|

Developers

|

Stereo on Compressed Data: How Much Accuracy Do You Really Lose?

Piotr Świerczyński

Chief Technology Officer

Most discussions of stereo vision quietly assume something that is rarely true in practice: that the matcher gets clean, uncompressed pixels straight off the sensor. In the field, that assumption breaks all the time - cameras hand us MJPEG streams, ecorders save space by writing compressed frames, datasets get shared as JPEG-encoded archives.

So the question we wanted to answer is concrete and practical: if the stereo pipeline is fed compressed images instead of raw ones, what does it cost us, and at what point does it stop being acceptable?

Experimental setup

We took 497 frames of off-road driving data collected in Arizona. The original imagery is 16-bit raw, which we treat as the upper bound on what the matcher can achieve. From there we generated two control conditions and a sweep of JPEG quality levels:

  • Raw 16-bit: the reference condition.

  • 8-bit: same data, mapped to 8 bits per channel. This is the control that separates "lost bit depth" from "lost high-frequency content."

  • JPEG 5, 15, 25, ..., 95: ten quality levels covering the full practical range.

For ground truth, we used NODAR GroundTruth, our offline neural matcher. It is too computationally expensive to run in real time, but it gives us accurate depth maps that we can use as a pseudo-ground truth reference for quantitatively evaluating the real-time matcher.

One methodological detail worth flagging: we ran autocalibration on the 16-bit data and then reused those extrinsic parameters across every other configuration. Autocalibration itself depends on image quality, so if we let it re-run on each compressed dataset, we would be measuring a mixture of calibration drift and matching degradation and we would not be able to tell them apart. By freezing the extrinsics, we isolate the matcher's response to compression.

Qualitative observations first

Two things are immediately visible just from watching the depth maps as we sweep the compression level.

The first is sparsification. As JPEG quality drops, the depth map gets holes. The matcher refuses to commit to a disparity in regions where the underlying texture has been smoothed away by quantization.

The second is more troubling: speckle starts appearing in regions of the sky. Our best guess is that JPEG's 8x8 block structure creates artificial texture at block boundaries, and the matcher occasionally locks onto these artifacts as if they were real features. A false return in the sky is a phantom obstacle, and a phantom obstacle in the wrong place can cause a vehicle to slam on the brakes for no reason.

Valid returns: how much of the scene survives?

The first quantitative question is what fraction of pixels in each frame get a valid disparity assignment. Note that large parts of the frame are occupied by the sky or are occluded by the car’s hood and we don’t expect returns in those regions - hence only around a third of the pixels in the frames can provide measurements.

The raw 16-bit baseline sits at about 33% valid pixels, and 8-bit data tracks it almost exactly: bit depth alone is not what costs us pixels. From JPEG 95 down to about JPEG 65 the median is almost flat. Below JPEG 65 the curve steepens: JPEG 35 gives roughly 20% valid pixels (a third less than raw), JPEG 15 loses more than three quarters of the matches, and JPEG 5 collapses to 2-3%.

The mechanism is straightforward. JPEG discards high-frequency content, and high-frequency content is exactly what the matcher uses to find correspondence between left and right views. Low-texture regions (painted walls, the side of a trailer) lose their match first. This dataset is relatively forgiving. Low-texture man-made environments would degrade faster.

Depth accuracy as a function of range

Error grows quadratically with range, exactly as stereo geometry predicts: depth error scales as Z² for a given disparity error, so doubling the range roughly quadruples the absolute error in meters. The spread between configurations also widens with range. At 0 to 10 meters every configuration is within a few centimeters; at 60 to 70 meters, JPEG 5 lands at 6.8 m MAE while raw is 1.8 m. Compression hurts most where you can least afford it. Rule of thumb - above 30 m, you pay a meaningful accuracy tax for anything below JPEG 75.

Collapsing across range, the progression is monotonic and gradual. There is no cliff. Each step from JPEG 95 down to JPEG 35 adds a few centimeters to the median MAE and widens the interquartile range. Below JPEG 35 things come apart: the median doubles between JPEG 35 and JPEG 15, and the upper whisker stretches past a meter.

One detail worth dwelling on: even at JPEG 95, the median MAE is already measurably worse than raw. The effect is small at the median, but it is real and consistent. There is no JPEG quality level at which compression is free, and the cost shows up first in the worst-case tail rather than the median. For safety-critical perception, that tail is the metric that matters.

The 8-bit and raw boxes sit close to each other and close to JPEG 95. On this daytime dataset, bit depth contributes far less to accuracy than compression artifacts do. We expect bit depth to matter much more in low light, where 8 bits crushes the signal in the darks.

What this means in practice

For safety-critical systems and adverse conditions, use raw high-bitdepth data. 12-bit or 16-bit, straight from the sensor. You avoid block artifacts, preserve the high-frequency texture the matcher needs, and keep headroom for night and high-dynamic-range scenes. Compression should be a deliberate trade against bandwidth, not a default.

Above JPEG 75 the degradation is small but not zero. Hammerhead handles the security-camera default of 80 to 90 without meaningful operational degradation. Even JPEG 95 already widens the error distribution compared to raw, so if you control the pipeline end to end there is still a reason to prefer raw.

Below JPEG 50 we are skeptical. Sparsification gets aggressive and speckle artifacts suggest the matcher is being fooled by block-boundary structure. We would not deploy a real-time system on data below JPEG 50 without a thorough false-positive analysis.

Bit depth matters less in daylight than people expect, and more at night. On daytime data the 16-bit advantage over 8-bit is small, dominated by the compression curve. In low light, 8-bit clips detail in shadows and the gap widens fast.

The Bottom Line

A few caveats. This dataset is high-texture off-road driving; on low-texture man-made environments the cliff comes sooner. We did not test night or low-light conditions, and we held autocalibration fixed by design, so we have not characterized how compression propagates into the extrinsics.

The headline result is that JPEG compression is not a binary "works or doesn't work" question for stereo matching. It is a continuous trade-off, and the curve is gentler than people often assume. Hammerhead works comfortably at JPEG 80, so legacy MJPEG infrastructure is genuinely usable. But for new systems we are designing or advising on, particularly anything safety-critical or operating in adverse conditions, the recommendation is unambiguous: feed the matcher 12-bit or 16-bit raw imagery. Compression is never free, even at JPEG 95.

When in doubt, send us a clip.