The second half of March was about one core question: can we extend Mari (short for our Marigold inspired dual RBB + Depth prediction model) from predicting just RGB & depth to also predicting visibility (uncertainty), and does that improve depth estimation in fog?
Phase 1:
Model architecture : The key addition was the _VisibilityHead — a small convolutional network that predicts latent-space visibility/preservation confidence from hazy latents. This head sits alongside the existing dehazing and depth branches, turning the model into a three-output system.
Training loop: Rewired to use euler_train and Accelerator. Added cosine and linear warmup LR schedules, EMA checkpoint handling, and a new joint checkpoint selection strategy that weighs PSNR, SSIM, , and AbsRel.
Loss landscape:
lambda_visibility: 0.05— main visibility prediction losslambda_visibility_preserve: 0.05— encourages the visibility map to preserve detaillambda_visibility_rank: 0.02— ranking loss between pixel pairslambda_visibility_tv: 0.005— total variation smoothness
With visibility_target_gamma: 4.0 and visibility_target_quantile: 0.9, the target visibility is derived from depth via a gamma-shaped mapping where nearby objects get high visibility and distant/foggy regions get low visibility.
Loss Ideas
The target is computed as exp(-4.0 * (delta / scale)) where delta is the per-pixel L1 difference between hazy and clear latents, normalized by the 90th percentile.
Intuition: This creates a soft binary map, pixels where hazy ≈ clear (already visible) get values near 1.0, pixels where hazy ≠ clear (obscured by haze) get values near 0.0. The gamma=4.0 makes this transition sharp: even moderate degradation drops visibility quickly toward zero.
1. lambda_visibility: 0.05 — Prediction Loss
Teaches a tiny 3-layer CNN to predict, from hazy latents alone (no clear image), where the haze is thick vs. thin. At inference time there's no clear image available, so the model must learn to infer degradation from haze cues alone.
Converges toward: A visibility head that accurately segments "already-clear" regions from "needs-dehazing" regions, purely from the hazy input.
2. lambda_visibility_preserve: 0.05 — Preserve Loss
Computes |x0_hat_dehaze - hazy_latents| weighted by visibility_target. High-visibility pixels (already clear) get strong weight; low-visibility pixels (foggy) get weak weight.
In regions the target says are already clear, this loss penalizes the dehazing branch for changing anything. It's a "do no harm" constraint — don't hallucinate or distort regions that were fine to begin with.
Converges toward: A dehazing model that passes through already-visible content untouched while only modifying genuinely degraded regions. Prevents the common failure mode of dehazing models introducing color shifts or artifacts in clear foreground areas.
3. lambda_visibility_rank: 0.02 — Rank Loss
Samples 128 random pixel pairs, filters for pairs with significant depth difference (>0.05), then enforces: if pixel A is deeper than pixel B, then visibility(A) < visibility(B).
Injects the physical prior that haze is depth-dependent — farther objects are more obscured. This is a relative/ordinal constraint, not absolute, so it's robust to varying haze densities across scenes.
Converges toward: A visibility map that monotonically decreases with depth (on average). This gives the map physically plausible structure even when the prediction loss alone might produce noisy or inconsistent gradients. It's the weakest-weighted loss (0.02) because it's a soft prior, not a hard constraint — some scenes may have non-uniform haze.
4. lambda_visibility_tv: 0.005 — Total Variation
L1 differences between adjacent pixels, summed over horizontal and vertical axes.
Penalizes sharp, noisy transitions in the visibility map. Without this, the visibility head could produce spatially inconsistent predictions (checkerboard artifacts, pixel-level noise).
Converges toward: A spatially smooth visibility map where transitions follow object/depth boundaries rather than noise. The very low weight (0.005) means it regularizes gently — it won't blur away real edges, just suppress noise.
Phase 2: First Uncertainty Training
Trained for 17,600 steps (~36 hours) to completion
Comparing the two runs' final metrics:
| Metric | Run 132 (17.6k steps) | Run 197 (38.8k steps) |
|---|---|---|
| Total loss | 0.053 | 0.256 |
| Depth loss | 0.0045 | 0.098 |
| Dehaze loss | 0.024 | 0.133 |
| Visibility loss | 0.025 | 0.044 |
| Visibility mean | 0.334 | 0.448 |
| Visibility std | 0.074 | 0.080 |
Run 197's higher losses are likely explained by it training on the real-drive-sim dataset (more diverse, harder) while run 132 trained on VKITTI2. The visibility mean being higher (0.45 vs 0.33) suggests the real-drive-sim fog is generally less "dense" (nearer pixel distribution) than the synthetic VKITTI2 fog.
Evaluation: Mari vs Marigold
The core evaluation question: does Mari's joint dehazing + depth prediction outperform applying Marigold (the baseline depth estimator from the paper) directly to foggy images?
Caveat: It appears that the Marigold inference pipeline used downsampling by default. Will have to evalute again against a full resolution pipeline, without intermediate downsampling.
VKITTI2 (Synthetic Fog)
Three approaches compared on hazy VKITTI2 images:
- Marigold on hazy images — baseline, no dehazing
- Marigold on Mari-dehazed images — two-stage: Mari removes fog, then Marigold estimates depth
- Mari (uncertainty_1) — single-pass joint prediction
We can also see how much Mari's dehazing helps Marigold as a pre-processing step:
Real-Drive-Sim (Real-World Fog)
A new foggy dataset was generated from real-drive-sim using a dark channel prior (DCP) heuristic for fog synthesis (25k images with train/val/test splits). Both Mari and Marigold were evaluated on the 762-image validation split:
What's Next
- Cross-task fusion (RGB + Depth) of the visibility map
- Going from Affine to Metric Depth Esimation