The past experiments were about again trying to have a purely diffusion based approach - of the marigold adjacent diffusion architecture - converge towards a reasonable metric depth estimator.

Below is the comparison table for our diffusion only metric predictor ("mari_metric_2") and the UniDepthV2 and DepthAnythingV3 models - evaluated both on hazy and clear real-drive-sim dataset.

The metrics are on a benchmark depth range of 0 to 80 meters.

real-drive-sim · depth

Evaluation of the UniDepthV2 model on a foggy dataset:

Evaluation of our metric diffusion only model on the foggy dataset:

Oddities: Hazy RGB performs BETTER than clean RGB in benchmark range [0, 80] meters

Given the UniDepthV2 models, the MAE (thought not RMSE) outperformed in comparison to the clear RGBs. This MIGHT be du e to the mechanics of the fog curation, i.e. the mean visibility to be around 100 meters thus possibly allowing the model to focus? a bit better on the visible regions.

Depth Error of UniDepthV2 vs DepthAnything3 on the clear dataset.

Depth Error of UniDepthV2 vs DepthAnything3 on the hazy dataset.

Diffusion only approach: Architecture and Losses

ComponentStateLRRationaleVAE encoder (SD 2.1)frozenpreserve well-conditioned latent manifoldVAE decoder (SD 2.1)frozenreused for dehaze AND depth decodingCLIP text encoderfrozennot used; empty embedding injectedconv_in (4ch8ch)trainable1.5×105input re-wired: [noisy target, hazy cond]UNet encoder (shared)trainable1.5×105shared across both tasksmid_blocktrainable1.5×105part of shared trunkDehaze decoder (up_blocks)trainable3.0×105higher LR for task-specific headDepth decoder (up_blocks)trainable3.6×105highest LR: new domain (depth)Visibility headtrainableencoder-LRlightweight, learns from aux lossesViz residual adapterstrainabledecoder-LRzero-initidentity at t=0\begin{array}{|l|c|c|l|} \hline \textbf{Component} & \textbf{State} & \textbf{LR} & \textbf{Rationale} \\ \hline \text{VAE encoder (SD 2.1)} & \text{frozen} & \text{—} & \text{preserve well-conditioned latent manifold} \\ \text{VAE decoder (SD 2.1)} & \text{frozen} & \text{—} & \text{reused for dehaze AND depth decoding} \\ \text{CLIP text encoder} & \text{frozen} & \text{—} & \text{not used; empty embedding injected} \\ \hline \text{conv\_in (4ch} \to \text{8ch)} & \text{trainable} & 1.5 \times 10^{-5} & \text{input re-wired: [noisy target, hazy cond]} \\ \text{UNet encoder (shared)} & \text{trainable} & 1.5 \times 10^{-5} & \text{shared across both tasks} \\ \text{mid\_block} & \text{trainable} & 1.5 \times 10^{-5} & \text{part of shared trunk} \\ \text{Dehaze decoder (up\_blocks)} & \text{trainable} & 3.0 \times 10^{-5} & \text{higher LR for task-specific head} \\ \text{Depth decoder (up\_blocks)} & \text{trainable} & 3.6 \times 10^{-5} & \text{highest LR: new domain (depth)} \\ \hline \text{Visibility head} & \text{trainable} & \text{encoder-LR} & \text{lightweight, learns from aux losses} \\ \text{Viz residual adapters} & \text{trainable} & \text{decoder-LR} & \text{zero-init} \to \text{identity at }t{=}0 \\ \hline \end{array}

StreamSpaceShapeRoleTransformhazy RGBpixel[3,H,W]conditionVAE.encodezhazyclean RGBpixel[3,H,W]target (dehaze)VAE.encodezcleandepth (metric, m)pixel[1,H,W]target (depth)normalizerepeat 3chVAE.encodezdepthvalid maskpixel[1,H,W] boolloss weight8× conservative downsample\begin{array}{|l|c|c|c|l|} \hline \textbf{Stream} & \textbf{Space} & \textbf{Shape} & \textbf{Role} & \textbf{Transform} \\ \hline \text{hazy RGB} & \text{pixel} & [3,H,W] & \text{condition} & \text{VAE.encode} \to z_{\text{hazy}} \\ \text{clean RGB} & \text{pixel} & [3,H,W] & \text{target (dehaze)} & \text{VAE.encode} \to z_{\text{clean}} \\ \text{depth (metric, m)} & \text{pixel} & [1,H,W] & \text{target (depth)} & \text{normalize}\to\text{repeat 3ch}\to\text{VAE.encode}\to z_{\text{depth}} \\ \text{valid mask} & \text{pixel} & [1,H,W]\text{ bool} & \text{loss weight} & \text{8}\times\text{ conservative downsample} \\ \hline \end{array}

SymbolMeaningOriginzhazylatent of hazy inputVAE.encode(hazy)zcleanlatent of clean targetVAE.encode(clean)zdepthlatent of depth targetVAE.encode(repeat(norm(depth),3))xt(k)noisy target at step tαtztarget(k)+σtεv(k)v-prediction targetαtεσtztarget(k)vvisibility preservationσ(Head(zhazy))\begin{array}{|l|l|l|} \hline \textbf{Symbol} & \textbf{Meaning} & \textbf{Origin} \\\\ \hline z_{\text{hazy}} & \text{latent of hazy input} & \text{VAE.encode(hazy)} \\\\ z_{\text{clean}} & \text{latent of clean target} & \text{VAE.encode(clean)} \\\\ z_{\text{depth}} & \text{latent of depth target} & \text{VAE.encode(repeat(norm(depth),3))} \\\\ x_t^{(k)} & \text{noisy target at step }t & \alpha_t z_{\text{target}}^{(k)} + \sigma_t \varepsilon \\\\ v_\star^{(k)} & \text{v-prediction target} & \alpha_t \varepsilon - \sigma_t z_{\text{target}}^{(k)} \\\\ v & \text{visibility preservation} & \sigma(\text{Head}(z_{\text{hazy}})) \\\\ \hline \end{array}

SymbolMeaningOriginzhazylatent of hazy inputVAE.encode(hazy)zcleanlatent of clean targetVAE.encode(clean)zdepthlatent of depth targetVAE.encode(repeat(norm(depth),3))xt(k)noisy target at step tαtztarget(k)+σtεv(k)v-prediction targetαtεσtztarget(k)vvisibility preservationσ(Head(zhazy))\begin{array}{|l|l|l|} \hline \textbf{Symbol} & \textbf{Meaning} & \textbf{Origin} \\\\ \hline z_{\text{hazy}} & \text{latent of hazy input} & \text{VAE.encode(hazy)} \\\\ z_{\text{clean}} & \text{latent of clean target} & \text{VAE.encode(clean)} \\\\ z_{\text{depth}} & \text{latent of depth target} & \text{VAE.encode(repeat(norm(depth),3))} \\\\ x_t^{(k)} & \text{noisy target at step }t & \alpha_t z_{\text{target}}^{(k)} + \sigma_t \varepsilon \\\\ v_\star^{(k)} & \text{v-prediction target} & \alpha_t \varepsilon - \sigma_t z_{\text{target}}^{(k)} \\\\ v & \text{visibility preservation} & \sigma(\text{Head}(z_{\text{hazy}})) \\\\ \hline \end{array}

UNet input/output per task

TaskInput (concat, 8ch)Output (4ch)Targetdehaze[xt(dehaze)zhazy]v^(dehaze)v(dehaze)depth[xt(depth)zhazy]v^(depth)v(depth)\begin{array}{|c|c|c|c|} \hline \textbf{Task} & \textbf{Input (concat, 8ch)} & \textbf{Output (4ch)} & \textbf{Target} \\\\ \hline \text{dehaze} & [x_t^{(\text{dehaze})} \,\Vert\, z_{\text{hazy}}] & \hat v^{(\text{dehaze})} & v_\star^{(\text{dehaze})} \\\\ \text{depth} & [x_t^{(\text{depth})} \,\Vert\, z_{\text{hazy}}] & \hat v^{(\text{depth})} & v_\star^{(\text{depth})} \\\\ \hline \end{array}

Same zhazyz_{\text{hazy}} is the condition for both branches — that is what makes dehaze and depth jointly conditioned on the observation.

Let k{dehaze,depth} m~=downsampled valid mask v=visibility map δ=zhazyzclean(latent-space haze residual)\begin{aligned} & \text{Let } k \in \{\text{dehaze}, \text{depth}\} \\ & \bullet \ \tilde{m} = \text{downsampled valid mask} \\ & \bullet \ v = \text{visibility map} \\ & \bullet \ \delta = |z_{\text{hazy}} - z_{\text{clean}}| \quad \text{(latent-space haze residual)} \end{aligned}

Let k{dehaze,depth} m=downsampled valid mask v=visibility map δ=zhazyzclean(latent-space haze residual)\begin{aligned} & \text{Let } k \in \{\text{dehaze}, \text{depth}\} \\ & \bullet \ m' = \text{downsampled valid mask} \\ & \bullet \ v = \text{visibility map} \\ & \bullet \ \delta = |z_{\text{hazy}} - z_{\text{clean}}| \quad \text{(latent-space haze residual)} \end{aligned}

TermExpressionMasked?Default weightLdehazev^(dehaze)v(dehaze)22no1.0Ldepth1m~m~(v^(depth)v(depth))2yes1.0LvisBCE ⁣(v, vvis),vvis=exp ⁣(γδ/q0.9(δ))no0.05Lpreservepenalize low v where δ is smallno0.05Lrankcontrastive: vlow-haze>vhigh-hazeyes0.02Ltvanisotropic TV(v)=xv+yvno0.005LtotaliλiLi\begin{array}{|l|l|c|c|} \hline \textbf{Term} & \textbf{Expression} & \textbf{Masked?} & \textbf{Default weight} \\\\ \hline L_{\text{dehaze}} & \Vert \hat v^{(\text{dehaze})} - v_\star^{(\text{dehaze})} \Vert_2^2 & \text{no} & 1.0 \\\\ \hline L_{\text{depth}} & \frac{1}{|\tilde m|}\sum \tilde m \odot (\hat v^{(\text{depth})} - v_\star^{(\text{depth})})^2 & \text{yes} & 1.0 \\\\ \hline L_{\text{vis}} & \text{BCE}\!\left(v,\ v_\star^{\text{vis}}\right),\quad v_\star^{\text{vis}} = \exp\!\left(-\gamma\,\delta / q_{0.9}(\delta)\right) & \text{no} & 0.05 \\\\ \hline L_{\text{preserve}} & \text{penalize low }v\text{ where }\delta\text{ is small} & \text{no} & 0.05 \\\\ \hline L_{\text{rank}} & \text{contrastive: } v_{\text{low-haze}} > v_{\text{high-haze}} & \text{yes} & 0.02 \\\\ \hline L_{\text{tv}} & \text{anisotropic TV}(v) = \sum |\nabla_x v| + |\nabla_y v| & \text{no} & 0.005 \\\\ \hline L_{\text{total}} & \sum_i \lambda_i L_i & & \\\\ \hline \end{array}

Visibility regularizers

Rank loss LrankL_{\text{rank}} - pairwise depth-visibility monotonicity

For each image, sample 128 random pairs (i,j)(i, j) of valid latent pixels and penalize ordering violations:

Δd=djdi,s=sign(Δd)(vivj),ij=softplus(s)\Delta_d = d_j - d_i, \qquad s = \text{sign}(\Delta_d)\,(v_i - v_j), \qquad \ell_{ij} = \text{softplus}(-s)

PieceDefinitionIntuitionΔd=djdidepth gap between sampled pixelswhich is closer / farther from cameraΔd>0.05margin filterdrop near-equal-depth pairs (no signal)sign(Δd)direction flag {1,+1}+1 if j is farther, 1 otherwises=sign(Δd)(vivj)signed visibility differences>0    closer pixel has higher vsoftplus(s)=log(1+es)smooth hinge / logistic penalty0 if correctly ranked; s if invertedNpairs=128stochastic samplingunbiased estimator vs. O(N2) full compareLrank=E[ij]mean over valid pairsmonotonic prior: vas d\begin{array}{|l|l|l|} \hline \textbf{Piece} & \textbf{Definition} & \textbf{Intuition} \\\\ \hline \Delta_d = d_j - d_i & \text{depth gap between sampled pixels} & \text{which is closer / farther from camera} \\\\ \hline |\Delta_d| > 0.05 & \text{margin filter} & \text{drop near-equal-depth pairs (no signal)} \\\\ \hline \text{sign}(\Delta_d) & \text{direction flag } \in \{-1, +1\} & +1\text{ if }j\text{ is farther},\ -1\text{ otherwise} \\\\ \hline s = \text{sign}(\Delta_d)(v_i - v_j) & \text{signed visibility difference} & s > 0 \iff \text{closer pixel has higher } v \\\\ \hline \text{softplus}(-s) = \log(1 + e^{-s}) & \text{smooth hinge / logistic penalty} & \approx 0\text{ if correctly ranked};\ \approx -s\text{ if inverted} \\\\ \hline N_{\text{pairs}} = 128 & \text{stochastic sampling} & \text{unbiased estimator vs. }O(N^2)\text{ full compare} \\\\ \hline L_{\text{rank}} = \mathbb{E}[\ell_{ij}] & \text{mean over valid pairs} & \text{monotonic prior: } v \downarrow \text{as } d \uparrow \\\\ \hline \end{array}

Enforce an ordering consistent with atmospheric scattering physics (vv should decrease with depth). That makes it a much weaker and safer prior than the BCE term, which targets a specific numeric value.


Total-variation loss LtvL_{\text{tv}} for edge-preserving smoothness on MATHPLACEHOLDER8

The TV Ltv=E ⁣[xv]+E ⁣[yv],xvi,j=vi,j+1vi,jL_{\text{tv}} = \mathbb{E}\!\left[|\partial_x v|\right] + \mathbb{E}\!\left[|\partial_y v|\right], \qquad \partial_x v_{i,j} = v_{i,j+1} - v_{i,j}

PieceDefinitionIntuitionxv, yvforward differences (discrete gradients)local spatial rate of change of v L1abslute value-anisotropic formsum of axis-wise magnitudessimpler than isotropic (xv)2+(yv)2mean over pixelssize-normalizedresolution-independent magnitudeλtv=0.005 (small)regularizer, not objectivesuppress speckle without flattening v\begin{array}{|l|l|l|} \hline \textbf{Piece} & \textbf{Definition} & \textbf{Intuition} \\\\ \hline \partial_x v,\ \partial_y v & \text{forward differences (discrete gradients)} & \text{local spatial rate of change of }v \\\\ \hline |\cdot|\text{ L1} & \text{abslute value} & \text{-} \\\\ \hline \text{anisotropic form} & \text{sum of axis-wise magnitudes} & \text{simpler than isotropic }\sqrt{(\partial_x v)^2 + (\partial_y v)^2} \\\\ \hline \text{mean over pixels} & \text{size-normalized} & \text{resolution-independent magnitude} \\\\ \hline \lambda_{\text{tv}} = 0.005\text{ (small)} & \text{regularizer, not objective} & \text{suppress speckle without flattening }v \\\\ \hline \end{array}