The past experiments were about again trying to have a purely diffusion based approach - of the marigold adjacent diffusion architecture - converge towards a reasonable metric depth estimator.
Below is the comparison table for our diffusion only metric predictor ("mari_metric_2") and the UniDepthV2 and DepthAnythingV3 models - evaluated both on hazy and clear real-drive-sim dataset.
The metrics are on a benchmark depth range of 0 to 80 meters.
real-drive-sim·depth
Evaluation of the UniDepthV2 model on a foggy dataset:
Oddities: Hazy RGB performs BETTER than clean RGB in benchmark range [0, 80] meters
Given the UniDepthV2 models, the MAE (thought not RMSE) outperformed in comparison to the clear RGBs. This MIGHT be du e to the mechanics of the fog curation, i.e. the mean visibility to be around 100 meters thus possibly allowing the model to focus? a bit better on the visible regions.
Depth Error of UniDepthV2 vs DepthAnything3 on the clear dataset.
Depth Error of UniDepthV2 vs DepthAnything3 on the hazy dataset.
Diffusion only approach: Architecture and Losses
ComponentVAE encoder (SD 2.1)VAE decoder (SD 2.1)CLIP text encoderconv_in (4ch→8ch)UNet encoder (shared)mid_blockDehaze decoder (up_blocks)Depth decoder (up_blocks)Visibility headViz residual adaptersStatefrozenfrozenfrozentrainabletrainabletrainabletrainabletrainabletrainabletrainableLR———1.5×10−51.5×10−51.5×10−53.0×10−53.6×10−5encoder-LRdecoder-LRRationalepreserve well-conditioned latent manifoldreused for dehaze AND depth decodingnot used; empty embedding injectedinput re-wired: [noisy target, hazy cond]shared across both taskspart of shared trunkhigher LR for task-specific headhighest LR: new domain (depth)lightweight, learns from aux losseszero-init→identity at t=0
Symbolzhazyzcleanzdepthxt(k)v⋆(k)vMeaninglatent of hazy inputlatent of clean targetlatent of depth targetnoisy target at step tv-prediction targetvisibility preservationOriginVAE.encode(hazy)VAE.encode(clean)VAE.encode(repeat(norm(depth),3))αtztarget(k)+σtεαtε−σtztarget(k)σ(Head(zhazy))
Symbolzhazyzcleanzdepthxt(k)v⋆(k)vMeaninglatent of hazy inputlatent of clean targetlatent of depth targetnoisy target at step tv-prediction targetvisibility preservationOriginVAE.encode(hazy)VAE.encode(clean)VAE.encode(repeat(norm(depth),3))αtztarget(k)+σtεαtε−σtztarget(k)σ(Head(zhazy))
TermLdehazeLdepthLvisLpreserveLrankLtvLtotalExpression∥v^(dehaze)−v⋆(dehaze)∥22∣m~∣1∑m~⊙(v^(depth)−v⋆(depth))2BCE(v,v⋆vis),v⋆vis=exp(−γδ/q0.9(δ))penalize low v where δ is smallcontrastive: vlow-haze>vhigh-hazeanisotropic TV(v)=∑∣∇xv∣+∣∇yv∣∑iλiLiMasked?noyesnonoyesnoDefault weight1.01.00.050.050.020.005
Visibility regularizers
Rank loss Lrank - pairwise depth-visibility monotonicity
For each image, sample 128 random pairs (i,j) of valid latent pixels and penalize ordering violations:
PieceΔd=dj−di∣Δd∣>0.05sign(Δd)s=sign(Δd)(vi−vj)softplus(−s)=log(1+e−s)Npairs=128Lrank=E[ℓij]Definitiondepth gap between sampled pixelsmargin filterdirection flag ∈{−1,+1}signed visibility differencesmooth hinge / logistic penaltystochastic samplingmean over valid pairsIntuitionwhich is closer / farther from cameradrop near-equal-depth pairs (no signal)+1 if j is farther,−1 otherwises>0⟺closer pixel has higher v≈0 if correctly ranked;≈−s if invertedunbiased estimator vs. O(N2) full comparemonotonic prior: v↓as d↑
Enforce an ordering consistent with atmospheric scattering physics (v should decrease with depth). That makes it a much weaker and safer prior than the BCE term, which targets a specific numeric value.
Total-variation loss Ltv for edge-preserving smoothness on MATHPLACEHOLDER8
The TV
Ltv=E[∣∂xv∣]+E[∣∂yv∣],∂xvi,j=vi,j+1−vi,j
Piece∂xv,∂yv∣⋅∣ L1anisotropic formmean over pixelsλtv=0.005 (small)Definitionforward differences (discrete gradients)abslute valuesum of axis-wise magnitudessize-normalizedregularizer, not objectiveIntuitionlocal spatial rate of change of v-simpler than isotropic (∂xv)2+(∂yv)2resolution-independent magnitudesuppress speckle without flattening v