C&L - 20/04/26 - Euler View

The past experiments were about again trying to have a purely diffusion based approach - of the marigold adjacent diffusion architecture - converge towards a reasonable metric depth estimator.

Below is the comparison table for our diffusion only metric predictor ("mari_metric_2") and the UniDepthV2 and DepthAnythingV3 models - evaluated both on hazy and clear real-drive-sim dataset.

The metrics are on a benchmark depth range of 0 to 80 meters.

Evaluation of the UniDepthV2 model on a foggy dataset:

Evaluation of our metric diffusion only model on the foggy dataset:

Oddities: Hazy RGB performs BETTER than clean RGB in benchmark range [0, 80] meters

Given the UniDepthV2 models, the MAE (thought not RMSE) outperformed in comparison to the clear RGBs. This MIGHT be du e to the mechanics of the fog curation, i.e. the mean visibility to be around 100 meters thus possibly allowing the model to focus? a bit better on the visible regions.

Depth Error of UniDepthV2 vs DepthAnything3 on the clear dataset.

Depth Error of UniDepthV2 vs DepthAnything3 on the hazy dataset.

Diffusion only approach: Architecture and Losses

$\begin{array}{|l|c|c|l|} \hline \textbf{Component} & \textbf{State} & \textbf{LR} & \textbf{Rationale} \\ \hline \text{VAE encoder (SD 2.1)} & \text{frozen} & \text{—} & \text{preserve well-conditioned latent manifold} \\ \text{VAE decoder (SD 2.1)} & \text{frozen} & \text{—} & \text{reused for dehaze AND depth decoding} \\ \text{CLIP text encoder} & \text{frozen} & \text{—} & \text{not used; empty embedding injected} \\ \hline \text{conv\_in (4ch} \to \text{8ch)} & \text{trainable} & 1.5 \times 10^{-5} & \text{input re-wired: [noisy target, hazy cond]} \\ \text{UNet encoder (shared)} & \text{trainable} & 1.5 \times 10^{-5} & \text{shared across both tasks} \\ \text{mid\_block} & \text{trainable} & 1.5 \times 10^{-5} & \text{part of shared trunk} \\ \text{Dehaze decoder (up\_blocks)} & \text{trainable} & 3.0 \times 10^{-5} & \text{higher LR for task-specific head} \\ \text{Depth decoder (up\_blocks)} & \text{trainable} & 3.6 \times 10^{-5} & \text{highest LR: new domain (depth)} \\ \hline \text{Visibility head} & \text{trainable} & \text{encoder-LR} & \text{lightweight, learns from aux losses} \\ \text{Viz residual adapters} & \text{trainable} & \text{decoder-LR} & \text{zero-init} \to \text{identity at }t{=}0 \\ \hline \end{array}$

$\begin{array}{|l|c|c|c|l|} \hline \textbf{Stream} & \textbf{Space} & \textbf{Shape} & \textbf{Role} & \textbf{Transform} \\ \hline \text{hazy RGB} & \text{pixel} & [3,H,W] & \text{condition} & \text{VAE.encode} \to z_{\text{hazy}} \\ \text{clean RGB} & \text{pixel} & [3,H,W] & \text{target (dehaze)} & \text{VAE.encode} \to z_{\text{clean}} \\ \text{depth (metric, m)} & \text{pixel} & [1,H,W] & \text{target (depth)} & \text{normalize}\to\text{repeat 3ch}\to\text{VAE.encode}\to z_{\text{depth}} \\ \text{valid mask} & \text{pixel} & [1,H,W]\text{ bool} & \text{loss weight} & \text{8}\times\text{ conservative downsample} \\ \hline \end{array}$

$\begin{array}{|l|l|l|} \hline \textbf{Symbol} & \textbf{Meaning} & \textbf{Origin} \\\\ \hline z_{\text{hazy}} & \text{latent of hazy input} & \text{VAE.encode(hazy)} \\\\ z_{\text{clean}} & \text{latent of clean target} & \text{VAE.encode(clean)} \\\\ z_{\text{depth}} & \text{latent of depth target} & \text{VAE.encode(repeat(norm(depth),3))} \\\\ x_t^{(k)} & \text{noisy target at step }t & \alpha_t z_{\text{target}}^{(k)} + \sigma_t \varepsilon \\\\ v_\star^{(k)} & \text{v-prediction target} & \alpha_t \varepsilon - \sigma_t z_{\text{target}}^{(k)} \\\\ v & \text{visibility preservation} & \sigma(\text{Head}(z_{\text{hazy}})) \\\\ \hline \end{array}$

UNet input/output per task

$\begin{array}{|c|c|c|c|} \hline \textbf{Task} & \textbf{Input (concat, 8ch)} & \textbf{Output (4ch)} & \textbf{Target} \\\\ \hline \text{dehaze} & [x_t^{(\text{dehaze})} \,\Vert\, z_{\text{hazy}}] & \hat v^{(\text{dehaze})} & v_\star^{(\text{dehaze})} \\\\ \text{depth} & [x_t^{(\text{depth})} \,\Vert\, z_{\text{hazy}}] & \hat v^{(\text{depth})} & v_\star^{(\text{depth})} \\\\ \hline \end{array}$

Same $z_{\text{hazy}}$ is the condition for both branches — that is what makes dehaze and depth jointly conditioned on the observation.

$\begin{aligned} & \text{Let } k \in \{\text{dehaze}, \text{depth}\} \\ & \bullet \ \tilde{m} = \text{downsampled valid mask} \\ & \bullet \ v = \text{visibility map} \\ & \bullet \ \delta = |z_{\text{hazy}} - z_{\text{clean}}| \quad \text{(latent-space haze residual)} \end{aligned}$

$\begin{aligned} & \text{Let } k \in \{\text{dehaze}, \text{depth}\} \\ & \bullet \ m' = \text{downsampled valid mask} \\ & \bullet \ v = \text{visibility map} \\ & \bullet \ \delta = |z_{\text{hazy}} - z_{\text{clean}}| \quad \text{(latent-space haze residual)} \end{aligned}$

$\begin{array}{|l|l|c|c|} \hline \textbf{Term} & \textbf{Expression} & \textbf{Masked?} & \textbf{Default weight} \\\\ \hline L_{\text{dehaze}} & \Vert \hat v^{(\text{dehaze})} - v_\star^{(\text{dehaze})} \Vert_2^2 & \text{no} & 1.0 \\\\ \hline L_{\text{depth}} & \frac{1}{|\tilde m|}\sum \tilde m \odot (\hat v^{(\text{depth})} - v_\star^{(\text{depth})})^2 & \text{yes} & 1.0 \\\\ \hline L_{\text{vis}} & \text{BCE}\!\left(v,\ v_\star^{\text{vis}}\right),\quad v_\star^{\text{vis}} = \exp\!\left(-\gamma\,\delta / q_{0.9}(\delta)\right) & \text{no} & 0.05 \\\\ \hline L_{\text{preserve}} & \text{penalize low }v\text{ where }\delta\text{ is small} & \text{no} & 0.05 \\\\ \hline L_{\text{rank}} & \text{contrastive: } v_{\text{low-haze}} > v_{\text{high-haze}} & \text{yes} & 0.02 \\\\ \hline L_{\text{tv}} & \text{anisotropic TV}(v) = \sum |\nabla_x v| + |\nabla_y v| & \text{no} & 0.005 \\\\ \hline L_{\text{total}} & \sum_i \lambda_i L_i & & \\\\ \hline \end{array}$

Visibility regularizers

Rank loss $L_{\text{rank}}$ - pairwise depth-visibility monotonicity

For each image, sample 128 random pairs $(i, j)$ of valid latent pixels and penalize ordering violations:

$\Delta_d = d_j - d_i, \qquad s = \text{sign}(\Delta_d)\,(v_i - v_j), \qquad \ell_{ij} = \text{softplus}(-s)$

$\begin{array}{|l|l|l|} \hline \textbf{Piece} & \textbf{Definition} & \textbf{Intuition} \\\\ \hline \Delta_d = d_j - d_i & \text{depth gap between sampled pixels} & \text{which is closer / farther from camera} \\\\ \hline |\Delta_d| > 0.05 & \text{margin filter} & \text{drop near-equal-depth pairs (no signal)} \\\\ \hline \text{sign}(\Delta_d) & \text{direction flag } \in \{-1, +1\} & +1\text{ if }j\text{ is farther},\ -1\text{ otherwise} \\\\ \hline s = \text{sign}(\Delta_d)(v_i - v_j) & \text{signed visibility difference} & s > 0 \iff \text{closer pixel has higher } v \\\\ \hline \text{softplus}(-s) = \log(1 + e^{-s}) & \text{smooth hinge / logistic penalty} & \approx 0\text{ if correctly ranked};\ \approx -s\text{ if inverted} \\\\ \hline N_{\text{pairs}} = 128 & \text{stochastic sampling} & \text{unbiased estimator vs. }O(N^2)\text{ full compare} \\\\ \hline L_{\text{rank}} = \mathbb{E}[\ell_{ij}] & \text{mean over valid pairs} & \text{monotonic prior: } v \downarrow \text{as } d \uparrow \\\\ \hline \end{array}$

Enforce an ordering consistent with atmospheric scattering physics ( $v$ should decrease with depth). That makes it a much weaker and safer prior than the BCE term, which targets a specific numeric value.

Total-variation loss $L_{\text{tv}}$ for edge-preserving smoothness on MATHPLACEHOLDER8

The TV $L_{\text{tv}} = \mathbb{E}\!\left[|\partial_x v|\right] + \mathbb{E}\!\left[|\partial_y v|\right], \qquad \partial_x v_{i,j} = v_{i,j+1} - v_{i,j}$

$\begin{array}{|l|l|l|} \hline \textbf{Piece} & \textbf{Definition} & \textbf{Intuition} \\\\ \hline \partial_x v,\ \partial_y v & \text{forward differences (discrete gradients)} & \text{local spatial rate of change of }v \\\\ \hline |\cdot|\text{ L1} & \text{abslute value} & \text{-} \\\\ \hline \text{anisotropic form} & \text{sum of axis-wise magnitudes} & \text{simpler than isotropic }\sqrt{(\partial_x v)^2 + (\partial_y v)^2} \\\\ \hline \text{mean over pixels} & \text{size-normalized} & \text{resolution-independent magnitude} \\\\ \hline \lambda_{\text{tv}} = 0.005\text{ (small)} & \text{regularizer, not objective} & \text{suppress speckle without flattening }v \\\\ \hline \end{array}$