24.02 Experiments: DiT's vs U-Net

Last two weeks were spent with lots of experiments on the RDS dataset and MERGE architecture.

Status quo was:

Dehazing via SD2.1 works well
Novelty from MERGE: Use Marigold adjacent architecture with dual-converters ("prediction heads") for image generation and depth maps.
Early MERGE experiments worked well on VKITTI2 (1216 x 352 RGB)
Extend tests to real-drive-sim

Which proved to be pretty comparable with the outputs from the SD2.1 dehazing approach - while also providing a reasonable relative depth map.

real-drive-sim fails where VKITTI2 worked

At first I assumed this noise was due to "not enough denoising" steps. The problem was neither hyperparameters, too little epochs, or the denoising steps, but rather the very architecture.

MERGE employs PixArt, which is trained on 1024x512 images and is based on Diffusion Transformers which patchify the image, its details in high resolution are highly dependent on the positional encodings during its training time. While it is multiscale, for a task like image-to-image, it simply fails to capture high frequency information on hgih resolution data.

This is why the initial VKITTI experiments looked good:

VKITTI2 ("older dataset") has mostly plain/smooth surfaces
Is natively more ali6gned (1216x352) with the training resolution of PixArt

This means

Use backbone that supports "higher resolutions" (so we can work with realistic driving scenes)
Find a different approach

Since most experiments are already using a batch-size of $1$ already maxing out the capabilities of a 4090, this means we might want to look at other models.

Fortunately Marigold did exactly that - and worked with the original stable diffusion architecture - i.e. UNets.

Why U-Nets

While a DiT may understand the global composition (thats just where the transformer shines) better, we do not need to understand the global composition at all. (At least not for the dehazing part - for understanding real world priors... maybe).

This is where U-Net with its skip-connections allows us to far better preserver image details.

Either we drastically increase the "number of patches" in a DiT to preserve structure - or we use the naiive but in our case fitting U-Net approach.

If we assume our task is mostly discriminative (by heavy "bias" on the concatenated latent), i.e. an image with little haze to a clear image has less of a "generative" side to it, the downsides of U-Net models vs DiT's is basically 0, since the global attention of DiT's is a non-factor for understanding the scene - since the scene is already almost fully present in the inputs.

Only in the deeper parts of the image, where we actually inpaint the scene into a "foggy wall", does this become a concern.

A dehazed version of an real-drive-sim scene from our earliest model

Our very first approach, of using SD2.1 with latent concatenation (clear+noise, hazy) already produced extremely close results - but was simply ditched for being the "dehazing only" baseline.

So, instead of relying on MERGE the original Marigold approach is the way to go forward - cheaper and faster training with good results on high-resolution images.

Clear image prediction of our new SD2.1 joint prediction model.

Model Visualization

Since training on VKITTI is cheaper, the following experiments are falling back to that dataset again (knowing it works pretty much of arbitrary (reasonably) large other resolutions).

SD 2.1 Joint 2026-02-22_15-11-12_8073

Loading run outputs...

SD 2.1. Mari 2026-02-22_21-42-48_b2ad

Loading run outputs...

Ablations

Radial Depth Pred
Planar Depth + FoV Map conditioning
Valid-Depths
Unreflect
Encoder freeze?