LWD – Latent Wavelet Diffusion

Overview

The UHR Challenge & Our Solution

High-resolution image synthesis remains a core challenge in generative modeling, particularly in balancing computational efficiency with the preservation of fine-grained visual detail. We present Latent Wavelet Diffusion (LWD), a lightweight training framework that significantly improves detail and texture fidelity in ultra-high-resolution (2K-4K) image synthesis. LWD introduces a novel, frequency-aware masking strategy derived from wavelet energy maps, which dynamically focuses the training process on detail-rich regions of the latent space. This is complemented by a scale-consistent VAE objective to ensure high spectral fidelity. The primary advantage of our approach is its efficiency: LWD requires no architectural modifications and adds zero additional cost during inference, making it a practical solution for scaling existing models. Across multiple strong baselines, LWD consistently improves perceptual quality and FID scores, demonstrating the power of signal-driven supervision as a principled and efficient path toward high-resolution generative modeling.

🌊
VAE Fine-Tuning with Scale-Consistency Loss
Fine-tuned with a scale-consistency loss to create a clean latent space free of HF compression artifacts.
🗺️
Wavelet Saliency Maps
A non-trainable 1-level DWT localizes where high-frequency energy lives — a heatmap of structural richness.
🎯
Adaptive Masked Loss
Time-dependent binary mask dynamically focuses gradient updates on the salient, detail-rich regions that matter most.
⚡
Zero Inference Overhead
DWT and masking are discarded after training. The sampled model is architecturally identical to the baseline.

Technical Approach

The LWD Framework

LWD is built on three tightly integrated stages that work together to solve the uniform-supervision bottleneck at ultra-high resolution.

Complete Architecture Flow

Latent Wavelet Diffusion architecture diagram

Stage 1

VAE Fine-Tuning with Scale-Consistency Loss

Standard VAEs optimize for reconstruction, not spectral accuracy. The encoder/decoder introduces its own HF noise and aliasing, making the latent code "dirty." We cannot guide training with garbage saliency maps.

We fine-tune the VAE with a multi-resolution scale-consistency loss. The key insight: real-world structures are self-similar across scales, while compression artifacts are not. We enforce that encoding a downscaled image should be spectrally identical to downscaling the full-res encoding.

$$ \mathcal{L}_{\text{VAE}} = \underbrace{\|D(z) - x\|^2_2}_{\text{Reconstruction}} + \alpha \underbrace{\|D(E(z_{\text{down}})) - x_{\text{down}}\|^2_2}_{\text{Scale Consistency}} + \beta \underbrace{D_{\text{KL}}(q(z \mid x) \,\|\, p(z))}_{\text{Latent Regularization}} + \lambda \underbrace{\mathcal{L}_{\text{LPIPS}}(D(z), x)}_{\text{Perceptual Loss}}. $$

Stage 2

Wavelet Saliency Maps via DWT

We apply a fast, non-trainable 1-level Discrete Wavelet Transform (Haar wavelet) to the noisy latent z_t at each training step. This decomposes it into four subbands: LL (approximation), LH (horizontal HF), HL (vertical HF), HH (diagonal HF).

We aggregate the detail subbands into a spatial energy map — bright regions are structurally rich (hair, foliage), dark regions are simple (sky, flat walls).

$$ E(i,j) = \frac{1}{C} \sum_c \left[ (z_{LH}^{c,i,j})^2 + (z_{HL}^{c,i,j})^2 + (z_{HH}^{c,i,j})^2 \right]. $$

Stage 3

Time-Dependent Adaptive Mask M_t

High-frequency detail is the first thing destroyed by noise. At high timesteps (lots of noise), the saliency map is itself noisy — we must not rely on it for fine-grained masking. Our mask follows a curriculum:

High noise (high t): Mask less — force the model to learn global structure (LL band). Low noise (low t): Mask aggressively — focus only on the most salient HF regions. The threshold tightens as detail becomes recoverable.

$$ M_t(i,j) = \begin{cases} 1 & \text{if } T \cdot (A_{\text{wavelet}}(i,j) + \ell) \geq t \\ 0 & \text{otherwise} \end{cases}. $$

$$ \mathcal{L}_{\text{masked}} = \left\| M_t \odot \left[ (\epsilon - z_0) - v_\Theta(z_t, t, y) \right] \right\|_2^2. $$

Experiments

Quantitative Results

LWD is model-agnostic: it improves perceptual quality (FID ↓, LPIPS ↓) and dramatically improves texture fidelity (GLCM ↑) on both diffusion and flow-matching architectures, across 2K and 4K resolutions.

Model	FID ↓	CLIPScore ↑	Aesthetics ↑	GLCM ↑	Compression ↓
SD3-F16	43.82	31.50	5.91	0.75	11.23
SD3-Diff4k-F16	40.18	34.04	5.96	0.79	11.73
LWD + SD3-F16	38.74	34.94	6.17	0.74	11.99

PixArt-Sigma-XL	39.13	35.02	6.43	0.79	13.66
LWD + PixArt-Sigma-XL	36.14	35.21	6.27	0.87	6.05

Sana-1.6B	32.06	35.28	6.15	0.93	24.01
LWD + Sana-1.6B	34.30	35.58	6.23	0.78	27.34

Model	FID ↓	CLIPScore ↑	Aesthetics ↑	GLCM ↑	Compression ↓
SD3-F16	—	33.12	5.97	0.73	11.97
SD3-Diff4k-F16	—	33.41	5.97	0.70	11.90
LWD + SD3-F16	—	34.08	6.03	0.77	12.27

Sana-1.6B	—	34.40	6.14	0.39	48.36
LWD + Sana-1.6B	—	34.59	6.21	0.60	32.62

Model	FID ↓	LPIPS ↓	MAN-IQA ↑	QualiCLIP ↑	HPSv2.1 ↑	PickScore ↑
SDEdit	35.59	0.6456	0.3736	0.4480	30.92	22.86
I-Max	36.28	0.6750	0.3641	0.4139	30.62	23.02
Diffusion-4K	37.10	0.6920	0.3550	0.4815	30.55	22.80
PixArt-Sigma-XL	36.58	0.6801	0.2949	0.4438	30.66	22.92
Sana-1.6B	35.75	0.7169	0.3666	0.5796	30.42	22.83
Lumina-Image 2.0	54.96	0.6445	0.3663	0.4567	23.08	21.15
FLUX-1.dev	37.58	0.6371	0.4110	0.5468	28.73	22.68
URAE	35.25	0.6717	0.4076	0.5423	31.15	22.41
LWD + URAE	32.88	0.6336	0.4099	0.5356	28.78	22.43

Qualitative Comparison

What Changes Visually

The improvements are most visible in detail-rich regions: hair, fabric, foliage, fine architecture. Baseline models exhibit spectral collapse — blurring and loss of sharp edges. LWD-enhanced models preserve the intricate structures.

Interactive Visual Comparison

Choose a model with the slider, then drag each image divider to compare baseline output against LWD-enhanced output.

Viewing: PixArt-Sigma-XL

PixArt-Sigma-XL SD3-Diff4k-F16 Sana URAE

Sample A

◀▶

PixArt-Sigma-XL +LWD

Sample B

◀▶

PixArt-Sigma-XL+LWD +LWD

Sample A

◀▶

SD3-Diff4k-F16 +LWD

Sample B

◀▶

SD3-Diff4k-F16 +LWD

Sample A

◀▶

Sana +LWD

Sample B

◀▶

Sana +LWD

Sample A

◀▶

URAE +LWD

Sample B

◀▶

URAE +LWD

Baseline (e.g. SD3-F16)

Common Failure Modes

✗Spectral Collapse — high-freq details lost, skin and fabric appear over-smoothed

✗Texture Tiling — repetitive patterns fill large areas unnaturally

✗Spatial Inconsistency — global structure (face symmetry, building lines) degrades at UHR

✗Muddled colors and blurred outlines at 4K

LWD (Ours)

Demonstrated Improvements

✓Sharp HF Detail — eyelashes, skin pores, intricate lattice structures remain crisp

✓Natural Textures — no tiling; foliage and hair render with spatial variety

✓Perceptual Coherence — improved FID and LPIPS confirm better global distribution match

✓Vibrant, separated color panels; sharp architectural edges at 4K

BibTeX

Cite This Work

If you find LWD useful for your research, please consider citing our paper.

@inproceedings{sigillo2026latent,
title={Latent Wavelet Diffusion For Ultra High-Resolution Image Synthesis},
author={Luigi Sigillo and Shengfeng He and Danilo Comminiello},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=5og80LMVxG}
}

Latent WaveletDiffusion

The UHR Challenge & Our Solution

The LWD Framework

VAE Fine-Tuning with Scale-Consistency Loss

Wavelet Saliency Maps via DWT

Time-Dependent Adaptive Mask Mt

Quantitative Results

What Changes Visually

Interactive Visual Comparison

Cite This Work

Latent Wavelet
Diffusion

Time-Dependent Adaptive Mask M_t