ICLR 2026

Latent Wavelet
Diffusion

For Ultra-High-Resolution Image Synthesis

Luigi Sigillo 1,2,3  ·  Shengfeng He 2  ·  Danilo Comminiello 1

1Sapienza University of Rome  ·  2Singapore Management University  ·  3EMBL

📄 Paper 💻 Code 📊 Results
Latent Wavelet Diffusion first-page teaser
Paper Teaser

The UHR Challenge & Our Solution

High-resolution image synthesis remains a core challenge in generative modeling, particularly in balancing computational efficiency with the preservation of fine-grained visual detail. We present Latent Wavelet Diffusion (LWD), a lightweight training framework that significantly improves detail and texture fidelity in ultra-high-resolution (2K-4K) image synthesis. LWD introduces a novel, frequency-aware masking strategy derived from wavelet energy maps, which dynamically focuses the training process on detail-rich regions of the latent space. This is complemented by a scale-consistent VAE objective to ensure high spectral fidelity. The primary advantage of our approach is its efficiency: LWD requires no architectural modifications and adds zero additional cost during inference, making it a practical solution for scaling existing models. Across multiple strong baselines, LWD consistently improves perceptual quality and FID scores, demonstrating the power of signal-driven supervision as a principled and efficient path toward high-resolution generative modeling.

🌊
VAE Fine-Tuning with Scale-Consistency Loss
Fine-tuned with a scale-consistency loss to create a clean latent space free of HF compression artifacts.
🗺️
Wavelet Saliency Maps
A non-trainable 1-level DWT localizes where high-frequency energy lives — a heatmap of structural richness.
🎯
Adaptive Masked Loss
Time-dependent binary mask dynamically focuses gradient updates on the salient, detail-rich regions that matter most.
Zero Inference Overhead
DWT and masking are discarded after training. The sampled model is architecturally identical to the baseline.

The LWD Framework

LWD is built on three tightly integrated stages that work together to solve the uniform-supervision bottleneck at ultra-high resolution.

Complete Architecture Flow
Latent Wavelet Diffusion architecture diagram
01
Stage 1

VAE Fine-Tuning with Scale-Consistency Loss

Standard VAEs optimize for reconstruction, not spectral accuracy. The encoder/decoder introduces its own HF noise and aliasing, making the latent code "dirty." We cannot guide training with garbage saliency maps.

We fine-tune the VAE with a multi-resolution scale-consistency loss. The key insight: real-world structures are self-similar across scales, while compression artifacts are not. We enforce that encoding a downscaled image should be spectrally identical to downscaling the full-res encoding.

$$ \mathcal{L}_{\text{VAE}} = \underbrace{\|D(z) - x\|^2_2}_{\text{Reconstruction}} + \alpha \underbrace{\|D(E(z_{\text{down}})) - x_{\text{down}}\|^2_2}_{\text{Scale Consistency}} + \beta \underbrace{D_{\text{KL}}(q(z \mid x) \,\|\, p(z))}_{\text{Latent Regularization}} + \lambda \underbrace{\mathcal{L}_{\text{LPIPS}}(D(z), x)}_{\text{Perceptual Loss}}. $$
02
Stage 2

Wavelet Saliency Maps via DWT

We apply a fast, non-trainable 1-level Discrete Wavelet Transform (Haar wavelet) to the noisy latent zt at each training step. This decomposes it into four subbands: LL (approximation), LH (horizontal HF), HL (vertical HF), HH (diagonal HF).

We aggregate the detail subbands into a spatial energy map — bright regions are structurally rich (hair, foliage), dark regions are simple (sky, flat walls).

$$ E(i,j) = \frac{1}{C} \sum_c \left[ (z_{LH}^{c,i,j})^2 + (z_{HL}^{c,i,j})^2 + (z_{HH}^{c,i,j})^2 \right]. $$
03
Stage 3

Time-Dependent Adaptive Mask Mt

High-frequency detail is the first thing destroyed by noise. At high timesteps (lots of noise), the saliency map is itself noisy — we must not rely on it for fine-grained masking. Our mask follows a curriculum:

High noise (high t): Mask less — force the model to learn global structure (LL band). Low noise (low t): Mask aggressively — focus only on the most salient HF regions. The threshold tightens as detail becomes recoverable.

$$ M_t(i,j) = \begin{cases} 1 & \text{if } T \cdot (A_{\text{wavelet}}(i,j) + \ell) \geq t \\ 0 & \text{otherwise} \end{cases}. $$
$$ \mathcal{L}_{\text{masked}} = \left\| M_t \odot \left[ (\epsilon - z_0) - v_\Theta(z_t, t, y) \right] \right\|_2^2. $$

Quantitative Results

LWD is model-agnostic: it improves perceptual quality (FID ↓, LPIPS ↓) and dramatically improves texture fidelity (GLCM ↑) on both diffusion and flow-matching architectures, across 2K and 4K resolutions.

Model FID ↓ CLIPScore ↑ Aesthetics ↑ GLCM ↑ Compression ↓
SD3-F1643.8231.505.910.7511.23
SD3-Diff4k-F1640.1834.045.960.7911.73
LWD + SD3-F1638.7434.946.170.7411.99
PixArt-Sigma-XL39.1335.026.430.7913.66
LWD + PixArt-Sigma-XL36.1435.216.270.876.05
Sana-1.6B32.0635.286.150.9324.01
LWD + Sana-1.6B34.3035.586.230.7827.34

What Changes Visually

The improvements are most visible in detail-rich regions: hair, fabric, foliage, fine architecture. Baseline models exhibit spectral collapse — blurring and loss of sharp edges. LWD-enhanced models preserve the intricate structures.

Interactive Visual Comparison

Choose a model with the slider, then drag each image divider to compare baseline output against LWD-enhanced output.

Viewing: PixArt-Sigma-XL
PixArt-Sigma-XL SD3-Diff4k-F16 Sana URAE
Sample A
PixArt-Sigma-XL + LWD baseline sample A
PixArt-Sigma-XL sample A
◀▶
PixArt-Sigma-XL +LWD
Sample B
PixArt-Sigma-XL + LWD baseline sample B
PixArt-Sigma-XL sample B
◀▶
PixArt-Sigma-XL+LWD +LWD
Sample A
SD3-Diff4k-F16 + LWD baseline sample A
SD3-Diff4k-F16 sample A
◀▶
SD3-Diff4k-F16 +LWD
Sample B
SD3-Diff4k-F16 + LWD baseline sample B
SD3-Diff4k-F16 sample B
◀▶
SD3-Diff4k-F16 +LWD
Sample A
Sana + LWD baseline sample A
Sana sample A
◀▶
Sana +LWD
Sample B
Sana + LWD baseline sample B
Sana sample B
◀▶
Sana +LWD
Sample A
URAE + LWD baseline sample A
URAE sample A
◀▶
URAE +LWD
Sample B
URAE + LWD baseline sample B
URAE  sample B
◀▶
URAE +LWD
Baseline (e.g. SD3-F16)
Common Failure Modes
Spectral Collapse — high-freq details lost, skin and fabric appear over-smoothed
Texture Tiling — repetitive patterns fill large areas unnaturally
Spatial Inconsistency — global structure (face symmetry, building lines) degrades at UHR
Muddled colors and blurred outlines at 4K
LWD (Ours)
Demonstrated Improvements
Sharp HF Detail — eyelashes, skin pores, intricate lattice structures remain crisp
Natural Textures — no tiling; foliage and hair render with spatial variety
Perceptual Coherence — improved FID and LPIPS confirm better global distribution match
Vibrant, separated color panels; sharp architectural edges at 4K

Cite This Work

If you find LWD useful for your research, please consider citing our paper.

@inproceedings{sigillo2026latent,
title={Latent Wavelet Diffusion For Ultra High-Resolution Image Synthesis},
author={Luigi Sigillo and Shengfeng He and Danilo Comminiello},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=5og80LMVxG}
}