FRAMER unlocks the high-frequency potential of diffusion models for Real-World Super-Resolution without altering inference. We demonstrate superior performance across both DiT (FRAMERD) and U-Net (FRAMERU) backbones by addressing the "low-first, high-later" frequency hierarchy.
Restores intricate fur textures and fine details, overcoming the smoothing artifacts typical of diffusion models.
Effectively sharpens natural landscapes and starry skies, demonstrating robust generalization on the U-Net architecture.
Real-image super-resolution (Real-ISR) seeks to recover HR images from LR inputs with mixed, unknown degradations. While diffusion models surpass GANs in perceptual quality, they under-reconstruct high-frequency (HF) details due to a low-frequency (LF) bias and a depth-wise "low-first, high-later" hierarchy.
We introduce FRAMER, a plug-and-play training scheme that exploits diffusion priors without changing the backbone or inference. At each denoising step, the final-layer feature map teaches all intermediate layers. Teacher and student feature maps are decomposed into LF/HF bands via FFT masks to align supervision with the model's internal frequency hierarchy.
For LF, an Intra Contrastive Loss (IntraCL) stabilizes globally shared structure. For HF, an Inter Contrastive Loss (InterCL) sharpens instance-specific details using random-layer and in-batch negatives. Two adaptive modulators, Frequency-based Adaptive Weight (FAW) and Frequency-based Alignment Modulation (FAM), reweight per-layer LF/HF signals and gate distillation by current similarity. Across U-Net and DiT backbones, FRAMER consistently improves PSNR/SSIM and perceptual metrics (LPIPS, NIQE, MANIQA, MUSIQ).
Figure 2. Band-wise magnitude densities showing the inherent LF dominance (bias) in natural images.
Figure 3. Layer-wise cosine similarity revealing the "low-first, high-later" frequency hierarchy.
We trace the limitations of current diffusion models in Real-ISR to a fundamental Low-Frequency (LF) Bias stemming from two key observations:
Natural image frequency distributions are inherently LF-dominant. The standard noise-prediction loss thus favors these dominant LF components, inevitably undertraining HF signals.
An analysis of layer-wise feature maps reveals that LF features stabilize early in the network, while HF features converge only near the final layers. A conventional loss supplies redundant gradients to early layers while starving the later, HF-refining layers.
Figure 4. Overview of FRAMER. The framework applies self-distillation from the final-layer teacher to intermediate student layers. We decompose teacher/student features into LF/HF bands via FFT masks. The key components are:
Intra Contrastive Loss stabilizes globally shared structures. It compares a student only against its teacher and a randomly sampled layer within the same network (no in-batch negatives), preventing false negatives common in batch-based contrastive learning.
Inter Contrastive Loss sharpens instance-specific details. It targets HF bands using both random-layer negatives (for layer progression) and in-batch negatives (for instance discrimination), counteracting the LF bias.
Frequency-based Adaptive Weight decomposes self-distillation across depth and frequency. It reweights supervision based on the actual layer-wise change rate relative to the final layer, mitigating scale-induced spectral bias.
Frequency-based Alignment Modulation gates the distillation strength based on student-teacher alignment. It suppresses large, unstable gradients in early layers when alignment is low, preventing early training collapse.
Table 1. Quantitative comparison of real-world image super-resolution methods. We evaluate fidelity metrics (PSNR, SSIM, LPIPS) as well as perceptual quality metrics (NIQE, MANIQA, MUSIQ) across multiple datasets, including DrealSR, RealSR, RealLR200, and RealLQ250. Overall, our method demonstrates competitive and generally superior performance across perceptual quality metrics.
Detailed visual comparisons against state-of-the-art methods.
Figure 10. Qualitative comparisons on datasets with Ground Truth (DrealSR, RealSR). We compare FRAMER against state-of-the-art methods (SwinIR, ResShift, SeeSR, PiSA-SR, DreamClear, DiT4SR). Red arrows indicate structural errors (e.g., hallucinations, object distortion), while Yellow arrows point to textural defects. FRAMER consistently mitigates these artifacts, producing sharper edges and faithful textures.
Figure 11. Qualitative comparisons on datasets without Ground Truth (RealLR200, RealLQ250). In these real-world scenarios with unknown degradations, baseline methods often suffer from severe degradations (red arrows indicate structural failures, yellow indicate textural anomalies). FRAMER demonstrates superior perceptual quality by effectively balancing noise suppression with detail generation.
@article{choi2025framer,
title={FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution},
author={Choi, Seungho and Sung, Jeahun and Oh, Jihyong},
journal={arXiv preprint arXiv:2512.01390},
year={2025}
}