FRAMER

CVPR 2026

Frequency-Aligned Self-Distillation with Adaptive Modulation
Leveraging Diffusion Priors for Real-World Image Super-Resolution

Seungho Choi, Jeahun Sung, Jihyong Oh^†

Chung-Ang University

Creative Vision and Multimedia Lab (CMLab)

^†Corresponding Author

arXiv Code

TL;DR

FRAMER unlocks the high-frequency potential of diffusion models for Real-World Super-Resolution without altering inference. We demonstrate superior performance across both DiT (FRAMER_D) and U-Net (FRAMER_U) backbones by addressing the "low-first, high-later" frequency hierarchy.

Restoration Capability

FRAMER_D (DiT Backbone)

Restores intricate fur textures and fine details, overcoming the smoothing artifacts typical of diffusion models.

FRAMER_U (U-Net Backbone)

Effectively sharpens natural landscapes and starry skies, demonstrating robust generalization on the U-Net architecture.

Use Mouse Wheel to Zoom, Slide to compare.

Abstract

Real-image super-resolution (Real-ISR) seeks to recover HR images from LR inputs with mixed, unknown degradations. While diffusion models surpass GANs in perceptual quality, they under-reconstruct high-frequency (HF) details due to a low-frequency (LF) bias and a depth-wise "low-first, high-later" hierarchy.

We introduce FRAMER, a plug-and-play training scheme that exploits diffusion priors without changing the backbone or inference. At each denoising step, the final-layer feature map teaches all intermediate layers. Teacher and student feature maps are decomposed into LF/HF bands via FFT masks to align supervision with the model's internal frequency hierarchy.

For LF, an Intra Contrastive Loss (IntraCL) stabilizes globally shared structure. For HF, an Inter Contrastive Loss (InterCL) sharpens instance-specific details using random-layer and in-batch negatives. Two adaptive modulators, Frequency-based Adaptive Weight (FAW) and Frequency-based Alignment Modulation (FAM), reweight per-layer LF/HF signals and gate distillation by current similarity. Across U-Net and DiT backbones, FRAMER consistently improves PSNR/SSIM and perceptual metrics (LPIPS, NIQE, MANIQA, MUSIQ).

Motivation & Analysis

Figure 2. Band-wise magnitude densities showing the inherent LF dominance (bias) in natural images.

Figure 3. Layer-wise cosine similarity revealing the "low-first, high-later" frequency hierarchy.

Why Do Diffusion Models Struggle?

We trace the limitations of current diffusion models in Real-ISR to a fundamental Low-Frequency (LF) Bias stemming from two key observations:

1. Spectral Bias (Figure 2)

Natural image frequency distributions are inherently LF-dominant. The standard noise-prediction loss thus favors these dominant LF components, inevitably undertraining HF signals.

2. Depth-wise Hierarchy (Figure 3)

An analysis of layer-wise feature maps reveals that LF features stabilize early in the network, while HF features converge only near the final layers. A conventional loss supplies redundant gradients to early layers while starving the later, HF-refining layers.

Method Overview

Figure 4. Overview of FRAMER. The framework applies self-distillation from the final-layer teacher to intermediate student layers. We decompose teacher/student features into LF/HF bands via FFT masks. The key components are:

IntraCL (LF)

Intra Contrastive Loss stabilizes globally shared structures. It compares a student only against its teacher and a randomly sampled layer within the same network (no in-batch negatives), preventing false negatives common in batch-based contrastive learning.

InterCL (HF)

Inter Contrastive Loss sharpens instance-specific details. It targets HF bands using both random-layer negatives (for layer progression) and in-batch negatives (for instance discrimination), counteracting the LF bias.

FAW

Frequency-based Adaptive Weight decomposes self-distillation across depth and frequency. It reweights supervision based on the actual layer-wise change rate relative to the final layer, mitigating scale-induced spectral bias.

FAM

Frequency-based Alignment Modulation gates the distillation strength based on student-teacher alignment. It suppresses large, unstable gradients in early layers when alignment is low, preventing early training collapse.

Quantitative Results

Table 1. Quantitative comparison of real-world image super-resolution methods. We evaluate fidelity metrics (PSNR, SSIM, LPIPS) as well as perceptual quality metrics (NIQE, MANIQA, MUSIQ) across multiple datasets, including DrealSR, RealSR, RealLR200, and RealLQ250. Overall, our method demonstrates competitive and generally superior performance across perceptual quality metrics.

Qualitative Comparisons

Detailed visual comparisons against state-of-the-art methods.

Synthetic Degradation (DrealSR, RealSR Dataset)

Figure 10: Comparisons on datasets with Ground Truth

Figure 10. Qualitative comparisons on datasets with Ground Truth (DrealSR, RealSR). We compare FRAMER against state-of-the-art methods (SwinIR, ResShift, SeeSR, PiSA-SR, DreamClear, DiT4SR). Red arrows indicate structural errors (e.g., hallucinations, object distortion), while Yellow arrows point to textural defects. FRAMER consistently mitigates these artifacts, producing sharper edges and faithful textures.

Real-World "In-the-Wild" (RealLR200, RealLQ250 Dataset)

Figure 11: Comparisons on datasets without Ground Truth

Figure 11. Qualitative comparisons on datasets without Ground Truth (RealLR200, RealLQ250). In these real-world scenarios with unknown degradations, baseline methods often suffer from severe degradations (red arrows indicate structural failures, yellow indicate textural anomalies). FRAMER demonstrates superior perceptual quality by effectively balancing noise suppression with detail generation.

BibTeX

@inproceedings{choi2026framer,
  title={FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution},
  author={Choi, Seungho and Sung, Jeahun and Oh, Jihyong},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}