
A vision-language agent that performs all-in-one restoration by scheduling experts in the right order, trained from a learned simulator at O(N) cost instead of O(N²).
Chung-Ang University
Creative Vision and Multimedia Lab (CMLab)
†Corresponding Author
Real images suffer multiple degradations at once, and the order they are removed changes the final quality. DiTTo casts restoration as sequential expert scheduling by a VLM. Its Simulator (∪S-IR + AiO-IQA) builds the optimal-trajectory dataset with only O(N) simulator steps, and its Agent learns to plan via SFT plus Order-aware Restoration Alignment (ORA), enabling plug-and-play extension to new experts by updating only the lightweight ORA stage.
Recovers structure and contrast lost under compounded haze and darkness without amplifying residual noise.
Clears snow occlusion first, then deblurs, preserving fine edges that a wrong order would smear away.
Scroll wheel to zoom · drag the divider to compare
Screen recordings of DiTTo perceiving degradations, planning an order, and invoking experts step by step.
DiTTo perceives the degradations, plans an order, and emits structured tool calls. Each call invokes one expert and returns an intermediate state, and the image gets progressively cleaner at every step.
I can identify the following degradations present in the image: sensor noise, defocus blur, and snow.
OR · Order-aware RestorationPlanning the order by reasoning over frequency-domain interactions: remove sensor noise first (deblurring beforehand would sharpen the noise into the structure), then resolve defocus blur, and finally clear the snow occlusion.
Tool · Structured Call
Applying the actions in a sub-optimal order yields measurably lower quality at intermediate states, and early errors propagate to the final output, which is exactly what order-aware planning avoids.
De-fogging before de-raining can alter the apparent rain distribution; enhancing low-light before de-noising amplifies noise. The same degradation set can land at very different IQA depending on removal order.
With many degradations and many experts per type, the valid orderings explode. A VLM that reasons and emits structured tool calls is a natural fit for sequential expert scheduling.
Prior training-based agents need O(N²) real expert calls to build supervision, and re-generate everything when a new expert is added. DiTTo removes this coupling.
A single-degradation restoration simulator that cheaply approximates heterogeneous experts via action-conditioned clean/degraded feature mixing with adaptive frequency-band gating that removes the target degradation while preserving the rest.
An all-in-one scoring model that predicts per-action next-state quality directly from the current state and trajectory, picking the highest-scoring action so the whole ORTD trajectory unrolls in O(N) steps.
The VLM is fine-tuned on simulator-generated ORTD as multi-turn tool-use conversations, acquiring degradation perception, order-aware planning, and JSON tool-call formatting.
A DPO-style alignment that computes preference margins over decomposed planning axes (DP / OR / Tool) on a small expert-executed subset, closing the simulator-to-expert gap without diluting the signal across shared template tokens.
Adding a new expert reuses ∪S-IR, AiO-IQA and the SFT checkpoint, updating only the efficient ORA stage.










DiTTo removes mixed degradations more thoroughly while preserving natural textures and semantic detail. ★DiTTo uses an extended expert pool to show plug-and-play scalability.
Reported on the final restored state. DiTTo uses the same expert pool as JarvisIR; ★DiTTo uses the extended pool.
| Method | MUSIQ ↑ | MANIQA ↑ | CLIP-IQA+ ↑ | NIQE ↓ |
|---|---|---|---|---|
| All-in-One methods · 3 degradations | ||||
| MiOIR | 52.41 | 0.2815 | 0.4040 | 6.075 |
| AutoDIR | 52.29 | 0.3184 | 0.3985 | 8.537 |
| Agent-based methods · 3 degradations | ||||
| AgenticIR | 61.20 | 0.4585 | 0.6010 | 6.587 |
| 4KAgent | 65.40 | 0.5025 | 0.6555 | 6.121 |
| JarvisIR | 67.54 | 0.5331 | 0.6845 | 5.862 |
| DiTTo | 67.09 | 0.5823 | 0.7126 | 5.962 |
| ★DiTTo | 70.65 | 0.6855 | 0.8101 | 5.773 |
| Agent-based methods · 5 degradations | ||||
| 4KAgent | 69.10 | 0.5818 | 0.7355 | 5.547 |
| JarvisIR | 71.36 | 0.6191 | 0.7677 | 5.292 |
| DiTTo | 69.34 | 0.6270 | 0.7690 | 5.224 |
| ★DiTTo | 72.27 | 0.7241 | 0.8509 | 5.188 |
| Stage | JarvisIR | DiTTo | Speedup |
|---|---|---|---|
| Data generation | ~460 | ~10 | ~45× |
| SFT | ~37 | ~37 | 1× |
| Alignment | ~410 | ~11 | ~37× |
| End-to-end | ~870 | ~60 | ~15× |
@inproceedings{ditto2026,
title = {DiTTo: Scalable Order-aware All-in-One Image Restoration Agent},
author = {Choi, Seungho and Oh, Jihyong},
year = {2026}
}