DiTTo

Scalable Order-aware All-in-One
Image Restoration Agent

A vision-language agent that performs all-in-one restoration by scheduling experts in the right order, trained from a learned simulator at O(N) cost instead of O(N²).

Seungho Choi, Jihyong Oh^†

Chung-Ang University

Creative Vision and Multimedia Lab (CMLab)

^†Corresponding Author

arXiv Paper Code

O(N)

ORTD generation cost
(vs. O(N²) for prior agents)

~15×

faster end-to-end adaptation
to a new expert

SOTA

multi-degradation quality
on MiO-100

TL;DR

Real images suffer multiple degradations at once, and the order they are removed changes the final quality. DiTTo casts restoration as sequential expert scheduling by a VLM. Its Simulator (∪S-IR + AiO-IQA) builds the optimal-trajectory dataset with only O(N) simulator steps, and its Agent learns to plan via SFT plus Order-aware Restoration Alignment (ORA), enabling plug-and-play extension to new experts by updating only the lightweight ORA stage.

Restoration Capability

Drag to compare · scroll to zoom

Fog + Low-light + Noise

Recovers structure and contrast lost under compounded haze and darkness without amplifying residual noise.

Snow + Blur

Clears snow occlusion first, then deblurs, preserving fine edges that a wrong order would smear away.

Scroll wheel to zoom · drag the divider to compare

Watch it run

Demo Videos

Screen recordings of DiTTo perceiving degradations, planning an order, and invoking experts step by step.

Interactive restoration walk-through The agent assesses a multi-degraded image and schedules experts in real time.

Plug-and-play expert extension Adding a new expert and re-planning, with only the ORA stage updated.

How the agent thinks

Reasoning & Step-by-Step Restoration

DiTTo perceives the degradations, plans an order, and emits structured tool calls. Each call invokes one expert and returns an intermediate state, and the image gets progressively cleaner at every step.

User

Can you clean up this image? Assess what's wrong, then plan the restoration order and the expert to use at each step.

DiTTo Agent

DP · Degradation Perception

I can identify the following degradations present in the image: sensor noise, defocus blur, and snow.

OR · Order-aware Restoration

Planning the order by reasoning over frequency-domain interactions: remove sensor noise first (deblurring beforehand would sharpen the noise into the structure), then resolve defocus blur, and finally clear the snow occlusion.

Tool · Structured Call

[ {"action":"sensor_noise","model":"scunet"},
{"action":"defocus_blur","model":"drbnet"},
{"action":"snow","model":"snowmaster"} ]

Observed input

noise · defocus blur · snow

① De-noise

{"action":"sensor_noise","model":"scunet"}

② De-blur

{"action":"defocus_blur","model":"drbnet"}

③ De-snow → Output

{"action":"snow","model":"snowmaster"}

Applying the actions in a sub-optimal order yields measurably lower quality at intermediate states, and early errors propagate to the final output, which is exactly what order-aware planning avoids.

Motivation

Why Order & Why an Agent

Order changes quality

De-fogging before de-raining can alter the apparent rain distribution; enhancing low-light before de-noising amplifies noise. The same degradation set can land at very different IQA depending on removal order.

Combinatorial planning

With many degradations and many experts per type, the valid orderings explode. A VLM that reasons and emits structured tool calls is a natural fit for sequential expert scheduling.

The cost bottleneck

Prior training-based agents need O(N²) real expert calls to build supervision, and re-generate everything when a new expert is added. DiTTo removes this coupling.

Framework

Simulator + Agent

∪S-IR Simulator

A single-degradation restoration simulator that cheaply approximates heterogeneous experts via action-conditioned clean/degraded feature mixing with adaptive frequency-band gating that removes the target degradation while preserving the rest.

AiO-IQA Simulator

An all-in-one scoring model that predicts per-action next-state quality directly from the current state and trajectory, picking the highest-scoring action so the whole ORTD trajectory unrolls in O(N) steps.

Stage 1 · SFT Agent

The VLM is fine-tuned on simulator-generated ORTD as multi-turn tool-use conversations, acquiring degradation perception, order-aware planning, and JSON tool-call formatting.

Stage 2 · ORA Agent

A DPO-style alignment that computes preference margins over decomposed planning axes (DP / OR / Tool) on a small expert-executed subset, closing the simulator-to-expert gap without diluting the signal across shared template tokens.

This is a high-level overview. For the full architectural design, training objectives, and implementation details, please refer to the paper.

Adding a new expert reuses ∪S-IR, AiO-IQA and the SFT checkpoint, updating only the efficient ORA stage.

Qualitative Comparison

Against Prior Agents

Input

4KAgent

JarvisIR

DiTTo

★DiTTo

Input

4KAgent

JarvisIR

DiTTo

★DiTTo

DiTTo removes mixed degradations more thoroughly while preserving natural textures and semantic detail. ★DiTTo uses an extended expert pool to show plug-and-play scalability.

Quantitative Results

MiO-100 · No-reference IQA

Reported on the final restored state. DiTTo uses the same expert pool as JarvisIR; ★DiTTo uses the extended pool.

Method	MUSIQ ↑	MANIQA ↑	CLIP-IQA+ ↑	NIQE ↓
All-in-One methods · 3 degradations
MiOIR	52.41	0.2815	0.4040	6.075
AutoDIR	52.29	0.3184	0.3985	8.537
Agent-based methods · 3 degradations
AgenticIR	61.20	0.4585	0.6010	6.587
4KAgent	65.40	0.5025	0.6555	6.121
JarvisIR	67.54	0.5331	0.6845	5.862
DiTTo	67.09	0.5823	0.7126	5.962
★DiTTo	70.65	0.6855	0.8101	5.773
Agent-based methods · 5 degradations
4KAgent	69.10	0.5818	0.7355	5.547
JarvisIR	71.36	0.6191	0.7677	5.292
DiTTo	69.34	0.6270	0.7690	5.224
★DiTTo	72.27	0.7241	0.8509	5.188

Per-stage adaptation cost (hours, 2×B200)

Stage	JarvisIR	DiTTo	Speedup
Data generation	~460	~10	~45×
SFT	~37	~37	1×
Alignment	~410	~11	~37×
End-to-end	~870	~60	~15×

Cite

BibTeX

@article{choi2026ditto,
  title         = {DiTTo: Scalable Order-aware All-in-One Image Restoration Agent},
  author        = {Choi, Seungho and Oh, Jihyong},
  journal       = {arXiv preprint arXiv:2605.30915},
  year          = {2026},
  eprint        = {2605.30915},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV}
}