DiTTo mascot

DiTTo

Scalable Order-aware All-in-One
Image Restoration Agent

A vision-language agent that performs all-in-one restoration by scheduling experts in the right order, trained from a learned simulator at O(N) cost instead of O(N²).

CAU Chung-Ang University
CMLab Creative Vision and Multimedia Lab (CMLab)

Corresponding Author

O(N)
ORTD generation cost
(vs. O(N²) for prior agents)
~15×
faster end-to-end adaptation
to a new expert
SOTA
multi-degradation quality
on MiO-100

TL;DR

Real images suffer multiple degradations at once, and the order they are removed changes the final quality. DiTTo casts restoration as sequential expert scheduling by a VLM. Its Simulator (∪S-IR + AiO-IQA) builds the optimal-trajectory dataset with only O(N) simulator steps, and its Agent learns to plan via SFT plus Order-aware Restoration Alignment (ORA), enabling plug-and-play extension to new experts by updating only the lightweight ORA stage.

Restoration Capability

Drag to compare · scroll to zoom

Fog + Low-light + Noise

degraded input
Input
DiTTo restored
DiTTo

Recovers structure and contrast lost under compounded haze and darkness without amplifying residual noise.

Snow + Blur

degraded input
Input
DiTTo restored
DiTTo

Clears snow occlusion first, then deblurs, preserving fine edges that a wrong order would smear away.

Scroll wheel to zoom · drag the divider to compare

Watch it run

Demo Videos

Screen recordings of DiTTo perceiving degradations, planning an order, and invoking experts step by step.

Interactive restoration walk-through The agent assesses a multi-degraded image and schedules experts in real time.
Plug-and-play expert extension Adding a new expert and re-planning, with only the ORA stage updated.
How the agent thinks

Reasoning & Step-by-Step Restoration

DiTTo perceives the degradations, plans an order, and emits structured tool calls. Each call invokes one expert and returns an intermediate state, and the image gets progressively cleaner at every step.

User
Can you clean up this image? Assess what's wrong, then plan the restoration order and the expert to use at each step.
DiTTo
DiTTo Agent
DP · Degradation Perception

I can identify the following degradations present in the image: sensor noise, defocus blur, and snow.

OR · Order-aware Restoration

Planning the order by reasoning over frequency-domain interactions: remove sensor noise first (deblurring beforehand would sharpen the noise into the structure), then resolve defocus blur, and finally clear the snow occlusion.

Tool · Structured Call
[ {"action":"sensor_noise","model":"scunet"},
  {"action":"defocus_blur","model":"drbnet"},
  {"action":"snow","model":"snowmaster"} ]
input state
Observed input
noise · defocus blur · snow
step 1
① De-noise
{"action":"sensor_noise","model":"scunet"}
step 2
② De-blur
{"action":"defocus_blur","model":"drbnet"}
final
③ De-snow → Output
{"action":"snow","model":"snowmaster"}

Applying the actions in a sub-optimal order yields measurably lower quality at intermediate states, and early errors propagate to the final output, which is exactly what order-aware planning avoids.

Motivation

Why Order & Why an Agent

Order changes quality

De-fogging before de-raining can alter the apparent rain distribution; enhancing low-light before de-noising amplifies noise. The same degradation set can land at very different IQA depending on removal order.

Combinatorial planning

With many degradations and many experts per type, the valid orderings explode. A VLM that reasons and emits structured tool calls is a natural fit for sequential expert scheduling.

The cost bottleneck

Prior training-based agents need O(N²) real expert calls to build supervision, and re-generate everything when a new expert is added. DiTTo removes this coupling.

Framework

Simulator + Agent

∪S-IR Simulator

A single-degradation restoration simulator that cheaply approximates heterogeneous experts via action-conditioned clean/degraded feature mixing with adaptive frequency-band gating that removes the target degradation while preserving the rest.

AiO-IQA Simulator

An all-in-one scoring model that predicts per-action next-state quality directly from the current state and trajectory, picking the highest-scoring action so the whole ORTD trajectory unrolls in O(N) steps.

Stage 1 · SFT Agent

The VLM is fine-tuned on simulator-generated ORTD as multi-turn tool-use conversations, acquiring degradation perception, order-aware planning, and JSON tool-call formatting.

Stage 2 · ORA Agent

A DPO-style alignment that computes preference margins over decomposed planning axes (DP / OR / Tool) on a small expert-executed subset, closing the simulator-to-expert gap without diluting the signal across shared template tokens.

This is a high-level overview. For the full architectural design, training objectives, and implementation details, please refer to the paper.

Adding a new expert reuses ∪S-IR, AiO-IQA and the SFT checkpoint, updating only the efficient ORA stage.

Qualitative Comparison

Against Prior Agents

input
Input
4kagent
4KAgent
jarvisir
JarvisIR
ditto
DiTTo
star ditto
★DiTTo
input
Input
4kagent
4KAgent
jarvisir
JarvisIR
ditto
DiTTo
star ditto
★DiTTo

DiTTo removes mixed degradations more thoroughly while preserving natural textures and semantic detail. ★DiTTo uses an extended expert pool to show plug-and-play scalability.

Quantitative Results

MiO-100 · No-reference IQA

Reported on the final restored state. DiTTo uses the same expert pool as JarvisIR; ★DiTTo uses the extended pool.

MethodMUSIQ ↑MANIQA ↑CLIP-IQA+ ↑NIQE ↓
All-in-One methods · 3 degradations
MiOIR52.410.28150.40406.075
AutoDIR52.290.31840.39858.537
Agent-based methods · 3 degradations
AgenticIR61.200.45850.60106.587
4KAgent65.400.50250.65556.121
JarvisIR67.540.53310.68455.862
DiTTo67.090.58230.71265.962
★DiTTo70.650.68550.81015.773
Agent-based methods · 5 degradations
4KAgent69.100.58180.73555.547
JarvisIR71.360.61910.76775.292
DiTTo69.340.62700.76905.224
★DiTTo72.270.72410.85095.188

Per-stage adaptation cost (hours, 2×B200)

StageJarvisIRDiTToSpeedup
Data generation~460~10~45×
SFT~37~37
Alignment~410~11~37×
End-to-end~870~60~15×
Cite

BibTeX

@inproceedings{ditto2026,
  title     = {DiTTo: Scalable Order-aware All-in-One Image Restoration Agent},
  author    = {Choi, Seungho and Oh, Jihyong},
  year      = {2026}
}