CHIMERA Logo CHIMERA: Adaptive CacHe Injection and SeMantic Anchor Prompting for ZERo-shot ImAge Morphing with Morphing-oriented Metrics

arXiv 2025
*equal contribution, corresponding author, 1Chung-ang University, CMLab, 2Princeton University
{rpekgus, jhseong, jihyongoh}@cau.ac.kr
mj7341@princeton.edu

TL;DR

CHIMERA enables smooth and semantically consistent zero-shot image morphing through Adaptive Cache Injection (ACI) and Semantic Anchor Prompting (SAP), along with GLCS, a new morphing-oriented metric for evaluating transition quality.

Results of CHIMERA

Top image
Image A
Bottom image
Image B
Morphing Images
Top image
Image A
Bottom image
Image B
Morphing Images
Top image
Image A
Bottom image
Image B
Morphing Images
Top image
Image A
Bottom image
Image B
Morphing Images
Top image
Image A
Bottom image
Image B
Morphing Images
Top image
Image A
Bottom image
Image B
Morphing Images
Top image
Image A
Bottom image
Image B
Morphing Images
Top image
Image A
Bottom image
Image B
Morphing Images
Top image
Image A
Bottom image
Image B
Morphing Images
Top image
Image A
Bottom image
Image B
Morphing Images
Figure 1. Qualitative Result of Our Method

Abstract

Diffusion models exhibit remarkable generative ability, yet achieving smooth and semantically consistent image morphing remains a challenge. Existing approaches often yield abrupt transitions or over-saturated appearances due to the lack of adaptive structural and semantic alignments. We propose CHIMERA, a zero-shot diffusion-based framework that formulates morphing as a cached inversion–guided denoising process. To handle large semantic and appearance disparities, we propose Adaptive Cache Injection and Semantic Anchor Prompting. Adaptive Cache Injection (ACI) caches down, mid, and up blocks’ features from both inputs during DDIM inversion and re-injects them adaptively during denoising in depth- and timestep-adaptive manners, enabling natural feature fusion and smooth transitions. Semantic Anchor Prompting (SAP) leverages a vision–language model to generate a shared anchor prompt that serves as a semantic anchor, bridging dissimilar inputs and guiding the denoising process toward coherent results. Finally, we introduce the Global-Local Consistency Score (GLCS), a morphing-oriented metric that simultaneously evaluates the global harmonization of the two inputs and the smoothness of the local morphing transition. Extensive experiments and user studies show that Chimera achieves smoother and more semantically aligned transitions than existing methods, establishing a new state-of-the-art in image morphing. The code and project page will be publicly released.

Motivation & Observation


Frequency analysis showing LF bias
Figure 2. Frequency analysis of each feature in the diffusion U-Net and across different denoising timesteps.

Frequency analysis of the diffusion U-Net and the denoising timesteps


Diffusion features tend to contain more low-frequency information in the mid layers and more high-frequency information in the up layers. In addition, early denoising timesteps mainly encode low-frequency information, while late timesteps contain more high-frequency information. Based on these properties, ACI injects diffusion features that match the characteristics of each denoising timestep.

Diagram of the transformer deep learning architecture.

Figure 3. Qualitative examples illustrating how CHIMERA and previous models differ in their ability to preserve smoothness, domain consistency, and perceptual quality.

The proposed CHIMERA shows a well-balanced improvement over previous methods in terms of smoothness, domain consistency, and perceptual quality.

Proposed Method

Diagram of the transformer deep learning architecture.
Figure 4. ACI corrects the timestep mismatch between inversion and denoising via the proposed IDM and reinjects multi-scale cached features (low-frequency structures early and high-frequency details later) to guide consistent morphing. SAP introduces a VLM-derived anchor prompt into early cross-attention layers, stabilizing semantics and reducing drift for heterogeneous input pairs.

Proposed Metric


Effect of GCS
Effect of LCS

Figure 5. Qualitative examples demonstrating the effectiveness of GLCS. GLCS consists of GCS and LCS, and the qualitative results illustrate how well each component aligns with human perception.

Frequency analysis showing LF bias
Algorithm 1. Algorithm for the full computation of GLCS, which consists of GCS and LCS

Quantitative Results

Frequency analysis showing LF bias
Table 1. Quantitative results for the 5-frame morphing between each input image pair.
Frequency analysis showing LF bias
Table 2. Quantitative results for the 14-frame morphing between each input image pair.

Qualitative Results

5-frame Qualitative Result 1
5-frame Qualitative Result 2
5-frame Qualitative Result 3
5-frame Qualitative Result 4

Figure 6. IMPUS [ICLR’24] shows good domain consistency with the input image pair, but it contains abrupt transitions and therefore lacks smoothness. DiffMorpher [CVPR’24] provides smoother transitions, but its domain consistency is weak, with objects disappearing or becoming unstable. FreeMorph [ICCV’25] produces overly saturated colors, which are common artifacts in diffusion-based generation. In contrast, the proposed CHIMERA maintains both smoothness and domain consistency.

14-frame Qualitative Result 1
14-frame Qualitative Result 2
14-frame Qualitative Result 3
14-frame Qualitative Result 4

Figure 7. This qualitative evaluation presents the more challenging 14-image morphing results. Consistent with Fig. 6, CHIMERA maintains both smoothness and domain consistency in this extended setting.

BibTeX citation

TBD;