TL;DR: V-Warper personalizes video generation by injecting fine appearance through semantic correspondence–guided value warping, achieving faithful identity while keeping motion natural and prompts aligned.

Teaser

Overview

We introduce V-Warper, a video personalization framework that generates subject-accurate videos from a reference image while following a text prompt. Unlike prior methods that rely on heavy video finetuning and still struggle with identity drift, V-Warper uses a coarse-to-fine strategy combining lightweight image-based adaptation with a fully training-free refinement stage. During generation, we compute semantic correspondences using RoPE-free mid-level attention features and warp appearance-rich value features into semantically matched regions of the video, guided by reliability masks. This design injects fine-grained appearance details without disturbing the model’s temporal prior, enabling more faithful, consistent, and motion-stable subject-driven video generation.

Our approach

Overall framework

(a): Given a few images of the subject, V-Warper coarsely adapt the subject's appearance to the video diffusion model utilizing LoRA.

(b): After coarse appearance alignment, V-Warper further refine the appearance under parallel reference, generation branches. By utilizing token-level correspondences, reference appearance features are effectively warped and injected to the generation branch for enhancing appearance consistency.

V-Warper adopts a coarse-to-fine strategy: first, it lightly adapts the model using LoRA and a learnable subject embedding to encode coarse identity from a few reference images. During inference, the model enhances fine-grained appearance through a training-free refinement stage that computes semantic correspondences from RoPE-free mid-level attention features and injects appearance-rich value features into matched regions via masking. This design preserves the model’s temporal prior while accurately transferring high-frequency subject details.

Motivation

Text-to-video diffusion models produce high-quality and temporally stable videos but still struggle to preserve fine-grained subject identity. Prior methods rely on finetuning or auxiliary supervision, overlooking the model’s intrinsic semantic structure. Our analysis shows that transformer-based video diffusion models encode strong semantic correspondences in their mid-level attention features, especially when RoPE is removed. This indicates that effective personalization can be achieved by leveraging these internal signals rather than extensively modifying the model.

Attention analysis
Comparison of matching behaviors. (a) Intermediate features fail to produce precise alignment, (b) Query–Key features suffer from positional bias due to RoPE, and only (c) RoPE-free Query–Key features yield accurate subject-aware correspondence.
Training efficiency curves
Layer-wise correspondence evaluation. RoPE-free Query–Key features achieve the highest matching reliability, with the strongest alignment observed in mid-level MM-DiT layers, particularly layer 12.

Qualitative comparison

We qualitatively compare V-Warper with DreamVideo, VideoBooth, SDVG, and VACE, and observe consistently stronger identity preservation across frames while maintaining natural motion and prompt alignment.

Quantitative comparison

We report CLIP/DINO and VBench scores to show that V-Warper not only preserves identity effectively, but also maintains overall video quality and temporal coherence.

Comparison of identity preservation

V-Warper achieves the strongest identity preservation across all baselines while maintaining comparable text alignment, and does so without large-scale video finetuning—showing that efficient personalization can still deliver high visual fidelity.

Method
IDINO ICLIP TCLIP
DreamVideo 0.322 0.641 0.290
VideoBooth 0.349 0.634 0.272
SDVG 0.661 0.787 0.294
VACE 0.651 0.796 0.326
V-Warper 0.738 0.825 0.297
Quantitative comparison of identity preservation, measuring similarity to the reference subject (CLIP, DINO) and adherence to the text prompt.

Results on VBench

Method Subject Cons. ↑ Background Cons. ↑ Motion Smooth. ↑ Dynamic Degree ↑ Aesthetic Quality ↑ Imaging Quality ↑ Temporal Flicker ↑
Training-based video personalization
VideoBooth 0.9143 0.9494 0.9649 0.6900 0.4483 56.0102 0.9550
SVGD 0.9811 0.9801 0.9922 0.2100 0.6498 64.9768 0.9889
VACE 0.9685 0.9751 0.9829 0.4500 0.6749 67.8530 0.9650
Optimization-based video personalization
DreamVideo 0.9591 0.9766 0.9734 0.1400 0.5111 62.4017 0.9630
V-Warper 0.9866 0.9750 0.9866 0.5100 0.6074 70.6617 0.9774

VBench evaluation showing that V-Warper delivers the strongest subject consistency and imaging quality, while maintaining competitive motion stability and perceptual scores without large-scale video training.

V-Warper outperforms prior optimization-based methods across nearly all metrics and matches or exceeds training-based approaches in subject consistency and image quality, without requiring large-scale video finetuning.

Ablation study

Component-level evaluation

The ablation compares coarse adaptation, value warping, and masked warping. Coarse adaptation offers a stable baseline but misses fine details. Value warping restores detailed appearance, though sometimes introduces artifacts. Masking filters unreliable transfers, yielding the best balance between identity fidelity and text alignment.

Component IDINO ICLIP TCLIP
(I) Coarse Appearance Adaptation 0.645 0.791 0.320
(II) (I)+Value Warping 0.701 0.809 0.278
(III) (II)+Masking (V-Warper) 0.656 0.806 0.320

Component analysis showing that adding value warping improves identity fidelity, while masking restores stability and text consistency, resulting in the best overall balance.

Conclusion

V-Warper tackles video personalization by combining lightweight image-based adaptation with correspondence-guided appearance injection. Rather than depending on heavy video finetuning, it leverages RoPE-free mid-level attention features to compute semantic correspondences during denoising and injects fine appearance via masked value warping. This coarse-to-fine design preserves temporal motion while reinforcing high-frequency identity details, yielding more faithful and consistent results than prior methods.

Citation

If you use this work or find it helpful, please consider citing:

@misc{lee2025vwarperappearanceconsistentvideodiffusion,
    title={V-Warper: Appearance-Consistent Video Diffusion Personalization via Value Warping}, 
    author={Hyunkoo Lee and Wooseok Jang and Jini Yang and Taehwan Kim and Sangoh Kim and Sangwon Jung and Seungryong Kim},
    year={2025},
    eprint={2512.12375},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2512.12375}, 
}