Overview
We introduce V-Warper, a video personalization framework that generates subject-accurate videos from a reference image while following a text prompt. Unlike prior methods that rely on heavy video finetuning and still struggle with identity drift, V-Warper uses a coarse-to-fine strategy combining lightweight image-based adaptation with a fully training-free refinement stage. During generation, we compute semantic correspondences using RoPE-free mid-level attention features and warp appearance-rich value features into semantically matched regions of the video, guided by reliability masks. This design injects fine-grained appearance details without disturbing the model’s temporal prior, enabling more faithful, consistent, and motion-stable subject-driven video generation.
Our approach
(a): Given a few images of the subject, V-Warper coarsely adapt the subject's appearance to the video diffusion model utilizing LoRA.
(b): After coarse appearance alignment, V-Warper further refine the appearance under parallel reference, generation branches. By utilizing token-level correspondences, reference appearance features are effectively warped and injected to the generation branch for enhancing appearance consistency.
V-Warper adopts a coarse-to-fine strategy: first, it lightly adapts the model using LoRA and a learnable subject embedding to encode coarse identity from a few reference images. During inference, the model enhances fine-grained appearance through a training-free refinement stage that computes semantic correspondences from RoPE-free mid-level attention features and injects appearance-rich value features into matched regions via masking. This design preserves the model’s temporal prior while accurately transferring high-frequency subject details.
Motivation
Text-to-video diffusion models produce high-quality and temporally stable videos but still struggle to preserve fine-grained subject identity. Prior methods rely on finetuning or auxiliary supervision, overlooking the model’s intrinsic semantic structure. Our analysis shows that transformer-based video diffusion models encode strong semantic correspondences in their mid-level attention features, especially when RoPE is removed. This indicates that effective personalization can be achieved by leveraging these internal signals rather than extensively modifying the model.
Qualitative comparison
We qualitatively compare V-Warper with DreamVideo, VideoBooth, SDVG, and VACE, and observe consistently stronger identity preservation across frames while maintaining natural motion and prompt alignment.
Quantitative comparison
We report CLIP/DINO and VBench scores to show that V-Warper not only preserves identity effectively, but also maintains overall video quality and temporal coherence.
Comparison of identity preservation
V-Warper achieves the strongest identity preservation across all baselines while maintaining comparable text alignment, and does so without large-scale video finetuning—showing that efficient personalization can still deliver high visual fidelity.
| Method | |||
|---|---|---|---|
| IDINO↑ | ICLIP↑ | TCLIP↑ | |
| DreamVideo | 0.322 | 0.641 | 0.290 |
| VideoBooth | 0.349 | 0.634 | 0.272 | SDVG | 0.661 | 0.787 | 0.294 | VACE | 0.651 | 0.796 | 0.326 |
| V-Warper | 0.738 | 0.825 | 0.297 |
Results on VBench
| Method | Subject Cons. ↑ | Background Cons. ↑ | Motion Smooth. ↑ | Dynamic Degree ↑ | Aesthetic Quality ↑ | Imaging Quality ↑ | Temporal Flicker ↑ |
|---|---|---|---|---|---|---|---|
| Training-based video personalization | |||||||
| VideoBooth | 0.9143 | 0.9494 | 0.9649 | 0.6900 | 0.4483 | 56.0102 | 0.9550 |
| SVGD | 0.9811 | 0.9801 | 0.9922 | 0.2100 | 0.6498 | 64.9768 | 0.9889 |
| VACE | 0.9685 | 0.9751 | 0.9829 | 0.4500 | 0.6749 | 67.8530 | 0.9650 |
| Optimization-based video personalization | |||||||
| DreamVideo | 0.9591 | 0.9766 | 0.9734 | 0.1400 | 0.5111 | 62.4017 | 0.9630 |
| V-Warper | 0.9866 | 0.9750 | 0.9866 | 0.5100 | 0.6074 | 70.6617 | 0.9774 |
VBench evaluation showing that V-Warper delivers the strongest subject consistency and imaging quality, while maintaining competitive motion stability and perceptual scores without large-scale video training.
V-Warper outperforms prior optimization-based methods across nearly all metrics and matches or exceeds training-based approaches in subject consistency and image quality, without requiring large-scale video finetuning.
Ablation study
Component-level evaluation
The ablation compares coarse adaptation, value warping, and masked warping. Coarse adaptation offers a stable baseline but misses fine details. Value warping restores detailed appearance, though sometimes introduces artifacts. Masking filters unreliable transfers, yielding the best balance between identity fidelity and text alignment.
| Component | IDINO↑ | ICLIP↑ | TCLIP↑ |
|---|---|---|---|
| (I) Coarse Appearance Adaptation | 0.645 | 0.791 | 0.320 |
| (II) (I)+Value Warping | 0.701 | 0.809 | 0.278 |
| (III) (II)+Masking (V-Warper) | 0.656 | 0.806 | 0.320 |
Component analysis showing that adding value warping improves identity fidelity, while masking restores stability and text consistency, resulting in the best overall balance.
Conclusion
V-Warper tackles video personalization by combining lightweight image-based adaptation with correspondence-guided appearance injection. Rather than depending on heavy video finetuning, it leverages RoPE-free mid-level attention features to compute semantic correspondences during denoising and injects fine appearance via masked value warping. This coarse-to-fine design preserves temporal motion while reinforcing high-frequency identity details, yielding more faithful and consistent results than prior methods.
Citation
If you use this work or find it helpful, please consider citing:
@misc{lee2025vwarperappearanceconsistentvideodiffusion,
title={V-Warper: Appearance-Consistent Video Diffusion Personalization via Value Warping},
author={Hyunkoo Lee and Wooseok Jang and Jini Yang and Taehwan Kim and Sangoh Kim and Sangwon Jung and Seungryong Kim},
year={2025},
eprint={2512.12375},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.12375},
}