Lip Forcing:
Few-Step Autoregressive Diffusion
for Real-time Lip Synchronization

Paul Hyunbin Cho1,*
Jinhyuk Jang1,*
SeokYoung Lee1
Joungbin Lee1
Siyoon Jin1
Heeseong Shin1
Jung Yi1
Yunjin Park2
Chulmin Park2
Seungryong Kim1,†

1KAIST AI   ·   2AIPARK
*Equal contribution   ·   Corresponding author

TL;DR

Lip Forcing is the first autoregressive diffusion model for V2V lip synchronization. It distills a 14B bidirectional teacher into two-step causal students at 1.3B and 14B scales. The 1.3B student runs 17.6× faster than its same-scale bidirectional counterpart, reaching 31 FPS real-time on a single GPU (H100), and the 14B student runs 39.8× faster than its teacher at comparable reference fidelity.

31 FPS
Real-time on a single GPU (H100)
2 steps
At inference, no CFG
17.6× Speedup
1.3B causal student model vs. same-scale bidirectional model
39.8× Speedup
14B causal student model vs. same-scale bidirectional model
PREVIEW

Preview

Nine outputs from Lip Forcing (14B) on TalkVid. Hover near the left or right edge to pan the row.

MOTIVATION

The fidelity–sync tradeoff

CFG fidelity-sync tradeoff: full trajectory analysis and 2x2 schedule factorial
(a) Along the full 50-step trajectory of the 14B teacher, classifier-free guidance (CFG, red) improves audio-visual sync but worsens reference fidelity. No-CFG (navy) shows the inverse. No fixed CFG scale optimizes both axes. (b) An Euler-step 2×2 factorial over schedules $(s_0, s_1)$ shows that mixed schedules recover most of the sync gap when the second step lands near $j_1\!=\!30$.

Diffusion lip-sync models exhibit a CFG fidelity-sync tradeoff: classifier-free guidance lifts audio-visual synchronization but at the cost of reference fidelity, and the two regimes do not coexist at any single CFG scale. To distill this into a two-step student we run an Euler-step factorial over $(s_0, s_1)\in\{1.0, 4.5\}^2$; the no-CFG → CFG schedule lands near $\tau\!=\!0.769$ (~ODE step 30) on the reference-leaning side of the joint reference–sync optimum — reference fidelity close to the CFG=1.0 ceiling and sync close to the CFG=4.5 ceiling. The remaining sync gap is closeable with explicit SyncNet supervision, while the reference cost incurred by a guided first step is not. This selects the operating point the student is trained to land on at inference: a single landing at $\tau\!=\!0.769$ after one no-CFG velocity step from $\tau\!=\!0.999$.

METHOD

Overall framework

Framework
Lip Forcing architecture — Self-Forcing DMD distillation pipeline
Speed vs. quality
Pareto frontier — Lip Forcing dominates on speed-quality

The 14B bidirectional teacher provides score / guidance targets along a windowed schedule (Sync-Window DMD), while the causal student uses chunk-wise sliding-window attention (sink=1, window=6, total cache 7) to support autoregressive inference at deployment. Trained this way, both Lip Forcing variants land on the upper-right corner of the speed–quality Pareto frontier: the 1.3B student reaches 31 FPS — past the 25 FPS real-time threshold — while the 14B student delivers a 4.7× throughput gain over LatentSync at comparable reference fidelity. No prior method occupies the upper-right region; the closest competitors trade off either quality (Diff2Lip) or speed (LatentSync, OmniAvatar).

1
Sync-Window DMD distillation. Standard DMD applies teacher CFG uniformly. We instead gate it to the sync-favoring band, derived from the trajectory analysis (roughly steps $j\!\in\![20,40]$). The student learns when guidance is useful and when it hurts.
2
Two-step inference, no CFG. At deployment, the student runs one no-CFG velocity step from $\tau\!=\!0.999$ to a landing at $\tau\!=\!0.769$, then one final denoising step to $\tau\!=\!0$. No classifier-free guidance is used at inference.
3
SyncNet reward. An explicit SyncNet-based reward closes the residual sync gap identified by the analysis.
QUANTITATIVE

Comparison on HDTF

Throughput measured on a single H100. Best in bold; second-best underlined.

Method Steps FPS ↑ TTFF ↓ Sync-C ↑ Sync-D ↓ CSIM ↑ FID ↓ FVD ↓ SSIM ↑
Ground truth 7.956.92
Diff2Lip 2515.475.04 8.356.32 0.94320.32285.690.907
LatentSync 203.236.29 8.106.51 0.9676.90117.910.950
X-Dub 300.91163.64 7.587.66 0.89814.76183.990.831
OmniAvatar LS (1.3B) 501.7945.36 8.046.99 0.9278.06143.750.904
OmniAvatar LS (14B) 500.38213.72 8.986.11 0.9346.71133.870.911
Self Forcing (1.3B) 427.480.38 7.127.80 0.9397.51124.780.915
Lip Forcing (1.3B) 231.580.32 6.887.93 0.9436.76118.860.919
Lip Forcing (14B) 215.110.54 7.597.23 0.9497.01107.880.938

See the Pareto frontier in the Method section for the speed-quality view of these numbers.

QUALITATIVE

Lip Forcing in action

Outputs from Lip Forcing (14B) on six clips from the TalkVid test set. Unmute the controls to play with audio.

VS. BASELINES

Comparison with baselines

TalkVid
Input
Ours (Lip Forcing)
Wav2Lip
VideoReTalking
MuseTalk
LatentSync

TalkVid test clips. Use the arrows above to switch test sets (TalkVid → HDTF → Hallo3) and the buttons to pick a sample. Playback is synchronized across all six videos.

CITATION

BibTeX

@misc{cho2026lipforcingfewstepautoregressive,
  title         = {Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization},
  author        = {Paul Hyunbin Cho and Jinhyuk Jang and SeokYoung Lee and Joungbin Lee and Siyoon Jin and Heeseong Shin and Jung Yi and Yunjin Park and Chulmin Park and Seungryong Kim},
  year          = {2026},
  eprint        = {2606.11180},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2606.11180}
}