Teaser
Text Prompt: At a kitchen counter, a student in a white shirt places a notebook on a wooden table. Cups and plates are scattered nearby, and sunlight comes in through the window. The action demonstrates placing the notebook on the flat surface.
Text Prompt: A man in a white t-shirt lifts a large box from the ground near a delivery truck on a narrow street.
Text Prompt: The man in a blue shirt feeds a strawberry to the woman in a white chef coat.
Overview
We conduct a systematic analysis that formalizes interaction as two perspectives of video DiTs: semantic grounding and semantic propagation. We find both effects concentrate in a small subset of interaction-dominant layers. Motivated by this, we introduce MATRIX, a simple and effective regularization that aligns attention in specific layers of video DiTs with multi-instance mask tracks, enhancing interaction fidelity and semantic alignment while reducing drift and hallucination.

MATRIX-11K Dataset β 11K videos with interaction-aware captions and instance-level mask tracks.
First systematic analysis of how Video DiTs encode semantic grounding and semantic propagation, analyzed via 3D full attentions, identifying interaction-dominant layers.
MATRIX Framework β a simple yet effective regularization that aligns attention in interaction-dominant layers with multi-instance mask tracks via two terms, including Semantic Grounding Alignment (SGA) and Semantic Propagation Alignment (SPA) losses.
InterGenEval Protocol β an interaction-aware evaluation protocol that measures Key Interaction Semantic Alignment (KISA), Semantic Grounding Integrity (SGI), along with Semantic Propagation Integrity (SPI), and Interaction Fidelity (IF).
State-of-the-art performance β significant interaction fidelity gains over baselines while maintaining video quality.
Dataset
Our Dataset Curation Pipeline

Top : An LLM (a) identifies interaction triplets, (b) filters them using Dynamism and Contactness, and (c) extracts per-ID appearance descriptions. Bottom : A VLM then verifies candidate to select an anchor frame, from which SAM2 propagates masks to produce instance mask tracks.
Curated Dataset Examples
Analysis

(a) Influential layers : layers with high AAS that rank in the Top-10 for many videos.
(b) Dominant layers : the influential layers whose mean AAS clearly separates successes from failures.
Influential Layer Candidates: We identify influential layers by rank-summing each layer's frequency in the AAS top-10 and its mean AAS, revealing that high alignment consistently concentrates in a small subset of layers.
Dominant Layer Selection: Among the influential layers, we identify the interaction-dominant layers that most directly governs the outcomes, where the success gap is large and positive while the failure gap is large and negative relative to the overall mean.
Relevance to Interaction-Awareness in Generated Videos: When attention map in the interaction-dominant layers concentrates on the corresponding subject/object/union regions, generations are correct and preferred by human raters; when attention is diffused or mislocalized, failrues are common.
Framework
Our analysis identifies that a small set of interaction-dominant layers align well with human-verified success and further improves interaction-awareness of the generated video.
Motivated by our analysis, MATRIX introduces Semantic Grounding Alignment (SGA) and Semantic Propagation Alignment (SPA) losses that directly align attention maps in the interaction-dominant layers with ground-truth mask tracks.

SGA and SPA supervise video-to-text and video-to-video attention directly with ground-truth instance mask tracks. We apply these losses only to the interaction-dominant layer, routing alignment where it is most effective, while leaving the remaining layers to preserve video quality.
InterGenEval

InterGenEval Protocol Pipeline
Results
Qualitative Comparisons
Quantitative Comparisons
Methods | InterGenEval | Human Fidelity | Video Quality | |||
---|---|---|---|---|---|---|
KISA β | SGI β | IF β | HA β | MS β | IQ β | |
CogVideoX-2B-I2V | 0.420 | 0.470 | 0.445 | 0.937 | 0.993 | 69.69 |
CogVideoX-5B-I2V | 0.406 | 0.491 | 0.449 | 0.936 | 0.987 | 69.66 |
Open-Sora-11B-I2V | 0.453 | 0.508 | 0.480 | 0.891 | 0.992 | 63.32 |
TaVid | 0.465 | 0.522 | 0.494 | 0.917 | 0.991 | 68.90 |
MATRIX (Ours) | 0.546 | 0.641 | 0.593 | 0.954 | 0.994 | 69.73 |
Ablation Studies
Methods | InterGenEval | Human Fidelity | Video Quality | ||||
---|---|---|---|---|---|---|---|
KISA β | SGI β | IF β | HA β | MS β | IQ β | ||
(I) | Baseline (CogVideoX-5B-I2V) | 0.406 | 0.491 | 0.449 | 0.936 | 0.987 | 69.66 |
(II) | TaVid | 0.465 | 0.522 | 0.494 | 0.917 | 0.991 | 68.90 |
(III) | (I) + LoRA w/o layer selection | 0.445 | 0.526 | 0.486 | 0.940 | 0.994 | 69.77 |
(IV) | (I) + LoRA w/ layer selection | 0.490 | 0.594 | 0.542 | 0.950 | 0.994 | 68.97 |
(V) | (IV) + SPA loss | 0.451 | 0.540 | 0.496 | 0.937 | 0.995 | 70.26 |
(VI) | (IV) + SGA loss | 0.509 | 0.592 | 0.550 | 0.952 | 0.994 | 69.62 |
(VII) | (IV) + SPA loss + SGA loss (Ours) | 0.546 | 0.641 | 0.593 | 0.954 | 0.994 | 69.73 |
Citation
If you use this work or find it helpful, please consider citing:
@misc{jin2025matrixmasktrackalignment, title={MATRIX: Mask Track Alignment for Interaction-aware Video Generation}, author={Siyoon Jin and Seongchan Kim and Dahyun Chung and Jaeho Lee and Hyunwook Choi and Jisu Nam and Jiyoung Kim and Seungryong Kim}, year={2025}, eprint={2510.07310}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.07310}, }