TL;DR: MATRIX identifies interaction-dominant layers in video DiTs and introduces a simple yet effective regularization that aligns their attention to multi-instance mask tracks, resulting in more interaction-aware video generation.

Teaser

Text Prompt: At a kitchen counter, a student in a white shirt places a notebook on a wooden table. Cups and plates are scattered nearby, and sunlight comes in through the window. The action demonstrates placing the notebook on the flat surface.

Text Prompt: A man in a white t-shirt lifts a large box from the ground near a delivery truck on a narrow street.

Text Prompt: The man in a blue shirt feeds a strawberry to the woman in a white chef coat.

Overview

We conduct a systematic analysis that formalizes interaction as two perspectives of video DiTs: semantic grounding and semantic propagation. We find both effects concentrate in a small subset of interaction-dominant layers. Motivated by this, we introduce MATRIX, a simple and effective regularization that aligns attention in specific layers of video DiTs with multi-instance mask tracks, enhancing interaction fidelity and semantic alignment while reducing drift and hallucination.

MATRIX teaser

MATRIX robot MATRIX-11K Dataset β€” 11K videos with interaction-aware captions and instance-level mask tracks.

MATRIX robot First systematic analysis of how Video DiTs encode semantic grounding and semantic propagation, analyzed via 3D full attentions, identifying interaction-dominant layers.

MATRIX robot MATRIX Framework β€” a simple yet effective regularization that aligns attention in interaction-dominant layers with multi-instance mask tracks via two terms, including Semantic Grounding Alignment (SGA) and Semantic Propagation Alignment (SPA) losses.

MATRIX robot InterGenEval Protocol β€” an interaction-aware evaluation protocol that measures Key Interaction Semantic Alignment (KISA), Semantic Grounding Integrity (SGI), along with Semantic Propagation Integrity (SPI), and Interaction Fidelity (IF).

MATRIX robot State-of-the-art performance β€” significant interaction fidelity gains over baselines while maintaining video quality.

Dataset

Our Dataset Curation Pipeline

MATRIX-11K illustration

Top : An LLM (a) identifies interaction triplets, (b) filters them using Dynamism and Contactness, and (c) extracts per-ID appearance descriptions. Bottom : A VLM then verifies candidate to select an anchor frame, from which SAM2 propagates masks to produce instance mask tracks.

Curated Dataset Examples

Dataset Example. Given a video and a caption, we extract interaction information including subject, verb, object, its appearance description and interaction-related scores. Using the information, we extract instance mask tracks using SAM2 and VLM.
Dataset Example. Given a video and a caption, we extract interaction information including subject, verb, object, its appearance description and interaction-related scores. Using the information, we extract instance mask tracks using SAM2 and VLM.
Dataset Statistics. Our curated dataset, MATRIX-11K, consists of 11K videos with interaction-aware captions and corresponding instance mask tracks.

Analysis

Comparisons diagram

(a) Influential layers : layers with high AAS that rank in the Top-10 for many videos.
(b) Dominant layers : the influential layers whose mean AAS clearly separates successes from failures.

Influential Layer Candidates: We identify influential layers by rank-summing each layer's frequency in the AAS top-10 and its mean AAS, revealing that high alignment consistently concentrates in a small subset of layers.

Dominant Layer Selection: Among the influential layers, we identify the interaction-dominant layers that most directly governs the outcomes, where the success gap is large and positive while the failure gap is large and negative relative to the overall mean.

Relevance to Interaction-Awareness in Generated Videos: When attention map in the interaction-dominant layers concentrates on the corresponding subject/object/union regions, generations are correct and preferred by human raters; when attention is diffused or mislocalized, failrues are common.

Framework

Our analysis identifies that a small set of interaction-dominant layers align well with human-verified success and further improves interaction-awareness of the generated video.

Motivated by our analysis, MATRIX introduces Semantic Grounding Alignment (SGA) and Semantic Propagation Alignment (SPA) losses that directly align attention maps in the interaction-dominant layers with ground-truth mask tracks.

MATRIX Framework illustration

SGA and SPA supervise video-to-text and video-to-video attention directly with ground-truth instance mask tracks. We apply these losses only to the interaction-dominant layer, routing alignment where it is most effective, while leaving the remaining layers to preserve video quality.

InterGenEval

InterGenEval Illustration

InterGenEval Protocol Pipeline

Results

Qualitative Comparisons

Quantitative Comparisons

Methods InterGenEval Human Fidelity Video Quality
KISA ↑ SGI ↑ IF ↑ HA ↑ MS ↑ IQ ↑
CogVideoX-2B-I2V 0.4200.4700.445 0.9370.99369.69
CogVideoX-5B-I2V 0.4060.4910.449 0.9360.98769.66
Open-Sora-11B-I2V 0.4530.5080.480 0.8910.99263.32
TaVid 0.4650.5220.494 0.9170.99168.90
MATRIX (Ours) 0.5460.6410.593 0.9540.99469.73

Ablation Studies

Methods InterGenEval Human Fidelity Video Quality
KISA ↑ SGI ↑ IF ↑ HA ↑ MS ↑ IQ ↑
(I) Baseline (CogVideoX-5B-I2V) 0.4060.4910.449 0.9360.98769.66
(II) TaVid 0.4650.5220.494 0.9170.99168.90
(III) (I) + LoRA w/o layer selection 0.4450.5260.486 0.9400.99469.77
(IV) (I) + LoRA w/ layer selection 0.4900.5940.542 0.9500.99468.97
(V) (IV) + SPA loss 0.4510.5400.496 0.9370.99570.26
(VI) (IV) + SGA loss 0.5090.5920.550 0.9520.99469.62
(VII) (IV) + SPA loss + SGA loss (Ours) 0.5460.6410.593 0.9540.99469.73

Citation

If you use this work or find it helpful, please consider citing:

@misc{jin2025matrixmasktrackalignment,
title={MATRIX: Mask Track Alignment for Interaction-aware Video Generation}, 
author={Siyoon Jin and Seongchan Kim and Dahyun Chung and Jaeho Lee and Hyunwook Choi and Jisu Nam and Jiyoung Kim and Seungryong Kim},
year={2025},
eprint={2510.07310},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.07310}, 
}