TL;DR: MATRIX identifies interaction-dominant layers in video DiTs and introduces a simple yet effective regularization that aligns their attention to multi-instance mask tracks, resulting in more interaction-aware video generation.

Teaser

Text Prompt: At the front door of a house, a girl in a yellow dress stands outside near a wooden door. She lifts her hand and taps the metal door knocker lightly, producing a clear sound.

Text Prompt: A man walking past with yellow towel wipes the front panel or windshield of the red SUV.

Text Prompt: The man lands on the skateboard and performs a jump.

Overview

We conduct a systematic analysis that formalizes interaction as two perspectives of video DiTs: semantic grounding and semantic propagation. We find both effects concentrate in a small subset of interaction-dominant layers. Motivated by this, we introduce MATRIX, a simple and effective regularization that aligns attention in specific layers of video DiTs with multi-instance mask tracks, enhancing interaction fidelity and semantic alignment while reducing drift and hallucination.

MATRIX-11K Dataset — 11K videos with interaction-aware captions and instance-level mask tracks.

First systematic analysis of how Video DiTs encode semantic grounding and semantic propagation, analyzed via 3D full attentions, identifying interaction-dominant layers.

MATRIX Framework — a simple yet effective regularization that aligns attention in interaction-dominant layers with multi-instance mask tracks via two terms, including Semantic Grounding Alignment (SGA) and Semantic Propagation Alignment (SPA) losses.

InterGenEval Protocol — an interaction-aware evaluation protocol that measures Key Interaction Semantic Alignment (KISA), Semantic Grounding Integrity (SGI), along with Semantic Propagation Integrity (SPI), and Interaction Fidelity (IF).

State-of-the-art performance — significant interaction fidelity gains over baselines while maintaining video quality.

Dataset

Our Dataset Curation Pipeline

Top : An LLM (a) identifies interaction triplets, (b) filters them using Dynamism and Contactness, and (c) extracts per-ID appearance descriptions. Bottom : A VLM then verifies candidate to select an anchor frame, from which SAM2 propagates masks to produce instance mask tracks.

Curated Dataset Examples

**Dataset Example.** Given a video and a caption, we extract interaction information including subject, verb, object, its appearance description and interaction-related scores. Using the information, we extract instance mask tracks using SAM2 and VLM.

**Dataset Statistics.** Our curated dataset, MATRIX-11K, consists of 11K videos with interaction-aware captions and corresponding instance mask tracks.

Analysis

(a) Influential layers : layers with high AAS that rank in the Top-10 for many videos.
(b) Dominant layers : the influential layers whose mean AAS clearly separates successes from failures.

Influential Layer Candidates: We identify influential layers by rank-summing each layer's frequency in the AAS top-10 and its mean AAS, revealing that high alignment consistently concentrates in a small subset of layers.

Dominant Layer Selection: Among the influential layers, we identify the interaction-dominant layers that most directly governs the outcomes, where the success gap is large and positive while the failure gap is large and negative relative to the overall mean.

Relevance to Interaction-Awareness in Generated Videos: When attention map in the interaction-dominant layers concentrates on the corresponding subject/object/union regions, generations are correct and preferred by human raters; when attention is diffused or mislocalized, failrues are common.

Framework

Our analysis identifies that a small set of interaction-dominant layers align well with human-verified success and further improves interaction-awareness of the generated video.

Motivated by our analysis, MATRIX introduces Semantic Grounding Alignment (SGA) and Semantic Propagation Alignment (SPA) losses that directly align attention maps in the interaction-dominant layers with ground-truth mask tracks.

SGA and SPA supervise video-to-text and video-to-video attention directly with ground-truth instance mask tracks. We apply these losses only to the interaction-dominant layer, routing alignment where it is most effective, while leaving the remaining layers to preserve video quality.

InterGenEval

InterGenEval Protocol Pipeline

Results

Qualitative Comparisons (CogVideoX-5B-I2V baseline)

Qualitative Comparisons (Wan2.1-14B-I2V baseline)

Quantitative Comparisons

Methods	InterGenEval			Human Fidelity	Video Quality
Methods	KISA ↑	SGI ↑	IF ↑	HA ↑	MS ↑	IQ ↑
CogVideoX-2B-I2V	0.420	0.470	0.445	0.937	0.993	69.69
CogVideoX-5B-I2V	0.406	0.491	0.449	0.936	0.987	69.66
Open-Sora-11B-I2V	0.453	0.508	0.480	0.891	0.992	63.32
TaVid	0.465	0.522	0.494	0.917	0.991	68.90
MATRIX (Ours)	0.546	0.641	0.593	0.954	0.994	69.73

Ablation Studies

	Methods	InterGenEval			Human Fidelity	Video Quality
	Methods	KISA ↑	SGI ↑	IF ↑	HA ↑	MS ↑	IQ ↑
(I)	Baseline (CogVideoX-5B-I2V)	0.406	0.491	0.449	0.936	0.987	69.66
(II)	TaVid	0.465	0.522	0.494	0.917	0.991	68.90
(III)	(I) + LoRA w/o layer selection	0.445	0.526	0.486	0.940	0.994	69.77
(IV)	(I) + LoRA w/ layer selection	0.490	0.594	0.542	0.950	0.994	68.97
(V)	(IV) + SPA loss	0.451	0.540	0.496	0.937	0.995	70.26
(VI)	(IV) + SGA loss	0.509	0.592	0.550	0.952	0.994	69.62
(VII)	(IV) + SPA loss + SGA loss (Ours)	0.546	0.641	0.593	0.954	0.994	69.73

Citation

If you use this work or find it helpful, please consider citing:

@misc{jin2025matrixmasktrackalignment,
title={MATRIX: Mask Track Alignment for Interaction-aware Video Generation}, 
author={Siyoon Jin and Seongchan Kim and Dahyun Chung and Jaeho Lee and Hyunwook Choi and Jisu Nam and Jiyoung Kim and Seungryong Kim},
year={2025},
eprint={2510.07310},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.07310}, 
}

MATRIX: Mask Track Alignment for Interaction-Aware Video Generation

arXiv 2025