Entropy-Guided k-Guard Sampling for Long-Horizon Autoregressive Video Generation

Han, Yizhao; Shi, Tianxing; Wang, Zhao; Xu, Zifan; Pu, Zhiyuan; Li, Mingxiao; Zhang, Qian; Yin, Wei; Long, Xiao-Xiao

Entropy-Guided k-Guard Sampling for Long-Horizon Autoregressive Video Generation

Yizhao Han^1*, Tianxing Shi^1*, Zhao Wang³, Zifan Xu¹, Zhiyuan Pu³, Mingxiao Li², Qian Zhang², Wei Yin^2‡, Xiao-Xiao Long^1†

¹Nanjing University
²Horizon Robotics
³China Mobile

arXiv Code(Coming Soon)

Long-Horizon Generation (200 frames). ENkG sustains temporally coherent, texture-preserving rollouts up to 200 frames, substantially reducing drift and collapse artifacts typical of static AR decoding.

Static AR Decoding vs ENkG (Ours)

Static AR Decoding

ENkG (Ours)

Static AR Decoding vs ENkG (Ours)

Static AR Decoding

ENkG (Ours)

Static AR Decoding vs ENkG (Ours)

Static AR Decoding

ENkG (Ours)

Static AR Decoding vs ENkG (Ours)

Static AR Decoding

ENkG (Ours)

Static AR Decoding vs ENkG (Ours)

Static AR Decoding

ENkG (Ours)

Static AR Decoding vs ENkG (Ours)

Static AR Decoding

ENkG (Ours)

Static AR Decoding vs ENkG (Ours)

Static AR Decoding

ENkG (Ours)

Static AR Decoding vs ENkG (Ours)

Static AR Decoding

ENkG (Ours)

Static AR Decoding vs ENkG (Ours)

Static AR Decoding

ENkG (Ours)

Static AR Decoding vs ENkG (Ours)

Static AR Decoding

ENkG (Ours)

Static AR Decoding vs ENkG (Ours)

Static AR Decoding

ENkG (Ours)

Abstract

Autoregressive (AR) architectures have achieved significant successes in LLM, inspiring explorations for video generation. In LLMs, top- $p$ /top- $k$ sampling strategies work exceptionally well: language tokens have high semantic density and low redundancy, so a fixed size of token candidates already strike a balance between semantic accuracy and generation diversity. In contrast, video tokens have low semantic density and high spatio-temporal redundancy. This mismatch makes static top-k/top-p strategies ineffective for video decoders: they either introduce unnecessary randomness for low-uncertainty regions (static backgrounds) or stuck in early errors for high-uncertainty regions (foreground objects). Prediction errors will accumulate as more frames are generated and eventually severely degrade long-horizon quality. To address this, we propose Entropy-Guided $k$ -Guard (ENkG) sampling, a simple yet effective strategy that adapts sampling to token-wise dispersion, quantified by the entropy of each token’s predicted distribution. ENkG uses adaptive token candidate sizes: for low-entropy regions, it employs fewer candidates to suppress redundant noise and preserve structural integrity; for high-entropy regions, it uses more candidates to mitigate error compounding. ENkG is model-agnostic, training-free, and adds negligible overhead. Experiments demonstrate consistent improvements in perceptual quality and structural stability compared to static top-k/top-p strategies.

Motivation

Learning from model weights

High-entropy regions form repeating textures (e.g., sky, foliage, and road), while low-entropy regions cluster in structured content and distinguishable textures (e.g., boundaries, edges between sky and trees, road markers and lines).

Learning from model weights

This figure illustrates the phenomenon of entropy collapse in static AR sampling, where blue regions indicate low entropy and red regions indicate high entropy. Our method effectively alleviates this issue.

We study why static top-$k$/top-$p$ decoding breaks in autoregressive video generation and propose Entropy-Guided k-Guard (ENkG), a training-free, model-agnostic sampler that adapts token candidate sets to predictive entropy to mitigate error accumulation and entropy collapse.

F1. Flat token distributions

Video tokens have low semantic density and high spatio-temporal redundancy, producing flatter predictive distributions. A fixed candidate size (top-$k$/top-$p$) becomes brittle and amplifies early mistakes.

F2. Entropy aligns with structure

High-entropy regions correspond to repeating textures (sky/foliage/road), while low-entropy regions cluster around structured content (boundaries, edges, lane markings). Static truncation ignores this structure.

F3. Entropy collapse in long horizons

As generation proceeds, low-entropy regions expand and frame-average entropy drops, causing texture wash-out, over-smoothing, and degenerate dynamics.

Method

Our sampling strategy adapts the token-wise candidate set based on the (normalized) entropy of the model’s predictive distribution at each decoding step. For low-entropy tokens (e.g., static background regions), we sample from a smaller candidate set to reduce stochastic artifacts. For high-entropy tokens (e.g., dynamic foreground regions), we expand the candidate set to promote exploration and alleviate error accumulation during long autoregressive decoding.

Comparison of truncation-based sampling strategies: Top-k, Top-p, and ENkG. — **ENkG compared to fixed truncation sampling strategies.** (b) **Top-k** samples from a fixed-size set of the k most probable tokens. (c) **Top-p (nucleus)** samples from the smallest set whose cumulative probability exceeds a fixed threshold. (a) **ENkG** sets a token-dependent nucleus threshold using predictive entropy and includes a small **k-guard** of highest-probability tokens, then samples from the renormalized distribution over the resulting candidate set.

Results

Visual Comparisons

Qualitatively, ENkG reduces texture wash-out, color drift, and frame freezing. Baseline decoding (static top-k/top-p/greedy) often yields blurred road markings and vegetation, unnatural color shifts, or degenerate near-static rollouts; ENkG preserves high-frequency details and maintains plausible motion over time.

DrivingWorld: Top-k (k=30) vs ENkG

Top-k (k=30)

ENkG (Ours)

VaVIM: Greedy vs ENkG

Greedy

ENkG (Ours)

Cosmos: Top-p (p=0.8) vs ENkG

Top-p (p=0.8)

ENkG (Ours)

Quantitative Results

Quantitatively, ENkG yields consistent gains on both DiverseDrive and nuPlan. Across DrivingWorld, VaVIM, and Cosmos, ENkG reduces FVD/FID while remaining training-free, indicating improved temporal realism and per-frame fidelity under identical generation settings.

Model	DiverseDrive					nuPlan
Model	FVD↓	FID↓	LPIPS↓	PSNR↑	SSIM↑	FVD↓	FID↓	LPIPS↓	PSNR↑	SSIM↑
DrivingWorld (top-k 30)	696	61.78	0.401	14.03	0.43	583	37.80	0.380	14.22	0.39
DrivingWorld (+ENkG)	489	26.61	0.350	15.87	0.45	565	31.34	0.360	14.96	0.40
VaVIM (greedy)	1473	91.75	0.396	16.46	0.50	927	65.26	0.315	14.82	0.44
VaVIM (+ENkG)	1055	46.76	0.426	14.76	0.46	1031	41.60	0.327	14.43	0.42
Cosmos (top-p 0.8)*	1260	87.82	0.48	16.56	0.54	814	80.45	0.29	17.52	0.54
Cosmos (+ENkG)*	1132	84.67	0.47	16.61	0.53	801	75.01	0.29	17.81	0.55

* Cosmos uses a fixed 33-frame generation window; metrics are computed on the first 33 frames (others use 75).

Ablation Study

Impact of Entropy-Adaptive Guidance

Removing entropy guidance leads to texture decay and color shifting, consistent with entropy collapse dynamics.

Without Entropy Guidance

With Entropy Guidance (Ours)

Impact of k-Guard Design

The k-guard enforces minimal exploration even in low-entropy regimes, mitigating degenerate static rollouts and frame-freezing.

Without k-Guard

With k-Guard (Ours)

BibTeX

@misc{han2026entropyguidedkguardsamplinglonghorizon,
      title={Entropy-Guided k-Guard Sampling for Long-Horizon Autoregressive Video Generation}, 
      author={Yizhao Han and Tianxing Shi and Zhao Wang and Zifan Xu and Zhiyuan Pu and Mingxiao Li and Qian Zhang and Wei Yin and Xiao-Xiao Long},
      year={2026},
      eprint={2601.19488},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.19488}, 
}