Entropy-Guided k-Guard Sampling for Long-Horizon Autoregressive Video Generation

1Nanjing University
2Horizon Robotics
3China Mobile
arXiv Code(Coming Soon)

Long-Horizon Generation (200 frames). ENkG sustains temporally coherent, texture-preserving rollouts up to 200 frames, substantially reducing drift and collapse artifacts typical of static AR decoding.

Abstract

Autoregressive (AR) architectures have achieved significant successes in LLM, inspiring explorations for video generation. In LLMs, top-p/top-k sampling strategies work exceptionally well: language tokens have high semantic density and low redundancy, so a fixed size of token candidates already strike a balance between semantic accuracy and generation diversity. In contrast, video tokens have low semantic density and high spatio-temporal redundancy. This mismatch makes static top-k/top-p strategies ineffective for video decoders: they either introduce unnecessary randomness for low-uncertainty regions (static backgrounds) or stuck in early errors for high-uncertainty regions (foreground objects). Prediction errors will accumulate as more frames are generated and eventually severely degrade long-horizon quality. To address this, we propose Entropy-Guided k-Guard (ENkG) sampling, a simple yet effective strategy that adapts sampling to token-wise dispersion, quantified by the entropy of each token’s predicted distribution. ENkG uses adaptive token candidate sizes: for low-entropy regions, it employs fewer candidates to suppress redundant noise and preserve structural integrity; for high-entropy regions, it uses more candidates to mitigate error compounding. ENkG is model-agnostic, training-free, and adds negligible overhead. Experiments demonstrate consistent improvements in perceptual quality and structural stability compared to static top-k/top-p strategies.

Motivation

Learning from model weights


High-entropy regions form repeating textures (e.g., sky, foliage, and road), while low-entropy regions cluster in structured content and distinguishable textures (e.g., boundaries, edges between sky and trees, road markers and lines).

Learning from model weights


This figure illustrates the phenomenon of entropy collapse in static AR sampling, where blue regions indicate low entropy and red regions indicate high entropy. Our method effectively alleviates this issue.

We study why static top-$k$/top-$p$ decoding breaks in autoregressive video generation and propose Entropy-Guided k-Guard (ENkG), a training-free, model-agnostic sampler that adapts token candidate sets to predictive entropy to mitigate error accumulation and entropy collapse.

F1. Flat token distributions

Video tokens have low semantic density and high spatio-temporal redundancy, producing flatter predictive distributions. A fixed candidate size (top-$k$/top-$p$) becomes brittle and amplifies early mistakes.

F2. Entropy aligns with structure

High-entropy regions correspond to repeating textures (sky/foliage/road), while low-entropy regions cluster around structured content (boundaries, edges, lane markings). Static truncation ignores this structure.

F3. Entropy collapse in long horizons

As generation proceeds, low-entropy regions expand and frame-average entropy drops, causing texture wash-out, over-smoothing, and degenerate dynamics.

Method

Our sampling strategy adapts the token-wise candidate set based on the (normalized) entropy of the model’s predictive distribution at each decoding step. For low-entropy tokens (e.g., static background regions), we sample from a smaller candidate set to reduce stochastic artifacts. For high-entropy tokens (e.g., dynamic foreground regions), we expand the candidate set to promote exploration and alleviate error accumulation during long autoregressive decoding.

Comparison of truncation-based sampling strategies: Top-k, Top-p, and ENkG.
ENkG compared to fixed truncation sampling strategies. (b) Top-k samples from a fixed-size set of the k most probable tokens. (c) Top-p (nucleus) samples from the smallest set whose cumulative probability exceeds a fixed threshold. (a) ENkG sets a token-dependent nucleus threshold using predictive entropy and includes a small k-guard of highest-probability tokens, then samples from the renormalized distribution over the resulting candidate set.

Results

Visual Comparisons

Qualitatively, ENkG reduces texture wash-out, color drift, and frame freezing. Baseline decoding (static top-k/top-p/greedy) often yields blurred road markings and vegetation, unnatural color shifts, or degenerate near-static rollouts; ENkG preserves high-frequency details and maintains plausible motion over time.

Quantitative Results

Quantitatively, ENkG yields consistent gains on both DiverseDrive and nuPlan. Across DrivingWorld, VaVIM, and Cosmos, ENkG reduces FVD/FID while remaining training-free, indicating improved temporal realism and per-frame fidelity under identical generation settings.

Model DiverseDrive nuPlan
FVD↓FID↓LPIPS↓PSNR↑SSIM↑ FVD↓FID↓LPIPS↓PSNR↑SSIM↑
DrivingWorld (top-k 30) 69661.780.40114.030.43 58337.800.38014.220.39
DrivingWorld (+ENkG) 48926.610.35015.870.45 56531.340.36014.960.40
VaVIM (greedy) 147391.750.39616.460.50 92765.260.31514.820.44
VaVIM (+ENkG) 105546.760.42614.760.46 103141.600.32714.430.42
Cosmos (top-p 0.8)* 126087.820.4816.560.54 81480.450.2917.520.54
Cosmos (+ENkG)* 113284.670.4716.610.53 80175.010.2917.810.55

* Cosmos uses a fixed 33-frame generation window; metrics are computed on the first 33 frames (others use 75).

Ablation Study

Impact of Entropy-Adaptive Guidance

Removing entropy guidance leads to texture decay and color shifting, consistent with entropy collapse dynamics.

Without Entropy Guidance

With Entropy Guidance (Ours)

Impact of k-Guard Design

The k-guard enforces minimal exploration even in low-entropy regimes, mitigating degenerate static rollouts and frame-freezing.

Without k-Guard

With k-Guard (Ours)

BibTeX

@misc{han2026entropyguidedkguardsamplinglonghorizon,
      title={Entropy-Guided k-Guard Sampling for Long-Horizon Autoregressive Video Generation}, 
      author={Yizhao Han and Tianxing Shi and Zhao Wang and Zifan Xu and Zhiyuan Pu and Mingxiao Li and Qian Zhang and Wei Yin and Xiao-Xiao Long},
      year={2026},
      eprint={2601.19488},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.19488}, 
}