reAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization

1National University of Singapore    2Stanford University    3The Chinese University of Hong Kong
reAR Teaser

News


event [Oct 2025] Code, Trained Model and Project page are released.

Abstract


Visual autoregressive (AR) generation offers a promising path toward unifying vision and language models, yet its performance remains suboptimal against diffusion models. Prior work often attributes this gap to tokenizer limitations and rasterization ordering. In this work, we identify a core bottleneck from the perspective of generator-tokenizer inconsistency, i.e., the AR-generated tokens may not be well-decoded by the tokenizer.

To address this, we propose reAR, a simple training strategy introducing a token-wise regularization objective: when predicting the next token, the causal transformer is also trained to recover the visual embedding of the current token and predict the embedding of the target token under a noisy context. It requires no changes to the tokenizer, generation order, inference pipeline, or external models.

Despite its simplicity, reAR substantially improves performance. On ImageNet, it reduces gFID from 3.02 to 1.86 and improves IS to 316.9 using a standard rasterization-based tokenizer. When applied to advanced tokenizers, it achieves a gFID of 1.42 with only 177M parameters, matching the performance with larger state-of-the-art diffusion models (675M).

reAR Pipeline

Understanding the Bottleneck


Generator-Tokenizer Inconsistency

We identify two key sources of inconsistency between the autoregressive generator and the visual tokenizer:

1. Amplified Exposure Bias: During training with teacher forcing, the model predicts tokens given ground-truth context, but at inference it conditions on its own predictions. In visual AR, this leads to unseen token sequences that corrupt future predictions and spread structural artifacts across the image.

2. Embedding Unawareness: The AR model optimizes only discrete token indices without considering how these tokens are embedded by the tokenizer. However, decoded image quality depends on the embeddings of the generated tokens rather than their indices alone.

Exposure Bias

Amplified Exposure Bias

Embedding Unawareness

Embedding Unawareness

Token-wise Consistency Regularization

reAR addresses these issues through two complementary strategies: Noisy Context Regularization that exposes the model to perturbed context during training, and Codebook Embedding Regularization that aligns the generator's hidden states with the tokenizer's embedding space. This encourages the generator to be aware of how tokens are decoded into visual patches.

Results


Generation Quality

Table 1 shows that reAR achieves strong results even with a standard raster-order AR model and a simple 2D patch tokenizer. reAR-S outperforms prior raster AR models like LlamaGen-XL (FID 2.00 vs. 2.34; IS 295.7 vs. 253.9) using only 14% of the parameters (201M vs. 1.4B), and surpasses advanced-tokenizer AR models such as WeTok with just 13–15% of their size. It matches RAR and outperforms RandAR under similar scales, and reAR-L exceeds MAR-L and VAR-d30. While diffusion and masked-generation models remain strong, reAR narrows the gap with far fewer training epochs.

Training
Paradigm
Generation
Model
Tokenizer
Type
Tokenizer
BPP16
Training
Epochs
#Params ↓ FID ↓ IS ↑
Diffusion LDM-4 Patch-VAE N/A 200 400M 3.60 247.7
DiT-XL Patch-VAE N/A 1400 675M 2.27 278.2
SiT-XL Patch-VAE N/A 800 675M 2.06 270.3
REPA Patch-VAE N/A 800 675M 1.42 305.7
MAR MAR-L Patch-VAE N/A 800 479M 1.98 290.3
MAR-H Patch-VAE N/A 800 943M 1.55 303.7
Mask. MaskGIT-re Patch-VQ 0.625 300 227M 4.02 355.6
MAGVIT-v2 Patch-VQ 1.125 1080 307M 1.78 319.4
Maskbit Patch-LFQ 0.875 1080 305M 1.52 328.6
Mask-TiTok-64 TiTok 0.188 800 177M 2.48 214.7
Mask-TiTok-128 TiTok 0.375 800 287M 1.97 281.8
VAR VAR-d20 VAR 1.992 350 600M 2.57 302.6
VAR-d30 VAR 1.992 350 2.0B 1.92 323.1
Rand.
Causal
AR
RAR-B Patch-VQ 0.625 400 261M 1.95 290.5
RAR-L Patch-VQ 0.625 400 461M 1.70 299.5
RAR-XL Patch-VQ 0.625 400 955M 1.50 306.9
RandAR-L Patch-VQ 0.875 300 343M 2.55 288.8
RandAR-XL Patch-VQ 0.875 300 775M 2.25 317.8
RandAR-XXL Patch-VQ 0.875 300 1.4B 2.15 322.0
Tok.
Causal
AR
AR-FlexTok-XL FlexTok 0.125 300 1.3B 2.02 --
AR-GigaTok-XXL GigaTok 0.875 300 1.4B 1.98 256.8
AR-WeTok-XL WeTok 1.667 300 1.5B 2.31 276.6
Raster.
Causal
AR
VQGAN-re Patch-VQ 0.875 100 1.4B 5.20 280.3
Open-MAGVIT-v2 Patch-LFQ 1.125 300 1.5B 2.33 271.8
LlamaGen-XL Patch-VQ 0.875 300 775M 2.62 244.1
LlamaGen-XXL Patch-VQ 0.875 300 1.4B 2.34 253.9
AR-L† Patch-VQ 0.625 400 461M 3.02 256.2
    reAR-S Patch-VQ 0.625 400 201M 2.00 295.7
    reAR-B Patch-VQ 0.625 400 261M 1.91 300.9
    reAR-L (cfg=10.0/11.0) Patch-VQ 0.625 400 461M 1.86/1.90 316.9/323.2

Table: Results on 256x256 class-conditional generation on ImageNet-1K. "Mask." indicates masked generation; "Tok." denotes non-standard tokenization; "Rand." denotes randomized order; "Raster." denotes rasterization order. "†" indicates that the model is not provided and it's trained with our implementation. BPP16=16×BPP (bits per pixel) measures the compression rate of discrete tokenizers and is not applicable ("N/A") to continuous tokenizers. "#Params" is the number of model parameters. "↑" and "↓" indicate whether higher or lower values are better, respectively.

Generalization

We also evaluate reAR on non-standard tokenizers TiTok and AliTok. Unlike RAR, which helps mainly on bidirectional tokenization, reAR consistently improves performance on both bidirectional (TiTok: 4.45 → 4.01) and unidirectional (AliTok: 1.50 → 1.42) tokenizers. Notably, it approaches diffusion-based REPA and outperforms Maskbit while using far fewer parameters (177M vs. 675M/305M).

Model Epochs Params FID ↓
Maskbit 1080 305M 1.52
REPA 800 675M 1.42
AR-TiTok-b64 400 261M 4.45
    RAR-TiTok-b64 400 261M 4.07
    reAR-TiTok-b64 400 261M 4.01
AR-AliTok-B 800 177M 1.50
    RAR-B-AliTok 800 177M 1.52
    reAR-B-AliTok 800 177M 1.42

Table 2: Superior generalization ability. reAR adapts to different tokenizers and achieves state-of-the-art performance with smaller models.

Scaling Effect

We also study if the scaling behavior of the original AR model maintains with reAR. Specifically, we plot the FID under different training epochs for each model size. As Figure 1 shows, the FID consistently decreases as model size and training iteration increase, revealing the potential of reAR on large-scale visual AR models.

Scaling Effect of reAR

Figure 1: Scaling Effect of reAR. As model size increases, the FID at each training step decreases consistently.

Sampling Speed

Like other autoregressive models, reAR benefits from KV-cache to achieve high sampling speed. We measure throughput on a single A800 GPU with batch size 128. With KV-cache, autoregressive models can run much faster than diffusion and MAR. Moreover, reAR-B-AliTok achieves lower FID with faster sampling speed even against parallel-decoding approaches such as Maskbit, TiTok, VAR, and RandAR.

Sampling Speed Comparison

Figure 2: Sampling Speed. Comparison of different methods on FID and throughput (images/sec).

Visual Quality Improvements

reAR demonstrates significant improvements in visual quality. The model generates more coherent and detailed images compared to baseline autoregressive models. The improvements are particularly notable in maintaining consistency across the entire image and reducing artifacts that typically arise from exposure bias.

Generated Image 1 Generated Image 2

Citation


@article{he2025rear,
    title={reAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization},
    author={Qiyuan He and Yicong Li and Haotian Ye and Jinghao Wang and Xinyao Liao and Pheng-Ann Heng and Stefano Ermon and James Zou and Angela Yao},
    year={2025},
    journal={arXiv preprint},
}