Visual autoregressive (AR) generation offers a promising path toward unifying vision and language models, yet its performance remains suboptimal against diffusion models. Prior work often attributes this gap to tokenizer limitations and rasterization ordering. In this work, we identify a core bottleneck from the perspective of generator-tokenizer inconsistency, i.e., the AR-generated tokens may not be well-decoded by the tokenizer.
To address this, we propose reAR, a simple training strategy introducing a token-wise regularization objective: when predicting the next token, the causal transformer is also trained to recover the visual embedding of the current token and predict the embedding of the target token under a noisy context. It requires no changes to the tokenizer, generation order, inference pipeline, or external models.
Despite its simplicity, reAR substantially improves performance. On ImageNet, it reduces gFID from 3.02 to 1.86 and improves IS to 316.9 using a standard rasterization-based tokenizer. When applied to advanced tokenizers, it achieves a gFID of 1.42 with only 177M parameters, matching the performance with larger state-of-the-art diffusion models (675M).
We identify two key sources of inconsistency between the autoregressive generator and the visual tokenizer:
1. Amplified Exposure Bias: During training with teacher forcing, the model predicts tokens given ground-truth context, but at inference it conditions on its own predictions. In visual AR, this leads to unseen token sequences that corrupt future predictions and spread structural artifacts across the image.
2. Embedding Unawareness: The AR model optimizes only discrete token indices without considering how these tokens are embedded by the tokenizer. However, decoded image quality depends on the embeddings of the generated tokens rather than their indices alone.
Amplified Exposure Bias
Embedding Unawareness
reAR addresses these issues through two complementary strategies: Noisy Context Regularization that exposes the model to perturbed context during training, and Codebook Embedding Regularization that aligns the generator's hidden states with the tokenizer's embedding space. This encourages the generator to be aware of how tokens are decoded into visual patches.
Table 1 shows that reAR achieves strong results even with a standard raster-order AR model and a simple 2D patch tokenizer. reAR-S outperforms prior raster AR models like LlamaGen-XL (FID 2.00 vs. 2.34; IS 295.7 vs. 253.9) using only 14% of the parameters (201M vs. 1.4B), and surpasses advanced-tokenizer AR models such as WeTok with just 13–15% of their size. It matches RAR and outperforms RandAR under similar scales, and reAR-L exceeds MAR-L and VAR-d30. While diffusion and masked-generation models remain strong, reAR narrows the gap with far fewer training epochs.
| Training Paradigm |
Generation Model |
Tokenizer Type |
Tokenizer BPP16 ↓ |
Training Epochs |
#Params ↓ | FID ↓ | IS ↑ |
|---|---|---|---|---|---|---|---|
| Diffusion | LDM-4 | Patch-VAE | N/A | 200 | 400M | 3.60 | 247.7 |
| DiT-XL | Patch-VAE | N/A | 1400 | 675M | 2.27 | 278.2 | |
| SiT-XL | Patch-VAE | N/A | 800 | 675M | 2.06 | 270.3 | |
| REPA | Patch-VAE | N/A | 800 | 675M | 1.42 | 305.7 | |
| MAR | MAR-L | Patch-VAE | N/A | 800 | 479M | 1.98 | 290.3 |
| MAR-H | Patch-VAE | N/A | 800 | 943M | 1.55 | 303.7 | |
| Mask. | MaskGIT-re | Patch-VQ | 0.625 | 300 | 227M | 4.02 | 355.6 |
| MAGVIT-v2 | Patch-VQ | 1.125 | 1080 | 307M | 1.78 | 319.4 | |
| Maskbit | Patch-LFQ | 0.875 | 1080 | 305M | 1.52 | 328.6 | |
| Mask-TiTok-64 | TiTok | 0.188 | 800 | 177M | 2.48 | 214.7 | |
| Mask-TiTok-128 | TiTok | 0.375 | 800 | 287M | 1.97 | 281.8 | |
| VAR | VAR-d20 | VAR | 1.992 | 350 | 600M | 2.57 | 302.6 |
| VAR-d30 | VAR | 1.992 | 350 | 2.0B | 1.92 | 323.1 | |
| Rand. Causal AR |
RAR-B | Patch-VQ | 0.625 | 400 | 261M | 1.95 | 290.5 |
| RAR-L | Patch-VQ | 0.625 | 400 | 461M | 1.70 | 299.5 | |
| RAR-XL | Patch-VQ | 0.625 | 400 | 955M | 1.50 | 306.9 | |
| RandAR-L | Patch-VQ | 0.875 | 300 | 343M | 2.55 | 288.8 | |
| RandAR-XL | Patch-VQ | 0.875 | 300 | 775M | 2.25 | 317.8 | |
| RandAR-XXL | Patch-VQ | 0.875 | 300 | 1.4B | 2.15 | 322.0 | |
| Tok. Causal AR |
AR-FlexTok-XL | FlexTok | 0.125 | 300 | 1.3B | 2.02 | -- |
| AR-GigaTok-XXL | GigaTok | 0.875 | 300 | 1.4B | 1.98 | 256.8 | |
| AR-WeTok-XL | WeTok | 1.667 | 300 | 1.5B | 2.31 | 276.6 | |
| Raster. Causal AR |
VQGAN-re | Patch-VQ | 0.875 | 100 | 1.4B | 5.20 | 280.3 |
| Open-MAGVIT-v2 | Patch-LFQ | 1.125 | 300 | 1.5B | 2.33 | 271.8 | |
| LlamaGen-XL | Patch-VQ | 0.875 | 300 | 775M | 2.62 | 244.1 | |
| LlamaGen-XXL | Patch-VQ | 0.875 | 300 | 1.4B | 2.34 | 253.9 | |
| AR-L† | Patch-VQ | 0.625 | 400 | 461M | 3.02 | 256.2 | |
| reAR-S | Patch-VQ | 0.625 | 400 | 201M | 2.00 | 295.7 | |
| reAR-B | Patch-VQ | 0.625 | 400 | 261M | 1.91 | 300.9 | |
| reAR-L (cfg=10.0/11.0) | Patch-VQ | 0.625 | 400 | 461M | 1.86/1.90 | 316.9/323.2 |
Table: Results on 256x256 class-conditional generation on ImageNet-1K. "Mask." indicates masked generation; "Tok." denotes non-standard tokenization; "Rand." denotes randomized order; "Raster." denotes rasterization order. "†" indicates that the model is not provided and it's trained with our implementation. BPP16=16×BPP (bits per pixel) measures the compression rate of discrete tokenizers and is not applicable ("N/A") to continuous tokenizers. "#Params" is the number of model parameters. "↑" and "↓" indicate whether higher or lower values are better, respectively.
We also evaluate reAR on non-standard tokenizers TiTok and AliTok. Unlike RAR, which helps mainly on bidirectional tokenization, reAR consistently improves performance on both bidirectional (TiTok: 4.45 → 4.01) and unidirectional (AliTok: 1.50 → 1.42) tokenizers. Notably, it approaches diffusion-based REPA and outperforms Maskbit while using far fewer parameters (177M vs. 675M/305M).
| Model | Epochs | Params | FID ↓ |
|---|---|---|---|
| Maskbit | 1080 | 305M | 1.52 |
| REPA | 800 | 675M | 1.42 |
| AR-TiTok-b64 | 400 | 261M | 4.45 |
| RAR-TiTok-b64 | 400 | 261M | 4.07 |
| reAR-TiTok-b64 | 400 | 261M | 4.01 |
| AR-AliTok-B | 800 | 177M | 1.50 |
| RAR-B-AliTok | 800 | 177M | 1.52 |
| reAR-B-AliTok | 800 | 177M | 1.42 |
Table 2: Superior generalization ability. reAR adapts to different tokenizers and achieves state-of-the-art performance with smaller models.
We also study if the scaling behavior of the original AR model maintains with reAR. Specifically, we plot the FID under different training epochs for each model size. As Figure 1 shows, the FID consistently decreases as model size and training iteration increase, revealing the potential of reAR on large-scale visual AR models.
Figure 1: Scaling Effect of reAR. As model size increases, the FID at each training step decreases consistently.
Like other autoregressive models, reAR benefits from KV-cache to achieve high sampling speed. We measure throughput on a single A800 GPU with batch size 128. With KV-cache, autoregressive models can run much faster than diffusion and MAR. Moreover, reAR-B-AliTok achieves lower FID with faster sampling speed even against parallel-decoding approaches such as Maskbit, TiTok, VAR, and RandAR.
Figure 2: Sampling Speed. Comparison of different methods on FID and throughput (images/sec).
reAR demonstrates significant improvements in visual quality. The model generates more coherent and detailed images compared to baseline autoregressive models. The improvements are particularly notable in maintaining consistency across the entire image and reducing artifacts that typically arise from exposure bias.
@article{he2025rear,
title={reAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization},
author={Qiyuan He and Yicong Li and Haotian Ye and Jinghao Wang and Xinyao Liao and Pheng-Ann Heng and Stefano Ermon and James Zou and Angela Yao},
year={2025},
journal={arXiv preprint},
}