reAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization

Qiyuan He¹, Yicong Li¹, Haotian Ye², Jinghao Wang³, Xinyao Liao¹,
Pheng-Ann Heng³, Stefano Ermon², James Zou², Angela Yao¹

¹National University of Singapore ²Stanford University ³The Chinese University of Hong Kong

Paper Code

News

event [Oct 2025] Code, Trained Model and Project page are released.

Abstract

Visual autoregressive (AR) generation offers a promising path toward unifying vision and language models, yet its performance remains suboptimal against diffusion models. Prior work often attributes this gap to tokenizer limitations and rasterization ordering. In this work, we identify a core bottleneck from the perspective of generator-tokenizer inconsistency, i.e., the AR-generated tokens may not be well-decoded by the tokenizer.

To address this, we propose reAR, a simple training strategy introducing a token-wise regularization objective: when predicting the next token, the causal transformer is also trained to recover the visual embedding of the current token and predict the embedding of the target token under a noisy context. It requires no changes to the tokenizer, generation order, inference pipeline, or external models.

Despite its simplicity, reAR substantially improves performance. On ImageNet, it reduces gFID from 3.02 to 1.86 and improves IS to 316.9 using a standard rasterization-based tokenizer. When applied to advanced tokenizers, it achieves a gFID of 1.42 with only 177M parameters, matching the performance with larger state-of-the-art diffusion models (675M).

Understanding the Bottleneck

Generator-Tokenizer Inconsistency

We identify two key sources of inconsistency between the autoregressive generator and the visual tokenizer:

1. Amplified Exposure Bias: During training with teacher forcing, the model predicts tokens given ground-truth context, but at inference it conditions on its own predictions. In visual AR, this leads to unseen token sequences that corrupt future predictions and spread structural artifacts across the image.

2. Embedding Unawareness: The AR model optimizes only discrete token indices without considering how these tokens are embedded by the tokenizer. However, decoded image quality depends on the embeddings of the generated tokens rather than their indices alone.

Amplified Exposure Bias

Embedding Unawareness

Token-wise Consistency Regularization

reAR addresses these issues through two complementary strategies: Noisy Context Regularization that exposes the model to perturbed context during training, and Codebook Embedding Regularization that aligns the generator's hidden states with the tokenizer's embedding space. This encourages the generator to be aware of how tokens are decoded into visual patches.

Results

Generation Quality

Table 1 shows that reAR achieves strong results even with a standard raster-order AR model and a simple 2D patch tokenizer. reAR-S outperforms prior raster AR models like LlamaGen-XL (FID 2.00 vs. 2.34; IS 295.7 vs. 253.9) using only 14% of the parameters (201M vs. 1.4B), and surpasses advanced-tokenizer AR models such as WeTok with just 13–15% of their size. It matches RAR and outperforms RandAR under similar scales, and reAR-L exceeds MAR-L and VAR-d30. While diffusion and masked-generation models remain strong, reAR narrows the gap with far fewer training epochs.

Training Paradigm	Generation Model	Tokenizer Type	Tokenizer BPP₁₆ ↓	Training Epochs	#Params ↓	FID ↓	IS ↑
Diffusion	LDM-4	Patch-VAE	N/A	200	400M	3.60	247.7
	DiT-XL	Patch-VAE	N/A	1400	675M	2.27	278.2
	SiT-XL	Patch-VAE	N/A	800	675M	2.06	270.3
	REPA	Patch-VAE	N/A	800	675M	1.42	305.7
MAR	MAR-L	Patch-VAE	N/A	800	479M	1.98	290.3
MAR	MAR-H	Patch-VAE	N/A	800	943M	1.55	303.7
Mask.	MaskGIT-re	Patch-VQ	0.625	300	227M	4.02	355.6
	MAGVIT-v2	Patch-VQ	1.125	1080	307M	1.78	319.4
	Maskbit	Patch-LFQ	0.875	1080	305M	1.52	328.6
	Mask-TiTok-64	TiTok	0.188	800	177M	2.48	214.7
	Mask-TiTok-128	TiTok	0.375	800	287M	1.97	281.8
VAR	VAR-d20	VAR	1.992	350	600M	2.57	302.6
VAR	VAR-d30	VAR	1.992	350	2.0B	1.92	323.1
Rand. Causal AR	RAR-B	Patch-VQ	0.625	400	261M	1.95	290.5
	RAR-L	Patch-VQ	0.625	400	461M	1.70	299.5
	RAR-XL	Patch-VQ	0.625	400	955M	1.50	306.9
	RandAR-L	Patch-VQ	0.875	300	343M	2.55	288.8
	RandAR-XL	Patch-VQ	0.875	300	775M	2.25	317.8
	RandAR-XXL	Patch-VQ	0.875	300	1.4B	2.15	322.0
Tok. Causal AR	AR-FlexTok-XL	FlexTok	0.125	300	1.3B	2.02	--
	AR-GigaTok-XXL	GigaTok	0.875	300	1.4B	1.98	256.8
	AR-WeTok-XL	WeTok	1.667	300	1.5B	2.31	276.6
Raster. Causal AR	VQGAN-re	Patch-VQ	0.875	100	1.4B	5.20	280.3
	Open-MAGVIT-v2	Patch-LFQ	1.125	300	1.5B	2.33	271.8
	LlamaGen-XL	Patch-VQ	0.875	300	775M	2.62	244.1
	LlamaGen-XXL	Patch-VQ	0.875	300	1.4B	2.34	253.9
	AR-L†	Patch-VQ	0.625	400	461M	3.02	256.2
	reAR-S	Patch-VQ	0.625	400	201M	2.00	295.7
	reAR-B	Patch-VQ	0.625	400	261M	1.91	300.9
	reAR-L (cfg=10.0/11.0)	Patch-VQ	0.625	400	461M	1.86/1.90	316.9/323.2

Table: Results on 256x256 class-conditional generation on ImageNet-1K. "Mask." indicates masked generation; "Tok." denotes non-standard tokenization; "Rand." denotes randomized order; "Raster." denotes rasterization order. "†" indicates that the model is not provided and it's trained with our implementation. BPP₁₆=16×BPP (bits per pixel) measures the compression rate of discrete tokenizers and is not applicable ("N/A") to continuous tokenizers. "#Params" is the number of model parameters. "↑" and "↓" indicate whether higher or lower values are better, respectively.

Generalization

We also evaluate reAR on non-standard tokenizers TiTok and AliTok. Unlike RAR, which helps mainly on bidirectional tokenization, reAR consistently improves performance on both bidirectional (TiTok: 4.45 → 4.01) and unidirectional (AliTok: 1.50 → 1.42) tokenizers. Notably, it approaches diffusion-based REPA and outperforms Maskbit while using far fewer parameters (177M vs. 675M/305M).

Model	Epochs	Params	FID ↓
Maskbit	1080	305M	1.52
REPA	800	675M	1.42
AR-TiTok-b64	400	261M	4.45
RAR-TiTok-b64	400	261M	4.07
reAR-TiTok-b64	400	261M	4.01
AR-AliTok-B	800	177M	1.50
RAR-B-AliTok	800	177M	1.52
reAR-B-AliTok	800	177M	1.42

Table 2: Superior generalization ability. reAR adapts to different tokenizers and achieves state-of-the-art performance with smaller models.

Scaling Effect

We also study if the scaling behavior of the original AR model maintains with reAR. Specifically, we plot the FID under different training epochs for each model size. As Figure 1 shows, the FID consistently decreases as model size and training iteration increase, revealing the potential of reAR on large-scale visual AR models.

Figure 1: Scaling Effect of reAR. As model size increases, the FID at each training step decreases consistently.

Sampling Speed

Like other autoregressive models, reAR benefits from KV-cache to achieve high sampling speed. We measure throughput on a single A800 GPU with batch size 128. With KV-cache, autoregressive models can run much faster than diffusion and MAR. Moreover, reAR-B-AliTok achieves lower FID with faster sampling speed even against parallel-decoding approaches such as Maskbit, TiTok, VAR, and RandAR.

Figure 2: Sampling Speed. Comparison of different methods on FID and throughput (images/sec).

Visual Quality Improvements

reAR demonstrates significant improvements in visual quality. The model generates more coherent and detailed images compared to baseline autoregressive models. The improvements are particularly notable in maintaining consistency across the entire image and reducing artifacts that typically arise from exposure bias.

Citation

@article{he2025rear,
    title={reAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization},
    author={Qiyuan He and Yicong Li and Haotian Ye and Jinghao Wang and Xinyao Liao and Pheng-Ann Heng and Stefano Ermon and James Zou and Angela Yao},
    year={2025},
    journal={arXiv preprint},
}