Conceptrol: Concept Control of Zero-shot Personalized Image Generation

Computer Vision & Machine Learning Group, National University of Singapore
Teaser.

TL;DR: Conceptrol is a free lunch that elicits the personalized ability of zero-shot adapter by transforming image condition to visual specification contrained by textual concept, even outperforming fine-tuning methods.

Abstract

Personalized image generation with text-to-image diffusion models generates unseen images based on reference image content. Zero-shot adapter methods such as IP-Adapter and OminiControl are especially interesting because they do not require test-time fine-tuning. However, they struggle to balance preserving personalized content and adherence to the text prompt. We identify a critical design flaw resulting in this performance gap: current adapters inadequately integrate personalization images with the textual descriptions. The generated images, therefore, replicate the personalized content rather than adhere to the text prompt instructions. Yet the base text-to-image has strong conceptual understanding capabilities that can be leveraged. We propose Conceptrol, a simple yet effective framework that enhances zero-shot adapters without adding computational overhead. Conceptrol constrains the attention of visual specification with a textual concept mask that improves subject-driven generation capabilities. It achieves as much as 89% improvement on personalization benchmarks over the vanilla IP-Adapter and can even outperform fine-tuning approaches such as Dreambooth LoRA.

Understand the design flaw of zero-shot adapters

Zero-shot adapters such as IP-Adapter and OminiControl does not leverage textual concept, and suffers from balancing between concept preservation and prompt adherence.

Insights from Attention Analysis

We gain three insights from attention analysis: (1) Attention from personalized image is misaligned with correct region of personalized target. (2) The visual specification of the reference, can transfer to the region with high attention score. (3) There exists textual concept mask, can be extracted from specific block (i.e, UP.0.1.3 in SDXL) to provide attention map with high score on proper region.

Personalization = Visual Specification + Textual Concept

Our method is simple based on previous analysis. We extract textual concept mask from concept-specific blocks and use it to adjust the attention score of visual specification.

Qualitative Results

Quantitative Results

Our method can significantly boost the personalization score without additional cost, and even surpasses fine-tuning methods such as Dreambooth LoRA.

BibTeX

If you find our work useful, please consider citing our paper:

@article{he2025conceptrol,
  title={Conceptrol: Concept Control of Zero-shot Personalized Image Generation},
  author={Qiyuan He and Angela Yao},
  journal={arXiv preprint arXiv:2503.06568},
  year={2025}
}
      }