STAR

Scale-wise Text-to-image generation via
Auto-Regressive representations

Xiaoxiao Ma1,3,*, Mohan Zhou2,3,*, Tao Liang3, Yalong Bai3, Tiejun Zhao2, Huaian Chen1‡, Yi Jin1‡

1University of Science and Technology of China, 2Harbin Institute of Technology, 3Du Xiaoman

*Equal contribution: Work done during the internships of at Du Xiaoman.

Corresponding author: anchen@mail.ustc.edu.cn, jinyi08@ustc.edu.cn

Performance & Comparison

performance-1performance-2

Per-category FID on MJHQ-30K

Efficiency & CLIP-Score of 512x512 generation

STAR is a novel scale-wise text-to-image model that is effective and efficient in terms of performance.
Notably, STAR also shows efficiency by requiring merely 2.95s to generate a 512×512 image (compared to 6.48s for PixArt-α).

Architecture of STAR

architecture

We present STAR, a Scale-wise Text-to-image model that employs scale-wise Auto-Regressive paradigm. Unlike VAR, which is limited to class-conditioned synthesis within a fixed set of predetermined categories, our STAR model enables text-driven open-set generation through three key designs:

  1. To boost diversity and generalizability with unseen combinations of objects and concepts, we introduce a pre-trained text encoder to extract representations for textual constraints, which we then use as guidance.
  2. To improve the interactions between the generated images and fine-grained text guidance, making results more controllable, the additional cross-attention layers are incorporated at each scale.
  3. Given the natural structure correlation across different scales, we leverage the 2D Rotary Positional Encoding (RoPE) and tweak it into a normalized version. This ensures consistent interpretation of relative positions across images at different scales and stabilizes the training process.

Extensive experiments demonstrate that STAR surpasses existing benchmarks in terms of fidelity,image text consistency, and aesthetic quality.
Our findings emphasize the potential of auto-regressive methods in the field of high-quality image synthesis, offering promising new directions for the T2I field currently dominated by diffusion methods.

Sample Generations

STAR produces realistic scenes and artistic images across various styles, and it adeptly handles detailed elements such as textures and architecture.