在文生图扩散模型中对于区域生成精确地控制,是对于满足用户复杂需求至关重要的一点。传统方法通常需要在现有模型的基础上再加训练或微调,导致计算成本高昂以及有限的可迁移性。本论文调研了多种仅前馈的免训练干涉方案,并验证其对于区域可控生成性能的提升。我们实验了针对生成过程多个方面的干涉,包括模型输入、隐藏特征、注意力机制和模型输出。最终成功验证出一套最佳的免训练配置。该配置实现了多文本对生成图像的区域控制,其生成结果展示了极具竞争力的性能。此外,我们引入了VG500数据集,该数据集源自Visual Genome,并基于CLIP设计了一个评估区域控制精度的评测方案。在VG500上的评测证实了我们的方法在文本和生成区域的对齐度,同时保持了图像的全局一致性。这项工作提供了一种灵活且高效的区域控制生成方案,为文本到图像的区域可控生成领域做出了贡献。
Achieving precise regional control within text-to-image diffusion models is crucial for fulfilling complex user requirements. Traditional methods often involve model retraining or fine-tuning, leading to substantial computational costs and limitations in transferability. This thesis investigates training-free, forward-only modulation techniques to address these challenges. Our approach targets various aspects of the generation process, including model input, hidden features, attention maps, and output. We successfully identify an optimal training-free configuration for layout-aware, multi-prompt image generation that demonstrates competitive performance.Additionally, we introduce VG500, a dataset derived from Visual Genome, along with CLIP-based scoring, specifically designed for evaluating regional control accuracy. Extensive evaluations on VG500 verify the effectiveness of our method in accurately aligning image generation with regional specifications while maintaining image coherence. This work contributes to the field of text-to-image generation by providing a flexible and computationally efficient solution for regional controllable generation.