登录 EN

添加临时用户

文本和图像模态的细粒度对齐方法研究

Towards Fine-Grained Alignment of Text and Image Modalities: Technologies and Applications

作者:肖昌明
  • 学号
    2018******
  • 学位
    博士
  • 电子邮箱
    xcm******.cn
  • 答辩日期
    2024.08.24
  • 导师
    张长水
  • 学科名
    控制科学与工程
  • 页码
    106
  • 保密级别
    公开
  • 培养单位
    025 自动化系
  • 中文关键词
    多模态学习;细粒度对齐;开放集分割;跨模态生成;多模态扩散模型
  • 英文关键词
    multi-modal learning; fine-grained alignment; open-set segmentation; cross-modal generation; multi-modal diffusion model

摘要

近年来,随着深度学习的发展,人工智能在计算机视觉、自然语言处理、音频信号处理等领域均得到了广泛的应用。更进一步,研究人员探究智能体如何能像人一样,可以同时处理多个模态的信息,并在复杂场景中完成多样化的任务。其中,一个关键步骤是找到不同模态信息之间的对应关系,从而能更有效地处理跨模态信息,这一步骤被称作对齐。一些对齐的工作关注数据整体的对齐,比如将图像与描述做对应,进而应用于跨模态检索等场景。本文关注更精细的对齐关系:数据内部子结构之间的对齐,比如将图像的某一区域与描述的某一短语做对应。建立模态间的细粒度对齐关系有助于智能体更准确地理解多模态信息,实现更有效的人机交互,完成更复杂的任务。本文选取真实世界中最常见的两个模态:文本和图像,研究它们之间的细粒度对齐技术及相关应用。本文首先聚焦于与细粒度对齐直接相关的任务,即图像的密集预测,随后探讨高质量的文本-图像细粒度对齐如何协助智能体在实际任务中取得更优越的表现。本文的主要贡献如下:? 对于语义分割任务,提出了一种基于文生图扩散模型的通用图像分割算法。该工作从多模态模型的注意力机制中提取文本-图像的细粒度对应关系,并利用这种关系在图像上定位文本描述的对象实体。实验结果表明,该算法能更高效地完成多种图像密集预测任务,并可以推广到个性化描述的情况。? 对于图像编辑任务,提出了一种基于区域注意力的文本引导图像编辑算法。该工作通过训练一个映射模块,将自然语言输入映射到图像变换方向,扩展了可行的变换空间,提升了图像编辑的实用性。同时通过训练一个注意力模块,找到输入指令与图像区域的细粒度对应关系,使图像的变动被限制在文本相关的部分,从而提升了图像编辑的准确性。? 对于物体重排任务,提出了一个基于扩散模型的文本引导桌面整理框架。该框架先利用文本-图像对应关系解析混乱场景,得到图像不同区域的物体属性信息。再训练一个引入语言条件的扩散模型,生成与输入文本相对应的整洁布局。最后利用大语言模型作为规划器,为机器人生成操作策略,将混乱桌面整理为整齐桌面。模拟和真机实验验证了该框架的有效性。

In recent years, with the development of deep learning, artificial intelligence has been widely applied in fields such as computer vision, natural language processing, and audio signal processing. Furthermore, researchers have explored how intelligent agents can simultaneously process information from multiple modalities, just like humans, and complete diverse tasks in complex scenarios. One key step is to find the corresponding relationship between different modal information, thus processing cross-modal information more effectively, which is called alignment. Some alignment works have focused on the alignment of the whole data, such as matching the image with its description, which can be applied to scenarios like cross-modal retrieval. This dissertation focuses on more precise alignment: the relationship between the internal substructures of data, such as matching a certain region of an image with a specific phrase in the description. Establishing fine-grained alignment between modalities helps intelligent agents to comprehend multi-modal information more accurately, achieve more effective human-machine interaction, and complete more complex tasks. This dissertation selects the two most common modalities in the real world: text and image, and studies the technologies and applications of their fine-grained alignment. This dissertation first focuses on tasks directly related to fine-grained alignment, namely image dense prediction. Then it explores how high-quality text-image fine-grained alignment can assist agents in achieving better performance in practical tasks. The main contributions of this dissertation are as follows:? For the semantic segmentation task, a universal segmentation algorithm based on the text-to-image diffusion model is proposed. This work extracts the precise correspondence between text and image from the attention mechanism of the multi-modal model. Then, this fine-grained relationship is used to locate the object entity described in the text on the image. Experimental results demonstrate that this algorithm can complete various image dense prediction tasks more efficiently and can be extended to personalized descriptions.? For the image editing task, a text-guided image editing algorithm with region-based attention is proposed. This work expands the feasible transformation space and enhances the practicality of image editing by introducing a mapping module to map the natural language input to the direction of image transformation. At the same time, by training an attention module, the fine-grained correspondence between input instructions and image regions is found, limiting image changes to text-related parts and thereby improving the precision of image editing.? For the object rearrangement task, a text-guided tabletop rearrangement framework based on the diffusion model is proposed. This framework first parses chaotic scenes via text-image correspondence to obtain attribute information of objects in different regions of the image. Then a language-conditioned diffusion model is trained to generate the tidy layout corresponding to the input text. Finally, using a large language model as the planner, the operation policy for the robot to arrange the messy table into a clean one is generated. Simulation and real robotic experiments verify the effectiveness of this framework.