知识是人类认知多模态世界的核心要素。结构化知识获取旨在自动获取由实体及其关系组成的知识三元组,包括具体场景中的具象知识,以及关于普遍事实的抽象知识,是人工智能的重要基础技术。然而,目前结构化知识获取的规模仍然高度受限,难以覆盖丰富的知识类型和知识语境。受到人类认知理论的启发,本文面向多模态数据,通过建立具体场景的具象知识和普遍事实的抽象知识间的双向循环,并充分发挥多模态预训练模型的基础作用,使两类知识的获取在循环中得到互相提升,从而实现大规模结构化知识获取。工作分为三个方面: (1)面向结构化知识获取的多模态预训练模型,探索适合结构化知识获取的细粒度多模态预训练模型,包括:多模态预训练模型的细粒度提示学习,通过将多模态任务转化为基于颜色的填空问题,充分发挥多模态预训练模型的细粒度感知能力;多模态模型的细粒度预训练方法,通过统一建模语言符号和视觉位置符号,构建细粒度多模态预训练模型。 (2)抽象知识指导的具象知识获取,落实普遍事实的抽象知识到具体场景指导具象知识获取,实现对具体场景的深度理解,包括:在视觉关系识别方面,进行知识图谱远程监督指导的视觉关系识别,将知识图谱落实到图像数据,提供大规模视觉关系学习信号;进行关系层次信息指导的大规模视觉关系识别,基于视觉关系的层次信息转移数据标签,实现大规模视觉关系识别。在文本关系识别方面,进行知识图谱远程监督指导的文档级文本关系识别,通过将知识图谱落实到文档数据,覆盖丰富的文档级知识语境;进行知识图谱远程监督指导的跨文档文本关系识别,通过知识图谱落实连接多个文档,覆盖更加开放的跨文档知识语境。 (3)具象知识驱动的抽象知识获取,基于具体场景的多个具象知识实例,归纳抽象出普遍事实的抽象知识,实现大规模知识图谱的构建,包括:图像关系实例支持的常识知识获取,通过面向视觉数据中的实体关系进行多实例学习,归纳抽象实体的常识知识;文本关系实例支持的关系发现,通过聚类开放域文本中的关系实例,发现新关系类型补充知识图谱关系体系。 基于上述工作,本文形成了面向多模态数据的结构化知识获取算法体系,并在真实场景下探究其应用,包括多模态结构化知识增强的可控图像生成,显著提升了图像生成模型的语义可控性。
Knowledge is the core element of human cognition of the multi-modal world. Structured knowledge acquisition aims to automatically acquire knowledge triplets consisting of entities and their semantic relations, including grounded knowledge in concrete scenes, and abstract knowledge about general facts, which is a fundamental problem in artificial intelligence. However, current structured knowledge acquisition models still struggle to scale up, in terms of covering rich knowledge types and broad knowledge contexts. Inspired by human cognition theory, this paper explores large-scale structured knowledge acquisition from multi-modal data by establishing a bidirectional loop between the grounded knowledge about concrete scenes and the abstract knowledge about general facts, so that their knowledge acquisition can be mutually enhanced in the loop. The work is divided into three aspects: (1) Pre-training and adaptation of foundation models for structured knowledge acquisition, which is to explore fine-grained foundation models suitable for structured knowledge acquisition. This includes: Fine-grained prompt learning for pre-trained multi-modal models, which transforms multi-modal tasks into color-based fill-in-the-blank problems, to fully utilize the fine-grained perceptual capability of multi-modal pre-trained models; Fine-grained pre-training for multi-modal models, which unifies language and entities‘ visual positions into discrete tokens, to construct pre-trained multi-modal models suitable for structured knowledge acquisition. (2) Abstract knowledge-guided grounded knowledge acquisition, which grounds abstract knowledge to guide the acquisition of grounded knowledge from concrete scenes, facilitating deep scene understanding. The work in visual relation recognition includes: Knowledge graph-guided visual relation recognition, which grounds knowledge graphs into image data to provide large-scale visual relation learning signals; Relation hierarchy-guided large-scale visual relation recognition, which transfers the relation labels in image data based on the relation hierarchy to learn large-scale visual relations. The work in textual relation recognition includes: Knowledge graph-guided document-level relation recognition, which grounds knowledge graphs into documents to improve the coverage of textual relation recognition into document-level; Knowledge graph-guided cross-document relation recognition, where abstract knowledge graphs are grounded to connect multiple documents, covering more open cross-document knowledge contexts. (3) Grounded knowledge-driven abstract knowledge acquisition, which constructs knowledge graphs based on the grounded knowledge from concrete scenes. The work includes: Commonsense knowledge acquisition from visual relation instances, which abstracts commonsense relations between entities in images in a multi-instance learning paradigm; Relation discovery from textual relation instances, which clusters relation instances in open-domain corpora to discover novel relation types. Based on these works, this paper forms a suite of algorithms for structured knowledge acquisition from multi-modal data, and explores the application in real-world scenarios, including multi-modal structured knowledge-enhanced controllable image synthesis.