近年来,随着多媒体技术的发展,网络数据的模态日趋复杂,表达能力不断增强。另一方面,人工智能领域的研究在模仿人类生理上取得了极大进展,而对心理层面的研究方兴未艾,有着较大的发展空间。如果机器能在多模态数据中进行情感的精准理解与表达,不仅在心理学与计算机科学上有着重大的跨学科研究意义,也有助于网络中的用户理解与内容创作等实际应用。当前,相关工作存在着数据集普适性不足、多模态情感特征关联度不高、多模态融合与生成的鲁棒性和可解释性较弱等缺陷,较难满足现实中的复杂环境需要。因此,本文开展了多模态关联的情感理解与生成研究,在心理学指导下建立了多个标准数据集并进行了多模态情感特征的提取和机器学习模型的搭建,从而捕捉并强化了模态间的关联,实现了用户情感可识别、内容情感可推理、生成情感可解释,为相关领域提供了研究基础,探索了进一步的发展方向。本文的主要贡献总结如下:1. 提出了基于社交媒体多模态数据的抑郁识别方法。为了捕捉社交媒体大数据下的多模态关联,提升用户情感的理解能力,本文构建了高质量的大规模社交媒体抑郁用户数据集,在心理学指导下提取了与情感相关的多模态特征,采用多模态字典学习算法获取多模态联合稀疏表征进行情感分类,取得了良好的抑郁识别效果,并在大规模抑郁用户中挖掘了典型的在线抑郁行为。2. 提出了基于视频多模态数据的人物情感推理方法。为了提升在多模态数据的模态信息不完整情况下内容情感理解的鲁棒性,提出了基于视频多模态数据的人物情感推理方法。为了在模态信息不完备情况下捕捉并强化多模态之间的关联,提升情感理解的鲁棒性,本文构建了一个大规模的视频多模态人物情感推理数据集,在模态不完备场景下提供了人物级别的情感标注。本文提出了基于自注意力机制的多模态情感推理方法,在进行多模态融合的同时考虑了情感传播与情感上下文等推理策略,在该数据集下取得了良好的表现,为更多情感推理算法的开发提供了研究基础。3. 提出了基于强度自推断的可控表情视频生成方法。为了兼顾情感生成中的鲁棒性、可控性和可解释性,本文提出了一种基于强度的可控视频生成方法,通过单张中性人脸生成表情视频,实现图像模态到视频模态的关联生成。其亮点在于在训练中能够自动推断每帧的表情强度,避免了复杂而不准确的人工标注,并提供了多表情统一生成模型以供用户选择。该方法能为大众创作提供便利,提升互联网的整体活力。
In recent years, with the development of multimedia technology, the modalities of network data have become increasingly complex and expressive. On the other hand, research in the field of artificial intelligence has made great progress in imitating human physiology, while research on the psychological level is in the ascendant and has a lot of room for development. If machines can accurately understand and express emotions in multimodal data, not only will it have significant interdisciplinary research significance between psychology and computer science, but it will also greatly improve the user understanding and content creation in practical applications. At present, related work has defects such as insufficient dataset universality, low relevance of multimodal emotional feature, and weak robustness and interpretability of multimodal fusion and generation, which cannot meet the needs of complex environments in reality. Therefore, this paper aims to research on emotion understanding and generation with multimodal correlation. In this paper, we establish multiple standard datasets and benchmarks, extract multimodal emotion features, and construct machine learning models under the guidance of psychology. In this way, we capture and enhance the correlation between modalities and achieve better recognition ability for user emotions, robust reasoning ability of content emotions, and stronger interpretability in emotion data generation, providing a research foundation and future directions in related fields. The main contributions of this paper are summarized as follows:1. We propose a multimodal user depression recognition method on social media. In order to capture the multimodal correlations under the social media big data and improve the understanding of user emotions, this paper constructs a high-quality large-scale depression user dataset on social media and extracts multimodal features that are highly related to emotions under the guidance of psychological theory. We use multimodal dictionary learning algorithm to obtain the multimodal joint sparse representation for further emotion classification, which achieves a good performance in depression recognition. In addition, we excavates some typical online depression behaviors on large-scale depression users.2. We propose a multimodal human emotion reasoning method based on video data. In order to capture and enhance the correlation between modalities when the modal information is incomplete and improve the robustness of emotion understanding of contents, this paper constructs a large dataset for multimodal human emotion reasoning in videos, which provides the person-level emotion annotations in the scenario of modality absence. This paper proposes a multimodal emotional reasoning model based on the self-attention mechanism. While performing multimodal fusion, it also considers reasoning strategies such as emotional communication and emotional context, achieving good performance on this data set. This work would support the development of more advanced emotion reasoning algorithms.3. We propose a controllable expression video generation method, where the intensity of frames are self-inferred. In order to take into account the robustness, controllability and interpretability in emotion generation, this paper proposes an intensity-based video generation method, which generates expression videos from a single neutral face to capture the multimodal correlation between images and videos. The highlight is that in training the expression intensities of each frame can be automatically inferred, so that we can avoid complicated and inaccurate manual intensity labeling. In addition, we provide a unified generation model of multiple expressions for users. This method can provide convenience for public creation and enhance the overall vitality of the Internet.