登录 EN

添加临时用户

多模态融合的心理压力检测方法研究

Psychological Stress Detection Methods through Multi-modal Fusion

作者:张慧君
  • 学号
    2017******
  • 学位
    博士
  • 电子邮箱
    zha******.cn
  • 答辩日期
    2023.05.19
  • 导师
    冯铃
  • 学科名
    计算机科学与技术
  • 页码
    129
  • 保密级别
    公开
  • 培养单位
    024 计算机系
  • 中文关键词
    心理压力检测,多模态,关联融合,智能手机应用,视频
  • 英文关键词
    Psychological Stress Detection, Multiple Modalities, Interactive Fusion, Mobile Application, Video

摘要

随着当今社会节奏越来越快,心理压力问题愈发普遍和严重,过大的心理压力严重影响到人们的身心健康,因此,及时而有效地识别心理压力非常重要。利用单一模态数据源进行心理压力检测往往面临着信息不足、特征片面以及数据缺失等问题,本文旨在弥补使用单一模态数据的缺陷,利用多模态数据融合实现心理压力检测,以丰富输入信息、实现模态互补,从而达到更好的检测效果。然而,该研究仍面临着来自数据、模态融合、模态内容理解等多方面的挑战。针对这些挑战,本文开展了三项研究: 研究一:基于移动端应用的多模态关联融合心理压力检测方法。本研究基于智能手机提出了一种多模态关联融合方法来实现多种模态融合和心理压力检测。本研究特别地设计了注意力关联映射机制和基于注意力的权重分配机制,实现了模态之间关联关系的自动捕捉,和对每种模态重要性的自动计算。本研究开发了移动端心理健康监测软件“哈跑”。基于62名参与者的“哈跑”使用数据(文本、图像、睡眠和运动数据),本研究构建了一个包含961条数据的多模态心理压力检测数据集。该数据集上的实验结果表明该方法可以达到80.84%的准确率,而且相比于单模态,多模态融合有更好的表现。 研究二:基于视频的面部与动作双层级融合心理压力检测方法。本研究拓展了压力识别的使用场景,使用监控视频作为新的心理压力监测途径。本研究利用视频中的人脸表情和动作模态,提出了一种面部与动作双层级融合的心理压力检测网络,首先学习面部和动作级别的表征,然后通过包含局部注意力和全局注意力的加权融合结构,有效融合两个层级的表征。本研究基于122位受试者的数据构建了包含2,092个视频样本的数据集。在该数据集上本研究提出的方法达到了85.42%的准确率。 研究三:基于视频的细粒度级别的多模态融合心理压力检测方法。在考虑人脸表情和动作模态的基础上,本研究进一步融合了音频和文本特征以丰富输入信息。本研究探究细粒度的视频片段级别的心理压力检测任务,并通过图神经网络来学习和挖掘视频片段间的关联性。本研究提出了一种基于图的跨模态融合模型,通过对人脸表情、动作、音频和文本的加权关联融合,以及对每一对视频片段中的行为之间的关联性学习,实现了更好的心理压力检测结果。实验中该方法在两个数据集上分别达到 88.14% 和 86.91% 的准确率,相关实验结果也表明多模态相比于单模态在该任务上往往有更好的表现。

As the pace of today‘s society is getting faster and faster, psychological stress problems are becoming more and more common and serious. Excessive psychological stress seriously affects people‘s physical and mental health. Therefore, it is very important to identify psychological stress timely and effectively. Psychological stress detection using only a single modality data source often faces problems such as insufficient information, one-sided features, and missing data. This thesis aims to make up for the shortcomings of using single-modal data, and leverage multi-modal data fusion to realize psychological stress detection, so as to enrich input information, realize modal complementarity, and achieve better detection results. However, this thesis still faces many challenges from data, multi-modal fusion, and modal content understanding. To address these challenges, this thesis conducts three studies:Study 1: A multi-modal interactive fusion psychological stress detection method based on a mobile application. Based on smartphones, this study proposed a multi-modal interactive fusion method to integrate multiple modalities and realize psychological stress detection. This study specially designed an interactive attention mapping mechanism and an attention-based weight distribution mechanism, which realized the automatic capture of the associated relationship between each pair of modalities, and the automatic calculation of the importance of each modality. This study developed a mobile application named “Happort” for mental health monitoring. Based on the “Happort” usage data (text, image, sleep and exercise data) of 62 participants, this study built a multi-modal psychological stress detection dataset containing 961 samples. Experimental results on this dataset showed that the method could achieve an accuracy of 80.84%, and multi-modal fusion performed better compared with the single modality.Study 2: A video-based two-level fusion of face and action for psychological stress detection. This study expanded the usage scenarios of stress recognition, using surveillance video as a new way to monitor psychological stress. This study leveraged facial expressions and actions in videos to propose a two-level stress detection network with the fusion of face- and action-level. This study first learned the representations of face- and action-level. Then the representations of the two levels were effectively fused through a weighted fusion structure including local attention and global attention. This study constructed a dataset of 2,092 video samples from 122 subjects. The method proposed in this study achieved an accuracy of 85.42% on this dataset.Study 3: A video-based fine-grained multi-modal fusion psychological stress detection method. On the basis of considering facial expression and action modalities, this study further fused acoustic and textual features to enrich the input information. This study explored the psychological stress detection task at the fine-grained video clip level, and learned and mined the correlation among video clips through a graph neural network. This study proposed a graph-based cross-modal fusion model and achieved better stress detection results through weighted integrating facial expressions, actions, audios and text, and learning the correlation of behaviors between each pair of video clips. In the experiment, the proposed method achieved the accuracy of 88.14% and 86.91% on the two datasets respectively. The experimental results also showed that multiple modalities could often perform better compared with the single modality in this task.