场景图作为一种描述图片视觉内容的结构化数据,有望改善下游视觉推理任务的性能及可解释性,从而帮助计算机更有效地“理解”图片。因此,近年来,场景图生成任务在计算机视觉领域受到了广泛的关注。其核心环节在于对视觉关系的检测,但是,与目标检测任务的相对客观性不同,视觉关系检测天然具有较强的不确定性。围绕引发该不确定性的谓语标签语言模糊性问题、视觉场景语义上的歧义性问题和应用这种不确定度量去改善数据标注效率的问题,本文分别进行了探索,主要工作内容如下:1. 本文提出了一种基于概率分布的谓语标签语言模糊性建模方法。针对文本标签的“一词多义”现象,本文受自然语言处理领域单词表征的相关工作启发,提出用高斯分布来表征谓语标签,从而处理标签与视觉场景的一对多关系。该方法很自然地捕捉到了标签的语言模糊性,如果标签对应的视觉场景比较多样,则其高斯分布的方差也相对较大,因此可以有效建模出谓语标签一词多义的特性。2. 从视觉信息的角度,本文提出了一种通用性的视觉关系语义歧义性表征方法。图片中每一对物体联合框的视觉特征都被映射为一个高斯分布,以概率分布的形式表达视觉场景的歧义性,并通过随机采样得到最终的特征实例。在该机制下,视觉场景的特征具有一定的随机性,因此预测结果很自然地具备了多样性,同时,对低频类别的覆盖率也得到了提高,有助于实现更均衡化的预测。此外,该方法易于实现,可以作为一种插件式的模块移植到已有的任何场景图生成模型。3. 本文提出了一种基于概率不确定性建模的场景图生成主动学习框架。首先将视觉关系实例表征为高斯分布,然后通过对比多次采样得到的不同预测结果来确定实例的不确定度,以此作为筛选高质量样本进行标注的标准。利用该主动学习框架,可以有效提高数据标注的效率,在数据标注预算受限的情况下仍能实现优越的模型性能。综上,本文分别从谓语标签和视觉信息的角度对场景图生成中的不确定性问题进行了探究,还通过对不确定性的度量有效改善了数据标注效率,从而提高了场景图生成模型的实用性,让场景图的下游推理任务更受益。
As a type of structured data for describing visual content of images, scene graphs mayimprove the performance and interpretability of those downstream visual reasoning tasks,so as to help computers ‘understand’ images more effectively. Therefore, in recent years,scene graph generation has received extensive attention in the computer vision community.The key to scene graph generation lies in the detection of visual relationships. However,different from object detection which is relatively objective, visual relationship detectionis born with strong uncertainty. This thesis focuses on the linguistic fuzziness of predicatelabels, semantic ambiguity of visual scenes and the application of uncertainty measure toimprove the effciency of data annotation. The main contributions are as follows:1. This thesis proposes a linguistic fuzziness modeling method based on a probabilitydistribution for predicate labels. In order to deal with the one-to-many mappingbetween labels and visual scenes, we propose to use Gaussian distribution to represent predicate labels, which is inspired by the work of word representation in naturallanguage processing. This method naturally captures the linguistic fuzziness of labels. If the visual scenes corresponding to a label are diverse, then the variance ofits Gaussian distribution will be large. Therefore, this representation method caneffectively model the polysemy of predicate labels.2. From the perspective of visual information, this thesis proposes a universal methodto represent the semantic ambiguity of visual relationships. The visual features ofeach object union region in an image are mapped into a Gaussian distribution toexpress the ambiguity of the visual scene in the form of a probability distribution,where the final feature instance is obtained by random sampling. In this mechanism, the features of visual scenes possess stochasticity, so the predicted results arenaturally diversified. At the same time, the coverage of infrequent categories is alsoimproved, which helps to achieve a more balanced prediction.3. This thesis proposes an Active Learning framework for scene graph generationbased on probabilistic uncertainty modeling. Visual relationship instances are represented as Gaussian distributions, and their uncertainty degrees are measured viadifferent predictions from multiple samples for further selecting high-quality examples to annotate. Using this framework, we can effectively improve the efficiency of data annotation and still achieve superior model performance under the circumstance of a limited data annotation budget.In a conclusion, this thesis explores the uncertainty problem in scene graph generation from the perspective of predicate labels and visual information, then effectivelyimproves the effciency of data annotation by measuring the uncertainty. Finally, thisthesis thus improves the practicality of scene graph generation models and benefits thedownstream reasoning tasks of scene graph.