在信息检索领域的相关研究中,文档检索是最重要也是最具挑战性的问题之一。给定一个搜索查询,文档检索的目标为检索出与查询相关的文档,并根据文档相关性生成排序列表。传统的检索模型通常直接基于文档级别的信号估计文档相关性,容易导致文档内局部内容的相关信息丢失。近年来,一些工作开始在检索模型中引入更细粒度(例如段落级、片段级等)的相关性信号,一定程度上提高了文档检索的性能。然而,现有的基于细粒度信息的检索模型仍面临诸多挑战。例如,对于细粒度的相关性信号如何影响整个文档的相关性判断仍不清晰;大部分现有工作仅建模独立的细粒度相关性信号,造成了上下文信息缺失,限制了检索模型的表现。因此,以下三个方面的研究被开展来应对这些挑战:用户细粒度相关性认知机制研究。用户的相关性认知机制理解是设计合理检索模型的基础。该研究通过用户实验分析,发现搜索引擎结果页面上的细粒度答案片段对用户搜索体验有正面影响,说明了开展比文档级更细粒度的研究的必要性。从具有段落级和文档级相关性标注的数据集上的系统分析结果发现:不同位置、不同长度的段落对文档级相关性有不同程度的影响;文档中连续出现的若干相关段落是文档相关的重要信号。这些发现能够帮助相关学者理解用户感知文档相关性的过程,进而启发其基于细粒度相关性信号设计更合理的检索模型。基于用户细粒度阅读行为的相关性估计。阅读作为信息获取过程中的重要认知行为,与用户的相关性感知过程紧密相关。该研究通过用户实验分析,发现用户在阅读文档时的注意力分布受到位置偏置和选择偏置等因素的影响。基于该发现和概率生成模型框架,提出了片段级阅读行为模型(PRM)。PRM利用可观测的片段级曝光和视窗持续时长信息,建模用户在阅读过程中不可观测的略读、精读和满意度事件。实验结果表明,PRM能够很好地拟合用户的片段级阅读行为,并有效地估计片段级和文档级相关性,达到优于现有的无监督检索模型的效果。基于细粒度累积收益的排序模型。基于上述两部分工作中的用户相关性感知与阅读过程研究,提出了一个基于BERT的深度序列模型:段落累积收益模型(PCGM)。PCGM模拟了用户感知到的信息收益在文档中和查询会话中逐段累积的过程。多个数据集上的实验结果表明,PCGM可以有效地预测段落累积收益序列,并且在文档排序和边际相关性估计任务上达到优于现有模型的表现。
Document retrieval is one of the most critical and challenging problems in information retrieval-related research. Given a search query, document retrieval aims to retrieve documents relevant to the query and generate a ranking list based on document relevance. Most traditional retrieval models estimate document relevance directly based on document-level signals, ignoring the local fine-grained relevance signals. Recently, some studies have started to consider fine-grained (e.g., paragraph-level, passage-level) relevance signals when estimating the document-level relevance, which has improved the retrieval performance. However, existing retrieval models based on fine-grained information still face several challenges. For example, it remains under-investigated how fine-grained relevance signals affect the relevance judgment on the whole document. Most existing studies only model independent fine-grained relevance signals, resulting in missing contextual information and limiting retrieval models' performance. Therefore, this paper addresses these shortcomings by conducting research in the following three aspects:Research on users' fine-grained relevance perception mechanism. Understanding users' relevance perception mechanism is the basis for designing reasonable retrieval models. Through user studies, we found that the fine-grained answer text on Search Engine Result Pages (SERPs) can reduce users' search effort. It shows the importance of conducting research at a finer granularity than the document level. We construct a dataset with both paragraph-level and document-level relevance annotations. Based on throughout analysis of this dataset, we find that: paragraphs with different locations and lengths have different effects on document-level relevance; the presence of consecutive relevant paragraphs within a document is an important signal of document relevance. These findings can help us understand how users perceive the document-level relevance and thus inspire us to design more reasonable retrieval models based on fine-grained signals.Relevance estimation based on users' fine-grained reading behavior. As an important cognitive behavior in information seeking, reading is also highly relevant to the relevance perception process of users. Through user studies, we found that users' attention distribution is influenced by location bias and selection bias. Therefore, we propose a Passage-level Reading behavior Model (PRM) based on the framework of probabilistic generative models. PRM models the unobservable skimming, reading, and satisfaction events based on observable passage-level exposure and viewport duration information. Experimental results show that PRM can model users' passage-level reading behavior and can be used to estimate the passage-level and document-level relevance effectively.Document ranking based on fine-grained cumulative gain. Based on the studies on users' relevance perception and reading process, we propose a BERT-based deep sequence model: Passage Cumulative Gain Model (PCGM). It simulates the accumulation of user-perceived information gain paragraph by paragraph within a document and a query session. Experimental results on multiple datasets show that the PCGM can effectively predict passage cumulative gain sequences and perform better than existing models on document ranking and marginal relevance estimation tasks.