数据融合是数据挖掘中的一个关键环节。对于学术数据挖掘,跨数据源融合科技文献数据是构建开放学术知识图谱的根基。构建高精度学术知识图谱是深度挖掘科技情报的基础,如人才洞察、趋势预测、科技创新等。为了构建一个开放的大规模高精度的学术知识图谱,本文主要关注学术知识图谱构建过程中的多源数据融合问题以及学术知识图谱数据基准以评估知识图谱的质量。然而,该问题如今面临着实体异构性强、实体歧义性大、数据规模大、数据噪音大、缺乏全面的学术知识图谱评测基准等挑战。针对以上挑战,本文进行了如下几方面研究: 为了匹配不同数据源的大规模异构实体,本文首先提出了统一的实体匹配框架 LinKG。根据不同类型实体的独特的匹配特点,LinKG 为不同类型的实体定制不同的匹配模块。为了验证 LinKG 的不同模块的设计选择,本文基于单域和多域迁移模型描述了不同匹配任务之间的关系。实验表明,LinKG 在两个大规模学术知识图谱之间取得超过 97% 的匹配准确率,并基于此生成和发布了亿级规模的开放学术图谱(OAG)。 学术图谱的不同类型实体中,作者的歧义性最大。从 OAG 中可以发现跨数据源的作者匹配率最低,进而可观察到不同数据源的论文作者分配存在部分不一致的问题。因此,本文从一个新的视角研究作者同名消歧——跨数据源同名消歧纠错,并提出一个框架 CrossND 来解决该问题。CrossND 包括一个细粒度的论文作者匹配模型以及两个交叉纠错方法,以此来推断哪个数据源的消歧结果是不正确的。实验表明,在两个真实数据上,CrossND 可在零标注情况下显著提升同名消歧的效果。CrossND 被应用于 AMiner 帮助用户检查论文作者关系的正确性。 最后,本文研究如何评估学术知识图谱的质量,并提出一个全面的开放学术知识图谱数据基准 OAG-Benchmark。其包含学术知识图谱构建和学术知识图谱应用的十大学术任务。在学术图谱构建方面,OAG-Benchmark 包含知识图谱上数据融合、实体属性抽取、实体关系预测等不同类型的任务。在学术图谱应用方面,OAG-Benchmark 包含论文/专家推荐等经典任务以及学术知识问答、论文源头追溯、突 破式创新预测等若干新颖的任务。OAG-Benchmark 目前包括 10 个任务、19 个数据集、50 余个基线模型以及 70 余个实验结果。OAG-Benchmark 可被用于全面评估学术图谱构建的精度和学术图谱应用的效果。
Data integration is a key step in data mining. For academic data mining, integrating scientific data from different sources is a fundamental task for building open academic knowledge graphs. Building accurate academic knowledge graphs is the basis for in-depth academic data mining, such as talent insight, trend prediction, and technological innovation. In order to build an open large-scale high-precision academic graph, this thesis mainly studies multi-source data integration for academic knowledge graphs, and the academic benchmark to evaluate the quality of academic knowledge graphs. However, these problems are now facing challenges including entity heterogeneity, entity ambiguity, large-scale data, noisy data, and the lack of comprehensive evaluation benchmark. In response to the challenges above, this thesis conducts research in the following aspects: To match large-scale heterogeneous entities from different sources, we propose a unified matching framework LinKG. In terms of the unique matching characteristics of different types of entities, LinKG customizes different matching modules for different types of entities. To validate the design choices of different modules of LinKG, we characterize the relationship between different matching tasks based on single-domain and multi-domain transfer models. Experiments show that LinKG has achieved more than 97% matching accuracy between two large-scale academic knowledge graphs, based on which we have generated and published a billion-scale Open Academic Graph (OAG). Among the different types of entities in academic graphs, the author entity is the most ambiguous. In addition, the cross-source matching rate of authors in OAG is the lowest, and it can be observed that the paper-author assignment from different data sources is partially inconsistent. Thus, this thesis studies author name disambiguation from a new perspective — cross-source correction for name disambiguation, and proposes a framework CrossND to solve it. CrossND includes a fine-grained paper-author matching model and two cross-correction methods to infer paper-author pairs from which data sources are incorrect. Experiments show that on two real datasets, CrossND significantly improves the quality of name disambiguation without human labeling. CrossND is applied to AMiner to help users check the correctness of the paper-author relations. Finally, we study how to evaluate the quality of academic knowledge graphs and thus propose a comprehensive open academic graph benchmark (OAG-Benchmark). OAG-Benchmark includes ten academic tasks for the construction and application of academic knowledge graphs. In terms of academic graph construction, OAG-Benchmark includes different types of tasks, such as data integration for knowledge graphs, entity attribute extraction, and entity relationship prediction. In terms of academic graph applications, OAG-Benchmark includes classic tasks such as paper/expert recommendation, as well as several novel tasks, such as academic question answering, paper source tracing, and breakthrough innovation prediction. OAG-Benchmark currently includes 10 tasks, 19 datasets, more than 50 baseline models, and more than 70 experimental results. OAG-Benchmark can be used to evaluate the accuracy of academic graph construction and the effect of academic graph applications comprehensively.