二十一世纪以来,综合国力的提升、外部环境的变化,使得我们国家面临的考验和挑战越来越多,对党风廉政建设和反腐败工作提出了更高要求。在全面从严治党的大背景下,如何对现有的廉政信息进行研究和分析,对廉政风险进行预测,是亟待解决的问题。本文以各级纪检监察网站发布的新闻信息为基本数据,采用自然语言识别技术,将自动化理论和廉政风险研究相结合,通过构建知识图谱的方式,实现对廉政风险的分析和简单预测,取得一定成果。本文主要完成了以下工作:(1)获取廉政风险知识图谱基础数据。本文使用python语言,通过正则表达式、Selenium、Beautifulsoup编写网络爬虫程序,从各级纪检监察网站识别审查调查、党纪政务处分、通报曝光版块,并爬取保存各版块网页,再通过提取网页文本内容,获取廉政风险知识图谱基础数据。(2)构建廉政风险知识图谱基本模型。根据廉政风险研究需要,使用Protégé进行廉政风险知识建模,确定构建廉政风险知识图谱的实体类别和实体关系,并根据提取的基础数据和实体识别效果修改知识图谱模型,确定更合理的廉政风险实体类别和实体关系。(3)建立廉政风险实体和实体关系语料库。采用BIO格式,标注廉政风险实体10类8038个,采用(关系类别,头实体、尾实体、实体所在句子)的格式,标注廉政风险实体关系11种8606个,建立廉政风险实体语料库和实体关系语料库。(4)完成廉政风险知识抽取。基于CRF算法和BILSTM-CRF算法训练廉政风险实体识别模型,针对实体识别效果一般的问题,提出将实体识别分为基本层和拓展层,分层先后识别不同廉政风险实体的方法,实体识别效果有较大提高。同时在实体识别的基础上,建立规则模型抽取实体关系,抽取效果优秀。(5)构建廉政风险知识图谱。基于训练得到的模型,抽取廉政风险实体1608170个和实体关系1475797个,使用neo4j图数据库构建廉政风险知识图谱,并使用python语言和cypher语言相结合的方式,对廉政风险实体进行实体对齐和实体消歧。
Since the 21st century, with the improvement of comprehensive national strength and the change of external environment, China has been facing more and more tests and challenges, which demands higher requirements for the construction of Party conduct and anti-corruption work. Under the overall background of strictly governing the party, how to study and analyze the existing information of clean government and then predict the incorruption risk are urgent problems to be solved. In this thesis, the information published by the disciplinary inspection and supervision websites is taken as the basic data, and the natural language recognition technology is adopted to combine the automation theory with the research on the incorruption risk. By constructing the knowledge graph, the analysis and simple prediction of the incorruption risk are realized, and some results have been achieved. The main work of this thesis is as follows:(1) Obtain the basic data of the incorruption risk knowledge graph. This thesis uses Python to write web crawler programs through regular expression, Selenium and BeautifulSoup,identify the sections of examination and investigation, party discipline and government punishment and report exposure from the disciplinary inspection and supervision websites, crawl and save the web pages of each section, and then extract the text content of the web pages to obtain the basic data of the incorruption risk knowledge graph.(2) Establish the basic model of the incorruption risk knowledge graph. According to the research needs, Protege was used to conduct the modeling of the incorruption risk knowledge graph, to determine the entity category and entity relationship, to modify the knowledge graph model according to the extracted basic data and entity identification effect, and to determine the more appropriate entity category and entity relationship of the incorruption risk.(3) Establish a corpus of the incorruption risk entity and entity relations. 8,038 entities of 10 categories were labeled with the incorruption risk by using BIO format. 8,606 entities relations of 11 types were labeled with the incorruption risk by using the format of (relationship category, head entity, tail entity, and the sentence where the entity is). A corpus of incorruption risk entity and entity relationship were established.(4) Complete the knowledge extraction of the incorruption risk. Based on CRF algorithm and BILSTM-CRF algorithm, the entity identification model of the incorruption risk is trained. Aiming at the general effect of entity identification, propose to increase the entity identification extension layer, distinguish the basic layer and the extension layer to identify different incorruption risk entities. The entity identification effect is greatly improved. At the same time, on the basis of entity identification, a rule model is established to extract entity relations excellently. (5) Build the knowledge graph of the incorruption risk. Based on the model obtained by training, 1,608,170 entities and 1,475,797 entity relationships were extracted; use Neo4j graph database to construct the incorruption risk knowledge graph; the entity aligned and entity disambiguation of the incorruption risk were carried out by the combination of Python and Cypher.