登录 EN

添加临时用户

基于自然语言处理的网络热点舆情分析方法与应用研究

Research on the Analysis Method and Application of Online Public Opinion Hot Spots Based on Natural Language Processing

作者:谢海光
  • 学号
    2015******
  • 学位
    硕士
  • 电子邮箱
    443******com
  • 答辩日期
    2021.05.20
  • 导师
    胡坚明
  • 学科名
    控制工程
  • 页码
    88
  • 保密级别
    公开
  • 培养单位
    025 自动化系
  • 中文关键词
    自然语言处理,网络舆情分析,新词发现,DBSCAN,情感分析
  • 英文关键词
    natural language processing,network public opinion analysis,new word discovery,DBSCAN,sentiment analysis

摘要

有关资料表明,2020年3月时,国内已有9亿余位网民,其中移动网民占比达到了99%以上。移动网络已经成为人们获取信息、发表意见、参与社会的主要手段。近年来,随着自然语言处理技术的快速发展,在语言翻译、社交媒体监控、聊天机器人、定向广告等领域得到了大量应用。但网络热点舆情分析研究还停留在传统网络技术背景下,具有一定的滞后性,尤其是对新造词和网民的社群关系缺乏深度挖掘。本文基于自然语言处理技术研究并设计了一套网络热点舆情分析系统,并以腾讯资讯、凤凰资讯、新浪网、新浪微博和百度资讯网站的20余万条文章帖、700余万条评论等网络信息为基础数据,以Python软件为平台,对网络热点舆情进行了分析研究,主要内容包括: (1)在现有词库基础上,引入互信息和信息熵机制,并根据网络传播特性,构建一种新型的基于传播特性的网络新词发现技术,实现从海量数据中对网络新词的自动识别。针对获取的大量无序、碎片化网络新词,引入马尔可夫模型,改进基于双向匹配的歧义消除中文匹配算法,实现对网络信息的精准分词和对热点字符串的统计分析,为网络舆情跟踪分析奠定基础。(2)对DBSCAN算法进行理论介绍,并结合本文引入惩罚值方式,提出了基于惩罚值改进的DBSCAN网络舆情知识图谱构建方法,将最高出现的若干个词语作为核心点,根据网页的链接特性定义RADIUS领域半径,计算边界点和噪音点,形成以簇为代表的网络舆情知识图谱,为网络舆情事件的深度分析提供支撑。(3)综合运用以上方法,在新词发现与关键词分析、情感分析、回归分析设计基础上,研究并设计了网络热点舆情分析系统,通过编写网络爬虫程序,对腾讯资讯、凤凰资讯等网站的网络信息进行采集作为基础数据,并结合网络热点舆情实例进行实证应用分析。通过实践发现,本文研究设计的网络热点舆情分析系统可以快速准确发现网络新生信息,挖掘信息更加深度,比传统网络舆情分析系统更加快速、准确、智能,可为相关部门舆情分析提供更好的技术支撑。

Relevant materials show that in March 2020, there were more than 900 million netizens in China, among whom mobile device netizens accounted for more than 99%. Mobile network has become the main means for people to obtain information, express their opinions, and participate in society. In recent years, natural language processing technology has developed rapidly and has been widely used in language translation, social media monitoring, chat robots, targeted advertising, and other fields. However, the research on the analysis of network public opinion hotspots remains overshadowed by traditional network technology, which has a certain hysteretic nature, especially with the lack of in-depth excavation of new words and the community relationship among netizens. Based on natural language processing technology, this research studied and designed a set of analysis systems for network social public opinion hotspots. Based on data from more than 200,000 articles and more than 7 million comments on the websites of Tencent Information, Phoenix Information, Sina.com, Sina Weibo, and Baidu Information, as well as on the Python software platform, this thesis analyzes network public opinion hotspots. The main contents include:(1) Based on the existing lexicon, mutual information and information entropy mechanisms are introduced. A novel network of new word discovery technology based on propagation characteristics is constructed according to the network propagation characteristics to realize automatic identification of new words in the network from massive data. Aiming at a large number of unordered and fragmented network new words obtained, the Markov model is introduced to improve the ambiguity elimination Chinese matching algorithm based on two-way matching, including accurate word segmentation in network information and statistical analysis of hot strings, thereby laying a foundation for network public opinion tracking analysis.(2) In this thesis, the DBSCAN algorithm theory is introduced, the penalty value method is introduced in combination with the subject, and a method for constructing knowledge maps of DBSCAN network public opinion (improved on the basis of penalty value) is proposed. The latter method takes the most frequently appearing words as core points, defines the RADIUS domain radius according to the linkage characteristics of web pages, calculates boundary points and noise points, and forms a knowledge map of network public opinion represented by clusters. This provides support for the in-depth analysis of network public opinion events.(3) The above methods, along with the design of new word discovery and keyword analysis, sentiment analysis, and regression analysis, form an analysis system for network public opinion hotspots. Through a self-compiled web crawler program, network information from Tencent Information, Phoenix Information, and other websites is collected as basic data, and empirical application analysis is carried out with the examples of network public opinion hotspots.Through practice, it is found that the analysis system for network public opinion hotspots designed in this thesis can quickly and accurately discover newly-emerging information from the network and mine deeper into the information. It is also faster, more accurate, and more intelligent than traditional network public opinion analysis systems, which can provide better technical support for public opinion monitoring analysis by relevant authorities.