随着电子商务的兴起以及相关数据采集、存储技术的发展,大量直接来自于用户的在线评论数据得到保存和分析,为不同领域的应用提供了数据基础。该类评论数据具有数量庞大、信息涵盖面广的特点,能够为顾客和商家提供较为全面的线上产品及服务质量参考。针对该类数据特点,建立一个通用的分析挖掘和在线监测的框架体系,对基于用户反馈的线上产品及服务的质量改进有着重要的意义。 考虑到评论数据中常伴随大量无意义的噪声内容,如广告等与购买行为无关的内容,以及其它一些较为笼统、主观的评论。为自动过滤评论中信息价值较低的内容,本文首先针对非评论和模糊评论两类典型的异常评论特点,从评论文本中提取出与之相关的语义特征,并基于该特征同时设计一个非监督学习的评论异常检测算法和一个监督学习的异常检测算法,能广泛适应不同应用条件下的评论异常检测任务,为后续应用提供高质量的数据基础。 经过异常内容过滤后的评论文本中,含有丰富的产品主题及用户情感信息,联合反映了用户感知的线上产品及服务质量水平。针对在线评论文本在时间上的定量变化特点,本文在现有的评论文本主题-情感特征表示学习方法的基础上,提出一个适用于线上评论文本的顺序概率生成模型及其对应的在线监测方法,在统计过程控制的框架下利用用户直接反馈的评论文本数据实现线上产品服务过程中主题和情感的联合监测,为产品售后阶段的质量问题反馈和预警提供决策支持。 除评论文本外,多数网站的评论中还伴随用户的评分,对用户评论中的文本和评分进行联合分析,更有助于评论内容的信息理解和特征抽取。针对评论中文本和评分两种类型数据之间的动态相关性,本文提出一个通用的联合概率模型对两类数据进行建模,并设计了模型参数估计算法。该模型深入刻画了评论中评分和文本的生成机制,能够联合两类数据抽取得到有效的评论主题和情感特征,对评论数据集的解释性和预测能力较之基准模型具有明显优势,而基于该联合概率模型所提出的在线监视方法也带来了监测效果的提升。 围绕基于电商用户评论数据的在线产品服务过程质量评价和质量监测这一核心目的,本文从前期的数据处理和异常检测,到中期基于概率统计模型的信息量化和特征提取工作,再到后期应用环节的在线监视,提出了一整套方法技术和操作流程,是对在线用户评论数据的一次完整探索和应用。
With the booming e-commerce as well as the development of data collection and data storage techniques, there are increasing amounts of online user-generated reviews collected and analyzed for broad applications in a variety of fields. These review data, featured with huge volume and abundant information, can provide valuable evaluation of online products and services for both customers and online merchants. The quality improvement of online products as well as services will benefit from a general framework of analyzing and monitoring these user-generated content. With consideration of the noise content in reviews, such as advertisements, unhelpful opinions, as well as highly subjective and ambiguous descriptions, this study first looks into the anomaly detection problems in review data. Semantic features are extracted from typical non-reviews and ambiguous reviews, based on which we propose both an unsupervised algorithm and a supervised algorithm for conducting the anomaly detection of review content under a broad range of conditions, providing a qualified data foundation for subsequent applications. The review texts after filtering are full of topics and sentiments that jointly reflect the customer-perceived quality of online products as well as services. Based on the existing joint sentiment-topic representation of textual reviews, we propose a sequential probabilistic generative model to characterize the quantitative evolution among online review texts and a corresponding monitoring method for quickly detecting the shifts in review topics and sentiments jointly under the framework of statistical process control, providing decision support for the quality control of the after-sales stage.In addition, the textual reviews are coupled with the numerical ratings on most websites. In this study, we propose a probabilistic model to accommodate both review texts and ratings with consideration of their intrinsic connection for a joint sentiment-topic prediction. The proposed model can enhance the prediction accuracy of hold-out review data and achieve an effective detection of interpretable topics and sentiments. A monitoring method is constructed based on the proposed model, leading to improved performance in review monitoring.In a summary, we propose a set of techniques and procedures for the applications of online review data, which range from the data pre-processing and anomaly detection, to statistical modeling and quantification, and to online monitoring. This study provides an integrated framework for the applications of user-generated reviews to the quality evaluation and monitoring of online products and services.