登录 EN

添加临时用户

基于监控信息的在线服务系统故障发现与根因定位研究

Research on Fault Discovery and Root Cause Location for Online Service System Based on Monitoring Information

作者:赵鋆峰
  • 学号
    2020******
  • 学位
    硕士
  • 电子邮箱
    973******com
  • 答辩日期
    2023.05.18
  • 导师
    王之梁
  • 学科名
    网络空间安全
  • 页码
    71
  • 保密级别
    公开
  • 培养单位
    412 网络研究院
  • 中文关键词
    异常检测, 根因定位, 在线服务系统
  • 英文关键词
    Anomaly detection, Root cause location, Online service system

摘要

随着互联网技术的快速发展和普及,越来越多的业务和服务迁移到网络平台,使人们能在任何时间、地点进行购物、娱乐、学习等。因此,企业和机构建立高可用的在线服务系统至关重要,但系统的复杂性和各种外部因素的干扰导致故障不可避免。一旦故障发生,可能导致服务中断,影响用户体验,甚至可能引起严重的数据丢失和安全问题。 现代在线服务系统通常由多个可调用的模块组成,为确保系统稳定性和可靠性,运维人员会记录并收集各种监控信息,例如记录请求在模块间的完整调用路径的调用链监控数据、用于监测服务器性能和资源使用情况的指标监控数据等。如何利用这些监控数据进行高准确、低开销、易部署的故障发现与根因定位是一个难题。本文采用了一种两阶段的方案:首先使用调用链数据进行实时的故障发现,并且在故障发生时定位到模块级别;进而根据指标数据定位到模块内具体机器的具体指标,从而提高运维人员的排查效率。最后,本文还形成了一个故障发现与根因定位原型系统。本文的主要研究内容和贡献总结如下:(1)提出了一个实时在线的调用链诊断方法\textit{TraceMiner}用于模块间的故障发现与定位。一方面实时收集系统中产生的调用链数据,识别请求在服务系统中的传递路径和调用情况,并存储必要的数据;另一方面通过周期性的异常检测发现系统中潜在的故障,并且使用基于异常分数的根因定位算法定位到故障发生的根因模块。该方法不仅可以高效地对调用链数据进行实时处理,还可以以较高的准确率定位到根因模块。(2)提出了一个多阶段的根因指标定位方法\textit{MetricMiner}用于模块内的根因机器指标定位。该方法认为根因指标不仅会在局部的统计意义上表现出明显变化,还会在时间维度和机器维度上都会展现出较为独特的特征。因此提出了一种先粗筛、再确认、最后聚类的多阶段定位框架,并设计了一个评分策略将三个阶段的得分融合,最终排序给出根因推荐列表,可以准确、高效地定位到根因指标。(3)实现了一个故障发现与根因定位原型系统。将前面两个研究内容做了具体实现,并且集成了相关的对比算法,还提供了可视化工具,可以对调用链和指标的原数据、相关算法的运行中间结果和检测结果、各算法之间的对比结果等进行方便地展示。不仅可以实际落地部署,还可以帮助不断迭代算法,形成闭环。

With the rapid development and popularization of Internet technology, an increasing number of businesses and services have migrated to online platforms, enabling people to access various information, engage in shopping, entertainment, learning, and other services through the Internet at any time and from any location. Consequently, it has become crucial for enterprises and institutions to establish highly available and high-performance online service systems. However, due to the complexity of the systems themselves and various external factors, system failures are inevitable. Once a failure occurs, it may result in service interruption, affect user experience, and potentially lead to more severe issues such as data loss and security breaches. Modern online service systems are typically composed of multiple modules with diverse functionalities that can be called upon by each other. To ensure the stability and reliability of online service systems and promptly detect potential faults or quickly locate root causes when failures occur, operators often record and collect various monitoring information, such as trace monitoring data that records the complete invocation path between modules and metric monitoring data used to monitor server performance and resource utilization. How to utilize these monitoring data for high-accuracy, low-cost, and easy-to-deploy fault detection and root cause location is an urgent problem to be solved. Therefore, this thesis adopts a two-stage solution: first, real-time fault detection is performed using trace monitoring data, and module-level location is achieved when faults occur; then, specific metrics of specific machines within modules are located based on metric monitoring data, which saves operators‘ troubleshooting time and improves operational efficiency. Finally, this thesis presents a prototype system for fault detection and root cause location. The main research content and contributions of this thesis are summarized as follows:(1) An online real-time trace diagnostic method named TraceMiner is proposed for fault detection and localization between modules. On the one hand, it collects trace data generated in the system in real time, identifies the transmission path of requests in the service system, and stores necessary information. On the other hand, it periodically detects potential faults in the system through anomaly detection and uses a root cause location algorithm based on anomaly scores to locate the root cause module where the fault occurs. This method can not only process trace data in real-time with high efficiency and low cost but also locate the root cause module with high accuracy.(2) A multi-stage root cause metric localization method named MetricMiner, is proposed for within-module root cause machine metric localization. The method is based on the premise that root cause metrics not only exhibit changes in local statistical significance but also display relatively rare characteristics in the comparison of historical data and the same metrics across different machines. Therefore, the method adopts a multi-stage approach of rough screening, confirmation, and clustering, and proposes a scoring strategy that integrates the scores of the three stages to ultimately provide an accurate and efficient root cause ranking list for metric localization.(3) An experimental fault detection and root cause localization prototype system is implemented, which realizes the two research objectives and integrates relevant comparison algorithms. The system provides visualization tools that enable the convenient display of raw data from call chains and metrics, intermediate results and detection outcomes of related algorithms, as well as comparative results among different algorithms. It is not only suitable for practical deployment but also helps to iterate and update algorithms in a closed-loop manner.