在信息技术广泛应用的今天,微服务系统已被广泛使用。保障微服务系统的稳定,对于用户体验、软件公司的利益均至关重要。为保证基于微服务系统的应用程序服务的稳定性,运维人员需要密切监控微服务系统中各个层级的指标,并且对这些监控的指标进行异常检测以便实时快速地发现微服务中隐藏的异常。由于微服务系统中错综复杂的调用关系,一个被监控到的异常有可能是由于其他微服务中的故障导致的。因此,当运维人员发和软件工程师发现了微服务系统中的某个严重异常后,需要通过根因定位找到异常的根因。本文解决了异常检测和根因定位的实际挑战,实现了面向庞大微服务系统的三个算法。主要贡献如下:针对微服务系统中的复杂时序数据的异常检测方法:本文第一次提出了微服务系统中的复杂时序数据需要稳定地且有效地进行异常检测的问题,并且设计实现了BUZZ算法。BUZZ算法的主要创新是通过深度生成模型对复杂性能关键指标进行无监督异常检测。它对来自阿里巴巴公司的监控数据的最佳F1分数在0.92到0.99之间,明显优于其他异常检测算法。针对长期部署微服务系统的高效透明根因定位方法:本文第一次提出了长期部署微服务系统中的事件因果图所需人工配置开销过大的问题,并且设计实现了DCGM算法。DCGM算法通过有监督的方式对因果参数表进行训练,替代运维专家总结因果规则,然后通过基于路径定义的根因概率和变分自动编码器计算出的似然来定位根因。DCGM算法是使用亿贝公司的微服务系统进行的评估。评估结果表明,DCGM 达到了最先进的水平(基于服务的数据集的top-1和top-3准确率分别为79.30%和98.8%,基于业务领域的数据集的top-1和top-3准确率分别为85.3%和96.6%),甚至优于人工配置的算法。针对短期部署微服务系统的快速启动根因定位方法:本文第一次提出了对于短期部署微服务系统,人工配置规则算法和深度学习算法的慢启动问题,并且设计了利用微服务系统中的元信息进行根因定位的METARCA算法。METARCA 可以快速启动,专门用以解决短期部署微服务系统的根因定位问题。METARCA算法设计了事件嵌入、因果注意力和扩散聚合图注意力网络等三个模块,通过模仿运维人员阅读监控事件的元信息,推断监控事件之间的因果关系,定位微服务系统的根因。METARCA可以在3个月内快速启动并达到可靠的效果。
Today, with the widespread application of information technology, systems based on microservice architecture are widely used. Ensuring the stability of the microservice system is crucial to the user experience and the interests of software companies. To ensure the reliability of microservice-based application services, operators closely monitor the key performance indicators in the microservices system in real-time. Operation and maintenance personnel need to perform anomaly detection on these monitored indicators to discover hidden abnormal states in microservices in real-time quickly. Furthermore, due to the intricate relationship among microservices in the microservice system, a monitored exception is likely caused by a fault in other microservices. Therefore, when the operation and maintenance personnel and software engineers discover a critical anomaly in the microservice system, to repair the microservice system in time and ensure that the system quickly returns to the normal operating state, the operation and software engineers need root cause analysis to find the root cause. This paper addresses the practical challenges of anomaly detection and root cause analysis and implements three algorithms for large microservice systems. The main contributions of this paper are summarized as follows:Complex time-series data anomaly detection method for microservices: According to the knowledge of this paper, this paper proposes for the first time that complex data in microservice systems requires stable and effective anomaly detection algorithms and proposes Buzz algorithm solves this problem. The main innovation is an unsupervised anomaly detection algorithm for complex key performance data through deep generative models. Its best F1 score on surveillance data from Alibaba Corporation is between 0.92 and 0.99, significantly outperforming state-of-the-art unsupervised methods without adversarial training based on variational autoencoder Donut, and state-of-the-art Supervised method Opprentice. An efficient and white-box root cause analysis method for long-term microservices: According to the knowledge of this paper, this paper proposes for the first time the problem of excessive labor overhead in event causal diagrams in microservice systems and designs and implements DCGM algorithm solves this problem. DCGM trains the causal parameter table in a supervised way, summarizes the causal rules by replacing the operation and maintenance experts and then locates the root cause through the root cause probability and historical likelihood correction based on the path definition. DCGM use microservice system of eBay Corporation for evaluation. Experiments show that DCGM achieves the state-of-the-art (79.30% and 98.8% top-1 and top-3 accuracy for service-based datasets, top-1 and top-3 for business domain-based datasets, respectively. The accuracy rates are 85.3% and 96.6%, respectively), even better than manually configured algorithms. At the same time, DCGM runs very fast, and it only takes about 3 seconds to locate the root cause of the entire microservice system.Fast-start root cause analysis method for short-term microservices: According to the knowledge of this paper, this paper proposes the slow start problem in the microservice system for the first time and designs a root cause detection method that can use the meta-information in the microservice system Due to the positioning algorithm MetaRCA. MetaRCA developed three modules: event embedding, causal attention, and diffusion aggregation graph attention network to imitate the operation and maintenance personnel to read the meta-information attributes of the monitored events, infer the causal relationship between the monitored events, and locate the micro The root cause of the service system solves the problem that the meta information is challenging to use and the root cause positioning algorithm needs a long time to start. MetaRCA can work within three months with reliable results.