A Unified Framework for Anomaly Detection and Root Cause Analysis in Microservice Systems
DOI:
https://doi.org/10.54097/1gw77589Keywords:
Microservice Architecture, Anomaly Detection, Root Cause Analysis, Observability, Telemetry Data, Machine Learning, Service Dependency Graph, Distributed Systems, System MonitoringAbstract
Modern software applications increasingly rely on microservice architectures for scalability, flexibility, and rapid deployment. However, this architectural paradigm introduces new complexities in monitoring system behavior, identifying anomalies, and determining their root causes across distributed services. Existing solutions often address anomaly detection and root cause analysis (RCA) in isolation, leading to fragmented insights and delayed resolution. This paper proposes a unified framework that integrates real-time anomaly detection with automated RCA using machine learning and graph-based dependency modeling. The framework continuously monitors telemetry data—including metrics, logs, and traces—and applies an ensemble of statistical and deep learning models for multivariate anomaly detection. Detected anomalies are then contextualized through a service dependency graph and analyzed using causal inference techniques to identify the most probable root causes. We evaluate the framework on both synthetic benchmarks and real-world microservice deployments. Experimental results show that it achieves high precision and recall in anomaly detection while significantly reducing RCA latency compared to baseline methods. By combining anomaly detection and RCA in a cohesive pipeline, the proposed framework enhances system observability and reduces mean time to recovery (MTTR), thus improving operational resilience in complex microservice environments.
Downloads
References
[1] Kansal, S., & Balasubramaniam, V. S. (2024). Microservices Architecture in Large-Scale Distributed Systems: Performance and Efficiency Gains. Journal of Quantum Science and Technology (JQST), 1(4), 633-663.
[2] Abgaz, Y., McCarren, A., Elger, P., Solan, D., Lapuz, N., Bivol, M., ... & Clarke, P. (2023). Decomposition of monolith applications into microservices architectures: A systematic review. IEEE Transactions on Software Engineering, 49(8), 4213-4242.
[3] Oyeniran, O. C., Modupe, O. T., Otitoola, A. A., Abiona, O. O., Adewusi, A. O., & Oladapo, O. J. (2024). A comprehensive review of leveraging cloud-native technologies for scalability and resilience in software development. International Journal of Science and Research Archive, 11(2), 330-337.
[4] Usman, M., Ferlin, S., Brunstrom, A., & Taheri, J. (2022). A survey on observability of distributed edge & container-based microservices. IEEE Access, 10, 86904-86919.
[5] Xing, S., Wang, Y., & Liu, W. (2025). Multi-Dimensional Anomaly Detection and Fault Localization in Microservice Architectures: A Dual-Channel Deep Learning Approach with Causal Inference for Intelligent Sensing. Sensors.
[6] Tsechelidis, M. (2023). Developing distributed systems with modular monoliths and microservices.
[7] Rzym, G., Masny, A., & Chołda, P. (2024). Dynamic telemetry and deep neural networks for anomaly detection in 6G software-defined networks. Electronics, 13(2), 382.
[8] Hahn, D. A., Davidson, D., & Bardas, A. G. (2020). Security Issues and Challenges in Service Meshes--An Extended Study. arXiv preprint arXiv:2010.11079.
[9] Katragadda, S. R., Tanikonda, A., Pandey, B. K., & Peddinti, S. R. (2022). Machine Learning-Enhanced Root Cause Analysis for Rapid Incident Management in High-Complexity Systems. Journal of Science & Technology, 3(3), 325-345.
[10] RIBEIRO, A. N. (2024). Unsupervised learning algorithms for data-driven fault management in optical networks.
[11] Chalapathy, R., & Chawla, S. (2019). Deep learning for anomaly detection: A survey. arXiv preprint arXiv:1901.03407.
[12] Rossi, F., Cardellini, V., & Presti, F. L. (2020, November). Self-adaptive threshold-based policy for microservices elasticity. In 2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) (pp. 1-8). IEEE.
[13] Murphy, J., Ward, J. E., & Mac Namee, B. (2023, October). An overview of machine learning techniques for onboard anomaly detection in satellite telemetry. In 2023 European Data Handling & Data Processing Conference (EDHPC) (pp. 1-6). IEEE.
[14] Faseeha, U., Syed, H. J., Samad, F., Zehra, S., & Ahmed, H. (2025). Observability in Microservices: An In-Depth Exploration of Frameworks, Challenges, and Deployment Paradigms. IEEE Access.
[15] Tiwari, A. (2024). Unveiling Graph Structures in Microservices: Service Dependency Graph, Call Graph, and Causal Graph. Abhishek Tiwari.
[16] Zakrzewski, R. (2024). Matrix-Based Graph Comparison Method for Behavioural Patterns Analysis with Application to Anomaly Detection Using Machine Learning in Wireless Multi-hop IoT Networks (Doctoral dissertation, University of Bristol).
[17] Ahmed, S. F., Kuldeep, S. A., Rafa, S. J., Fazal, J., Hoque, M., Liu, G., & Gandomi, A. H. (2024). Enhancement of traffic forecasting through graph neural network-based information fusion techniques. Information Fusion, 110, 102466.
[18] Steenwinckel, B., De Paepe, D., Vanden Hautte, S., Heyvaert, P., Bentefrit, M., Moens, P., ... & Ongenae, F. (2021). FLAGS: A methodology for adaptive anomaly detection and root cause analysis on sensor data streams by fusing expert knowledge with machine learning. Future Generation Computer Systems, 116, 30-48.
[19] Wang, J., Tan, Y., Jiang, B., Wu, B., & Liu, W. (2025). Dynamic Marketing Uplift Modeling: A Symmetry-Preserving Framework Integrating Causal Forests with Deep Reinforcement Learning for Personalized Intervention Strategies. Symmetry, 17(4), 610.
[20] Dhaou, A. (2024). Interpretable and Causal Analysis for Multivariate Time Series (Doctoral dissertation, Institut Polytechnique de Paris).
[21] Wu, B., Qiu, S., & Liu, W. (2025). Addressing Sensor Data Heterogeneity and Sample Imbalance: A Transformer-Based Approach for Battery Degradation Prediction in Electric Vehicles. Sensors, 25(11), 3564.
[22] Wolniak, R., Gajdzik, B., & Grebski, W. (2023). The usage of Root Cause Analysis (RCA) in Industry 4.0 conditions. Zeszyty Naukowe Politechniki Śląskiej. Organizacja i Zarządzanie, 190, 223-235.
[23] Liu, Y., Guo, L., Hu, X., & Zhou, M. (2025). Sensor-Integrated Inverse Design of Sustainable Food Packaging Materials via Generative Adversarial Networks. Sensors.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Computer Life

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.







