Raft and Beyond: Practical Consensus Mechanisms for Geo-Distributed Data Systems
DOI:
https://doi.org/10.54097/zpssz517Keywords:
Consensus protocols, Raft algorithm, Geo-distributed databases, State machine replication, Byzantine fault tolerance, Multi-Raft, Wide-area networks, Distributed transactions, Quorum systems, Fault toleranceAbstract
Consensus protocols form the foundational building blocks of fault-tolerant distributed data systems, enabling nodes to agree on a consistent shared state despite failures, network partitions, and heterogeneous communication delays. The Raft algorithm, designed explicitly for understandability and practical deployability, has become the de facto foundation for a wide class of modern replicated storage systems. However, as organizations require data infrastructure to span geographically distributed (GD) environments, vanilla Raft exhibits significant performance limitations rooted in its single-leader architecture and majority quorum constraints, causing write latency to scale with cross-region wide-area network (WAN) round-trip times (RTTs). This survey provides a comprehensive review of consensus mechanisms from foundational Raft semantics to advanced variants engineered for GD deployments, covering state machine replication (SMR), Multi-Raft architectures, flexible quorum designs, and Byzantine fault tolerance (BFT). We analyze prominent WAN consensus protocols including WPaxos, EPaxos, and HotStuff, examining their theoretical guarantees and practical trade-offs in latency, throughput, and operational complexity. The survey further examines how production systems such as CockroachDB, TiKV, and YugabyteDB integrate and extend these protocols to achieve global-scale consistency. By synthesizing recent advances in adaptive leader election, hierarchical consensus, hybrid protocol design, and BFT convergence with crash-fault-tolerant (CFT) alternatives, this paper provides a structured reference for researchers and engineers designing the next generation of GD data infrastructure.
Downloads
References
[1] Zhang, Y., Huang, Y., Wei, H., & Ma, X. (2024). Model‐checking‐driven explorative testing of CRDT designs and implementations. Journal of Software: Evolution and Process, 36(4), e2555.
[2] Taft, R., Sharif, I., Matei, A., VanBenschoten, N., Lewis, J., Grieger, T., ... & Mattis, P. (2020, June). Cockroachdb: The resilient geo-distributed sql database. In Proceedings of the 2020 ACM SIGMOD international conference on management of data (pp. 1493-1509).
[3] Huang, D., Liu, Q., Cui, Q., Fang, Z., Ma, X., Xu, F., ... & Tang, X. (2020). TiDB: a Raft-based HTAP database. Proceedings of the VLDB Endowment, 13(12), 3072-3084.
[4] Ailijiang, A., Charapko, A., Demirbas, M., & Kosar, T. (2019). Wpaxos: Wide area network flexible consensus. IEEE Transactions on Parallel and Distributed Systems, 31(1), 211-223.
[5] Vasilakos, X., Featherstone, W., Uniyal, N., Bravalheri, A., Muqaddas, A. S., Solhjoo, N., ... & Simeonidou, D. (2020, October). Towards zero downtime edge application mobility for ultra-low latency 5G streaming. In 2020 IEEE Cloud Summit (pp. 25-32). IEEE.
[6] Zhou, J., Xu, M., Shraer, A., Namasivayam, B., Miller, A., Tschannen, E., ... & Yadav, V. (2021, June). Foundationdb: A distributed unbundled transactional key value store. In Proceedings of the 2021 International Conference on Management of Data (pp. 2653-2666).
[7] Howard, H., Charapko, A., & Mortier, R. (2021, January). Fast flexible paxos: Relaxing quorum intersection for fast paxos. In Proceedings of the 22nd International Conference on Distributed Computing and Networking (pp. 186-190).
[8] Tollman, S., Park, S. J., & Ousterhout, J. (2021). {EPaxos} revisited. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21) (pp. 613-632).
[9] Yin, M., Malkhi, D., Reiter, M. K., Gueta, G. G., & Abraham, I. (2019, July). HotStuff: BFT consensus with linearity and responsiveness. In Proceedings of the 2019 ACM symposium on principles of distributed computing (pp. 347-356).
[10] Howard, H., & Mortier, R. (2019). A generalised solution to distributed consensus. arXiv preprint arXiv:1902.06776.
[11] Whittaker, M., Charapko, A., Hellerstein, J. M., Howard, H., & Stoica, I. (2021, April). Read-write quorum systems made practical. In Proceedings of the 8th Workshop on Principles and Practice of Consistency for Distributed Data (pp. 1-8).
[12] Ding, C., Chu, D., Zhao, E., Li, X., Alvisi, L., & Van Renesse, R. (2020). Scalog: Seamless reconfiguration and total order in a scalable shared log. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20) (pp. 325-338).
[13] Charapko, A., Ailijiang, A., & Demirbas, M. (2021, June). Pigpaxos: Devouring the communication bottlenecks in distributed consensus. In Proceedings of the 2021 International Conference on Management of Data (pp. 235-247).
[14] Whittaker, M., Giridharan, N., Szekeres, A., Hellerstein, J. M., Howard, H., Nawab, F., & Stoica, I. (2020). Matchmaker paxos: A reconfigurable consensus protocol [technical report]. arXiv preprint arXiv:2007.09468.
[15] Cao, W., Zhang, Y., Yang, X., Li, F., Wang, S., Hu, Q., ... & Tong, J. (2021, June). Polardb serverless: A cloud native database for disaggregated data centers. In Proceedings of the 2021 International Conference on Management of Data (pp. 2477-2489).
[16] Bravo, M., Gotsman, A., de Régil, B., & Wei, H. (2021). {UniStore}: A fault-tolerant marriage of causal and strong consistency. In 2021 USENIX Annual Technical Conference (USENIX ATC 21) (pp. 923-937).
[17] Lu, Y., Yu, X., Cao, L., & Madden, S. (2021). Epoch-based commit and replication in distributed OLTP databases.
[18] Ganesan, A., Alagappan, R., Arpaci-Dusseau, A. C., & Arpaci-Dusseau, R. H. (2021). Strong and efficient consistency with consistency-aware durability. ACM Transactions on Storage (TOS), 17(1), 1-27.
[19] Maiyya, S., Nawab, F., Agrawal, D., & Abbadi, A. E. (2019). Unifying consensus and atomic commitment for effective cloud data management. Proceedings of the VLDB Endowment, 12(5).
[20] Stathakopoulou, C., David, T., Pavlovic, M., & Vukolić, M. (2019). Mir-bft: High-throughput robust bft for decentralized networks. arXiv preprint arXiv:1906.05552.
[21] Distler, T. (2021). Byzantine fault-tolerant state-machine replication from a systems perspective. ACM Computing Surveys (CSUR), 54(1), 1-38.
[22] Trevino, K. M., Canin, B., Healy, C., Moran, S., Trochim, W. M., Martin, P., ... & Reid, M. C. (2020). Bridging the gap between aging research and practice: A new strategy for enhancing the Consensus Workshop Model. Journal of Applied Gerontology, 39(6), 677-680.
[23] Lewchenko, N. V., & Kaki, G. (2023). Distributed Consensus Algorithms as Replicated State Applications. In Proceedings of the 10th Workshop on Principles and Practice of Consistency for Distributed Data (PaPoC’23).
[24] Li, Y., Fan, Y., Zhang, L., & Crowcroft, J. (2023). RAFT consensus reliability in wireless networks: Probabilistic analysis. IEEE Internet of Things Journal, 10(14), 12839-12853.
[25] Schultz, W., Dardik, I., & Tripakis, S. (2022, January). Formal verification of a distributed dynamic reconfiguration protocol. In Proceedings of the 11th ACM SIGPLAN International Conference on Certified Programs and Proofs (pp. 143-152).
[26] Wang, Z., Li, T., Wang, H., Shao, A., Bai, Y., Cai, S., ... & Wang, D. (2020). {CRaft}: An Erasure-coding-supported Version of Raft for Reducing Storage Cost and Network Cost. In 18th USENIX Conference on File and Storage Technologies (FAST 20) (pp. 297-308).
[27] Howard, H., & Mortier, R. (2020, April). Paxos vs Raft: Have we reached consensus on distributed consensus?. In Proceedings of the 7th Workshop on Principles and Practice of Consistency for Distributed Data (pp. 1-9).
[28] Prout, A., Wang, S. P., Victor, J., Sun, Z., Li, Y., Chen, J., ... & Shamgunov, N. (2022, June). Cloud-native transactions and analytics in singlestore. In Proceedings of the 2022 International Conference on Management of Data (pp. 2340-2352).
[29] Ailijiang, A., Charapko, A., & Demirbas, M. (2019, June). Dissecting the performance of strongly-consistent replication protocols. In Proceedings of the 2019 International Conference on Management of Data (pp. 1696-1710).
[30] Zhu, H., Bai, Z., Li, J., Michael, E., Ports, D., Stoica, I., & Jin, X. (2019). Harmonia: Near-linear scalability for replicated storage with in-network conflict detection. arXiv preprint arXiv:1904.08964.
[31] Georgiou, M. A., Paphitis, A., Sirivianos, M., & Herodotou, H. (2020). Hihooi: A database replication middleware for scaling transactional databases consistently. IEEE Transactions on Knowledge and Data Engineering, 34(2), 691-707.
[32] Yang, S., Ding, G., Chen, Z., & Yang, J. S. (2025). GART: Graph Neural Network-based Adaptive and Robust Task Scheduler for Heterogeneous Distributed Computing. IEEE Access, 13, 200196-200216.
[33] Somarapu, S. K. (2025). Cross-Region Consistency Models for Geo-Distributed Event Streams. International Journal of Communication Networks and Information Security, 17(3), 916-942.
[34] Kingsbury, K., & Alvaro, P. (2020). Elle: Inferring isolation anomalies from experimental observations. arXiv preprint arXiv:2003.10554.
[35] Xu, D., Li, T., Sun, Z., Chen, Z., Zhou, W., Zhang, Y., ... & Du, X. (2025). Performant Synchronization in Geo-Distributed Databases. arXiv preprint arXiv:2511.22444.
[36] Zhao, Z., Pan, H., Chen, G., Du, X., Lu, W., & Ooi, B. C. (2023). VeriTxn: Verifiable transactions for cloud-native databases with storage disaggregation. Proceedings of the ACM on Management of Data, 1(4), 1-27.
[37] Gueta, G. G., Abraham, I., Grossman, S., Malkhi, D., Pinkas, B., Reiter, M., ... & Tomescu, A. (2019, June). SBFT: A scalable and decentralized trust infrastructure. In 2019 49th Annual IEEE/IFIP international conference on dependable systems and networks (DSN) (pp. 568-580). IEEE.
[38] Abraham, I., Malkhi, D., Nayak, K., Ren, L., & Yin, M. (2020, May). Sync hotstuff: Simple and practical synchronous state machine replication. In 2020 IEEE Symposium on Security and Privacy (SP) (pp. 106-118). IEEE.
[39] Malkhi, D., & Nayak, K. (2023). Hotstuff-2: Optimal two-phase responsive bft. Cryptology ePrint Archive.
[40] Baudet, M., Ching, A., Chursin, A., Danezis, G., Garillot, F., Li, Z., ... & Sonnino, A. (2019). State machine replication in the libra blockchain. The Libra Assn., Tech. Rep, 7.
[41] Spiegelman, A., Arun, B., Gelashvili, R., & Li, Z. (2024, March). Shoal: Improving dag-bft latency and robustness. In International Conference on Financial Cryptography and Data Security (pp. 92-109). Cham: Springer Nature Switzerland.
[42] Abraham, I., Devadas, S., Dolev, D., Nayak, K., & Ren, L. (2019, February). Synchronous byzantine agreement with expected o (1) rounds, expected communication, and optimal resilience. In International Conference on Financial Cryptography and Data Security (pp. 320-334). Cham: Springer International Publishing.
[43] Mane, T., Li, X., Sadoghi, M., & Lesani, M. (2025, May). Hamava: Fault-tolerant reconfigurable geo-replication on heterogeneous clusters. In 2025 IEEE 41st International Conference on Data Engineering (ICDE) (pp. 2024-2037). IEEE.
[44] Gorenflo, C., Lee, S., Golab, L., & Keshav, S. (2020). FastFabric: Scaling hyperledger fabric to 20 000 transactions per second. International Journal of Network Management, 30(5), e2099.
[45] Liu, G., Wei, L., Gu, J., Zhou, T., & Liu, Y. (2020). Benefit distribution in urban renewal from the perspectives of efficiency and fairness: A game theoretical model and the government's role in China. Cities, 96, 102422.
[46] Kondru, K. K., & Rajiakodi, S. (2024). RaftOptima: An Optimised Raft with enhanced Fault Tolerance, and increased Scalability with low latency. IEEE Access, 12, 105974-105989.
[47] Kumar, B., Verma, A., & Verma, P. (2025). Kubernetes Architecture. In Modern Kubernetes: From Core Concepts to Intelligent Autoscaling for Cloud Applications (pp. 35-53). Cham: Springer Nature Switzerland.
[48] Abraham, I., Dolev, D., & Halpern, J. Y. (2019). Distributed protocols for leader election: A game-theoretic perspective. ACM Transactions on Economics and Computation (TEAC), 7(1), 1-26.
[49] Aydin, M. (2025). Assessing human reliability in life raft inspection and maintenance to improve onboard ship operational safety. Ocean Engineering, 342, 123048.
[50] Enes, V., Baquero, C., Rezende, T. F., Gotsman, A., Perrin, M., & Sutra, P. (2020, April). State-machine replication for planet-scale systems. In Proceedings of the Fifteenth European Conference on Computer Systems (pp. 1-15).
[51] Kim, T., Wong, D. L. K., Ganger, G. R., Kaminsky, M., & Andersen, D. G. (2020, October). High availability in cheap distributed key value storage. In Proceedings of the 11th ACM Symposium on Cloud Computing (pp. 165-178).
[52] YR, S. K., & N, C. H. (2024). An efficient localization-based secure resource allocation using e-fso with ss-ddnn-based cm-lsgeo techniques. Multimedia Tools and Applications, 83(34), 80543-80564.
[53] Yang, J., Rae, I., Xu, J., Shute, J., Yuan, Z., Lau, K., ... & Cieslewicz, J. (2020). F1 Lightning: HTAP as a Service. Proceedings of the VLDB Endowment, 13(12), 3313-3325.
[54] Xu, D., Zhang, D., Li, T., Chai, Y., Sun, Z., Li, W., ... & Du, X. (2025, May). GeoLM: Performance-oriented Leader Management for Geo-Distributed Consensus Protocol. In IEEE INFOCOM 2025-IEEE Conference on Computer Communications (pp. 1-10). IEEE.
[55] Cepeda, D., Chowdhury, S., Li, N., Lopez, R., Wang, X., & Golab, W. (2020). Toward linearizability testing for multi-word persistent synchronization primitives. In 23rd International Conference on Principles of Distributed Systems (OPODIS 2019) (pp. 19-1). Schloss Dagstuhl–Leibniz-Zentrum für Informatik.
[56] Liu, Y., Ren, S., Wang, X., & Zhou, M. (2024). Temporal logical attention network for log-based anomaly detection in distributed systems. Sensors, 24(24), 7949.
[57] Chen, Z., Kang, Y., Li, L., Zhang, X., Zhang, H., Xu, H., ... & Lyu, M. R. (2020, November). Towards intelligent incident management: why we need it and how we make it. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (pp. 1487-1497).
[58] Xiao, Y., Zhang, N., Lou, W., & Hou, Y. T. (2020). A survey of distributed consensus protocols for blockchain networks. IEEE communications surveys & tutorials, 22(2), 1432-1465.
[59] Emara, T. Z., & Huang, J. Z. (2020). Distributed data strategies to support large-scale data analysis across geo-distributed data centers. IEEE Access, 8, 178526-178538.
[60] Palumbo, I. (2025). Consensus Algorithms for Distributed Systems: Managing the Consistency of Critical Data Files (Doctoral dissertation, Politecnico di Torino).
[61] Liu, M., Krishnamurthy, A., Madhyastha, H. V., Bhardwaj, R., Gupta, K., Kamat, C., ... & Jawahar, A. (2020). {Fine-Grained} Replicated State Machines for a Cluster Storage System. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20) (pp. 305-323).
[62] Dhanasekaran, M. (2025). Scaling to Billions: The Distributed Systems Magic Behind Massive Data Stores. Journal Of Engineering And Computer Sciences, 4(7), 1345-1356.
[63] Ajdari, M., Raaf, P., Kishani, M., Salkhordeh, R., Asadi, H., & Brinkmann, A. (2022). An enterprise-grade open-source data reduction architecture for all-flash storage systems. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 6(2), 1-27.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Computer Life

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.







