High-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads

  • Ashish GuptaEmail author
  • Jeff ShuteEmail author
Conference paper
Part of the Lecture Notes in Business Information Processing book series (LNBIP, volume 337)


Google’s Ads Data Infrastructure systems run the multi-billion dollar ads business at Google. High availability and strong consistency are critical for these systems. While most distributed systems handle machine-level failures well, handling datacenter-level failures is less common. In our experience, handling datacenter-level failures is critical for running true high availability systems. Most of our systems (e.g. Photon, F1, Mesa) now support multi-homing as a fundamental design property. Multi-homed systems run live in multiple datacenters all the time, adaptively moving load between datacenters, with the ability to handle outages of any scale completely transparently.

This paper focuses primarily on stream processing systems, and describes our general approaches for building high availability multi-homed systems, discusses common challenges and solutions, and shares what we have learned in building and running these large-scale systems for over ten years.


Stream processing Distributed systems Multi-homing Databases 



We would like to thank the teams inside Google who built and ran the systems we have described, and the earlier generations of systems that informed our current designs. We would like to thank Divyakant Agrawal for his help preparing this paper.


  1. 1.
    Abadi, D.J., et al.: Aurora: a new model and architecture for data stream management. VLDB J. 12(2), 120–139 (2003)Google Scholar
  2. 2.
    Akidau, T., et al.: The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc. VLDB Endow. 8(12), 1792–1803 (2015)CrossRefGoogle Scholar
  3. 3.
    Ananthanarayanan, R., et al.: Photon: fault-tolerant and scalable joining of continuous data streams. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, (SIGMOD 2013), New York, NY, USA (2013)Google Scholar
  4. 4.
    Apache Cassandra (2011). Accessed 5 Oct 2011Google Scholar
  5. 5.
    Apache Flink (2014).
  6. 6.
    Apache Samza (2014).
  7. 7.
    Apache Storm (2013).
  8. 8.
    Astley, M., et al.: Pulsar: a resource-control architecture for time-critical service-oriented applications. IBM Syst. J. 47(2), 265–280 (2008)CrossRefGoogle Scholar
  9. 9.
    Bailis, P., Ghodsi, A.: Eventual consistency today: limitations, extensions, and beyond. ACM Queue 11(3), 20:20–20:32 (2013)CrossRefGoogle Scholar
  10. 10.
    Bernstein, P.A., Hadzilacos, V., Goodman, N.: Concurrency Control and Recovery in Database Systems. Addison-Wesley, Boston (1987)Google Scholar
  11. 11.
    Chandrasekaran, S., et al.: TelegraphCQ: continuous dataflow processing. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, SIGMOD 2003, pp. 668–668. ACM, New York (2003)Google Scholar
  12. 12.
    Chang, F., et al.: Bigtable: a distributed storage system for structured data. In: 7th Symposium on Operating Systems Design and Implementation (OSDI 2006), 6–8 November, Seattle, WA, USA, pp. 205–218 (2006)Google Scholar
  13. 13.
    Chen, J., et al.: NiagaraCQ: a scalable continuous query system for internet databases. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD 2000. ACM, New York (2000)Google Scholar
  14. 14.
    Cooper, B.F., et al.: Pnuts: Yahoo!’s hosted data serving platform. Proc. VLDB Endow. 1(2), 1277–1288 (2008)CrossRefGoogle Scholar
  15. 15.
    Corbett, J.C., et al.: Spanner: Google’s globally-distributed database. In: 10th USENIX Symposium on Operating Systems Design and Implementation, (OSDI 2012), 8–10 October 2012, Hollywood, CA, USA, pp. 261–264 (2012)Google Scholar
  16. 16.
    Cormode, G., Garofalakis, M.N.: Approximate continuous querying over distributed streams. ACM Trans. Database Syst. 33(2), 9 (2008)CrossRefGoogle Scholar
  17. 17.
    Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.: Continuous sampling from distributed streams. J. ACM 59(2), 10 (2012)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In 6th USENIX Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, California, USA, pp. 137–150 (2004)Google Scholar
  19. 19.
    DeCandia, G., et al.: Dynamo: Amazon’s highly available key-value store. In: Proceedings of 21st ACM Symposium Operating Systems Principles, pp. 205–220 (2007)Google Scholar
  20. 20.
    Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31(2), 182–209 (1985)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Gupta, A., et al.: Mesa: geo-replicated, near real-time scalable data warehousing. PVLDB 7(12), 1259–1270 (2014)Google Scholar
  22. 22.
    Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21(7), 558–565 (1978)CrossRefGoogle Scholar
  23. 23.
    Lamport, L.: The part-time parliament. ACM Trans. Comput. Syst. 16(2), 133–169 (1998)CrossRefGoogle Scholar
  24. 24.
    Metwally, A., Agrawal, D., El Abbadi, A.: An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Trans. Database Syst. 31(3), 1095–1133 (2006)CrossRefGoogle Scholar
  25. 25.
    Shrivastava, N., et al.: Medians and beyond: new aggregation techniques for sensor networks. In: Proceedings of the 2nd International Conference on Embedded Networked Sensor Systems, SenSys 2004, Baltimore, MD, USA (2004)Google Scholar
  26. 26.
    Shute, J., et al.: F1: a distributed SQL database that scales. PVLDB 6(11), 1068–1079 (2013)Google Scholar
  27. 27.
    Weikum, G., Vossen, G.: Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control and Recovery. Morgan-Kaufman Publishers, Burlington (2002)CrossRefGoogle Scholar
  28. 28.
    Zaharia, M., et al.: Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 423–438. ACM (2013)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Google Inc.Mountain ViewUSA

Personalised recommendations