Verifying Equivalence of Spark Programs

  • Shelly GrossmanEmail author
  • Sara Cohen
  • Shachar Itzhaky
  • Noam Rinetzky
  • Mooly Sagiv
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10427)


Apache Spark is a popular framework for writing large scale data processing applications. Our long term goal is to develop automatic tools for reasoning about Spark programs. This is challenging because Spark programs combine database-like relational algebraic operations and aggregate operations, corresponding to (nested) loops, with User Defined Functions (UDFs). In this paper, we present a novel SMT-based technique for verifying the equivalence of Spark programs.

We model Spark as a programming language whose semantics imitates Relational Algebra queries (with aggregations) over bags (multisets) and allows for UDFs expressible in Presburger Arithmetics. We prove that the problem of checking equivalence is undecidable even for programs which use a single aggregation operator. Thus, we present sound techniques for verifying the equivalence of interesting classes of Spark programs, and show that it is complete under certain restrictions. We implemented our technique, and applied it to a few small, but intricate, test cases.


  1. 1.
    Blanc, R., Kuncak, V., Kneuss, E., Suter, P.: An overview of the Leon verification system: verification by translation to recursive functions. In: Proceedings of the 4th Workshop on Scala, SCALA 2013, pp. 1:1–1:10. ACM, New York (2013)Google Scholar
  2. 2.
    Chandra, A.K., Merlin, P.M.: Optimal implementation of conjunctive queries in relational data bases. In: Proceedings of the Ninth Annual ACM Symposium on Theory of Computing, STOC 1977, pp. 77–90. ACM, New York (1977)Google Scholar
  3. 3.
    Chaudhuri, S., Vardi, M.Y.: Optimization of real conjunctive queries. In: Proceedings of the Twelfth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS 1993, pp. 59–70. ACM, New York (1993)Google Scholar
  4. 4.
    Chekuri, C., Rajaraman, A.: Conjunctive query containment revisited. Theoret. Comput. Sci. 239(2), 211–229 (2000)MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Chen, Y.-F., Hong, C.-D., Sinha, N., Wang, B.-Y.: Commutativity of reducers. In: Baier, C., Tinelli, C. (eds.) TACAS 2015. LNCS, vol. 9035, pp. 131–146. Springer, Heidelberg (2015). doi: 10.1007/978-3-662-46681-0_9 Google Scholar
  6. 6.
    Chu, S., Wang, C., Weitz, K., Cheung, A.: Cosette: an automated prover for SQL. In: Online Proceedings of the 8th Biennial Conference on Innovative Data Systems Research, CIDR 2017, 8–11 January 2017, Chaminade, CA, USA (2017)Google Scholar
  7. 7.
    Cohen, S., Nutt, W., Sagiv, Y.: Deciding equivalences among conjunctive aggregate queries. J. ACM 54(2), 5 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Cohen, S., Sagiv, Y., Nutt, W.: Equivalences among aggregate queries with negation. ACM Trans. Comput. Logic 6(2), 328–360 (2005)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Cooper, D.C.: Theorem proving in arithmetic without multiplication. Mach. Intell. 7, 300 (1972)zbMATHGoogle Scholar
  10. 10.
    De Moura, L., Bjørner, N.: Z3: an efficient SMT solver. In: Ramakrishnan, C.R., Rehof, J. (eds.) TACAS 2008. LNCS, vol. 4963, pp. 337–340. Springer, Heidelberg (2008). doi: 10.1007/978-3-540-78800-3_24 CrossRefGoogle Scholar
  11. 11.
    El Ghazi, A.A., Taghdiri, M.: Relational reasoning via SMT solving. In: Butler, M., Schulte, W. (eds.) FM 2011. LNCS, vol. 6664, pp. 133–148. Springer, Heidelberg (2011). doi: 10.1007/978-3-642-21437-0_12 CrossRefGoogle Scholar
  12. 12.
    Fischer, M.J., Rabin, M.O.: Super-exponential complexity of Presburger arithmetic. Technical report, Massachusetts Institue of Technology, Cambridge, MA, USA (1974)Google Scholar
  13. 13.
    Grossman, S., Cohen, S., Itzhaky, S., Rinetzky, N., Sagiv, M.: Verifying equivalence of spark programs. Technical report, Tel Aviv University, April 2017.
  14. 14.
    Grumbach, S., Rafanelli, M., Tininini, L.: On the equivalence and rewriting of aggregate queries. Acta Inf. 40(8), 529–584 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Hasegawa, M.: Decomposing typed lambda calculus into a couple of categorical programming languages. In: Pitt, D., Rydeheard, D.E., Johnstone, P. (eds.) CTCS 1995. LNCS, vol. 953, pp. 200–219. Springer, Heidelberg (1995). doi: 10.1007/3-540-60164-3_28 CrossRefGoogle Scholar
  16. 16.
    Jackson, D.: Software Abstractions: Logic, Language, and Analysis. The MIT Press, Cambridge (2006)Google Scholar
  17. 17.
    Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning Spark: Lightning-Fast Big Data Analytics, 1st edn. O’Reilly Media Inc., Sebastopol (2015)Google Scholar
  18. 18.
    Klug, A.: On conjunctive queries containing inequalities. J. ACM 35(1), 146–160 (1988)MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Leino, K.R.M.: Dafny: an automatic program verifier for functional correctness. In: Clarke, E.M., Voronkov, A. (eds.) LPAR 2010. LNCS (LNAI), vol. 6355, pp. 348–370. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-17511-4_20 CrossRefGoogle Scholar
  20. 20.
    Leino, K.R.M., Monahan, R.: Reasoning about comprehensions with first-order SMT solvers. In: Proceedings of the 2009 ACM Symposium on Applied Computing, SAC 2009, pp. 615–622. ACM, New York (2009)Google Scholar
  21. 21.
    Loncaric, C., Torlak, E., Ernst, M.D.: Fast synthesis of fast collections. In: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2016, pp. 355–368. ACM, New York (2016)Google Scholar
  22. 22.
    Presburger, M.: Über die vollständigkeit eines gewissen systems der arithmetik ganzer zahlen, in welchem die addition als einzige operation hervor. Comptes Rendus du I congrès de Mathématiciens des Pays Slaves, pp. 92–101 (1929)Google Scholar
  23. 23.
    Rondon, P.M., Kawaguchi, M., Jhala, R.: Liquid types. In: 35th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pp. 159–169. ACM, January 2008Google Scholar
  24. 24.
    Rondon, P.M., Kawaguchi, M., Jhala, R.: Low-level liquid types. In: 37th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pp. 131–144. ACM, January 2010Google Scholar
  25. 25.
    Smith, C., Albarghouthi, A.: Mapreduce program synthesis. In: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2016, pp. 326–340. ACM, New York (2016)Google Scholar
  26. 26.
    Suter, P., Dotta, M., Kuncak, V.: Decision procedures for algebraic data types with abstractions. In: Proceedings of the 37th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2010, pp. 199–210. ACM, New York (2010)Google Scholar
  27. 27.
    Swamy, N., Hriţcu, C., Keller, C., Rastogi, A., Delignat-Lavaud, A., Forest, S., Bhargavan, K., Fournet, C., Strub, P.-Y., Kohlweiss, M., Zinzindohoue, J.-K., Zanella-Béguelin, S.: Dependent types and multi-monadic effects in F*. In: 43rd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pp. 256–270. ACM, January 2016Google Scholar
  28. 28.
    Wills, J., Owen, S., Laserson, U., Ryza, S.: Advanced Analytics with Spark: Patterns for Learning from Data at Scale, 1st edn. O’Reilly Media Inc., Sebastopol (2015)Google Scholar
  29. 29.
    Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Presented as Part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp. 15–28. USENIX, San Jose (2012)Google Scholar
  30. 30.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, p. 10. USENIX Association, Berkeley (2010)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Shelly Grossman
    • 1
    Email author
  • Sara Cohen
    • 2
  • Shachar Itzhaky
    • 3
  • Noam Rinetzky
    • 1
  • Mooly Sagiv
    • 1
  1. 1.Tel Aviv UniversityTel AvivIsrael
  2. 2.The Hebrew University of JerusalemJerusalemIsrael
  3. 3.Massachusetts Institute of TechnologyCambridgeUSA

Personalised recommendations