Distributed secondo: an extensible and scalable database management system

Article

Abstract

This paper describes a novel method to couple a standalone database management system (DBMS) with a highly scalable key-value store. The system employs Apache Cassandra as data storage and the extensible DBMS Secondo as a query processing engine. The resulting system is a distributed, general-purpose DBMS which is highly scalable and fault tolerant. The logical ring of Cassandra is used to split up input data into smaller units of work (UOWs), which can be processed independently. A decentralized algorithm is responsible to assign the UOWs to query processing nodes. In case of a node failure, UOWs are recalculated on a different node. All the data models (e.g. relational, spatial and spatio-temporal) and functions (e.g. filter, aggregates, joins and spatial-joins) implemented in Secondo can be used in a scalable way without changing the implementation. Many aspects of the distribution are hidden from the user. Existing sequential queries can be easily converted into parallel ones.

Keywords

Distributed databases Spatial data processing Apache Cassandra Fault tolerance Big data 

References

  1. 1.
    Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. VLDB Endow. 2(1), 922–933 (2009)CrossRefGoogle Scholar
  2. 2.
    Apache license, version 2.0. http://www.apache.org/licenses/ (2004). Accessed 30 Jul 2015
  3. 3.
    Ceri, S., Pelagatti, G.: Distributed Databases Principles and Systems. McGraw-Hill Inc, New York (1984)MATHGoogle Scholar
  4. 4.
    Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. In: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation, OSDI’06, vol. 7, pp. 15–15. USENIX Association, Berkeley (2006)Google Scholar
  5. 5.
    Corbett, J.C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J.J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P., Hsieh, W., Kanthak, S., Kogan, E., Li, H., Lloyd, A., Melnik, S., Mwaura, D., Nagle, D., Quinlan, S., Rao, R., Rolig, L., Saito, Y., Szymaniak, M., Taylor, C., Wang, R., Woodford, D.: Spanner: Google’s globally-distributed database. In: Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI’12, pp. 251–264. USENIX Association, Berkeley (2012)Google Scholar
  6. 6.
    Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation, OSDI’04, vol. 6, pp. 10. USENIX Association, Berkeley (2004)Google Scholar
  7. 7.
    DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s highly available key-value store. SIGOPS Oper. Syst. Rev. 41(6), 205–220 (2007)CrossRefGoogle Scholar
  8. 8.
    Dinun, F., Ng, T.S.E.: Understanding the effects and implications of compute node related failures in hadoop. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, HPDC’12, pp. 187–198. ACM, New York (2012)Google Scholar
  9. 9.
    Dittrich, J.P., Seeger, B.: Data redundancy and duplicate detection in spatial join processing. In: ICDE, pp. 535–546 (2000)Google Scholar
  10. 10.
    Düntgen, C., Behr, T., Güting, R.H.: Berlinmod: a benchmark for moving object databases. VLDB J. 18(6), 1335–1368 (2009)CrossRefGoogle Scholar
  11. 11.
    Eldawy, A., Mokbel, M.F.: Pigeon: a spatial mapreduce language. In: IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31–April 4, 2014, pp. 1242–1245 (2014)Google Scholar
  12. 12.
    Eldawy, A., Mokbel, M.F.: SpatialHadoop: a mapreduce framework for spatial data. In: 31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul, South Korea, pp. 1352–1363, 13–17 April 2015Google Scholar
  13. 13.
    Gantz, J.F., Reinsel, D.: The digital universe in 2020: big data, bigger digital shadow’s, and biggest growth in the far east. In: IDC (2012)Google Scholar
  14. 14.
    George, L.: HBase: The Definitive Guide. O’Reilly Media Inc, Sebastopol (2011)Google Scholar
  15. 15.
    Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles. SOSP’03, pp. 29–43. ACM, New York (2003)Google Scholar
  16. 16.
    Güting, R.H.: Operator Based Query Progress Estimation. Fern Universität in Hagen, Hagen (2008)Google Scholar
  17. 17.
    Güting, R.H., Behr, T., Düntgen, C.: Secondo: a platform for moving objects database research and for publishing and integrating research implementations. IEEE Data Eng. Bull. 33(2), 56–63 (2010)Google Scholar
  18. 18.
    Idreos, S., Liarou, E., Koubarakis, M.: Continuous multi-way joins over distributed hash tables. In: Proceedings of the 11th International Conference on Extending Database Technology: Advances in Database Technology. EDBT’08, pp. 594–605. ACM, New York (2008)Google Scholar
  19. 19.
    Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., Lewin, D.: Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the world wide web. In: Proceedings of the Twenty-ninth Annual ACM Symposium on Theory of Computing, STOC’97, pp. 654–663. ACM, New York (1997)Google Scholar
  20. 20.
    Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev. 44(2), 35–40 (2010)CrossRefGoogle Scholar
  21. 21.
    Lamport, L.: The part-time parliament. ACM Trans. Comput. Syst. 16(2), 133–169 (1998)CrossRefGoogle Scholar
  22. 22.
    Leach, P., Mealling, M., Salz, R.: RFC 4122: A Universally Unique IDentifier (UUID) URN Namespace (2005)Google Scholar
  23. 23.
    Lu, J., Güting., R.H.: Parallel secondo: boosting database engines with hadoop. In: 2013 International Conference on Parallel and Distributed Systems, pp. 738–743 (2012)Google Scholar
  24. 24.
    Nidzwetzki, J.K.: Entwicklung eines skalierbaren und verteilten Datenbanksystems. Springer, Berlin (2016)CrossRefGoogle Scholar
  25. 25.
    Nidzwetzki, J.K., Güting, R.H.: Distributed SECONDO: a highly available and scalable system for spatial data processing. In: Advances in spatial and temporal databases—14th international symposium, SSTD 2015, Hong Kong, China, pp. 491–496, 26–28 August 2015Google Scholar
  26. 26.
    Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. SIGMOD’08, pp. 1099–1110. ACM, New York (2008)Google Scholar
  27. 27.
    Özsu, M.T., Valduriez, P. (eds.): Principles of Distributed Database Systems, vol. 3. Springer, New York (2011)Google Scholar
  28. 28.
    Palma, W., Akbarinia, R., Pacitti, E., Valduriez, P.: Distributed processing of continuous join queries using DHT networks. In: Proceedings of the 2009 EDBT/ICDT Workshops. EDBT/ICDT’09, pp. 34–41. ACM, New York (2009)Google Scholar
  29. 29.
    Patel, J.M., DeWitt, D.J.: Partition based spatial-merge join. SIGMOD Rec. 25(2), 259–270 (1996)CrossRefGoogle Scholar
  30. 30.
    Rothnie, J.B., Goodman, N.: A survey of research and development in distributed database management. In: Proceedings of the Third International Conference on Very Large Data Bases, VLDB’77, vol. 3, pp. 48–62. VLDB Endowment (1977)Google Scholar
  31. 31.
    Rothnie, J.B., Bernstein, P.A., Fox, S., Goodman, N., Hammer, M., Landers, T.A., Reeve, C., Shipman, D.W., Wong, E.: Introduction to a system for distributed databases (SDD-1). ACM Trans. Database Syst. 5(1), 1–17 (1980)CrossRefGoogle Scholar
  32. 32.
    Shute, J., Oancea, M., Ellner, S., Handy, B., Rollins, E., Samwel, B., Vingralek, R., Whipkey, C., Chen, X., Jegerlehner, B., Littleield, K., Tong, P.: F1: the fault-tolerant distributed RDBMS supporting googles ad business. In: SIGMOD, 2012. Talk given at SIGMOD (2012)Google Scholar
  33. 33.
    Stoica, I., Morris, R., Karger, D., Kaashoek, M.F., Balakrishnan, H.: Chord: a scalable peer-to-peer lookup service for internet applications. SIGCOMM Comput. Commun. Rev. 31(4), 149–160 (2001)CrossRefGoogle Scholar
  34. 34.
    Tanenbaum, A.S., Steen, Mv: Distributed Systems: Principles and Paradigms, vol. 2. Prentice-Hall, Inc., Upper Saddle River (2006)MATHGoogle Scholar
  35. 35.
    Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)CrossRefGoogle Scholar
  36. 36.
    Transaction Processing Performance Council. TPC BENCHMARK H (Decision Support) Standard Specification. http://www.tpc.org/tpch/. Accessed 15 May 2015
  37. 37.
    Vogels, W.: Eventually consistent. Commun. ACM 52(1), 40–44 (2009)CrossRefGoogle Scholar
  38. 38.
    Website of Apache Drill. http://drill.apache.org (2015). Accessed 20 July 2015
  39. 39.
    Website of Apache Spark. http://spark.apache.org/ (2015). Accessed 20 Jul 2015
  40. 40.
    Website of cpp-driver for Cassandra. https://github.com/datastax/cpp-driver (2015). Accessed 15 Sept 2015
  41. 41.
    Website of distributed secondo http://dna.fernuni-hagen.de/secondo/DSecondo/DSECONDO-Website/index.html (2015). Accessed 15 Nov 2015
  42. 42.
    Website of the Open Street Map Project. http://www.openstreetmap.org (2015). Accessed 09 July 2015
  43. 43.
    White, T.: Hadoop: The Definitive Guide, 1st edn. O’Reilly Media Inc., Sebastopol (2009)Google Scholar
  44. 44.
    Xie, D., Li, F., Yao, B., Li, G., Zhou, L., Guo, M.: Simba: efficient in-memory spatial analytics. In: Proceedings of the 2016 International Conference on Management of Data. SIGMOD’16, pp. 1071–1085. ACM, New York (2016)Google Scholar
  45. 45.
    You, S., Zhang, J., Gruenwald, L.: Large-scale spatial join query processing in cloud. Technical Report http://www-cs.ccny.cuny.edu/~jzhang/papers/spatial_cc_tr.pdf (2016). Accessed 14 Mar 2017
  46. 46.
    Zhang, S., Han, J., Liu, Z., Wang, K., Xu, Z.: SJMR: parallelizing spatial join with mapreduce on clusters. In: Proceedings of the 2009 IEEE International Conference on Cluster Computing, August 31–September 4, 2009, New Orleans, Louisiana, USA, pp. 1–8 (2009)Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.Faculty of Mathematics and Computer ScienceFernUniversität HagenHagenGermany

Personalised recommendations