BBoxDB: a distributed and highly available key-bounding-box-value store

Abstract

BBoxDB is a distributed and highly available key-bounding-box-value store, which is designed to handle multi-dimensional big data. To handle large amounts of data, the software splits the stored data into multi-dimensional shards and spreads them across a cluster of nodes. Unlike existing key-value stores, BBoxDB stores each value together with an n-dimensional, axis parallel bounding box. The bounding box describes the spatial location of the value in an n-dimensional space. Multi-dimensional data can be retrieved by using range queries, which are efficiently supported by indices. A space partitioner (e.g., a K-D Tree, a Quad-Tree or a Grid) is used to split the n-dimensional space into disjoint regions (distribution regions). Distribution regions are created dynamically, based on the stored data. BBoxDB can handle growing and shrinking datasets. The data redistribution is performed in the background and does not affect the availability of the system; read and write access is still possible at any time. BBoxDB works with distribution groups, the data of all tables in a distribution group are distributed in the same way (co-partitioned). Spatial joins on co-partitioned tables can be executed efficiently without data shuffling between nodes. BBoxDB supports spatial joins out-of-the-box using the bounding boxes of the stored data. The joins are supported by a spatial index and executed in a distributed and parallel manner on the nodes of the cluster.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30

Notes

  1. 1.

    In Sect. 2.2 we discuss how n-dimensional point data could also be stored efficiently in a KVS.

  2. 2.

    “If a geometry or geography shares any portion of space then they intersect.” [39]

  3. 3.

    Most KVS use binary search on an key sorted storage to retrieve the tuples. For smaller datasets with a predictable size, hashing can be used and tuple retrieval can be done with a time complexity of \(\mathcal {O}(1)\).

  4. 4.

    Ranges of keys are assigned to the nodes. For example node \(n_1\) stores the tuples whose keys begin with \([a-d]\) and node \(n_2\) stores the data for the keys \([e-g]\) and so on. When a node becomes overloaded, the data is repartitioned. The old range is split into two parts and another node becomes responsible for one of the parts.

  5. 5.

    A hash function is applied on the tuple’s key. The value range of the function is mapped to the available nodes. The mapping determines which node stores which data. Consistent hashing [40] makes it possible to add and remove nodes without repartitioning the already stored data.

  6. 6.

    An entity describes an object with one or more attributes. A record consists of a collection of fields and represents the technical view of a stored entity. In this paper, we use both terms synonymously.

  7. 7.

    When the data type of the attribute is not numeric (e.g., a string, a JPEG encoded image), the values need to be mapped to a numeric data type before the min and max function can be applied. This can be done in several ways, like with a perfect hash function [55], treating the bytes of the datatype as numbers or with a custom mapping function.

  8. 8.

    These indices are typically built independently on each node. Building a global index over the complete data requires a lot of coordination between the nodes; most DKVS are working with local secondary indices.

  9. 9.

    DKVS like Cassandra are also providing eventual consistency to ensure that the system can deal with network or node outages. Eventual consistency does not mean that replicates become outdated for a long time. During the regular operation of the cluster without outages, the replicates are updated with every write operation [41, p. 162ff]. However, when outages occur, replicates can contain outdated data for a longer period before they eventually become synchronized with the last version of the data. Cassandra uses techniques like timestamps, gossip or read repair to ensure the replicates become synchronized. BBoxDB implements similar techniques. However, BBoxDB does not make improvements to these techniques. Therefore the implementation is not covered in this paper. Some implementation details can be found in the documentation of BBoxDB. [12].

  10. 10.

    Usually, only the split positions are stored in a K-D Tree. The data structure in ZooKeeper instead contains a description of the space which is covered by a region. This makes it possible to reuse this data structure with other space partitioners (e.g., the Quad-Tree or the grid).

  11. 11.

    On a replicated distribution group, several nodes could notice at the same time that the region needs to be split. This could lead to an unpredictable behavior of BBoxDB. To prevent such situations, the state of the distribution region is changed from ACTIVE to ACTIVE-FULL (see time \(t_1\) in Fig. 10). ZooKeeper is used as a coordinator that allows exactly one state change. Only the instance who performs this transition successfully executes the split.

  12. 12.

    BBoxDB contains different resource allocators, which choose the nodes based on the available hardware such as free disk space, number of harddisks, total memory or total CPUs. The used resource allocator can be chosen when a distribution group is created.

  13. 13.

    The efficiency of operations without a hyperrectangle (such as delete operations) is improved by another index. This is discussed in Section 3.10.

  14. 14.

    Other functions for mapping the key to a point in the one-dimensional space can also be used. It is only important that all nodes use the same function for reading and writing index entries.

  15. 15.

    As discussed in Sect. 3.1.2, each tuple contains a version timestamp to identify the most recent version of the tuple. To delete a tuple, a deletion marker is stored on disk (see Sect. 3.3.2). Also, deletion markers are stored together with a version timestamp. This behavior is used to store a new version of a tuple at time \(t_1\) and to apply a deletion operation later in time which affects only the tuples which have been stored before time \(t_1\). This is needed to ensure that a version of the tuple is stored at any time in BBoxDB. Otherwise, the tuple has to be deleted before the new version is stored. Such an implementation would lead to missing tuples in BBoxDB when a read request is executed by a client between the deletion and the put operation.

  16. 16.

    Systems that use a static grid to work with multi-dimensional data like Distributed Secondo are using more partitions than nodes to allow a dynamic scaling of the cluster. 128 partitions is a common size in a cluster of ten nodes (see [46] for more details).

  17. 17.

    A further problem with the grid approach is that the creation of the grid depends on the used dimension. For example, the grid operator in SECONDO needs a concrete implementation for each dimension. At the moment, SECONDO contains implementations for two- and three-dimensional grids. Using a non supported dimensionality leads to additional work for creating an appropriate grid operator. The K-D Tree in BBoxDB can be used immediately for any dimension without any adjustment being necessary.

  18. 18.

    Tiny MD-HBase is limited to two-dimensional point data, therefore we compare tiny MD-HBase only with this dataset. To import the OSM point dataset, we modified Tiny MD-HBase and added an import function for GeoJSON data. In addition, we added some statistics about the scanned data. Our version of Tiny MD-HBase is available on GitHub [56].

  19. 19.

    Spatial Hadoop can only handle two dimensional data, therefore range queries were performed only on the two dimensional datasets.

  20. 20.

    The source code of the baseline approach can be found in the BBoxDB repository at GitHub [13].

  21. 21.

    The source code of the baseline approach can be found in the BBoxDB repository at GitHub [13].

  22. 22.

    We also tried to execute the spatial join with Spatial Hadoop only on bounding boxes. We created new datasets witch contained the bounding boxes of the OSM datasets. Afterwards, the spatial join was executed on these datasets. However, the execution time of the join increased by a factor of 5. We assume this is an error in the current implementation and therefore, we don’t show the execution time in the figure of the experiment.

  23. 23.

    This paper covers only failures in components of BBoxDB. BBoxDB uses ZooKeeper as a coordinator. ZooKeeper itself is also a highly available distributed system, multiple nodes run ZooKeeper in parallel and select one leader. The leader is responsible for performing changes on the stored data. A leader failure results in a new leader election, during this time ZooKeeper clients have to wait before their requests are processed. A discussion of handling failures in ZooKeeper is already contained in papers like [38, p. 11] and not part of this paper.

References

  1. 1.

    Website of the Apache Accumulo project. https://accumulo.apache.org/, 2017. Online Accessed 05 Oct 2017

  2. 2.

    Website of the Apache CouchDB project, 2018. http://couchdb.apache.org/. Online Accessed 15 April 2018

  3. 3.

    Website of Apache Hadoop project. http://hadoop.apache.org/, 2018. Online Accessed 15 April 2018

  4. 4.

    Apache software license, version 2.0, 2004. http://www.apache.org/licenses/. Online Accessed 15 May 2017

  5. 5.

    Athanassoulis, M., Kester, M.S., Maas, L.M., Stoica, R., Idreos, S., Ailamaki, A., Callaghan, M.: Designing access methods: The RUM conjecture. In Proceedings of the 19th International Conference on Extending Database Technology, EDBT 2016, Bordeaux, France, March 15–16, 2016, Bordeaux, France, March 15–16, 2016, pp 461–466, (2016)

  6. 6.

    Baker, H.C., Hewitt, C.: The incremental garbage collection of processes. In Proceedings of the 1977 Symposium on Artificial Intelligence and Programming Languages, pp. 55–59, New York, NY, USA, ACM (1977)

  7. 7.

    BBoxDB at the maven repository, 2018. https://maven-repository.com/artifact/org.bboxdb. Online Accessed 15 April 2018

  8. 8.

    Docker image of BBoxDB on DockerHub, 2018. https://hub.docker.com/r/jnidzwetzki/bboxdb/. Online Accessed 15 April 2018

  9. 9.

    Website of Docker Compose, 2018. https://docs.docker.com/compose/. Online Accessed 15 April 2018

  10. 10.

    Website of the BBoxDB project. http://bboxdb.org, 2018. Online Accessed 03 Feb 2018

  11. 11.

    Website of the Docker project, 2018. https://www.docker.com/. Online Accessed 15 Apr 2018

  12. 12.

    The network protocol of BBoxDB. https://jnidzwetzki.github.io/bboxdb//dev/network.html, 2019. Online Accessed 16 Jul 2019

  13. 13.

    Cassandra based baseline approach for performance evaluation, 2018. https://github.com/jnidzwetzki/bboxdb/blob/master/bboxdb-experiments/src/main/java/org/bboxdb/experiments/TestBaselineApproach.java. Online Accessed 21 Oct 2018

  14. 14.

    Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)

    MathSciNet  Article  Google Scholar 

  15. 15.

    Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)

    Article  Google Scholar 

  16. 16.

    Böhm, C., Klump, G., Kriegel, H.P.: XZ-Ordering: A Space-Filling Curve for Objects with Spatial Extension. In Advances in Spatial Databases, 6th International Symposium, SSD’99, Hong Kong, China, July 20–23, 1999, Proceedings, pp. 75–90 (1999)

  17. 17.

    Bracciale, Lorenzo, Bonola, Marco, Loreti, Pierpaolo, Bianchi, Giuseppe, Amici, Raul, Rabuffi, Antonello: CRAWDAD dataset roma/taxi (v. 2014-07-17). Downloaded from https://crawdad.org/roma/taxi/20140717, July (2014)

  18. 18.

    Cassandra---ALLOW FILTERING explained, 2019. https://www.datastax.com/dev/blog/allow-filtering-explained-2. Online Accessed 15 Jul 2019

  19. 19.

    Cassandra---Create Index, 2019. https://docs.datastax.com/en/archived/cql/3.3/cql/cql_reference/cqlCreateIndex.html. Online Accessed 15 Jul 2019

  20. 20.

    Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2), 4:1–4:26 (2008)

    Article  Google Scholar 

  21. 21.

    Transaction Processing Performance Council. TPC BENCHMARK H (Decision Support) Standard Specification. http://www.tpc.org/tpch/. Online Accessed 22 April 2018

  22. 22.

    Website of Elasticsearch. https://www.elastic.co/products/elasticsearch/, 2018. Online Accessed 23 April 2018

  23. 23.

    Eldawy, A., Mokbel, M.F.: SpatialHadoop: A MapReduce Framework for Spatial Data. In 31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul, South Korea, April 13–17, 2015, pp. 1352–1363, (2015)

  24. 24.

    Escriva, R., Wong, B., Sirer, E.G.: HyperDex: A Distributed, Searchable Key-value Store. In Proceedings of the ACM SIGCOMM 2012 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, SIGCOMM ’12, pp 25–36, New York, NY, USA, ACM (2012)

  25. 25.

    Finkel, R.A., Bentley, J.L.: Quad trees a data structure for retrieval on composite keys. Acta Inf. 4(1), 1–9 (1974)

    Article  Google Scholar 

  26. 26.

    Fox, A., Eichelberger, C., Hughes, J., Lyon, S.: Spatio-temporal indexing in non-relational distributed databases. In 2013 IEEE International Conference on Big Data, pages 291–299, October (2013)

  27. 27.

    Website of GeoCouch. https://github.com/couchbase/geocouch, 2018. Online Accessed 23 April 2018

  28. 28.

    The Wikipedia article about Geohashing. https://en.wikipedia.org/wiki/Geohash, 2018. Online Accessed 03 Feb 2018

  29. 29.

    Website of the GeoMesa project. http://www.geomesa.org, 2017. Online Accessed 05 Oct 2017

  30. 30.

    Ghemawat, S., Gobioff, H., Leung, S.T.: The Google File System. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP ’03, pp 29–43, New York, NY, USA, ACM (2003)

  31. 31.

    Güting, R.H., Behr, T., Düntgen, C.: Secondo: A platform for moving objects database research and for publishing and integrating research implementations. IEEE Data Eng. Bull. 33(2), 56–63 (2010)

    Google Scholar 

  32. 32.

    Güting, R.H., Schneider, M.: Moving Objects Databases. Morgan Kaufmann, Los Altos (2005)

    MATH  Google Scholar 

  33. 33.

    Guttman, A.: R-trees: a dynamic index structure for spatial searching. SIGMOD Rec. 14(2), 47–57 (1984)

    Article  Google Scholar 

  34. 34.

    Han, D., Stroulia, E.: HGrid: A Data Model for Large Geospatial Data Sets in HBase. pp. 910–917, 06 (2013)

  35. 35.

    Website of Apache HBase. https://hbase.apache.org/, 2018. Online Accessed 12 Feb 2018

  36. 36.

    HBase - Secondary Index, 2019. http://hbase.apache.org/book.html#secondary.indexes. Online Accessed 15 Jul 2019

  37. 37.

    Hughes, J., Annex, A., Eichelberger, C., Fox, A., Hulbert, A., Ronquest, M.: Geomesa: a distributed architecture for spatio-temporal fusion. In Geospatial Informatics, Fusion, and Motion Video Analytics V, 94730F, volume 9473 of Proceedings SPIE, pp. 9473–9473–13, (2015)

  38. 38.

    Hunt, P., Konar, M., Junqueira, F.P., Reed, B.: Zookeeper: Wait-free coordination for internet-scale systems. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, USENIXATC’10, pages 11–25, Berkeley, CA, USA, USENIX Association (2010)

  39. 39.

    ST\_Intersects - Spatial Relationships and Measurements, 2018. http://postgis.net/docs/ST_Intersects.html. Online Accessed 15 May 2018

  40. 40.

    Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., Lewin, D.: Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In Proceedings of the Twenty-ninth Annual ACM Symposium on Theory of Computing, STOC ’97, pp. 654–663, New York, NY. USA, ACM (1997)

  41. 41.

    Kleppmann, M.: Designing Data-Intensive Applications. O’Reilly, Beijing (2017). ISBN 978-1-4493-7332-0

  42. 42.

    Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev. 44(2), 35–40 (2010)

    Article  Google Scholar 

  43. 43.

    Li, S., Hu, S., Ganti, R.K., Srivatsa, M., Abdelzaher, T.F.: Pyro: A Spatial-Temporal Big-Data Storage System. In 2015 USENIX Annual Technical Conference, USENIX ATC ’15, July 8–10, Santa Clara, CA, USA, pp. 97–109, (2015)

  44. 44.

    Website of MongoDB project. https://www.mongodb.com/, 2018. Online Accessed 23 Feb 2018

  45. 45.

    Morton, G.M.: A Computer Oriented Geodetic Data Base and a New Technique in File Sequencing. International Business Machines Company, Ottawa (1966)

    Google Scholar 

  46. 46.

    Nidzwetzki, J.K., Güting, R.H.: Distributed secondo: an extensible and scalable database management system. Distrib. Parallel Databases 35(3–4), 197–248 (2017)

    Article  Google Scholar 

  47. 47.

    Nidzwetzki, J.K., Güting, R.H.: BBoxDB - A Scalable Data Store for Multi-Dimensional Big Data (Demo-Paper). In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM ’18, pp. 1867–1870, New York, NY, USA, ACM (2018)

  48. 48.

    Nishimura, S., Das, S., Agrawal, D., Abbadi, A.E.: MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services. In Proceedings of the 2011 IEEE 12th International Conference on Mobile Data Management—Volume 01, MDM ’11, pp. 7–16, Washington, DC, USA, IEEE Computer Society (2011)

  49. 49.

    O’Neil, P., Cheng, E., Gawlick, D., O’Neil, E.: The log-structured merge-tree (lsm-tree). Acta Inf. 33(4), 351–385 (1996)

    Article  Google Scholar 

  50. 50.

    Website of the Open Street Map Project, 2018. http://www.openstreetmap.org. Online Accessed 15 Apr 2018

  51. 51.

    Orenstein, J.: A comparison of spatial query processing techniques for native and parameter spaces. SIGMOD Rec. 19(2), 343–352 (1990)

    Article  Google Scholar 

  52. 52.

    Tamer Özsu, M., Valduriez, P.: Principles of Distributed Database Systems, 3rd edn. Springer, New York (2011)

    Google Scholar 

  53. 53.

    Patel, J.M., DeWitt, D.J.: Partition based spatial-merge join. SIGMOD Rec. 25(2), 259–270 (1996)

    Article  Google Scholar 

  54. 54.

    Samet, H.: Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann Publishers Inc., San Francisco (2005)

    MATH  Google Scholar 

  55. 55.

    Sprugnoli, R.: Perfect hashing functions: a single probe retrieving method for static sets. Commun. ACM 20(11), 841–850 (1977)

    MathSciNet  Article  Google Scholar 

  56. 56.

    Modified version of Tiny MD-HBase on Github, 2018. https://github.com/jnidzwetzki/Tiny-MD-HBase. Online Accessed 15 Jul 2018

  57. 57.

    The Tiny MD-HBase project on Github, 2018. https://github.com/shojinishimura/Tiny-MD-HBase. Online Accessed 26 Apr 2018

  58. 58.

    Vogels, W.: Eventually consistent. Commun. ACM 52(1), 40–44 (2009)

    Article  Google Scholar 

  59. 59.

    Whitby, M.A., Fecher, R., Bennight, C.: GeoWave: Utilizing Distributed Key-Value Stores for Multidimensional Data. In Advances in Spatial and Temporal Databases - 15th International Symposium, SSTD 2017, Arlington, VA, USA, August 21–23, 2017, Proceedings, pp. 105–122, (2017)

  60. 60.

    Zhang, S., Han, J., Liu, Z., Wang, K., Xu, Z.: SJMR: parallelizing spatial join with mapreduce on clusters. In Proceedings of the 2009 IEEE International Conference on Cluster Computing, August 31 - September 4, 2009, New Orleans, Louisiana, USA, pp. 1–8, (2009)

  61. 61.

    Zhou, X., Zhang, X., Wang, Y., Li, R., Wang, S.: Efficient Distributed Multi-dimensional Index for Big Data Management. In Proceedings of the 14th International Conference on Web-Age Information Management, WAIM’13, pp. 130–141, Berlin, Heidelberg, Springer (2013)

Download references

Acknowledgements

We are grateful for the free license of JProfiler, which ej-technologies GmbH provided for the BBoxDB open source project. The profiler helped us to speed up the implementation of BBoxDB significantly.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Jan Kristof Nidzwetzki.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Nidzwetzki, J.K., Güting, R.H. BBoxDB: a distributed and highly available key-bounding-box-value store. Distrib Parallel Databases 38, 439–493 (2020). https://doi.org/10.1007/s10619-019-07275-w

Download citation

Keywords

  • Distributed data store
  • Storage engine
  • Key-bounding-box-value store
  • Multi-dimensional data store