Grid-R-tree: a data structure for efficient neighborhood and nearest neighbor queries in data mining

Abstract

The use of multi-dimensional indexing structures has gained a lot of attention in data mining. The most commonly used data structures for indexing data are R-tree and its variants, quad-tree, k-d-tree, etc. These data structures support region queries (point, window and neighborhood queries) and nearest neighbor queries. These queries are extensively used in data mining algorithms. Although these data structures facilitate execution of the above queries in logarithmic time, the constraints associated with them become bottleneck in query execution, when used for large and high-dimensional datasets. Moreover, these indexing structures do not cater to specific data access patterns of data mining algorithms. In this paper, we propose a new data structure Grid-R-tree, a grid based R-tree which is specifically designed to address the querying requirements of multiple data mining algorithms. Grid-R-tree is a simple, yet effective adaptation of R-tree using the concept of Grid. We also introduce a new query over Grid-R-tree, called cell-wise epsilon neighborhood query (CellWiseNBH), which captures the locality in query execution pattern of density-based clustering algorithms, and enables us to redesign them for improving their efficiency. Our theoretical and experimental analysis shows that the proposed data structure outperforms the conventional R-tree in terms of neighborhood and nearest neighbor queries. The experiments were conducted on datasets of size up to 100 million and dimensionality up to 74. The results also suggest that Grid-R-tree improves the efficiency of data mining algorithms such as k-nearest neighbor classifier and DBSCAN clustering (including the redesigned version that uses CellWiseNBH). Additionally, an adaptive grid optimization has been applied on dense cells that have number of indexed data points greater than a threshold \(\tau \) to keep equal load distribution in the cells, which resulted in more efficient query performance for datasets that have skewed distribution of data points.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

References

  1. 1.

    Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining, 1st edn. Addison-Wesley Longman Publishing Co., Inc., Boston (2005)

    Google Scholar 

  2. 2.

    Goyal, P., Kumari, S., Sharma, S., et al.: Parallel SLINK for big data. Int J Data Sci Anal 9, 339–359 (2020)

    Google Scholar 

  3. 3.

    Sharma, P.K., Holness, G.: Erratum to: L2-norm transformation for improving k-means clustering. Int. J. Data Sci. Anal. 4(3), 233–234 (2017)

    Google Scholar 

  4. 4.

    Albarakati, N., Obradovic, Z.: Multi-domain and multi-view networks model for clustering hospital admissions from the emergency department. Int. J. Data Sci. Anal. 8(4), 385–403 (2019)

    Google Scholar 

  5. 5.

    Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967)

    MATH  Google Scholar 

  6. 6.

    Anastasiu, D.C., Karypis, G.: Efficient identification of tanimoto nearest neighbors. Int. J. Data Sci. Anal. 4(3), 153–172 (2017)

    Google Scholar 

  7. 7.

    Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, AAAI Press, KDD’96, pp. 226–231 (1996)

  8. 8.

    Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: Optics: ordering points to identify the clustering structure. SIGMOD Rec. 28(2), 49–60 (1999)

    Google Scholar 

  9. 9.

    Goyal, P., Kumari, S., Kumar, D., Balasubramaniam, S., Goyal, N., Islam, S., Challa, J.S.: Parallelizing optics for commodity clusters. In: Proceedings of the 2015 International Conference on Distributed Computing and Networking, ACM, New York, NY, USA, ICDCN ’15, pp. 1–10 (2015)

  10. 10.

    Guttman, A.: R-trees: a dynamic index structure for spatial searching. SIGMOD Rec. 14(2), 47–57 (1984)

    Google Scholar 

  11. 11.

    Manolopoulos, Y., Nanopoulos, A., Papadopoulos, A.N., Theodoridis, Y.: R-Trees: Theory and Applications. Springer, Berlin (2005)

    Google Scholar 

  12. 12.

    Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)

    MathSciNet  MATH  Google Scholar 

  13. 13.

    Finkel, R.A., Bentley, J.L.: Quad trees a data structure for retrieval on composite keys. Acta Inf. 4(1), 1–9 (1974)

    MATH  Google Scholar 

  14. 14.

    Nievergelt, J., Hinterberger, H., Sevcik, K.C.: The grid file: an adaptable, symmetric multikey file structure. ACM Trans. Database Syst. 9(1), 38–71 (1984)

    Google Scholar 

  15. 15.

    Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems: The Complete Book, 2nd edn. Prentice Hall Press, Upper Saddle River (2008)

    Google Scholar 

  16. 16.

    Li, G., Tang, J.: A new r-tree spatial index based on space grid coordinate division. In: Proceedings of the 2011 International Conference on Informatics, Cybernetics, and Computer Engineering (ICCE2011), pp. 133–140. Springer, Berlin(2012)

    Google Scholar 

  17. 17.

    Schikuta, E.: Grid-clustering: an efficient hierarchical clustering method for very large data sets. In: Proceedings of the 13th International Conference on Pattern Recognition, IEEE Computer Society, Washington, DC, USA, ICPR ’96, pp. 101–105 (1996)

  18. 18.

    Schikuta, E., Erhart, M.: The bang-clustering system: Grid-based data analysis. In: Advances in Intelligent Data Analysis Reasoning about Data, pp 513–524. Springer, Berlin (1997)

    Google Scholar 

  19. 19.

    Wang, W., Yang, J., Muntz, R.R.: Sting: A statistical information grid approach to spatial data mining. In: Proceedings of the 23rd International Conference on Very Large Data Bases, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, VLDB ’97, pp. 186–195 (1997)

  20. 20.

    Liao, W.K., Ying, L., Choudhary, A.: A grid-based clustering algorithm using adaptive mesh refinement. In: Proceedings of the 7th Workshop on Mining Scientific and Engineering Data Sets (2004)

  21. 21.

    Wang, W., Guan, J., Li, W., Zhang, L.: GR-tree: An efficient index structure for GML. In: Proceedings of the 2014 22nd International Conference on Geoinformatics, pp. 1–6 (2014)

  22. 22.

    Hjaltason, G.R., Samet, H.: Distance browsing in spatial databases. ACM Trans. Database Syst. (TODS) 24(2), 265–318 (1999)

    Google Scholar 

  23. 23.

    Borah, B., Bhattacharyya, D.K.: An improved sampling-based DBSCAN for large spatial databases. In: Proceedings of 2004 International Conference on Intelligent Sensing and Information Processing, pp. 92–96 (2004)

  24. 24.

    Tsai, C.F., Liu, C.W.: Kidbscan: A new efficient data clustering algorithm. In: Proceedings of the 8th International Conference on Artificial Intelligence and Soft Computing, Springer-Verlag, Berlin, Heidelberg, ICAISC’06, pp. 702–711 (2006)

    Google Scholar 

  25. 25.

    Tsai, C.F., Sung, C.Y.: Dbscale: An efficient density-based clustering algorithm for data mining in large databases. In: 2010 Second Pacific-Asia Conference on Circuits, Communications and System, pp. 98–101. IEEE (2010)

  26. 26.

    Faloutsos, C., Sellis, T., Roussopoulos, N.: Analysis of object oriented spatial access methods. In: Proceedings of the 1987 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD ’87, pp. 426–439 (1987)

    Google Scholar 

  27. 27.

    Vampir trace library (2013). https://tu-dresden.de/zih/forschung/projekte/vampirtrace. Accessed 1 June 2018

  28. 28.

    Kaul, M., Yang, B., Jensen, C.S.: Building accurate 3d spatial networks to enable next generation intelligent transportation systems. In: 2013 IEEE 14th International Conference on Mobile Data Management, IEEE, vol. 1, pp. 137–146 (2013)

  29. 29.

    Springel, V., White, S.D.M., Jenkins, A., Frenk, C.S., Yoshida, N., Gao, L., Navarro, J., Thacker, R., Croton, D., Helly, J., Peacock, J.A., Cole, S., Thomas, P., Couchman, H., Evrard, A., Colberg, J., Pearce, F.: Simulations of the formation, evolution and clustering of galaxies and quasars. Nature 435, 629–636 (2005)

    Google Scholar 

  30. 30.

    Suvn trace data (2012). http://wirelesslab.sjtu.edu.cn/ Accessed 17 Sept 2015

  31. 31.

    Kdd cup 2004 bio dataset (2004). http://cs.joensuu.fi/sipu/datasets/. Accessed 16 Oct 2015

  32. 32.

    Catlett, J.: Statlog (shuttle) data set (1993). https://archive.ics.uci.edu/ml/datasets/Statlog+(Shuttle). Accessed 17 Sept 2015

  33. 33.

    Bhatt, R., Dhall, A.: Skin segmentation data set (2009). https://archive.ics.uci.edu/ml/datasets/Skin +Segmentation. Accessed 17 Sept 2015

  34. 34.

    Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. In: Proceedings of Spring Joint Computer Conference 1967, ACM, New York, NY, USA, AFIPS ’67 (Spring), pp. 483–485 (1967)

  35. 35.

    Goyal, P., Kumari, S., Sharma, S., Kishore, V., Goyal, N., Balasubramaniam, S.S.: Spatial locality aware, fast, and scalable slink algorithm for commodity clusters. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), IEEE, pp. 158–159 (2016a)

  36. 36.

    Goyal, P., Kumari, S., Sharma, S., Kumar, D., Kishore, V., Balasubramaniam, S., Goyal, N.: A fast, scalable slink algorithm for commodity cluster computing exploiting spatial locality. In: 2016 IEEE 18th International Conference on High Performance Computing and Communications, IEEE, pp. 268–275 (2016b)

  37. 37.

    Kumari, S., Goyal, P., Sood, A., Kumar, D., Balasubramaniam, S., Goyal, N.: Exact, fast and scalable parallel dbscan for commodity platforms. In: Proceedings of the 18th International Conference on Distributed Computing and Networking, ACM, New York, NY, USA, ICDCN ’17, pp. 14:1–14:10 (2017)

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Poonam Goyal.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Goyal, P., Challa, J.S., Kumar, D. et al. Grid-R-tree: a data structure for efficient neighborhood and nearest neighbor queries in data mining. Int J Data Sci Anal 10, 25–47 (2020). https://doi.org/10.1007/s41060-020-00208-2

Download citation

Keywords

  • Data mining
  • Neighborhood queries
  • Nearest neighbor queries
  • R-tree
  • Density-based clustering