Advertisement

STATS - A Point Access Method for Multidimensional Clusters

  • Giannis Evagorou
  • Thomas Heinis
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10438)

Abstract

The ubiquity of high-dimensional data in machine learning and data mining applications makes its efficient indexing and retrieval from main memory crucial. Frequently, these machine learning algorithms need to query specific characteristics of single multidimensional points. For example, given a clustered dataset, the cluster membership (CM) query retrieves the cluster to which an object belongs.

To efficiently answer this type of query we have developed STATS, a novel main-memory index which scales to answer CM queries on increasingly big datasets. Current indexing methods are oblivious to the structure of clusters in the data, and we thus, develop STATS around the key insight that exploiting the cluster information when indexing and preserving it in the index will accelerate look up. We show experimentally that STATS outperforms known methods in regards to retrieval time and scales well with dataset size for any number of dimensions.

Keywords

High-dimensional indexing Clustering 

Notes

Acknowledgements

This work is supported by the EU’s Horizon 2020 grant 650003 (Human Brain project), EPSRC’s PETRAS IoT Hub and HiPEDS grant reference EP/L016796/1).

References

  1. 1.
    Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the surprising behavior of distance metrics in high dimensional space. In: Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 420–434. Springer, Heidelberg (2001). doi: 10.1007/3-540-44503-X_27 CrossRefGoogle Scholar
  2. 2.
    Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B.: The R*-tree: an efficient and robust access method for points and rectangles. In: SIGMOD 1990 (1990)Google Scholar
  3. 3.
    Guttman, A.: R-trees: dynamic index structure for spatial data. In: SIGMOD 1984 (1984)Google Scholar
  4. 4.
    Hadjieleftheriou, M.: libspatialindex (2014). https://libspatialindex.github.io/
  5. 5.
    Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)CrossRefGoogle Scholar
  6. 6.
    Kleen, A.: NUMACTL(1) Linux User’s Manual. SuSE Labs, September 2016Google Scholar
  7. 7.
    Leutenegger, S.T., Lopez, M.A., Edgington, J.: STR: a simple and efficient algorithm for R-tree packing. In: ICDE 1997 (1997)Google Scholar
  8. 8.
    Lichman, M.: UCI Machine Learning Repository (2013)Google Scholar
  9. 9.
    Maimon, O., Rokach, L.: Data Mining and Knowledge Discovery Handbook. Springer-Verlag New York, Inc., New York (2005)CrossRefzbMATHGoogle Scholar
  10. 10.
    Muja, M., Low, D.G.: nanoflann (2016). https://github.com/jlblancoc/nanoflann
  11. 11.
    Ludwig Maximilian University of Munich: ELKI data mining library (2016)Google Scholar
  12. 12.
    Sellis, T.K., Roussopoulos, N., Faloutsos, C.: The R+-tree: a dynamic index for multi-dimensional objects. In: VLDB 1987 (1987)Google Scholar
  13. 13.
    Suthaharan, S.: Machine Learning Models and Algorithms for Big Data Classification: Thinking with Examples for Effective Learning, 1st edn. Springer Publishing Company Incorporated, Heidelberg (2015)Google Scholar
  14. 14.
    Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S.: Constrained K-means clustering with background knowledge. In: ICML 2001 (2001)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Imperial College LondonLondonUK

Personalised recommendations