Advertisement

Annals of Data Science

, Volume 5, Issue 4, pp 549–563 | Cite as

Fractal Dimension Calculation for Big Data Using Box Locality Index

  • Rong Liu
  • Robert Rallo
  • Yoram Cohen
Article
  • 73 Downloads

Abstract

The box-counting approach for fractal dimension calculation is scaled up for big data using a data structure named box locality index (BLI). The BLI is constructed as key-value pairs with the key indexing the location of a “box” (i.e., a grid cell on the multi-dimensional space) and the value counting the number of data points inside the box (i.e., “box occupancy”). Such a key-value pair structure of BLI significantly simplifies the traditionally used hierarchical structure and encodes only necessary information required by the box-counting approach for fractal dimension calculation. Moreover, as the box occupancies (i.e., the values) associated with the same index (i.e., the key) are aggregatable, the BLI grants the box-counting approach the needed scalability for fractal dimension calculation of big data using distributed computing techniques (e.g., MapReduce and Spark). Taking the advantage of the BLI, MapReduce and Spark methods for fractal dimension calculation of big data are developed, which conduct box-counting for each grid level as a cascade of MapReduce/Spark jobs in a bottom-up fashion. In an empirical validation, the MapReduce and Spark methods demonstrated good effectiveness and efficiency in fractal calculation of a big synthetic dataset. In summary, this work provides an efficient solution for estimating the intrinsic dimension of big data, which is essential for many machine learning methods and data analytics including feature selection and dimensionality reduction.

Keywords

Fractal dimension Intrinsic dimension Box-counting Box locality index MapReduce Spark 

Notes

Acknowledgements

This study was supported, in part, by the National Science Foundation and the Environmental Protection Agency under Cooperative Agreement No. DBI-0830117. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or the Environmental Protection Agency. This work has not been subjected to EPA review and no official endorsement should be inferred. Support by the UCLA Water Technology Research Center is also acknowledged. R. Rallo is supported by the Laboratory Directed Research and Development Program at Pacific Northwest National Laboratory, a multiprogram national laboratory operated by Battelle for the U.S. Department of Energy.

References

  1. 1.
    Hu H, Wen Y, Chua TS, Li X (2014) Toward scalable systems for big data analytics: a technology tutorial. IEEE Access 2:652–687CrossRefGoogle Scholar
  2. 2.
    Wu X, Zhu X, Wu GQ, Ding W (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26:97–107CrossRefGoogle Scholar
  3. 3.
    Fan J, Han F, Liu H (2014) Challenges of big data analysis. Natl Sci Rev 1:293–314CrossRefGoogle Scholar
  4. 4.
    Hassani H, Silva ES (2015) Forecasting with big data: a review. Ann Data Sci 2:5–19CrossRefGoogle Scholar
  5. 5.
    Sun Y, Todorovic S, Goodison S (2010) Local-learning-based feature selection for high-dimensional data analysis. IEEE Trans Pattern Anal Mach Intell 32:1610–1626CrossRefGoogle Scholar
  6. 6.
    Pagel BU, Korn F, Faloutsos C (2000) Deflating the dimensionality curse using multiple fractal dimensions. In: Proceedings. 16th international conference on data engineering, 2000, pp 589–598Google Scholar
  7. 7.
    Ravi Kanth KV, Agrawal D, Abbadi AE, Singh A (1999) Dimensionality reduction for similarity searching in dynamic databases. Comput Vis Image Underst 75:59–72CrossRefGoogle Scholar
  8. 8.
    Korn F, Pagel BU, Faloutsos C (2001) On the ldquo;dimensionality curse rdquo; and the ldquo;self-similarity blessing rdquo. IEEE Trans Knowl Data Eng 13:96–111CrossRefGoogle Scholar
  9. 9.
    Wasserman Larry (2004) All of statistics: a concise course in statistical inference. Springer, New YorkCrossRefGoogle Scholar
  10. 10.
    Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann, WalthamGoogle Scholar
  11. 11.
    Dunham MH (2002) Data mining: introductory and advanced topics. Prentice Hall, Upper Saddle RiverGoogle Scholar
  12. 12.
    Bishop CM (2006) Pattern recognition and machine learning. Springer, New YorkGoogle Scholar
  13. 13.
    Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, BerlinCrossRefGoogle Scholar
  14. 14.
    Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97:245–271CrossRefGoogle Scholar
  15. 15.
    Kégl B (2002) Intrinsic dimension estimation using packing numbers. In NIPS, pp 681–688Google Scholar
  16. 16.
    Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290:2323–2326CrossRefGoogle Scholar
  17. 17.
    Tenenbaum JB, de Silva V, Langford JCA (2000) Global geometric framework for nonlinear dimensionality reduction. Science 290:2319–2323CrossRefGoogle Scholar
  18. 18.
    De Sousa EPM, Traina C, Traina AJM, Wu L, Faloutsos C (2007) A fast and effective method to find correlations among attributes in databases. Data Min Knowl Discov 14:367–407CrossRefGoogle Scholar
  19. 19.
    Rozza A, Lombardi G, Ceruti C, Casiraghi E, Campadelli P (2012) Novel high intrinsic dimensionality estimators. Mach Learn 89:37–65CrossRefGoogle Scholar
  20. 20.
    Camastra F, Staiano A (2016) Intrinsic dimension estimation: advances and open problems. Inf Sci (Ny) 328:26–41CrossRefGoogle Scholar
  21. 21.
    Granata D, Carnevale V (2016) Accurate estimation of the intrinsic dimension using graph distances: unraveling the geometric complexity of datasets. Sci Rep 6:31377CrossRefGoogle Scholar
  22. 22.
    Vapnik VN (1998) Statistical learning theory. Wiley, New YorkGoogle Scholar
  23. 23.
    Chávez E, Navarro G, Baeza-Yates R (2001) Marroqu’\in, J.L. Searching in metric spaces. ACM Comput Surv 33:273–321CrossRefGoogle Scholar
  24. 24.
    Villmann T, Claussen JC (2006) Magnification control in self-organizing maps and neural gas. Neural Comput 18:446–469CrossRefGoogle Scholar
  25. 25.
    Traina C Jr, Traina A, Faloutsos C (2010) Fast feature selection using fractal dimension-ten years later. J Inf Data Manag 1:17Google Scholar
  26. 26.
    Mo D, Huang SH (2012) Fractal-based intrinsic dimension estimation and its application in dimensionality reduction. IEEE Trans Knowl Data Eng 24:59–71CrossRefGoogle Scholar
  27. 27.
    Levina E, Bickel PJ (2004) Maximum likelihood estimation of intrinsic dimension. In: NIPS vol 48109, p 1092Google Scholar
  28. 28.
    Camastra F (2003) Data dimensionality estimation methods: a survey. Pattern Recognit 36:2945–2954CrossRefGoogle Scholar
  29. 29.
    Belussi A, Faloutsos C (1995) Estimating the selectivity of spatial queries using the correlation fractal dimension. In: Proceedings of the 21th international conference on very large data bases, VLDB’95. Morgan Kaufmann Publishers Inc, San Francisco, pp 299–310Google Scholar
  30. 30.
    Traina C, Traina A, Wu L, Faloutsos C (2000) Fast feature selection using fractal dimension. J Inf Data Manag 1:3–16Google Scholar
  31. 31.
    Lin S, Zhao Y, Xia T, Meng H, Ji Z, Liu R, George S, Xiong S, Wang X, Zhang H, Pokhrel S, Mädler L, Damoiseaux R, Lin S, Nel AE (2011) High content screening in zebrafish speeds up hazard ranking of transition metal oxide nanoparticles. ACS Nano 5:7284–7295CrossRefGoogle Scholar
  32. 32.
    Zhang H, Perng C.-S, Cai Q (2002) An improved algorithm for feature selection using fractal dimension. In: Proceedings of the second international workshop on databases, documents, and information fusionGoogle Scholar
  33. 33.
    Bao Y, Yu G, Sun H, Wang D (2004) Performance optimization of fractal dimension based feature selection algorithm. In: International conference on web-age information management. Springer, Berlin, pp 739–744CrossRefGoogle Scholar
  34. 34.
    Liu R, Shi Y (2013) Spatial distance join based feature selection. Eng Appl Artif Intell 26:2597–2607CrossRefGoogle Scholar
  35. 35.
    Lee HD, Monard MC, Wu FC (2006) A fractal dimension based filter algorithm to select features for supervised learning. In: Advances in artificial intelligence-IBERAMIA-SBIA 2006; Springer, pp 278–288Google Scholar
  36. 36.
    Pham DT, Packianather MS, Garcia MS, Castellani M (2009) Novel feature selection method using mutual information and fractal dimension. In: 35th annual conference of IEEE industrial electronics, 2009, IECON’09. IEEE, pp 3393–3398Google Scholar
  37. 37.
    Bhavani SD, Rani TS, Bapi RS (2008) Feature selection using correlation fractal dimension: issues and applications in binary classification problems. Appl Soft Comput 8:555–563CrossRefGoogle Scholar
  38. 38.
    Li J, Du Q, Sun C (2009) An improved box-counting method for image fractal dimension estimation. Pattern Recognit 42:2460–2469CrossRefGoogle Scholar
  39. 39.
    Yan G, Li Z, Yuan L (2006) The practical method of fractal dimensionality reduction based on z-ordering technique. In: Li X, Zaïane OR, Li Z (eds) International conference on advanced data mining and applications. Springer, Berlin, pp 542–549CrossRefGoogle Scholar
  40. 40.
    Faloutsos C, Seeger B, Traina A, Traina C Jr (2000) Spatial join selectivity using power laws. ACM SIGMOD Rec 29:177–188CrossRefGoogle Scholar
  41. 41.
    Rajaraman A, Ullman JD, Ullman JD, Ullman JD (2012) Mining of massive datasets. Cambridge University Press, CambridgeGoogle Scholar
  42. 42.
    Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai DB, Amde M, Owen S (2016) Mllib: machine learning in apache spark. J Mach Learn Res 17:1–7Google Scholar
  43. 43.
    Shanahan JG, Dai L (2015) Large scale distributed data science using apache spark. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 2323–2324Google Scholar
  44. 44.
    Schroeder M (2009) Fractals, chaos, power laws: minutes from an infinite paradise. Dover PublicationGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Institute of the Environment and SustainabilityUniversity of CaliforniaLos AngelesUSA
  2. 2.Advanced Computing, Mathematics, and Data DivisionPacific Northwest National LaboratoryRichlandUSA
  3. 3.Chemical and Biomolecular Engineering DepartmentUniversity of CaliforniaLos AngelesUSA
  4. 4.Center for Environmental Implications of NanotechnologyUniversity of CaliforniaLos AngelesUSA

Personalised recommendations