Advertisement

Data Mining and Knowledge Discovery

, Volume 32, Issue 6, pp 1768–1805 | Cite as

Extreme-value-theoretic estimation of local intrinsic dimensionality

  • Laurent Amsaleg
  • Oussama Chelly
  • Teddy Furon
  • Stéphane Girard
  • Michael E. Houle
  • Ken-ichi Kawarabayashi
  • Michael Nett
Article
  • 62 Downloads

Abstract

This paper is concerned with the estimation of a local measure of intrinsic dimensionality (ID) recently proposed by Houle. The local model can be regarded as an extension of Karger and Ruhl’s expansion dimension to a statistical setting in which the distribution of distances to a query point is modeled in terms of a continuous random variable. This form of intrinsic dimensionality can be particularly useful in search, classification, outlier detection, and other contexts in machine learning, databases, and data mining, as it has been shown to be equivalent to a measure of the discriminative power of similarity functions. Several estimators of local ID are proposed and analyzed based on extreme value theory, using maximum likelihood estimation, the method of moments, probability weighted moments, and regularly varying functions. An experimental evaluation is also provided, using both real and artificial data.

Keywords

Intrinsic dimension Indiscriminability Manifold learning Curse of dimensionality Maximum likelihood estimation Extreme value theory 

Notes

References

  1. Alves MF, de Haan L, Lin T (2003a) Estimation of the parameter controlling the speed of convergence in extreme value theory. Math Method Stat 12(2):155–176MathSciNetGoogle Scholar
  2. Alves MIF, Gomes MI, de Haan L (2003b) A new class of semi-parametric estimators of the second order parameter. Port Math 60(2):193–214MathSciNetzbMATHGoogle Scholar
  3. Amsaleg L, Chelly O, Furon T, Girard S, Houle ME, Kawarabayashi K, Nett M (2015) Estimating local intrinsic dimensionality. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 29–38Google Scholar
  4. Balkema AA, De Haan L (1974) Residual life time at great age. Ann Probab 2(5):792–804MathSciNetCrossRefzbMATHGoogle Scholar
  5. Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is nearest neighbor meaningful? In: International conference on database theory. Springer, Berlin, pp 217–235Google Scholar
  6. Beygelzimer A, Kakade S, Langford J (2006) Cover trees for nearest neighbor. In: International conference on machine learning. ACM, pp 97–104Google Scholar
  7. Bingham NH, Goldie CM, Teugels JL (1989) Regular variation, vol 27. Cambridge University Press, CambridgezbMATHGoogle Scholar
  8. Boujemaa N, Fauqueur J, Ferecatu M, Fleuret F, Gouet V, LeSaux B, Sahbi H (2001) IKONA for interactive specific and generic image retrieval. In: Proceedings of international workshop on multimedia content-based indexing and retrievalGoogle Scholar
  9. Bouveyron C, Celeux G, Girard S (2011) Intrinsic dimension estimation by maximum likelihood in isotropic probabilistic PCA. Pattern Recogn Lett 32(14):1706–1713CrossRefGoogle Scholar
  10. Bruske J, Sommer G (1998) Intrinsic dimensionality estimation with optimally topology preserving maps. IEEE Trans Pattern Anal Mach Intell 20(5):572–575CrossRefGoogle Scholar
  11. Camastra F, Vinciarelli A (2002) Estimating the intrinsic dimension of data with a fractal-based method. IEEE Trans Pattern Anal Mach Intell 24(10):1404–1407CrossRefGoogle Scholar
  12. Cole R, Fanty M (1990) Spoken letter recognition. In: Proceedings of the third DARPA speech and natural language workshop, pp 385–390Google Scholar
  13. Coles S, Bawa J, Trenner L, Dorazio P (2001) An introduction to statistical modeling of extreme values, vol 208. Springer, BerlinCrossRefGoogle Scholar
  14. Costa JA, Hero AO III (2004) Entropic graphs for manifold learning. In: Asilomar conference on signals, systems and computers, vol 1. IEEE, pp 316–320Google Scholar
  15. Dahan E, Mendelson H (2001) An extreme-value model of concept testing. Manag Sci 47(1):102–116CrossRefGoogle Scholar
  16. de Vries T, Chawla S, Houle ME (2012) Density-preserving projections for large-scale local anomaly detection. Knowl Inf Syst 32(1):25–52CrossRefGoogle Scholar
  17. Dong W, Moses C, Li K (2011) Efficient k-nearest neighbor graph construction for generic similarity measures. In: International World Wide Web conference. ACM, pp 577–586Google Scholar
  18. Donoho DL, Grimes C (2003) Hessian eigenmaps: locally linear embedding techniques for high-dimensional data. Proc Natl Acad Sci 100(10):5591–5596MathSciNetCrossRefzbMATHGoogle Scholar
  19. Faloutsos C, Kamel I (1994) Beyond uniformity and independence: analysis of R-trees using the concept of fractal dimension. In: Proceedings of the 13th ACM SIGACT–SIGMOD–SIGART symposium on principles of database systems. ACM, pp 4–13Google Scholar
  20. Fisher RA, Tippett LHC (1928) Limiting forms of the frequency distribution of the largest or smallest member of a sample. Math Proc Camb Philos Soc 24(02):180–190CrossRefzbMATHGoogle Scholar
  21. Fukunaga K, Olsen DR (1971) An algorithm for finding intrinsic dimensionality of data. IEEE Trans Comput 100(2):176–183CrossRefzbMATHGoogle Scholar
  22. Furon T, Jégou H (2013) Using extreme value theory for image detection. Research Report RR-8244, INRIAGoogle Scholar
  23. Gnedenko B (1943) Sur la distribution limite du terme maximum d’une série aléatoire. Ann Math 44(3):423–453Google Scholar
  24. Grimshaw SD (1993) Computing maximum likelihood estimates for the Generalized Pareto Distribution. Technometrics 35(2):185–191MathSciNetCrossRefzbMATHGoogle Scholar
  25. Gupta A, Krauthgamer R, Lee JR (2003) Bounded geometries, fractals, and low-distortion embeddings. In: Proceedings of the 44th annual IEEE symposium on foundations of computer science. IEEE, pp 534–543Google Scholar
  26. Guyon I, Gunn S, Ben-Hur A, Dror G (2004) Result analysis of the NIPS 2003 feature selection challenge. In: Neural information processing systems, pp 545–552Google Scholar
  27. Harris R (2001) The accuracy of design values predicted from extreme value analysis. J Wind Eng Ind Aerodyn 89(2):153–164CrossRefGoogle Scholar
  28. Hein M, Audibert JY (2005) Intrinsic dimensionality estimation of submanifolds in \({R}^d\). In: International conference on machine learning. ACM, pp 289–296Google Scholar
  29. Hill BM et al (1975) A simple general approach to inference about the tail of a distribution. Ann Stat 3(5):1163–1174MathSciNetCrossRefzbMATHGoogle Scholar
  30. Hosking JR, Wallis JR (1987) Parameter and quantile estimation for the Generalized Pareto Distribution. Technometrics 29(3):339–349MathSciNetCrossRefzbMATHGoogle Scholar
  31. Houle ME (2013) Dimensionality, discriminability, density & distance distributions. In: 13th International conference on data mining workshops. IEEE, pp 468–473Google Scholar
  32. Houle ME (2015) Inlierness, outlierness, hubness and discriminability: an extreme-value-theoretic foundation. Tech. Rep. 2015-002E, National Institute of InformaticsGoogle Scholar
  33. Houle ME, Kashima H, Nett M (2012a) Generalized expansion dimension. In: 12th international conference on data mining workshops, IEEE, pp 587–594Google Scholar
  34. Houle ME, Ma X, Nett M, Oria V (2012b) Dimensional testing for multi-step similarity search. In: 12th International conference on data mining. IEEE, pp 299–308Google Scholar
  35. Houle ME, Ma X, Oria V, Sun J (2014) Efficient algorithms for similarity search in axis-aligned subspaces. In: International conference on similarity search and applications. Springer, Berlin, pp 1–12Google Scholar
  36. Jégou H, Tavenard R, Douze M, Amsaleg L (2011) Searching in one billion vectors: re-rank with source coding. In: International conference on acoustics, speech and signal processing. IEEE, pp 861–864Google Scholar
  37. Jolliffe IT (1986) Principal component analysis and factor analysis. In: Principal component analysis. Springer, Berlin, pp 115–128Google Scholar
  38. Karger DR, Ruhl M (2002) Finding nearest neighbors in growth-restricted metrics. In: ACM symposium on theory of computing. ACM, pp 741–750Google Scholar
  39. Karhunen J, Joutsensalo J (1994) Representation and separation of signals using nonlinear PCA type learning. IEEE Trans Neural Netw 7(1):113–127CrossRefGoogle Scholar
  40. Landwehr JM, Matalas N, Wallis J (1979) Probability weighted moments compared with some traditional techniques in estimating Gumbel parameters and quantiles. Water Resour Res 15(5):1055–1064CrossRefGoogle Scholar
  41. Larrañaga P, Lozano JA (2002) Estimation of distribution algorithms: a new tool for evolutionary computation, vol 2. Springer, BerlinzbMATHGoogle Scholar
  42. Lavenda BH, Cipollone E (2000) Extreme value statistics and thermodynamics of earthquakes: aftershock sequences. Ann Geophys 43(5):967–982Google Scholar
  43. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324CrossRefGoogle Scholar
  44. Levina E, Bickel PJ (2004) Maximum likelihood estimation of intrinsic dimension. In: Neural information processing systems, pp 777–784Google Scholar
  45. McNulty PJ, Scheick LZ, Roth DR, Davis MG, Tortora MR (2000) First failure predictions for EPROMs of the type flown on the MPTB satellite. IEEE Trans Nucl Sci 47(6):2237–2243CrossRefGoogle Scholar
  46. Millán JdR (2004) On the need for on-line learning in brain-computer interfaces. In: Proceedings of the IEEE international joint conference on neural networks, vol 4. IEEE, pp 2877–2882Google Scholar
  47. Nett M (2014) Intrinsic dimensional design and analysis of similarity search. Ph.D. thesis, University of TokyoGoogle Scholar
  48. Pestov V (2000) On the geometry of similarity search: dimensionality curse and concentration of measure. Inf Process Lett 73(1):47–51MathSciNetCrossRefzbMATHGoogle Scholar
  49. Pettis KW, Bailey TA, Jain AK, Dubes RC (1979) An intrinsic dimensionality estimator from near-neighbor information. IEEE Trans Pattern Anal Mach Intell 1:25–37CrossRefzbMATHGoogle Scholar
  50. Pickands J III (1975) Statistical inference using extreme order statistics. Ann Stat 3(1):119–131Google Scholar
  51. Radovanović M, Nanopoulos A, Ivanović M (2010a) Hubs in space: popular nearest neighbors in high-dimensional data. J Mach Learn Res 11:2487–2531MathSciNetzbMATHGoogle Scholar
  52. Radovanović M, Nanopoulos A, Ivanović M (2010b) Time-series classification in many intrinsic dimensions. In: Proceedings of the 2010 SIAM international conference on data mining. Citeseer, pp 677–688Google Scholar
  53. Rao CR (2009) Linear statistical inference and its applications, vol 22. Wiley, New YorkGoogle Scholar
  54. Roberts SJ (2000) Extreme value statistics for novelty detection in biomedical data processing. Proc Sci Meas Technol 147:363–367CrossRefGoogle Scholar
  55. Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326CrossRefGoogle Scholar
  56. Rozza A, Lombardi G, Ceruti C, Casiraghi E, Campadelli P (2012) Novel high intrinsic dimensionality estimators. Mach Learn J 89(1–2):37–65MathSciNetCrossRefzbMATHGoogle Scholar
  57. Schölkopf B, Smola A, Müller KR (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10(5):1299–1319CrossRefGoogle Scholar
  58. Shaft U, Ramakrishnan R (2006) Theory of nearest neighbors indexability. ACM Trans Database Syst 31(3):814–838CrossRefGoogle Scholar
  59. Takens F (1985) On the numerical determination of the dimension of an attractor. Springer, BerlinCrossRefzbMATHGoogle Scholar
  60. Tenenbaum JB, De Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323CrossRefGoogle Scholar
  61. Tryon RG, Cruse TA (2000) Probabilistic mesomechanics for high cycle fatigue life prediction. J Eng Mater Technol 122(2):209–214CrossRefGoogle Scholar
  62. Venna J, Kaski S (2006) Local multidimensional scaling. IEEE Trans Neural Netw 19(6):889–899CrossRefzbMATHGoogle Scholar
  63. Verveer PJ, Duin RPW (1995) An evaluation of intrinsic dimensionality estimators. IEEE Trans Pattern Anal Mach Intell 17(1):81–86CrossRefGoogle Scholar
  64. von Brünken J, Houle ME, Zimek A (2015) Intrinsic dimensional outlier detection in high-dimensional data. Tech. Rep. 2015-003E, National Institute of InformaticsGoogle Scholar
  65. Weber R, Schek HJ, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: 24th international conference on very large data bases, vol 98, pp 194–205Google Scholar
  66. Zhang Z, Zha H (2004) Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. J Shanghai Univ (English Edition) 8(4):406–424MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© The Author(s) 2018

Authors and Affiliations

  1. 1.Equipe LINKMEDIA - CNRS/IRISA RennesRennes CedexFrance
  2. 2.National Institute of InformaticsTokyoJapan
  3. 3.Equipe LINKMEDIA - INRIA RennesRennes CedexFrance
  4. 4.Equipe MISTIS - INRIA GrenobleSaint-Ismier CedexFrance
  5. 5.Google JapanTokyoJapan

Personalised recommendations