Skip to main content
Log in

Iterative Denoising

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

One problem in many fields is knowledge discovery in heterogeneous, high-dimensional data. As an example, in text mining an analyst often wishes to identify meaningful, implicit, and previously unknown information in an unstructured corpus. Lack of metadata and the complexities of document space make this task difficult. We describe Iterative Denoising, a methodology for knowledge discovery in large heterogeneous datasets that allows a user to visualize and to discover potentially meaningful relationships and structures. In addition, we demonstrate the features of this methodology in the analysis of a heterogeneous Science News corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Alpert C and Kahng A (1995). Recent directions in netlist partitioning: a summary. Integr VLSI J 19(1): 1–81

    Article  MATH  Google Scholar 

  2. Arnoldi W (1951). The principle of minimized iterations in the solution of the matrix eigenvalue problem. Q J Appl Math 9: 17–29

    MATH  MathSciNet  Google Scholar 

  3. Arya S, Mount D, Netanyahu N, Silverman R and Wu A (1998). An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. J ACM 45(6): 891–923

    Article  MATH  MathSciNet  Google Scholar 

  4. Banerjee S, Pedersen T (2003) The design, implementation, and use of the ngram statistics package. In: Proceedings of the fourth international conference on intelligent text processing and computational linguistics. Mexico City, Mexico

  5. Belkin M and Niyogi P (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput 15(6): 1373–1396

    Article  MATH  Google Scholar 

  6. Berk R (2006). An introduction to ensemble methods for data analysis. Sociol Methods Res 34(3): 263–295

    Article  MathSciNet  Google Scholar 

  7. Clarkson K (1999). Nearest neighbor queries in metric spaces. Discrete Comput Geom 22(1): 63–69

    Article  MATH  MathSciNet  Google Scholar 

  8. Cormack R (1971). A review of classification (with discussion). J R Stat Soc Ser A (General) 134(3): 321–367

    Article  MathSciNet  Google Scholar 

  9. Critchley F (1988). On certain linear mappings between inner-product and squared-distance matrices. Linear Algebra Appl 105: 91–107

    Article  MATH  MathSciNet  Google Scholar 

  10. de Leeuw J (1988). Convergence of the majorization method for multidimensional scaling. J Classif 5: 163–180

    Article  MATH  MathSciNet  Google Scholar 

  11. Donoho D and Grimes C (2003). Hessian eigenmaps: locally linear embedding techniques for high-dimensional data. Proc Natl Acad Sci 100(10): 5591–5596

    Article  MATH  MathSciNet  Google Scholar 

  12. Everitt B (1993). Cluster analysis, 3rd edn. Halsted Press, New York

    Google Scholar 

  13. Faloutsos C, Lin K (1995) FastMap: a fast algorithm for indexing, data-mining, and visualization of traditional and multimedia datasets. In: Proceedings of the 1995 ACM SIGMOD international conference on management of data, pp 163–174

  14. Fiedler M (1973). Algebraic connectivity of graphs. Czech Math J 23(98): 298–305

    MathSciNet  Google Scholar 

  15. Garey M, Johnson D, Stockmeyer L (1974) Some simplified NP-complete problems. In: Proceedings of the sixth annual ACM symposium on theory of computing, pp 47–63

  16. Giles K (2006). Knowledge discovery in computer network data: a security perspective. Ph.D. dissertation. Johns Hopkins University, Baltimore

    Google Scholar 

  17. Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of 25th VLDB conference, pp 518–529

  18. Gordon A (1999) Classification, 2nd edn. Chapman & Hall/CRC, Boca Raton

  19. Gower J (1966). Some distance properties of latent root and vector methods in multivariate analysis. Biometrika 53: 325–338

    MATH  MathSciNet  Google Scholar 

  20. Grosjean J, Plaisant C, Bederson B (2002) Spacetree: supporting exploration in large node link tree, design evolution and empirical evaluation. In: Proceedings of IEEE symposium on information visualization, pp 57–64

  21. Hendrickson B, Leland R (1995) A multilevel algorithm for partitioning graphs. In: Supercomputing ’95: Proceedings of the 1995 ACM/IEEE conference on supercomputing (CDROM), ACM Press

  22. Houle M (2003) Sash: a spatial approximation sample hierarchy for similarity search, Technical Report RT-0517, IBM Tokyo Research Laboratory

  23. Houle M, Sakuma J (2005) Fast approximate similarity search in extremely high-dimensional data sets. In: 21st International Conference on Data Engineering, pp 619–630

  24. Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of 30th ACM symposium on theory of computing, pp 604–613

  25. Kanungo T, Mount D, Netanyahu N, Piatko C, Silverman R and Wu A (2004). A local search approximation algorithm for k-means clustering. Comput Geom Theory Appl 28: 89–112

    MATH  MathSciNet  Google Scholar 

  26. Karypis G and Kumar V (1998). A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1): 359–392

    Article  MathSciNet  Google Scholar 

  27. Kernighan B and Lin S (1970). An efficient heuristic procedure for partitioning graphs. Bell Syst Tech J 49(2): 291–307

    Google Scholar 

  28. Kushilevitz E, Ostrovsky R, Rabani Y (1998) An algorithm for approximate closest-point queries. In: Proceedings of the 30th ACM symposium on theory of computing, pp 614–623

  29. Lanczos C (1950). An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J Res Natl Bur Stand 45(4): 255–282

    MathSciNet  Google Scholar 

  30. Lehoucq R and Yang C (1998). ARPACK users guide: solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods. SIAM, Philadelphia

    Google Scholar 

  31. Lin D, Pantel P (2002) Concept discovery from text. In: Proceedings of conference on computational linguistics, pp 577–583

  32. Mirkin B (2005) Clustering for data mining: a data recovery approach. Chapman & Hall/CRC, Boca Raton

  33. Porter M (1980). An algorithm for suffix stripping. Program 14(3): 130–137

    Google Scholar 

  34. Priebe C, Marchette D and Healy D (2004a). Integrated sensing and processing decision trees. IEEE Trans Pattern Anal Mach Intell 26(6): 699–708

    Article  Google Scholar 

  35. Priebe C, Marchette D, Park Y, Wegman E, Solka J, Socolinsky A, Karakos D, Church K, Guglielmi R, Coifman R, Lin D, Healy D, Jacobs M, Tsao A (2004b) Iterative denoising for cross-corpus discovery. In: Antoch J (ed), COMPSTAT: Proceedings in computational statistics, 16th symposium. Physica-Verlag, Springer, pp 381–392

  36. Roweis S and Saul L (2000). Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500): 2323–2326

    Article  Google Scholar 

  37. Saerens M, Fouss F, Yen L, Dupont P (2004) The principal components analysis of a graph and its relationships to spectral clustering. In: Proceedings of the 15th European conference on machine learning. Lecture Notes in Artificial Intelligence, pp 371–383

  38. Schalkoff R (1991). Pattern recognition: statistical structural and neural approaches. Wiley, New York

    Google Scholar 

  39. Tenenbaum J, DeSilva V and Langford J (2000). A global geometric framework for nonlinear dimensionality reduction. Science 290(5500): 2319–2322

    Article  Google Scholar 

  40. Torgerson W (1952). Multidimensional scaling: I theory and method. Psychometrika 17: 401–419

    Article  MATH  MathSciNet  Google Scholar 

  41. Trosset M, Groenen P (2005) Multidimensional scaling algorithms for large data sets. Comput Sci Stat

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kendall E. Giles.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Giles, K.E., Trosset, M.W., Marchette, D.J. et al. Iterative Denoising. Comput Stat 23, 497–517 (2008). https://doi.org/10.1007/s00180-007-0090-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-007-0090-8

Keywords

Navigation