Abstract
One problem in many fields is knowledge discovery in heterogeneous, high-dimensional data. As an example, in text mining an analyst often wishes to identify meaningful, implicit, and previously unknown information in an unstructured corpus. Lack of metadata and the complexities of document space make this task difficult. We describe Iterative Denoising, a methodology for knowledge discovery in large heterogeneous datasets that allows a user to visualize and to discover potentially meaningful relationships and structures. In addition, we demonstrate the features of this methodology in the analysis of a heterogeneous Science News corpus.
Similar content being viewed by others
References
Alpert C and Kahng A (1995). Recent directions in netlist partitioning: a summary. Integr VLSI J 19(1): 1–81
Arnoldi W (1951). The principle of minimized iterations in the solution of the matrix eigenvalue problem. Q J Appl Math 9: 17–29
Arya S, Mount D, Netanyahu N, Silverman R and Wu A (1998). An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. J ACM 45(6): 891–923
Banerjee S, Pedersen T (2003) The design, implementation, and use of the ngram statistics package. In: Proceedings of the fourth international conference on intelligent text processing and computational linguistics. Mexico City, Mexico
Belkin M and Niyogi P (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput 15(6): 1373–1396
Berk R (2006). An introduction to ensemble methods for data analysis. Sociol Methods Res 34(3): 263–295
Clarkson K (1999). Nearest neighbor queries in metric spaces. Discrete Comput Geom 22(1): 63–69
Cormack R (1971). A review of classification (with discussion). J R Stat Soc Ser A (General) 134(3): 321–367
Critchley F (1988). On certain linear mappings between inner-product and squared-distance matrices. Linear Algebra Appl 105: 91–107
de Leeuw J (1988). Convergence of the majorization method for multidimensional scaling. J Classif 5: 163–180
Donoho D and Grimes C (2003). Hessian eigenmaps: locally linear embedding techniques for high-dimensional data. Proc Natl Acad Sci 100(10): 5591–5596
Everitt B (1993). Cluster analysis, 3rd edn. Halsted Press, New York
Faloutsos C, Lin K (1995) FastMap: a fast algorithm for indexing, data-mining, and visualization of traditional and multimedia datasets. In: Proceedings of the 1995 ACM SIGMOD international conference on management of data, pp 163–174
Fiedler M (1973). Algebraic connectivity of graphs. Czech Math J 23(98): 298–305
Garey M, Johnson D, Stockmeyer L (1974) Some simplified NP-complete problems. In: Proceedings of the sixth annual ACM symposium on theory of computing, pp 47–63
Giles K (2006). Knowledge discovery in computer network data: a security perspective. Ph.D. dissertation. Johns Hopkins University, Baltimore
Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of 25th VLDB conference, pp 518–529
Gordon A (1999) Classification, 2nd edn. Chapman & Hall/CRC, Boca Raton
Gower J (1966). Some distance properties of latent root and vector methods in multivariate analysis. Biometrika 53: 325–338
Grosjean J, Plaisant C, Bederson B (2002) Spacetree: supporting exploration in large node link tree, design evolution and empirical evaluation. In: Proceedings of IEEE symposium on information visualization, pp 57–64
Hendrickson B, Leland R (1995) A multilevel algorithm for partitioning graphs. In: Supercomputing ’95: Proceedings of the 1995 ACM/IEEE conference on supercomputing (CDROM), ACM Press
Houle M (2003) Sash: a spatial approximation sample hierarchy for similarity search, Technical Report RT-0517, IBM Tokyo Research Laboratory
Houle M, Sakuma J (2005) Fast approximate similarity search in extremely high-dimensional data sets. In: 21st International Conference on Data Engineering, pp 619–630
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of 30th ACM symposium on theory of computing, pp 604–613
Kanungo T, Mount D, Netanyahu N, Piatko C, Silverman R and Wu A (2004). A local search approximation algorithm for k-means clustering. Comput Geom Theory Appl 28: 89–112
Karypis G and Kumar V (1998). A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1): 359–392
Kernighan B and Lin S (1970). An efficient heuristic procedure for partitioning graphs. Bell Syst Tech J 49(2): 291–307
Kushilevitz E, Ostrovsky R, Rabani Y (1998) An algorithm for approximate closest-point queries. In: Proceedings of the 30th ACM symposium on theory of computing, pp 614–623
Lanczos C (1950). An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J Res Natl Bur Stand 45(4): 255–282
Lehoucq R and Yang C (1998). ARPACK users guide: solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods. SIAM, Philadelphia
Lin D, Pantel P (2002) Concept discovery from text. In: Proceedings of conference on computational linguistics, pp 577–583
Mirkin B (2005) Clustering for data mining: a data recovery approach. Chapman & Hall/CRC, Boca Raton
Porter M (1980). An algorithm for suffix stripping. Program 14(3): 130–137
Priebe C, Marchette D and Healy D (2004a). Integrated sensing and processing decision trees. IEEE Trans Pattern Anal Mach Intell 26(6): 699–708
Priebe C, Marchette D, Park Y, Wegman E, Solka J, Socolinsky A, Karakos D, Church K, Guglielmi R, Coifman R, Lin D, Healy D, Jacobs M, Tsao A (2004b) Iterative denoising for cross-corpus discovery. In: Antoch J (ed), COMPSTAT: Proceedings in computational statistics, 16th symposium. Physica-Verlag, Springer, pp 381–392
Roweis S and Saul L (2000). Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500): 2323–2326
Saerens M, Fouss F, Yen L, Dupont P (2004) The principal components analysis of a graph and its relationships to spectral clustering. In: Proceedings of the 15th European conference on machine learning. Lecture Notes in Artificial Intelligence, pp 371–383
Schalkoff R (1991). Pattern recognition: statistical structural and neural approaches. Wiley, New York
Tenenbaum J, DeSilva V and Langford J (2000). A global geometric framework for nonlinear dimensionality reduction. Science 290(5500): 2319–2322
Torgerson W (1952). Multidimensional scaling: I theory and method. Psychometrika 17: 401–419
Trosset M, Groenen P (2005) Multidimensional scaling algorithms for large data sets. Comput Sci Stat
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Giles, K.E., Trosset, M.W., Marchette, D.J. et al. Iterative Denoising. Comput Stat 23, 497–517 (2008). https://doi.org/10.1007/s00180-007-0090-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-007-0090-8