Iterative Denoising

Giles, Kendall E.; Trosset, Michael W.; Marchette, David J.; Priebe, Carey E.

doi:10.1007/s00180-007-0090-8

Iterative Denoising

Original Paper
Published: 12 October 2007

Volume 23, pages 497–517, (2008)
Cite this article

Computational Statistics Aims and scope Submit manuscript

Kendall E. Giles¹,
Michael W. Trosset²,
David J. Marchette³ &
…
Carey E. Priebe⁴

258 Accesses
2 Citations
Explore all metrics

Abstract

One problem in many fields is knowledge discovery in heterogeneous, high-dimensional data. As an example, in text mining an analyst often wishes to identify meaningful, implicit, and previously unknown information in an unstructured corpus. Lack of metadata and the complexities of document space make this task difficult. We describe Iterative Denoising, a methodology for knowledge discovery in large heterogeneous datasets that allows a user to visualize and to discover potentially meaningful relationships and structures. In addition, we demonstrate the features of this methodology in the analysis of a heterogeneous Science News corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Alpert C and Kahng A (1995). Recent directions in netlist partitioning: a summary. Integr VLSI J 19(1): 1–81
Article MATH Google Scholar
Arnoldi W (1951). The principle of minimized iterations in the solution of the matrix eigenvalue problem. Q J Appl Math 9: 17–29
MATH MathSciNet Google Scholar
Arya S, Mount D, Netanyahu N, Silverman R and Wu A (1998). An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. J ACM 45(6): 891–923
Article MATH MathSciNet Google Scholar
Banerjee S, Pedersen T (2003) The design, implementation, and use of the ngram statistics package. In: Proceedings of the fourth international conference on intelligent text processing and computational linguistics. Mexico City, Mexico
Belkin M and Niyogi P (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput 15(6): 1373–1396
Article MATH Google Scholar
Berk R (2006). An introduction to ensemble methods for data analysis. Sociol Methods Res 34(3): 263–295
Article MathSciNet Google Scholar
Clarkson K (1999). Nearest neighbor queries in metric spaces. Discrete Comput Geom 22(1): 63–69
Article MATH MathSciNet Google Scholar
Cormack R (1971). A review of classification (with discussion). J R Stat Soc Ser A (General) 134(3): 321–367
Article MathSciNet Google Scholar
Critchley F (1988). On certain linear mappings between inner-product and squared-distance matrices. Linear Algebra Appl 105: 91–107
Article MATH MathSciNet Google Scholar
de Leeuw J (1988). Convergence of the majorization method for multidimensional scaling. J Classif 5: 163–180
Article MATH MathSciNet Google Scholar
Donoho D and Grimes C (2003). Hessian eigenmaps: locally linear embedding techniques for high-dimensional data. Proc Natl Acad Sci 100(10): 5591–5596
Article MATH MathSciNet Google Scholar
Everitt B (1993). Cluster analysis, 3rd edn. Halsted Press, New York
Google Scholar
Faloutsos C, Lin K (1995) FastMap: a fast algorithm for indexing, data-mining, and visualization of traditional and multimedia datasets. In: Proceedings of the 1995 ACM SIGMOD international conference on management of data, pp 163–174
Fiedler M (1973). Algebraic connectivity of graphs. Czech Math J 23(98): 298–305
MathSciNet Google Scholar
Garey M, Johnson D, Stockmeyer L (1974) Some simplified NP-complete problems. In: Proceedings of the sixth annual ACM symposium on theory of computing, pp 47–63
Giles K (2006). Knowledge discovery in computer network data: a security perspective. Ph.D. dissertation. Johns Hopkins University, Baltimore
Google Scholar
Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of 25th VLDB conference, pp 518–529
Gordon A (1999) Classification, 2nd edn. Chapman & Hall/CRC, Boca Raton
Gower J (1966). Some distance properties of latent root and vector methods in multivariate analysis. Biometrika 53: 325–338
MATH MathSciNet Google Scholar
Grosjean J, Plaisant C, Bederson B (2002) Spacetree: supporting exploration in large node link tree, design evolution and empirical evaluation. In: Proceedings of IEEE symposium on information visualization, pp 57–64
Hendrickson B, Leland R (1995) A multilevel algorithm for partitioning graphs. In: Supercomputing ’95: Proceedings of the 1995 ACM/IEEE conference on supercomputing (CDROM), ACM Press
Houle M (2003) Sash: a spatial approximation sample hierarchy for similarity search, Technical Report RT-0517, IBM Tokyo Research Laboratory
Houle M, Sakuma J (2005) Fast approximate similarity search in extremely high-dimensional data sets. In: 21st International Conference on Data Engineering, pp 619–630
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of 30th ACM symposium on theory of computing, pp 604–613
Kanungo T, Mount D, Netanyahu N, Piatko C, Silverman R and Wu A (2004). A local search approximation algorithm for k-means clustering. Comput Geom Theory Appl 28: 89–112
MATH MathSciNet Google Scholar
Karypis G and Kumar V (1998). A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1): 359–392
Article MathSciNet Google Scholar
Kernighan B and Lin S (1970). An efficient heuristic procedure for partitioning graphs. Bell Syst Tech J 49(2): 291–307
Google Scholar
Kushilevitz E, Ostrovsky R, Rabani Y (1998) An algorithm for approximate closest-point queries. In: Proceedings of the 30th ACM symposium on theory of computing, pp 614–623
Lanczos C (1950). An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J Res Natl Bur Stand 45(4): 255–282
MathSciNet Google Scholar
Lehoucq R and Yang C (1998). ARPACK users guide: solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods. SIAM, Philadelphia
Google Scholar
Lin D, Pantel P (2002) Concept discovery from text. In: Proceedings of conference on computational linguistics, pp 577–583
Mirkin B (2005) Clustering for data mining: a data recovery approach. Chapman & Hall/CRC, Boca Raton
Porter M (1980). An algorithm for suffix stripping. Program 14(3): 130–137
Google Scholar
Priebe C, Marchette D and Healy D (2004a). Integrated sensing and processing decision trees. IEEE Trans Pattern Anal Mach Intell 26(6): 699–708
Article Google Scholar
Priebe C, Marchette D, Park Y, Wegman E, Solka J, Socolinsky A, Karakos D, Church K, Guglielmi R, Coifman R, Lin D, Healy D, Jacobs M, Tsao A (2004b) Iterative denoising for cross-corpus discovery. In: Antoch J (ed), COMPSTAT: Proceedings in computational statistics, 16th symposium. Physica-Verlag, Springer, pp 381–392
Roweis S and Saul L (2000). Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500): 2323–2326
Article Google Scholar
Saerens M, Fouss F, Yen L, Dupont P (2004) The principal components analysis of a graph and its relationships to spectral clustering. In: Proceedings of the 15th European conference on machine learning. Lecture Notes in Artificial Intelligence, pp 371–383
Schalkoff R (1991). Pattern recognition: statistical structural and neural approaches. Wiley, New York
Google Scholar
Tenenbaum J, DeSilva V and Langford J (2000). A global geometric framework for nonlinear dimensionality reduction. Science 290(5500): 2319–2322
Article Google Scholar
Torgerson W (1952). Multidimensional scaling: I theory and method. Psychometrika 17: 401–419
Article MATH MathSciNet Google Scholar
Trosset M, Groenen P (2005) Multidimensional scaling algorithms for large data sets. Comput Sci Stat

Download references

Author information

Authors and Affiliations

Department of Statistical Sciences and Operations Research, Virginia Commonwealth University, Richmond, VA, 23284, USA
Kendall E. Giles
Department of Statistics, Indiana University, Bloomington, IN, 47405, USA
Michael W. Trosset
Dahlgren Division, Naval Surface Warfare Center, Dahlgren, VA, 22448, USA
David J. Marchette
Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, 21218, USA
Carey E. Priebe

Authors

Kendall E. Giles
View author publications
You can also search for this author in PubMed Google Scholar
Michael W. Trosset
View author publications
You can also search for this author in PubMed Google Scholar
David J. Marchette
View author publications
You can also search for this author in PubMed Google Scholar
Carey E. Priebe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kendall E. Giles.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Giles, K.E., Trosset, M.W., Marchette, D.J. et al. Iterative Denoising. Comput Stat 23, 497–517 (2008). https://doi.org/10.1007/s00180-007-0090-8

Download citation

Received: 06 July 2007
Accepted: 04 September 2007
Published: 12 October 2007
Issue Date: October 2008
DOI: https://doi.org/10.1007/s00180-007-0090-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Iterative Denoising

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Big data analytics on Apache Spark

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Iterative Denoising

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Big data analytics on Apache Spark

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation