Application of Latent Semantic Indexing to Processing of Noisy Text

Price, Robert. J.; Zukas, Anthony. E.

doi:10.1007/11427995_68

Application of Latent Semantic Indexing to Processing of Noisy Text

Robert. J. Price²³ &
Anthony. E. Zukas²⁴

Conference paper

4027 Accesses
5 Citations
3 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3495))

Abstract

Latent semantic indexing (LSI) is a robust dimensionality-reduction technique for the processing of textual data. The technique can be applied to collections of documents independent of subject matter or language. Given a collection of documents, LSI indexing can be employed to create a vector space in which both the documents and their constituent terms can be represented. In practice, spaces of several hundred dimensions typically are employed. The resulting spaces possess some unique properties that make them well suited to a range of information-processing problems. Of particular interest for this conference is the fact that the technique is highly resistant to noise. Many sources of classified text are still in hardcopy. Conversion of degraded documents to electronic form through optical character recognition (OCR) processing results in noisy text and poor retrieval performance when indexed by conventional information retrieval (IR) systems. The most salient feature of an LSI space is that proximity of document vectors in that space is a remarkably good surrogate for proximity of the respective documents in a conceptual sense. This fact has been demonstrated in a large number of tests involving a wide variety of subject matter, complexity, and languages. This feature enables the implementation of high-volume, high-accuracy automatic document categorization systems. In fact, the largest existing government and commercial applications of LSI are for automated document categorization. Previous work [1], has demonstrated the high performance of LSI on the Reuters-21578 [2] test set in comparison to other techniques. In more recent work, we have examined the ability of LSI to categorize documents that contain corrupted text. Testing using the Reuters-21578 test set demonstrated the robustness of LSI in conditions of increasing document degradation. We wrote a Java class that degraded text in the test documents by inserting, deleting, and substituting characters randomly at specified error rates. Although true OCR errors are not random, the intent here was simply to show to what extent the text of the documents could be degraded and still retain useful categorization results. Moreover, the nature of comparisons in the LSI space is such that random errors and systematic errors will have essentially the same effects. These results are extremely encouraging. They indicate that the categorization accuracy of LSI falls off very slowly, even at high levels of text errors. Thus, the categorization performance of LSI can be used to compensate for weaknesses in optical character recognition accuracy. In this poster session we present results of applying this process to the much newer (and larger) Reuters RCV1-v2 categorization test set [3]. Initial results indicate that the technique provides robust noise immunity in large collections.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Zukas, A., Price, R.: Document Categorization Using Latent Semantic Indexing. In: Proceedings: 2003 Symposium on Document Image Understanding Technology, Greenbelt, MD, April 2003, pp. 87–91 (2003)
Google Scholar
Lewis, D.: Reuters-21578 Text Categorization Test Collection. Distribution 1.0. README file (version1.2), September 26 (1997) (manuscript), http://www.daviddlewis.com/resources/testcollection/reuters21578/readme.html
Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5, 361–397 (2004), http://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf
Google Scholar

Download references

Author information

Authors and Affiliations

Content Analyst LLC, Reston, VA
Robert. J. Price
Science Applications International Corporation, Reston, VA
Anthony. E. Zukas

Authors

Robert. J. Price
View author publications
You can also search for this author in PubMed Google Scholar
Anthony. E. Zukas
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Library and Information Science, Rutgers University,
Paul Kantor
School of Communication, Information and Library Studies, Rutgers University, 4 Huntington Street, 08901-1071, New Brunswick, NJ, USA
Gheorghe Muresan
Artificial Solutions, Altonaer Poststraße 13b, 22767, Hamburg, Germany
Fred Roberts
MIS Department, University of Arizona, 85721, Tucson, AZ, USA
Daniel D. Zeng
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Fei-Yue Wang
Department of Management Information Systems, Eller College of Management, The University of Arizona, 85721, AZ, USA
Hsinchun Chen
College of Computing, Georgia Tech Information Security Center, Georgia Institute of Technology, 801 Atlantic Drive, 30332-0280, Atlanta, GA, USA
Ralph C. Merkle

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Price, R.J., Zukas, A.E. (2005). Application of Latent Semantic Indexing to Processing of Noisy Text. In: Kantor, P., et al. Intelligence and Security Informatics. ISI 2005. Lecture Notes in Computer Science, vol 3495. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11427995_68

Download citation

DOI: https://doi.org/10.1007/11427995_68
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25999-2
Online ISBN: 978-3-540-32063-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics