Application of Latent Semantic Indexing to Processing of Noisy Text
Latent semantic indexing (LSI) is a robust dimensionality-reduction technique for the processing of textual data. The technique can be applied to collections of documents independent of subject matter or language. Given a collection of documents, LSI indexing can be employed to create a vector space in which both the documents and their constituent terms can be represented. In practice, spaces of several hundred dimensions typically are employed. The resulting spaces possess some unique properties that make them well suited to a range of information-processing problems. Of particular interest for this conference is the fact that the technique is highly resistant to noise. Many sources of classified text are still in hardcopy. Conversion of degraded documents to electronic form through optical character recognition (OCR) processing results in noisy text and poor retrieval performance when indexed by conventional information retrieval (IR) systems. The most salient feature of an LSI space is that proximity of document vectors in that space is a remarkably good surrogate for proximity of the respective documents in a conceptual sense. This fact has been demonstrated in a large number of tests involving a wide variety of subject matter, complexity, and languages. This feature enables the implementation of high-volume, high-accuracy automatic document categorization systems. In fact, the largest existing government and commercial applications of LSI are for automated document categorization. Previous work , has demonstrated the high performance of LSI on the Reuters-21578  test set in comparison to other techniques. In more recent work, we have examined the ability of LSI to categorize documents that contain corrupted text. Testing using the Reuters-21578 test set demonstrated the robustness of LSI in conditions of increasing document degradation. We wrote a Java class that degraded text in the test documents by inserting, deleting, and substituting characters randomly at specified error rates. Although true OCR errors are not random, the intent here was simply to show to what extent the text of the documents could be degraded and still retain useful categorization results. Moreover, the nature of comparisons in the LSI space is such that random errors and systematic errors will have essentially the same effects. These results are extremely encouraging. They indicate that the categorization accuracy of LSI falls off very slowly, even at high levels of text errors. Thus, the categorization performance of LSI can be used to compensate for weaknesses in optical character recognition accuracy. In this poster session we present results of applying this process to the much newer (and larger) Reuters RCV1-v2 categorization test set . Initial results indicate that the technique provides robust noise immunity in large collections.
- 1.Zukas, A., Price, R.: Document Categorization Using Latent Semantic Indexing. In: Proceedings: 2003 Symposium on Document Image Understanding Technology, Greenbelt, MD, April 2003, pp. 87–91 (2003)Google Scholar
- 2.Lewis, D.: Reuters-21578 Text Categorization Test Collection. Distribution 1.0. README file (version1.2), September 26 (1997) (manuscript), http://www.daviddlewis.com/resources/testcollection/reuters21578/readme.html
- 3.Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5, 361–397 (2004), http://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf Google Scholar