Document Representation and Quality of Text: An Analysis
There are three factors involved in text classification: the classification model, the similarity measure, and the document representation. In this chapter, we will focus on document representation and demonstrate that the choice of document representation has a profound impact on the quality of the classification.We will also show that the text quality affects the choice of document representation. In our experiments we have used the centroid-based classification, which is a simple and robust text classi-fication scheme. We will compare four different types of document representation: N-grams, single terms, phrases, and a logic-based document representation called RDR. The N-gram representation is a string-based representation with no linguistic processing. The single-term approach is based on words with minimum linguistic processing. The phrase approach is based on linguistically formed phrases and single words. The RDR is based on linguistic processing and representing documents as a set of logical predicates. Our experiments on many text collections yielded similar results.
Unable to display preview. Download preview PDF.
- A. AleAhmad, P. Hakimian, and F. Oroumchian. N-gram and local context analysis for persian text retrieval. International Symposium on Signal Processing and its Applications (ISSPA2007), 2007.Google Scholar
- M.W. Berry, S.T. Dumais, and G.W. O’Brien. Using linear algebra for intelligent information retrieval. Technical Report UT-CS-94-270, University of Tennessee, 1994. Available from World Wide Web: citeseer.ist.psu.edu/ berry95using.html.Google Scholar
- A. Collins and R. Michalski. The logic of plausible reasoning: a core theory. Cognitive Science, 13(1):1-49, 1989. Available from World Wide Web: citeseer. ist.psu.edu/collins89logic.html.Google Scholar
- F. Crestani and C.J. van Rijsbergen. Probability kinematics in information retrieval. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 291-299, ACM Press, New York, 1995.Google Scholar
- E. Greengrass. Information Retrieval: A Survey. IR Report, 120600, 2000. Available from World Wide Web: http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf.
- E.H. Han and G. Karypis. Centroid-based Document Classification: Analysis and Experimental Results. Springer, New York, 2000.Google Scholar
- A. Jalali and F. Oroumchian. Rich document representation for document clustering. In Coupling Approaches, Coupling Media and Coupling Languages for Information Retrieval Avignon (Vaucluse), pages 800-808, RIAO, Paris, France, 2004.Google Scholar
- R. Kjeldsen and P.R. Cohen. The evolution and performance of the GRANT system. IEEE Expert, pages 73-79, 1988.Google Scholar
- J.H. Lee. Properties of extended Boolean models in information retrieval. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 182-190, ACM Press, New York, 1994.Google Scholar
- F. Oroumchian and R.N. Oddy. An application of plausible reasoning to information retrieval. In Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Developement in Information Retrieval, pages 244-252, ACM Press, New York, 1996.Google Scholar
- M.F. Porter. An algorithm for suffix stripping. Information Systems, 40(3):211-218,1980.Google Scholar
- C.J. Van Rijsbergen. Information Retrieval. Butterworth-Heinemann, Newton, MA, 1979.Google Scholar
- F. Raja, M. Keikha, F. Oroumchian, and M. Rahgozar. Using Rich Document Representation in XML Information Retrieval. Proceedings of the Fifth International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX), Springer, New York, 2006.Google Scholar