Document Representation and Quality of Text: An Analysis

  • Mostafa Keikha
  • Narjes Sharif Razavian
  • Farhad Oroumchian
  • Hassan Seyed Razi

There are three factors involved in text classification: the classification model, the similarity measure, and the document representation. In this chapter, we will focus on document representation and demonstrate that the choice of document representation has a profound impact on the quality of the classification.We will also show that the text quality affects the choice of document representation. In our experiments we have used the centroid-based classification, which is a simple and robust text classi-fication scheme. We will compare four different types of document representation: N-grams, single terms, phrases, and a logic-based document representation called RDR. The N-gram representation is a string-based representation with no linguistic processing. The single-term approach is based on words with minimum linguistic processing. The phrase approach is based on linguistically formed phrases and single words. The RDR is based on linguistic processing and representing documents as a set of logical predicates. Our experiments on many text collections yielded similar results.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A. AleAhmad, P. Hakimian, and F. Oroumchian. N-gram and local context analysis for persian text retrieval. International Symposium on Signal Processing and its Applications (ISSPA2007), 2007.Google Scholar
  2. M.W. Berry, S.T. Dumais, and G.W. O’Brien. Using linear algebra for intelligent information retrieval. Technical Report UT-CS-94-270, University of Tennessee, 1994. Available from World Wide Web: citeseer.ist.psu.edu/ berry95using.html.Google Scholar
  3. A. Collins and R. Michalski. The logic of plausible reasoning: a core theory. Cognitive Science, 13(1):1-49, 1989. Available from World Wide Web: citeseer. ist.psu.edu/collins89logic.html.Google Scholar
  4. F. Crestani and C.J. van Rijsbergen. Probability kinematics in information retrieval. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 291-299, ACM Press, New York, 1995.Google Scholar
  5. M. Damashek. Gauging similarity with n-grams: language-independent categorization of text. Science, 267(5199):843, 1995.CrossRefGoogle Scholar
  6. J. Davis and M. Goadrich. The relationship between Precision-Recall and ROC curves. Proceedings of the 23rd International Conference on Machine Learning, pages 233-240, ACM Press, New York, 2006.CrossRefGoogle Scholar
  7. E. Greengrass. Information Retrieval: A Survey. IR Report, 120600, 2000. Available from World Wide Web: http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf.
  8. E.H. Han and G. Karypis. Centroid-based Document Classification: Analysis and Experimental Results. Springer, New York, 2000.Google Scholar
  9. A. Jalali and F. Oroumchian. Rich document representation for document clustering. In Coupling Approaches, Coupling Media and Coupling Languages for Information Retrieval Avignon (Vaucluse), pages 800-808, RIAO, Paris, France, 2004.Google Scholar
  10. R. Kjeldsen and P.R. Cohen. The evolution and performance of the GRANT system. IEEE Expert, pages 73-79, 1988.Google Scholar
  11. J.H. Lee. Properties of extended Boolean models in information retrieval. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 182-190, ACM Press, New York, 1994.Google Scholar
  12. E.D. Liddy, W. Paik, and E.S. Yu. Text categorization for multiple users based on semantic features from a machine-readable dictionary. ACM Transactions on Information Systems (TOIS), 12(3):278-295, 1994.CrossRefGoogle Scholar
  13. F. Oroumchian and R.N. Oddy. An application of plausible reasoning to information retrieval. In Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Developement in Information Retrieval, pages 244-252, ACM Press, New York, 1996.Google Scholar
  14. C. Pearce and C. Nicholas. TELLTALE: Experiments in a dynamic hypertext environment for degraded and multilingual data. Journal of the American Society for Information Science, 47(4):263-275, 1996.CrossRefGoogle Scholar
  15. M.F. Porter. An algorithm for suffix stripping. Information Systems, 40(3):211-218,1980.Google Scholar
  16. C.J. Van Rijsbergen. Information Retrieval. Butterworth-Heinemann, Newton, MA, 1979.Google Scholar
  17. F. Raja, M. Keikha, F. Oroumchian, and M. Rahgozar. Using Rich Document Representation in XML Information Retrieval. Proceedings of the Fifth International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX), Springer, New York, 2006.Google Scholar
  18. G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513-523, 1988.CrossRefGoogle Scholar
  19. C.Y. Suen. N-gram statistics for natural language understanding and text processing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(2):164-172,1979.CrossRefGoogle Scholar
  20. Y. Yang and X. Liu. A re-examination of text categorization methods. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 42-49, ACM Press, New York, 1999.CrossRefGoogle Scholar
  21. E.M. Zamora, J.J. Pollock, and A. Zamora. The use of trigram analysis for spelling error detection. Information Processing and Management, 17(6):305-316, 1981.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2008

Authors and Affiliations

  • Mostafa Keikha
    • 1
  • Narjes Sharif Razavian
    • 1
  • Farhad Oroumchian
    • 2
  • Hassan Seyed Razi
    • 1
  1. 1.Department of Electrical and Computer EngineeringUniversity of TehranTehranIran
  2. 2.College of Information TechnologyUniversity of Wollongong in DubaiDubaiUAE

Personalised recommendations