Document Representation and Quality of Text: An Analysis

  • Mostafa Keikha
  • Narjes Sharif Razavian
  • Farhad Oroumchian
  • Hassan Seyed Razi

There are three factors involved in text classification: the classification model, the similarity measure, and the document representation. In this chapter, we will focus on document representation and demonstrate that the choice of document representation has a profound impact on the quality of the classification.We will also show that the text quality affects the choice of document representation. In our experiments we have used the centroid-based classification, which is a simple and robust text classi-fication scheme. We will compare four different types of document representation: N-grams, single terms, phrases, and a logic-based document representation called RDR. The N-gram representation is a string-based representation with no linguistic processing. The single-term approach is based on words with minimum linguistic processing. The phrase approach is based on linguistically formed phrases and single words. The RDR is based on linguistic processing and representing documents as a set of logical predicates. Our experiments on many text collections yielded similar results.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag London Limited 2008

Authors and Affiliations

  • Mostafa Keikha
    • 1
  • Narjes Sharif Razavian
    • 1
  • Farhad Oroumchian
    • 2
  • Hassan Seyed Razi
    • 1
  1. 1.Department of Electrical and Computer EngineeringUniversity of TehranTehranIran
  2. 2.College of Information TechnologyUniversity of Wollongong in DubaiDubaiUAE

Personalised recommendations