This paper presents a Bayes document classifier using phrases as features. The phrases are extracted using a grammar that iteratively applies the rules to the sequence of words in the document. This grammar is generated from a training set using statistical word association. We report an improvement in the classification over the “bag of words” representation.


Feature Vector Mutual Information Training Corpus Association Measure Association Matrix 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Ahonen-Myka, H., Heinonen, O., Klemettinen, M., AND Verkamo, A. I. Finding co-occurring text phrases by combining sequence and frequent set discovery. In Proceedings of 16th International Joint Conference on Artificial Intelligence (1999).Google Scholar
  2. 2.
    Church, K. W., AND Hanks, P. Word association norms, mutual information and lexicography. Computational Linguistics 16,1 (Mar. 1990), 22–29.Google Scholar
  3. 3.
    Cover, T. M., AND Thomas, J. A. Elements of Information Theory. Wiley and Sons, Inc., 1991.Google Scholar
  4. 4.
    Domingos, P., AND Pazzani, M. On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning 29 (1997), 103–130.zbMATHCrossRefGoogle Scholar
  5. 5.
    Fagan, J. L. Experiments in Automatic Phrase Indexing for Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods. PhD thesis, Department of Computer Science, Cornell University, Ithaca, USA, 1997.Google Scholar
  6. 6.
    Joachims, T. A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. In International Conference on Machine Learning (ICML) (1997).Google Scholar
  7. 7.
    Kosala, R., AND Blockeel, H. Web mining research: A survey. ACM SIGKDD Explorations Newsletter 2,1 (June 2000).Google Scholar
  8. 8.
    Mladenić, D., AND Grobelnik, M. Word sequences as features in text-learning. In Proceedings of the 17th Electrotechnical and Computer Sciences Conference (ERK-98) (Ljubljana, Slovenia, 1998).Google Scholar
  9. 9.
    Ries, K., Buø, F., AND Waibel, A. Class phrase models for language modelling. In Proc. of the 4th International Conference on Spoken Language Processing (IC-SLP’96) (1996).Google Scholar
  10. 10.
    Smadja, F. Retrieving collocations form text: Xtract. Computational Linguistics 19,1 (1993), 143–177.Google Scholar
  11. 11.
    Yang, Y., AND Pedersen, J. O. A comparative study on feature selection in text categorization. In Text Categorization Proceedings of the Fourteenth International Conference on Machine Learning (ICML-97) (1997).Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Jan Bakus
    • 1
  • Mohamed Kamel
    • 1
  1. 1.Pattern Analysis and Machine Intelligence Lab Department of Systems Design EngineeringUniversity of WaterlooWaterlooCanada

Personalised recommendations