Text Categorisation Using Document Profiling

  • Maximilien Sauban
  • Bernhard Pfahringer
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2838)


This paper presents an extension of prior work by Michael D. Lee on psychologically plausible text categorisation. Our approach utilises Lee’s model as a pre-processing filter to generate a dense representation for a given text document (a document profile) and passes that on to an arbitrary standard propositional learning algorithm. Similarly to standard feature selection for text classification, the dimensionality of instances is drastically reduced this way, which in turn greatly lowers the computational load for the subsequent learning algorithm. The filter itself is very fast as well, as it basically is just an interesting variant of Naive Bayes. We present different variations of the filter and conduct an evaluation against the Reuters-21578 collection that shows performance comparable to previously published results on that collection, but at a lower computational cost.


Support Vector Machine 10th European Conf Posterior Odds Document Profile Sequential Sampling Process 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Yang, Y., Slattery, S., Ghani, R.: A study of approaches to hypertext categorization. Journal of Intelligent Info. Systems 18, 219–241 (2002)CrossRefGoogle Scholar
  2. 2.
    Roth, D.: Learning to resolve natural language ambiguities: a unified approach. In: Proc. of AAAI 1998, 15th Conf. of the American Association for Artificial Intelligence, pp. 806–813. AAAI Press, Menlo Park (1998)Google Scholar
  3. 3.
    Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proc. of CIKM 1998, 7th ACM Int. Conf. on Info. and Knowledge Management, pp. 148–155. ACM Press, New York (1998)CrossRefGoogle Scholar
  4. 4.
    Lewis, D.D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  5. 5.
    Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  6. 6.
    Rocchio, J.J.: Relevance feedback in information retrieval. The SMART Retrieval System: Experiments in automatic document processing, 313–323 (1971)Google Scholar
  7. 7.
    Yang, Y., Chute, C.G.: A linear least squares fit mapping method for information retrieval from natural language texts. In: 14th Int. Conf. on Computational Linguistics (COLING), pp. 447–453 (1992)Google Scholar
  8. 8.
    Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proc. of ICML 1997, 14th Int. Conf. on Machine Learning, pp. 412–420. Morgan Kaufmann, San Francisco (1997)Google Scholar
  9. 9.
    Frank, E., Chui, C., Witten, I.H.: Text categorization using compression models. In: Proc. of DCC 2000, IEEE Data Compression Conf., pp. 200–209. IEEE Computer Society Press, Los Alamitos (2000)Google Scholar
  10. 10.
    Wiener, E.D., Pedersen, J.O., Weigend, A.S.: A neural network approach to topic spotting. In: Proc. of SDAIR 1995, 4th Annual Symposium on Document Analysis and Info. Retrieval, pp. 317–332 (1995)Google Scholar
  11. 11.
    Scott, S., Matwin, S.: Text classification using WordNet hypernyms. In: Use of WordNet in Natural Language Processing Systems: Proceedings of the Conf. Association for Computational Linguistics, pp. 38–44 (1998)Google Scholar
  12. 12.
    Lee, M.D.: Fast text classification using sequential sampling processes. In: Proc. of the 14th Australian Joint Conf. on Artificial Intelligence, pp. 309–320. Springer, Heidelberg (2002)Google Scholar
  13. 13.
    Apté, C., Damerau, F., Weiss, S.M.: Automated learning of decision rules for text categorization. Information Systems 12, 233–251 (1994)Google Scholar
  14. 14.
    Witten, I.H., Frank, E., Trigg, L., Hall, M., Holmes, G., Cunningham, S.J.: Weka: Practical machine learning tools and techniques with java implementations. In: Proc. ICONIP/ANZIIS/ANNES 1999 Int.Workshop: Emerging Knowledge Engineering and Connectionist-Based Info. Systems, pp. 192–196 (1999)Google Scholar
  15. 15.
    Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)Google Scholar
  16. 16.
    Holte, R.: Very simple classification rules perform well on most commonly used datasets. Machine Learning 11, 63–91 (1993)zbMATHCrossRefGoogle Scholar
  17. 17.
    Aha, D., Kibler, D., Albert, M.: Instance-based learning algorithms. Machine Learning 6, 37–66 (1991)Google Scholar
  18. 18.
    Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: SchOlkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods - Support Vector Learning (1998)Google Scholar
  19. 19.
    John, G.H., Langley, P.: Estimating continuous distributions in bayesian classifiers. In: Proc. of the Eleventh Conf. on Uncertainty in Artificial Intelligence, pp. 338–345. Morgan Kaufmann, San Francisco (1995)Google Scholar
  20. 20.
    McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization (1998)Google Scholar
  21. 21.
    Rennie, J., Shih, L., Teevan, J., Karger, D.: Tackling the poor assumptions of naive bayes text classifiers. In: Proc. of the 20th Int. Conf. on Machine Learning, Morgan Kaufmann, San Francisco (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Maximilien Sauban
    • 1
  • Bernhard Pfahringer
    • 1
  1. 1.Department of Computer ScienceUniversity of WaikatoHamiltonNew Zealand

Personalised recommendations