Text Categorisation Using Document Profiling

Sauban, Maximilien; Pfahringer, Bernhard

doi:10.1007/978-3-540-39804-2_37

Maximilien Sauban¹⁰ &
Bernhard Pfahringer¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2838))

Included in the following conference series:

European Conference on Principles of Data Mining and Knowledge Discovery

2163 Accesses
4 Citations

Abstract

This paper presents an extension of prior work by Michael D. Lee on psychologically plausible text categorisation. Our approach utilises Lee’s model as a pre-processing filter to generate a dense representation for a given text document (a document profile) and passes that on to an arbitrary standard propositional learning algorithm. Similarly to standard feature selection for text classification, the dimensionality of instances is drastically reduced this way, which in turn greatly lowers the computational load for the subsequent learning algorithm. The filter itself is very fast as well, as it basically is just an interesting variant of Naive Bayes. We present different variations of the filter and conduct an evaluation against the Reuters-21578 collection that shows performance comparable to previously published results on that collection, but at a lower computational cost.

Download to read the full chapter text

Chapter PDF

Interactive Text Categorisation: The Geometry of Likelihood Spaces

Improved Document Categorization Through Feature-Rich Combinations

Evaluation of the Document Classification Approaches

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Yang, Y., Slattery, S., Ghani, R.: A study of approaches to hypertext categorization. Journal of Intelligent Info. Systems 18, 219–241 (2002)
Article Google Scholar
Roth, D.: Learning to resolve natural language ambiguities: a unified approach. In: Proc. of AAAI 1998, 15th Conf. of the American Association for Artificial Intelligence, pp. 806–813. AAAI Press, Menlo Park (1998)
Google Scholar
Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proc. of CIKM 1998, 7th ACM Int. Conf. on Info. and Knowledge Management, pp. 148–155. ACM Press, New York (1998)
Chapter Google Scholar
Lewis, D.D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)
Chapter Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Rocchio, J.J.: Relevance feedback in information retrieval. The SMART Retrieval System: Experiments in automatic document processing, 313–323 (1971)
Google Scholar
Yang, Y., Chute, C.G.: A linear least squares fit mapping method for information retrieval from natural language texts. In: 14th Int. Conf. on Computational Linguistics (COLING), pp. 447–453 (1992)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proc. of ICML 1997, 14th Int. Conf. on Machine Learning, pp. 412–420. Morgan Kaufmann, San Francisco (1997)
Google Scholar
Frank, E., Chui, C., Witten, I.H.: Text categorization using compression models. In: Proc. of DCC 2000, IEEE Data Compression Conf., pp. 200–209. IEEE Computer Society Press, Los Alamitos (2000)
Google Scholar
Wiener, E.D., Pedersen, J.O., Weigend, A.S.: A neural network approach to topic spotting. In: Proc. of SDAIR 1995, 4th Annual Symposium on Document Analysis and Info. Retrieval, pp. 317–332 (1995)
Google Scholar
Scott, S., Matwin, S.: Text classification using WordNet hypernyms. In: Use of WordNet in Natural Language Processing Systems: Proceedings of the Conf. Association for Computational Linguistics, pp. 38–44 (1998)
Google Scholar
Lee, M.D.: Fast text classification using sequential sampling processes. In: Proc. of the 14th Australian Joint Conf. on Artificial Intelligence, pp. 309–320. Springer, Heidelberg (2002)
Google Scholar
Apté, C., Damerau, F., Weiss, S.M.: Automated learning of decision rules for text categorization. Information Systems 12, 233–251 (1994)
Google Scholar
Witten, I.H., Frank, E., Trigg, L., Hall, M., Holmes, G., Cunningham, S.J.: Weka: Practical machine learning tools and techniques with java implementations. In: Proc. ICONIP/ANZIIS/ANNES 1999 Int.Workshop: Emerging Knowledge Engineering and Connectionist-Based Info. Systems, pp. 192–196 (1999)
Google Scholar
Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)
Google Scholar
Holte, R.: Very simple classification rules perform well on most commonly used datasets. Machine Learning 11, 63–91 (1993)
Article MATH Google Scholar
Aha, D., Kibler, D., Albert, M.: Instance-based learning algorithms. Machine Learning 6, 37–66 (1991)
Google Scholar
Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: SchOlkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods - Support Vector Learning (1998)
Google Scholar
John, G.H., Langley, P.: Estimating continuous distributions in bayesian classifiers. In: Proc. of the Eleventh Conf. on Uncertainty in Artificial Intelligence, pp. 338–345. Morgan Kaufmann, San Francisco (1995)
Google Scholar
McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization (1998)
Google Scholar
Rennie, J., Shih, L., Teevan, J., Karger, D.: Tackling the poor assumptions of naive bayes text classifiers. In: Proc. of the 20th Int. Conf. on Machine Learning, Morgan Kaufmann, San Francisco (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Waikato, Hamilton, New Zealand
Maximilien Sauban & Bernhard Pfahringer

Authors

Maximilien Sauban
View author publications
You can also search for this author in PubMed Google Scholar
Bernhard Pfahringer
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Nova Gorica, Nova Gorica, Slovenia
Nada Lavrač
Rudjer Bošković Institute, Bijenička 54, 10000, Zagreb, Croatia
Dragan Gamberger
Jozef Stefan Institute, Jamova 39, 1000, Ljubljana, Slovenia
Ljupčo Todorovski
Leiden Institute of Advanced Computer Science, Leiden University,
Hendrik Blockeel

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sauban, M., Pfahringer, B. (2003). Text Categorisation Using Document Profiling. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds) Knowledge Discovery in Databases: PKDD 2003. PKDD 2003. Lecture Notes in Computer Science(), vol 2838. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39804-2_37

Download citation

DOI: https://doi.org/10.1007/978-3-540-39804-2_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20085-7
Online ISBN: 978-3-540-39804-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Text Categorisation Using Document Profiling

Abstract

Chapter PDF

Similar content being viewed by others

Interactive Text Categorisation: The Geometry of Likelihood Spaces

Improved Document Categorization Through Feature-Rich Combinations

Evaluation of the Document Classification Approaches

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Text Categorisation Using Document Profiling

Abstract

Chapter PDF

Similar content being viewed by others

Interactive Text Categorisation: The Geometry of Likelihood Spaces

Improved Document Categorization Through Feature-Rich Combinations

Evaluation of the Document Classification Approaches

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation