Augmenting Naive Bayes Classifiers with Statistical Language Models

Peng, Fuchun; Schuurmans, Dale; Wang, Shaojun

doi:10.1023/B:INRT.0000011209.19643.e2

Augmenting Naive Bayes Classifiers with Statistical Language Models

Published: September 2004

Volume 7, pages 317–345, (2004)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

Augmenting Naive Bayes Classifiers with Statistical Language Models

Download PDF

Fuchun Peng¹,
Dale Schuurmans² &
Shaojun Wang³

1037 Accesses
150 Citations
3 Altmetric
Explore all metrics

Abstract

We augment naive Bayes models with statistical n-gram language models to address short-comings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier which allows for a local Markov dependence among observations; a model we refer to as the C hain A ugmented N aive Bayes (CAN) Bayes classifier. CAN models have two advantages over standard naive Bayes classifiers. First, they relax some of the independence assumptions of naive Bayes—allowing a local Markov chain dependence in the observed variables—while still permitting efficient inference and learning. Second, they permit straightforward application of sophisticated smoothing techniques from statistical language modeling, which allows one to obtain better parameter estimates than the standard Laplace smoothing used in naive Bayes classification. In this paper, we introduce CAN models and apply them to various text classification problems. To demonstrate the language independent and task independent nature of these classifiers, we present experimental results on several text classification problems—authorship attribution, text genre classification, and topic detection—in several languages—Greek, English, Japanese and Chinese. We then systematically study the key factors in the CAN model that can influence the classification performance, and analyze the strengths and weaknesses of the model.

Article PDF

Text Classification Using Novel “Anti-Bayesian” Techniques

A discriminative model selection approach and its application to text classification

Article 15 July 2017

Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data

Article 25 May 2020

References

Aizawa A (2001) Linguistic techniques to improve the performance of automatic text categorization. In: Proceedings of the Sixth Natural Language Processing Pacific Rim Symposim (NLPRS), pp. 307-314.
Bell T, Cleary J and Witten I (1990) Text Compression. Prentice Hall.
Benedetto D, Caglioti E and Loreto V (2002) Language trees and zipping. Physical Review Letters, 88.
Cavnar WB and Trenkle JM (1994) N-gram-based text categorization. In: 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR).
Chen S and Goodman J (1998) An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Computer Science Group, Harvard University.
Church K and Gale W (1991) A comparison of the enhanced good-turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language, 5(1).
Cohen W and Singer Y (1999) Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems, 17:141–173.
Google Scholar
Damashek M (1995) Gauging similarity with N-grams: Language-independent categorization of text? Science, 267:843–848.
Google Scholar
Domingos P and Pazzani M (1997) Beyond independence: Conditions for the optimality of the simple bayesian classifier. Machine Learning, 29:103–130.
Google Scholar
Duda R and Hart P (1973) Pattern Classification and Scene Analysis. Wiley, New York.
Google Scholar
Dumais S, Platt J, Heckerman D and Sahami M (1998) Feature engineering for text classification. In: Proceedings of ACM Seventh International Conference on Information and Knowledge Management (CIKM), pp. 148-155.
Eyheramendy S, Lewis D and Madigan D (2003) On the naive bayes model for text categorization. In: Proceedings Artificial Intelligence & Statistics 2003.
Federico M and De Mori R (1998) Language modelling. In: De Mori R, Ed. Spoken Dialogues with Computers, Academy Press, London, UK, chapter 7.
Google Scholar
Friedman N, Geiger D and Goldszmidt M (1997). Bayesian network classifiers. Machine Learning, 29:31–163.
Google Scholar
Goodman J (2002) Comment on Language Trees and Zipping. Unpublished Manuscript.
He J, Tan A and Tan C (2000) A comparative study on chinese text categorization methods. In: Proceedings of PRICAI'2000 International Workshop on Text and Web Mining, pp. 24-35.
He J, Tan A and Tan C (2001) On machine learning methods for chinese documents classification. Applied Intelligence, Special Issue on Text and Web Mining.
Hiemstra D (2001) Using language models for information retrieval. PhD thesis, Centre for Telematics and Information Technology, University of Twente.
Holmes D and Forsyth R (1995) The federalist revisited: New directions in authorship attribution. Literary and Linguistic Computing, 10:111–127.
Google Scholar
Howard PG (1993) The design and analysis of efficient loseless data compression systems. PhD thesis, Brown University.
Huffman S (1995) Acquaintance: Language-independent document categorization by N-grams. In: Harman DK and Voorhees EM, Eds. Proceedings of TREC-4, 4th Text Retrieval Conference, pp. 359-371.
Jelinek F (1990) Self-organized language modeling for speech recognition. In: Weibel A and Lee K-F, Eds., Readings in Speech Recognition, Morgan Kaufmann, Los Altos, CA, pp. 450–505.
Google Scholar
Joachims T (1998) Text categorization with support vector Machines: Learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning (ECML).
Katz S (1987) Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-35(3):400–401.
Google Scholar
Keogh E and Pazzanni M (1999) A comparison of distribution-based and classification-based approaches. In: Proceedings Artificial Intelligence & Statistics 1999.
Kessler B, Nunberg G and Schüze H (1997) Automatic detection of text denre. In: Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics (ACL).
Kneser R and Ney H (1995) Improved backing-off for m-gram language modeling. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, Detroit, MI, pp. 181-184.
Kwok KL (1997) Comparing representations in chinese information retrieval. In: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), ACM, pp. 34-41.
Kwok KL (1999) Employing multiple representations for chinese information retrieval. Journal of the American Society for Information Science (JASIS), 50(8):709–723.
Google Scholar
Lang K (1995) Newsweeder: Learning to filter netnews. In: Proceedings of the Twelfth International Conference on Machine Learning (ICML), pp. 331-339.
Lee Y and Myaeng S (2002) Text genre classification with genre-revealing and subject-revealing features. In: Proceedings of The 25th Annual International ACM SIGIR Conference on Research and Development in Information (SIGIR).
Lewis D (1992) Representation and learning in information retrieval. PhD thesis, Computer Science Deptment, University of Massachusetts.
Lewis D (1998) Naive (Bayes) at Forty: The independence assumption in information retrieval. In: Proceedings 10th European Conference on Machine Learning (ECML).
Manning C and Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge, Massachusetts.
Google Scholar
McCallum A and NigamK(1998)Acomparison of event models for naive bayes text classification. In: Proceedings of AAAI-98 Workshop on “Learning for Text Categorization.”
Moffat A (1990) Implementing the PPM data compression scheme. IEEE Transactions on Communications, 31(11):1917–1921.
Google Scholar
Ney H, Essen U and Kneser R (1994) On structuring probabilistic dependencies in stochastic language Modeling. Computer Speech and Language, 8(1):1–28.
Google Scholar
Pang B, Lee L andVaithyanathan S (2002) Thumbs up? sentiment classification using machine learning techniques. In: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 79-86.
Ponte J and Croft W (1998) A language modeling approach to information retrieval. In: Proceedings of ACM Research and Development in Information Retrieval (SIGIR), pp. 275-281.
Rennie J (2001) Improving multi-class text classification with naive bayes. Master's thesis, M.I.T.
Rish I (2001) An empirical study of the naive bayes classifier. In: Proceedings of IJCAI-01Workshop on Empirical Methods in Artificial Intelligence.
Robertson S and Sparck Jones K (1976) Relevance weighting of search terms. Journal of the American Society for Information Science, 27:129–146.
Google Scholar
Schmitt JC (1991) Trigram-based method of language identification. U.S. Patent No. 5,062,143.
Scott S and Matwin S (1999) Feature engineering for text classification. In: Proceedings of The Sixteenth International Conference on Machine Learning (ICML), pp. 379-388.
Sebastiani F (2002) Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47.
Google Scholar
Stamatatos E, Fakotakis N and Kokkinakis G (2000) Automatic text categorization in terms of genre and author. Computational Linguistics, 26(4):471–495.
Google Scholar
Stamatatos E, Fakotakis N and Kokkinakis G (2001) Computer-based authorship attribution without lexical measures. Computers and the Humanities, 35:193–214.
Google Scholar
Teahan W (1998) Modelling English text. PhD thesis, University of Waikato.
Teahan W and Harper D (2001) Using compression-based language models for text categorization. In: Proceedings of Workshop on Language Modeling and Information Retrieval (LMIR). (also appear in Language Modeling and Information Retrieval, Kluwer, 2003).
Turney P (2002) Thumbs up or thumbs down? Semantic oritentation applied to unsupervised classification of reviews. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL).
Turpin A and Moffat A (1999) Statistical phrases for vector-space information retrieval. In: 22th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR).
Witten I and Bell T (1991) The zero-frequency Problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37(4), 37(4).
Witten I, Bray Z, Mahoui M and Teahan T (1999) Text mining: A new frontier for lossless compression. In: Proceedings of IEEE Data Compression Conference 1999.
Yang Y (1999) An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1/2):67–88.
Google Scholar
Zhang H and Ling C (2001) Learnability of augmented naive bayes in nominal domains. In: Proceedings of the Eighteenth International Conference on Machine Learning (ICML).

Download references

Author information

Authors and Affiliations

Center for Intelligent Information Retrieval, Department of Computer Science, University of Massachusetts at Amherst, 140 Governors Drive, Amherst, MA, USA, 01003
Fuchun Peng
Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada, T6G 2E8
Dale Schuurmans
Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada, T6G 2E8
Shaojun Wang

Authors

Fuchun Peng
View author publications
You can also search for this author in PubMed Google Scholar
Dale Schuurmans
View author publications
You can also search for this author in PubMed Google Scholar
Shaojun Wang
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Peng, F., Schuurmans, D. & Wang, S. Augmenting Naive Bayes Classifiers with Statistical Language Models. Information Retrieval 7, 317–345 (2004). https://doi.org/10.1023/B:INRT.0000011209.19643.e2

Download citation

Issue Date: September 2004
DOI: https://doi.org/10.1023/B:INRT.0000011209.19643.e2

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Augmenting Naive Bayes Classifiers with Statistical Language Models

Abstract

Article PDF

Similar content being viewed by others

Text Classification Using Novel “Anti-Bayesian” Techniques

A discriminative model selection approach and its application to text classification

Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Augmenting Naive Bayes Classifiers with Statistical Language Models

Abstract

Article PDF

Similar content being viewed by others

Text Classification Using Novel “Anti-Bayesian” Techniques

A discriminative model selection approach and its application to text classification

Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation