Abstract
We augment naive Bayes models with statistical n-gram language models to address short-comings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier which allows for a local Markov dependence among observations; a model we refer to as the C hain A ugmented N aive Bayes (CAN) Bayes classifier. CAN models have two advantages over standard naive Bayes classifiers. First, they relax some of the independence assumptions of naive Bayes—allowing a local Markov chain dependence in the observed variables—while still permitting efficient inference and learning. Second, they permit straightforward application of sophisticated smoothing techniques from statistical language modeling, which allows one to obtain better parameter estimates than the standard Laplace smoothing used in naive Bayes classification. In this paper, we introduce CAN models and apply them to various text classification problems. To demonstrate the language independent and task independent nature of these classifiers, we present experimental results on several text classification problems—authorship attribution, text genre classification, and topic detection—in several languages—Greek, English, Japanese and Chinese. We then systematically study the key factors in the CAN model that can influence the classification performance, and analyze the strengths and weaknesses of the model.
Article PDF
Similar content being viewed by others
References
Aizawa A (2001) Linguistic techniques to improve the performance of automatic text categorization. In: Proceedings of the Sixth Natural Language Processing Pacific Rim Symposim (NLPRS), pp. 307-314.
Bell T, Cleary J and Witten I (1990) Text Compression. Prentice Hall.
Benedetto D, Caglioti E and Loreto V (2002) Language trees and zipping. Physical Review Letters, 88.
Cavnar WB and Trenkle JM (1994) N-gram-based text categorization. In: 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR).
Chen S and Goodman J (1998) An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Computer Science Group, Harvard University.
Church K and Gale W (1991) A comparison of the enhanced good-turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language, 5(1).
Cohen W and Singer Y (1999) Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems, 17:141–173.
Damashek M (1995) Gauging similarity with N-grams: Language-independent categorization of text? Science, 267:843–848.
Domingos P and Pazzani M (1997) Beyond independence: Conditions for the optimality of the simple bayesian classifier. Machine Learning, 29:103–130.
Duda R and Hart P (1973) Pattern Classification and Scene Analysis. Wiley, New York.
Dumais S, Platt J, Heckerman D and Sahami M (1998) Feature engineering for text classification. In: Proceedings of ACM Seventh International Conference on Information and Knowledge Management (CIKM), pp. 148-155.
Eyheramendy S, Lewis D and Madigan D (2003) On the naive bayes model for text categorization. In: Proceedings Artificial Intelligence & Statistics 2003.
Federico M and De Mori R (1998) Language modelling. In: De Mori R, Ed. Spoken Dialogues with Computers, Academy Press, London, UK, chapter 7.
Friedman N, Geiger D and Goldszmidt M (1997). Bayesian network classifiers. Machine Learning, 29:31–163.
Goodman J (2002) Comment on Language Trees and Zipping. Unpublished Manuscript.
He J, Tan A and Tan C (2000) A comparative study on chinese text categorization methods. In: Proceedings of PRICAI'2000 International Workshop on Text and Web Mining, pp. 24-35.
He J, Tan A and Tan C (2001) On machine learning methods for chinese documents classification. Applied Intelligence, Special Issue on Text and Web Mining.
Hiemstra D (2001) Using language models for information retrieval. PhD thesis, Centre for Telematics and Information Technology, University of Twente.
Holmes D and Forsyth R (1995) The federalist revisited: New directions in authorship attribution. Literary and Linguistic Computing, 10:111–127.
Howard PG (1993) The design and analysis of efficient loseless data compression systems. PhD thesis, Brown University.
Huffman S (1995) Acquaintance: Language-independent document categorization by N-grams. In: Harman DK and Voorhees EM, Eds. Proceedings of TREC-4, 4th Text Retrieval Conference, pp. 359-371.
Jelinek F (1990) Self-organized language modeling for speech recognition. In: Weibel A and Lee K-F, Eds., Readings in Speech Recognition, Morgan Kaufmann, Los Altos, CA, pp. 450–505.
Joachims T (1998) Text categorization with support vector Machines: Learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning (ECML).
Katz S (1987) Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-35(3):400–401.
Keogh E and Pazzanni M (1999) A comparison of distribution-based and classification-based approaches. In: Proceedings Artificial Intelligence & Statistics 1999.
Kessler B, Nunberg G and Schüze H (1997) Automatic detection of text denre. In: Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics (ACL).
Kneser R and Ney H (1995) Improved backing-off for m-gram language modeling. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, Detroit, MI, pp. 181-184.
Kwok KL (1997) Comparing representations in chinese information retrieval. In: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), ACM, pp. 34-41.
Kwok KL (1999) Employing multiple representations for chinese information retrieval. Journal of the American Society for Information Science (JASIS), 50(8):709–723.
Lang K (1995) Newsweeder: Learning to filter netnews. In: Proceedings of the Twelfth International Conference on Machine Learning (ICML), pp. 331-339.
Lee Y and Myaeng S (2002) Text genre classification with genre-revealing and subject-revealing features. In: Proceedings of The 25th Annual International ACM SIGIR Conference on Research and Development in Information (SIGIR).
Lewis D (1992) Representation and learning in information retrieval. PhD thesis, Computer Science Deptment, University of Massachusetts.
Lewis D (1998) Naive (Bayes) at Forty: The independence assumption in information retrieval. In: Proceedings 10th European Conference on Machine Learning (ECML).
Manning C and Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge, Massachusetts.
McCallum A and NigamK(1998)Acomparison of event models for naive bayes text classification. In: Proceedings of AAAI-98 Workshop on “Learning for Text Categorization.”
Moffat A (1990) Implementing the PPM data compression scheme. IEEE Transactions on Communications, 31(11):1917–1921.
Ney H, Essen U and Kneser R (1994) On structuring probabilistic dependencies in stochastic language Modeling. Computer Speech and Language, 8(1):1–28.
Pang B, Lee L andVaithyanathan S (2002) Thumbs up? sentiment classification using machine learning techniques. In: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 79-86.
Ponte J and Croft W (1998) A language modeling approach to information retrieval. In: Proceedings of ACM Research and Development in Information Retrieval (SIGIR), pp. 275-281.
Rennie J (2001) Improving multi-class text classification with naive bayes. Master's thesis, M.I.T.
Rish I (2001) An empirical study of the naive bayes classifier. In: Proceedings of IJCAI-01Workshop on Empirical Methods in Artificial Intelligence.
Robertson S and Sparck Jones K (1976) Relevance weighting of search terms. Journal of the American Society for Information Science, 27:129–146.
Schmitt JC (1991) Trigram-based method of language identification. U.S. Patent No. 5,062,143.
Scott S and Matwin S (1999) Feature engineering for text classification. In: Proceedings of The Sixteenth International Conference on Machine Learning (ICML), pp. 379-388.
Sebastiani F (2002) Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47.
Stamatatos E, Fakotakis N and Kokkinakis G (2000) Automatic text categorization in terms of genre and author. Computational Linguistics, 26(4):471–495.
Stamatatos E, Fakotakis N and Kokkinakis G (2001) Computer-based authorship attribution without lexical measures. Computers and the Humanities, 35:193–214.
Teahan W (1998) Modelling English text. PhD thesis, University of Waikato.
Teahan W and Harper D (2001) Using compression-based language models for text categorization. In: Proceedings of Workshop on Language Modeling and Information Retrieval (LMIR). (also appear in Language Modeling and Information Retrieval, Kluwer, 2003).
Turney P (2002) Thumbs up or thumbs down? Semantic oritentation applied to unsupervised classification of reviews. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL).
Turpin A and Moffat A (1999) Statistical phrases for vector-space information retrieval. In: 22th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR).
Witten I and Bell T (1991) The zero-frequency Problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37(4), 37(4).
Witten I, Bray Z, Mahoui M and Teahan T (1999) Text mining: A new frontier for lossless compression. In: Proceedings of IEEE Data Compression Conference 1999.
Yang Y (1999) An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1/2):67–88.
Zhang H and Ling C (2001) Learnability of augmented naive bayes in nominal domains. In: Proceedings of the Eighteenth International Conference on Machine Learning (ICML).
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Peng, F., Schuurmans, D. & Wang, S. Augmenting Naive Bayes Classifiers with Statistical Language Models. Information Retrieval 7, 317–345 (2004). https://doi.org/10.1023/B:INRT.0000011209.19643.e2
Issue Date:
DOI: https://doi.org/10.1023/B:INRT.0000011209.19643.e2