Machine Learning

, Volume 39, Issue 2–3, pp 103–134 | Cite as

Text Classification from Labeled and Unlabeled Documents using EM

  • Kamal Nigam
  • Andrew Kachites Mccallum
  • Sebastian Thrun
  • Tom Mitchell
Article

Abstract

This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available.

We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve classification accuracy under these conditions: (1) a weighting factor to modulate the contribution of the unlabeled data, and (2) the use of multiple mixture components per class. Experimental results, obtained using text from three different real-world tasks, show that the use of unlabeled data reduces classification error by up to 30%.

text classification Expectation-Maximization integrating supervised and unsupervised learning combining labeled and unlabeled data Bayesian learning 

References

  1. Blum, A. & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT '98) (pp. 92–100).Google Scholar
  2. Castelli, V. & Cover, T.M. (1995). On the exponential value of labeled samples. Pattern Recognition Letters, 16(1), 105–111.Google Scholar
  3. Cheeseman, P. & Stutz, J. (1996). Bayesian classification (AutoClass): Theory and results. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy (Eds.), Advances in knowledge discovery and data mining. MIT Press.Google Scholar
  4. Cohen, W.W. & Singer, Y. (1996). Context-sensitive learning methods for text categorization. SIGIR '96: Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 307–315).Google Scholar
  5. Cover, T.M. & Thomas, J.A. (1991). Elements of information theory. New York: John Wiley and Sons.Google Scholar
  6. Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., & Slattery, S. (1998). Learning to extract symbolic knowledge from the World Wide Web. Proceedings of the Fifteenth National Conference on Artificial Intellligence (AAAI-98) (pp. 509–516).Google Scholar
  7. Dagan, I. & Engelson, S.P. (1995). Committee-based sampling for training probabilistic classifiers. Machine Learning: Proceedings of the Twelfth International Conference (ICML '95) (pp. 150–157).Google Scholar
  8. Dempster, A.P., Laird, N.M., & Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1–38.Google Scholar
  9. Dietterich, T.G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7), 1895–1923.Google Scholar
  10. Domingos, P. & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29, 103–130.Google Scholar
  11. Friedman, J.H. (1997). On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1(1), 55–77.Google Scholar
  12. Ghahramani, Z. & Jordan, M.I. (1994). Supervised learning from incomplete data via an EM approach. In Advances in neural information processing systems 6 (pp. 120–127). Morgan Kaufmann.Google Scholar
  13. Jaakkola, T.S. & Jordan, M.I. (1998). Improving the mean field approximation via the use of mixture distributions. In M. I. Jordan (Ed.), Learning in graphical models. Kluwer Academic Publishers.Google Scholar
  14. Joachims, T. (1997). Aprobabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Machine Learning: Proceedings of the Fourteenth International Conference (ICML '97) (pp. 143–151).Google Scholar
  15. Joachims, T. (1998). Text categorization with Support Vector Machines: Learning with many relevant features. Machine Learning: ECML-98, Tenth European Conference on Machine Learning (pp. 137–142).Google Scholar
  16. Koller, D. & Sahami, M. (1997). Hierarchically classifying documents using very few words. Machine Learning: Proceedings of the Fourteenth International Conference (ICML '97) (pp. 170–178).Google Scholar
  17. Lang, K. (1995). Newsweeder: Learning to filter netnews. Machine Learning: Proceedings of the Twelfth International Conference (ICML '95) (pp. 331–339).Google Scholar
  18. Larkey, L.S. & Croft, W.B. (1996). Combining classifiers in text categorization. SIGIR '96: Proceedings of the Nineteenth Annual International ACMSIGIR Conference on Research andDevelopment in Information Retrieval (pp. 289–297).Google Scholar
  19. Lewis, D.D. (1992). An evaluation of phrasal and clustered representations on a text categorization task. SIGIR '92: Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 37–50).Google Scholar
  20. Lewis, D.D. (1995). A sequential algorithm for training text classifiers: Corrigendum and additional data. SIGIR Forum, 29(2), 13–19.Google Scholar
  21. Lewis, D.D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. Machine Learning: ECML-98, Tenth European Conference on Machine Learning (pp. 4–15).Google Scholar
  22. Lewis, D.D. & Gale, W.A. (1994). A sequential algorithm for training text classifiers. SIGIR '94: Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 3–12).Google Scholar
  23. Lewis, D.D. & Knowles, K.A. (1997). Threading electronic mail: A preliminary study. Information Processing and Management, 33(2), 209–217.Google Scholar
  24. Lewis, D.D. & Ringuette, M. (1994). A comparison of two learning algorithms for text categorization. Third Annual Symposium on Document Analysis and Information Retrieval (pp. 81–93).Google Scholar
  25. Li, H. & Yamanishi, K. (1997). Document classification using a finite mixture model. Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (pp. 39–47).Google Scholar
  26. Liere, R. & Tadepalli, P. (1997). Active learning with committees for text categorization. Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI-97) (pp. 591–596).Google Scholar
  27. McCallum, A. & Nigam, K. (1998). A comparison of event models for naive Bayes text classification. AAAI-98 Workshop on Learning for Text Categorization. Tech. Rep. WS–98–05, AAAI Press. http://www.cs.cmu.edu/∼mccallum.Google Scholar
  28. McCallum, A.K. & Nigam, K. (1998). EmployingEMin pool-based active learning for text classification. Machine Learning: Proceedings of the Fifteenth International Conference (ICML '98) (pp. 350–358).Google Scholar
  29. McCallum, A., Rosenfeld, R., Mitchell, T., & Ng, A. (1998). Improving text clasification by shrinkage in a hierarchy of classes. Machine Learning: Proceedings of the Fifteenth International Conference (ICML '98) (pp. 359–367).Google Scholar
  30. McLachlan, G. & Basford, K. (1988). Mixture models. New York: Marcel Dekker.Google Scholar
  31. McLachlan, G.J. & Krishnan, T. (1997). The EM algorithm and extensions. New York: John Wiley and Sons.Google Scholar
  32. Miller, D.J. & Uyar, H.S. (1997). A mixture of experts classifier with learning based on both labelled and unlabelled data. In Advances in Neural Information Processing Systems 9 (pp. 571–577). The MIT Press.Google Scholar
  33. Mitchell, T.M. (1997). Machine learning. New York: McGraw-Hill.Google Scholar
  34. Ng, A.Y. (1997). Preventing “overfitting” of cross-validation data. Machine Learning: Proceedings of the Fourteenth International Conference (ICML '97) (pp. 245–253).Google Scholar
  35. Pazzani, M.J., Muramatsu, J., & Billsus, D. (1996). Syskill & Webert: Identifying interesting Web sites. Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96) (pp. 54–59).Google Scholar
  36. Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length. Annals of Statistics, 11(2), 416–431.Google Scholar
  37. Robertson, S.E. & Sparck-Jones, K. (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), 129–146.Google Scholar
  38. Rocchio, J. (1971). Relevance feedback in information retrieval. In G. Salton (Ed.), The SMART retrieval system: experiments in automatic document processing. Englewood Cliffs, NJ: Prentice Hall.Google Scholar
  39. Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). A Baysian approach to filtering junk e-mail. AAAI-98 Workshop on Learning for Text Categorization. Tech. Rep. WS–98–05, AAAI Press. http://robotics.stanford.edu/users/sahami/papers.html.Google Scholar
  40. Salton, G. (1991). Developments in automatic text retrieval. Science, 253(5023), 974–980.Google Scholar
  41. Schuurmans, D. (1997). A new metric-based approach to model selection. Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI-97) (pp. 552–558).Google Scholar
  42. Shahshahani, B. & Landgrebe, D. (1994). The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon. IEEE Transactions on Geoscience and Remote Sensing, 32(5), 1087–1095.Google Scholar
  43. Shavlik, J. & Eliassi-Rad, T. (1998). Intelligent agents for web-based tasks: An advice-taking approach. AAAI-98 Workshop on Learning for Text Categorization. Tech. Rep. WS–98–05, AAAI Press. http://www.cs.wisc.edu/∼shavlik/mlrg/publications.html.Google Scholar
  44. Stolcke, A. & Omohundro, S.M. (1994). Best-first model merging for hidden Markov model induction. Tech. Rep. TR–94–003, ICSI, University of California, Berkeley. http://www.icsi.berkeley.edu/techreports/1994.html.Google Scholar
  45. Yang, Y. (1994). Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. SIGIR '94: Proceedings of the Seventeenth Annual International ACMSIGIR Conference on Research and Development in Information Retrieval (pp. 13–22).Google Scholar
  46. Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1, 69–90.Google Scholar
  47. Yang, Y. & Pederson, J.O. (1997). Feature selection in statistical learning of text categorization. Machine Learning: Proceedings of the Fourteenth International Conference (ICML '97) (pp. 412–420).Google Scholar

Copyright information

© Kluwer Academic Publishers 2000

Authors and Affiliations

  • Kamal Nigam
    • 1
  • Andrew Kachites Mccallum
    • 2
    • 3
  • Sebastian Thrun
    • 4
  • Tom Mitchell
    • 5
  1. 1.School of Computer ScienceCarnegie Mellon UniversityPittsburghUSA
  2. 2.Just ResearchPittsburghUSA
  3. 3.School of Computer ScienceCarnegie Mellon UniversityPittsburghUSA
  4. 4.School of Computer ScienceCarnegie Mellon UniversityPittsburghUSA
  5. 5.School of Computer ScienceCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations