Learning to Separate Text Content and Style for Classification

  • Dell Zhang
  • Wee Sun Lee
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4182)


Many text documents naturally have two kinds of labels. For example, we may label web pages from universities according to their categories, such as “student” or “faculty”, or according the source universities, such as “Cornell” or “Texas”. We call one kind of labels the content and the other kind the style. Given a set of documents, each with both content and style labels, we seek to effectively learn to classify a set of documents in a new style with no content labels into its content classes. Assuming that every document is generated using words drawn from a mixture of two multinomial component models, one content model and one style model, we propose a method named Cartesian EM that constructs content models and style models through Expectation Maximization and performs classification of the unknown content classes transductively. Our experiments on real-world datasets show the proposed method to be effective for style independent text content classification.


Test Document Style Model Label Space Style Type Transductive Learning 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34, 1–47 (2002)CrossRefMathSciNetGoogle Scholar
  2. 2.
    Tenenbaum, J.B., Freeman, W.T.: Separating Style and Content with Bilinear Models. Neural Computation 12, 1247–1283 (2000)CrossRefGoogle Scholar
  3. 3.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B 39, 1–38 (1977)MATHMathSciNetGoogle Scholar
  4. 4.
    Vapnik, V.N.: Statistical Learning Theory. Wiley, Chichester (1998)MATHGoogle Scholar
  5. 5.
    Mitchell, T.: Machine Learning. McGraw Hill, New York (1997)MATHGoogle Scholar
  6. 6.
    Agrawal, R., Bayardo, R., Srikant, R.: Athena: Mining-based Interactive Management of Text Databases. In: Zaniolo, C., Grust, T., Scholl, M.H., Lockemann, P.C. (eds.) EDBT 2000. LNCS, vol. 1777, pp. 365–379. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  7. 7.
    McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: AAAI 1998 Workshop on Learning for Text Categorization, Madison, WI, pp. 41–48 (1998)Google Scholar
  8. 8.
    Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Learning to Classify Text from Labeled and Unlabeled Documents. In: Proceedings of the 15th Conference of the American Association for Artificial Intelligence (AAAI), Madison, WI, pp. 792–799 (1998)Google Scholar
  9. 9.
    Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning 39, 103–134 (2000)MATHCrossRefGoogle Scholar
  10. 10.
    Zhai, C.: A Note on the Expectation-Maximization (EM) Algorithm (2004)Google Scholar
  11. 11.
    Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning (ICML), Nashville, TN, pp. 412–420 (1997)Google Scholar
  12. 12.
    McCallum, A.: Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering (1996)Google Scholar
  13. 13.
    Lang, K.: NewsWeeder: Learning to Filter Netnews. In: Proceedings of the 12th International Conference on Machine Learning (ICML), Tahoe City, CA, pp. 331–339 (1995)Google Scholar
  14. 14.
    Pavlov, D., Popescul, A., Pennock, D.M., Ungar, L.H.: Mixtures of Conditional Maximum Entropy Models. In: Proceedings of the 20th International Conference on Machine Learning (ICML), Washington DC, USA, pp. 584–591 (2003)Google Scholar
  15. 15.
    McCallum, A.: Multi-Label Text Classification with a Mixture Model Trained by EM. In: AAAI 1999 Workshop on Text Learning (1999)Google Scholar
  16. 16.
    Zhai, C., Lafferty, J.D.: Model-based Feedback in the Language Modeling Approach to Information Retrieval. In: Proceedings of the 10th ACM International Conference on Information and Knowledge Management (CIKM), Atlanta, GA, pp. 403–410 (2001)Google Scholar
  17. 17.
    Sarawagi, S., Chakrabarti, S., Godbole, S.: Cross-Training: Learning Probabilistic Mappings between Topics. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Washington DC, USA, pp. 177–186 (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Dell Zhang
    • 1
  • Wee Sun Lee
    • 2
  1. 1.School of Computer Science and Information Systems, BirkbeckUniversity of LondonLondonUK
  2. 2.Department of Computer Science and Singapore-MIT AllianceNational University of SingaporeSingapore

Personalised recommendations