Learning to Separate Text Content and Style for Classification

Zhang, Dell; Lee, Wee Sun

doi:10.1007/11880592_7

Learning to Separate Text Content and Style for Classification

Dell Zhang²⁰ &
Wee Sun Lee²¹

Conference paper

957 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4182))

Abstract

Many text documents naturally have two kinds of labels. For example, we may label web pages from universities according to their categories, such as “student” or “faculty”, or according the source universities, such as “Cornell” or “Texas”. We call one kind of labels the content and the other kind the style. Given a set of documents, each with both content and style labels, we seek to effectively learn to classify a set of documents in a new style with no content labels into its content classes. Assuming that every document is generated using words drawn from a mixture of two multinomial component models, one content model and one style model, we propose a method named Cartesian EM that constructs content models and style models through Expectation Maximization and performs classification of the unknown content classes transductively. Our experiments on real-world datasets show the proposed method to be effective for style independent text content classification.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34, 1–47 (2002)
Article MathSciNet Google Scholar
Tenenbaum, J.B., Freeman, W.T.: Separating Style and Content with Bilinear Models. Neural Computation 12, 1247–1283 (2000)
Article Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B 39, 1–38 (1977)
MATH MathSciNet Google Scholar
Vapnik, V.N.: Statistical Learning Theory. Wiley, Chichester (1998)
MATH Google Scholar
Mitchell, T.: Machine Learning. McGraw Hill, New York (1997)
MATH Google Scholar
Agrawal, R., Bayardo, R., Srikant, R.: Athena: Mining-based Interactive Management of Text Databases. In: Zaniolo, C., Grust, T., Scholl, M.H., Lockemann, P.C. (eds.) EDBT 2000. LNCS, vol. 1777, pp. 365–379. Springer, Heidelberg (2000)
Chapter Google Scholar
McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: AAAI 1998 Workshop on Learning for Text Categorization, Madison, WI, pp. 41–48 (1998)
Google Scholar
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Learning to Classify Text from Labeled and Unlabeled Documents. In: Proceedings of the 15th Conference of the American Association for Artificial Intelligence (AAAI), Madison, WI, pp. 792–799 (1998)
Google Scholar
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning 39, 103–134 (2000)
Article MATH Google Scholar
Zhai, C.: A Note on the Expectation-Maximization (EM) Algorithm (2004)
Google Scholar
Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning (ICML), Nashville, TN, pp. 412–420 (1997)
Google Scholar
McCallum, A.: Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering (1996)
Google Scholar
Lang, K.: NewsWeeder: Learning to Filter Netnews. In: Proceedings of the 12th International Conference on Machine Learning (ICML), Tahoe City, CA, pp. 331–339 (1995)
Google Scholar
Pavlov, D., Popescul, A., Pennock, D.M., Ungar, L.H.: Mixtures of Conditional Maximum Entropy Models. In: Proceedings of the 20th International Conference on Machine Learning (ICML), Washington DC, USA, pp. 584–591 (2003)
Google Scholar
McCallum, A.: Multi-Label Text Classification with a Mixture Model Trained by EM. In: AAAI 1999 Workshop on Text Learning (1999)
Google Scholar
Zhai, C., Lafferty, J.D.: Model-based Feedback in the Language Modeling Approach to Information Retrieval. In: Proceedings of the 10th ACM International Conference on Information and Knowledge Management (CIKM), Atlanta, GA, pp. 403–410 (2001)
Google Scholar
Sarawagi, S., Chakrabarti, S., Godbole, S.: Cross-Training: Learning Probabilistic Mappings between Topics. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Washington DC, USA, pp. 177–186 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Information Systems, Birkbeck, University of London, London, WC1E 7HX, UK
Dell Zhang
Department of Computer Science and Singapore-MIT Alliance, National University of Singapore, 117543, Singapore
Wee Sun Lee

Authors

Dell Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wee Sun Lee
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, National University of Singapore, 3 Science Drive 2, 117543, Singapore
Hwee Tou Ng
Institute for Infocomm Research, 21 Heng Mui Keng Terrace, 119613, Singapore
Mun-Kew Leong
Department of Computer Science, School of Computing, National University of Singapore, 117543, Singapore
Min-Yen Kan
Institute for Infocomm Research, 21 Heng Mui Keng Terrace, P.O. Box, 119613, Singapore
Donghong Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, D., Lee, W.S. (2006). Learning to Separate Text Content and Style for Classification. In: Ng, H.T., Leong, MK., Kan, MY., Ji, D. (eds) Information Retrieval Technology. AIRS 2006. Lecture Notes in Computer Science, vol 4182. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880592_7

Download citation

DOI: https://doi.org/10.1007/11880592_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45780-0
Online ISBN: 978-3-540-46237-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics