Skip to main content

Document Classification

  • Reference work entry
Encyclopedia of Machine Learning

Synonyms

Document categorization; Supervised learning on text data

Definition

Document classification refers to a process of assigning one or more labels for a document  from a predefined set of labels. The main issues in document classification are connected to classification of free text giving document content. For instance, classifying Web documents as being about arts, education, science, etc. or classifying news articles by their topic. In general, one can consider different properties of a document in document classification and combine them, such as document type, authors, links to other documents, content, etc. Machine learning methods applied to document classification are based on general classification methods adjusted to handle some specifics of text data.

Motivation and Background

Documents and text data provide for valuable sources of information and their growing availability in electronic form naturally led to application of different analytic methods. One of the common...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Recommended Reading

  • Cohen, W. W., & Singer, Y. (1996). Context sensitive learning methods for text categorization. In Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval (pp. 307–315). Zurich: ACM.

    Google Scholar 

  • Lewis, D. D. (1991). Representation and learning in information retrieval. PhD thesis, Department of Computer Science, University of Massachusetts, Amherst, MA.

    Google Scholar 

  • Lewis, D. D., Schapire, R. E., Callan, J. P., & Ron Papka, R. (1996) Training algorithms for linear text classifiers. In Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval SIGIR-1996 (pp. 298–306). New York: ACM.

    Google Scholar 

  • Mladenic, D., Brank, J., Grobelnik, M., & Milic-Frayling, N. (2004). Feature selection using linear classifier weights: Interaction with classification models. In Proceedings of the twenty-seventh annual international ACM SIGIR conference on research and development in information retrieval SIGIR-2004 (pp. 234–241). New York: ACM.

    Google Scholar 

  • Mladenić, D., & Grobelnik, M. (2003). Feature selection on hierarchy of web documents. Journal of Decision Support Systems, 35, 45–87.

    Google Scholar 

  • Sebastiani, F. (2002). Machine learning for automated text categorization. ACM Computing Surveys, 34(1), 1–47.

    Google Scholar 

  • Yang, Y. (1997). An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1, 67–88.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer Science+Business Media, LLC

About this entry

Cite this entry

Mladeni, D., Brank, J., Grobelnik, M. (2011). Document Classification. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_230

Download citation

Publish with us

Policies and ethics