Synonyms
Document categorization; Supervised learning on text data
Definition
Document classification refers to a process of assigning one or more labels for a document from a predefined set of labels. The main issues in document classification are connected to classification of free text giving document content. For instance, classifying Web documents as being about arts, education, science, etc. or classifying news articles by their topic. In general, one can consider different properties of a document in document classification and combine them, such as document type, authors, links to other documents, content, etc. Machine learning methods applied to document classification are based on general classification methods adjusted to handle some specifics of text data.
Motivation and Background
Documents and text data provide for valuable sources of information and their growing availability in electronic form naturally led to application of different analytic methods. One of the common...
Recommended Reading
Cohen, W. W., & Singer, Y. (1996). Context sensitive learning methods for text categorization. In Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval (pp. 307–315). Zurich: ACM.
Lewis, D. D. (1991). Representation and learning in information retrieval. PhD thesis, Department of Computer Science, University of Massachusetts, Amherst, MA.
Lewis, D. D., Schapire, R. E., Callan, J. P., & Ron Papka, R. (1996) Training algorithms for linear text classifiers. In Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval SIGIR-1996 (pp. 298–306). New York: ACM.
Mladenic, D., Brank, J., Grobelnik, M., & Milic-Frayling, N. (2004). Feature selection using linear classifier weights: Interaction with classification models. In Proceedings of the twenty-seventh annual international ACM SIGIR conference on research and development in information retrieval SIGIR-2004 (pp. 234–241). New York: ACM.
Mladenić, D., & Grobelnik, M. (2003). Feature selection on hierarchy of web documents. Journal of Decision Support Systems, 35, 45–87.
Sebastiani, F. (2002). Machine learning for automated text categorization. ACM Computing Surveys, 34(1), 1–47.
Yang, Y. (1997). An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1, 67–88.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer Science+Business Media, LLC
About this entry
Cite this entry
Mladeni, D., Brank, J., Grobelnik, M. (2011). Document Classification. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_230
Download citation
DOI: https://doi.org/10.1007/978-0-387-30164-8_230
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-30768-8
Online ISBN: 978-0-387-30164-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering