Thematic content analysis using supervised machine learning: An empirical evaluation using German online news

Scharkow, Michael

doi:10.1007/s11135-011-9545-7

Thematic content analysis using supervised machine learning: An empirical evaluation using German online news

Published: 28 July 2011

Volume 47, pages 761–773, (2013)
Cite this article

Quality & Quantity Aims and scope Submit manuscript

Michael Scharkow¹

63 Citations
1 Altmetric
Explore all metrics

Abstract

In recent years, two approaches to automatic content analysis have been introduced in the social sciences: semantic network analysis and supervised text classification. We argue that, although less linguistically sophisticated than semantic parsing techniques, statistical machine learning offers many advantages for applied communication research. By using manually coded material for training, supervised classification seamlessly bridges the gap between traditional and automatic content analysis. In this paper, we briefly introduce the conceptual foundations of machine learning approaches to text classification and discuss their application in social science research. We then evaluate their potential in an experimental study in which German online news was coded with established thematic categories. Moreover, we investigate whether and how linguistic preprocessing can improve classification quality. Results indicate that supervised text classification is generally robust and reliable for some categories, but may even be useful when it fails.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automated Content Analysis

Automatic Classification of Web News: A Systematic Mapping Study

Text classification algorithms for mining unstructured data: a SWOT analysis

Article 05 February 2018

References

Alexa M., Zuell C.: Text analysis software: Commonalities, differences and limitations: The results of a review. Qual. Quant. 34, 299–321 (2000)
Article Google Scholar
Assis, F.: OSBF-Lua-A text classification module for Lua—The importance of the training method. In: Fifteenth TREC, Citeseer, Gaithersburg (2006)
Braschler M., Ripplinger B.: How effective is stemming and decompounding for German text retrieval?. Inf. Retrieval 7, 291–316 (2004)
Article Google Scholar
Bruns, T., Marcinkowski, F.: Politische information im Fernsehen [Political information in television]. Leske + Budrich, Opladen (1997)
Cormack G.V., Lynam T.R.: Online supervised spam filter evaluation. ACM Trans. Inf. Sys. 25(3), 11 (2007)
Article Google Scholar
Doerfel M., Barnett G.: The use of Catpac for text analysis. Cult. Anthropol. Methods J. 8(2), 4–7 (1996)
Google Scholar
Durant K, Smith, M.: Predicting the political sentiment of Web Log posts using supervised machine learning techniques coupled with feature selection. In: Advances in Web Mining and Web Usage Analysis: 8th International Workshop on Knowledge Discovery on the Web, Webkdd 2006, pp. 187–206. Springer-Verlag, New York (2007)
Eilders C.: News factors and news decisions. Theoretical and methodological advances in Germany. Communications 31(1), 5–24 (2006)
Article Google Scholar
Eugenio B.D., Glass M.: The kappa statistic: A second look. Comput. Ling. 30, 95–101 (2004)
Article Google Scholar
Evans M., McIntosh W., Lin J., Cates C.: Recounting the courts? Applying automated content analysis to enhance empirical legal research. J. Emp. Legal Stud. 4(4), 1007–1039 (2007)
Article Google Scholar
Finn, A., Kushmerick, N., Smyth, B.: Fact or fiction: Content classification for digital libraries. In: DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries, Dublin (2001)
Fretwurst, B.: Nachrichten im Interesse der Zuschauer. Eine konzeptionelle und empirische Neubestimmung der Nachrichtenwerttheorie [News in the viewer’s interest. A conceptual and empirical re-evaluation of news values theory]. UVK Verlag, Konstanz (2008)
GÖFAK Medienforschung: Fernsehanalyse zum Bundestagswahlkampf 2009. Methodenbericht GLES1401 der German Longitudinal Election Study [content analysis of tv coverage for the bundestag election 2009]. http://www.gesis.org/fileadmin/upload/dienstleistung/forschungsdatenzentren/gles/SecureDownload/frageboegen/GLES1401_Pre1.0%20-%20Methodenbericht.pdf (2010)
Hayes A.F., Krippendorff K.: Answering the call for a standard reliability measure for coding data. Commun. Methods Meas. 1(1), 77–89 (2007)
Article Google Scholar
Hillard, D., Purpura, S., Wilkerson, J.: An active learning framework for classifiying political text. In: Annual Meeting of the Midwest Political Science Association, Chicago (2007)
Hillard D., Purpura S., Wilkerson J.: Computer-assisted topic classification for mixed-methods social science research. J. Inf. Technol. Polit. 4(4), 31–46 (2008)
Article Google Scholar
Holsti O.: Content Analysis for the Social Sciences and Humanities. Addison-Wesley, Reading (1969)
Google Scholar
Hotho A., Nürnberger A., Paaß G.: A brief survey of text mining. LDV Forum GLDV J. Comput. Ling. Lang. Technol. 20(1), 19–62 (2005)
Google Scholar
Iker H., Harway N.: A computer systems approach toward the recognition and analysis of content. In: Gerbner, G. (ed.) The Analysis of Communication Content. Developments in Scientific Theories and Computer Techniques., pp. 381–405. Wiley, New York (1969)
Google Scholar
King G., Lowe W.: An Automated Information Extraction Tool for International Conflict Data with Performance as Good as Human Coders: A Rare Events Evaluation Design. Int. Organ. 57(3), 617–642 (2003)
Article Google Scholar
Krippendorff K.: Content Analysis: An Introduction to Its Methodology, 2nd edn. Sage, London (2004a)
Google Scholar
Krippendorff K.: Reliability in content analysis. Human Commun. Res. 30, 411–433 (2004b)
Google Scholar
Landmann J., Züll C.: Identifying events using computer-assisted text analysis. Soc. Sci. Comput. Rev. 26(4), 483–497 (2008)
Article Google Scholar
Lasswell H., Namenwirth J.: The Lasswell Value Dictionary. Yale University Press, New Haven (1968)
Google Scholar
Laver M., Benoit K., Garry J.: Extracting policy positions from political texts using words as data. Am. Polit. Sci. Rev. 97, 311–331 (2003)
Article Google Scholar
Leopold E., Kindermann J.: Text categorization with support vector machines. How to represent texts in input space?. Mach. Learn. 46(1–3), 423–444 (2002)
Article Google Scholar
Li, J., Ezeife, C.: Cleaning web pages for effective web content mining. In: Database and Expert Systems Applications. Springer, Berlin, pp. 560–571 (2006)
Manning C.D., Schütze H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA (1999)
Google Scholar
Monroe B.L., Schrodt P.A.: Introduction to the special issue: The statistical analysis of political text. Polit. Anal. 16(4), 351–355 (2008)
Article Google Scholar
Pang B., Lee L.: Opinion mining and sentiment analysis. Foundations Trends Inf. Retrieval 2(1–2), 1–135 (2008)
Article Google Scholar
Pennings P., Keman H.: Towards a new methodology of estimating party policy positions. Qual. Quant. 36(1), 55–79 (2002)
Article Google Scholar
Popping, R.: Computer-assisted Text Analysis. Sage, Thousand Oaks, CA (2000)
Porter M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Potter W.J., Levine-Donnerstein D.: Rethinking validity and reliability in content analysis. J. Appl. Commun. Res.h 27(3), 258–284 (1999)
Article Google Scholar
Purpura, S., Hillard, D.: Automated classification of congressional legislation. In: Proceedings of the 2006 International Conference on Digital Government Research, pp. 219–225 (2006)
Reidsma D., Carletta J.: Reliability measurement without limits. Comput. Ling. 34(3), 319–326 (2008)
Article Google Scholar
Roberts, C.: Introduction. In: Roberts C (ed) Text Analysis for the Social Sciences: Methods for Drawing Statistical Inferences from Texts and Transcripts, pp. 1–8. Lawrence Erlbaum Associates, Mahwah, NJ (1997)
Roberts C.W.: A conceptual framework for quantitative text analysis. Qual. Quant. 34(3), 259–274 (2000)
Article Google Scholar
Schrodt, P.: Automated Production of High-Volume, Near-Real-Time Political Event Data. Paper presented at the 2010 APSA Conference (2010)
Schrodt P., Davis S., Weddle J.: Political science: KEDS—a program for the machine coding of event data. Soc. Sci. Comput. Rev. 12(4), 561 (1994)
Article Google Scholar
Sebastiani F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article Google Scholar
Sheng, V., Provost, F., Ipeirotis, P.: Get another label? Improving data quality and data mining using multiple, noisy labelers. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 614–622. ACM, Las Vegas, NV (2008)
Siefkes, C., Assis, F., Chhabra, S., Yerazunis, W.S.: Combining winnow and orthogonal sparse bigrams for incremental spam filtering. In: Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, Pisa, pp. 410–421 (2004)
Stone P., Dunphy D., Smith M., Ogilvie D.: The General Inquirer: A Computer Approach to Content Analysis. The MIT Press, Cambridge, MA (1966)
Google Scholar
van Atteveldt, W.: Semantic Network Analysis: Techniques for Extracting, Representing, and Querying Media Content. BookSurge Publishers, Charleston, SC (2008)
van Atteveldt, W., Kleinnijenhuis, J., Ruigrok, N.: Semantic network analysis: A two-step approach for flexible, reusable, and combinable content analysis. Paper presented at the 2010 ICA conference, Singapore (2010)
van Cuilenburg J.J., Kleinnijenhuis J., de Ridder J.A.: Artificial intelligence and content analysis. Qual. Quant. 22(1), 65–97 (1988)
Article Google Scholar
Weare C., Lin W.: Content analysis of the World Wide Web: Opportunities and challenges. Soc. Sci. Comput. Rev. 18(3), 272–292 (2000)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Communication Studies, University of Hohenheim, Wollgrasweg 23, 70599, Stuttgart, Germany
Michael Scharkow

Authors

Michael Scharkow
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Scharkow.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Scharkow, M. Thematic content analysis using supervised machine learning: An empirical evaluation using German online news. Qual Quant 47, 761–773 (2013). https://doi.org/10.1007/s11135-011-9545-7

Download citation

Published: 28 July 2011
Issue Date: February 2013
DOI: https://doi.org/10.1007/s11135-011-9545-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Thematic content analysis using supervised machine learning: An empirical evaluation using German online news

Abstract

Access this article

Similar content being viewed by others

Automated Content Analysis

Automatic Classification of Web News: A Systematic Mapping Study

Text classification algorithms for mining unstructured data: a SWOT analysis

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Thematic content analysis using supervised machine learning: An empirical evaluation using German online news

Abstract

Access this article

Similar content being viewed by others

Automated Content Analysis

Automatic Classification of Web News: A Systematic Mapping Study

Text classification algorithms for mining unstructured data: a SWOT analysis

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation