Skip to main content
Log in

Thematic content analysis using supervised machine learning: An empirical evaluation using German online news

  • Published:
Quality & Quantity Aims and scope Submit manuscript

Abstract

In recent years, two approaches to automatic content analysis have been introduced in the social sciences: semantic network analysis and supervised text classification. We argue that, although less linguistically sophisticated than semantic parsing techniques, statistical machine learning offers many advantages for applied communication research. By using manually coded material for training, supervised classification seamlessly bridges the gap between traditional and automatic content analysis. In this paper, we briefly introduce the conceptual foundations of machine learning approaches to text classification and discuss their application in social science research. We then evaluate their potential in an experimental study in which German online news was coded with established thematic categories. Moreover, we investigate whether and how linguistic preprocessing can improve classification quality. Results indicate that supervised text classification is generally robust and reliable for some categories, but may even be useful when it fails.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Alexa M., Zuell C.: Text analysis software: Commonalities, differences and limitations: The results of a review. Qual. Quant. 34, 299–321 (2000)

    Article  Google Scholar 

  • Assis, F.: OSBF-Lua-A text classification module for Lua—The importance of the training method. In: Fifteenth TREC, Citeseer, Gaithersburg (2006)

  • Braschler M., Ripplinger B.: How effective is stemming and decompounding for German text retrieval?.  Inf. Retrieval 7, 291–316 (2004)

    Article  Google Scholar 

  • Bruns, T., Marcinkowski, F.: Politische information im Fernsehen [Political information in television]. Leske + Budrich, Opladen (1997)

  • Cormack G.V., Lynam T.R.: Online supervised spam filter evaluation. ACM Trans. Inf. Sys. 25(3), 11 (2007)

    Article  Google Scholar 

  • Doerfel M., Barnett G.: The use of Catpac for text analysis. Cult. Anthropol. Methods J. 8(2), 4–7 (1996)

    Google Scholar 

  • Durant K, Smith, M.: Predicting the political sentiment of Web Log posts using supervised machine learning techniques coupled with feature selection. In: Advances in Web Mining and Web Usage Analysis: 8th International Workshop on Knowledge Discovery on the Web, Webkdd 2006, pp. 187–206. Springer-Verlag, New York (2007)

  • Eilders C.: News factors and news decisions. Theoretical and methodological advances in Germany. Communications 31(1), 5–24 (2006)

    Article  Google Scholar 

  • Eugenio B.D., Glass M.: The kappa statistic: A second look. Comput. Ling. 30, 95–101 (2004)

    Article  Google Scholar 

  • Evans M., McIntosh W., Lin J., Cates C.: Recounting the courts? Applying automated content analysis to enhance empirical legal research. J. Emp. Legal Stud. 4(4), 1007–1039 (2007)

    Article  Google Scholar 

  • Finn, A., Kushmerick, N., Smyth, B.: Fact or fiction: Content classification for digital libraries. In: DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries, Dublin (2001)

  • Fretwurst, B.: Nachrichten im Interesse der Zuschauer. Eine konzeptionelle und empirische Neubestimmung der Nachrichtenwerttheorie [News in the viewer’s interest. A conceptual and empirical re-evaluation of news values theory]. UVK Verlag, Konstanz (2008)

  • GÖFAK Medienforschung: Fernsehanalyse zum Bundestagswahlkampf 2009. Methodenbericht GLES1401 der German Longitudinal Election Study [content analysis of tv coverage for the bundestag election 2009]. http://www.gesis.org/fileadmin/upload/dienstleistung/forschungsdatenzentren/gles/SecureDownload/frageboegen/GLES1401_Pre1.0%20-%20Methodenbericht.pdf (2010)

  • Hayes A.F., Krippendorff K.: Answering the call for a standard reliability measure for coding data. Commun. Methods Meas. 1(1), 77–89 (2007)

    Article  Google Scholar 

  • Hillard, D., Purpura, S., Wilkerson, J.: An active learning framework for classifiying political text. In: Annual Meeting of the Midwest Political Science Association, Chicago (2007)

  • Hillard D., Purpura S., Wilkerson J.: Computer-assisted topic classification for mixed-methods social science research. J. Inf. Technol. Polit. 4(4), 31–46 (2008)

    Article  Google Scholar 

  • Holsti O.: Content Analysis for the Social Sciences and Humanities. Addison-Wesley, Reading (1969)

    Google Scholar 

  • Hotho A., Nürnberger A., Paaß G.: A brief survey of text mining. LDV Forum GLDV J. Comput. Ling. Lang. Technol. 20(1), 19–62 (2005)

    Google Scholar 

  • Iker H., Harway N.: A computer systems approach toward the recognition and analysis of content. In: Gerbner, G. (ed.) The Analysis of Communication Content. Developments in Scientific Theories and Computer Techniques., pp. 381–405. Wiley, New York (1969)

    Google Scholar 

  • King G., Lowe W.: An Automated Information Extraction Tool for International Conflict Data with Performance as Good as Human Coders: A Rare Events Evaluation Design. Int. Organ. 57(3), 617–642 (2003)

    Article  Google Scholar 

  • Krippendorff K.: Content Analysis: An Introduction to Its Methodology, 2nd edn. Sage, London (2004a)

    Google Scholar 

  • Krippendorff K.: Reliability in content analysis. Human Commun. Res. 30, 411–433 (2004b)

    Google Scholar 

  • Landmann J., Züll C.: Identifying events using computer-assisted text analysis. Soc. Sci. Comput. Rev. 26(4), 483–497 (2008)

    Article  Google Scholar 

  • Lasswell H., Namenwirth J.: The Lasswell Value Dictionary. Yale University Press, New Haven (1968)

    Google Scholar 

  • Laver M., Benoit K., Garry J.: Extracting policy positions from political texts using words as data. Am. Polit. Sci. Rev. 97, 311–331 (2003)

    Article  Google Scholar 

  • Leopold E., Kindermann J.: Text categorization with support vector machines. How to represent texts in input space?.  Mach. Learn. 46(1–3), 423–444 (2002)

    Article  Google Scholar 

  • Li, J., Ezeife, C.: Cleaning web pages for effective web content mining. In: Database and Expert Systems Applications. Springer, Berlin, pp. 560–571 (2006)

  • Manning C.D., Schütze H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA (1999)

    Google Scholar 

  • Monroe B.L., Schrodt P.A.: Introduction to the special issue: The statistical analysis of political text. Polit. Anal. 16(4), 351–355 (2008)

    Article  Google Scholar 

  • Pang B., Lee L.: Opinion mining and sentiment analysis. Foundations Trends Inf. Retrieval 2(1–2), 1–135 (2008)

    Article  Google Scholar 

  • Pennings P., Keman H.: Towards a new methodology of estimating party policy positions. Qual. Quant. 36(1), 55–79 (2002)

    Article  Google Scholar 

  • Popping, R.: Computer-assisted Text Analysis. Sage, Thousand Oaks, CA (2000)

  • Porter M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  • Potter W.J., Levine-Donnerstein D.: Rethinking validity and reliability in content analysis. J. Appl. Commun. Res.h 27(3), 258–284 (1999)

    Article  Google Scholar 

  • Purpura, S., Hillard, D.: Automated classification of congressional legislation. In: Proceedings of the 2006 International Conference on Digital Government Research, pp. 219–225 (2006)

  • Reidsma D., Carletta J.: Reliability measurement without limits. Comput. Ling. 34(3), 319–326 (2008)

    Article  Google Scholar 

  • Roberts, C.: Introduction. In: Roberts C (ed) Text Analysis for the Social Sciences: Methods for Drawing Statistical Inferences from Texts and Transcripts, pp. 1–8. Lawrence Erlbaum Associates, Mahwah, NJ (1997)

  • Roberts C.W.: A conceptual framework for quantitative text analysis. Qual. Quant. 34(3), 259–274 (2000)

    Article  Google Scholar 

  • Schrodt, P.: Automated Production of High-Volume, Near-Real-Time Political Event Data. Paper presented at the 2010 APSA Conference (2010)

  • Schrodt P., Davis S., Weddle J.: Political science: KEDS—a program for the machine coding of event data. Soc. Sci. Comput. Rev. 12(4), 561 (1994)

    Article  Google Scholar 

  • Sebastiani F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  • Sheng, V., Provost, F., Ipeirotis, P.: Get another label? Improving data quality and data mining using multiple, noisy labelers. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 614–622. ACM, Las Vegas, NV (2008)

  • Siefkes, C., Assis, F., Chhabra, S., Yerazunis, W.S.: Combining winnow and orthogonal sparse bigrams for incremental spam filtering. In: Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, Pisa, pp. 410–421 (2004)

  • Stone P., Dunphy D., Smith M., Ogilvie D.: The General Inquirer: A Computer Approach to Content Analysis. The MIT Press, Cambridge, MA (1966)

    Google Scholar 

  • van Atteveldt, W.: Semantic Network Analysis: Techniques for Extracting, Representing, and Querying Media Content. BookSurge Publishers, Charleston, SC (2008)

  • van Atteveldt, W., Kleinnijenhuis, J., Ruigrok, N.: Semantic network analysis: A two-step approach for flexible, reusable, and combinable content analysis. Paper presented at the 2010 ICA conference, Singapore (2010)

  • van Cuilenburg J.J., Kleinnijenhuis J., de Ridder J.A.: Artificial intelligence and content analysis. Qual. Quant. 22(1), 65–97 (1988)

    Article  Google Scholar 

  • Weare C., Lin W.: Content analysis of the World Wide Web: Opportunities and challenges. Soc. Sci. Comput. Rev. 18(3), 272–292 (2000)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael Scharkow.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Scharkow, M. Thematic content analysis using supervised machine learning: An empirical evaluation using German online news. Qual Quant 47, 761–773 (2013). https://doi.org/10.1007/s11135-011-9545-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11135-011-9545-7

Keywords

Navigation