Quality & Quantity

, Volume 47, Issue 2, pp 761–773 | Cite as

Thematic content analysis using supervised machine learning: An empirical evaluation using German online news

  • Michael ScharkowEmail author


In recent years, two approaches to automatic content analysis have been introduced in the social sciences: semantic network analysis and supervised text classification. We argue that, although less linguistically sophisticated than semantic parsing techniques, statistical machine learning offers many advantages for applied communication research. By using manually coded material for training, supervised classification seamlessly bridges the gap between traditional and automatic content analysis. In this paper, we briefly introduce the conceptual foundations of machine learning approaches to text classification and discuss their application in social science research. We then evaluate their potential in an experimental study in which German online news was coded with established thematic categories. Moreover, we investigate whether and how linguistic preprocessing can improve classification quality. Results indicate that supervised text classification is generally robust and reliable for some categories, but may even be useful when it fails.


Content analysis Machine learning Online news Bayesian classifier 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Alexa M., Zuell C.: Text analysis software: Commonalities, differences and limitations: The results of a review. Qual. Quant. 34, 299–321 (2000)CrossRefGoogle Scholar
  2. Assis, F.: OSBF-Lua-A text classification module for Lua—The importance of the training method. In: Fifteenth TREC, Citeseer, Gaithersburg (2006)Google Scholar
  3. Braschler M., Ripplinger B.: How effective is stemming and decompounding for German text retrieval?.  Inf. Retrieval 7, 291–316 (2004)CrossRefGoogle Scholar
  4. Bruns, T., Marcinkowski, F.: Politische information im Fernsehen [Political information in television]. Leske + Budrich, Opladen (1997)Google Scholar
  5. Cormack G.V., Lynam T.R.: Online supervised spam filter evaluation. ACM Trans. Inf. Sys. 25(3), 11 (2007)CrossRefGoogle Scholar
  6. Doerfel M., Barnett G.: The use of Catpac for text analysis. Cult. Anthropol. Methods J. 8(2), 4–7 (1996)Google Scholar
  7. Durant K, Smith, M.: Predicting the political sentiment of Web Log posts using supervised machine learning techniques coupled with feature selection. In: Advances in Web Mining and Web Usage Analysis: 8th International Workshop on Knowledge Discovery on the Web, Webkdd 2006, pp. 187–206. Springer-Verlag, New York (2007)Google Scholar
  8. Eilders C.: News factors and news decisions. Theoretical and methodological advances in Germany. Communications 31(1), 5–24 (2006)CrossRefGoogle Scholar
  9. Eugenio B.D., Glass M.: The kappa statistic: A second look. Comput. Ling. 30, 95–101 (2004)CrossRefGoogle Scholar
  10. Evans M., McIntosh W., Lin J., Cates C.: Recounting the courts? Applying automated content analysis to enhance empirical legal research. J. Emp. Legal Stud. 4(4), 1007–1039 (2007)CrossRefGoogle Scholar
  11. Finn, A., Kushmerick, N., Smyth, B.: Fact or fiction: Content classification for digital libraries. In: DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries, Dublin (2001)Google Scholar
  12. Fretwurst, B.: Nachrichten im Interesse der Zuschauer. Eine konzeptionelle und empirische Neubestimmung der Nachrichtenwerttheorie [News in the viewer’s interest. A conceptual and empirical re-evaluation of news values theory]. UVK Verlag, Konstanz (2008)Google Scholar
  13. GÖFAK Medienforschung: Fernsehanalyse zum Bundestagswahlkampf 2009. Methodenbericht GLES1401 der German Longitudinal Election Study [content analysis of tv coverage for the bundestag election 2009]. (2010)
  14. Hayes A.F., Krippendorff K.: Answering the call for a standard reliability measure for coding data. Commun. Methods Meas. 1(1), 77–89 (2007)CrossRefGoogle Scholar
  15. Hillard, D., Purpura, S., Wilkerson, J.: An active learning framework for classifiying political text. In: Annual Meeting of the Midwest Political Science Association, Chicago (2007)Google Scholar
  16. Hillard D., Purpura S., Wilkerson J.: Computer-assisted topic classification for mixed-methods social science research. J. Inf. Technol. Polit. 4(4), 31–46 (2008)CrossRefGoogle Scholar
  17. Holsti O.: Content Analysis for the Social Sciences and Humanities. Addison-Wesley, Reading (1969)Google Scholar
  18. Hotho A., Nürnberger A., Paaß G.: A brief survey of text mining. LDV Forum GLDV J. Comput. Ling. Lang. Technol. 20(1), 19–62 (2005)Google Scholar
  19. Iker H., Harway N.: A computer systems approach toward the recognition and analysis of content. In: Gerbner, G. (ed.) The Analysis of Communication Content. Developments in Scientific Theories and Computer Techniques., pp. 381–405. Wiley, New York (1969)Google Scholar
  20. King G., Lowe W.: An Automated Information Extraction Tool for International Conflict Data with Performance as Good as Human Coders: A Rare Events Evaluation Design. Int. Organ. 57(3), 617–642 (2003)CrossRefGoogle Scholar
  21. Krippendorff K.: Content Analysis: An Introduction to Its Methodology, 2nd edn. Sage, London (2004a)Google Scholar
  22. Krippendorff K.: Reliability in content analysis. Human Commun. Res. 30, 411–433 (2004b)Google Scholar
  23. Landmann J., Züll C.: Identifying events using computer-assisted text analysis. Soc. Sci. Comput. Rev. 26(4), 483–497 (2008)CrossRefGoogle Scholar
  24. Lasswell H., Namenwirth J.: The Lasswell Value Dictionary. Yale University Press, New Haven (1968)Google Scholar
  25. Laver M., Benoit K., Garry J.: Extracting policy positions from political texts using words as data. Am. Polit. Sci. Rev. 97, 311–331 (2003)CrossRefGoogle Scholar
  26. Leopold E., Kindermann J.: Text categorization with support vector machines. How to represent texts in input space?.  Mach. Learn. 46(1–3), 423–444 (2002)CrossRefGoogle Scholar
  27. Li, J., Ezeife, C.: Cleaning web pages for effective web content mining. In: Database and Expert Systems Applications. Springer, Berlin, pp. 560–571 (2006)Google Scholar
  28. Manning C.D., Schütze H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA (1999)Google Scholar
  29. Monroe B.L., Schrodt P.A.: Introduction to the special issue: The statistical analysis of political text. Polit. Anal. 16(4), 351–355 (2008)CrossRefGoogle Scholar
  30. Pang B., Lee L.: Opinion mining and sentiment analysis. Foundations Trends Inf. Retrieval 2(1–2), 1–135 (2008)CrossRefGoogle Scholar
  31. Pennings P., Keman H.: Towards a new methodology of estimating party policy positions. Qual. Quant. 36(1), 55–79 (2002)CrossRefGoogle Scholar
  32. Popping, R.: Computer-assisted Text Analysis. Sage, Thousand Oaks, CA (2000)Google Scholar
  33. Porter M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)CrossRefGoogle Scholar
  34. Potter W.J., Levine-Donnerstein D.: Rethinking validity and reliability in content analysis. J. Appl. Commun. Res.h 27(3), 258–284 (1999)CrossRefGoogle Scholar
  35. Purpura, S., Hillard, D.: Automated classification of congressional legislation. In: Proceedings of the 2006 International Conference on Digital Government Research, pp. 219–225 (2006)Google Scholar
  36. Reidsma D., Carletta J.: Reliability measurement without limits. Comput. Ling. 34(3), 319–326 (2008)CrossRefGoogle Scholar
  37. Roberts, C.: Introduction. In: Roberts C (ed) Text Analysis for the Social Sciences: Methods for Drawing Statistical Inferences from Texts and Transcripts, pp. 1–8. Lawrence Erlbaum Associates, Mahwah, NJ (1997)Google Scholar
  38. Roberts C.W.: A conceptual framework for quantitative text analysis. Qual. Quant. 34(3), 259–274 (2000)CrossRefGoogle Scholar
  39. Schrodt, P.: Automated Production of High-Volume, Near-Real-Time Political Event Data. Paper presented at the 2010 APSA Conference (2010)Google Scholar
  40. Schrodt P., Davis S., Weddle J.: Political science: KEDS—a program for the machine coding of event data. Soc. Sci. Comput. Rev. 12(4), 561 (1994)CrossRefGoogle Scholar
  41. Sebastiani F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)CrossRefGoogle Scholar
  42. Sheng, V., Provost, F., Ipeirotis, P.: Get another label? Improving data quality and data mining using multiple, noisy labelers. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 614–622. ACM, Las Vegas, NV (2008)Google Scholar
  43. Siefkes, C., Assis, F., Chhabra, S., Yerazunis, W.S.: Combining winnow and orthogonal sparse bigrams for incremental spam filtering. In: Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, Pisa, pp. 410–421 (2004)Google Scholar
  44. Stone P., Dunphy D., Smith M., Ogilvie D.: The General Inquirer: A Computer Approach to Content Analysis. The MIT Press, Cambridge, MA (1966)Google Scholar
  45. van Atteveldt, W.: Semantic Network Analysis: Techniques for Extracting, Representing, and Querying Media Content. BookSurge Publishers, Charleston, SC (2008)Google Scholar
  46. van Atteveldt, W., Kleinnijenhuis, J., Ruigrok, N.: Semantic network analysis: A two-step approach for flexible, reusable, and combinable content analysis. Paper presented at the 2010 ICA conference, Singapore (2010)Google Scholar
  47. van Cuilenburg J.J., Kleinnijenhuis J., de Ridder J.A.: Artificial intelligence and content analysis. Qual. Quant. 22(1), 65–97 (1988)CrossRefGoogle Scholar
  48. Weare C., Lin W.: Content analysis of the World Wide Web: Opportunities and challenges. Soc. Sci. Comput. Rev. 18(3), 272–292 (2000)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2011

Authors and Affiliations

  1. 1.Institute of Communication StudiesUniversity of HohenheimStuttgartGermany

Personalised recommendations