Abstract
In recent years, two approaches to automatic content analysis have been introduced in the social sciences: semantic network analysis and supervised text classification. We argue that, although less linguistically sophisticated than semantic parsing techniques, statistical machine learning offers many advantages for applied communication research. By using manually coded material for training, supervised classification seamlessly bridges the gap between traditional and automatic content analysis. In this paper, we briefly introduce the conceptual foundations of machine learning approaches to text classification and discuss their application in social science research. We then evaluate their potential in an experimental study in which German online news was coded with established thematic categories. Moreover, we investigate whether and how linguistic preprocessing can improve classification quality. Results indicate that supervised text classification is generally robust and reliable for some categories, but may even be useful when it fails.
Similar content being viewed by others
References
Alexa M., Zuell C.: Text analysis software: Commonalities, differences and limitations: The results of a review. Qual. Quant. 34, 299–321 (2000)
Assis, F.: OSBF-Lua-A text classification module for Lua—The importance of the training method. In: Fifteenth TREC, Citeseer, Gaithersburg (2006)
Braschler M., Ripplinger B.: How effective is stemming and decompounding for German text retrieval?. Inf. Retrieval 7, 291–316 (2004)
Bruns, T., Marcinkowski, F.: Politische information im Fernsehen [Political information in television]. Leske + Budrich, Opladen (1997)
Cormack G.V., Lynam T.R.: Online supervised spam filter evaluation. ACM Trans. Inf. Sys. 25(3), 11 (2007)
Doerfel M., Barnett G.: The use of Catpac for text analysis. Cult. Anthropol. Methods J. 8(2), 4–7 (1996)
Durant K, Smith, M.: Predicting the political sentiment of Web Log posts using supervised machine learning techniques coupled with feature selection. In: Advances in Web Mining and Web Usage Analysis: 8th International Workshop on Knowledge Discovery on the Web, Webkdd 2006, pp. 187–206. Springer-Verlag, New York (2007)
Eilders C.: News factors and news decisions. Theoretical and methodological advances in Germany. Communications 31(1), 5–24 (2006)
Eugenio B.D., Glass M.: The kappa statistic: A second look. Comput. Ling. 30, 95–101 (2004)
Evans M., McIntosh W., Lin J., Cates C.: Recounting the courts? Applying automated content analysis to enhance empirical legal research. J. Emp. Legal Stud. 4(4), 1007–1039 (2007)
Finn, A., Kushmerick, N., Smyth, B.: Fact or fiction: Content classification for digital libraries. In: DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries, Dublin (2001)
Fretwurst, B.: Nachrichten im Interesse der Zuschauer. Eine konzeptionelle und empirische Neubestimmung der Nachrichtenwerttheorie [News in the viewer’s interest. A conceptual and empirical re-evaluation of news values theory]. UVK Verlag, Konstanz (2008)
GÖFAK Medienforschung: Fernsehanalyse zum Bundestagswahlkampf 2009. Methodenbericht GLES1401 der German Longitudinal Election Study [content analysis of tv coverage for the bundestag election 2009]. http://www.gesis.org/fileadmin/upload/dienstleistung/forschungsdatenzentren/gles/SecureDownload/frageboegen/GLES1401_Pre1.0%20-%20Methodenbericht.pdf (2010)
Hayes A.F., Krippendorff K.: Answering the call for a standard reliability measure for coding data. Commun. Methods Meas. 1(1), 77–89 (2007)
Hillard, D., Purpura, S., Wilkerson, J.: An active learning framework for classifiying political text. In: Annual Meeting of the Midwest Political Science Association, Chicago (2007)
Hillard D., Purpura S., Wilkerson J.: Computer-assisted topic classification for mixed-methods social science research. J. Inf. Technol. Polit. 4(4), 31–46 (2008)
Holsti O.: Content Analysis for the Social Sciences and Humanities. Addison-Wesley, Reading (1969)
Hotho A., Nürnberger A., Paaß G.: A brief survey of text mining. LDV Forum GLDV J. Comput. Ling. Lang. Technol. 20(1), 19–62 (2005)
Iker H., Harway N.: A computer systems approach toward the recognition and analysis of content. In: Gerbner, G. (ed.) The Analysis of Communication Content. Developments in Scientific Theories and Computer Techniques., pp. 381–405. Wiley, New York (1969)
King G., Lowe W.: An Automated Information Extraction Tool for International Conflict Data with Performance as Good as Human Coders: A Rare Events Evaluation Design. Int. Organ. 57(3), 617–642 (2003)
Krippendorff K.: Content Analysis: An Introduction to Its Methodology, 2nd edn. Sage, London (2004a)
Krippendorff K.: Reliability in content analysis. Human Commun. Res. 30, 411–433 (2004b)
Landmann J., Züll C.: Identifying events using computer-assisted text analysis. Soc. Sci. Comput. Rev. 26(4), 483–497 (2008)
Lasswell H., Namenwirth J.: The Lasswell Value Dictionary. Yale University Press, New Haven (1968)
Laver M., Benoit K., Garry J.: Extracting policy positions from political texts using words as data. Am. Polit. Sci. Rev. 97, 311–331 (2003)
Leopold E., Kindermann J.: Text categorization with support vector machines. How to represent texts in input space?. Mach. Learn. 46(1–3), 423–444 (2002)
Li, J., Ezeife, C.: Cleaning web pages for effective web content mining. In: Database and Expert Systems Applications. Springer, Berlin, pp. 560–571 (2006)
Manning C.D., Schütze H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA (1999)
Monroe B.L., Schrodt P.A.: Introduction to the special issue: The statistical analysis of political text. Polit. Anal. 16(4), 351–355 (2008)
Pang B., Lee L.: Opinion mining and sentiment analysis. Foundations Trends Inf. Retrieval 2(1–2), 1–135 (2008)
Pennings P., Keman H.: Towards a new methodology of estimating party policy positions. Qual. Quant. 36(1), 55–79 (2002)
Popping, R.: Computer-assisted Text Analysis. Sage, Thousand Oaks, CA (2000)
Porter M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Potter W.J., Levine-Donnerstein D.: Rethinking validity and reliability in content analysis. J. Appl. Commun. Res.h 27(3), 258–284 (1999)
Purpura, S., Hillard, D.: Automated classification of congressional legislation. In: Proceedings of the 2006 International Conference on Digital Government Research, pp. 219–225 (2006)
Reidsma D., Carletta J.: Reliability measurement without limits. Comput. Ling. 34(3), 319–326 (2008)
Roberts, C.: Introduction. In: Roberts C (ed) Text Analysis for the Social Sciences: Methods for Drawing Statistical Inferences from Texts and Transcripts, pp. 1–8. Lawrence Erlbaum Associates, Mahwah, NJ (1997)
Roberts C.W.: A conceptual framework for quantitative text analysis. Qual. Quant. 34(3), 259–274 (2000)
Schrodt, P.: Automated Production of High-Volume, Near-Real-Time Political Event Data. Paper presented at the 2010 APSA Conference (2010)
Schrodt P., Davis S., Weddle J.: Political science: KEDS—a program for the machine coding of event data. Soc. Sci. Comput. Rev. 12(4), 561 (1994)
Sebastiani F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Sheng, V., Provost, F., Ipeirotis, P.: Get another label? Improving data quality and data mining using multiple, noisy labelers. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 614–622. ACM, Las Vegas, NV (2008)
Siefkes, C., Assis, F., Chhabra, S., Yerazunis, W.S.: Combining winnow and orthogonal sparse bigrams for incremental spam filtering. In: Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, Pisa, pp. 410–421 (2004)
Stone P., Dunphy D., Smith M., Ogilvie D.: The General Inquirer: A Computer Approach to Content Analysis. The MIT Press, Cambridge, MA (1966)
van Atteveldt, W.: Semantic Network Analysis: Techniques for Extracting, Representing, and Querying Media Content. BookSurge Publishers, Charleston, SC (2008)
van Atteveldt, W., Kleinnijenhuis, J., Ruigrok, N.: Semantic network analysis: A two-step approach for flexible, reusable, and combinable content analysis. Paper presented at the 2010 ICA conference, Singapore (2010)
van Cuilenburg J.J., Kleinnijenhuis J., de Ridder J.A.: Artificial intelligence and content analysis. Qual. Quant. 22(1), 65–97 (1988)
Weare C., Lin W.: Content analysis of the World Wide Web: Opportunities and challenges. Soc. Sci. Comput. Rev. 18(3), 272–292 (2000)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Scharkow, M. Thematic content analysis using supervised machine learning: An empirical evaluation using German online news. Qual Quant 47, 761–773 (2013). https://doi.org/10.1007/s11135-011-9545-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11135-011-9545-7