Advertisement

Beyond lexical frequencies: using R for text analysis in the digital humanities

  • Taylor ArnoldEmail author
  • Nicolas Ballier
  • Paula Lissón
  • Lauren Tilton
Original Paper
  • 139 Downloads

Abstract

This paper presents a combination of R packages—user contributed toolkits written in a common core programming language—to facilitate the humanistic investigation of digitised, text-based corpora. Our survey of text analysis packages includes those of our own creation (cleanNLP and fasttextM) as well as packages built by other research groups (stringi, readtext, hyphenatr, quanteda, and hunspell). By operating on generic object types, these packages unite research innovations in corpus linguistics, natural language processing, machine learning, statistics, and digital humanities. We begin by extrapolating on the theoretical benefits of R as an elaborate gluing language for bringing together several areas of expertise and compare it to linguistic concordancers and other tool-based approaches to text analysis in the digital humanities. We then showcase the practical benefits of an ecosystem by illustrating how R packages have been integrated into a digital humanities project. Throughout, the focus is on moving beyond the bag-of-words, lexical frequency model by incorporating linguistically-driven analyses in research.

Keywords

Digital humanities Text mining Text interoperability 

References

  1. Allaire, J., Cheng, J., Xie, Y., McPherson, J., Chang, W., Allen, J., Wickham, H., Atkins, A., Hyndman, R., & Arslan, R. (2017). rmarkdown: Dynamic documents for R. R package version 1.6. https://cran.r-project.org/package=rmarkdown.
  2. Anthony, L. (2004). Antconc: A learner and classroom friendly, multi-platform corpus analysis toolkit. In Proceedings of IWLeL (pp. 7–13).Google Scholar
  3. Anthony, L. (2013). A critical look at software tools in corpus linguistics. Linguistic Research, 30(2), 141–161.CrossRefGoogle Scholar
  4. Arnold, T., & Benoit, K. (2017). tif: Text interchange format. R package version 0.2. https://github.com/ropensci/tif/.
  5. Arnold, T., Lissón, P., & Ballier, N. (2017). fasttextM: Work with bilingual word embeddings. R package version 0.0.1. https://github.com/statsmaths/fasttextM/.
  6. Arnold, T. (2017). A tidy data model for natural language processing using cleannlp. The R Journal, 9(2), 1–20.Google Scholar
  7. Arnold, T., & Tilton, L. (2015). Humanities data in R. New York: Springer.CrossRefGoogle Scholar
  8. Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to statistics using R. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
  9. Baglama, J., Reichel, L., & Lewis, B. W. (2017). irlba: Fast truncated singular value decomposition and principal components analysis for large dense and sparse matrices. R package version 2.2.1. https://cran.r-project.org/package=irlba.
  10. Ballier, N., & Lissón, P. (2017). R-based strategies for DH in English Linguistics: A case study. In Bockwinkel, P., Declerck, T., Kübler, S., Zinsmeister, H. (eds), Proceedings of the Workshop on Teaching NLP for Digital Humanities, CEUR Workshop Proceedings, Berlin, Germany (Vol. 1918, pp. 1–10). http://ceur-ws.org/Vol-1918/ballier.pdf.
  11. Ballier, N. (2016). R, pour un écosystème du traitement des données? L’exemple de la linguistique. In P. Caron (Ed.), Données, Métadonnées des corpus et catalogage des objets en sciences humaines et sociales. Rennes: Presses universitaires de Rennes.Google Scholar
  12. Bastian, M., Heymann, S., Jacomy, M., et al. (2009). Gephi: An open source software for exploring and manipulating networks. International Conference on Web and Social Media, 8, 361–362.Google Scholar
  13. Becker, R. A., & Chambers, J. M. (1984). S: An interactive environment for data analysis and graphics. Boca Raton: CRC Press.Google Scholar
  14. Bécue-Bertaut, M., & Lebart, L. (2018). Analyse textuelle avec R. Rennes: Presses universitaires de Rennes.Google Scholar
  15. Benoit, K., & Matsuo, A. (2017). spacyr: R Wrapper to the spaCy NLP Library. R package version 0.9.0. https://cran.r-project.org/package=spacyr.
  16. Benoit, K., & Obeng, A. (2017). readtext: Import and handling for plain and formatted text files. R package version 0.50. https://cran.r-project.org/package=readtext.
  17. Benoit, K., Watanabe, K., Nulty, P., Obeng, A., Wang, H., Lauderdale, B., & Lowe, W. (2017). Quanteda: Quantitative analysis of textual data. R package version 0.99.9. https://cran.r-project.org/package=quanteda.
  18. Berry, D. M. (2011). The computational turn: Thinking about the digital humanities. Culture Machine, 12, 1–22.Google Scholar
  19. Bird, S. (2006). NLTK: The natural language toolkit. In Proceedings of the COLING/ACL on interactive presentation sessions, Association for Computational Linguistics (pp. 69–72).Google Scholar
  20. Blevins, C., & Mullen, L. (2015). Jane, John ... Leslie? A historical method for algorithmic gender prediction. Digital Humanities Quarterly 9(3).Google Scholar
  21. Bradley, J., & Rockwell, G. (1992). Towards new research tools in computer-assisted text analysis. In Canadian Learned Societies Conference.Google Scholar
  22. Brezina, V., McEnery, T., & Wattam, S. (2015). Collocations in context: A new perspective on collocation networks. International Journal of Corpus Linguistics, 20(2), 139–173.CrossRefGoogle Scholar
  23. Camargo, B. V., & Justo, A. M. (2013). Iramuteq: um software gratuito para análisede dados textuais. Temas em Psicologia, 21(2), 513–518.CrossRefGoogle Scholar
  24. Chang, W., Cheng, J., Allaire, J., Xie, Y., & McPherson, J. (2017). shiny: Web application framework for R. R package version 1.0.4. https://cran.r-project.org/package=shiny.
  25. Deschamps, R. (2017). Correspondence analysis for historical research with R. The Programming Historian. https://programminghistorian.org/en/lessons/correspondence-analysis-in-R.
  26. Dewar, T. (2016). R basics with tabular data. The Programming Historian. https://programminghistorian.org/en/lessons/r-basicswith-tabular-data.
  27. Donaldson, J. (2016). tsne: T-distributed stochastic neighbor embedding for R (t-SNE). R package version 0.1-3. https://cran.r-project.org/package=tsne.
  28. Eder, M., Rybicki, J., & Kestemont, M. (2016). Stylometry with R: A package for computational text analysis. R Journal, 8(1), 107–121.CrossRefGoogle Scholar
  29. Feinerer, I., Hornik, K., & Meyer, D. (2008). Text mining infrastructure in R. Journal of Statistical Software, 25(5), 1–54.CrossRefGoogle Scholar
  30. Fleury, S., & Zimina, M. (2014). Trameur: A framework for annotated text corpora exploration. In COLING (Demos) (pp. 57–61).Google Scholar
  31. Gagolewski, M. (2017). R package stringi: Character string processing facilities. https://cran.r-project.org/package=stringi.
  32. Gerdes, K. (2014). Corpus collection and analysis for the linguistic layman: The Gromoteur. http://gromoteur.ilpga.fr/.
  33. Goldstone, A., & Underwood, T. (2014). The quiet transformations of literary studies: What thirteen thousand scholars could tell us. New Literary History, 45(3), 359–384.CrossRefGoogle Scholar
  34. Gries, S. (2009). Quantitative corpus linguistics with R: A practical introduction. London: Routledge.CrossRefGoogle Scholar
  35. Gries, S. (2013). Statistics for linguistics with R: A practical introduction. Berlin: Walter de Gruyter.CrossRefGoogle Scholar
  36. Gries, S. T., & Deshors, S. C. (2014). Using regressions to explore deviations between corpus data and a standard/target: Two suggestions. Corpora, 9(1), 109–136.CrossRefGoogle Scholar
  37. Gries, S. T., & Wulff, S. (2012). Regression analysis in translation studies. Quantitative methods in corpus-based translation studies: A practical guide to descriptive translation research (pp. 35–52). Amsterdam: Benjamins.CrossRefGoogle Scholar
  38. Grün, B., & Hornik, K. (2011). topicmodels: An R package for fitting topic models. Journal of Statistical Software, 40(13), 1–30.  https://doi.org/10.18637/jss.v040.i13.CrossRefGoogle Scholar
  39. Heiden, S. (2010). The txm platform: Building open-source textual analysis software compatible with the tei encoding scheme. In 24th Pacific Asia conference on language, information and computation, Institute for Digital Enhancement of Cognitive Development, Waseda University (pp. 389–398).Google Scholar
  40. Honnibal, M., & Johnson, M. (2015). An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 conference on empirical methods in natural language processing, Association for Computational Linguistics, Lisbon, Portugal (pp. 1373–1378).Google Scholar
  41. Hornik, K. (2016). openNLP: Apache OpenNLP tools interface. R package version 0.2-6. https://cran.r-project.org/package=openNLP.
  42. Hornik, K. (2017a). NLP: Natural language processing infrastructure. R package version 0.1-11. https://cran.r-project.org/package=NLP.
  43. Hornik, K., Ligges, U., & Zeileis, A. (2017). Changes on CRAN. The R Journal, 9(1), 505–507.Google Scholar
  44. Ihaka, R., & Gentleman, R. (1996). R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3), 299–314.Google Scholar
  45. Jockers, M. L. (2013). Macroanalysis: Digital methods and literary history. Champaign: University of Illinois Press.CrossRefGoogle Scholar
  46. Jockers, M. L. (2014). Text analysis with R for students of literature. New York: Springer.CrossRefGoogle Scholar
  47. Johnson, K. (2008). Quantitative methods in linguistics. London: Wiley.Google Scholar
  48. Kahle, D., & Wickham, H. (2013). ggmap: Spatial visualization with ggplot2. The R Journal, 5(1), 144–161.CrossRefGoogle Scholar
  49. Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., et al. (2014). The sketch engine: Ten years on. Lexicography, 1(1), 7–36.CrossRefGoogle Scholar
  50. Klaussner, C., Nerbonne, J., & Çöltekin, Ç. (2015). Finding characteristic features in stylometric analysis. Digital Scholarship in the Humanities, 30(suppl 1), i114–i129.Google Scholar
  51. Komen, E. R. (2011). Cesax: Coreference editor for syntactically annotated xml corpora. Reference manual Nijmegen. Nijmegen: Radboud University Nijmegen.Google Scholar
  52. Lamalle, C., Martinez, W., Fleury, S., Salem, A., Fracchiolla, B., Kuncova, A., & Maisondieu, A. (2003). Lexico3–outils de statistique textuelle. manuel d’utilisation. SYLED–CLA2T, Université de la Sorbonne nouvelle–Paris 3:48.Google Scholar
  53. Lancashire, I., Bradley, J., McCarty, W., Stairs, M., & Wooldridge, T. (1996). Using tact with electronic texts. New York: MLA.Google Scholar
  54. Levine, L. W. (1988). Documenting America (Vol. 2, pp. 1935–1943). Berkeley: University of California Press.Google Scholar
  55. Levshina, N. (2015). How to do linguistics with R: Data exploration and statistical analysis. Amsterdam: John Benjamins Publishing Company.CrossRefGoogle Scholar
  56. Lienou, M., Maitre, H., & Datcu, M. (2010). Semantic annotation of satellite images using latent dirichlet allocation. IEEE Geoscience and Remote Sensing Letters, 7(1), 28–32.CrossRefGoogle Scholar
  57. Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., & McClosky, D. (2014). The stanford corenlp natural language processing toolkit. In ACL (system demonstrations) (pp. 55–60).Google Scholar
  58. McEnery, T., & Hardie, A. (2011). Corpus linguistics: Method, theory and practice. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
  59. Michalke, M. (2017). koRpus: An R package for text analysis. (Version 0.10-2). https://cran.rproject.org/package=koRpus.
  60. Mimno, D. (2013). mallet: A wrapper around the Java machine learning tool MALLET. R package version 1.0. https://cran.r-project.org/package=mallet.
  61. Morton, T., Kottmann, J., Baldridge, J., & Bierner, G. (2005). Opennlp: A java-based nlp toolkit. In EACL.Google Scholar
  62. O’Donnell, M. (2008). The uam corpustool: Software for corpus annotation and exploration. In Proceedings of the XXvI congreso de AESLA, Almeria, Spain (pp. 3–5).Google Scholar
  63. Ooms, J. (2017). hunspell: High-performance Stemmer, Tokenizer, and spell checker for R. R package version 2.6. https://cran.r-project.org/package=hunspell.
  64. O’Sullivan, J., Jakacki, D., & Galvin, M. (2015). Programming in the digital humanities. Digital Scholarship in the Humanities, 30(suppl 1), i142–i147.Google Scholar
  65. Peng, R. D. (2011). Reproducible research in computational science. Science, 334(6060), 1226–1227.CrossRefGoogle Scholar
  66. Rayson, P. (2009). Wmatrix: A web-based corpus processing environment. http://ucrel.lancs.ac.uk/wmatrix/.
  67. Rinker, T. W. (2013). qdap: Quantitative discourse analysis package. Buffalo, NY: University at Buffalo/SUNY. 2.2.8.Google Scholar
  68. RStudio Team. (2017). RStudio: Integrated development environment for R. Boston, MA: RStudio Inc.Google Scholar
  69. Rudis, B., Levien, R., Engelhard, R., Halls, C., Novodvorsky, P., Németh, L., & Buitenhuis, N. (2016). hyphenatr: Tools to Hyphenate Strings Using the ’Hunspell’ Hyphenation Library. R package version 0.3.0. https://cran.r-project.org/package=hyphenatr.
  70. Salkie, R. (1995). Intersect: A parallel corpus project at brighton university. Computers and Texts, 9, 4–5.Google Scholar
  71. Schreibman, S., Siemens, R., & Unsworth, J. (2015). A new companion to digital humanities. London: Wiley.CrossRefGoogle Scholar
  72. Scott, M. (1996). WordSmith tools, Stroud: Lexical analysis software. https://lexically.net/wordsmith/.
  73. Siddiqui, N. (2017). Data wrangling and management in R. The Programming Historian. https://programminghistorian.org/en/lessons/data_wrangling_and_management_in_R.
  74. Sievert, C., & Shirley, K. (2015). LDAtools: Tools to fit a topic model using Latent Dirichlet Allocation (LDA). R package version 0.1. https://cran.r-project.org/package=LDAtools.
  75. Simon, N., Friedman, J., Hastie, T., & Tibshirani, R. (2011). Regularization paths for cox’s proportional hazards model via coordinate descent. Journal of Statistical Software, 39(5), 1–13.CrossRefGoogle Scholar
  76. Sinclair, S., Rockwell, G., et al. (2016). Voyant tools. http://voyant-tools.org/. Accessed 4 Sept 2018.
  77. Th Gries, S., & Hilpert, M. (2008). The identification of stages in diachronic data: Variability-based neighbour clustering. Corpora, 3(1), 59–81.CrossRefGoogle Scholar
  78. Underwood, T. (2017). A genealogy of distant reading. Digital Humanities Quarterly. http://digitalhumanities.org/dhq/vol/11/2/000317/000317.html.
  79. Ushey, K., McPherson, J., Cheng, J., Atkins, A., & Allaire, J. (2016). packrat: A dependency management system for projects and their R package dependencies. R package version 0.4.8-1. https://cran.r-project.org/package=packrat.
  80. Wang, X., & Grimson, E. (2008). Spatial latent Dirichlet Allocation. In: Advances in neural information processing systems 20 (pp. 1577–1584). Curran Associates, Inc. http://papers.nips.cc/paper/3278-spatial-latent-dirichlet-allocation.pdf.
  81. Welbers, K., Van Atteveldt, W., & Benoit, K. (2017). Text analysis in R. Communication Methods and Measures, 11(4), 245–265.CrossRefGoogle Scholar
  82. Wiedemann, G., & Niekler, A. (2017). Hands-on: A five day text mining course for humanists and social scientists in R. In Proceedings of the 1st workshop teaching NLP for digital humanities.Google Scholar
  83. Wiedemann, G. (2016). Text mining for qualitative data analysis in the social sciences. New York: Springer.CrossRefGoogle Scholar
  84. Wijffels, J. (2018). udpipe: Tokenization, parts of speech tagging, lemmatization and dependency parsing with the ’UDPipe’ ’NLP’ Toolkit. R package version 0.6.1. https://cran.r-project.org/package=udpipe.
  85. Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 2(1–3), 37–52.CrossRefGoogle Scholar
  86. Xie, Y. (2014). knitr: A comprehensive tool for reproducible research in R. In: Stodden, V., Leisch, F., & Peng, R. D. (eds), Implementing reproducible computational research. Chapman and Hall/CRC. ISBN: 978-1466561595.Google Scholar

Copyright information

© Springer Nature B.V. 2019

Authors and Affiliations

  1. 1.University of RichmondVirginiaUSA
  2. 2.UFR Études AnglophonesUniversité Paris DiderotParisFrance
  3. 3.Department of LinguisticsUniversität PotsdamPotsdamGermany

Personalised recommendations