Language technology for digital humanities: introduction to the special issue
The use of digital resources and tools across humanities disciplines is steadily increasing, giving rise to new research paradigms and associated methods that are commonly subsumed under the term digital humanities. Digital humanities does not constitute a new discipline in itself, but rather a new approach to humanities research that cuts across different existing humanities disciplines. While digital humanities extends well beyond language-based research, textual resources and spoken language materials play a central role in most humanities disciplines.
Case studies of using language technology and/or language resources with the goal of finding new answers to existing research questions in a particular humanities discipline or addressing entirely new research questions.
Case studies of expanding the functionality of existing language processing tools in order to be able to address research questions in digital humanities.
The design of new language processing tools as well as annotation tools for spoken and written language, showcasing their use in digital humanities research.
Domain adaptation of rule-based, statistical, or machine-learning models for language processing tools in digital humanities research.
Challenges posed for language processing tools when used on diachronic data, language variation data, or literary texts.
Showcasing the use of language processing tools in humanities disciplines such as anthropology, gender studies, history, literary studies, philosophy, political science, and theology.
The collection of articles accepted for publication in this special issue covers a wide range of topics relevant to researchers in the digital humanities. Our main objective for this introduction is to relate the individual contributions to larger, on-going research themes in digital humanities and to highlight the role of natural language tools and resources in each article.
Computational Text Analysis within the Humanities: How to Combine Working Practices from the Contributing Fields? by Jonas Kuhn.
Dialogue Analysis: A Case Study on the New Testament by Chak Yan Yeung and John Lee.
Vector Space Explorations of Literary Language by Andreas van Cranenburgh, Karina van Dalen-Oskam, and Joris van Zundert.
Geoparsing Historical and Contemporary Literary Text set in the City of Edinburgh by Beatrice Alex, Claire Grover, Richard Tobin, and Jon Oberlander.
Token-based Spelling Variant Detection in Middle Low German Texts by Fabian Barteld, Chris Biemann, and Heike Zinsmeister.
Beyond Lexical Frequencies: Using R for Text Analysis in the Digital Humanities by Taylor Arnold, Nicolas Ballier, Paula Lissón, and Lauren Tilton.
The contribution by Yeung and Lee presents an automated approach for analyzing dialogues in the New Testament. The authors have developed a machine learning approach for identifying dialogues in three steps: First, they identify speakers and listeners, then they detect chains of quotes with alternating speakers/listeners, and finally they determine the boundaries of complete dialogues. They use automatic POS tagging, dependency parsing, and named entity recognition to create features for the machine learner. Based on these extracted dialogues, Yeung and Lee present a quantitative analysis of the dialogues.
The article by van Cranenburgh, van Dalen-Oskam, and van Zundert investigates to what extent machine learning (ML) approaches can successfully predict the degree of literariness of a novel. More specifically, the authors utilize two types of unsupervised ML models that are widely used in natural language processing: a topic model and a neural vector space model which are trained on different text passages of 2–3 pages in length. They show how different notions of semantic complexity can be derived from these models and investigate how well these complexity measures correlate with the literacy ratings of Dutch novels that were collected by an on-line survey.
Data-driven analysis of literary text is also the topic of the article contributed by Alex, Grover, Tobin, and Oberlander. The authors adapt the Edinburgh Geobrowser, an NLP tool for automatic enrichment of textual materials with geographical information, for use with historical literary texts set in the city of Edinburgh. The tool allows fine-grained annotation of street names, monuments, and other landmarks. The quality of the automatic annotation is evaluated against a gold standard. The tool has a modular architecture and is therefore easily adapted to other geographical locations.
A recurrent theme in the digitization and use of historical text corpora concerns wide-spread spelling variation for the same lemma. Barteld, Biemann, and Zinsmeister offer a new computational approach to dealing with this issue for Middle Low German texts. Contrary to most studies that deal with the phenomenon of spelling variation, the authors of the present article do not attempt to convert different spelling variants to a single, normalized form. Rather, their spelling variant detection approach generates all potential spelling variants for a given lemma and filters the set of potential variants by systematically inspecting the linguistic contexts of the spelling variants that occur in the text.
Due to its focus on data analysis and visualization, its large number of processing packages, and an active user community, the statistical computing language R is well suited to text analysis tasks and is gaining popularity in digital humanities. The article by Arnold, Ballier, Lissón, and Tilton presents a collection of R packages, built around a common text interchange format, to be used in digital humanities workflows. They demonstrate the power and usefulness of the ecosystem, which includes NLP tools, in a digital humanities project.
Digitising Swiss German—How to Process and Study a Polycentric Spoken Language by Yves Scherrer, Tanja Samardžić, and Elvira Glaser.
Studying the History of the Arabic Language: Language Technology and a Large-Scale Historical Corpus by Yonatan Belinkov, Alexander Magidow, Alberto Barrón-Cedeño, Avi Shmidman, and Maxim Romanov.
Historical Corpora meet the Digital Humanities: The Jerusalem Corpus of Emergent Modern Hebrew by Aynat Rubinstein.
Belinkov, Magidow, Barrón-Cedeño, Shmidman, and Romanov describe the creation of a large scale diachronic corpus of written Arabic, automatically annotated for sentence boundaries, morphological segments, lemmas, POS tags, and constituent syntax. They then develop a computational methodology to cluster the diachronic texts into periods. In a final analysis of the results of their periodization algorithm on the corpus, Belinkov et al. not only confirm the established periodization of Standard Arabic into Classical and Modern Standard Arabic, but they also find evidence for a more differentiated periodization.
Rubinstein describes the process of creating an open-access corpus of Emergent Modern Hebrew, which includes extensive metadata and linguistic annotations. Throughout the process, care was taken to follow best practices and to comply with standards in the digital humanities. His article shows how the use of NLP tools, in combination with crowd sourcing and collaboration with external partners, made the construction of the resource possible. Use-cases are presented to demonstrate the use of the corpus in diachronic linguistic research.
From 0 to 10 Million Annotated Words— Part-of-Speech Tagging for Middle High German by Sarah Schulz and Nora Ketschik.
Exploiting Languages Proximity for Part-of-Speech Tagging of three French Regional Languages by Pierre Magistry, Anne-Laure Ligozat, and Sophie Rosset.
Approaching Terminological Ambiguity in Cross-disciplinary Communication as a Word Sense Induction Task. A Pilot Study. by Julie Mennes, Ted Pedersen, and Els Lefever.
Magistry, Ligozat, and Rosset also address the issue of POS tagging, but they investigate methods for POS tagging the regional languages Alsatian, Occitan, and Picard. For these languages, no resources exist, thus Magistry et al. leverage resources from the related, high-resourced languages German, Catalan, and French using delexicalization and transposition of highly frequent words.
Linguistic enrichment and disambiguation of text corpora is by no means limited to morphological and syntactic information. Mennes, Pedersen, and Lefever present a pilot study for sense clustering of ambiguous terms. The authors point out that such ambiguities can make communication difficult among researchers from different scientific disciplines in cross-disciplinary investigations, including in digital humanities. Mennes et al. conduct a pilot study for automatically inducing different sense clusters for ambiguous terms. Their study is based on the NLP software package Sense Clusters, previously developed by Pedersen.
We hope that this brief preview of each article will prove useful for navigating through this special issue and will stimulate readers to consult the individual contributions. We would like to take this opportunity to thank all authors for their contributions to this issue, and all referees who kindly agreed to review the submitted manuscripts for their in-depth comments and helpful suggestions.