Qualitative and quantitative research in the humanities and social sciences: how natural language processing (NLP) can help

Franzosi, Roberto; Dong, Wenqin; Dong, Yilin

doi:10.1007/s11135-021-01235-2

Qualitative and quantitative research in the humanities and social sciences: how natural language processing (NLP) can help

Published: 23 September 2021

Volume 56, pages 2751–2781, (2022)
Cite this article

Quality & Quantity Aims and scope Submit manuscript

1185 Accesses
2 Citations
4 Altmetric
Explore all metrics

Abstract

The paper describes computational tools that can be of great help to both qualitative and quantitative scholars in the humanities and social sciences who deal with words as data. The Java and Python tools described provide computer-automated ways of performing useful tasks: 1. check the filenames well-formedness; 2. find user-defined characters in English language stories (e.g., social actors, i.e., individuals, groups, organizations; animals) (“find the character”) via WordNet; 3. aggregate words into higher-level aggregates (e.g., “talk,” “say,” “write” are all verbs of “communication”) (“find the ancestor”) via WordNet; 4. evaluate human-created summaries of events taken from multiple sources where key actors found in the sources may have been left out in the summaries (“find the missing character”) via Stanford CoreNLP POS and NER annotators; 5. list the documents in an event cluster where names or locations present close similarities (“check the character’s name tag”) using Levenshtein word/edit distance and Stanford CoreNLP NER annotator; 6. list documents categorized into the wrong event cluster (“find the intruder”) via Stanford CoreNLP POS and NER annotators; 7. classify loose documents into most-likely event clusters (“find the character’s home”) via Stanford CoreNLP POS and NER annotators or date matcher; 8. find similarities between documents (“find the plagiarist”) using Lucene. These tools of automatic data checking can be applied to ongoing projects or completed projects to check data reliability. The NLP tools are designed with “a fourth grader” in mind, a user with no computer science background. Some five thousand newspaper articles from a project on racial violence (Georgia 1875–1935) are used to show how the tools work. But the tools have much wider applicability to a variety of problems of interest to both qualitative and quantitative scholars who deal with text as data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 9

Fig. 12

What’s in a text? Bridging the gap between quality and quantity in the digital era

Article 11 November 2020

Beyond lexical frequencies: using R for text analysis in the digital humanities

Article 08 April 2019

Text Mining and Big Textual Data: Relevant Statistical Models

Notes

On PEA see Koopmans and Rucht (2002) and (Hutter 2014); on PEA and its more rigorous methodological counterpart rooted in a linguistic theory of narrative and rhetoric, Quantitative Narrative Analysis (QNA), see Franzosi (2010).
See, for instance, Franzosi’s PC-ACE (Program for Computer-Assisted Coding of Events) at www.pc-ace.com (Franzosi 2010).
For recent surveys, see Evans and Aceves (2016), Edelmann et al. (2020).
The GitHub site will automatically install not only all the NLP Suite scripts but also Python and Anaconda required to run the scripts. It also provides extensive help on how to download and install a handful of external software required by some of the algorithms (e.g., Stanford CoreNLP, WordNet). The goal is to make it as easy as possible for non-technical users to take advantage of the tools with minimal investment.
We rely on the Python package openpyxl and ad hoc functions.
The newspaper collections found in Chronicling America of the Library of Congress (http://chroniclingamerica.loc.gov/newspapers/), the Digital Library of Georgia (http://dlg.galileo.usg.edu/MediaTypes/Newspapers.html?Welcome), The Atlanta Constitution, Proquest, Readex.
Multiple cross-references are also possible, whereby a document deals with several different events.
Contrary to some protest event projects based on a single newspaper source (e.g., The New York Times in the “Dynamics of Collective Action, 1960–1995” project that involved several social scientists, notably, Doug McAdam, John McCarthy, Susan Olzak, Sarah Soule, and led to dozens of influential publications; see for all McAdam and Su 2002), the Georgia lynching project is based on multiple newspaper sources for each event.
Franzosi reports 1,600 distinct entries for subjects and objects and 7,000 for verbs for one of his projects (Franzosi 2010: 93); similar figures are reported by Ericsson and Simon (1996: 265–266) and Tilly (1995: 414–415).
The most up-to-date numbers of terms are given in https://wordnet.princeton.edu/documentation/wnstats7wn.
A common critique of WordNet is that WordNet is better suited to account for concrete concepts than for abstract concepts. It is much easier to create hyponyms/hypernym relationships between “conifer” as a type of “tree”, a “tree” as a type of “plant”, and a “plant” as a type of “organism”. Not so easy to classify emotions like “fear” or “happiness” into hyponyms/hypernym relationships.
https://projects.csail.mit.edu/jwi/
The WordNet databases comprises both single words or combinations of two or more words that typically come together with a specific meaning (collocations, e.g., coming out, shut down, thumbs up, stand in line, customs duty). Over 80% of terms in the WordNet database are collocations, at least at the time of Miller et al.’s Introduction to WordNet manual (1993, p. 2). For the English language (but WordNet is available for some 200 languages) the database contains a very large set of terms. The most up-to-date numbers of terms are given in https://wordnet.princeton.edu/documentation/wnstats7wn.
Data aggregation is often referred to as “data reduction” in the social sciences and as “linguistic categorization” in linguistics (on linguistic categorization, see Taylor 2004; on verbs classification, Levin 1993; see also Franzosi 2010: 61).
On the way up through the hierarchy, the script relies on the WordNet concepts of hypernym – the generic term used to designate a whole class of specific instances (Y is a hypernym of X if X is a (kind of) Y) – and holonym – the name of the whole of which the meronym names is a part. Y is a holonym of X if X is a part of Y.
Collocations are sets of two or more words that are usually together for a complete meaning, e.g., “coming out,” “sunny side up”. Over 80% of terms in the WordNet database are collocations, at least at the time of Miller et al.’s Introduction to WordNet manual (1993, p. 2). For the English language (but WordNet is available for some 200 languages) the database contains a very large set of terms. The most up-to-date numbers of terms in each category are given in https://wordnet.princeton.edu/documentation/wnstats7wn
The 25 top noun synsets are: act, animal, artifact, attribute, body, cognition, communication, event, feeling, food, group, location, motive, object, person, phenomenon, plant, possession, process, quantity, relation, shape, state, substance, time.
The 15 top verb synsets are: body, change, cognition, communication, competition, consumption, contact, creation, emotion, motion, perception, possession, social, stative, weather.
Unfortunately, there is no easy way to aggregate at levels lower than the top synsets. Wordnet is a linked graph where each node is a synset and synsets are interlinked by means of conceptual-semantic and lexical relations. In other words, it is not a simple tree structure: there is no way to tell at which level the synset is located at. For example, the synset “anger” can be traced from top level synset “feeling” and follows the path: feeling—> emotion—> anger. But it can also be traced from top level synset “state” and follows the path: state—> condition—> physiological condition—> arousal—> emotional arousal—> anger. In the first case, “anger” is at level 3 (assuming “feeling” and or other top synsets are level 1). In the second case, “anger” is at level 6. Programmatically, if one gives users more freedom to control the level of aggregating up, it is hard to build a user-friendly communication protocol. If the user wants to aggregate up to level 3 (two levels below the top synset), then should “anger” be considered as a level 3 synset? Does the user want “anger” to be considered as a level 3 synset? Since there is no clear definition of how far away a synset is from the root (top synsets), our algorithm aggregates all the way up to root.
Suppose that you wish to aggregate the verbs in your corpus under the label “violence.” WordNet top synsets for verbs do not include “violence” as a class. Verbs of violence may be listed under body, contact, social. You could use the Zoom IN/DOWN widget of Figure 24 to get a list of verbs in these top synsets, then manually go through the list to select only the verbs of violence of interest. That would mean go through manually the list of 956 verbs in the body class (e.g., to find there the verb “attack,” among others), the 2515 verbs of contact (e.g., to find there the verb “wrestle”), and the 1688 verbs of social (e.g., to find there the verb “abuse”). In total, 5159 distinct verbs. A restricted domain, for example newspaper articles of lynching, may have many fewer distinct verbs, indeed 2027, extracted using the lemma of the POS annotator for all the VB* tags. Whether using the WordNet dictionary (a better solution if the list of verbs of violence has to be used across different corpora) or the POS distinct verb tags, the dictionary list can then be used to annotate the documents in the corpus via the NLP Suite dictionary annotator GUI.
Current computational technology makes available a different approach to creating summaries: an automatic approach where summaries are generated automatically by a computer algorithm, rather than a human (Gambhir and Gupta 2017; Lloret and Palomar 2012; Nenkova and McKeown 2012).
We use the word “compilation”, rather than “summary”, since, by and large, we maintained the original newspaper language (e.g., the word “negro”, rather than “African American”) and original story line, however contrived the story may have appeared to be.
https://stanfordnlp.github.io/CoreNLP/ Manning et al. (2014).
More specifically, for locations, the NER tags used are: City, State_or_Province, Country. Several other NER values are also recognized and tagged (e.g., Numbers, Percentages, Money, Religion), but they are irrelevant in this context.
The column “List of Documents for Type of Error” may be split in several columns depending upon the number of documents found in error.
The algorithm can process all or selected NER values, comparing the associated word values either within a single event subdirectory or across all subdirectories (or all the files listed in a directory, for that matter).
We calculated the relativity index by using cosine similarity (Singhal 2001). We use the two list of NN, NNS, Location, Date, Person, and Organization from the j doc (L1) and from all other j-1 docs (L2) and compute cosine similarity between the two lists. We construct a vector from each list by mapping the word count onto each unique word. Then, relativity index is calculated as the cosine similarity between two vectors and n is the count of total unique words. For instance, L1 is {Alice: 2, doctor: 3, hospital: 1}, and L2 is {Bob:1, hospital: 2}. If we fix the order of all words as {Alice, doctor, hospital, Bob}, then the first vector (V1) is (2, 3, 1, 0), the second vector (V2) is (0, 0, 2, 1), and the length n of the vector is 4. The relativity is the dot product of two vectors divided by the product of two vector lengths. Documents with index of relativity significantly lower than the rest of the cluster are signalled as unlikely to belong to the cluster.
\({\text{relativity}}\;{\text{index}} = \frac{{\sum\nolimits_{i = 1}^{n} {\left( {V1_{i} V2_{i} } \right)} }}{{\sqrt {\sum\nolimits_{i = 1}^{n} {V1_{i}^{2} } } \sqrt {\sum\nolimits_{i = 1}^{n} {V2_{i}^{2} } } }}\)
The relativity index ranges from 0 to 1, where 0 means two documents are totally different, and 1 means two documents have exactly the same list of NN, NNS, Location, Date, Person, and Organization.
The bar chart displays the distribution of most frequent threshold index values as intervals, with most records in the 0.25 ~ 0.29 interval.
It should be noted that the use of the words plagiarism and plagiarist in this context should be taken with a grain of salt. First, the data do not tell us anything about who copied whom, but only that the two different newspapers shared content, wholly or in part; furthermore, the shared content may well have come from an unacknowledged wire service (on the development and spread of news wire services in the United States during the second half of the nineteenth century, see Brooker-Gross 1981; on computational tools for plagiarism and authorship attribution, see, for instance, Stein et al. 2011).
http://lucene.apache.org/core/downloads.html. For a summary of approaches to document similarities, see Forsyth and Sharoff (2014).
Other approaches are also available. After all, determining document similarity has been a major research area due to its wide application in information retrieval, document clustering, machine translation, etc. Existing approaches to determine document similarity can be grouped into two categories: knowledge-based similarity and content-based similarity (Benedetti et al., 2019).
Knowledge-based similarity approaches extract information from other sources to supplement the corpus, so as to draw from more document features to analyze. For example, Explicit Semantic Analysis (ESA) (Gabrilovich and Markovitch 2007) represents documents in high dimensional vectors based on the features extracted from both original articles and Wikipedia articles. Then, similarity of documents is calculated using vector space comparison algorithm. Since our main focus in this work is to detect plagiarism among texts in the same corpus, knowledge-based similarity approaches are not very fruitful.
Content-based similarity approaches focus on using only textual information contained in documents. Popular proposed techniques in this fields are Vector Space Models (Turney and Pantel 2010), probabilistic models such as Okapi BM-25 (Robertson and Zaragoza 2009). These methods all transform documents into some form of representations, and then either do a vector space comparison or query search match on the constructed representations.
document_duplicates.txt.
Users can specify different spans of temporal aggregation (e.g., year, quarter/year, month/year).
In this specific application, documents are newspapers where document name refers to the name of the paper (e.g., The New York Times) and document instance refers to a specific newspaper article (e.g., The New York Times_12-11-1912_1, referring to a The New York Times of December 11, 1912 on page 1). But the document name could refer to an ethnographic interview with document instance referring to an interviewer’s ID (by name or number), an interview’s location, time, or interviewee (by name or ID number).
The numbers in each row of the table add up to approximately the total number of newspaper articles in the corpus. This number of not exact due to the way the Lucene function “find top similar documents” computes similar documents with discrepancies numbering in the teens.
On the specific topic of lynching, see, for instance, the quantitative work by Beck and Tolnay (1990) or Franzosi et al. (2012) and the more qualitative work by Brundage (1993).

References

Aggarwal, C.C., Zhai, C.: A survey of text classification algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 163–222. Springer, Boston (2012)
Chapter Google Scholar
Beck, E.M., Tolnay, S.: ‘The killing fields of the deep south: the market for cotton and the lynching of blacks, 1882–1930.’ Am. Sociol. Rev. 55, 526–539 (1990)
Article Google Scholar
Beck, E.M., Tolnay, S.E.: Confirmed inventory of southern lynch victims, 1882–1930. Data file available from authors (2004).
Benedetti, F., Beneventano, D., Bergamaschi, S., Simonini, G.: Computing inter document similarity with Context Semantic Analysis. Inf. Syst. 80, 136–147 (2019). https://doi.org/10.1016/j.is.2018.02.009
Article Google Scholar
Białecki, A., Muir, R., & Ingersoll, G.: "Apache Lucene 4." SIGIR 2012 Workshop on Open Source Information Retrieval. August 16, 2012, Portland, OR, USA (2012).
Brundage, F.: Lynching in the New South: Georgia and Virginia, 1880–1930. University of Illinois Press, Urbana (1993)
Google Scholar
Johansson, J., Borg, M., Runeson, P., Mäntylä, M.V.:A replicated study on duplicate detection: using Apache Lucene to search among android defects. In: Proceedings of the 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, 8. ACM (2014)
Brooker-Gross, S.R.: News wire services in the nineteenth-century United States. J. Hist. Geogr. 7(2), 167–179 (1981)
Article Google Scholar
Cooper, J.W., Coden, A.R. Brown, E.W.: Detecting similar documents using salient terms. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, 245–251 (2002)
Edelmann, A., Wolff, T., Montagne, D., Bail, C.A.: Computational social science and sociology. Ann. Rev. Sociol. 46, 61–81 (2020)
Article Google Scholar
Ericsson, K.A., Herbert, S.: Protocol Analysis: Verbal Reports as Data, 2nd edn. MIT Press, Cambridge, MA (1996)
Google Scholar
Evans, J.A., Aceves, P.: Machine translation: mining text for social theory. Ann. Rev. Sociol. 42, 21–50 (2016)
Article Google Scholar
Fellbaum, C. (ed.): WordNet. An Electronic Lexical Database. MIT Press, Cambridge, MA (1998)
Google Scholar
Forsyth, R.S., Sharoff, S.: Document dissimilarity within and across languages: a benchmarking study. Liter. Linguistic Comput 29(1), 6–22 (2014)
Article Google Scholar
Franzosi, R.: Quantitative Narrative Analysis, vol. 162. Sage, Thousand Oaks, CA (2010)
Book Google Scholar
Franzosi, R., De Fazio, G., Vicari, S.: Ways of measuring agency: an application of quantitative narrative analysis to lynchings in Georgia (1875–1930). Sociol. Methodol. 42(1), 1–42 (2012)
Article Google Scholar
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. IJcAI 7, 1606–1611 (2007)
Google Scholar
Gambhir, M., Gupta, V.: Recent automatic text summarization techniques: a survey. Artif. Intell. Rev. 47, 1–66 (2017)
Article Google Scholar
Grimm, J., Grimm, W.: [1812, 1857]. The original folk and fairy tales of the brothers Grimm: The Complete First Edition. [Kinder- und Hausmärchen. Children’s and Household Tales]. Translated and Edited by Jack Zipes. Princeton, NJ: Princeton University Press (2014)
Hutter, S.: Protest event analysis and its offspring. In: Donatella della Porta (ed.) Methodological Practices in Social Movement Research. Oxford: Oxford University Press, pp. 335–367 (2014)
Jacobs, J.: English fairy tales (Collected by Joseph Jacobs, Illustrated by John D. Batten). London: David Nutt (1890)
Klandermans, B., Staggenborg, S. (eds.): Methods of Social Movement Research. University of Minnesota Press, Minneapolis (2002)
Google Scholar
Koopmans, R., Rucht, D.: Protest event analysis. In: Klandermans, Bert, Staggenborg, Suzanne (eds.) Methods of Social Movement Research, pp. 231–59. University of Minnesota Press, Minneapolis (2002)
Google Scholar
Kowsari, K., Meimandi, K.J., Heidarysafa, M., Mendu, S., Barnes, L., Brown, D.: Text classification algorithms: a survey. Information 2019(10), 150 (2019)
Article Google Scholar
Labov, W.: Language in the Inner City. University of Pennsylvania Press, Philadelphia (1972)
Google Scholar
Lansdall‐Welfare, T., Sudhahar, S., Thompson, J., Lewis, J., FindMyPast Newspaper Team, and Cristianini, N.: Content analysis of 150 years of british periodicals. Proceedings of the National Academy of Sciences (PNAS), PNAS, Published online January 9, 2017 E457–E465 (2017)
Lansdall-Welfare, T., Cristianini, N.: History playground: a tool for discovering temporal trends in massive textual corpora. Digit. Scholar. Human. 35(2), 328–341 (2020)
Article Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Doklady Akademii Nauk SSSR, 163(4):845–848, 1965 (Russian). English translation in Soviet Physics Doklady, 10(8):707–710, 1966. (Doklady is Russian for "Report". Sometimes transliterated in English as Doclady or Dokladi.) (1966)
Levin, B.: English Verb Classes and Alternations. The University of Chicago Press, Chicago (1993)
Google Scholar
Lloret, E., Palomar, M.: Text summarisation in progress: a literature review. Artif. Intell. Rev. 37, 1–41 (2012)
Article Google Scholar
MacEachren, A.M., Roth, R.E., O'Brien, J., Li, B., Swingley, D., and Gahegan, M.: Visual semiotics & uncertainty visualization: an empirical study. IEEE Transactions on Visualization and Computer Graphics, Vol. 18, No. 12, December 2012 (2012)
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J. and McClosky, D.: The stanford CoreNLP natural language processing toolkit. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60 (2014)
McAdam, D., Yang, Su.: The war at home: antiwar protests and congressional voting, 1965–1973. Am. Sociol. Rev. 67(5), 696–721 (2002)
Article Google Scholar
McCandless, M., Hatcher, E., Gospodnetic, O.: Lucene in Action, Second Edition Covers Apache Lucene 3.0. Manning Publications Co, Greenwich, CT (2010)
Google Scholar
Miller, G.A.: WordNet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)
Article Google Scholar
Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.J.: Introduction to WordNet: an on-line lexical database. Int. J. Lexicogr. 3(4), 235–244 (1990)
Article Google Scholar
Nenkova, A., McKeown, K.: A survey of text summarization techniques. In: Aggarwal, C.C., Cheng, X.Z. (eds.) Mining Text Data, pp. 43–76. Springer, Boston (2012)
Chapter Google Scholar
Murchú, T.Ó., Lawless, S.: The problem of time and space: the difficulties in visualising spatiotemporal change in historical data. Proc. Dig. Human. 7(8), 12 (2014)
Google Scholar
Panitch, L.: Corporatism: a growth industry reaches the monopoly stage. Can. J. Polit. Sci. 21(4), 813–818 (1988)
Article Google Scholar
Robertson, S., Zaragoza, H.: The probabilistic relevance framework BM25 and beyond. Found. Trends® Inf Retr. 3(4), 333–389 (2009).
Article Google Scholar
Singhal, A.: Modern information retrieval: a brief overview. Bull. IEEE Comput. Soc. Tech. Comm. Data Eng. 24(4), 35–43 (2001)
Google Scholar
Stein, B., Lipka, N., Prettenhofer, P.: Plagiarism and authorship analysis. Lang. Resour. Eval. 45(1), 63–82 (2011)
Article Google Scholar
Taylor, J.R.: Linguistic Categorization. Oxford University Press, Oxford (2004)
Google Scholar
Tilly, C.: Popular Contention in Great Britain, 1758–1834. Harvard University Press, Cambridge, MA (1995)
Google Scholar
Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Int. Res. 37(1), 141–188 (2010)
Google Scholar
Zhang, H., Pan, J.: CASM: a deep-learning approach for identifying collective action events with text and image data from social media. Sociol. Methodol. 49(1), 1–57 (2019)
Article Google Scholar
Zhang, Y., Li, J.L.: Research and improvement of search engine based on Lucene. Int. Conf. Intell. Human-Mach. Syst. Cybern. 2, 270–273 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Sociology/Linguistics Program, Emory University, Atlanta, GA, USA
Roberto Franzosi
Carnegie Mellon University, Pittsburgh, PA, USA
Wenqin Dong & Yilin Dong

Authors

Roberto Franzosi
View author publications
You can also search for this author in PubMed Google Scholar
Wenqin Dong
View author publications
You can also search for this author in PubMed Google Scholar
Yilin Dong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Roberto Franzosi.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

See Figs. 23, 24 and 25

Rights and permissions

Reprints and permissions

About this article

Cite this article

Franzosi, R., Dong, W. & Dong, Y. Qualitative and quantitative research in the humanities and social sciences: how natural language processing (NLP) can help. Qual Quant 56, 2751–2781 (2022). https://doi.org/10.1007/s11135-021-01235-2

Download citation

Accepted: 02 September 2021
Published: 23 September 2021
Issue Date: August 2022
DOI: https://doi.org/10.1007/s11135-021-01235-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Qualitative and quantitative research in the humanities and social sciences: how natural language processing (NLP) can help

Abstract

Access this article

Similar content being viewed by others

What’s in a text? Bridging the gap between quality and quantity in the digital era

Beyond lexical frequencies: using R for text analysis in the digital humanities

Text Mining and Big Textual Data: Relevant Statistical Models

Notes

References