Abstract
The paper describes computational tools that can be of great help to both qualitative and quantitative scholars in the humanities and social sciences who deal with words as data. The Java and Python tools described provide computer-automated ways of performing useful tasks: 1. check the filenames well-formedness; 2. find user-defined characters in English language stories (e.g., social actors, i.e., individuals, groups, organizations; animals) (“find the character”) via WordNet; 3. aggregate words into higher-level aggregates (e.g., “talk,” “say,” “write” are all verbs of “communication”) (“find the ancestor”) via WordNet; 4. evaluate human-created summaries of events taken from multiple sources where key actors found in the sources may have been left out in the summaries (“find the missing character”) via Stanford CoreNLP POS and NER annotators; 5. list the documents in an event cluster where names or locations present close similarities (“check the character’s name tag”) using Levenshtein word/edit distance and Stanford CoreNLP NER annotator; 6. list documents categorized into the wrong event cluster (“find the intruder”) via Stanford CoreNLP POS and NER annotators; 7. classify loose documents into most-likely event clusters (“find the character’s home”) via Stanford CoreNLP POS and NER annotators or date matcher; 8. find similarities between documents (“find the plagiarist”) using Lucene. These tools of automatic data checking can be applied to ongoing projects or completed projects to check data reliability. The NLP tools are designed with “a fourth grader” in mind, a user with no computer science background. Some five thousand newspaper articles from a project on racial violence (Georgia 1875–1935) are used to show how the tools work. But the tools have much wider applicability to a variety of problems of interest to both qualitative and quantitative scholars who deal with text as data.
Similar content being viewed by others
Notes
See, for instance, Franzosi’s PC-ACE (Program for Computer-Assisted Coding of Events) at www.pc-ace.com (Franzosi 2010).
The GitHub site will automatically install not only all the NLP Suite scripts but also Python and Anaconda required to run the scripts. It also provides extensive help on how to download and install a handful of external software required by some of the algorithms (e.g., Stanford CoreNLP, WordNet). The goal is to make it as easy as possible for non-technical users to take advantage of the tools with minimal investment.
We rely on the Python package openpyxl and ad hoc functions.
The newspaper collections found in Chronicling America of the Library of Congress (http://chroniclingamerica.loc.gov/newspapers/), the Digital Library of Georgia (http://dlg.galileo.usg.edu/MediaTypes/Newspapers.html?Welcome), The Atlanta Constitution, Proquest, Readex.
Multiple cross-references are also possible, whereby a document deals with several different events.
Contrary to some protest event projects based on a single newspaper source (e.g., The New York Times in the “Dynamics of Collective Action, 1960–1995” project that involved several social scientists, notably, Doug McAdam, John McCarthy, Susan Olzak, Sarah Soule, and led to dozens of influential publications; see for all McAdam and Su 2002), the Georgia lynching project is based on multiple newspaper sources for each event.
The most up-to-date numbers of terms are given in https://wordnet.princeton.edu/documentation/wnstats7wn.
A common critique of WordNet is that WordNet is better suited to account for concrete concepts than for abstract concepts. It is much easier to create hyponyms/hypernym relationships between “conifer” as a type of “tree”, a “tree” as a type of “plant”, and a “plant” as a type of “organism”. Not so easy to classify emotions like “fear” or “happiness” into hyponyms/hypernym relationships.
The WordNet databases comprises both single words or combinations of two or more words that typically come together with a specific meaning (collocations, e.g., coming out, shut down, thumbs up, stand in line, customs duty). Over 80% of terms in the WordNet database are collocations, at least at the time of Miller et al.’s Introduction to WordNet manual (1993, p. 2). For the English language (but WordNet is available for some 200 languages) the database contains a very large set of terms. The most up-to-date numbers of terms are given in https://wordnet.princeton.edu/documentation/wnstats7wn.
On the way up through the hierarchy, the script relies on the WordNet concepts of hypernym – the generic term used to designate a whole class of specific instances (Y is a hypernym of X if X is a (kind of) Y) – and holonym – the name of the whole of which the meronym names is a part. Y is a holonym of X if X is a part of Y.
Collocations are sets of two or more words that are usually together for a complete meaning, e.g., “coming out,” “sunny side up”. Over 80% of terms in the WordNet database are collocations, at least at the time of Miller et al.’s Introduction to WordNet manual (1993, p. 2). For the English language (but WordNet is available for some 200 languages) the database contains a very large set of terms. The most up-to-date numbers of terms in each category are given in https://wordnet.princeton.edu/documentation/wnstats7wn
The 25 top noun synsets are: act, animal, artifact, attribute, body, cognition, communication, event, feeling, food, group, location, motive, object, person, phenomenon, plant, possession, process, quantity, relation, shape, state, substance, time.
The 15 top verb synsets are: body, change, cognition, communication, competition, consumption, contact, creation, emotion, motion, perception, possession, social, stative, weather.
Unfortunately, there is no easy way to aggregate at levels lower than the top synsets. Wordnet is a linked graph where each node is a synset and synsets are interlinked by means of conceptual-semantic and lexical relations. In other words, it is not a simple tree structure: there is no way to tell at which level the synset is located at. For example, the synset “anger” can be traced from top level synset “feeling” and follows the path: feeling—> emotion—> anger. But it can also be traced from top level synset “state” and follows the path: state—> condition—> physiological condition—> arousal—> emotional arousal—> anger. In the first case, “anger” is at level 3 (assuming “feeling” and or other top synsets are level 1). In the second case, “anger” is at level 6. Programmatically, if one gives users more freedom to control the level of aggregating up, it is hard to build a user-friendly communication protocol. If the user wants to aggregate up to level 3 (two levels below the top synset), then should “anger” be considered as a level 3 synset? Does the user want “anger” to be considered as a level 3 synset? Since there is no clear definition of how far away a synset is from the root (top synsets), our algorithm aggregates all the way up to root.
Suppose that you wish to aggregate the verbs in your corpus under the label “violence.” WordNet top synsets for verbs do not include “violence” as a class. Verbs of violence may be listed under body, contact, social. You could use the Zoom IN/DOWN widget of Figure 24 to get a list of verbs in these top synsets, then manually go through the list to select only the verbs of violence of interest. That would mean go through manually the list of 956 verbs in the body class (e.g., to find there the verb “attack,” among others), the 2515 verbs of contact (e.g., to find there the verb “wrestle”), and the 1688 verbs of social (e.g., to find there the verb “abuse”). In total, 5159 distinct verbs. A restricted domain, for example newspaper articles of lynching, may have many fewer distinct verbs, indeed 2027, extracted using the lemma of the POS annotator for all the VB* tags. Whether using the WordNet dictionary (a better solution if the list of verbs of violence has to be used across different corpora) or the POS distinct verb tags, the dictionary list can then be used to annotate the documents in the corpus via the NLP Suite dictionary annotator GUI.
We use the word “compilation”, rather than “summary”, since, by and large, we maintained the original newspaper language (e.g., the word “negro”, rather than “African American”) and original story line, however contrived the story may have appeared to be.
https://stanfordnlp.github.io/CoreNLP/ Manning et al. (2014).
More specifically, for locations, the NER tags used are: City, State_or_Province, Country. Several other NER values are also recognized and tagged (e.g., Numbers, Percentages, Money, Religion), but they are irrelevant in this context.
The column “List of Documents for Type of Error” may be split in several columns depending upon the number of documents found in error.
The algorithm can process all or selected NER values, comparing the associated word values either within a single event subdirectory or across all subdirectories (or all the files listed in a directory, for that matter).
We calculated the relativity index by using cosine similarity (Singhal 2001). We use the two list of NN, NNS, Location, Date, Person, and Organization from the j doc (L1) and from all other j-1 docs (L2) and compute cosine similarity between the two lists. We construct a vector from each list by mapping the word count onto each unique word. Then, relativity index is calculated as the cosine similarity between two vectors and n is the count of total unique words. For instance, L1 is {Alice: 2, doctor: 3, hospital: 1}, and L2 is {Bob:1, hospital: 2}. If we fix the order of all words as {Alice, doctor, hospital, Bob}, then the first vector (V1) is (2, 3, 1, 0), the second vector (V2) is (0, 0, 2, 1), and the length n of the vector is 4. The relativity is the dot product of two vectors divided by the product of two vector lengths. Documents with index of relativity significantly lower than the rest of the cluster are signalled as unlikely to belong to the cluster.
\({\text{relativity}}\;{\text{index}} = \frac{{\sum\nolimits_{i = 1}^{n} {\left( {V1_{i} V2_{i} } \right)} }}{{\sqrt {\sum\nolimits_{i = 1}^{n} {V1_{i}^{2} } } \sqrt {\sum\nolimits_{i = 1}^{n} {V2_{i}^{2} } } }}\)
The relativity index ranges from 0 to 1, where 0 means two documents are totally different, and 1 means two documents have exactly the same list of NN, NNS, Location, Date, Person, and Organization.
The bar chart displays the distribution of most frequent threshold index values as intervals, with most records in the 0.25 ~ 0.29 interval.
It should be noted that the use of the words plagiarism and plagiarist in this context should be taken with a grain of salt. First, the data do not tell us anything about who copied whom, but only that the two different newspapers shared content, wholly or in part; furthermore, the shared content may well have come from an unacknowledged wire service (on the development and spread of news wire services in the United States during the second half of the nineteenth century, see Brooker-Gross 1981; on computational tools for plagiarism and authorship attribution, see, for instance, Stein et al. 2011).
http://lucene.apache.org/core/downloads.html. For a summary of approaches to document similarities, see Forsyth and Sharoff (2014).
Other approaches are also available. After all, determining document similarity has been a major research area due to its wide application in information retrieval, document clustering, machine translation, etc. Existing approaches to determine document similarity can be grouped into two categories: knowledge-based similarity and content-based similarity (Benedetti et al., 2019).
Knowledge-based similarity approaches extract information from other sources to supplement the corpus, so as to draw from more document features to analyze. For example, Explicit Semantic Analysis (ESA) (Gabrilovich and Markovitch 2007) represents documents in high dimensional vectors based on the features extracted from both original articles and Wikipedia articles. Then, similarity of documents is calculated using vector space comparison algorithm. Since our main focus in this work is to detect plagiarism among texts in the same corpus, knowledge-based similarity approaches are not very fruitful.
Content-based similarity approaches focus on using only textual information contained in documents. Popular proposed techniques in this fields are Vector Space Models (Turney and Pantel 2010), probabilistic models such as Okapi BM-25 (Robertson and Zaragoza 2009). These methods all transform documents into some form of representations, and then either do a vector space comparison or query search match on the constructed representations.
document_duplicates.txt.
Users can specify different spans of temporal aggregation (e.g., year, quarter/year, month/year).
In this specific application, documents are newspapers where document name refers to the name of the paper (e.g., The New York Times) and document instance refers to a specific newspaper article (e.g., The New York Times_12-11-1912_1, referring to a The New York Times of December 11, 1912 on page 1). But the document name could refer to an ethnographic interview with document instance referring to an interviewer’s ID (by name or number), an interview’s location, time, or interviewee (by name or ID number).
The numbers in each row of the table add up to approximately the total number of newspaper articles in the corpus. This number of not exact due to the way the Lucene function “find top similar documents” computes similar documents with discrepancies numbering in the teens.
References
Aggarwal, C.C., Zhai, C.: A survey of text classification algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 163–222. Springer, Boston (2012)
Beck, E.M., Tolnay, S.: ‘The killing fields of the deep south: the market for cotton and the lynching of blacks, 1882–1930.’ Am. Sociol. Rev. 55, 526–539 (1990)
Beck, E.M., Tolnay, S.E.: Confirmed inventory of southern lynch victims, 1882–1930. Data file available from authors (2004).
Benedetti, F., Beneventano, D., Bergamaschi, S., Simonini, G.: Computing inter document similarity with Context Semantic Analysis. Inf. Syst. 80, 136–147 (2019). https://doi.org/10.1016/j.is.2018.02.009
Białecki, A., Muir, R., & Ingersoll, G.: "Apache Lucene 4." SIGIR 2012 Workshop on Open Source Information Retrieval. August 16, 2012, Portland, OR, USA (2012).
Brundage, F.: Lynching in the New South: Georgia and Virginia, 1880–1930. University of Illinois Press, Urbana (1993)
Johansson, J., Borg, M., Runeson, P., Mäntylä, M.V.:A replicated study on duplicate detection: using Apache Lucene to search among android defects. In: Proceedings of the 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, 8. ACM (2014)
Brooker-Gross, S.R.: News wire services in the nineteenth-century United States. J. Hist. Geogr. 7(2), 167–179 (1981)
Cooper, J.W., Coden, A.R. Brown, E.W.: Detecting similar documents using salient terms. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, 245–251 (2002)
Edelmann, A., Wolff, T., Montagne, D., Bail, C.A.: Computational social science and sociology. Ann. Rev. Sociol. 46, 61–81 (2020)
Ericsson, K.A., Herbert, S.: Protocol Analysis: Verbal Reports as Data, 2nd edn. MIT Press, Cambridge, MA (1996)
Evans, J.A., Aceves, P.: Machine translation: mining text for social theory. Ann. Rev. Sociol. 42, 21–50 (2016)
Fellbaum, C. (ed.): WordNet. An Electronic Lexical Database. MIT Press, Cambridge, MA (1998)
Forsyth, R.S., Sharoff, S.: Document dissimilarity within and across languages: a benchmarking study. Liter. Linguistic Comput 29(1), 6–22 (2014)
Franzosi, R.: Quantitative Narrative Analysis, vol. 162. Sage, Thousand Oaks, CA (2010)
Franzosi, R., De Fazio, G., Vicari, S.: Ways of measuring agency: an application of quantitative narrative analysis to lynchings in Georgia (1875–1930). Sociol. Methodol. 42(1), 1–42 (2012)
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. IJcAI 7, 1606–1611 (2007)
Gambhir, M., Gupta, V.: Recent automatic text summarization techniques: a survey. Artif. Intell. Rev. 47, 1–66 (2017)
Grimm, J., Grimm, W.: [1812, 1857]. The original folk and fairy tales of the brothers Grimm: The Complete First Edition. [Kinder- und Hausmärchen. Children’s and Household Tales]. Translated and Edited by Jack Zipes. Princeton, NJ: Princeton University Press (2014)
Hutter, S.: Protest event analysis and its offspring. In: Donatella della Porta (ed.) Methodological Practices in Social Movement Research. Oxford: Oxford University Press, pp. 335–367 (2014)
Jacobs, J.: English fairy tales (Collected by Joseph Jacobs, Illustrated by John D. Batten). London: David Nutt (1890)
Klandermans, B., Staggenborg, S. (eds.): Methods of Social Movement Research. University of Minnesota Press, Minneapolis (2002)
Koopmans, R., Rucht, D.: Protest event analysis. In: Klandermans, Bert, Staggenborg, Suzanne (eds.) Methods of Social Movement Research, pp. 231–59. University of Minnesota Press, Minneapolis (2002)
Kowsari, K., Meimandi, K.J., Heidarysafa, M., Mendu, S., Barnes, L., Brown, D.: Text classification algorithms: a survey. Information 2019(10), 150 (2019)
Labov, W.: Language in the Inner City. University of Pennsylvania Press, Philadelphia (1972)
Lansdall‐Welfare, T., Sudhahar, S., Thompson, J., Lewis, J., FindMyPast Newspaper Team, and Cristianini, N.: Content analysis of 150 years of british periodicals. Proceedings of the National Academy of Sciences (PNAS), PNAS, Published online January 9, 2017 E457–E465 (2017)
Lansdall-Welfare, T., Cristianini, N.: History playground: a tool for discovering temporal trends in massive textual corpora. Digit. Scholar. Human. 35(2), 328–341 (2020)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Doklady Akademii Nauk SSSR, 163(4):845–848, 1965 (Russian). English translation in Soviet Physics Doklady, 10(8):707–710, 1966. (Doklady is Russian for "Report". Sometimes transliterated in English as Doclady or Dokladi.) (1966)
Levin, B.: English Verb Classes and Alternations. The University of Chicago Press, Chicago (1993)
Lloret, E., Palomar, M.: Text summarisation in progress: a literature review. Artif. Intell. Rev. 37, 1–41 (2012)
MacEachren, A.M., Roth, R.E., O'Brien, J., Li, B., Swingley, D., and Gahegan, M.: Visual semiotics & uncertainty visualization: an empirical study. IEEE Transactions on Visualization and Computer Graphics, Vol. 18, No. 12, December 2012 (2012)
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J. and McClosky, D.: The stanford CoreNLP natural language processing toolkit. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60 (2014)
McAdam, D., Yang, Su.: The war at home: antiwar protests and congressional voting, 1965–1973. Am. Sociol. Rev. 67(5), 696–721 (2002)
McCandless, M., Hatcher, E., Gospodnetic, O.: Lucene in Action, Second Edition Covers Apache Lucene 3.0. Manning Publications Co, Greenwich, CT (2010)
Miller, G.A.: WordNet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)
Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.J.: Introduction to WordNet: an on-line lexical database. Int. J. Lexicogr. 3(4), 235–244 (1990)
Nenkova, A., McKeown, K.: A survey of text summarization techniques. In: Aggarwal, C.C., Cheng, X.Z. (eds.) Mining Text Data, pp. 43–76. Springer, Boston (2012)
Murchú, T.Ó., Lawless, S.: The problem of time and space: the difficulties in visualising spatiotemporal change in historical data. Proc. Dig. Human. 7(8), 12 (2014)
Panitch, L.: Corporatism: a growth industry reaches the monopoly stage. Can. J. Polit. Sci. 21(4), 813–818 (1988)
Robertson, S., Zaragoza, H.: The probabilistic relevance framework BM25 and beyond. Found. Trends® Inf Retr. 3(4), 333–389 (2009).
Singhal, A.: Modern information retrieval: a brief overview. Bull. IEEE Comput. Soc. Tech. Comm. Data Eng. 24(4), 35–43 (2001)
Stein, B., Lipka, N., Prettenhofer, P.: Plagiarism and authorship analysis. Lang. Resour. Eval. 45(1), 63–82 (2011)
Taylor, J.R.: Linguistic Categorization. Oxford University Press, Oxford (2004)
Tilly, C.: Popular Contention in Great Britain, 1758–1834. Harvard University Press, Cambridge, MA (1995)
Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Int. Res. 37(1), 141–188 (2010)
Zhang, H., Pan, J.: CASM: a deep-learning approach for identifying collective action events with text and image data from social media. Sociol. Methodol. 49(1), 1–57 (2019)
Zhang, Y., Li, J.L.: Research and improvement of search engine based on Lucene. Int. Conf. Intell. Human-Mach. Syst. Cybern. 2, 270–273 (2009)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Franzosi, R., Dong, W. & Dong, Y. Qualitative and quantitative research in the humanities and social sciences: how natural language processing (NLP) can help. Qual Quant 56, 2751–2781 (2022). https://doi.org/10.1007/s11135-021-01235-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11135-021-01235-2