Chapter

Text Mining

Part of the series Theory and Applications of Natural Language Processing pp 221-238

Date:

Towards a Historical Text Re-use Detection

  • Marco BüchlerAffiliated withGöttingen Centre for Digital Humanities, Georg August University Göttingen Email author 
  • , Philip R. BurnsAffiliated withAcademic and Research Technologies, Northwestern University
  • , Martin MüllerAffiliated withDepartment of English, Northwestern University
  • , Emily FranziniAffiliated withDepartment of Computer Science, Digital Humanities Chair
  • , Greta FranziniAffiliated withDepartment of Computer Science, Digital Humanities Chair

* Final gross prices may vary according to local VAT.

Get Access

Abstract

Text re-use describes the spoken and written repetition of information. Historical text re-use, with its longer time span, embraces a larger set of morphological, linguistic, syntactic, semantic and copying variations, thus adding a complication to text-reuse detection. Furthermore, it increases the chances of redundancy in a Digital Library. In Natural Language Processing it is crucial to remove these redundancies before applying any kind of machine learning techniques to the text. In Humanities, these redundancies foreground textual criticism and allow scholars to identify lines of transmission. This chapter investigates two aspects of the historical text re-use detection process, based on seven English editions of the Holy Bible. First, we measure the performance of several techniques. For this purpose, when considering a verse—such as book Genesis, Chapter 1, Verse 1—that is present in two editions, one verse is always understood as a paraphrase of the other. It is worth noting that paraphrasing is considered a hyponym of text re-use. Depending on the intention with which the new version was created, verses tend to differ significantly in the wording, but not in the meaning. Secondly, this chapter explains and evaluates a way of extracting paradigmatic relations. However, as regards historical languages, there is a lack of language resources (for example, WordNet) that makes non-literal text re-use and paraphrases much more difficult to identify. These differences are present in the form of replacements, corrections, varying writing styles, etc. For this reason, we introduce both the aforementioned and other correlated steps as a method to identify text re-use, including language acquisition to detect changes that we call paradigmatic relations. The chapter concludes with the recommendation to move from a ”single run” detection to an iterative process by using the acquired relations to run a new task.