Towards a Historical Text Re-use Detection

Büchler, Marco; Burns, Philip R.; Müller, Martin; Franzini, Emily; Franzini, Greta

doi:10.1007/978-3-319-12655-5_11

Marco Büchler⁶,
Philip R. Burns⁷,
Martin Müller⁸,
Emily Franzini⁹ &
…
Greta Franzini⁹

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

4039 Accesses
5 Citations
9 Altmetric

Abstract

Text re-use describes the spoken and written repetition of information. Historical text re-use, with its longer time span, embraces a larger set of morphological, linguistic, syntactic, semantic and copying variations, thus adding a complication to text-reuse detection. Furthermore, it increases the chances of redundancy in a Digital Library. In Natural Language Processing it is crucial to remove these redundancies before applying any kind of machine learning techniques to the text. In Humanities, these redundancies foreground textual criticism and allow scholars to identify lines of transmission. This chapter investigates two aspects of the historical text re-use detection process, based on seven English editions of the Holy Bible. First, we measure the performance of several techniques. For this purpose, when considering a verse—such as book Genesis, Chapter 1, Verse 1—that is present in two editions, one verse is always understood as a paraphrase of the other. It is worth noting that paraphrasing is considered a hyponym of text re-use. Depending on the intention with which the new version was created, verses tend to differ significantly in the wording, but not in the meaning. Secondly, this chapter explains and evaluates a way of extracting paradigmatic relations. However, as regards historical languages, there is a lack of language resources (for example, WordNet) that makes non-literal text re-use and paraphrases much more difficult to identify. These differences are present in the form of replacements, corrections, varying writing styles, etc. For this reason, we introduce both the aforementioned and other correlated steps as a method to identify text re-use, including language acquisition to detect changes that we call paradigmatic relations. The chapter concludes with the recommendation to move from a ”single run” detection to an iterative process by using the acquired relations to run a new task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Hardcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.diggingintodata.org/.
2.
Supported by the National Endowment for the Humanities (NEH)http://ww.neh.gov/.
3.
http://research.microsoft.com/en-us/downloads/607d14d9-20cd-47e3-85bc-a2f65cd28042/.
4.
www.paralleltext.info.
5.
www.mysword.info.

References

Allan J (2002) Topic detection and tracking: event-based information organization. Kluwer International Series on Information Retrieval. Kluwer Academic. ISBN: 9780792376644. http://books.google.de/books?id=50hnLI_Jz3cC
Basile C, Esposti MD, Rosso P, Barrón-Cedeño A (2010) Word length n-grams for text re-use detection. In: Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing, CICLing’10. Springer, Berlin/Heidelberg, pp 687–699. ISBN: 3-642-12115-2, 978-3-642-12115-9
Google Scholar
Believers Resource (2011) XML encoded versions of several English language Bible translations. http://www.believersresource.com/categories/bible-raw-data.html. Accessed 11 Nov 2011
Bernstein Y, Croft WB, Moffat A, Zobel J, Metzler D (2005) Similarity measures for tracking information flow. In: Proceedings of the 14th ACM international conference on information and knowledge management, CIKM ’05. ACM, New York, pp 517–524. doi:10.1145/1099554.1099695. ISBN: 1-59593-140-6. http://doi.acm.org/10.1145/1099554.1099695
Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on Machine learning, ICML ’06, ACM, New York, pp 113–120. ISBN: 1-59593-383-2. doi:10.1145/1143844.1143859. http://doi.acm.org/10.1145/1143844.1143859
Bonzanini M, Roelleke T, Yahyaei S (2011) Cross-lingual text fragment alignment using divergence from randomness. In: Grossi R, Sebastiani F, Silvestri F (eds) String Processing and Information Retrieval, Lecture Notes in Computer Science, vol 7024, pp 14–25. ISBN: 978-3-642-24582-4
Chapter Google Scholar
Bordag S (2007) Elements of Knowledge-free and Unsupervised Lexical Acquisition. Ph.D. thesis, Universität Leipzig
Google Scholar
Brockett C, Dolan WB (2005) Automatically constructing a corpus of sentential paraphrases. In: Third International Workshop on Paraphrasing (IWP2005). Asia Federation of Natural Language Processing. http://research.microsoft.com/apps/pubs/default.aspx?id=101076
Broder AZ (1997) On the resemblance and containment of documents. In: In compression and complexity of sequences (SEQUENCES97. IEEE Computer Society, Los Alamitos, pp 21–29
Google Scholar
Brück TVD, Eichhorn C, Hartrumpf S (2010) Semantic duplicate identification with parsing and machine learning. In: TSD, pp 84–92
Google Scholar
Büchler M (2013) Informationstechnische aspekte des historischen text re-use. Ph.D. thesis, Leipzig University, Germany
Google Scholar
Büchler M, Boehlke V, Heyer G (2011) Aspects of an infrastructure for eHumanities. In: Proceedings of Supporting Digital Humanities 2011
Google Scholar
Büchler M, Scheuermann G, Jänicke S (2014) Visualizations for text re-use. In: Proceedings of the 5th International Conference on Information Visualization Theory and Applications, IVAPP 2014
Google Scholar
Burns P (2012) MorphAdorner. http://morphadorner.northwestern.edu/. Accessed 1 Nov 2012
Burns PR, Crane G, Mueller M, Heyer G, Büchler M (2011) One step closer to paraphrase detection on historical texts: about the quality of text re-use techniques and the ability to learn paradigmatic relations. In: Proceedings of the 2011 Chicago Colloquium on Digital Humanities and Computer Science, Chicago, 2012
Google Scholar
Charikar M (2002) Similarity estimation techniques from rounding algorithms. In: John HR (ed) STOC. ACM, New York, pp 380–388. ISBN: 1-58113-495-9
Google Scholar
Charikar M, Frieze AM, Mitzenmacher M, Broder AZ (1998) Min-wise independent permutations. J Comput Syst Sci 60:327–336. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.121.8215
Cordell R, Dillon EM, Smith DA (2013) Infectious texts: Modeling text reuse in nineteenth-century newspapers. In: IEEE International Conference on Big Data, pp 86–94. doi:10.1109/BigData.2013.6691675
Google Scholar
Crane G (2006) What do you do with a million books? D-Lib Magazine 12:3. doi:10.1045/march2006-crane. ISSN: 1082-9873. http://www.dlib.org/dlib/march06/crane/03crane.html
Croft WB, Seo J (2008) Local text reuse detection. In: SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, New York, pp 571–578. doi:http://doi.acm.org/10.1145/1390334.1390432. ISBN: 978-1-60558-164-4
Croft WB, Bendersky M (2009) Finding text reuse on the web. In: Proceedings of the Second ACM International Conference on Web Search and Data Mining, WSDM ’09. ACM, New York, pp 262–271. ISBN: 978-1-60558-390-7. doi:10.1145/1498759.1498835. http://doi.acm.org/10.1145/1498759.1498835
Elhadad N, Barzilay R (2003) Sentence alignment for monolingual comparable corpora. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, EMNLP ’03. Association for Computational Linguistics, Stroudsburg, pp 25–32. doi:10.3115/1119355.1119359. http://dx.doi.org/10.3115/1119355.1119359
Fellbaum C (ed) (1998) WordNet: an electronic lexical database. MIT, Cambridge. ISBN: 978-0-262-06197-1
MATH Google Scholar
Harris Z (1954) Distributional structure. Word 10(23):146–162
Google Scholar
Hagen M, Beyer A, Busse M, Tippmann M, Rosso P, Stein B, Potthast M (2014) Overview of the 6th international competition on plagiarism detection. In: Cappellato L, Ferro L, Halvey M, Kraaij W (eds) Working Notes Papers of the CLEF 2014 Evaluation Labs, CEUR Workshop Proceedings. CLEF and CEUR-WS.org. http://www.clef-initiative.eu/publication/working-notes
Heyer G (2009) Analyse von Bedeutungsveränderungen in diachronen Textkorpora. Technical report, Natural Language Processing Group, University of Leipzig, Germany, Februar 2009. Vortrag im Forschungsseminar, Leipzig, Germany
Google Scholar
Hofmann T (1999) Probabilistic latent semantic analysis. In: Kathryn BL, Henri P (eds) UAI. Morgan Kaufmann, Stockholm, pp 289–296
Google Scholar
Horton R, Henderson L (2010) Sequence alignment and similarity in biology and the humanities. J Chicago Colloq Digit Humanit Comput Sci
Google Scholar
Hose R (2004) CS490 final report: investigation of sentence level text reuse algorithms. At Bits On Our Minds workshop at Cornell University
Google Scholar
Hunt E, Stiller B, Bocek T (2007) Fast Similarity Search in Large Dictionaries. Technical Report ifi-2007.02, Department of Informatics, University of Zurich. http://fastss.csg.uzh.ch/
Klein E, Loper E, Bird S (2009) Natural Language Processing with Python. Oreilly Series. O’Reilly Media. ISBN: 9780596516499. http://books.google.de/books?id=KGIbfiiP1i4C
Lee J (2007) A computational model of text reuse in ancient literary texts. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Prague, pp 472–479. http://www.aclweb.org/anthology/P07-1060
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Technical Report 8
Google Scholar
Mayer T, Cysouw M (2014) Creating a massively parallel bible corpus. In: Calzolari N (Conference Chair), Choukri K, Declerck T, Loftsson H, Maegaard B, Mariani J, Moreno A, Odijk J, Piperidis S (eds) Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland. European Language Resources Association (ELRA). ISBN: 978-2-9517408-8-4
Google Scholar
Miller GA (1995) WordNet: a lexical database for English. Commun ACM 38(11):39–41. doi:10.1145/219717.219748. ISSN: 0001-0782. http://doi.acm.org/10.1145/219717.219748
Mueller M (2006) VosPos: a project for virtual orthographic standardization and part of speech tagging of early modern english texts. http://panini.northwestern.edu/mmueller/nupos.pdf. Accessed 13 Nov 2014
Niekler A, Wiedemann G, Heyer G (2014) Brauchen die Digital Humanities eine eigene Methodologie? Überlegungen zur systematischen Nutzung von Text Mining Verfahren in einem politikwissenschaftlichen Projekt. Proceedings, 03
Google Scholar
Nord C, Girona J, Talavera L (2000) Dependency-based feature selection for clustering symbolic data. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.29.1720
Potthast M, Beyer A, Busse M, Rangel F, Rosso P, Stamatatos E, Stein B, Gollub, T (2013) Recent trends in digital text forensics and its evaluation: plagiarism detection, author identification, and author profiling. In: 4th Int. Conf. of CLEF on information access evaluation meets multilinguality, multimodality, and visualization (CLEF 2013). Springer, New York
Google Scholar
Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley/Longman, Boston. ISBN: 0-201-12227-8
Google Scholar
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423
Article MATH MathSciNet Google Scholar
Stein B (2007) Principles of hash-based text retrieval. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’07. ACM, New York, pp 527–534. doi:10.1145/1277741.1277832. ISBN: 978-1-59593-597-7. http://doi.acm.org/10.1145/1277741.1277832
Wilkerson DS, Aiken A, Schleimer S (2003) Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, SIGMOD ’03. ACM, New York, pp 76–85. doi:10.1145/872757.872770. ISBN 1-58113-634-X. http://doi.acm.org/10.1145/872757.872770
Zesch T, Gurevych I, Bär D (2012) Text reuse detection using a composition of text similarity measures. In: Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), pp 167–184
Google Scholar
Zipf G (1949) Human behaviour and the principle of least-effort. Addison-Wesley, Cambridge
Google Scholar

Download references

Acknowledgements

This work has been made available by eTRACES (No. 01UA1101A) and the early career research group eTRAP (No. 01UG1409) of the German Ministry of Education and Research.

Author information

Authors and Affiliations

Göttingen Centre for Digital Humanities, Georg August University Göttingen, Papendiek 16, 37073, Göttingen, Germany
Marco Büchler
Academic and Research Technologies, Northwestern University, Evanston, IL, USA
Philip R. Burns
Department of English, Northwestern University, Evanston, IL, USA
Martin Müller
Department of Computer Science, Digital Humanities Chair, Augustusplatz 10/11, 04009, Leipzig, Germany
Emily Franzini & Greta Franzini

Authors

Marco Büchler
View author publications
You can also search for this author in PubMed Google Scholar
Philip R. Burns
View author publications
You can also search for this author in PubMed Google Scholar
Martin Müller
View author publications
You can also search for this author in PubMed Google Scholar
Emily Franzini
View author publications
You can also search for this author in PubMed Google Scholar
Greta Franzini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marco Büchler .

Editor information

Editors and Affiliations

Computer Science Department, Technische Universität Darmstadt FG Language Technology, Darmstadt, Germany
Chris Biemann
Computer Science Department, Goethe University WG Text Technology, Frankfurt am Main, Hessen, Germany
Alexander Mehler

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Büchler, M., Burns, P.R., Müller, M., Franzini, E., Franzini, G. (2014). Towards a Historical Text Re-use Detection. In: Biemann, C., Mehler, A. (eds) Text Mining. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-12655-5_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-12655-5_11
Published: 13 December 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12654-8
Online ISBN: 978-3-319-12655-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics