Abstract
Text re-use describes the spoken and written repetition of information. Historical text re-use, with its longer time span, embraces a larger set of morphological, linguistic, syntactic, semantic and copying variations, thus adding a complication to text-reuse detection. Furthermore, it increases the chances of redundancy in a Digital Library. In Natural Language Processing it is crucial to remove these redundancies before applying any kind of machine learning techniques to the text. In Humanities, these redundancies foreground textual criticism and allow scholars to identify lines of transmission. This chapter investigates two aspects of the historical text re-use detection process, based on seven English editions of the Holy Bible. First, we measure the performance of several techniques. For this purpose, when considering a verse—such as book Genesis, Chapter 1, Verse 1—that is present in two editions, one verse is always understood as a paraphrase of the other. It is worth noting that paraphrasing is considered a hyponym of text re-use. Depending on the intention with which the new version was created, verses tend to differ significantly in the wording, but not in the meaning. Secondly, this chapter explains and evaluates a way of extracting paradigmatic relations. However, as regards historical languages, there is a lack of language resources (for example, WordNet) that makes non-literal text re-use and paraphrases much more difficult to identify. These differences are present in the form of replacements, corrections, varying writing styles, etc. For this reason, we introduce both the aforementioned and other correlated steps as a method to identify text re-use, including language acquisition to detect changes that we call paradigmatic relations. The chapter concludes with the recommendation to move from a ”single run” detection to an iterative process by using the acquired relations to run a new task.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
Supported by the National Endowment for the Humanities (NEH)http://ww.neh.gov/.
- 3.
- 4.
- 5.
References
Allan J (2002) Topic detection and tracking: event-based information organization. Kluwer International Series on Information Retrieval. Kluwer Academic. ISBN: 9780792376644. http://books.google.de/books?id=50hnLI_Jz3cC
Basile C, Esposti MD, Rosso P, Barrón-Cedeño A (2010) Word length n-grams for text re-use detection. In: Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing, CICLing’10. Springer, Berlin/Heidelberg, pp 687–699. ISBN: 3-642-12115-2, 978-3-642-12115-9
Believers Resource (2011) XML encoded versions of several English language Bible translations. http://www.believersresource.com/categories/bible-raw-data.html. Accessed 11 Nov 2011
Bernstein Y, Croft WB, Moffat A, Zobel J, Metzler D (2005) Similarity measures for tracking information flow. In: Proceedings of the 14th ACM international conference on information and knowledge management, CIKM ’05. ACM, New York, pp 517–524. doi:10.1145/1099554.1099695. ISBN: 1-59593-140-6. http://doi.acm.org/10.1145/1099554.1099695
Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on Machine learning, ICML ’06, ACM, New York, pp 113–120. ISBN: 1-59593-383-2. doi:10.1145/1143844.1143859. http://doi.acm.org/10.1145/1143844.1143859
Bonzanini M, Roelleke T, Yahyaei S (2011) Cross-lingual text fragment alignment using divergence from randomness. In: Grossi R, Sebastiani F, Silvestri F (eds) String Processing and Information Retrieval, Lecture Notes in Computer Science, vol 7024, pp 14–25. ISBN: 978-3-642-24582-4
Bordag S (2007) Elements of Knowledge-free and Unsupervised Lexical Acquisition. Ph.D. thesis, Universität Leipzig
Brockett C, Dolan WB (2005) Automatically constructing a corpus of sentential paraphrases. In: Third International Workshop on Paraphrasing (IWP2005). Asia Federation of Natural Language Processing. http://research.microsoft.com/apps/pubs/default.aspx?id=101076
Broder AZ (1997) On the resemblance and containment of documents. In: In compression and complexity of sequences (SEQUENCES97. IEEE Computer Society, Los Alamitos, pp 21–29
Brück TVD, Eichhorn C, Hartrumpf S (2010) Semantic duplicate identification with parsing and machine learning. In: TSD, pp 84–92
Büchler M (2013) Informationstechnische aspekte des historischen text re-use. Ph.D. thesis, Leipzig University, Germany
Büchler M, Boehlke V, Heyer G (2011) Aspects of an infrastructure for eHumanities. In: Proceedings of Supporting Digital Humanities 2011
Büchler M, Scheuermann G, Jänicke S (2014) Visualizations for text re-use. In: Proceedings of the 5th International Conference on Information Visualization Theory and Applications, IVAPP 2014
Burns P (2012) MorphAdorner. http://morphadorner.northwestern.edu/. Accessed 1 Nov 2012
Burns PR, Crane G, Mueller M, Heyer G, Büchler M (2011) One step closer to paraphrase detection on historical texts: about the quality of text re-use techniques and the ability to learn paradigmatic relations. In: Proceedings of the 2011 Chicago Colloquium on Digital Humanities and Computer Science, Chicago, 2012
Charikar M (2002) Similarity estimation techniques from rounding algorithms. In: John HR (ed) STOC. ACM, New York, pp 380–388. ISBN: 1-58113-495-9
Charikar M, Frieze AM, Mitzenmacher M, Broder AZ (1998) Min-wise independent permutations. J Comput Syst Sci 60:327–336. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.121.8215
Cordell R, Dillon EM, Smith DA (2013) Infectious texts: Modeling text reuse in nineteenth-century newspapers. In: IEEE International Conference on Big Data, pp 86–94. doi:10.1109/BigData.2013.6691675
Crane G (2006) What do you do with a million books? D-Lib Magazine 12:3. doi:10.1045/march2006-crane. ISSN: 1082-9873. http://www.dlib.org/dlib/march06/crane/03crane.html
Croft WB, Seo J (2008) Local text reuse detection. In: SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, New York, pp 571–578. doi:http://doi.acm.org/10.1145/1390334.1390432. ISBN: 978-1-60558-164-4
Croft WB, Bendersky M (2009) Finding text reuse on the web. In: Proceedings of the Second ACM International Conference on Web Search and Data Mining, WSDM ’09. ACM, New York, pp 262–271. ISBN: 978-1-60558-390-7. doi:10.1145/1498759.1498835. http://doi.acm.org/10.1145/1498759.1498835
Elhadad N, Barzilay R (2003) Sentence alignment for monolingual comparable corpora. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, EMNLP ’03. Association for Computational Linguistics, Stroudsburg, pp 25–32. doi:10.3115/1119355.1119359. http://dx.doi.org/10.3115/1119355.1119359
Fellbaum C (ed) (1998) WordNet: an electronic lexical database. MIT, Cambridge. ISBN: 978-0-262-06197-1
Harris Z (1954) Distributional structure. Word 10(23):146–162
Hagen M, Beyer A, Busse M, Tippmann M, Rosso P, Stein B, Potthast M (2014) Overview of the 6th international competition on plagiarism detection. In: Cappellato L, Ferro L, Halvey M, Kraaij W (eds) Working Notes Papers of the CLEF 2014 Evaluation Labs, CEUR Workshop Proceedings. CLEF and CEUR-WS.org. http://www.clef-initiative.eu/publication/working-notes
Heyer G (2009) Analyse von Bedeutungsveränderungen in diachronen Textkorpora. Technical report, Natural Language Processing Group, University of Leipzig, Germany, Februar 2009. Vortrag im Forschungsseminar, Leipzig, Germany
Hofmann T (1999) Probabilistic latent semantic analysis. In: Kathryn BL, Henri P (eds) UAI. Morgan Kaufmann, Stockholm, pp 289–296
Horton R, Henderson L (2010) Sequence alignment and similarity in biology and the humanities. J Chicago Colloq Digit Humanit Comput Sci
Hose R (2004) CS490 final report: investigation of sentence level text reuse algorithms. At Bits On Our Minds workshop at Cornell University
Hunt E, Stiller B, Bocek T (2007) Fast Similarity Search in Large Dictionaries. Technical Report ifi-2007.02, Department of Informatics, University of Zurich. http://fastss.csg.uzh.ch/
Klein E, Loper E, Bird S (2009) Natural Language Processing with Python. Oreilly Series. O’Reilly Media. ISBN: 9780596516499. http://books.google.de/books?id=KGIbfiiP1i4C
Lee J (2007) A computational model of text reuse in ancient literary texts. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Prague, pp 472–479. http://www.aclweb.org/anthology/P07-1060
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Technical Report 8
Mayer T, Cysouw M (2014) Creating a massively parallel bible corpus. In: Calzolari N (Conference Chair), Choukri K, Declerck T, Loftsson H, Maegaard B, Mariani J, Moreno A, Odijk J, Piperidis S (eds) Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland. European Language Resources Association (ELRA). ISBN: 978-2-9517408-8-4
Miller GA (1995) WordNet: a lexical database for English. Commun ACM 38(11):39–41. doi:10.1145/219717.219748. ISSN: 0001-0782. http://doi.acm.org/10.1145/219717.219748
Mueller M (2006) VosPos: a project for virtual orthographic standardization and part of speech tagging of early modern english texts. http://panini.northwestern.edu/mmueller/nupos.pdf. Accessed 13 Nov 2014
Niekler A, Wiedemann G, Heyer G (2014) Brauchen die Digital Humanities eine eigene Methodologie? Überlegungen zur systematischen Nutzung von Text Mining Verfahren in einem politikwissenschaftlichen Projekt. Proceedings, 03
Nord C, Girona J, Talavera L (2000) Dependency-based feature selection for clustering symbolic data. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.29.1720
Potthast M, Beyer A, Busse M, Rangel F, Rosso P, Stamatatos E, Stein B, Gollub, T (2013) Recent trends in digital text forensics and its evaluation: plagiarism detection, author identification, and author profiling. In: 4th Int. Conf. of CLEF on information access evaluation meets multilinguality, multimodality, and visualization (CLEF 2013). Springer, New York
Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley/Longman, Boston. ISBN: 0-201-12227-8
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423
Stein B (2007) Principles of hash-based text retrieval. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’07. ACM, New York, pp 527–534. doi:10.1145/1277741.1277832. ISBN: 978-1-59593-597-7. http://doi.acm.org/10.1145/1277741.1277832
Wilkerson DS, Aiken A, Schleimer S (2003) Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, SIGMOD ’03. ACM, New York, pp 76–85. doi:10.1145/872757.872770. ISBN 1-58113-634-X. http://doi.acm.org/10.1145/872757.872770
Zesch T, Gurevych I, Bär D (2012) Text reuse detection using a composition of text similarity measures. In: Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), pp 167–184
Zipf G (1949) Human behaviour and the principle of least-effort. Addison-Wesley, Cambridge
Acknowledgements
This work has been made available by eTRACES (No. 01UA1101A) and the early career research group eTRAP (No. 01UG1409) of the German Ministry of Education and Research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Büchler, M., Burns, P.R., Müller, M., Franzini, E., Franzini, G. (2014). Towards a Historical Text Re-use Detection. In: Biemann, C., Mehler, A. (eds) Text Mining. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-12655-5_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-12655-5_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12654-8
Online ISBN: 978-3-319-12655-5
eBook Packages: Computer ScienceComputer Science (R0)