Skip to main content

Towards a Historical Text Re-use Detection

  • Chapter
  • First Online:
Book cover Text Mining

Abstract

Text re-use describes the spoken and written repetition of information. Historical text re-use, with its longer time span, embraces a larger set of morphological, linguistic, syntactic, semantic and copying variations, thus adding a complication to text-reuse detection. Furthermore, it increases the chances of redundancy in a Digital Library. In Natural Language Processing it is crucial to remove these redundancies before applying any kind of machine learning techniques to the text. In Humanities, these redundancies foreground textual criticism and allow scholars to identify lines of transmission. This chapter investigates two aspects of the historical text re-use detection process, based on seven English editions of the Holy Bible. First, we measure the performance of several techniques. For this purpose, when considering a verse—such as book Genesis, Chapter 1, Verse 1—that is present in two editions, one verse is always understood as a paraphrase of the other. It is worth noting that paraphrasing is considered a hyponym of text re-use. Depending on the intention with which the new version was created, verses tend to differ significantly in the wording, but not in the meaning. Secondly, this chapter explains and evaluates a way of extracting paradigmatic relations. However, as regards historical languages, there is a lack of language resources (for example, WordNet) that makes non-literal text re-use and paraphrases much more difficult to identify. These differences are present in the form of replacements, corrections, varying writing styles, etc. For this reason, we introduce both the aforementioned and other correlated steps as a method to identify text re-use, including language acquisition to detect changes that we call paradigmatic relations. The chapter concludes with the recommendation to move from a ”single run” detection to an iterative process by using the acquired relations to run a new task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.diggingintodata.org/.

  2. 2.

    Supported by the National Endowment for the Humanities (NEH)http://ww.neh.gov/.

  3. 3.

    http://research.microsoft.com/en-us/downloads/607d14d9-20cd-47e3-85bc-a2f65cd28042/.

  4. 4.

    www.paralleltext.info.

  5. 5.

    www.mysword.info.

References

  1. Allan J (2002) Topic detection and tracking: event-based information organization. Kluwer International Series on Information Retrieval. Kluwer Academic. ISBN: 9780792376644. http://books.google.de/books?id=50hnLI_Jz3cC

  2. Basile C, Esposti MD, Rosso P, Barrón-Cedeño A (2010) Word length n-grams for text re-use detection. In: Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing, CICLing’10. Springer, Berlin/Heidelberg, pp 687–699. ISBN: 3-642-12115-2, 978-3-642-12115-9

    Google Scholar 

  3. Believers Resource (2011) XML encoded versions of several English language Bible translations. http://www.believersresource.com/categories/bible-raw-data.html. Accessed 11 Nov 2011

  4. Bernstein Y, Croft WB, Moffat A, Zobel J, Metzler D (2005) Similarity measures for tracking information flow. In: Proceedings of the 14th ACM international conference on information and knowledge management, CIKM ’05. ACM, New York, pp 517–524. doi:10.1145/1099554.1099695. ISBN: 1-59593-140-6. http://doi.acm.org/10.1145/1099554.1099695

  5. Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on Machine learning, ICML ’06, ACM, New York, pp 113–120. ISBN: 1-59593-383-2. doi:10.1145/1143844.1143859. http://doi.acm.org/10.1145/1143844.1143859

  6. Bonzanini M, Roelleke T, Yahyaei S (2011) Cross-lingual text fragment alignment using divergence from randomness. In: Grossi R, Sebastiani F, Silvestri F (eds) String Processing and Information Retrieval, Lecture Notes in Computer Science, vol 7024, pp 14–25. ISBN: 978-3-642-24582-4

    Chapter  Google Scholar 

  7. Bordag S (2007) Elements of Knowledge-free and Unsupervised Lexical Acquisition. Ph.D. thesis, Universität Leipzig

    Google Scholar 

  8. Brockett C, Dolan WB (2005) Automatically constructing a corpus of sentential paraphrases. In: Third International Workshop on Paraphrasing (IWP2005). Asia Federation of Natural Language Processing. http://research.microsoft.com/apps/pubs/default.aspx?id=101076

  9. Broder AZ (1997) On the resemblance and containment of documents. In: In compression and complexity of sequences (SEQUENCES97. IEEE Computer Society, Los Alamitos, pp 21–29

    Google Scholar 

  10. Brück TVD, Eichhorn C, Hartrumpf S (2010) Semantic duplicate identification with parsing and machine learning. In: TSD, pp 84–92

    Google Scholar 

  11. Büchler M (2013) Informationstechnische aspekte des historischen text re-use. Ph.D. thesis, Leipzig University, Germany

    Google Scholar 

  12. Büchler M, Boehlke V, Heyer G (2011) Aspects of an infrastructure for eHumanities. In: Proceedings of Supporting Digital Humanities 2011

    Google Scholar 

  13. Büchler M, Scheuermann G, Jänicke S (2014) Visualizations for text re-use. In: Proceedings of the 5th International Conference on Information Visualization Theory and Applications, IVAPP 2014

    Google Scholar 

  14. Burns P (2012) MorphAdorner. http://morphadorner.northwestern.edu/. Accessed 1 Nov 2012

  15. Burns PR, Crane G, Mueller M, Heyer G, Büchler M (2011) One step closer to paraphrase detection on historical texts: about the quality of text re-use techniques and the ability to learn paradigmatic relations. In: Proceedings of the 2011 Chicago Colloquium on Digital Humanities and Computer Science, Chicago, 2012

    Google Scholar 

  16. Charikar M (2002) Similarity estimation techniques from rounding algorithms. In: John HR (ed) STOC. ACM, New York, pp 380–388. ISBN: 1-58113-495-9

    Google Scholar 

  17. Charikar M, Frieze AM, Mitzenmacher M, Broder AZ (1998) Min-wise independent permutations. J Comput Syst Sci 60:327–336. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.121.8215

  18. Cordell R, Dillon EM, Smith DA (2013) Infectious texts: Modeling text reuse in nineteenth-century newspapers. In: IEEE International Conference on Big Data, pp 86–94. doi:10.1109/BigData.2013.6691675

    Google Scholar 

  19. Crane G (2006) What do you do with a million books? D-Lib Magazine 12:3. doi:10.1045/march2006-crane. ISSN: 1082-9873. http://www.dlib.org/dlib/march06/crane/03crane.html

  20. Croft WB, Seo J (2008) Local text reuse detection. In: SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, New York, pp 571–578. doi:http://doi.acm.org/10.1145/1390334.1390432. ISBN: 978-1-60558-164-4

  21. Croft WB, Bendersky M (2009) Finding text reuse on the web. In: Proceedings of the Second ACM International Conference on Web Search and Data Mining, WSDM ’09. ACM, New York, pp 262–271. ISBN: 978-1-60558-390-7. doi:10.1145/1498759.1498835. http://doi.acm.org/10.1145/1498759.1498835

  22. Elhadad N, Barzilay R (2003) Sentence alignment for monolingual comparable corpora. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, EMNLP ’03. Association for Computational Linguistics, Stroudsburg, pp 25–32. doi:10.3115/1119355.1119359. http://dx.doi.org/10.3115/1119355.1119359

  23. Fellbaum C (ed) (1998) WordNet: an electronic lexical database. MIT, Cambridge. ISBN: 978-0-262-06197-1

    MATH  Google Scholar 

  24. Harris Z (1954) Distributional structure. Word 10(23):146–162

    Google Scholar 

  25. Hagen M, Beyer A, Busse M, Tippmann M, Rosso P, Stein B, Potthast M (2014) Overview of the 6th international competition on plagiarism detection. In: Cappellato L, Ferro L, Halvey M, Kraaij W (eds) Working Notes Papers of the CLEF 2014 Evaluation Labs, CEUR Workshop Proceedings. CLEF and CEUR-WS.org. http://www.clef-initiative.eu/publication/working-notes

  26. Heyer G (2009) Analyse von Bedeutungsveränderungen in diachronen Textkorpora. Technical report, Natural Language Processing Group, University of Leipzig, Germany, Februar 2009. Vortrag im Forschungsseminar, Leipzig, Germany

    Google Scholar 

  27. Hofmann T (1999) Probabilistic latent semantic analysis. In: Kathryn BL, Henri P (eds) UAI. Morgan Kaufmann, Stockholm, pp 289–296

    Google Scholar 

  28. Horton R,  Henderson L (2010) Sequence alignment and similarity in biology and the humanities. J Chicago Colloq Digit Humanit Comput Sci

    Google Scholar 

  29. Hose R (2004) CS490 final report: investigation of sentence level text reuse algorithms. At Bits On Our Minds workshop at Cornell University

    Google Scholar 

  30. Hunt E, Stiller B, Bocek T (2007) Fast Similarity Search in Large Dictionaries. Technical Report ifi-2007.02, Department of Informatics, University of Zurich. http://fastss.csg.uzh.ch/

  31. Klein E, Loper E, Bird S (2009) Natural Language Processing with Python. Oreilly Series. O’Reilly Media. ISBN: 9780596516499. http://books.google.de/books?id=KGIbfiiP1i4C

  32. Lee J (2007) A computational model of text reuse in ancient literary texts. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Prague, pp 472–479. http://www.aclweb.org/anthology/P07-1060

  33. Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Technical Report 8

    Google Scholar 

  34. Mayer T, Cysouw M (2014) Creating a massively parallel bible corpus. In: Calzolari N (Conference Chair), Choukri K, Declerck T, Loftsson H, Maegaard B, Mariani J, Moreno A, Odijk J, Piperidis S (eds) Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland. European Language Resources Association (ELRA). ISBN: 978-2-9517408-8-4

    Google Scholar 

  35. Miller GA (1995) WordNet: a lexical database for English. Commun ACM 38(11):39–41. doi:10.1145/219717.219748. ISSN: 0001-0782. http://doi.acm.org/10.1145/219717.219748

  36. Mueller M (2006) VosPos: a project for virtual orthographic standardization and part of speech tagging of early modern english texts. http://panini.northwestern.edu/mmueller/nupos.pdf. Accessed 13 Nov 2014

  37. Niekler A, Wiedemann G, Heyer G (2014) Brauchen die Digital Humanities eine eigene Methodologie? Überlegungen zur systematischen Nutzung von Text Mining Verfahren in einem politikwissenschaftlichen Projekt. Proceedings, 03

    Google Scholar 

  38. Nord C, Girona J, Talavera L (2000) Dependency-based feature selection for clustering symbolic data. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.29.1720

  39. Potthast M, Beyer A, Busse M, Rangel F, Rosso P, Stamatatos E, Stein B, Gollub, T (2013) Recent trends in digital text forensics and its evaluation: plagiarism detection, author identification, and author profiling. In: 4th Int. Conf. of CLEF on information access evaluation meets multilinguality, multimodality, and visualization (CLEF 2013). Springer, New York

    Google Scholar 

  40. Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley/Longman, Boston. ISBN: 0-201-12227-8

    Google Scholar 

  41. Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423

    Article  MATH  MathSciNet  Google Scholar 

  42. Stein B (2007) Principles of hash-based text retrieval. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’07. ACM, New York, pp 527–534. doi:10.1145/1277741.1277832. ISBN: 978-1-59593-597-7. http://doi.acm.org/10.1145/1277741.1277832

  43. Wilkerson DS, Aiken A, Schleimer S (2003) Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, SIGMOD ’03. ACM, New York, pp 76–85. doi:10.1145/872757.872770. ISBN 1-58113-634-X. http://doi.acm.org/10.1145/872757.872770

  44. Zesch T, Gurevych I, Bär D (2012) Text reuse detection using a composition of text similarity measures. In: Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), pp 167–184

    Google Scholar 

  45. Zipf G (1949) Human behaviour and the principle of least-effort. Addison-Wesley, Cambridge

    Google Scholar 

Download references

Acknowledgements

This work has been made available by eTRACES (No. 01UA1101A) and the early career research group eTRAP (No. 01UG1409) of the German Ministry of Education and Research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marco Büchler .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Büchler, M., Burns, P.R., Müller, M., Franzini, E., Franzini, G. (2014). Towards a Historical Text Re-use Detection. In: Biemann, C., Mehler, A. (eds) Text Mining. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-12655-5_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12655-5_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12654-8

  • Online ISBN: 978-3-319-12655-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics