Increasing Recall for Text Re-use in Historical Documents to Support Research in the Humanities

  • Marco Büchler
  • Gregory Crane
  • Maria Moritz
  • Alison Babeu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7489)


High precision text re-use detection allows humanists to discover where and how particular authors are quoted (e.g., the different sections of Plato’s work that come in and out of vogue). This paper reports on on-going work to provide the high recall text re-use detection that humanists often demand. Using an edition of one Greek work that marked quotations and paraphrases from the Homeric epics as our testbed, we were able to achieve a recall of at least 94% while maintaining a precision of 73%. This particular study is part of a larger effort to detect text re-use across 15 million words of Greek and 10 million words of Latin available or under development as openly licensed TEI XML.


historical text re-use hypertextuality Homer Athenaeus 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Balasubramanian, N., Allan, J.: Syntactic Query Models for Restatement Retrieval. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 143–155. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  2. 2.
    Potthast, M., Stein, B.: New Issues in Near-duplicate Detection Data Analysis, Machine Learning and Applications. In: Studies in Classification, Data Analysis, and Knowledge Organization, pp. 601–609. Springer, Heidelberg (2008)Google Scholar
  3. 3.
    Wang, J.H., Chang, H.C.: Exploiting Sentence-Level Features for Near-Duplicate Document Detection. In: Lee, G.G., Song, D., Lin, C.-Y., Aizawa, A., Kuriyama, K., Yoshioka, M., Sakai, T. (eds.) AIRS 2009. LNCS, vol. 5839, pp. 205–217. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  4. 4.
    Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarized documents. J. Am. Soc. Inf. Sci. Technol. 54(3), 203–215 (2003)CrossRefGoogle Scholar
  5. 5.
    Alzahrani, S., Salim, N., Abraham, A.: Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C 42(2), 133–149 (2012)CrossRefGoogle Scholar
  6. 6.
    Lee, J.: A computational model of text reuse in ancient literary texts. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic, Association for Computational Linguistics, pp. 472–479 (June 2007)Google Scholar
  7. 7.
    Bourdaillet, J., Ganascia, J.G., Pierre, U., Curie, M.: J.g: Alignment of noisy unstructured text data. In: Proc. of the IJCAI Workshop on Analytics for Noisy Unstructured Text Data (AND 2007) of the 20th International Joint Conference on Artificial Intelligence (IJCAI), pp. 139–146 (2007)Google Scholar
  8. 8.
    Trillini, R.H., Quassdorf, S.: A ’key to all quotations’? a corpus-based parameter model of intertextuality. LLC 25(3), 269–286 (2010)Google Scholar
  9. 9.
    Coffee, N., Koenig, J.P., Poornim, S., Forstall, C., Ossewaarde, R., Jacobson, S.: The tesserae project: Intertextual analysis of latin poetry (2011),;query=;brand=default (last accessed February 14, 2012)
  10. 10.
    Forstall, C.W., Jacobson, S.L., Scheirer, W.J.: Evidence of intertextuality: investigating paul the deacon’s angustae vitae. Literary and Linguistic Computing 26(3), 285–296 (2011)CrossRefGoogle Scholar
  11. 11.
    Kane, A., Tompa, F.W.: Janus: the intertextuality search engine for the electronic manipulus florum project. Literary and Linguistic Computing 26(4), 407–415 (2011)CrossRefGoogle Scholar
  12. 12.
    Broder, A.Z.: On the resemblance and containment of documents. In: Compression and Complexity of Sequences (SEQUENCES 1997), pp. 21–29. IEEE Computer Society (1997)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Marco Büchler
    • 1
  • Gregory Crane
    • 2
  • Maria Moritz
    • 1
  • Alison Babeu
    • 2
  1. 1.Institute for Computer ScienceLeipzig UniversityGermany
  2. 2.Department of ClassicsTufts UniversityBostonUSA

Personalised recommendations