A Big Data Case Study in Digital Humanities

Creating a Performance Benchmark for Canonical Text Services
  • Gerhard Heyer
  • Jochen TiepmarEmail author


While the volume of primary data in the text oriented humanities is small in comparison to the terabytes that are nowadays standard in Big Data applications, secondary data that are the result of scholarly annotations require a fine-grained hierarchical structure based reference model for primary data. The paper provides an attempt for a reusable performance benchmark for Canonical Text Services, a service to access and retrieve text content and structural meta information for hierarchically structured texts, and shows how it can be used to evaluate the technical performance of such a system.


Hierarchically structured text CTS protocol Performance evaluation 



Part of this work was funded by the German Federal Ministry of Education and Research within the project ScaDS Dresden/Leipzig (BMBF 01IS14014B).


  1. 1.
    Blackwell C, Roughan C, Smith DN (2017) Citation and alignment: scholarship outside and inside the codex. Manuscript Studies, Bd. 1Google Scholar
  2. 2.
    Brass P (2008) Advanced data structures. Cambridge University Press, CambridgeCrossRefzbMATHGoogle Scholar
  3. 3.
    Corman TH, Leiserson CE, Rivest RLS, Stein C (2001) Introduction to Algorithms, 2. Aufl. MIT Press, Cambridge, MassachusettsGoogle Scholar
  4. 4.
    Fielding T (2000) Architectural styles and design of network-based software architectures. University of California, Oakland, CaliforniaGoogle Scholar
  5. 5.
    Geyken A, Haaf S, Jurish B, Schulz M, Steinmann J, Thomas C, Wiegand F (2011) Das Deutsche Textarchiv: Vom historischen Korpus zum aktiven Archiv. In: Digitale Wissenschaft Stand und Entwicklung digital vernetzter Forschung in Deutschland, Bd. 2Google Scholar
  6. 6.
    Henrich A, Heyer G, Schlieder C, Haerder T (2015) Editorial. Datenbank Spektrum 15(1):1–6MathSciNetCrossRefGoogle Scholar
  7. 7.
    Mayer T, Cysouw M (2014) Creating a massively parallel Bible corpus. In: Proceedings of LRECGoogle Scholar
  8. 8.
    McCarty W (2005) Humanities computing. Palgrave, BasingstokeCrossRefGoogle Scholar
  9. 9.
    Nah FF-H (2004) A study on tolerable waiting time: how long are Web users willing to wait? Behav Inf Technol 23(3):153–163CrossRefGoogle Scholar
  10. 10.
    Schneider R (2012) Evaluating DBMS-based access strategies to very large multi-layer corpora. Proceedings of the LREC-2012 Workshop on Challenges in the Management of Large Corpora. IstanbulGoogle Scholar
  11. 11.
    Smith DN (2009) Citation in classical studies. Digit Humanit Q 3(1). Accessed: 04.12.2018
  12. 12.
    Text-Encoding-Initiative (2007) TEI guidelines for electronic text encoding and interchange P5. Accessed: 04.12.2018Google Scholar
  13. 13.
    Tiepmar J (2018) Implementation and evaluation of the canonical text service protocol as part of a research infrastructure in the digital humanities. Leipzig University, Leipzig (Phd Thesis)Google Scholar
  14. 14.
    Tiepmar J, Heyer G (2017) An overview of canonical text services. Linguistics and Literature StudiesCrossRefGoogle Scholar
  15. 15.
    Tiepmar J, Teichmann C, Heyer G, Berti M, Crane G (2013) A new Implementation for Canonical Text Services. In: Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)Google Scholar
  16. 16.
    Williamson DF, Parker RA, Kendrick JS (1989) The box plot: a simple visual method to interpret data. Ann Intern Med 110:916–921CrossRefGoogle Scholar

Copyright information

© Gesellschaft für Informatik e.V. and Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.NLP GroupLeipzig UniversityLeipzigGermany
  2. 2.ScaDSLeipzig UniversityLeipzigGermany

Personalised recommendations