Skip to main content

A Corpus with Wavesurfer and TEI: Speech and Video in TEITOK

  • 1066 Accesses

Part of the Lecture Notes in Computer Science book series (LNAI,volume 12848)


In this paper, we demonstrate how TEITOK provides a full online interface for speech and even video corpora, that are fully searchable using the CQL query language, can contain all speech-related annotation such as repetitions, gaps, and mispronunciations, and provides a full interface for time-aligned annotations scrolling below the waveform and showing the video if there is any. Corpora are stored in the TEI/XML standard, with import and output functions for other established standards like ELAN, Praat, or Transcriber. It is even possible to directly annotate corpora in TEITOK.


  • TEI
  • Spoken corpus
  • Multimedia corpora

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions


  1. 1.

  2. 2.

  3. 3.

  4. 4.

  5. 5.


  1. Boersma, P., Weenink, D.: Praat: doing phonetics by computer (version 6.0.37) (2018).

  2. CLUL: P.S. post scriptum. arquivo digital de escrita quotidiana em portugal e espanha na Época moderna.

  3. Evert, S., Hardie, A.: Twenty-first century corpus workbench: Updating a query architecture for the new millennium. In: Corpus Linguistics 2011 (2011)

    Google Scholar 

  4. Janssen, M.: TEITOK: text-faithful annotated corpora. In: Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016, pp. 4037–4043 (2016)

    Google Scholar 

  5. Janssen, M.: Neotag: a POS tagger for grammatical neologism detection. In: Calzolari, N., et al. (eds.) LREC, pp. 2118–2124. European Language Resources Association (ELRA) (2012)

    Google Scholar 

  6. Janssen, M., Freitas, T.: Spock - a spoken corpus client. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008) (May 2008)

    Google Scholar 

  7. Ruzaitė, J.: Learner corpora for lesser taught languages: a workin-progress report on the Lithuanian learner corpus (2019)

    Google Scholar 

  8. Schmidt, T.: Exmaralda - ein modellierungs- und visualisierungsverfahren für die computergestützte transkription gesprochener sprache. In: Buchberger, E. (ed.) Proceedings of Konvens 2004, vol. 5 (2004)., dE

  9. Straka, M., Straková, J.: UDPipe (2016). LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University.

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Maarten Janssen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Janssen, M. (2021). A Corpus with Wavesurfer and TEI: Speech and Video in TEITOK. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2021. Lecture Notes in Computer Science(), vol 12848. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-83526-2

  • Online ISBN: 978-3-030-83527-9

  • eBook Packages: Computer ScienceComputer Science (R0)