Behavior Research Methods

, Volume 50, Issue 2, pp 466–489 | Cite as

AlignTool: The automatic temporal alignment of spoken utterances in German, Dutch, and British English for psycholinguistic purposes

  • Lars Schillingmann
  • Jessica Ernst
  • Verena Keite
  • Britta Wrede
  • Antje S. Meyer
  • Eva Belke


In language production research, the latency with which speakers produce a spoken response to a stimulus and the onset and offset times of words in longer utterances are key dependent variables. Measuring these variables automatically often yields partially incorrect results. However, exact measurements through the visual inspection of the recordings are extremely time-consuming. We present AlignTool, an open-source alignment tool that establishes preliminarily the onset and offset times of words and phonemes in spoken utterances using Praat, and subsequently performs a forced alignment of the spoken utterances and their orthographic transcriptions in the automatic speech recognition system MAUS. AlignTool creates a Praat TextGrid file for inspection and manual correction by the user, if necessary. We evaluated AlignTool’s performance with recordings of single-word and four-word utterances as well as semi-spontaneous speech. AlignTool performs well with audio signals with an excellent signal-to-noise ratio, requiring virtually no corrections. For audio signals of lesser quality, AlignTool still is highly functional but its results may require more frequent manual corrections. We also found that audio recordings including long silent intervals tended to pose greater difficulties for AlignTool than recordings filled with speech, which AlignTool analyzed well overall. We expect that by semi-automatizing the temporal analysis of complex utterances, AlignTool will open new avenues in language production research.


Language production Time course Voice-key Automatic alignment 


  1. Abrams, L., & Jennings, D. T. (2004). VoiceRelay: Voice key operation using Visual Basic. Behavior Research Methods, Instruments, and Computers, 36, 771-777.CrossRefPubMedGoogle Scholar
  2. Anderson, A. H., Bader, M., Bard, E. G., Boyle, E., Doherty, G., Garrod, S., ... Weinert, R. (1991). The HCRC Map Task Corpus. Language and Speech, 34, 351-366.Google Scholar
  3. Baayen, R. H., Piepenbrook, R., & van Rijn, H. (1995). The CELEX lexical database (CD-ROM). Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA.Google Scholar
  4. BAS (Bavarian Archive for Speech Signals) (2017a, August 9). BAS WebServices. Retrieved from
  5. BAS (Bavarian Archive for Speech Signals) (2017b, August 9) BAS WebServices: G2P. Retrieved from
  6. BAS (Bavarian Archive for Speech Signals) (2017c, August, 8). BAS WebServices: General Help – Terms of Usage. Retrieved from
  7. Bates, E., D’Amico, S., Jacobsen, T., Székely, A., Andonova, E., Devescovi, A., ... Tzeng, O. (2003). Timed picture naming in seven languages. Psychonomic Bulletin and Review, 10, 344-380.Google Scholar
  8. Bebout, J. & Belke, E. (2017). Language play facilitates language learning: Optimizing the input for rapid gender-like category induction. Cognitive Research: Principles and Implications, 2, 11.Google Scholar
  9. Belke, E., Keite, V., & Schillingmann, L. (2017). AlignTool Documentation. Retrieved from
  10. Bock, J. K. (1996). Language production: Methods and methodologies. Psychonomic Bulletin & Review, 3, 395-421.CrossRefGoogle Scholar
  11. Boersma, P. & Weenink, D. (2016). Praat: Doing phonetics by computer (Version 6.0.14) [Computer software]. Retrieved from
  12. Brennan, S. E., Schuhmann, K. S., & Batres, K. M. (2013). Entrainment on the move and in the lab: The Walking Around Corpus. Proceedings of the 35th Annual Conference of the Cognitive Science Society. Google Scholar
  13. Clark, H. (1996). Using Language. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
  14. Coco, M. I., & Keller, F. (2015). Integrating mechanisms of visual guidance in naturalistic language production. Cognitive Processing, 16, 131-150.CrossRefPubMedGoogle Scholar
  15. Coco, M. I., Malcolm, G. L., & Keller, F. (2014). The interplay of bottom-up and top-down mechanisms in visual guidance during object naming, The Quarterly Journal of Experimental Psychology, 67, 1096-1120.CrossRefPubMedGoogle Scholar
  16. Duyck, W., Anseel, F., Szmalec, A., Mestdagh, P., Tavernier, A., & Hartsuiker, R. (2008). Improving accuracy in detecting acoustic onsets. Journal of Experimental Psychology: Human Perception & Performance, 34, 1317-1326.Google Scholar
  17. Fink, G. A. (1999). Developing HMM-based recognizers with ESMERALDA. In V. Matousek, P. Mautner, J. Ocelíková, & P. Sojka (Eds.), Lecture notes in artificial intelligence science: Vol. 1692. Text, speech and dialogue: Second international workshop, TSD ’99, Plzen, Czech Republic, September 13-17, 1999 (pp. 229-234). Berlin: Springer.Google Scholar
  18. Forster, K. I., & Forster, J. C. (2003). A windows display program with millisecond accuracy. Behavior Research Methods, Instruments, & Computers, 35, 116-124.CrossRefGoogle Scholar
  19. Fox Tree, J. E., & Clark, H. H. (1997). Pronouncing “the” as “thee” to signal problems in speaking. Cognition 62, 151-167.CrossRefPubMedGoogle Scholar
  20. Griffin, Z. M., & Bock, J. K. (2000). What the eyes say about speaking. Psychological Science, 11, 274-279.CrossRefPubMedPubMedCentralGoogle Scholar
  21. Griffin, Z. M., & Ferreira, V. S. (2006). Properties of spoken language production. In M. J. Traxler, & M. A. Gernsbacher (Eds.), Handbook of psycholinguistics (2nd ed.) (pp. 21-59). London: Elsevier.CrossRefGoogle Scholar
  22. Hanna, J. E., & Brennan, S. E. (2007). Speakers' eye gaze disambiguates referring expressions early during face-to-face conversation. Journal of Memory and Language, 57, 596-615.CrossRefGoogle Scholar
  23. Hüttig, F., Rommers, J., & Meyer, A. S. (2011). Using the visual world paradigm to study language processing: A review and critical evaluation. Acta Psychologica, 137, 151-171.CrossRefGoogle Scholar
  24. Jansen, P., & Watter, S. (2008). SayWhen: An automated method for high-accuracy speech onset detection. Behavior Research Methods, 40, 744-751.CrossRefPubMedGoogle Scholar
  25. Katzberg, D., Belke, E., Wrede, B., Ernst, J., Berwe, Th., & Meyer, A. S. (2014). AUDIOMAX: A software using an automatic speech recognition system for fast and accurate temporal analyses of word onsets in spoken utterances. Poster presented at the International Workshop on Language Production 2014, Geneva.Google Scholar
  26. Kessler, B., Treiman, R., & Mullennix, J. (2002). Phonetic biases in voice key response time measurements. Journal of Memory and Language, 47, 145-171.CrossRefGoogle Scholar
  27. Kisler, T., Reichel, U. D., Schiel, F., Draxler, Ch., Jackl, B., & Pörner, N. (2016). BAS Speech Science Web Services - an update of current developments. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, May 23-28, 2016.Google Scholar
  28. Laubrock, J., & Kliegl, R. (2015). The eye-voice span during reading aloud. Frontiers in Psychology, 6, 1432.CrossRefPubMedPubMedCentralGoogle Scholar
  29. Levelt, W. J. M. (1989). Speaking: From Intention to Articulation. Cambridge: MIT Press.Google Scholar
  30. Levelt, W. J. M. (1999). Models of word production. Trends in Cognitive Sciences, 3, 223-232.CrossRefPubMedGoogle Scholar
  31. Levelt, W. J. M., Roelofs, A., & Meyer, A. S. (1999). A theory of lexical access in speech production. Behavioral and Brain Sciences, 22, 1-75.PubMedGoogle Scholar
  32. Marklund, U., Marklund, E., Lacerda, F., & Schwarz, I.-C. (2015). Pause and utterance duration in child-directed speech in relation to child vocabulary size. Journal of Child Language, 42, 1158-1171.CrossRefPubMedGoogle Scholar
  33. McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1, 30-46.CrossRefGoogle Scholar
  34. Metzing, C. & Brennan, S. E. (2003). When conceptual pacts are broken: Partner-specific effects in the comprehension of referring expressions. Journal of Memory and Language, 49, 201-213.CrossRefGoogle Scholar
  35. Mousikou, P., & Rastle, K. (2015). Lexical frequency effects on articulation: A comparison of picture naming and reading aloud. Frontiers in Psychology, 6, 1571.CrossRefPubMedPubMedCentralGoogle Scholar
  36. Pechmann, T., Reetz, H., & Zerbst, D. (1989). Kritik einer Messmethode: Zur Ungenauigkeit von Voicekey Messungen [Critique on a measurement method: About the inaccuracy of voicekey measurements]. Sprache & Kognition, 8, 65-71.Google Scholar
  37. Protopapas, A. (2007). CheckVocal: A program to facilitate checking the accuracy and response time of vocal responses from DMDX. Behaviour Research Methods, 39, 859-862.CrossRefGoogle Scholar
  38. Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications and speech recognition. Proceedings of the IEEE, 77, 257-286.Google Scholar
  39. Rastle, K., & Davis, M. H. (2002). On the complexities of measuring naming. Journal of Experimental Psychology: Human Perception and Performance, 28, 307-314.PubMedGoogle Scholar
  40. Rayner, K. (1998). Eye movements in reading and information processing: 20 years of research. Psychological Bulletin, 124, 372-422.CrossRefPubMedGoogle Scholar
  41. Rosenfelder, I., Fruehwald, J., Evanini, K., & Jiahong, Y. (2011). FAVE (Forced Alignment and Vowel Extraction) Program Suite. Retrieved from
  42. Roux, F., Armstrong, B. C., & Carreiras, M. (2016). Chronset: An automated tool for detecting speech onsets. Behavior Research Methods. Google Scholar
  43. Sadat, J., Martin, C. D., Alario, F. X., & Costa, A. (2012). Characterizing the bilingual disadvantage in noun phrase production. Journal of Psycholinguistic Research, 41, 159-179.CrossRefPubMedGoogle Scholar
  44. Schiel, F. (1999). Automatic phonetic transcription of non-prompted speech. International Congress of Phonetic Sciences 14, 607-610.).Google Scholar
  45. Schiel, F. (2015, November 5). Munich Automatic Segmentation. Retrieved from
  46. Severens, E., van Lommel, S., Ratinckx, E., & Hartsuiker, R. J. (2005). Timed picture naming norms for 590 pictures in Dutch. Acta Psychologica, 119, 159-187.CrossRefPubMedGoogle Scholar
  47. Sichelschmidt, L., Jang, K.-W., Koesling, H., Ritter, H., & Weiß, P. (2010). Alignment in aufgabenorientierten Dialogen: ein multimodales Such- und Vergleichskorpus. [Alignment in task-oriented dialogues: A multimodal search and comparison corpus]. Linguistische Berichte, 222, 205-230.Google Scholar
  48. Sjerps, M. J., & Meyer, A. S. (2015). Variation in dual-task performance reveals late initiation of speech planning in turn-taking. Cognition, 136, 304-324.CrossRefPubMedGoogle Scholar
  49. Strunk, J., Schiel, F., & Seifart, F. (2014). Untrained forced alignment of transcriptions and audio for language documentation corpora using WebMAUS. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S Piperidis (Eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), Reykjavik, Iceland, May 26-31, 2014.Google Scholar
  50. Tyler, M. D., Tyler, L., & Burnham, D. K. (2005). The Delayed Trigger Voice Key: An improved analogue voice key for psycholinguistic research. Behavior Research Methods, 37, 139-147.CrossRefPubMedGoogle Scholar
  51. Young, S. (1996). A review of large-vocabulary continuous-speech recognition. IEEE Signal Processing Magazine, 13, 45-56.CrossRefGoogle Scholar

Copyright information

© Psychonomic Society, Inc. 2018

Authors and Affiliations

  • Lars Schillingmann
    • 1
  • Jessica Ernst
    • 2
  • Verena Keite
    • 2
  • Britta Wrede
    • 1
  • Antje S. Meyer
    • 3
    • 4
  • Eva Belke
    • 2
  1. 1.Technische FakultätUniversität BielefeldBielefeldGermany
  2. 2.Sprachwissenschaftliches Institut, Ruhr-Universität BochumBochumGermany
  3. 3.Max-Planck-Institut für PsycholinguistikNijmegenThe Netherlands
  4. 4.Radboud UniversityNijmegenThe Netherlands

Personalised recommendations