Detecting Commas in Slovak Legal Texts

  • Róbert Sabo
  • Štefan Beňuš
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8655)


This paper reports on initial experiments with automatic comma recovery in legal texts. In deciding whether to insert a comma or not, we propose to use the value of the probability of a bigram of two words without a comma and a trigram of the words with the comma. The probability is determined by the language model trained on sentences with commas labeled as separate words. In the training database one sentence corresponds to one line. The thresholds of bigrams and trigrams probability were experimentally determined to achieve the best balance of precision and recall. The advantage of the proposed method is its high precision (95%) at a relatively satisfactory recall (49%). For judges as potential users of an ASR system with an automatic comma insertion function, precision is particularly important.


automatic speech recognition Slavic languages judicial domain 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Rusko, M., Juhár, J., Trnka, M., Staš, J., Darjaa, S., et al.: Slovak automatic transcription and dictation system for the judicial domain. In: 5th Language & Technology Conference on Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 365–369. Fundacja Uniwersytetu Im, A. Miczkiewicza (2011)Google Scholar
  2. 2.
    Kolář, J., Švec, J., Psutka, J.: Automatic punctuation annotation in Czech broadcast news speech. In: SPECOM 2004, Saint-Petersburg, pp. 319–325 (2004)Google Scholar
  3. 3.
    Batista, F., Caseiro, D., Mamede, N., Trancoso, I.: Recovering capitalization and punctuation marks for automatic speech recognition: Case study for the Portuguese broadcast news. Speech Communication 50(10), 847–862 (2008)CrossRefGoogle Scholar
  4. 4.
    Huang, J., Zweig, G.: Maximum entropy model for punctuation annotation from speech. In: Proceedings of International Conference on Spoken Language Processing, Denver, pp. 917–920, (2002)Google Scholar
  5. 5.
    Christensen, H., Gotoh, Y., Renals, S.: Punctuation annotation using statistical prosody models. In: Proc. ISCA Workshop on Prosody in Speech Recognition and Understanding, pp. 35–40 (2001)Google Scholar
  6. 6.
    Wei, L., Hwee, T.N.: Better punctuation prediction with dynamic conditional random fields. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Cambridge, pp. 177–186 (2010)Google Scholar
  7. 7.
    Gravano, A., Jansche, M., Bacchiani, M.: Restoring punctuation and capitalization in transcribed speech. In: Proceedings of The International Conference on Acoustics, Speech, and Signal Processing, Dallas, pp. 4741–4744 (2009)Google Scholar
  8. 8.
    Stolcke, A., Shriberg, E., Bates, R., Ostendorf, M., Hakkani, D., Plauche, M., Tur, G., Lu, Y.: Automatic detection of sentence boundaries and disfluencies based on recognized words. In: Proc. of ICSLP 1998 (1998)Google Scholar
  9. 9.
    Jakubíček, M., Horák, A.: Punctuation Detection with Full Syntactic Parsing. Research in Computing Science, Special issue: Natural Language Processing and its Applications 46, 335–343 (2010)Google Scholar
  10. 10.
    Stolcke, A.: SRILM – An Extensible Language Modeling Toolkit. In: Proc. of ICSLP 2002, Denver, pp. 901–904 (2002)Google Scholar
  11. 11.

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Róbert Sabo
    • 1
  • Štefan Beňuš
    • 1
    • 2
  1. 1.Institute of Informatics of Slovak Academy of SciencesBratislavaSlovakia
  2. 2.Constantine the Philosopher University in NitraNitraSlovakia

Personalised recommendations