Skip to main content

Multilevel Annotation in the Corpus for Parsing Russian Spontaneous Speech

  • Conference paper
  • First Online:
Book cover Speech and Computer (SPECOM 2018)

Abstract

The paper describes PARS - a manually annotated corpus of spoken Russian, which was built intentionally for training parsing algorithms and extracting grammars from Russian spontaneous speech. PARS corpus includes multiple annotation levels starting from signal-level boundaries of word forms and discourse units ending with syntactic structure representations following Universal Dependencies standard. Presented results include detailed description of corpus structure, principles of annotation and annotation levels.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://groups.inf.ed.ac.uk/switchboard/index.html.

  2. 2.

    http://spokencorpora.ru/.

  3. 3.

    http://universaldependencies.org/.

  4. 4.

    The models are freely available at MANASLU8 repository https://github.com/MANASLU8.

  5. 5.

    https://webanno.github.io/webanno/.

  6. 6.

    https://github.com/MANASLU8/PARS/demo.

References

  1. Kibrik, A.A.: Stories about dreams. Corpus based research of spoken Russian discourse, Rasskazi o snovideniyah. Korpusnoe issledovanie ustnogo russkogo diskursa (2009). (in Russian)

    Google Scholar 

  2. Blacfkmer, E.R., Mitton, J.L.: Theories of monitoring and the timing of repairs in spontaneous speech. Cognition 39(3), 173–194 (1991)

    Article  Google Scholar 

  3. Carlson, L., Marcu, D.: Discourse Tagging Reference Manual. ISI Technical report ISI-TR-545 54, 56 (2001)

    Google Scholar 

  4. Dobrovoljc, K., Nivre, J.: The universal dependencies treebank of spoken Slovenian. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation, pp. 1566–1573 (2016)

    Google Scholar 

  5. Givón, T.: Topic Continuity in Discourse: A Quantitative Cross-language Study, vol. 3. John Benjamins Publishing (1983)

    Google Scholar 

  6. Grimes, J.E., Grimes, J.E.: The Thread of Discourse, vol. 207. Walter de Gruyter (1975)

    Google Scholar 

  7. Heeman, P.A., Allen, J.F.: Speech repairs, intonational phrases, and discourse markers: modeling speakers’ utterances in spoken dialogue. Comput. Linguist. 25(4), 527–571 (1999)

    Google Scholar 

  8. Hirschberg, J., Litman, D.: Empirical studies on the disambiguation of cue phrases. Comput. linguist. 19(3), 501–530 (1993)

    Google Scholar 

  9. Johnson, W.: Measurements of oral reading and speaking rate and disfluency of adult male and female stutterers and nonstutterers. J. Speech Hear. Disord. Monogr. Suppl. (1961)

    Google Scholar 

  10. Kachkovskaia, T., Kocharov, D., Skrelin, P.A., Volskaya, N.B.: CoRuSS - a new prosodically annotated corpus of russian spontaneous speech. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation, pp. 1949–1954 (2016)

    Google Scholar 

  11. Kovriguina, L., Shilin, I., Shipilo, A., Putintseva, A.: Russian tagging and dependency parsing models for stanford CoreNLP natural language toolkit. In: Różewski, P., Lange, C. (eds.) KESW 2017. CCIS, vol. 786, pp. 101–111. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69548-8_8

    Chapter  Google Scholar 

  12. Levelt, W.J.: Monitoring and self-repair in speech. Cognition 14(1), 41–104 (1983)

    Article  Google Scholar 

  13. Longacre, R.E.: The Grammar of Discourse. Springer, New York (1983). https://doi.org/10.1007/978-1-4615-8018-8

    Book  Google Scholar 

  14. Maekawa, K., Koiso, H., Furui, S., Isahara, H.: Spontaneous speech corpus of Japanese. In: LREC (2000)

    Google Scholar 

  15. de Marneffe, M.C., et al.: More constructions, more genres: extending stanford dependencies. In: DepLing, pp. 187–196 (2013)

    Google Scholar 

  16. de Marneffe, M.C., et al.: Universal Dependencies: A cross-linguistic typology, pp. 4585–4592 (2014)

    Google Scholar 

  17. Miller, T.: Improved syntactic models for parsing speech with repairs. In: Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, May 31–June 5, 2009, Boulder, Colorado, USA, pp. 656–664 (2009). http://www.aclweb.org/anthology/N09-1074

  18. Miller, T.A., Schuler, W.: A unified syntactic model for parsing fluent and disfluent speech. In: ACL 2008, Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, June 15–20 2008, Columbus, Ohio, USA, Short Papers. pp. 105–108 (2008). http://www.aclweb.org/anthology/P08-2027

  19. Nesterenko, I., Rauzy, S., Bertrand, R.: Prosody in a corpus of french spontaneous speech: perception, annotation and prosody-syntax interaction. In: Speech Prosody 2010-Fifth International Conference (2010)

    Google Scholar 

  20. Polanyi, L.: A formal model of the structure of discourse. J. Pragmatics 12(5–6), 601–638 (1988)

    Article  Google Scholar 

  21. Sacks, H., Schegloff, E.A., Jefferson, G.: A simplest systematics for the organization of turn taking for conversation. In: Studies in the Organization of Conversational Interaction, pp. 7–55. Elsevier (1978)

    Google Scholar 

  22. Sherstinova, T.: The structure of the ORD speech corpus of russian everyday communication. In: Matoušek, V., Mautner, P. (eds.) TSD 2009. LNCS (LNAI), vol. 5729, pp. 258–265. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04208-9_37

    Chapter  Google Scholar 

  23. Shitaoka, K., Uchimoto, K., Kawahara, T., Isahara, H.: Dependency structure analysis and sentence boundary detection in spontaneous Japanese. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 1107. Association for Computational Linguistics (2004)

    Google Scholar 

  24. Shriberg, E.E.: Preliminaries to a theory of speech disfluencies. Ph.D. thesis, University of California, Berkeley (1994)

    Google Scholar 

  25. Stepanova, S., Asinovskij, A., Bogdanova, N., Rusakova, M., Sherstinova, T.: Speech corpus of the Russian everyday communication “One Speaker’s Day”: basic conception and current [Zvukovoj korpus russkogo jazyka povsednevnogo obwenija “Odin rechevoj den’": Koncepcija i sostojanie formirovanija] Komp’iuternaia lingvistika i intellektual’nye tekhnologii. In: Proceedings of International Conference "Dialogue", pp. 488–494 (2008)

    Google Scholar 

Download references

Acknowledgments

This work was financially supported by the Russian Fund of Basic Research (RFBR), Grant No. 16-36-60055.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liubov Kovriguina .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kovriguina, L., Shilin, I., Putintseva, A., Shipilo, A. (2018). Multilevel Annotation in the Corpus for Parsing Russian Spontaneous Speech. In: Karpov, A., Jokisch, O., Potapova, R. (eds) Speech and Computer. SPECOM 2018. Lecture Notes in Computer Science(), vol 11096. Springer, Cham. https://doi.org/10.1007/978-3-319-99579-3_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99579-3_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99578-6

  • Online ISBN: 978-3-319-99579-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics