Development of Multi-lingual Spoken Corpora of Indian Languages

  • K. Samudravijaya
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4274)


This paper describes a recently initiated effort for collection and transcription of read as well as spontaneous speech data in four Indian languages. The completed preparatory work include the design of phonetically rich sentences, data acquisition setup for recording speech data over telephone channel, a Wizard of Oz setup for acquiring speech data of a spoken dialogue of a caller with the machine in the context of a remote information retrieval task. An account of care taken to collect speech data that is as close to real world as possible is given. The current status of the programme and the set of actions planned to achieve the goal is given.


Automatic Speech Recognition Speech Data Speech Recognition System Spontaneous Speech Indian Language 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
  2. 2.
  3. 3.
    Agrawal, S., Samudravijaya, K., Arora, K.: Recent Advances of Speech Databases development activity for Indian Languages. In: Proc. of ISCSLP 2006, Companion. COLIPS, Singapore (2006)Google Scholar
  4. 4.
    Samudravijaya, K., Rao, P.V.S., Agrawal, S.S.: Hindi Speech Database. In: Proc. Int. Conf. on Spoken Language processing(ICSLP 2000) Beijing China, CDROM paper: 00192.pdf (2000)Google Scholar
  5. 5.
  6. 6.
    Chourasia, V., Samudravijaya, K., Chandwani, M.: Phonetically Rich Hindi Sentence Corpus for Creation of Speech Database. In: Proc. O-COCOSDA 2005, Indonesia, pp. 132–137 (2005)Google Scholar
  7. 7.

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • K. Samudravijaya
    • 1
  1. 1.Tata Institute of Fundamental ResearchMumbaiIndia

Personalised recommendations