Web-Drawn Corpus for Indian Languages: A Case of Hindi

  • Narayan Choudhary
Part of the Communications in Computer and Information Science book series (CCIS, volume 139)


Text in Hindi on the web has come of age since the advent of Unicode standards in Indic languages. The Hindi content has been growing by leaps and bounds and is now easily accessible on the web at large. For linguists and Natural Language Processing practitioners this could serve as a great corpus to conduct studies. This paper describes how good a manually collected corpus from the web could be. I start with my observations on finding the Hindi text and creating a representative corpus out of it. I compare this corpus with another standard corpus crafted manually and draw conclusions as to what needs to be done with such a web corpus to make it more useful for studies in linguistics.


Web Corpus Hindi Corpora Hindi corpora for linguistic analysis 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Kilgarriff, A., Reddy, S., Pomikálek, J., Avinesh, P.V.S.: A Corpus Factory for Many Languages. In: Proceedings of Asialex, Bangkok (2009)Google Scholar
  2. 2.
    Mahal, B.K.: The Queens English: How to Speak Pukka. Collins (2006)Google Scholar
  3. 3.
    Biemann, C., Heyer, G., Quasthoff, U., Matthias, R.: The Leipzig Corpora Collection: Monolingual Corpora of Standard Size. In: Proceedings of Corpus Linguistics Birmingham, UK (2007)Google Scholar
  4. 4.
    Biber, D.: Representativeness in Corpus Design. Literary and Linguistic Computing, 8(4) (1993)Google Scholar
  5. 5.
    Leech, G.: New resources or just better old ones? The Holy Grail of Representativeness. In: Mair, C., Meyer, C.F. (eds.) Corpus Linguistics and the Web, Rodopi, Amsterdam, New York (2007)Google Scholar
  6. 6.
    Jha, G.N.: The TDIL Program and the Indian Language Corpora Initiative (ILCI). In: Calzolari, N., et al. (eds.) Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC 2010). European Language Resources Association (ELRA). (2010)Google Scholar
  7. 7.
    Baroni, M., Bernardini, S.: BootCaT: bootstrapping corpora and terms from the web. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC-2004), Lisbon (2004)Google Scholar
  8. 8.
    Taneja, P., et al. (eds.): Devanagari Lipi Tatha Hindi Vartani ka Manakikaran. Central Hindi Directorate, New Delhi (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Narayan Choudhary
    • 1
  1. 1.Jawaharlal Nehru UniversityIndia

Personalised recommendations