Choudhary N. (2011) Web-Drawn Corpus for Indian Languages: A Case of Hindi. In: Singh C., Singh Lehal G., Sengupta J., Sharma D.V., Goyal V. (eds) Information Systems for Indian Languages. Communications in Computer and Information Science, vol 139. Springer, Berlin, Heidelberg
Text in Hindi on the web has come of age since the advent of Unicode standards in Indic languages. The Hindi content has been growing by leaps and bounds and is now easily accessible on the web at large. For linguists and Natural Language Processing practitioners this could serve as a great corpus to conduct studies. This paper describes how good a manually collected corpus from the web could be. I start with my observations on finding the Hindi text and creating a representative corpus out of it. I compare this corpus with another standard corpus crafted manually and draw conclusions as to what needs to be done with such a web corpus to make it more useful for studies in linguistics.
Web Corpus Hindi Corpora Hindi corpora for linguistic analysis