Development of Odia Language Corpus from Modern News Paper Texts: Some Problems and Issues

  • Bishwa Ranjan Das
  • Srikanta Patnaik
  • Niladri Sekhar Dash
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 309)


In this paper, we have tried to describe the details about the strategies and methods we have adapted to design and develop a digital Odia corpus of newspaper texts. We have also attempted to identify the scopes of its utilization in different domains of Odia language technology and applied linguistics. The corpus is developed with sample news reports produced and published by some major Odia newspapers published from Bhubaneswar and neighboring places. We have followed several issues relating to text corpus design, development, and management, such as size of the corpus with regard to number of sentences and words, coverage of domains and sub-domains of news texts, text representation, question of nativity, determination of target users, selection of time span, selection of texts, amount of sample for each text types, method of data sampling, manner of data input, corpus sanitation, corpus file management, and problem of copyright. The digital corpus is basically in machine readable format, so that the text becomes easy to process very quickly. We presume that the corpus we have developed will come to a great help to look into the present texture of the language as well as to retrieve various linguistic data and information required for writing a modern grammar for Odia with close reference to its empirical identity, usage, and status. The electronic Odia corpus that we have generated can also be used in various fields of research and development activities for Odia.


Corpus Odia Newspaper Sentence Word Text representation Time span File management Copyright 


  1. 1.
    Dash, N.S.: Corpus Linguistics and Language Technology: with Reference to Indian Languages. Mittal Publications, New Delhi (2005)Google Scholar
  2. 2.
    Hofland, K.: Concordance programs for personal computers. In: Johansson, S., Stenström, A.-B. (eds.) English Computer Corpora: Selected Papers and Research Guide, pp. 283–306. Mouton de Gruyter, Berlin (1991)Google Scholar
  3. 3.
    Dash, N.S.: Techniques of text corpus processing. In: Mohanty, P., Reinhard, K. (eds.) Readings in Quantitative Linguistics, pp. 81–115. Indian Institute of Language Studies, New Delhi (2008)Google Scholar
  4. 4.
    Dash, N.S.: Corpus Linguistics: an Introduction. Person Education-Longman, New Delhi (2008)Google Scholar
  5. 5.
    Hunston, S.: Corpora in Applied Linguistics. Cambridge University Press, Cambridge (2002)Google Scholar

Copyright information

© Springer India 2015

Authors and Affiliations

  • Bishwa Ranjan Das
    • 1
  • Srikanta Patnaik
    • 1
  • Niladri Sekhar Dash
    • 2
  1. 1.Department of Computer Science and Information Technology, Institute of Technical Education and ResearchSiksha’O’ Anusandhan UniversityKhandagiriIndia
  2. 2.Linguistic Research UnitIndian Statistical InstituteBaranagarIndia

Personalised recommendations