Building a Hierarchical Annotated Corpus of Urdu: The URDU.KON-TB Treebank

  • Qaiser Abbas
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7181)


This work aims at the development of a representative treebank for the South Asian language Urdu. Urdu is a comparatively under resourced language and the development of a reliable treebank for Urdu will have significant impact on the state-of-the-art for Urdu language processing. In URDU.KON-TB treebank described here, a POS tagset, a syntactic tagset and a functional tagset have been proposed. The construction of the treebank is based on an existing corpus of 19 million words for the Urdu language. Part of speech (POS) tagging and annotation of a selected set of sentences from different sub-domains of this corpus is in process manually and the work performed till to date is presented here. The hierarchical annotation scheme we adopted has a combination of a phrase structure (PS) and a hybrid dependency structure (HDS).


Urdu Treebank POS Phrase Hybrid 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Leech, G.: Adding linguistic annotation. In: Wynne, M. (ed.) Developing Linguistic Corpora: A Guide to Good Practice, ch. 3, pp. 17–29. Oxbow Books, Oxford (2005)Google Scholar
  2. 2.
    Garside, R., Leech, G.N., McEnery, T.: Corpus annotation: linguistic information from computer text corpora. Longman, London (1997)Google Scholar
  3. 3.
    Ijaz, M.: Urdu 5000 Most Frequently Used Words: Technical Report, Center for Research in Urdu Language Processing (CRULP), Lahore, Pakistan (2007)Google Scholar
  4. 4.
    Wallis, S.: Searching treebanks and other structured corpora. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics: An International Handbook. Handbücher zur Sprache und Kommunikationswissenschaft, ch. 34. Mouton de Gruyter, Berlin (2008)Google Scholar
  5. 5.
    Santorini, B.: Part-of-speech tagging guidelines for the Penn treebank project: Technical report MS-CIS-90-47, Department of Computer and Information Science, University of Pennsylvania (1990)Google Scholar
  6. 6.
    Brill, E.: Discovering the lexical features of a language. In: 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, CA (1991)Google Scholar
  7. 7.
    Brill, E., Magerman, D., Marcus, M.P., Santorini, B.: Deducing linguistic structure from the statistics of large corpora. In: DARPA Speech and Natural Language Workshop (1990)Google Scholar
  8. 8.
    Magerman, D., Marcus, M.P.: Parsing a natural language using mutual information statistics. In: AAAI (1990)Google Scholar
  9. 9.
    Pereira, F., Schabes, F.: Inside-outside re-estimation from partially bracketed corpora. In: 30th Annual Meeting of the Association for Computational Linguistics (1992)Google Scholar
  10. 10.
    Weischedel, R., Ayuso, D., Bobrow, R., Boisen, S., Ingria, R., Palmucci, J.: Partial parsing: a report of work in progress. In: 4th DARPA Speech and Natural Language Workshop (1991)Google Scholar
  11. 11.
    Meteer, M., Schwartz, R., Weischedel, R.: Studies in part of speech labelling. In: 4th DARPA Speech and Natural Language Workshop (1991)Google Scholar
  12. 12.
    Veilleux, M.N., Ostendorf, M.: Probabilistic parse scoring based on prosodic features. In: 5th DARPA Speech and Natural Language Workshop (1992)Google Scholar
  13. 13.
    Niv, M.: Syntactic disambiguation. The Penn Review of Linguistics 14, 120–126 (1991)Google Scholar
  14. 14.
    Sampson, G.: English for the computer: The SUSANNE corpus and analytic scheme. Clarendon Press, Oxford (1995)Google Scholar
  15. 15.
    Leech, G.: The Lancaster Parsed Corpus. ICAME Journal 16(124) (1992)Google Scholar
  16. 16.
    Greenbaum, S.: Comparing English worldwide: The International Corpus of English. Clarendon Press, Oxford (1996)Google Scholar
  17. 17.
    Dipper, S., Brants, T., Lezius, W., Plaehn, O., Smith, G.: The TIGER Treebank. In: Third Workshop on Linguistically Interpreted Corpora LINC 2001, Leuven, Belgium (2001)Google Scholar
  18. 18.
    Schiller, A., Teufel, S., Stoeckert, C.: Vorlaeufige Guidelines fuer das Tagging deutscher Textcorpora mit STTS(Deutsche): Technical Report, IMS-CL, University Stuttgart (1995)Google Scholar
  19. 19.
    Skut, W., Krenn, B., Brants, T., Uszkoreit, H.: An Annotation Scheme for Free Word Order Languages. In: Fifth Conference on Applied Natural Language Processing (ANLP), Washington, D.C (1997)Google Scholar
  20. 20.
    Abbas, Q., Karamat, N., Niazi, S.: Development of Tree-bank based probabilistic grammar for Urdu Language. International Journal of Electrical & Computer Science 09(09), 231–235 (2009) ISSN: 2077-1231Google Scholar
  21. 21.
    Butt, M., King, T.H.: The Status of Case. In: Dayal, V., Mahajan, A. (eds.) Clause Structure in South Asian Languages, pp. 153–198. Springer, Berlin (2005)Google Scholar
  22. 22.
    Sajjad, H., Schmid, H.: Tagging Urdu Text with Parts of Speech: A Tagger Comparison. In: 12th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2009 (2009)Google Scholar
  23. 23.
    Clark, A., Fox, C., Lappin, S.: The Handbook of Computational Linguistics and Natural Language Processing. Blackwell Handbooks in Linguistics, vol. 52, pp. 239–244. John Wiley and Sons (2010) ISBN: 1405155817, 9781405155816Google Scholar
  24. 24.
    Abbas, Q., Khan, A.H.: Lexical functional grammar for Urdu modal verbs. In: 5th IEEE (ICET) 2009 International Conference on Engineering and Technology, pp. 07–12 (2009)Google Scholar
  25. 25.
    Abbas, Q., Ahmed, M.S., Niazi, S.: Language Identifier for Languages of Pakistan Including Arabic and Persian. International Journal of Computational Linguistics (IJCL) 01(03), 27–35 (2010) ISSN: 2180-1266Google Scholar
  26. 26.
    Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English. Computational Linguistics (CL) 19(2), 313–330 (1993)Google Scholar
  27. 27.
    Bies, A., Ferguson, M., Katz, K., Macintyre, R.: Bracketing guidelines for Treebank II style penn treebank project: Technical Report, University of Pennsylvania (1995)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Qaiser Abbas
    • 1
  1. 1.Department of LinguisticsUniversity of KonstanzKonstanzGermany

Personalised recommendations