Indonesian Morphology Tool (MorphInd): Towards an Indonesian Corpus

  • Septina Dian Larasati
  • Vladislav Kuboň
  • Daniel Zeman
Part of the Communications in Computer and Information Science book series (CCIS, volume 100)

Abstract

This paper describes a robust finite state morphology tool for Indonesian (MorphInd), which handles both morphological analysis and lemmatization for a given surface word form so that it is suitable for further language processing. MorphInd has wider coverage on handling Indonesian derivational and inflectional morphology compared to an existing Indonesian morphological analyzer [1], along with a more detailed tagset. MorphInd outputs the analysis in the form of segmented morphemes along with the morphological tags. The implementation was done using finite state technology by adopting the two-level morphology approach implemented in Foma. It achieved 84.6% of coverage on a preliminary stage Indonesian corpus where it mostly fails to capture the proper nouns and foreign words as expected initially.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    Pisceldo, F., Mahendra, R., Manurung, R., Arka, I.W.: A Two-Level Morphological Analyser for Indonesian. In: Abstract Submitted to the Australasian Language Technology (ALTA) Workshop 2008, Tasmania (2008)Google Scholar
  2. [2]
    Siregar, N.: Pencarian Kata Berimbuhan pada Kamus Besar Bahasa Indonesia dengan menggunakan Algoritma Stemming. Undergraduate thesis, Faculty of Computer Science, University of Indonesia (1995)Google Scholar
  3. [3]
    Adriani, M., Jelita, A., Nazief, S.B., Tahaghoghi, M., Williams, H.: Stemming Indonesian: A Confix-Stripping Approach. ACM Transactions on Asian Language Information Processing 6(4) (2007)Google Scholar
  4. [4]
    Hartono, H.: Pengembangan Pengurai Morfologi untuk Bahasa Indonesia dengan Model Morfologi Dua Tingkat Berbasiskan PC-KIMMO. Undergraduate thesis, Faculty of Computer Science, University of Indonesia (2002)Google Scholar
  5. [5]
    Beesley, K.R., Karttunen, L.: Finite State Morphology. CSLI Publications, Palo Alto (2003)Google Scholar
  6. [6]
    Hulden, M.: Foma: a finite-state compiler and library. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Demonstrations Session, Athens, Greece, pp. 29–32 (2009)Google Scholar
  7. [7]
    Bahasa, P.: Kamus Besar Bahasa Indonesia Daring (2008), http://pusatbahasa.diknas.go.id/kbbi/ (last access: February 14, 2011)
  8. [8]
    Darma Putra, D., Arfan, A., Manurung, R.: Building an Indonesian Wordnet. In: The Second International MALINDO Workshop (2008), http://bahasa.cs.ui.ac.id/wordnet/ (last access: February 14, 2011)
  9. [9]
    Pisceldo, F., Manurung, R., Adriani, M.: Probabilistic Part-of-Speech Tagging for Bahasa Indonesia. In: The Third International MALINDO Workshop, Colocated Event ACL-IJCNLP 2009, Singapore, August 1 (2009)Google Scholar
  10. [10]
    Farizki Wicaksono, A., Purwarianti, A.: HMM Based Part-of-Speech Tagger for Bahasa Indonesia. In: The Fourth International MALINDO Workshop, Jakarta, Indonesia (2010)Google Scholar
  11. [11]
    Joice: Pengembangan lanjut pengurai struktur kalimat bahasa indonesia yang menggunakan constraint-based formalism. Undergraduate thesis, Faculty of Computer Science, University of Indonesia (2002)Google Scholar
  12. [12]
    Hari Gusmita, R., Manurung, R.: Some Initial Experiments with Indonesian Probabilistic Parsing. In: The Second International MALINDO Workshop (2008)Google Scholar
  13. [13]
    Dian Larasati, S., Manurung, R.: Towards a Semantic Analysis of Bahasa Indonesia for Question Answering. In: Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, PACLING (2007)Google Scholar
  14. [14]
    Mahendra, R., Dian Larasati, S., Manurung, R.: Extending an Indonesian Semantic Analysis-based Question Answering System with Linguistic and World Knowledge Axioms. In: Proceedings of the 22nd Pacific Asia Conference on Language, Information, and Computation (PACLIC 2008), pp. 262–271 (2008)Google Scholar
  15. [15]
    PAN Localization, http://www.panl10n.net/english/OutputsIndonesia2.htm (last access: February 14, 2011)
  16. [16]
    Prague Markup Language (PML), http://ufal.mff.cuni.cz/jazz/pml/index_en.html (last access: February 14, 2011)

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Septina Dian Larasati
    • 1
  • Vladislav Kuboň
    • 1
  • Daniel Zeman
    • 1
  1. 1.Faculty of Mathematics and Physics,Institute of Formal and Applied LinguisticsCharles UniversityPragueCzech Republic

Personalised recommendations