Identification of Transliterated Foreign Words in Hebrew Script

  • Yoav Goldberg
  • Michael Elhadad
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4919)


We present a loosely-supervised method for context-free identification of transliterated foreign names and borrowed words in Hebrew text. The method is purely statistical and does not require the use of any lexicons or linguistic analysis tool for the source languages (Hebrew, in our case). It also does not require any manually annotated data for training – we learn from noisy data acquired by over-generation. We report precision/recall results of 80/82 for a corpus of 4044 unique words, containing 368 foreign words.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Stalls, B., Night, K.: Translating Names and Technical Terms in Arabic Text. In: Proc. of the COLING/ACL Workshop on Comp. Approaches to Semitic Languages (1998)Google Scholar
  2. 2.
    Al-Onaizan, Y., Knight, K.: Machine Transliteration of Names in Arabic Text. In: Proc. of ACL Workshop on Comp. Approaches to Semitic Languages (2002)Google Scholar
  3. 3.
    Yoon, S.Y., Kim, K.Y., Sproat, R.: Multilingual transliteration using feature based phonetic method. In: Proc. of ACL (2007)Google Scholar
  4. 4.
    Klementiev, A., Roth, D.: Named entity transliteration and discovery from multilingual comparable corpora. In: Proc. of NAACL (2006)Google Scholar
  5. 5.
    Sherif, T., Kondrak, G.: Bootstrapping a stochastic transducer for arabic-english transliteration extraction. In: Proc. of ACL (2007)Google Scholar
  6. 6.
    Oh, J., Choi, K.: A statistical model for automatic extraction of korean transliterated foreign words. Int. J. of Computer Proc. of Oriental Languages 16 (2003)Google Scholar
  7. 7.
    Nwesri, A.F., Tahaghoghi, S., Scholer, F.: Capturing out-of-vocabulary words in arabic text. In: Proc. of EMNL 2006 (2006)Google Scholar
  8. 8.
    Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proc. of SDAIR-1994 (1994)Google Scholar
  9. 9.
    Dunning, T.: Statistical identification of language. Technical report, Computing Research Lab, New Mexico State University (1994)Google Scholar
  10. 10.
    Qu, Y., Grefenstette, G.: Finding ideographic representations of japanese names written in latin script via language identification and corpus validation. In: Proc. of ACL (2004)Google Scholar
  11. 11.
    Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)MATHGoogle Scholar
  12. 12.
    Bird, S., Loper, E.: NLTK: The natural language toolkit. In: The Comp. Vol. to the Proc. of ACL (2004)Google Scholar
  13. 13.
    Thede, S.M., Harper, M.P.: A second-order hidden markov model for part-of-speech tagging. In: Proc. of ACL (1999)Google Scholar
  14. 14.
    Har’el, N., Kenigsberg, D.: HSpell - the free Hebrew spell checker and morphological analyzer. Israeli Sem. on Comp. Ling (2004)Google Scholar
  15. 15.
    Adler, M., Elhadad, M.: An unsupervised morpheme-based hmm for hebrew morphological disambiguation. In: Proc. of the ACL (2006)Google Scholar
  16. 16.
    Shacham, D., Wintner, S.: Morphological disambiguation of Hebrew: A case study in classifier combination. In: Proc. of EMNLP-CoNLL (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Yoav Goldberg
    • 1
  • Michael Elhadad
    • 1
  1. 1.Computer Science DepartmentBen Gurion University of the NegevBe’er ShevaIsrael

Personalised recommendations