Skip to main content
Log in

Language resources for Hebrew

  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

We describe a suite of standards, resources and tools for computational encoding and processing of Modern Hebrew texts. These include an array of XML schemas for representing linguistic resources; a variety of text corpora, raw, automatically processed and manually annotated; lexical databases, including a broad-coverage monolingual lexicon, a bilingual dictionary and a WordNet; and morphological processors which can analyze, generate and disambiguate Hebrew word forms. The resources are developed under centralized supervision, so that they are compatible with each other. They are freely available and many of them have already been used for several applications, both academic and industrial.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. To facilitate readability we use a straight-forward transliteration of Hebrew in this paper, where the characters (in Hebrew alphabetic order) are: abgdhwzxviklmnsypcqršt. In our resources, we use both a UTF-8 encoding of Hebrew and an ASCII transliteration, which differs from the above in two letters: ‘↔ y and š ↔ e.

  2. The undotted script is sometimes referred to as ktiv male “full script”, whereas the dotted script, without the diacritics, is called ktiv xaser “lacking script”. These terms are misleading, as any representation that does not depict the diacritics lacks many of the vowels.

  3. These are often called entry in similar projects.

  4. 13,475 of the 22,656 entries in the lexicon are dotted, and we continue to add dotted forms to the remaining entries.

  5. HAifa Morphological System for Analyzing Hebrew.

  6. An article also includes meta-data, such as its source, the author, the date of production, etc.

  7. http://www.haaretz.co.il/

  8. http://www.inn.co.il/

  9. http://www.themarker.com/

  10. http://www.knesset.gov.il/

  11. http://www.slamathil.allbiz.co.il/

References

  • Abney, S. (1996). Statistical methods and linguistics. In J. Klavans & P. Resnik (Eds.), The balancing act: Combining symbolic and statistical approaches to language. Cambridge: The MIT Press.

  • Adler, M., & Elhadad, M. (2006). An unsupervised Morpheme-based HMM for Hebrew morphological disambiguation. In Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (pp. 665–672). Sydney: Association for Computational Linguistics.

  • Agirre, E., & Rigau, G. (1996). Word sense disambiguation using conceptual density. In Proceedings of the 16th conference on computational linguistics (pp. 16–22). Morristown: Association for Computational Linguistics.

  • Bar-Haim, R., Sima’an, K., & Winter, Y. (2005). Choosing an optimal architecture for segmentation and POS-tagging of modern Hebrew. In Proceedings of the ACL workshop on computational approaches to semitic languages (pp. 39–46). Ann Arbor: Association for Computational Linguistics.

  • Bar-haim, R., Sima’an, K., & Winter, Y. (2008). Part-of-speech tagging of modern Hebrew text. Natural Language Engineering. To appear.

  • Barkali, S. (2000a). Lux HaP’alim HaShalem (the complete verbs table). In Hebrew (51st ed.). Jerusalem: Rubin Mass.

  • Barkali, S. (2000b). Lux HaShemot (the nouns table). In Hebrew (18th ed.). Jerusalem: Rubin Mass.

  • Beesley, K. R., & Karttunen, L. (2003). Finite-state morphology: Xerox tools and techniques. Stanford: CSLI.

  • Bentivogli, L., Pianta, E., & Girardi, C. (2002). MultiWordNet: Developing an aligned multilingual database. In Proceedings of the first international conference on global Wordnet. Mysore.

  • Black, W., Elkateb, S., Rodriguez, H., Alkhalifa, M., Vossen, P., Pease, A., & Fellbaum, C. (2006). Introducing the Arabic WordNet project. In Proceedings of the third global WordNet meeting.

  • Bonnema, R. (1997). Data oriented semantics. Master’s thesis, University of Amsterdam.

  • Buckwalter, T. (2002). Buckwalter Arabic morphological analyzer. Distributed through LDC as LDC2002L49.

  • Connolly, D. (1997). XML: Principles, tools, and techniques. O’Reilly.

  • Dahan, H. (1997). Hebrew–English English–Hebrew dictionary. Jerusalem: Academon.

    Google Scholar 

  • Daya, E., Roth, D., & Wintner, S. (2004). Learning Hebrew roots: Machine learning with linguistic constraints. In Proceedings of EMNLP’04 (pp. 357–364). Barcelonan.

  • de Buenaga Rodríguez, M., Hidalgo, J. M. G., & Díaz-Agudo, B. (1997). Using WordNet to complement training information in text categorization. In Proceedings of the 2nd international conference on recent advances in natural language processing.

  • Diab, M. (2004). The feasibility of bootstrapping an Arabic WordNet leveraging parallel corpora and an English WordNet. In Proceedings of the Arabic language technologies and resources. Cairo: NEMLAR.

  • Dichy, J., & Farghaly, A. (2003). Roots and patterns vs. stems plus grammar-lexis specifications: On what basis should a multilingual lexical database centered on Arabic be built. In Proceedings of the MT-Summit IX workshop on machine translation for semitic languages (pp. 1–8). New Orleans.

  • DuBois, P. (1999). MySQL. New Riders.

  • Fellbaum, C. (Ed.) (1998). WordNet: An electronic lexical database, language, speech and communication. MIT Press.

  • Fellbaum, C., Palmer, M., Dang, H. T., Delfs, L., & Wolf, S. (2001). Manual and automatic semantic annotation with WordNet. In Proceedings of WordNet and other lexical resources workshop.

  • Gadish, R. (Ed.) (2001). Klalei ha-Ktiv Hasar ha-Niqqud. In Hebrew (4th ed.). Academy for the Hebrew Language.

  • Habash, N., & Rambow, O. (2005). Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd annual meeting of the association for computational linguistics (ACL’05) (pp. 573–580). Ann Arbor: Association for Computational Linguistics.

  • Harabagiu, S. (Ed.) (1998). Usage of WordNet in natural language processing systems: Proceedings of the Coling-ACL 1998 workshop. Montreal: Association for Computational Linguistics.

  • Har’El, N., & Kenigsberg, D. (2004). Hspell: A free Hebrew speller. Available from http://www.ivrix.org.il/projects/spell-checker/

  • Ide, N., Bonhomme, P., & Romary, L. (2000). XCES: An XML-based encoding standard for linguistic corpora. In Proceedings of the second international language resources and evaluation conference. Paris.

  • Ide, N., Romary, L., & de la Clergerie, E. (2003). International standard for a linguistic annotation framework. In SEALTS ’03: Proceedings of the HLT-NAACL 2003 workshop on software engineering and architecture of language technology systems (pp. 25–30). Morristown: Association for Computational Linguistics.

  • Ide, N. M., & Veronis, J. (Eds.) (1995). Text encoding initiative: Background and contexts. Norwell: Kluwer Academic Publishers.

  • Itai, A. (2006). Knowledge center for processing Hebrew. In Proceedings of the LREC-2006 workshop “Towards a Research Infrastructure for Language Resources”. Genoa, Italy.

  • Itai, A., Wintner, S., & Yona, S. (2006). A computational lexicon of contemporary Hebrew. In Proceedings of the fifth international conference on language resources and evaluation (LREC-2006). Genoa, Italy.

  • Jing, H. (1998). Usage of WordNet in natural language generation. In S. Harabagiu (Ed.), Usage of WordNet in natural language processing systems: Proceedings of the Coling-ACL 1998 workshop (pp. 128–134). Association for Computational Linguistics.

  • Lavie, A., Wintner, S., Eytani, Y., Peterson, E., & Probst, K. (2004). Rapid prototyping of a transfer-based Hebrew-to-English machine translation system. In Proceedings of TMI-2004: The 10th international conference on theoretical and methodological issues in machine translation. Baltimore.

  • Mandala, R., Tokunaga, T., Tanaka, H., Okumura, A., & Satoh, K. (1998). Ad hoc retrieval experiments using WordNet and automatically constructed thesauri. In TREC (pp. 414–419).

  • Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge: MIT Press.

    Google Scholar 

  • Ordan, N., & Wintner, S. (2007). Hebrew WordNet: A test case of aligning lexical databases across languages. International Journal of Translation, special issue on Lexical Resources for Machine Translation, 19(1), 39–58.

    Google Scholar 

  • Segal, E. (1997). Morphological analyzer for unvocalized Hebrew words. Unpublished work.

  • Segal, E. (1999). Hebrew morphological analyzer for Hebrew undotted texts. Master’s thesis, Technion, Israel Institute of Technology, Haifa. In Hebrew.

  • Shacham, D., & Wintner, S. (2007). Morphological disambiguation of Hebrew: A case study in classifier combination. In Proceedings of EMNLP-CoNLL 2007, the conference on empirical methods in natural language processing and the conference on computational natural language learning. Prague.

  • Shapira, M., & Choueka, Y. (1964). Mechanographic analysis of Hebrew morphology: Possibilities and achievements. Leshonenu, 28(4), 354–372, In Hebrew.

    Google Scholar 

  • Sima’an, K., Itai, A., Winter, Y., Altman, A., & Nativ, N. (2001). Building a tree-bank of modern Hebrew text. Traitment Automatique des Langues, 42(2).

  • Sperberg-McQueen, C. M., & Burnard, L. (Eds.) (2002). Guidelines for text encoding and interchange. Oxford: University of Oxford.

  • Stern, N. (1994). Milon ha-Poal. Bar Ilan University. In Hebrew.

  • Szpektor, I., Dagan, I., Lavie, A., Shacahm, D., & Wintner, S. (2007). Cross lingual and semantic retrieval for cultural heritage appreciation. In Proceedings of the ACL-2007 workshop on language technology for cultural heritage data (LaTeCH 2007). Prague.

  • van der Vlist, E. (2002). XML Schema. O’Reilly.

  • Wintner, S. (2004). Hebrew computational linguistics: Past and future. Artificial Intelligence Review, 21(2), 113–138.

    Article  Google Scholar 

  • Wintner, S. (2007). Finite-state technology as a programming environment. In A. Gelbukh (Ed.), Proceedings of the conference on computational linguistics and intelligent text processing (CICLing-2007) (Vol. 4394 of Lecture notes in computer science, pp. 97–106). Berlin and Heidelberg: Springer.

  • Wintner, S., & Yona, S. (2003). Resources for processing Hebrew. In Proceedings of the MT-Summit IX workshop on machine translation for semitic languages (pp. 53–60). New Orleans.

  • Yona, S., & Wintner, S. (2005). A finite-state morphological grammar of Hebrew. In Proceedings of the ACL workshop on computational approaches to semitic languages (pp. 9–16). Ann Arbor: Association for Computational Linguistics.

  • Yona, S., & Wintner, S. (2007). A finite-state morphological grammar of Hebrew. Natural Language Engineering. To appear.

  • Zdaqa, Y. (1974). Luxot HaPoal (The verb tables). Jerusalem: Kiryath Sepher. In Hebrew.

Download references

Acknowledgments

This work was funded by the Israeli Ministry of Science and Technology. Parts of this project were supported by THE ISRAEL SCIENCE FOUNDATION (grant No. 137/06); by the Israel Internet Association; and by the Caesarea Rothschild Institute for Interdisciplinary Application of Computer Science at the University of Haifa. Several people were involved in this work, and we are extremely grateful to all of them: Meni Adler, Roy Bar-Haim, Dalia Bojan, Ido Dagan, Michael Elhadad, Nomi Guthmann, Adi Milea, Noam Ordan, Erel Segal, Danny Shacham, Shira Schwartz, Yoad Winter, and Shlomo Yona. We are grateful to the reviewers for useful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shuly Wintner.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Itai, A., Wintner, S. Language resources for Hebrew. Lang Resources & Evaluation 42, 75–98 (2008). https://doi.org/10.1007/s10579-007-9050-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-007-9050-8

Keywords

Navigation