Language Resources and Evaluation

, Volume 53, Issue 4, pp 807–835 | Cite as

Historical corpora meet the digital humanities: the Jerusalem Corpus of Emergent Modern Hebrew

  • Aynat RubinsteinEmail author
Original Paper


The paper describes the creation of the first open access multi-genre historical corpus of Emergent Modern Hebrew, made possible by implementation of digital humanities methods in the process of corpus curation, encoding, and dissemination. Corpus contents originate in the Ben-Yehuda Project, an open access repository of Hebrew literature online, and in digital images curated from the collections of the National Library of Israel, a selection of which have been transcribed through a dedicated crowdsourcing task that feeds back into the library’s online catalog. Texts in the corpus are encoded following best practices in the digital humanities, including markup of metadata that enables time-sensitive research, linguistic and other, of the corpus. Evaluation of morphological analysis based on Modern Hebrew language models is shown to distinguish between genres in the historical variety, highlighting the importance of ephemeral materials for linguistic research and for potential collaboration with libraries and cultural institutions in the process of corpus creation. We demonstrate the use of the corpus in diachronic linguistic research and suggest ways in which the association it provides between digital images and texts can be used to support automatic language processing and to enhance resources in the digital humanities.


Historical corpora Language change Ephemera Digital humanities Citizen science Crowdsourcing Hebrew 



I wish to thank the three anonymous reviewers of this manuscript for their helpful comments. For invaluable discussion and feedback during all stages of the project, I am grateful to Sinai Rusinek. Thanks also to Meni Adler, Maayan Almagor, Yael Netzer, Avigail Tsirkin-Sadan, and Amir Zeldes. This research was supported by the Mandel Scholion Interdisciplinary Research Center in the Humanities and Jewish Studies at the Hebrew University of Jerusalem. I thank researchers at the Center for their support, especially Yael Reshef for enabling me to train research assistants of the “Emergence of Modern Hebrew” research group in the TEI format. Programming support by Itay Zandbank of The Research Software Company ( is also gratefully acknowledged.


  1. Adler, M. (2007). Hebrew morphological disambiguation: An unsupervised stochastic word-based approach. Ph.D. thesis, Ben-Gurion University of the Negev.Google Scholar
  2. Adler, M., & Elhadad, M. (2006). An unsupervised morpheme-based HMM for Hebrew morphological disambiguation. In Proceeding of COLING-ACL-06, Sydney, Australia.Google Scholar
  3. Ahmed, M. A. (2018). XML annotation of Hebrew elements in Judeo-Arabic texts. Journal of Jewish Languages,6, 221–242.Google Scholar
  4. Anderwald, L., & Szmrecsanyi, B. (2009). Corpus linguistics and dialectology. In M. Kytö & A. Lüdeling (Eds.), Corpus linguistics: An international handbook (Vol. 2, pp. 1126–1140). Berlin: De Gruyter.Google Scholar
  5. Ariel, Ch. (2015). The expression of material constitution in Revival Hebrew. Journal of Jewish Languages 3(1–2), 231–244 (Reprinted in E. Doron (Ed.),Language contact and the development of Modern Hebrew, Studies in Semitic Languages and Linguistics, Brill (vol. 84)).Google Scholar
  6. Azar, M. (1995). The syntax of Mishnaic Hebrew. Jerusalem, Haifa: The Academy of the Hebrew Language and University of Haifa Press. (in Hebrew).Google Scholar
  7. Bar-Ziv Levy, M. & Agranovsky. V. (2015). The evolution of the structure of free relative clauses in Modern Hebrew: Internal development and contact language influence. Journal of Jewish Languages 3(1–2): 259–270. (Reprinted in Doron, E. (Ed.),Language contact and the development of Modern Hebrew, Studies in Semitic Languages and Linguistics, Brill (vol. 84)).Google Scholar
  8. Belinkov, Y., Magidow, A., Romanov, M., Shmidman, A. & Koppel, M. (2016). Shamela: A large-scale historical Arabic corpus. In Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH at Coling) 2016 (pp. 45–53).Google Scholar
  9. Bendavid, A. (1971). Biblical Hebrew and Mishnaic Hebrew. Tel Aviv: Dvir. (in Hebrew).Google Scholar
  10. Ben-Ḥayyim, Z. (1953). On the use of the phrase yeš l-. Lĕšonénu La‘am 4. (in Hebrew).Google Scholar
  11. Ben-Ḥayyim, Z. (1992). The struggle for a language. Jerusalem: The Academy of the Hebrew Language. (in Hebrew).Google Scholar
  12. Bolozky, S. (2009). Frequency and productivity in the verb system of Israeli Hebrew. Lĕšonénu,71, 345–367. (in Hebrew).Google Scholar
  13. Boneh, N. (2013). Mood and modality: Modern Hebrew. In G. Khan (Ed.), Encyclopedia of Hebrew Language and Linguistics (Vol. 2, pp. 693–703). Leiden: Brill.Google Scholar
  14. Claridge, C. (2008). Historical corpora. In M. Kytö & A. Lüdeling (Eds.), Corpus linguistics: An international handbook (Vol. 1, pp. 242–259). Berlin: De Gruyter.Google Scholar
  15. Culpeper, J., & Kytö, M. (2010). Early Modern English dialogues: Spoken interaction as writing. Cambridge: Cambridge University Press.Google Scholar
  16. Curzan, A. (2009). Historical corpus linguistics and evidence of language change. In M. Kytö & A. Lüdeling (Eds.), Corpus linguistics: An international handbook (Vol. 2, pp. 1091–1109). Berlin: De Gruyter.Google Scholar
  17. Doron, E. (2015). Introduction: Language contact and the development of Modern Hebrew. Journal of Jewish Languages 3(1–2): 5–26. (Reprinted in E. Doron (ed.),Language contact and the development of Modern Hebrew, Studies in Semitic Languages and Linguistics, Brill (vol. 84)).Google Scholar
  18. Doron, E. (2016). Language contact and the development of Modern Hebrew, Studies in Semitic Languages and Linguistics (Vol. 84). Leiden: Brill.Google Scholar
  19. Garcia Martinez, M., & Walton, B. (2014). The wisdom of crowds: The potential of online communities as a tool for data analysis. Technovation,34, 203–214.Google Scholar
  20. Geyken, A. (2007). The DWDS corpus: A reference corpus for the German language of the 20th century. In C. Fellbaum (Ed.), Collocations and idioms: Linguistic, lexicographic, and computational aspects (pp. 23–41). London: Continuum Press.Google Scholar
  21. Goldberg, Y., Adler, M. & Elhadad, M. (2008). EM can find pretty good HMM POS-taggers (when given a good start). In Proceedings of ACL-08: HLT (pp. 746–754).Google Scholar
  22. Grosse, S., Grimberg, M., Hölscher, T., Karweick, J., & Kuntz, H. (1987). Sprachwandel und Sprachwachstum im Ruhrgebiet des 19. Jahrhunderts unter dem Einfluss der Industrialisierung. Zeitschrift für Dialektologie und Linguistik,54(2), 202–221. (In German).Google Scholar
  23. HaCohen-Kerner, Y., Beck, H., Yehudai, E., & Mughaz, D. (2010). Stylistic feature sets as classifiers of documents according to their historical period and ethnic origin. Applied Artificial Intelligence,24(9), 847–862.Google Scholar
  24. Hana, J., Feldman, A. & Aharodnik, K. (2011). A low-budget tagger for Old Czech. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (pp. 10–18).Google Scholar
  25. Harshav, B. (1993). Language in time of revolution. Berkeley: University of California Press.Google Scholar
  26. Howe, J. (2006). The rise of crowdsourcing. Wired 14(6). Accessed 11 Aug 2017.
  27. Howe, J. (2008). Crowdsourcing: Why the power of crowd is driving the future of business. New York City: Crown Business.Google Scholar
  28. Itai, A., & Wintner, S. (2008). Language resources for Hebrew. Language Resources and Evaluation,42(1), 75–98.Google Scholar
  29. Kaufmann, N., Schulze, T. & Veit, D. (2011). More than fun and money. Worker Motivation in Crowdsourcing-A Study on Mechanical Turk. In Proceedings of the 17th Americas Conference on Information Systems-AMCIS, 2011 (pp. 1–11).Google Scholar
  30. Krause, T. & Zeldes, A. (2016). ANNIS3: A new architecture for generic corpus query and visualization. Digital Scholarship in the Humanities 2016, 31(1), 118–139.Google Scholar
  31. Lehmann, H. M., auf dem Keller., C., & Ruef, B. (2006). ZEN Corpus 1.0. In R. Facchinetti & M. Rissanen (Eds.), Corpus-based Studies of Diachronic English (pp. 135–155). New York: Peter Lang.Google Scholar
  32. Liebeskind, C., Dagan, I., & Schler, J. (2016). Semiautomatic construction of cross-period thesaurus. Journal on Computing and Cultural Heritage,9(4), 22.Google Scholar
  33. Lin, Y., Michel, J. B., Aiden, E. L., Orwant, J., Brockman, W. & Petrov, S. (2012). Syntactic annotations for the Google Books Ngram Corpus. In Proceedings of the ACL 2012 System Demonstrations, ACL ‘12 (pp. 169–174).Google Scholar
  34. Meurman-Solin, A. (1995). A new tool: The Helsinki Corpus of Older Scots (1450–1700). ICAME Journal,19, 49–62.Google Scholar
  35. Michel, J. B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Brockman, W., et al. (2011). Quantitative analysis of culture using millions of digitized books. Science,331, 176–182.Google Scholar
  36. Morag, S. (1998). The contribution of the Geniza to the study of the Hebrew language. Jewish Studies,38, 239–251. (in Hebrew).Google Scholar
  37. Morschheuser, B., Hamari, J. & Koivisto, J. (2016). Gamification in crowdsourcing: A review. In Bui, T. X. and Sprague Jr, R. H., (Eds.), In Proceedings of the 49th Hawaii International Conference on System Sciences (pp. 4375–4384).Google Scholar
  38. Mughaz, D., HaCohen-Kerner, Y., & Gabbay, D. (2017). Mining and using key-words and key-phrases to identify the era of an anonymous text. In N. Nguyen, R. Kowalczyk, A. Pinto, & J. Cardoso (Eds.), Transactions on computational collective intelligence XXVI. Lecture notes in computer science (Vol. 10190). Cham: Springer.Google Scholar
  39. Neuman, Y. (2013). The diphthong [eʸ] in Israeli Hebrew: Its origin and the factors conditioning its distribution. In R. Ben-Shahar & N. Ben-Ari (Eds.), Hebrew—A Living Language, volume VI. Tel-Aviv: The Porter Institute for Poetics & Semiotics Tel-Aviv University, HaKibbutz HaMeuchad. (in Hebrew).Google Scholar
  40. Piotrowski, M. (2012). Natural language processing for historical texts. San Rafael: Morgan & Claypool.Google Scholar
  41. Reshef, Y. (2009). Continuity vs. change in the emergence of Standard Modern Hebrew: The verbal system in the early Mandate period. In H. Cohen (Ed.), Modern Hebrew: Two hundred and fifty years (pp. 143–176). Jerusalem: The Academy of the Hebrew Language. (in Hebrew).Google Scholar
  42. Reshef, Y. (2012). Early Spoken Hebrew. In S. Izre’el (Ed.), The speech machine as a language teacher Hebrew Spoken Here: Hebrew voices from Nazi Germany: A testimony on spoken Hebrew and Jewish life in Palestine during the British Mandate (pp. 163–187). Tel Aviv: The Haim Rubin Tel Aviv University Press. (in Hebrew).Google Scholar
  43. Reshef, Y. (2013). Revival of Hebrew: Grammatical structure and lexicon. In G. Khan (Ed.), Encyclopedia of Hebrew Language and Linguistics (Vol. 3, pp. 397–405). Leiden: Brill.Google Scholar
  44. Reshef, Y. (2016). Written Hebrew of the revival generation as a distinct phase in the evolution of Modern Hebrew. Journal of Semitic Studies,61(1), 187–213.Google Scholar
  45. Reshef, Y., & Helman, A. (2009). Instructing or recruiting? Language and style in 1920s and 1930s Tel Aviv municipal posters. Jewish Studies Quarterly,16, 306–332.Google Scholar
  46. Rissanen, M. (2008). Corpus linguistic and historical linguistics. In M. Kytö & A. Lüdeling (Eds.), Corpus linguistics: An international handbook (Vol. 1, pp. 53–68). Berlin: De Gruyter.Google Scholar
  47. Rögnvaldsson, E., & Helgadóttir, S. (2011). Morphological tagging of Old Norse texts and its use in studying syntactic variation ad change. In C. Sporleder, A. van den Bosch, & K. Zervanou (Eds.), Language Technology for Cultural Heritage: Selected papers from the LaTeCH workshop series (pp. 63–76). Berlin: Springer.Google Scholar
  48. Rubin, A. D. (2013). Definite Article: Pre-Modern Hebrew. In G. Khan (Ed.), Encyclopedia of Hebrew Language and Linguistics (Vol. 1, pp. 678–682). Leiden: Brill.Google Scholar
  49. Rubinstein, A. (forthcoming). Existential possessive modality in the emergence of Modern Hebrew. In E. Doron, M. Rappaport Hovav, Y. Reshef, M. Taube (Eds.), Linguistic contact, continuity, and change in the genesis of Modern Hebrew. Amsterdam: John Benjamins (to appear).Google Scholar
  50. Rubinstein, A., Sichel, I. & Tsirkin-Sadan, A. (2015). Superfluous negation in Modern Hebrew and its origins. Journal of Jewish Languages 3(1–2): 165–182. (Reprinted in Doron, E. (Ed.),Language contact and the development of Modern Hebrew, Studies in Semitic Languages and Linguistics, Brill (vol. 84)).Google Scholar
  51. Rusinek, S. (2016). Kima: Towards an open digital historical Hebrew gazetteer. Accessed 11 Aug 2017.
  52. Saxton, G. D., Oh, O., & Kishore, R. (2013). Rules of crowdsourcing: Models, issues, and systems of control. Information Systems Management,30(1), 2–20. Scholar
  53. Schilling, N. (2013). Sociolinguistic fieldwork. Cambridge: Cambridge University Press.Google Scholar
  54. Schmied, J. (1994). The Lampeter Corpus of Early Modern English Tracts. In Kytö, M., Rissanen, M. and Wright S. (Eds.), Corpora across the centuries: Proceedings of the First International Colloquium on English Diachronic Corpora, Rodopi (pp. 81–89).Google Scholar
  55. Shatil, N. (2007). The synchronic status of nitpa’el. Divrei ha-ḥug ha-yisr’eli lavalshanut,16, 105–127. (in Hebrew).Google Scholar
  56. Shehadeh, H. (1991). Gilguley ha-bituy ‘yesh (lo) lilmod’/‘haya (lo) lilmod’. In M. H. Goshen-Gottstein, S. Morag, & S. Kogut (Eds.), Studies on Hebrew and other Semitic languages presented to Professor Chaim Rabin on the occassion of his seventy-fifth birthday (pp. 415–442). Jerusalem: Academon Press. (in Hebrew).Google Scholar
  57. Tsirkin-Sadan, A. (2015). Inheritance and Slavic contact in the polysemy of bixlal. Journal of Jewish Languages 3(1–2): 218–230. (Reprinted in Doron, E. (Ed.),Language contact and the development of Modern Hebrew, Studies in Semitic Languages and Linguistics, Brill (vol. 84)).Google Scholar
  58. Wigderson, S. (2015). The sudden disappearance of Nitpael and the rise of Hitpael in Modern Hebrew, and the role of Yiddish in the process. Journal of Jewish Languages, 3(1–2): 199–206. (Reprinted in Doron, E. (Ed.),Language contact and the development of Modern Hebrew, Studies in Semitic Languages and Linguistics, Brill (vol. 84)).Google Scholar
  59. Xiao, R. (2008). Well-known and influential corpora. In M. Kytö & A. Lüdeling (Eds.), Corpus linguistics: An international handbook (Vol. 1, pp. 383–457). Berlin: De Gruyter.Google Scholar
  60. Yáñez-Bouza, N. (2015). ‘Have you ever written a diary or journal?’ Diurnial prose and register variation. Neuphilologische Mitteilungen,116(2), 449–474.Google Scholar
  61. Zhao, Y. & Zhu, Q. (2014). Effects of extrinsic and intrinsic motivation on participation in crowdsourcing contest. Online Information Review,38(7), 896–917.Google Scholar
  62. Zheng, H., Li, D., & Hou, W. (2011). Task design, motivation, and participation in crowdsourcing contests. International Journal of Electronic Commerce,15(4), 57–88.Google Scholar
  63. Zipser, F. & Romary, L. (2010). A model oriented approach to the mapping of annotation formats using standards. In Proceedings of the Workshop on Language Resource and Language Technology Standards, LREC 2010. Malta. URL:
  64. Zohar, H., Liebeskind, C., Schler, J., & Dagan, I. (2013). Automatic thesaurus construction for cross generation corpus. Journal on Computing and Cultural Heritage,6(1), 4.Google Scholar

Copyright information

© Springer Nature B.V. 2019

Authors and Affiliations

  1. 1.Hebrew University of JerusalemJerusalemIsrael

Personalised recommendations