Historical corpora meet the digital humanities: the Jerusalem Corpus of Emergent Modern Hebrew


The paper describes the creation of the first open access multi-genre historical corpus of Emergent Modern Hebrew, made possible by implementation of digital humanities methods in the process of corpus curation, encoding, and dissemination. Corpus contents originate in the Ben-Yehuda Project, an open access repository of Hebrew literature online, and in digital images curated from the collections of the National Library of Israel, a selection of which have been transcribed through a dedicated crowdsourcing task that feeds back into the library’s online catalog. Texts in the corpus are encoded following best practices in the digital humanities, including markup of metadata that enables time-sensitive research, linguistic and other, of the corpus. Evaluation of morphological analysis based on Modern Hebrew language models is shown to distinguish between genres in the historical variety, highlighting the importance of ephemeral materials for linguistic research and for potential collaboration with libraries and cultural institutions in the process of corpus creation. We demonstrate the use of the corpus in diachronic linguistic research and suggest ways in which the association it provides between digital images and texts can be used to support automatic language processing and to enhance resources in the digital humanities.

Fig. 1

(Source: National Library of Israel, “Time Travel” Ephemera Collection.) Corresponding TEI-XML markup for the closer of the document is shown on the right. Markup includes structural features, annotation of locations, and date normalization

Fig. 2

Source: http://benyehuda.org/; accessed September 6, 2016

Fig. 3


  1. 1.

    A survey of this rich literature is beyond the scope of this paper. Entries in the Encyclopedia of Hebrew Language and Linguistics (EHLL; Khan, ed., 2013) provide a useful starting point for reviews of all topics related to the Hebrew language. See in particular Reshef (2013) and references cited therein.

  2. 2.


  3. 3.


  4. 4.


  5. 5.


  6. 6.

    Dumps of the Project are archived periodically as of 2018, at https://github.com/projectbenyehuda.

  7. 7.


  8. 8.


  9. 9.

    See Ariel (2015), Bar-Ziv Levy and Agranovsky (2015), and Reshef (2016).

  10. 10.

    A small section is devoted specifically to children’s newspapers. A benefit of the automatic text recognition with OCR is that the corpus is very large compared to other historical corpora of newspapers before the twentieth century (e.g., the 1.6 million-word Zurich English Newspaper Corpus (ZEN), 1661–1791; Lehmann et al. 2006).

  11. 11.


  12. 12.

    For a recent overview and critical evaluation of the leading hypotheses in this area, the reader is referred to Doron (2015).

  13. 13.

    We are grateful to Asaf Bartov for providing this snapshot of the corpus at the Ben-Yehuda Hackathon (THATCamp Haifa, University of Haifa; February 2014).

  14. 14.

    See http://web.nli.org.il/sites/NLI/English/digitallibrary/time_journey/. The Time Travel project is a collaborative project of NLI and the University of California Los Angeles and is sponsored by the Arcadia Fund.

  15. 15.


  16. 16.


  17. 17.

    http://hebrew-academy.org.il/%d7%94%d7%9e%d7%99%d7%9c%d7%95%d7%9f/%d7%a1%d7%a4%d7%a8%d7%95%d7%aa-%d7%94%d7%a2%d7%aa-%d7%94%d7%97%d7%93%d7%a9%d7%94/. The Modern Literature subcorpus focuses on texts from the mid-eighteenth century up to the establishment of the State of Israel in 1948.

  18. 18.

    Currently, the collection includes over 139,000 freely browsable items as well as 13,483 items that can only be viewed on location at the library.

  19. 19.

    One exception is the work of Grosse et al. (1987) on the evolution of Ruhrdeutsch (cited in Anderwald and Szmrecsanyi 2009, p. 1134). In the study of the emergence of Modern Hebrew, Reshef (2009: 144–148, 2012) and Reshef and Helman (2009) stress the importance of municipal correspondence and posters to the study of language development. These materials have not been included in an openly accessible corpus, however. Another notable resource that contains, in part, historical ephemeral documents in classical and medieval Hebrew is the Cairo Genizah (available in digital format at http://www.jewishmanuscripts.org/). See Morag 1998 for an early assessment of the role of this corpus in the linguistic study of Hebrew.

  20. 20.

    For relevant discussion, see Harshav (1993), Reshef (2013), and Doron (2015).

  21. 21.

    A possible source for this non-standard use may be found in Mishnaic Hebrew, which is known to show precisely this kind of exception to definiteness agreement (Azar 1995: 246). Rubin (2013) notes that the nouns involved are usually generic or collective (e.g., ‘water’, ‘camel’), like ‘sesame’ in (2b). However, note that the phrase in (2b) is preceded by the “correctly” inflected noun phrase ha-min ha-muvḥar ‘the select brand’ (lit. def-kind def-select), which also features a generic noun. The juxtaposition of standard and non-standard forms is well-attested in materials from this period (see, e.g., Reshef 2016, pp. 195, 199).

  22. 22.

    See Claridge (2008), Xiao (2008), Piotrowski (2012) and Yáñez-Bouza (2015) for an overview of these and other resources. Basic information about English language corpora can be gleaned from the Corpus Resource Database (CoRD; http://www.helsinki.fi/varieng/CoRD/index.html). Piotrowski (2012) provides an overview of historical corpora in languages other than English, including Arabic, Chinese, Dutch, French, German, Nordic languages, Latin and Ancient Greek, and Portuguese (see his chapter 8).

  23. 23.

    On the limitations of studying historical phonology from corpora, see Curzan (2009: 1097).

  24. 24.

    See Schilling (2013) on sociolinguistic fieldwork and Anderwald and Szmrecsanyi (2009) for an overview of corpus-based dialectology.

  25. 25.

    Osey he-Ḥayil: (literally: ‘Achievers of Success’) http://nlics.org, powered by PyBossa (http://pybossa.com/).

  26. 26.

    The term crowdsourcing is due to Howe (2006, 2008).

  27. 27.

    We thank Maayan Almagor, NLI’s former Community Manager, and Sinai Rusinek from Digital Humanities Israel for their collaboration on this project.

  28. 28.

    Volunteers were recruited through announcements distributed among students and faculty at the Hebrew University of Jerusalem (including research assistants who were involved in transcription of other materials in the corpus), through announcements in local forums of digital humanities, and through the library’s social media outlets.

  29. 29.

    See, among others, Bendavid (1971), Reshef (2009), Shatil (2007), and Wigderson (2015).

  30. 30.

    On ways to increase motivation of participants in crowdsourcing see, e.g., Kaufmann et al. (2011), Zheng et al. (2011), Zhao and Zhu (2014), and Morschheuser et al. (2016).

  31. 31.

    http://nlics.org/, transcription of ads for children’s’ plays (accessed November 7, 2017).

  32. 32.

    Experiments run by participants of the Ben-Yehuda Hackathon (THATCamp at the University of Haifa, February 2014).

  33. 33.

    Some works in Russian, Yiddish, German, English, and Italian, for example, are included in the Ben-Yehuda Project.

  34. 34.

    The heuristic of estimating creation dates based on author lifespan is mentioned in other historical corpora, e.g., the Corpus of Modern Yiddish (CMY; http://web-corpora.net/YNC/search/index.php). There are various implementations of this heuristic that one could employ; we leave experimentation with their accuracy for future research.

  35. 35.

    Verbal templates are characteristic of Hebrew’s Semitic non-concatenative morphology. Future releases of the corpus will include annotation of roots alongside templates for verbs.

  36. 36.

    See Piotrowski (2012) for a more complete survey of NLP tools for historical corpora in additional languages.

  37. 37.

    Adler (2007) reports an accuracy of 93.36% for part of speech tagging (and segmentation), and 90.05% for full morphological analysis. Full analysis in his system consisted of more features than those tested here. Presumably, this is the case also for Goldberg et al.’s (2008) system.

  38. 38.

    See Adler (2007: 2) for an example of a seven-way ambiguity in the morphological analysis of the four-character word bclm.

  39. 39.

    In contrast, a corpus of historical texts from just 1 year, e.g., the Brown Corpus (with texts collected in 1961), is not a diachronic corpus according to this definition (Claridge 2008: 243).

  40. 40.

    See footnote 22 above for pointers to comprehensive surveys of existing diachronic corpora.

  41. 41.


  42. 42.


  43. 43.


  44. 44.


  45. 45.

    See the ANNIS user guide for details: http://corpus-tools.org/annis/documentation.html.

  46. 46.

    A variety of reasons have been given in the literature to explain the decline in productivity of pa`al, which still remains the most common template in the language: the template is already associated with many roots, it cannot accommodate quadri-literal roots due to its morphophonology, it is not associated with a uniform semantics, and more. See Bolozky (2009: 360) and reference cited there for further discussion.

  47. 47.

    I thank an anonymous reviewer for suggesting this avenue for research. Changes in the productivity of MH verbal templates have so far been described based on a range of quantitative studies, but without systematically reporting the statistical significance of the findings (see, e.g., Bolozky 2009).

  48. 48.

    Raw counts are as follows (for the 1840s and the 1970s): 6 and 966 in hitpa`el, 28 and 2299 in pi`el, 1 and 228 in pu`al, 13 and 1170 in nif`al.

  49. 49.

    I thank Sinai Rusinek for discussion of these issues.

  50. 50.

    Search-Yehuda working group (the group included the author, Livnat Herzig Sheinfux, Nadav Bin Nun, Nurit Melnik, Shira Wigderson, Sinai Rusinek, Tal Baumel, Toma Tasovac, and Tsvi Sadan).

  51. 51.



    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

    Google Scholar 

Download references


I wish to thank the three anonymous reviewers of this manuscript for their helpful comments. For invaluable discussion and feedback during all stages of the project, I am grateful to Sinai Rusinek. Thanks also to Meni Adler, Maayan Almagor, Yael Netzer, Avigail Tsirkin-Sadan, and Amir Zeldes. This research was supported by the Mandel Scholion Interdisciplinary Research Center in the Humanities and Jewish Studies at the Hebrew University of Jerusalem. I thank researchers at the Center for their support, especially Yael Reshef for enabling me to train research assistants of the “Emergence of Modern Hebrew” research group in the TEI format. Programming support by Itay Zandbank of The Research Software Company (https://www.chelem.co.il) is also gratefully acknowledged.

Author information



Corresponding author

Correspondence to Aynat Rubinstein.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Corpus size estimates

Appendix A: Corpus size estimates

Historical Jewish Press (JPress)

The JPress corpus is not freely available for search. In order to provide an estimation of the size of the corpus, I relied on the number of scanned newspaper pages from the years 1856–1970, as reported by the JPress team for the version of the corpus from August 2016 (Sinai Rusinek and Eyal Miller, p.c.). An estimation of token counts was done manually, by counting the number of tokens (base words, prepositions, and punctuation marks) in one column of one randomly chosen newspaper page in the corpus (Ha-zman, volume 78, April 20, 1914; page 2).

  • Tokens in column: 777

  • Tokens in page [estimate]: 4662 (six columns)

  • Total pages (1856–1970): 277,165

  • Total tokens (1856–1970) [estimate]: 1,292,143,230

Bar Ilan Responsa project

The corpus is searchable through a proprietary interface, in which it is possible to see word counts for individual texts and subcorpora. However, since works in the corpus are distributed without date metadata, it is not possible to search for those that are from a particular time period. To achieve an estimate of the size of the corpus for the period of interest to us, we located relevant subcorpora and estimated word counts for each of them. The following calculations are based on version 2.1 of the Bar Ilan Responsa Project.

Table 10

  • Historical corpora
  • Language change
  • Ephemera
  • Digital humanities
  • Citizen science
  • Crowdsourcing
  • Hebrew