Historical corpora meet the digital humanities: the Jerusalem Corpus of Emergent Modern Hebrew


The paper describes the creation of the first open access multi-genre historical corpus of Emergent Modern Hebrew, made possible by implementation of digital humanities methods in the process of corpus curation, encoding, and dissemination. Corpus contents originate in the Ben-Yehuda Project, an open access repository of Hebrew literature online, and in digital images curated from the collections of the National Library of Israel, a selection of which have been transcribed through a dedicated crowdsourcing task that feeds back into the library’s online catalog. Texts in the corpus are encoded following best practices in the digital humanities, including markup of metadata that enables time-sensitive research, linguistic and other, of the corpus. Evaluation of morphological analysis based on Modern Hebrew language models is shown to distinguish between genres in the historical variety, highlighting the importance of ephemeral materials for linguistic research and for potential collaboration with libraries and cultural institutions in the process of corpus creation. We demonstrate the use of the corpus in diachronic linguistic research and suggest ways in which the association it provides between digital images and texts can be used to support automatic language processing and to enhance resources in the digital humanities.

This is a preview of subscription content, access via your institution.

Fig. 1

(Source: National Library of Israel, “Time Travel” Ephemera Collection.) Corresponding TEI-XML markup for the closer of the document is shown on the right. Markup includes structural features, annotation of locations, and date normalization

Fig. 2

Source: http://benyehuda.org/; accessed September 6, 2016

Fig. 3


  1. 1.

    A survey of this rich literature is beyond the scope of this paper. Entries in the Encyclopedia of Hebrew Language and Linguistics (EHLL; Khan, ed., 2013) provide a useful starting point for reviews of all topics related to the Hebrew language. See in particular Reshef (2013) and references cited therein.

  2. 2.


  3. 3.


  4. 4.


  5. 5.


  6. 6.

    Dumps of the Project are archived periodically as of 2018, at https://github.com/projectbenyehuda.

  7. 7.


  8. 8.


  9. 9.

    See Ariel (2015), Bar-Ziv Levy and Agranovsky (2015), and Reshef (2016).

  10. 10.

    A small section is devoted specifically to children’s newspapers. A benefit of the automatic text recognition with OCR is that the corpus is very large compared to other historical corpora of newspapers before the twentieth century (e.g., the 1.6 million-word Zurich English Newspaper Corpus (ZEN), 1661–1791; Lehmann et al. 2006).

  11. 11.


  12. 12.

    For a recent overview and critical evaluation of the leading hypotheses in this area, the reader is referred to Doron (2015).

  13. 13.

    We are grateful to Asaf Bartov for providing this snapshot of the corpus at the Ben-Yehuda Hackathon (THATCamp Haifa, University of Haifa; February 2014).

  14. 14.

    See http://web.nli.org.il/sites/NLI/English/digitallibrary/time_journey/. The Time Travel project is a collaborative project of NLI and the University of California Los Angeles and is sponsored by the Arcadia Fund.

  15. 15.


  16. 16.


  17. 17.

    http://hebrew-academy.org.il/%d7%94%d7%9e%d7%99%d7%9c%d7%95%d7%9f/%d7%a1%d7%a4%d7%a8%d7%95%d7%aa-%d7%94%d7%a2%d7%aa-%d7%94%d7%97%d7%93%d7%a9%d7%94/. The Modern Literature subcorpus focuses on texts from the mid-eighteenth century up to the establishment of the State of Israel in 1948.

  18. 18.

    Currently, the collection includes over 139,000 freely browsable items as well as 13,483 items that can only be viewed on location at the library.

  19. 19.

    One exception is the work of Grosse et al. (1987) on the evolution of Ruhrdeutsch (cited in Anderwald and Szmrecsanyi 2009, p. 1134). In the study of the emergence of Modern Hebrew, Reshef (2009: 144–148, 2012) and Reshef and Helman (2009) stress the importance of municipal correspondence and posters to the study of language development. These materials have not been included in an openly accessible corpus, however. Another notable resource that contains, in part, historical ephemeral documents in classical and medieval Hebrew is the Cairo Genizah (available in digital format at http://www.jewishmanuscripts.org/). See Morag 1998 for an early assessment of the role of this corpus in the linguistic study of Hebrew.

  20. 20.

    For relevant discussion, see Harshav (1993), Reshef (2013), and Doron (2015).

  21. 21.

    A possible source for this non-standard use may be found in Mishnaic Hebrew, which is known to show precisely this kind of exception to definiteness agreement (Azar 1995: 246). Rubin (2013) notes that the nouns involved are usually generic or collective (e.g., ‘water’, ‘camel’), like ‘sesame’ in (2b). However, note that the phrase in (2b) is preceded by the “correctly” inflected noun phrase ha-min ha-muvḥar ‘the select brand’ (lit. def-kind def-select), which also features a generic noun. The juxtaposition of standard and non-standard forms is well-attested in materials from this period (see, e.g., Reshef 2016, pp. 195, 199).

  22. 22.

    See Claridge (2008), Xiao (2008), Piotrowski (2012) and Yáñez-Bouza (2015) for an overview of these and other resources. Basic information about English language corpora can be gleaned from the Corpus Resource Database (CoRD; http://www.helsinki.fi/varieng/CoRD/index.html). Piotrowski (2012) provides an overview of historical corpora in languages other than English, including Arabic, Chinese, Dutch, French, German, Nordic languages, Latin and Ancient Greek, and Portuguese (see his chapter 8).

  23. 23.

    On the limitations of studying historical phonology from corpora, see Curzan (2009: 1097).

  24. 24.

    See Schilling (2013) on sociolinguistic fieldwork and Anderwald and Szmrecsanyi (2009) for an overview of corpus-based dialectology.

  25. 25.

    Osey he-Ḥayil: (literally: ‘Achievers of Success’) http://nlics.org, powered by PyBossa (http://pybossa.com/).

  26. 26.

    The term crowdsourcing is due to Howe (2006, 2008).

  27. 27.

    We thank Maayan Almagor, NLI’s former Community Manager, and Sinai Rusinek from Digital Humanities Israel for their collaboration on this project.

  28. 28.

    Volunteers were recruited through announcements distributed among students and faculty at the Hebrew University of Jerusalem (including research assistants who were involved in transcription of other materials in the corpus), through announcements in local forums of digital humanities, and through the library’s social media outlets.

  29. 29.

    See, among others, Bendavid (1971), Reshef (2009), Shatil (2007), and Wigderson (2015).

  30. 30.

    On ways to increase motivation of participants in crowdsourcing see, e.g., Kaufmann et al. (2011), Zheng et al. (2011), Zhao and Zhu (2014), and Morschheuser et al. (2016).

  31. 31.

    http://nlics.org/, transcription of ads for children’s’ plays (accessed November 7, 2017).

  32. 32.

    Experiments run by participants of the Ben-Yehuda Hackathon (THATCamp at the University of Haifa, February 2014).

  33. 33.

    Some works in Russian, Yiddish, German, English, and Italian, for example, are included in the Ben-Yehuda Project.

  34. 34.

    The heuristic of estimating creation dates based on author lifespan is mentioned in other historical corpora, e.g., the Corpus of Modern Yiddish (CMY; http://web-corpora.net/YNC/search/index.php). There are various implementations of this heuristic that one could employ; we leave experimentation with their accuracy for future research.

  35. 35.

    Verbal templates are characteristic of Hebrew’s Semitic non-concatenative morphology. Future releases of the corpus will include annotation of roots alongside templates for verbs.

  36. 36.

    See Piotrowski (2012) for a more complete survey of NLP tools for historical corpora in additional languages.

  37. 37.

    Adler (2007) reports an accuracy of 93.36% for part of speech tagging (and segmentation), and 90.05% for full morphological analysis. Full analysis in his system consisted of more features than those tested here. Presumably, this is the case also for Goldberg et al.’s (2008) system.

  38. 38.

    See Adler (2007: 2) for an example of a seven-way ambiguity in the morphological analysis of the four-character word bclm.

  39. 39.

    In contrast, a corpus of historical texts from just 1 year, e.g., the Brown Corpus (with texts collected in 1961), is not a diachronic corpus according to this definition (Claridge 2008: 243).

  40. 40.

    See footnote 22 above for pointers to comprehensive surveys of existing diachronic corpora.

  41. 41.


  42. 42.


  43. 43.


  44. 44.


  45. 45.

    See the ANNIS user guide for details: http://corpus-tools.org/annis/documentation.html.

  46. 46.

    A variety of reasons have been given in the literature to explain the decline in productivity of pa`al, which still remains the most common template in the language: the template is already associated with many roots, it cannot accommodate quadri-literal roots due to its morphophonology, it is not associated with a uniform semantics, and more. See Bolozky (2009: 360) and reference cited there for further discussion.

  47. 47.

    I thank an anonymous reviewer for suggesting this avenue for research. Changes in the productivity of MH verbal templates have so far been described based on a range of quantitative studies, but without systematically reporting the statistical significance of the findings (see, e.g., Bolozky 2009).

  48. 48.

    Raw counts are as follows (for the 1840s and the 1970s): 6 and 966 in hitpa`el, 28 and 2299 in pi`el, 1 and 228 in pu`al, 13 and 1170 in nif`al.

  49. 49.

    I thank Sinai Rusinek for discussion of these issues.

  50. 50.

    Search-Yehuda working group (the group included the author, Livnat Herzig Sheinfux, Nadav Bin Nun, Nurit Melnik, Shira Wigderson, Sinai Rusinek, Tal Baumel, Toma Tasovac, and Tsvi Sadan).

  51. 51.



  1. Adler, M. (2007). Hebrew morphological disambiguation: An unsupervised stochastic word-based approach. Ph.D. thesis, Ben-Gurion University of the Negev.

  2. Adler, M., & Elhadad, M. (2006). An unsupervised morpheme-based HMM for Hebrew morphological disambiguation. In Proceeding of COLING-ACL-06, Sydney, Australia.

  3. Ahmed, M. A. (2018). XML annotation of Hebrew elements in Judeo-Arabic texts. Journal of Jewish Languages,6, 221–242.

    Google Scholar 

  4. Anderwald, L., & Szmrecsanyi, B. (2009). Corpus linguistics and dialectology. In M. Kytö & A. Lüdeling (Eds.), Corpus linguistics: An international handbook (Vol. 2, pp. 1126–1140). Berlin: De Gruyter.

    Google Scholar 

  5. Ariel, Ch. (2015). The expression of material constitution in Revival Hebrew. Journal of Jewish Languages 3(1–2), 231–244 (Reprinted in E. Doron (Ed.),Language contact and the development of Modern Hebrew, Studies in Semitic Languages and Linguistics, Brill (vol. 84)).

  6. Azar, M. (1995). The syntax of Mishnaic Hebrew. Jerusalem, Haifa: The Academy of the Hebrew Language and University of Haifa Press. (in Hebrew).

    Google Scholar 

  7. Bar-Ziv Levy, M. & Agranovsky. V. (2015). The evolution of the structure of free relative clauses in Modern Hebrew: Internal development and contact language influence. Journal of Jewish Languages 3(1–2): 259–270. (Reprinted in Doron, E. (Ed.),Language contact and the development of Modern Hebrew, Studies in Semitic Languages and Linguistics, Brill (vol. 84)).

  8. Belinkov, Y., Magidow, A., Romanov, M., Shmidman, A. & Koppel, M. (2016). Shamela: A large-scale historical Arabic corpus. In Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH at Coling) 2016 (pp. 45–53).

  9. Bendavid, A. (1971). Biblical Hebrew and Mishnaic Hebrew. Tel Aviv: Dvir. (in Hebrew).

    Google Scholar 

  10. Ben-Ḥayyim, Z. (1953). On the use of the phrase yeš l-. Lĕšonénu La‘am 4. (in Hebrew).

  11. Ben-Ḥayyim, Z. (1992). The struggle for a language. Jerusalem: The Academy of the Hebrew Language. (in Hebrew).

    Google Scholar 

  12. Bolozky, S. (2009). Frequency and productivity in the verb system of Israeli Hebrew. Lĕšonénu,71, 345–367. (in Hebrew).

    Google Scholar 

  13. Boneh, N. (2013). Mood and modality: Modern Hebrew. In G. Khan (Ed.), Encyclopedia of Hebrew Language and Linguistics (Vol. 2, pp. 693–703). Leiden: Brill.

    Google Scholar 

  14. Claridge, C. (2008). Historical corpora. In M. Kytö & A. Lüdeling (Eds.), Corpus linguistics: An international handbook (Vol. 1, pp. 242–259). Berlin: De Gruyter.

    Google Scholar 

  15. Culpeper, J., & Kytö, M. (2010). Early Modern English dialogues: Spoken interaction as writing. Cambridge: Cambridge University Press.

    Google Scholar 

  16. Curzan, A. (2009). Historical corpus linguistics and evidence of language change. In M. Kytö & A. Lüdeling (Eds.), Corpus linguistics: An international handbook (Vol. 2, pp. 1091–1109). Berlin: De Gruyter.

    Google Scholar 

  17. Doron, E. (2015). Introduction: Language contact and the development of Modern Hebrew. Journal of Jewish Languages 3(1–2): 5–26. (Reprinted in E. Doron (ed.),Language contact and the development of Modern Hebrew, Studies in Semitic Languages and Linguistics, Brill (vol. 84)).

  18. Doron, E. (2016). Language contact and the development of Modern Hebrew, Studies in Semitic Languages and Linguistics (Vol. 84). Leiden: Brill.

    Google Scholar 

  19. Garcia Martinez, M., & Walton, B. (2014). The wisdom of crowds: The potential of online communities as a tool for data analysis. Technovation,34, 203–214.

    Google Scholar 

  20. Geyken, A. (2007). The DWDS corpus: A reference corpus for the German language of the 20th century. In C. Fellbaum (Ed.), Collocations and idioms: Linguistic, lexicographic, and computational aspects (pp. 23–41). London: Continuum Press.

    Google Scholar 

  21. Goldberg, Y., Adler, M. & Elhadad, M. (2008). EM can find pretty good HMM POS-taggers (when given a good start). In Proceedings of ACL-08: HLT (pp. 746–754).

  22. Grosse, S., Grimberg, M., Hölscher, T., Karweick, J., & Kuntz, H. (1987). Sprachwandel und Sprachwachstum im Ruhrgebiet des 19. Jahrhunderts unter dem Einfluss der Industrialisierung. Zeitschrift für Dialektologie und Linguistik,54(2), 202–221. (In German).

    Google Scholar 

  23. HaCohen-Kerner, Y., Beck, H., Yehudai, E., & Mughaz, D. (2010). Stylistic feature sets as classifiers of documents according to their historical period and ethnic origin. Applied Artificial Intelligence,24(9), 847–862.

    Google Scholar 

  24. Hana, J., Feldman, A. & Aharodnik, K. (2011). A low-budget tagger for Old Czech. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (pp. 10–18).

  25. Harshav, B. (1993). Language in time of revolution. Berkeley: University of California Press.

    Google Scholar 

  26. Howe, J. (2006). The rise of crowdsourcing. Wired 14(6). http://www.wired.com/2006/06/crowds/. Accessed 11 Aug 2017.

  27. Howe, J. (2008). Crowdsourcing: Why the power of crowd is driving the future of business. New York City: Crown Business.

    Google Scholar 

  28. Itai, A., & Wintner, S. (2008). Language resources for Hebrew. Language Resources and Evaluation,42(1), 75–98.

    Google Scholar 

  29. Kaufmann, N., Schulze, T. & Veit, D. (2011). More than fun and money. Worker Motivation in Crowdsourcing-A Study on Mechanical Turk. In Proceedings of the 17th Americas Conference on Information Systems-AMCIS, 2011 (pp. 1–11).

  30. Krause, T. & Zeldes, A. (2016). ANNIS3: A new architecture for generic corpus query and visualization. Digital Scholarship in the Humanities 2016, 31(1), 118–139.

    Google Scholar 

  31. Lehmann, H. M., auf dem Keller., C., & Ruef, B. (2006). ZEN Corpus 1.0. In R. Facchinetti & M. Rissanen (Eds.), Corpus-based Studies of Diachronic English (pp. 135–155). New York: Peter Lang.

    Google Scholar 

  32. Liebeskind, C., Dagan, I., & Schler, J. (2016). Semiautomatic construction of cross-period thesaurus. Journal on Computing and Cultural Heritage,9(4), 22.

    Google Scholar 

  33. Lin, Y., Michel, J. B., Aiden, E. L., Orwant, J., Brockman, W. & Petrov, S. (2012). Syntactic annotations for the Google Books Ngram Corpus. In Proceedings of the ACL 2012 System Demonstrations, ACL ‘12 (pp. 169–174).

  34. Meurman-Solin, A. (1995). A new tool: The Helsinki Corpus of Older Scots (1450–1700). ICAME Journal,19, 49–62.

    Google Scholar 

  35. Michel, J. B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Brockman, W., et al. (2011). Quantitative analysis of culture using millions of digitized books. Science,331, 176–182.

    Google Scholar 

  36. Morag, S. (1998). The contribution of the Geniza to the study of the Hebrew language. Jewish Studies,38, 239–251. (in Hebrew).

    Google Scholar 

  37. Morschheuser, B., Hamari, J. & Koivisto, J. (2016). Gamification in crowdsourcing: A review. In Bui, T. X. and Sprague Jr, R. H., (Eds.), In Proceedings of the 49th Hawaii International Conference on System Sciences (pp. 4375–4384).

  38. Mughaz, D., HaCohen-Kerner, Y., & Gabbay, D. (2017). Mining and using key-words and key-phrases to identify the era of an anonymous text. In N. Nguyen, R. Kowalczyk, A. Pinto, & J. Cardoso (Eds.), Transactions on computational collective intelligence XXVI. Lecture notes in computer science (Vol. 10190). Cham: Springer.

    Google Scholar 

  39. Neuman, Y. (2013). The diphthong [eʸ] in Israeli Hebrew: Its origin and the factors conditioning its distribution. In R. Ben-Shahar & N. Ben-Ari (Eds.), Hebrew—A Living Language, volume VI. Tel-Aviv: The Porter Institute for Poetics & Semiotics Tel-Aviv University, HaKibbutz HaMeuchad. (in Hebrew).

    Google Scholar 

  40. Piotrowski, M. (2012). Natural language processing for historical texts. San Rafael: Morgan & Claypool.

    Google Scholar 

  41. Reshef, Y. (2009). Continuity vs. change in the emergence of Standard Modern Hebrew: The verbal system in the early Mandate period. In H. Cohen (Ed.), Modern Hebrew: Two hundred and fifty years (pp. 143–176). Jerusalem: The Academy of the Hebrew Language. (in Hebrew).

    Google Scholar 

  42. Reshef, Y. (2012). Early Spoken Hebrew. In S. Izre’el (Ed.), The speech machine as a language teacher Hebrew Spoken Here: Hebrew voices from Nazi Germany: A testimony on spoken Hebrew and Jewish life in Palestine during the British Mandate (pp. 163–187). Tel Aviv: The Haim Rubin Tel Aviv University Press. (in Hebrew).

    Google Scholar 

  43. Reshef, Y. (2013). Revival of Hebrew: Grammatical structure and lexicon. In G. Khan (Ed.), Encyclopedia of Hebrew Language and Linguistics (Vol. 3, pp. 397–405). Leiden: Brill.

    Google Scholar 

  44. Reshef, Y. (2016). Written Hebrew of the revival generation as a distinct phase in the evolution of Modern Hebrew. Journal of Semitic Studies,61(1), 187–213.

    Google Scholar 

  45. Reshef, Y., & Helman, A. (2009). Instructing or recruiting? Language and style in 1920s and 1930s Tel Aviv municipal posters. Jewish Studies Quarterly,16, 306–332.

    Google Scholar 

  46. Rissanen, M. (2008). Corpus linguistic and historical linguistics. In M. Kytö & A. Lüdeling (Eds.), Corpus linguistics: An international handbook (Vol. 1, pp. 53–68). Berlin: De Gruyter.

    Google Scholar 

  47. Rögnvaldsson, E., & Helgadóttir, S. (2011). Morphological tagging of Old Norse texts and its use in studying syntactic variation ad change. In C. Sporleder, A. van den Bosch, & K. Zervanou (Eds.), Language Technology for Cultural Heritage: Selected papers from the LaTeCH workshop series (pp. 63–76). Berlin: Springer.

    Google Scholar 

  48. Rubin, A. D. (2013). Definite Article: Pre-Modern Hebrew. In G. Khan (Ed.), Encyclopedia of Hebrew Language and Linguistics (Vol. 1, pp. 678–682). Leiden: Brill.

    Google Scholar 

  49. Rubinstein, A. (forthcoming). Existential possessive modality in the emergence of Modern Hebrew. In E. Doron, M. Rappaport Hovav, Y. Reshef, M. Taube (Eds.), Linguistic contact, continuity, and change in the genesis of Modern Hebrew. Amsterdam: John Benjamins (to appear).

  50. Rubinstein, A., Sichel, I. & Tsirkin-Sadan, A. (2015). Superfluous negation in Modern Hebrew and its origins. Journal of Jewish Languages 3(1–2): 165–182. (Reprinted in Doron, E. (Ed.),Language contact and the development of Modern Hebrew, Studies in Semitic Languages and Linguistics, Brill (vol. 84)).

  51. Rusinek, S. (2016). Kima: Towards an open digital historical Hebrew gazetteer. http://commons.pelagios.org/2016/07/kima-towards-an-open-digital-historical-hebrew-gazetteer/. Accessed 11 Aug 2017.

  52. Saxton, G. D., Oh, O., & Kishore, R. (2013). Rules of crowdsourcing: Models, issues, and systems of control. Information Systems Management,30(1), 2–20. https://doi.org/10.1080/10580530.2013.739883.

    Article  Google Scholar 

  53. Schilling, N. (2013). Sociolinguistic fieldwork. Cambridge: Cambridge University Press.

    Google Scholar 

  54. Schmied, J. (1994). The Lampeter Corpus of Early Modern English Tracts. In Kytö, M., Rissanen, M. and Wright S. (Eds.), Corpora across the centuries: Proceedings of the First International Colloquium on English Diachronic Corpora, Rodopi (pp. 81–89).

  55. Shatil, N. (2007). The synchronic status of nitpa’el. Divrei ha-ḥug ha-yisr’eli lavalshanut,16, 105–127. (in Hebrew).

    Google Scholar 

  56. Shehadeh, H. (1991). Gilguley ha-bituy ‘yesh (lo) lilmod’/‘haya (lo) lilmod’. In M. H. Goshen-Gottstein, S. Morag, & S. Kogut (Eds.), Studies on Hebrew and other Semitic languages presented to Professor Chaim Rabin on the occassion of his seventy-fifth birthday (pp. 415–442). Jerusalem: Academon Press. (in Hebrew).

    Google Scholar 

  57. Tsirkin-Sadan, A. (2015). Inheritance and Slavic contact in the polysemy of bixlal. Journal of Jewish Languages 3(1–2): 218–230. (Reprinted in Doron, E. (Ed.),Language contact and the development of Modern Hebrew, Studies in Semitic Languages and Linguistics, Brill (vol. 84)).

  58. Wigderson, S. (2015). The sudden disappearance of Nitpael and the rise of Hitpael in Modern Hebrew, and the role of Yiddish in the process. Journal of Jewish Languages, 3(1–2): 199–206. (Reprinted in Doron, E. (Ed.),Language contact and the development of Modern Hebrew, Studies in Semitic Languages and Linguistics, Brill (vol. 84)).

  59. Xiao, R. (2008). Well-known and influential corpora. In M. Kytö & A. Lüdeling (Eds.), Corpus linguistics: An international handbook (Vol. 1, pp. 383–457). Berlin: De Gruyter.

    Google Scholar 

  60. Yáñez-Bouza, N. (2015). ‘Have you ever written a diary or journal?’ Diurnial prose and register variation. Neuphilologische Mitteilungen,116(2), 449–474.

    Google Scholar 

  61. Zhao, Y. & Zhu, Q. (2014). Effects of extrinsic and intrinsic motivation on participation in crowdsourcing contest. Online Information Review,38(7), 896–917.

    Google Scholar 

  62. Zheng, H., Li, D., & Hou, W. (2011). Task design, motivation, and participation in crowdsourcing contests. International Journal of Electronic Commerce,15(4), 57–88.

    Google Scholar 

  63. Zipser, F. & Romary, L. (2010). A model oriented approach to the mapping of annotation formats using standards. In Proceedings of the Workshop on Language Resource and Language Technology Standards, LREC 2010. Malta. URL: http://hal.archives-ouvertes.fr/inria-00527799/en/.

  64. Zohar, H., Liebeskind, C., Schler, J., & Dagan, I. (2013). Automatic thesaurus construction for cross generation corpus. Journal on Computing and Cultural Heritage,6(1), 4.

    Google Scholar 

Download references


I wish to thank the three anonymous reviewers of this manuscript for their helpful comments. For invaluable discussion and feedback during all stages of the project, I am grateful to Sinai Rusinek. Thanks also to Meni Adler, Maayan Almagor, Yael Netzer, Avigail Tsirkin-Sadan, and Amir Zeldes. This research was supported by the Mandel Scholion Interdisciplinary Research Center in the Humanities and Jewish Studies at the Hebrew University of Jerusalem. I thank researchers at the Center for their support, especially Yael Reshef for enabling me to train research assistants of the “Emergence of Modern Hebrew” research group in the TEI format. Programming support by Itay Zandbank of The Research Software Company (https://www.chelem.co.il) is also gratefully acknowledged.

Author information



Corresponding author

Correspondence to Aynat Rubinstein.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Corpus size estimates

Appendix A: Corpus size estimates

Historical Jewish Press (JPress)

The JPress corpus is not freely available for search. In order to provide an estimation of the size of the corpus, I relied on the number of scanned newspaper pages from the years 1856–1970, as reported by the JPress team for the version of the corpus from August 2016 (Sinai Rusinek and Eyal Miller, p.c.). An estimation of token counts was done manually, by counting the number of tokens (base words, prepositions, and punctuation marks) in one column of one randomly chosen newspaper page in the corpus (Ha-zman, volume 78, April 20, 1914; page 2).

  • Tokens in column: 777

  • Tokens in page [estimate]: 4662 (six columns)

  • Total pages (1856–1970): 277,165

  • Total tokens (1856–1970) [estimate]: 1,292,143,230

Bar Ilan Responsa project

The corpus is searchable through a proprietary interface, in which it is possible to see word counts for individual texts and subcorpora. However, since works in the corpus are distributed without date metadata, it is not possible to search for those that are from a particular time period. To achieve an estimate of the size of the corpus for the period of interest to us, we located relevant subcorpora and estimated word counts for each of them. The following calculations are based on version 2.1 of the Bar Ilan Responsa Project.

Table 10

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Rubinstein, A. Historical corpora meet the digital humanities: the Jerusalem Corpus of Emergent Modern Hebrew. Lang Resources & Evaluation 53, 807–835 (2019). https://doi.org/10.1007/s10579-019-09458-4

Download citation


  • Historical corpora
  • Language change
  • Ephemera
  • Digital humanities
  • Citizen science
  • Crowdsourcing
  • Hebrew