The paper describes the creation of the first open access multi-genre historical corpus of Emergent Modern Hebrew, made possible by implementation of digital humanities methods in the process of corpus curation, encoding, and dissemination. Corpus contents originate in the Ben-Yehuda Project, an open access repository of Hebrew literature online, and in digital images curated from the collections of the National Library of Israel, a selection of which have been transcribed through a dedicated crowdsourcing task that feeds back into the library’s online catalog. Texts in the corpus are encoded following best practices in the digital humanities, including markup of metadata that enables time-sensitive research, linguistic and other, of the corpus. Evaluation of morphological analysis based on Modern Hebrew language models is shown to distinguish between genres in the historical variety, highlighting the importance of ephemeral materials for linguistic research and for potential collaboration with libraries and cultural institutions in the process of corpus creation. We demonstrate the use of the corpus in diachronic linguistic research and suggest ways in which the association it provides between digital images and texts can be used to support automatic language processing and to enhance resources in the digital humanities.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
A survey of this rich literature is beyond the scope of this paper. Entries in the Encyclopedia of Hebrew Language and Linguistics (EHLL; Khan, ed., 2013) provide a useful starting point for reviews of all topics related to the Hebrew language. See in particular Reshef (2013) and references cited therein.
Dumps of the Project are archived periodically as of 2018, at https://github.com/projectbenyehuda.
A small section is devoted specifically to children’s newspapers. A benefit of the automatic text recognition with OCR is that the corpus is very large compared to other historical corpora of newspapers before the twentieth century (e.g., the 1.6 million-word Zurich English Newspaper Corpus (ZEN), 1661–1791; Lehmann et al. 2006).
For a recent overview and critical evaluation of the leading hypotheses in this area, the reader is referred to Doron (2015).
We are grateful to Asaf Bartov for providing this snapshot of the corpus at the Ben-Yehuda Hackathon (THATCamp Haifa, University of Haifa; February 2014).
See http://web.nli.org.il/sites/NLI/English/digitallibrary/time_journey/. The Time Travel project is a collaborative project of NLI and the University of California Los Angeles and is sponsored by the Arcadia Fund.
http://hebrew-academy.org.il/%d7%94%d7%9e%d7%99%d7%9c%d7%95%d7%9f/%d7%a1%d7%a4%d7%a8%d7%95%d7%aa-%d7%94%d7%a2%d7%aa-%d7%94%d7%97%d7%93%d7%a9%d7%94/. The Modern Literature subcorpus focuses on texts from the mid-eighteenth century up to the establishment of the State of Israel in 1948.
Currently, the collection includes over 139,000 freely browsable items as well as 13,483 items that can only be viewed on location at the library.
One exception is the work of Grosse et al. (1987) on the evolution of Ruhrdeutsch (cited in Anderwald and Szmrecsanyi 2009, p. 1134). In the study of the emergence of Modern Hebrew, Reshef (2009: 144–148, 2012) and Reshef and Helman (2009) stress the importance of municipal correspondence and posters to the study of language development. These materials have not been included in an openly accessible corpus, however. Another notable resource that contains, in part, historical ephemeral documents in classical and medieval Hebrew is the Cairo Genizah (available in digital format at http://www.jewishmanuscripts.org/). See Morag 1998 for an early assessment of the role of this corpus in the linguistic study of Hebrew.
A possible source for this non-standard use may be found in Mishnaic Hebrew, which is known to show precisely this kind of exception to definiteness agreement (Azar 1995: 246). Rubin (2013) notes that the nouns involved are usually generic or collective (e.g., ‘water’, ‘camel’), like ‘sesame’ in (2b). However, note that the phrase in (2b) is preceded by the “correctly” inflected noun phrase ha-min ha-muvḥar ‘the select brand’ (lit. def-kind def-select), which also features a generic noun. The juxtaposition of standard and non-standard forms is well-attested in materials from this period (see, e.g., Reshef 2016, pp. 195, 199).
See Claridge (2008), Xiao (2008), Piotrowski (2012) and Yáñez-Bouza (2015) for an overview of these and other resources. Basic information about English language corpora can be gleaned from the Corpus Resource Database (CoRD; http://www.helsinki.fi/varieng/CoRD/index.html). Piotrowski (2012) provides an overview of historical corpora in languages other than English, including Arabic, Chinese, Dutch, French, German, Nordic languages, Latin and Ancient Greek, and Portuguese (see his chapter 8).
On the limitations of studying historical phonology from corpora, see Curzan (2009: 1097).
We thank Maayan Almagor, NLI’s former Community Manager, and Sinai Rusinek from Digital Humanities Israel for their collaboration on this project.
Volunteers were recruited through announcements distributed among students and faculty at the Hebrew University of Jerusalem (including research assistants who were involved in transcription of other materials in the corpus), through announcements in local forums of digital humanities, and through the library’s social media outlets.
http://nlics.org/, transcription of ads for children’s’ plays (accessed November 7, 2017).
Experiments run by participants of the Ben-Yehuda Hackathon (THATCamp at the University of Haifa, February 2014).
Some works in Russian, Yiddish, German, English, and Italian, for example, are included in the Ben-Yehuda Project.
The heuristic of estimating creation dates based on author lifespan is mentioned in other historical corpora, e.g., the Corpus of Modern Yiddish (CMY; http://web-corpora.net/YNC/search/index.php). There are various implementations of this heuristic that one could employ; we leave experimentation with their accuracy for future research.
Verbal templates are characteristic of Hebrew’s Semitic non-concatenative morphology. Future releases of the corpus will include annotation of roots alongside templates for verbs.
See Piotrowski (2012) for a more complete survey of NLP tools for historical corpora in additional languages.
See Adler (2007: 2) for an example of a seven-way ambiguity in the morphological analysis of the four-character word bclm.
In contrast, a corpus of historical texts from just 1 year, e.g., the Brown Corpus (with texts collected in 1961), is not a diachronic corpus according to this definition (Claridge 2008: 243).
See footnote 22 above for pointers to comprehensive surveys of existing diachronic corpora.
See the ANNIS user guide for details: http://corpus-tools.org/annis/documentation.html.
A variety of reasons have been given in the literature to explain the decline in productivity of pa`al, which still remains the most common template in the language: the template is already associated with many roots, it cannot accommodate quadri-literal roots due to its morphophonology, it is not associated with a uniform semantics, and more. See Bolozky (2009: 360) and reference cited there for further discussion.
I thank an anonymous reviewer for suggesting this avenue for research. Changes in the productivity of MH verbal templates have so far been described based on a range of quantitative studies, but without systematically reporting the statistical significance of the findings (see, e.g., Bolozky 2009).
Raw counts are as follows (for the 1840s and the 1970s): 6 and 966 in hitpa`el, 28 and 2299 in pi`el, 1 and 228 in pu`al, 13 and 1170 in nif`al.
I thank Sinai Rusinek for discussion of these issues.
Search-Yehuda working group (the group included the author, Livnat Herzig Sheinfux, Nadav Bin Nun, Nurit Melnik, Shira Wigderson, Sinai Rusinek, Tal Baumel, Toma Tasovac, and Tsvi Sadan).
Adler, M. (2007). Hebrew morphological disambiguation: An unsupervised stochastic word-based approach. Ph.D. thesis, Ben-Gurion University of the Negev.
Adler, M., & Elhadad, M. (2006). An unsupervised morpheme-based HMM for Hebrew morphological disambiguation. In Proceeding of COLING-ACL-06, Sydney, Australia.
Ahmed, M. A. (2018). XML annotation of Hebrew elements in Judeo-Arabic texts. Journal of Jewish Languages,6, 221–242.
Anderwald, L., & Szmrecsanyi, B. (2009). Corpus linguistics and dialectology. In M. Kytö & A. Lüdeling (Eds.), Corpus linguistics: An international handbook (Vol. 2, pp. 1126–1140). Berlin: De Gruyter.
Ariel, Ch. (2015). The expression of material constitution in Revival Hebrew. Journal of Jewish Languages 3(1–2), 231–244 (Reprinted in E. Doron (Ed.),Language contact and the development of Modern Hebrew, Studies in Semitic Languages and Linguistics, Brill (vol. 84)).
Azar, M. (1995). The syntax of Mishnaic Hebrew. Jerusalem, Haifa: The Academy of the Hebrew Language and University of Haifa Press. (in Hebrew).
Bar-Ziv Levy, M. & Agranovsky. V. (2015). The evolution of the structure of free relative clauses in Modern Hebrew: Internal development and contact language influence. Journal of Jewish Languages 3(1–2): 259–270. (Reprinted in Doron, E. (Ed.),Language contact and the development of Modern Hebrew, Studies in Semitic Languages and Linguistics, Brill (vol. 84)).
Belinkov, Y., Magidow, A., Romanov, M., Shmidman, A. & Koppel, M. (2016). Shamela: A large-scale historical Arabic corpus. In Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH at Coling) 2016 (pp. 45–53).
Bendavid, A. (1971). Biblical Hebrew and Mishnaic Hebrew. Tel Aviv: Dvir. (in Hebrew).
Ben-Ḥayyim, Z. (1953). On the use of the phrase yeš l-. Lĕšonénu La‘am 4. (in Hebrew).
Ben-Ḥayyim, Z. (1992). The struggle for a language. Jerusalem: The Academy of the Hebrew Language. (in Hebrew).
Bolozky, S. (2009). Frequency and productivity in the verb system of Israeli Hebrew. Lĕšonénu,71, 345–367. (in Hebrew).
Boneh, N. (2013). Mood and modality: Modern Hebrew. In G. Khan (Ed.), Encyclopedia of Hebrew Language and Linguistics (Vol. 2, pp. 693–703). Leiden: Brill.
Claridge, C. (2008). Historical corpora. In M. Kytö & A. Lüdeling (Eds.), Corpus linguistics: An international handbook (Vol. 1, pp. 242–259). Berlin: De Gruyter.
Culpeper, J., & Kytö, M. (2010). Early Modern English dialogues: Spoken interaction as writing. Cambridge: Cambridge University Press.
Curzan, A. (2009). Historical corpus linguistics and evidence of language change. In M. Kytö & A. Lüdeling (Eds.), Corpus linguistics: An international handbook (Vol. 2, pp. 1091–1109). Berlin: De Gruyter.
Doron, E. (2015). Introduction: Language contact and the development of Modern Hebrew. Journal of Jewish Languages 3(1–2): 5–26. (Reprinted in E. Doron (ed.),Language contact and the development of Modern Hebrew, Studies in Semitic Languages and Linguistics, Brill (vol. 84)).
Doron, E. (2016). Language contact and the development of Modern Hebrew, Studies in Semitic Languages and Linguistics (Vol. 84). Leiden: Brill.
Garcia Martinez, M., & Walton, B. (2014). The wisdom of crowds: The potential of online communities as a tool for data analysis. Technovation,34, 203–214.
Geyken, A. (2007). The DWDS corpus: A reference corpus for the German language of the 20th century. In C. Fellbaum (Ed.), Collocations and idioms: Linguistic, lexicographic, and computational aspects (pp. 23–41). London: Continuum Press.
Goldberg, Y., Adler, M. & Elhadad, M. (2008). EM can find pretty good HMM POS-taggers (when given a good start). In Proceedings of ACL-08: HLT (pp. 746–754).
Grosse, S., Grimberg, M., Hölscher, T., Karweick, J., & Kuntz, H. (1987). Sprachwandel und Sprachwachstum im Ruhrgebiet des 19. Jahrhunderts unter dem Einfluss der Industrialisierung. Zeitschrift für Dialektologie und Linguistik,54(2), 202–221. (In German).
HaCohen-Kerner, Y., Beck, H., Yehudai, E., & Mughaz, D. (2010). Stylistic feature sets as classifiers of documents according to their historical period and ethnic origin. Applied Artificial Intelligence,24(9), 847–862.
Hana, J., Feldman, A. & Aharodnik, K. (2011). A low-budget tagger for Old Czech. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (pp. 10–18).
Harshav, B. (1993). Language in time of revolution. Berkeley: University of California Press.
Howe, J. (2006). The rise of crowdsourcing. Wired 14(6). http://www.wired.com/2006/06/crowds/. Accessed 11 Aug 2017.
Howe, J. (2008). Crowdsourcing: Why the power of crowd is driving the future of business. New York City: Crown Business.
Itai, A., & Wintner, S. (2008). Language resources for Hebrew. Language Resources and Evaluation,42(1), 75–98.
Kaufmann, N., Schulze, T. & Veit, D. (2011). More than fun and money. Worker Motivation in Crowdsourcing-A Study on Mechanical Turk. In Proceedings of the 17th Americas Conference on Information Systems-AMCIS, 2011 (pp. 1–11).
Krause, T. & Zeldes, A. (2016). ANNIS3: A new architecture for generic corpus query and visualization. Digital Scholarship in the Humanities 2016, 31(1), 118–139.
Lehmann, H. M., auf dem Keller., C., & Ruef, B. (2006). ZEN Corpus 1.0. In R. Facchinetti & M. Rissanen (Eds.), Corpus-based Studies of Diachronic English (pp. 135–155). New York: Peter Lang.
Liebeskind, C., Dagan, I., & Schler, J. (2016). Semiautomatic construction of cross-period thesaurus. Journal on Computing and Cultural Heritage,9(4), 22.
Lin, Y., Michel, J. B., Aiden, E. L., Orwant, J., Brockman, W. & Petrov, S. (2012). Syntactic annotations for the Google Books Ngram Corpus. In Proceedings of the ACL 2012 System Demonstrations, ACL ‘12 (pp. 169–174).
Meurman-Solin, A. (1995). A new tool: The Helsinki Corpus of Older Scots (1450–1700). ICAME Journal,19, 49–62.
Michel, J. B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Brockman, W., et al. (2011). Quantitative analysis of culture using millions of digitized books. Science,331, 176–182.
Morag, S. (1998). The contribution of the Geniza to the study of the Hebrew language. Jewish Studies,38, 239–251. (in Hebrew).
Morschheuser, B., Hamari, J. & Koivisto, J. (2016). Gamification in crowdsourcing: A review. In Bui, T. X. and Sprague Jr, R. H., (Eds.), In Proceedings of the 49th Hawaii International Conference on System Sciences (pp. 4375–4384).
Mughaz, D., HaCohen-Kerner, Y., & Gabbay, D. (2017). Mining and using key-words and key-phrases to identify the era of an anonymous text. In N. Nguyen, R. Kowalczyk, A. Pinto, & J. Cardoso (Eds.), Transactions on computational collective intelligence XXVI. Lecture notes in computer science (Vol. 10190). Cham: Springer.
Neuman, Y. (2013). The diphthong [eʸ] in Israeli Hebrew: Its origin and the factors conditioning its distribution. In R. Ben-Shahar & N. Ben-Ari (Eds.), Hebrew—A Living Language, volume VI. Tel-Aviv: The Porter Institute for Poetics & Semiotics Tel-Aviv University, HaKibbutz HaMeuchad. (in Hebrew).
Piotrowski, M. (2012). Natural language processing for historical texts. San Rafael: Morgan & Claypool.
Reshef, Y. (2009). Continuity vs. change in the emergence of Standard Modern Hebrew: The verbal system in the early Mandate period. In H. Cohen (Ed.), Modern Hebrew: Two hundred and fifty years (pp. 143–176). Jerusalem: The Academy of the Hebrew Language. (in Hebrew).
Reshef, Y. (2012). Early Spoken Hebrew. In S. Izre’el (Ed.), The speech machine as a language teacher Hebrew Spoken Here: Hebrew voices from Nazi Germany: A testimony on spoken Hebrew and Jewish life in Palestine during the British Mandate (pp. 163–187). Tel Aviv: The Haim Rubin Tel Aviv University Press. (in Hebrew).
Reshef, Y. (2013). Revival of Hebrew: Grammatical structure and lexicon. In G. Khan (Ed.), Encyclopedia of Hebrew Language and Linguistics (Vol. 3, pp. 397–405). Leiden: Brill.
Reshef, Y. (2016). Written Hebrew of the revival generation as a distinct phase in the evolution of Modern Hebrew. Journal of Semitic Studies,61(1), 187–213.
Reshef, Y., & Helman, A. (2009). Instructing or recruiting? Language and style in 1920s and 1930s Tel Aviv municipal posters. Jewish Studies Quarterly,16, 306–332.
Rissanen, M. (2008). Corpus linguistic and historical linguistics. In M. Kytö & A. Lüdeling (Eds.), Corpus linguistics: An international handbook (Vol. 1, pp. 53–68). Berlin: De Gruyter.
Rögnvaldsson, E., & Helgadóttir, S. (2011). Morphological tagging of Old Norse texts and its use in studying syntactic variation ad change. In C. Sporleder, A. van den Bosch, & K. Zervanou (Eds.), Language Technology for Cultural Heritage: Selected papers from the LaTeCH workshop series (pp. 63–76). Berlin: Springer.
Rubin, A. D. (2013). Definite Article: Pre-Modern Hebrew. In G. Khan (Ed.), Encyclopedia of Hebrew Language and Linguistics (Vol. 1, pp. 678–682). Leiden: Brill.
Rubinstein, A. (forthcoming). Existential possessive modality in the emergence of Modern Hebrew. In E. Doron, M. Rappaport Hovav, Y. Reshef, M. Taube (Eds.), Linguistic contact, continuity, and change in the genesis of Modern Hebrew. Amsterdam: John Benjamins (to appear).
Rubinstein, A., Sichel, I. & Tsirkin-Sadan, A. (2015). Superfluous negation in Modern Hebrew and its origins. Journal of Jewish Languages 3(1–2): 165–182. (Reprinted in Doron, E. (Ed.),Language contact and the development of Modern Hebrew, Studies in Semitic Languages and Linguistics, Brill (vol. 84)).
Rusinek, S. (2016). Kima: Towards an open digital historical Hebrew gazetteer. http://commons.pelagios.org/2016/07/kima-towards-an-open-digital-historical-hebrew-gazetteer/. Accessed 11 Aug 2017.
Saxton, G. D., Oh, O., & Kishore, R. (2013). Rules of crowdsourcing: Models, issues, and systems of control. Information Systems Management,30(1), 2–20. https://doi.org/10.1080/10580530.2013.739883.
Schilling, N. (2013). Sociolinguistic fieldwork. Cambridge: Cambridge University Press.
Schmied, J. (1994). The Lampeter Corpus of Early Modern English Tracts. In Kytö, M., Rissanen, M. and Wright S. (Eds.), Corpora across the centuries: Proceedings of the First International Colloquium on English Diachronic Corpora, Rodopi (pp. 81–89).
Shatil, N. (2007). The synchronic status of nitpa’el. Divrei ha-ḥug ha-yisr’eli lavalshanut,16, 105–127. (in Hebrew).
Shehadeh, H. (1991). Gilguley ha-bituy ‘yesh (lo) lilmod’/‘haya (lo) lilmod’. In M. H. Goshen-Gottstein, S. Morag, & S. Kogut (Eds.), Studies on Hebrew and other Semitic languages presented to Professor Chaim Rabin on the occassion of his seventy-fifth birthday (pp. 415–442). Jerusalem: Academon Press. (in Hebrew).
Tsirkin-Sadan, A. (2015). Inheritance and Slavic contact in the polysemy of bixlal. Journal of Jewish Languages 3(1–2): 218–230. (Reprinted in Doron, E. (Ed.),Language contact and the development of Modern Hebrew, Studies in Semitic Languages and Linguistics, Brill (vol. 84)).
Wigderson, S. (2015). The sudden disappearance of Nitpael and the rise of Hitpael in Modern Hebrew, and the role of Yiddish in the process. Journal of Jewish Languages, 3(1–2): 199–206. (Reprinted in Doron, E. (Ed.),Language contact and the development of Modern Hebrew, Studies in Semitic Languages and Linguistics, Brill (vol. 84)).
Xiao, R. (2008). Well-known and influential corpora. In M. Kytö & A. Lüdeling (Eds.), Corpus linguistics: An international handbook (Vol. 1, pp. 383–457). Berlin: De Gruyter.
Yáñez-Bouza, N. (2015). ‘Have you ever written a diary or journal?’ Diurnial prose and register variation. Neuphilologische Mitteilungen,116(2), 449–474.
Zhao, Y. & Zhu, Q. (2014). Effects of extrinsic and intrinsic motivation on participation in crowdsourcing contest. Online Information Review,38(7), 896–917.
Zheng, H., Li, D., & Hou, W. (2011). Task design, motivation, and participation in crowdsourcing contests. International Journal of Electronic Commerce,15(4), 57–88.
Zipser, F. & Romary, L. (2010). A model oriented approach to the mapping of annotation formats using standards. In Proceedings of the Workshop on Language Resource and Language Technology Standards, LREC 2010. Malta. URL: http://hal.archives-ouvertes.fr/inria-00527799/en/.
Zohar, H., Liebeskind, C., Schler, J., & Dagan, I. (2013). Automatic thesaurus construction for cross generation corpus. Journal on Computing and Cultural Heritage,6(1), 4.
I wish to thank the three anonymous reviewers of this manuscript for their helpful comments. For invaluable discussion and feedback during all stages of the project, I am grateful to Sinai Rusinek. Thanks also to Meni Adler, Maayan Almagor, Yael Netzer, Avigail Tsirkin-Sadan, and Amir Zeldes. This research was supported by the Mandel Scholion Interdisciplinary Research Center in the Humanities and Jewish Studies at the Hebrew University of Jerusalem. I thank researchers at the Center for their support, especially Yael Reshef for enabling me to train research assistants of the “Emergence of Modern Hebrew” research group in the TEI format. Programming support by Itay Zandbank of The Research Software Company (https://www.chelem.co.il) is also gratefully acknowledged.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A: Corpus size estimates
Appendix A: Corpus size estimates
Historical Jewish Press (JPress)
The JPress corpus is not freely available for search. In order to provide an estimation of the size of the corpus, I relied on the number of scanned newspaper pages from the years 1856–1970, as reported by the JPress team for the version of the corpus from August 2016 (Sinai Rusinek and Eyal Miller, p.c.). An estimation of token counts was done manually, by counting the number of tokens (base words, prepositions, and punctuation marks) in one column of one randomly chosen newspaper page in the corpus (Ha-zman, volume 78, April 20, 1914; page 2).
Tokens in column: 777
Tokens in page [estimate]: 4662 (six columns)
Total pages (1856–1970): 277,165
Total tokens (1856–1970) [estimate]: 1,292,143,230
Bar Ilan Responsa project
The corpus is searchable through a proprietary interface, in which it is possible to see word counts for individual texts and subcorpora. However, since works in the corpus are distributed without date metadata, it is not possible to search for those that are from a particular time period. To achieve an estimate of the size of the corpus for the period of interest to us, we located relevant subcorpora and estimated word counts for each of them. The following calculations are based on version 2.1 of the Bar Ilan Responsa Project.
About this article
Cite this article
Rubinstein, A. Historical corpora meet the digital humanities: the Jerusalem Corpus of Emergent Modern Hebrew. Lang Resources & Evaluation 53, 807–835 (2019). https://doi.org/10.1007/s10579-019-09458-4
- Historical corpora
- Language change
- Digital humanities
- Citizen science