Keywords

1 Introduction

The project “Dictionnaire de Termes Médico-botaniques de l’Ancien Occitan” (DiTMAO)Footnote 1 aims at constructing an ontology-based information system for Old Occitan medico-botanical terminology. The article shows the application of the lemon modelFootnote 2 to the lexicon component and focuses on the modelling of the historical, multilingual terminology.

1.1 Aims, Background and Structure of the Article

Old Occitan is the medieval stage of Occitan, the autochthonous Romance language spoken in Southern France, today regional minority language with several dialects. During the Middle Ages, the region and its language played a significant role in medical science due to the medical schools of Toulouse and Montpellier and the strong presence of Jewish physicians and scholars. For this reason, Old Occitan medico-botanical terminology is documented both in Latin and in Hebrew characters (cf. [3]). The DiTMAO project aims at making this terminology accessible to several scientific communities, such as those of Romance and Semitic studies, as well as that of the history of medicine.

The textual basisFootnote 3 of the lexicon, as described in [2, 9, 10], consists of medico-botanical texts in Latin and in Hebrew script. Among the sources in Hebrew script, the most prominent text type are so-called synonym lists, which contain a large amount of Old Occitan medical and botanical terms in Hebrew characters with equivalents or explanations in other languages (also spelled in Hebrew characters), mostly in (Judaeo-)Arabic, but also in Hebrew, Latin, or other Romance languages and sometimes in Greek, Aramaic or Persian. These lists can be described as ancient multilingual dictionaries, which are of particular importance for Old Occitan lexicography for two main reasons: (i) the synonym lists of the Jewish tradition include vernacular (Old Occitan) terms already from the 13th century on, hence these lists contain very early testimonies of Old Occitan technical terms. (ii) The corresponding terms in other ancient languages help to determine the meaning of otherwise opaque Old Occitan terms (cf. [3, 18, 19, 21]). A special difficulty of medieval texts in vernacular languages is that most terms are documented in a large number of variants (reflecting different spellings, dialects, or historical stages of the languages at issue). Thus the dictionary will include all variants of Old Occitan terms, together with the corresponding terms in at least six other ancient languages. Whenever possible, also a translation to modern French and English will be provided. The dictionary aims to be useful not only for users interested in Old Occitan but also in reading the numerous Medieval Hebrew medico-botanical texts written or translated in Southern France, since these texts are full of Occitan terminology and thus partially inaccessible even for readers with a good knowledge of Hebrew (cf. [22]).

After introducing the lemon model and our extensions, the article primarily deals with the lemmatization of simple and multiword terms and their representation in lemon. Furthermore, we will show how the corresponding terms in other ancient languages can be integrated and we will propose a way to resolve polysemyFootnote 4.

1.2 The Ontological Conception and the Lemon Model

Current trends in linguistic and lexical resources show a growing interest towards the publishing in the context of the Semantic Web [1416]. The sharing of lexica in accordance with linked data principles is, nowadays, mandatory: a resource (not only of linguistic nature) that cannot be accessed, shared and reused as a dataset is basically considered unreachable, and, thus, pretty much useless from a semantic web perspective. The lemon model has been developed as a standard for publishing lexica as RDF data. More precisely, lemon should be considered as an Ontology-Lexicon model for the Multilingual Semantic Web [11] and its nature and purpose perfectly satisfy our needs of representing the DiTMAO lexicon and the relative ontologies. DiTMAO consists of three main domains: (i) the lexicographic domain, including the lemmatized forms (lemma, variants and corresponding terms in other ancient languages) and their linguistic and lexicographic description. (ii) The conceptual domain, describing the meaning of each term by means of subontologies for the fields of botany, zoology, mineralogy, human anatomy, diseases and therapy (medication, medical instruments). We aim to complement the onomasiological description, if possible, with a modern scientific classification, for at least most of the plant names, and a medieval classificationFootnote 5 of plants and other simple drugs. (iii) The documentation domain, giving the source for each form of a term and its meaning. The documentation is indispensable for a historical (diachronic) dictionary.

The lemon model will be extended with a documentation domain and new vocabulary that is necessary for the lemmatization of a historical multilingual and multi-alphabetical dictionaryFootnote 6.

2 The Lexicographic Component

In the following sections, we describe the lemmatization of simple and multiword terms in Latin and Hebrew script and their representation in lemon. The representation will be illustrated by some representative examples from our corpus. The fact that we use just a few terms should not obscure the fact that our corpus contains about 5800 Old Occitan forms in Latin script and 3200 forms in Hebrew script. Furthermore, the corresponding terms in the other ancient languages amount to 3050 terms.

2.1 Lemmatization and Determination of Variants

As a general criterion of lemmatization, it has been decided for DiTMAO that a lemma is a term in Latin characters. All forms that differ from the lemma are classified as variants. Among the forms in Latin script the lemma is determined following a set of criteriaFootnote 7 and the form of an Old Occitan lemma is the obliqueFootnote 8 singular form for nouns, the oblique singular masculine for adjectives, and the infinitive for verbs. For example, the corpus contains the following variants for the word meaning ‘hemp seed’: canabo, canebe, canabos, and variants in Hebrew characters (represented here together with the transliterated formsFootnote 9): קנבוש/QNBWŠ, קִנַבוּש/QiNaBWuŠ, קנבונש/QNBWNŠ. The form canabo is taken as lemma or leading variant. The form canabos is the plural form of the lemma canabo. It is classified as morphological variant. The form canebe differs with respect to spelling and pronunciation. The form is thus classified as grapho-phonetic variant. As a general definition, the variants in Hebrew characters are all alphabetical variants. The forms קנבוש/QNBWŠ and קִנַבוּש/QiNaBWuŠ are alphabetical variants of the plural form canabos. In this sense they are variants of a variant. The form קִנַבוּש/QiNaBWuŠ additionally differs with respect to phonology. As indicated by the vowel signs, the initial syllable has to be interpreted as [ki] instead of [ka]. The form קנבונש/QNBWNŠ (read: “canabons”) has no corresponding form in Latin script in our corpus. It is thus classified as alphabetical variant of the lemma, and additionally as grapho-phoneticFootnote 10 and morphological variant. Furthermore, concerning variants in Latin characters, there are pure graphic variants, where the spelling does not reflect a difference in pronunciation e.g. alcanna and alquana.

A certain difficulty for lemmatization lies in the fact that about 40 % of the terms are only documented in Hebrew characters. Nevertheless, the general criterion for lemmatization (a lemma is a term in Latin script) has been established for two main reasons. First of all, it is not possible to uniquely link a Hebrew character to a Latin character. For example the letter Alef (א - ʾ) may represent different vowels e.g. it stands for /e/in אשפרמא/ʾŠPRMʾ (read: “esperma”, ‘sperm’), for /a/in ארמולש/ʾRMWLŠ (read “armols”, ‘orache’). The combinations of initial Alef with Yod or Waw can be interpreted as /i/or /e/like in אינגילש/ʾYNGYLŠ (read: “enguilas”, ‘eels’) or as /o/o /u/like in אורטיגש/ʾWRṬYGŠ (read “ortigas”, ‘stinging nettles’). Thus, having lemmata in two alphabets would additionally complicate the string search and the display of the results in alphabetical order. In case a term is only documented in Hebrew characters, a corpus-external lemma, a form documented in other dictionaries, will be included. But in some cases, there is no such corpus-external lemma (so the variant in Hebrew spelling is the only documented form), and we have to introduce a hypothetical or reconstructed form. For example for the term אנאקירד - ʾNʾQYRD (read “anacard”), we introduce the form *anacard as hypothetical Old Occitan form with the meaning ‘marking nut’, fruit of Semecarpus anacardium L. . The meaning is documented for the Arabic term בלאדר/BLʾDR that features as its synonym in the lists edited in [4]. Thus, we need to indicate for a lexical entry whether the lemma is corpus-external, a reconstructed or a hypothetical from.

2.2 Modelling the Lemma and Its Variants

A lexicon entry in lemon consists of a Form and a LexicalSense . For the lemmatization, the class Form and its relations with LexicalEntry ( lexicalForm and its subproperties canononicalForm and otherForm ) are relevant. In lemon the lemma canabo will have the following shape:

The lemma is represented by the canonicalForm of the entry and its realization is the written representation ( writtenRep ). The language, although inferable from the lexicon, will be represented together with the ISO 15924 script code: Latn for Latin, Arab for Arabic, and Hebr for Hebrew. This is an elegant way to avoid the definition of a property specifying the script type. The linguistic information like part of speech, gender and number will be integrated as attribute-value pairs from the Lexinfo ontologyFootnote 11, an extension of lemon that provides data categories for linguistic annotations. These will be defined as subproperties of the property lemon:property . In a similar vein, the labels for corpus-external lemmata and hypothetical and reconstructed forms can be added to the canonicalForm .

The subproperty ditmao:lemmaInfo will have the following values: ditmao:corpusExternalLemma , ditmao:hypotheticalForm and ditmao:reconstructedForm . For the representation of variants, the lemon model only provides the relation otherForm . The variant canabos has the following entry:

The fact that canabos is a morphological variant can be inferred from the value of lexinfo:number . An alphabetical variant can be formalized by adding a script tag to the language tag e.g. aoc Footnote 12 -Hebr or aoc-Arab . In order to give the transliteration, we adopted lexinfo:transliteration which is defined as a subproperty of lemon:representation (the superproperty of lemon:writtenRep ), in accordance to the Lemon Cookbook [17]. The specific transliteration alphabets are defined as subproperties of lexinfo:transliteration . For the DiTMAO, a transliteration of Hebrew and Arabic is needed. The former is labelled HebrTransliteration and the latter ArabTransliteration with the respective abbreviations HebrTrsl and ArabTrsl .Footnote 13 The entry for קנבונש/QNBWNŠ (read “canabons”) would have the following shape.

A problem is the formalization of the graphic and grapho-phonetic variants. Only users who are familiar with Old Occitan phonology and dialectology may distinguish graphic from grapho-phonetic variants. But as the dictionary also wants to reach researchers from other domains, an indication of these types of variants is desired. We propose to specify all types of variants (morphological, alphabetical, grapho-phonetic and graphic variants) as values of ditmao:variant , defined as a subproperty of lemon:property . This subproperty will take the following values: ditmao:alphabeticalVariant , ditmao:graphicVariant , ditmao:morphologicalVariant , and ditmao:graphophoneticVariant . The form canebe bears only the value ditmao:graphophoneticVariant . Additionally to the marking of the script and grammatical number, the entry קנבונש/QNBWNŠ has the following shape:

The other variants in Hebrew characters have been classified as variants of a variant. The terms קנבוש/QNBWŠ and קִנַבוּש/QiNaBWuŠ are alphabetical variants of the morphological variant canabos. In order to represent a relation between two forms of one lexical entry, lemon provides the property formVariant . A symmetric subproperty of formVariant , ditmao:varOfVar , will be defined:

The subproperty ditmao:varOfVar will be added to the variant in Hebrew characters. An exemplary entry is shown below for the form קִנַבוּש/QiNaBWuŠ.

2.3 Modeling Multiword Expressions

The multiword expressions contained in our corpus are mostly noun-adjective expressions, like goma arabica, ‘arabic gum’ or syntagmatic noun-preposition-noun expressions, like goma de gingibre, ‘ginger gum’. Multiword terms are classified as sublemma in the sense of a strict alphabetical macrostructure of a dictionary. Both nouns, goma arabica and gomma de ginibre, are sublemmata of the lemma goma. Sublemmata are modeled as a relation between two lexical entries by means of the property LexicalVariant . For DiTMAO, a sub-property of LexicalVariant , sublemmaOf , will be defined. The entry of the term goma arabica will have the following entry:

For a description of the internal structure of multiword expressions, lemon provides a phrase structure module. Multiword terms can be decomposed into their components by means of an ordered list, the lemon:componentList . A list consists of components, which are linked by means of the property lemon:element to the lexical entries. Each component can be associated to a leaf of a tree structure, representing the internal structure of the phrases goma arabica and goma de gingibre. The determinatum goma is the head of the noun phrase and the determinans is the adjective phrase or the prepositional phrase, which are themselves decomposed into an adjective and a preposition + noun phrase. Each component is linked to its lemma, which is unproblematic for the noun goma de gingibre, because the components correspond to the canonical form of the lemmata at issue. However, the term arabica is inflected for feminine and the canonical form of an adjective is, per definition, the masculine singular form. For relating such components, we cannot use the lemon:element property since it is defined to have the class LexicalEntry as range. For this reason, we chose to define a specific property, whose range is the lemon:Form :

The decomposition of goma arabica is shown in Fig. 1.

Fig. 1.
figure 1

Decompositon of goma arabica

A particularity of our corpus is multiword expressions, consisting of an Old Occitan and a Hebrew word e.g. בול חתום/ BWL ḤTWM meaning ‘sealed clay/earth’ and אגוז מושקאדא/ʾGWZ MWŠQʾDʾ, meaning ‘nutmeg’. The former consists of an Old Occitan head noun, בול/BWL, an alphabetical variant of the term bol, followed by a Hebrew participle passive ḥatum. The latter has a Hebrew head noun אגוז/ʾGWZ, meaning ‘nut’, followed by an alphabetical variant of the Old Occitan adjective muscada. These mixed terms mostly occur in Hebrew prose texts or in Hebrew translations and should be considered as foreign technical terms of Jewish physicians living in the Southern France. As for the lemmatization, the terms are taken to be lexical entries of the ditmao_hebrew lexicon, irrespective of the language of the head noun. Due to the decomposition function of lemon, we can preserve the information that the components בול/BWL and מושקאדא/MWŠQʾDʾ are variants of the Old Occitan terms bol and muscat Footnote 14, respectively. How these terms can be represented in lemon will be discussed in the following subsection.

The term אגוז מושקאדא/ʾGWZ MWŠQʾDʾ is a sublemma of the Hebrew entry אגוז/ʾGWZ, meaning ‘nut’. The adjective מושקאדא/MWŠQʾDʾ is the alphabetical variant of the feminine form muscada, hence a variant of a variant.

In order to decompose the term, the relations lemon:element and ditmao:formElement are needed, because the head noun corresponds to a lemma of the ditmao_hebrew lexicon and the adjective is a variant.

The representations for the terms אגוז/ʾGWZ and מושקאדא/MWŠQʾDʾ have the following shape.

As for the term בול חתום/BWL ḤTWM, which consists of an Old Occitan head noun and a Hebrew participle passive, lemmatization is more problematic. In order to define a sublemma relation we would need to assume, contrary to fact, that the simple term בול/BWL was a Hebrew medical term. An equally undesired solution would be to allow the sublemma relation to be valid across the lexica. Thus, multiword terms with an Old Occitan head noun will not be lemmatized with respect to the sublemma relation, but the information that the word בול/BWL is an alphabetical variant of the Old Occitan term bol may be preserved, due to the decomposition, as shown below.

Further we preserve the information that the word חתום/ḤTWM is a morphological variant of the lemma חתם/ḤTM

In some cases a mixed term is documented in a synonym list together with a term in Old Occitan. The mixed term will be classified as corresponding term, in the same way as simple terms or other monolingual multiword expressions. E.g. the term אגוז מושקאדא/ʾGWZ MWŠQʾDʾ appears together with the Old Occitan term נוץ מושקאדא/NWṢ MWŠQʾDʾ (read: “noz muscada”) and the Arabic term גוז בוי/GWZ BWY (read: “ǧawz bawwā”). The mixed term and the Arabic term will be linked as correspondence to the sublemma in Latin script: noze moscada. How these corresponding terms are modeled in lemon will be discussed in the next section.

2.4 Corresponding Terms and Other Sense Relations

As mentioned in the introduction, our corpus contains corresponding terms in other ancient languages, which have been considered as synonyms by the authors of the manuscripts. For example the term ליטוגא/LYṬWGʾ (a variant of laytugua) figures as synonym of the Aramaic term חסא/ḤSʾ and the Arabic term כס/KS in the synonym lists edited in [3]. The meaning of all three terms is documentedFootnote 15 as ‘lettuce’ (in particular Lactuca sativa L.). But even if the terms have exactly the same meaning, they should not be considered as synonyms in the modern understanding of the term, because they do not belong to the same language (cf. [5]). In order to model this relation in lemon, we propose the property ditmao:correspondence , as a subproperty of senseRelation . It links the senses of two lexical entries that belong to distinct lexica of ancient languages. In order to give a corresponding term in modern French and modern English, the subproperty lemon:translationOf will be used. The relations have to be kept apart for mainly two reasons: corresponding terms and translations belong to different historical stages and to different registers. The former are medieval technical terms and the latter are modern common names. Furthermore, the corpus contains Old Occitan terms that are synonyms in the modern understanding of the term, e.g. the terms litargia and mal de dormir have the meaning: ‘fatigue’. The corresponding LexicalSense of both terms is linked via the subproperty lemon:equivalent . The relations are represented in Fig. 2.

Fig. 2.
figure 2

Relating lexical senses in DiTMAO

But about 20 % of the lemmata in our corpus have more than one meaning. For example, we often find polysemic plant names which designate several species of a genus, e.g. the term laureola is documented with the names for the species Daphne oleoides Schreb., Daphne gnidium L., and Daphne sericea Vahl. In lemon, polysemy will be formalized as follows: a LexicalEntry has several instances of LexicalSense . The Arabic and Hebrew corresponding terms that feature in the synonym lists, give an additional meaning: Daphne mezereum L. The entry of laureola has four instances of LexicalSense . Each LexicalSense has a translation into modern French and English and the LexicalSense referring to Daphne mezereum L. will be linked via ditmao:correspondence to the respective Arabic and Hebrew entries. Furthermore, each LexicalSense of laureola has a referent in the botanical branch of the ontology, giving a general description of the plant e.g. that it is a kind of shrub. These entities are linked to the modern classification, here the binominal plant names, and to a medieval classification. The term laureola is described as HOT and DRY in the third degree (see [12] and fn. 5). The general division of the conceptual subontology into an onomasiological subontology, a medieval and a modern classification system allows us to provide a description of the term´s concepts independently from a modern or a medieval classification. This division is necessary for terms that designate e.g. medical instruments or substances whose composition is uncertain.

3 Conclusion and Outlook

We have shown how the lemon model can be adapted to the needs of historical lexicography, by defining subproperties of the basic lemon properties: lemon:senseRelation , lemon:formVariant , lemon:element and lemon:property . Furthermore, we introduced our own, domain specific, vocabulary for the description of form variants. In the spirit of lemon, and, in general, of the Semantic Web, we plan to link the dictionaries to other resources. However, at the moment the most important resource related to Old Occitan (i.e. DOMFootnote 16) is a database and it’s not exposed as a linked data. Among the resources we are planning to use to provide the conceptual references of lexical senses we cite DBpediaFootnote 17, WikidataFootnote 18 and more domain-specific datasets, such as TDWGFootnote 19 or the Biological Taxonomy VocabularyFootnote 20.

To ease the process of modelling of the various lexica in lemon and the construction of the ontologies of reference, we are also working on a web editor. As a matter of fact, none of the currently available tools for the editing of lexica and ontologies appears suited to our purpose. ProtégéFootnote 21, probably the most used tool for the construction of ontological resources, is general enough to allow the building of lemon resources. However, the process can be quite tedious, requiring the manual construction of instances of entries, senses, forms and relations among them. In addition, it is a stand-alone tool which cannot be used collaboratively by a team of users (its Web versionFootnote 22 has several limitations, as the lack of support for reasoning mechanisms and plug-in extensions). We also plan to develop a controlled natural language querying interface to ease the access to the resources.