The enrichment of lexical resources through incremental parsebanking
Automatic syntactic analysis of a corpus requires detailed lexical and morphological information that cannot always be harvested from traditional dictionaries. Therefore the development of a treebank presents an opportunity to simultaneously enrich the lexicon. In building NorGramBank, we use an incremental parsebanking approach, in which a corpus is parsed and disambiguated, and after improvements to the grammar and the lexicon, reparsed. In this context we have implemented a text preprocessing interface where annotators can enter unknown words or missing lexical information either before parsing or during disambiguation. The information added to the lexicon in this way may be of great interest both to lexicographers and to other language technology efforts.
KeywordsLexical resources INESS NorGramBank Treebanking LFG Language research infrastructure Automatic syntactic analysis
Parsebanking is the creation of a treebank through automatic parsing of a corpus with a grammar and lexicon. Since this process results in a large number of analyses which can readily be inspected, it provides an excellent testing ground for the development of a lexicon as well as a grammar. As parsing requires fine-grained distinctions which are often overlooked in traditional lexicography, parsebanking presents a good and until now insufficiently recognized context for enrichment and testing of the lexicon.
The INESS project (Infrastructure for the Exploration of Syntax and Semantics) is developing NorGramBank, a large parsebank for Norwegian.1 In the process, a grammar and lexicon for Norwegian are being further developed in tandem. Since the parser requires quite detailed morphosyntactic information in order to provide an analysis, the lexicon must be syntactically well informed. In our experience, which will be discussed in some detail in this paper, feedback from the parsebanking process is valuable for testing and improving lexical information.
‘“How do you spell love?” asked Piglet.’
It is our hypothesis that traditional dictionaries are insufficient sources of lexical information for parsing and that adding unknown words and more precise and complete information about known words will significantly improve parsing. We hope to show that parsebanking is a productive context for discovering and describing words and their morphosyntactic properties.
In the following, we will first explain how the syntax and lexicon mutually inform each other in our parsebanking approach. In Sect. 3, the interface for preprocessing texts will be presented. Section 4 describes how words that are not recognized by the morphological analyzer are treated, while Sect. 5 details the procedure for adding information for known words. In Sect. 6 issues concerned with multiword expressions are presented.
2 Grammar development and incremental parsebanking
Most current manually checked treebanks are produced in part by parsing a corpus. However, not all sentences may automatically get a correct analysis, due to missing coverage in the grammar and lexicon. Many treebanking efforts remedy this problem by means of manual editing of the parses. This may result in analyses which are not compatible with the grammar which was used for parsing. Furthermore, editing the parses directly will not lead to enrichment or correction of the lexicon. In contrast, our approach is based on incremental improvement of the grammar and lexicon during the parsebanking process (Losnegaard et al. 2012; Rosén and De Smedt 2007). This approach results not only in a manually checked parsebank, but also in a grammar which is fully compatible with the analyses in the parsebank, and moreover, in substantial lexicon improvements, as will be described below.
The grammar used for creating NorGramBank is NorGram, a hand-written broad coverage computational grammar which has been used in several language technology projects (Dyvik 2000; Butt et al. 2002). It is written in the Lexical Functional Grammar (LFG) framework (Bresnan 2001; Dalrymple 2001), which allows for deep analyses of considerable grammatical detail. We use the Xerox Linguistics Environment (XLE) for grammar development and parsing (Maxwell and Kaplan 1993). The analyses produced by XLE with NorGram are disambiguated and stored in the parsebank. Regular reparsing after improvements to the grammar and lexicon provides improvements in coverage. Thus we aim to incrementally produce high quality gold standard treebanks, which in turn are used for training a stochastic disambiguator in order to produce larger fully automatically parsed and disambiguated treebanks. This methodology is similar to and inspired by the LinGO Redwoods treebanking approach (Oepen et al. 2004).
‘The dog panted.’
In LFG, the syntax and the lexicon have an important interaction with each other, especially in the treatment of predicate-argument structure. The lexical entry for each verb must specify which arguments a verb requires. For example, in a transitive sentence, the lexical entry for the verb must specify that the verb can take an object. This specification interacts with the syntax in such a way that no grammatical analysis will be assigned to sentences lacking syntactic arguments which the verb specifies, or containing syntactic arguments which the verb does not specify.
As a result of the ubiquitous ambiguity of natural languages, parsing with a high-coverage formal grammar and lexicon will often return a very high number of alternative analyses for a sentence, whereas normally only one of those analyses will be appropriate in the given context. Some degree of manual disambiguation is unavoidable for the purpose of building a gold standard parsebank, which subsequently may be used for training a stochastic disambiguator. Whereas annotators in our approach never manually edit an analysis, they must verify if the parser has produced a correct analysis, and choose the correct analysis if several possible analyses are produced.
The disambiguation process has been optimized through the use of discriminants (Carter 1997; Oepen et al. 2004). The parsebanking system automatically analyzes the forest of alternatives, reducing it to a set of binary discriminants which allow the annotators to efficiently distinguish and select among a high number of alternatives (Rosén et al. 2007, 2009, 2012).3 While disambiguating, the annotators may discover that the correct analysis is not among the alternatives produced by the parser. In that case they first attempt to diagnose the problem, and often they may solve it by updating the lexicon and reparsing. If the problem persists, a change in the grammar may be necessary, which is reported through the issue tracking system that is integrated into the disambiguation interface.
This potentially continuous approach is scalable: new text can be automatically parsed and disambiguated stochastically by training on the manually disambiguated material. The information which is stored as a result of manual disambiguation is not just the selected analysis, but also the discriminant values chosen by the annotators, along with the rest of the analyses. Hence, when the entire treebank has been reparsed with the updated grammar (which happens with certain intervals), the stored discriminant values can be reapplied to the new set of alternative analyses, which is frequently sufficient to pick out a unique solution again. As mentioned above, this methodology is inspired by LinGO Redwoods (Oepen et al. 2004). What is novel in our approach is that we have designed and implemented discriminants for LFG grammars, and that the entire process is supported through a web-based annotation interface.
The advantage of this parsebanking approach is that the resulting parsebank will always be fully compatible with the grammar. Parsebanks constructed in this way therefore achieve a very high level of consistency. It is also the case, however, that only sentences that are grammatical according to the current grammar will be fully analyzed, while others may receive a fragment parse or may fail to parse.
Earlier we carried out a detailed study of a small subcorpus in order to find out what the main causes of failed analyses are (Losnegaard et al. 2012). This study found that 21 % of the sentences had a full analysis that was not the correct one. Moreover, the study identified the interventions that were necessary in order to achieve the intended analysis for these sentences. Since the sentences studied initially had some analysis, they all involved words that were recognized, but sometimes not with the correct morphological analysis. We found that 29 % of the failed analyses were caused by syntactic problems, while 71 % were caused by lexical problems. Of the lexical problems, 41 % were caused by missing multiword expressions (MWEs), whereas 41 % were caused by incorrect lexical categories.4 These numbers indicate that correct lexical information is essential for successful syntactic analysis.
A parsebanking approach of this kind requires a large lexicon with detailed morphosyntactic information. The main basis for the NorGram lexicon has been the NorKompLeks electronic lexicon (Nordgård 2000). This lexicon is an adapted version of two traditional dictionaries of Norwegian: Bokmålsordboka (Landrø and Wangensteen 1993) and Nynorskordboka (Hovdenak et al. 1986). These dictionaries were developed by the University of Oslo and the Norwegian Language Council (Språkrådet), and in practice they define the official norms for spelling and inflection. Bokmålsordboka has approximately 70,000 lemmas, while Nynorskordboka has approximately 90,000. The dictionaries contain both etymologies and examples. The web versions are standard works of reference for most Norwegian users, with more than 70 million searches per year between them. The NorKompLeks lexicons added subcategorization frames for the verbs. The NorKompLeks format was converted by means of a program into the format required by XLE.5 Morphological analysis is handled by finite-state transducers derived from the resource Norsk Ordbank (Norwegian Word Bank), a database which contains inflectional and other information about all entries in Bokmålsordboka and Nynorskordboka, in addition to further material. However, as we will see below, the lexical information in NorKompLeks and Norsk Ordbank is not always complete and accurate, and needs to be supplemented.
3 Text preprocessing
An important source of texts for NorGramBank is a large repository of OCR-read fiction texts supplied by the National Library of Norway. Because OCR software makes certain errors, such as misinterpreting characters, omitting text, or inserting unwanted material, the documents must be preprocessed before syntactic parsing. Moreover, when a corpus is parsed, there will always be words that are unknown to the morphological analyzer and/or the lexicon. INESS has therefore developed an intelligent browser-based preprocessing interface which facilitates efficient text cleanup and the treatment of unknown word forms (Rosén et al. 2012).
The first step is text cleanup, which involves for example removing superfluous material that does not belong to the text, joining parts of sentences that have erroneously been split, and adding punctuation where it is missing. The interface offers practical editing functions for these cleanup operations.
After text cleanup, the annotators process word forms that have not been automatically recognized. The preprocessing interface presents a list of unknown words. Some of these are errors which must be corrected in the text itself before parsing, such as OCR errors, incidental misspellings, and typos. Other unknown words should be covered in the lexicon. Examples are names, foreign words, neologisms, productive compounds not recognized by the compound analyzer, and words only occurring in MWEs.
Nonstandard words of various types are also added to the lexicon. We distinguish between three main classes: archaic words, systematic misspellings, and forms belonging to nonstandard language varieties. An example of the first class, archaic words, is the plural noun form fjelle, in contrast to the current standard spelling fjell ‘mountains’. The second class, systematic misspellings, includes forms which are produced regularly by one or more authors. An example is the form tennveske, which is a common misspelling of tennvæske ‘charcoal lighter fluid’. Finally, the third class of nonstandard words covers forms that can be ascribed to a particular dialect, technolect, sociolect, or other language variety. An example is barnehagan, instead of the standard form barnehagen ‘the preschool’. The suffix -an in the nonstandard variant is used to imitate a dialect pronunciation. Instances of these three nonstandard classes are left unchanged in the text because normalizing them would be to interfere with actual language use.
The important common denominator of all types of unknown words which are not to be corrected is that while these forms fall outside standard dictionaries, it is a prerequisite for successful parsing that they nevertheless be included in our lexicon. Nonstandard words are explicitly marked as such in the lexicon, so that any reuse of the lexicon, for example for generation, would not result in these words being output inadvertently.
4 Adding unknown words during preprocessing
Overview of the various types of unknown words added through preprocessing
Open word class (N, V, A, ADV)
Organization or brand name
Variant inflectional form
First name, masculine
First name, feminine
The preprocessing interface allows the annotators to add information about unknown words to the lexicon. Noninflecting words such as names and interjections are entered by assigning the appropriate lexical category to each entry. For words belonging to the open lexical classes the annotator specifies an inflectional pattern. Verbs must also be assigned subcategorization frames necessary for parsing. When a word is not recognized because of nonstandard spelling, the annotator must consider whether the spelling deviation concerns the stem or an inflection. Variant stems are registered with existing standard inflectional paradigms, and variant inflectional forms are registered as deviations from individual, standard inflectional forms. In order to add unknown words to the lexicon in an efficient way, the annotator makes use of a set of predefined options in the preprocessing interface. Each option corresponds to a certain type of entry. Most of these types can be entered by a single mouse click, while the recording of inflecting words and variant inflectional forms requires a few more steps.
4.1 Open word classes
Another example of an unknown word form is synonymiserer ‘synonymizes’, illustrated in Fig. 4. This is a productive verbal derivation from the noun synonym, inflected in the present tense. In order to add a new verb synonymisere, the annotator follows the same procedure as the one described for adding the noun narrativ. In this case, a verb with matching inflection, polemisere, has been selected from the drop-down menu in the “Inflects like” field, and this creates the proposed set of inflectional forms shown to the right in Fig. 4. Since synonymiserer is used intransitively in the given context, the annotator ticks off “intrans” in the “Verb frame” field before storing the new verb.
4.2 New compounds
Norwegian is a language with extensive productive compounding. Since compounds are written as single graphical words and compounding may be done on the fly, many legitimate compounds cannot be listed in the lexicon. Therefore an automatic compound analyzer is run on the text prior to preprocessing in order to identify compounds that are not already in the lexicon. Although the analyzer recognizes many compounds, the analysis of potential compounds is nevertheless restricted in order to prevent overgeneration.
Allowing compound constituents of less than three letters is generally considered a risk in automatic compound analysis; if such short constituents are allowed in general, many typos and misspelled words may be erroneously analyzed as compounds. We implement this restriction and allow short elements only if they are listed specially due to their observed occurrence in compounds.
Furthermore, some of the combinations that the compound analyzer allows have certain constraints imposed on them. For noun + adjective compounds, only a few nouns that occur frequently as the first element in compounds are allowed; examples are kjempe ‘giant’, drit ‘shit’, and rekord ‘record’. This explains why the compound avisgrå ‘newspaper gray’ was not recognized. This example, and numerous others, such as guttegærn ‘boy crazy’, silkehvit ‘silk white’ and helseriktig ‘health correct’ (‘healthy’), show that this constraint is too strong. For adjective + verb compounds, the verb is restricted to only being a past participle, which is the reason why blekpudre ‘pale powder’ (‘powder something to make it pale’) was not recognized. Again, however, this restriction seems too strong, since there are many examples of other forms of verbs than the past participle in this type of compound: ansvarliggjøre ‘responsible make’ (‘make responsible’), finpiske ‘fine whip’ (‘whip until fine’), hardkode ‘hard code’, etc.
Another reason why compounds are not recognized is that special forms which are only used in compounds are missing from the lexicon. An example is engleflokk ‘angel flock’ (‘flock of angels’), where engle is a variant of engel ‘angel’. Other examples are billedramme ‘picture frame’, where billed is an archaic form of bilde ‘picture’, and faktafeil ‘facts error’ (‘factual error’), where fakta is the plural of faktum ‘fact’. Although compounds with such special forms occur in the dictionaries that were used as sources, the specific first elements themselves were missing.
Finally, some compounds are not recognized because one or both of their constituents are misspelled. Examples are hårshampo, a misspelling of hårsjampo ‘hair shampoo’, and cafébord, a misspelling of kafébord ‘café table’.
Overview of the most common compound types recognized during preprocessing
Noun + noun
te + lys ‘tea light’
Noun + adjective
avis + grå ‘newspaper gray’
Adjective + adjective
blå + brun ‘blue brown’
Adjective + noun
fin + kåpe ‘nice coat’
Preposition + noun
av + knapp ‘off button’
Preposition + adjective
gjennom + korrupt ‘through corrupt’
Preposition + verb
av + beite ‘off graze’
Noun + verb
dybde + bore ‘depth drill’
Verb + noun
ete + fest ‘eat party’
Adjective + verb
blek + pudre ‘pale powder’
Verb + adjective
drikke + klar ‘drink ready’
Table 2 gives an overview of the most common compound types that have been registered by annotators in this way. The column headed CA (for compound analyzer) shows which types the compound analyzer currently allows: noun + noun, noun + adjective, adjective + noun, adjective + verb and verb + noun. The reason why only these five were allowed initially was that they were assumed to be the most frequent types; allowing too many possible combinations could lead to many incorrect analyses of unknown words. The overview of types that were actually found shows that there are several additional frequent types that should be considered for incorporation into the compound analyzer. A detailed study of the individual examples in the different categories will help to determine which new types should be added to the compound analyzer, as well as which frequent short elements should be allowed.
4.3 Other types of unknown words
A particularly frequent type of unknown words is names. These are typically missing from dictionaries. From Table 1 it appears that unknown last names, organization or brand names, and place names are very common. Since names are normally invariable, they can simply be assigned a part of speech.
‘“I’m not really an alien for you,” said Auguste.’
‘He went into American Bar, which boasted air conditioning.’
Missing lexical entries like these are easily added to the lexicon when they are identified in the preprocessing step. In this case, American Bar was entered as an organization name, and alien and air conditioning were entered as loans.
A particularly productive part of speech is interjections; especially writers of fiction are very creative in the way in which they write interjections. Bokmålsordboka has an entry for the interjection hysj ‘hush’ which also includes the alternative spelling hyss. There are several occurrences of this interjection in the fiction texts of NorGramBank, and many of them do not have either of the two standard spellings. The following eight variants of hysj/hyss have been registered so far: hysjjj, hyssj, hysssj, hyssssjjj, hysssssj, hysssssjjj, hysst, hyyyysssjjj. These examples show that the spelling of this interjection is unpredictable and to a large extent determined by the way in which an author chooses to express it in a given context. For parsebanking purposes, the challenge is that each time a new spelling is encountered, it is displayed in the preprocessing interface as an unknown word. The INESS interface makes it possible for annotators to add new variant spellings to an existing interjection in the lexicon.
In conclusion, as the annotator processes the unknown words, these words and the necessary information about them are added to the lexical resources exploited by the parser.
5 Known words with missing or incorrect information
Overview of lexicon updates made by annotators for known words
Type of lexicon update
Transitive readings (incl. ditransitive)
Verb complement readings
Intransitives with expletive subject
Added count nouns
Added title readings
Adverbs and prepositions
Added adverb readings
Even though the NorKompLeks lexicon has added subcategorization frames for the verbs in Bokmålsordboka and Nynorskordboka, many quite common frames are not included. Table 3 gives an overview of the types of lexicon updates made by annotators during disambiguation. As shown in this table, the most frequent type of lexicon update concerns subcategorization frames for verbs. These instances cover a large number of different types of verb frames, which are sorted into six categories: MWE frames, intransitive readings, intransitives with expletive subject, transitive readings, verb complement readings, and inquit readings. New verb frames involving MWEs account for over half of these cases. In Table 3 the six groups of verb frame types, as well as the other types of lexicon updates, are listed in descending order of frequency.
‘The path flattens out.’
‘The mountain flattens out.’
‘There is flattening out.’
‘There is a flock of birds flattening out.’
‘But grandfather declined.’
‘“What do you mean by that?” she stammered.’
The sentence in example (15) was initially given a partial analysis by the parser. That is, the word sequences Hva mener du med det and stotret hun were respectively identified as sentence units, but no complete analysis was found, because the lexicon entry for the verb stotre ‘stammer’ contained only an intransitive reading. An inquit reading was added to the entry, and after reparse the sentence Hva mener du med det? was successfully analyzed as a sentential complement to the inquit verb.
Table 3 also presents numbers for lexicon updates concerning nouns, adverbs, prepositions, and adjectives. With respect to nouns and adjectives, the data indicate that also in these categories there is a considerable need for adding subcategorization frames involving MWEs. Moreover, Table 3 shows that adding mass readings for nouns is another frequent type of lexicon update. Traditionally, dictionaries for Norwegian do not provide information on the distinction between mass terms and countables, but this information is required for producing correct syntactic analyses with NorGram. By default, the noun entries in the NorGram lexicon are therefore count nouns, and mass readings are added as they are encountered in the corpus.
One could imagine a number of automated procedures that create new lexical entries with modified subcategorization frames or features on the fly. A procedure that has been implemented in NorGram is similar to the Universal Grinder (Pelletier 1975), which produces mass noun readings from count nouns. In order to prevent overgeneration, the grinder is only applied in cases where the parser does not produce any analysis for a sentence; in these cases, all count nouns get mass readings as alternatives, and the sentence is automatically reparsed.
One could also imagine similar procedures for verbs. However, for verbs, there are many possible subcategorization frames, and allowing automatic postulation of unattested frames would easily lead to overgeneration. Therefore we only produce new subcategorization frames manually for cases that are present in the corpus.
Table 3 shows that lexicon updates involving new readings of adverbs constitute another frequent type. This illustrates how the lexical category of a given word must often be more fine-grained than what is provided by the lexicon. In the case of adverbs, there is only one large class with the part of speech ADV in the original lexicon. However, different types of adverbs vary considerably in their syntactic distribution, and it is therefore necessary to classify them into subcategories in order to account for this distribution. NorGram distinguishes between 24 categories of adverbs based on syntactic position, usually named according to their typical semantic contribution.
‘I actually have nothing to add, I suppose.’
‘quite far from the lake’
These examples illustrate that a descriptively adequate treatment of Norwegian needs to distinguish between different classes of adverbs motivated by their syntactic distribution. Such distinctions are not only relevant for parsing, but also for other purposes, such as language learning.
6 Multiword expressions
As already noted in the previous section, MWEs are involved in many of the necessary lexical updates. The term MWE is frequently used in computational linguistics7 and refers to the idiomatic, often non-literal part of the language. The notion of idiomaticity has been applied in numerous and various ways, but is generally associated with properties such as lexical and grammatical fixedness (or frozenness), convention, and non-compositionality (Nunberg et al. 1994; Moon 1998; Cowie 1998; Sag et al. 2002; Baldwin and Kim 2010). Non-compositionality refers to a situation where the linguistic properties of an expression cannot be fully derived from the properties of its component words and the way in which these normally combine, and it is central to many of the problems encountered in parsing of MWEs. Traditional dictionaries often list idioms as examples, but fail to provide information about their idiomaticity.
MWEs, and in particular MWEs that are idiosyncratic at the linguistic level (lexicalized MWEs in the terminology of Sag et al. 2002), present a great challenge for parsing because they exceed word boundaries, have unpredictable or irregular morphosyntactic properties, and are sometimes discontiguous.8 The most immediate problem with MWEs, however, simply concerns recognizing them as such (Losnegaard et al. 2012). Although there are a considerable number of MWE entries in NorKompLeks (more than 2500 prepositional verbs, 1800 particle verbs and almost 400 fixed expressions), these are far from sufficient for accounting for all of the MWEs occurring in our corpus.
‘He adjusted his wig and slowly turned to face her.’
‘What was the point of embarrassing me like that?’
‘She became very concerned with brushing cake crumbs off of her coat.’
In constructions with selected prepositions, the verb, noun or adjective will as a rule keep its original meaning, while the meaning of the preposition is semantically bleached and does not contribute to the semantics of the overall construction to any large extent. An example of this is le av ‘laugh of’ (‘laugh at’), where the verb retains its main sense ‘laugh’, and the preposition introduces as an argument the participant causing the laughter. Insofar as the preposition conveys some modification of the main predicate, this change in meaning will be idiosyncratic and fairly transparent. The meaning of the expression is thus not fully compositional, and the preposition to be used is not fully predictable. In example (18), the inherent and concrete meaning of the verb rette ‘straighten’ is preserved while the addition of the preposition på ‘on’ invokes the more specialized and figurative meaning ‘make right’, ‘adjust’.
In this respect, constructions with selected prepositions are situated somewhere between institutionalized MWEs (i.e. MWEs that are linguistically regular but whose component words have a high frequency of cooccurrence) and semantically transparent idioms; the relation between the main predicate and the preposition can, more than anything, be viewed as a special case of lexical preference. Whether these are to be treated as special constructions in the dictionary or grammar, dealt with compositionally, or accounted for as a valence property of the main predicate is a decision for the lexicon and grammar developer. The selected preposition is treated not as a lexical word, but as a grammatical word which is analyzed as an incorporated element in the predicate and whose main function is to signal the semantic role of an argument.
‘Why is the moon completely crisscrossed by cracks and ridges?’
Conventional dictionaries usually provide limited information about MWEs, and their treatment is sometimes incomplete or incoherent. Often the expressions are not given as separate lexical entries, but occur as examples in the definitions of single-word entries. This information is difficult to extract when constructing an electronic lexicon. The case of på kryss og tvers exemplifies this problem.
In Bokmålsordboka, the phrase på kryss og tvers occurs as an example both under the entry for kryss ‘cross’ and the entry for tvers ‘across’, but does not exist as an entry of its own. The entry for kryss, whose main category is specified as a noun, has a sense partition named kryssende bevegelse ‘crossing movement’, implying that ‘crossing movement’ is a specific meaning pertaining to kryss. For this sense, the example phrase gjennomsåke området på k- og tvers ‘search the area in every direction’ is given without further information. In the entry for tvers, whose main category is an adverb, a sense partition states that the word is also used as a noun, but no information is given about its meaning as such. For this sense, the expression på kryss og t- is given along with a definition: ‘i alle retninger’ (‘in all directions’).
Although the MWE is referenced twice in Bokmålsordboka, neither entry explicitly refers to it as an expression. The information included in the two entries varies and the structures of the subentries are also different. As a consequence, på kryss og tvers is not listed in NorKompLeks, and the MWE was added to the NorGram lexicon by an annotator during disambiguation of the treebank. Adding lexical entries for hitherto unanalyzed MWEs is thus an important factor for increasing parsing coverage. Moreover, the addition of words with spaces to the lexicon during parsebanking results in a coherently classified inventory of fixed MWEs.
The NorGram lexicon also includes verbal idioms. These are idioms with a verbal core and a selected object; they have limited variability and have a non-compositional meaning. Some expressions are monovalent and only require a subject. Examples are finne sted ‘find place’ (‘take place’, ‘happen’) and klage sin nød ‘complain one’s distress’ (‘complain’, ‘pour out one’s troubles’), which both consist of a verb and a selected nominal object. Another type is exemplified by falle på kne ‘fall on knee’ (‘go down on one’s knees’, ‘grovel’), with a verb and a selected oblique object in the form of a prepositional phrase.
In each idiom frame, the verb predicate is extended with the fixed components of the idiom. Morphosyntactic restrictions on idiom components, such as definiteness and number for nouns, temporal or aspectual constraints, restrictions on passivization, etc., are regulated by special templates. The two examples of monovalent MWEs with a selected nominal object, finne sted ‘happen’ and klage sin nød ‘complain’, differ with respect to the definiteness of the object. The entries thus call the templates VPIDIOM-INDEFOBJ and VPIDIOM-DEFOBJ, respectively.
When certain morphological properties are particular to a MWE, these are specified directly in the MWE entry. The MWE klage sin nød, for instance, has special restrictions on the determiner of the noun nød, which is mandatory and must be a possessive, and the number of the noun, which must be singular.
‘“Serious incidents have occurred!” he shouts.’
Other idioms may subcategorize for both a subject and a complement. A complement may be nominal (OBJ), clausal (COMP) or infinitival (XCOMP).9 Examples are sette pris på OBJ|COMP|XCOMP ‘put price on’ (‘appreciate’), få tak i OBJ ‘get grasp in’ (‘get hold of’), gjøre et (stort) nummer av OBJ|COMP|XCOMP ‘do a (big) number of’ (‘make a big deal about’), and legge merke til OBJ|COMP ‘lay mark to’ (‘notice’). All of these have the syntactic structure verb, selected noun and selected preposition. A divalent idiom with a different syntactic pattern is bringe OBJ på bane ‘bring OBJ on field’ (‘bring (something) up’), where the fixed elements are the verb and a selected prepositional phrase.
The treatment of idioms in the source dictionaries is not more consistent than that of other MWEs discussed above. The expression finne sted, for instance, is listed in Bokmålsordboka under the entry for sted ‘place’ as an example under a sense partition labeled by, bygd, strøk (‘town, village, district’). Although it is given with a definition, ‘foregå’ (‘happen’), it occurs together with other examples illustrating the concrete sense, and no information is given on idiomaticity or variability. Under the verb finne, however, we find more explicit information about the expression. This entry has a separate subentry with the heading i uttrykket ‘in the expression’, which is the most common way of marking expressions as such. The idiom is listed under this heading along with its definition ‘hende, skje’ (‘happen, occur’).
7 Evaluation and conclusion
Correct lexical information is essential for successful syntactic analysis, but we have found that lexical resources derived from dictionaries lack much necessary information because they are typically not tested in parsing.
In order to measure the impact of preprocessing on coverage, we analyzed all parsed sentences with a tokenizer and a morphological analyzer that were built solely on the basis of the lexicon before any text preprocessing. As detailed earlier, a sentence can only be successfully parsed if all its word tokens are recognized by the morphological analyzer. Taking MWEs into account, there might be several ways of tokenizing a sentence, and at least for one specific tokenization, all tokens should be recognized for a possible successful parse. Therefore, when for a given sentence there was no tokenization such that all tokens were recognized by the morphological analyzer, we concluded that the sentence would not have been analyzed without the additional extracted morphology. The measured difference in coverage was quite significant: among the 3,312,452 parsed sentences, 219,933 (6.6 %) would not have gotten any analysis without preprocessing. Moreover, many more sentences would have gotten a full analysis, but not the correct one, because of insufficiencies in the lexical resources, as discussed above.
In conclusion, we found that authentic text contains a wide variety of word forms which are not included in traditional dictionaries. Furthermore, traditional dictionaries do not cover all ways in which words are used, for example with respect to their subcategorization or in MWEs. In our parsebanking efforts, which are mainly aimed at high quality treebanks and compatible grammars, we find that the secondary result of tested and updated lexical resources that help overcome the limitations of traditional dictionaries is substantial and deserves more attention.
Although some nonstandard words may not be desirable in a lexicon for language generation, they are useful for parsing where missing items can cause failure. Information about nonstandard words and new compounds can also be useful for other applications such as automatic proofreading. Some information which we add, such as valency, mass terms and MWEs, may in modified form be included in dictionaries and language teaching materials.
One possible approach to the issue of missing lexical information would consist of using more information sources (gazetteers) and making informed guesses. Although our lexicon already includes large lists of named entities, a named entity recognizer might spot a few more potential names. However, since we are developing a gold standard parsebank, any guesses would have to be manually checked anyway. In this context, the benefit of checking a guess over adding an unknown word is small.
Good practice in lexicon development presupposes the involvement of trained annotators, but also the use of a sophisticated preprocessing interface which promotes efficiency and consistency. In the present study we have described how the INESS preprocessing interface (Rosén et al. 2012), in its further developed form, has been useful in enriching the Norwegian lexicon. The software of our interface accommodates in principle any language, but the system would have to be adapted to the specific lexical categories, morphology, subcategorization, etc. of other languages.
The INESS project is building up a richer lexical resource for Norwegian and will continue to do so during the remainder of the project. The resulting reusable lexical resource will be made available upon completion of the INESS project in 2017.
Since the morphological structure of the words in the examples is not relevant in this article, we have not indicated morphological features in the glosses, but simply used two English words when necessary to render a Norwegian word.
For a discussion of interannotator agreement in the disambiguation process, see Dyvik et al. (2013).
The original paper (Losnegaard et al. 2012) erroneously suggests that 31 % were caused by incorrect lexical categories.
See http://www2.parc.com/isl/groups/nltt/xle/doc/walkthrough.html for an explanation of the XLE lexicon format.
Here and in the following examples of f-structures, we use the simplified XLE format where features other than predicates and functions are not displayed.
For a thorough account of MWEs and automatic analysis we refer to Sag et al. (2002).
Since the semantic arguments are not lexically fixed parts of the idiom, these are represented here in terms of their syntactic functions. Alternative realizations are given as disjuncts.
The work reported on in this article was carried out in the INESS project, which is funded by the Research Council of Norway and the University of Bergen. We are grateful to three anonymous reviewers for their constructive comments and suggestions.
- Baldwin, T., & Kim, S. N. (2010). Multiword expressions. In N. Indurkhya & F. J. Damerau (Eds.), Handbook of natural language processing, Chapter 12 (2nd ed.). Boca Raton, FL: CRC Press.Google Scholar
- Bresnan, J. (2001). Lexical-functional syntax. Malden, MA: Blackwell.Google Scholar
- Butt, M., Dyvik, H., King, T. H., Masuichi, H., & Rohrer, C. (2002). The parallel grammar project. In J. Carroll, N. Oostdijk, & R. Sutcliffe (Eds.), Proceedings of the workshop on grammar engineering and evaluation at the 19th international conference on computational linguistics (COLING) (pp. 1–7). Taipei: Association for Computational Linguistics.Google Scholar
- Carter, D. (1997). The TreeBanker: A tool for supervised training of parsed corpora. In Proceedings of the fourteenth national conference on artificial intelligence, Providence, RI (pp. 598–603).Google Scholar
- Cowie, A. P. (1998). Phraseology: Theory, analysis, and applications. Oxford: Clarendon Press.Google Scholar
- Dyvik, H. (2000). Nødvendige noder i norsk: Grunntrekk i en leksikalsk-funksjonell beskrivelse av norsk syntaks [Necessary nodes in Norwegian: Basic properties of a lexical-functional description of Norwegian syntax]. In Ø. Andersen, K. Fløttum, & T. Kinn (Eds.), Menneske, språk og felleskap (pp. 25–45). Oslo: Novus forlag.Google Scholar
- Dyvik, H., Thunes, M., Haugereid, P., Rosén, V., Meurer, P., De Smedt, K., et al. (2013). Studying interannotator agreement in discriminant-based parsebanking. In S. Kübler, P. Osenova, & M. Volk (Eds.), Proceedings of the twelfth workshop on treebanks and linguistic theories (TLT12) (pp. 37–48). Sofia: Bulgarian Academy of Sciences.Google Scholar
- Hovdenak, M., Killingbergtrø, L., Arne, L., Sigurd, N., Magne, R., & Dagfinn, W. (Eds.). (1986). Nynorskordboka: definisjons- og rettskrivingsordbok. Oslo: Det norske samlaget.Google Scholar
- Landrø, M. I., & Wangensteen, B. (Eds.). (1993). Bokmålsordboka: definisjons- og rettskrivningsordbok. Oslo: Universitetsforlaget.Google Scholar
- Losnegaard, G. S., Lyse, G. I., Thunes, M., Rosén, V., De Smedt, K., Dyvik, H., et al. (2012). What we have learned from Sofie: Extending lexical and grammatical coverage in an LFG parsebank. In J. Hajič, K. De Smedt, M. Tadić, & A. Branco (Eds.), META-RESEARCH Workshop on Advanced Treebanking at LREC2012, Istanbul, Turkey (pp. 69–76).Google Scholar
- Maxwell, J., & Kaplan, R. M. (1993). The interface between phrasal and functional constraints. Computational Linguistics, 19(4), 571–589.Google Scholar
- Moon, R. (1998). Fixed expressions and idioms in English: A corpus-based approach. Oxford: Oxford University Press.Google Scholar
- Nordgård, T. (2000). Nordkompleks: A Norwegian computational lexicon. In COMLEX 2000 workshop on computational lexicography and multimedia dictionaries (pp. 89–92). Patras: University of Patras.Google Scholar
- Rosén, V., & De Smedt, K. (2007). Theoretically motivated treebank coverage. In Proceedings of the 16th Nordic conference of computational linguistics (NoDaLiDa-2007) (pp. 152–159). Tartu: Tartu University Library.Google Scholar
- Rosén, V., De Smedt, K., Meurer, P., & Dyvik, H. (2012). An open infrastructure for advanced treebanking. In J. Hajič, K. De Smedt, M. Tadić, & A. Branco (Eds.), Meta-research workshop on advanced treebanking at LREC2012, Istanbul, Turkey (pp. 22–29).Google Scholar
- Rosén, V., Meurer, P., & De Smedt, K. (2007). Designing and implementing discriminants for LFG grammars. In T. H. King & M. Butt (Eds.), The proceedings of the LFG ’07 conference (pp. 397–417). Stanford: CSLI Publications.Google Scholar
- Rosén, V., Meurer, P., & De Smedt, K. (2009). LFG Parsebanker: A toolkit for building and searching a treebank as a parsed corpus. In F. Van Eynde, A. Frank, G. van Noord, & K. De Smedt (Eds.), Proceedings of the seventh international workshop on treebanks and linguistic theories (TLT7) (pp. 127–133). Utrecht: LOT.Google Scholar
- Rosén, V., Meurer, P., Losnegaard, G. S., Lyse, G. I., De Smedt, K., Thunes, M., et al. (2012). An integrated web-based treebank annotation system. In I. Hendrickx, S. Kübler, & K. Simov (Eds.), Proceedings of the eleventh international workshop on treebanks and linguistic theories (TLT11) (pp. 157–168). Lisbon: Edições Colibri.Google Scholar
- Sag, I. A., Baldwin, T., Bond, F., Copestake, A. & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. In Lecture Notes in Computer Science. Proceedings of the third international conference on computational linguistics and intelligent text processing (Vol. 2276, pp. 189–206). Berlin: Springer.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.