Lexical Analysis of Textual Data
KeywordsMachine Translation Information Retrieval System Lexical Processing Text Corpus Lexical Resource
Lexical analysis refers to the association of meaning with explicitly specified textual strings, referred to here as lexical terms. These lexical terms are typically obtained from texts (whether natural or artificial) by a process called term extraction. The association of meaning with lexical terms involves a data structure known generically as a lexicon. The characteristic operation in using a lexicon is a lookup, where the input is a lexical term, and the output is a representation of one or more associated meanings. A lexicon consists of a collection of entries, each of which comprises an entry term and a meaning structure. Lookup entails finding any entries whose entry term matches the lexical term in question.
Here, the term lexical analysis is used to refer only to operations performed on complete words or word groups. Operations on the characters within words is the concern of morphology.
The use of text corpora for obtaining data on the properties of words may also be regarded as a branch of lexical analysis. A text corpus provides a large representative sample of a language or sublanguage, and may be used for generating data-sets such as lexicons (see later), for providing statistical information, and for identifying examples of particular language constructs for researchers. Operations on corpora include annotating words or word groups with grammatical or other information. [9, 11].
The need for automatic lexical analysis dates back to the 1950s, when the need arose for programing language compilers to recognize variable names and reserved words, and assign an appropriate role to each one. As soon as computers started being used for processing natural language texts, lexical processing was required for extracting, recognizing, organizing, and correlating words and phrases. At first, lexicons for term recognition were compiled by hand, or adapted from existing dictionaries and lists, but later, automatic tools were developed, either for generating lexicons outright, or for reducing the amount of human effort involved in constructing them.
A natural language word is a sign whose meaning cannot be inferred (except in special cases such as onomatopoeic words) from its phonological or morphological structure. Lexical analysis is a process by which meanings are associated with specific words or other textual strings. These strings, which can be referred to as lexical terms, or just terms, are typically extracted during the scanning of some document.
Lexical terms may be individual words belonging to a natural or artificial language, but they may also include abbreviations such as “IBM” and “USAAF”, ‘pseudo-words’ such as “PL/1” and “B12”, and even multiword expressions (MWEs) such as “winter wheat” or “Boeing 747.” The key point is that the term must represent a known concept which is, or may be, relevant to some current need. The function of lexical analysis is to return information enabling that need to be satisfied.
A data structure or data collection which associates a set of lexical terms with their meanings is known as a lexicon. The meaning associated with a specified lexical term may be referred to as an output of the lexicon. The structure of the lexicon and the nature of the output vary according to what kind of information is required for the task at hand. Some applications may require access to two or more different lexicons, covering different (or partly different) sets of terms, and yielding different kinds of output.
Although the detailed arrangements can vary, a lexicon consists essentially of a set of entries, each of which maps a term to a meaning. In some lexicons, a given term can yield a number of alternative meanings. In such a case, the application program must choose the most appropriate meaning or (depending on the task at hand) make use of them all.
The archetypal lexicon is a traditional dictionary, in which each natural language word is accompanied by a definition of each distinct meaning of the word, together with indicators for pronunciation, syntax, etymology, etc. .
In a lexicon, the key operation is the lookup, which takes a term and returns an output representing the required meaning or meanings. In cases where the term is not present in the lexicon, a “null” or “false” output is returned.
In applications where the range of terms and meanings are very small, lexical analysis may be performed by encoding (or “hard wiring”) the relevant terms as literals in decision structures in a computer program.
Lexical Analysis in Text Processing
Reference to a lexicon is frequently required by an application program designed to process textual data, whether this consists of text in a natural language, or in an artificial language such as a programming language. In extracting terms from a text, the first step is normally to tokenize the text – that is, to divide the stream of characters into coherent groups called tokens, as appropriate for the task at hand. In a natural text, the tokens may include words, numbers, punctuation symbols, brackets, etc.
As noted earlier, lexical analysis typically involves the lookup of significant items such as words and phrases. Tokenization therefore usually involves, or is immediately followed by, a filtering process which recognizes and discards irrelevant items such as punctuation marks. In fact, in a language like English, extraction of single words can be achieved by regarding all characters except letters, and perhaps digits, as token delimiters. There remain only certain specific issues, such as the handling of hyphens and apostrophes, where the extracted terms need to conform to the practice used in the lexicon. [6, 8].
If the lexicon includes terms which are multiword expressions (MWEs), there is a problem in that MWEs are not explicitly flagged in a text, and so cannot be extracted directly. The only exception is the names of places, people and organizations, such as “Department for Transport” or “United Arab Emirates,” where the main component words are capitalized.
Lexical terms are almost always represented by grammatical noun phrases (NPs). Hence, one approach to the extraction of MWEs is to parse the text and then extract the NPs as potential terms. Parsing is however a slow process, and only likely to be worthwhile if the output of the parser is to be used for other purposes as well. Fortunately, NPs can also be identified by a more limited syntactic process based on recognizing patterns of grammatical tags, which can be done reasonably quickly. .
A more simple-minded approach to term extraction involves generating all sequences of successive tokens up to a specified length, and checking each against the lexicon in decreasing order of length. With a four-token limit, this would seem to involve checking four potential terms per token, but in fact this is an overestimate. Firstly, sequences starting or finishing with a trivial word like “of” or “for”, or containing a punctuation symbol, might be discarded at once (though this begs the question of how efficiently these cases can be recognized). Secondly, if the first word of a long sequence fails to match, the shorter sequences starting with that word need not be tested. All in all, with its avoidance of syntactic processing, this can be quite an effective approach.
A different method is to only extract single tokens. But if the current token happens to match the first word of a multiword term, the succeeding tokens of the text are checked against the remainder of that entry (and also against any other entries starting with the same word).
If it is known that the terms of interest will only occur in certain specific contexts, rules can be used for defining those contexts. For instance, a rule may be devised for extracting surnames from personal names, covering examples such as “Mrs Robinson,” “John F. Ratley, Jr.,” and “Revd Ben Wiles”. Alternatively, data may need to be extracted from semiformatted texts such as clinical reports. Rules of this kind may consist of, or be based on, regular expressions – see URL below, and reference .
As noted later, extracted terms are sometimes compiled into an index before further processing, but for the present it is assumed that each term is immediately passed on for lookup.
Besides terms extracted from a text, lexical processing may need to be applied to terms retrieved from a database or knowledge base, extracted from a user query, or entered into a text box in a user interface. In some cases, the lexical terms might be artificial values, such as employee numbers assigned by a company or internal identifiers assigned by a database system.
For a lexicon of any size, the efficiency of the lookup is of major importance. The techniques here are well known, including simple binary searching in arrays, the use of access trees, and hashing techniques. For fixed sets of terms, hashing is potentially the fastest of all methods, but if terms are likely to be added or deleted frequently access trees are probably better. Further details about these methods can be found in any good book on data structures.
Of course, in the case of a lexicon designed for use by human beings, alphabetical ordering is the almost invariable rule.
Use of the Output
The output of a lexicon, unless it is simply to be displayed on a user’s workstation, will be subjected to further processing by the application program. As noted earlier, the output of each lexicon is designed according to its intended uses.
Each entry in a lexicon consists of an entry term and an output structure. Besides the outputs explicitly defined in the lexicon, there is normally a default output which is returned in cases where the lexical term is not included among the entry terms.
The simplest form of lexicon is one in which the explicit outputs are all void; in other words, the lexicon is simply a list of terms. In effect, the lookup process returns the value “true” or “false,” depending on whether or not the term was found. Such a lexicon defines a set of terms, and the lookup is a test for set membership.
The simplest form of nonvoid lexicon is a simple association list, in which each output is a single value. In general, however, the outputs can have an arbitrarily complicated structure, as defined by two factors: multiplicity and complexity.
Multiplicity of output refers to the fact that, in many lexicons, a given entry term may be associated with a number of different outputs (or equivalently, there may be a number of separate entries which have the same entry term). In many natural language tasks, multiplicity of output equates to ambiguity (as in the various distinct meanings of the English word “dock”), which represent a problem to be resolved. In other cases (e.g., when the outputs enumerate the members of a set) the multiplicity is not a problem; for example, in an inverted file (see below) multiple outputs are the norm.
Complexity of output. In many lexicons each output consists of several differentiated data items. The output may for example be structured as a record or tuple, or it may have a deeper structure, as in some examples below.
Types of Lexicon
Word Lists and Stoplists
As noted above, the simplest form of lexicon is a simple term list representing a set of terms which share some common property. Membership of such a set can result in the acceptance of a term for further processing by the application, or alternatively can cause its rejection. An example of the latter case is a stoplist, which is routinely used in information retrieval systems to eliminate trivial words (“stopwords”) which can serve no purpose for retrieval. These are mainly syntactic function words (“closed class words”) such as the English words “the,” “of,” “and,” “for,” “over,” “because,” etc.
Indexes and Inverted Files
An index is a secondary structure which is provided to facilitate access to the items held in some primary body of information. A back-of-book index provides a familiar everyday example, but computer-based indexes can be generated automatically for any kind of information object which includes lexical data. An important example is the use of inverted files for assisting the retrieval of records in databases. Given a database in which each record contains a set of terms summarizing the properties of a particular entity, inspecting all of the records individually will likely be a slow process. Instead, an inverted file is generated in which each distinct term is associated with an entry listing the identifiers of the specific records which contain that term. Searching can then be performed quickly by accessing and comparing only those few entries which relate to the specified search terms .
Machine Readable Dictionaries
The outputs of traditional dictionaries and glossaries were designed for perusal by human beings. However, with their generally predictable structure, they proved to be a valuable lexical resource for natural language processing applications . Their great advantage was that they provided a ready-made summary of the whole of the language, apart from specialized technical terms. Their disadvantage was that their presentation of information was often inconsistent and incomplete. Experience of such problems soon led to the development of electronic dictionaries with more formally defined structures; perhaps the best-known example is the Longman Dictionary of Contemporary English (LDOCE), which is now available on CD-ROM.
A thesaurus is a structure which records and classifies relationships between distinct lexical terms. Probably its best-known use is for detection of synonyms, to enable distinct but equivalent terms to be merged or conflated. In information retrieval, a thesaurus may be used for expanding a set of query terms, thus improving the recall performance of a query. Some of the most detailed thesauruses provide information about the terminology of a technical domain, such as medicine or agriculture.
In fact, a well-developed thesaurus provides information about a range of distinct interterm relationships, and may support two distinct types of operation: (i) vocabulary control, and (ii) classification of relationships. Vocabulary control consists of mapping each lexical term onto a preferred term, denoting some specific concept in the domain, thus effectively conflating different (synonymous) mentions of that concept. Such a thesaurus also includes information about semantic relationships between distinct concepts, including antonymy and various hierarchical relations. The latter almost always include genus/type relations (e.g., plant/tree/oak), and often also whole/part relations (e.g., tree/branch/twig) and others. .
The WordNet is an online thesaurus developed at Princeton University in the early 1990s [5,10]. In WordNet, groups of synonymous words are organized into “synsets” (but no preferred terms are designated). Words which are polysemous (i.e., possess more than one meaning) will occur in two or more synsets – for example, (board, plank) and (board, committee). This means that the entry term “board” returns two distinct outputs. WordNet also allows five further relations to be recorded between synsets – antonymy, genus/type, whole/part, manner, and logical entailment.
Although WordNet is an attractive and useful tool, it is really quite “weak” in terms of completeness and consistency. Thus, its use for supplementing search terms in information retrieval systems has led to disappointing results . It may be more useful for providing suggestions or asking questions in an online context.
The EDR dictionary, now administered by the Japanese National Institute of Information and Communications Technology, provides an impressive range of lexical resources . One of its main purposes is to support the development of robust and effective machine translation tools. Primary lexical access is via word dictionaries in Japanese and English, and these include links to a “concept dictionary” containing semantic information and hierarchical links to related concepts. Other components include co-occurrence dictionaries and a Japanese/English bilingual dictionary.
A thesaurus or online dictionary provides a kind of sketchy model of the world (or partial world), and the language used to describe it. The most fully developed systems of this kind are ontologies which, besides the features outlined above, may include definitions of processes, interactions, constraints, and temporal relationships .
Construction of Lexicons
Any body of text can be converted automatically into an index of words (or, with a little more trouble, of phrases) by extracting the words/phrases as discussed earlier and organizing them into an alphabetical list. An index thus produced is of course a kind of lexicon – one in which the outputs are restricted, e.g., to positional information or frequency. Many language processing applications start off by generating an index of extracted terms, which can then be inspected or manipulated at a later stage. Note that an inverted file is a form of index.
The compilation of more complex forms of lexicon can only be automated to a limited extent. Statistical analysis of text corpora can provide useful lexical resources, or at least resources which can be used to facilitate the construction of practically useful lexicons . A key activity is looking for associations (collocations) between words in a text corpus . This involves taking the corpus, dividing it into suitable segments (e.g., paragraphs, sentences, or fixed-size windows) and then measuring the extent to which different terms tend to occur together using, e.g., the mutual information measure. One application of this approach is in the construction of a statistical thesaurus, by which strongly associated terms may be used for expanding sets of query terms in information retrieval systems .
In a statistical thesaurus, the associations between terms are undifferentiated “tending-to-occur-together” relationships. More detailed analysis of patterns of word association may be used, tentatively, to differentiate between relationships of various types. One noteworthy example of this approach is the work of Greffenstette , while some further efforts in this direction have been reviewed by Matsumoto  and Srinivasan .
As has already been noted, some kind of lexical analysis is performed by almost every program which operates on textual data, including systems for text retrieval, question-answering, summarization, information extraction, data mining, and machine translation.
The URLs given below provide datasets as well as lexical analysis tools.
URL to Code
- 2.Chodorow M, Byrd R, Heidorn, G. Extracting semantic hierarchies from a large on-line dictionary. In: Proceedings of the 23rd Annual Meeting of the Association for Computation Linguistics, Illinois, USA. 1985. p. 299–304.Google Scholar
- 3.Church K, Hanks P. Word association norms: mutual information and lexicography. Comput Linguist. 1990;16(1):22–9.Google Scholar
- 6.Frakes WB, Baeza-Yates R. Information retrieval: data structures & algorithms, Chapters III, VII, and IX. Englewood Cliffs: Prentice-Hall; 1992.Google Scholar
- 11.Mitkov R. The Oxford handbook of computational linguistics. Chapters III, XXI, XXIV, XXV, and XXXIII. Oxford: Oxford University Press; 2003.Google Scholar
- 13.Voorhees EM. Using WordNet to disambiguate word senses for text retrieval. In: Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval. 1993. p. 171–80.Google Scholar