Information Density and Linguistic Encoding (IDeaL)
We introduce IDeaL (Information Density and Linguistic Encoding), a collaborative research center that investigates the hypothesis that language use may be driven by the optimal use of the communication channel. From the point of view of linguistics, our approach promises to shed light on selected aspects of language variation that are hitherto not sufficiently explained. Applications of our research can be envisaged in various areas of natural language processing and AI, including machine translation, text generation, speech synthesis and multimodal interfaces.
KeywordsLanguage variation Information theory Information density Surprisal
Language exhibits a broad spectrum of variation at all linguistic levels. While there is a substantial body of research on the systematic description of linguistic variation accounting for factors such as region (dialect), social background (sociolect) and situational context (register), there are few attempts at finding a common explanation: Why do we usually have more than one option of encoding a given message? And how do we choose among the options available to us? These are the questions we address in IDeaL.
(a) My boss confirmed that he is absolutely crazy.
(b) My boss confirmed he is absolutely crazy.
(a) Where should I put this stuff?
(b) Where to put this stuff?
(a) If this method of control were to be used, trains would operate more safely.
(b) The use of this control method leads to safer train operation.
Our overarching research question is what governs the choice between such encoding options. Specifically, we begin with the hypothesis that such encoding choices serve to optimize density—the amount of information conveyed per unit of time. Traditionally, linguistics has associated the informational content of a sentence or discourse with its semantics: The inherent meaning of words and constituents are combined compositionally to determine a sentential or discourse message. Psycholinguistics and computational linguistics, however, have increasingly turned to Information Theory as a mathematical framework for objectively quantifying the information conveyed by a linguistic unit (e.g. phoneme or word) as a function of its predictability in context , often termed surprisal.
IDeaL investigates the hypothesis that processing complexity is indexed by surprisal across linguistic levels and further that linguistic variation may be characterized by the optimal distribution of information across the linguistic signal. We provide a more formal definition of surprisal in Sect. 2, along with a summary of supporting evidence for its linguistic relevance, and then illustrate several different research areas in IDeaL in Sect. 3. We conclude with a brief summary and outlook (Sect. 4).
2 Information Density
At the word level, Hale posits that the cognitive effort associated with comprehending the next word, \(w_i\), of a sentence will be proportional to its surprisal (Eq. 1) [5, 9]. This claim is consistent with a wealth of experimental evidence demonstrating that behavioural (e.g. reading time) and neurophysiological (e.g. event-related brain potentials, ERPs) measures of processing effort are highly correlated with a word’s predictability [3, 8, 12, 14].
Evidence for UID comes from the observation that various syntactic reduction phenomena—such as that-complementiser and that-relativiser omission, as well as auxiliary contraction—can be explained by a preference for uniformity in surprisal [6, 7]. Other relevant support comes from the observation that speakers take more time to pronounce words when they occur in less predictable contexts . While at the text level, Genzel and Charniak  find evidence that information density of sentences can be viewed as uniform when taking context into account, in contrast with the apparent increase in density when context is not considered.
Particularly compelling is recent research demonstrating that the lexica of many human languages have adapted so as to encode words that are more predictable (on average) using shorter forms than less predictable words . The consequence of such a lexicon is precisely to increase uniformity by using longer forms for words that typically convey more information, thus distributing the information over time. Indeed, evidence also suggests that people’s decision to say math rather than mathematics is driven at least partly by the increased predictability of the word in a particular context—that is people use the short form, when the word is less surprising and conveys fewer bits of information .
3 Research Programme
Building on the findings outlined above, IDeaL investigates the extent to which surprisal offers a pervasive explanation of language behavior across levels of linguistic representation. To this end, we examine both the mechanisms of encoding and determinants of surprisal in detail by (a) identifying which aspects of linguistic and non-linguistic context are relevant for determining levels of surprisal and density, and (b) examining the diverse means languages make available for variation in linguistic encoding and thus modulation of surprisal: cross-linguistically, diachronically, and in different genres and registers. Measures of surprisal and processing difficulty can then feed back into models of human linguistic behavior as well as various kinds of computational applications. For instance, in human computer interaction (HCI), they open a path to adaptive technology that can make effective use of the variation available in language, to adapt utterances in a situated setting to a user and the environment.
3.1 Research Areas
The projects in IDeaL are distributed across three research areas: “Situational Context and World Knowledge” (Area A), “Discourse and Register” (Area B) and “Variation in Linguistic Encoding” (Area C). We will below give a few examples of the research carried out in each area.
Projects in Area B focus on surprisal in discourse context and different registers and text types. One of the projects, for example, looks at the hypothesis of linguistic densification in the evolution of scientific writing in English (mid 17th century to present), starting from the assumption that shared expertise of the author and their audience affects language use and, over a longer period, drives language change and the evolution of domain-specific language (register) . As scientific activity in a given field develops and becomes more specialized, particular meanings become more predictable (within that scientific field). UID then predicts the emergence of denser encodings for these predictable meanings, which would optimize efficiency in communication.
Projects in Area C focus on testing the effect of linguistic predictability on the expansion or compression of linguistic items in encoding. Such effects can be observed in acoustic-phonetic realization in speech (e.g. shorter vs. longer durations of syllables in speech production) and are investigated in two projects in this area. Another project studies intercomprehension across languages of one family (here: Slavic languages), starting from the assumption that there is a correlation between (the degree of) language relatedness, intercomprehension and surprisal.
3.2 Methods for Measuring Surprisal
In order to address questions about the role of information density in language use and language evolution, it is necessary to accurately quantify the amount of information carried by a linguistic item. Two kinds of methods of quantifying surprisal are employed across the three research areas. Surprisal can be estimated experimentally from human subjects by asking for completions, and observing processing difficulty through eye-tracking and event-related potentials in EEG; and it can be estimated on the basis of probabilities obtained from large corpora representative of different domains. With regard to the latter, the development of language modeling approaches with more sophisticated and extended notions of linguistic context, is a particular research focus. Finally, several projects are also concerned with developing computational models which represent aspects of language structure which is not observable in surface forms, modelling for example the surprisal of a syntactic structure, the thematic fit of words in specific roles when occuring in long-distance dependencies, or the information conveyed by discourse connectives and other linguistic cues about discourse relations.
3.3 Applications in HCI and NLP
The notion of information density and its effect on linguistic encoding also has the potential to improve Natural Language Processing (NLP) applications, such as making effective use of the ability to represent alternative linguistic encodings of a particular meaning: if we can map high-density and low-density encodings of the same information onto one another, this should also help to improve information retrieval, especially for complex multi-argument events.
Cognitive models of processing difficulty can also be used to inform natural language generation in dialog systems, such that the dialog system can choose the optimal utterance for a given user (e.g., layperson vs. expert) in a situation. For instance, we investigate how to optimally manage the cognitive load induced by a language comprehension task in combination with a driving task (in an automotive simulator) for different user groups (younger vs. older adults). Experimental findings will contribute to the development of a language generation model that adapts linguistic encodings appropriately based on both the immediate setting and cognitive capacity of the listener, see also Demberg et al., this issue.
The overarching goal of IDeaL is to establish the extent to which variation in language and language use—from phonemes to discourse—can be explained by the pressure to distribute information evenly across the communicative channel. The constituent projects thus contribute towards the development of comprehensive model of language use that unites traditionally different perspectives—from the cognitive and computational to the social and historical.
On the conceptual level, we assume that language is designed for communication and that language users are rational in the sense that they want communication to work. Hence, they will adapt their linguistic encodings in the service of successful communication. The notion of information density gives us a basis for investigating language use that is in accordance with these assumptions.
On the methodological level, using information theory as a common formal basis opens up a rich repertoire of computational methods commonly used in computational linguistics and in parts of psycholinguistics to other areas of linguistic investigation, including (cross-linguistic) language variation, language acquisition and language evolution. Also, since information theory is agnostic regarding any particular linguistic theory, different theoretical perspectives can be incorporated in a unified model of language use under this view.
In the present project phase, we focus on selected aspects of language variation and developing a solid repertoire of methods of measuring ID/surprisal, building computational models, and determining the scope of surprisal based accounts. In future phases, we would like to refine our models and expand their application to a broader spectrum of languages and other areas of linguistic investigation, such as language acquisition and learning, language typology and language evolution.
In a wider perspective, we believe that information theory can help describe and explain (aspects of) human experience more generally, notably its evolutionary, cognitive and social aspects. As language is a major window into human behavior, we would thus hope that the research carried out in IDeaL will provide contributions to the larger endeavor of modeling and explaining human experience.
IDeaL is funded by the Deutsche Forschungsgemeinschaft (DFG) under grant SFB (Sonderforschungsbereich) 1102 (www.sfb1102.uni-saarland.de). Support by Cluster of Excellence Multimodal Computing and Interaction (MMCI) is also gratefully acknowledged.
- 2.Demberg V, Hoffmann J, Howcroft D, Klakow D, Torralba A (2015) Search challenges in natural language generation with complex optimization objectives. Künstliche Intelligenz (in this issue)Google Scholar
- 4.Genzel D, Charniak E (2002) Entropy rate constancy in text. In: Proceedings of the 40th meeting of the Association for Computational Linguistics, ACL ’02, pp 199–206Google Scholar
- 5.Hale J (2001) A probabilistic earley parser as a psycholinguistic model. Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, NAACL ’01, Association for Computational Linguistics, Stroudsburg, PA, USA, pp 1–8Google Scholar
- 8.Kutas M, DeLong KA, Smith NJ (2011) A look around at what lies ahead: prediction and predictability in language processing. In: M. Bar (ed) Predictions in the Brain: using our past to generate a future. Oxford University Press, UK, pp 190–207Google Scholar
- 15.Teich E, Degaetano-Ortlieb S, Fankhauser P, Kermes H, Lapshinova-Koltunski E (2015) The linguistic construal of disciplinarity:a data mining approach using register features. J Assoc Info Sci Technol JASISTGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.