Building on the findings outlined above, IDeaL investigates the extent to which surprisal offers a pervasive explanation of language behavior across levels of linguistic representation. To this end, we examine both the mechanisms of encoding and determinants of surprisal in detail by (a) identifying which aspects of linguistic and non-linguistic context are relevant for determining levels of surprisal and density, and (b) examining the diverse means languages make available for variation in linguistic encoding and thus modulation of surprisal: cross-linguistically, diachronically, and in different genres and registers. Measures of surprisal and processing difficulty can then feed back into models of human linguistic behavior as well as various kinds of computational applications. For instance, in human computer interaction (HCI), they open a path to adaptive technology that can make effective use of the variation available in language, to adapt utterances in a situated setting to a user and the environment.
Research Areas
The projects in IDeaL are distributed across three research areas: “Situational Context and World Knowledge” (Area A), “Discourse and Register” (Area B) and “Variation in Linguistic Encoding” (Area C). We will below give a few examples of the research carried out in each area.
Projects in Area A explore the extension of the notion of context beyond the strictly linguistic context, to relevant aspects of the current situation (e.g., objects present in the scene, interlocutors’ gestures or gaze) which help us explain linguistic variation observed in situated communication. A particular focus concerns of the role of event knowledge in linguistic choice during (human) production. Here, we test experimentally how exactly human knowledge of typical complex but routine events—like going to a restaurant—is structured, and how humans integrate such script knowledge with the information contained in utterances that implicitly rely on it. Challenges addressed in these projects involve the acquisition of event knowledge at a large scale from texts and via human annotation, and the mapping between representations of script knowledge and actual textual references to these events, see for example Fig. 2.
Projects in Area B focus on surprisal in discourse context and different registers and text types. One of the projects, for example, looks at the hypothesis of linguistic densification in the evolution of scientific writing in English (mid 17th century to present), starting from the assumption that shared expertise of the author and their audience affects language use and, over a longer period, drives language change and the evolution of domain-specific language (register) [15]. As scientific activity in a given field develops and becomes more specialized, particular meanings become more predictable (within that scientific field). UID then predicts the emergence of denser encodings for these predictable meanings, which would optimize efficiency in communication.
Projects in Area C focus on testing the effect of linguistic predictability on the expansion or compression of linguistic items in encoding. Such effects can be observed in acoustic-phonetic realization in speech (e.g. shorter vs. longer durations of syllables in speech production) and are investigated in two projects in this area. Another project studies intercomprehension across languages of one family (here: Slavic languages), starting from the assumption that there is a correlation between (the degree of) language relatedness, intercomprehension and surprisal.
Methods for Measuring Surprisal
In order to address questions about the role of information density in language use and language evolution, it is necessary to accurately quantify the amount of information carried by a linguistic item. Two kinds of methods of quantifying surprisal are employed across the three research areas. Surprisal can be estimated experimentally from human subjects by asking for completions, and observing processing difficulty through eye-tracking and event-related potentials in EEG; and it can be estimated on the basis of probabilities obtained from large corpora representative of different domains. With regard to the latter, the development of language modeling approaches with more sophisticated and extended notions of linguistic context, is a particular research focus. Finally, several projects are also concerned with developing computational models which represent aspects of language structure which is not observable in surface forms, modelling for example the surprisal of a syntactic structure, the thematic fit of words in specific roles when occuring in long-distance dependencies, or the information conveyed by discourse connectives and other linguistic cues about discourse relations.
Applications in HCI and NLP
The notion of information density and its effect on linguistic encoding also has the potential to improve Natural Language Processing (NLP) applications, such as making effective use of the ability to represent alternative linguistic encodings of a particular meaning: if we can map high-density and low-density encodings of the same information onto one another, this should also help to improve information retrieval, especially for complex multi-argument events.
Cognitive models of processing difficulty can also be used to inform natural language generation in dialog systems, such that the dialog system can choose the optimal utterance for a given user (e.g., layperson vs. expert) in a situation. For instance, we investigate how to optimally manage the cognitive load induced by a language comprehension task in combination with a driving task (in an automotive simulator) for different user groups (younger vs. older adults). Experimental findings will contribute to the development of a language generation model that adapts linguistic encodings appropriately based on both the immediate setting and cognitive capacity of the listener, see also Demberg et al., this issue.