Cognitive and social well-being in older adulthood: The CoSoWELL corpus of written life stories

This paper presents the Cognitive and Social WELL-being (CoSoWELL) project that consists of two components. One is a large corpus of narratives written by over 1000 North American older adults (55+ years old) in five test sessions before and during the first year of the COVID-19 pandemic. The other component is a rich collection of socio-demographic data collected through a survey from the same participants. This paper introduces the first release of the corpus consisting of 1.3 million tokens and the survey data (CoSoWELL version 1.0). It also presents a series of analyses validating design decisions for creating the corpus of narratives written about personal life events that took place in the distant past, recent past (yesterday) and future, along with control narratives. We report results of computational topic modeling and linguistic analyses of the narratives in the corpus, which track the time-locked impact of the COVID-19 pandemic on the content of autobiographical memories before and during the COVID-19 pandemic. The main findings demonstrate a high validity of our analytical approach to unique narrative data and point to both the locus of topical shifts (narratives about recent past and future) and their detailed timeline. We make the CoSoWELL corpus and survey data available to researchers and discuss implications of our findings in the framework of research on aging and autobiographical memories under stress. Supplementary Information The online version contains supplementary material available at 10.3758/s13428-022-01926-0.

( tokensnoun tokensnoun+tokens verb ). The importance of this variable stems from the fact that distributional properties of nouns and verbs affect language processing in general (see Vigliocco et al., 2011). Moreover, differences in production and comprehension of nouns and verbs are associated with language impairments that increase in their incidence with advancing age, such as dementia and Alzheimer's disease. Both types of impairments are associated with telegraphic style characterized by a less frequent use of verbs relative to nouns, i.e., higher Noun-to-Verb ratio (Eyigoz et al., 2020;Le et al., 2011;Burke & Shafto, 2008).
A second set of variables under consideration provides several estimates of Type-Token Ratio (TTR) originally proposed by Johnson (1944) as a metric of lexical diversity in a language. Traditionally, TTR is calculated as the number of word types in a text divided by the number of word tokens but other operationalizations are possible (for discussion see Kyle, 2019). In this study, we employed three related but different operationalizations of this metric: 1) based on words in line with the traditional definition ( types word tokens word , labeled here TTR-w ), 2) based on the part-of-speech associated with a given word ( typespos tokenspos , labeled TTR-p) and, finally, 3) based on the dependency relation associated with a particular word ( types dep tokens dep , TTR-d). 1 An increase in any of the three TTR measures can be interpreted as an increase in lexical (TTR-w) or syntactic diversity (TTR-p and TTR-d), signaling a greater linguistic productivity of an individual writer in a given narrative.
A third set of variables tapped into the syntactic complexity of a sentence that has been shown to affect language comprehension (e.g., Gibson, 2000;Gibson, 1998;Levy, 2008) and constrain language production (Ferreira, 1991;Scontras et al., 2015;Szmrecsanyi, 2004;see Nippold et al., 2014, for results across age groups). The first variable in this set captured how elaborate the syntactic constructions were in a given sentence, i.e., it calculates how many syntactic dependencies are there in a sentence relative to the total number of words in that sentence. Since each dependency has exactly one head, the formula for the measure that we will refer to as d-Ratio was: 1 − n head n token . A higher d-Ratio indicates that a sentence contains more dependencies relative to its number of tokens, making the sentence syntactically more elaborate. This measure was calculated for every sentence and then averaged for each narrative.
The second variable in this set operationalized syntactic complexity in a given sentence by calculating the longest path in a dependency tree for the sentence. For example, in the sentence visualized in Figure 1, the longest dependency path was 3.
This measure follows the notion that longer dependency relations are more effortful to process (Gibson, 2000;Gibson, 1998) and there is a general tendency to minimize dependency distances in language production (Futrell et al., 2015;Temperley, 2007). To facilitate efficient computation time, a given dependency tree was treated as a directed acyclic graph (see Oya, 2011;Yadav et al., 2019) and the longest path was calculated using the diameter function in the R package igraph, version 1.2.6, i.e., the length of the longest shortest path (max u,v d(u, v) between any two nodes (u, v) (Csardi, Nepusz, et al., 2006). Similar to d-Ratio, this variable was calculated for each sentence and averaged across the sentences for a given narrative. We will refer to this variable as the longest dependency path (LDP) in this study.
The third and final variable in this set considered the notion of syntactic complexity at the level of the narrative by focusing on the use of complex clauses such as coordination and subordination relative to syntactically simplex ones (e.g., Beaman, 1984;Givón, 1991). For the purposes of the present study, a wide range of syntactic constructions were considered to constitute degree of embeddedness, not just coordinating and subordinating clauses. The presence of one of the following nine dependency relations was used to mark a sentence as complex listed below, otherwise a sentence was considered as simplex. The definition used below follows the UD schema used in this study.
• parataxis: parataxis • xcomp: a clausal complement of a verb or an adjective functioning as a predicative or clausal complement without its own subject • ccomp: a clausal complement of a verb or adjective with a dependent clause which is a core argument • advcl: an adverbial clause modifier • acl:relcl: a relative clause modifier • acl: finite and non-finite clauses modifying a nominal • conj: a coordinating conjunction • cc: a conjunct and a preceding coordinating conjunction • mark: a word marking a clause as subordinate to another clause For a given narrative, the number of simplex and complex sentences calculated and the sums were divided to compute a ratio (1 − n simplex n complex ) where higher values indicated that a given story had relatively more complex sentences.
The final variable considered in this study is mean length of utterance (MLU) originally proposed by Brown (1973). MLU has been used extensively to study linguistic productions in children (see Rice et al., 2010, and citations therein) and adults (see Nippold et al., 2014, and citations therein). MLU can be operationalized either based on the number of morphemes or words (for recent discussion see Ezeizabarrena & Garcia Fernandez, 2018). Here, we opted for the simpler word-based calculation ( n word nsentence ).
A total of eight stories were removed from the data due to undefined values in calculating the eight lexico-syntactic variables. Thus, the data used in this section consisted of 7,275 life stories written by 1,028 participants. The summary information of the lexico-syntactic variables across the story types is given in Table 1. To analyze the potential differences in the lexico-syntactic profile of the story types, we fitted eight separate linear mixed-effects model to the data. The package lme4 version 1.1-26 was used to fit the mixed-effects models (Bates et al., 2015) and the p-values and ANOVA type III were estimated using Satterthwaite's degrees of freedom method as implemented in the R package lmerTest (version 3.1-3) (Kuznetsova et al., 2017). In a given model, the lexico-syntactic response variable was modeled as a function of the story type and participants were included in the model as random intercepts. The profile of story type "future" was associated with a relatively high lexical and syntactic diversity as indicated by the type-token ratios in Figure 1) and increased syntactic elaborateness (see d-Ratio in Figure 1). Furthermore, stories about the future tended to be syntactically more complex (see LDP in Figure 1) and also longer (see MLU in in Figure 1). Story type "past" had a similarly distinctive profile. Stories about distant past were associated with decreased lexical (see TTR-w) and syntactic (see TTR-p and TTR-d) richness. They also tended to be shorter than the future stories (see MLU in Figure 1) and characterized by syntactic simplicity (see LDP in Figure 1).
Stories about the present (story type "yesterday") displayed a similar profile to stories about the past with one crucial difference, namely, they were much richer in terms of both lexical (see TTR-w) and syntactical diversity (see TTR-p and TTR-d). Finally, "cookie" as a story type displayed a profile similar to "future" except for a relatively weak syntactic elaborateness (d-Ratio).
In sum, the analysis presented in this section demonstrated that story types displayed robust differences in their lexico-syntactic profiles. Interestingly, stories associated with past events appeared to be less vivid both lexically and syntactically and this lack of elaborateness increased the further back in veridical time the depicted life story was situated in. In contrast, the lexico-syntanctic profile of the future story type was characterized by highly elaborate use of lexical and syntactic devices. Finally, the results presented in this section demonstrated that the story types used in this study tap into different language patterns, providing a method that brings forth a comprehensive snapshot of an individual's language use and personal life experiences.

Supplementary material S2: Data generating process of STM
The data-generating process of STM is shown graphically in Figure 2.
Conceptually, the model consists of three parts: 1) a prevalence model controlling the allocation of words to topics as a function of covariates, 2) a topical content model controlling for the frequency of each words in a given topic as a function of covariates, and 3) a language model combining these two sources to generate the actual words in each document (for a detailed description of the model, we refer the reader to Roberts et al., 2019). Algorithmically, this type of topic model combines properties of the correlated topic model (CTM) (Blei & Lafferty, 2007), the Dirichlet-Multinomial Regression (DMR) topic model (Mimno & McCallum, 2008) and the Sparse Additive Generative (SAGE) topic model (Eisenstein et al., 2011). The structural topic model (STM) used here is implemented in the R package stm, version 1.3.6, (Roberts et al., 2019).

Supplementary material S3: Example of broad and narrow document-topic distribution
We illustrate the ability of a topic model to associate a specific narrative with a document-topic distribution. We took two extreme cases as examples in the story type "yesterday" (recent past): a narrative that was estimated to be dominated by a single topic while the second story was estimated to display multiple topics simultaneously.
The former story is given as Example 1 and its document-topic distribution is visualized as a line graph in Figure 3 (top panel) where the x-axis contains the topics and the y-axis contains the document-topic probability (which sums to one for a given narrative). The changes in the document-topic distribution are visualized as a line.
Example 1: I went grocery shopping yesterday at two stores. The first one did not require face masks anymore, but the other one did. I must admit I felt safer in the second store and probably only shop there until the other one goes back to requiring masks. Got all the stuff I needed and walked home.
Great exercise. A total of almost 3 miles walking and carrying groceries.  Figure 3 . Estimated topic distribution across two life stories.
Example 1 was estimated to be primarily associated with a single topic, namely pandemic experiences. Based on a close reading of the narrative, it is clear that the story described an event related to COVID-19. The fitted STM was able construct and identify this abstract topic and associate it strongly with this particular narrative.
Example 2, written about a future event, illustrates the opposite pattern-a mixture of multiple topics. Its topic distribution is visualized in Figure 3 (bottom panel). The estimated topic distribution reveals that this particular story does not demonstrate a well-defined topic structure as it pertains lexical elements from such topics as education, life and death, pandemic experiences and retirement.
While the narrative can be seen as lacking in cohesiveness (Halliday & Hasan, 1976), the key elements verbalized in the narrative were well captured with STM. have an idea what to do, similar to a fire drill, so she was prepared.

Supplementary material S4: Global variable importance
As a measure of variable importance, we used the permutation variable importance Breiman (2001). This metric is based on the idea of breaking the original association between a given predictor and the response by randomly permuting said predictor. If the permuted predictor was important, the model performance will decrease when predictions are made with the permuted variable. The difference in the model performance before and after permutation is then used as a measure of variable importance. The permutation importance was computed for the 22 predictors and their relative ranking is visualized in Figure 4.
The cookie-related predictors (cookie I, cookie II, and cookie III) were estimated  Figure 4 . Relative variable importance in classifying the four story types.
to be highly important. This is to be expected given that as a type the cookie stories were extremely well discriminated. More importantly, the relative ranking of the predictors indicated that predictors that could be linked to autobiographical memories were z estimated to have provided a significant contribution to the model performance, for example retirement, weekend, and the family-related predictors such as family members. It is also noteworthy that variables related to the current COVID-19 crisis were estimated to be important, suggesting that people wrote about important but recent life events. Thus, the estimated variable importance suggests that the discrimination of the story types might, at least partly, be based on groupings of predictors that can be linked to autobiographical memories. In order to investigate this possibility, the contribution of an individual predictor in discriminating a given story type was estimated.