1 Introduction

Over 50 million people in the world have dementia, a syndrome involving deterioration in memory and cognitive abilities, with an annual increase of 10 million (Organization et al., 2019). Earlier diagnosis can improve patients’ quality of life by enabling better planning and medical interventions at the most effective and appropriate stage (Prince et al., 2011, 2014; Association, 2016)–especially if diagnosis occurs before clinical symptoms onset in Ritchie et al. (2016); Mortamais et al. (2017).

Standard diagnostic biomarkers such as MRI, PET scans and cerebrospinal fluids are intrusive and expensive, and therefore unsuitable for early diagnosis. The other main family of diagnosis methods are cognitive tests such as the ADAS-Cog (Rosen et al., 1984), MMSE (Folstein et al., 1975) and ACE-III (Noone, 2015), widely used in clinical studies. As well as suffering from sensitivity issues (Schneider & Sano, 2009) and lagging behind biomarkers in their ability to detect dementia onset (Jack et al., 2013), they have several other caveats: they are usually administered manually; they are pencil-and-paper type tests requiring an expert, and therefore applied only after initial referral to a doctor; they are unsuited for testing across large populations or at home. For early diagnosis, pre-screening and condition monitoring we need methods that can be automated, complementary to biomarkers but easily observable within everyday life without intrusion.

1.1 Language datasets for dementia

The automatic analysis of patients’ spontaneous speech and language is a promising non-invasive, inexpensive approach to screening and monitoring dementia progression, as speech and language impairments caused by dementia can occur early in the course of the disease (Fraser et al., 2015; König et al., 2018). To allow progress in this field researchers have been working with a number of different datasets. Table 1 provides an overview of the most widely used datasets (along with a new datasets proposed by us) in terms of different aspects such as the population, the amount of data, the modality, the nature of the tasks and ability to elicit linguistic information, the duration of the elicited tasks, and the longitudinal aspect (if any). Most consist of speech obtained in a clinical setting; some include either partially (Hansebo & Kihlgren, 2002; Weiner & Schultz, 2016) or fully (Luz et al., 2020) transcribed text or text extracted through Automatic Speech Recognition (ASR) (Luz et al., 2021).

1.2 Language based tasks

The majority of research has focused on distinguishing people with Alzheimer’s Disease (AD) from cognitively normal controls, a.k.a., the AD classification task. The Pitt (Becker et al., 1994), ADReSS (Luz et al., 2020), and \(ADReSSo \) (Luz et al., 2021) datasets have been widely used for addressing this task; Among them Pitt is the largest dataset. The datasets include speech recordings (Luz et al., 2021) or speech and transcripts of verbal descriptions (Becker et al., 1994; Luz et al., 2020) obtained by asking subjects to describe either general-purpose pictures (e.g., pictures showing animals) or the Cookie Theft picture (CTP) from the Boston Diagnostic Aphasia Examination (Goodglass, 2013). Unlike the Pitt Corpus (Becker et al., 1994), which contains annual data for the same person up to five times, ADReSS (Luz et al., 2020) and \(ADReSSo \) (Luz et al., 2021) include a single speech sample per participant. ADReSSo also contains occasional interactions between instructors and participants.

The inference of patients’ cognitive scores, a.k.a., the score regression task, has seen less attention. Several studies have extracted different linguistic and acoustic features to predict cognitive scores using the Pitt dataset (Becker et al., 1994). In Luz et al. (2020, 2021), authors use acoustic features and features extracted from transcripts in a single and bimodal setting for predicting cognitive scores, showing the effectiveness of simple modality fusion.

The task of predicting changes in cognitive status per individual over time, a.k.a., disease progression, has received even less attention due to the lack of consistent longitudinal datasets. The Pitt (Becker et al., 1994) and Carolinas (Pope & Davis, 2011) datasets are the largest longitudinal datasets currently available for studying the language of individuals with dementia. An important limitation of the Pitt corpus is that the longitudinal aspect is limited, spanning up to 5 sessions maximum per individual with most participants having two narratives only. On the other hand, in the Carolinas dataset, healthy controls have up to two interviews over the longitudinal study while people with dementia have between 1 and 9 sessions. Preliminary work has addressed disease progression as a classification task (Weiner & Schultz, 2016; Clark et al., 2016; Luz et al., 2021) on the basis of longitudinal assessments in ADReSSo spanning two years from dementia onset. However, spontaneous speech is only collected once.

An important limitation of existing dementia datasets is that their longitudinal aspect is limited to data snapshots at either a fixed time (Luz et al., 2020, 2021) or a few points in time (Becker et al., 1994; Pope & Davis, 2011). Moreover, the language elicitation tasks are biased towards particular genres or domains via lab-based tasks, such as the description of a particular set of images (Becker et al., 1994; Luz et al., 2020, 2021). Some studies elicit speech from more natural spontaneous conversations (Pope & Davis, 2011; Weiner & Schultz, 2016). Yet, the tasks are restricted to particular topics, well known to participants and subject to learning effects (Goldberg et al., 2015). Due to the above limitations in dementia datasets, existing work in language and speech processing for discriminating across cohorts of healthy controls and people with dementia ignore changes in language and how these relate to changes in cognition over time.

Table 1 Overview of the most widely used datasets and comparative benefits of our new proposed multimodal longitudinal dataset

Here we address the above limitations and make the following contributions as follows:

  • We introduce a novel multi-modal dementia corpus of rich longitudinal natural conversations collected over 2 phases, each spanning 28 sessions. This was obtained from people with various forms of dementia and healthy controls on the basis of reminiscence material and in a non-clinical setting. The corpus contains speech, transcriptions, and written language (i.e., pen and keyboard modalities). To the best of our knowledge, this is the first multi-modal longitudinal dataset with this range of modalities and covering such fine-grained longitudinal spans. We present the data collection process and the dataset itself in detail.

  • We establish longitudinal tasks and baselines across different modalities and investigate language changes across the cohorts of health controls and people with dementia over time.

  • We conduct a set of experiments that show significant discrimination between healthy controls and people with dementia across all modalities. Our tasks and baselines pave the way on future directions enabled by our new dataset.

2 Related work

2.1 Linguistic manifestation of dementia

Dementia is often associated with reduction in vocabulary size, syntactic complexity and information content (Maxim & Bryan, 1994; Croisile et al., 1996), as well as loss of coherence, both temporal (construction of logical time sequences) and thematic (continuity of topic) (Ellis, 1996). When conducting dialogue, adults with dementia show less coherence and cohesion and more disruptive topic shifts and empty phrases (Dijkstra et al., 2004), more topically irrelevant utterances (St-Pierre et al., 2005), and characteristic ways of responding to questions (Elsey et al., 2015). Dementia has also been associated with apathy (Nobis & Husain, 2018), emotional dysregulation and mood swings, even in people with mild to moderate dementia (Petry et al., 1989); such emotional aspects are known to surface and are detectable in language (e.g. Purver & Battersby, 2012).

2.2 Language tests

Interview-based tests have therefore been developed that assess the linguistic ability for diagnosis and disease progression (Taler & Phillips, 2008; Tarawneh & Holtzman, 2012). However these tests are vulnerable to practice effects (Goldberg et al., 2015), and are not applicable to everyday spontaneous speech. Promisingly, Forbes-McKay et al. (2013, 2014) showed that linguistic characteristics of spontaneous speech and writing can reliably discriminate healthy older controls from mild-moderate AD patients, and track aspects of decline over time; however, longitudinal findings were limited by infrequent (6-monthly) data collection and their results relied on manual linguistic analysis.

2.3 NLP for dementia

More recent work has used NLP approaches for dementia detection by analysing aspects of language such as lexical, grammatical, and semantic features (Ahmed et al., 2013; Orimaye et al., 2017; Kavé & Dassa, 2018), showing that people with dementia produce less lexical and semantic context and lower syntactic complexity compared to healthy controls. Lack of fluency through the study of paralinguistic features has also been shown to be indicative of people with dementia (de Ipiña et al., 2013). The semantics and pragmatics of language appear to be affected by dementia throughout the entire span of the disease, more so than syntax (Bayles & Boone, 1982). In particular, people with dementia talk more slowly with longer pauses (Gayraud et al., 2011; de Ipiña et al., 2013; Pistono et al., 2019).

Recent work has also investigated manually engineered acoustic features to recognize AD from spontaneous speech (Luz et al., 2020, 2021) while other work exploited non-linguistic features to distinguish people with AD from healthy controls (Nasreen et al., 2021). Neural models such as LSTM and CNN(Karlekar et al., 2018) or pre-trained language models (Yuan et al., 2020), have been used to analyze disfluency characteristics, such as filled pauses. Researchers have used neural approaches to extract either acoustic features (Pan et al., 2020, 2021) or linguistic information (Zhu et al., 2021) directly from the speech signal for dementia detection.

2.4 Longitudinal language changes

Existing work has focused on distinguishing people with dementia from healthy controls without considering language changes over time. Moreover, where present, longitudinal data in current datasets are sparse. For example, in the Carolinas Corpus (Pope & Davis, 2011), the largest available longitudinal dataset, subjects have up to 9 speech records across the longitudinal study. However, only 8 people (3 dementia and 5 controls) have 9 speech records across the entire collection. Our newly introduced corpus consists of 22 people, where each subject was asked to record 28 sessions of 15 mins of speech for each of the two study phases, as well as provide written logs (See Table 1 for the comparative benefits of our dataset). Additionally, ours is the first study to investigate longitudinal language changes across modalities and how these manifest in the dementia and control cohorts.

3 Collecting the longitudinal multimodal corpus

Our goal has been to collect a longitudinal multi-modal corpus including both spontaneous speech and writing, as well as extra-linguistic information associated with language production, and to focus on interactions occurring in a non-clinical setting. There are three important novel aspects in the corpus design: I) Conversations and written thoughts as well as associated paralinguistic information are obtained on the basis of reminiscence material, specifically images from past decades on topics of general interest. Reminiscence is a meaningful and useful activity for people with and without dementia that can improve cognition, mood, and quality of life (Pinquart & Forstmeier, 2012; Gonzalez et al., 2015). II) The corpus contains daily data over two phases each spanning around 4 weeks or 28 sessions. III) The corpus is collected in the participants’ own environment using a custom-built tablet application.

3.1 Corpus collection process

According to our protocol our corpus is collected in separate phases each lasting four weeks or 28 sessions, where phases are 14 weeks apart. In practice due to unforeseen delays the period between phases has been longer than 14 weeks (see Sect. 7). Participants are paired with a carer and asked to record daily sessions during each study phase, alternating between (a) 15 mins of conversation with their carers, and (b) typed or hand-written thoughts using a stylus pen.

Both spoken and written language are elicited using reminiscence material, i.e., images from the past, created by the dementia communication specialists Many Happy ReturnsFootnote 1. Images are presented using a bespoke Android Tablet application which records spoken and written data and sends it to a secure remote server for storage. The application was designed, developed, and tested together with our commercial clinical partner ClinvivoFootnote 2, in consultation with a stakeholder group from the Alzheimer’s Society. The application allows recording three modalities: speech, typed text (keyboard), and hand-written text (pen). For the latter two paralinguistic information such as key strokes and pen strokes, pen pressure and deletions are recorded respectively.

At the start of each 4-week phase, participants are given a tablet running the purpose built application which contains reminiscence material and allows recording language in the various modalities. A Mini Mental State Examination (MMSE) (Folstein et al., 1975) and an Addenbrooke’s Cognitive Examination-III (ACE-III) (Noone, 2015) are administered by a suitably qualified person at the start and end of each phase respectively as cognitive impairment benchmarks. While these tests provide only minimal cognitive data, they allow us to assess cognitive change at the comparison points needed to analyse the rich linguistic information collected.

The investigators monitor the submission of data via a remote server. If data have not been submitted for 48 hours, a research team member contacts the participant and carer to ensure they are not having difficulties. If a participant is unable to record data for longer than a week, the research team considers their withdrawal from the study. If it is deemed that a participant has lost capacity to consent, they are withdrawn from further data collection. Subject to consultation with the participants’ carers, any data collected for that participant up to that point are used in the analysis.

3.2 Reminiscence material & tablet application

The bespoke tablet application shows a participant four images every day, each representing a topic of general interest from the 50s, 60s and 70s. Each image/topic is accompanied by three questions to help initiate a conversation or thought process and provide memory “joggers”. The participant chooses one topic out of these four as well as the mode of interaction (recording a conversation, typing or writing thoughts). The four images are pseudo-randomly selected from a pool of images available in each of the 4-week phases. Figure 1 illustrates the table application once a subject has chosen a particular topic (here “Radio”).

Fig. 1
figure 1

Screenshot of Tablet application once topic is chosen (here “Radio”)

The image material and corresponding questions were developed by an organisation specialising in dementia communicationFootnote 3 and have been used with people in care homes. The 50s and 60s material was adapted for use in the tablet application. The 70s material was created for the purpose of the study. The collected corpus covers a set of 67 images/topics. Phase 1 includes 26 topics from the 50s and Phase 2 includes 41 images from the 60s and 70s. By design topics were meant to be repeated every so often but based on individual’s feedback such repetition has been minimised and each phase includes different material. In particular, Phase 1 topics include: Goblin Teasmade, Washday and Smog, Keeping Warm, Household smells, Sundays, Housewives, Budgies, Radio, Television, The cinema and music, The Goons, Weekly children’s comics, The Coronation, Holiday Camps, The Modern Era - transport, Endless freedom, Toys and books, School, Knitting, Hair, Teddy boys and teenagers, Bikes, Fashion, Immigration, National Service, Sport. Phase 2 was augmented with more images from the 70s, which seems to better fit the memory bump of our participant cohort.

3.3 Participant recruitment

Our target was to recruit a cohort of people living with dementia or MCI (n=20) and age-matched controls (n=10). These numbers are appropriate for a pilot study. Previous longitudinal research examining language indicators in people with Alzheimer’s disease over a year (Forbes-McKay et al., 2013, 2014) have recruited similar numbers of participants, with assessments every six months. Participant inclusion and exclusion criteria as well as recruitment methods are described in Figure 2.

Fig. 2
figure 2

Eligibility criteria for study participants and recruitment details

Eligibility: Participants are aged 65-80 years with mild to moderate dementia, MCI, or are age-matched healthy controls. They are resident in their own home or with family and are in daily contact with a carer or family member. They must have lived in the UK during the 50s-70s so they can relate to the reminiscence material. They must be able to conduct daily conversations and write their thoughts using the provided tablet application.

Recruitment: Primary means of recruiting participants has been through the Join Dementia Research platform (JDR), mailing lists, dementia networks, dementia cafes and memory clinics. Interested candidate participants are subsequently screened by qualified staff (NIHR research nurses) in terms of providing consent and meeting the study eligibility criteria.

3.4 Transcription

A subset of the spoken data (51 sessions from 8 participants spanning several weeks) were transcribed manually by experienced dialogue transcribers, using PRAAT.Footnote 4 As well as the words spoken, transcripts include significant non-verbal events such as utterance timings, pauses, laughter, crying, yawning, whispering, coughing as well as disfluencies including mis-speaking and reformulation. The transcription convention used was developed on the basis of the CHAT protocol (MacWhinney, 1992) and techniques for transcribers (Garrard et al., 2011).

3.5 Community support and dissemination

We have created a steering committee consisting of six Alzheimer’s society volunteers and the founder of Many Happy Returns/Real Comminication Works, an organisation working closely with people living with dementia to help with engagement and dialogue. We have had meetings every several months with the entire group and work closely with a subgroup on the design and usability of the data collection application as well as any concerns that may be faced by participants. This committee has had vital input into the design of the study, the identification of suitable participants as well as dissemination of findings at the end of the study.

The raw data itself cannot be made publicly available as this does not comply with our ethics. Yet, it could be made available to interested parties subject to an NDA agreement. In the future, we also aim to make publicly available pre-trained embeddings for linguistic and audio modalities.

4 Dataset description

4.1 Participant demographics

We have data from 22 participants (6 females and 4 males with no dementia diagnosis, and 3 females and 9 males with a dementia diagnosis). The average age at the time of recruitment for people with dementia was 70.9 years and for healthy controls 68.9 years. Most participants had at least 10 years of education, with about 25% having completed a University degree. Conditions represented in the collected corpus include Mild Cognitive Impairment (MCI), Alzheimer’s Disease (AD), Vascular Dementia (VD), Frontotemporal Dementia (FD), and Mixed Dementia (MD). Overall, the corpus includes 10 healthy controls, 5 people with AD, 2 people with MCI, 2 people with FD, 1 with VD, and 2 with MD (1 AD+VD, 1 AD + Lewy body Dementia), covering 816 sessions and different modalities. Table 2 summarizes dataset statistics across all modalities.

For Phase 1, the cohorts include 12 people with dementia or MCI and 10 controls. As managing the longitudinal data collection process is expensive, both in terms of human effort and in terms of the hardware, software and data storage our protocol catered for a small number of participants (20 people with dementia and 10 controls). Similar numbers have been used before in previous longitudinal studies (Forbes-McKay et al., 2013, 2014). While our study protocol targeted recruitment of twice as many people with dementia as healthy controls, since we expect greater homogeneity between controls while dementia can manifest in many different ways, in practice we only managed to recruit 12 people with dementia in the given timeframe. For delays primarily external to the study, discussed in Sect.  7, only 3 people with dementia and 6 controls were able to complete Phase 2 of the study Footnote 5.

4.2 Statistical overview of different data modalities

Table 2 Overview of the dataset. Dem=Participants with Dementia. Ctrl=Participants with no dementia diagnosis

The speech modality is the most popular amounting for 490 sessions by 22 participants, that is 101:26 hours of audio data. The Typed/Key modality follows with 17 participants and 271 sessions while the Hand-written/Pen modality was selected by 12 participants in 104 sessions. In general, controls record a larger amount of sessions compared to participants with dementia (51.1 (STD=13.1) vs 29.1 (STD=11.4)). In each session, participants chose 2.5 topics on average (STD=2.6). In total, the sessions cover 66/67 unique topics. However, participants with dementia addressed fewer topics in the same number of sessions compared to controls (63 vs 65).

For speech, the mean duration of sessions is slightly shorter for the dementia group compared to controls (12:11 mins (STD=4:31) vs 12:39 mins (STD=4:07)). Figure 3 summarizes the number of recorded sessions per individual in the two groups together with the session duration in the speech modality. For the majority of sessions conversations last between 15 and 16 mins in both groups, although some topics seem harder for both groups, resulting in shorter sessions. On average, subjects choose the same topic 1.2 (STD=0.5) and 1.3 (STD=0.6) times over the longitudinal study in control and dementia cohorts, correspondingly. Overall, the duration of individual conversations/sessions is balanced across the two groups. In total, the duration of speech sessions is 50:49 hours for people with dementia and 50:36 hours for controls.

Fig. 3
figure 3

Summary of recorded sessions per individual in the two cohorts together with the duration in the speech modality

Table 3 summarizes statistics of other modalities included in the corpus, i.e., typed and hand-written daily logs, and transcribed daily conversations, along with their corresponding characteristics. For the typed daily logs, healthy controls spend 20.9 minutes writing a log, while the respective length of time to produce a written log for people with dementia is 35.9 minutes. By contrast, the average length of typed characters is 2,647 for healthy controls and 1,752 for people with dementia. We see a similar pattern for the hand-written logs, with the average length of a character sequence produced for this purpose being 529 for healthy controls and 392 for people with dementia. Therefore healthy controls are able to produce more written or typed text within a shorter amount of time. Yet the averaged recorded pen pressure is similar across the two cohorts.

Part of the spoken conversations (for 8 speakers, 6 people with dementia and 2 controls) has been manually transcribed. Most of the manually transcribed spoken conversations were chosen to be by participants with dementia (79/84 sessions). This allows us in the future to analyse linguistic patterns that characterise people with dementia and use the corresponding para-linguistic information to fine-tune pre-trained speech-to-text models for automatic speech recognition (ASR) specialising in speech by people with dementia.

Table 3 Overview of the typed, hand-written, and transcribed conversations in accordance to their particular characteristics

5 Longitudinal multimodal language changes across dementia and control cohorts

5.1 Task

Here we showcase the utility of our newly proposed dataset by investigating longitudinal changes in language across different modalities (i.e., speech, transcribed conversation, and typed text) in relation to the two cohorts (i.e., healthy controls and people with dementia). Our goal is to identify subjects’ language variations over time. In particular, given a sequence of N sessions \(\{S_1, S_2, ..., S_N\}\) over the longitudinal study, we first map each of the sessions to a d-dimensional representation \(\{S_1^d, S_2^d, ..., S_N^d\}\) such as \(d \in \mathbb {I^{+}}\). We then compute the distance D across different sessions over the longitudinal study through cosine similarity for measuring changes in language within subjects. To this effect, we explore two tasks by calculating language changes: a) between adjacent sessions \(D(S_t^d,S_{t+1}^d)\) where \(t \in N \), called the consecutive task, and b) from the beginning of data collection up to time t, \(D(S_1^d,S_{t}^d)\) where \(t \in N \) and \(t>1\), called the non-consecutive task. For calculating the distance D, we consider different statistical functions (i.e., mean, median, std). To the best of our knowledge, this would be the first task to allow such fine-grained multimodal longitudinal analysis as previous work mostly considered modality-specific classification of disease progression at limited fixed time points.

5.2 Session-level representations

To obtain session-level representations for both linguistic (transcribed spoken conversations and typed logs) and audio modalities (acoustic aspects of spoken conversations), we first segment language into utterances, where an utterance is defined as an unbroken chain of spoken or written language. We then map each of the utterances into a pre-trained embedding representation. We finally construct session-level representations by averaging the utterance embeddings within sessions.

When working on the linguistic modality, segmentation is performed on an unbroken chain of spoken language for the transcriptions and on punctuation for the typed texts. Each segmented utterance is mapped onto a fixed-size sentence representation (Reimers & Gurevych, 2019). We chose sentence embedding representations as previous work has shown their effectiveness in assessing cognition through language for mental health (Iter et al., 2018; Voleti et al., 2019).

For the audio modality, we use an end-to-end voice activity detection model Footnote 6 to perform segmentation on speech. In line with the linguistic modality and as previous work showed the superiority of using neural representations over manual-engineered acoustic features (Zhu et al., 2021), we map speech segments to pre-trained speech embeddings. Here, we use TRIpLET Loss network (TRILL), which has resulted in a good performance in non-semantic speech tasks including AD classification on DementiaBank (Shor et al., 2020). We encode moments of silence by applying random initialization.

5.3 Results

We calculated the mean, median, and std cosine distance of session-level representations between consecutive and non-consecutive sessions for each speaker individually. We then averaged the obtained scores of speakers across the two cohorts. We chose cosine distance of sentence level representations as it has been shown in previous work to be a strong baseline for tasks in mental health  (Iter et al., 2018).

For speech, we noticed that the mean and median cosine distance scores were different across the two cohorts for both the consecutive and non-consecutive tasks (see Table 4). However, the distance scores were significantly higher for the dementia group (\(p<0.05\)) when we calculated changes across non-consecutive sessions. That is changes in speech across sessions were particularly prominent in temporally distant sessions. We also investigated speech variations for people who participated in both phases of the longitudinal study. There are 9 such participants (6 controls and 3 people with dementia). Here, we averaged the session embeddings per participant within a phase and calculated the distance between the two phases. Again, participants in the dementia cohort exhibited substantial speech variations across phases (see Table 5). This justifies further the importance of collecting longitudinal language data for dementia monitoring.

We obtained similar results when conducting experiments with transcriptions and typed texts (see Table 4). Overall, we observed that transcribed speech is most informative in capturing longitudinal language changes across the two cohorts. Yet, speech is more useful when comparing people across phases (see Table 5). In the case of typed text, while distance scores are higher for the dementia cohort, the difference was not statistically significant. We assume this is because in planned, non-spontaneous texts, such as written thoughts, the planning going into writing the text makes it more coherent. However, the typed and written text modalities convey additional, currently unexplored, extra-linguistic information (number of deletions, pauses between keystrokes), that show corrections of one’s text and these may be better indicators of changes in cognition. In the future, we aim to investigate self-repair tasks (Rohanian & Hough, 2021) that are more appropriate for written discourse.

Table 4 Averaged distance scores between the two cohorts (people with dementia and healthy controls) and across different modalities for the non-consecutive and consecutive tasks
Table 5 Averaged distance scores between the two phases and across different modalities for subjects participating in both phases of the longitudinal study

6 Conclusion

We introduce a novel fine-grained longitudinal multi-modal corpus containing data from healthy controls and people with dementia. The dataset covers audio and text, containing spoken and transcribed conversations, written and typed logs as well as associated extra-linguistic information such as pen and keystrokes. Conversations and written thoughts are elicited in a natural setting, in the participants own environment, triggered by reminiscence material. Specifically, people can record their thoughts via recorded audio, typed or written text through a bespoke tablet application. We present the data collection process and describe the corpus providing statistical information about the two cohorts across the different modalities collected. We also establish baselines to capture longitudinal language changes in relation to the two cohorts and across the audio and linguistic modalities. A set of initial experiments shows that longitudinal language variations are higher in people with dementia. This effect is even more pronounced across temporally distant sessions. In the future, we aim to investigate tasks that involve language-function variations, such as coherence and disfluency, that are particularly prominent in the progression of dementia.

7 Limitations

In this work, we introduced a multi-modal longitudinal corpus for monitoring changes in dementia progression. The corpus was collected in a natural setting from healthy controls and people with dementia over two phases, each spanning 28 sessions. Moreover, subjects could choose to hold conversations or write or type their thoughts on a variety of topics from reminiscence material provided by a bespoke tablet application. Despite the novel fine-grained longitudinal multimodal nature of the corpus, an important limitation is the relatively small-scale cohorts in the study. In particular there is only a small number of people who where able to participate in the second phase of the study. This was due to unforeseen disruptions to the study first via the introduction of GDPR regulation in 2018, which required pausing of the study to update software for data collection, and then COVID-19. This meant that in several cases 12 months or longer elapsed between phases 1 and 2 and as a result many of our participants were no longer able to participate, primarily due to a decline in their health or change in their personal circumstances. We aim to address this limitation by expanding the existing corpus with a new data collection spanning three phases within twelve months, by recruiting individuals from a collaborating memory clinic. Nevertheless the existing dataset is the first of its kind and has opened new avenues for research in longitudinal changes in language for people with dementia and across different modalities.

Indeed, we have introduced baselines to capture longitudinal changes in language across modalities in the two cohorts. In particular, we calculated the distance between adjacent and across non-adjacent sessions when those were mapped to fixed-size representations. A set of initial experiments showed promising results for monitoring dementia using fine-grained multi-modal longitudinal data. However these approaches are limited in capturing various linguistic functions associated with the progression of dementia (Tang-Wai & Graham, 2008; Klimova et al., 2015). In future work, we aim to use NLP techniques to characterise the language in terms of features likely to be associated with disease onset and progression and/or be suitable for detecting changes in use over time across all types of conversations, i.e., speech, transcriptions, typed and written thoughts. These will include analysis in terms of lexical, syntax and coherence features already identified in the literature (Fraser et al., 2015; Ellis, 1996); and in terms of recent approaches which infer vector-based representations of words or speakers (embeddings) from observed use and are well suited to tracking changes in language use over time (Hamilton et al., 2016; Tsakalidis et al., 2022).

8 Ethical considerations

The collection of the corpus involves ethical considerations especially as we are working with vulnerable individuals who have dementia. The study has received ethics approval from the NHS Research Ethics Committee (REC) and the Health Research Authority (HRA), with reference number 16/WS/0226. Participating individuals as well as their carers consented to permit data collection and analysis for research purposes. User identifying information was kept separate from the language data collected via the bespoke tablet application.

While data was collected anonymously, there are potential ethical concerns with using spoken language and computational approaches for monitoring changes in cognitive status and dementia. One concern is related to privacy and confidentiality, as language data may contain sensitive personal information. Other potential risks involve the misuse of models trained on the data for monitoring changes in cognition, which could be used carelessly or maliciously without considering the impact and social consequences in the broader community. To mitigate such risks, we apply strategies such as running software on authorised servers only, with encrypted data during transfer, anonymization of data prior to analysis. Data is only accessed by authorised individuals and interested parties can only obtain access subject to an NDA agreement which carefully states research goals.

For a real-world application, ethical concerns are related to the potential for misdiagnosis or overdiagnosis, which could lead to unnecessary treatment or psychological distress for patients and their families. Additionally, there may be issues related to access and equity, as some individuals may not have access to the necessary technology or resources for speech recognition and monitor through analysis of language. Finally, there may be concerns related to the accuracy and reliability of technology, as well as the potential for bias in the data or algorithms used for monitoring changes. It is important to consider these ethical concerns when developing and implementing technologies for dementia monitoring and diagnosis.