A longitudinal multi-modal dataset for dementia monitoring and diagnosis

Dementia affects cognitive functions of adults, including memory, language, and behaviour. Standard diagnostic biomarkers such as MRI are costly, whilst neuropsychological tests suffer from sensitivity issues in detecting dementia onset. The analysis of speech and language has emerged as a promising and non-intrusive technology to diagnose and monitor dementia. Currently, most work in this direction ignores the multi-modal nature of human communication and interactive aspects of everyday conversational interaction. Moreover, most studies ignore changes in cognitive status over time due to the lack of consistent longitudinal data. Here we introduce a novel fine-grained longitudinal multi-modal corpus collected in a natural setting from healthy controls and people with dementia over two phases, each spanning 28 sessions. The corpus consists of spoken conversations, a subset of which are transcribed, as well as typed and written thoughts and associated extra-linguistic information such as pen strokes and keystrokes. We present the data collection process and describe the corpus in detail. Furthermore, we establish baselines for capturing longitudinal changes in language across different modalities for two cohorts, healthy controls and people with dementia, outlining future research directions enabled by the corpus.


Introduction
Over 50 million people in the world have dementia, a syndrome involving deterioration in memory and cognitive abilities, with an annual increase of 10 million [1].Earlier diagnosis can improve patients' quality of life by enabling better planning and medical interventions at the most effective and appropriate stage [2][3][4]-especially if diagnosis occurs before clinical symptoms onset in [5,6].
Standard diagnostic biomarkers such as MRI, PET scans and cerebrospinal fluids are intrusive and expensive, and therefore unsuitable for early diagnosis.The other main family of diagnosis methods are cognitive tests such as the ADAS-Cog [7], MMSE [8] and ACE-III [9], widely used in clinical studies.As well as suffering from sensitivity issues [10] and lagging behind biomarkers in their ability to detect dementia onset [11], they have several other caveats: they are usually administered manually; they are pencil-and-paper type tests requiring an expert, and therefore applied only after initial referral to a doctor; they are unsuited for testing across large populations or at home.For early diagnosis, pre-screening and condition monitoring we need methods that can be automated, complementary to biomarkers but easily observable within everyday life without intrusion.

Language datasets for dementia:
The automatic analysis of patients' spontaneous speech and language is a promising non-invasive, inexpensive approach to screening and monitoring dementia progression, as speech and language impairments caused by dementia can occur early in the course of the disease [12,13].To allow progress in this field researchers have been working with a number of different datasets.Table 1 provides an overview of the most widely used datasets (along with a new datasets proposed by us) in terms of different aspects such as the population, the amount of data, the modality, the nature of the tasks and ability to elicit linguistic information, the duration of the elicited tasks, and the longitudinal aspect (if any).Most consist of speech obtained in a clinical setting; some include either partially [14,15] or fully [16] transcribed text or text extracted through Automatic Speech Recognition (ASR) [17].

Language based tasks:
The majority of research has focused on distinguishing people with Alzheimer's Disease (AD) from cognitively normal controls, a.k.a., the AD classification task.The Pitt [18], ADReSS [16], and ADReSSo [17] datasets have been widely used for addressing this task; Among them Pitt is the largest dataset.The datasets include speech recordings [17] or speech and transcripts of verbal descriptions [16,18] obtained by asking subjects to describe either generalpurpose pictures (e.g., pictures showing animals) or the Cookie Theft picture (CTP) from the Boston Diagnostic Aphasia Examination [19].Unlike the Pitt Corpus [18], which contains annual data for the same person up to five times, ADReSS [16] and ADReSSo [17] include a single speech sample per participant.ADReSSo also contains occasional interactions between instructors and participants.
The inference of patients' cognitive scores, a.k.a., the score regression task, has seen less attention.Several studies have extracted different linguistic and acoustic features to predict cognitive scores using the Pitt dataset [18].In [16,17], authors use acoustic features and features extracted from transcripts in a single and bimodal setting for predicting cognitive scores, showing the effectiveness of simple modality fusion.
The task of predicting changes in cognitive status per individual over time, a.k.a., disease progression, has received even less attention due to the lack of consistent longitudinal datasets.The Pitt [18] and Carolinas [20] datasets are the largest longitudinal datasets currently available for studying the language of individuals with dementia.An important limitation of the Pitt corpus is that the longitudinal aspect is limited, spanning up to 5 sessions maximum per individual with most participants having two narratives only.On the other hand, in the Carolinas dataset, healthy controls have up to two interviews over the longitudinal study while people with dementia have between 1 and 9 sessions.Preliminary work has addressed disease progression as a classification task [15,17,21] on the basis of longitudinal assessments in ADReSSo spanning two years from dementia onset.However, spontaneous speech is only collected once.
An important limitation of existing dementia datasets is that their longitudinal aspect is limited to data snapshots at either a fixed time [16,17] or a few points in time [18,20].Moreover, the language elicitation tasks are biased towards particular genres or domains via lab-based tasks, such as the description of a particular set of images [16][17][18].Some studies elicit speech from more natural spontaneous conversations [15,20].Yet, the tasks are restricted to particular topics, well known to participants and subject to learning effects [22].Due to the above limitations in dementia datasets, existing work in language and speech processing for discriminating across cohorts of healthy controls and people with dementia ignore changes in language and how these relate to changes in cognition over time.
Here we address the above limitations and make the following contributions as follows: • We introduce a novel multi-modal dementia corpus of rich longitudinal natural conversations collected over 2 phases, each spanning 28 sessions.This was obtained from people with various forms of dementia and healthy controls on the basis of reminiscence material and in a non-clinical setting.The corpus contains speech, transcriptions, and written language (i.e., pen and keyboard modalities).To the best of our knowledge, this is the first multimodal longitudinal dataset with this range of modalities and covering such fine-grained longitudinal spans.We present the data collection process and the dataset itself in detail.• We establish longitudinal tasks and baselines across different modalities and investigate language changes across the cohorts of health controls and people with dementia over time.• We conduct a set of experiments that show significant discrimination between healthy controls and people with dementia across all modalities.
Our tasks and baselines pave the way on future directions enabled by our new dataset.

Related work
Linguistic manifestation of dementia: Dementia is often associated with reduction in vocabulary size, syntactic complexity and information content [24,25], as well as loss of coherence, both temporal (construction of logical time sequences) and thematic (continuity of topic) [26].When conducting dialogue, adults with dementia show less coherence and cohesion and more disruptive topic shifts and empty phrases [27], more topically irrelevant utterances [28], and characteristic ways of responding to questions [29].Dementia has also been associated with apathy [30], emotional dysregulation and mood swings, even in people with mild to moderate dementia [31]; such emotional aspects are known to surface and are detectable in language [e.g.32].

Language tests:
Interview-based tests have therefore been developed that assess the linguistic ability for diagnosis and disease progression [33,34].However these tests are vulnerable to practice effects [22], and are not applicable to everyday spontaneous speech.Promisingly, [35,36] showed that linguistic characteristics of spontaneous speech and writing can reliably discriminate healthy older controls from mild-moderate AD patients, and track aspects of decline over time; however, longitudinal findings were limited by infrequent (6-monthly) data collection and their results relied on manual linguistic analysis.

NLP for dementia:
More recent work has used NLP approaches for dementia detection by analysing aspects of language such as lexical, grammatical, and semantic features [37][38][39], showing that people with dementia produce less lexical and semantic context and lower syntactic complexity compared to healthy controls.Lack of fluency through the study of paralinguistic features has also been shown to be indicative of people with dementia [40].The semantics and pragmatics of language appear to be affected by dementia throughout the entire span of the disease, more so than syntax [41].In particular, people with dementia talk more slowly with longer pauses [40,42,43].
Recent work has also investigated manually engineered acoustic features to recognize AD from spontaneous speech [16,17] while other work exploited nonlinguistic features to distinguish people with AD from healthy controls [44].Neural models such as LSTM and CNN [45] or pre-trained language models [46], have been used to analyze disfluency characteristics, such as filled pauses.
Researchers have used neural approaches to extract either acoustic features [47,48] or linguistic information [49] directly from the speech signal for dementia detection.

Longitudinal language changes:
Existing work has focused on distinguishing people with dementia from healthy controls without considering language changes over time.Moreover, where present, longitudinal data in current datasets are sparse.For example, in the Carolinas Corpus [20], the largest available longitudinal dataset, subjects have up to 9 speech records across the longitudinal study.However, only 8 people (3 dementia and 5 controls) have 9 speech records across the entire collection.Our newly introduced corpus consists of 22 people, where each subject was asked to record 28 sessions of 15 mins of speech for each of the two study phases, as well as provide written logs (See Table 1 for the comparative benefits of our dataset).Additionally, ours is the first study to investigate longitudinal language changes across modalities and how these manifest in the dementia and control cohorts.

Collecting the longitudinal multimodal corpus
Our goal has been to collect a longitudinal multi-modal corpus including both spontaneous speech and writing, as well as extra-linguistic information associated with language production, and to focus on interactions occurring in a non-clinical setting.There are three important novel aspects in the corpus design: I) Conversations and written thoughts as well as associated paralinguistic information are obtained on the basis of reminiscence material, specifically images from past decades on topics of general interest.Reminiscence is a meaningful and useful activity for people with and without dementia that can improve cognition, mood, and quality of life [50,51].II) The corpus contains daily data over two phases each spanning around 4 weeks or 28 sessions.III) The corpus is collected in the participants' own environment using a custom-built tablet application.

Corpus collection process
According to our protocol our corpus is collected in separate phases each lasting four weeks or 28 sessions, where phases are 14 weeks apart.In practice due to unforeseen delays the period between phases has been longer than 14 weeks (see section 7).Participants are paired with a carer and asked to record daily sessions during each study phase, alternating between (a) 15 mins of conversation with their carers, and (b) typed or hand-written thoughts using a stylus pen.
Both spoken and written language are elicited using reminiscence material, i.e., images from the past, created by the dementia communication specialists Many Happy Returns1 .Images are presented using a bespoke Android Tablet application which records spoken and written data and sends it to a secure remote server for storage.The application was designed, developed, and tested together with our commercial clinical partner Clinvivo2 , in consultation with a stakeholder group from the Alzheimer's Society.The application allows recording three modalities: speech, typed text (keyboard), and hand-written text (pen).For the latter two paralinguistic information such as key strokes and pen strokes, pen pressure and deletions are recorded respectively.
At the start of each 4-week phase, participants are given a tablet running the purpose built application which contains reminiscence material and allows recording language in the various modalities.A Mini Mental State Examination (MMSE) [8] and an Addenbrooke's Cognitive Examination-III (ACE-III) [9] are administered by a suitably qualified person at the start and end of each phase respectively as cognitive impairment benchmarks.While these tests provide only minimal cognitive data, they allow us to assess cognitive change at the comparison points needed to analyse the rich linguistic information collected.
The investigators monitor the submission of data via a remote server.If data have not been submitted for 48 hours, a research team member contacts the participant and carer to ensure they are not having difficulties.If a participant is unable to record data for longer than a week, the research team considers their withdrawal from the study.If it is deemed that a participant has lost capacity to consent, they are withdrawn from further data collection.Subject to consultation with the participants' carers, any data collected for that participant up to that point are used in the analysis.

Reminiscence Material & Tablet application
The bespoke tablet application shows a participant four images every day, each representing a topic of general interest from the 50s, 60s and 70s.Each image/topic is accompanied by three questions to help initiate a conversation or thought process and provide memory "joggers".The participant chooses one topic out of these four as well as the mode of interaction (recording a conversation, typing or writing thoughts).The four images are pseudo-randomly selected from a pool of images available in each of the 4-week phases.Figure 1 illustrates the table application once a subject has chosen a particular topic (here "Radio").The image material and corresponding questions were developed by an organisation specialising in dementia communication 3 and have been used with people in care homes.The 50s and 60s material was adapted for use in the tablet application.The 70s material was created for the purpose of the study.The collected corpus covers a set of 67 images/topics.Phase 1 includes 26 topics from the 50s and Phase 2 includes 41 images from the 60s and 70s.By design topics were meant to be repeated every so often but based on individual's feedback such repetition has been minimised and each phase includes different material.In particular, Phase 1 topics include: Goblin Teasmade, Washday and Smog, Keeping Warm, Household smells, Sundays, Housewives, Budgies, Radio, Television, The cinema and music, The Goons, Weekly children's comics, The Coronation, Holiday Camps, The Modern Era -transport, Endless freedom, Toys and books, School, Knitting, Hair, Teddy boys and teenagers, Bikes, Fashion, Immigration, National Service, Sport.Phase 2 was augmented with more images from the 70s, which seems to better fit the memory bump of our participant cohort.

Participant Recruitment
Our target was to recruit a cohort of people living with dementia or MCI (n=20) and age-matched controls (n=10).These numbers are appropriate for a pilot study.Previous longitudinal research examining language indicators in people with Alzheimer's disease over a year [35,36] have recruited similar numbers of participants, with assessments every six months.Participant inclusion and exclusion criteria as well as recruitment methods are described in Figure 2.
• be aged 65-80 years old, with a mild to moderate demen8a diagnosis, MCI, or not suspected of having demen8a (healthy control).
• be resident in their own home or with their family.
• be in daily contact with a carer or family member.
• have been raised in Britain or be able to relate to Bri8sh social culture from the 50s, 60s and 70s (so as to match the background of current research and conversa8on s8mula8on material) .
• be able to access to a broadband connec8on at home.• be able and willing to record conversa8ons with their carer in English.• be able and willing to write text in English.
• be able and willing to engage with the study on a daily basis for 15 minutes.• be available for a twelve week intermiJent period within a year.
Carers in the study need to : Par1cipants in the study need to : • be willing to record daily conversa8ons with par8cipants in English.
• be available throughout the par8cipant's 8me on the study.
• Means or recruitment: Primary means of recrui8ng par8cipants will be through the Join Demen8a Research plaPorm (JDR), mailing lists, demen8a networks, demen8a cafes and memory clinics.Interested candidate par8cipants who meet the criteria will be subsequently screened by qualified staff such as NIHR research nurses in terms of suitability to our criteria; they will be asked to take an MMSE test and consent will be taken.

Transcription
A subset of the spoken data (51 sessions from 8 participants spanning several weeks) were transcribed manually by experienced dialogue transcribers, using PRAAT. 4As well as the words spoken, transcripts include significant non-verbal events such as utterance timings, pauses, laughter, crying, yawning, whispering, coughing as well as disfluencies including mis-speaking and reformulation.The transcription convention used was developed on the basis of the CHAT protocol [52] and techniques for transcribers [53].

Community support and dissemination
We have created a steering committee consisting of six Alzheimer's society volunteers and the founder of Many Happy Returns/Real Comminication Works, an organisation working closely with people living with dementia to help with engagement and dialogue.We have had meetings every several months with the entire group and work closely with a subgroup on the design and usability of the data collection application as well as any concerns that may be faced by participants.This committee has had vital input into the design of the study, the identification of suitable participants as well as dissemination of findings at the end of the study.The raw data itself cannot be made publicly available as this does not comply with our ethics.Yet, it could be made available to interested parties subject to an NDA agreement.In the future, we also aim to make publicly available pre-trained embeddings for linguistic and audio modalities.

Participant demographics
We have data from 22 participants (6 females and 4 males with no dementia diagnosis, and 3 females and 9 males with a dementia diagnosis).The average age at the time of recruitment for people with dementia was 70.9 years and for healthy controls 68.9 years.Most participants had at least 10 years of education, with about 25% having completed a University degree.Conditions represented in the collected corpus include Mild Cognitive Impairment (MCI), Alzheimer's Disease (AD), Vascular Dementia (VD), Frontotemporal Dementia (FD), and Mixed Dementia (MD).Overall, the corpus includes 10 healthy controls, 5 people with AD, 2 people with MCI, 2 people with FD, 1 with VD, and 2 with MD (1 AD+VD, 1 AD + Lewy body Dementia), covering 816 sessions and different modalities.Table 2 summarizes dataset statistics across all modalities.
For Phase 1, the cohorts include 12 people with dementia or MCI and 10 controls.As managing the longitudinal data collection process is expensive, both in terms of human effort and in terms of the hardware, software and data storage our protocol catered for a small number of participants (20 people with dementia and 10 controls).Similar numbers have been used before in previous longitudinal studies [35,36].While our study protocol targeted recruitment of twice as many people with dementia as healthy controls, since we expect greater homogeneity between controls while dementia can manifest in many different ways, in practice we only managed to recruit 12 people with dementia in the given timeframe.For delays primarily external to the study, discussed in section 7, only 3 people with dementia and 6 controls were able to complete Phase 2 of the study5 .The speech modality is the most popular amounting for 490 sessions by 22 participants, that is 101:26 hours of audio data.The Typed/Key modality follows with 17 participants and 271 sessions while the Hand-written/Pen modality was selected by 12 participants in 104 sessions.In general, controls record a larger amount of sessions compared to participants with dementia (51.1 (STD=13.1)vs 29.1 (STD=11.4)).In each session, participants chose 2.5 topics on average (STD=2.6).In total, the sessions cover 66/67 unique topics.However, participants with dementia addressed fewer topics in the same number of sessions compared to controls (63 vs 65).

Statistical Overview of different Data Modalities
For speech, the mean duration of sessions is slightly shorter for the dementia group compared to controls (12:11 mins (STD=4:31) vs 12:39 mins (STD=4:07)).Figure 3 summarizes the number of recorded sessions per individual in the two groups together with the session duration in the speech modality.For the majority of sessions conversations last between 15 and 16 mins in both groups, although some topics seem harder for both groups, resulting in shorter sessions.On average, subjects choose the same topic 1.2 (STD=0.5)and 1.3 (STD=0.6)times over the longitudinal study in control and dementia cohorts, correspondingly.Overall, the duration of individual conversations/sessions is balanced across the two groups.In total, the duration of speech sessions is 50:49 hours for people with dementia and 50:36 hours for controls.
Table 3 summarizes statistics of other modalities included in the corpus, i.e., typed and hand-written daily logs, and transcribed daily conversations, along with their corresponding characteristics.For the typed daily logs, healthy controls spend 20.9 minutes writing a log, while the respective length of time to produce a written log for people with dementia is 35.9 minutes.By contrast, the average length of typed characters is 2,647 for healthy controls and 1,752 for people with dementia.We see a similar pattern for the hand-written logs, with the average length of a character sequence produced for this purpose being 529 for healthy controls and 392 for people with dementia.Therefore healthy controls are able to produce more written or typed text within a shorter amount of time.Yet the averaged recorded pen pressure is similar across the two cohorts.Part of the spoken conversations (for 8 speakers, 6 people with dementia and 2 controls) has been manually transcribed.Most of the manually transcribed spoken conversations were chosen to be by participants with dementia (79/84 sessions).This allows us in the future to analyse linguistic patterns that characterise people with dementia and use the corresponding para-linguistic information to fine-tune pre-trained speech-to-text models for automatic speech recognition (ASR) specialising in speech by people with dementia.5 Longitudinal multimodal language changes across dementia and control cohorts

Task
Here we showcase the utility of our newly proposed dataset by investigating longitudinal changes in language across different modalities (i.e., speech, transcribed conversation, and typed text) in relation to the two cohorts (i.e., healthy controls and people with dementia).Our goal is to identify subjects' language variations over time.In particular, given a sequence of N sessions {S 1 , S 2 , ..., S N } over the longitudinal study, we first map each of the sessions to a d-dimensional representation {S d 1 , S d 2 , ..., S d N } such as d ∈ I + .We then compute the distance D across different sessions over the longitudinal study through cosine similarity for measuring changes in language within subjects.To this effect, we explore two tasks by calculating language changes: a) between adjacent sessions D(S d t , S d t+1 ) where t ∈ N , called the consecutive task, and b) from the beginning of data collection up to time t, D(S d 1 , S d t ) where t ∈ N and t > 1, called the non-consecutive task.For calculating the distance D, we consider different statistical functions (i.e., mean, median, std).To the best of our knowledge, this would be the first task to allow such fine-grained multimodal longitudinal analysis as previous work mostly considered modality-specific classification of disease progression at limited fixed time points.

Session-level representations
To obtain session-level representations for both linguistic (transcribed spoken conversations and typed logs) and audio modalities (acoustic aspects of spoken conversations), we first segment language into utterances, where an utterance is defined as an unbroken chain of spoken or written language.We then map each of the utterances into a pre-trained embedding representation.We finally construct session-level representations by averaging the utterance embeddings within sessions.
When working on the linguistic modality, segmentation is performed on an unbroken chain of spoken language for the transcriptions and on punctuation for the typed texts.Each segmented utterance is mapped onto a fixed-size sentence representation [54].We chose sentence embedding representations as previous work has shown their effectiveness in assessing cognition through language for mental health [55,56].
For the audio modality, we use an end-to-end voice activity detection model6 to perform segmentation on speech.In line with the linguistic modality and as previous work showed the superiority of using neural representations over manual-engineered acoustic features [49], we map speech segments to pretrained speech embeddings.Here, we use TRIpLET Loss network (TRILL), which has resulted in a good performance in non-semantic speech tasks including AD classification on DementiaBank [57].We encode moments of silence by applying random initialization.

Results
We calculated the mean, median, and std cosine distance of session-level representations between consecutive and non-consecutive sessions for each speaker individually.We then averaged the obtained scores of speakers across the two cohorts.We chose cosine distance of sentence level representations as it has been shown in previous work to be a strong baseline for tasks in mental health [55].
For speech, we noticed that the mean and median cosine distance scores were different across the two cohorts for both the consecutive and nonconsecutive tasks (see Table 4).However, the distance scores were significantly higher for the dementia group (p < 0.05) when we calculated changes across non-consecutive sessions.That is changes in speech across sessions were particularly prominent in temporally distant sessions.We also investigated speech variations for people who participated in both phases of the longitudinal study.There are 9 such participants (6 controls and 3 people with dementia).Here, we averaged the session embeddings per participant within a phase and calculated the distance between the two phases.Again, participants in the dementia cohort exhibited substantial speech variations across phases (see Table 5).This justifies further the importance of collecting longitudinal language data for dementia monitoring.
We obtained similar results when conducting experiments with transcriptions and typed texts (see Table 4).Overall, we observed that transcribed speech is most informative in capturing longitudinal language changes across the two cohorts.Yet, speech is more useful when comparing people across phases (see Table 5).In the case of typed text, while distance scores are higher for the dementia cohort, the difference was not statistically significant.We assume this is because in planned, non-spontaneous texts, such as written thoughts, the planning going into writing the text makes it more coherent.However, the typed and written text modalities convey additional, currently unexplored, extra-linguistic information (number of deletions, pauses between keystrokes), that show corrections of one's text and these may be better indicators of changes in cognition.In the future, we aim to investigate self-repair tasks [58] that are more appropriate for written discourse.

Conclusion
We introduce a novel fine-grained longitudinal multi-modal corpus containing data from healthy controls and people with dementia.The dataset covers audio and text, containing spoken and transcribed conversations, written and typed logs as well as associated extra-linguistic information such as pen and keystrokes.Conversations and written thoughts are elicited in a natural setting, in the participants own environment, triggered by reminiscence material.Specifically, people can record their thoughts via recorded audio, typed or written text through a bespoke tablet application.We present the data collection process and describe the corpus providing statistical information about the two cohorts across the different modalities collected.We also establish baselines to capture longitudinal language changes in relation to the two cohorts and across the audio and linguistic modalities.A set of initial experiments shows that longitudinal language variations are higher in people with dementia.This effect is even more pronounced across temporally distant sessions.In the future, we aim to investigate tasks that involve language-function variations, such as coherence and disfluency, that are particularly prominent in the progression of dementia.

Limitations
In this work, we introduced a multi-modal longitudinal corpus for monitoring changes in dementia progression.The corpus was collected in a natural setting from healthy controls and people with dementia over two phases, each spanning 28 sessions.Moreover, subjects could choose to hold conversations or write or type their thoughts on a variety of topics from reminiscence material provided by a bespoke tablet application.Despite the novel fine-grained longitudinal multimodal nature of the corpus, an important limitation is the relatively small-scale cohorts in the study.In particular there is only a small number of people who where able to participate in the second phase of the study.This was due to unforeseen disruptions to the study first via the introduction of GDPR regulation in 2018, which required pausing of the study to update software for data collection, and then COVID-19.This meant that in several cases 12 months or longer elapsed between phases 1 and 2 and as a result many of our participants were no longer able to participate, primarily due to a decline in their health or change in their personal circumstances.We aim to address this limitation by expanding the existing corpus with a new data collection spanning three phases within twelve months, by recruiting individuals from a collaborating memory clinic.Nevertheless the existing dataset is the first of its kind and has opened new avenues for research in longitudinal changes in language for people with dementia and across different modalities.
Indeed, we have introduced baselines to capture longitudinal changes in language across modalities in the two cohorts.In particular, we calculated the distance between adjacent and across non-adjacent sessions when those were mapped to fixed-size representations.A set of initial experiments showed promising results for monitoring dementia using fine-grained multi-modal longitudinal data.However these approaches are limited in capturing various linguistic functions associated with the progression of dementia [59,60].In future work, we aim to use NLP techniques to characterise the language in terms of features likely to be associated with disease onset and progression and/or be suitable for detecting changes in use over time across all types of conversations, i.e., speech, transcriptions, typed and written thoughts.These will include analysis in terms of lexical, syntax and coherence features already identified in the literature [12,26]; and in terms of recent approaches which infer vector-based representations of words or speakers (embeddings) from observed use and are well suited to tracking changes in language use over time [61,62].

Ethical Considerations
The collection of the corpus involves ethical considerations especially as we are working with vulnerable individuals who have dementia.The study has received ethics approval from the NHS Research Ethics Committee (REC) and the Health Research Authority (HRA), with reference number 16/WS/0226.Participating individuals as well as their carers consented to permit data collection and analysis for research purposes.User identifying information was kept separate from the language data collected via the bespoke tablet application.
While data was collected anonymously, there are potential ethical concerns with using spoken language and computational approaches for monitoring changes in cognitive status and dementia.One concern is related to privacy and confidentiality, as language data may contain sensitive personal information.Other potential risks involve the misuse of models trained on the data for monitoring changes in cognition, which could be used carelessly or maliciously without considering the impact and social consequences in the broader community.To mitigate such risks, we apply strategies such as running software on authorised servers only, with encrypted data during transfer, anonymization of data prior to analysis.Data is only accessed by authorised individuals and interested parties can only obtain access subject to an NDA agreement which carefully states research goals.
For a real-world application, ethical concerns are related to the potential for misdiagnosis or overdiagnosis, which could lead to unnecessary treatment or psychological distress for patients and their families.Additionally, there may be issues related to access and equity, as some individuals may not have access to the necessary technology or resources for speech recognition and monitor through analysis of language.Finally, there may be concerns related to the accuracy and reliability of technology, as well as the potential for bias in the data or algorithms used for monitoring changes.It is important to consider these ethical concerns when developing and implementing technologies for dementia monitoring and diagnosis.

Fig. 2
Fig. 2 Eligibility criteria for study participants and recruitment details.

Fig. 3
Fig. 3 Summary of recorded sessions per individual in the two cohorts together with the duration in the speech modality.

Table 1
Overview of the most widely used datasets and comparative benefits of our new proposed multimodal longitudinal dataset.Elicitation = Data elicitation task.Duration: Average speech duration of the elicited task in minutes.CTP = Cookie Theft Picture.Dem=Participants with Dementia.Ctrl=Participants with no dementia diagnosis.AD Class= Alzheimer's Dementia classification.Score Reg=Score Regression.Monologue speech, ‡ Occasional interactions between clinicians and participants.

Table 2
Overview of the dataset.Dem=Participants with Dementia.Ctrl=Participants with no dementia diagnosis.

Table 3
Overview of the typed, hand-written, and transcribed conversations in accordance to their particular characteristics.Numbers in parentheses correspond to STD.

Table 4
Averaged distance scores between the two cohorts (people with dementia and healthy controls) and across different modalities for the non-consecutive and consecutive tasks.Numbers in bold indicate significant difference across cohorts.Results were rounded to the nearest 1000 th . *

Table 5
Averaged distance scores between the two phases and across different modalities for subjects participating in both phases of the longitudinal study.Numbers in bold indicate significant difference across cohorts.