Automatic detection of cognitive impairment in elderly people using an entertainment chatbot with Natural Language Processing capabilities

Previous researchers have proposed intelligent systems for therapeutic monitoring of cognitive impairments. However, most existing practical approaches for this purpose are based on manual tests. This raises issues such as excessive caretaking effort and the white-coat effect. To avoid these issues, we present an intelligent conversational system for entertaining elderly people with news of their interest that monitors cognitive impairment transparently. Automatic chatbot dialogue stages allow assessing content description skills and detecting cognitive impairment with Machine Learning algorithms. We create these dialogue flows automatically from updated news items using Natural Language Generation techniques. The system also infers the gold standard of the answers to the questions, so it can assess cognitive capabilities automatically by comparing these answers with the user responses. It employs a similarity metric with values in [0, 1], in increasing level of similarity. To evaluate the performance and usability of our approach, we have conducted field tests with a test group of 30 elderly people in the earliest stages of dementia, under the supervision of gerontologists. In the experiments, we have analysed the effect of stress and concentration in these users. Those without cognitive impairment performed up to five times better. In particular, the similarity metric varied between 0.03, for stressed and unfocused participants, and 0.36, for relaxed and focused users. Finally, we developed a Machine Learning algorithm based on textual analysis features for automatic cognitive impairment detection, which attained accuracy, F-measure and recall levels above 80%. We have thus validated the automatic approach to detect cognitive impairment in elderly people based on entertainment content. The results suggest that the solution has strong potential for long-term user-friendly therapeutic monitoring of elderly people.


Introduction
The United Nations has reported (World Population Prospects report 1 ) that 9% of the world population is over 65 years old, and this percentage will reach 16% in 30 years.The population older than 80 is growing even faster, and it is expected to reach 450 million by 2050.This urges society to find innovative solutions to improve the living conditions of our elders, especially of those who live alone (Callahan et al. 2014;Hancock et al. 2006).
A main issue of the quality of life of elderly people is the severe prevalence of cognitive impairment disorders.Affected people are mainly over 65 years old and they were 50 million in 2015, although this segment is expected to triple by 2050 (Livingston et al. 2017).Regular screening for detecting early symptoms and monitoring the progression of these disorders have been considered beneficial for treatment planning and patient autonomy (Borson et al. 2013).However, it has also been observed that cognitive impairment assessment in primary care systems is inefficient (Löppönen et al. 2003;Boise et al. 2004).
As discussed in Section 2, even though existing telecare set-top-boxes and gateways have increasingly intelligent capabilities, they still do not communicate autonomously with the elderly using natural language.This is also the case of the solutions for cognitive evaluation, which mostly rely on sets of predefined manual tests.
Given the industrial gap, and as demonstrated by the analysis of the state of the art in Section 2, we propose a novel conversational system for entertainment and therapeutic monitoring of elderly people that relies on nlp techniques and Machine Learning for empathetic chatbot behaviour generation and user-transparent automatic assessment.
From the perspective of the target users, the elderly, the main priority of any information system should be alleviating loneliness, whether the system has embedded cognitive monitoring capabilities or not.Accordingly, we want our solution to be perceived as a friendly intelligent assistant to access Internet media, that is, a conversational system that reads news.These will be interspersed with brief dialogues to subtly guide the users through a series of questions to gather their interests and evaluate their understanding of the information they have just consumed, which includes word category understanding and shortterm memory, to evaluate cognitive impairment (Loewenstein et al. 2004;Crocco et al. 2014).
Conversational systems seem an adequate approach for this purpose.These software programs allow for human interaction with a machine using written or spoken natural language (Shawar and Atwell 2007).Ideally, the dialogue should be empathetic (Fung et al. 2018;Rashkin et al. 2019), a still distant goal even after the recent advances in Artificial Intelligence (ai) and in Natural Language Generation (nlg) techniques.Before virtual companions (Shum et al. 2018) become a reality, entertainment through interesting information will be more feasible.In this vein, it is well known that elders feel accompanied by simply listening to news in their background ( Östlund 2010).
More in detail, our system reads recent news items and generates questions about them.At the same time, in order to automatically evaluate cognitive capabilities, the system measures the similarity between user answers and a gold standard (Yang and Powers 2005;Corley and Mihalcea 2005;Li et al. 2006;Feng et al. 2008) that is automatically generated from the news.To validate our approach, we have defined an answer similarity metric and we have performed tests on a sample of 30 patients of Asociación de Familiares de enfermos de Alzheimer y otras demencias de Galicia (afaga2 , the Galician Association of Relatives of Patients with Alzheimer's and other Dementias).This public association seeks to improve the quality of life of Alzheimer patients, provides guidance and information to relatives and to the public, and makes society aware of this reality to achieve a broader and more effective response.It collaborates actively in research on cognitive impairments.
The rest of this paper is organised as follows.Section 2 reviews related work and lists our contributions.Section 3 describes our conversational system for entertainment and user-transparent cognitive assessment.Section 4 presents our case study and the results of our word similarity approach to automatically evaluate cognitive impairment.Finally, Section 5 concludes the paper.
Regarding the automatic detection of health conditions, there exists a wealth of work based on Machine Learning, such as Ghoneim et al. (2018), on a smart healthcare framework to detect medical data tampering; Sedik et al. (2021), on the analysis of the outbreak of covid-19 disease; Ahmed et al. (2021), on an unsupervised Machine Learning approach to predict data-types attributes for optimal processing of telemedicine data, including text and images; and Sarrab and Alshohoumi (2021), on the real-time detection of abnormalities in streamed data from IoT sensors (furthermore, in Masud et al. (2021), a mutual authentication and secret key establishment protocol was proposed to protect medical IoT networks).Our research contributes to automatic smart telecare solutions based on Machine Learning.
However, despite the huge advances in ai for artificial reasoning and problem-solving, interactions are still far from human-like.Most recent ai systems have partial understanding of natural language and lack cognitive capabilities to enrich the communication with context-dependent information (Skjuve et al. 2019).Specifically, existing solutions for intelligent conversational systems are based on retrieval-based methods (Yasuda et al. 2014;Wu et al. 2018), which select the best candidate among a predefined set of alternative responses, and generation-based procedures (Oh et al. 2017;Su et al. 2017;Baby et al. 2017), which rely on nlp techniques to create human-like written or oral dialogue flows.Note that natural language is especially beneficial for user interfaces for its spontaneity and friendliness (Liu et al. 2018).
Among the existing intelligent conversational systems, we can mention Google Duplex (Lindgren and Andersson 2011) and the Neural Responding Machine in Shang et al. (2015) based on Recurrent Neural Networks. Moreover, in Wen et al. (2017) the authors presented a dialogue system based on the pipelined Wizard-of-Oz framework, which, unlike other approaches in the literature, can make assumptions.Regarding linguistic knowledge, in Wang et al. (2015) and Wu et al. (2018) syntactic features were used to generate coherent and human-like texts.Newscasters (Matsumoto et al. 2007), which, as previously said, generate a feeling of companionship ( Östlund 2010), do not sustain dialogues with end users.Conversational systems have already been considered for entertainment and healthcare (Noh et al. 2017;Su et al. 2017), although still at an early stage.
Regarding existing conversational systems for entertainment (Johnson et al. 2016;Correia et al. 2016;Aaltonen et al. 2017), we highlight EduRobot (Budiharto et al. 2017) which can sing and tell stories, although it is not oriented to the specific needs of the elderly.It has been suggested to make conversational systems more appealing by modelling their interfaces as pets or avatars (Sharkey and Sharkey 2012).Unlike EduRobot, RobAlz (Salichs et al. 2016) has been specifically devised for this audience, but it has no therapeutic diagnostic capabilities.Few existing intelligent systems for senior healthcare (Foroughi et al. 2008;Hsu and Chien 2009;Suryadevara et al. 2012;Tseng et al. 2013;Samanta et al. 2014;Wang et al. 2016) have human-computer communication capabilities.The system by Yasuda et al. (2014) for people with dementia is an exception, but its communication capabilities are very limited: it just selects questions and answers among 120 pre-set options.Therefore, even though there still is incipient work on digital tools for therapeutic monitoring of people with dementia and other impairments, manual tests (written and task-based neurological and neuropsychological assessments with caretaker supervision) are typically used.For example, the Mini-Mental State Examination (Ridha and Rossor 2005) is a cognitive test on orientation, immediate memory, calculation, attention and comprehension, to cite some tasks, which produces scores about dementia levels.The Mini Cognitive Assessment Instrument (Milne et al. 2008) includes a verbal memory task and a clock drawing test.Finally, the camdex test (Ball et al. 2004) is another standardised manual tool for the diagnosis of mental disorders, which is especially suitable for the early detection of dementia.It asks questions related to memory, personality, general mental and intellectual functioning, and judgement.It also considers specific symptoms and the medical histories of the users and their families.Note that the white-coat effect is a major concern in all these manual approaches, apart from the fact that they are time-consuming and require professional expertise.
Previous research has proved the benefits of combining dichotomous questions (also named closed questions) with essay questions3 to mitigate the white-coat effect (Ridha and Rossor 2005;Echeburúa et al. 2017).We remark the interest of inserting distracting questions before attentiondemanding questions in cognitive tests (Ridha and Rossor 2005).Altogether, this combination allows evaluating cognitive impairment less intrusively (Ball et al. 2004;Ridha and Rossor 2005;Milne et al. 2008).To the best of our knowledge, our proposal is the first intelligent system that embeds user-transparent, automatic cognitive assessment into a newscaster system that sustains dialogues with elderly people, based on nlp techniques for chatbot behaviour generation and human-machine communication.
We close this section with a review of related industrial initiatives, which further backs the social relevance of the field under study.
For instance, the Carelife system by Televés4 analyses personal routines from sensor data for custom home care.Its home gateway can be extended via peripherals, such as biomedical devices, as well as via software.Doro launched SmartCare5 in 2018.It includes a home gateway and home sensors to detect behavioural changes.However, neither these systems nor the sam Robotic Concierge by Luvozo6 have built-in intelligence to communicate with the elderly using natural language.
Regarding general purpose domestic robots, there are examples such as ZenBo by Asus7 , with video surveillance, Internet shopping and agenda features.Its interactions are rather rigid.It can understand vocal orders, but it has no empathetic capabilities, nor is it tailored to the needs of the elderly.
We believe that a feasible path towards a next generation of intelligent conversational telecare systems is the augmentation through software of current platforms such as Carelife and Smartcare by relying on their simple voice interfaces (which, nowadays, caregivers employ to call the users), without any additional hardware add-ons.Some platforms are already open in this aspect.For example, Buddy, by European Blue Frog Robotics8 , allows third developers to create new applications and distribute them via its store.We are neither aware of any application of this robot to entertain and monitor elderly people, nor of any intelligent functionalities.
Regarding solutions focusing on cognitive evaluation, we must mention three approaches based on manual tests (written and task-oriented neurological and neuropsychological assessments), none of which employ automatic nlp techniques.Neurotrack9 is a set of cognitive tests to evaluate, monitor and strengthen brain health to reduce the risk of Alzheimer and other dementias.Mezurio10 provides support to interactive data acquisition as a baseline for detecting individuals at risk of developing Alzheimer's disease.Altoida11 tests the functional and cognitive skills of patients with a Machine Learning algorithm but, as the previous two solutions, it is entirely based on a set of predefined tests and it does not have any bidirectional communication capabilities in natural language.

System Architecture
We present a novel intelligent system specially designed for therapeutic monitoring of elderly people with different levels of cognitive impairments or on the verge of suffering them.The users perceive the system as a news broadcasting service, with which they interact by voice from time to time.Therapeutic monitoring is embedded as a user-transparent functionality.Figure 1 illustrates the main modules of the system, on which we will elaborate in the next sections.They include online (news broadcast service and intelligent dialogue generation system) and local services (Android12 application with cognitive attention assessment service).Online services modules were implemented using the Eclipse13 Integrated Development Environment (ide) tool and Java 1.8 programming language14 .They were deployed on a Tomcat15 server to be made available through a rest api, which was programmed using the Jersey library16 .The Android application (for Lollipop operating system or higher regarding devices compatibility) was developed with Android Studio17 .
The system transforms Spanish speech into text and vice versa as input and output data (stt/speech-to-text and tts/text-to-speech boxes in Figure 1).For this purpose, it employs the Google Voice Android Software Development Kit (sdk) library18 .
Regarding system activation, we use voice commands and facial recognition.For implementing the latter we employed the OpenCV library19 and an eye-sensing train data set 20 (note that previous works have also exploited sophisticated schemes for this purpose, apart from voice commands (Alsmirat et al. 2019)).
As in Wang et al. (2020), we decided to combine text and images, in our case to help the users focus during the dialogue stages.Specifically, the dialogue with the users is guided with basic graphic indicators (see Figure 2): the screen displays a "muted" or "open" microphone to indicate the system's or user's turn to speak.Moreover, the "facial" expressions of the avatar of the conversational system provide empathetic feedback to the users based on text sentiment analysis.Finally, note that the user interface is an animated dog, simulating a pet as suggested in literature (Sharkey and Sharkey 2012).
To ensure short response times, instead of querying external systems, the system relies on a MongoDB database21 in our own server containing all the necessary linguistic knowledge.

News Broadcast Service
The news broadcast service requires a varied and updated set of news to engage the target users.This news is periodically extracted from the Application Programming Interface (api) of the Spanish National Radio and Television (RTVE22 ) channel with a get query, using the tematicas and noticias api services, to gather news items on specific topics.This task is performed in the background owing to the requirements of the api, and it hides news processing to the target users by generating pre-saved news items on a daily basis.The content retrieved from the api is saved into a MongoDB structure using a json file.Date and topic features are exploited for indexing and searching.As a result, the News Broadcast Service delivers content immediately from the user's point of view.
News are arranged into five categories: economy, politics, science, society and sports, all of them at national level except for politics and society, which focus on Galicia (Spain) in our case, since proximity is appealing to elderly people.Table 1 presents an example of a social news item.We provide the user with a summary of each piece of news by extracting the lead paragraph (for that purpose, we first split the news items into paragraphs and take the first paragraph after the title).

Automatic Question Generation
Our question generation system combines linguistic knowledge from our aLexiS lexicon (García-Méndez et al. 2018, 2019) with the Name Entity Classification (nec) functionality from Freeling (Atserias et al. 2006;Padró and Stanilovsky 2012).The former is saved in a MongoDB database to reduce the response time of the chatbot.Besides, the nec process is executed in the background because of the complexity of the linguistic analysis of the news.These two resources allow extracting and identifying personal names, organisations and locations, which together constitute valuable data for question generation.More specifically, thanks to the linguistic information in aLexiS, our conversational system can adjust features such as the gender, number, person, and tense of the questions it generates.

Topic
News item
'National Police has dismantled a dangerous drug trafficking ring that operated in Galicia, Madrid, and Alicante in an operation in which ten people with a wide criminal record have been arrested, and 1,000 kilos of cocaine, 500,000 euros and ten luxury cars have been seized.' As previously said, we follow the strategy of combining dichotomous and essay questions to reduce the white-coat effect and create a more relaxed atmosphere.
• To generate dichotomous questions, we rely on the nec functionality of Freeling to extract personal names and locations from the news.This produces questions such as those in Table 2.The system always generates four similar options for each question, and one of them is picked at random and presented to the user.Then, depending on the user's answer, the system poses the next question as indicated in Table 3.
Regarding the extraction of the gold standard answers for the aforementioned types of questions, we obtain these data with Freeling syntactic parsing.Take the sentence 'National Police has dismantled a dangerous drug trafficking ring that operated in Galicia, Madrid, and Alicante' as an example.The corresponding essay question using our system is 'Who has dismantled a dangerous drug trafficking ring?', and the correct answer it produces as a reference is the noun phrase that precedes the verb, 'National Police'.Note that the best answer is obtained by extracting the noun phrases that precede the verb for 'who' questions, as in our example.On the other hand, for 'what' questions, we use the noun phrase that precedes the verb plus the verb itself.Take the sentence 'The Government will automatically extend the social electric bonds until September 15th' and its associated question 'What does the news say on September 15th will happen?' as an example of 'what' question handling.The correct produced answer is 'It will automatically extend social electric bonds'.Finally, for the question 'Which places does the news item mention?', the correct answer is produced by extracting all location entities from the news content using the nec functionality by Freeling.Note that in the first example about the drug trafficking ring, the correct answer is Galicia, Madrid and Alicante.
Both the generation of questions and the extraction of gold standard answers are performed in the background, after the daily news-gathering process (see Section 3.1), using the same indexing scheme.
By combining dichotomous and essay questions, the conversational system establishes a dialogue with the end users.Each dialogue is composed of the following three stages: • News: prior to the dialogue, the conversational system presents the news item.• User-centred questions: a dichotomous question followed by two essay questions to distract the user.• Attention-demanding question: a last essay question on key aspects of the news item.It allows assessing if the user understood the news piece, if he/she was focused during the conversation, and his/her short-term memory.
Figure 3 shows an example of a real dialogue according to this structure.To keep the user engaged, most questions are related to the news item (avatar marked in yellow).
The resulting user's utterances are the input to the cognitive attention assessment service, which calculates the accuracy of the user answers by comparing them with the gold standard responses.We describe it in the next section.

Cognitive Attention Assessment Service
We employ the lexical Multilingual Central Repository (mcr 23 ) database (González-Agirre et al. 2012), which integrates the Spanish WordNet into the EuroWordNet framework, to obtain the semantic classification of the adjectives, adverbs, nouns and verbs in the news piece.For that purpose, we extract from mcr three semantic categories corresponding to Adimen sumo, WordNet Domains and Top Ontology hierarchies for nouns and verbs, and Top Ontology hierarchies for adjectives and adverbs, since there is less information available for these two lexical categories (Pedersen et al. 2004).From mcr we also gather holonyms, hypernyms, hyponyms, meronyms, synonyms and related data for nouns and verbs, and only synonyms for adjectives and adverbs.Table 4 shows an example for noun montaña 'mountain'.This linguistic knowledge was added to the aLexiS lexicon within the same json indexing scheme, which did not affect response time performance.
We obtain the final score for each response as a weighted average, as follows: Where: • N x represents the number of words of lexical category x in the ideal response.• For the i-th word of category x in the ideal answer, we calculate its similarity with all words within the same lexical category x in the user's response, and we take the highest value as x * i .Note that nouns and verbs are considered more relevant than adjectives and adverbs in expression (1).This choice is supported by the fact that the former generally carry most semantic information in a sentence, whereas the latter provide nuances (Feng et al. 2008;Corley and Mihalcea 2005).
To obtain x * i we adapted the method by Yang et al. (Yang and Powers 2005) to calculate the  similarity s(word 1 , word 2 ) between words word 1 and word 2 : Where: • α s = 0.9 if the words are synonyms and 0.85 otherwise.• d(word 1 , word 2 ) = 0 if the words are holonyms, hypernyms, hyponyms, meronyms or synonyms, related to the same hierarchy category or belonging to it.Otherwise, d(word 1 , word 2 ) is WordNet's shortest path between word 1 and word 2 .• β acts as a depth factor that decreases similarity exponentially, to the power of the number of hierarchical steps separating the two concepts.
In the tests in Section 4 we set it to 0.7.• γ is explained below.
Note that certain fairly similar words in the same WordNet domain category will have a very low similarity value with our method (less than 0.4).Consider for example the pair (madera 'wood', cartón 'cardboard'), whose similarity is 0.27, a low value taking into account that both terms define materials and share WordNet domain category 'substance'; and pair (panadero 'baker', maestro 'teacher'), with a similarity of 0.10, although these words represent professions and belong to the same WordNet domain category 'person'.
To avoid this issue, we define correction factor γ in expression (2).By default γ = 0, except for the following two cases: • For word pairs that belong to the same Word-Net domain category, γ = 0.25.After applying this to the two previous examples, their similarities grow to 0.45 and 0.33 (from 0.27 and 0.10), respectively.• For all terms with the same stem that have not been already classified as synonyms, with a similarity of 0.85 or less, γ = 0.5.This is because they belong to the same word family, and their similarity should be even higher for our purposes than in the previous case.For instance, the pair (flor 'flower', florista 'florist') would have a similarity value of 0.15 for γ = 0, but it becomes 0.58 after applying γ = 0.5.
The goal of these corrections is improving coherence, by defining a ground truth for word similarity (Li et al. 2006).This was tuned by considering the value range (0.3, 0.6) for similarity scores in cases such as autógrafo 'autograph' versus firma 'signature' and cojín 'cushion' versus almohada 'pillow'.
Moreover, we pay special attention to the treatment of numbers.Given the fact that our goal is to assess the understanding of the news, we must take into account that even people with healthy minds seldom retain the exact quantities they have just heard.For example, after listening to the sentence 'there were 2569 casualties', a person will likely remember 'there were over 2500 casualties'.For this reason, we generate all the possible numbers a given quantity can be rounded to by dividing it by powers of ten, and we assign a 0.7 similarity score if any of them produces a match.For instance, if the correct amount is 2569, the possible right answers that would be assigned a 0.7 similarity score are 2000, 2500, 2560, 2570, 2600 and 3000.On top of that, if the words 'over' or 'under' are chosen correctly, a similarity score of 0.9 is assigned.Going back to the previous example, if the user's reply is 'there were over 2500 casualties', it receives a 0.9 similarity score, whereas if it is 'there were 2500 casualties' it gets a 0.7 similarity score, since it reflects less understanding of the original information.
As a final example, take the sentence un profesor llevó papel en blanco a su hogar en la montaña 'a teacher took blank paper to his home in the mountain' as the ideal response to a question.The following three user answers produce different results: • Un hombre llevó folios blancos a su casa del monte 'a man took blank paper sheets to his house on the hill': this reply uses different words than the original, but keeps the same meaning.
It would obtain a similarity score of 0.8.• Un hombre sacó madera blanca de su apartamento 'a man took white wood from his apartment': this sentence has lost most original meaning, although some common concepts remain.It would obtain a similarity score of 0.51.
• Un hombre rompió una silla en una tienda 'a man broke a chair in a shop': this sentence has none of the original meaning left.It would obtain a similarity score of 0.25 (note that the subject was correctly inferred).
The sim metric allows the automatic evaluation of comprehension skills from the questions and their corresponding gold standard answers, and the procedure reflects the level of concentration during the conversation and the reliability of short-term memory.

Experimental Results and Discussion
In this section, we present the validation tests to assess the effectiveness of our approach to determine cognitive impairment levels.

Case Study
The case study comprises two experimental scenarios.The first experiment (Section 4.2.1)studies the sim metric presented in Section 3.3 as a tool to assess abstraction capabilities, for different user profiles and levels of cognitive impairment.The second experiment (Section 4.2.2) evaluates Machine Learning algorithms to automatically detect cognitive impairment in the users under study.These experiments were divided in "sessions", a session being a particular newscast and its associated dialogue with an elderly person.The profiles of the participants in these sessions were characterised as follows: • At least 60 years old.
• Technological skills and hearing problems: existing or not.• Study levels: basic (high education or less) or superior (bachelor's, master's and doctoral degrees).
• Cognitive impairment level: absent, mild or severe, as established for the case study by gerontology experts from afaga.• Stress: yes or no.
• Focus: yes or no.
The experiments lasted for three months and involved 30 users 75.73 ± 6.60 years old (average ± standard deviation).All the users involved in the experiments are patients in the occupational therapy workshops of afaga.The tests were conducted under the supervision of their caregivers.For annotating cognitive impairment, afaga applied the Spanish version (Díaz Mardomingo and Peraita Adrados 2008) of the Global Deterioration Scale standard methodology (Reisberg et al. 1982).Specifically, 57% of the participants had cognitive impairment to some extent (40% mild, 17% severe) and the rest were mentally healthy.We used this methodology as manual baseline for the comparison with our automatic Machine Learning detection approach in Section 4.2.2.
In each individual experiment, we registered the characteristics of the user.Tables 5 and 6 shows the session registration sheet, which was filled by the caregivers, and a real example of a session, with its newscast content and its associated interview.
In detail, 17 participants had basic technological skills (e.g., they regularly used electronic devices such as computers and smartphones), 9 suffered from hearing problems, 16 had a basic education and 14 had a superior level of education.18 participants had been diagnosed some cognitive impairment.Most of the users were in a positive frame of mind (14 were happy to participate and 16 simply accepted it).22 users were clearly focused.
Regarding the white-coat effect, it is worth mentioning that 91.67% of the participants without any cognitive impairment and 61.11% with mild or severe cognitive impairments were relaxed during the experiments.As it could be expected beforehand, cognitive impairment level increased with age (Ammal and Jayashree 2020).

Similarity Metric Assessment
Each experiment was composed of five different sessions whose corresponding news items were related to economy, politics, science, society, and sports.Each session consisted of a newscast and four questions, as explained in Section 3.
In the experiments, our system was able to separate users in most cases by sim scores that were significantly related to their level of cognitive impairment.Table 7 averages the sim metric for the three groups in the study (absent, mild and severe impairment).
By session, Table 8 shows that users with severe cognitive impairments scored significantly less, while healthy ones or those with mild levels of cognitive impairment performed significantly better.The effect of mild cognitive impairment (compared to its absence) in the comprehension skills of the participants was noticeable in all sessions except for the third.This session was particularly challenging due to its vocabulary, which included technical terms such as pymes 'SMEs' (acronym for small and medium enterprises).Furthermore, in sessions 1, 2 and 4 the difference between mild and no cognitive impairment was noteworthy.
Session 5 was particularly interesting.In it, users received several clues in user-centred questions before being presented the attentiondemanding question.Thus, users with mild cognitive impairment performed better in this session than in session 3, for example.This is coherent with the fact that the reinforcement of key ideas from the news helped users to accurately answer attention-demanding questions.
Table 9 shows the average lengths of user responses.They tended to be shorter for higher impairment levels.In fact, for severe impairment, users tended to remain silent or answer concisely (e.g., 'I don't know').
Table 10 shows that, for users with cognitive impairments, stress had an appalling effect on performance, and that high focus had a very positive outcome, even enhancing sim results from 0.20 to 'Teresa Ribera has announced this Thursday that the Government will automatically extend electricity social bonds until September 15th to maintain the protection of people in a condition of energy vulnerability and to guarantee basic supplies at home due to the coronavirus crisis.'Finally, Table 11 presents sim measurements versus technological skills and levels of education, for users with cognitive impairments.In view of these results, more educated users performed better in the experiments than those with a basic education.Apparently, technological skills only led to higher sim scores for users with a basic education (note, however, the large standard deviation of the sim score of skilled users with superior educational level).

Automatic Cognitive Impairment Detection
Finally, in order to evaluate the effectiveness of the proposed system, we trained a set of selected Machine Learning algorithms (Salzberg 1994) for detecting cognitive impairment: Bayesian Network (bn), Decision Tree (dt), Random Forest (rf) and linear Support Vector Machine (svm), which have been widely used in medical applications (Lu et al. 2017;Bratić et al. 2018;Ghoneim et al. 2018;Rukmawan et al. 2021;Ahmed et al. 2021).
Table 12 shows the training and classification complexity of the Machine Learning algorithms we selected, for c classes, d features, k instances of the algorithm (where applicable) and n samples.bn training has linear training complexity (Lu et al. 2006).dt and rf training have logarithmic training complexity (Witten et al. 2016;Hassine et al. 2019).svm has the highest training complexity, but it is very fast in classification time if trained with a linear kernel, as in our case (Vapnik 2000).We employed the algorithm implementations from Weka24 (Witten et al. 2016).
Firstly, we divided the sample into user classes with and without impairments.Table 13 shows all the features we considered.Then, we applied the GainRatioAttributeEval feature selection algorithm, also from Weka, which evaluates the relevance of the attributes by measuring their gain ratio with regard to the target class.The most relevant features it selected for the classification model were, in decreasing importance, length of the response in characters, focus, sim for question 4 in session 2, age, technology skills, and sim for question 4 in session 4.
Note that in spite of the results in Table 11, the selection algorithm preferred technological skills over level of education.
Finally, Table 14 shows the classification results for the selected algorithms with 10-fold cross validation (Berrar 2019) to avoid overfitting.This methodology minimises underestimation and overestimation in the results.For this purpose, the dataset is divided into 10 segments, 9 of which are used for training and the remaining one for testing.This process is repeated ten times by avoiding overlapping testing segments in different evaluations.At the end, the final performance metric is computed as the average of the intermediate tests.In addition, unlike svm(Jatav and Sharma 2018) and bn (Wood et al. 2019) algorithms, which are less prone to overfitting issues, in the cases of dt and rf we limited the folds to 3 and the maximum depth to 5. To avoid bias, the entries from the same users were grouped to prevent them from being simultaneously used for training and evaluation.
Note that dt was finally selected due to its better performance, since it attained a detection accuracy of 86.67% using the most relevant features.

Conclusions
Even though there exists previous research on intelligent systems for therapeutic monitoring of cognitive impairment in elderly people, most current approaches are based on manual tests that rely on human supervision for early detection.
In this work, to reduce caregivers' effort and the white-coat effect, we have proposed a novel conversational system for entertainment and therapeutic monitoring of elderly people.It relies on nlp techniques for chatbot behaviour generation and user-transparent automatic assessment, by combining distracting (user-centred) with attention-demanding questions (embedded cognitive tests).Thus, our main contribution is a Machine Learning approach for user-transparent cognitive monitoring that is embedded into a usercentred entertainment solution.This approach is based on metrics that estimate the abstraction skills of the users from their answers during the automatic dialogue stages.Experimental results with elderly people under the supervision of afaga gerontologists indicate that our solution is satisfactory and has strong potential for user-friendly therapeutic monitoring.Preliminary analyses have obtained a detection accuracy of cognitive impairment close to 90%.
Given these promising results, we plan to enhance the system with empathetic capabilities through user feedback.At the same time, users will benefit from real-time encouragement about their performance.of cognitive impairments, and for validating our results.

Question 1 :
Does the name Teresa Ribera sound familiar to you? Answer: Yes → Question 1.1: What facts do you know about the life of Teresa Ribera?No → Question 1.2: What do you associate the name of Teresa Ribera to?Not available → Question 1.3: Why do you think people like the name Teresa?Question 2: What is the meaning of the word 'Government' ?Question 3: What happened on September 15th according to the news?Question 4: Do you consider this news item interesting?

Table 1
Example of news content.

Table 2
Example of dichotomous questions by different nec results.

Table 3
Example of dichotomous questions by different nec results depending on the user's response.

Table 4
Semantic data from mcr for noun montaña 'mountain'.

Table 5
Session registration sheet.

Table 6
News item for session 1.

Table 7
Average ± sd sim metric by level of impairment across all sessions.

Table 9
Average answer length by level of impairment, in characters.Table10Average ± sd sim metric for users with cognitive impairments by stress and focus.

Table 11 Average
± sd sim metric for users with cognitive impairments by technological skills and level of education.

Table 12
Training and testing complexity of the Machine Learning algorithms.

Table 14 F
-measure, recall and response times for the selected algorithms.