1 Introduction

In a world with an ever-increasing human lifespan, the quality of life of senior adults is becoming more and more relevant. According to WHO [1], the percentage of population over the age of 60 will increase by 34% between 2020 and 2030, and with it, the prevalence of neuro-psychiatric disorders, particularly dementia, which have an extremely high impact on people’s well-being and their social and economical aspects.

Mild Cognitive Impairment (MCI) is the transition stage between healthy aging and dementia and is characterized by subtle cognitive deficits that do not meet the criteria for diagnosis of a major neuro-cognitive disorder (DSM-V) [2]. These difficulties can manifest themselves in areas such as memory, attention, language, orientation or decision making. Thus, detecting MCI in its early stages is beneficial in preventing the progression of the disease, and, in certain cases, in slowing down some of its symptoms. However, in most cases the detection of cognitive deficits occurs when the symptoms are already evident and when the underlying neurological disorder was already present for some time [3], which means that the disease progressed. The traditional screening method for early detection of cognitive impairment involves the use of clinically-validated gold-standard tests that assess the cognitive state of a person.

The inception of these tests trace back to the second half of the 20th century. One of the first widely used screening tools was the Mini-Mental State Examination (MMSE), published by Folstein [4] in 1975; it includes items of orientation, concentration, attention, verbal memory, naming and visuospatial skills. In the 80s, the Alzheimer’s Disease Assessment Scale-Cognitive Subscale (ADAS-Cog) was developed [5] and it included 7 items, namely word recall, naming, commands, constructional praxis, ideational praxis, orientation and word recognition.

Table 1 Demographic and neuro-psychological characteristics of participants in each study group

One of the limitations of these evaluation instruments is the fact that they are dementia-oriented, particularly Alzheimer’s. Therefore, in later years other screening tools were created, e.g., the Montreal Cognitive Assessment (MoCA) [6] test, which has a 90% sensitivity for MCI detection (MMSE is not sensitive to MCI). Its telephone version (T-MoCA) [7, 8] is also validated and has a strong correlation with MoCA with a Pearson coefficient of 0.74.

The fact that MoCA is oriented at MCI detection makes it suitable as a screening tool for an early diagnosis.

In this context, the use of Information and Communication Technologies (ICT) could be a valuable tool for the early detection of MCI cases in a reliable and efficient way, where smart conversational agents are a disruptive technology with the potential to help detect neuro-psychiatric disorders in early stages [9, 10]. Note that the penetration of these technological tools among senior adults is not as higher as in the case of other age groups, which makes these tools even more relevant.

Previous research demonstrated that it is possible to implement a voice-based version of a gold standard test for cognitive assessment using conversational agents [11]. More specifically, DigiMoCA, an Alexa voice application based on T-MoCA, was developed and tested with actual elderly people using a smart speaker.

DigiMoCA makes use of Alexa’s voice recognition and natural language processing services, and is able to store and retrieve session data in DynamoDB (Amazon’s NoSQL database service) persistently. Additionally, DigiMoCA utilizes prosodic annotations to adapt the speech rate to the user, and collects the response time to each item using a statistical estimation of rountrip times. This information is subsequently used to enhance DigiMoCA’s CI screening performance. DigiMoCA was evaluated using the Paradigm for Dialogue System Evaluation (PARADISE), yielding a confusion matrix with a Kappa coefficient \(\kappa = 0.901\). This means DigiMoCA understands the user approximately 90% of the time, which is equivalent to “almost perfect”[12] in terms of task completion performance.

The main objective of this work is to analyze the acceptability and usability of DigiMoCA through a user interaction pilot study [13]. For this, the perception of senior end-users as well as administrators was collected by means of standard evaluation questionnaires, and the outcomes were analyzed using standard statistical procedures.

Thus, the research question posed is:

Is the screening tool DigiMoCA acceptable and usable for the cognitive evaluation of senior adults, both by them and their evaluators?

Section 2 describes the sample of participants, the study design and the data analysis carried out; Section 3 presents and discuss the findings of the study, both from the senior end-users as well as the administrators’ point of view; finally, Section 4 summarizes the results of this research.

2 Material and methods

This user-interaction study included the participation of 46 senior end-users and 24 sector-related professionals. According to previous relevant works [14, 15], in order to calculate the number of participants for a pilot study we need to take into account: (1) the parameters to be estimated; (2) that at least 30 participants are involved; (3) a minimum confidence interval of 80% is required. The present study fits all three criteria.

Table 2 DigiMoCA administrators’ demographic and professional characteristics

Senior end-users participated through two associations: Parque Castrelos Daycare Center (PCDC) and the Association of Relatives of Alzheimer’s Patients (AFAGA), both located in the city of Vigo (Spain). Before the start of each study, applications were submitted to the Research Ethics Committee of Pontevedra-Vigo-Ourense, containing: (1) the objectives of the study, main and secondary; (2) the methodology proposed, i.e. tests and questionnaires to administer, inclusion and exclusion criteria, recruiting procedure within the association, sample size and structure, and detailed schedule; (3) security concerns and how to address them (anonymization and encryption); (4) ethical and legal aspects, particularly regarding data privacy; and finally, (5) a copy of the informed consent to be signed in advance by all participants. Both applications for AFAGA and CDPC were approved by the corresponding dictums with registration codes 2021/213 and 2023/115 respectively.

Inclusion criteria for senior participants consisted mainly of being over the age of 65 and not having an advanced state of dementia or any other psychological pathology, or any auditory/vocal disability. Table 1 collects the demographic characteristics of the end-user participants, classified by cognitive group. The mean age was 78.61 ± 6.75, with 65% of them being female. We can see that the number of individuals is fairly distributed per group. For cognitive state classification, we used the Global Deterioration Scale (GDS) [16], which is a widely utilized scale that describes the stage of cognitive impairment, with higher GDS score meaning more deterioration. For additional information, we also show the results of the T-MoCA evaluation (16.25 ± 3.28 for healthy users (HC), 16.25 ± 3.28 for users with MCI and 16.25 ± 3.28 for users with dementia (AD)), as well as the Memory Failures of Everyday (MFE) [17] questionnaire and the Instrumental Activities of the Daily Living (IADL) scale [18].

Administrator participants, on the other hand, were affiliated to several associations, namely the Unit of Psychogerontology at the University of Santiago de Compostela, the Galicia Sur Health Research Institute, the Multimedia Technology Group at the University of Vigo, and also AFAGA and PCDC. Table 2 depicts the information about these participants, where we can see that they are predominantly from the health field. The sample has a 58.33% female composition, mostly with middle-aged participants, and fairly evenly distributed among different backgrounds. We can also see a variety in terms of seniority, ranging from less than 5 of years of experience (29.17%) to more than 20 (20.83%).

2.1 Study design

The study was organized along 3 different sessions: during the first one, T-MoCA, MFE and IADL questionnaires were administered; during the second, and after at least two weeks in between, DigiMoCA administration took place. Finally, again after two or more weeks, a second administration of DigiMoCA was carried out during the third session.

Before the first and after the second conversation with the agent, participants were asked to answer to a Technology Acceptance Model (TAM) [20] questionnaire, which covers how users come to accept a technological system. In order to determine the acceptability of the conversational agent by participants, the designed TAM questionnaire addressed 3 dimensions:

  • Perceived usefulness (PU). It measures whether a participant finds the smart speaker useful, both as a general concept, and specifically during the cognitive assessment sessions.

  • Perceived ease-of-use (PEoU). It measures whether the conversation with the speaker was comfortable and straightforward for the user, purely in terms of communication.

  • Perceived satisfaction (PS). It measures whether the user enjoyed the utilization of the speaker, and whether they prefer it to a human counterpart (i.e., another person conducting T-MoCA as an interviewer).

The resulting questionnaire consisted of a 5-point Likert rating scale composed of 6 items, 2 for each main dimension (1 meaning strongly negative/disagree, 5 strongly positive/agree, 3 neutral). For reference, the TAM questionnaire used is available in Section 1, translated to English.

In addition to studying how end-users interacted with DigiMoCA, another study was conducted to gather the opinions of cognitive evaluation administrators on its usability and user-friendliness. These were individuals either responsible for administering cognitive assessment tools to older adults, or had a background of expertise related to application development and voice assistants. A 7-point Likert scale questionnaire based on the Post-Study System Usability Questionnaire (PSSUQ) [21] was used (1 meaning strongly disagree, 7 strongly agree, 4 neutral). The English translation of the PSSUQ questionnaire used is available in Section 2.

The PSSUQ-based questionnaire was designed in order to evaluate 3 usability dimensions:

  • System usefulness: measures the ease of use and convenience. In the designed version, includes the average scores of items 1 to 8.

  • Information quality: measures the usefulness of the information and messages provided by the application. Includes average scores of questions 9 to 14.

  • Interface quality: measures the friendliness and functionality of the user interface of the system. Includes average scores of items 15 to 17 of the questionnaire.

  • Overall: measures overall usability, computed as the average of the scores of all items (1 to 18 in our case).

2.2 Data analysis

The following statistical instruments were used to assess acceptability:

  • Fundamental statistics: mean, standard deviation and percentages.

  • Cronbach’s Alpha (\(\alpha \))[22] to estimate the reliability, and specifically the internal consistency, of the responses. It is widely used in psychological test construction and interpretation, and it seeks to measure how closely test items are related to one another - thus measuring the same construct. When test items are closely related to each other, Cronbach’s alpha will be closer to 1; if they are not, Cronbach’s alpha will be closer to 0. In this study, we use this metric to evaluate the internal consistency of the responses to the TAM (end-user centered) and PSSUQ (administrators centered) questionnaires. It is computed as follows:

    $$\begin{aligned} \alpha = \frac{k}{k-1} \left( 1 - \frac{\sum _{i=1}^{k} \sigma _i^2}{\sigma _x^2}\right) \end{aligned}$$
    (1)

    Where:

    • k is the number of items/questions included.

    • \(\sigma _i^2\) is the variance of each item across all responses.

    • \(\sigma _x^2\) is the total variance, including all items.

    According to Gliem [23], a good interpretation of the value of Cronbach’s alpha regarding internal consistency is \(\alpha > 0.9\) means “excellent"; \(\alpha > 0.8\) means “good"; \(\alpha > 0.7\) menas “acceptable"; \(\alpha > 0.6\) means “questionable"; and anything below 0.6 is considered an indicator of low internal consistency.

  • Student T-tests [24] were used for comparison of pre-pilot and post-pilot questionnaires, giving insight on the evolution of the acceptability perception of the participants during the administration. Statistical significance was measured by means of p-values.

  • Cohen’s d [25]: measures the effect size of T-tests, and is computed as the standardized mean difference between two groups (in this case, pre-pilot and post-pilot). It is computed as the difference between the means divided by the square root of the average of both variances:

    $$\begin{aligned} Cohen's\ d = \frac{M_{post} - M_{pre}}{\sqrt{\frac{SD_{pre}^2+SD_{post}^2}{2}}} \end{aligned}$$
    (2)

    Based on Tellez’s analysis[26] the interpretation of Cohen’s d is as follows: \(d < 0.2\) is “trivial effect"; \(0.2< d < 0.5\) is “small effect"; \(0.5< d < 0.8\) is “medium effect"; and \(d > 0.8\) means “large effect".

Statistical analysis was performed using the Google Sheets online tool, as well as Google Colab with Jupyter notebooks written in Python. Several commonly-used data analysis libraries were used (e.g., NumPy, Pandas, Pingouin).

3 Results and discussion

This section presents and analyzes the main results obtained regarding the usability and acceptability of DigiMoCA, both from the end-users’ perspective (sample of n = 46) as well as the administrators’ (n = 24).

3.1 User interaction from senior end-users

As explained in Section 2, users completed the TAM questionnaire before and after the administration of DigiMoCA. The questionnaire included two sections, each with the 3 dimensions and 6 questions: one focused on technology in general, and another focused on DigiMoCA and conversational agents.

Table 3 DigiMoCA’s usability perception and Cronbach’s alpha scores for end users, based on TAM [20]

Table 3 presents the results of TAM’s 3-dimensional scale, taken from the post evaluation, regarding DigiMoCA’s section. Most relevant results are:

  • Perceived usefulness: a value of 3.87 ± 0.92 was obtained including all groups, with the highest rating within the MCI group (4.11 ± 0.92) and the lowest from the HC group (3.42 ± 0.93). Regarding the internal consistency of the answers, a value of \(\alpha \) = 0.63 was obtained, with the most internally consistent group being HC (\(\alpha \) = 0.76) and the lowest MCI (\(\alpha \) = 0.42).

  • Perceived ease of use: a value of 3.98 ± 0.96 was obtained including all groups. Once again, the highest mean value was found in the MCI group (4.14 ± 0.99), whereas the lowest rating was also obtained within the HC group (3.83 ± 0.96). In terms of internal consistency, a value of \(\alpha \) = 0.73 was obtained overall, being the HC group the most internally consistent (\(\alpha \) = 0.96) and MCI the least one (\(\alpha \) = 0.28).

  • Perceived satisfaction: including all groups we observe a value of 3.27 ± 1.21, in this case with the best rating coming from the AD group (3.47 ± 1.16), and the worst rating from the HC group (3.00 ± 1.22). Regarding the internal consistency, a value of \(\alpha \) = 0.41 was obtained, with the most internally consistent group being HC again (\(\alpha \) = 0.56) and the least one being MCI (\(\alpha \) = 0.25).

Overall, we consider these results to be rather positive: none of the ratings drop below 3 (out of 5) on average, either considering the overall sample or any particular group/sub-sample. This means that regardless of the level of cognitive deterioration, the users find DigiMoCA useful, easy to use and satisfactory.

Regarding the internal consistency however, it is only “acceptable" for one of the dimensions (PEoU), with a worryingly low value for the PS dimension. We believe this inconsistency to be caused by the disparity of results obtained from the two questions regarding PS: the first asks about whether participants “liked to use DigiMoCA", and the second whether they would rather “use DigiMoCA instead of T-MoCA". We observe that the answers to the second part (i.e., after interacting with the agent) are considerably lower than to the first, perhaps due to the comparison between a human-robot interaction and a human-human interaction (which is usually strongly preferred by this demographic group).

Additionally, we can observe a tendency for the MCI group to give the highest ratings but with lowest internal consistency, whereas the HC group usually gives the lowest ratings but with highest internal consistency. One possible explanation for this behavior is that cognitive impairment can interfere with consistent reasoning; it is also likely that users with MCI had more trouble understanding the full implications of the questions posed, giving less consistent answers. Certainly, it is reasonable to believe that healthy users are generally more sensitive to the intrusiveness of these evaluations, hence the slightly lower ratings.

Tables 4 and 5 present the results of the perception variation between pre-administration and post-administration of DigiMoCA. Table 4 contains the results regarding the section about technology in general, while Table 5 contains the results of the section about conversational agents. Again, data is classified by TAM dimensions (rows), including the results for each individual question (“.1" and “.2" for each dimension). We also have the results obtained classified by cognitive group (columns): HC, MCI, AD and the whole sample.

The main objective of this analysis is to determine whether the acceptability perception of users has a significant change after the administration of DigiMoCA. For this, we performed a student’s T-test with pre and post questions, and obtained three metrics: percentage change between the averages, Cohen’s d and the statistical significance p. The following paragraphs address the main findings of this process.

Regarding the technology section, there is a percentage increase in all items of the first two dimensions: +6.17% for PU.1 (d = 0.33), +3.05% for PU.2 (d = 0.11), +5.26% for PEoU.1 (d = 0.17) and +9.00% for PEoU.2 (d = 0.44). However, there is only one item (PEoU.2) that exhibits a significant change (p = 0.010). Both items from the PS dimension remain essentially unchanged. Therefore, generally speaking, we can establish that the administration does not significantly change the acceptability of technology in senior adults, but we do observe a non-significant positive change in both PU and PEoU items. Furthermore, if we look at the sample sub-groups independently, we can also observe a positive non-significant change in the vast majority of items, only one of them being significant (PEoU.2 for AD group with +17.08% change; d = 0.84, p = 0.007).

Table 4 Study of perception variation about technology by cognitive group
Table 5 Study of perception variation about conversational agents by cognitive group

With respect to the conversational agents section, the acceptability has a more noticeable improvement among most items, three of them being statistically significant, and we also find the first item with a “large effect" size: PU.1 with +59.14% (d = 1.06, p < 0.001), PEoU.2 with +13.71% (d = 0.65, p = 0.005), PS.1 with +12.22% (d = 0.61, p = 0.005). We should also notice that the PS.2 item has a significant decrease of -24.24% (d = 0.95, d < 0.001), but we do not think this particular item is a good representative of the PS dimension, since -as it was stated previously- the pre and post questions are different, and thus it should be taken with a grain of salt. If we look at sample sub-groups independently, we can notice that none of the significant changes are in the HC group, while most are concentrated on the MCI group: +85.84% (d = 1.29, p < 0.001) for PU.1, +21.61% (d = 1.04, p = 0.007) for PEoU.2, and +16.90% (d = 1.14, p = 0.003) for PS.1. Within the AD group, only the PU.1 is statistically significant (+58.59%, d = 1.02, p = 0.013).

In light of the results discussed, it seems reasonable to affirm that the acceptability on conversational agents by senior adults improves significantly after the interaction with DigiMoCA. To support this, we found that at least one item exhibits a statistically significant (p < 0.05) positive change in all 3 dimensions, and if we discard item PS.2, which as pointd out above is probably not accurate, all items have an increase in acceptability across all groups.

Table 6 Usability perception and Cronbach’s alpha scores for DigiMoCA administrators based on PSSUQ [21]

3.2 Usability perception of DigiMoCA from administrators

In addition to the end-user interaction study, an additional study was carried out in order to measure the usability perception of DigiMoCA from cognitive assessment administrators and professionals. For this, we employed the PSSUQ questionnaire with items rated in a 7-point Likert scale, which is widely used to measure user’s perceived satisfaction of a software system. Table 6 summarizes the results, which are also categorized by gender, field of occupation and years of experience:

  • Overall usability (OVERALL): we obtain a 5.86 ± 1.24 mean value for all participants and all items. The mean rating does not excessively change based on gender or career experience, although the average rating for participants in the technological field was slightly higher (6.26 ± 0.94). The internal consistency obtained was “excellent" (\(\alpha \) = 0.95) overall, with some slight differences based on gender (\(\alpha \) being 0.88 for males and 0.97 for females), field of expertise (\(\alpha \) = 0.96 for health field, 0.90 for technological field) and experience (\(\alpha \) = 0.91 for administrators with 10+ years of experience, 0.97 for the ones with less than 10).

  • System usefulness (SYSUSE): including items 1 to 8, we obtain a mean value of 5.96 ± 1.14 for all participants. Again the mean rating is not considerably affected by gender or career experience, but we do obtain a slightly higher value of 6.36 ± 0.94 for participants in the technological field. As for the internal consistency of the answers, we get an “excellent" \(\alpha \) = 0.91 for the whole sample, although it does drop to just “good" for the male group (\(\alpha \) = 0.85) and the most experienced participants (\(\alpha \) = 0.88). The lowest internal consistency is found within the technological field, with an “acceptable" \(\alpha \) = 0.76.

  • Information quality (INFOQUAL): the mean value obta-ined from items 9 to 14 was 5.74 ± 1.44 overall. Once again, the highest differences found were based on the field of expertise: the technological field group had the highest mean value of 6.17 ± 1.22, while the lowest value was obtained from the health field group (5.63 ± 1.48). The overall internal consistency was \(\alpha \) = 0.90, and we do find differences between the demographic groups: higher consistency for females (\(\alpha \) = 0.96) than males (\(\alpha \) = 0.74); higher consistency for the health field group (\(\alpha \) = 0.91) than the technological group (\(\alpha \) = 0.79); and higher consistency for the least experienced individuals (\(\alpha \) = 0.93) than the most experienced (\(\alpha \) = 0.84).

  • Interface quality (INTERQUAL): including items 15 to 17, the overall mean rating was 5.81 ± 1.11. For this dimension the mean value for the technological field group was the highest (6.11 ± 0.65), and the mean value for the least experienced group was the lowest (5.71 ± 1.23). As for the internal consistency, this was the dimension with the lowest overall, with an “acceptable" value of \(\alpha \) = 0.77. Again we find considerable differences between demographic groups: higher \(\alpha \) = 0.88 for females than males (\(\alpha \) = 0.34), higher \(\alpha \) = 0.80 for health field than technological field (\(\alpha \) = 0.27) and higher \(\alpha \) = 0.90 for the less experienced group than the people with 10+ years of experience (\(\alpha \) = 0.42). This is the only dimension where we see the internal consistency drop below an “acceptable" level, and it is probably due to the small amount of items it considers (only three).

In light of the presented results, we observe that the overall usability perception is generally positive, slightly under 6 out of 7 points, and never drops below 5 for any of its dimensions, even if considering specific demographic groups based on gender, career field and experience.

We do observe a pattern between the groups: females provide slightly lower ratings than males, but with a higher internal consistency. The exact same happens between the health field group (i.e., slightly lower ratings and higher consistency) and the technological field group, as well as between the most experienced group and the least one. The fact that this pattern repeats across groups is expected, and it is probably due to the fact that the groups are overlapping: more males than females work in the tech field, and the males happen to be younger on average than females (34.9 years old vs. 40.14, cf. Table 2), hence the difference found between different seniority groups. Furthermore, we noticed that participants from the medical field made more comments suggesting improvement areas than participants from the technical field, particularly regarding the user interface.

As to why this pattern occurs, we believe it is justified since DigiMoCA is inherently a technological and disruptive screening tool. Therefore, it is to be expected that professionals from the technological field are more keen to using it, and generally more interested in it and curious about how it works. Conversely, it also makes sense that professionals from the health field are more “skeptical" and less interested, since the health field is generally more stable and less prone to disruptive changes [27], and certainly more people-oriented than tool-oriented.

Finally, the fact that the information and interface-related items obtain a slightly lower rating across all groups is justified, as one of the main drawbacks of using a voice-only communication channel is the restriction of the user interface, which lacks visual user interaction. This probably means that the PSSUQ questionnaire should be adapted in this context to new ICT tools based on conversational agents, where questions about the user interface either need a reformulation or simply to be excluded.

4 Conclusion

In this paper a user-interaction pilot study analyizing the usability and acceptability of DigiMoCA -a digital Alexa-based cognitive impairment screening tool based on T-MoCA- is discussed, both from end-users’ and administrators’ perspectives.

In the case of end-users, a TAM questionnaire was utilized, administered both before and after DigiMoCA. Overall, the results show that users accept DigiMoCA, giving it a 3+ score in all three TAM’s dimensions, meaning that they perceive it as useful, easy to use and satisfactory. The perceived ease of use was particularly positive and internally consistent, with a mean score of 3.98. Additionally, the pre vs. post analysis show that, while the acceptability of technology does not change significantly after the administration of DigiMoCA, when it comes to conversational agents specifically, their perceived acceptability improves significantly. All three dimensions have an item with a statistically significant positive change. Moreover, the vast majority of non-significant changes were also positive.

In the case of test administrators, a PSSUQ questionnaire was used. Its results show that DigiMoCA is considered usable (mean score 5.86) very consistently (\(\alpha \) = 0.95), with a score of 5+ out of 7 for all the dimensions and demographic groups. System usefulness was rated consistently higher than information and interface quality, and we find the biggest demographic differences between the health field group and the technological field group.

The sample size is one of the main limitations of the study. To estimate an ideal sample size, initially we obtained an estimation of the prevalence of AD in Spain (10.285%)Footnote 1. Then, based on the confidence interval needed of 95% we would need n = 142 participants per study group, which is far from the sample size achieved so far.

Future lines of work include further characterizing the sample, carrying out a study of acceptability and usability by technological training of the participants, including their relationship with technology throughout their lives. Additionally, it could be worth to analyse more objective metrics, such as participants’ response times, which could enrich the study of DigiMoCA.

Ongoing work addresses the improvement of the perceived satisfaction from using DigiMoCA, by making it more friendly, while also improving its interface and the information provided to the user, compensating the voice-only interaction limitations. As these aspects are improved to make user interaction with conversational agents to be perceived closer and closer to that with human administrators, the distinctive affordability and accessibility of smart assistant-based tests can effectively set them off as a powerful screening technology.

figure b