Introduction

Severe mental disorders (SMDs) are defined as having a non-organic psychosis with long illness duration and severe functional impairment [1]. SMDs include schizophrenia, bipolar disorder, and major depressive disorder with psychotic features. Despite their relatively low prevalence, these disorders are among the leading causes for Years Lived with Disability (YLD) [2]. Research shows that people with SMDs have significantly more cognitive difficulties compared to healthy controls [3,4,5,6,7]. In support of this, a recent systematic review showed that cognitive symptoms in people with schizophrenia (PWS) had heterogenous trajectories [8].

Cognition is a term referring to thinking skills including acquiring and retaining knowledge, processing information, and reasoning. Cognitive function includes intellectual abilities such as perception, reasoning, and remembering. Impairment in those functions (i.e., memory, judgment, and comprehension) is referred to as cognitive impairment [9]. PWS tend to have greater cognitive impairment compared to people with bipolar disorder (PWBD) and people with depression (PWD) [10,11,12,13].

Even though SMDs share nearly similar domains of cognitive impairment, the impairment in PWS is more global compared to the impairment in PWBD and PWD. Both PWS and PWBD show impairment in the domains of attention, verbal learning, and executive function [14, 15]. Whereas, domains of processing speed, working memory, verbal and visual learning, and reasoning are impaired in PWS and PWD [14, 16]. In addition to the above domains, PWS have more prominent impairment in the domain of social cognition [14], this may be used to differentiate PWS from PWBD and PWD.

Cognitive impairment in people with SMDs is associated with poor functional and clinical outcomes [17,18,19,20,21]. A recent study also showed that cognition worsens gradually if no intervention is provided [22]. Measuring cognition of people with SMDs with robust instruments is important, since measurement and assessment is the first step to intervention. For this purpose, several measures of cognition have been developed and validated in people with SMDs. Although several measures exist, most of these have been developed in Western countries and not always adapted well for use in low-income settings [23,24,25,26,27,28]. Norms for low- and middle-income countries (LMICs) also do not always exist, making the use and interpretation of these measures complex. In addition, it is not clear which measures would be better candidates for adaptation to low-income setting, since there is no previously synthesized report about the measurement properties of measures adapted in LMICs. Furthermore, most cognitive measures require literacy to respond to the items. Therefore, separate review of validation studies conducted in LMICs can show readers which measure is more appropriate for adaptation in countries with low literacy rate. Finally, multiple languages are spoken in most LMICs as a result, and a separate review of validation studies conducted in LMICs may show readers which measure is adapted across different LMICs speaking different languages.

Although there are numerous studies on the validation of cognitive measures in people with SMDs, only one systematic review has addressed this [29], and there is no previous systematic review focusing on this issue in LMICs. This review is important, because it can help researchers and clinicians to choose the most appropriate measure for their context. As a result, this systematic review is aimed to fill this gap by reviewing the psychometric properties of cognitive measures adapted or developed and validated among people with SMDs in LMICs.

Methods

We followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guideline to conduct and report this systematic review [30]. We registered the protocol on Prospective Register of Systematic Reviews (PROSPERO) before we started the search (registration number: CRD42019136099).

Databases searched

PubMed, Embase, PsycINFO, Global Index Medicus, and African Journals Online (AJOL) were searched from the date of inception of the databases until June 07, 2019. Google Scholar was used for forward and backward-searching. We conducted backward-searching on 3rd September 2019 and forward-searching on 29th September 2019.

Search strategy

We used free terms and controlled vocabulary terms for four keywords: SMDs, cognition, psychometric properties, and LMICs. We combined these four keywords with the Boolean term “AND”. For the complete search strategy, see online resource 1. To increase our chance of capturing all measures validated for the assessment of cognition in people with SMDs, we conducted a forward and backward search. In addition to our registered protocol, we consulted experts in the area by emailing the final list of measures identified for potentially missed measures.

Eligibility criteria

This review considered studies aimed at developing/adapting and validating a cognitive measure in people with SMDs aged 18 years and older in LMICs. Diagnoses of the disorders needed to be confirmed using either Diagnostic and Statistical Manual of mental disorders (DSM) [31], International Classification of Diseases (ICD) [32], or other recognized diagnostic criteria. For this study, SMDs included schizophrenia, bipolar disorder, and depressive disorders. We chose these three groups of disorders, because cognitive impairment is prominent. We excluded normative studies and adaptation studies involving only healthy participants. Although a normative study is an important step in the adaptation of measures, our aim was to focus on evaluating measures validated in people with SMDs.

We included any measure which was used to assess at least one domain of cognition. Both performance-based (instruments that evaluate behavior on a task or performance) and interview-based (instruments in which the examiner scores the performance through clinical interviews) measures were included.

In this review, a validation study was operationally defined as any study conducted with the aim of evaluating the psychometric properties of a measure, i.e., a study with the main objective of reporting different dimensions of reliability and validity. We also included studies which reported the process of adaptation or development of a measure in people with SMDs without reporting psychometric properties of those measures. Studies only from LMICs were included in this review. We used the World Bank list of economic status of countries during the 2018/2019 financial year as a reference for categorizing countries. Only studies published in English with no restriction in study design were included in the review.

Full-text identification process

We merged articles found from the databases and removed duplicates. Two of the authors (YG, AD) independently screened each article for eligibility using their title and abstract, followed by full-text screening. Disagreements between the two screeners were resolved by consensus.

Data extraction

The first author (YG) extracted data from the included articles using data extraction tool developed a priori, and another author (AD) checked all the extracted data for correctness of the extraction. The extraction tool was developed in consultation with the senior authors, referring to previous published systematic reviews, and the requirements for quality assessment followed by piloting it on two articles (the data extraction template is in online resource 2). The core components of the data extraction tool were:

  • Authors’ name and affiliation, date of publication, and country

  • Study design

  • Type of the study (development, adaptation, validation)

  • Mode of administration (interview-based vs performance-based)

  • Total number of participants in each group (control vs patients)

  • Sociodemographic characteristics (age, gender, educational status, and language)

  • Duration to administer the tool

  • Specific cognitive domains addressed and the number of items of the measure and the domains/sub-tests

  • Psychometric properties reported, method of analysis, and findings

  • Elements of the quality assessment tool (described in detail on the quality assessment section below)

Risk of bias/quality assessment

YG and AD independently assessed the risk of bias of individual studies using the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) checklist [33]. Any disagreements between the two authors were resolved by discussion. There were no disagreements beyond the consensus agreement between the two screeners. Unlike the registered protocol, we used the updated version of COSMIN checklist, in conducting this review.

COSMIN has a total of 10 boxes for 10 different psychometric properties (each box has 3–35 items). Four dimensions of scoring options are available for each item (i.e., very good, adequate, doubtful, and inadequate). A summary of quality per measurement property is given for each study by taking the worst result for each criterion (for each measurement property addressed) [36]. Since the studies included were not homogenous, we were not able to conduct an assessment of publication bias.

Criteria used to rank order the measures

In addition to our registered protocol, we evaluated and ranked cognitive measures validated in PWS using five criteria that we developed by adapting from previous reviews [37,38,39,40,41]. Our main reason for the rank ordering the measures is to recommend better measures for adaptation in other settings (it is not for quality assessment). The criteria used to rank the measures were:

  1. 1.

    The number of studies that adapted/validated the measure: one point was given to each measure by counting the number of studies which reported information about the specific measure. According to this criterion, higher score was given to a measure adapted by many studies.

  2. 2.

    Year of publication of studies adapted/validated the measure: in addition to number of studies, year of publication was considered to reduce the risk of recommending a measure which was adapted by many studies, just because it was developed earlier than others. A score of 5 was given for studies published in 2015 and after, while a score of one was given for studies published before 1980. For measures evaluated in more than one studies, the average of the publication year scores was taken.

  3. 3.

    The number of domains the measure addressed: we scored this by counting the number of specific domains that the measure consisted. A single score was given by counting the number of domains that the measure holds from list of domains thought to be impaired in PWS as reported in the systematic review of Nuechterlein et al. [14].

  4. 4.

    Duration to administer: we scored from one to three inversely, i.e., three for brief measures taking 30 min or less, two for measures which take between 30 and 60 min, and one for measures which take more than 60 min to administer.

  5. 5.

    The number of psychometric properties addressed and findings: we added this criterion, since we wanted to consider the number of psychometric properties evaluated for the measure and findings reported. For this criterion, a scale from one to eight was used, where the maximum score was given if five or more measurement properties from COSMIN’s list were evaluated and reported excellent findings and the least score was given if less than two measurement properties were evaluated with less than excellent findings of any of the properties. We have not considered the COSMIN quality rating in this criterion, we only considered the number of measurement properties evaluated from COSMIN’s list and the findings reported. According to this criterion, a better measure is a measure on which many measurement properties have been evaluated and all had been scored excellent findings.

If the necessary information was not contained in the studies included, we gave a score of zero (not reported). The overall ranking of the measures was based on the total sum of scores according to the above five criteria. The highest total possible score is 28 with higher scores indicating a better measure for the recommendation.

Data synthesis

We used a narrative synthesis to report the findings. For each identified measure, we reported psychometric properties, duration to administer, and other important outcome points that we extracted [e.g., population on which the measure was evaluated, type of the measure (performance- or interview-based), cognitive domains, number of items, etc.]. We also reported the methodological qualities of each study for the specific measurement properties reported.

In addition to our registered protocol, we summarized and synthesized findings. Since the purpose of summarizing was for the aim of general tool selection, we used the updated criteria for good measurement properties in COSMIN systematic review for patient report outcome measurement manual version 1 released in February 2018 [33]. With regards to internal consistency, we graded a Cronbach’s α of ≥ 0.7 as excellent, and < 0.7 as satisfactory. For test–retest assessment, intra-class correlation coefficient (ICC) ≥ 0.7 was considered as high, while < 0.7 was considered as poor. For tests with Pearson correlation (r) (for test–retest reliability, convergent, or concurrent validity), we used Cohen’s classification and assigned ≥ 0.5 as large, r between 0.3 and 0.49 as medium, and between 0.1 and 0.29 as small. Furthermore, we used the COSMIN criteria for summarizing evidence and grade the quality of evidence per measurement properties for measures validated in PWS in more than one study [33].

We reported results for performance-based and interview-based measures separately. We also compared measures validated in PWS with measures validated in PWD and PWBD. It was not possible to conduct meta-analysis and meta-regression because of heterogeneous findings in terms of the measures included and measurement properties reported.

Results

Study characteristics

The search strategy yielded a total of 6091 articles. Title and abstract screening yielded 67 articles. Full-text screening, forward and backward-searching resulted in 27 articles. One article is added later through peer recommendation and the total articles included in this review were 28. Figure 1 shows the flow diagram of article identification. A list of excluded articles with the reason for their exclusion is provided in the online resource 3.

Fig. 1
figure 1

PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analysis) flow diagram. AJOL African Journals Online, GIM Global Index Medicus, LMICs Low- and middle-income countries, PWSMDs People with severe mental disorders

The 28 studies included in the review evaluated psychometric properties of 23 cognitive measures in people with SMDs from 12 LMICs. Most of these studies were from Brazil (n = 7/12) [42,43,44,45,46,47,48]. No study was conducted in low-income countries and only three studies were conducted in lower-middle-income countries [49,50,51] (Fig. 2).

Fig. 2
figure 2

Distribution of included articles in different geographical regions of the world

About two-third of the studies (n = 18/28) were conducted either in PWS or people with schizophrenia spectrum disorders (PWSSD) with healthy controls (n = 13/18) [42, 43, 45, 46, 49, 52,53,54,55,56,57,58,59] or without healthy controls (n = 5/18) [47, 50, 51, 59, 60,]. Three studies each were conducted in PWD [44, 62, 63] and PWBD [48, 64, 65] with healthy controls, while those remaining (n = 4/28) were conducted in mixed populations [66,67,68,69]. A total of 6396 participants (2196 clinical samples and 4200 healthy controls) were included in the main studies of this review. The sample size of PWS/PWSSD in the included studies ranged from 15 to 230 (with a median of 50) for the main studies and 15 to 188 for test–retest reliability. The sample size for healthy controls ranged from 15 to 1757 (with a median of 77) for the main studies and 15 to 84 for test–retest reliability studies. The mean age of PWS/PWSSD participants was 35.2 years. Most studies (n = 15/21) had more male participants. On average, PWS/PWSSD had 10.7 years of education. Table 1 describes the participants' characteristics.

Table 1 Study characteristics of included articles

Description of the measures

Twenty-three cognitive measures were identified from the 28 studies included. Of these, 15 were evaluated in PWSSD, three in PWD, one in PWBD, and one in PWS and PWD, while three measures were evaluated in a mixed population. The identified measures addressed either single domain of cognition or as many as seven domains, with duration to administer ranging from 10 to 90 min.

Of the measures identified, 17 were performance-based and 12 were evaluated in PWSSD, one each in PWD; and in PWS and PWD, and three in a mixed population. About half of these measures (n = 8/17) addressed only neurocognition domains and six addressed social cognition, while three included domains of both neurocognition and social cognition. About two-third of the measures were batteries (n = 11/17), while six were single-domain tests.

Six of the 23 measures identified were interview-based. Out of these, three were evaluated in PWSSD, two in PWD, and one in PWBD. Except for one, all these measures addressed neurocognitive domains only (n = 5/6). See Table 2 for detailed description of the measures.

Table 2 Description of the measures identified from the included articles

Psychometric properties evaluated

These are summarized in Table 3. The most commonly studied psychometric property was hypothesis testing, including convergent, concurrent, and known group validity (evaluated in 25 studies), followed by internal consistency reliability (evaluated in 20 studies), and cross-cultural validity (evaluated in 15 studies). Test–retest reliability was conducted in 14 studies, whereas structural validity was conducted in 11 studies. The least reported measurement property was content validity (n = 3/28), followed by criterion validity (n = 4/28). None of the included studies evaluated responsiveness to change or measurement error. Very few (n = 6/28) studies reported other measurement properties, such as face validity (n = 1/28), learning effects (n = 1/28), tolerability/feasibility (n = 2/28), floor and ceiling effects (n = 2/28), comparison of measures (n = 1/28), or cross-cultural comparisons (n = 1/28).

Table 3 Psychometric properties reported in the included articles

It was not possible to pool the findings of the psychometric properties reported because of heterogeneity in the measures used, psychometric properties reported, and populations studied. We therefore summarized them in Table 3 and provided a narrative synthesis.

Almost all of the studies reported excellent internal consistency (n = 18/20), and most studies reported high test–retest reliability (n = 11/14). Only one-third of the studies reported good concurrent validity (n = 4/14), whereas good convergent validity was reported by most studies (n = 11/15). All of the studies (n = 21) which evaluated known group validity reported good ability to discriminate different clinical samples and healthy controls. Likewise, appropriate content validity (n = 4), excellent criterion validity (n = 4), high tolerability/feasibility (n = 2), good face validity (n = 1), minimal floor and ceiling effect (n = 2), and no learning effect (n = 1) were also reported.

Of the studies which evaluated performance-based measures, the majority reported excellent internal consistency (n = 8/12), high test–retest reliability (8/11), and good concurrent validity (n = 4/8). Good convergent validity was reported in two-third of the studies (n = 6/9). Three studies assessed structural validity and two of them reported one-factor structure and the other one reported a two-factor structure. Two studies evaluated criterion validity and reported excellent sensitivity and specificity. Nine different studies evaluated cross-cultural validly and yielded nine different versions of the measures.

Of the studies which evaluated interview-based measures, all reported excellent internal consistency (n = 8/8), high test–retest reliability (3/3), and good convergent validity (n = 5/6). None of the included studies reported good concurrent validly, whereas two-thirds of the studies reported moderate concurrent validity (n = 4/6). Seven studies assessed structural validity and five of them reported one factor, one study reported three factors, and another one reported six factors. Two studies evaluated criterion validity and reported excellent sensitivity and specificity. Six different studies evaluated cross-cultural validly and yielded six different versions of the measures. Two studies evaluated feasibility/tolerability of measures and found that both measures were feasible/tolerable for the respondents (less than 1% of missing values were reported). See Table 3 for details.

Summary of evidence per measure

Three studies [47, 61, 62] reported more than one measure, while six measures were described by more than one study. In Table 4, we summarized the psychometric properties of measures reported in more than one study in PWS. For detailed psychometric properties of measures reported in other population and in PWS in only one study, please see Table 3.

Table 4 Best evidence synthesis for measures evaluated in more than one study in people with schizophrenia using the COSMIN systematic review for PROM manual version 1 released in Feb 2018

Brief Assessment of Cognition in Schizophrenia (BACS): this performance-based measure was evaluated in five studies [42, 43, 51, 56, 60]. BACS has six sub-tests which address seven domains of cognition, and on average, it takes 37.4 min to administer it in PWS (Table 2).

High-quality evidence was reported for internal consistency (with positive rating from four studies), and hypothesis testing (with mixed result with most reporting good concurrent, convergent, and known group validity from four studies). While moderate-quality evidence was reported for structural validity (with one-factor structure from two studies) and cross-cultural validity (yielded Persian, Brazilian, Indonesian, and Malay version). However, low-quality evidence was reported for test–retest reliability (with positive rating from three studies) and criterion validity (with positive rating from one study). Other than the COSMIN list of measurement properties, minimal ceiling and floor effect were reported in one study (Table 4).

Frontal assessment battery (FAB): Two studies [54, 69] evaluated this performance-based measure. FAB has six sub-tests each to be scored in a Likert scale from 0 to 3 with a total score of 0 to 18. A higher score reflects better performance. It takes only 10 min for administration, but it assesses only one domain (i.e., executive function) (Table 2).

High-quality evidence were reported for internal consistency (with negative rating from two studies), and hypothesis testing (with positive rating for good concurrent, convergent, and known group validity from two studies). Whereas moderate-quality evidence was reported for test–retest reliability (with intermediate rating from two studies). However, low-quality evidence was reported for cross-cultural validity [54]. Other than the COSMIN list of measurement properties, very high inter-rater reliability was reported in one study (Table 4).

Hinting task: Two studies [48, 61] addressed this performance-based measure of theory of mind. The Hinting task comprised 10 short sketches/stories which focused on assessing the person’s ability to describe the intention of the person from the stories presented (Table 2).

Cross-cultural validity (n = 1/2) and comparison with other tests (n = 1/2) were evaluated for this measure. The cross-cultural validity of the Hinting task resulted in the Brazilian version which was rated as low-quality evidence, while it was found to be the least difficult measure in detecting theory of mind compared with Reading the Mind in the Eye Tests (RMET) and Faux pas test (Table 4).

Clinical and research usefulness evaluation

We ranked measures evaluated in PWS using five criteria described in detail in the section “Criteria used to rank order the measures”. Summing the scores for each criterion, BACS ranked first, Cognitive Assessment Interview (CAI) ranked second, and MATRICS Consensus Cognitive Battery (MCCB) and CogState Schizophrenia Battery (CSB) ranked third. When looked at performance-based and interview-based measures separately, BACS, MCCB, and CSB stood out as the top three performance-based measures. While, CAI ranked first from interview-based measures, followed by Schizophrenia Cognition Rating Scale (SCoRS), and Self-Assessment Scale of Cognitive Complaints in Schizophrenia (SASCCS). See Table 5.

Table 5 Ranking of the measures evaluated only in people with schizophrenia in the included studies

Methodological quality

From the ten domains of the COSMIN checklist, two domains (responsiveness to change and measurement error) were not reported in any of the included articles. We have not evaluated patient-reported outcome development check box (box 1), since this study focuses on adaptation studies rather than development studies and most measures included here are not freely available. Most of the included studies reported hypothesis testing (n = 25/28). For a summary of the methodological qualities of the included studies, see Fig. 3.

Fig. 3
figure 3

Number of studies with very good, adequate, doubtful, or inadequate-quality rating per each measurement properties addressed

Quality of the included studies ranged from very good to inadequate. Thirteen studies were rated very good quality for internal consistency; while 5/19 studies were rated as doubtful, and one inadequate-quality rating. The main reason for the doubtful rating in those studies was that there were minor methodological problems and the reason for the inadequate-quality rating was that internal consistency was not calculated for each dimensions of the scale. None of the studies that assessed test–retest reliability had very good rating and the reason for this was that it is doubtful that the time interval used was appropriate (5/10 doubtful rating). All the 10 studies that evaluated content validity were found to have doubtful quality. The reason is that in all studies, it is not clear the methodology they followed (not clear whether the group moderator was trained, topic guide was used, group meetings were recorded, etc.). From the 10 studies that evaluated structural validity, one study had very good quality, four studies had adequate quality, another four doubtful rating, and one inadequate quality. The reason for the inadequate rating was that they used a very small sample. Most of the included studies (n = 25) examined one form of hypothesis testing (i.e., concurrent, convergent, or known group validity). Most of the studies were rated as very good (n = 18/25) and the remaining were rated as doubtful (n = 7/25). The main reason for the doubtful rating was that minor methodological problems in sample selection. The quality of all studies that evaluated cross-cultural validity was rated as doubtful (n = 15/15). The main reason for this was that multiple group factor analysis was not performed, it is not clear whether the samples are similar, and the approaches used to analyze the data are not clear. Finally, four studies evaluated criterion validity—two very good quality and two doubtful quality. Figure 3 presents the methodological quality scores for each of the included studies. Detailed report of the quality rating for each study at each measurement property is given as online resources 4.

Discussion

There is limited availability of valid and reliable cognitive assessment and screening instruments for people with SMDs in LMICs. This is partly due to the limited adaptation and validation efforts in the literature. A first step to improving this situation is to systematically assess the current status of the literature and identify cognitive measures which are already validated and may be used and adapted for the assessment of cognition in people with SMDs in LMICs.

This review identified 28 studies and 23 independent cognitive measures. Most of the measures evaluated cognition in PWS from upper-middle-income countries such as Brazil and China. None of the studies was from low-income settings, suggesting that we have no evidence about the psychometric properties of these measures in low-income countries. We found participants’ education level to be high (on average 11 years of education). The majority of the studies had low methodological quality, and based on limited sample size. This was more so in cross-cultural validation and content validity studies.

According to the criteria that we considered for clinical and research usefulness, we recommend BACS, CAI, MCCB, and CSB as most suitable measures to be adapted for the cognitive assessment of PWS in LMICs. Of these, the BACS is the most frequently evaluated measure and it is the measure with most adaptations and reliable psychometric properties. In addition, it addresses comprehensive domains of neurocognition with shorter administering time. Since LMICs have different contexts in terms of cultural, linguistic, economic, and educational backgrounds, we recommend validation studies at different sites with the above three measures as a starting point for validation and adaptation of other measures. Studies focusing on norming of measures in LMICs would also be useful. The use and adaptation of a measure to a new context requires considering carefully how the new version of the measure will interact with the new context and population [70,71,72]. Assessment of psychometric properties is the result of an interaction between the measure, the context, and the population. Adapting measures to LMICs requires a fundamental shift in each of these aspects, and therefore, the literature and evidence on existing measures can only have a limited value to inform the adaptation process [73, 74].

More than half of the measures included in this review do not assess social cognition (n = 13/22) which is a domain usually found impaired in PWS. We recommend clinicians and researchers in LMICs to consider measures that can include this domain, although the different social and contextual factors may be a challenge in developing comparable social cognition tests.

We did not find studies conducted in low-income countries, the majority of the studies were from upper-middle-income countries (e.g., from Brazil and China). This is an important finding as it highlights a clear gap in the literature and availability of cognitive measures globally. The context in low-income countries is different from upper-middle-income and higher-income countries. For example, according to the World Bank, 67% of the total population in lower-income countries are rural residents, who may not be literate [75]. Overall only 63% have basic literacy skills (able to read and write a simple sentence) and may not be familiar with settings outside their local community. LMICs are diverse in terms of educational status, culture, and language. With this in mind, it is important that local experts lead the adaptation efforts on cognitive measures in LMICs, using the recommended measures in this review as an initial point.

The average educational level of PWSSD in the included studies was approximately 11 years, which shows that the findings reported here may not translate directly to low-income settings where the overall literacy rate tends to be lower. Researchers need to consider the effect of education [76,77,78], culture [79, 80], and language [81] when deciding which measure to adapt and use in LMICs’ context.

It should be borne in mind that a low score on a cognitive test may not always reflect cognitive impairment, but simply lack of familiarity with the material presented. LMICs are also diverse in cultural practice, which should be considered during the adaptation process. For example, one item of SCoRS [82] requires the participant to assess how difficult it is for them to follow a television show. Answering this item has clear economic and cultural implications. Adaptation of this item may require a fundamental rethink in relation to the setting. The other factor that should be considered is language. Again, LMICs are less homogeneous in langue knowledge and use. In many countries in Africa and Asia, multiple languages are spoken within a given country, and it may not be simple to define people’s first language. Using a cognitive measure adapted in a different linguistic context may not be appropriate and non-verbal cognitive measures may be preferred. This further emphasized the question of how much context influence cognitive assessment [83]. The literature shows that different results in a cognitive test can be due to variation in cultural interpretations, such as when a test has items or tasks that are only familiar in certain contexts [79].

This review has a number of strengths in that we included any measure of cognition in people with SMDs with no restriction in the domain of cognition evaluated. We also followed a rigorous protocol, which we preregistered in PROSPERO (https://www.crd.york.ac.uk/PROSPERO/), searched four comprehensive databases without restriction on the date of publication, and used a comprehensive quality assessment tool (the COSMIN criteria) [33].

However, our review has limitations. First, the broad scope of the review makes the data inappropriate for meta-analysis. Our study protocol allowed a wide variety of study outcomes to obtain a broad overview of the field given the paucity of knowledge and lack of prior systematic reviews. Second, the criteria we used to rank the measures were not used previously (even though we adapted them from previous reviews). Third, this review excluded non-English studies, which might limit the generalizability of the findings. Fourth, gray literature was not searched; however, we conducted forward and backward-searching which extended our included studies from 21 to 28. Readers are recommended to consider the generalizability of our review considering those limitations.

Reviewing and systematically assessing the psychometric properties of measures in this field are useful for researchers, clinicians, and policymakers in LMICs and beyond. Since LIMICs are diverse in language, culture, and education, our recommendations may not work for every country, and hence, these need to be contextualized. This review could help researchers in measure selection when planning studies, particularly for adaptation studies. Other potential use includes guiding choices of the best measures in conducting longitudinal studies to assess change in cognition and clinical trials for interventions aiming to improve cognition. In this review, we considered measurement properties such as test–retest reliability, learning effect, tolerability, and practicality which are important in repeated assessment. Therefore, researchers can compare measures on those criteria when looking for the best measures for longitudinal studies and clinical trials. Clinicians in LMICs could use this review to compare different measures and use the one that most suits their specific needs and context. Policymakers can use the results of this study to design prevention and treatment strategies regarding cognition in people with SMDs—such as developing a guideline, integrating routine assessment of cognition in clinical settings, and promoting research activities in the treatment of cognitive impairment. This review points clearly to a gap in the evidence for cognitive assessment for SMDs in LMICs. This may suggest a gap in the use of cognitive assessment in clinical practice and the need for adaptation and validation study to make these tools available to services, clinicians, and service users.