Introduction

Psychometric instruments assessing behavioural dysfunctions (i.e. neuropsychiatric alterations within the affective, motivational, social and awareness dimensions) and functional outcomes (i.e. quality of life, functional independence and other aspects of physical statuse.g. sleep, pain or fatigue) in neurological, psychiatric and geriatric populations are relevant towards clinical phenotyping, prognosis and intervention practice [21]. Indeed, besides aiding clinical diagnosis, behavioural/functional instruments (BFIs) are often addressed as relevant to provide estimates of patients’ prognosis, being also adopted as clinical endpoints during interventional programs [21].

BFIs often present either self- or proxy-report (i.e. caregiver or healthcare professional) questionnaires, thus requiring sound psychometric and diagnostics, as well as evidence on clinical usability in target populations [6]. However, it has been highlighted that BFIs often do not meet methodological-statistical requirements, both when developed de novo and when adapted from a different language and culture [60]. Of note, such methodological-statistical lacks have been identified as detrimentally influencing the level of recommendation of a given tool both within clinical practice and research [13], [42].

Given the abovementioned premises, and based on the current knowledge on health measurement tools [58], this work aimed at assessing psychometrics, diagnostics and usability in neurological, geriatric and psychiatric populations of BFIs currently available in Italy, in order to (1) provide an up-to-date compendium on Italian BFIs designed for clinical and research aims in clinical populations and (2) deliver evidence-based information on the level of recommendation for Italian BFIs.

Methods

Search strategy

Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines were consulted [27]. This review was pre-registered on the International Prospective Register of Systematic Reviews (PROSPERO; ID: CRD42021295430; https://www.crd.york.ac.uk/prospero/display_record.php?ID=CRD42021295430).

A systematic literature search was performed on December 1, 2021 (no date limit set), entering the following search string into Scopus and PubMed databases: ( behavioural OR behavioral OR “quality of life” OR psychiatric OR psychopathological OR apathy OR depression OR anxiety OR qol OR mood OR “activities of daily living” OR “functional independence”) AND (validation OR validity OR standardization OR psychometric AND properties OR reliability OR version ) AND ( italian OR italy ) AND ( neurolog* OR neuropsych* OR cognitive ) AND ( questionnaire OR inventory OR tool OR instrument OR scale OR test OR interview OR checklist ). Fields of search were title, abstract and key words for Scopus whereas title and abstract for PubMed. Only peer-reviewed, full-text contributions written in English/Italian were considered. In addition, the reference lists of all relevant articles were further hand-searched in order to identify further eligible studies.

Study eligibility criteria

Studies were evaluated for eligibility if they focused either on the psychometric/diagnostic/normative study of Italian/adapted to-Italian BFIs or their usability in healthy participants (HPs) and in patients with neurological/geriatric conditions or their proxies (e.g. caregivers). More specifically, eligible studies had to focus on (1) BFI psychometrics (i.e. validity and reliability) and (2) diagnostics (i.e. intrinsic (i.e. sensitivity, specificity) and post-test features (e.g. positive and negative predictive values and likelihood ratios)), or (3) norm derivation. Studies that did not aim at providing normative data were included only if at least one property among validity, reliability and sensitivity/specificity (or related metrics) was assessed.

Conference proceedings, letters to the Editor, commentaries, animal studies, single-case studies, reviews/meta-analyses, abstracts, research protocols, qualitative studies, opinion papers and studies on paediatric populations were excluded.

Data collection and quality assessment

Screening stage was performed by two authors (E.N.A. and A.D.) and eligibility stage was performed by two other authors (G.M. and C.G.) via Rayyan (https://rayyan.qcri.org/ welcome); these stages were supervised by another author (V.B.).

Data extraction was performed by four independent Authors (S.M., G.S.D.T., V.B. and F.P.), whereas one independent author (E.N.A.) supervised this stage and checked extracted data.

Extracted outcomes included (1) sample size, (2) sample representativeness (geographic coverage, exclusion criteria), (3) participants’ demographics, (4) instruments adaptation procedures, (5) administration time, (6) validity metrics, (7) reliability metrics (including significant change measures), (8) measures of sensitivity and specificity, (9) metrics derived from sensitivity and specificity, (10) norming methods and (11) other psychometric/diagnostic properties (e.g. accuracy, acceptability rate, assessment of ceiling/floor effects, ease of use).

Formal quality assessment was performed by four Authors (S.M., G.S.D.T., V.B. and F.P.), and supervised by a further, independent one (E.N.A.). Quality assessment was performed for each BFI by developing two ad-hoc checklists, the Behavioural and Functional Instrument Quality Assessment-Normative Sample (BFIQA-NS) and the Behavioural and Functional Instrument Quality Assessment-Clinical Population (BFIQA-CP) (Supplemental Material 1), which were adapted from the Cognitive Screening Standardization Checklist (CSSC) [1]. Scores were “cumulatively” assigned for each BFI by evaluating all available studies on it among those included. Although some studies met the selection criteria, they did not answer some of the questions included in the CSSC, (e.g. diagnostic criteria for quality of life tools). In these cases, a group of items was scored as “non-applicable” and this was accounted, i.e. weighted, in the final score.

Both BFIQA-NS and BFIQA-CP total scores range from 0 to 50 and a given BFI was considered “statistically/methodologically sound” if scoring was ≥25, which means 50% out of the maximum. When more than one study focused on the same BFI in different populations, BFIQA scores were averaged (as both the BFIQA-NS and BFIQA-CP range from 0 to 50).

Results

One-hundred and eighteen studies were included; study selection process according to PRISMA guidelines is shown in Figure 1.

Fig. 1
figure 1

Adapted from Moher et al. (2009) (www.prisma-statement.org)

Study selection process according to PRISMA guidelines. Notes. Study selection process according to PRISMA guidelines.

The included BFIs (N=102), along with the summarization of the general clinical features and BFIQA scores are detailed in Table 1, while the most relevant psychometrics, diagnostics and usability evidence are described in Table 2 and Table 3. The full reference list of included studies is shown in the Supplementary File 2.

Table 1 Summarization of general clinical features of behavioural/functional instruments and BFIQA scores
Table 2 Summarization of main psychometrics evidence
Table 3 Summarization of main diagnostics and usability evidence

The most represented constructs assessed by the included BFIs were behaviour/psychiatric symptoms (apathy: N=6; anxiety: N=7; depression: N=11; general: N=13; other: N=11), quality of life (QoL; N=14) and physical (activity of daily living/functional independence: N=16; other: N=10). Multidimensional BFIs (N=11) included behavioural, QoL and physical constructs. Forty-one BFIs were self-report, 21 were caregiver, whereas the remaining ones clinician-report.

The vast majority of studies (N=109) aimed at providing psychometric, diagnostic or normative data in clinical populations, of whom 16 also addressed normotypical samples. The most represented neurological conditions were those of a degenerative/disimmune etiology: multiple sclerosis (N=13), amyotrophic lateral sclerosis (N=6) and Parkinson’s disease (N=10). Dementia was addressed in 17 BFIs (Alzheimer’s disease: N=12; vascular dementia: N=3; frontotemporal dementia: N=1; Lewy Body disease: N=1). Acute cerebrovascular accidents and traumatic brain injury were addressed in 5 and 3 BFIs, respectively. Nonspecific geriatric populations (see the footnotes for the available details regarding the inclusion criteriaFootnote 1) were addressed in 5 BFIs, whereas mild cognitive impairment in 8. Psychiatric populations were addressed in 4 BFIs (major depressive disorder: N=1; schizophrenia spectrum disorders: N=1; other: N=2). Three BFIs specifically addressed healthy populations. Validity was investigated for 85 BFIs, mostly by convergence (N=58) and divergence (N=34). Criterion validity was assessed for 31 BFIs, whereas content validity for 3 BFIs. Ecological validity was assessed only in 3 studies. Factorial structure underlying BFIs by means of dimensionality reduction approaches was examined in 34 BFIs.

Reliability was investigated for 80 BFIs and mostly as internal consistency (N=64), test-retest (N=39) and inter-rater (N=25). Parallel forms were developed for one BFI only.

Item response theory (IRT) analyses were carried out for 9 BFIs only.

Among BFIs for which diagnostic properties could be computed (N=79), sensitivity and specificity measures were reported for 25 tools, whereas derived metrics such as predictive values and likelihood ratios for 16. With respect to norming, when applicable (N=79), 21 BFIs derived norms through receiver-operating characteristics (ROC) analyses, while other methods (e.g. percentiles or z-scores) were address to derive cut-offs in other 31 studies. Diagnostic accuracy was tested in 21 studies.

As to feasibility, back-translation was performed in 57 BFIs; the ease of use was assessed in 12 whereas ceiling/floor effects in 22. Strikingly, time of administration was explicitly reported for very few BFIs (N=18).

Discussion

Overview

The present review provides Italian clinicians and researchers with a comprehensive, up-to-date compendium on available BFIs along with information on their psychometrics, diagnostics and clinical usability. This work was designed not only to serve as a guide to practitioners in selecting the appropriate tool based on the clinical questions but also to researchers involved in clinical psychometrics applied to neurology and geriatrics. In the view of raising the awareness on the statistical-methodological standards that are expected to be met by such instruments, checklists herewith delivered (BFIQA) would hopefully come in handy for orienting both the development and the psychometric/diagnostic/usability study of BFIs. Indeed, at variance with the literature on diagnostic test accuracy as applied to performance-based psychometric instruments [26], such guidelines for BFIs mostly focus on psychometrics while lacking thorough sections specifically devoted to diagnostics and clinical usability [30]. Albeit each of the BFIs included in this study can undoubtedly be recognized in its peculiarities and usefulness in research and clinical contexts, as to the level of recommendations as assessed by the BFIQA, it has to be noted that 63.5% out of those referred to clinical populations (N=96) fell under the pre-established cut-off of 25 (i.e. half of the full range of the scale). More specifically, the following BFIs addressed to clinical populations reached a BFIQA score ≥25: ALS Depression Inventory [36], Apathy Evaluation Scale–Self Report version [44, 51], Anosognosia Questionnaire Dementia [17], Bedford Alzheimer Nursing Severity Scale [4], Beaumont Behavioural Inventory [20], Beck Depression Inventory-II [11, 64] Care-Related Quality of Life [63], Coop/Wonca [37], Disability Assessment Dementia Scale [12], Dimensional Apathy Scale [46, 53, 54], Dual-Task Impact in Daily-living Activities Questionnaire [38], Electronic format of Multiple Sclerosis Quality of Liufe-29 [48], Epilepsy-Quality of Life [40], Frontal Behavioural Inventory [2, 29], Geriatric Handicap Scale [62], Hospital Anxiety and Depression Scale [31], Hamilton Depression Rating Scale [33, 43, 45], Medical Outcomes Study-Human Immunodeficiency Virus [59], Multiple Sclerosis Quality of Life-29 [47], Multiple Sclerosis Quality of Life-54 [56], Neuropsychiatric Invenotry–Nursing Home version [3], Non-Motor Symptoms Scale for Parkinson's disease [9], Non-Communicative Patient’s Pain Assessment Instrument [15], Observer-Rated version of the Parkinson Anxiety Scale [53], Pain Assessment in Advanced Dementia [8, 32], Progressive Supranuclear Palsy–Quality of Life [41], Quality of Life in Alzheimer’s Disease [5], Quality of Life in the Dysarthric Speaker [39], Quality of Life after Brain Injury [16, 19], Stroke Impact Scale 3.0 [61] and State-Trait Anxiety Inventory [24, 53, 55]. Moreover, out of those also or exclusively referred to normative populations (N=22), only 4 were classified above the same cut-off (Table 1)—Beaumont Behavioural Inventory [20], Barratt Impulsiveness Scale [28], Dimensional Apathy Scale [46, 53, 54] and Starkstein Apathy Scale [18]. Although a specific, and of course empirical, methodology has been adopted for quality assessment, such findings should warn practitioners about possible statistical and methodological lacks of several available BFIs. In this respect, several issues have been highlighted as to psychometrics, diagnostics and clinical usability of BFIs.

Psychometrics

About two-thirds of all instruments were characterized by basic validity evidence.

However, as far as validity is concerned, its assessment was often based on convergence/divergence, whereas criterion validity was only seldom examined. With this regard, it was not uncommon that criterion validity has been tested via correlational, instead of regression, analyses—the latter being the proper ones to test such a property. Indeed, although the two approaches are mathematically related, while correlations are non-directional techniques solely intended to determine whether variables synchronously covary, regressions allow to test whether a first variable, which is attributed the status of a predictor, is able to account for the variability of a second one, which is instead addressed as a criterion.

In this respect, also ecological validity—testable through correlational analyses and predictive models—was infrequently investigated, raising the issue whether certain BFIs effectively reflect functional outcomes in daily life. Moreover, it is striking that factorial structure was explored in 34 BFIs only, albeit such analysis appears to be fundamental, especially for questionnaires [58]. Finally, content validity was almost never addressed: although for some BFIs can be difficult to assess content validity (e.g. in multi-domain instruments), our results strongly suggest the necessity of test such parameter by collecting ratings from experts as to the goodness of the operationalization of the target construct. We encourage this practice, as this expedient would provide practitioners with useful information about the target construct.

As to reliability, about 80% of BFIs come with such data. However, it is unfortunate to note inter-rater agreement measures lacked for 44 proxy-report BFIs, which are known to be highly subjected to heterogeneity in score attribution from examiner to examiner, also considering their different backgrounds (e.g. neurologists vs. psychologists). In this respect, it should be also noted that assessing inter-rater reliability in self-report BFIs is possible, albeit methodologically complex—as evidenced by the fact that such a feature was almost never assessed within included self-report BFIs. This aim could be reached, for instance, by evaluating the rate of agreement between a below- vs. above-cut-off classification delivered by the target BFI and that yielding from another one measuring the same construct (e.g. presence vs. absence of apathetic features). Indeed, if one considers that a below- vs. above-cut-off classification refers to standardized clinical judgments provided by the instruments, then such a scenario could be compared, for instance, to two clinicians (i.e. raters) evaluating a given clinical sign.

Moreover, parallel forms of included BFIs were almost never provided, limiting to an extent their usage for longitudinal applications. Although the development of parallel forms appears to be more relevant to performance-based instruments, practice effects cannot be ruled out in questionnaires either, especially those that are short-lived and thus likely to be remembered by the examinee [58].

Finally, it should be noted that, outside the framework of classical test theory, IRT analyses were almost never performed, despite them possibly providing relevant insights into the interpretation of BFI scores. In fact, while looking at total scores is crucial in order to draw clinical judgments, single item-level information would help clinicians to orient themselves towards a given diagnostic hypothesis, also possibly providing relevant prognostic information—albeit at a qualitative level. In this respect, data on item discrimination, i.e. an IRT parameter quantifying how much a given item is able to discriminate between different levels of the underlying trait, and thus the extent to which it is informative, would allow examiners to address responses to such items with greater attention. For instance, within a BFI assessing dysexecutive behavioural features, an item on the development of a sweet tooth (for instance, following the onset of a neurodegenerative condition) might result as highly informative towards the diagnosis of a frontal disorder. By contrast, within the same tool, items targeting depressive symptoms might be less informative towards such a behavioural syndrome, as being common to different brain disorders.

With that said, since practitioners and clinical researchers most of the times look at the global score yielded by a given BFI, a further useful output of potential IRT analyses might be represented by the test information function, which describes the overall informativity of the BFI based on the underlying level of the target construct. For instance, a BFI aimed at measuring apathy, which reveals itself as mostly informative for individuals having higher levels of the underlying construct (i.e., high levels of apathetic features), should be used with caution when assessing patients who do not display overt symptoms (and thus are unlikely to suffer from severe apathy) since possibly yielding false negative results.

Diagnostics

It is unfortunate to note that, out of BFIs for which diagnostics could be computed and norms derived, such data lacked for about one-third of them—this rate further dropped when addressing non-intrinsic diagnostics (i.e. predictive values and likelihood ratios). This represents a major drawback as to the clinical usability of certain BFIs as tools intended to convey diagnostic information. It is undoubtable that diagnostic properties and norms should be more accurately addressed in future studies aimed at developing and standardizing BFIs. In this respect, researchers devoted to such scopes should note that diagnostic and normative investigations do not necessarily overlap. For instance, ROC analyses allow to both derive a cut-off and to provide intrinsic/post-test diagnostics, but may be used only to the latter aim. Moreover, norms can be derived through approaches other than ROC analyses, e.g. by means of z-based, percentile-based or regression-based techniques.

As to the derivation of cut-offs via ROC analyses, it should be noted that an advisable practice would be that of providing different values based on different trade-offs between sensitivity and specificity. This would not only allow clinicians to be adaptive in selecting the most suitable cut-off values based on whether they intend to favour the sensitivity or specificity of a given BFI, but also help clinical researchers identify an adequate threshold value for inclusion/exclusion purposes in research settings. Indeed, when including a given deficit as an exclusion criterion for recruitment, stricter cut-offs might be preferred by researchers as they guarantee higher specificity and hence fewer false positives.

Finally, on the notion of “disease-specificity,” it would be reasonable not to limit the application of certain disease-specific BFIs to those clinical population(s) to which they were originally addressed. For instance, questionnaires designed to assess depression in amyotrophic lateral sclerosis (ALS) by overcoming disability-related confounders [36] might as well be applied to other motor conditions (e.g. extra-pyramidal disorders, multiple sclerosis). Similarly, tools assessing dysexecutive-like behavioural changes in ALS [20] might come in handy for the detection of such disturbances in other neurological conditions known to affect frontal networks (e.g. Huntington’s disease). This proposal rises from the consideration of common phenotypic manifestation possibly being underpinned by different pathophysiological processes; therefore, an extended application of disease-specific BFIs should occur only when such an assumption is met. Moreover, such “off-label” adoptions would undoubtedly need studies that support the feasibility of these disease-specific BFIs in desired populations.

Usability

Despite being widely accepted that back-translation is required when adapting a given BFI to a new (target) language, very few BFIs appeared to undergo such a procedure, and information on BFI adaptation often lacked. Such a finding is in line with the notion according to which statistical and methodological deficiencies of psychometric instruments derived especially from cross-cultural adaptation frameworks [60].

Moreover, data on possible ceiling and/or floor effects were often unreported, preventing clinicians and researchers to evaluate whether a given BFI can be deemed as suitable for a target clinical or non-clinical population. For instance, a BFI assessing behavioural disorders and putatively presenting with a relevant ceiling/floor effect might be scarcely informative if administered with the aim of detecting sub-clinical alterations. However, such an issue is of course even more relevant when dealing with BFIs addressed to clinical populations: indeed, while ceiling/floor effects might be expected in normotypical individuals if a given BFI is aimed at detecting a clearly clinical symptoms (e.g. neuropsychiatric manifestations within the dysexecutive spectrum), the same would not apply for clinical populations known to present with such features (e.g. patients with frontal lobe damages). In other terms, a BFI yielding ceiling/floor effects in diseased populations is likely to be poorly usable at a clinical level. It follows that the assessment of ceiling/floor effects is mostly relevant when exploring the clinical usability of BFIs.

As for the ease of use, researchers devoted to the development and psychometric/diagnostic/usability study of BFIs are encouraged to assess how difficult a questionnaire is, from the examiner’s standpoint, to be administered, scored and interpreted, as well as, from the examinee’s standpoint, to be understood and completed. The vast majority of tools included in the present review did not come with such information.

Time requirement of BFIs was also frequently find as lacking, although this information is undoubtedly needed in order to determine whether a given tool is suitable for the target setting. For instance, not all BFIs might be adequate for bedside administrations, as being relatively long and thus scarcely appropriate to time-restricted settings. Similarly, time requirements could be different depending on whether it is in in-patient vs. out-patient setting.

Further suggestions for researchers

A number of further elements, not explicitly addressed earlier in this work, can be herewith listed in order to help researchers devoted to BFI development and psychometric/diagnostic/usability study.

First, IRT analyses can be also useful, within the development of either a novel BFI or a shortened version of a previous one, to select items that adequately measure the target construct [14]. To such aims, IRT can be also complemented with classical test theory approaches in order to identify, through an empirical, data-driven approach, a set of criteria to be met in order for an item to be included into a given BFI in development—as recently proposed within the Italian scenario [28].

Second, the a priori estimation of the adequate sample size for the main target analyses within a psychometric/diagnostic/usability study for a given BFI is advisable. In this respect, a number of studies are available that suggest optimal, either empirical or simulation-based sample size estimation procedures for, e.g. validity and reliability analyses [23], dimensionality-reduction techniques [22], ROC analyses [35], IRT analyses [50] and regression-based norming [49]. In this respect, an a posteriori evaluation of the robustness of normative data can be also performed, as suggested by Crawford and Garthwaite [10].

Finally, researchers focused on BFI development and psychometric/diagnostic/usability study have to be aware of procedures aimed at handling missing data according to their categorization (e.g. at-random vs. not-at-random missing values) [34]. This is particularly relevant when administering several tools within a same data collection session, especially to patients: indeed, participants might not agree or be able to complete the full range of instruments included in a study protocol, e.g. due to fatigue.

Conclusions

With the present work, practitioners have been provided with an up-to-date compendium of available BFIs in Italy and also to present some possible criticisms about their properties, and deliver hopefully useful insights into best-practice guidelines. To this last aim, it is believed that the BFIQA scales herewith provided may serve as a plot for researchers in order to carefully consider relevant aspects associated with the development and psychometric/diagnostic/usability of BFIs, in order to strengthen their level of recommendation for their use in clinical practice as applied to diagnostic, prognostic and interventional setting.