Psychometrics, diagnostics and usability of Italian tools assessing behavioural and functional outcomes in neurological, geriatric and psychiatric disorders: a systematic review

Background Psychometric instruments assessing behavioural and functional outcomes (BFIs) in neurological, geriatric and psychiatric populations are relevant towards diagnostics, prognosis and intervention. However, BFIs often happen not to meet methodological-statistical standards, thus lowering their level of recommendation in clinical practice and research. This work thus aimed at (1) providing an up-to-date compendium on psychometrics, diagnostics and usability of available Italian BFIs and (2) delivering evidence-based information on their level of recommendation. Methods This review was pre-registered (PROSPERO ID: CRD42021295430) and performed according to PRISMA guidelines. Several psychometric, diagnostic and usability measures were addressed as outcomes. Quality assessment was performed via an ad hoc checklist, the Behavioural and Functional Instrument Quality Assessment. Results Out of an initial N = 830 reports, 108 studies were included (N = 102 BFIs). Target constructs included behavioural/psychiatric symptoms, quality of life and physical functioning. BFIs were either self- or caregiver-/clinician-report. Studies in clinical conditions (including neurological, psychiatric and geriatric ones) were the most represented. Validity was investigated for 85 and reliability for 80 BFIs, respectively. Criterion and factorial validity testing were infrequent, whereas content and ecological validity and parallel forms were almost never addressed. Item response theory analyses were seldom carried out. Diagnostics and norms lacked for about one-third of BFIs. Information on administration time, ease of use and ceiling/floor effects were often unreported. Discussion Several available BFIs for the Italian population do not meet adequate statistical-methodological standards, this prompting a greater care from researchers involved in their development. Supplementary Information The online version contains supplementary material available at 10.1007/s10072-022-06300-8.


Introduction
Psychometric instruments assessing behavioural dysfunctions (i.e. neuropsychiatric alterations within the affective, motivational, social and awareness dimensions) and functional outcomes (i.e. quality of life, functional independence and other aspects of physical status-e.g. sleep, pain or fatigue) in neurological, psychiatric and geriatric populations are relevant towards clinical phenotyping, prognosis and intervention practice [21]. Indeed, besides aiding clinical diagnosis, behavioural/functional instruments (BFIs) are often addressed as relevant to provide estimates of patients' prognosis, being also adopted as clinical endpoints during interventional programs [21].
BFIs often present either self-or proxy-report (i.e. caregiver or healthcare professional) questionnaires, thus requiring sound psychometric and diagnostics, as well as evidence on clinical usability in target populations [6]. However, it has been highlighted that BFIs often do not meet methodological-statistical requirements, both when developed de novo and when adapted from a different language Edoardo Nicolò Aiello, Alfonsina D'Iorio and Sonia Montemurro contributed equally

Data collection and quality assessment
Screening stage was performed by two authors (E.N.A. and A.D.) and eligibility stage was performed by two other authors (G.M. and C.G.) via Rayyan (https:// rayyan. qcri. org/ welcome); these stages were supervised by another author (V.B.).
Data extraction was performed by four independent Authors (S.M., G.S.D.T., V.B. and F.P.), whereas one independent author (E.N.A.) supervised this stage and checked extracted data.
Formal quality assessment was performed by four Authors (S.M., G.S.D.T., V.B. and F.P.), and supervised by a further, independent one (E.N.A.). Quality assessment was performed for each BFI by developing two adhoc checklists, the Behavioural and Functional Instrument Quality Assessment-Normative Sample (BFIQA-NS) and the Behavioural and Functional Instrument Quality Assessment-Clinical Population (BFIQA-CP) (Supplemental Material 1), which were adapted from the Cognitive Screening Standardization Checklist (CSSC) [1]. Scores were "cumulatively" assigned for each BFI by evaluating all available studies on it among those included. Although some studies met the selection criteria, they did not answer some of the questions included in the CSSC, (e.g. diagnostic criteria for quality of life tools). In these cases, a group of items was scored as "non-applicable" and this was accounted, i.e. weighted, in the final score.
Both BFIQA-NS and BFIQA-CP total scores range from 0 to 50 and a given BFI was considered "statistically/methodologically sound" if scoring was ≥25, which means 50% out of the maximum. When more than one study focused on the same BFI in different populations, BFIQA scores were averaged (as both the BFIQA-NS and BFIQA-CP range from 0 to 50).

Results
One-hundred and eighteen studies were included; study selection process according to PRISMA guidelines is shown in Figure 1.
The included BFIs (N=102), along with the summarization of the general clinical features and BFIQA scores are detailed in Table 1, while the most relevant psychometrics, diagnostics and usability evidence are described in Table 2 and Table 3. The full reference list of included studies is shown in the Supplementary File 2.
The vast majority of studies (N=109) aimed at providing psychometric, diagnostic or normative data in clinical populations, of whom 16 also addressed normotypical samples. The most represented neurological conditions were those of a degenerative/disimmune etiology: multiple sclerosis (N=13), amyotrophic lateral sclerosis (N=6) and Parkinson's disease (N=10). Dementia was addressed in 17 BFIs (Alzheimer's disease: N=12; vascular dementia: N=3; frontotemporal dementia: N=1; Lewy Body disease: N=1). Acute cerebrovascular accidents and traumatic brain injury were addressed in 5 and 3 BFIs, respectively. Nonspecific geriatric populations (see the footnotes for the available details regarding the inclusion criteria 1 ) were addressed in 5 BFIs,       Item response theory (IRT) analyses were carried out for 9 BFIs only.
Among BFIs for which diagnostic properties could be computed (N=79), sensitivity and specificity measures were reported for 25 tools, whereas derived metrics such as predictive values and likelihood ratios for 16. With respect to norming, when applicable (N=79), 21 BFIs derived norms through receiver-operating characteristics (ROC) analyses, while other methods (e.g. percentiles or z-scores) were address to derive cut-offs in other 31 studies. Diagnostic accuracy was tested in 21 studies.
As to feasibility, back-translation was performed in 57 BFIs; the ease of use was assessed in 12 whereas ceiling/ floor effects in 22. Strikingly, time of administration was explicitly reported for very few BFIs (N=18).

Overview
The present review provides Italian clinicians and researchers with a comprehensive, up-to-date compendium on available BFIs along with information on their psychometrics, diagnostics and clinical usability. This work was designed not only to serve as a guide to practitioners in selecting the appropriate tool based on the clinical questions but also to researchers involved in clinical psychometrics applied to neurology and geriatrics. In the view of raising the awareness on the statistical-methodological standards that are expected to be met by such instruments, checklists herewith delivered (BFIQA) would hopefully come in handy for orienting both the development and the psychometric/ diagnostic/usability study of BFIs. Indeed, at variance with the literature on diagnostic test accuracy as applied to performance-based psychometric instruments [26], such guidelines for BFIs mostly focus on psychometrics while lacking thorough sections specifically devoted to diagnostics and clinical usability [30]. Albeit each of the BFIs included in this study can undoubtedly be recognized in its peculiarities and usefulness in research and clinical contexts, as to the level of recommendations as assessed by the BFIQA, it has to be noted that 63.5% out of those referred to clinical populations (N=96) fell under the pre-established cut-off of 25 (i.e. half of the full range of the scale). More specifically, the following BFIs addressed to clinical populations reached a BFIQA score ≥25: ALS Depression Inventory [36], Apathy Evaluation Scale-Self Report version [44,51], Anosognosia Questionnaire Dementia [17], Bedford Alzheimer Nursing Severity Scale [4], Beaumont Behavioural Inventory [20], Beck Depression Inventory-II [11,64] Care-Related Quality of Life [63], Coop/Wonca [37], Disability Assessment Dementia Scale [12], Dimensional Apathy Scale [46,53,54], Dual-Task Impact in Daily-living Activities Questionnaire [38], Electronic format of Multiple Sclerosis Quality of Liufe-29 [48], Epilepsy-Quality of Life [40], Frontal Behavioural Inventory [2,29], Geriatric Handicap Scale [62], Hospital Anxiety and Depression Scale [31], Hamilton Depression Rating Scale [33,43,45], Medical Outcomes Study-Human Immunodeficiency Virus [59], Multiple Sclerosis Quality of Life-29 [47], Multiple Sclerosis Quality of Life-54 [56], Neuropsychiatric Invenotry-Nursing Home version [3], Non-Motor Symptoms Scale for Parkinson's disease [9], Non-Communicative Patient's Pain Assessment Instrument [15], Observer-Rated version of the Parkinson Anxiety Scale [53], Pain Assessment in Advanced Dementia [8,32], Progressive Supranuclear Palsy-Quality of Life [41], Quality of Life in Alzheimer's Disease [5], Quality of Life in the Dysarthric Speaker [39], Quality of Life after Brain Injury [16,19], Stroke Impact Scale 3.0 [61] and State-Trait Anxiety Inventory [24,53,55]. Moreover, out of those also or exclusively referred to normative populations (N=22), only 4 were classified above the same cut-off (Table 1)-Beaumont Behavioural Inventory [20], Barratt Impulsiveness Scale [28], Dimensional Apathy Scale [46,53,54] and Starkstein Apathy Scale [18]. Although a specific, and of course empirical, methodology has been adopted for quality assessment, such findings should warn practitioners about possible statistical and methodological lacks of several available BFIs. In this respect, several issues have been highlighted as to psychometrics, diagnostics and clinical usability of BFIs.

Psychometrics
About two-thirds of all instruments were characterized by basic validity evidence.
However, as far as validity is concerned, its assessment was often based on convergence/divergence, whereas criterion validity was only seldom examined. With this regard, it was not uncommon that criterion validity has been tested via correlational, instead of regression, analyses-the          latter being the proper ones to test such a property. Indeed, although the two approaches are mathematically related, while correlations are non-directional techniques solely intended to determine whether variables synchronously covary, regressions allow to test whether a first variable, which is attributed the status of a predictor, is able to account for the variability of a second one, which is instead addressed as a criterion.
In this respect, also ecological validity-testable through correlational analyses and predictive models-was infrequently investigated, raising the issue whether certain BFIs effectively reflect functional outcomes in daily life. Moreover, it is striking that factorial structure was explored in 34 BFIs only, albeit such analysis appears to be fundamental, especially for questionnaires [58]. Finally, content validity was almost never addressed: although for some BFIs can be difficult to assess content validity (e.g. in multi-domain instruments), our results strongly suggest the necessity of test such parameter by collecting ratings from experts as to the goodness of the operationalization of the target construct. We encourage this practice, as this expedient would provide practitioners with useful information about the target construct.
As to reliability, about 80% of BFIs come with such data. However, it is unfortunate to note inter-rater agreement measures lacked for 44 proxy-report BFIs, which are known to be highly subjected to heterogeneity in score attribution from examiner to examiner, also considering their different backgrounds (e.g. neurologists vs. psychologists). In this respect, it should be also noted that assessing inter-rater reliability in self-report BFIs is possible, albeit methodologically complex-as evidenced by the fact that such a feature was almost never assessed within included self-report BFIs. This aim could be reached, for instance, by evaluating the rate of agreement between a belowvs. above-cut-off classification delivered by the target BFI and that yielding from another one measuring the same construct (e.g. presence vs. absence of apathetic features). Indeed, if one considers that a below-vs. above-cut-off classification refers to standardized clinical judgments provided by the instruments, then such a scenario could be compared, for instance, to two clinicians (i.e. raters) evaluating a given clinical sign.
Moreover, parallel forms of included BFIs were almost never provided, limiting to an extent their usage for longitudinal applications. Although the development of parallel forms appears to be more relevant to performance-based instruments, practice effects cannot be ruled out in questionnaires either, especially those that are short-lived and thus likely to be remembered by the examinee [58].
Finally, it should be noted that, outside the framework of classical test theory, IRT analyses were almost never performed, despite them possibly providing relevant insights into the interpretation of BFI scores. In fact, while looking at total scores is crucial in order to draw clinical judgments, single item-level information would help clinicians to orient themselves towards a given diagnostic hypothesis, also possibly providing relevant prognostic information-albeit at a qualitative level. In this respect, data on item discrimination, i.e. an IRT parameter quantifying how much a given item is able to discriminate between different levels of the underlying trait, and thus the extent to which it is informative, would allow examiners to address responses to such items with greater attention. For instance, within a BFI assessing dysexecutive behavioural features, an item on the development of a sweet tooth (for instance, following the onset of a neurodegenerative condition) might result as highly informative towards the diagnosis of a frontal disorder. By contrast, within the same tool, items targeting depressive symptoms might be less informative towards such a behavioural syndrome, as being common to different brain disorders.
With that said, since practitioners and clinical researchers most of the times look at the global score yielded by a given BFI, a further useful output of potential IRT analyses might be represented by the test information function, which describes the overall informativity of the BFI based on the underlying level of the target construct. For instance, a BFI aimed at measuring apathy, which reveals itself as mostly informative for individuals having higher levels of the underlying construct (i.e., high levels of apathetic features), should be used with caution when assessing patients who do not display overt symptoms (and thus are unlikely to suffer from severe apathy) since possibly yielding false negative results.

Diagnostics
It is unfortunate to note that, out of BFIs for which diagnostics could be computed and norms derived, such data lacked for about one-third of them-this rate further dropped when addressing non-intrinsic diagnostics (i.e. predictive values and likelihood ratios). This represents a major drawback as to the clinical usability of certain BFIs as tools intended to convey diagnostic information. It is undoubtable that diagnostic properties and norms should be more accurately addressed in future studies aimed at developing and standardizing BFIs. In this respect, researchers devoted to such scopes should note that diagnostic and normative investigations do not necessarily overlap. For instance, ROC analyses allow to both derive a cut-off and to provide intrinsic/ post-test diagnostics, but may be used only to the latter aim. Moreover, norms can be derived through approaches other than ROC analyses, e.g. by means of z-based, percentilebased or regression-based techniques.
As to the derivation of cut-offs via ROC analyses, it should be noted that an advisable practice would be that of providing different values based on different trade-offs between sensitivity and specificity. This would not only allow clinicians to be adaptive in selecting the most suitable cut-off values based on whether they intend to favour the sensitivity or specificity of a given BFI, but also help clinical researchers identify an adequate threshold value for inclusion/exclusion purposes in research settings. Indeed, when including a given deficit as an exclusion criterion for recruitment, stricter cut-offs might be preferred by researchers as they guarantee higher specificity and hence fewer false positives.
Finally, on the notion of "disease-specificity," it would be reasonable not to limit the application of certain disease-specific BFIs to those clinical population(s) to which they were originally addressed. For instance, questionnaires designed to assess depression in amyotrophic lateral sclerosis (ALS) by overcoming disability-related confounders [36] might as well be applied to other motor conditions (e.g. extra-pyramidal disorders, multiple sclerosis). Similarly, tools assessing dysexecutive-like behavioural changes in ALS [20] might come in handy for the detection of such disturbances in other neurological conditions known to affect frontal networks (e.g. Huntington's disease). This proposal rises from the consideration of common phenotypic manifestation possibly being underpinned by different pathophysiological processes; therefore, an extended application of disease-specific BFIs should occur only when such an assumption is met. Moreover, such "off-label" adoptions would undoubtedly need studies that support the feasibility of these diseasespecific BFIs in desired populations.

Usability
Despite being widely accepted that back-translation is required when adapting a given BFI to a new (target) language, very few BFIs appeared to undergo such a procedure, and information on BFI adaptation often lacked. Such a finding is in line with the notion according to which statistical and methodological deficiencies of psychometric instruments derived especially from cross-cultural adaptation frameworks [60].
Moreover, data on possible ceiling and/or floor effects were often unreported, preventing clinicians and researchers to evaluate whether a given BFI can be deemed as suitable for a target clinical or non-clinical population. For instance, a BFI assessing behavioural disorders and putatively presenting with a relevant ceiling/floor effect might be scarcely informative if administered with the aim of detecting subclinical alterations. However, such an issue is of course even more relevant when dealing with BFIs addressed to clinical populations: indeed, while ceiling/floor effects might be expected in normotypical individuals if a given BFI is aimed at detecting a clearly clinical symptoms (e.g. neuropsychiatric manifestations within the dysexecutive spectrum), the same would not apply for clinical populations known to present with such features (e.g. patients with frontal lobe damages). In other terms, a BFI yielding ceiling/floor effects in diseased populations is likely to be poorly usable at a clinical level. It follows that the assessment of ceiling/floor effects is mostly relevant when exploring the clinical usability of BFIs.
As for the ease of use, researchers devoted to the development and psychometric/diagnostic/usability study of BFIs are encouraged to assess how difficult a questionnaire is, from the examiner's standpoint, to be administered, scored and interpreted, as well as, from the examinee's standpoint, to be understood and completed. The vast majority of tools included in the present review did not come with such information.
Time requirement of BFIs was also frequently find as lacking, although this information is undoubtedly needed in order to determine whether a given tool is suitable for the target setting. For instance, not all BFIs might be adequate for bedside administrations, as being relatively long and thus scarcely appropriate to time-restricted settings. Similarly, time requirements could be different depending on whether it is in in-patient vs. out-patient setting.

Further suggestions for researchers
A number of further elements, not explicitly addressed earlier in this work, can be herewith listed in order to help researchers devoted to BFI development and psychometric/ diagnostic/usability study.
First, IRT analyses can be also useful, within the development of either a novel BFI or a shortened version of a previous one, to select items that adequately measure the target construct [14]. To such aims, IRT can be also complemented with classical test theory approaches in order to identify, through an empirical, data-driven approach, a set of criteria to be met in order for an item to be included into a given BFI in development-as recently proposed within the Italian scenario [28].
Second, the a priori estimation of the adequate sample size for the main target analyses within a psychometric/ diagnostic/usability study for a given BFI is advisable. In this respect, a number of studies are available that suggest optimal, either empirical or simulation-based sample size estimation procedures for, e.g. validity and reliability analyses [23], dimensionality-reduction techniques [22], ROC analyses [35], IRT analyses [50] and regression-based norming [49]. In this respect, an a posteriori evaluation of the robustness of normative data can be also performed, as suggested by Crawford and Garthwaite [10].
Finally, researchers focused on BFI development and psychometric/diagnostic/usability study have to be aware of procedures aimed at handling missing data according to