The Telephone Language Screener (TLS): standardization of a novel telephone-based screening test for language impairment

Background This study aimed at developing and standardizing the Telephone Language Screener (TLS), a novel, disease-nonspecific, telephone-based screening test for language disorders. Methods The TLS was developed in strict pursuance to the current psycholinguistic standards. It comprises nine tasks assessing phonological, lexical-semantic and morpho-syntactic components, as well as an extra Backward Digit Span task. The TLS was administered to 480 healthy participants (HPs), along with the Telephone-based Semantic Verbal Fluency (t-SVF) test and a Telephone-based Composite Language Index (TBCLI), as well as to 37 cerebrovascular/neurodegenerative patients—who also underwent the language subscale of the Telephone Interview for Cognitive Status (TICS-L). An HP subsample was also administered an in-person language battery. Construct validity, factorial structure, internal consistency, test–retest and inter-rater reliability were tested. Norms were derived via Equivalent Scores. The capability of the TLS to discriminate patients from HPs and to identify, among the patient cohort, those with a defective TICS-L, was also examined. Results The TLS was underpinned by a mono-component structure and converged with the t-SVF (p < .001), the TBCLI (p < .001) and the in-person language battery (p = .002). It was internally consistent (McDonald’s ω = 0.67) and reliable between raters (ICC = 0.99) and at retest (ICC = 0.83). Age and education, but not sex, were predictors of TLS scores. The TLS optimally discriminated patients from HPs (AUC = 0.80) and successfully identified patients with an impaired TICS-L (AUC = 0.92). In patients, the TLS converged with TICS-L scores (p = 0.016). Discussion The TLS is a valid, reliable, normed and clinically feasible telephone-based screener for language impairment. Supplementary Information The online version contains supplementary material available at 10.1007/s10072-023-07149-1.


Introduction
Telephone-based cognitive screening (TBCS) plays a pivotal role within both clinical practice and research addressed to brain disorders, as allowing, via a widespread, highly accessible and practicable medium, the reduction of geographical, logistic, economical, socio-demographic and organizational barriers that undermine the access to such health facilities, the continuity of care and the viability/accomplishment of epidemiological studies and decentralized clinical trials [1][2][3][4][5][6].
Whilst several TBCS tools have been developed for detecting global cognitive impairment [7,8], no standardized TBCS tests that focus on language are available, especially in Italy [9].Indeed, within the Italian scenario, previously standardized TBCS test tap either into global cognition, i.e. the Italian telephone-based Mini-Mental State Examination (Itel-MMSE) [10], the Telephone Interview for Cognitive Status (TICS) [11,12], the Tele-Global Exam of Mental State (Tele-GEMS) [13] and the ALS Cognitive Behavioral Screen™-Phone Version [14] or on overall executive efficiency, i.e. the Telephone-based Frontal Assessment Battery [15].
However, language dysfunctions-either primary or secondary to extra-linguistic deficits [16]-are common to a variety of neurological disorders of different aetiologies Edoardo Nicolò Aiello and Veronica Pucci contributed equally.
In the light of such a trans-diagnostic relevance of language disorders, as well as of their prognostic entailments [24], practitioners and clinical researchers would undoubtedly benefit from the availability of TBCS tests that specifically tap into language [25].Such a stance has been highlighted by De Witte et al. [26], who explored the feasibility of a telephone-based language battery (i.e. the TeleLanguage) for monitoring neurosurgical patients over time.However, no full standardization was provided for the TeleLanguage [26].
Given the above premises, this study aimed at standardizing the Telephone Language Screener (TLS) [27]-a novel, disease-nonspecific, telephone-based test aimed at screening for language deficits in patients with suspected/ confirmed brain pathologies.More specifically, the current report focused on (1) assessing its psychometrics, (2) deriving its norms in an Italian population sample, and (3) offering preliminary feasibility evidence in a heterogeneous cohort of patients with neurodegenerative and cerebrovascular diseases.

Participants
The normative sample consisted of 480 Italian healthy participants (HPs) aged ≥18 years and with no history of (1) neurological/psychiatric disorders, (2) active psychotropic medications, (3) uncompensated/severe medical-general conditions and (4) uncorrected hearing deficits.Sample stratification is shown in Supplementary Table 1.HPs were recruited through both authors' personal acquaintances and advertising at the University of Milano-Bicocca and the University of Padova.

Telephone Language Screener
As to its structure, overall task content and aims, the TLS has been originally inspired by the Screening for Aphasia in NeuroDegeneration (SAND) [33].The TLS includes nine tasks that provide both component-specific and global measures of language.Their development, pursuant to rigorous psycholinguistic standards, is detailed within the Supplementary Material 1.Briefly, such a process entailed the following: (1) the identification of an initial pool of items that, based on the neurolinguistic literature, were likely to be relatively sensitive to language deficits; (2) the conduction of pilot studies in HPs that allowed to refine the initial item sets by selecting the most feasible items and (3) assessing the definite item sets for their psycholinguistic features.The record-form and manual of the TLS are available upon request to the corresponding authors.
The TLS is structured into the following tasks: • Connected Speech (CS); this task-adapted from Wilson et al. [34], Arcara and Bambini [35] and Catricalà et al. [33]-allows collecting speech samples within a semi-structured, ecological interview.It is subdivided into 2 subtasks (CSa; CSb) according to how speech productions is elicited, since different elicitation modalities allow exploring to different extents language components engaged in connected speech [36].The CSa requires the examinee to describe her/his morning routine (narrative elicitation) within 40 s; the CSb requires to verbalize how to brush one's teeth (procedural elicitation) within 20 s, and is a mere oral version of the written description task from the SAND itself [33].CS performance (CSa + CSb) is scored on two levels: base and advanced (optional), as in Catricalà et al. [33].The base CS scoring allows for a qualitative report of deficits within the phonological, lexicalsemantic and morpho-syntactic components, as well as of features regarding articulatory deficits, speech fluency, communicative failures of a dysexecutive/inattentive aetiology [37].According to Catricalà et al. [33], information units (IU, i.e. the most relevant meaningconveying elements) represent the primary outcome of the CS task (range, 0-11).The advanced, optional scoring system encompasses a report of features suggestive of dysarthria and apraxia of speech [38] and allows a quantitative analysis of phonological, lexical-semantic and morpho-syntactic deficits [34,37]  Sentences (RoS): these tasks-adapted from Catricalà et al. [33]-allow assessing the phonological compo-nent within both receptive and productive modalities by requiring the examinee to repeat, one at the time, six words, five non-words (i.e.phonologically legal strings with no meaning) and three sentences.Including words and non-words in repetition tasks is relevant to assess the integrity of both lexical-semantic and phonological routes, respectively involved in processing words with and without semantic representations within the mental lexicon [48].
The TLS-Total score is equal to the sum of IU, Spelling, SA, NtD-N, NtD-V, CML, RoW, RoNW and RoS sub-scores (range, 0-68).Additionally, the TLS comprises an "extra" Backward Digit Span (BDS) task, whose score is however not included within the TLS-Total score.Such a taskadapted from Monaco et al. (2013)-has been included in order to take into account phonological working memory/ verbal short-term memory deficits that may affect language performances [49].According to Pasotti et al. [50], two outcomes are computed: the longest sequence recalled, reflecting working memory capacity (BDS-WM); and the total number of sequences correctly reported (BDS-T), reflecting sustained attention during task execution.

Other measures
To the aim of convergent validity testing in HPs, the following measures were employed: • The Telephone-based Semantic Verbal Fluency (t-SVF) task included within the Telephone-based Verbal Fluency Battery (t-VFB) [51], which was administered to 266 HPs; • A Telephone-Based Composite Language Index (TBCLI; range, 0-7) computed as the sum of the language items included within the Itel-MMSE [10]-i.e., naming-todescription of an object (N = 1) and sentence repetition (N = 1) -and the Tele-GEMS [13]-i.e., naming-todescription of objects (N = 4) and comprehension of a bi-phasic command (N = 1)-, which was administered to N = 200 HPs.
When standardizing a cognitive test that is intended to be administered remotely, it is pivotal to ascertain that it taps into the same construct(s) that in-person measures-which should not overlap with the target test itself-tap into [9,52,53].Thus, to this aim, a subsample of 79 HPs underwent a set of standardized, in-person language tasks that mimicked each section of the TLSi.e., the In-Person Composite Language Index (IPCLI).The IPCLI -described in Supplementary Table 2 -yields from the sum of the following measures (range=0-68): the Spelling task from the Edinburgh Cognitive and Behavioural ALS Screen [54], the Noun-and Verb-Naming tasks from the Esame NeuroPsicologico per l'Afasia [55] and the Semantic Association, Repetition of Words, Repetition of Non-Words, Repetition of Sentences, Sentence Comprehension and Connected Speech tasks from the SAND [33].
Finally, to the aim of convergent validity testing in patients, the Language sub-scale of the TICS (TICS-L; range=1-8) [11,12] was administered.Patients also underwent the Mini-Mental State Examination [56] for clinical purposes.

Procedures
All participants first underwent a semi-structured interview for collecting demographic data and medical history, as well as an in-depth sound-check for ensuring a good quality of the call -whose protocol has been described elsewhere [15,51].When tested over the telephone, participants were at their home; in-person testing took place at the Institutions involved in the study.TBCS sessions lasted ≤45', whilst inperson evaluations lasted ≈30'.
For test-retest and inter-rater reliability testing, 29 HPs were re-administered the TLS after 30 days from the baseline and 26 TLS record-forms were scored online by two examiners blinded to each other's scoring, respectively.
HPs undergoing both the TLS and the IPCLI were either first tested over the telephone (N=37) and then in-person at a 48-h distance or vice-versa (N=42), in order to rule out carry-over effects.
Data were collected by either licensed neuropsychologists or neuropsychology trainees; all examiners underwent an ad hoc training performed by the corresponding author.Data collection started in March 2021 and ended in May 2022.

Statistical analyses
Convergent validity against telephone-based measures (both in HPs and in patients), as well the convergence between the TLS and the IPCLI, were tested through Spearman's coefficients, since the vast majority of measures did not meet linear model analyses (i.e.skewness and kurtosis values >|1| and >|3|, respectively) [57].
In HPs, test-retest and inter-rater reliability were assessed via intra-class correlations, whereas internal consistency and factorial structure via McDonald'ω and a Principal Component Analysis (PCA), respectively.
Norms were derived through the Equivalent Score (ES) method [58,59].The ES method first entails a stepwise regression-based step that allows adjusting raw scores for significant demographic predictors.Subsequently, outer and inner tolerance limits (oTL; iTL) are identified on ranked adjusted scores (ASs) to provide a non-parametric, interval estimate of cut-off values.ASs≤oTL are attributed an ES=0, i.e. an "impaired" performance, whereas ASs≥Mdn an ES=4, i.e. a "high-end normal" performance.ASs comprised between the oTL and the Mdn are then allotted into three further ability levels, whose thresholds are identified via a z-score-based approach: ES=1 → "borderline"; ES=2 → "low-end normal performance"; ES=3 → normal performance.ASs comprised between the oTL and the iTL fall under the ES=1 but cannot be inferentially judged as either below-or above-cut-off.
In HPs, a 2-paramter logistic (2-PL) Item Response Theory (IRT) model [60,61] was run via the R 4.1.0package mirt [62] in order to estimate item difficulty and discrimination values for each TLS item -except for BDS (which is a task not included within the TLS-Total) and IU ones (which, theoretically, are not closed-ranged items).According to Arifin and Yusoff [61], difficulty values ranging from −3 to +3 were addressed as typical (with values ≤−3 indexing an extremely easy item and those ≥+3 an extremely difficult items), whereas, as to discrimination, values ranging from 0.65 to 1.34 were addressed as indexing moderate discrimination, and those ≥1.35 and >1.7 as indexing high and extremely high discrimination, respectively.
Clinical usability was tested via receiver-operating characteristics (ROC) analyses by comparing (1) the whole clinical group and each clinical subgroup against the normative sample and (2) patients with an impaired performance on the TICS-L (i.e.ES = 0) to those with an above-cut-off performance on it (i.e.ES ≥ 1).

Power analyses
Based on Hobart et al.'s [63] recommendations, the minimum sample size for reliability and validity analyses in HPs were set at N=20 and N=80, respectively.
A sample size of 100 was deemed as sufficient for the PCA according to the guidelines delivered by Kyriazos [64].
According to Baylor et al.'s [60] rule-of-thumb suggestions, 250 observations were deemed as adequate to run the 2-PL IRT model.
For ROC analyses comparing clinical groups to the normative sample, according to Obuchowski [66] and through the R package easyROC [67], the minimum sample sizes were estimated at N=30 and N=6, respectively, by addressing a case-control allocation ratio of 5, AUC = 0.8, 1-β = 0.8 and α = 0.05 within a single-test ROC analysis.For ROC analyses comparing patients with a defective vs. a normal TICS-L score, the minimum sample sizes were set at N = 4 and N = 20, respectively, by addressing an allocation ratio of 5, AUC = 0.85, 1-β = 0.8 and α = 0.05 within a single-test ROC analysis.

Results
Demographic and telephone-based cognitive measures of the normative sample are shown in Table 1.Supplementary Table 3 reports IPCLI measures of the target HC subsample.Ceiling effects in the TLS-Total, defined as a score ≥95th percentile of the normative performance, was detected in 6% of the sample.
Table 2 shows item difficulty and discrimination values.The majority of TLS items fell within a typical difficulty range, albeit towards easiness, with only a limited number of them being classified as extremely easy.Spelling items proved to be the most difficult ones.As to discrimination, the vast majority of TLS items proved to come with moderate discriminative values, with some of those included within Spelling, SA, NtD-N, NtD-V, RoW and CML yielding high-to-extremely-high discrimination.
Table 3 shows adjustment equations for raw TLS measures as well as TLs and ES thresholds for TLS ASs.Norms were derived from the whole sample for all TLS tasks except for the advanced CS scoring measures, which were derived from N=219 HPs (see Supplementary Table 4 for the stratification of this sub-sample and Supplementary Table 5 for its descriptive statistics).Norms for the BDS-T/-WM were instead derived from N=401 HPs.An automated AS and ES calculation sheet is provided within the Supplementary Material 2. RoW and IU, as well as a number of advanced CS scoring measures, were not predicted by either age, education or sex.Age negatively predicted the vast majority of measures (ps < .05),at times concurrently with education (ps < .05),which instead was a positive predictor.No sex differences emerged as to all TLS measures (ps ≥ .06).

Discussion
The present work provides Italian practitioners and researchers with a standardized, disease-nonspecific TBCS test for language impairment, i.e. the TLS [27], along with preliminary evidence on its clinical usability in neurodegenerative and cerebrovascular patients.The TLS adds up to the range of standardized TBCS tests that are currently available in Italy [10][11][12][13][14][15]-in order to improve tele-neuropsychological practice within both clinical and research settings [9].Remarkably, the development and comprehensive standardization of a TBCS test for language impairment is unprecedented within the international literature; the procedures herewith described will thus hopefully stand as a virtuous paradigm for future research on this topic, as well as for adaptations of the TLS to other languages and cultures.
The TLS has been indeed developed according to rigorous psycholinguistic/neurolinguistic standards, proved to be valid (both at the structure and construct levels) and reliable (at an internal, test-retest and inter-rater level), as well as to converge with in-person language measures.Additionally, item-level information for the TLS has been herewith provided-with the aim of easing the interpretation of its results [9,68].Moreover, as coming with regression-based norms for both its total score and each of its tasks, the TLS allows detecting both overall and component-specific language deficits.In this respect, the inclusion of normed BDS tasks also allows for qualitatively determining whether phonological working memory/verbal short-term memory deficits impact on TLS scores.Specific sections have been then developed within the TLS record-form to qualitatively report relevant semeiotic elements related to motor speech disorders and overall communicative failures, as well as to quantify connected speech deficits-the latter aspect being of major relevance in the light of the promising role of speech sample analyses towards an early detection of cognitive decline in a variety of brain disorders [36].The TLS was also shown to be able both to discriminate HPs from neurological patients and to identify, within a clinical cohort, the occurrence of language deficits (i.e. a defective TICS-L score) in cerebrovascular and neurodegenerative diseases.Similarly to the SAND [33], which represented a relevant source of inspiration for its development, the TLS should be thus intended as a disease-nonspecific language screener to be applied for case-findings aims whenever deemed as appropriate-i.e., also beyond primary aphasic syndromes [69][70][71].In this regard, it has to be nevertheless noted that the SAND has been specifically intended to be administered to patients with neurodegenerative disorderswhilst the TLS is meant not to be bound to a specific set of etiologies.

Limitations and future perspectives
The present study is of course not free of limitations.
In the first place, a number of elements need to be highlighted with regard to the current psychometric analyses.
First, the convergent validity of the TLS has been herewith tested, in HPs, against two measures-i.e., the t-SVF and the TBCLI-that mostly tap into the lexical-semantic component and do not fully cover the wide range of language functions/components covered by the TLS itself.In this respect, it has to be also noted that semantic fluency tasks load on executive functions to a non-negligible extent, albeit to a lesser degree than phonemic fluency ones [72,73].Similarly, in neurological patients, convergent validity has been tested against the TICS-L, which, once again, mostly assesses the lexical-semantic component and is thus far from being a comprehensive language index.Hence, even though the TLS proved to be significantly associated with all of the abovementioned measures, there is still a need to further explore its construct validity by employing ad hoc, telephone-based language tests that minimally relies on extra-linguistic abilities and that cover the full range of functions/components assessed by the TLS itself.
Second, it has to be noted that the TLS showed acceptable, albeit not high, internal consistency.However, this was hardly unexpected: indeed, its target construct, i.e. language, is inherently multi-faceted.Consistently, the adoption of internal consistency as an index of the reliability of cognitive screeners has been questioned -as they often address either multiple cognitive domains/functions or different facets of a given cognitive domain/function that is supposed, but not empirically proven, to be unitary [68].
Third, it is worth noting that the IRT model revealed that the majority of TLS items were overall easy, as well as that only a limited number of them came with a high discriminative power.However, this finding is consistent with the empirical notion according to which language tests are generally not challenging for HPs.At the same time, this should not lead to equate item easiness/low discrimination in HPs to clinical uselessness in patients [74].
Finally, it should be mentioned that the sample from which norms for the advanced CS scoring measures were derived was smaller (N=219) when compared to the whole sample of HPs (N=480).However, based on the current power analysis, this sample happens to be satisfactory in size, and is larger than that employed for the normative study of the SAND (N=134) [33]-which encompasses similar measures.Relatedly, regarding the stratification of the present normative sample, it should be noted that adjustment coefficients might not be validly applicable-and should be thus interpreted cautiously-for those age and education classes which are not herewith represented-i.e.individuals with ≤5 years of education and aged up to 60 years for the CS task, and those with ≤5 years of education and aged up to 45 years for the remaining tasks and the TLS-Total.Some further considerations have to be made with regard to the characteristics of a number of tasks included within the TLS.
First, the restricted time limits that have been set for CS tasks might prevent from collecting meaningful speech samples in patients with severe production deficits (e.g.non-fluent aphasias).Thus, although the choice of keeping such a time window as narrow as possible was made in order for the TLS to be less time-consuming as possible, further investigations are needed for determining whether these tasks are actually informative in patients with a severe reduction in speech output.
Second, one might question the adequacy of the Spelling task to the aim of detecting phonological deficits.It is indeed true that, at variance with other languages (such as English), the vast majority of Italian words are featured by predictable phonological-to-orthographic/ orthographic-to-phonological conversion rules.Hence, Italian speakers are mostly unfamiliar with oral spelling, this making such task quite challenging for healthy individuals too [75,76]-as also confirmed by the present IRT analysis showing that Spelling items were among the most difficult ones.In addition, oral spelling engages attentive and executive functions [77].It follows that, in order to determine whether examinees might present with phonological deficits or not, Spelling scores should not be interpreted alone, but rather in the light of the performance on the other TLS tasks assessing phonology (i.e.RoW, RoNW and RoS).Having said that, the Spelling task remains a relevant part of the TLS; a qualitative analysis of the errors on this task can help provide some further insight about the integrity of the phonological structure of single lexical entries.
Third, as is the case for the Token Test from which it takes inspiration, the CML too taps into multiple, extra-linguistic cognitive functions [78].Indeed, the performance on the Token Test not only depends on the integrity of the morpho-syntactic component, but also on phonological working memory/verbal short-term memory and executive functions [79].Hence, CML scores need to be interpreted along with the performance on the BDS-T/-WM: in the presence of a defective performance on both the CML and the BDS-T/-WM, examiners should not confidently conclude on the presence of morpho-syntactic deficits.It has to be however noted that the presence of defective BDS-T/-WM performances should lead examiners to interpret with caution the results of all TLS tasks-and not only of the CML-, since working and/or short-term memory impairments negatively affect different language functions.
It should be then borne in mind that the present study is not exhaustive of all the clinimetric and feasibility investigations that are supposed to be performed on a given cognitive screener [68].
First, further studies are mandatory in order to test the diagnostics and cross-sectional feasibility of the TLS in clinical populations whose language impairment, regardless of the aetiology, is confirmed by either a clinical diagnosis or via first-/second-level, gold-standard language batteries.such as the SAND [33] or the Aachen Aphasia Test [80], respectively.Indeed, within the present study, only preliminary evidence on the feasibility of the TLS for case-control discrimination and case-finding aims was provided.With this regard, it should be also noted that such results relied on a suboptimal operationalization of the positive state, since the TICS-L is far from being a comprehensive measure of language.Moreover, the present clinical cohort was relatively small in size and highly heterogeneous, and detailed information on both its linguistic and extra-linguistic profile was not collected.
Furthermore, no evidence has been herewith provided on the longitudinal feasibility of the TLS, e.g. its responsiveness and susceptibility to practice effects.Future investigations are needed to examine such properties, given that language impairment features itself as a chronic condition in several brain disorders (i.e.post-stroke aphasia and primary progressive aphasia) [16].In this respect, it would also be advisable that reliable change indices are derived and/or that parallel forms are also developed.

Conclusions
In conclusion, the TLS is a valid, reliable, normed and clinically feasible TBCS test for language deficits.Future studies are nonetheless needed on its clinimetrics and feasibility, both within the cross-sectional and longitudinal dimension, both in HPs and in patients whose core cognitive feature is represented by language impairment.

Table 1
[13]graphic and telephone-based cognitive data of the normative sample Derived as the sum of the language items included within the Italian telephone-based Mini-Mental State Examination[10]and the Tele-Global Examination of Mental State[13] a Data available for N = 200 participants b Data available for N = 266 participants c Data available for N = 401 participants *

Table 2
Item difficulty and discrimination values as yielded by the 2-PL IRT model in HPs (N = 401)

Table 2
(continued)n.a.not applicable (the model yielded inadequate estimates), TLS Telephone Language Screener, 2-PL two-parameter logistic, IRT Item Response Theory, HPs healthy participants * Item not originally dichotomous and thus dichotomized based on a raw score above vs.below the 5th percentile of the empirical distribution ■ Extremely easy item ▲ High discrimination• Extremely high discrimination

Table 3
Adjustment equations and Equivalent Score thresholds for raw TLS measures TLS Telephone Language Screener, AS adjusted score, CS Connected Speech, RS raw score, oTL outer tolerance limit, iTL inner tolerance limit, n.a.adjustment equation not available (no significant demographic predictors for this measures) a Norms computed on the whole sample (N = 480)

Table 4
Patients' background and cognitive data SVD small vessel disease, NDD neurodegenerative diseases, MMSE Mini-Mental State Examination, TICS Telephone Interview for Cognitive Status, TICS-L Telephone Interview for Cognitive Status-Language, BDS-T Backwards Digit Span-Total, BDS-WM Backwards Digit Span-Working Memory, SA Semantic Association, NtD-N Naming to Description of Nouns, NtD-V Naming to Description of Verbs, RoW Repetition of Words, RoNW Repetition of Non-Words, RoS Repetition of Sentences, CML Comprehension and Memory Load, IU Information Units a Data available for N = 34 patients