Introduction

Currently, the standard diagnostic dementia work-up consists of a clinical evaluation, a full neuropsychological examination, and MR imaging of the brain, which often includes visual inspection of different brain regions using standardized rating scales, such as Scheltens’ medial temporal lobe atrophy (MTA) [41, 42] and the global cortical atrophy (GCA) scale [30]. On structural MRI, beyond mild MTA due to normal aging, at least moderate MTA is suggestive for AD, in particular when the hippocampus is affected [33, 48]. Hippocampal atrophy is an established biomarker of Alzheimer’s disease (AD) [18, 29, 47]. Co-occurrence of reduced hippocampal volumes and inferior lateral ventricle (ILV) enlargement, the two regions used to formulate the MTA score, is also typical for atrophy due to neurodegeneration in case of AD and facilitates the differentiation with individuals with congenitally small hippocampi [11, 14, 46]. However, especially in earlier symptomatic stages of AD, hippocampal atrophy is hard to objectify by visual inspection [39]. The hippocampus is a small structure which, together with the existing variability in hippocampal volumes (also present in the normal population), complicates the detection of atrophy due to AD. According to Harper et al. 2015 [13], the reliability of visual rating scales is satisfactory, with an intraclass correlation coefficient (ICC) of 0.8 for MTA, and 0.6 for GCA, for inter-rater reliability. However, since visual rating scales suffer from MRI acquisition protocol dependency, difficulties in identifying subtle distinctions, intra- and inter-rater variability and require trained neuroradiologists, automated volumetric assessment has become increasingly important as an additional measure for AD diagnostics, aiming to diagnose AD earlier, e.g., in the prodromal stage. The added diagnostic value of automated hippocampal volumetry to the diagnostic confidence of AD (beyond neuropsychological evaluation, cerebrospinal fluid AD biomarkers, and brain FDG-PET scan) has been shown and emphasized in previous literature [3]. Even though widely used in research settings, the integration of automated volumetry in routine clinical practice is still an ongoing evolving process [7, 16, 25]. As suggested by Vernooij et al. [51], one of the main concerns hampering integration of automated software largely pertains to lack of standardization, validation, concerns about specificity, and the difficulty to transfer research findings into the clinical setting to help diagnosing an individual patient [24]. To overcome potential shortcomings of automated hippocampal volumetry alone for (early) AD diagnosis, we suggest the use of an automated MTA ratio, defined as the ratio between ILV and hippocampal volumes expressed as a percentage. The implementation of a continuous MTA variable, compared to a five-step scale, might offer a more fine-grained metric to differentiate between abnormality and normality. Notwithstanding the presence of existing automated approaches for automated MTA scoring [20, 22, 31, 32, 34], successful integration is challenged by difficulties in accurately segmenting intricate anatomical structures (thus requiring manual correction) and accommodating to individual variations and unique patient characteristics [5, 7, 15, 35]. Therefore, continuous validation, evaluation, improvement, and complementation efforts are vital to address, augment, and overcome these complexities.

To this end, an ILV/hippocampus (Hip) ratio based on icobrain dm (dementia), a CE-marked and FDA-cleared automated volumetric post-processing software for clinical MRI scans, was developed [19, 36, 45]. The primary objective of this exploratory clinical study is to evaluate and compare the ILV/Hip ratio’s performance, to the hippocampal volumes and visual ratings, as part of the validation of automated volumetry in routine clinical practice. The secondary objective of this study is to investigate the correlation between visual MTA scales and the ILV/Hip ratio from icobrain dm in a heterogeneous patient population comprising different degrees of cognitive decline.

Material and methods

Study population

This study consisted of one-hundred-twelve subjects who underwent a clinical routine MRI examination in the context of a full cognitive clinical diagnostic work-up. All consented patients between the ages of 60 and 90 (inclusive) that had a memory consultation at the department of neurology at UZ Brussel between September 2020 and December 2022 were considered for inclusion, irrespective of the severity of cognitive decline. Exclusion criteria consisted of MRI contraindications and structural lesions in the region of interest (temporal lobe). Patient classification was effectuated in compliance with the National Institute on Aging-Alzheimer’s Association criteria for “MCI due to AD” and “Dementia due to AD” [1, 9, 17, 26, 44]. Subjective cognitive decline (SCD) subjects were diagnosed according to the criteria of Jessen’s et al. (2014) [21]. Note that the aforementioned criteria were applied wherever possible, since not all cerebrospinal fluid (CSF) biomarkers were available for the entire study population. Therefore, the final diagnosis used in this study does not necessarily imply a biomarker-based diagnosis. In total, 16 cognitively healthy controls (CN), 33 SCD subjects, 35 mild cognitive impairment patients (MCI), and 27 dementia (DEM) patients were included in this study. Lastly, a randomly selected patient with normal pressure hydrocephalus (NPH) was included to illustrate the effect of a large ILV volume on the MTA score.

MRI acquisition protocol

Brain MRI was performed in all participants using the Philips Ingenia 3T (Philips Medical Systems, Best, The Netherlands). The MRI examination consisted of a sagittal 3D T1-weighted sequence, a sagittal 3D FLAIR-weighted sequence, a coronal T2-weighted sequence, a 3D susceptibility weighted imaging (SWI) and diffusion (DWI). For this study, only the 3D T1-weighted sequence for volumetry and MTA scoring on a coronal reconstruction was used. All scan parameters of the 3D T1-weighted sequence are listed in Supplementary Material Table 1.

Image analysis

Visual assessment

The MTA scale by Scheltens, rated on coronal T1-weighted images, was determined individually by three experienced radiologists (G-J. A., T. V., and S. R.), blinded to diagnosis and sex. In case of discrepancy between individual ratings, a consensus MTA score was agreed upon. Images were viewed and evaluated on a Barco (Kortrijk, Belgium) diagnostic screen in AGFA Picture Archiving and Communication System (PACS).

Automated volumetry

From each T1-weighted image, automated brain volumetry was computed by icobrain dm (v 5.10) for total, left, and right hippocampal volumes, as well as for total, left, and right ILV volumes. The initial steps in icobrain dm’s pipeline included skull stripping, bias field correction, and computation of a head size normalization factor as the determinant of an affine transformation to MNI space. The hippocampal and ILV segmentations were obtained with a deep learning-based algorithm trained on a dataset of T1-weighted brain images with high variability both at the population level and in terms of scanners and acquisition parameters [28]. Additionally, a specific intensity-based augmentation strategy that enhances generalizability was used during training [27].

ILV/Hip ratio

An automated alternative of the visual MTA score showed the degree of hippocampal atrophy accounting for volume loss and compensatory expansion of the ILV, defined as the ratio between ILV and hippocampal volumes expressed as a percentage. The ILV/Hip ratio was calculated according to the following formula:

$$\frac{ILV}{Hip}ratio=\left(\frac{\left(Inferior\;lateral\;ventricle\;volume\left(\text{cm}^3\right),ILV\right)}{\left(Hippocampal\;volume\left(\text{cm}^3\right),HC\right)}\right)\times100$$

for each hemisphere (left and right) separately, as well as combined (total).

Normative reference population

In order to integrate the variables age and sex in the interpretation of the ILV/Hip ratio score, a large reference dataset (n = 1903, age range [min–max]: [6–96] years old) comprised subjects without cognitive complaints belonging to 14 different studies with participants derived from open-source data (Supplementary Material Table 2) was used to understand if an individuals’ ILV/Hip ratio score for each patient deviates from the expected score for an individual without cognitive complaints of the same age and sex. Normal aging, derived from the reference dataset, is modeled through univariate interpolating splines, fitted through the percentiles of a shifting age window. Comparing an ILV/Hip ratio score with the trends observed in the subjects without cognitive complaints resulted in a normative percentile adjusted for age and sex. This same methodology was also applied to hippocampal and ILV volumes, creating a percentile score that can be compared to a chosen “cut-off.” Typically, the range between percentile 1 and percentile 99 can be considered a normal range. Any value outside this range might be considered abnormal, which can be used in clinical routine to integrate age and thus evaluate whether a subject’s ILV/Hip ratio score deviates from a healthy aging pattern. A value between the 90 and 99th percentiles can still be considered normal, since this can be inherent to the normal distribution of biological variables, but should nevertheless be interpreted with caution, suggesting that clinical follow-up within 1–2 years might be warranted.

To demonstrate clinical interpretability on patient level, individual cases for each diagnostic category (CN, SCD, MCI, and DEM), as well as the NPH case, were presented. Lastly, an error bar (EB) calculation was performed to evaluate performance specifications, where the error bar interval (− EB and + EB) contains the difference between test and retest values with 90% confidence. This is important since automated measurements can be subject to measurement errors. Therefore, measurement variability should also be considered during result interpretation.

Statistical analysis

Descriptive statistics

R environment (R-Studio, v.1.0.136) for statistical computing and graphics was used for all data processing with the following “packages” and (functions). Demographic information was reported as percentages, mean and standard deviation (SD) and/or median and interquartile range (IQR). For categorical variables, the chi-square test of independence was used, while continuous variables were analyzed by the ANOVA test, with a significance level of 0.05 (R package: “arsenal” (tableby and write2word)).

Inter-rater variability analysis

To ensure the quality of the visual assessment for an adequate comparison to automated volumetry, the inter-rater variability was evaluated through the intraclass correlation coefficient (ICC, 95% confidence intervals (CI)), a measure of reproducibility between repeated measurements of the same item, carried out by different observers. (R package “psych” (ICC, v. 2.3.0)). A two-way mixed model, single measurement, with absolute agreement measures was used. The output was the ICC estimate with its respective confidence intervals [38, 43]. The mean ICC and CI were calculated using Fisher’s z transformation (R function: (atanh), to transform the ICC values to z-scores, calculating the mean of the z-scores, and then applying the tanh() function to obtain the mean ICC value.

The ICC is a value going from 0, which indicates no agreement, to 1, indicating absolute agreement, which can be interpreted as either poor (x < 0.50), moderate (0.50 < x < 0.75), good (0.75 < x < 0.90), or excellent (x > 0.90), when taking into account the 95% confidence intervals of the ICC estimate, as suggested by Koo and Li in 2016 [2, 23]. The ICC was calculated using the following formula:

$$ICC=S2A/(S2A+S2W)$$

where S2A is the variance among groups, and S2W is the variance within groups [53]. The intra-rater variability was not assessed, as this was beyond the scope of this study.

Association analysis

The association between the visual MTA rating (each rater separately and the MTA consensus), cognitive outcome (reflected by the Mini-Mental State Examination, MMSE), and the automated brain segmentations computed by icobrain dm (total, left, and right hippocampal volumes, total, left, and right ILV volumes, and the total, left, and right ILV/Hip ratio) was first quantified using Spearman’s correlation analysis (R package: “base R” (cor)] (alpha < 0.05). Spearman’s rank correlation is a non-parametric measure with robustness to potential non-linearity and outliers, which is often encountered when comparing ordinal (visual MTA) and continuous (automated brain segmentation) variables. Spearman’s values range from − 1 to 1. A value of − 1 indicates a perfect negative monotonic relationship; 0 indicates no monotonic relationship; and 1 indicates a perfect positive monotonic relationship. The strength and direction of the relationship can be interpreted as follows: very weak (|x|< 0.20), weak (0.20 ≤|x|< 0.40), moderate (0.40 ≤|x|< 0.60), strong (0.60 ≤|x|< 0.80), or very strong (|x|≥ 0.80), where a positive value indicates parallel transitions, and a negative value implies an inverse relationship.

Additionally, for a more comprehensive analysis and to not be overly reliant on a single method, Kendall’s Tau (R package: “base R” (cor)] (alpha < 0.05) was used to give more weight to the ordinal nature of the visual MTA ratings and to examine the concordance between the automated measurements and visual MTA ratings. Kendall’s Tau, like Spearman’s correlation, is a versatile measure that ranges from − 1 to 1, where − 1 indicates a perfect negative association; 0 implies no association; and 1 signifies a perfect positive association. The strength and direction of the association can be summarized as very weak (|x|< 0.10), weak (0.10 ≤|x|< 0.30), moderate (0.30 ≤|x|< 0.50), strong (0.50 ≤|I|< 0.70), or very strong (|x|≥ 0.70).

Moreover, a logistic regression was performed (R package: “rms” (lrm)) to further evaluate the relationship between the visual MTA as outcome and the automated brain segmentations as predictor. Various additional scale invariant metrics, including the concordance index (c-index/area under the curve (AUC)) and the Brier score, were used to further quantitatively evaluate the strength, direction, degree of association, and correlation.

The AUC assesses how well the model distinguishes between the different visual MTA severity scores based on the predicted probabilities. The AUC varies being 0 and 1, whereas a general guideline AUC > 0.90 signifies excellent discrimination, 0.80 ≤ AUC < 0.90 represents good, 0.70 ≤ AUC < 0.80 denotes moderate to good, 0.60 ≤ AUC < 0.70 reflects moderate to poor, and AUC < 0.60 indicates very poor discrimination. AUC equals 0.50 indicates no discriminative power (random chance). Thus, a high AUC suggests the model is effective at ordering and ranking cases according to MTA severity and would indicate that both scoring methods provide very similar rankings, and by extension, high correlation, encouraging the prospects for interchangeability.

The Brier score quantifies the accuracy of predicted probabilities, where a low Brier score suggests accurate predictions, further reinforcing correlation between the two variables. A Brier score ≤ 0.2 indicates excellent model performance; 0.2 < Brier score ≤ 0.25 reflects a good; 0.25 < Brier score ≤ 0.3 suggests a fair; 0.3 < Brier score ≤ 0.35 signifies poor; and a Brier score > 0.35 implies very poor model performance.

Even though these additional metrics do not offer a traditional correlation coefficient, they do offer information about the quality of predictions and alignment between predicted and actual values, which indirectly reflects the degree of correlation.

Classification accuracy

To determine whether the ILV/Hip ratio exhibits comparable diagnostic precision or provides additional information compared to other measurements, classification accuracy through logistic regression of the visual MTA ratings (each rater separately and the MTA consensus) and the automated volumetric measurements (total, left, and right hippocampal volumes, total, left, and right ILV volumes, and the total, left, and right ILV/Hip ratio), was conducted for each variable separately as predictors. The following pairwise combinations of disease stages were considered as binary outcomes: SCD vs. CN, MCI vs. CN, DEM vs. CN, MCI vs. SCD, DEM vs. SCD, and DEM vs. MCI.

Classification performance was evaluated using receiver operating characteristic (ROC) analysis, with the R package “pROC” (roc, auc, coords, and ci) and the “stats” (predict and glm) package [40]. AUC, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV), were documented for each pairwise combination of disease stages. The AUC was computed with the trapezoidal rule. The Youden index to determine the threshold that maximizes the distance to the identity (diagonal) line, from which the sensitivity, specificity, PPV, and NPV were calculated. In addition, for each binary classification, resampling with replacement to estimate the variability of the AUC was employed. The resulting bootstrapped-based confidence intervals were then used to investigate if the AUC values between the variables were significantly different.

Results

Study population

The demographic and volumetric characteristics of the study population are presented in Table 1. This study population consisted of one-hundred-twelve subjects with a mean age (± SD) of 66.85 ± 13.64 years, composed of cognitively healthy controls (N = 16), SCD subjects (N = 33), and patients (MCI = 35, DEM = 27, and NPH = 1) belonging to different stages of cognitive decline.

Table 1 Demographic and volumetric characteristics of the study population

Inter-rater variability of visual assessment

To assess visual assessment reproducibility, the inter-rater variability for each visually rated MTA score (total, left, and right) was determined for all pairwise and all-rater comparisons (Table 2). The largest inter-rater variability was found between Rater I and Rater II for the right MTA score (0.714 [0.610, 0.794]). The smallest overall inter-rater variability was seen between Rater I and Rater III for the total MTA score (0.907 [0.868, 0.935]).

Table 2 Inter-rater variability

Relationship between automated volumetry and the visual MTA score

Subsequently, the association between visual MTA scores (each rater separately and the MTA consensus), cognitive outcome (reflected by the MMSE), and the automated brain segmentations computed by icobrain dm (total, left, and right hippocampal volumes, total, left, and right ILV volumes, and total, left, and right ILV/Hip ratio) were assessed using ordinal logistic regression, the Brier’s score, Kendall Tau, and Spearman’s correlation analysis (Table 3). Additionally, the automated brain structure volumes and calculated ILV/Hip ratio versus the consensus MTA rating scores were visualized per MTA score (0–4) (Fig. 1). The automated measurements and consensus MTA scores versus the MMSE were visualized in Fig. 2.

Table 3 The relationship between visual MTA and automated volumetry
Fig. 1
figure 1

Violin plots of visually assessed medial temporal atrophy (MTA) score and ILV/Hip ratio (left: a, total: b, and right: c), hippocampal volumes (left: d, total: e, and right: f), and inferior lateral ventricular (ILV) volumes (left: g, total: h, and right: i)

Fig. 2
figure 2

Spearman’s correlations graphs of cognitive outcome reflected by the Mini-Mental State Examination (MMSE) versus visually assessed medial temporal atrophy (MTA) scores (left: a, total: b, and right: c), ILV/Hip ratio (left: d, total: e, and right: f), hippocampal volumes (left: g, total: h, and right: i), and inferior lateral ventricular (ILV) volumes (left: j, total: k, and right: l)

A high consistency was observed across the various association metrics testing for potential interchangeability among the considered variables. Within both individual raters’ analyses and the consensus, the consensus scores demonstrated the strongest correlation to the automated measurements. Specifically, the total, left, and right ILV/Hip ratio within the consensus scores displayed the most robust associations with the corresponding total, left, and right visual MTA ratings. Moreover, the ILV/Hip ratio showed a uniform pattern of higher AUC, Kendall-Tau, and Spearman coefficients, along with lower Brier scores, in comparison to the correlation between visual MTA ratings and the ILV, or hippocampal volumes alone.

The Spearman’s correlation analysis indicated a moderate to strong negative significant (p < 0.001) correlation for total (ρ =  − 0.659), left (ρ =  − 0.608), and right (ρ =  − 0.670) hippocampal volumes versus the consensus MTA ratings. Conversely, a strong positive and significant (p < 0.001) correlation was seen for the total (ρ = 0.867), left (ρ = 0.836), and right (ρ = 0.864) ILV volumes versus the consensus MTA rating, comparable to the total (ρ = 0.877), left (ρ = 0.851), and right (ρ = 0.866) ILV/Hip ratio score versus the consensus MTA rating. It is noteworthy that, while individual hippocampal volumes showed the lowest Spearman’s correlation to the MTA score, individual ILV volumes exhibited a slightly lower correlation to the MMSE score (total (ρ =  − 0.400), left (ρ =  − 0.403), and right (ρ =  − 0.368) compared to the other examined measures (visual MTA: total (ρ =  − 0.492), left (ρ =  − 0.463), and right (ρ =  − 0.443), ILV/Hip ratio: total (ρ =  − 0.464), left (ρ =  − 0.466), and right (ρ =  − 0.402), and hippocampus: total (ρ = 0.496), left (ρ = 0.445), and right (ρ = 0.497)), respectively.

The Kendall Tau analysis provided an additional complementary layer of confirmation, consistently mirroring the patterns previously observed in Spearman’s correlation. The concordance among different correlation measures suggests the independence of observed relationships from specific data characteristics and underscores the reliability of the interconnectedness between variables, regardless of the analytical approach chosen.

Classification accuracy

The classification accuracy of for the automated brain segmentations and the visual MTA consensus ratings are listed in Table 4. The ILV/hip ratio measurements demonstrate competitive classification performance to the visual MTA scores and varying levels of sensitivity and specificity across different pairwise comparisons. No significant differences were observed between the two methods based on to the confidence intervals of the AUC values. As the disease severity gap widens from milder stages (e.g., SCD vs. CN) to more severe stages (e.g., DEM vs. CN)), there is a notable trend of increasing AUC, improved sensitivity, and specificity.

Table 4 Classification accuracy

Particularly, when assessing DEM vs. CN, both the visual MTA and total ILV/Hip ratio showed an excellent comparable performance of (AUC [CI]) 0.953 [0.898–0.999] with a sensitivity of 0.778, and a specificity of 0.938 for the total visual MTA consensus rating, and a corresponding AUC of 0.938 [0.853–0.999], sensitivity of 0.963, and specificity of 0.875 for the total ILV/Hip ratio, respectively. Similarly, the individual ILV volumes demonstrated excellent discriminative power, with an AUC of 0.912 [0.795–0.999], sensitivity of 0.963, and specificity of 0.850. The individual hippocampal volumes showcased a slightly lower, but still strong performance, with an AUC of 0.882 [0.784–0.980], sensitivity of 0.741, and a specificity of 0.938. Interestingly, individual total ILV volumes exhibited a high specificity of 0.960, but a notably lower specificity of 0.486 for the DEM vs. MCI pairwise comparison, compared to a well-balanced sensitivity of 0.740 and specificity of 0.686 for the total ILV/Hip ratio. Moreover, total hippocampal volumes showed a comparable pattern in specificity, but a consistently lower sensitivity compared to the total ILV/Hip ratio. Recall that all sensitivity and specificity values were computed at the Youden index, which optimizes the sum between sensitivity and specificity; other trade-offs between sensitivity and specificity would be obtained by using different cut-off selection methods.

The ILV/Hip ratio and individual ILV volumes performing similarly, with slightly lower outcomes observed for hippocampal volumes, is a consistent trend across all pairwise comparisons, emphasizing a commensurable performance in terms of classification accuracy for automated brain segmentations and visual MTA consensus ratings.

Comparison to a normative reference population

Subsequently, all automated segmentations were compared to a reference population of subjects without cognitive complaints (n = 1903), depicting the normal ranges of selected brain structure volumes/ratios across a relevant age interval (Fig. 3). Performance specifications, featuring error bar results, can be found in Supplementary Material Table 3.

Fig. 3
figure 3

Population graphs — ILV/Hip ratio (in %, left: a, total: b, and right: c), inferior lateral ventricular (ILV) volumes (in mL, left: d, total: e, and right: f) and hippocampal volumes (in mL, left: g, total: h, and right: i). The graphs show the median and normal ranges (percentiles 1 to 99, with percentiles 90, 50, and 10 also represented) of selected brain structure volumes/ratios of cognitively healthy subjects (n = 1903) across a relevant age interval. The time point volumes are marked as circles. The size of the circles corresponds to the Scheltens’ MTA consensus score given by the three radiologists, which MTA = 0 corresponding to the smallest circle size, and MTA = 4 to the largest circle size. Color-coded dot representation; CN, dark green; SCD, green; MCI, orange, DEM, red; and NPH, purple). Color-coded population graph percentile interpretation; 1–90th percentile, green (normal); 90–99th percentile, orange (caution); and > 99th percentile, dark blue (abnormality)

For every individual, the ILV/hip ratio was juxtaposed with the normative reference population. Individual values are illustrated by dots on the graph and the reference population is represented through a color-coded background and percentile lines indicated on the right side of the graph. Each individual’s dot size corresponded to their visual MTA consensus score, with larger circles denoting higher MTA scores. When looking at the ILV/Hip ratio population graphs, a clear correlation with increasing MTA scores (dot size), diagnosis (color-coded), and reference population percentiles is visible. The classification of MTA “severity” was proportional between visual MTA ratings and the percentiles of ILV/Hip ratio measurements on the population graphs, indicating an equipollent performance of visual assessment and automated volumetry for MTA determination. When looking at the NPH case, it is more likely that the MTA severity is caused by a deviation of ILV volumes (Fig. 2d–f, purple circle located in the blue zone) and not by an abnormal hippocampal volume (Fig. 2g–i, purple circle located in the orange zone). Individual percentiles per subject can be found in Supplementary Material Table 4.

Clinical interpretability at patient level

To validate clinical interpretability at patient level, a subject was selected from each diagnostic category and assessed individually. The following brain structures were evaluated: upper lateral ventricles (green), inferior lateral ventricles (purple), and hippocampal volumes (yellow).

In the case of the CN (age, 83.3 years old) subject, all automated measurements fell within the 1st to the 90th percentile range when compared to an age-matched reference population of healthy individuals (Fig. 4). This aligns with the visual assessment, which yielded an MTA score of 1 for both left and right hemispheres, as well as the total MTA score.

Fig. 4
figure 4

Cognitively normal (CN) case. Cross-sectional coronal T1-weighted image at the level of the medial temporal lobe and hippocampus with segmentation of the upper lateral ventricles (green), inferior lateral ventricle (purple) and hippocampus (yellow). A — Automated volumetric measurements left: Axial view middle: Sagittal view right: Coronal view. B — Population graphs left: hippocampal volumes of this subject showing the gray “x” marker in the green region just above between the 50th and 90th percentiles; middle: the ILV/Hip ratio of this subject showing the gray “x” marker in the green region between the 1st and 50th percentiles. Right: ILV volumes of this subject showing the gray “x” marker in the green region between the 1st and 50th percentiles. Color-coded population graph percentile interpretation; 1–90th percentiles, green (normal); 90–99th percentiles, orange (caution); and > 99th percentile, dark blue (abnormality)

For the SCD subject (age, 71.9 years old), the hippocampal brain structure segmentations demonstrated values between the 50th and 90th percentile range, which is within the expected normal range compared to a population of similar age. In contrast, the ILV and ILV/Hip ratio fell between the 90–99th percentile, suggesting caution. Visual inspection yielded a consistent rating of 2 across all MTA measurements, which, taking the subject’s age into account, was considered abnormal (Fig. 5).

Fig. 5
figure 5

Subjective cognitive decline (SCD) case. Cross-sectional coronal T1-weighted image at the level of the medial temporal lobe and hippocampus with segmentation of the upper lateral ventricles (green), inferior lateral ventricle (purple), and hippocampus (yellow). A — Automated volumetric measurements left: axial view; middle: sagittal view; and right: coronal view. B — Population graphs left: hippocampal volumes of this subject showing the gray “x” marker in the green region just above the 50th percentile; middle: the ILV/Hip ratio of this subject showing the gray “x” marker in the orange region between the 90th and 99th percentiles. Right: ILV volumes of this subject showing the gray “x” marker in the orange region between the 90th and 99th percentiles. Color-coded population graph percentile interpretation; 1–90th percentile, green (normal); 90–99th percentile, orange (caution); > 99th percentile, dark blue (abnormality)

The MCI patient’s (age, 72.7 years old) hippocampal brain structure segmentations fell below the 1st percentile range, indicating significant atrophy in this region (Fig. 6). Conversely, the ILV volume surpassed the 90th percentile, leading to an ILV/Hip ratio above the 99th percentile, signifying this outcome is rare among the healthy population. The visual MTA score for the MCI patient received a score of 2 for each of the measures, akin to the SCD subject.

Fig. 6
figure 6

Mild cognitive impairment (MCI) case. Cross-sectional coronal T1-weighted image at the level of the medial temporal lobe and hippocampus with segmentation of the upper lateral ventricles (green), inferior lateral ventricle (purple), and hippocampus (yellow). A — Automated volumetric measurements left: axial view; middle: sagittal view; right: coronal view. B — Population graphs left: hippocampal volumes of this subject showing the gray “x” marker in the blue region, below the 1st percentile; middle: the ILV/Hip ratio of this subject showing the gray “x” marker in the blue region, just above the 99th percentile. Right: ILV volumes of this subject showing the gray “x” marker in the orange region, above the 90th percentile. Color-coded population graph percentile interpretation; 1–90th percentile, green (normal); 90–99th percentiles, orange (caution); > 99 th percentile; dark blue (abnormality)

For the DEM patient (age, 77.5 years old), individual volumes, both ILV and HC, along with the corresponding ILV/Hip ratio, substantially exceeded the 99th percentile when compared to the healthy age-matched reference population (Fig. 7). Visual MTA assessment revealed a corresponding score of 4 for the left and 3 for the right hemisphere, culminating in a total consensus rating of 3.5. Lastly, in the case of NPH (age, 73.7 years old), despite having a normal hippocampal volume, there was a noticeable enlargement of the lateral ventricles (Fig. 8). This led to an ILV/Hip ratio that exceeded the 99th percentile, a characteristic observation in patients with normal pressure hydrocephalus, and one of the causes of dementia that can be managed and potentially reversed, with appropriate treatment. In alignment with the automated volumetric measurements, this patient received a consistent visual MTA score of 4, for both left and right hemispheres and the overall total MTA score.

Fig. 7
figure 7

Dementia (DEM) case. Cross-sectional coronal T1-weighted image at the level of the medial temporal lobe and hippocampus with segmentation of the upper lateral ventricles (green), inferior lateral ventricle (purple), and hippocampus (yellow). A — Automated volumetric measurements left: axial view; middle: sagittal view; right: coronal view. B — Population graphs left: hippocampal volumes of this subject showing the gray “x” marker in the blue region, below the 1st percentile; middle: the ILV/Hip ratio of this subject showing the gray “x” marker in the blue region, above the 99th percentile. Right: ILV volumes of this subject showing the gray “x” marker in the blue region, above the 99th percentile. Color-coded population graph percentile interpretation; 1–90th percentiles, green (normal); 90–99th percentiles, orange (caution); > 99th percentile, dark blue (abnormality)

Fig. 8
figure 8

Normal-pressure hydrocephalus (NPH) case. Cross-sectional coronal T1-weighted image at the level of the medial temporal lobe and hippocampus with segmentation of the upper lateral ventricles (green), inferior lateral ventricle (purple), and hippocampus (yellow). A — Automated volumetric measurements left: axial view middle: sagittal view right: coronal view. B — Population graphs left: hippocampal volumes of this subject showing the gray “x” marker in the orange region, between the 1st and 10th percentile middle: the ILV/Hip ratio of this subject showing the gray “x” marker in the blue region, above the 99th percentile. Right: ILV volumes of this subject showing the gray “x” marker in the blue region, above the 99th percentile. Color-coded population graph percentile interpretation; 1–90th percentiles, green (normal); 90–99th percentiles, orange (caution); > 99th percentile, dark blue (abnormality)

Discussion

In this study, an ILV/Hip ratio computed by icobrain dm as a potential additional metric in the diagnostic work up of neurodegenerative diseases such as AD was investigated. Our findings demonstrate that the performance of the ILV/Hip ratio score is, besides the aforementioned advantages of automated volumetry versus manual assessment, comparable to the results obtained by the consensus of three individual raters (with a high level of agreement in terms of inter-rater variability), indicating the ILV/Hip ratio score can serve as an additional (e.g., confirmative) metric for MTA atrophy stage and progression.

This study has shown that ILV volumes and ILV/Hip ratio scores show the highest correlation to visually assessed MTA ratings, in comparison to the automated hippocampal volumes versus the visually assessed MTA. This emphasizes the importance of not only considering the hippocampal volumes alone but also regarding the inferior lateral ventricle as an equally important structure in the MTA assessment in neurodegenerative disorders.

Regarding the ordinal logistic regression analysis, it needs to be noted that differences in coefficients among hippocampal volumes, ILV volumes, and the ILV/Hip ratio are influenced by variations in their scales and units. Specifically, the ILV/Hip ratio is a percentage, while individual hippocampal volumes and ILV volumes are measured in milliliters, making direct comparisons challenging. These differences emphasize the importance of relying on scale-invariant metrics, such as AUC, Brier scores, Kendall-Tau, and Spearman coefficients, for an accurate assessment of correlation strength. Despite the lower coefficients observed for the ILV/Hip ratio compared to the individual hippocampal and ILV volumes, the additional scale-invariant metrics reveal a stronger correlation to the visual MTA ratings. Lastly, regarding the correlation to cognitive outcome, the slightly lower correlation coefficients for individual ILV volumes suggest that, in contrast to visual MTA, the ILV/Hip ratio and hippocampal volumes, the individual ILV volumes may have, as was to be expected, a relatively less pronounced impact on cognitive function as assessed by MMSE scores. This observation, while anticipated, underscores the potential added value of incorporating the ILV/Hip ratio to enhance the precision of cognitive assessment beyond the singular focus on individual volumes. Furthermore, it needs to be noted that the MMSE, being a single time point and general measure of cognitive function that lacks the specificity needed for diagnostic differentiation, may not be as sensitive to early stages of cognitive decline. It might not capture subtle cognitive changes that occur in early phases of the disease, or individuals with near-normal (ceiling effect) or very low (floor effect) cognitive functions. In addition, different levels of education and diverse cultural backgrounds might introduce bias. Furthermore, the MMSE may not provide a comprehensive evaluation of all cognitive functions affected in AD.

In terms of diagnostic purposes, it needs to be stressed that the MTA score is not specific for AD, and normal imaging findings in the medial temporal lobe region does not exclude AD either, especially in the early stages. When assessing the classification accuracy, both the ILV/Hip ratio and the visual MTA ratings showed similar consistency in their ability to discriminate between different pairwise disease stage comparisons. The closely matched AUC values and overlapping CI values between different methods suggest no statistically significant difference in classification accuracy. These findings collectively show a robust and potentially equivalent predictive performance across the spectrum of disease severity, as evidenced by the improving discrimination metrics and decreasing prediction errors across the considered variables.

Moreover, considering their comparable specificity patterns, both total individual hippocampal volumes and the total ILV/Hip ratio are implicated in holding value for confirming specific diagnoses. Nevertheless, the consistently lower sensitivity (at the Youden index) observed in individual total hippocampal volumes suggests they may be less effective than the total ILV/Hip ratio in identifying borderline cases.

In the light of this study, the selection between employing the individual ILV volumes alone or the ILV/Hip ratio is contingent upon the specific clinical context and the prioritization of diagnostic considerations. Each approach presents distinct characteristics with implications for minimizing different types of diagnostic errors. Nonetheless, it needs to be noted that the ILV score alone will not be sufficient for adequate differential diagnosis in the presence of (additional co-) pathologies such as NPH, which needs to be considered for a correct interpretation of the ILV/Hip ratio score.

The ILV/Hip ratio showed a good balance between sensitivity and specificity for all pairwise comparisons (except SCD vs. CN), holding potential to reveal unique patterns that are not evident in the individual raw volumes. This can aid in identifying subgroup profiles within a cohort, which can have clinical implications (e.g., better clinical decision support), or be useful for patient stratification. Furthermore, the ratio provides a normalized measure that considers the relative ILV size compared to the hippocampal volume, which is particularly useful in comparing patients with varying brain sizes, improving diagnostic consistency and comparability.

While the ILV/Hip ratio may not consistently outperform other measurements in terms of classification accuracy, it can be a useful complementary quantitative tool, particularly in certain diagnostic contexts, for example when examining the trade-off between sensitivity and specificity.

To determine the applicability of the automated volumetric segmentations such as the ILV/Hip ratio score in routine clinical practice, it is essential to be able to distinguish between what is considered part of a healthy aging pattern and what is not. The existing guidelines for visual MTA scoring are well-established and include a widely accepted age-related cut-off at 75 years old (with an MTA score of 1.5 or more in both hemispheres considered “abnormal” in younger patients (age < 75) and an MTA score of 2 or more in both hemispheres being abnormal at age > 75) [33, 48]. However, the optimal coronal slice position for MTA scoring has not been universally agreed upon, lacking a definitive consensus or established criteria thereof [37]. This might lead to inconsistency and scoring variations, which stands in contrast to automated tools that consistently rely on a predefined coronal slice position. Thus, the ILV/Hip ratio score yielding a continuous variable, in contrast to the visual MTA score employing a scale ranging from 0 to 4, allows for a more standardized and fine-grained determination of normality and abnormality, which, when combined with population graphs, holds potential clinical relevance.

The ability to interpret ILV/Hip ratio score measurements using age and sex-correlated population graphs is an enrichment to the already continuous characteristic of the ILV/Hip ratio score. A continuous measurement would enable close monitoring of gradual decline, which prevents a situation where a patient aged 74, with an MTA score of 1.5 (pathological in this case) can have a normal MTA score in a follow-up examination the year after, solely due to the surpassing of the aforementioned (stringent) threshold, since subtle changes and trends are not depicted in a visual assessment. In addition, it needs to be noted that the existence and inconsistency in the use of different (age-specific and clinical population-based) cut-offs for defining visual MTA rating scale abnormalities is identified as a major issue and source of heterogeneity, with only a few studies addressing this concern so far [24]. Thus, preserving the advantage of automated volumetric imaging quantification, the ILV/Hip ratio score can be translated to a standardized value useful for interpretation purposes when compared to an age- and sex-matched normative population, while retaining a comparable performance to visual assessment.

In fact, numerous studies confirm the superiority of automated volumetry over visual rating [3, 4, 6, 8, 22, 34, 49], strengthening the potential beneficial use of automated visual rating scales in routine clinical practice. However, it is important to consider the existing inter-software variability in automated volumetry, which can give rise to differing clinical interpretations, emphasizing the need to avoid assuming interchangeability of software applications [54]. Although not explored in this study, another potential limitation of the ILV/Hip ratio score versus visual MTA assessment is the introduction of scanner specific dependency, an undesired effect. Using a diverse dataset in training deep learning-based image segmentation methods, as in icobrain v 5.10, has been shown to lead to lower inter-scanner variability [28]. Furthermore, the performance in terms of reproducibility of the ILV/Hip ratio was not assessed in this paper. However, software reproducibility has been previously described and validated by Wittens and Allemeersch et. al, (2021) [52]. Besides age and gender, education can also be seen as a confounder in MTA grading, which was not taken into account in the current study and should be part of further validation studies [50]. An additional challenge in this study is the usage of a convenience sample, where not all diagnoses were substantiated by CSF biomarkers. While this does not affect the inter-observer findings, it does introduce complexity in accurately interpreting abnormalities. The NI-AAA criteria were applied wherever possible; however, due to the retrospective nature of the study, coupled to common constraints in routine clinical practice, including contraindications for lumbar puncture (e.g., coagulation disorders, thrombocytopenia, use of anticoagulants, increased intracranial pressure, and resilience against or incapacity to participate in the procedure), not all criteria for each subject were met in full to obtain a biomarker-based diagnosis. Lastly, it needs to be noted that the consideration of the visual MTA as a suboptimal golden standard in the context of automated volumetry may not necessarily directly lead to substantial improvements in the field. However, since the visual MTA is often used as a reference point for diagnosis and treatment decisions, validating against this standard ensures that automated methods align with established clinical practices, making them more readily applicable and interpretable in real-world scenarios. Showing equivalence of alternative or complementary automated methods with the familiarity, simplicity, and ease of implementation that the visual MTA rating provides to clinicians can aid in gaining acceptance and establishing trust in the reliability and accuracy of automated methods, justifying their integration into routine practice [12].

In future studies, validation on larger clinical datasets containing MRI acquisitions of different scanner types to evaluate the generalizability of the ILV/Hip ratio score, as well as further fine tuning the percentiles as alternative interpretable “thresholds” for the ILV/Hip ratio score that correspond to specific visual MTA rating scores, are needed to determine, among others, the diagnostic utility. Lastly, it would be valuable to investigate the combined use of an automated equivalent of the entorhinal cortex atrophy score (ERICA) [10], which utilizes the entorhinal cortex, parahippocampal gyri, and amygdala as primary structures for assessing atrophy patterns and the automated MTA alternative to further improve the accuracy and specificity of neurodegenerative disease diagnosis and monitoring.

Conclusion

The ILV/Hip ratio score showed an excellent correlation to the visually assessed MTA consensus rating, currently regarded as the golden standard for MTA scoring. The less strong correlation of this visually assessed MTA consensus rating to hippocampal volumes, which has become a widely accepted additional informative metric in MTA assessment, emphasizes the potential use of the ILV/Hip ratio score in a heterogeneous patient population. Furthermore, the possibility to calibrate the ILV/Hip ratio using age- and sex-matched healthy population graphs has an added value for future research to validate the use of automated volumetry.