Introduction

The sense of smell is largely underappreciated until it becomes dysfunctional. Unlike vision, hearing, balance, and touch, this sense is rarely quantitatively assessed clinically, even though hundreds of thousands of persons seek medical help annually for problems with this primary sense. This has been markedly accentuated by the COVID-19 pandemic (Gerkin et al., 2020). Loss or distortion of the ability to smell impacts quality of life, including the flavor of foods and beverages, and impacts patient safety by compromising the ability to detect fire, leaking natural gas, spoiled food, and other environmental dangers (Devanand et al., 2014). It is now well established that olfactory dysfunction can be a harbinger for Alzheimer's and other neurodegenerative diseases and, in elderly populations, can triple the likelihood of mortality over the course of a half-decade (Liu et al., 2019; Devanand et al., 2014). Without quantitative testing, patients are often unaware of an olfactory deficit (Doty et al., 1987). Such testing is critical to accurately establish the veracity of a patient's complaint, the probability of malingering, the impact of treatments, and to what degree a patient’s function is normal for his or her age and sex.

Although there are numerous brief odor identification tests available for quantifying the ability to smell, in many cases their efficacy has not been evaluated and they lack parallel forms to make longitudinal testing possible without potential confounding from knowledge of prior test items. Despite the fact that tests with larger numbers of items do not appear to meaningfully suffer from these problems (Doty et al., 2021), out of practicality many clinicians and researchers prefer to use briefer tests. Unfortunately, in addition to lacking parallel forms, the empirical rationale for establishing the length of such tests is lacking.

The present study employed state-of-the-art machine learning techniques to generate a series of parallel brief odor identification tests of lengths which optimize practicality and both sensitivity and specificity in differentiating persons with abnormal smell function from those with normal smell function. Machine learning techniques take advantage of automated analytical algorithms and combinatorial search methods to maximize their predictive power. The training was focused on test items from tests administered to persons with an intact sense of smell and those with smell loss secondary to SARS-CoV-2 infections, a surrogate for virus-related smell loss.

Method

Experimental design

To achieve the goal of the study, we first determined the optimal number of odorant test items needed to accurately differentiate patients with a disorder known to have significant smell loss from healthy controls. The data to which the machine learning algorithms were applied came from smell tests that had been administered to 100 confirmed COVID-19 patients (age, 45.4 ±11.8, 67 males) and to 132 healthy subjects (age, 43.7 ±15.4, 48 males) tested prior to the outbreak of COVID-19 in Iran (Moein, Hashemian, Mansourafshar, et al., 2020a; Moein, Hashemian, Tabarsi, & Doty, 2020b).

The Persian version of the 40-item University of Pennsylvania Smell Identification Test (UPSIT®; Sensonics® International, Haddon Heights, NJ) was administered to all subjects. This version includes 40 odors out of a library of 51 odors that are used in different cultural versions of UPSIT®. The UPSIT® focuses on the comparative ability of subjects to identify odorants at the suprathreshold level (Doty et al., 1984). The subject releases each odorant by scraping an odorized strip with a pencil tip. He or she then indicates which of four written response alternatives corresponds to the perceived smell. A response must be made even if no smell is experienced or the smell seems different from those listed as the response alternatives (i.e., the test is forced-choice).

The dataset contained the UPSIT® test items that were correctly identified, along with subject demographic information, i.e., their age, gender, and educational level. The summary of the subjects’ responses to the UPSIT® items is shown in Table 1. Note that while the names of the odors are indicated, there are response alternatives associated with each odorant. Thus, in this table, as well as elsewhere in the text, the name of an odorant actually reflects an UPSIT® item, which includes the odorant and associated response alternatives. All attributes, except for age, coconut, and pizza were significantly different between COVID-19-positive and healthy subjects (p value < 0.01) based on Student's t test for continuous attributes and Fisher's exact test for binary-valued attributes.

Table 1 Odorants detection rate

To examine the generalizability of the test scores beyond the initial COVID-19 cohort, an independent group of 32 validation subjects [mean (SD) age: 55.56 (16.52), 12 males] with or without smell loss, as determined from previous UPSIT scores (Doty, 1995), were administered the brief tests that we developed (see Table S2). Sixty-seven subjects for whom such scores were available were contacted for the testing through phone calls or email messages; 45 agreed to participate and received the test kit via mail. Thirty-two completed test kits were returned to the researchers within the study time frame and their data were included in the analysis. Those with compromised smell function (i.e., UPSIT® scores < 35 for women and < 34 for men) had etiologies due to traumatic brain injury (two subjects), upper respiratory viral infection (one subject), nasal surgeries (two subjects), and unknown causes (21 subjects). Each subject was paid $20 for participating in the validation study.

Statistical analysis

The binary UPSIT® item response data were used to train and test popular machine learning algorithms available in MATLAB® version 2020a (MATLAB®, 2020), including logistic regression (Grimm & Yarnold, 1995), artificial neural networks (using ten hidden elements) (Haykin, 1998), decision trees (Breiman et al., 1984), k-nearest neighbor (kNN, k = 3 with city block distance metric), and the ensemble learning methods of random forests (Breiman, 2001), AdaBoost (short for adaptive boosting) (Freund & Schapire, 1999), and support vector machines (SVM) (Hearst et al., 1998). Parameter sweep analysis for number of hidden elements in neural networks and the number of nearest neighbors in the kNN method are provided in the supplementary Figs. S1 and S2. To compare the results with common smell test scoring approaches, a simple linear discriminant analysis (LDA) classifier, based on the total number of correctly identified odorant items, was also used.

Machine learning and feature selection

To reduce the data dimensionality, remove noise, and optimize the predictive performance of each method, feature selection was performed to select an optimal subset of odorant items. We used a sequential forward feature selection strategy where the selected odorant set was repeatedly extended so long as the inclusion of a new odorant into the selected set improved the cross-validation performance. As a trade-off between computing time and completeness of the search space, a modified sequential selection strategy was implemented such that at each selection iteration all combinations of 2–3 odorants were considered for inclusion. The performance of a combination of odorants was assessed until the optimized model was reached for each machine learning method.

A leave-one-out cross-validation procedure was applied to the data to assess the performance of each machine learning method. At each iteration, one subject was left out as a test sample; the machine learning model was trained on the remaining samples, which was then used to make a prediction for the left-out subject. This procedure was executed for each subject in the dataset at each feature selection iteration. The performance metrics we assessed included accuracy (ratio of correct predictions over all predictions), sensitivity (true-positive rate), specificity (true-negative rate), and the area under the curve (AUC) of the receiver operating characteristic (ROC) curve, a graph displaying the true-positive rate against the false-positive rate. For some machine learning methods (e.g., SVM, random forest, and AdaBoost), several different combinations of sensitivity and specificities are possible depending on where the threshold for categorization is set. Therefore, in addition to the combination of optimized sensitivity and specificity, the threshold-independent AUC was also reported. The optimization of feature selection continued until no further improvement in performance could be achieved.

To provide tests useful for practical serial testing of smell function, an optimization search strategy for multiple sets was performed. The number of odorant items for these sets was decided by considering both practicality and having sufficient predictive performance based on the item selection results. The rationale was that, for example, four relatively brief tests could be administered sequentially, with a repeated administration of one of these tests for assessment of temporal reproducibility. Such tests could also be administered in different orders and at varied intervals, such as every day or every other day over the course of a week. In order to ensure heterogeneity among the tests, we made certain that any given odorant item was not present in more than three tests and no two tests shared more than half of odorant items. The sequential forward feature selection algorithm was employed for the items of the four tests simultaneously under these constraints. Here, optimization criterion was the arithmetic mean of the accuracy, sensitivity, specificity, and AUC, assessed from leave-one-out cross-validation of the sum-based LDA model.

Measurement of agreement

To investigate the level of agreement of the scores of the four final smell tests, the equality of mean and variance was simultaneously tested using the Bradley–Blackwood test (Bradley & Blackwood, 1989). This test provided an F statistic calculated from the regression of each pair of smell tests. For this test, considering the two scores provided by a pair of tests (i.e., T1 and T2), the difference and the sum of paired scores (D = T1 − T2 and S = T1 + T2) are used for testing the null slope and intercept in the regression of D on S. This simultaneous testing was achieved through the equation for the Bradley–Blackwood test

$$F=\frac{\left[\frac{\sum {D}_i^2- SSE}{2}\right]}{\left[\frac{SSE}{n-2}\right]},$$

where n is the number of paired observations, \(\sum {D}_i^2\) is the sum of the squares of the n observed differences and SSE is the residual sum of squares from the regression of D on S. The null hypothesis is the equality of the two measurements. The alpha level for the critical F statistic was set at 0.001. When F is greater than Fcritical (2, n -2), the null hypothesis of the equality of the two measurements is rejected (in the current context the acceptance of the null hypothesis is desired).

Test validation using the independent subject group

The 32 subjects self-administered each of the parallel tests on five consecutive days. The first smell test was performed for a second time on the fifth day (with a different order of the same odorants as Test 1). Each test provided a score between 0 and 8. A subject’s overall score was also computed as the sum of the scores of the parallel tests (maximum possible: 40). The UPSIT® score and clinical category for smell function for each subject was compared with the parallel test scores. Spearman’s rho was used to determine the correlation coefficient between the UPSIT scores and total scores. The AUC of the receiver operating characteristic curve was used to evaluate the strength of the parallel smell tests for clinical evaluation of the subjects’ smell function category.

Results

Different machine learning algorithms resulted in different odorant sets

The classification performances achieved by each machine learning method throughout the feature selection process are summarized in Fig. 1. The number of odorants where each method achieved its highest performance varied between 7 and 29. As Fig. 1 indicates for all applied machine learning algorithms, after reaching the highest classification performance, the addition of more odorants either was inconsequential or even detrimental to the performance. The LDA, logistic regression, and Adaboost methods were more robust to the addition of more odorants and had only a slight decline in performance, whereas the other methods performed less well when more than the optimal number of odorants was used.

Fig. 1
figure 1

Optimization criteria for different machine learning methods. Note. Optimization criteria (arithmetic mean of the accuracy, sensitivity, specificity, and AUC metrics) achieved during the feature selection strategy with increasing odorant counts for each machine learning method. Leave-one-out cross-validation was used to calculate the classification performance metrics

The optimal UPSIT® odorant sets for each of the machine learning methods and their corresponding cross-validation performance metrics are shown in Table 2. Our feature selection strategy was designed to optimize the arithmetic mean of the accuracy, sensitivity, specificity, and AUC metrics. The simple sum based linear LDA method using 29 odorants had the best overall performance, with an accuracy of 95.7%, sensitivity of 94.0% (true-positive rate), specificity of 97.0% (true-negative rate), and AUC of 0.97. Logistic regression using 23 odorants achieved equivalent specificity with slightly lower performance in other metrics. Compared to LDA, support vector machines (SVM) with 15 odorants achieved better sensitivity (97.0%) but had lower specificity (89.4%). k-Nearest neighbor (kNN, k = 3 with city block distance metric) with 23 features achieved 99.2% specificity at the cost of lower sensitivity (85.0%). AdaBoost and random forest methods performed similarly and had balanced sensitivity and specificity. Neural network (using ten hidden elements) and decision tree methods performed poorly with inferior sensitivity performance. The LDA decision threshold was 20 odorants, meaning a subject is predicted to have smell loss (i.e., being COVID-19-positive) if they correctly identified 20 or fewer of the 29 selected odorants. Interestingly, LDA using all 40 UPSIT® odorants (decision threshold of 27) where no feature selection is performed had slightly lower accuracy and sensitivity but had the same specificity as the LDA method with 29 optimized features.

Table 2 Leave-one-out cross-validation performance of machine learning methods using features selected for optimized classification performance

For each method, using the odorant sets from other methods did not produce a better performance than the optimized odorant set selected for that specific method. Thus, the differences in the odor sets selected by different methods were not due to the non-exhaustive nature of our feature selection strategy. While each method had a separate set of odors that optimized its performance, there were some odorants common to all methods. For instance, grass and lemon were selected by all eight machine learning models, while menthol was selected by all methods except SVM. Inclusion of the available demographic information, namely age and sex, did not result in any improvement beyond what was achievable with the odorant test data alone (see supplementary Table S1).

The order and incremental contributions of the selected odorants to the classification performance of the LDA model are shown in Fig. 2. The performance metrics were robust to cross-validation iterations, with little variability in all metrics, except for sensitivity, which had up to 5% variability within the ten-fold cross-validation iterations.

Fig. 2
figure 2

Sequential feature selection performance metrics for the LDA model. Note. Odorants appear in the order of their inclusion to the selected feature set during the sequential feature selection algorithm. Ten-fold cross validation was used to quantify the variability. Shaded areas show the range of performance values observed in ten-fold cross validations. Odorant names indicate which of the UPSIT® odorant/response sets were employed

The largest contributions to the performance occurred for up to eight odorants and in certain machine learning methods (i.e., decision tree and random forest); brief tests with as few as 3–5 odorant items could not achieve 90% accuracy (Fig. 1). There were diminishing or negative returns for including additional odorants after the sweet spot of eight odorants. The classification performances of the machine learning methods were then restricted to the use of an optimized set with eight odorants (Table 3). The sum-based LDA method, which is the most practical of these machine learning methods, achieved 92.7% accuracy, 90.0% sensitivity, and 94.7% specificity. The LDA decision threshold was correct identification of six odorants, meaning a subject was predicted to be normosmic (i.e., not belonging to the COVID-19 group) if he or she correctly identified six or more of the eight odorants.

Table 3 Leave-one-out cross-validation performance using only eight features

Brief smell tests optimized for sequential testing

In order to provide brief practical smell tests that could be administered for serial testing (screening) in a short period of time, we identified four eight-item sets of odorants. Sum-based LDA was the candidate for the extraction of such serial tests because of its high performance (Tables 2 and 3) and simple implementation without the use of overly complicated computer modeling. A constrained optimization procedure was used to select the odorants for each set to maximize the diversity among them. Specifically, an iterative feature selection strategy simultaneously added odorants to each of the four sets, limiting the overlap of odorants to less than 50% between any pair of the final sets. The choice of limiting each set to eight odorants was based on the sweet spot observed in Fig. 1 and Table 3 for the trade-off between practicality and classification performance.

The four optimized eight-odorant sets that were most useful for differentiating smell loss (COVID-19 patients) from normosmia (healthy controls) are shown in Table 4. These sets achieved an accuracy of 91–93% with 88–92% sensitivity, and 90–95% specificity. The LDA decision threshold for each set was six odorants, meaning a subject is predicted to be normosmic (i.e.., not belonging to the COVID-19 group) if he or she correctly identifies six or more of the eight odorants. These odorant sets could be successively employed in the same subjects on multiple days to minimize the likelihood of recalling previous sets of stimuli. This should make these specific odorant sets extremely useful in longitudinal tracking of the smell function of individuals when rapid screen is needed.

Table 4 Optimized odorant sets for multiple testing of smell function

Finally, for an optimal sequential test, one expects to have low differences between parallel test scores for a given subject and large between-subject variability to assure that the full range of the scale is being used. For all subjects, the test scores for the sequential tests were computed. The level of agreement of each pair of tests was then measured using the Bradley–Blackwood test with FCritical (2,139) = 7.26. The computed F statistic for pair combinations of the optimized smell tests were 0.07, 2.19, 3.36, 2.02, 1.48, and 6.29, which were all less than the FCritical, making them pass the simultaneous test of equality of t means and variances.

Test performance of the independent validation group

Each of the 32 subjects performed parallel tests on five consecutive days (four tests in Table 4 and the repetition of test 1 on the fifth day). The mean (SD) scores of the five consecutive tests were 5.06 (1.78), 5.03 (1.58), 5.53 (2.17), 5.25 (1.60), and 5.53 (2.13), and the mean total score of the of the five tests was 26.40 (7.33). A repeated measure analysis of variance indicated that the five test scores were not significantly different [F(4,124) = 1.11, p = 0.35]. The AUCs of ROCs for parallel tests were between 0.79 to 0.83 in detecting olfactory dysfunction and the AUC increased to 0.91 when the parallel test scores were totaled (Fig. 3).

Fig. 3
figure 3

Comparison of sequential test score with UPSIT® score. Note. Total score of the sequential tests are summed up as serial test score which is correlated with UPSIT score (r = 0.84, p < 0.0001). Area under the curve of the receiver operating characteristics curve analysis was 0.91 in differentiating normosmics from olfactory compromised subjects. The right subpanels compare each parallel smell test with UPSIT® score with the AUCs of the ROCs

Discussion

Alterations in the sense of smell are among the first signs of a number of serious medical diseases and disorders, including COVID-19 and such neurodegenerative diseases as Parkinson’s and Alzheimer’s. The sequential screening of smell function is rarely practiced in the clinical setting, despite the usefulness of such testing in identifying the onset of disorders such as COVID-19. One potential reason is the lack of parallel short tests to overcome the subject’s recall bias.

In this study, a number of machine-learning models were compared using UPSIT® odorant items to establish optimal brief screening tests for differentiating patients with smell dysfunction from healthy controls. We used COVID-19 patients as a surrogate for smell dysfunction in general, as it provides a strong metric for such differentiation. Although most of the investigated models performed quite well, the sum based LDA model was superior, exhibiting 95.7% accuracy, 94.0% sensitivity, and 97.0% specificity. A minimum of eight odorant/response items was found to be needed to produce accuracy rates of 91–93%. We found that brief tests with as few as 3–5 odorant items could not achieve such high levels of accuracy.

In addition to determining the minimum number of odorant items needed to achieve high sensitivity, we identified four sets of short eight-item odorant tests useful for serial testing while minimizing the likelihood of remembering previously presented stimuli. Serial testing potentially allows for detection and continuous monitoring of smell dysfunction and may also be useful in tracking disease progression, recovery, and intervention effectiveness. Such testing may also be useful for testing persons who cannot accurately perform longer tests in single test sessions (e.g., due to fatigue and difficulties concentrating, as occurs in some elderly persons or patients with dementia).

To increase the generalizability to other populations and to focus on the predictive power of the smell tests, per se, we focused on using only the responses to the odorant items in the machine learning analysis and didn’t emphasize non-olfactory demographic variables. Inclusion of the limited demographic information available, namely age and sex, did not result in any improvement beyond that achievable with odorant test data alone (see Supplementary File). If one wished to combine the smell test data developed in this study with other demographic data, such inclusion would likely necessitate a more sophisticated scoring.

To our knowledge, this is the first study to develop multiple parallel brief smell identification tests useful for detecting temporal changes in smell function with high sensitivity and specificity. The study has both strengths and weaknesses. Among its strengths are the use of data from a relatively large number of persons with and without smell dysfunction. Another strength was the systematic exploration of the efficacy of numerous machine learning algorithms to optimize sensitivity in detecting smell dysfunction. Several potential weaknesses should be acknowledged. First, one might argue that the current sample size was inadequate for the performed analyses. However, this is unlikely since the cross-validation procedure addresses this concern and demonstrated the opposite. Second, data augmentation (Shorten & Khoshgoftaar, 2019) was not performed in this study. Such augmentation is employed to reduce overfitting when training a machine learning model. However, since the dataset contained relatively balanced numbers of participants in the two groups, this procedure was not necessary. A third concern might be that the tests we employed were reliant on smell identification and did not account for changes that may occur on other measures, such as an odor threshold. This is also unlikely, since it is well documented that identification tests such as the UPSIT® are strongly correlated with a number of types of olfactory tests, including detection threshold tests (Doty et al., 1994; Doty et al., 2019).