Introduction

After reaching the status of a pandemic, the new coronavirus variant 2 (SARS-CoV-2), the etiological agent of the 2019 coronavirus infectious disease (COVID-19), has infected more than two hundred million individuals and caused over five million deaths worldwide (https://covid19.who.int). In order to mitigate the effects of the related morbidity of COVID-19, there is a substantial need for improved population-scale testing solutions to early identify infection and thus allow adequate tracking [1]. New diagnosis techniques that are fast, accurate, and low-cost will not only help the management of the current crisis, but also serve as a baseline for the development of multiplex technology that will be useful in the response to future epidemics. Presently, most diagnostic methods involve sampling and testing different fluids like nasopharyngeal cell lysate, saliva, or blood.

Infections of SARS-CoV-2 in the initial stage are currently identified using real-time quantitative polymerase chain reaction (RT-qPCR) assays, considered the Gold Standard method, which may require up to 3 days after infection for a reliable positive signal [2]. In addition, RT-qPCR tests require between 3 to 4 h to be concluded [3] and are hardly used on a daily basis due to their elevated cost, shortages in biomarkers, and key reagents.

Intermediate stage or past infections are investigated using serum-based testing methods such as enzyme-linked immunosorbent assays (ELISAs), lateral flow immunoassays (LFIAs), or chemiluminescent immunoassays (CLIAs). These tests normally detect a significant and measurable concentration of immunoglobulin G (IgG) and immunoglobulin M (IgM) antibodies in blood samples. However, the build-up of such antibodies in the blood is slow, thus concentrations of IgG and IgM are measurable by these methods only after 2 weeks of infection [4]. Sensitive and specific serological methods are not fast, requiring 4 to 6 h to be completed [3]. Moreover, since most infections become apparent only upon symptom onset, the current methods of testing are unlikely to identify pre-symptomatic carriers. It is estimated that as many as 50% of individuals infected with SARS-CoV-2 are asymptomatic, hampering early-stage interventions that reduce transmission [5, 6]. There is also a large number of unreported infection cases and COVID-19-related deaths [7].

In this regard, appropriate clinical samples are essential to produce reliable results for the diagnosis of infection with SARS-CoV-2. For primary diagnostic assessment for current SARS-CoV-2 infection, the Center for Disease Control and Prevention (CDC) recommends collecting and testing an upper respiratory specimen, which includes sputum, bronchoalveolar lavage, and tracheal aspirate samples [8]. Considering that the virus does not produce or poorly induces viremia, it is essential to search for the virus in the local infection milieu. As such, fast, accurate and inexpensive methods for the early detection of SARS-CoV-2 in sputum, bronchoalveolar lavage, and tracheal aspirate samples in real time are urgently needed.

Emerging optical methods have been proposed for the detection of virus diseases [9]. Such methods usually detect labeled samples or use laser-based expensive and complex measurement techniques [9]. However, efforts to implement fast and sensitive diagnostic approaches have emerged in response to the current health crisis, as key steps to control the pandemic as well as part of reopening strategies [10]. Although the combination of RT-qPCR and serological tests such as ELISA are ideal for an accurate diagnosis, the detection of antibodies is particularly relevant during later transmission [11]. Thus, a fast and label-free methodology for COVID-19 diagnosis during the first days after infection is desirable.

Here, we report the use of a patent-pending [12,13,14] label-free optical spectroscopic method of straightforward operation, combined with machine learning (ML) processing of the acquired spectroscopic data, as a new diagnostic method of SARS-CoV-2. Using inactivated nasopharyngeal swab samples from RT-qPCR tested individuals, as well as inactivated tracheal aspirate from intubated patients, we show that this patent-pending multiplex method can be used to detect diseased individuals in less than 15 min, with elevated accuracy, and at a very low cost.

Methods

Study design and overview

We investigated whether optical spectroscopy data of nasopharyngeal swab and tracheal aspirate samples could be effectively used to detect SARS-CoV-2 infection with the aid of machine learning methods and without the use of biomarkers, in a fast and accurate way. Figure 1 shows an overview of the study process, divided into four steps: participant recruitment, collection of nasopharyngeal swab or tracheal aspirate samples, optical spectroscopy, and machine learning modeling.

Fig. 1
figure 1

Overview of the study process. a Enrollment. b Collection of nasopharyngeal swab or tracheal aspirate samples. c Optical spectroscopy measurements. d The algorithm was trained to predict the probability of a patient by COVID-19

Participant recruitment

The samples used in this research were collected from nasopharynx swabs of 152 patients suspected of SARS-CoV-2 infection, from asymptomatic individuals and from mildly symptomatic non-hospitalized patients. In addition, tracheal aspirate samples from 12 healthy patients and 12 critically ill COVID-19 patients, aged from 18 to 80 years old, 14 males and 10 females, under mechanical ventilation at the Intensive Care Unit of Risoleta Tolentino Neves Hospital were also studied.

The use of these samples was approved by the Ethical Committee (CAAE: 32,113,420.6.0000.5149; 1,686,320.0.0000.5149) from Universidade Federal de Minas Gerais (UFMG). Sensitive information was duly anonymized. All procedures followed ethical guidelines in accordance with Brazilian national regulations.

Collection of nasopharyngeal swab samples

Nasopharyngeal and oropharyngeal swab samples were collected from participants by inserting a rayon swab with a plastic shaft into the nostril, parallel to the palate, and gently scraped for a few seconds to absorb secretions. Another swab was shafted into the tonsils for sample collection. Next, the swabs were immediately merged into a sterile tube containing 2 mL of guanidine isothiocyanate solution. RNA extracted from all samples was tested by RT-qPCR using probes for viral and human genes. RT-qPCR was performed at the Vaccine Technology Center (CTVacinas) of the Universidade Federal de Minas Gerais to allow a definitive diagnosis of SARS-CoV-2 infection. Ground truth categorization of swab samples into negative (78) versus positive (74) SARS-CoV-2 infection was based on PCR results. Further details of the RT-qPCR results can be found in the Supplementary Information.

Collection of tracheal aspirate samples

Tracheal aspirate (TA) samples (2–10 mL) were collected during the early morning routine of COVID-19 patients. All patients included in the study tested positive for SARS-CoV-2 by RT-qPCR targeting the E gene. Only secretive productive patients were included in the study. Samples were aspirated into sterile tracheal secretion collectors and immediately processed in a biosecurity level 3 laboratory.

Optical spectroscopy measurements

For the optical measurements, each nasopharyngeal swab or tracheal aspirate sample was thawed and homogenized by spinning for 1 min, at room temperature and 1200 rpm. Next, 10 µL of the sample was deposited on a 22 mm × 22 mm glass #1½ coverslip (Corning, USA) and covered with a second coverslip. The sandwich samples were studied by ellipsometry in the 245–1690 nm wavelength range, with incidence angle varying from 45 to 70° in 5-degree steps. The measurements were repeated in 9 different regions of approximately 3 mm × 6 mm of each slide, organized as a 3 × 3 rectangular mesh, in order to account for possible spatial inhomogeneities across the samples.

Development of the machine learning model

The measured data was used to train a machine learning model to identify SARS-CoV-2 infected patients. This model was specifically trained to predict the infection status for each of the distinct positions read from the individual slides. The patient’s final diagnosis was defined by the average infection probability of all positions in the slide. An average probability below 0.5 meant a negative diagnostic, being positive otherwise.

Model design was performed in three stages: feature treatment/model type selection, training with data augmentation, and model tuning. Throughout, model quality was assessed by accuracy, precision, recall, F1, and ROC-AUC scores in a test set, determined at the patient level. F1 was chosen as the reference metric for optimization.

In the first stage of model design, the pipeline consisted of the following sequential steps: manual variable selection, manual feature selection, scaling preprocessing, methods of outlier detection and removal, automatic feature selection, and model type selection. These steps aimed to recognize the variables, features, and preprocessing procedures that would yield the best models. For manual variable selection, we considered the variables related to experimental design. The angles of incidence were tried individually (45, 50, 55, 60, 65, and 70°) and combined (all angles). Four windows of wavelength were tried: below 380 nm, between 380 and 1000 nm, above 1000 nm, and the whole range (all wavelength). Due to concerns of rapid sample degradation, as well as the will to speed up the procedure in a clinical setup, three combinations of positions were tried: positions 1–3, 1–5, and 1–9 (all positions). For manual feature selection, we used combinations of the measured ellipsometry features: angles Ψ and Δ, depolarization, intensity, and the real and imaginary parts of the complex reflectance ratio ρ = tan Ψ e. The scaling step is introduced to express all measures in a comparable scale; the methods tested were MinMaxScaler, StandardScaler, QuantileTransformer and RobustScaler, as implemented by the Python package Scikit-Learn v0.24 [15]. The outlier detection methods tested were PCA, LOF, KNN, COPOD, and IForest, with contamination rates in the range of 1 to 12.5%, as implemented by the Python package PyOD v0.8.7 [16]. Automatic feature selection was performed to rank the features according to their discriminative power. The methods tested in this step were ExtraTreesClassifier (both by Gini and entropy criteria), PCA, and LDA, as implemented by Scikit-Learn v0.24. After the features were ordered accordingly, we tried the top “n” features from a range of 20 to 500. For model type, we tested implementations of logistic regression, support vector machine, gradient boosting classifier and deep neural network (multi-layer perceptron classifier), by Scikit-Learn v0.24, and XGBoost Classifier by Python package XGBoost v1.4.0 [17].

In the second stage of model design, we tested the top performing models identified so far with a technique of data augmentation presented in [18]. The main idea of the method is to create synthetic training data by mixing the original measurements; more data tends to increase the power of generalization of the model. The synthetic data in this study was generated by averaging two measurements, making sure that only measurements from the same class and position would be mixed. The original data was also kept in the training set. The test set consisted only of original measurements.

The third and last stage consisted of tuning further the best performing models by adjusting the parameters specific to each model type. We performed an exhaustive search, tweaking some of the adjustable parameters according to each model documentation, relying once again on the data augmentation setup.

Throughout the model design protocol, models were trained with a training set and evaluated with a test set. Even though models were trained on individual positions in the slides, we made sure that the same slide would not be present in the training and test set at the same time, therefore, preventing data leakage at the patient level. These sets were generated by randomly splitting all the measurements available in a stratified fashion, reserving 20% of the patients to the test set. All metrics reported are an average of 10 such splits, produced as follows: at first, all available slides were shuffled then split into 5 folds with roughly the same size, then, this process was repeated, yielding the 10 folds reported. Therefore, each of the 2 sets of 5 splits covered the whole dataset, and each patient was evaluated twice by the same model, trained with different patients each time.

Results

Machine learning model

Figure 2 depicts the steps of data preparation related to feature selection, prior to model implementation, for the nasopharyngeal swabs. The solid lines in panels a, b, and c are the mean spectra of the physical property denoted in the y-axis, at a particular angle, measured at the wavelengths denoted in the x-axis, for all the positions in the slides. The shadow areas are the corresponding standard deviation, and the readings are separated by infection status (color coded). Each position of the slide is represented by a set of data as exemplified in Fig. 2a; such a set contains readings for 9 different physical properties (Ψ, Δ, depolarization (depol), intensity, real part of ρ, imaginary part of ρ, sin(Δ), cos(Δ), tan(Ψ)), 6 different angles (45–70°) and 674 different wavelengths, making a total of 9 × 6 × 674 = 36,396 features available as a starting point for the development of the algorithm. After the manual selection of features, each position is represented by 198 features (Fig. 2b), which contain data for one single angle (55°), one single physical property (depolarization), and a sub-range of wavelengths (above 1000 nm). Figure 2c represents the remaining features after the automatic feature selection and data scaling, where the wavelengths are ordered by their importance given by the method chosen for feature selection. At this point, 166 features remain: 166 selected wavelengths from the depolarization spectra at 55°. Figure 2d is a PCA representation of the same data shown in Fig. 2c; some patients of the same status cluster together, but not all healthy and infected individuals are clearly discriminated by the data alone. The machine learning model is responsible for this final step in the classification task.

Fig. 2
figure 2

Transformation of the data along the processing steps. a Representation of the measured data. b Data after manual feature selection. c Data after automatic feature selection and scaling. d PCA of the data at the end of feature selection

For these samples, the feature that delivered the best scores is depolarization, measured at an angle of 55°, at wavelengths above 1000 nm, and at all positions of a slide. They were scaled by the RobustScaler method. Outlier detection and removal were performed by the iForest method with a contamination rate of 10%. Samples from the test set were not evaluated for the presence of outliers, meaning that outliers were removed only from the training set. Automatic feature selection was guided by the ExtraTreesClassifier with Gini criterion, and 166 features were fed into the model. The model that yielded the best F1 score was an implementation of the MLPClassifier, from the Python package Scikit-Learn v0.24 [15], which is used to design neural networks. It contained two hidden layers with 100 neurons each. All layers were activated by the ReLU function. The solver used was SGD, with alpha of 1E-5, momentum of 0.95, and constant learning rate. This setup yielded a model able to diagnose patients with an accuracy of 85.0% (standard deviation 6.0%), F1 of 85.9% (5.4%), precision of 79.1% (7.2%), recall of 90.4% (5.4%), and ROC-AUC of 0.900 (0.045).

In the case of the tracheal samples, four features were used: Ψ, Δ, depolarization, and intensity, measured at an angle of 70°, at all wavelengths, at positions 1–3. The best scaling method was Robust Scaler, and outliers were removed by the KNN method with a contamination rate of 2.5%. Automatic feature selection was guided by the ExtraTreesClassification with Gini criterion, and the model performed best using 568 features. Due to the lower number of samples, the best results were achieved prior to the data augmentation phase. The best model was an implementation of the LogisticRegression classifier with standard parameters, as implemented by the Scikit-Learn v0.24 package [15]. The accuracy at the patient level was 97.2% (standard deviation of 5.5%), F1 was 97.2% (5.7%), precision was 96.4% (7.4%), recall was 97.2% (8.3%), and ROC-AUC was 1.0.

In the configuration of 9 measured positions and only one measured angle, we estimated a 7-min interval to carry out the measurement and classification of one sample, and less than 15 min for the overall time of the diagnosis process of one patient, including the collection of the nasopharyngeal swab samples, preparation of the sample to be measured, the optical measurements, and the AI processing of the data. The measurement and classification of the tracheal aspirate are even faster, since only 3 positions of the slide are necessary to be measured.

Discussion

The rapid spreading of the new SARS-CoV-2 virus worldwide has shown the necessity and impact of governmental restrictions, such as lockdowns, to prevent the increase in cases and the collapse of health centers [19]. Likewise, this pandemic revealed the urgent need for fast, precise, and well-timed diagnostic systems to identify and manage the treatment of infected individuals, thus hampering the effects of COVID-19. Up to now, the most applied diagnostic methods encompass RT-qPCR assay at the early stage of infection, through samples collected from nasopharyngeal and oropharyngeal swabs, and ELISA at a later stage of infection by evaluating the patient’s sera [20]. Although the elevated sensitivity of the current available tests, false positive and false negative results may occur depending on the time of infection and the quantity of viral load. For example, it may be challenging to find viral RNA in some samples due to the quality of transport and manipulation. Radiological methods such as chest computed tomography or thoracic radiography also have demonstrated remarkable signs of COVID-19 disease; however, they cannot be used for disease screening [21].

New methodologies for massive testing are available by applying LFA through different approaches, mainly using nanomaterials. Among them, an electrochemical immunoassay based on a graphene electrode was functionalized with anti-spike antibodies for the rapid detection of the SARS-CoV-2 virus via the spike surface protein [22]. Another study has proposed three-dimensional assembly of electrodes of reduced-graphene-oxide (rGO) nanoflakes immobilized with specific viral antigens integrated with a microfluidic device [23]. In addition, a rapid electrochemical detection of SARS-CoV-2 antibodies using a commercially available impedance sensing platform was also proposed, which contains sensing electrodes coated with SARS-CoV-2 spike protein and exposes samples to an anti-SARS-CoV-2 monoclonal antibody [24]. However, these technologies possess some drawbacks difficult to overcome such as automation and integration of microfluidics as well as the avoidance of nonspecific biomolecule adhesion in their systems.

Plasmonic biosensors have encouraged the development of novel approaches to achieve the effective coverage of the biological receptor while confirming the affinity and specificity of targeted viral nucleic acids, proteins, or whole virus [25]. Localized surface plasmon resonance (LSPR) has already been proposed to detect other viruses of medical interests such as dengue and Zika virus [26]. Besides, other strategies using gold nanoparticles (AuNPs) serological fast tests to identify the presence of IgM and/or IgG immunoglobulins are commercially available [27] and single-walled carbon nanotube (SWCNT)-based field-effect transistor (FET) semiconducting to detect the presence of SARS-CoV-2 antigens in clinical nasopharyngeal samples was assessed [28]. Nevertheless, most fast tests available have shown a considerable lack of specificity [29].

A more sophisticated biosensing platform was suggested by using a reverse transcription recombinase polymerase amplification (RT-RPA) coupled with clustered regularly interspaced short palindromic repeats (CRISPR-Cas12a) for the SARS-CoV-2 detection. This methodology utilizes DNA-modified gold nanoparticles (AuNPs) as a universal colorimetric readout and can specifically target the ORF1ab and N regions of the SARS-CoV-2 genome [30]. However, it is expensive and unlikely to be commercially available at large scale.

On the other hand, suggested spectroscopic techniques have demonstrated useful importance for rapid, accurate, and relatively cost-effective methods for virus detection but also for infection checking and follow-up [31, 32]. For instance, surface-enhanced Raman spectroscopy (SERS) [33], COVID-19 salivary Raman fingerprint [34], and a superfast, reagent-free, and non-destructive approach of attenuated total reflection Fourier-transform infrared (ATR-FTIR) spectroscopy [35] have already shown reliability for diagnostic applications.

The ability of monitoring potential virus mutations is essential, especially in identifying SARS-CoV-2 variants that are known to change their RNA sequence. The use of spectroscopic techniques combined with artificial intelligence models will allow detection and probably monitor and detect any changes related to this virus [36]. AI has been employed in health care fields for several proposals ranging from the prediction of disease spread trajectory to the development of diagnostic and prognostic models [37] by developing algorithms to analyze possible predictions for overall prognosis for COVID-19 patients [38]. Moreover, a machine-learning model that predicts a positive SARS-CoV-2 infection in a RT-PCR test based on symptoms was already established [39].

Despite all recent advances in diagnosis methods of SARS-CoV-2 above mentioned, there is an urgent need to develop a reagent-free, scalable, low-cost, sensitive, and specific assay for rapid detection of SARS-CoV-2 within minutes, or ideally in seconds, at the early stage of infection. Here, we have demonstrated the use of a label-free optical spectroscopy method of simple operation, combined with ML processing of the acquired raw spectroscopy data as an innovative method for SARS-CoV-2 infection detection in inactivated samples of nasopharyngeal swab and tracheal aspirate. Our methodology was validated by RT-qPCR and is applicable not only in the case of patients with mild symptoms or asymptomatic in the first stage of infection, but also to critically ill COVID-19 patients under mechanical ventilation in intensive care units. Spectroscopic data from the samples, carrying information about the dielectric properties of the sample over a broad spectral range, was acquired. Software was specifically developed to manipulate the data and process them via an artificial intelligence algorithm. Both the spectroscopic technique and the software are patent pending at this moment. One of the advantages of the present method is that the samples are not labeled or processed after collection. The samples can be measured right after collection or after several weeks of storage at − 20 °C. The volume of sample required for the test is relatively small, limited to 10 µL and dropped in between regular glass cover slides for measurements. Since the samples are inactivated at the moment of the collection, there is a very low biological risk associated with the preparation, manipulation, measurement, and later discard of the slides. The simplicity and automation of the measurements and data processing procedures avoid the necessity of highly qualified personnel. These characteristics ensure the low cost of our method. In general, the performance scores of different diagnostic tests are not comparable. Most of the scores depend on the cut-off point selection, as in the case of accuracy, selectivity, and sensitivity, for example. Other scores as the area under the receiver operating characteristic (AUC) are independent of the cut-off point selection but affected by asymmetries in the population of tested samples. However, just to put in perspective the results of our method (sensitivity of 90.4% and 97.2% for nasopharyngeal swab samples and tracheal aspirate samples, respectively), we should mention that SARS-CoV-2 detection with nasopharyngeal swabs by RT-PCR has been reported with a sensitivity of 77% [40], 63% [41], 79% [42], and 73% [43]. In addition, a processing time of less than 15 min, which can be reduced with further automation of the process, accuracy and sensitivity compatibles with the above-mentioned methods of COVID-19 diagnosis, make this solution optimal for contributing to the diagnosis of emerging infectious diseases and future pandemics of public health importance.

Our study has some limitations. The fact that nasopharyngeal swab RT-PCR sensitivity varies throughout the disease course [40] limits the external validity of our findings. A future systematic follow-up study is necessary to understand the evolution of the performance scores of our methods during the disease course. It is also possible that other pre-clinical conditions could influence the classification outcome of our method. Further studies are necessary to understand the role of infection by common diseases that produces clinical conditions like COVID-19.

Conclusion

There is a massive demand for alternative methods to detect new cases of COVID-19 as well as to investigate the epidemiology of the disease. In many countries, the importation of commercial kits poses a significant impact on their testing capacity and increases the costs for the public health system [11]. Decentralization of diagnostic testing and other technology transfer activities should be prioritized to improve accessibility in remote or isolated areas and reduce costs for the public health system [44]. Our approach demonstrates an accurate, simple, fast, label-free and cost-effective methodology for SARS-CoV-2 diagnosis.