1 Introduction

Coronavirus disease-19 (COVID-19) is a respiratory illness caused by the novel coronavirus, severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2). First identified in Wuhan, China in December 2019, COVID-19 has since spread to over 100 countries worldwide [1]. Compared to other coronavirus-related diseases such as SARS and MERS, COVID-19 is highly contagious [2,3,4] and can lead to severe symptoms, including fever, shortness of breath, cough, sore throat, sputum, and myalgia [5]. Patients with underlying health conditions such as heart disease, diabetes, obesity, chronic kidney disease, and chronic obstructive pulmonary disease are more susceptible to severe symptoms [6]. However, some individuals infected with SARS-CoV-2 may not exhibit any symptoms, but they can still spread the virus and are therefore required to quarantine.

To detect SARS-CoV-2 in infected patients and prevent transmission to vulnerable individuals, various methods have been developed. The most widely used method is the lateral flow immunoassay, which is cost-effective and provides relatively quick results [7]. However, some lateral flow immunoassay tests are only qualitative or partially quantitative and require extensive validation and quality [8]. Another commonly used method is the polymerase chain reaction (PCR), which is recognized as the most reliable method by government and health facilities. PCR provides accurate results with high sensitivity but requires up to 48 h to obtain results and cannot be used on patients who have recently recovered from COVID-19 [9].

Optical spectroscopy is an emerging method for detecting SARS-CoV-2 and has shown promising results in rapidly detecting other types of viruses with high accuracy [10,11,12,13,14]. Optical spectroscopy has been used more frequently in recent years due to its speed and non-destructive nature of the method [15]. Raman spectroscopy, a type of optical spectroscopy, involves the scattering of monochromatic light on biomolecules that make up a virus, resulting in a change in wavelength due to the excitation of atoms from the ground state to the virtual state [16]. The change in wavelength is quantified by Raman shift (measured in cm−1), with larger change in wavelength corresponding to larger Raman shift. Raman spectroscopy is non-invasive as the samples used typically do not require further heat and chemical treatment. Furthermore, Raman spectroscopy is superior to other methods of optical spectroscopy such as infrared or ultraviolet, mainly because the collected Raman signal is not significantly affected by the presence of water in the sample [17]. Biofingerprint detection based on Raman spectroscopy is mainly used in medical applications such as detection of various types of cancer and diagnosis of other diseases [17, 18]. This has been used in previous research studies to identify presence of viral particles in samples via scattering of monochromatic light on biomolecules that make up virus [10, 12,19, 20].

Although most countries have entered the endemic phase of the pandemic, where previously applied restrictions have been relaxed, there is still a need to contain COVID-19 to prevent a surge in cases. The public has seen the highly contagious and potentially deadly nature of COVID-19, especially for vulnerable individuals [21]. Therefore, there is a high demand for research and development of rapid and robust methods for detecting SARS-CoV-2. In this study, we analyze the Raman peaks present in negative and positive samples using spectral data obtained using a portable Raman system built in our laboratory for the purpose of SARS-CoV-2 detection. The proposed method would benefit the development of pattern recognition model by utilizing the detected peaks, consequently improving specificity and reducing false negatives of the model.

2 Data and methods

2.1 Sample preparation

The study involved the analysis of 75 nasal swab samples collected from patients infected with SARS-CoV-2 as positive samples, and 75 nasal swab samples from healthy patients as negative samples. The samples were collected from Kuala Lumpur International Healthcare Centre Kota Kinabalu (KLIHCKK) and were stored in  − 18 °C after being placed in vials containing 3 mL of viral transport media (VTM). A blank VTM sample is also used in this study as control sample to compare its spectral data with positive and negative samples.

To prepare the samples for acquiring Raman spectra, 1 mL of each sample including the blank VTM sample was transferred into cuvettes using a pipette. This was done to ensure that the samples were homogenously mixed and prepared for analysis. The samples were then analyzed using Raman spectroscopy to determine their molecular composition and identify any differences between the positive and negative samples. It should be noted that all necessary safety precautions were taken during the collection and handling of the samples to prevent any potential contamination or exposure to the virus. Additionally, the study was conducted in accordance with ethical guidelines and regulations to ensure the protection of patient privacy and confidentiality.

2.2 Spectral data acquisition

Figure 1 illustrates the simplified setup for Raman system used in this study. The system consists of a 200-mW laser diode as the excitation source, emitting a laser beam with a wavelength of 532 nm. Raman scattering light from a sample is collected using a microscope objective with a numerical aperture of 0.22 and magnification of 10X. The collected light then passes through a long-wave pass filter, which allows the transmission of Raman scattered light at longer wavelengths while rejecting Rayleigh scattered light.

Fig. 1
figure 1

Simplified diagram for Raman system setup

To enhance the normally weak Raman signal, a charged-couple device (CCD) is used in the setup, which is cooled down for 15 min to remove noise associated with CCD. An acquisition time of 15 s is required for every sample to further enhance the signal collected at the CCD.

Before the Raman system can be used for analysis, it must be calibrated to optimize the acquisition of Raman spectra. A standard sample is used for calibration to ensure that the signal acquired by the CCD matches the required Raman shift displayed by the computer with minimal error. In this study, polystyrene is used as the standard sample for calibration as it is readily available and has numerous Raman peaks that can be compared with the datasheet for polystyrene.

Overall, the Raman system used in this study was carefully calibrated and optimized to ensure the accurate acquisition of Raman spectra from the collected samples. This setup allowed for the reliable comparison and analysis of Raman spectra between positive and negative samples.

2.3 Processing of Raman spectra

The Raman system shown in Fig. 1 is used to acquire Raman spectral data from the collected samples. The Raman shift range covered by the system is from 200 to 3800 cm−1. The acquired spectral data is processed using MATLAB software to observe the Raman peaks.

In this study, the spectral data within the range of 400 and 2000 cm−1 is of particular interest, as Raman peaks located in this range indicate the presence of functional groups found in viral particles [22]. Therefore, the acquired spectral data is trimmed to focus only on this range of high interest.

Next, background noise is removed from the spectral data using a Savitzky-Golay filter with a third order polynomial taken seven points at a time. This filter smoothens the spectral data and ensures that Raman peaks can be clearly observed. After this, the smoothened data undergoes baseline correction to remove wide peaks that might be the product of background fluorescence [12]. The baseline is regressed using a spline approximation within windows of width 150 cm−1. This ensures that sharp peaks are preserved while wider humps are removed from spectral data, leading to a clearer and more accurate representation of the Raman peaks. Figure 2 illustrates Raman spectra of a sample before and after Savitzky-Golay filter and baseline correction.

Fig. 2
figure 2

Processing of Raman spectra using Savitzky–Golay filter and baseline correction

2.4 Selection and assignment of peaks from spectral data

To facilitate the selection of peaks based on a fixed prominence threshold, the intensity axis of spectral data is converted to relative intensity by setting the minimum and maximum values of the intensity axis to 0 and 1, respectively. Raman peaks are detected using the findpeaks function in MATLAB, which applies a set of criteria [12]. The findpeaks function is improved and modified in this study to identify the location at which half the prominence of each peak is taken, along with other parameters such as the peak’s prominence and width. This is important to ensure that peaks located within another peak are not counted. Peaks with a width larger than 10 cm−1 and a prominence larger than 0.2 in the Raman spectra are selected, while other peaks that may be attributed to background noise remaining after the smoothing procedure are removed.

Next, the experimental peaks of positive samples are assigned based on literature peaks obtained from previous studies [22,23,24]. Only peaks that coincide or are in proximity to the literature peaks associated with positive samples are assigned. This step is repeated for negative samples, and the presence of peaks is compared between positive and negative samples.

3 Results and discussion

3.1 Detection of peaks in positive and negative samples

All 75 positive and 75 negative samples were analyzed by detecting and analyzing their respective peaks based on specific criteria. As shown in Table 1, positive samples exhibited more detected peaks than negative samples, suggesting the presence of Raman peaks that are exclusive to positive samples. However, further assignment of these peaks is necessary to verify that they are indeed unique to positive samples. This is particularly important given the relatively low number of Raman peaks detected in positive samples (as few as 5) and the higher number of peaks detected in negative samples (up to 19). Therefore, rigorous peak assignment is crucial to ensure accurate differentiation between positive and negative samples.

Table 1 Descriptive statistics for number of peaks detected in samples

3.2 Assignment of peaks from positive and negative samples

Table 2 reveals the number of positive and negative samples with the experimental peak present coinciding or close to the literature peak, obtained from previous studies [22,23,24]. Between 2 and 15 peaks in positive samples, and 1 to 6 peaks in negative samples, are assigned based on literature peaks listed in Table 2.

Table 2 Peak assignment on experimental peaks of positive and negative samples

Figure 3a and b shows Raman spectra of both selected positive and negative samples, with the peaks detected by MATLAB are highlighted in grey. Meanwhile, Fig. 3c reveals Raman spectra for a blank VTM sample.The spectra highlighted in brown are assigned based on literature peak in Table 2. Peaks that coincide with the literature peak or deviate from literature peak not more than 10 cm−1 are assigned. By analysing all peaks in positive samples, peaks at around 501.5, 509, 577, 656.2, 716, 748, 782.5, 846.5, 951.5, 983.7, 1000–1060, 1207, 1249, 1317, 1384, 1453 and 1554 cm−1 are present in positive samples significantly more than negative samples. Since less than 10% of the samples containing these peaks are negative samples, these peaks can be used to determine positive samples with possibility of false positive detection on negative samples below 10%. Meanwhile, analysis of all peaks in negative samples shows that peaks at 1126, 1155, 1263, 1600.7 and 1665–1669 cm−1 are present in negative samples more than positive samples. These peaks can be used detect negative samples. However, some negative peaks are present in positive samples. Using these peaks in pattern recognition might cause these positive samples to be identified as negative samples, leading to false negative. It is noted that peaks located in proximity to 1340.4, 1378 and 1492.1 cm−1 are not considered as these peaks are also significant in blank VTM samples.

Fig. 3
figure 3

Raman spectra for a a selected positive sample, b a selected negative sample, and c a blank VTM sample

Raman peaks situated in close proximity to 577, 656.2, 748, 951.5, 983.7, 1048, 1207, and 1554 cm−1 are prominent in significantly more positive samples and attributed to amino acids that are present in coronavirus. Experimental peaks close to 577, 748, 951.5, 983.7 and 1554 cm−1 originate from tryptophan (Trp) found in positive samples, but not in negative samples. Another peak at 1048 cm−1 (situated between 1000 and 1060 cm−1) prevalent in positive samples correspond to aromatic amino acids, particularly phenylalanine (Phe) and Trp [23]. The environment rich in these amino acids in positive samples is due to the importance of Phe and Trp in developing viral protein structure or interaction with physiologically expressed molecules [25]. However, it is also possible that these peaks are due to other amino acids such as cysteine (Cys) in 577 cm−1 and valine (Val) in 983.7 cm−1

Some of these Raman peaks (close to 509, 922, 1155, and 1453 cm−1) are associated with glycoproteins and lipids. Many positive and negative samples have Raman peak located near 1155 cm−1, assigned with glucose, which is due to assembled and fragmented viral particles present in positive and negative samples, respectively [23]. In contrast, other Raman peaks are more prominent in positive samples, indicating that these peaks are due to assembled viral particles represented by physio/pathological modifications of lipids and glycoproteins in contact with the virus [26, 27].

Amide bands in protein structure are represented by Raman peaks of 1263 and 1665–1669 cm−1 in this experiment. Amide I and III bands present in some positive and negative samples correspond to β-sheets/coils. This is consistent with the previous study by Huang et al. which stated that amide bands in Raman spectra of SARS-CoV-2 show more obvious β-sheets/coils compared to α-helices [24]. Not all positive samples exhibit amide I and III bands based on their Raman spectra. This is a common occurrence as stated by Sanchez et al. [22] and observed in several previous studies [23, 28, 29]. This is due to dependence of absence of amide bands in Raman spectra on the size of amino acid side chain [28]. Despite that, the absence of amide bands in many positive samples is also due to deactivation of virus that might break amino acids. Also, some negative samples exhibit amide I and III bands, which are attributed to fragmented proteins that may still circulate during convalescence and weeks after infection.

Table 3 reveals the summary of Raman peaks that are shown to be significant in positive and negative samples. By analysing all 75 positive and 75 negative samples, it is found that peaks associated with presence of SARS-CoV-2 are present in the positive samples significantly more than negative samples. Because of this, these peaks can be used to distinguish positive samples from negative samples with minimal occurrence of false positives. In contrast, while peaks associated with absence of SARS-CoV-2 are present in negative samples, these peaks are also prominent in positive samples. The use of these peaks to distinguish negative samples may lead to considerable false negatives. To ensure that these peaks can be reliably incorporated into development of pattern recognition model, only peaks associated with positive samples shall be considered to minimize false negatives due to peaks in negative samples. These peaks can be used with high specificity as they are not significantly present in blank VTM samples, as revealed in Fig. 3c.

Table 3 Summary of Raman peaks significant in positive and negative samples

4 Conclusion

Peak assignment on Raman spectra of positive and negative samples was applied to serve as database for rapid detection of SARS-CoV-2 in samples using pattern recognition. The assigned peaks are analyzed in this study to determine the significant peaks that can be used to distinguish positive and negative samples based on Raman spectra. Peaks associated with positive samples are reliable in identifying positive samples. As these peaks are prominent in positive samples but not in negative samples nor in blank VTM sample, these peaks can be implemented in pattern recognition model with reduced false negatives and high specificity.