To describe the clinical presentations at the time of diagnosis of SARS-CoV-2 infection in Costa Rica during the pre-vaccination period, we implemented a machine learning strategy to identify clusters or clinical profiles at the population level among 18,974 records of positive cases. Seven clinical profiles were identified with a specific composition of clinical manifestations and some patterns related to viral load (lower for the asymptomatic group), while the risk factors and the SARS-CoV-2 genomic features were distributed among all the clusters. This work was an observational study with random samples at random points of time for suspected cases in which most of them were on an ongoing disease and reported symptomatology.
The clinical manifestations of COVID-19 define a large spectrum of symptoms at the population level, as found in other studies (Kim et al. 2020; Fu et al. 2020; Sudre et al. 2021). Estimates of the features and proportion of the distinct clinical manifestations of COVID-19, including asymptomatic cases, are vital parameters for modeling studies (Byambasuren et al. 2020). In addition, early identification of symptoms is important for successful diagnosis, medical management, and treatment selection (Kostopoulou et al. 2015). This is a key point for health professionals that are in charge of gathering symptoms information when testing patients (the time of diagnosis during the first point of contact), to be able to differentiate between the most and least prevalent clinical presentation of COVID-19 in a specific community. In this regard, we were motivated to conduct this study with symptoms in SARS-CoV-2-infected cases from Costa Rica during 2020 (the pre-vaccination period).
At the time of diagnosis, 18 symptoms were found to be present in at least 1% of the SARS-CoV-2-infected cases from Costa Rica, including non-specific symptoms (fever, headache, etc.), as well as respiratory and gastrointestinal manifestations. Using a machine learning approach, seven major clusters or clinical profiles were found with those symptoms to describe the manifestations at the population level. The clusters showed the expected heterogeneity in the clinical presentation among SARS-CoV-2-infected cases from Costa Rica, just as it has been observed worldwide according to hundreds of case reports (Dixon et al. 2021; Han et al. 2020; Sudre et al. 2021; Tong et al. 2020; Kim et al. 2020; Fu et al. 2020). Besides, six main symptoms are defining the clinical profiles (Fig. 3) and that must be taken into higher consideration at the moment of filling a patient’s chart: cough, fever, headache, arthralgia, anosmia, and dysgeusia. Congruently, most of these manifestations are included in the limited number of symptoms that are known to be associated with infectious diseases (Jeon et al. 2020). In addition, the general description of the clinical manifestations can be used as part of the “case definition of COVID-19” given by the local and international epidemiological surveillance systems (World Health Organization 2021).
A multivariable logistic regression and exploratory factor analysis by Dixon et al. (2021) determined five symptom clusters among which ageusia, anosmia, and fever tend to be highly associated with SARS-CoV-2 infection, which resembles our findings in cluster C6. This also supports other findings in a meta-analysis in which up to 52.73% and 43.93% of SARS-CoV-2-infected cases presented olfactory and gustatory dysfunction, respectively (Tong et al. 2020), also found in cluster C7. In a second cluster, Dixon et al. (2021) reported shortness of breath, cough, and chest pain, but only the cough had a high frequency in our data (cluster C2) without being associated with those other two symptoms. Interestingly, another study showed that a diversity of respiratory symptoms were found as a significant predictor for test positivity for the diagnosis of SARS-CoV-2 infection (Kotsiou et al. 2021).
A third cluster was composed of fatigue, muscle ache, and headache. Of those symptoms, we only found headache as the main symptom in cluster C4. Finally, the last two clusters reported were represented by vomiting and diarrhea, and a runny nose with a sore throat (Dixon et al. 2021). None of those two clusters coincides with our findings. As Fig. 3 shows, even if digestive symptoms are present among Costa Rican SARS-CoV-2-infected cases from C1 to C7, their frequency is very low. Nonetheless, this should not be neglected as it has been reported that some individuals present digestive symptoms alone, which is of clinical relevance as those patients may last longer achieving viral clearance compared to those with associated respiratory symptoms (Han et al. 2020).
In another work, a similar approach with machine learning techniques for the study of COVID-19 symptoms, six temporal profiles were identified after self-reported data were used (Sudre et al. 2021). To make a better comparison, day 0 symptoms were contrasted with our findings. Interestingly, dysgeusia was not included as the main symptom in their study, even though it was the most prevalent one in our cluster C7. Cough and fever were found to be associated with the second cluster reported by Sudre et al. (2021) as well as in profile C3 in our study. Headaches were distributed among all the clusters in both studies.
About risk factors, three chronic diseases were found among Costa Rican patients in all of the seven clusters. From most to least prevalent, the most significant conditions were high blood pressure, diabetes, and asthma. Interestingly, this finding is highly consistent with a meta-analysis by Yang et al. (2020), who reported that the most prevalent comorbidities among SARS-CoV-2 patients were hypertension (21.1%), diabetes (9.7%), cardiovascular disease (8.4%), and respiratory system disease (1.5%). Another study, which was based on environmental and health-related predictors for SARS-CoV-2 infection, revealed a vulnerability to COVID-19 in cases with previous pneumonia (Mouliou et al. 2021a), although this risk factor was not studied in our work. Jointly, it is clinically relevant to take these comorbidities into account when performing a screening among COVID-19 tests. However, we identified no reliance on the co-morbidities and the clinical profiles for SARS-CoV-2-infected cases. This result is in line with a meta-analysis that reported that up to 90% of clinical and demographic variables showed inconsistent associations with COVID-19 outcomes (Jeon et al. 2020).
Despite consulting several databases, no other works using machine learning were found using symptoms, risk factors nor SARS-CoV-2 genomic data of SARS-CoV-2-infected cases at the same time, and none using the initial clinical profile at the time of diagnosis. Machine learning techniques prove to be a very useful approach to study the variety of COVID-19 symptoms when large sets of data are available. The heterogeneity of this disease’s clinical presentation is reduced using this technique, thus it may help clinicians heighten vigilance of some specific symptoms over others.
On the other hand, the cluster of asymptomatic cases (C1) represents 18% of the total positive cases. This percentage is in line with other analyses in which the asymptomatic cases vary between 15 and 30% (Centers for Disease Control and Prevention US 2021; Byambasuren et al. 2020), although other studies found higher frequencies (Byambasuren et al. 2020; Lee et al. 2020). This variation can be explained by multiple factors, such as the: (i) definition of asymptomatic, which depends on a specific moment (at time of diagnosis in our case) but can eventually change during the infection with distinct symptom onset into pre-/post-symptomatic cases (Mouliou and Gourgoulianis 2022); in fact, these conditions have questioned the real existence of asymptomatic cases, as discussed in a recent study (Mouliou and Gourgoulianis 2022); (ii) diagnosis tests identify SARS-CoV-2 carriers and not necessarily COVID-19 patients (Mouliou et al. 2022b), and (iii) the possible existence of preexistent immunity by previous infection, that can affect the clinical outcome during reinfections or coinfections, although associations in the possible reduction of symptomatology are still being monitored (Mouliou and Gourgoulianis 2022; Molina-Mora et al. 2022).
The comparison of expected viral load between symptomatic and asymptomatic cases, using the Ct value, has been also reported as very variable (Tutuncu et al. 2021; Trunfio et al. 2021). Similar to our findings in which the symptomatic groups had lower Ct values, another study reported that higher viral load was associated with more signs and symptoms at diagnosis and a more frequent pattern of respiratory and systemic complaints (Trunfio et al. 2021). However, no associations between viral load and symptoms state have been also suggested in other works (Tutuncu et al. 2021; Lee et al. 2020). The situation of very diverse patterns of Ct values and clinical outcome is a drawback that can be explained by individual factors and technical issues. For example for a specific case (individual factors), as discussed before, some asymptomatic cases could be related to genetics, risk factors, or preexistent immunity by previous/concomitant infections (with possible effects on the viral replication or viral shedding and finally on symptoms onset or transmission) (Mouliou and Gourgoulianis 2022). In the case of sample processing (technical issues), the results can be affected by the technology, sample quality, and the time of sampling after infection (Buchan et al. 2020), as well as the general performance of the rRT-PCR test which is not errors-free and false positive and false negative results can be generated (Mouliou and Gourgoulianis 2021). Therefore, this complex scenario implies that there is no consensus between the initial viral load and the clinical manifestations of COVID-19 (Trunfio et al. 2021; Byambasuren et al. 2020).
Regarding the SARS-CoV-2 genotypes, our reports of the independence of the clinical presentation of COVID-19 and the genomic determinants of the SARS-CoV-2 sequence are in line with others studies (Hodcroft et al. 2020; Grubaugh et al. 2020; van Dorp et al. 2020). For each cluster, a diversity of clades and lineages were identified, including independence of the presence or absence of the mutation T1117I from the Costa Rican lineage B.1.1.389 (Molina-Mora et al. 2021). This situation reminds us that the clinical profiles depend on the viral agent and human host conditions. The human genetic, comorbidities and risk conditions have been described as the predominant factor in the clinical outcome of the COVID-19, as found in several studies (LoPresti et al. 2020; Sironi et al. 2020; Toyoshima et al. 2020; Molina-Mora et al. 2021).
Furthermore, owing to the distribution of SARS-CoV-2 genotypes among all the clusters, our results suggest that genomic features of the virus are not associated with specific changes in the clinical presentation, as has been reported recently, including relevant variants (Nakamichi et al. 2021; Graham et al. 2021). The lack of change in symptoms for different SARS-CoV-2 genotypes also indicates that existing testing and surveillance infrastructure do not need to change specifically for these versions of the SARS-CoV-2 genome (Graham et al. 2021).
Our analyses presented some limitations that must be taken into account in the interpretation of results: (1) classification of positive cases of COVID-19 was based on the positivity of a rRT-PCR for nasopharyngeal samples, i.e., we depended on the performance of the test and sample quality; (2) Ct values were obtained by distinct RT-PCR test kits (not performed in the same lab and protocols), thus Ct value comparisons can be affected by these differences; (3) records were retrieved from a local database (with predefined symptoms) with some missing information, mainly for SARS-CoV-2 genomic data or other potential symptoms (i.e., dyspnea, sputum production, neck pain, shiver); (4) this study was based on symptoms present in at least 1% of the patients and rare or low frequency symptoms were not included for the clustering analysis (see “Materials and Methods”); (5) cases corresponded to random samples at random points of time which were considered as a same group, without consideration of a particular symptomatology for reinfections or coinfections cases; and (6) data for social behavior or genetic factors of the host were not considered in this study.
Finally, due to vaccination started massively in January 2021 in Costa Rica (first doses were applied at the end of December 2020), we consider that this study represents a special work to give the panorama of COVID-19 in pre-vaccination time (2020). In future work, we hope to assess the vaccination status and how this event has influenced the clinical profiles of SARS-CoV-2-infected cases during 2021.