Background

Since the invention of the stethoscope by René-Théophile-Hyacinthe Laennec in the nineteenth century, the interpretation of internal body sounds has made the stethoscope a ubiquitous part of a doctor’s uniform. As stethoscopes are easy to handle, inexpensive and non-invasive, they can provide valuable clinical information at any level of care, and are a fundamental step in the preliminary clinical diagnosis and early assessment of pulmonary disease to shorten delays in diagnosis and emergency management [1]. However, the utility of this tool is ultimately dependent on the user’s perceptual capacity to discriminate and interpret pathological patterns in sound across the respiratory cycle. Follow-up comparative assessments are then based on the user’s ability to remember these patterns from a previous time point, or mentally reconstruct them from descriptions written in clinical notes. As this interpretation is a highly subjective skill, inter-listener variability limits interoperability, where accuracy ranges widely with experience and differs across specialities [2, 3]. Discrepancies also arise due to a lack of standardisation in nomenclature [3] rendering the results equivocal and/or incomprehensible to other caregivers. Other sources of heterogeneity may originate from differences in the intrinsic properties of the stethoscope and extrinsic patient-related factors such as obesity, ambient noise and patient compliance (e.g., crying child).

To better standardise the detection of abnormal lung sounds, there has been a recent interest in digitizing respiratory sound acquisition using electronic stethoscopes [4,5,6,7] and then analysing it using artificial intelligence (AI) methods of deep learning [8,9,10,11,12,13]. For the COVID-19, caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), AI methods have revealed clear patterns in its radiological presentation [14,15,16,17,18,19], and some preliminary evidence on the predictive capacity of respiratory sound is emerging. For instance, the application of simple models such as logistic regression and support vector machine (SVM) were able to predict the diagnosis of COVID-19 from breath and cough sounds crudely collected on a mobile application with an area under the curve (AUC) of around 70% [20]. Another group achieved above 95% sensitivity and specificity on discriminating COVID-19 coughs from other pathologies as well as healthy patients [21]. However, no evidence exists for the potential of digital lung sounds for early detection and, more importantly, risk stratification in COVID-19. Indeed, while around 80% of infections are either asymptomatic or self-resolving after a few weeks of mild disease, the remaining 20% can rapidly progress to acute respiratory distress syndrome with a poor prognosis and high mortality [22, 23]. Hence, early risk stratification is crucial for prompt referral and early intervention, as well as the appropriate allocation of limited hospital resources.

Owing to the initial non-specific symptomatic presentation of COVID-19, diagnosis and risk stratification are based on more objective paraclinical exams, such as reverse transcription polymerase chain reaction (RT-PCR) to detect the genetic signatures of SARS-CoV-2 in nasal-pharyngeal swabs for diagnosis, and high-resolution chest computerized tomography (CT) to estimate prognosis [24, 25]. However, even RT-PCR may yield false-negative results [26] and in times of high transmission, testing backlogs can render turn-around times clinically irrelevant [27]. On the other hand, risk stratification by CT is inconvenient for a variety of reasons. Firstly, these machines are usually housed in centralised high-level healthcare infrastructures, inappropriate for triage, and necessitate the transfer/handling of potentially infectious patients. Secondly, they expose patients to ionizing radiation and, thirdly, the cost and skill required to acquire and use these machines have made CT scanners rarely available in many parts of the world.

Thus, we aim to develop a set of early diagnostic and risk-stratification algorithms for COVID-19 from lung auscultations. To this end, we will collect standardised digital lung recordings from patients triaged for COVID-19 testing at screening sites or already hospitalised for diagnosed COVID-19. We hypothesize that early diagnostic and prognostic acoustic signatures of COVID-19 can be detected by deep learning independently of the caregiver’s auscultation skills, thus better standardising decision making and resource allocation.

Methods

Study design

This is a single-centre population-based study divided into two aims; a case–control study for diagnostic classification, and a prospective cohort study for risk stratification. For diagnosis, cases will be those testing positive for COVID-19 by RT-PCR for SARS-CoV-2 virus on a nasopharyngeal swab (sensitivity: 89.0%, specificity: 99.70%) [28] at any time during the 14 days following enrolment. Thus, if a patient is initially classified as COVID-negative and becomes positive within 14 days of enrolment the initial test result is considered false-negative and the patient is then re-classified as COVID-positive. Controls will be those presenting at triage who consistently receive negative COVID-test results during the 14 days following their enrolment. In addition to triage sites, COVID-positive hospitalised patients (not in intensive care) will also be enrolled. RT-PCR tests will be performed and repeated according to public health guidelines, described in Table 1. For risk stratification, COVID-positive patients will be separated into ordinal severity classes shown in Table 2.

Table 1 Study schedule
Table 2 Outcome groups

Population

Inclusion criteria are consecutive consenting adult patients (i.e. aged ≥ 16 years) meeting COVID-19 testing criteria. At the time of writing, testing criteria are having been in close contact (less than 1.5 m for a total of 15 min or more) with documented SARS-CoV-2 infection, and/or having any of the following symptoms: cough, dyspnoea, fever, sudden loss of taste/smell, flu-like symptoms (i.e. sore throat, runny or stuffy nose, muscle or body aches, headaches, fatigue, etc.) [29]. Exclusion criteria are: (1) oxygen supplementation greater than 10L/min delivered by any device (due to major modifications of the auscultatory sounds), (2) mechanically ventilated patients (due to major modifications of the auscultatory sounds), (3) severely ill patients hospitalised in intensive care units, (4) patients who cannot be mobilised for posterior auscultation, (5) patients known or suspected of immunodeficiency, and or under immunotherapy. Due to the non-specific symptoms of COVID, the negative patients comprise a vast range of differential diagnoses which are not recorded as we rather aim represent all patients attending a screening facility irrespective of the differential diagnosis.

Data collection

For each patient, demographic and clinical data will be collected including, age, sex, medical history, pre-existing diseases known to predispose poor outcomes in COVID-19 (according to the US Centers for Disease Control and Prevention [30]), and signs and symptoms of the current episode. Lung auscultations will be recorded with a Littmann 3200 electronic stethoscope (3 M Health Care, St. Paul, USA) using the Littmann StethAssist proprietary software v.1.3. Digital lung auscultations will be performed at 6 thoracic sites for at least 30 s each. The sites are four posterior (left and right apical and basal zones) and 2 axillary sites (right, left). Anterior sites will not be auscultated to both reduce the interference of heart sounds and prevent the airborne transmission of SARS-CoV-2 to investigators. The audio files and patient clinical data will be encoded as anonymised files by the local investigators and uploaded to the RedCap server hosted by the hospital (REDCap, Vanderbilt University, Nashville, TN, USA; https://www.project-redcap.org/resources/citations/). Patients positive for COVID-19 will then be followed up until outcome (i.e., discharge, hospitalisation, intubation and/or death, which serve as labels for prognostic risk stratification). Hospitalised patients will have repeat lung auscultations each 48 h until discharge, ICU referral or death. The study schedule is detailed in Table 1.

Sample size calculation

Deep learning algorithm derivation: will be used to decompose the audio signals into meaningful parameters. Each patient will provide 6 recordings of 30 s each. The sample size considerations are estimated for the train and test sets in such a way that performance (sensitivity, specificity, area under the receiver operating characteristic [ROC] curve and accuracy) could be derived with minimal variance on a stable training curve.

Assuming a similar discriminative power compared to a previous work (personal communications) distinguishing healthy and pathological lung sounds in pneumonia from 80 patients in balanced classes (40 pathological and 40 control) with 8 auscultation sites of 30 s each, we estimate to achieve convergence at above 80% AUC-ROC with 10% variability using the same number of patients in each class (i.e. for diagnosis: 40 COVID positive and 40 COVID negative, and for risk stratification: 40 severe and 40 non-severe). Starting with the requirement for risk stratification of 40 COVID-positive patients with severe disease: 20% of COVID-positive patients are expected to be classified as “severe” (with hospitalisation required) [31]. We would thus require at least 200 COVID-positive patients to be recruited. As the expected positivity rate for patients at this recruitment sites averages around 20%, we would require at least 1000 enrolments to secure 200 COVID-positive patients (Fig. 1).

Fig. 1
figure 1

Sample size required to secure at least 40 patients in each outcome group

Currently there are around 4900 tests per week in the targeted outpatient group in Geneva University Hospitals. Thus, with a test positivity rate of 11.4% as of November 11th 2020 [32], we would have 4341 COVID-negative and 559 COVID-positive patients per week, meaning that recruitment should take one week. Assuming 50% recruitment consent and accounting the lability of these epidemiological values, we anticipate a recruitment period of 1–2 months.

Human predictive baseline

In order to generate a human baseline for sound-based diagnostic and risk stratification interpretation, randomly selected lung sounds will be blinded for outcomes and randomly evaluated by several clinicians (residents, fellows, professors and pulmonologists) who will be required to classify them as either COVID-positive or COVID-negative (with and without the availability of their clinical and demographic information). Once this is completed, the clinicians will be given only the set of COVID-positive samples and asked to stratify their risk (with and without the availability of their clinical symptomology and demographic information collected on day 0 before any diagnostic tests are undertaken). Lung sounds will also be annotated by expert consensus for audible pathology types (wheezing, crackles, rhonchi, etc.).

A kappa statistic for interrater concordance will be computed for each evaluation and a ROCAUC reported according to the outcome. The distribution of discriminative sound patterns identified by the human clinical experts will be explored by unsupervised machine learning using clustering on k-means (grouping sounds into k number of clusters, which will be further optimized using the Elbow method). Thus, objective sound patterns will be clustered by k-means in a vector space [33] and the distribution of human labels within these clusters will be reported to visualise the alignment between objective patterns and human interpretation.

The development of DeepBreath

DeepBreath is a deep learning algorithm to diagnose and stratify risk of COVID-19 from lung sounds. In preparation for these analyses, digital auscultations are cleaned to crop non-biological frequencies and amplitudes generated by ambient noise. The frequency range of normal lung sounds extends from below 100 Hz to 1000 Hz, with a sharp drop at approximately 100 to 200 Hz [1], whereas tracheal sound extends between 100 and 5000 Hz. In the lower band range (under 100 Hz), heart and thoracic muscle sounds overlap [34]. Abnormal lung sounds (wheezing, rhonchi etc.) have characteristic frequencies and duration, differentiating them from each other [1]. In this study, all signals will be sampled at 16 kHz with a resolution of 8-bit, and the built-in filter will range from 20 to 2000 Hz. Heart and thoracic muscle sounds, as well as other background low-frequency noises will be filtered out through a high-pass filter (cut-off frequency 150 Hz).

The sounds will then be divided into overlapping time windows of between 1 and 10 s and transformed to Mel Frequency Cepstral Coefficients (MFCCs). Several data augmentation techniques will be explored, such as amplitude scaling, pitch shift, and random time shift. The effect of each pre-processing method will be tested and the best performing approach according to sensitivity and specificity will be reported. This dataset will then be fed into a convolutional neural network with max pooling and dropout before binary classification by a support vector machine (SVM) into positive vs negative COVID-19 test results (diagnostic model) or hospitalisation/death vs outpatient/self-resolving (risk stratification model). Longitudinal auscultations on individual patients who were hospitalised will also be used to assess the severity and progression of the disease, normalized at the time of symptom onset.

Statistical analysis plan

Firstly, the collected data will be described. Continuous variables will be reported as means with their 95% confidence intervals (CI). Categorical variables will be reported as proportions and percentages. Features will be compared between outcome groups for diagnosis (COVID-positive versus COVID-negative) and prognosis (within COVID positive: outpatient, hospitalisation, worsening patients requiring ICU referral/death) using logistic regression, with risk ratios and CI95%. Pearson’s and Spearman’s correlation coefficients will be used to assess the relationship between continuous variables normally and non-normally distributed, respectively.

Missing data will be padded with zero in the CNN and reported in the descriptive statistics. Missingness will also be assessed according to other features. Features with more than 50% missing values or with significant bias in missingness will be removed. For the primary outcome, the ability of the DeepBreath algorithm to distinguish between COVID-19 positive and negative patients as well as between severity outcome groups will be quantified using the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, positive predictive value, negative predictive value, and likelihood ratios, presented with their 95% CI. The accuracy of the AI algorithm will be compared with human expert discrimination by sound. The optimism of the AI algorithm will be estimated by the difference in performance between the training and validation sets.

All statistical tests will be two-sided with a type-I error risk of 5%. Data analysis will be carried out using GraphPad Prism, version 9 (GraphPad Software, San Diego, CA, USA) for graph figures, and R version 4.0 (R Foundation, Vienna, Austria) for descriptive statistics and statistical tests.

Discussion

The recent advances in deep learning are promising to support physicians in standardizing the detection and interpretation of complex patterns in pulmonary diseases. Artificial intelligence has proven to outperform physicians in discriminating respiratory pathologies via respiratory functional explorations [35], symptoms [36, 37], and/or radiological examinations [38]. The development of AI algorithms for the analysis of respiratory acoustic signals has been proposed previously [9, 10]. It remains to be determined whether an AI algorithm can be used as an initial and accurate screening tool for patients suspected of COVID-19. Such an algorithm has the potential to support early diagnosis, to guide the allocation of resources and identify those in need of early hospitalisation [39].

This study aims to collect a standardised dataset of digital lung auscultations and derive a deep leaning model able to detect the acoustic signatures of the presence and severity of COVID-19. We hypothesize that automated interpretation of lung auscultation could better democratize the accuracy of this critical clinical exam beyond the individual capabilities of the health workers that a patient may be fortunate (or unfortunate) to have. We plan to incorporate this algorithm into an autonomous digital stethoscope (currently under development), that could help decentralise high quality respiratory examination and monitoring, and perhaps even empower patients to assess themselves, which would reduce nosocomial infections occurring during the proximity of a traditional clinical exam. Making high quality lung auscultation accessible to unskilled/decentralized actors could broaden COVID-19 mass screening and identify patients at earlier stages of disease, which would prevent transmission as well as allow earlier pharmacological interventions and nuance the pre-test probability to better select candidates for further PCR testing (i.e. resource conservation). Additionally, such a tool could empower patients confined at home to self-monitor their symptoms to inform telemedicine and personalize care.

This study has several limitations. First, ICU-inpatients with severe COVID-19 presentations requiring ventilatory supports will not be considered. This will limit longitudinal auscultations on critically-ill patients and prevent risk stratification assessment of those who will progress very unfavourably (or favourably) from this stage onwards. Second, since this study is a single centre study conducted in a high-income country with easy access to health care, caution will be taken when assessing the generalisability of these results to different populations, especially in resource-limited countries and remote areas. Finally, our study population will be recruited primarily from an emergency department triage centre and COVID-19-dedicated hospitalization units, which may suggest more acute symptoms and pathological lung sounds than those encountered in ambulatory care services.