Introduction

Radiology is the speciality of medicine that uses different imaging techniques to detect and treat diseases. One of the most widely used imaging techniques in this field is radiography (X-ray), used to generate images of the inside of the body, which has the advantage of being quick to perform and requiring low levels of radiation1,2. Within the scope of X-rays, one of the most common tests is the chest X-ray3,4,5,6, used to detect pulmonary and cardiovascular diseases, among others.

Despite the importance of this speciality and the volume of radiological tests that exist, the lack of radiologists able to interpret them7,8,9,10,11, as well as the increase in errors in interpretation as a consequence of a heavy workload is becoming increasingly evident12,13,14,15. Furthermore, radiological diagnosis is a component of clinical competencies in various specialities, including family and community medicine. It is a common practice for practitioners in this field to interpret chest X-rays, despite their lower degree of expertise16. This fact highlights the need and, therefore, the opportunity to introduce tools such as artificial intelligence (AI) into this field to support radiologists and other healthcare practitioners who need to interpret an X-ray.

Technological support tools, such as computer-aided diagnostics (CAD), have long been used in this area. However, the advent of deep learning and machine learning models in recent years has offered the potential to develop new support tools aiming to overcome the primary limitations of CAD and enhance accuracy17,18. Deep learning models, compared to CAD, are built to train and work with large databases, to have constant improvement over time by learning from errors, and to have the ability to detect more than one condition at a time, all of which makes them much more powerful than CAD.

In this context, AI algorithms, including deep and machine learning models can assist in diagnosis, potentially enhancing diagnostic accuracy. However, while AI will not replace professionals, it is essential to acknowledge that the implementation of AI in routine clinical practice must be both safe and effective20,21. One of the current concerns lies in the fact that many of the studies on applications of new AI models only present in silico validation, a phenomenon called “digital exceptionalism”, without performing external validation in the actual implementation environment. External validation is important, as it allows estimating the accuracy of the model in a population different from the training population, selected in real clinical practice, thus allowing the subsequent generalisation of the results22,23,24,25,26.

A recent study conducting external validation of an AI algorithm designed to classify chest X-rays as normal or abnormal using a cohort of real images from two primary care centres, indicated that additional training with data from these environments was required to enhance the algorithm’s performance, particularly in clinical setting different from its initial training environment27.

In this context, external validation, crucial for ensuring non-discrimination and equity in healthcare, should be a key requirement for the widespread implementation of AI algorithms. However, it is not yet specifically mandated by European legislation (Regulation 2017/745) and this not a prerequisite for marketing an AI algorithm28. For this reason, different groups of experts around the world have developed guidelines to stipulate the essential requirements for using AI algorithms as a complementary diagnostic tool. Yet, while expert groups emphasise the necessity of external validation to confirm its practical potential in clinical settings, this requirement is not yet mandated in the Regulation26,29,30.

The implementation of AI in healthcare appears to be an imminent reality that can offer significant benefits to both professionals and the general population. However, it is essential to implement safe and validated tools in real clinical settings to maintain fairness. In this sense, this study aims to externally validate an AI algorithm's diagnoses in real primary care settings, comparing them to a radiologist’s diagnoses. The aim is also to identify diagnoses the algorithm may not have been trained for.

Materials and methods

The study protocol has been previously described and published32. Nevertheless, the most relevant points of the present study are described below.

The ChestEye AI algorithm

Oxipit33, one of the leading companies in AI medical image reading, has developed a fully automatic computer-aided diagnosis (CAD) AI algorithm for reading chest X-rays trained with more than 300,000 images, available through a web platform called ChestEye. The ChestEye imaging service has been certified as a Class II medical device on the Australian Register of Therapeutic Goods and has also been CE marked33. The web platform reads the inserted chest X-ray and returns the automatic report with the ability to detect 75 conditions, which cover 90% of diagnoses, as well as a heat map to show the locations of the findings. Thus, ChestEye allows radiologists to analyse only the most relevant X-rays33,34.

Study design

A prospective observational study for the external validation of the AI algorithm in a region of Catalonia of users who were scheduled for chest radiography at the Osona Primary Care Center. For each user, the report of the reference radiologist (considered the gold standard) was obtained. Subsequently, the research team input the image into the AI algorithm to obtaine the diagnosis. This allowed for the comparison of the AI's performance with the reference standard in terms of accuracy, sensitivity, specificity, positive predictive value and negative predictive value c.

Description of the study population, time frame, and data collection

The study was carried out at the Catalan Institute of Health’s Primary Care Centre Vic Nord (Osona, Catalonia, Spain), a reference centre where all chest X-rays in the region are performed (with a coverage of 125,000 users). At this same centre, convenience recruitment was carried out from 7 February 2022 to 31 May 2022. The study was explained, and the information sheet and informed consent were given to all patients who came for a chest X-ray and met the inclusion criteria32.

The reference population of the study was the entire population of the Osona region who underwent chest X-rays at the study centre and agreed to participate in the study. The study included only anteroposterior chest X-rays on those over 18 years of age and excluded pregnant women and poor-quality chest X-rays (poor exposure, non-centred or rotated images).

Sample size

Due to a problem with the image collection centre, the sample size calculated in the protocol32 could not be reached. For this reason, the sample was recalculated by increasing the precision by one percentage point. Thus, in order to validate the algorithm, a sample of 450 images was needed to estimate an overall accuracy expected to be around 80%, with a confidence interval of 95%, a precision level of 5% and a percentage of replacements needed of 15%.

Procedure

Once recruitment was completed, the Technical Service of the Catalan Institute of Health of Central Catalonia extracted the patients’ anonymous and automated images and their corresponding non-anonymised reports. The images and reports were then coded together so that they could be related.

The research team then entered all the images into the AI model to extract their interpretation (the diagnosis or the possibility of no abnormalities). At the same time, three general practitioners interpreted the reference radiologists’ reports, without seeing the images in order to avoid assumptions, with the aim of extracting the diagnoses described.

Finally, the group of general practitioners grouped all the conditions detectable by the AI model into 9 categories according to the anatomy of the thorax in order to build an individual and grouped study. The categories were: external implants, mediastinal findings or conditions, cardiac and valvular conditions, vessel conditions, bone conditions, pleural or pleural space conditions, upper abdominal findings or conditions, pulmonary parenchymal findings or conditions, and others.

Statistical analysis

To validate the algorithm, the AI algorithm’s diagnoses were compared with those of the gold standard. The accuracy of the algorithm and the confusion matrix were obtained from the images correctly classified positive (PT), correctly classified negative (TN), false positive (FP) and false negative (FN). Sensitivity and specificity were also calculated. These measurements were obtained for the total sample, for each condition and for each of the categories according to physiology. Analyses were performed with R software version 4.2.1 and all confidence intervals were 95%.

Ethics committee

The University Institute for Research in Primary Health Care Jordi Gol i Gurina (Barcelona, Spain) ethics committee approved the trial study protocol (approval code: 21/288). Written informed consent was requested from all patients participating in the study.

Ethical considerations

Radiologists’ assessment and decisions were not influenced by this study, as the normal radiology referral workflow was not affected. This project was approved by the Research Ethics Committee (REC) from the Foundation University Institute for Primary Health Care Research Jordi Gol i Gurina (IDIAPJGol) (P21/288-P). The study was performed in accordance with relevant guidelines/regulations, and informed consent was obtained from all participants. All research was performed in accordance with the Declaration of Helsinki.

Results

Of the 471 patients who agreed to participate in the study and provide the images and reports, the final sample for external validation of the model was 278, mainly due to computer-related issues when extracting the data. In some cases, when automatically extracting images and reports, both were not obtained, i.e., either the image was missing, or the report was missing. In addition, some reports did not include the interpretation of the image, as it was a follow-up X-ray. In these cases, the report only indicated whether or not there were changes with respect to the previous report and, therefore, they had to be discarded from the analysis (Fig. 1). Of the final sample, 144 (51.8%) obtained images without radiological abnormalities according to the radiologist's report.

Figure 1
figure 1

Flow chart of the final sample of study images.

Although the final sample consisted of 278 images, it is possible that an image may be suggestive of one or more conditions or anatomical abnormalities. In this sense, the sum of the unaltered images plus the images with one or more abnormalities does not correspond to the total sample (n = 278). Of the conditions or anatomical abnormalities for which the algorithm was trained, the AI model identified 33 in the sample, and the most prevalent were: nodule (n = 37, 13.3%), consolidation (n = 28, 10.1%), abnormal rib (n = 17, 6.1%), enlarged heart (n = 15, 5.4%) and aortic sclerosis (n = 8, 2.9%). On the other hand, the reference radiologist identified 35 in the sample, and the most prevalent were: consolidation (n = 30, 10.8%), pulmonary emphysema (n = 20, 7.2%), enlarged aorta (n = 14, 5.0%), enlarged heart (n = 12, 4.3%), hilar prominence (n = 12, 4.3%) and linear atelectasis (n = 11, 3.9%) (Table 1).

Table 1 Description of the conditions or anatomical abnormalities of the 278 images and their respective diagnoses according to the radiologist and the AI algorithm.

Analysing the validity of the AI algorithm, the average accuracy was 0.95 (95% CI 0.92; 0.98), the average sensitivity was 0.48 (95% CI 0.30; 0.66) and the average specificity was 0.98 (95% CI 0.97; 0.99). The accuracy, sensitivity and specificity values for each condition can be seen in Table 2. The values for true positives, true negatives, false positives, and false negatives for each condition are presented in Table 1 of the supplementary information.

Table 2 Accuracy, sensitivity, specificity, positive and negative predictive values with a 95% confidence interval for each condition.

The reference radiologist identified a list of conditions for which the algorithm was not trained, and which were therefore classified as “Other”. The most prevalent were bronchial wall thickening (n = 13, 4.68%), fibroscarring lesions or abnormalities (n = 11, 3.96%) and chronic pulmonary abnormalities (n = 11, 3.96%) (Table 3).

Table 3 Description of conditions not contemplated by the AI model.

Figures 2 and 3 show some examples of the AI algorithm’s performance in cases where the algorithm’s diagnosis was successful and in cases where errors occurred.

Figure 2
figure 2

Image of patient (upper-left) where according to the radiologist's report there is only consolidation, but the algorithm detects an abnormal rib (upper-right), consolidation (lower-left) and two nodules (lower-right). It is worth noting the confusion of a consolidation with mammary tissue and of two nodules with the two mammary areolae.

Figure 3
figure 3

Image of patient (left) where the AI algorithm and the radiologist detected the same condition: consolidation (right).

In order to perform a more general analysis, the conditions were grouped into 10 groups according to chest anatomy, considering only the diagnoses for which the AI model was trained. According to the radiologist, the most prevalent groupings found were lung parenchymal conditions (n = 71, 25.5%), bone conditions (n = 21, 7.5%) and vessel conditions (n = 18, 6.47%). According to the AI model, the most prevalent conditions were lung parenchymal conditions (n = 57, 20.5%), bone conditions (n = 21, 7.5%) and cardiac and/or valvular conditions (n = 15, 5.4%) (Table 4).

Table 4 Description of the conditions of the 278 images according to the radiologist and AI model, grouped according to chest anatomy.

Finally, Table 5 shows the accuracy, sensitivity and specificity values for each group. Of these, it is worth mentioning the low sensitivity values for mediastinal conditions (0.0, 95% CI 0.0; 0.96), vessel conditions (0.29, 95% CI 0.11; 0.52) and bone conditions (0.24, 95% CI 0.08; 0.47). On the other hand, high sensitivity values were recorded for external implants (0.67, 95% CI 0.22; 0.96), upper abdominal conditions (0.67, 95% CI 0.30; 0.93) and cardiac and/or valvular conditions (0.67, 95% CI 0.35; 0.90). It is also worth mentioning the model’s strong ability to detect images that do not have radiological abnormalities. The values for true positives, true negatives, false positives, and false negatives for each grouping are presented in Table 1 of the supplementary information.

Table 5 Accuracy, sensitivity and specificity values for each grouping.

Discussion

The aim of this study was to perform an external validation, in real clinical practice, of the diagnostic capability of an AI algorithm with respect to the reference radiologist for chest X-rays, as well as to detect possible diagnoses for which the algorithm had not been trained. Thus, the overall accuracy of the algorithm was 0.95 (95% CI 0.92–0.98), the sensitivity was 0.48 (95% CI 0.30–0.66) and the specificity was 0.98 (95% CI 0.97–0.99). The results obtained have further highlighted, as indicated by different expert groups26,28,29, the need for external validations of AI algorithms in a real clinical context in order to establish the necessary measures and adaptations to ensure safety and effectiveness in any environment. Therefore, in the context of the model developed, it is important to understand and interpret what each of the results obtained indicate.

High accuracy values were observed in most cases (ranging between 0.7–1). The accuracy is represented by the proportion of correctly classified results among the total number of cases examined. This value was high since, both for each condition and for the groups of conditions, the capacity to detect true negatives was good, taking into account that most of the images analysed were found to have no abnormalities (51.8%). Working with an AI algorithm that quickly determines that there is no abnormality can function as a triage tool, streamlining the diagnostic process, allowing the professional to focus on other tests, reduce waiting lists, reduce waiting times for diagnoses and even reduce expenses in secondary tests.

With sensitivity referring to the ability to detect an abnormality when there really is one, high sensitivity values were shown when detecting anatomical findings or abnormalities such as sternal cables, enlarged heart, abnormal ribs, spinal implants, cardiac valve, or interstitial markings. On the other hand, low sensitivity values were observed for most conditions, indicating that the algorithm had limited ability to detect certain conditions like those in the mediastinum, vessels, or bones. These findings align with the results of a study that performed an external validation of a similar algorithm in an emergency department35. Additionally, the algorithm exhibited low sensitivity in detecting pulmonary emphysema, linear atelectasis, and hilar prominence, which are prevalent conditions in the primary care setting31.

Low sensitivity was also observed when detecting nodules, with the algorithm finding more nodules than the reference radiologist, in most cases confusing them with areolae in the breast tissue. Although it is important to be able to detect any warning signs and that the professional is in charge of making the clinical judgement and determining the need for complementary tests, it is possible that this external validation has detected a possible gender bias in the training of the algorithm. When it comes to chest imaging, it's important to distinguish between the physiological aspects of breast tissue and any potential changes it may undergo during various life stages, as opposed to signs of conditions or abnormalities36. Other studies have also detected a high false positive value in the detection of nodules due to other causes such as fat, pleura or interstitial lung disease37.

Finally, specificity being the ability to correctly identify images in which there are no radiological abnormalities, the results showed high values for all condition groupings, since the algorithm was able to detect images with no abnormalities.

Following the authors' desire to contribute to the improvement of the AI model, some radiologists’ findings were identified that were overlooked during the algorithm's training, especially related to bronchial conditions, including chronic bronchopathy, bronchiectasis, and bronchial wall thickening. Additionally, the algorithm missed common chronic conditions often seen in primary care, including chronic pulmonary abnormalities, COPD, and fibrocystic abnormalities. Furthermore, it was noted that certain condition names within the AI algorithm should be adjusted to align with names used in the radiology field. Interstitial markings could be changed to interstitial abnormality, consolidation to condensation, aortic sclerosis to valvular sclerosis, and abnormal rib to rib fracture.

Once the main variables that characterise the algorithm's capacity were discussed, the results obtained differ from the majority of published studies, since most of them obtained a higher algorithm capacity. However, it should be noted that most of these are internal validations and not tested in real clinical practice settings38,39,40.

A study in Korea performed an internal and external validation of an AI algorithm capable of detecting the 10 most prevalent chest X-ray abnormalities and was able to demonstrate the difference in sensitivity and specificity values. The internal validation obtained sensitivity and specificity values between 0.87–0.94 and 0.81–0.98, respectively. On the other hand, the external validation obtained sensitivity and specificity values between 0.61–1.00 and 0.71–0.98, respectively41. This difference can also be seen in a study in Michigan, where internal and external validation of an AI algorithm capable of detecting the most common chest X-ray abnormalities was performed42, and in a study at the Seoul University School of Medicine, where an algorithm for lung cancer detection in population screening was validated43.

Therefore, the results obtained from the external validation show the need to increase the sensitivity of the algorithm for most conditions. Considering that AI should serve as a diagnostic support tool and the ultimate responsibility for medical decisions rests with the practitioner, it is ideal for the algorithm to flag potential abnormalities for the practitioner to review and confirm. This ensures the highest diagnostic effectiveness. Recent studies have shown that the use of an AI algorithm to support the practitioner significantly improves diagnostic sensitivity and specificity and reduces image reading time20,44.

Enhanced sensitivity could help address the shortage of specialised radiologists globally, especially in Central Catalonia's primary care setting, where this validation was conducted45,46. More and more, general practitioners are tasked with interpreting X-rays. In this context, the advancement of these tools can be a valuable asset in the diagnostic process.

Limitations and strengths

One significant limitation of the study was the small sample size for certain specific conditions. This was due to difficulties in obtaining the required number of cases, as these conditions are not very common in real clinical practice. Consequently, the external validation for these conditions yielded less reliable estimates. However, by representing reality, a large volume of images without radiological abnormalities was obtained and this allowed for a good external validation of the model's ability to detect images without abnormalities.

In addition, the radiologist's reference diagnosis was not always the practitioner’s own, but that of a group of radiologists. This could represent a limitation, since there was no consensus among them, but there was no desire to alter actual clinical practice. In addition, the study aimed to test the algorithm in primary care settings. For this reason, a double interpretation of the images was performed: initially by the radiologist and subsequently, the radiologist's report was interpreted by the family and community physician.Finally, another limitation was the lack of information on the sex of the users analysed. Through the results obtained, we found it very relevant to do another study but separating the capabilities of the algorithm according to gender, since it seems that they might not be the same. In addition, since we have a small sample for most of the conditions, separating the analyses according to sex in the present study would be unreliable and not comparable.

On the other hand, the greatest strength of the study is that it presents an external validation in real clinical practice in primary care and there are currently few studies that have done so. Most studies present an internal validation, but it is very important to perform an external validation in order to estimate the accuracy of the model in a population other than the training population, thus allowing the results to be generalised.

Conclusion

The findings of this study demonstrate the validation of an AI algorithm for reading chest X-rays in the primary care setting, achieved by comparing its diagnoses with those made by a radiologist. The algorithm has been validated in the primary care setting using values such as the accuracy, sensitivity and specificity of the algorithm and has proven to be useful by being able to identify images with or without abnormalities. However, further training is needed to increase the diagnostic capability of some of the conditions analysed. It is important that training is done in a real environment, with real images, in order to perform robust external validations. Our analysis highlights the need for continuous improvement to ensure that the algorithm is a reliable and effective tool in the primary care environment.

The role of AI in healthcare should be to assist and support the practitioners. Being able to reliably detect images without abnormalities can have a very positive impact, reducing waiting times for diagnoses, secondary tests to rule out conditions, streamlining practitioners work and, among others, ultimately favouring patient care and, indirectly, their health.