1 Introduction

Artificial intelligence is a computer science discipline that can analyze complicated medical data. In many clinical contexts, their ability to exploit a relationship with data collection can be employed in the diagnosis, treatment, and prediction of results [1,2,3].

Artificial intelligence systems are computer programs that allow computers to operate in ways that make them appear intelligent. Alan Turing (1950), a British mathematician, was one of the pioneers of modern computer science and artificial intelligence [4]. He characterized that the intelligent behavior in a computer has the capacity to exhibit human-level performance in cognitive activities, subsequently known as the “Turing test” [5, 6]. The Turing test is one of the most debatable issues in artificial intelligence and cognitive science, as some machines might not pass their test but they may still be intelligent. Alan Turing proposed the Turing test (TT) in his 1950 Mind article ‘Computing Machinery and Intelligence’ [7] replacing the question “Can machines think?” The goal of Turing’s work is to provide a mechanism for determining whether or not a computer can think. His paper has been seen as the “starting point” of artificial intelligence (AI), whereas the TT has been regarded as its final objective. He further proposes the Imitation Game to give this idea a concrete form [8,9,10].

Researchers have been investigating the possible uses of intelligent techniques in every sector of medicine since the last century. Medical AI has witnessed a rise in popularity during the previous two decades. AI systems can consume, analyze, and report vast amounts of data from various modalities to diagnose disease and guide healthcare choices. In addition to diagnosis, AI can aid in the prediction of cancer patient survival rates, such as lung cancer patients. In the field of radiology, artificial intelligence (AI) is being utilized to diagnose disorders in patients using CT scans, MR imaging, and X-rays [1, 4, 11,12,13]. Alongside, the question of fairness and ethics has also become very crucial as more and more techniques are getting ready to be implemented in a clinical setting [14,15,16,17,18].

2 Problem statement

Our prime question in regards to the advancements of the state-of-the-art AI-based medical imaging algorithms and devices, how do we compute the performance of the algorithm before actually deploying and decide if it is better or at least as good as a clinician in a real-life medical setting or not [19,20,21]? Which also raises the concern of whether can we completely trust an AI and give it the status of an individual entity or do we need a clinician in the loop to oversee the predictions made by the AI algorithm [22]? In addition, we can examine if the current state-of-the-art techniques are good enough for clinical use or if we need more advancements in the development by comparing if they have the same level of preciseness and accuracy on a diverse cohort of patients as a professional clinician with years of experience [23].

2.1 Aim and objective

With the rise of AI-based radiological devices and algorithms providing clinical, diagnostic, and prognostic predictions, along with accuracy we need to look beyond the performance of the model in certain cases and think about whether these modalities are ethically sound and free of biases or not [24,25,26]. Therefore, with our proposed test, we can deeply analyze the predictions made by the algorithm and compare them against humans and see if it is safe enough to be implemented in a medical institution while considering the prevalent biases it may have [27,28,29]. The article draws its inspiration from A.M. Turing’s classic Turing test. We propose a modified Turing test which serves as a metric to discover the AI-models true performance in the real-life clinical setting and can also help in detecting any possible biases.

3 Methodology

3.1 Dataset

For this project, we used two different datasets to train and test our dataset. For the training of our models, we used the publicly available Medical Imaging Data Resource Center (MIDRC)—RSNA International COVID-19 Open Radiology Database (RICORD) [30]. In partnership with the Society of Thoracic Radiology (STR) and the American Society of Nuclear Medicine, the MIDRC-RICORD dataset 1a was developed. For all COVID-positive thoracic computed tomography (CT) imaging studies, pixel-level volumetric segmentation with clinical annotations by thoracic radiology subspecialists was performed according to a labeling schema that was coordinated with other international consensus panels and COVID data annotation efforts.

Database 1a of the MIDRC-RICORD is comprised of 120 thoracic computed tomography (CT) images from four international sites, each of which has been annotated with precise segmentation and diagnostic labeling. For our model training process, we employed 120 Chest CT tests (axial series) as input. The data were retrieved using Cancer Imaging Archive [31]. A CT scan sample of lungs infected with COVID-19 is shown in Fig. 1.

To test our model, we used the COVID-19 CT Lung and Infection Segmentation Dataset publically available at Zenodo [32]. This dataset contains 20 COVID-19 CT scans that have been labeled and annotated. The left lung, right lung, and disease areas were labeled by two radiologists and checked by an expert radiologist before being sent to the pathology lab for testing. The dataset completely fits our research interests because of the additional human-annotated segmentation along with the ground truth.

Fig. 1
figure 1

Semantic segmentation of normal and edge cases lung infection produced by our proposed UNet model

3.2 Data prepossessing and training

The volumes included in this set of data have a resolution of 512 by 512 pixels, and there is no consistent number of slices throughout the cohort of patients. The input pictures from the CT include some information that is not necessary for the procedure. As a consequence of this, preprocessing is required in order to get rid of the unnecessary information included in the volumes. In order to reduce memory consumption, the dimensions of each slice in the 3D CT data were shrunk to 256 by 256.

A considerable quantity of GPU memory is essential due to the fact that the inputs are 3D volumes. To do this, we used a strategy that was based on 3D patches. Each input volume is then randomly segmented into 16 patches with dimensions of 128 by 128 by 32. A CNN that has had enough training should be insensitive to changes in translation, size, and perspective. In order to accomplish this goal, a substantial quantity of data must first be entered. We used augmentation to attain a substantial quantity of data, and as a result, we were successful in obtaining data invariance. The augmentation was carried out on each of the 128 by 128 by 32 patches in a manner that was completely arbitrary. This piece makes use of a variety of different augmentation methods, including zooming in and out, shearing, horizontal and vertical translations, and a ninety-degree rotation.

Further, all the data were divided into a training set, validation set, and test set in the ratio of 60:20:20, respectively.

3.3 Segmentation model

Semantic segmentation of the lung CT scans was performed using a VGG16-UNet model and compared its performance to other models such as UNet, UNet++, UNet3+, and Attention UNet [33,34,35,36], shown in Fig. 1. The choice of VGG16-UNet is because of its similarity to UNet’s contracted layer and its number of parameters is also less than UNet [37].

The left-hand side of the network is an encoder and incorporates the 13 convolutional layers from the original VGG16. After each convolution layer, the MaxPooling operation which reduces the dimensions of the image by 2 \(\times\) 2 is performed. On the right-hand side of the network, is a decoder. UpSampling operation which restores the dimensions of the image. Each UpSampling operation repeats the rows and columns of the image by 2 \(\times\) 2. The skip connections are used to restore the dimensions of the image. These skip connections are implemented using the concatenate operation to combine the corresponding feature maps. Since this is a variant of the fully convolutional neural network, FCN for semantic segmentation, the spatial dimension information of the image needs to be retained hence we use the skip connections. The last convolutional layer has only one filter which is similar to a final Dense layer in most other neural networks and gives the binary mask prediction. In total, the network has about 29 convolutional layers which are followed by a PReLU activation. The PReLU has an alpha parameter that is learned during training. In addition, the last convolutional layer has a sigmoid activation function.

Fig. 2
figure 2

a The model architecture outlining the workflow of UNet, UNet++, and UNet3+. The notable difference between the three models is the skip connections. UNet is using plain skip connections, UNet++ has nested and dense skip connections which have the downside of not being able to explore a sufficient amount of information from full scales. UNet3+, however, uses full-scale skip connections so more information can be obtained during upsampling. b A VGG16-UNet is comprised of an encoder that is based on a VGG16 model and a decoder that is based on a UNet model. c Attention UNet uses attention mechanisms, compared to a standard UNet model, by focusing on the varying size and shape of target structures

The semantic segmentation produced by our proposed UNet-VGG16 is shown in Fig. 2. We trained the model on multiple edge cases (artifact scans or complex patient cases) for producing a more generalized segmentation and the model performed really well in various of these cases.

Fig. 3
figure 3

Semantic segmentation of normal and edge cases’ lung infection produced by our proposed UNet model

3.4 Modified Turing test

This study analyzes the Turing test’s possible usage in healthcare informatics, intending to highlight the broader use of diagnostic accuracy approaches for the Turing test in the present and future AI situations. As a response, we aim to create a model for a measurable diagnostic accurate scoring approach for the Turing test (how distinct are clinician and AI models?). In diagnostic accuracy testing, we adapted the Turing test to account for false positives and true negatives (Fig. 3).

As shown in Fig. 4, Examiner (A) (blinded) attempts to differentiate between a human control (B) and a computer test subject (C) versus a human test subject (D). The examiner does not know whether the test subject is human or a machine, therefore (C) vs (D) provides the diagnostic assessment. As a diagnostic test, the redesigned Turing test will now be assessed using a diagnostic accuracy technique and can provide the fast feedback of a human examiner—a method for determining if a computer “(C)” is indistinguishable from its homolog.

The findings of this test may be compared to the results of a gold-standard reference test, namely whether or not the test subject is a computer. The segmentation done by the expert radiologist (D) is shown in Fig. 3. The radiologist did not see the ground truth while doing the human-annotated segmentation.

Fig. 4
figure 4

Human-annotated segmentation produced by expert radiologist

The participants (radiologists) of the test were asked to make the prediction on the basis of how accurate the given segmentation is when compared to the ground truth, and based on this the individual may classify whether the segmentation is absolutely accurate and has details and done by a professional radiologist or if it is done by an AI model and it has some missing features. The motive of this study is not to see who does more neat segmentation rather it focuses on whether or not the machine learning algorithms pick up on the clinically important features in the scan.

Fig. 5
figure 5

A systemic layout of the modified Turing test

Consequently, each computer may be evaluated numerous times by the same human and compared to find how biased or accurate the algorithm is. This allows us to obtain several diagnostic evaluation parameters such as sensitivity, specificity, positive predictive value (PPV), and false predictive value (FPV), and we can also generate a receiver operating characteristic (ROC) curve. The proposed diagnostic metrics could be made using the principles of the confusion matrix [38], as shown in Fig. 5.

The AI model would be considered more accurate and reliable if the AI predictions make the radiologists believe that the segmentation is done by a real human being in terms of preciseness, picking of the area of interest, and if any important considerations are needed in a scenario of an edge case [39, 40].

This is a technique that has never been implemented before and thus is highly novel. The Turing test modification can provide verifiable diagnostic precision and statistical effect–size resilience in the evaluation of AI for computer-based and robotic healthcare and clinical solutions.

4 Results and discussion

4.1 Segmentation results

In this study, we used multiple metrics to evaluate the performance of the model: dice coefficient (DSC), mean intersection over union (mIoU), recall (RE), precision (PR), specificity (SP), and F1-score (F1). The expressions of the metrics are described as follows:

$$\begin{aligned}&{\text {DSC}}(Y, {\hat{Y}})=\frac{2|Y \cap {\hat{Y}}|}{|Y + {\hat{Y}}|} \end{aligned}$$
(1)
$$\begin{aligned}&{\text {mIoU}}(Y, {\hat{Y}})=\frac{|Y \cap {\hat{Y}}|}{|Y \cup {\hat{Y}}|} \end{aligned}$$
(2)
$$\begin{aligned}&\mathrm{PR}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} \end{aligned}$$
(3)
$$\begin{aligned}&\mathrm{RE}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} \end{aligned}$$
(4)
$$\begin{aligned}&\mathrm{SP}=\frac{\mathrm{TN}}{\mathrm{TN}+\mathrm{FP}} \end{aligned}$$
(5)
$$\begin{aligned}&F 1=2 \times \frac{\mathrm{PR} \times \mathrm{RE}}{\mathrm{PR}+\mathrm{RE}}. \end{aligned}$$
(6)

Table 1 compares the segmentation results of the UNet, UNet++, UNet3+, Attention UNet, and UNet-VGG16 models in terms of all metrics used in our experiments. Our proposed UNet-VGG16 model achieved the highest accuracy among all other models in all the matrices, represented in bold.

Table 1 Segmentation metrics results among various UNet models trained

In addition, during the testing, the model was examined on various edge cases and cases with complex or rare infections to check whether the UNET is biased or not, but the results are very promising, our model achieved a dice score of 94.76% on these critical cases.

4.2 Modified Turing test results

For this study, 10 board-certified radiologists with more than 10 years of experience each in interpreting cardio-thoracic imaging reviewed 20 sets of medical images and give out their predictions of whether the segmentation is done by a human or AI based on the preciseness and accuracy of the segmentation. All the radiologists were given the same platform and time to give their predictions. The predictions analysis of each radiologist is given in Table 2.

The true positive (TP) denotes that the tester was able to detect the AI-based segmentation and the true negative (TN) denotes that the radiologist was able to detect the human-based segmentation. False negative (FN) represents that the tester thought it was an AI while the segmentation was done by a human, whereas in true negative (TN) the tester thought the segmentation is done by the AI while it was done by a human.

Table 2 Analysis of prediction derived from the test results of the participants

We would consider the TP and FN as our most important metrics here as they reveal the most context about the performance of the AI algorithms in a clinical setting. TP score reveals the reliability of AI-based segmentation, participants reported when the segmented scan did not include not-so-obvious infections or overdid some of the areas, it made it easier to say for them that the segmentation was done by an AI because a professional radiologist can never do such segmentation [41]. Therefore, having a high TP score is not a good metric for the AI because it means that the segmentation generated is not clinically relevant enough. FN score is what makes an algorithm come closer to an “expert radiologist.” If the model earns more TN scores that means that the AI system is as good as a professional radiologist and is very hard to distinguish whether the segmentation is done by a human or a machine.

To furthermore understand the metrics we have calculated evaluation matrices like accuracy, recall, precision, and others to understand the overall performance and behavior of participants as well as the AI model, the results are shown in Table 3.

Table 3 Evaluation metrics results among all the participants
Fig. 6
figure 6

Diagnostic evaluation metrics generated through test results

Overall, our UNet model did exceedingly well in this test, where not only it achieves a high FN score but also received a low TP ratio. This data distribution explains that the model is compatible enough to get implemented in a clinical setting but at the same time there was also a considerable portion of the FP and TN cases, where the participant distinguished between AI and humans, so taking that into the account we would still need a clinician in the loop to safe-gourd patient care and false prediction making by the algorithm. We also incorporated 5 out of 20 to be edge cases and the model, and the AI-based segmentation was picked most of the time as an FP.

Finally, we compared the working and performance of the actual Turing test proposed by Alan Turing and the modified Turing test proposed in this study using bivariate meta-analysis [42], and the results are closely similar to what we expected. In Fig. 6, we have shown how our modified Turing test works exactly like the original test and even the UNet model’s performance could analyze using the plot.

Fig. 7
figure 7

Bivariate meta-analysis of actual Turing test and our modified Turing test

5 Conclusion and future work

The number of AI-based medical imaging devices is getting increased every single day and it is crucial to think about the potential bias it may inherit. This test would be a transnational standard for upcoming AI modalities. We learned through the study we conducted that the use of our modified Turing test is a notably strong standard to measure the actual performance of the AI model on a variety of edge cases and normal cases and also helps in detecting if the algorithm is biased towards any one type of case. Not just we can detect biases but also classify the type of bias and can work towards resolving it (Fig. 7).

Since artificial intelligence systems in healthcare can be utilized for both diagnosis and treatment of diseases, even a tiny error can result in diagnostic inaccuracy and, as a result, increased morbidity and death rates. As a result, it is critical to conduct a comprehensive verification and validation of each artificial intelligence system prior to using it for diagnosis. Consequently, distinguishing between computers and humans (Turing test or modified Turing test) should not detract from the importance of diagnostic accuracy in disease detection and healthcare provision provided by each computer-based AI system, which should be independently appraised for its healthcare safety, precision, and utility. Therefore, as we proceed towards the upcoming ages of AI in medicine, this technique would still be applicable in not only segmentation but also in various other prediction and detection models as well. The modified Turing test provides us trust in the AI algorithm and helps us if not look then predict what is inside the black box of the algorithm.

The future of this subject lies in the application of diagnostic accuracy methods to the modified Turing test, which will spur the development of enhanced technology that can closely replicate human behavior in the process of development. This has the potential to produce healthcare computers and other artificial intelligence-based technologies that can improve human health and quality of life while also igniting the next generation of human–technological conversation.