Deep convolutional neural networks are a type of artificial neural network that utilizes multiple hidden layers. Convolutional neural networks have been shown to be highly accurate with image classification, which may be due to its ability to represent images at different layers, including edges and blobs at earlier layers, parts of objects at intermediate layers, and whole objects at later layers.
In this study, we explored three datasets with varying levels of difficulty. For example, with the easiest dataset—differentiation of chest vs. abdominal radiographs—both the pre-trained networks AlexNet_T and GoogLeNet_T achieved 100% accuracy, with only 45 training cases and no additional data-augmentation techniques. Moreover, even the untrained networks achieved very high accuracy (Table 1). On the other hand, four times as many training cases and data-augmentation techniques were used for the presence/absence and low/normal ET tube position datasets, resulting in 2160 training images, which is 48 times the number of that used by the chest/abdominal radiography dataset. However, despite this, the best-performing presence/absence and low/normal ET tube models were still not as accurate the chest/abdominal model (Table 1).
This may be explained by the rich set of differences between chest and abdominal radiographs which can be appreciated in almost any aspect of the image. There is also a stark difference in contrast between the lungs, which have a high number of darker pixels, compared to that of the abdomen, which have a larger number of relatively brighter pixels. As such, with datasets containing more ostensible differences, fewer training examples may be needed to develop an accurate model.
On the other hand, differentiating presence and absence of an ET tube can be considered more difficult [21]. This may be because the proportion of the image that changes with the presence or absence of the ET tube is smaller (Figs. 5 and 6), whereas with chest and abdominal radiographs, there are many aspects of the image that are different. However, using pre-trained networks and data augmentation, an AUC of 0.989 was obtained with the best-performing GoogLeNet-T model (Table 1 and Fig. 1). The most challenging dataset was determining low vs. normal position of the ET tube. One explanation is that the carina can sometimes be difficult to identify on portable antero-posterior (AP) chest radiographs, which are often degraded by image noise and have less contrast resolution compared to PA radiographs [21]. This classification task requires assessment of the tube tip as well as the position of the carina. To help the neural network classifiers, these images were manually cropped and centered on the trachea including the bifurcation region (Figs. 7 and 8). One of the limitations is that manually cropping was performed in this study; however, automated cropping centered around the carina would be needed for an automated implementation. Another limitation was that a “high” position of the ET tube was not assessed for this study, which would be worthwhile to consider in future research. The best-performing dataset was a pre-trained GoogLeNet model with an AUC of 0.809 (Table 1 and Fig. 1).
Overall, GoogLeNet performed better than AlexNet for the ET presence/absence and low/normal datasets (Table 1). GoogleNet or Inception V1 is a relatively newer architecture compared to AlexNet, and had a better accuracy on ImageNet (top-5 error rate of 6.7% compared to 15.4% for AlexNet) [7, 8]. This may be explained by the extra depth of GoogleNet, which is 22 layers deep, compared to AlexNet, which is 8 layers deep. The GoogLeNet architecture was able to achieve a deeper architecture and reduce the computational cost of such at the same time by use of the “inception” module. This consisted of smaller 1 × 1 convolutions in parallel to reduce the number of features, before more computationally expensive larger 3 × 3 and 5 × 5 convolutions, sometimes referred to as a “bottleneck” design. As such, GoogLeNet is considered one of the most efficient neural network architectures. This design also provides a mixture of smaller, intermediate, and larger convolutions, which may help integrate information from a wider portion of the image.
It is possible that down-sampling the data to 256 × 256 pixels—which is commonly performed in other deep learning studies [7, 8]—reduced the classifier’s accuracy for both the ET presence and localization tasks. Of the two, the ET tip localization is more dependent on spatial resolution information, and better preservation of resolution may be needed for accurate models. Of course, higher-resolution images require more GPU memory, which may be a limiting factor for some depending on the depth and memory constraints of the neural network. Further research using higher-resolution images would be worthwhile to perform.
Transfer learning using networks pre-trained on non-medical images (ImageNet) performed better than untrained networks (Table 1), in keeping with prior studies [10, 11, 13]. This was only shown to be statistically significant for the presence/absence GoogLeNet_T and AlexNet_T models compared to their untrained counterparts. The first layer of well-trained networks tends to have general features comprised of edges and blobs, as shown with multiple neural network architectures regardless of the type of training data [22]. On the other hand, the last layers are thought to be specific to the training data—in this case, chest radiographs. As such, it makes sense that leveraging pre-trained weights of the initial layers of a well-formed neural network and training mostly on the last layer (set to random initialization of weights) should improve accuracy. In this study, all of the layers were not frozen, but rather set to learn at a slower rate. Rajkomar et al. demonstrated that pre-training with grayscale images from ImageNet resulted in better accuracy than pre-training with color images, when using transfer learning for training with grayscale chest radiographs [11]. However, this case was only true if all the layers were frozen, except for the last layer. Moreover, that study also demonstrated that similar high accuracy can be obtained using models pre-trained on color images if the layers are not frozen (fine-tuning of all layers), and using a reduced learning rate as in the case of this study. Figures 2 and 3 show training curves for the pre-trained AlexNet networks for ET presence/absence using base learning rates of 0.001 and 0.01. The curves show that lower base learning rates (e.g., 0.001) are important when utilizing transfer learning, as the pre-trained weights are already well optimized, and higher rates (e.g., 0.01) may result in slower or no convergence.
For the ET presence/absence and low/normal datasets, additional augmentation was done using pre-processed images, which included contrast enhanced with CLAHE, non-rigid deformation, and quadrilateral rotations (90, 180, and 270°). This was to increase the size of the dataset as well as provide more variation in the images to mitigate overfitting. Overall, models with augmentation resulted in higher AUCs compared to those without, although this was only statistically significant for the GoogLeNet_T model (Table 2). Quadrilateral rotations were chosen because occasionally rotated images are accidentally sent by the modality to the reading worklists, and we wanted the DCCNs to handle that potential variation. In addition, it seems more intuitive that shallow rotation angles (e.g., ±5°) would aid training of the network and preempt overfitting, because of relative similarity to original image. However, when combined with other augmentation strategies (CLAHE and non-rigid deformation), and using pre-trained networks, augmentation with quadrilateral rotations had greater accuracy (Table 3), although this was only statistically significant for the GoogLeNet_T model. It is possible that quadrilateral rotations provide a greater difference in the image, potentially reducing overfitting on the test data. However, more research is needed to see if this is true with other datasets.
Deep neural networks can be thought of as functional black boxes, because of the tremendous number of parameters that these networks have. For example, AlexNet has approximately 60 million different parameters and GoogLeNet has 5 million [7, 8]. However, there are strategies to inspect a network and determine the parts of an image that are being activated [23]. One method involves creation of a saliency map, which highlights parts of the network that contribute most to the prediction label, by carrying out an optimization using gradient ascent [23]. Figure 9 is a saliency map for the ET tube presence/absence task, derived from the GoogLeNet_T model. Figure 10 is a saliency map for the ET tube position classification task, also from the GoogLeNet_T model. The maps were created using one back-propagation pass through the DCNN. In these examples, the black parts of the image contribute most to the prediction of the network, as opposed to the light gray background. In Fig. 9, the area of the endotracheal tube at the level of the thoracic inlet has a visible contribution to the prediction class, which lends credence to the model as it is appropriately assessing the correct region. However, there are rib edges and an overlying unrelated catheter that also contribute. Therefore, one could infer that the network has room for improvement. In Fig. 10, the ET tube, enteric tube, and an overlying catheter all have contributions to the prediction class, indicating that the network is not as well-formed and is inferring from parts of the image (enteric tube and overlying catheter) that are not relevant to the ground-truth label (ET tube is low). The enteric tube and overlying catheter have similar features with a linear appearance and brighter pixel intensities, which may be confusing the model. It is likely that more training cases could improve this.
One would expect that increasing the number of cases for training should improve accuracy, as deep neural networks have been shown to perform better with larger sample sizes [15]. For example, Fig. 4 shows the AUCs of the pre-trained AlexNet_T and GoogLeNet_T classifiers for the presence/absence data, using 25, 50, 75, and 100% of the total training data. Training with the full dataset resulted in higher AUCs for GoogLeNet_T and AlexNet_T than 25% of the data for example (P = 0.015 and P < 0.001, respectively; Fig. 4).
To further improve these results, one could consider different types of deep artificial neural networks, pre-processing steps, pre-training on a large sample of radiology images (rather than non-medical images), higher matrix sizes, working with DICOM files directly, or using a combination machine learning techniques.