Data Preparation
Data Collection
IRB approval was obtained for this retrospective study. Using an internal report search engine (Render), all radiographs and radiology reports using the exam code “XRBAGE” were queried from 2005 to 2015. Accession numbers, ages, genders, and radiology reports were collected into a database. Using the open source software OsiriX, DICOM images corresponding to the accession numbers were exported. Our hospital’s radiology reports include the patient’s chronological age and the bone age with reference to the standards of Greulich and Pyle, second edition [1].
Data Categorization
Radiographs from patients with chronological age of 5–18 years and skeletally mature (18 years and up) were included in the dataset. In this study, ages 0–4 years were excluded for two reasons. First, there were only a limited amount of radiographs for patients in the 0–4 year-old bracket (298 cases for females and 292 cases for males), which significantly reduced the volume of images usable for training. Second, the overwhelming indication for bone age assessment at our institution is for questions of delayed puberty, short stature, or precocious puberty. These examinations are infrequently performed for patients less than 5 years of age. The reported bone ages were extracted from the radiologist report by determining bone age-related keywords such as “bone age” and “skeletal.” The extracted bone ages were calculated in the form of years, floored, and categorized by year ranging from 5 to 18 years. Skeletally mature cases were considered 18 years [10]. For cases where the reported bone ages were given in a range, we assigned the arithmetic mean of the range as the actual bone age. The total number of studies originally retrieved was 5208 for the female cohort and 5317 for the male cohort. After excluding ages 0–4 years and aberrant cases—right hands, deformed images, and uninterpretable reports—4278 radiographs for females and 4047 radiographs for males were labeled by skeletal age as in Fig. 2.
We randomly selected 15% of the total data for use as a validation dataset and 15% for use as a test dataset. The remainder (70%) was used as training datasets for the female and male cohorts. The validation datasets were utilized to tune hyperparameters to find the best model out of several trained models during each epoch. The best network was evaluated using the test datasets to determine whether the top 1 prediction matched the ground truth, was within 1 year or 2 years. In order to make a fair comparison, we used the same split datasets for each test as new random datasets might prevent fair comparisons.
Preprocessing Engine
Input DICOM images vary considerably in intensity, contrast, and grayscale base (white background and black bones or black background and white bones) as shown in Fig. 3. This variance of the training radiographs prevents algorithms from learning salient features. As such, a preprocessing pipeline that standardizes images is essential for the model’s accuracy by eliminating as much unnecessary noise as possible. For this application, bones are the most important features to be preserved and enhanced as they are central to BAAs. Therefore, we propose a novel preprocessing engine that consists of a detection CNN to identify/segment the hand/wrist and create a corresponding mask followed by a vision pipeline to standardize and maximize the invariant features of images.
Normalization
The first step of the preprocessing engine is to normalize radiographs for a grayscale-base and image size before feeding them to the detection CNN. Some images have black bones with white backgrounds and others have white bones with black backgrounds (Fig. 3). Image size varies considerably from a few thousand to a few hundred pixels. To normalize the different grayscale bases, we calculated the pixel-means of 10 × 10 image patches in the four corners of each image and compared them with the half value of the maximum value for a given image resolution (e.g., 128 for 8-bit resolution). This effectively determines whether an image has a white or black background, allowing us to normalize them all to black backgrounds. The next step normalizes sizes of input images. Almost all hand radiographs are height-wise rectangles. Accordingly, we resized the heights of all images to 512 pixels, then through a combination of preserving their aspect ratios and using zero-padding; the widths were all made 512 pixels, ultimately creating standardized 512 × 512 images. We chose this size for two reasons: it needed to be larger than the required input size (224 × 224) for the neural network, and this size is the optimal balance for the performance of the detection CNN and the speed of preprocessing. Larger squares improve the detection CNN performance at the cost of slower deployment time, while smaller squares accelerate the testing time, but they result in worse image preprocessing.
Detection CNN
There are five different types of objects on hand radiographs: bone, tissue, background, collimation, and annotation markers (Fig. 3). In order to segment the hand and wrist from radiographs, we utilized a CNN to detect bones and tissues, construct a hand/wrist mask, and apply a vision pipeline to standardize images. As shown in Fig. 4, image patches for the five classes were sampled in the normalized images through the use of ROIs. The sampled patches are a balanced dataset with 1 M samples from each class. We used 1000 unique radiographs, which randomly selected from the training dataset, to generate diverse object patches. We used LeNet-5 [11] as the network topology for the detection CNN because the network is an efficient model for coarse-grained recognition of obviously distinctive datasets and used in applications such as MNIST digit recognition [12]. In addition, the network requires small amount of computations and trivial memory space for trainable parameters at deployment time. We trained the model with the set of the sampled patches for 100 epochs using a stochastic gradient descent (SGD) algorithm with 0.01 of the base learning rate decreased as a factor of ten by three steps based on convergence to loss of function. The 25% of training images per class were held out as a validation dataset to select the best model out of epochs.
Reconstruction
The next step is to construct a label map which contains hand and non-hand regions. For each input radiograph, the detection system slides across the entire image, sampling patches, and records all class scores per pixel using the trained detection CNN. Based on the score records, the highest-score class is labeled to each pixel. After that, a label map is constructed by assigning pixels labeled as bone and tissue classes to a hand label and other pixels to a non-hand label.
Mask Generation
Most label maps have clearly split regions of hand and non-hand classes, but like an example in Fig. 4, false-positive regions were sometimes assigned to the hand class. As a result, we extracted the largest contiguous contour, filled it, and then created a clean mask for the hand and wrist shown in Fig. 4.
Vision Pipeline
After creating the mask, the system passes it to the vision pipeline. The first stage uses the mask to remove extraneous artifacts from the image. Next, the segmented region is centered in the new image to eliminate translational variance. Subsequently, histogram equalization for contrast enhancement, denoising, and sharpening filters are applied to enhance the bones. A final preprocessed image is shown in Fig. 4.
Image Sample Patch Size and Stride Selection
Preprocessing performance depends on the size of an image sample patch and the stride by which the detection system moves. We conducted a regressive test to find the optimal image patch size and stride by comparing varying strides (2, 4, 8, 16) and image patch sizes (16 × 16, 24 × 24, 32 × 32, 40 × 40, 48 × 48, 56 × 56, 64 × 64) as shown in Fig. 5a. For this experiment, 280 images representing 10 images per class for females and males were randomly selected from the test dataset to evaluate the preprocessing engine’s performance by calculating the arithmetic mean of Intersection over Union values (mIoU) between the predicted and ground truth binary maps. Based on the results in Fig. 5, a 32 × 32 image patch size and a stride of 4 are the optimal configuration with a mIoU of 0.92.
Classification CNN
Deep CNNs consist of alternating convolution and pooling layers to learn layered hierarchical and representative abstractions from input images, followed by fully connected classification layers which are then trainable with the feature vectors extracted from the earlier layers. They have achieved considerable success in many computer vision tasks including object classification, detection, and semantic segmentation. Many innovative deep neural networks and novel training methods have demonstrated impressive performance for image classification tasks, most notably in the ImageNet competition [13–15]. The rapid advance in classification of natural images is due to the availability of large-scale and comprehensively annotated datasets such as ImageNet [16]. However, obtaining medical datasets on such scale and with equal quality annotation as ImageNet remains a challenge. Medical data cannot be easily accessed due to patient privacy regulations, and image annotation requires an onerous and time-consuming effort of highly trained human experts. Most classification problems in the medical imaging domain are fine-grained recognition tasks which classify highly similar appearing objects in the same class using local discriminative features. For example, skeletal ages are evaluated by the progression in epiphyseal width relative to the metaphyses at different phalanges, carpal bone appearance, and radial or ulnar epiphyseal fusion, but not by the shape of the hand and wrist. Subcategory recognition tasks are known to be more challenging compared to basic level recognition as less data and fewer discriminative features are available [17]. One approach to fine-grained recognition is transfer learning. It uses well-trained, low-level knowledge from a large-scale dataset and then fine-tunes the weights to make the network specific for a target application. This approach has been applied to datasets that are similar to the large-scale ImageNet such as Oxford flowers [18], Caltech bird species [19], and dog breeds [20]. Although medical images are considerably different from natural images, transfer learning can be a possible solution by using generic filter banks trained on the large dataset and adjusting parameters to render high-level features specific for medical applications. Recent works [21, 22] have demonstrated the effectiveness of transfer learning from general pictures to the medical imaging domain by fine-tuning several (or all) network layers using the new dataset.
Optimal Network Selection for Transfer Learning
We considered three high-performing CNNs, including AlexNet [13], GoogLeNet [14], and VGG-16 [15], as candidates for our system as they were validated in ImageNet Large Scale Visual Recognition Competition (ILSVRC) [23]. Fortunately, Canziani et al. performed a comparative study between the candidate networks. A summary of their differences is presented in Table 1 [24]. If accuracy is the sole determiner, VGG-16 is the best performer and AlexNet is the worst. However, GoogLeNet utilizes ∼25 times fewer trainable parameters to achieve comparable performance to VGG-16 with a faster inference time. In addition, GoogLeNet is the most efficient neural network [24], particularly because the inception modules described in Figs. 5 and 6, enable the network to have a greater capability to learn hierarchical representative features without many trainable parameters by minimizing the number of fully connected layers.
Table 1 Comparisons of the three candidate networks for transfer learning in terms of trainable parameter number, computational requirements for a single inference, and single-crop top 1 accuracy on the ImageNet validation dataset
Training Details
We retrieved a pretrained model of GoogLeNet from Caffe Zoo [25] and set about fine-tuning the network to medical images. ImageNet consists of color images, and the first layer filters of GoogLeNet correspondingly comprise three RGB channels. Hand radiographs are grayscale, however, and only need a single channel. As such, we converted the filters into a single channel by taking arithmetic means of the preexisting RGB values. We confirmed that the converted grayscale filters matched the same general patterns of filters, mostly consisting of edge, corner, and blob extractors. After initializing the network with the pretrained model, our networks were further trained using an SGD for 100 epochs with a mini-batch size of 96 using 9 different combinations of hyperparameters, including base learning rates (0.001, 0.005, 0.01) and gamma values (0.1, 0.5, 0.75), in conjunction with a momentum term of 0.9 and a weight decay of 0.005. Learning rate, a hyperparameter that controls the rate of weights and bias change during training a neural network, is decreased by the gamma value by three steps to ensure a stable convergence to loss function. It is challenging to determine the best learning rate because it varies with intrinsic factors of the dataset and neural network topology. To resolve this, we use an extensive grid search for optimal combinations of hyperparameters using the NVIDIA Devbox [26] to find the optimal learning rate schedule.
Preventing Overfitting (Data Augmentation)
Deep neural networks require a large amount of labeled training data for stable convergence and high classification accuracy. If there is limited training data, deep neural networks will overfit and fail to generalize for target application. This is a particular challenge in medical imaging, as compilation of high quality and well-annotated images is a laborious and expensive process. As a result, several methods are used to decrease the risk of overfitting. Data augmentation is one technique where we synthetically increase the size of the training dataset with geometric transformations, photometric transformations, noise injections, and color jittering [13], while preserving the same image label. Table 2 details the geometric, contrast, and brightness transformations used for real-time data augmentation and the number of possible synthetic images for each. Affine transformations, including rotation, scaling, shearing, and photometric variation were utilized to improve resiliency of the network to geometric variants and variations in contrast or intensity. Rotations ranged from −30 to +30 in 5° increments. Scaling operations were performed by multiplying the width by 0.85–1.0 in 0.01 increments and the height by 0.9–1.0 in 0.01 increments. Shearing was performed by applying an x and y angle ranging from −5 to +5 with an increment of 1°. Brightness was adjusted by multiplying all pixels by a factor ranging from 0.9 to 1.0 with increment of 0.01 and adding an integer ranging from 0 to 10. These transformations were augmented with random switches for each transformation. By using real-time data augmentation, a single image can be transformed into one of 1,107,150,000 images (= 61 * 150 * 121 * 100), preventing image repetition during each epoch. This method does not increase computing time or storage as images for the next iteration are augmented on the CPU while the previous iteration is being trained via the GPU. We excluded random horizontal inversion, frequently utilized for natural images, because BAA only uses left-sided radiographs by convention. We also did not perform random translation as all were centered at the image preprocessing stage.
Table 2 Summary of real-time data augmentation methods used in the study