Fully Automated Deep Learning System for Bone Age Assessment
- 7.3k Downloads
Skeletal maturity progresses through discrete phases, a fact that is used routinely in pediatrics where bone age assessments (BAAs) are compared to chronological age in the evaluation of endocrine and metabolic disorders. While central to many disease evaluations, little has changed to improve the tedious process since its introduction in 1950. In this study, we propose a fully automated deep learning pipeline to segment a region of interest, standardize and preprocess input radiographs, and perform BAA. Our models use an ImageNet pretrained, fine-tuned convolutional neural network (CNN) to achieve 57.32 and 61.40% accuracies for the female and male cohorts on our held-out test images. Female test radiographs were assigned a BAA within 1 year 90.39% and within 2 years 98.11% of the time. Male test radiographs were assigned 94.18% within 1 year and 99.00% within 2 years. Using the input occlusion method, attention maps were created which reveal what features the trained model uses to perform BAA. These correspond to what human experts look at when manually performing BAA. Finally, the fully automated BAA system was deployed in the clinical environment as a decision supporting system for more accurate and efficient BAAs at much faster interpretation time (<2 s) than the conventional method.
KeywordsBone-age Structured reporting Artificial neural networks (ANNs) Automated measurement Automated object detection Clinical workflow Computer-aided diagnosis (CAD) Computer vision Data collection Decision support Digital X-ray radiogrammetry Efficiency Classification Machine learning Artificial intelligence
Skeletal maturity progresses through a series of discrete phases, particularly in the wrist and hands. As such, pediatric medicine has used this regular progression of growth to assign a bone age and correlate it with a child’s chronological age. If discrepancies are present, these help direct further diagnostic evaluation of possible endocrine or metabolic disorders. Alternatively, these examinations may be used to optimally time interventions for limb-length discrepancies. While the process of bone age assessment (BAA) is central to the evaluation of many disease states, the actual process of BAA has not changed significantly since the publication of the groundbreaking atlas in 1950 by Greulich and Pyle , which was developed from studying children in Ohio from 1931 to 1942.
BAA can be performed either using the Greulich and Pyle (GP)  or Tanner-Whitehouse (TW2)  methods. The GP method compares the patient’s radiograph with an atlas of representative ages and determines the bone age. The TW2 system is based on a scoring system that examines 20 specific bones. In both cases, BAA requires a considerable time and contains significant interrater variability, leading to clinical challenges when therapy decisions are made based on changes in a patient’s BAA. Attempts have been made to shorten the evaluation process by defining shorthand methods to perform BAA more efficiently; however, these still rely on human interpretation and reference to an atlas .
BAA is the ideal target for automated image evaluation as there are few images in a single study (one image of the left hand and wrist) and relatively standardized reported findings (all reports contain chronological and skeletal ages with relatively standardized keywords, like “bone age” or “year old”). This combination is an appealing target for machine learning, as it sidesteps many labor-intensive preprocessing steps such as using Natural Language Processing (NLP) to process radiology reports for relevant findings.
IRB approval was obtained for this retrospective study. Using an internal report search engine (Render), all radiographs and radiology reports using the exam code “XRBAGE” were queried from 2005 to 2015. Accession numbers, ages, genders, and radiology reports were collected into a database. Using the open source software OsiriX, DICOM images corresponding to the accession numbers were exported. Our hospital’s radiology reports include the patient’s chronological age and the bone age with reference to the standards of Greulich and Pyle, second edition .
We randomly selected 15% of the total data for use as a validation dataset and 15% for use as a test dataset. The remainder (70%) was used as training datasets for the female and male cohorts. The validation datasets were utilized to tune hyperparameters to find the best model out of several trained models during each epoch. The best network was evaluated using the test datasets to determine whether the top 1 prediction matched the ground truth, was within 1 year or 2 years. In order to make a fair comparison, we used the same split datasets for each test as new random datasets might prevent fair comparisons.
The first step of the preprocessing engine is to normalize radiographs for a grayscale-base and image size before feeding them to the detection CNN. Some images have black bones with white backgrounds and others have white bones with black backgrounds (Fig. 3). Image size varies considerably from a few thousand to a few hundred pixels. To normalize the different grayscale bases, we calculated the pixel-means of 10 × 10 image patches in the four corners of each image and compared them with the half value of the maximum value for a given image resolution (e.g., 128 for 8-bit resolution). This effectively determines whether an image has a white or black background, allowing us to normalize them all to black backgrounds. The next step normalizes sizes of input images. Almost all hand radiographs are height-wise rectangles. Accordingly, we resized the heights of all images to 512 pixels, then through a combination of preserving their aspect ratios and using zero-padding; the widths were all made 512 pixels, ultimately creating standardized 512 × 512 images. We chose this size for two reasons: it needed to be larger than the required input size (224 × 224) for the neural network, and this size is the optimal balance for the performance of the detection CNN and the speed of preprocessing. Larger squares improve the detection CNN performance at the cost of slower deployment time, while smaller squares accelerate the testing time, but they result in worse image preprocessing.
The next step is to construct a label map which contains hand and non-hand regions. For each input radiograph, the detection system slides across the entire image, sampling patches, and records all class scores per pixel using the trained detection CNN. Based on the score records, the highest-score class is labeled to each pixel. After that, a label map is constructed by assigning pixels labeled as bone and tissue classes to a hand label and other pixels to a non-hand label.
Most label maps have clearly split regions of hand and non-hand classes, but like an example in Fig. 4, false-positive regions were sometimes assigned to the hand class. As a result, we extracted the largest contiguous contour, filled it, and then created a clean mask for the hand and wrist shown in Fig. 4.
After creating the mask, the system passes it to the vision pipeline. The first stage uses the mask to remove extraneous artifacts from the image. Next, the segmented region is centered in the new image to eliminate translational variance. Subsequently, histogram equalization for contrast enhancement, denoising, and sharpening filters are applied to enhance the bones. A final preprocessed image is shown in Fig. 4.
Image Sample Patch Size and Stride Selection
Deep CNNs consist of alternating convolution and pooling layers to learn layered hierarchical and representative abstractions from input images, followed by fully connected classification layers which are then trainable with the feature vectors extracted from the earlier layers. They have achieved considerable success in many computer vision tasks including object classification, detection, and semantic segmentation. Many innovative deep neural networks and novel training methods have demonstrated impressive performance for image classification tasks, most notably in the ImageNet competition [13, 14, 15]. The rapid advance in classification of natural images is due to the availability of large-scale and comprehensively annotated datasets such as ImageNet . However, obtaining medical datasets on such scale and with equal quality annotation as ImageNet remains a challenge. Medical data cannot be easily accessed due to patient privacy regulations, and image annotation requires an onerous and time-consuming effort of highly trained human experts. Most classification problems in the medical imaging domain are fine-grained recognition tasks which classify highly similar appearing objects in the same class using local discriminative features. For example, skeletal ages are evaluated by the progression in epiphyseal width relative to the metaphyses at different phalanges, carpal bone appearance, and radial or ulnar epiphyseal fusion, but not by the shape of the hand and wrist. Subcategory recognition tasks are known to be more challenging compared to basic level recognition as less data and fewer discriminative features are available . One approach to fine-grained recognition is transfer learning. It uses well-trained, low-level knowledge from a large-scale dataset and then fine-tunes the weights to make the network specific for a target application. This approach has been applied to datasets that are similar to the large-scale ImageNet such as Oxford flowers , Caltech bird species , and dog breeds . Although medical images are considerably different from natural images, transfer learning can be a possible solution by using generic filter banks trained on the large dataset and adjusting parameters to render high-level features specific for medical applications. Recent works [21, 22] have demonstrated the effectiveness of transfer learning from general pictures to the medical imaging domain by fine-tuning several (or all) network layers using the new dataset.
Optimal Network Selection for Transfer Learning
Comparisons of the three candidate networks for transfer learning in terms of trainable parameter number, computational requirements for a single inference, and single-crop top 1 accuracy on the ImageNet validation dataset
We retrieved a pretrained model of GoogLeNet from Caffe Zoo  and set about fine-tuning the network to medical images. ImageNet consists of color images, and the first layer filters of GoogLeNet correspondingly comprise three RGB channels. Hand radiographs are grayscale, however, and only need a single channel. As such, we converted the filters into a single channel by taking arithmetic means of the preexisting RGB values. We confirmed that the converted grayscale filters matched the same general patterns of filters, mostly consisting of edge, corner, and blob extractors. After initializing the network with the pretrained model, our networks were further trained using an SGD for 100 epochs with a mini-batch size of 96 using 9 different combinations of hyperparameters, including base learning rates (0.001, 0.005, 0.01) and gamma values (0.1, 0.5, 0.75), in conjunction with a momentum term of 0.9 and a weight decay of 0.005. Learning rate, a hyperparameter that controls the rate of weights and bias change during training a neural network, is decreased by the gamma value by three steps to ensure a stable convergence to loss function. It is challenging to determine the best learning rate because it varies with intrinsic factors of the dataset and neural network topology. To resolve this, we use an extensive grid search for optimal combinations of hyperparameters using the NVIDIA Devbox  to find the optimal learning rate schedule.
Preventing Overfitting (Data Augmentation)
Summary of real-time data augmentation methods used in the study
No. of synthetic images
−30° ≤ rotation angle ≤30°
0.85 ≤ width < 1.0, 0.9 ≤ height < 1.0
−5° ≤ x angle ≤5°, −5° ≤ y angle ≤5°
α* pixel +_β, (0.9 ≤ α ≤ 1.0, 0 < β ≤ 10)
Optimal Depth of Fine Tuning
Comparison with Previous Works
Summary and comparison of prior attempts at automated BAA: dataset, method, salient features, and their limitations
24 GP female images
SIFT; SVD Fully connected NN
Fixed-sized features vectors from SIFT description with SVD
Training and validation with limited data; deficiency of robustness to actual images
180 images from 
Canny edge detection Fuzzy classification
Morphological features regarding carpal bones
Not applicable for children above 7 years
205 images from 
Canny edge detection Fuzzy classification
Morphological features regarding carpal bones (Capitate Hamate)
Not applicable for children above 5 years for females and 7 years for males
1559 images from multiple sources
Features regarding shapes, intensity, texture of RUS bones
Vulnerable to excessive noise in images chronological age used as input
8325 images at MGH
Deep CNN transfer learning
Data driven, automatically extracted features
The most successful attempt to date is BoneXpert , a software only medical device approved for use in Europe and the first commercial implementation of automated BAA. BoneXpert utilizes a generative model, the active appearance model (AAM), to automatically segment 15 bones in the hand and wrist and then determine either the GP or TW2 bone age based on shape, intensity, and textural features. Even though BoneXpert reports considerable accuracy for automated BAA, it has several critical limitations. BoneXpert does not identify bone age directly, because the prediction depends on a relationship between chronological and bone ages . The system is brittle and will reject radiographs when there is excessive noise. Prior studies report that BoneXpert rejected around 235 individual bones out of 5161 (4.5%) . Finally, BoneXpert does not utilize the carpal bones, despite their containing discriminative features for young children.
In summary, all prior attempts at automated BAA are based on hand-crafted features, reducing the capability of the algorithms from generalizing to the target application. Our approach exploits transfer learning with a pretrained deep CNN to automatically extract important features from all bones on an ROI that was automatically segmented by a detection CNN. Unfortunately, all prior approaches used varying datasets and provide limited details of their implementations and parameter selection that it is impossible to make a fair comparison with prior conventional approaches.
How to Improve the System?
The trained model in this study achieved impressive classification accuracy within 2 years (>98%) and within 1 year (>90%) for the female and male cohorts. Areas for future improvement abound. We plan to use insights from attention maps and iterative radiologist feedback to direct further learning and improve prediction accuracy. The attention maps reveal key regions similar to what domain experts use to perform conventional BAA; however, it is not certain whether the algorithm uses the exact same features as domain experts. Rather, this method of visualization only reveals that the important regions of the images are similar. The CNN could be using as yet unknown features to perform accurate fine-grained classification which happen to be in the same regions. Further investigation is needed to determine if bone morphology is what the CNN is using for BAA.
However, the algorithm still has room for improvement to provide even more accurate BAA at a faster interpretation time. We down sampled native DICOM images to 8-bit resolution jpegs (224 × 224) to provide a smaller matrix size and use GPU-based parallel computing. In the future, using the native 14-bit or 16-bit resolution images with larger matrix sizes will likely improve the performance of algorithm.
Another approach could be to develop a new neural network architecture optimized for BAA. Recent advanced networks, like GoogLeNet , VGGNet , and ResNet , contain many layers—16 to 152—and run the risk of overfitting given our relatively small amount of training images. Creating a new network topology might be a better approach for BAA which could be more effective than using transfer learning. This would require future systematic study to determine the best algorithm for BAA, beyond the scope of this work.
Lastly, we need to reconsider that bone ages obtained from reports may not necessarily reflect ground truth as BAA is inherently based on subjective analysis of human experts. In some radiology reports, bone ages were recorded as single numbers, a numerical range, or even a time point not in the original GP atlas. In addition, Greulich and Pyle’s original atlas  provides standard deviations that range from 8 to 11 months for a given chronological age, reflecting the inherent variation in the study population. As such, not all the ground truths can be assumed as correct. To counter this, the algorithm could be enhanced with an iterative training by applying varying weights to training images based on confidence levels in reports.
The proposed deep learning system for BAA will be used in the clinical environment to both more efficiently and more accurately perform BAA. It takes approximately 10 ms to perform a single BAA with a preprocessed image. However, it requires averagely 1.71 s to crop, segment, and preprocess an image prior to classification. Most of the time is consumed by the construction of the label map prior to segmentation. The time could be decreased by exploiting a selective search to process only plausible regions of interest . Additionally, instead of preserving aspect ratios and creating a 512 × 512 pixels image, image warping to a smaller matrix size could reduce the computational time required for segmentation at the cost of eventual output image quality. The optimal balance requires a systematic study, beyond the scope of this work. Although all stages of preprocessing and BAA cannot be performed in real time (<30 ms), net interpretation time (<2 s) is still accelerated compared to conventional BAA, which ranges from 1.4 to 7.9 min .
Figure 1 details the process of conventional BAA by radiologists and the proposed fully automated BAA system with automated report generation. Radiologists conventionally compare the patient’s radiograph to reference images in the G&P atlas, a repetitive and time-consuming task. Since bone age is evaluated based on a subjective comparison, interrater variability can be considerable. As a result, our system has another major advantage: it reduces interobserver variability for a given examination. Repeated presentations of the same radiograph to the CNN will always result in the same BAA.
Our workflow shows the radiologist a relevant range of images from the G&P atlas with probability estimate of which the algorithm considers the best match. The radiologist then chooses which image he or she thinks is the most accurate BAA, triggering the system to create a standardized report. This system can be seamlessly embedded into the reporting environment, where it provides structured data, improving the quality of health data reported to the EMR.
While our system has much potential to improve workflow, increase quality, and speed interpretation, there are important limitations. Exclusion of 0–4 year olds slightly limits the broad applicability of the system to all ages. Given that 10 years of accessions only included 590 patients of ages 0–4 years (5.6% of the total query), this limitation was felt to be acceptable given the relative rarity of patients in this age range. Eventually, by adding more radiographs to the dataset, we hope to expand our system to include all ages.
Another limitation is our usage of integer-based BAA, rather than providing time-points every 6 months. This is unfortunately inherent to the GP method. The original atlas did not provide consistent time points for assignment of age, rather than during periods of rapid growth, there are additional time points. This also makes training and clinical assessment difficult, given the constant variability in age ranges. This has been a problem that multiple others have tried to correct, such as Gilsanz and Ratib’s work in this area with the Digital Atlas of Skeletal Maturity, which uses idealized images from Caucasian children to provide 29 age groups from 8 months to 18 years of age . While their atlas is more consistent than the GP atlas, it has the serious limitation of not seeing wide clinical adoption, therefore limiting the available training data that we can then use for machine learning.
Because our cohort was underpowered for determinations below annual age determinations, we elected to floor ages in the cases where the age was reported as “X years, 6 months” to maintain a consistent approach to handling all intermediate time points and the fact that chronological ages are naturally counted with flooring. However, this could be introducing error. Retraining the models to account for this by using selectively rounded cases, a higher volume of cases, higher resolution images, or higher powered computer systems to find the optimal combination of settings is beyond the scope of this work but an important future direction.
Lastly, an important consideration is the extent of interobserver variability. Limited directly comparable data is available in the literature regarding interobserver variability in BAA. These estimates range from 0.96 years for British registrars evaluating 50 images using Greulich and Pyle to Tanner’s own publications which suggested manual interpretation using the TW2 system resulted in differences greater than 1 stage ranging from 17 to 33% of the time [38, 39, 40]. The most comprehensive open dataset available of hand radiographs with assessment by two raters is the Digital Hand Atlas , compiled by the Image Processing and Informatics Lab at the University of Southern California in the late 1990s. All radiographs in that series were rated by two raters, with an overall RMSE of 0.59 years—0.54 years for females, 0.57 years for males, and 0.66 years for all children ranging from 5 to 18 years of age. More recent publication from Korea reported interobserver variation of 0.51 ± 0.44 years by the GP method . These values provide a baseline for the human interobserver variability; however, they may underestimate the true degree of interobserver variability. Our values of 0.93 years for females and 0.82 years for males are comparable to the upper limits of these reported values, keeping in mind that our system does not reject malformed images. While our dataset does provide a rich source from which to perform a rigorous assessment of interobserver variability with multiple raters and experience levels, performing such an analysis is beyond the scope of this work and will be performed as part of future examinations to help guide assessments of system performance.
We have created a fully automated, deep learning system to automatically detect and segment the hand and wrist, standardize the images using a preprocessing engine, perform automated BAA with a fine-tuned CNN, and generate structured radiology reports with the final decision by a radiologist. This system automatically standardizes all hand radiographs of different formats, vendors, and quality to be used as a training dataset for future model enhancement and achieves excellent average BAA accuracy of 98.56% within 2 years and 92.29% within 1 year for the female and male cohorts. We determined that the trained algorithm assesses similar regions of the hand and wrist for BAA as what a human expert does via attention maps. Lastly, our BAA system can be deployed in the clinical environment by displaying three to five reference images from the G&P atlas with an indication of our automated BAA for radiologists to make the final age determination with one-click, structured report generation.
- 2.Tanner JM, Whitehouse RH, Cameron N. Assessment of skeletal maturity and prediction of adult height (Tw2 method). 1989.Google Scholar
- 3.Heyworth BE, Osei D, Fabricant PD, Green DW. A new, validated shorthand method for determining bone age. Annual Meeting of the. hss.edu; 2011; Available: https://www.hss.edu/files/hssboneageposter.pdf
- 6.Liskowski P, Pawel L, Krzysztof K. Segmenting retinal blood vessels with deep neural networks. IEEE Trans Med Imaging; 1–1, 2016.Google Scholar
- 10.Gilsanz V, Ratib O. Hand bone age: a digital atlas of skeletal maturity. Springer Science & Business Media; 2005.Google Scholar
- 12.LeCun Y, Cortes C, Burges C. The MNIST database of handwritten digits. 1998;Google Scholar
- 13.Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems. pp. 1097–1105, 2012.Google Scholar
- 14.Szegedy C, Christian S, Wei L, Yangqing J, Pierre S, Scott R, et al. Going deeper with convolutions. 2015 I.E. Conference on Computer Vision and Pattern Recognition (CVPR). doi: 10.1109/cvpr.2015.7298594, 2015.
- 15.Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition [Internet]. arXiv [cs.CV]. 2014. Available: http://arxiv.org/abs/1409.1556
- 16.Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. ImageNet: a large-scale hierarchical image database. Computer Vision and Pattern Recognition, 2009 CVPR 2009 I.E. Conference on. 2009. pp. 248–255.Google Scholar
- 18.pt?>Nilsback ME, Zisserman A. Automated flower classification over a large number of classes. Computer Vision, Graphics Image Processing, 2008 ICVGIP ‘08 Sixth Indian Conference on. 2008. pp. 722–729.Google Scholar
- 19.Wah C, Branson S, Welinder P, Perona P, Belongie S. The Caltech-UCSD birds-200-2011 dataset. Pasadena, CA: California Institute of Technology; 8, 2011.Google Scholar
- 20.Russakovsky O, Deng J, Krause J, Berg A, Fei-Fei L. Large scale visual recognition challenge 2013 (ILSVRC2013). 2013.Google Scholar
- 24.Canziani A, Paszke A, Culurciello E. An analysis of deep neural network models for practical applications [Internet]. arXiv [cs.CV]. 2016. Available: http://arxiv.org/abs/1605.07678
- 25.Jia Y. Caffe model zoo. 2015;Google Scholar
- 26.NVIDIA® DIGITS™ DevBox. In: NVIDIA Developer [Internet]. 16 Mar 2015 [cited 23 Aug 2016]. Available: https://developer.nvidia.com/devbox
- 27.Zeiler MD, Fergus R. Visualizing and understanding convolutional networks. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T, editors. Computer vision—ECCV 2014. Springer International Publishing; 2014. pp. 818–833.Google Scholar
- 28.Simonyan K, Vedaldi A, Zisserman A. Deep inside convolutional networks: visualising image classification models and saliency maps [Internet]. arXiv [cs.CV]. 2013. Available: http://arxiv.org/abs/1312.6034
- 29.Seok J, Hyun B, Kasa-Vubu J, Girard A. Automated classification system for bone age X-ray images. 2012 I.E. International Conference on Systems, Man, and Cybernetics (SMC). IEEE; pp. 208–213.Google Scholar
- 35.He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition [Internet]. arXiv [cs.CV]. 2015. Available: http://arxiv.org/abs/1512.03385
- 36.Greulich WW, Pyle SI. Radiographic atlas of skeletal development of the hand and wrist. Am J Med Sci. pdfs.journals.lww.com; 1959; Available: http://pdfs.journals.lww.com/amjmedsci/1959/09000/Radiographic_Atlas_of_Skeletal_Development_of_the.30.pdf
- 37.Girshick R, Donahue J, Darrell T. Rich feature hierarchies for accurate object detection and semantic segmentation. and pattern recognition. cv-foundation.org; 2014; Available: http://www.cv-foundation.org/openaccess/content_cvpr_2014/html/Girshick_Rich_Feature_Hierarchies_2014_CVPR_paper.html
- 41.Kim SY, Oh YJ, Shin JY, Rhie YJ, Lee KH. Comparison of the Greulich-Pyle and Tanner Whitehouse (TW3) methods in bone age assessment. J Korean Soc Pediatr Endocrinol; 13:50–55, 2008.Google Scholar
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.