Keywords

1 Introduction

Wrist fractures are the commonest type of fractures seen in emergency departments (EDs), They are estimated to be 18% of the fractures seen in adults [1, 2] and of 25% of fractures seen in children [2]. They are usually identified in EDs by doctors examining lateral (LAT) and posterioanterior (PA) radiographs. Yet wrist fractures are one of the most commonly-missed in ED-examined radiographs [3, 4]. Systems that can identify suspicious wrist areas and notify ED staff could reduce the number of misdiagnoses.

In this paper we describe a fully-automated system for detecting radius fractures in PA and LAT radiographs. For each view, a global search [5] is performed for finding the approximate position of the radius. The detailed outline of the bone is then located using a random forest regression voting constrained local model (RFCLM) [6]. Convolutional neural networks (CNNs) are trained on cropped patches containing the region of interest on the task of detecting fractures. The decisions from both views are averaged for better performance. This paper is the first to show an automatic system for identifying fractures from PA and LAT view radiographs of the wrist by using convolutional neural networks, outperforming previously-published works.

2 Previous Work

Early work on fracture detection used non-visual techniques: analysing mechanical vibration [7], analysing acoustic waves travelling along the bone [8], or by measuring electrical conductivity [9]. The first published work on detecting fractures in radiographs was that in [10] where an algorithm is developed to measure the femur neckshaft angle and use it to determine whether the femur is fractured. There is a body of literature on radiographic fracture detection on a variety of anatomical regions, including arm fractures [11], femur fractures [10, 12,13,14,15], and vertebral endplates [16]. Cao et al. [17] worked on fractures in a range of different anatomical regions using stacked random forests to fuse different feature representations (Schmid texture feature, Gabor texture feature, and forward contextual-intensity). They achieved a sensitivity of 81% and precision of 25%. Work on wrist fracture detection from radiographs is still limited. The earliest works [13, 14] used active shape models and active appearance models [18] to locate the approximate contour of the radius and trained support vector machine (SVM) on extracted texture features (Gabor, Markov random field, and gradient intensity). They worked on a small dataset with only 23 fractured examples in their test set and achieved encouraging performance. In previous work [19, 20] we used RFCLMs [21] to segment the radius in PA and LAT views and trained random forest (RF) classifiers on statistical shape parameters and eigen-mode texture features [18]. The fully automated system achieved a performance of 91.4% (area under receiver operating characteristic curve, AUC) on a dataset of 787 radiographs (378 of which were fractured) in cross-validation experiments and was the first to combine the both views. Instead of hand-crafting features Kim et al. [22] re-trained the top layer (i.e. classifier) of inception v3 network [23] to detect fractures in wrist LAT views from features previously-learned from non-radiological images (ImageNet [24]). This was the first work to use deep learning in the task of detecting wrist fractures. The system was tested on 100 images (half of which fractured) and reported an AUC of 95.4%. However, they excluded images where lateral projection was inconclusive for the presence or absence of fracture which would bias the results favourably but contradict the goal of developing such systems (i.e. helping clinicians with difficult usually-missed fractures). Olczak et al. [25] re-trained five common deep networks from Caffe library [26] on dataset of 256,000 wrist, hand, and ankle radiographs, of which 56% of the images contained fractures. The dataset was divided into (70% training, 20% validation, and 10% testing) and used to train the networks for the tasks of detecting fractures, determining which exam view, body part, and laterality (left or right). Labels were extracted by automatically mining reports and DICOMs. The images were rescaled to \(256\,{\times }\,256\) and then cropped into a subsection of the original image with the network’s input size. The pre-processing causes image distortion but they justified that as the nature of tasks does not need non-distorted images. The networks were pre-trained on the ImageNet dataset [24] and then their top layers (i.e. classifier) were replaced with fully connected layers suitable for each task. The best performing network (VGG 16 [27]) achieved a fracture detection accuracy of 83% without reporting false positive rate. The model deals with various views independently but it does not combine them for a decision. Another related work [28] used a very deep CNN-based model (169 trainable layers) for abnormality detection from raw radiographs. Images are labelled as normal or abnormal, where abnormal does not always mean “fractured” - it sometimes means there is metalwork present. Their dataset contains metal hardware in both categories (normal and abnormal) and also contains different age groups. This makes the definition of abnormality is rather unclear as what is considered abnormal for a certain group can be seen as normal for another age group and vice versa.

3 Background

3.1 Shape Modeling and Matching

Statistical shape models (SSMs) [18] are widely used for studying the contours of bones. Shape is the quality left after all differences due to location, orientation, and scale are omitted in a population of same-class objects. SSMs assume that each shape instance is a deformed version of the mean shape describing the object class. The training data is used to identify the mean shape and its possible deformations. The contour of an object is described by a set of model points \((x_i\), \(y_i)\) packed in a 2n-D vector \(\mathbf {x}=(x_{1},\ldots ,x_{n},y_{1},\ldots ,y_{n})^T\). An SSM is a linear model of shape variations of the object across the training dataset built by applying principal component analysis (PCA) to aligned shapes and fitting a Gaussian distribution in the reduced space. A shape instance \(\mathbf {x}\) is represented as:

$$\begin{aligned} \mathbf {x} \approx T_{\theta }(\bar{\mathbf {x}}+\mathbf {P}\mathbf {b}:\theta ), \end{aligned}$$
(1)

where \(\bar{\mathbf {x}}\) is the mean shape, \(\mathbf {P}\) is the set of the orthogonal eigenvectors corresponding to the t highest eigenvalues of the covariance matrix of the training data, \(\mathbf {b}\) is the vector of shape parameters and \(T(.:\theta )\) applies a similarity transformation with parameters \(\theta \) between the common reference frame and the image frame. The number of the used eigenvectors t is chosen to represent most of the total variation (i.e. 95–\(98\%\)).

One of the most effective algorithms for locating the outline of bones in radiographs is RFCLM [6]. This uses a collection of RFs to predict the most likely location of each point based on nearby image patches. A shape model is then used to constrain the points and encode the result.

3.2 Convolutional Neural Network

CNNs are a class of deep feed-forward artificial neural networks for processing data that has a known grid-like topology. They emerged from the study of the brain’s visual cortex and benefited from the recent increase in the computational power and the amount of available training data.

A typical CNN (Fig. 1) stacks few convolutional layers, then followed by a subsampling layer (Pooling layer), then another few convolutional layers, then another pooling layer, and so on. At the top of the stack fully-connected layers are added outputing a prediction (e.g. estimated class probabilities). This layer-wise fashion allows CNNs to combine low-level features to form higher-level features (Fig. 2), learning features and eliminating the need for hand crafted feature extractors. In addition, the learned features are translation invariant, incorporating the two-dimensional (2D) spatial structure of images which contributed to CNNs achieving state-of-the-art results in image-related tasks.

Fig. 1.
figure 1

A convolutional neural network-based classifier applied to a single-channel input image. Every convolutional layer (Conv) transforms its input to a three-dimensional output volume of neuron activations. The pooling layer (Pool) downsamples the volume spatially, independently in each feature map of its input volume. At the end fully-connected layers (FC) output a prediction.

A convolutional layer has k filters (or kernels) of size \(r\,{\times }\,r\,{\times }\,c\) (receptive field size) where r is smaller than the input width/height, and c is the same as the input depth. Every filter convolves with the input volume in sliding-window fashion to produce feature maps (Fig. 2). Each convolution operation is followed by a nonlinear activation, typically a rectified linear unit (ReLU) which sets any negative values to zero. A feature map can be subsampled by taking the mean or maximum value over \(p\,{\times }\,p\) contiguous regions to produce translation invariant features (Pooling). The value of p usually ranges between 2–5 depending on how large the input is. This reduction in spatial size leads to fewer parameters, less computation, and controls overfitting.

The local connections, tied weights, and pooling result in CNNs have fewer trainable parameters than fully connected networks with the same number of hidden units. The parameters are learned by back propagation with gradient-based optimization to reduce a cost function.

Fig. 2.
figure 2

In the convolutional neural network, k neurons receive input from only a restricted subarea (receptive field) of the previous layer output. Convolving the filters with the whole input volume produces k feature maps.

4 Methods

4.1 Patch Preparation

Because most parts of a radiograph are either background or irrelevant to the task, we chose to train CNNs on cropped patches rather than raw images. The steps of the automated system are shown in Fig. 3. Following our previous work [20] we used a global search with a random forest regression voting (RFRV) technique to find the approximate radius location (red dots in Fig. 3) followed by a local search performed by a sequence of RFCLM models with an increasing resolution to find its contour. The automatic point annotation gives information on the position, orientation and scale of the distal radius accurately. This is used to transfer the bone to a standardized coordinate frame before cropping a patch of size (\(n_i\,{\times }\,n_i\) pixels) containing the bone. We used the resulting patches to train and test a CNN. This process is completely automatic. Figure 4 shows examples of radiographs and extracted patches.

Fig. 3.
figure 3

Fully automated system for detecting wrist fractures. (Color figure online)

Fig. 4.
figure 4

Example of pairs of radiographs for four subjects with (a) a normal radius, (b)–(d) fracture radiuses. The first and third rows show the posterioanterior and lateral views respectively. The corresponding cropped patches appear below each view.

4.2 Network Architecture

We trained a CNN for each view. The two CNNs were classical stacks of CP layers (CP refers to one ReLU-activated convolutional layer followed by a pooling layer) with two consecutive fully-connected (FC) layers. No padding was used. Weights were initialised with the Xavier uniform kernel initializer [29] and biases initialised to zeros. The loss function was binary cross entropy optimised with Adam [30] (default parameter values used). An input patch size of \(121\,{\times }\,121\), and of \(151\,{\times }\,151\) were used for PA, and LAT networks respectively. Architecture details are summarised in Table 1. In our experiments we gradually increased the number of CP layers and chose the network with the best performance. Figure 5 shows an example network with three CP layers followed by two FC layers.

Fig. 5.
figure 5

Fully automated system for detecting wrist fractures.

5 Experiments and Results

5.1 Data

We collected a wrist dataset containing 1010 pairs of wrist radiographs (PA and LAT) for 1010 adult patients, 505 of whom had fractures (Fig. 4). Images for 787 patients, 378 of whom had fractures, were gathered from two local EDs while the rest were gathered from the MURA dataset [28] with fractures as abnormality. Fractured examples do not contain any plaster casts or metalware to make sure the network learns features for detecting fracture not hardware.

5.2 Fracture Detection

We carried out 5-fold cross validation experiments. During each fold 802 radiographs were used as training set, 102 as validation set, and 102 as testing set. The validation and testing sets were then swapped so that all the data were tested exactly once. Every time a network was trained from scratch for 20 epochs with batch size\(\,{=}\,32\) and the model with the lowest validation loss was selected. Training data was randomly shuffled at the start of each epoch to produce different batches each time. We found the architectures with three CP layers, and with four CP layers performed the best for the LAT view, and PA view respectively. Having trained the two CNNs, one for each view, their outputs are combined by averaging (Fig. 6). Figure 7 shows the average performance and learning curves. We achieved an average performance of AUC\(\,{=}\,95\)% for PA view, 93% for LAT view, and 96% from both views combined.

Table 1. The overall architecture detailed with maps’ sizes corresponding to an input wrist patch of size \(121\,{\times }\,121\). Same architecture also used with \(151\,{\times }\,151\). In our experiments we gradually increased the number of CP layers and chose the network with the best performance.
Fig. 6.
figure 6

During testing the outputs for both views are combined by averaging.

Fig. 7.
figure 7

Fracture detection. (a) Receiver operating characteristic (ROC) curve for posterioanterior view. (b) ROC curve for lateral view. (c) ROC curve for both views combined. (d) Example of learning curves for a model.

Kim et al. [22] used features originally learned to classify non-radiological images [24] and used them to detect fractures in LAT views and reported an AUC of 95.4%. Unlike their work we have not excluded images where lateral projection was inconclusive for the presence or absence of fracture which would bias the results favourably. In our case, we performed 5-fold cross-validation and reported an overall AUC of 96%. For the sake of comparison with our previous RF-based technique in [20] we repeated all experiments in [20] on the current dataset with the same fold divisions and found an AUC of 92% from two views combined, 89% and 91% for PA view, and LAT view respectively (Table 2 and Fig. 8). The CNN-based technique clearly outperforms the RF-based one.

Table 2. Comparison between convolutional neural network (CNN)-based and random forest (RF)-based techniques on the same dataset in terms of area under the curve ± standard deviation (PA - posteroanterior, LAT - lateral).
Fig. 8.
figure 8

Comparison between receiver operating characteristics curves for the proposed convolutional neural network-based technique and the relevant random forest-based work in [20] on: (a) posteroanterior view, (b) lateral view, and (c) both views combined for the same dataset in terms of area under the curve ± standard deviation.

5.3 Conclusions

We presented a system for automatic wrist fracture detection from plain PA and LAT X-rays. The CNN is trained from scratch on radiographic patches cropped around the joint after automatic segmentation and registration. This directed preprocessing ensures meaningful learning from only the targeted region in scale which in turn reduces the noise a CNN is exposed to compared to when trained on full images containing parts that are not relevant to the task. Radiographs, unlike photos, have predictable contents that allow model-based techniques to work well and therefore they can provide CNNs with an input that dispense with the need to: (1) perform any data augmentation and (2) unnecessarily complicate the deep architecture and its learning process. Our work was the first to train CNNs from scratch on the task of detecting wrist fractures and to combine the two views for a decision. The experiments showed that combining the results from both views leads to an improvement in overall classification performance, with an AUC of 96% compared to 95% for PA view and 93% for LAT view.