1 Introduction

Burns are skin injuries caused by heat, radiation and other acute trauma. According to the World Health Organization (WHO), about 300,000 people are dying every year as a result of burn related injuries with a high number of incidences in developing countries [1]. In 2004, nearly 11 million people requiring medical treatment had been reported by WHO. Similarly, the study in [2] finds that nearly half a million Americans are affected by thermal injuries each year with almost 40,000 hospital admissions. The severity of burn injuries varies from superficial burns that heal within 14 days to most complicated deeper (full-thickness) burns that last for more than 3 weeks to heal and requires surgical management [3].

Early patient recovery and healing of burn wounds are of paramount importance that depends on effective and timely assessment. A timely assessment provides an avenue for a decision to be made as early as possible, whether surgery (skin grafting) is required or not [4]. This will ensure shorter hospital stay, reduced expenses and lesser risk of hospital acquired complications. Burns are assessed clinically by observation due to its availability and lesser diagnostic cost. Visual and tactile observation of the wound’s descriptions such as sensibility, appearance and capillary blanching is the common practical approach clinically [5]. The biopsy is an alternative technique used to examine burn depth via the extraction of the sample (biopsied) of the burn wound and examined histologically. Another important non-invasive promising technique for burn assessment is Laser Doppler Imaging (LDI) which measure blood flow and provide a predictive healing potential [6] corresponding to the severity of the burns, thereby providing prompt decision-making.

However, reliability of these approaches suffers diagnostic merit such as inconsistency of assessment by different burn specialist [7], as this is mostly associated with clinical assessment in which its reliability lies on dermatologists’ experience. Histological analysis is underpinned by high rates of sampling error and lack of standardized interpretations. LDI’s reliability in diagnosing burns in children is affected by movement artifacts, high affordability and operability cost, and requires high level of expertise to operate the device. Therefore, these necessitate the need of an alternative technique that can be robust and effective in terms of accuracy, cost and timely decision-making. In this work, we are proposing the use of a deep transfer learning approach to discriminate whether a given image is burnt or not. To facilitate this, we use datasets from different ethnicities so that a proposed diagnostic approach can be robust and effective.

1.1 Contributions

In this study, we propose the use of deep transfer learning approach to discriminate between human skin burns and healthy skin images in both Caucasian and African patients via the use of transfer learning due to deficient data. The study further provides an extensive analysis of data diversification or inclusion when training a deep learning algorithm. This is crucially important in order to produce a complete robust and less bias diagnostic platform. The rest of the paper is structured as follows: in Sect. 2, we present the highlights of prior work conducted using machine learning approaches and we highlight overview of the relevant convolutional neural networks. In Sect. 3, we present our methodology. In Sect. 4, we present our result and we discuss them. Finally, in Sect. 5, we summarise the findings and discuss possible future directions on which this study can be investigated further.

1.2 Literature Review

Use of machine learning algorithms in solving real world problems is widely applied in different domains [8]. Specifically, deep learning technique has recently achieved remarkable success in different areas such as security [9], traffic forecast [10], agriculture [11] as well as age, gender and face recognition [12,13,14,15]. Moreover, in the health sector, deep learning has been applied for image classification and has performed extremely well for detection of diseases [16,17,18,19]

A Burns is one of those traumatic injuries subjecting thousands to physical deformities, and in extreme cases, loss of lives affect different body parts such as the face, lower and upper limbs, and neck [20]. The devastating effect is felt severely and causes discomfort to both victims, their families and to the nation as a whole. Recently, a number of researchers have attempted to address burn assessment challenges using machine learning algorithms.

A study by [21] proposed an automation process for the identification and classification of scald burns into different categories (depth) using colour and texture features from LAB images. The experiment was conducted on 50 images in each burn depth category using K-Nearest Neighbour (KNN) and support vector machines (SVM) for the classification. The study shows SVM achieved the highest classification accuracy with 85% in first degree burns as compared to 70% by KNN, 87.5% for SVM in second degree burns as compared to 82% by KNN and 92.5% in third degree burns as compared to 75% by KNN.

Study by [7] used off-the-shelf features extracted by a pre-trained Convolutional Neural Network model and SVM as the classification algorithm for the identification of whether an image contains burns or is healthy. 1360 RGB Caucasian images (equally distributed into burnt and healthy skin) in the proposed study comprising of burn injuries from different body locations or parts. The study achieved a classification accuracy of 99.5% on Caucasian datasets

Another study by the authors in [22] used 74 burn images in LAB colour space from Caucasian patients to discriminate burns using machine learning. The approach used handcrafted features to train a SVM which achieved a classification accuracy of 82.43%.

Additionally, the study reported in [23] proposed a work to classify burns and pressure ulcer wounds in Caucasians. Three different ImageNet pre-trained convolutional neural network models were used as feature extractors while in all the three cases SVM was used as a classifier. The evaluation was carried out via the use of tenfold cross-validation in order to avoid biases which might arise during data splitting process. Interestingly, up to 99% accuracy was recorded via the use of all the features from the ImageNet models.

The study reported in [24] was similarly based on the use of Convolutional Neural Network to predict burn depth in pediatric patients. The study was conducted on 23 burn images based on ground truth defined by experienced clinicians. Original images were then augmented via extraction of a region of interest and resulted to 676 samples out of which 119 are superficial burns, 120 are superficial partial thickness, 108 are intermediate partial thickness, 111 are deep partial thickness, 111 are normal skin images and 107 background images. The authors fine-tuned four different Convolutional Neural Network models were trained via fine-tuning; ResNet101 yields the maximum performance accuracy of 81.66%, ResNet50 yields 77.79%, GoogleNet yields 73.89%, and VGG-16 achieved 77.53%.

Towards the end, our approach in this study is using state-of-the-art deep learning model (specifically, the pre-trained ImageNet model) to discriminate burns using embedded features representations from diverse ethnicities. To the best of our knowledge, this is the first study that provides extensive experiments and analysis of classifying burns in different ethnic or racial groups.

1.3 Convolutional Neural Network

Convolutional Neural Networks (ConvNet) are machine learning algorithms inspired by the human brain and are used as architecture for classification such as image recognition. ConvNet architecture generally consists of a series of different layers, where in each layer 2D array of pixels (feature maps) are produced which serves an input to the next layer. Training of ConvNet architecture was a bottleneck to researchers due to inability to access a huge amount of data and powerful computational machines until around 2010 when a large repository of images called ImageNet [25] was made available. The ConvNet architecture is fundamentally made up of the following layers [26]:

  • Input layer this is where data are passed into the network. The data can be a raw image pixel or their transformations.

  • Convolutional layer this layer contains fixed size filters arranged in series performing convolution operations and producing what is referred to as a feature map.

  • Pooling layer this is where the dimensions of the feature map produced by convolutional layer are reduced, thereby allowing the network to focused on the most important features.

  • Rectified Linear Unit (ReLU) this layer is responsible for removing all negative values by applying a non-linear function to the output of the previous layer and setting them to zero.

  • Fully connected Layer this is the layer where the high-level reasoning of the patterns generated by the previous layers is done. All the activations in the previous layer have full connection to neurons in this layer. For feature extraction using pre-trained ConvNet model, features are generated here and used to train another classification algorithm

  • Loss layer this is where the deviation between the true and the predicted labels is penalized. This is normally the last layer of the ConvNet, and various loss functions are used depending on the task. Example of loss functions includes SoftMax, Cross-Entropy and Sigmoid

Deep ConvNet models become poplar and the research domain receive more recognition due data availability and the computational resources from 2010. Among the most common ConvNet models from 2010 to date includes AlexNet, VGG-16 and ResNet-50. AlexNet model is composed of 5 convolutional layers with an interweaved max-pooling layers and 3 fully-connected layers proposed by the authors of [27] from the University of Toronto in Canada. The firs convolutional layer was configured to take an input size 224 × 224 and equipped with 96 filters of size 11 × 11 with a stride of 4. The output of the first layers goes into the second layer as input (after the pooling operation) equipped with 256 filters of size 5 × 5. The third, fourth and fifth layers contain 384, 384 and 256 filters respectively with the same filter size of 3 × 3 while the fully connected layers have 4096 neurons each and the soft-max layer as the final layer for classification.

In 2014, Visual Geometry Group (VGG) from Oxford University made another breakthrough in image classification task using a sixteen layered ConvNet model; thirteen convolutional and three fully connected layers [28]. Apart from been deeper than AlexNet, the size of receptive fields was significantly reduced in both convolutional and pooling layers to 3 × 3 and 2 × 2 respectively and stride of 2 was maintained throughout. Similar to AlexNet model, the soft-max layer was used as the final classification layer.

GoogleNet achieved a remarkable performance in 2014 [29]. It has a total of 22 layers or 27 layers (pooling layers inclusive). The architecture of the model comprises of parallel layers of convolution with different filters, the output of such parallel convolutional layers is then concatenated as input to subsequent layer. Unlike previously proposed models, GoogleNet is also equipped with 1 × 1 convolution for dimensionality reduction. GoogleNet outperformed all other proposed models in the ILSVRC competition in 2014.

However, deeper ConvNet encountered several difficulties during training which includes vanishing gradients and accuracy degradation, which was one of the issues faced by VGG. When the network is deep, the gradient shrinks to zero from where the loss function was computed, thereby resulting in a network not learning anything. As such, among the researchers who proposed a pipeline to address the challenges by allowing the network to go deeper with the increase in performance includes Residual Network from Microsoft [30], and won the ILSVRC competition in 2015. Residual Network (known as ResNet) uses skip connections to allows a copy of the gradient to be passed to the subsequent layer without passing through other weight layers as depicted in Fig. 1.

Fig. 1
figure 1

Illustration of the residual block

Towards the end, our proposal in this paper is straight-forward, which involves automatic recognition of burns using deep learning.

2 Materials and Methodology

ConvNet is one of the most known machines learning algorithms these days, and their use has been exploited in medical image analysis. They are supervised machine learning techniques that can extensively extract discriminatory image features using a considerable amount of training data. Basically, deploying ConvNet is carried out using three approaches; training from a ConvNet from scratch, fine-tuning an existing ConvNet and off the shelf features. Training from scratch requires powerful computational machines and huge data, which is a very challenging task. Fine-tuning and off the shelf features approaches are referred to as transfer learning. Fine-tuning involves customizing top-most layers of an existing learned ConvNet and freezing lower layers, in which features extracted from the frozen layers are used to trained top-most customized or added layers using the new dataset. Lastly, classification algorithms such as Support Vector Machines (SVM) and Decision Trees (DT) can be trained using off the shelf features extracted by freezing learned layers of the existing ConvNet.

Generally, there are several existing pre-trained models that are used for transfer learning, for example, VGG16, VGG19, ResNet50, ResNet101.

For the problem domain, we propose in this study, and we utilised ResNet50 pre-trained model. Basically, our methodology here uses a fine-tuning approach in which initial layers of ResNet50 were frozen to extract useful features, and subsequently, top-most substituted layers were trained using those features from the initial layers. In the next subsections, we presented the datasets, fine-tuning and experimental frameworks.

2.1 Data Collection

Datasets were collected from patients of different ethnicities involving Caucasians and Africans. The Caucasian datasets were collected in Bradford hospital, the United Kingdom and the African datasets were collected from Federal Teaching Hospital Gombe, Nigeria. 1360 Caucasian images were successfully collected, which contained 680 burn images and 680 healthy skin images. African dataset contains 270 burn images and 270 healthy skin images, totaling 540 images. Figure 2 shows samples of burns from the two ethnicities. The database images are composed of both pediatric and adult patients as well as different burn complexities.

Fig. 2
figure 2

Dataset samples: ac are burn samples from Caucasian patients; df are burn samples from African patients; while fk are healthy skin samples from Caucasian and African patients respectively

2.2 Pre-processing

Prior to the development of this research, all datasets from both Caucasians and Africans contain regions that are not relevant to burns identification. In order to diminish the risk of distorting final results, such as the regions were carefully cropped out, and the images were normalized to enhance homogeneity.

2.3 Fine-Tuning Scenario

Pre-trained ConvNet models become very useful due to the ability to transfer their internal deep representations in solving various recognition tasks in another domain faced with challenges of enormous data availability. They can be used as feature extractors or modify (fine-tune) towards a new task. Generic features of the images such as edges and blobs are captured from early layers of a pre-trained model while the later layers get more specific image details. Fine-tuning a pre-trained ConvNet model is to copy all the layers of the model excluding the last layer and replace it with a new specific layer that corresponds to the number of classes in the new domain. Figure 3 below depicts an illustration of the scenario.

Fig. 3
figure 3

Illustration of fine-tuning scenario

We freeze the lower layers of the ResNet50 as depicted in Fig. 3 while customizing the top layers. The customized ResNet50 has a fully connected layer with 512 nodes which passes through a Rectified Linear activation layer (ReLU), then a dropout layer was added to ensure good generalization. Figure 4 shows a fragment of code used in modifying the last layer of the pre-trained ConvNet model. The original model outputs 1000 different object categories while the new customized model output 2 classes (i.e. healthy skin or unhealthy skin).

Fig. 4
figure 4

Fragment of code of the fine-tuned layer

2.4 Training Setting

We artificially enlarge the training dataset via a process called augmentation. The augmentation strategy includes random resize-crop, random aspect ratio, random rotation, horizontal flip, and center crop. This is in addition to avoid training on a very small database which may lead to over-fitting. 80% of the dataset was allocated for training and the remaining 20% for validation.

We utilised a free-access jupyter environment called Google collaborator (Colab) which allows development and training deep learning algorithms using on-demand powerful computing resources. Colab gives free access to python libraries and to develop deep learning applications to be run on hardware equipped with NVIDIA Tesla K80, 12 GB RAM 2496 CUDA cores @ 560 MHz.

3 Results and Discussion

Figure 5 presents the experimental framework in which features extracted by the frozen layers of ResNet50 were used in training the newly added two dense layers and a classification layer.

Fig. 5
figure 5

Illustration of our experimental framework

We conducted a series of experiments using Caucasian, African and combination of both (global) datasets to reveal the best training approach using diverse feature representations that to have a robust diagnostic model. Figure 6 shows the training process in which only datasets from Caucasian patients are used in training the algorithm.

Fig. 6
figure 6

Training process on the Caucasian dataset

Training and validation losses are depicted in Fig. 7. It is obvious that there is no or minimal over-fitting challenges, the trained model has fitted well with the training dataset, and the performance on the validation set is good. Both training and validation losses settled rapidly from early epochs, and the impressive performance was maintained up to the last epoch.

Fig. 7
figure 7

Training and validation loss using the Caucasian dataset

Figure 8 depicts the training and validation accuracies, maximum validation accuracy of 99.6% was achieved, the performance stabilized even before epoch 200.

Fig. 8
figure 8

Training and validation accuracy on the Caucasian dataset

On the other side, Fig. 9 shows the training process using the dataset from African patients. The maximum validation accuracy achieved by the trained model is 96.4%.

Fig. 9
figure 9

Training process using the African dataset

Figure 10 shows training and validation losses, while Fig. 11 shows training and validation accuracy using datasets from African patients. We recorded an impressive result using the proposed approach with slight poor generalization compared to the previous result using Caucasian images. We attributed the high misclassification rate to poor resolutions of some of the images and poor or uncontrolled illumination during the acquisition process. This raises an alarm that the trained machine learning algorithm on Caucasian datasets can be biased when African skin is tested. This is one of the major limitations of previous studies. We believe the research is not satisfactory without taking ethnic or racial representations during the training process. More than 90% of the epidemiology of burn related injuries occur in low/middle-income countries such as those in African and Asian counties [31]. Therefore, we further explore whether a trained model on a specific ethnic dataset can provide good identification accuracy on other racial datasets.

Fig. 10
figure 10

Training and validation loss on African dataset

Fig. 11
figure 11

Training and validation accuracy using the African dataset

Lack of racial or diverse ethnic inclusion during the training process of a machine learning algorithm tends to produce the unrealistic model as shown in Fig. 12 in which training process using Caucasian datasets is depicted, and Fig. 13 shows training and validation accuracy where Fig. 14 shows the corresponding training and validation loss. The validation accuracy is clearly not impressive using data from African patients. The model seems to have overfit. A similar finding is obtained when the model is trained using images from African patients, the validation accuracy using Caucasian data shows poor generalization in which the model tends to be biased as depicted in Figs. 15, 16 and 17.

Fig. 12
figure 12

Showing training process using Caucasian dataset and validated using the African dataset

Fig. 13
figure 13

Training and validation accuracy of a model trained on the Caucasian dataset and validated on the African dataset

Fig. 14
figure 14

Training and validation loss of a model trained on the Caucasian dataset and validated using African dataset

Fig. 15
figure 15

Showing training process using African dataset and validated using the Caucasian dataset

Fig. 16
figure 16

Training and validation accuracy of a model trained on African dataset and validated using the Caucasian dataset

Fig. 17
figure 17

Showing training and validation loss using African dataset and validated using the Caucasian dataset

The results presented in Figs. 13 and 16 clearly indicates that ConvNets are biased in such a way that they recognise a particular racial data they were trained with. This phenomena of recognizing a racial group while failing to recognise another racial group is termed as ‘other-race effect’ [32]. Taking Figs. 24 and 27 into consideration, this tells us that racial composition in the training datasets tends to produce a robust system that can be deployed and be utilized effectively while diminishing bias.

We have seen so far how good ConvNets are in discriminating burns in both Caucasian and African skin types. We have also seen how poor the performance of a ConvNet trained using just a specific racial data is. At this point, both datasets from Caucasians and Africans were put together, forming a new type of dataset, which we simply designated as the global dataset. The following depicted Figs. 18, 19 and 20 show the results of a trained model using global datasets, in which validation dataset contains another explicit global data. The result is quite impressive with the validation accuracy of up to 98%.

Fig. 18
figure 18

Training process on the global dataset

Fig. 19
figure 19

Training loss using global datset and validation loss using the global dataset

Fig. 20
figure 20

Training accuracy using global dataset and validation accuracy using the global dataset

We further introduce reshuffling operation during the training process using global datasets in order to diminish variance and to ensure the model remain robust and overfit less. Figures 21, 22 and 23 show the training process—with the training and validation losses, and training and validation accuracies respectively. The performance reaches up to 99% (see Fig. 23) outperforming the previous outcome depicted in Fig. 20.

Fig. 21
figure 21

Training process using the reshuffled global dataset

Fig. 22
figure 22

Training and validation loss using the reshuffled global dataset

Fig. 23
figure 23

Training and validation accuracy using the reshuffled global dataset

Depicted Figs. 24, 25 and 26 show result in which global dataset is used during the training while validation datasets contain only Caucasian dataset. This shows hard-encoded data representation or embedded features from diverse ethnicities provides effective identifications of burns regardless of the racial dataset used at deployment phase. Similarly, Figs. 27, 28 and 29 show similar approach in which training the model was conducted using global dataset while validation using only African dataset.

Fig. 24
figure 24

Showing training process using global datasets and validation using the Caucasian datasets

Fig. 25
figure 25

Showing training accuracy using global datasets and validation accuracy using the Caucasian datasets

Fig. 26
figure 26

Showing training loss using global datasets and validation loss using the Caucasian datasets

Fig. 27
figure 27

Showing training process using global datasets and validation using the African datasets

Fig. 28
figure 28

Showing training accuracy using global datasets and validation accuracy using the African datasets

Fig. 29
figure 29

Showing training loss using global datasets and validation loss using the African datasets

Table 1 below presents a summary of all the experiments executed in this paper. The results show the machine learning algorithm trained with training dataset containing different racial representations produced the most robust and effective diagnostic system. The terminologies used in the table are Caucasian dataset (Cauc), African dataset (Afri), global dataset (Glo), Reshuffling (Re) and accuracy (Acc).

Table 1 Classification accuracy

4 Conclusion

The study in this paper provide in-depth experiments and analysis of discriminating burns in different ethnic or racial identities achieving a recognition accuracy of up to 99%. The study further provides a baseline for future investigation, specifically in healthcare, on how embedded racial feature representations in the training data provides a robust and flexible diagnostic tool that can be deployable anywhere.

The experimental results reveal that embedded burn features in the training datasets are prone to making a deep learning algorithm more robust and less biased when deployed to a new environment. A model that is well trained on Caucasian datasets performs poorly when testing using African datasets and vice-versa. We show poor the classification accuracy was when a model was trained and test using Caucasian and African datasets, respectively, achieving 87.5%. Similarly, 83.4% of classification accuracy was achieved when the model was trained and testes on African and Caucasian datasets, respectively. But when the model was trained using the global dataset, the classification accuracy of 99.3% and 97.1% was achieved using Caucasian and African datasets respectively during the validation process.

Effective identification accuracy was tested using three databases of burn images (Caucasian, African and global). The prediction accuracy on Caucasian database yielded the best result achieving up to 99.5% accuracy. Conversely, a decrease in performance was observed when the algorithm was trained on African database but outperforming experienced dermatologists’ evaluation which indicates an impressive identification performance. Combining the two databases forming a new database (global) ensures future prediction bias is avoided. As such, both image representations from the two ethnic groups are represented. The result achieved state-of-the-art recognition accuracy.

However, the classification results in this paper recorded some misclassification, as shown in the samples depicted in Fig. 29. These misclassifications were attributed to a number of factors ranging from feature similarity between full-thickness burns and some normal (healthy) skin in Caucasian datasets and poor image resolutions, bad illumination during image acquisition in African datasets. Full-thickness burns feature such as whitish and waxy appearances are the major challenge observed. Hence, they were misclassified as healthy skin. Additionally, the presence of unpeeled burn skin in the African datasets was misclassified as healthy skin, as shown in Fig. 30.

Fig. 30
figure 30

Sample of misclassified images

One obvious limitation of our proposed pipeline is the lack of comparative analysis of our result with the prior works, and this is due to our inability to get access to the datasets used in the literature, all effort made to get access to the data in the existing works has been futile. Therefore, our work in this paper can serve as a baseline for future investigation as we intend to make our data publicly available in due course. Moreover, extensive data processing is needed to further diminishes the potential misclassification of burns, more specifically the deep burns (such as full-thickness), as healthy to further address challenges of underestimation.

Another obvious limitation of our study is lack of experiments to discriminate burn depth which is a critical aspect to make the decision whether the injury is severe enough for surgical intervention, this will be treated in future work.