Introduction

More than 50% of households in the United States have one or more type of dog (Wise et al. 2005). Although dog ownership can provide mental health and emotional benefits (Takashima and Day 2014), an intimate relationship with dogs provides potential sources of zoonotic pathogens for more than 70 human diseases (Mani and Maguire 2009; Stull et al. 2013; Chomel 2014). These infections can be troublesome and even critical in extremely young children, pregnant women, the elderly, and those who are immunocompromised (Albrecht Suzanne 2011). Therefore, dog contact has been recognized as a risk factor for many zoonotic infections by bacteria, fungi, parasites, and viruses through direct or indirect contact (Chomel 1992).

Dermatophytosis is a common dog-associated condition that is caused by a variety of skin fungi that can infect both humans and animals (Allizond et al. 2016). Microsporum canis, Microsporum gypseum, and Trichopyton mentagrophytes are the main etiologic dermatophytes in dogs and invoke symptoms or signs, such as hair loss, scaling, and crusting (Bond et al. 2010). However, non-dermatophytic fungi such as Alternaria spp., Scopulariopsis spp., Penicillium spp., Rhizopus spp., and Fusarium spp. have also been reported to have dermatophytic potential and are increasingly regarded as disease-causing agents in animals and humans (Aho 1983; Magdy Mohamed Khalil and Ahmed Yahya 1991; Seyedmousavi et al. 2015). Common household dogs serve as the main reservoir of methicillin-resistant Staphylococcus aureus (MRSA), which can be transmitted from humans to animals and vice versa (Albrecht Suzanne 2011; Ghasemzadeh and Namazi 2015). MRSA causes skin infections initiated with a red, raised lesion that resembles an infected pimple, boil, or insect bite (Albrecht Suzanne 2011). Dog owners have been reported to be at higher risk for colonization of β-lactamase-producing Escherichia coli than non-dog owners (Meyer et al. 2012). In addition, MRSA can result in more serious conditions, such as endocarditis, osteomyelitis, pneumonia, and sepsis in humans (Albrecht Suzanne 2011).

To diagnose dog-associated infections correctly, a sampling of the wound, biopsy, or microscopic examination are required. In addition, the physician must have an index of the diagnoses of various infections and routinely ask owners regarding the presence of dogs in the household and their health condition. Because these processes require time and a high-level skill set, we employed a convolutional neural network (CNN) based algorithm developed through the use of infected skin images taken from a multispectral imaging device (Kim et al. 2016) to reduce the time and effort required for producing a consistent infection diagnosis. CNN-based models have been successfully applied to the detection and diagnosis of diseases in the field of medical imaging, including computed tomography and magnetic resonance imaging, and have achieved highly promising results (Yadav and Jadhav 2019).

In this study, 95 images from diseased or non-diseased dog skins were collected using a multispectral imaging device, and data augmentation was applied to increase data size by a factor of 1000 through resizing, rotation, and translocation of the original images. From these data, three independent binary classification models were developed to discriminate three skin diseases (bacterial dermatosis, fungal-infection, and hypersensitive allergy) with non-diseased skin. We chose ResNet (He et al. 2015), InceptionNet (Szegedy et al. 2015), DenseNet (Huang et al. 2016), and MobileNet (Howard et al. 2019) to create models for the diagnosis of dog skin disease. The multispectral imaging device can capture nine images at once from different wavelengths, i.e., ~ 440, 460, 480, 515, 550, 585, 620, 655, and 690 nm. Although models were developed using each normal image or multispectral image from the multispectral imaging devices, the models were biased to a certain label. To enhance performance by merging the strengths of each image feature, the consensus models were built up (see the section of CNN model training in “Methods”) and as results, they achieved the outstanding performance in accuracies of 0.89, 0.82, and 0.87 on the validation data set for bacterial dermatosis, fungal infection, and hypersensitivity allergic dermatosis, respectively.

Methods

Data acquisition

The skin of 95 pet dogs (23 with bacterial dermatosis, 19 with fungal infections, 23 with hypersensitivity allergic dermatosis, and 30 with no disease) were collected from the nonclinical CRO Institution (KNOTUS Co. Ltd, Incheon, Korea). The protocol was reviewed and approved by the Animal Care and Use Committee of KNOTUS IACUC (the Institutional Animal Care and Use Committee at KNOTUS Co. Ltd) (Approval number: KNOTUS IACUC 20-KE-723).The skin images of pet dogs were taken after obtaining written consent from dog owner. The normal images using camera in a mobile device (Fig. 1a) and the multispectral images were obtained using a mobile device equipped with a multispectral imaging device (Kim et al. 2016) (Fig. 1b). After the veterinarian examined the dog, an image was taken of the suspected lesion as the center of the image through the mobile devices during supervision of dog owner. All data are available in the Mendeley repository under: Hwang, Sungbo; Shin, Hyun Kil; Park, Jin Moon; Kwon, Bosun; Kang, Myung-Gyun (2022), “Classification of pet dog skin diseases using deep learning with images captured from multispectral imaging device”, Mendeley Data, V1, https://doi.org/10.17632/5dbht54kw7.1”.

Fig. 1
figure 1

Images used for training and testing of our model: a without and b with multispectral imaging device

Data augmentation

The three data randomly selected from images in each group were used as validation data set (Fig. 2). The size of an image from the mobile device was (1920, 1080), which was too large to be trained. Thus, all skin images were cropped and resized to (128, 128) using TensorFlow 2.6.0 (Abadi et al. 2016). Each image was augmented by a factor of 1000 through rotation and translation. The rotation angles were ranged from − 180° to 180°. The range of translation was set to ± 10 pixels, and the rotation angle and length of translation were selected from a uniform distribution using Numpy 1.19.5 (Harris et al. 2020). Since the total number of images for each disease was imbalanced, images for each disease were randomly selected to match the size of fungal infection image data, the smallest data among three data sets, to balance the size of labels in the training data.

Fig. 2
figure 2

Flowchart showing data sorting and pre-processing in data set of bacterial dermatosis

Data set labelling

Three disease data sets were prepared for binary classification. The label for each image was determined by veterinarians who examined the dog. The labels in the binary class data sets were bacterial or non-bacterial dermatosis, fungal or non-fungal infections, and hypersensitivity or non-hypersensitivity allergic dermatosis. All data sets were equally balanced for both labels. To prevent an overfitting, the ratio splitting into training set and test set in each data set was 0.85 and 0.05. The flow of the data set generated by the bacterial dermatosis is described in Fig. 2.

CNN model training

The data were trained on CNN models, such as InceptionNet, ResNet, DenseNet, and MobileNet. Each model structure was labeled as ResNet50, InceptionV3, DenseNet121, and MobileNetV3Small implemented in Tensorflow 2.6.0 (Abadi et al. 2016). To extract features from each filtered image and use limited model parameters, the size of the input node was set to (128, 128, 27) by all RGB channels in the nine images. Since the image size in the data set is different from that of ImageNet, the model parameter was initially randomized without transfer learning. The categorical cross-entropy was used for loss function. The maximum epoch step was set to 150. The early stopping condition was defined such that the loss from test set did not decrease for consecutive 20 steps. In general, accuracy and area under the curve (AUC) are commonly used metric of model performance. Accuracy was measured by the ratio between correctly predicted images and entire images. AUC is a metric measuring how accurately the predicted label is the same as the actual label. When a model is trained using an unbalanced data set, the model prediction can be biased to a certain label with larger data set, because the training is concentrated on a large number of labels, and the model fails to learn trend in the label with a small number of data. Matthews’s Correlation Coefficient (MCC) is a metric indicating the accuracy considering the number of true positive, true negative, false positive, and false negative altogether to measure balanced prediction accuracy between sensitivity and specificity in an unbalanced data set. All metrics were calculated using test set and validation set. The best model was selected based on the AUC and MCC of validation set. Since the validation set is data that was not used in the learning process, the metric measured by validation set can represent genuine model performance at the actual diagnosis of dog skin disease.

To build more balanced and accurate model, we developed consensus model using the best models among the single models for each disease. Two consensus models were constructed for each disease through two criteria. The one (Criterion 1) was defined to count only if prediction of each model agreed. Images were predicted as positive (target disease) or negative (non-target disease) when both models gave identical prediction values otherwise prediction value was as uncertain, neither positive nor negative. Another one (Criterion 2) was to predict images as disease when prediction of at least one model was positive and as non-disease when both models predicted the images as negative. Among two consensus models, a more balanced model was selected based on MCC value.

Results

Bacterial dermatosis

The bacterial dermatosis classification models were developed using four architectures of CNN models by normal and multispectral image data sets (Fig. 3 and Table S1).

Fig. 3
figure 3

Accuracy, Area under curve (AUC), and Matthews correlation coefficient (MCC) of bacterial dermatosis binary classification models using a normal image of test data set, b normal image of validation data set, c multispectral images of test data set, and d multispectral images of validation data set

Normal image-based CNN models

At testing of models, ResNet model and DenseNet model presented the identical performance of 0.99 in both accuracy and MCC and 1.00 in AUC (Fig. 3a). InceptionNet model achieved 0.96 in accuracy, 0.99 in AUC, and 0.92 in MCC. MobileNet model attained accuracy of 0.90, AUC of 0.98, and MCC value of 0.80 which was the lowest MCC value among four CNN model architectures. For validation, DenseNet model achieved the highest performance in accuracy, AUC, and MCC among four CNN architectures (Fig. 3b). The model scoring the second highest MCC value was MobileNet model. In DenseNet model, sensitivity and specificity was 1.00 and 0.73 (Table S2e), respectively, supporting that the prediction model was not biased toward the non-bacterial dermatosis group. Therefore, the best performance model using normal image data set of bacterial dermatosis was achieved by DenseNet model.

Multispectral image-based models

For testing, InceptionNet model and MobileNet model showed the same performance in accuracy, AUC, and MCC values of 1.00 (Fig. 3c). Whereas, ResNet model showed slightly lower performance in accuracy, AUC, and MCC compared to InceptionNet model and MobileNet model; Nevertheless, accuracies from ResNet model were comparable. For validation. The highest MCC was attained from InceptionNet model (Fig. 3d). The second highest MCC was from DenseNet model, followed by ResNet model. MobileNet model attained the lowest values in AUC and MCC at validation. Overall, all the models showed the lowered achievement on MCC metric at validation, although InceptionNet model showed the highest performance (0.31) on MCC but it was not remarkable compared to other metrics. From MCC values, the most balanced model architecture was InceptionNet model that achieved sensitivity of 0.59 and specificity of 0.74. Other three models showed relatively higher specificities but lower sensitivities, which implied that these models could be biased toward the non-bacterial dermatosis group (Table S3e). Therefore, for multispectral images it was proven that the best performance model for bacterial dermatosis was produced from the InceptionNet.

Consensus model using the best models for each image data set

For more accurate model, the consensus model was built with DenseNet model in normal image data set and InceptionNet model in multispectral image data set that were chosen as the best model in each data set. For model of criterion 1, accuracy, MCC, sensitivity, and specificity were measured to 0.89, 0.76, 1.00, and 0.86 (Table S4a and S4c). For model of criterion 2, accuracy, MCC, sensitivity, and specificity were determined to 0.68, 0.50, 1.00 and 0.57 (Table S4b and S4c). The consensus model of criterion 1 was selected as the best predictive model of bacterial dermatosis for its better balanced and metrics (Fig. 4). This consensus model also showed better prediction accuracy in all metrics even compared to single model from each image data set.

Fig. 4
figure 4

Accuracy and Matthews correlation coefficient (MCC) of bacterial dermatosis binary classification models using consensus method. Criterion 1 and 2 were described in the last paragraph in method section. The detailed metric values are described in Table S1

Fungal infection

The prediction results for fungal infections are summarized in Fig. 5 and Table S5.

Fig. 5
figure 5

Accuracy, Area under curve (AUC), and Matthews correlation coefficient (MCC) of fungal infection binary classification models using a normal image of test data set, b normal image of validation data set, c multispectral images of test data set, and d multispectral images of validation data set

Normal image-based models

In testing, accuracy, AUC, and MCC values of the models from three architectures except the MobileNet were 1.00, and metrics of MobileNet model were 0.98, 1.00, and 0.95 for each (Fig. 5a). In validation, ResNet model achieved the highest performance in accuracy and MCC, and DenseNet model achieved the highest performance in AUC only (Fig. 5b). To choose best model between ResNet model and DenseNet model, sensitivity and specificity were calculated for both models (Table S6e). Sensitivity and specificity in ResNet model were 0.48 and 0.81, whereas they were 0.67 and 0.65 in DenseNet model. Since from MCC values it was considered that DenseNet model was more biased than ResNet model, the best model for fungal infection using normal image data set was determined to ResNet model.

Multispectral image-based models

The accuracy, AUC, and MCC values of both ResNet model and DenseNet model were 1.00 in testing (Fig. 5c). InceptionNet model and MobileNet model also achieved comparable but little lowered results in accuracy, AUC, and MCC in the same data set. In validation, the best model was ResNet model, and the second-best model was MobileNet model at performance comparison on accuracy and AUC (Fig. 5d). When sensitivity and specificity were examined (Table S7e), ResNet model achieved sensitivity of 0.72 and specificity of 0.69, whereas MobileNet model attained 1.00 and 0.20 for each metric. Thus, ResNet model was concluded to offer better predictions for fungal infections on multispectral images.

Consensus model using the best models for each image data set

The consensus model was constructed by two different ResNet models, the one was the best model in normal image data set and the other one was the best model in multispectral image data set. For consensus model in criterion 1, accuracy, MCC, sensitivity, and specificity were calculated to 0.82, 0.49, 0.68, and 0.85 (Table S8a and S8c). For the consensus model in criterion 2, accuracy, MCC, sensitivity, and specificity were computed to 0.67, 0.38, 0.83, and 0.61 (Table S8b and S8c). For fungal infection, the consensus model in criterion 1 was selected as the best consensus model, because the model was not only more balanced but also showed higher results in accuracy and MCC than the consensus model in criterion 2 (Fig. 6). In addition, although the sensitivity of the consensus model was lower than that of ResNet model on multispectral image data set, the other metrics of the model were measured higher than those of the single models.

Fig. 6
figure 6

Accuracy and Matthews correlation coefficient (MCC) of fungal infection binary classification models using consensus method. Criterion 1 and 2 were described in the last paragraph in method section. The detailed metric values are described in Table S5

Hypersensitivity allergic dermatosis

The performance of classification model on hypersensitivity allergic dermatosis is described in Fig. 7 and Table S9.

Fig. 7
figure 7

Accuracy, Area under curve (AUC), and Matthews correlation coefficient (MCC) of hypersensitivity allergic dermatosis binary classification models using a normal image of test data set, b normal image of validation data set, c multispectral images of test data set, and d multispectral images of validation data set

Normal image-based models

ResNet model and MobileNet model that were tested on testing data set showed accuracy of 1.00, AUC of 1.00, and MCC of 1.00 (Fig. 7a). DenseNet model also demonstrated better results in accuracy (0.97), AUC (1.00), and MCC (0.94) than InceptionNet model (0.69, 0.88, 0.46); however, it failed performance comparable to ResNet model or MobileNet model. For validation, ResNet model produced the most outstanding performance with accuracy of 0.81, AUC of 0.62, and MCC of 0.41 among the models (Fig. 7b). InceptionNet model was determined as the second-best model showing accuracy of 0.79, AUC of 0.47, and MCC of 0.36. Sensitivity and specificity of ResNet model were 0.33 and 0.96, whereas MobileNet model showed sensitivity of 0.00 and specificity of 0.65 (Table S10e). Although ResNet model was considered as more balanced than MobileNet model, both models were proven to be biased toward the non-hypersensitivity allergic dermatosis group from their MCC values.

Multispectral image-based models

All CNN models recorded accuracy of 1.00 and AUC of 1.00 in the testing (Fig. 7c); however, their prediction accuracies on the validation data set were highly varied (Fig. 7d). Among them, MobileNet model achieved the highest accuracy of 0.82 and AUC of 0.94. ResNet model showed lower performance on accuracy and AUC than MobileNet model. When sensitivity and specificity were computed for balance of models, MobileNet model achieved values of 0.38 for sensitivity and 0.97 for specificity, whereas ResNet model generated 0.28 for sensitivity and 0.98 for specificity (Table S11e). Considering the validation results and the model bias to disease or non-disease, MobileNet model was determined as the best model for hypersensitivity allergic dermatosis.

Consensus model using the best models for each image data set

The consensus model was built up using ResNet model on normal image data set and MobileNet model on multispectral image data set. For criterion 1 model, accuracy was calculated to 0.89, MCC was to 0.32, sensitivity was to 0.12, and specificity was to 1.00 (Table S12a and S12c). For criterion 2 model, accuracy was determined to 0.87, MCC was to 0.63, the sensitivity was to 0.66, and the specificity was to 0.93 (Table S12b and S12c). Therefore, criterion 2 model achieved better performance than the single models and criterion 1 model. Moreover, the consensus 2 model for hypersensitivity allergic dermatosis was revealed more balanced than the two previous selected consensus models trained for bacterial dermatosis and fungal infection (Fig. 8).

Fig. 8
figure 8

Accuracy and Matthews correlation coefficient (MCC) of hypersensitivity allergic dermatosis binary classification models using consensus method. Criterion 1 and 2 were described in the last paragraph in method section. The detailed metric values are described in Table S9

Discussion

The data used in this study consisted of healthy or diseased skin images from 16 dog species (Fig. S1). For the model training, dogs from 15 species were used, whereas dogs from 5 species were used for the validation. A species of golden retriever was only included in the validation for the purpose of testing the models on unknown species. The most common dog species was Maltese and the number of the Maltese was 27 (Figs. 6, 7, 8).

The color or thickness of fur showed difference depending on the dog species; however, the skin redness did not (Fig. S2). Since skin redness could be similarly observed in all skin diseases across all species, we considered that differences between species were negligible for the development of CNN models. Images from bacterial dermatosis, fungal infection, hypersensitivity allergic dermatosis, or healthy skin in Maltese (Fig. S3) showed that skin condition could be visually distinguished without difficulties, because diseased skins had varying degrees of redness depending on disease type, whereas healthy skins did not appear red. Since difference in redness patterns by disease type could be observable, the CNN models were expected to identify these diseases from the images.

The best CNN models developed here were chosen based on accuracy, AUC, and MCC in testing as well as sensitivity and specificity in the validation. For test data set, all CNN models showed remarkable performance in accuracy, AUC, and MCC; however, these results were not helpful to determine the best model, because their metrics were close to 1.0. Therefore, we decided to use metrics obtained from testing on the validation data set, because they exhibited significantly different results per models. Among the three metrices, higher accuracy in one of the two classes, disease or non-disease, were found to attribute to higher AUC values. However, confusion matrices calculated from CNN models with such higher accuracies showed significantly biased results. To select more balanced model, MCC value were adopted and the model with the highest MCC value was chosen as the best model that was utilized to develop the consensus model.

In our study, accuracies in most of the single models for bacterial dermatosis and fungal infection were less than 0.8, whereas accuracies of the consensus models were over 0.80. The multispectral imaging devices characterize image features appearing at specific wavelengths better than general cameras (Kim et al. 2016), and the multispectral images were used to build a model to discriminate seborrheic dermatitis and psoriasis region (Kim et al. 2019). Since the skin redness with human observation could be examined only at three wavelengths of ~ 620, ~ 655, and ~ 690 nm in the multi-spectral imaging device, the multispectral images were expected to extract features that were not examined in normal images. We had expected that normal image could be used to determine areas suspected of lesions of skin diseases, and additionally multispectral image could help confirming skin redness of the areas. Since the consensus model combined two types of information, the consensus models showed better performance than normal image- or multispectral image-based models alone. Since the multi-spectral images were obtained from the device through filters, it could be possible that the normal image may be obtained from simple concatenation of RGB channels of the multi spectral images, which could undermine the advantage of the consensus model. Therefore, we examined if it is possible to recover a normal image through simple concatenation of a multi-spectral image. In our examination, it was impossible to generate a normal image by combining the multispectral images from each wavelength. The reason behind that is possibly due to the fact that the multispectral imaging device captured narrow range of wavelength rather than specific wavelength. Even though the multispectral imaging device captures narrower range of wavelength than other general multispectral imaging device, normal image cannot be generated by combining multispectral images. If the multispectral imaging device captures the specific wavelength, a normal image can be recovered by combining RGB channels of the image from each wavelength channel-by-channel. In reality, the multispectral imaging device captures a narrow range of wavelengths, thus RGB channel of a multi-spectral image contains data from a range of wavelengths. Hence, it was not possible to reconstruct a normal image by simply concatenating the channels from the multispectral images. This means that there is other significant information from the multispectral image of each specific wavelength region. Therefore, the performance of the consensus model for each skin disease were better than normal image- or multispectral image-based model alone.

To the best of our knowledge, machine learning- or deep learning-based prediction models for classification of dog skin diseases were unavailable. However, various machine learning models using images of human skin disease have been developed (Han et al. 2018; Liu et al. 2020; Thomsen et al. 2020; Srinivasu et al. 2021). The critical difference between images of human skin and dog skin is the presence of fur, which restricted visibility of the skin disease. For this reason, high accurate models for human skin disease could have been easily developed due to the absence of skin hair. By the way, furs were observed in four images taken at a central wavelength of 480–580 nm among nine images per a dog skin (Fig. 1b). The presence of fur caused to lower performance of our model compared to models developed for human skin disease.

The multispectral device easily allows users to take skin images under the same condition. This may contribute to massive collection of dog skin images. In addition, the models can be utilized with smaller computational resources and less calculation time, since the selected CNN models do not have complex architecture. For these reasons, we expected that the models could be easily updated with addition of images from users.

Conclusions

Classification models to discriminate bacterial dermatosis, fungal infections, and hypersensitivity allergic dermatosis with non-diseased skin were developed using the four CNN architectures. Based on prediction performance on the validation data sets, the best normal image-based models for bacterial dermatosis, fungal infections, and hypersensitivity allergic dermatosis were generated from architectures of DenseNet, ResNet, and ResNet, respectively. The best multispectral image-based models for three skin diseases were from structures of InceptionNet, ResNet, and MobileNet. However, the normal image- and the multispectral image-based models for all diseases were turned out to be a little biased. Features from normal images contributed more to compartmentalize lesion of skin disease than features from multispectral images did. Whereas, features from multispectral images were more significant to determine whether suspected lesion of skin is disease than features from normal spectral images. To overcome skewness of each single model, we developed the consensus models for each skin disease to maximize strength of features from each model. The consensus models performed better with balanced accuracy on image data from disease and non-disease compared to the single models.