Keywords

Introduction

Artificial intelligence and machine learning have made impressive progress in recent years, particularly in the realm of image analysis. In healthcare many specialties are image-centric in their data focus, with dermatology being a prime example. Other specialties that employ the astonishing power of deep neural networks when applied to images include radiology, cardiology and ophthalmology amongst others. Some non-image datasets have even been successfully recast as images in order to take advantage of the power of convolutional networks, for example treating the time series from 12-lead electrocardiograms as if they were images.

In modern day (2022) dermatology, the intersection with artificial intelligence appears in several forms. In this chapter we will describe the dermatology AI landscape, providing an overview of the types of questions commonly asked and the data and processing needed to attempt to answer those questions.

We also suggest that while we warmly embrace the progress that AI in dermatology has made, in our opinion the most helpful frameworks going forward will likely fall under the category of "augmented intelligence", wherein humans and computers work synergistically to improve care delivery [1]. We also wish to direct the reader to Chapter ‘ “From “Human versus Machine” to “Human with Machine” ’ for a discussion of AI-assisted decision making.

Please note that fully explaining many of the best practices and pitfalls identified in this chapter is beyond a reasonable scope. Rather they are intended to point the interested reader in the right direction.

Brief Review

Recent advances of AI in dermatology have primarily depended on leveraging so called deep neural networks (DNNs). This style of learning uses neural network-based computational models consisting of multiple processing layers. Traditional artificial neural networks (ANNs) are typically comprised of a limited number of layers built of a linear combination of “nodes”. A node is similar to a linear regression model embedded inside a non-linear activation function. The weights both internal to the nodes as well as those in the combination of nodes are optimized from the data, and the network is trained to obtain supervised representations optimized for a specific task.

Deep neural networks have more complex architectures with a higher number of layers and connections, thus allowing them to learn data representations with multiple levels of abstraction. DNNs are usually trained in an end-to-end manner using backpropagation. In AI-based dermatology studies, the most common architecture employed is a special variety known as a convolutional neural network (CNN).

Convolutional neural networks are the primary AI tools in dermatology as of 2022.

CNNs are inspired by the visual cortex and leverage a convolution operator (a combination of matrix multiplication and summation) followed by feature pooling (averaging) to learn translation-invariant representations. They achieve superior performance due to their capacity to learn and extract deep and hierarchical features from skin image datasets. Current CNN architectures typically consist of multiple convolutional and pooling layers stacked together to model the input data space, where the output of one layer serves as the input to the following. Many state-of-the-art architectures used in dermatology have originally been developed by technology companies, and include examples such as ResNet, DenseNet121, EfficientNet and GoogleNet.

Best Practice 15.1

Always start with a known state-of-the-art network architecture rather than designing your own. EfficientNet is often a (relatively) fast way to see if your dataset has an extractable signal.

The most prevalent applications of AI in dermatology have been via traditional supervised learning, wherein a DNN is trained to learn the relationship between input data and known corresponding target labels. Examples of supervised learning include CNNs trained for skin cancer diagnosis, risk stratification of skin cancer (indolent vs aggressive), and general lesion identification.

DNNs require significant amounts of training data to perform well, and current dermatology datasets particularly pathology datasets are of limited size relative to massive troves of internet photographs, for example, ImageNet, used by major technology firms in training models. An important improvement therefore to the traditional training of DNNs for use in dermatology is transfer learning. In transfer learning, instead of starting from scratch, one begins with a network that is known to perform well on a similar problem. Transfer learning dramatically reduces the amount of training data required, and is particularly useful when examples with known outcomes can be challenging to acquire. Unsurprisingly, many of the published dermatology deep learning studies to date employ transfer learning to train their DNNs.

Best Practice 15.2

Use transfer learning any time you can find a dataset similar enough to yours.

Pitfall 15.1

When using transfer learning, if you cannot find a dataset naturally similar to yours and must start with weights from e.g. ImageNet, always try training from scratch as well. In pathology in particular sometimes it is better to just start over.

For a more complete description of these topics specific to dermatology, please see Murphree et al. [2] and Puri et al. [3]. For more general descriptions please consult [4] or [5].

Current Applications

Broadly speaking applications of computer vision in dermatology can be categorized either by input data or by problem type.

Categorizing application by input type. Input data is usually one of the following: clinical photographs, dermoscopic photographs, or digitized pathology slides (also known as whole slide images, or WSI). Dermoscopic photos are captured via special instruments known as dermatoscopes. These instruments are used by dermatologists to reduce reflections from the skin, and provide for a more uniform but very distinctive looking image. Dermoscopic images are immediately distinguishable from standard photographs by their unique circular appearance. Clinical photographs obtained by providers are often different enough in quality from those captured by patients that the two can be considered different data types.

Best Practice 15.3

Be alert to data differences in photographs acquired by patients vs those acquired by medical photographers or informed providers. If you mix them, be certain to check that there is balance among outcomes by origin. Are photographs acquired in a dermatologists office more likely to contain cases than controls?

Categorizing applications by problem type. Problem type is most typically supervised learning, where a label of interest is known for each observation, and can appear as classification or segmentation. For example, photographs of lesions may be labeled as malignant or benign. Similarly, regions of a pathology slide may be labeled (annotated) as epidermis, dermis, eccrine gland, etc. In the first case one might seek to classify the lesion in the photograph, while in the second one might seek to segment the slide into different regions of known tissue type. Often the two are combined, for example segmenting a slide to identify regions of tumor, then using those regions as training data input for e.g. a tumor risk classifier.

In addition to traditional supervised learning, in the pathology space in particular there is growing interest in weakly supervised learning, where expensive pixel-level annotations can be replaced in favor of slide-level labels. Slide level labels are often able to be extracted in an automated fashion from electronic health records, thus do not require manual effort by pathology specialists.

Best Practice 15.4

Weakly supervised learning may be a promising future direction to alleviate the burden of acquiring costly and time-consuming pixel level annotations.

Here we discuss different current or recent applications divided by problem type, with a special section on dermatopathology.

Classification

In a classification problem, the goal is to learn a label (or set of labels) for an image in its entirety. For example, given a photograph of a skin lesion the classifier might distinguish between melanoma and benign nevus, and if melanoma then might further characterize it as aggressive or indolent.

Given the understandably pressing nature of the disease, the vast majority of AI applications in dermatology have focused on cancer, primarily cutaneous lesions. Comparatively less attention has been paid other categories of skin disease such as inflammatory dermatoses (rashes). This may also be driven in part by the greater spatial uniformity of lesions in general. Here we will discuss only lesions.

Various AI-based approaches have been developed in the detection and diagnosis of skin cancer ranging from conventional low-level pixel processing methods using handcrafted features to more recent CNN-based deep learning approaches. CNNs have achieved state-of-the-art performance in skin lesion analysis along with superior performance to dermatologists in distinguishing between pigmented and non-pigmented skin lesions across multiple studies. For example, Esteva et al. [6] was the first to propose a CNN model to identify epidermal and melanocytic lesions, comparing its performance to 21 board-certified dermatologists on two specific tasks: distinguishing squamous cell carcinomas (SCC) from benign seborrheic keratoses (SK) and malignant melanomas from benign nevi. On a biopsy-proven test set of 135 epidermal, 130 melanocytic non-dermoscopy images and 111 melanocytic dermoscopy images, dermatologists were asked whether to biopsy, treat the lesion or reassure the patient without biopsy. In parallel, the CNN was tasked with classifying the same lesions. The network outperformed the average performance of the dermatologists in each case. The authors conclude by graphical inspection that the CNN’s performance was comparable to that of the board-certified dermatologists. However, we note that no formal statistical test was applied. Concurrently, Han et al. (2019) [7] used a ResNet-based CNN to automatically classify 12 skin disorders, achieving a level of performance comparable to 16 dermatologists. The network determined coarse and irregular portions of the lesion as important features for malignant tumors, which was highlighted via gradient-based activation maps (Grad-CAM) generated from the CNN. The activation maps allowed interpretability of the CNN’s classification output. Another study by Codella et al. [8] combined CNN with hand-coded features and sparse coding which could potentially achieve higher accuracy than dermatologists in melanoma detection. Brinker et al. (2019) [9] proposed an enhanced CNN architecture for skin lesion classification using 12,378 images. They did a thorough evaluation by comparing the classification performance of their CNN on 100 images and to that of 157 dermatologists across 12 university hospitals in Germany. This system was shown by some metrics to outperform the average dermatologist.

Haenssle et al. (2018) [10] similarly sought to compare the performance of a CNN trained to recognize melanoma in dermoscopic images to 58 international dermatologists with varying levels of experience in dermoscopy (29% beginner, 19% skilled, and 52% expert by self-report). The dermatologists were asked to classify lesions in two stages, including dermoscopy alone in the first stage and dermoscopy with clinical images and additional clinical information in the second stage. While the stage II performance of both beginner and skilled dermatologists improved significantly relative to stage I, the CNN, which was trained on images only still outperformed dermatologists of all experience levels in both the stages. This study highlights the importance of including a large group of dermatologists with varying levels of experience, as well as using open source datasets and lesions from different anatomic sites and of different histologic types during CNN training. They also demonstrate the importance of integrating clinical information and clinical experience when comparing human performance to algorithmic performance.

Recently, Soensken et al. (2021) [11] proposed a deep CNN which identifies early-stage melanoma by capturing wide-field photographs of patient bodies using mobile phones, subsequently ranking suspicious pigmented lesions (SPL) and flagging them for further examination. The AI tool achieved more than 90% sensitivity in distinguishing SPL from non-suspicious lesions and achieved comparable performance to board-certified dermatologists, thus highlighting its efficacy as a successful triage tool.

While the above studies are highlighted for their CNN performance in comparison to human dermatologists, there are numerous studies that address lesion identification. Most of these studies are focused on improved algorithmic performance with some of the recent studies focusing on improving model robustness. For example, Han et al (2018) [12] demonstrated that CNNs trained on images from Asian patients performed poorly on Caucasian patients and vice-versa, highlighting the importance of training CNNs with skin lesions from a wide range of age groups and ethnicities. Gessert et al. (2020) [13] used an ensemble of deep learning models including EfficientNets, SENet, and ResNeXt WSL using a search strategy for skin lesion classification. Maron et al. (2021) [14] proposed a benchmark out-of-distribution dataset for melanoma detection by adding artificial noise-based corruption and image perturbations to lesion images and observed that while DenseNet121 [15] showed the best corruption robustness, AlexNet achieved better perturbation robustness. Sayed, Soliman, and Hassanien (2021) [16] proposed an approach to tackle class-imbalance in existing melanoma classification datasets from ISIC challenges [17] and proposed a random over-sampling method followed by data augmentation achieving state-of-the-art accuracy using a simpler CNN architecture named SqueezeNet [18].

Although the mentioned studies demonstrate the richness and variety of applications of CNNs to classifying a variety of cutaneous lesions, including some that appear to perform well relative to non-specialists or trainees, at this time (2022) there appears to be only a single prospective clinical trial [19] utilizing AI for skin disease. We look forward to seeing more of these trials performed.

Segmentation

Segmentation is similar to classification, but rather than trying to predict a label for an image in its entirety it seeks to do so for each pixel in the image.

This problem is in many ways more challenging than classification. Nevertheless, deep learning has achieved promising success in skin lesion segmentation, in particular with melanoma. Lesion segmentation is still a challenging task for deep learning methods because of various complexities including regions of interest (ROIs) of varying shapes and sizes, fuzzy boundaries, capture-dependent color variation and the presence of hair. Due to these complexities, traditional “handcrafted” approaches such as those based on thresholding, region-based active contour models or clustering tend to underperform. In contrast, CNN-based methods can automatically create features that are maximally helpful, for example, to distinguish lesions from normal skin. Most segmentation frameworks leverage an encoder-decoder network wherein an “encoder” network consisting of convolution and pooling layers is used to extract features from the input image which are then passed to a “decoder” network which performs a series of unpooling and disconnection operations to construct the segmentation output. Goyal, Yap, and Hassanpour (2017) [20] used Fully Convolutional Networks (FCN) to learn hierarchical features and derive multi-class segmentation maps for three distinct forms of skin lesions: benign nevi, melanoma, and seborrheic keratoses. Yuan and Lo (2019) [21] achieved the highest segmentation accuracy (Jaccard (JAC) index of 76.5%) in the International Skin Imaging Collaboration’s (ISIC) 2017 challenges [17] by using a 19-layer convolutional-deconvolutional neural network to segment skin lesions by training their model with different color spaces of dermoscopy images. Sarker et al. (2018) [22] proposed an architecture combining skip-connections, dilated residual and multi-scale pyramid pooling networks to extract additional contextual information. They also leveraged End Point Error as a content loss function to preserve melanoma boundaries.

The most popular architecture which has achieved state-of-the-art performance in skin lesion segmentation is U-Net [23].

U-Net was proposed for biomedical image segmentation and is based on Fully Convolutional Networks (FCN) for natural object detection. It has a U-shaped architecture which concatenates the feature maps from the encoder layer with corresponding upsampled decoder feature maps using “skip-connections”, thus allowing it to retain fine-grained details required for segmentation. Lin et al. (2017) [24] did the initial study highlighting the efficacy of U-Net based histogram equalization (dice coefficient of 77%) over C-means clustering (dice coefficient of 61%) for skin lesion segmentation. Various architectures combining U-Net with alternate CNN architectures have been proposed subsequently. For instance, Zafar et al. (2020) [25] proposed a fully automatic skin lesion segmentation combining U-Net and ResNet achieving an average JAC Index of 77.2% on ISIC-2017 dataset. Recently, Ashraf et al. (2022) [26] highlighted that a JAC Index above 80% in lesion segmentation guarantees that the approach is reliable and appropriate for subjective clinical assessment. They proposed three deep learning models, including U-Net, deep residual U-Net (ResUNet), and improved ResUNet (ResUNet++) along with an improved pre-processing pipeline employing an inpainting algorithm to eliminate unnecessary hair structures. They also leveraged test time image augmentation and a conditional random field (CRF) in the postprocessing stage achieving state-of-the-art 80.73% Jaccard index on ISIC 2017 dataset.

Dermatopathology

Deep learning in dermatopathology is centered around traditional pathology slides that are digitized into images (WSIs) by scanners. Histology whole slides provide a much greater amount of cellular-level information highlighting morphological and spatial arrangement, thus making them attractive for deep learning-based biomarker extraction.

Due to their immense size, special technical considerations are critically important when working with digitized pathology slide images. Currently the best practice is to divide the slide into smaller patches of tissue, often chosen to match a given neural network architecture.

Best Practice 15.5

A common paradigm for deep learning on pathology slides is to first divide the slide into small patches of tissue. Then one trains a tissue-level classifier, typically a deep neural network. This predicts the type of tissue in the patch using labels from pixel-level annotations. Afterwards, a slide-level classifier can be trained to predict using the tissue-level predictions as input and the slide labels as output. The slide-level classifier is typically a model such as logistic regression, often chosen to avoid overfitting.

Pitfall 15.2

While important in all applications of ML in healthcare, deep learning on whole slide images is especially prone to inadvertently learning biases in the dataset rather than actual physiology. A pernicious example is that of scanner effects. In a scanner effect, the model learns which scanner was used to aquire an image. This causes problems when one outcome of interest is more frequently acquired on one model of scanner, something common in multicenter studies. A red flag is if many examples of a single (potentially rare) disease need to be supplied by a single institution. Similar biases can occur if images are acquired using different staining protocols or scanning parameters.

The majority of applications of AI in dermatopathology to date have focused on the traditional formalin-fixed paraffin embedded tissue that is the mainstay of modern pathology. However, there is growing interest in also utilizing the fresh frozen tissue common in dermatologic surgery.

A prototypical AI project in dermatopathology focuses on extracting several features from WSI for diagnostic prediction tasks such as cancer grading or cancer subtyping. Moreno-Andrés et al. [27] developed a diagnostic support tool to identify mitotic cells within detected tumor regions for whole slide images (WSI). The authors report a diagnostic accuracy of 83% for their model trained on 59 WSIs. This tool could augment a dermatopathologist’s practice by identifying areas of the slide with the highest density of mitotic figures, and could also potentially reduce the need for the immunohistochemical stains for mitosis.

Olsen et al [28] similarly trained a CNN using 450 WSI to classify basal cell carcinomas, dermal nevi, and seborrheic keratoses. Their Visual Geometry Group (VGG) network achieved an AUC of 0.99 for basal cell carcinomas, 0.97 for dermal nevi, and 0.99 for seborrheic keratoses. Hart et al [29] developed a CNN to differentiate between Spitz and conventional melanocytic lesions on histopathology. They trained their model on 100 curated whole slide images and first evaluated their model on curated image sections. Their model demonstrated 99% accuracy in this experiment. They then conducted a second experiment evaluating the model’s performance on noncurated image patches of the entire slide. In contrast to the curated experiment, the model achieved a significantly lower accuracy of 52.3% on the non-curated patches. Hekler et al [30] built a similar CNN trained on 695 whole slide images to classify images as melanoma or benign nevi. They compared the performance of their CNN to dermatopathologists. Performance was evaluated on randomly cropped 10× magnification sections. The CNN achieved a melanoma sensitivity/specificity/accuracy of 76%/60%/68% respectively, while the 11 dermatopathologists achieved a mean sensitivity/specificity/accuracy of 51.8%/66.5%/59.2% respectively. However, these results should be interpreted with caution—in a normal clinical setting, pathologists have the ability to evaluate the whole slide and are not restricted to randomly cropped segments.

Sankarapandian et al. [31] presented a deep learning-based triaging system which performs hierarchical melanocytic specimen classification into low (MPATH I-II), Intermediate (MPATH III), or High (MPATH IV-V) diagnostic categories, enabling prioritization of melanoma cases. They leverage transfer learning using a pretrained ResNet50 network for extracting patch-level features from WSI and formulate the classification problem in a weakly-supervised multiple instance paradigm using tissue-level labels only. By combining patch features using max-pooling, their tool is able to classify suspected melanoma without requiring pixel-level annotations and could substantially reduce diagnostic turnaround time for melanoma by ensuring that suspected melanoma cases are routed directly to subspecialists.

Thomas et al. [32] proposed an interpretable deep learning method to classify several common skin cancers (basal cell carcinoma, squamous cell carcinoma and intraepidermal carcinoma) using WSI. Using manual labelling they characterised the tissue into 12 meaningful dermatological classes, including hair follicles, sweat glands as well as identifying the well-defined stratified layers of the skin. Subsequently, they trained a classifier to classify the sub-regions of WSI into the 12 classes thus representing the WSI with a segmentation map similar to a pathologist. By analysing the tissue context using the segmentation map obtained from the classifier, they achieved high accuracy of WSI classification as well as ensured interpretability.

One notable limitation in the dermatopathology literature is the limited work on leveraging AI to predict patient prognosis and response to therapy based on the morphological slide features. While existing approaches have tried to link pathological features, such as tumour grade and subtype, to effective patient prognosis, none of the methods have demonstrated a direct link between pathology images with multiscale features as well as patient’s genetic profiles with survival outcomes and treatment response for adjuvant/neoadjuvant therapy.

Datasets and Challenges

All of the approaches described here depend critically on sufficient quantities of appropriate data, ideally free of biases and representative of a wide variety of patients. While this is unlikely to ever be achieved in practice, dermatology benefits from several large, publicly available datasets. Many of these datasets have been partially combined under the auspices of the International Skin Imaging Collaboration (ISIC) Archive [17]. Although the ISIC Archive has several known limitations [33], it is an invaluable resource for advancing AI research in dermatology. The collaboration also hosts challenges [17] each year, typically associated with prominent computer vision conferences, that provide an engaging opportunity for computer scientists to apply new techniques to relevant dermatologic problems.

Best Practice 15.6

Keep an up-to-date list of publicly available datasets so that you can use them when appropriate.

Conclusion

As an image-centric specialty dermatology has become an area of particular interest for applications of artificial intelligence and computer vision in healthcare. While many approaches have focused on pigmented and non-pigmented lesions, melanoma in particular, the field is vast, encompassing some 3000 known skin diseases and affecting approximate one third of the global population [34,35,36]. The opportunity this presents to ease the global disease burden, particularly by enhancing remote access to specialty care, is incredibly exciting and we look forward to its bright future.

Key Concepts in This Chapter

Deep Neurual Networks, especially convolutional neural networks, are the primary AI tools in dermatology as of 2022.

Broadly speaking applications of computer vision in dermatology can be categorized either by input data or by problem type.

Input data is usually one of the following: clinical photographs, dermoscopic photographs, or digitized pathology slides (also known as whole slide images, or WSI).

Problem type is most typically supervised learning, where a label of interest is known for each observation, and can appear as classification or segmentation.

In a classification problem, the goal is to learn a label (or set of labels) for an image in its entirety. For example, given a photograph of a skin lesion the classifier might distinguish between melanoma and benign nevus, and if melanoma then might further characterize it as aggressive or indolent.

Segmentation is similar to classification, but rather than trying to predict a label for an image in its entirety it seeks to do so for each pixel in the image.

The most popular architecture which has achieved state-of-the-art performance in skin lesion segmentation is U-Net.

Due to their immense size, special technical considerations are critically important when working with digitized pathology slide images. Currently the best practice is to divide the slide into smaller patches of tissue, often chosen to match a given neural network architecture.

Pitfalls in This Chapter

Pitfall 1. When using transfer learning, if you cannot find a dataset naturally similar to yours and must start with weights from e.g. ImageNet, always try training from scratch as well. In pathology in particular sometimes it is better to just start over.

Pitfall 2. While important in all applications of ML in healthcare, deep learning on whole slide images is especially prone to inadvertently learning biases in the dataset rather than actual physiology. A pernicious example is that of scanner effects. In a scanner effect, the model learns which scanner was used to aquire an image. This causes problems when one outcome of interest is more frequently acquired on one model of scanner, something common in multicenter studies. Similar biases can occur if images are acquired using different staining protocols or scanning parameters.

Best Practices in This Chapter

Best Practice 15.1. Always start with a known state-of-the-art network architecture rather than designing your own. EfficientNet is often a (relatively) fast way to see if your dataset has an extractable signal.

Best Practice 15.2. Use transfer learning any time you can find a dataset similar enough to yours. See pitfall below however.

Best Practice 15.3. Be alert to data differences in photographs acquired by patients vs those acquired by medical photographers or informed providers. If you mix them, be certain to check that there is balance among outcomes by origin.

Best Practice 15.4. Weakly supervised learning may be a promising future direction to alleviate the burden of acquiring costly and time-consuming pixel level annotations.

Best Practice 15.5. A common paradigm for deep learning on pathology slides is to first divide the slide into small patches of tissue. Then one trains a tissue-level classifier, typically a deep neural network. This predicts the type of tissue in the patch using labels from pixel-level annotations. Afterwards, a slide-level classifier can be trained to predict using the tissue-level predictions as input and the slide labels as output. The slide-level classifier is typically a model such as logistic regression, often chosen to avoid overfitting.

Best Practice 15.6. Keep an up-to-date list of publicly available datasets so that you can use them when appropriate.

Questions and Discussion Topics in This Chapter

  1. 1.

    Discuss: Are photographs acquired in a dermatologists office more likely to contain cases than controls?

  2. 2.

    Describe a red flag to look for when being alert to scanner effects.

  3. 3.

    What is the primary difference between supervised learning and weakly supervised learning?

  4. 4.

    Noise and artifacts may be present in images that are not visible to the human eye. Read this blog post by Andrew Janowczyk: http://www.andrewjanowczyk.com/the-noise-in-our-digital-pathology-slides/

    1. (a)

      How might this affect a study?

    2. (b)

      What approaches could you take to mitigate it?

  5. 5.

    What are some of the ways that published studies have compared machine performance to dermatologist performance? Do any study designs have particular advantages or disadvantages?