The Application of Convolutional Deep Learning in Dermatology
Classifying data using CNNs is now relatively accessible, computationally efficient and inexpensive, hence the explosion in so-called 'artificial intelligence'. In medicine to date, the main areas of application have been the visual diagnostic specialties of dermatology, radiology and pathology. Automating aspects of dermatology with computer-aided image classification has been attempted in dermatology for over 30 years;[6–8] however, previous efforts have achieved only limited accuracy. Although attempts have been made in recent years to use neural networks to diagnose or monitor inflammatory dermatoses,[9–11] these have generally not been as successful or impressive as the networks constructed to diagnose skin lesions, particularly melanoma. Melanoma is therefore the focus of the remainder of this review, and Table S1 (see Supporting Information) summarizes these head-to-head comparison studies.[12–21]
In 2017, Esteva et al. published a landmark study in Nature that was notable for being the first to compare a neural network's performance against dermatologists. They used a pretrained GoogLeNet Inception v3 architecture and fine-tuned the network (transfer learning) using a dataset of 127 463 clinical and dermoscopic images of skin lesions (subsequent studies have shown it is possible to train networks on significantly smaller datasets, numbering in the thousands). For testing, they selected a subset of clinical and dermoscopic images confirmed with biopsy and asked over 20 dermatologists for their treatment decisions. Dermatologists were presented with 265 clinical images and 111 dermoscopic images of 'keratinocytic' or 'melanocytic' nature, and asked whether they would: (i) advise biopsy or further treatment or (ii) reassure the patient. They inferred a 'malignant' or 'benign' diagnosis from these management decisions, and then plotted the dermatologists' performance on the network's ROC curves with regards to classifying the keratinocytic or melanocytic lesions (which were subdivided as dermoscopic or clinical) as 'benign' or 'malignant' (Figure 4a). In both 'keratinocytic' and 'melanocytic' categories, the average dermatologist performed at a level below the CNN ROC curves, with only one individual dermatologist performing better than the CNN ROC curve in each category. This suggests that in the context of this study, the CNN has superior accuracy to dermatologists.
Receiver operating characteristic (ROC) curves from studies by Esteva et al.,14 Brinker et al.19, 20 and Tschandl et al.21 Most often, the dermatologists' comparative ROC curves are plotted as individual data points. Lying below the curve means that their sensitivity and specificity, and therefore accuracy, are considered inferior to those of the model in the study. The studies all demonstrate that, on average, dermatologists sit below the ROC curve of the machine learning algorithm. It is noticeable that the performance of the clinicians in Brinker's studies (b, c), for example, is inferior to that of the clinicians in the Esteva study (a). Although there is a greater spread of clinical experience in the Brinker studies, the discrepancy could also be related to how the clinicians were tested. In both Brinker's and Tschandl's studies, some individual data points represent performance discrepancy that is significantly lower than data would suggest in the real world, which could suggest that the assessments may be biased against clinicians. AUC, area under the curve; CNN, convolutional neural network. All figures are reproduced with permission of the copyright holders.
A recently published large study detailed in two papers by Brinker et al.[19,20] involved training a 'ResNet' model on the publicly available International Skin Imaging Collaboration (ISIC) database, which contains in excess of 20 000 labelled dermoscopic images and is required to meet some basic quality standards. This network was trained on over 12 000 images to perform two tasks: the first was to classify dermoscopic images of melanocytic lesions as benign or malignant (Figure 4b), and the second was to classify clinical images of melanocytic lesions as benign or malignant (Figure 4c). The dermatologists were assessed using 200 test images, with the decision requested mirroring that of the study of Esteva et al.: to biopsy/treat or to reassure. Additionally, the dermatologists' demographic data, such as experience and training level, were requested.
The method used to quantify the relative performance also consisted of drawing a mean ROC curve by calculating the average predicted class probability for each test image (Figure 4b, c). The dermatologists' performance for the same set of images was then plotted on the ROC curve. Barring a few individual exceptions, the dermatologists' performance fell below the CNN ROC curves in both the clinical and dermoscopic image classifications. The authors also used a second approach, whereby they set the sensitivity of the CNN at the level of the attending dermatologists, and compared the mean specificity achieved at equivalent sensitivity. In the dermoscopic test, at a sensitivity of 74·1%, the dermatologists' specificity was 60% whereas the CNN achieved a superior 86·5%.
As part of an international effort to produce technology for early melanoma diagnosis, in 2016 an annual challenge was established to test the performance of machine learning algorithms using the image database from the ISIC. A recent paper by Tschandl et al. summarizes the performance of the most recent competition in August to September 2018, and also compares the performance of the submitted algorithms against 511 human readers recruited from the World Dermoscopy Congress, who comprised a mixture of board-certified dermatologists, dermatology residents and general practitioners (Figure 4d). Test batches of 30 images were generated to compare the groups, with a choice of seven diagnoses as multiple-choice questions provided. When comparing all 139 algorithms against all dermatologists, dermatologists on average achieved 17 out of 30 on the image multiple-choice questions, whereas the algorithms on average achieved 19. As expected, years of experience improved the probability for making a correct diagnosis. Regardless, the top three algorithms in the challenge outperformed even experts with > 10 years of experience, and the ROC curves of these top three algorithms sit well above the average performance of the human readers.
The British Journal of Dermatology. 2020;183(3):423-430. © 2020 Blackwell Publishing