What Is AI? Applications of Artificial Intelligence to Dermatology

X. Du-Harpur; F.M. Watt; N.M. Luscombe; M.D. Lynch


The British Journal of Dermatology. 2020;183(3):423-430. 

In This Article

Key Biases, Limitations and Risks of Automated Skin Lesion Classification

Given that, remarkably, all of the published studies indicate superiority of machine learning algorithms over dermatologists, it is worth exploring the biases commonly found in these study designs. These can be categorized into biases that favour the networks and biases that disadvantage clinicians. With regards to the first category, it is first worth noting that in the studies described, the neural networks were generally trained and tested on the same dataset. This closed-loop system of training and testing highlights a common limitation within machine learning called 'generalizability'. On the occasions that generalizability has been tested, neural networks have often been found lacking. For example, Han et al. released their neural network, which was a Microsoft ResNet-152 architecture trained on nearly 20 000 skin lesion images from a variety of sources as a web application.[15] When Navarette-Dechent et al. tested the network on data from the ISIC dataset, which the network had not previously been exposed to, its performance dropped from a reported area under the curve of 0·91, to achieving the correct diagnosis in only 29 out of 100 lesions, which would imply a far lower area under the curve.[23] As algorithms are fundamentally a reflection of their training data, this means that if the input image dataset is biased in some way, this will have a direct impact on algorithmic performance, which will only be apparent when they are tested on completely separate datasets.

Another important limitation of the methodology used to compare AI models with dermatologists is that ROC curves, although a useful visual representation of sensitivity and specificity, do not address other important clinical risks. For example, in order to capture more melanomas (increased sensitivity), the algorithm may incorrectly misclassify more benign naevi as malignant (false-positives). However, this could potentially lead to unnecessary biopsies for patients, which aside from patient harm would create additional demand on an already burdened healthcare system. There is evidence that dermatologists have improved 'number need to biopsy' metrics for melanoma in comparison with nondermatologists.[24] The reporting of number need to biopsy would be a useful addition to studies such as that of Esteva et al.,[14] as it would aid in the estimation of potential patient and health economic impact.

It is also worth noting that these datasets are retrospectively collated and repurposed for image classification training; this means that the images captured may not be representative in terms of the proportion of diagnoses, or in terms of having typical features. As neural networks are essentially a reflection of their labelled data input, this will undoubtedly have consequences on how they perform. However, given the lack of 'real-world' studies, it is difficult to know how significant this is. When it comes to assessing clinicians using images from these datasets, this may also introduce an element of bias that disadvantages clinicians too, as lesions that were deemed worthy of capturing via photograph or being biopsied may not be representative of the lesion type. As a result, the sensitivity of clinicians diagnostically may be lower than in a normal clinic. This hypothesis for discrepancy in diagnostic accuracy was borne out in a recent Cochrane review, where the diagnostic sensitivity of dermatologists examining melanocytic lesions with dermoscopy was 92%,[25] which is significantly higher than typically found in neural network studies. For example in Tschandl et al.'s web-based study of 511 clinicians, the sensitivity of experts was 81·2%.[22] The manner in which clinical decisions are inferred as 'benign' or 'malignant' also makes some assumptions that may not be accurate; for example, a dermatologist's decision to biopsy a lesion is a reflection of risk, not an outright 'malignant' classification.

From a safety perspective, there are two considerations that have yet to be addressed in the studies. Firstly, in order to 'replace' a dermatologist, an algorithm must be able to match the current gold standard for screening a patient's skin lesions. Currently, this is a clinical assessment by a dermatologist, who examines the lesion in the context of patient history and the rest of their skin. Published studies do not compare neural networks against this standard of assessment; they are only compared with dermatologists presented with dermoscopic or clinical images, sometimes with limited additional clinical information. Not only does this bias the studies against dermatologists, who are not trained or accustomed to make diagnoses without this information, it also represents a limiting factor in justifying their deployment in a clinical setting as a replacement for dermatologists. Fundamentally, it has not yet been demonstrated that they are equivalent to the standard of dermatological care currently provided to patients. A second important consideration is the fact that training data lack sufficient quantities of certain types of lesions, particularly the rarer presentations of malignancy, such as amelanotic melanoma.[15] It is not yet clear how algorithms will perform when presented with entirely novel, potentially malignant lesions; this has rare but significant safety implications for patients.

From a legal perspective, an issue that has yet to be fully addressed is the lack of explainability by neural networks. Currently, it is not possible to know what contributes to their decision-making process. This has led to criticisms and concerns that neural networks function as 'black boxes' with potential unanticipated and hard-to-explain failure modes. The European Union's General Data Protection Requirement specifies explainability as a requirement for algorithmic decision making, which is currently not achievable.[26,27] Algorithmic decision making also has uncertain status in the USA, where the Food and Drug Administration have advised that until there exists a body of evidence from clinical trials, clinical decisions suggested by AI ought to be considered AI guided, not AI provided, and liability would still rest with the clinician.[28]