Accuracy and Efficiency of Deep-Learning–Based Automation of Dual Stain Cytology in Cervical Cancer Screening

Nicolas Wentzensen, MD; Bernd Lahrmann, PhD; Megan A. Clarke, PhD; Walter Kinney, MD; Diane Tokugawa, MD; Nancy Poitras, BS; Alex Locke, MD; Liam Bartels, BS; Alexandra Krauthoff, BS; Joan Walker, MD; Rosemary Zuna, MD; Kiranjit K. Grewal, MS; Patricia E. Goldhoff, MD; Julie D. Kingery, MD; Philip E. Castle, PhD; Mark Schiffman, MD; Thomas S. Lorey, MD; Niels Grabe, PhD


J Natl Cancer Inst. 2021;113(1):72-79. 

In This Article


General Approach

CYTOREADER uses whole-slide scanners (Hamamatsu Nanozoomers HT, XR, and S360) for imaging of ThinPrep (Hologic) or SurePath (Becton Dickinson, BD) slides, 2 widely used liquid-based cytology technologies. CYTOREADER is a cloud-based system (Google Cloud Platform) that can also run as a local installation. Training of deep-learning algorithms for automated DS evaluation was performed using small areas (tiles) from whole slides containing individual or small numbers of epithelial cells. For training of the deep-learning algorithms, tiles from training slides were manually evaluated for DS-positive cells by 3 observers (Supplementary Figure 1, available online).

Deep Learning

Two deep-learning approaches (Convolutional Neural Network with 4 layers [CNN4] and Inception-v3 with 48 layers [IncV3]) were developed sequentially as shown in Figure 1 and described in Supplementary Methods (available online). The algorithms determine the number of DS-positive cells on a slide by detecting the number of tiles above a certain likelihood threshold. A slide is considered positive if the number of DS-positive cells on a slide exceeds a certain cutoff. Training and validation were conducted on the tile level and the slide level. First, a training set from 450 patients was selected for which the clinical endpoint cervical intraepithelial neoplasia grade 3 or greater (CIN3+) was unblinded. Tiles were selected for initial training (80%) and validation (20%) of the algorithm. The deep-learning network provides a likelihood for each tile above which it is considered positive (0.5 for CNN4 and 0.4 for IncV3). The resulting candidate CNN was applied on the slide level on training slides. A cutoff of positive tiles is used to determine slide positivity (≥3 tiles per cell for CNN4 and ≥2 tiles per cell for IncV3). From misclassified slides, false-positive or false-negative tiles were extracted and fed back into the original CNN training to optimize classification accuracy of the CNN. A final locked CNN was applied on the patient level on the blinded validation set comprising 3803 slides. CNN4 showed good performance in Thinprep slides but not in Surepath slides. Subsequently, a second algorithm (IncV3) was trained specifically for Surepath slides (Supplementary Methods, available online). We published a GitHub repository and created a web page at with a source code description of the models and the installation instructions.

Figure 1.

Study design. AI = artificial intelligence; CNN = convolutional neural network; CIN3+ = cervical intraepithelial neoplasia grade 3 or worse; DS = dual stain.

Study Populations

The Biopsy Study is a population-based study of women aged 18 years or older referred to colposcopy at the University of Oklahoma Health Sciences Center between 2009 and 2011.[24] We included DS slides from 602 women as previously described.[19] The study population was split into a representative training set (193 slides with 741 DS-positive and 953 DS-negative tiles) and a validation set of 409 slides (Figure 1). This study was approved by the University of Oklahoma and National Cancer Institute (NCI) institutional review boards (IRB); written informed consent was obtained from all participants before study enrollment.

The Anal Cancer Screening Study (ACSS) was based at the San Francisco Kaiser Permanente Northern California (KPNC) Anal Cancer Screening Clinic. HIV-positive men who have sex with men 18 years or older were enrolled at KPNC between 2009 and 2010. DS slides from 318 men were generated as previously described.[25] From 19 training slides, 445 DS-positive and 532 DS-negative tiles were used for training (Figure 1). This study was approved by the KPNC and NCI IRBs; written informed consent was obtained from all participants before study enrollment.

At KPNC, DS was evaluated for triage of HPV-positive women between 2012 and 2015 in a population of women aged 25 years and older who were undergoing routine cervical cancer screening.[16] From a screening population of more than 300 000 women in a year, 3333 slides from HPV-positive women were included. From 238 training slides, 8215 DS-positive and 9739 DS-negative tiles were used for training (Figure 1). The study was approved by the KPNC IRB and was exempted from institutional review at the NCI by the Office of Human Subjects Research. Patient consent was waived because deidentified discarded specimens were used in this study.

Clinical Endpoints

All studies followed routine clinical practice at the respective institutions. Cytology was classified by the Bethesda System: negative for intraepithelial lesions or malignancy, atypical squamous cells of undetermined significance, low-grade squamous intraepithelial lesions, and high-grade squamous intraepithelial lesions (HSIL).[26] Final diagnosis was established by histopathology classified according to the cervical intraepithelial neoplasia (CIN) scale for cervical endpoints, which indicates the extent of dysplastic cells in the cervical epithelium: no indication for biopsy, normal CIN, grade 1 (CIN1), grade 2 (CIN2), grade 3 (CIN3), and cancer. We grouped adenocarcinoma in situ with CIN3. For anal disease endpoints, the comparable anal intraepithelial neoplasia nomenclature (AIN) was used.

p16/Ki-67 Staining and Evaluation

For the Biopsy Study and ACSS, slides were prepared from residual PreservCyt material using a T2000 processor (Hologic, Bedford, MA). For the KPNC study, slides were prepared from residual SurePath tubes according to the manufacturer's instructions (BD, Sparks, MD). Immunostaining of cervical cytology slides for p16/Ki-67 was performed using the CINtec Plus Kit (Roche, Tucson, AZ) according to the manufacturer's instructions. DS-trained cytotechnologists reviewed all slides; a slide was considered positive if 1 or more cervical epithelial cell(s) stained both with a brown cytoplasmic stain (p16) and a red nuclear (Ki-67) irrespective of morphologic abnormalities. Slides from the Biopsy Study and ACSS were stained and evaluated at Roche mtm laboratories AG, Heidelberg, Germany, whereas slides from the Kaiser DS study were stained and evaluated at KPNC. HPV testing with partial genotyping (HPV16 and HPV18) at KPNC was based on the cobas assay (Roche, Pleasanton, CA).

Statistical Analysis

We created boxplots and calculated medians to show the distribution of DS-positive cells in cytology and histology categories. We compared differences in DS cell counts in ordinal cytology and histology categories using 1-way analysis of variance. The primary endpoint for the Biopsy Study and the Kaiser Study was CIN3 or greater (CIN3+). For ACSS, the primary endpoint was AIN2 or AIN3 (AIN2+). Receiver operator characteristics curve analysis was conducted for the number of DS-positive cells against the primary endpoints, and the area under the curve (AUC) was calculated. Sensitivity and specificity coordinates for manual DS evaluation and cytology were plotted on the receiver operator characteristics curve for comparison. We calculated percentage positivity, sensitivity, specificity, and Youden's index in the Biopsy Study and ACSS for the cutoff determined by CNN4 and for manual DS evaluation. In the Kaiser Study, with a representative population of women who underwent routine screening, we calculated percentage positivity, sensitivity, specificity, and positive and negative predictive values for automated and manual DS. Differences in positivity, sensitivity, and specificity were evaluated using an exact McNemar's χ2, and differences in predictive values were evaluated using the R package DTComPair, using the generalized score statistic.[27] To evaluate clinical efficiency of each strategy, we estimated the number of CIN3+ detected for different cutoffs of DS-positive cells and the ratio of the number of tests and colposcopies per case of CIN3+ detected. We also evaluated the theoretical performance of automated DS in a fully vaccinated population by excluding all women who were positive for HPV16 and/or HPV18 from the analysis. Analyses were performed in SPSS, Stata, and R. All statistical tests were 2-sided and P less than .05 was considered statistically significant.