Background

In the ongoing initiative to eliminate cervical cancer, the World Health Organization (WHO) highlights the central role of HPV [1]. The recently demonstrated efficacy of single-dose HPV vaccination provides hope for eventual primary prevention of cervical cancer [2]. However, implementing affordable and accurate HPV screening is still a major challenge in lower-resource settings [3,4,5,6].

Where HPV testing is done, a negative HPV result in mid-adulthood provides strong reassurance that cervical cancer/precancer is not present or imminent [7, 8]. While the high negative predictive value is settled, feasibility of HPV testing remains an unsettled issue especially in lower-resource settings. A major practical issue is the clinical management of women testing HPV-positive [3]. WHO recommends either ablation/excision of the cervical transformation zone in its entirety in all women testing HPV-positive ("screen-treat") or performance of an additional test on HPV-positive women to determine who will benefit most from treatment ("screen-triage-treat") [1].

As a novel screen-triage-treat strategy, we are evaluating two complementary tests: HPV genotyping and artificial intelligence (AI)-assisted visual evaluation. Regarding HPV genotyping, there are approximately a dozen HPV types classified as carcinogenic, and the type of HPV strongly modifies risk of cancer, i.e., the risk groups in descending order of carcinogenicity are HPV16, then HPV18/45, then HPV31/33/35/52/58, then HPV39/51/56/59/68 [9, 10]. HPV typing could be provided at minimal added costs compared with HPV positivity/negativity, and several assays have incorporated HPV typing as part of their readout of results [5, 11].

The second complementary triage method is a visual adjunct called "automated visual evaluation (AVE)". AVE is a real-time, deep learning-based classifier of cervical appearance with a readout as either reflective of precancer/cancer, indeterminate, or normal based on images captured by a digital camera [12,13,14,15].

The cross-combination of the four-level HPV type groups with the three-level AVE algorithm score creates 12 risk levels (Fig. 1), a gradient that could help clinicians identify which HPV-positive women are most likely to have precancer and are therefore in greatest need of ablation or excisional cervical treatments to prevent cervical cancer [3].

Fig. 1
figure 1

Combination of HPV typing and automated visual evaluation (AVE) to create risk score for cervical screening. *In case of multiple infections, the result will be hierarchical, as HPV16 positive, else (if HPV16 negative) HPV18/45 positive, else (if HPV16 and HPV18/45 negative) HPV31/33/35/52/58 positive, else (if HPV16 and HPV18/45 and HPV31/33/35/52/58 negative) HPV39/51/56/59/68 positive, else negative

We present an evaluation of the HPV-AVE approach for predicting precancer/cancer based on screening results and histologic outcomes in the screening program in Lusaka, Zambia.

Methods

Study population and field study

The research took place in the well-established cervical cancer screening program nested inside public sector primary care health facilities in Lusaka [16, 17]. Women in this analysis were recruited under informed consent from clinics which primarily offer screening and treatment with thermal ablation or large loop excision of the transformation zone (LLETZ) by trained nurses. The standard of care screening was conducted by experienced nurses who examined women with visual inspection with acetic acid (VIA)-based screening aided with high-quality digital cameras; the magnified illuminated images permits both inspection of the surface morphology of the cervix and facilitates expert telemedicine quality assurance. Emphasizing sensitive criteria to avoid missing precancer/cancer, ~ 25% of women screen positive, reflecting partly the high HIV prevalence [17, 18]. As a research ‘add-on’ for this study, the nurses also took an additional triplicate set of images using a Samsung Galaxy J8™ smartphone camera. They also collected a cervical swab that was sent for subsequent HPV testing using the BD Onclarity™ assay system installed in Lusaka at a major referral hospital.

Expert gynecologic pathology review was available. Cases of cervical precancer/cancer were defined clinically as women having histologic CIN2, CIN3, or cancer. Glandular neoplasia was uncommon and grouped with the corresponding severity of squamous diagnoses (AIS with CIN3, ADC with SCC). Controls were women with completely visible squamocolumnar junctions (Type 1 or 2 Transformation Zones) whose visual screen was judged to be normal and not requiring referral for biopsy, combined with those that were referred but had histologic findings < CIN2.

HPV testing

The results of the HPV testing performed in Lusaka were obtained by Onclarity batch testing for research purposes only, and unconnected to clinical management. Onclarity provides HPV typing that can approximate the type groups ranked in order of carcinogenicity [19]. Specifically, the assay yields results individually for HPV 16, 18, 31, 45, 51, and 52, but combines 33/58, 56/59/66, and 35/39/68. For the purposes of this research the results were further grouped based on established risk of cancer in a hierarchical classification as HPV16, else HPV18/45, else HPV 31/33/52/58, else HPV 35/51/56/59/66/68. Of note, the inclusion of HPV 35 in the lowest risk group is now known to be an error (among individuals of African heritage, it properly belongs with the other HPV 16-related types in the HPV 31 group) [20], and the incorrect inclusion of HPV 66 as carcinogenic is another acknowledged limitation of this assay [21], leading to some false positives.

Automated visual evaluation (AVE) algorithm

The AVE algorithm was pre-trained on the NCI cervical image bank that contains more than 150,000 images taken with Cerviscopes (35 mm film images called Cervigrams, subsequently digitized) or DSLR camera images taken by beam splitting of Zeiss colposcope images [13]. The reader is referred elsewhere for detailed description of the logic, training, and initial validation of this deep-learning algorithm [12,13,14,15]. As noted above, the algorithm yields an ordered three-level classification of severity ("likely precancer/cancer", "indeterminate", or "normal" appearance). Its performance has been validated on internal "hold-back" test sets but, prior to this presentation, had not yet been validated in combination with HPV genotyping on an external dataset using a different image device in a distinct screening population.

Treatment and histologic diagnoses

An important aspect of the Zambian screening program is expert treatment of screen-positive women [16]. If visually assessed lesions meet the WHO criteria for ablation by cryotherapy or thermal ablation of the transformation zone, that treatment is performed [22]. For the purposes of this research, women underwent biopsy prior to ablation to detail underlying pathology. If more extensive treatment was needed, either LLETZ was performed or punch biopsies were taken to exclude invasion as guided by clinical assessment or/and expert review of digital cervigrams.

The case and control histologic diagnoses in this study were based therefore on punch biopsies or LLETZ specimens, evaluated by an expert pathologist. As stated above, women that screened negative were also included as controls despite having no biopsy (as were those with negative digital cervicography/biopsy) given the very sensitive threshold for VIA positivity, high rates of referral, and the substantial expertise of the examining nurses.

Data analysis

The population diagram for the study is shown in Fig. 2. The associations of HPV type group and AVE classification with histologic outcome were visualized for all women having all three variables (Additional file 1: Fig. S1).

Fig. 2
figure 2

Consort diagram of Zambia dataset

The data analysis included the following: First, we tested transfer learning of the candidate AVE algorithm for immediate use without modification on the J8 images. The J8 image type was a kind not previously included in AVE training. We postulated that portability might require retraining of the AVE algorithm to permit familiarity with the new image type. A small subset of images from women in the Zambian screening clinic was used for retraining, and contained 80 individuals with each classification (precancer/cancer, indeterminate, normal). The retraining images were added incrementally (20, then 40, then 60, then 80) to the NCI core collection to consider incrementally how many of the previously unfamiliar kind of images were needed to transfer the algorithm successfully; 40 was the chosen number achieving reasonable performance (Additional file 1: Table S1).

Once AVE was trained to analyze the Zambian J8 images, we assessed repeatability of the AVE results obtained from the three replicate images of the same individual captured by the J8 smartphone camera. Repeatability was assessed as an ordinal 3 × 3 table since the output was three ordinal classes of increasing severity: normal, indeterminate (HPV-positive patients with some equivocal/borderline/look-alike cervical changes), and precancer/cancer. In assessing reproducibility, the percent of the individuals that were extremely misclassified on replicates was of special interest (i.e., normal images classified as precancer/cancer, or vice versa).

Accuracy of a test is typically judged to be the correct identification of cases and non-cases, generally assessed in a 2 × 2 table by sensitivity, specificity, and their tradeoff (area under the receiver operating curve, or AUC). However, in this version of AVE, a large "gray zone" of indeterminate results was established between precancer/cancer and normal. Thus, the analysis assessed a 3 × 3 matrix (which shows the three diagnostic truth classes as one dimension and three-level test classification as the other). The worst inaccuracies, i.e., the percent of extreme errors, were again of special interest (precancer/cancer called normal, or vice versa).

Results

Shown in Table 1 are the general characteristics of the Zambian screening population. The variables including HIV, HPV genotyping, histopathology, VIA classification, and ground truth classes for AVE algorithm are presented. Of note, a high percentage, 35%, of the women in the total analysis population (test set in Table 1) were HIV-positive.

Table 1 Summary of Zambia screening population (test set n = 998 and retraining/validation set n = 240)

Non-portability of AVE algorithm to new image type

The initial "transfer learning" application of the pre-trained AVE algorithm to the Zambian J8 image set generated very poor performance, with nearly random discrimination of the reference diagnostic classes (Additional file 1: Table S1).

Retraining was required, i.e., adding case/control J8 images from the Zambia dataset into our original training/validation sets. We tried including 17 + 3, 35 + 5, 52 + 8, 70 + 10 (training + validation) individuals’ data from each ground truth class (for example, we randomly selected 40 individuals with ground truth of normal, indeterminate, and precancer/cancer each 35 to be added into our training set and five to be added into our validation set, which is used during retraining at various checkpoints to monitor the progress of the retraining). Except for the experiment with only 17 + 3 individuals’ data addition, all other retrained algorithms performed well (Additional file 1: Table S1). For the rest of this section, we will present validation results of the AVE algorithm retrained with additional 35 + 5 (training + validation images) individuals’ data from each diagnostic class.

Repeatability

Table 2 and Fig. 3 present repeatability results of the retrained algorithm as measured in the test set. Each individual in this dataset had on average three images captured by Samsung J8. Table 2 compares the AVE result from the first two J8 images captured from the same patient at the same visit, as a 3 × 3 ordinal matrix with AVE predictions on the first J8 image as one dimension and the second image on the other. Of the individuals in the test set, 79% had the same AVE test result on both images with weighted kappa score 0.72 (95% CI 0.68–0.76). The other pairwise comparisons (first image versus third, second versus third) generated comparable results.

Table 2 Repeatability of AVE algorithm results obtained from 2 different J8 images captured from the same patient at the same visit
Fig. 3
figure 3

Bland–Altman plot—assessing the repeatability of AVE scores obtained from 2 different J8 images captured from the same patient at the same screening visit. It displays the repeatability of AVE scores obtained from 2 different J8 images captured from the same patient at the same visit by Bland–Altman plot. The x-axis shows the average of 2 AVE scores obtained from 2 different J8 images while the y-axis shows the difference of these scores. Continuous AVE score is obtained as the summation of class label (0, 1, 2) multiplied by its corresponding class AVE prediction. Each point in the plot is colored according to its ground truth. Blue points represent ground truth normal patients, yellows are indeterminate cases, and reds are confirmed CIN2+ cases. Under perfect repeatability, score differences are expected to be zero; therefore, in an ideal situation, all of the points on the graph are expected to be lying on the y = 0 line (horizontal line passing through 0). However, in our situation points vary around this horizontal line, and the variability is highest at the middle (where x = 1). This means that the variability in score differences is dependent on score averages. The variability is smallest at each end 0 (corresponding to normal) and 2 (corresponding to precancer/cc), and is highest at the middle which means there is low repeatability at indeterminate class compared to definite normal and precancer/cancer classes

Figure 3 displays the difference between the two replicate image scores on the y-axis plotted against the average of them on the x-axis (a Bland–Altman plot). Continuous scores were obtained by multiplying predicted class probabilities given by the AVE algorithm with their corresponding class labels (0: normal, 1: indeterminate, 2: precancer/cancer). Most of the normal class images (represented by blue dots) are clustered on the left end of the graph with only small variability on the y-axis, meaning most of the normal images were repeatedly estimated as normal across the replicate images. The same holds for the precancer/cancer cases (red dots) as well; however, the variability of the continuous score (i.e., larger differences between replicates) increases somewhat at the middle (indeterminate class images). In other words, indeterminate images generated slightly more variable AVE results.

Accuracy

The AVE classification trended strongly toward more severe classification linked to the severity of the histologic diagnosis (Additional file 1: Fig. 1). Table 3 displays accuracy of the algorithm results among HPV risk groups, compared with the actual histologic results. Both tests showed strong associations with case-indeterminate-control status.

Table 3 Results by histology of HPV then AVE from J8 images, pretrained algorithm retrained with 35 + 5 patients’ data added to each class from Zambia screening population

Risk stratification

As shown in Figs. 4 and 5, at each step in the screening/triage strategy, we estimated pre- and post-test chance of having precancer/cancer. We considered sequentially the data available on the different variables, to simulate independent performance. Both kinds of triage tests (AVE and HPV type) were linked strongly with histologic diagnoses.

Fig. 4
figure 4

Step by step precancer/cancer stratification: low prevalence Zambia study population, after knowing HIV status, after knowing HPV status. This figure explains step by step risk discrimination in a population after knowing each screening test result. In total, there are 931 patients screened in this study in Zambia (67 excluded due to no J8 image). After testing for HIV, precancer/cc risk of HIV+ patients increases to 15% while the risk decreases to 2.2% for HIV-negative patients. After HIV, if the patients get tested for HPV genotype, we can observe even further risk discrimination. A 15% precancer/cc risk of HIV-positive patients increases to 48% if they are positive for HPV type 16. Similarly, the risk decreases from 15 to 1.4% if the patients are HPV HR-negative. For HIV-negative patients, the precancer/cc risk increases from 2.2 to 22% if they are positive for HPV type 16. Similarly, if HIV-negative patients are negative for any HPV HR types then their precancer/cc risk decreases to 0.23%. In the above figure, the number of patients (N) observed in each category and their precancer/cc risk are displayed separately for each category. *67 individuals have no J8 images, they have images captured by other camera types. **37 of HIV+ individuals and 28 of HIV− individuals, and 2 of the HIV missing individuals do not have any J8 images (which add up to 67 from the previous step), so they are not included in this analysis. 22 individuals have missing HIV result (317 HIV+, 592 HIV − , and 22 missing HIV will add up to the previous step, screening population)

Fig. 5
figure 5

Precancer risk stratification among low-prevalence HIV-positive Zambia study population by HPV and AVE combined results. This figure is an extension of Fig. 4 extending the risk discrimination to demonstrate the intended use case of PAVE, triaging HPV-positive individuals with HPV genotyping and AVE. The population is first tested for HIV, and this figure shows the risk discrimination among HIV-positive patients first tested with HPV and then with AVE. As demonstrated in previous images, 15% precancer/cc risk of HIV-positive patients will vary between 48 and 1.4% (HPV 16+ and HPV HR-negative, respectively) after being tested for HPV genotype. If we apply AVE test after HPV genotype, we can observe even finer risk discrimination such that 48% precancer/cc risk of HIV-positive and HPV 16+ patients will increase to 72% for AVE result precancer/cc and decrease to 21% for AVE result normal. The highest risk group is HPV 16+ and AVE precancer/cc, followed by other HPV-positive groups and AVE precancer/cc. The lowest risk groups are HPV HR-negative and AVE normal/indeterminate with 0–2.0% precancer/cc risk. 37 of HIV+ individuals do not have any J8 images (their images were captured by other camera types), so they are not included in this analysis. *No cases observed in these categories

Figure 5 demonstrates the use case of HPV and AVE together. The overall precancer/cancer risk of the population was 6.7%. We divided by HIV status. HIV-negative women had very low risk of CIN2 or worse, with so few cases that fine stratification by both HPV type group and AVE was not feasible (5 CIN2+ among HPV 16+, 7 CIN2+ among other hrHPV+, and 1 CIN2+ among hrHPV-). The precancer/cancer risk of HIV-positive patients varied between 48 and 1.4% after testing for HPV status (HPV 16+ and HPV HR-negative, respectively). When we added the AVE test after HPV, we observed even finer risk discrimination such that 48% precancer/cancer risk of HIV-positive and HPV 16+ patients increased to 72% for AVE result precancer/cancer and decreased to 27% and 21% for AVE result indeterminate and normal, respectively.

Additional file 1: Figure S2 presents the rank order of the individuals according to their predicted chance of having cervical precancer/cancer. The figure demonstrates the population (x-axis) ranked based on the HPV-AVE algorithm and compares what percent of the observed precancer/cancers (y-axis) in this population could be detected if a certain percent of the high-risk population is referred for management. In this figure, the HIV-positive population is demonstrated because there are more precancer/cancer outcomes permitting stratification of risk. In this example, one possible threshold or cutpoint for managing patients could be drawn at risks equal or greater than that observed for women that are HPV16+ and normal AVE. At this cutpoint, 92% of the expected precancers would be detected and treated. To achieve this sensitivity, 31% of the HIV-positive population would be referred for management.

Discussion

The data support that a combination of HPV typing and automated visual evaluation (AVE) could accurately distinguish women at different risks of cervical precancer/cancer. Current cervical cancer prevention guidelines in the US and Canada employ the principles of risk-based management [23, 24]. Applying risk-based management in resource-limited settings is paramount as resources are concentrated on patients at highest risk of cancer, and the harms of overtreatment are avoided in those at low risk [3]. Using HPV testing with genotyping and AVE for risk-based management would not require new scientific discovery, but the real-life challenge of establishing the strategy in a lower-resource setting like the Zambian public sector is of paramount importance.

As one possible strategy, the screening process could start with collecting self-sampled vaginal swabs from the general screening population [25] and evaluating of these samples by a sensitive target-amplification test like the new ScreenFire HPV test [26]. (under review). This HPV test will give genotyping results for the high-risk HPV-positive patients in the 4 hierarchical channels, which are HPV 16, else HPV 18 or 45, else HPV 31, 33, 35, 52, 58, else HPV 39, 51, 56, 59, 68. After obtaining the HPV test result, patients with high-risk HPV-positive could have an image captured by a dedicated hand-held camera (i.e., cell phone camera, digital camera, or other local choice). The cervix image would be evaluated by the AVE algorithm to give one of the three test results: normal cervix, indeterminate (neither completely normal nor a precancer/cancer), or precancer/cancer. This deep-learning based screening test, AVE, would be an assistive technology to guide clinicians in these settings, reducing the number of unnecessary referrals for treatment or workup.

The 12-level screening table would indicate each woman's chance of having precancer/cancer, and could be used by the decision-makers in the local public health and clinical authorities to create risk-action thresholds. In other words, those in charge of each setting could decide based on resources and risk tolerances how to convert the risks into actions, similar to how risks are used in current US and Canadian guidelines [23, 24].

The strengths of this study include a large dataset of images from women with and without HIV, collection of HPV testing with genotyping, availability of multiple (triplicate) cervical images, and cervical disease outcomes on all patients. The nurses performing the standard of care VIA screening aided by digital camera imaging are highly trained, and have individually performed thousands of screenings, and often act as master-trainers for colleagues across the country and region. Adding to the fact that they also have an internal quality assurance program, the quality of VIA results in this study is expected to be much higher than is reported in most studies worldwide [17, 27].

The limitations of this convenience dataset include inclusion only of women with complete data by the study end, the assumption that VIA-negative patients did not have precancer, and the potential limitations of the Onclarity HPV assay (e.g., the inclusion of a lower-risk type HPV66, grouping of HPV35 with lower risk HPV types that is inconsistent with true precancer/cancer risk in an African population [20]), and the low numbers of precancer/cancer in the HIV-negative population precluding detailed analysis. Finally, the AVE algorithm was not run on a smartphone camera itself (was run offline using high intensive graphic cards/chipsets on computers in the lab). Adapting and miniaturizing high performing AVE algorithms onto lower-cost devices is a major technical and operational requirement for this technology to be transformed into a near-patient/point-of-care clinical application [28].

AVE shows promise as an assistive technology when performing screen-triage-and treat strategies [13,14,15]. However, this and other studies indicate that AVE cannot be applied “out of the box” to new patient populations or used with different image capture devices than those used to train the original algorithm [12]. Unless setting-specific images are provided to retrain the algorithm, it will fail to distinguish precancer/cancer at a rate much higher than chance alone [12]. This study suggests that approximately 40 images (35 training + 5 validation) of each class (normal, indeterminate, precancer+) could retrain the algorithm to function in a new setting. Obtaining these data would require dedicated protocols to obtain images and pathology specimens, which in turn could require screening of several hundred to several thousand women in each new setting. Finally, there is a theoretical potential for AVE-assisted VIA as a primary visual screening approach that could replace VIA as—at least—a non-inferior alternative to HPV-based primary screening, especially in settings of great burden where HPV remains unavailable or unaffordable. However, the formal evaluation of this strategy requires rigorously conducted clinical effectiveness, cost effectiveness, and implementation science studies.

Conclusion

Given advanced understanding of cervical cancer etiology and pathogenesis, and effective prevention methods, the critical measure of success/failure is whether we actually save lives and reduce suffering from cervical cancer [10]. Cervical cancer is already and increasingly linked to deeply inequitable distribution of prevention efforts [29,30,31]. The vital need is for immediate prioritization of prevention efforts where they are lacking. The HPV typing and AVE visual triage approach has promise but will need evaluation in clinical effectiveness and implementation studies to determine if it is indeed a feasible and affordable real option in settings of great need like Zambia.