Introduction

Cervical cancer ranks fourth among gynecological malignancies, with 600,000 new cases and 340,000 deaths worldwide in 2020 [1]. As a vital diagnostic tool and in the management of cervical cancer, colposcopy has become more commonly used around the world [2, 3]. However, colposcopy practice is not yet standardized and to promote a standardized approach, the International Federation of Cervical Pathology and Colposcopy (IFCPC) proposed a series of three terminologies, in 1975, 1990, and 2002. Then in 2011, the IFCPC nomenclature committee examined previous IFCPC terminologies and the existing knowledge before producing the first evidence-based terminology. The 2011 terminology is more comprehensive and was recommended to replace all previous terminologies [4,5,6]. Indeed, several studies conducted since 2011 have shown that this "new" terminology does improve colposcopic accuracy, if used correctly, and is clinically practicable [7, 8]. However, very few studies have assessed the 2011 IFCPC terminology [9] and no one has systematically reviewed the best available evidence for global communities.

Even though the relatively new IFCPC standard highlights the continuous development of colposcopic technologies and our understanding of colposcopic findings, the performance of these technologies in diagnosing squamous intraepithelial lesions varies substantially [9,10,11]. Fortunately, a number of technologies have emerged, such as dynamic spectral image (DSI) [12], smart phone [13], artificial intelligence [14, 15], and portable pocket colposcopy [16, 17]. These all help to provide more sophisticated analysis and therefore more appropriate diagnoses. However, it remains necessary to assess colposcopy performance against the gold standard biopsy. It is important to note that cervical biopsies take the form of a punch biopsy, endocervical curettage, or cone biopsy which are all invasive but are also only performed for colposcopy-based suspected cases. Of course, most physicians err on the side of caution, but colposcopy is a subjective process which requires skill and experience. Indeed, many clinicians and researchers alike have postulated that a comprehensive synthesis of the best available evidence would prove useful [18, 19].

In 2017, the American Society for Colposcopy and Cervical Pathology (ASCCP) has organized multiple working groups to draft colposcopy standards for the United States [20] After systematically reviewing 18 unique articles and synthesized knowledge, researchers recognized that there remains wide variation in both guidance and quality indicators. Crucially, the sample of studies was US centric and there have been a number of studies conducted around the world which may provide more generalizable insights. Therefore, it may be possible to yield more reliable findings if we assess colposcopic effectiveness from around the world, according to a set standard. Here, we systematically review (and meta-analyze) the evidence to assess the diagnostic performance of colposcopy-guided biopsies at different thresholds for the detection of histologically confirmed cervical intraepithelial neoplasia grade 2 or worse (CIN2+) according to the 2011 IFCPC terminology.

Materials and methods

The study was developed and completed in compliance with the PRISMA checklist and the study protocol was registered in PROSPERO (CRD42021293845).

Data source: search strategy and selection criteria

Relevant articles were identified using a set search strategy implemented in the following databases: PubMed, Embase, the Cochrane library and Web of Science from 1st January, 2011, to 7nd January, 2023. Searching began in May 2021 with two updated searches conducted in July 2022 and January 2023, before submitting for publication.

Search terms included "uterine cervical neoplasia", "squamous intraepithelial lesions", "colposcopy", "biopsy", "pathology", "sensitivity and specificity". In addition, we performed a reference list search to ensure all available evidence could be included or at least discussed. The detailed search strategy has been provided in the Supplementary materials.

Three thousand thirty-three articles were initially identified. Articles were included if they met three criteria: (1) The results of pathological examination were obtained using punch biopsy, cone biopsy, or hysterectomy specimens; (2) the article included raw data (not just aggregated data) in the form of a table comparing colposcopic impressions to results from colposcopy-guided biopsy, with results broken down according to either of the two independent histopathologic categories: the cervical intraepithelial lesions (CINs) system consisting of normal, CIN1, CIN2, CIN3 and cancer, or the LAST system including normal, low squamous intraepithelial lesions (LSIL), high squamous intraepithelial lesions (HSIL) and cancer; and (3) all results had been determined in accordance with the 2011 IFCPC terminology.

The following studies were excluded: duplicate publications; reviews; editorials; non-human samples; and no studies of colposcopy for detecting CINs. Duplicates were manually removed using Endnote software (version X9). Two authors independently screened the titles and abstracts according to these eligibility criteria, and relevant articles for full text were downloaded and reviewed. Any disagreement was resolved through discussion with a third author.

Data extraction and risk-of-bias assessment

Two cut-off values were set in this study. First, if colposcopy results suggested normal or benign, the patients were categorized as <LSIL, and if the results were CIN1, CIN2, CIN3, LSIL, HSIL, or cancer, we categorized patients as LSIL+. Second, if the colposcopy results of patients were considered normal, benign, CIN1 or LSIL, patients were considered <HSIL, and if the results were CIN2, CIN3, HSIL, or cancer, we considered the patients as HSIL+. To unify the criteria, we combined CIN1, CIN2, CIN3, LSIL, HSIL, and cancer into LSIL or worse (LSIL+), and combined CIN2, CIN3, HSIL and cancer into HSIL or worse (HSIL+). The predictive value of colposcopy in diagnosing CINs was based on its accuracy for detecting HSIL+ (confirmed by histopathology).

Data extraction from eligible articles was performed by one author, then two independent authors compiled the data into a standardized table, while the other author cross-checked the extracted information. Disagreements were resolved by a third author.

Extracted information included publication year, number of patients, time of recruitment, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). If these data were not provided directly, we back calculated the required values using the reported data.

Study quality was assessed using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) [21], which consists of four domains: 1) patient selection; 2) index test; 3) reference standard; and 4) flow and timing. The first three domains can also be used to assess applicability. Two authors independently assessed quality of the included reports and conflicts were resolved through discussion. Details of quality assessment can be seen in Figure S1.

Statistical analysis

Sensitivity and specificity estimates, according to different cut-off values for colposcopic diagnosis, were calculated by cross-tabulation. Forest plots were generated for each test, with the corresponding 95% confidence intervals (95% CIs). Pooled estimates for test accuracy are presented graphically with Summary Receiver-Operating Characteristic (SROC) curves. Summary points, area under the receiver operating curve (AUC) with 95% CIs, and prediction contours are also described. The risk of publication bias was assessed statistically using a funnel plot, and egger’s regression test. STATA (version 16.0) and RevMan (version 5.4) were used for all data analyses.

Results

Study characteristics

A total of 3,033 studies were identified, of which 1312 before 2011 were excluded. Among the remaining 1721 records, 1008 were left behind and 713 were excluded because of duplication, and 863 were excluded at the initial screening of abstracts. The remaining 145 full-text articles were assessed, and a further 130 studies were excluded for not fulfilling our predetermined inclusion criteria (Fig. 1). Finally, a total of 15 [7,8,9, 22,23,24,25,26,27,28,29,30,31,32] articles with 22,764 participants in compliance with the criteria were included in the study.

Fig. 1
figure 1

Selection and inclusion process of included studies

Fifteen studies met the data requirements for meta-analysis in the <LSIL category compared with LSIL+, and <HSIL compared to HSIL+. Sensitivity and specificity were calculated using these two thresholds, independently. Overall likelihood, SROC curves, sensitivity, specificity, and AUCs were calculated. Two SROC curves were estimated using meta-analysis from all ten independent studies’ sensitivity and specificity, which reflect the overall performance of colposcopy as a diagnostic tool.

Additionally, calculated the AUC results could be used to compare differences. Table 1 summarizes the demographic characteristics of the participants in the included articles. Tables 2 and 3 report the sensitivity, specificity, and their 95% CI, as well as true positive, true negative, false positive, and false negative using two thresholds separately from the 14 included publications, which are all important indicators of diagnostic accuracy.

Table 1 Participant demographics for 15 included studies
Table 2 Effectiveness of colposcopy in distinguishing <LSIL from LSIL+
Table 3 Effectiveness of colposcopy in distinguishing <HSIL from HSIL+

Figures 2 present SROCs with prediction and confidence contours and AUC with 95% CI. Figures 3 and 4 present the sensitivity and specificity data with 95% CI from each study at the different cut-offs reported. Figure S2 present the Deek’s funnel-based asymmetry test. The p-value for Deek’ s asymmetry is 0.06, and the egger’s (P = 0.064) regression tests likewise were statistically insignificant, indicating no publication bias.

Fig. 2
figure 2

Sensitivity and specificity reported for diagnostic colposcopic impression in 14 studies (each study represented by a point in the figure) relative to the gold standard of biopsy for distinguishing A <LSIL from LSIL+; B <HSIL from HSIL+. The solid line in the graph shows the receiver operating characteristic curve determined from regression analysis

Fig. 3
figure 3

Sensitivity and specificity reported for distinguishing  <LSIL from LSIL+

Fig. 4
figure 4

Sensitivity and specificity reported for distinguishing <HSIL from HSIL+

Pooled performance of colposcopy under two thresholds

When testing colposcopy with a cut-off of LSIL+, the pooled sensitivity is 0.92 (95% CI 0.88–0.95), the pooled specificity is 0.51 (95% CI 0.43–0.59), summary SROC analysis confirmed the ability of colposcopy in distinguishing <LSIL from LSIL+, with a mean (SE) AUC of 0.82 (95% CI 0.78–0.85). I2 of sensitivity is 96.27%, of specificity is 99.21%. When testing colposcopy with a cut-off of HSIL+, the pooled sensitivity is 0.68 (95% CI 0.58–0.76), the pooled specificity is 0.93 (95% CI 0.88–0.96), summary SROC analysis confirmed the ability of colposcopy in distinguishing <HSIL from HSIL+, with a mean (SE) AUC of 0.89 (95% CI 0.86–0.91). I2 of sensitivity is 98.04%, of specificity is 99.21%.

Discussion

This study aimed to assess the accuracy of colposcopy based on histopathologic findings according to the 2011 IFCPC terminology, which might be helpful to generate generalizable findings by synthesizing evidence from studies using the same terminology around the world. We systematically reviewed (and meta-analyzed) evidence to assess the performance of colposcopy at different thresholds. Our results suggest that the sensitivity of colposcopy diagnosis is high (0.92), although with relatively low specificity (0.51) when LSIL is adopted as the cut-off value. Conversely, using HSIL+ as the cut-off value appears to lower the sensitivity (0.68) and raise specificity (0.93). AUC analysis also indicates that there is a higher level of overall accuracy using HSIL+ as the threshold (0.89 vs 0.82). Quality assessment suggests that the included studies are high quality, and there was no apparent publication bias, despite having included only nine studies.

The wide range of values for both sensitivity and specificity found in each of the 15 included studies were very similar to the ranges reported in reviews by Brown et al. [35] and Underwood et al. [36]. Findings around overall sensitivity (0.68) and specificity (0.93) when using HSIL+ as the cut-off value, were also equivalent to previous studies [37,38,39]. The sensitivity of colposcopy for HSIL+ from 49 to 61% [40], and specificity varied from 79 to 96.5%. Given that our results are within these ranges, approximately 40% of CIN2+ cases are missed at initial colposcopy when using this threshold. This is far too high and requires our immediate attention because late diagnosis limits the number and efficacy of treatment options. However, previous research also showed that over one-third of all CIN2+ cases would progress into cervical cancer over a period of between 10 and 15 years [41], while the change of missed lower grade lesions in progressing into invasive disease was little, which justified advocating HSIL+ as the more clinically meaningful cut-off value regardless of the lower sensitivity.

In this study, the diagnostic performance for detecting CIN2+ was calculated for colposcopic impressions using both cut-off values, i.e., LSIL+ and HSIL+. When the cut-off value of colposcopic impressions was LSIL+, the lower specificity became normal, as some of the patients with low-grade colposcopic diagnosis may not have pathologic CIN2+. Biopsy of all suspected lesions, i.e., LSIL+, appears to result in the highest sensitivity for detecting CIN2+, which is the main biopsy strategy used in low- and middle-income countries (LMICs). For example, this approach is commonly used in China, because it is difficult to accurately grade LSIL and HSIL lesions. Evidence from this study therefore recommends a balanced cut-off value for low-grade (or worse) to reduce the number of missed CIN2+ cases, even though specificity drops.

It should be noted that verification bias is a particular problem in studies of colposcopy. This is because of the process involved and the economic pressures health systems face. Biopsies are only performed in suspected cases and as a result, biopsies often become the process of verification rather than investigation. If a biopsy is not taken after colposcopy results are negative, then the sensitivity might be 100% which re-affirms clinical decisions. Additionally, spectrum bias might occur due to diversity in disease prevalence. Even though the incidence of a disease would not change sensitivity and specificity calculations within the test population, this can affect a sample consisting of disease negative participants [35]. Again, this has a knock-on effect and creates ambiguity from screening to diagnosis, which causes unnecessary anguish, raises the price of healthcare, and ultimately costs lives.

While colposcopy is increasingly common in cervical cancer screening, it is, as we have mentioned, a subjective procedure. A number of researchers have found correlations between colposcopy and histopathology are all too often misleading and generally unsatisfactory [42]. For example, Ruan et al. [27] found that colposcopy often underestimates the occurrence of squamous intraepithelial lesions when using biopsy as the pathologic gold standard. By contrast, Tatiyachonwiphut et al. [38] found that colposcopic diagnoses more often overestimate the incidence of cervical pathologies. These discrepancies might be due to the use of different colposcopic thresholds and methods; however, evidence from this meta-analysis suggests that LSIL may be selected as the cut-off value for directing biopsy in areas with underdeveloped colposcopy, while HSIL+ can be selected as a cut-off value to avoid unnecessary diagnosis. Moreover, digital colposcopy has potential values in providing accurate and objective measurements of a number of cervical features. Some studies reported high sensitivities and specificities concerning digital colposcopy compared with traditional colposcopy [43], while its wide application may be limited by the relatively high purchase and maintenance costs, an important factor especially in lower resource areas.

One of the reasons for conducting this systematic review was to provide support for global communities. The distribution of medical resources is disproportionate, which means there are fewer senior colposcopists in LMICs which directly affects women’s health and well-being [40]. Therefore, to meet the challenge of elimination of cervical cancer in LMICs, studies exploring feasible methods to improve the diagnostic performance of colposcopy are needed. Bekkers et al. [44] has found that junior colposcopists were significantly more likely to require biopsy compared to more senior colposcopists. Due to a lack of experience, junior colposcopists tend to order biopsies when in doubt. Conversely, increased confidence in colposcopic assessment displayed by more senior colposcopists might result in higher positive predictive values, but this is often at the expense of lower sensitivity. Therefore, junior colposcopists might not be able to identify HSIL+ cases accurately based on colposcopic images, and therefore refer to perform more biopsies for final confirmation.

In these circumstances, LSIL+ should be used as the cut-off value to reduce the number of false negatives. Since colposcopic biopsy is the gold standard for cervical cancer, it is important to improve the accuracy of colposcopy to improve identification processes. Specific training is compulsory before practitioners can be certified as colposcopists in some countries [44]. In LMICs, the quality of colposcopists could be effectively improved by increasing the amount and standard of training and by giving more professional guidance to uncertified colposcopists. However, enhancing quality control and advocating novel training methods, such as widely applicable teaching equipment [45] and training software [46], could also help to enhance colposcopists’ skills. This study suggests that skills vary substantially and that the application of the IFCPC terminology may also vary, which requires further research.

This systematic review identified gaps in our knowledge and some methodological issues that should be considered in future studies of cervical screening. Standardizing the evaluation of colposcopy based on the 2011 IFCPC, it could not only help to provide a reference for colposcopists, but also highlights emerging techniques for assessment [47]. Even though there are many alternative options for cervical cancer screening, Sawaya et al. [48] suggests that studies directly assessing the accuracy of screening tests or comparing between test results and colposcopy are inconclusive. Under limited resource settings [49], objective methods, such as molecular HPV testing, may be more appropriate. Until now, some of these methods have not been universally accepted due to concerns about health resource conditions in certain areas such as sub-Saharan Africa [50]. Therefore, colposcopy still remains fundamentally important in high-income countries and increasingly useful in LMICs.

Strengths and limitations

While this is the first systematic review with meta-analysis of the diagnostic value of colposcopy based on the latest version of IFCPC guidelines, there were some limitations that could not be avoided. First, because of the strict screening criteria used in this study, the number of studies included in this analysis was relatively small. This clearly reduces the generalizability of the findings and means that recommendations can only be tentative. We attempted to quantify the diagnostic performance of colposcopy under two cut-off values which underpin its utility in clinical practice. However, biases in the design of the included studies also made the interpretation of our findings less than certain. It also emerged that medical variability is apparent and that we were unable to plan or extract information regarding experiential and skillset differences. This is certainly something that requires further attention, and perhaps would be best looked at by health economists and medical educationalists who could consider understanding the impact of educational strategies on patients.

Conclusions

This meta-analysis confirmed the diagnostic value of colposcopy, as an effective tool for diagnosing cervical lesions. High sensitivity was observed with the LSIL+ cut-off, while high specificity was observed with a HSIL+ cut-off for squamous intraepithelial lesions. This might be used to provide guidance for future clinical practice and colposcopic research. Although, further research into the impact of colposcopy-based educational strategies is required.