Introduction

Over the last 20 years, mammographic screening has become widespread in most developed countries, with the aim of reducing breast cancer mortality through early detection of the disease. However, the organisation and delivery vary across geographic regions in ways that may influence its effectiveness. Most countries in Europe offer population-based screening programmes following the recommendations of the European guidelines with defined screening intervals and target populations related to age [1]. In the U.S, although several organisations recommend routine screening [2, 3], actual screening practices vary by personal and medical provider preferences. Access to care also varies, for instance, by insurance status. Most commonly, screening is opportunistic in response to recommendations made during a routine medical consultation or on the basis of a possible increased risk of developing breast cancer [2].

Comparing accuracy measures across different organisational models for mammographic screening provides valuable information that can be used to guide clinical practice and breast cancer screening policy. Only a few studies have compared such measures across organisational models for breast cancer screening and most have focused on the sensitivity, specificity, and cancer rates [48]. Positive predictive value (PPV), the proportion of women either recalled or who undergo a biopsy and are subsequently diagnosed with cancer, is reported less often.

This study utilises data collected in Norway and Spain as part of the service screening programmes and in the US by the Breast Cancer Surveillance Consortium (BCSC), a large on-going study of mammographic screening performance in community practice. We estimated accuracy measures for mammographic screening in the three countries with the aim of comparing the sensitivity and specificity as well as PPV.

Materials and methods

Information from 898,418 screened women in the US (1996–2008), 527,464 in Norway (1996–2007), and 517,317 in Spain (1996–2009) was included in the study. All women were aged 50 to 69 years at screening. These women contributed a total of 5,713,594 screening exams.

Screening organisation

No organised mammography screening programme exists in the US (Table 1). Screening is performed opportunistically, typically according to the guidelines of organisations such as the American Cancer Society [3] and the US Preventive Services Task Force [2]. Under the Patient Protection and Affordable Care Act, all health insurance plans must cover screening mammography with no patient cost sharing [9]. All facilities performing screening mammography are accredited by the US FDA and follow the regulations set forth by the Mammography Quality Standards Act [10]. Recommendations generally call for initiation of screening at age 40 or 50 and continuation until at least age 74 [2]. The recommended screening interval also varies, with some organisations calling for annual and others for biennial screening. In consultation with their medical providers, women choose to receive screening mammography according to personal preference. Screening mammography typically consists of two-view (mediolateral oblique and craniocaudal views) bilateral examinations. Radiologists’ assessments and recommendations are based on the American College of Radiology’s Breast Imaging Reporting and Data System (BI-RADS®) [11] and are typically read by an individual radiologist, sometimes with the use of computer-aided detection (CAD). Full-field digital mammography (FFDM) began diffusing into community practice after FDA approval of the first FFDM machines in 2000. As of December 2009, 60 % of accredited mammography facilities in the US were using FFDM [12].

Table 1 Description of the main characteristics of breast cancer screening organization in the US, Norway, and Spain

Both Norway and Spain adhere to the European Guidelines for Quality Assurance in Mammographic Screening [1] , which recommend biennial invitation to mammography screening for women aged 50 to 69 years.

The Norwegian Breast Cancer Screening Programme is administered by the Cancer Registry of Norway, which is also responsible for the surveillance and quality assurance of the programme. The Programme started in 1996 and became nationwide in 2005. The participation rate was 76.2 % [13]. Women are invited to two-view bilateral mammography by a personal invitation letter, regardless of cancer history. Women are screened at stationary and mobile units. The programme performs independent reading with consensus/arbitration. A score of 1–5 is given for each breast, by each radiologist, where 1 indicates a negative screening examination and 5 a high likelihood of malignancy. All cases with a score of 2 or higher given by one or both readers are discussed at a consensus meeting where the final decision of whether to recall the woman is made. Recall examinations take place at one of the 16 screening centres. FFDM was implemented gradually in Norway, from 2000v2011. As of the end of 2008, 48 % of the screening mammograms were performed with FFDM [14].

The Spanish Breast Cancer Screening Programme started in 1990 and was nationwide by 2006. The overall participation rate was 74.0 % [15]. In Spain, breast cancer screening is government funded. Women are actively invited to participate in population-based mammography screening by an invitation letter. The standard procedure for radiological performance in Spain is two-view mammography with double reading. The BI-RADS® [11] scale or equivalent is used to rate the probability of cancer. Women with positive mammographic findings, scored as 3, 4, 5, or 0, are recalled for further assessments to confirm or rule out malignancy at reference hospitals of each screening area. From 2004 onwards, FFDM was gradually introduced in Spain. As of December 2009, digital mammograms represented 25.7 % of screening tests.

Data sources

The study is based on data from the BCSC in the US, the Cancer Registry in Norway, and the updated database of the Cumulative False Positive Risk (CFPR) study in Spain [16].

The BCSC is a consortium of breast imaging registries throughout the US linked to population-based cancer registries. These registries collect information from community mammography facilities on mammography examinations and patient risk factors. The study included data on screening examinations in 1996–2008 captured by seven regional registries from diverse geographic locations that have previously been used to describe the distribution of screening mammography accuracy in the US [17]. Subsequent breast cancer diagnoses were obtained by linking BCSC data to pathology databases, regional Surveillance, Epidemiology, and End Results (SEER) programmes, and state tumour registries. Data were pooled at a central Statistical Coordinating Center. Data for this study were obtained from the BCSC Research Resource [18].

Screening data from Norway include examinations from women screened throughout the country between 1996 and 2007. Data from screening in Spain were drawn from an anonymised database that gathers information from eight screening areas. The database was originally created in 2006 for the CFPR Study [16] and was subsequently updated [19, 20]. The study includes data from women screened between 1996 and 2009.

All BCSC registries and the BCSC Statistical Coordinating Center received Institutional Review Board approval for active or passive consenting processes or a waiver of consent to enrol participants, link data, and perform analysis. All procedures were Health Insurance Portability and Accountability Act compliant, and registries and the Coordinating Center received a Federal Certificate of Confidentiality and other protections for the identities of women, physicians, and facilities. Data collection in Norway followed the regulations of the Cancer Registry of Norway and no ethical committee approval was necessary since all data received were aggregated. Data collection in Spain was performed following a study protocol approved by the institutional review boards at all participating screening areas.

Definitions

For US women, a screening mammogram was defined as bilateral mammograms with screening indication performed on women without a personal history of breast cancer or breast augmentation who had not received mammography within the prior 9 months. In Norway and Spain, all mammograms performed on women attending the population-based screening programme were considered screening mammograms.

A recall was defined as abnormal findings on the screening mammogram, leading to a recall for further assessment. Based on the findings of the imaging workup, women were referred back to screening or for an invasive procedure [fine-needle aspiration cytology (FNAC), core needle biopsy, or an open biopsy]. Short-term follow-up at 6 months after the screening examination is sometimes recommended in the US but is not recommended in Spain and in Norway where further assessment takes place and concludes within 4 months of the screening examination. For the BCSC cohort, a recall was defined as a BI-RADS assessment of 0, 4, or 5 [11].

For all three countries a false-positive recall was defined as a recall for further assessment where no breast cancer was confirmed, regardless of the procedures performed. A false-positive screening result may also include an invasive procedure with benign morphology, referred to as a false positive with invasive procedures. A screen-detected cancer was defined as ductal carcinoma in situ (DCIS) or invasive breast cancer diagnosed as a result of further assessment due to abnormal findings on the screening mammograms.

In the BCSC data, a positive screening result was defined as false positive if no cancer was diagnosed within 12 months of the screening examination and prior to the next screening mammogram. All cancers diagnosed within 12 months of a positive screening mammogram and prior to the next screening mammogram were considered screen-detected. An interval cancer was defined as a breast cancer detected within 12 months after a negative screening mammogram and prior to the next screening mammogram.

In Norway and Spain, false-positives and screen-detected cancers were defined based on cancers diagnosed as a result of further assessment conducted following the screening mammogram. An interval cancer was defined as a breast cancer diagnosed within 730 days after a negative screening examination, with or without an invasive procedure, and before the next screening examination.

Sensitivity was defined as the number of screen-detected cancers divided by the number of screen-detected cancers plus interval cancers, while specificity was defined as the number of true-negative screening examinations divided by the number of true-negatives tests plus false positives.

Rates were defined as the number of cases per 1000 screening examinations. PPV-1 was defined as the number of screen-detected breast cancers divided by the number of recalls due to positive mammographic findings. PPV-2 refers to recalled examinations including invasive procedures or (in the BCSC) recommendation for invasive procedures. The number of women needed to be recalled and to undergo an invasive procedure to detect one breast cancer was estimated by taking the inverse of PPV-1 (1/PPV-1) and PPV-2 (1/PPV-2), respectively

Statistical analysis

We included all screening mammograms performed on eligible women during the study period, including multiple screening mammograms for some women. We used generalised estimating equations (GEE) to account for within-woman correlation by means of the robust Huber-White (sandwich) variance estimator [21]. The z-test was used to examine differences in accuracy measures between countries. P-values < 0.05 were considered statistically significant.

Estimates of sensitivity and specificity were stratified by several factors: first or subsequent screen, calendar year of the screening mammogram, age at screening, and screening modality [screen-film mammography (SFM) or FFDM]. Cancer detection rates, false-positive rates, PPV-1, and PPV-2 were stratified by the time since the last screening mammogram (<18 months, 18 to 30 months, >30 months). The 95 % confidence intervals (95 % CIs) were calculated.

Analyses were conducted using R v.3.0.0 (US), STATA v.12 (Spain), SPSS v.12.0 (Spain and Norway), and SAS (Norway).

Results

The study included information about 5,713,594 screening examinations from 1,943,199 women screened in 1996–2009 at age 50 to 69 years. Overall, 26,430 cancers were screen-detected and 6,756 emerged as interval cancers.

Tables 2 and 3 show overall measures of screening accuracy in the three countries. The highest rate of screen-detected cancers was found in Norway, followed by the US and Spain [5.5 cancers per 1,000 screening mammograms, 4.5 ‰ and 4.0 ‰, respectively (p < 0.001)], which is equivalent to 181.5, 223.0, and 247.4 screening examinations needed to detected one cancer, respectively. The highest rate of DCIS was observed in the US, followed by Norway and Spain [1.1 ‰, 0.9 ‰, and 0.7 ‰, respectively (p < 0.001)]. The highest sensitivity was reported in the US, followed by Spain and Norway (83.1 %, 79.0 %, and 75.5 %). Conversely, the highest specificity was found in Norway, followed by Spain and the US (97.1 %, 96.2 %, and 91.3 %). PPV-1 was 16.4 % in Norway, 9.8 % in Spain, and 4.9 % in the US, which implies that 6.1 women were required to undergo further workup to detect one cancer in Norway, 10.2 in Spain, and 20.3 in the US. PPV-2 was 39.4 % in Norway, 38.9 % in Spain, and 25.9 % in the US.

Table 2 Number and rate (per 1000 screening examinations) of screen-detected, interval cancer and false-positive screening examinations (per 100 screening examinations) in mammographic screening performed in the US, Norway, and Spain
Table 3 Sensitivity, specificity, positive predictive value of recalls (PPV-1) and invasive procedures (PPV-2) in mammographic screening performed in the US, Norway, and Spain

Stratification revealed differences between the countries for both sensitivity and specificity (Table 4) that were consistent with the overall measures observed in Table 3. In all strata, sensitivity was higher in the US than in Norway and Spain, and specificity was lower in the US than in Norway and Spain. However, there were some notable patterns in the differences within specific strata. Specifically, differences in specificity between countries were larger at the first compared to subsequent screenings. The smallest differences for sensitivity, but not for specificity, were detected in women aged 60–69 (83.8 %, 82.1 %, and 79.2 % in the US, Spain, and Norway, respectively). Differences in sensitivity among the US, Norway, and Spain were greater with FFDM (85.8 %, 73.5 %, and 76.4 %, respectively). The only exception to this pattern was found in the first years of the study period, where the sensitivity in Spain was higher than in the US.

Table 4 Sensitivity and specificity by type of screening history (first or subsequent), year, age at screening examination, and type of mammography (screen-film mammography, SFM, or full-field digital mammography, FFDM)

In both in Norway and Spain, the largest percentage of subsequent examinations was performed 18–30 months after the prior examination (95.8 % and 93.9 %, respectively) whereas in the US 68.5 % of screening tests were performed within 18 months of the prior test (Table 5). For all countries, the longer the time since the prior screening test, the higher the rate of screen-detected cancers, invasive cancers, false-positives, and PPV-1 was. PPV-1 for mammograms performed 18–30 months after the last screening was 5.2 %, 19.4 %, and 11.6 % in the US, Norway, and Spain, respectively.

Table 5 Number and rates (per 1000 screening examinations) of screen-detected cancer, false-positive screening examinations (n, per 100 screening examinations) and positive predictive value of recalls (PPV-1) and invasive procedures (PPV-2) by time since last screening examination

Discussion

We compared accuracy measures for mammographic screening performed in community practice in the US and through population-based screening programmes in two European countries. The highest specificity and PPV were found in the European population-based screening programmes, whereas the highest sensitivity was found in the US. The results suggest that the opportunistic approach with annual mammography requires more interventions to detect one cancer compared with biennial screening in organised programmes.

Opportunistic screening is known to be more interventionist than population-based approaches, which translates into a higher number of recalls and false-positive results, as reported in prior studies comparing screening performance indicators between the US and Europe [4, 7]. Different explanations for the higher recall rates in the US have been proposed. First, the use of a single reading and the lower radiologist interpretive volume required in the US for accreditation may partially explain the differences [12]. A minimum interpretive volume of 5000 mammograms per radiologist per year is recommended in Europe [1]. These programmatic characteristics, however, have not been associated with a decrease in the sensitivity or cancer detection rate [22]. A second explanation is that the threat of lawsuits for malpractice in the US might induce radiologists to order further tests and procedures aiming to decrease the number of missed cancers [23, 24]. Finally, differences exist with respect to the targets for recall rates: European guidelines recommend <3 % of mammograms should result in recalls [1] while the BI-RADS recommendations in the US are 5-10 % [11]. In spite of organisational similarities between Norway and Spain, we observed differences in PPV-1 as a result of the higher detection rates and lower false-positive rates in Norway than in Spain. The smaller cancer detection rate observed in Spain can be partly attributed to the lower background breast cancer incidence in this country in comparison with Norway and the US [25]. Different definitions of recall for invasive procedures among the US, Norway, and Spain may also partly explain the lower PPV-2 values reported in the US.

The higher sensitivity in the US in comparison to Norway and Spain can be partially explained by the different screening periodicity. It could also be affected by other factors such as the test sensitivity. However, identifying the specific contributing factors is beyond the scope of this study. Women in the US were less likely to develop an interval cancer between screening examinations, which were mostly annual, which was directly reflected in sensitivity. Unfortunately, because of differences in screening practices between the US and the European screening programmes, we were not able to compare the sensitivity for women screened every 2 years in the three countries. Nevertheless, prior work comparing the sensitivity between the US (based on data from Vermont and North Carolina) and Norway indicated that the sensitivity for 2-year screening intervals in these US regions was almost the same as that in Norway [4, 5, 8].

The trend of higher sensitivity and lower specificity in the US vs. Spain and Norway persisted across all strata investigated. Differences observed in first screenings, both for the percentage of mammograms and for sensitivity and specificity, could be related to differences in recommendations for screening initiation. In the US, some organisations and providers recommend that women begin screening at age 40 [3]. As a result, relatively few women have their initial screening at age 50 or older. The comparison of characteristics of first screening examinations is thus confounded by age at first screening.

When comparing cancer rates among women who attended screening with an 18-30-month interval, the values from the US and Norway—countries with similar background breast cancer incidence [25] —were similar. However, PPVs continued to differ. This reflects the fact that, while cancer detection is mainly dependent on background incidence, PPV is more sensitive to variations in radiological practice and the organisation of the screening programme.

Our study has some limitations. First, we have taken a descriptive approach based on aggregate data, and therefore we did not control for potential confounders like individual age. However, in an attempt to make data more comparable across countries, we restricted the study population to women aged 50–69. Second, despite using consensual definitions of the screening terms, some unavoidable differences remained such as the definition for interval cancer in the US and Europe, which directly affects the reported screening sensitivity. In the US, estimates of sensitivity and/or interval cancers based on 2-year follow-up after the screening mammogram would be biased because the majority of women return for another screening examination at a 1-year interval. However, the results represent variability in the radiological performance between countries, which is the main objective of the study. Third, some overestimation of sensitivity estimates in Spain cannot be discounted since there is a lack of a nationwide cancer registry, which may result in some missed interval cancers. However, the mechanisms for identifying interval cancers have improved over time [26], which is reflected in a decrease in sensitivity estimates.

In summary, the opportunistic approach to screening in the US is more interventionist, resulting in more frequent follow-up evaluations and shorter screening intervals than the European population-based approaches. This translates into a somewhat higher sensitivity of screening mammography in the US but at the cost of higher clinical burden on the women. Population-based approaches stress the balance between sensitivity and specificity, aiming to decrease the clinical burden—and the related harms and costs—to participating women.