Many of the current cohorts of women for cancer study in the USA are insufficient for examining factors from the biology to the environment that are associated with breast cancer risk among a diverse group of women. Most of these ongoing cohorts of women for the study of cancer have extensive questionnaire risk factor data collected for all participants, yet they are made up predominantly of White women. Smaller subcohorts have blood samples for prospective analysis of hormones, metabolic markers, and DNA [1]. These cohorts typically have mammograms retrieved on only a small subset of the participants, if at all [2, 3]. For example, the Mayo Mammography Health Study includes 19,924 women seen at Mayo Clinic mammography service from 2003 to 2006 and includes breast cancer risk factor measures and follow-up through electronic health records, but lacks racial diversity [4]. Further, despite growing evidence that mammographic breast density and additional markers of parenchymal texture [5] are strong risk factors for breast cancer [6,7,8], few studies integrate repeated mammography measures [9] in addition to questionnaire risk factors and blood-based markers. Even fewer cohorts routinely integrate breast tissue from benign and malignant lesions [1, 10, 11]. To address gaps in race/ethnic composition of cohorts, newer studies such as the Black Women’s Health Study [12] have been established, yet these have similar challenges in assembling tissue samples and image data. This study has, however, provided valuable insights to risk prediction in Black women [13]. Recent validation of mammography-based breast cancer risk model based on AI included 7 data sets relying heavily on Emory for mammograms from African American women [14]. Thus, a resource gap exists limiting epidemiologic investigations and validation of risk prediction models. As Potter noted over 15 years ago [1], the integration of all these data sources is essential to fully capitalize on genomics, proteomics, geographic and environmental measures, and tissue to integrate data on host and tumor phenotype. While he proposed a million-person “last cohort” we here describe baseline data on a cohort that meets many of the principles he outlined.

Purpose of the study

Dr. Colditz and colleagues established the Joanne Knight Breast Health Cohort at Siteman Cancer Center to collect, store, and ultimately share comprehensive data sets and tissue specimens for future research. The screening mammography service at Siteman and Washington University School of Medicine offered us the potential to recruit a diverse population of women [15] and to bring routine mammography images and all breast biopsies into the cohort follow-up as a feature of the prospective data collection. This thus fills two of the major gaps in existing US cohort studies and facilitates study of risk factors and validation of models for breast cancer risk prediction.

The Joanne Knight Breast Health Center at Siteman Cancer Center at Washington University School of Medicine, St. Louis, Missouri provides mammography services for women from varying socioeconomic and racial backgrounds in the St. Louis region, including those with coverage through the Missouri breast and cervical cancer screening programs (Centers for Disease Control and Prevention and state funded), the Komen Fund and Barnard Fund coverage for the uninsured, and regularly insured women with private insurance or Medicare coverage. All women are screened with the same technology (Hologic). The mammography service stores all images and as of 2019 all screening used tomosynthesis (Hologic).

Materials and methods


Posters describing the study were placed in waiting areas and women attending mammographic breast screening or diagnostic procedures at the Joanne Knight Breast Health Center were approached to participate, all of them completing extended data collection for breast cancer risk estimation. The Joanne Knight Breast Health Center screens approximately 25,000 women and does high risk and diagnostic screening for another 15,000 women per year [15]. Women aged 18 and older attending the Breast Health Center were eligible to enroll. More than 50% of eligible women attending for screening mammograms chose to enroll. Males were excluded, as were women with self-reported blood transfusion within the past 4 months, and self-reported HIV +, Hepatitis B, or C +.

The variables needed for the simplified Rosner–Colditz breast cancer risk prediction model [16] (these include age, menopausal status, age at menopause, pregnancy history, history of benign breast disease, and current menopausal hormone therapy (estrogen alone, estrogen plus progestin, progesterone alone, and other), current BMI, height, and daily alcohol intake, see measures below) have been routinely collected since 2010 and risk estimates are incorporated into reporting from breast health screening mammograms. Those invited to the study and agreeing were consented and then proceeded to blood draw. 20 mls of blood were drawn and aliquoted for storage at −80 °C in the Siteman Tissue Procurement Core liquid nitrogen freezer system. Aliquots of white blood cells and plasma are stored separately in cryotubes.

Cohort participants consented to (1) retrospective and prospective review of medical records (including radiologic images, pathology reports); (2) one-time 20 ml blood draw; (3) access to tissue not required for clinical care (e.g., breast biopsy tissue following conclusive clinical pathology assays); and (4) optional future contact for the purposes of long-term follow-up and/or to recruit for other related research projects. Record linkage identifies new mammograms, biopsies, and other visits to BJC Health Care facilities. BJC is a non-profit health care organization serving metro St. Louis, mid-Missouri, and Southern Illinois.

Enrollment from November 2008 to April 2012 included 12,153 women who provided blood and risk factor data. A survey of 158 women who opted not to enroll over a two-week period in October 2009 showed most women who did not participate cited a lack of time to give the blood sample (30.4%). The next largest group (19.0%) wanted more time to think about participation. The remaining reasons for not participating included not wanting to give a blood sample (8.9%), not wanting researchers to have access to their medical records (8.9%), and (13.9%) provided no answer. The majority of the women came in for screening mammogram and a subset for diagnostic follow-up (5.4% of total cohort). Of these enrolled women, 1,672 had a history of cancer at enrollment leaving 10,481 women free from cancer at baseline.

Methods of follow-up

Follow-up of cohort participants as determined by mammography and other clinic visits through December 2020 was 78% seen in 2019 or 2020; a further 4.4% seen most recently in 2018 and a further 2.4% in 2017. All women remain under surveillance for return to follow-up mammography. Follow-up is passive through medical record linkages every 6 months, annual tumor registry searches, and annual mortality searches. This results in over 80% active follow-up for women seen within the last 36 months. The average person-years of follow-up through most recent contact is 9.2 person-years.

Exposures measures

At enrollment a baseline questionnaire, blood draw, and mammogram were obtained along with address for follow-up and for geocoding for measures of structural inequality. Baseline blood samples were taken and stored in multiple aliquots; DNA extraction (3 aliquots of 1 ml) and plasma aliquots of 1 ml (6 per participant) and placed into cryovials and stored at − 80 °C in LN2 freezers.

Women self-reported breast cancer risk factors on entry to the cohort. These are drawn from established and validated measures [17]. The baseline questionnaire assessed height, weight at age 18, current weight and weight at menopause, age at menarche, age at first birth, age at each subsequent birth, parity, menses ceased (yes/no), age at menopause and surgical removal of uterus, with removal of ovaries or without removal of ovaries, and age at hysterectomy; family history of breast cancer (mother and/or sister), Ashkenazi Jewish heritage; history of benign breast biopsy; current use of hormone therapy (yes / no and type of hormone therapy, including duration), current use of oral contraceptives (yes / no) and duration, current alcohol intake, current smoking status, and cigarettes per day.

Mammograms: a screening mammogram 12–24 months prior to baseline, at baseline, and subsequent follow-up screening have been identified and stored. These images are stored along with BI-RADS density report recorded (a = almost entirely fat, b = scattered areas of fibroglandular density, c = heterogeneously dense, d = extremely dense). Routine screening mammograms were obtained using Hologic machines.

County-level measures of structural inequality: We summarize multiple measures of county-level structural inequality that were included for relevance to population health and health disparities. First, we include five multi-dimensional factors representing several domains of structural inequality. Each factor consists of four or five variables clustering around the following themes: racial and economic segregation; population change; opportunity for socioeconomic advancement; economic environment; and population and housing characteristics. They were derived using exploratory factor analysis (EFA) in SAS 9.4 (SAS Institute Inc., Cary, NC) and theory-driven choices. The data were publicly available and previously compiled by the Health Inequality Project [18]. We also include three versions of the Index of Concentration at the Extremes (ICE) for (1) race, (2) income, and (3) income and race combined. The ICE measures compare the most advantaged groups to the least advantaged groups and the combined ICE measure compares higher-income White or Caucasian populations to lower-income Black or African American populations. These measures describe the distribution of extreme privilege and deprivation for these indicators across a specified area [19]. Finally, we include measures of area-level debt delinquency for any debt and for medical debt since area-level indebtedness has been shown to impact household finances as well as available neighborhood-level services [20] which have implications for neighborhood stability and subsequent health. The measures of area-level debt delinquency are publicly available through the Urban Institute [21]. All variables were appended to participant’s geocoded county of residence at the time of enrollment for a total of 224 unique counties.


The cohort free from cancer at baseline includes 10,481 women. The distribution by race/ethnicity reflects the racial distribution in our catchment area and is summarized in Table 1. Almost 27% of the cohort is Black or African American, less than 1% are Asian, and 69% are White or Caucasian. Of these women, 1% identify as Hispanic. Women were aged 30 to 94 at entry with 90% between ages 35 and 69 at blood draw and median age 54. Furthermore, 4.3% of participants come from rural residential addresses defined by RUCA codes.

Table 1 Race and ethnicity and age distribution of women participating in the Joanne Knight Breast Health Cohort at Siteman Cancer Center, Washington University

Breast cancer risk factors at baseline are summarized in Table 2. Briefly, women were on average 54.8 years at enrollment and nulliparity was more common among White (20.4%) vs Black (11.5%) women. 61% of participants were postmenopausal at entry to the cohort. The mean body mass index (BMI) was 29.3 kg/m2 and of note it was higher for Black women (32.8 kg/m2) vs White women (28.0 kg/m2). During follow-up linkages to cancer registry and pathology records have identified 272 incident invasive breast cancers and 116 in situ lesions through October 2020. A total of 623 benign biopsy samples from 6/28/2010 through 12/31/2020 have been identified and are stored for centralized pathology review and classification. Through January 2021, we have confirmed 329 deaths within this cohort.

Table 2 Joanne Knight Breast Health Cohort selected characteristics at entry, 10,092 women free from cancer

Socioeconomic status varied among participants. 45.6% were living in counties with debt of any kind at or above 30% of the population (Table 3).

Table 3 Joanne Knight breast health cohort baseline—county-level structural inequality (n = 10,481 women, n = 224 counties)

Early findings

Plasma samples from the cohort have been evaluated for carotenoid concentrations and risk of proliferative benign breast disease diagnosed from baseline through April 2016 [22]. Among women under age 50 we observed that African Americans had lower levels of alpha and beta-carotene and higher levels of beta-cryptoxanthin and lutein/zeaxanthin. There was a suggested inverse association between plasma carotenoids and risk of proliferative benign breast disease. Ongoing analysis aims use this cohort to externally validate the Rosner–Colditz breast cancer risk model that includes mammographic breast density, breast cancer questionnaire risk factors, and polygenic risk scores [16]. The study also motivates novel statistical methods for breast image data analysis in the time to event setting [23,24,25]. For example, using supervised Functional Principal Component Analysis of baseline full-field mammographic images we reported methods [23] and refinement to accommodate the irregular boundary of the mammographic image [24].


This new cohort brings breast images and pathology from routine care in a clinical setting that serves a diverse population into prospective epidemiologic investigations for breast cancer. The integration of blood markers in addition to questionnaire-based risk factors and tissue samples for all breast biopsies, in addition to repeated mammograms on participants, brings unique strengths to this cohort. Furthermore, the diversity of this population that is approximately one-quarter African American fills gaps in both breast cancer etiologies, risk prediction development, and validation of breast cancer risk models in diverse populations.

Although repeated visits to the breast health center for screening mammography could facilitate updated or repeated blood measures, the epidemiologic evidence and resources to justify this have not been assembled to date. However, because breast images are the product of the mammography visit, improving approaches to maximize use of the information in these repeated images appears to be the most efficient approach to improve risk stratification as part of routine breast health services.

Data access

Through IRB approval of deidentified data, plasma or tissue samples can be shared with investigators. Applications submitted to Dr. Colditz are reviewed by an internal Siteman committee, including breast pathology, mammography, and Tissue Procurement Core leadership. Material Transfer agreements are developed once access is approved and data, tissue samples, or blood samples are shipped as agreed. The overall study is approved by the institutional review board at Washington University in St. Louis.