Introduction

In the early 1990s the breast cancer susceptibility genes BRCA1 and BRCA2 were identified through linkage analyses [14]. BRCA1, located on chromosome 17q12-q21, consists of 24 exons encoding a protein of 1,863 amino acids and is involved in DNA repair [5, 6], in transcription [7, 8], and in the cell cycle checkpoint in DNA damage response [911]. BRCA2, located on chromosome 13q12-q13, consists of 27 exons encoding a protein of 3,418 amino acids and is also involved in DNA repair [1215], but its role in transcription and the cell cycle checkpoint is less clear [16].

Since the discovery of the BRCA1 and BRCA2 genes, a total of 1,643 and 1,856 distinct variants have been reported in the Breast Cancer Information Core (BIC) Database for BRCA1 and BRCA2 as of April 2007 [17]. Among these variants, frameshift mutations, nonsense mutations, splice variants and a few well-documented missense mutations are considered deleterious [18], while synonymous variants have been considered benign or polymorphic. A large number of missense or intronic variants of BRCA1 or BRCA2 remain of unknown significance. The proportion of breast cancer patients who carry these unclassified variants (UVs) is about 9% [19]. Given that only 2% to 3% of breast cancer patients have deleterious mutations in BRCA1 or BRCA2 [20], understanding the clinical significance of this relatively large number of UVs is of great importance.

Functional studies can provide direct insight into whether the UV has biological consequences, but few of these studies have been performed [21, 22]. Other approaches have been applied to classify the significance of UVs, including comparisons of allele frequencies [18], algorithms such as Polyphen (see Materials and methods) [23], examination of sequence conservation across species [2426], and characterization of the physicochemical nature of the amino acid substitutions (Grantham matrix scores) [26, 27]. A combination approach of the sequence conservation and Grantham matrix score methods was applied to classify a large number of UVs [26]. No systematic evaluation, however, has been conducted to determine whether patients who carry the variants classified as high risk using these methods have similar characteristics as patients with known deleterious BRCA1/BRCA2 mutations, which would suggest that these high-risk UVs are deleterious.

Breast cancer patients with a known deleterious mutation in BRCA1/BRCA2 are more likely to have a family history of breast cancer or ovarian cancer [28] and an earlier age of diagnosis than noncarrier patients [18, 29]. In addition, BRCA1 deleterious mutation carriers are more likely to have estrogen receptor (ER)-negative and progesterone receptor (PR)-negative tumors than women without such mutations [29]. In the current analyses, we classified BRCA1/BRCA2 UVs using the four methods listed above and a combination of the Grantham matrix scores and sequence conservation. We then evaluated the validity and usefulness of each method by comparing the risk categories of UV carriers with respect to these three well-defined characteristics of BRCA1/BRCA2 deleterious mutation carriers.

Materials and methods

Subjects

The data collection methods for this study have been described previously [30]. In brief, female patients diagnosed with histologically confirmed first primary invasive breast cancer were identified through the Los Angeles County Cancer Surveillance Program, a population-based Surveillance, Epidemiology and End Results registry supported by the State of California and the National Cancer Institute. Eligible cases were US born and English speaking, white (including Hispanic) or African-American, aged 20 to 49 years at diagnosis, and Los Angeles County residents at diagnosis. A total of 2,882 eligible cases were identified (2,534 whites and 348 African-Americans) between February 1998 and May 2003. Recruitment of African-Americans began after the initiation of the study with eligible African-American cases diagnosed from January 2000 to May 2003.

Among the 2,882 potentially eligible cases, 1,794 (62%) were interviewed (1,585 white, 209 African-American). Reasons for nonparticipation were patient refusal (n = 428), no longer a resident of Los Angeles County (n = 37), not located (n = 88), death (n = 38), serious illness or disability (n = 18), physician refusal (n = 50), or inability to schedule the interview within 18 months of diagnosis (n = 429). The study was approved by the Institutional Review Board of the University of Southern California. All participants provided written informed consent.

Data and blood specimen collection

An inperson interview was completed using a modified version of the structured questionnaire used in the Women's Contraceptive and Reproductive Experiences Study [31]. The questionnaire included detailed information on demographic characteristics, family history of breast cancer or ovarian cancer, ethnic origin, and environmental factors such as oral contraceptive use, reproductive history, alcohol use, smoking history, and radiation exposure. We obtained information up to the date of breast cancer diagnosis. Blood specimens were collected from 1,519 participants (85%) and were transported to the Norris Cancer Center Genetics Core Laboratory in Styrofoam containers on frozen ice packs. For the first 50 samples the buffy coat was immediately extracted and stored, and for the remaining samples we stored whole blood.

Sequencing of BRCA1 and BRCA2genes

All BRCA1 and BRCA2 exons (except BRCA1 exons 1 and 4 and BRCA2 exon 1) as well as all exon–intron boundaries were sequenced. Exon 1 was not sequenced for either gene because it is located upstream of the translation start site in both genes. BRCA1 exon 4 was not sequenced because it is not found in the normal BRCA1 mRNA transcript.

DNA extraction, amplification and sequencing were carried out in the USC Genomics Core Laboratory using a protocol similar to that previously described [32]. The detailed procedures are described in the supplemental methods (see Additional File 1). We sequenced BRCA1/BRCA2 genes for 1,469 out of 1,519 blood specimens. We were unable to sequence the remaining 50 specimens due to insufficient DNA.

Thirty-three randomly selected, blinded samples were resequenced for quality control purposes. The discordance rate was 0.19%: 16 discordant sequencing results out of the total 8,646 variant sites sequenced (262 variant sites for each of the 33 samples). In addition, 166 subjects who had noninformative sequencing results on one or more variant sites were resequenced or genotyped using the TaqMan assay (for BRCA2 I2490T, N372H, and N991D) as previously described [33].

Epidemiologic and histologic variables

Age at diagnosis was categorized as <35 years, 35 to 39 years, 40 to 44 years, and 45 to 49 years. We classified women based on their family history of breast cancer or ovarian cancer as follows: one or more breast cancer or ovarian cancer patients among their first-degree relatives (mother and full sisters); no first-degree family history of breast cancer or ovarian cancer but one or more breast cancer or ovarian cancer patients among their second-degree relatives (mother's or father's full sisters, and grandmothers); no first-degree or second-degree relatives diagnosed with breast cancer or ovarian cancer; and an unknown first-degree family history. We considered unknown second-degree family history as no family history.

The ER and PR status of the breast cancer was obtained by abstracting pathology reports collected by the Los Angeles County Cancer Surveillance Program. Among the 1,469 subjects, ER/PR information was available for 1,216 patients (83%). For the ER/PR analyses, we excluded 63 patients who had borderline ER/PR status and 101 patients whose ER/PR status was +/- or -/+, leaving 1,052 patients with a +/+ or -/- receptor status.

Classification of BRCA1/BRCA2mutation status

We classified each identified BRCA1/BRCA2 variant according to its predicted functional and biological significance as follows: definitely disease-causing variants (DDCVs), including frameshift mutations, nonsense mutations, splice variants that were previously reported to affect splicing or were located at the exon/intron boundary, and missense variants that were previously shown to be deleterious; UVs, including inframe deletion/insertions, intronic variants that might affect splicing by creating a splice donor/acceptor site, variants next to the exon/intron boundary, and most missense variants; and benign polymorphic variants, including synonymous variants, intronic variants that are unlikely to affect splicing, and a few missense mutations that were reported to be benign. (See Additional File 2 for a list of all variants identified in this study, with their classification and the reasons and references for such classification.)

Further classification of BRCA1 and BRCA2unclassified variants

We further classified BRCA1/BRCA2 UVs using the following methods.

Classification based on allele frequency

We divided the UVs into high-frequency unclassified variants (HFUVs) and low-frequency unclassified variants (LFUVs) depending on the minor allele frequency (≥ 1% versus <1%) in each ethnic group (142 African-Americans, 222 Hispanic whites, 1,105 non-Hispanic whites). If the minor allele frequency is ≥ 1% in one or more ethnic groups, the UV was categorized as a HFUV. This categorization was based on the assumption that variants with high frequency would be less likely to be disease causing compared with variants of very low frequency.

Polyphen-based classification

Polyphen is an algorithm that classifies the functional effect of each missense variant into three categories (probably damaging, possibly damaging, and benign) [34]. This classification is based on the chemical characteristics of the substitution site (for example, disulfide bond, transmembrane region), the alignment of homologous sequences, and protein three-dimensional structures [23]. UVs other than missense variants are not classified by Polyphen. The Polyphen classification in this report is based on access to the algorithm in March 2007.

Classification based on sequence conservation across mammalian species

A variant that occurs at a site with high-degree conservation is considered more likely to be deleterious than a variant occurring at a site with low-degree conservation [35]. We selected only mammals for cross-species comparisons of the BRCA1/BRCA2 sequences, since the function of these two proteins in mammals could be different from that in other animals. We selected all mammalian species whose BRCA1/BRCA2 sequences were reported in the National Center for Biotechnology Information gene database or whose complete coding sequences were reported in the National Center for Biotechnology Information nucleotide sequence database. Ten species for BRCA1 and five species for BRCA2 met these criteria (see Additional File 2). Sequence alignment was performed using the Clustal W method [36] and the MegAlign software (DNASTAR, Inc., Madison, WI, USA).

We classified BRCA1/BRCA2 missense variants into three categories (high conservation, moderate conservation, and low conservation) depending on the number of the species that had a different amino acid from that of the human at the site of variation. For each UV in BRCA1 we considered differences in zero or one species out of the 10 examined to represent high conservation, differences in two or three species to represent moderate conservation, and differences in four or more species to represent low conservation. For BRCA2 we compared sequences of five species: no difference in all five species was considered high conservation, one or two differences were considered moderate conservation, and three or more differences were considered low conservation.

Classification based on the Grantham matrix score

The Grantham matrix score (GMS) is a composite measure of the degree of amino acid substitution, taking into account the side-chain composition, polarity, and molecular volume of the two amino acids [27]. We dichotomized the GMS at 60, a criterion previously used to define neutral missense variants [26].

Integration of sequence conservation and the Grantham matrix score

We adopted a previously reported classification scheme integrating the sequence conservation and the GMS [26]. Briefly, if the variant was located at a fully conserved site or led to a nonconservative substitution at a conserved site, it was considered deleterious. If the variant amino acid is observed in other species or led to conservative substitution, it was considered neutral. See Additional File 1 for further details.

Classification of women who carry unclassified variants in BRCA1/BRCA2

Each subject was categorized hierarchically based on their BRCA1 and BRCA2 mutation status (Figure 1). This means that anyone successfully classified by the first criterion would not be further classified by the criteria that followed. This hierarchical classification leads to mutually exclusive categories (DDCV carriers, UV carriers, normal/polymorphic carriers, and patients with unknown mutation status) as follows. First, a patient was classified as a DDCV carrier if she had one or more of the DDCV(s). Second, if the patient did not belong to the DDCV group and had a noninformative result at any of the identified DDCV sites, she was classified as unknown. Third, if the patient did not belong to these first two categories and carried one or more of the UVs, she was classified as a UV carrier. Fourth, if the patient did not belong to the first three categories and any of the sequencing results at the identified UV sites was noninformative for the subject, she was classified as unknown. Finally, if the patient did not belong to any of the preceding categories, she was classified as a polymorphic or normal genotype carrier.

Figure 1
figure 1

Illustration of the classification scheme of BRCA1 / BRCA2 variants. DDCV, definitely disease-causing variant; UV, unclassified variant.

UV carriers were further classified hierarchically into mutually exclusive categories of high risk, moderate risk, low risk, and unknown risk according to the various UV classifications. For example, when applying the allele frequency method, a UV carrier was classified as high risk if the subject carried one or more of the LFUVs, as unknown risk if any of the sequencing results at the LFUV site was noninformative for the subject, as low risk if the subject carried one or more of the HFUVs, and as unknown risk if any of the sequencing results at the HFUV site was noninformative for the subject. Classification using other methods such as Polyphen, the GMS, or sequence conservation followed the same hierarchical logic.

Six BRCA1 UV carriers and six BRCA2 UV carriers with a possible splice variant or in-frame deletion were categorized only by allele frequency since Polyphen, the GMS, and the integrated GMS/sequence conservation methods are not applicable to these splice variants and in-frame deletions. These women were therefore excluded from the analyses using Polyphen, the GMS, sequence conservation, and the integrated GMS/sequence conservation methods.

Statistical analyses

We compared the UV classification methods of allele frequency, Polyphen, sequence conservation, and the GMS by examining the pairwise joint distribution of BRCA1/BRCA2 UVs as classified using each method. Tests for a linear trend in the GMS across the three UV categories classified using Polyphen and the sequence conservation method were conducted in linear regression models. The mean GMS across two UV categories using allele frequency was compared by t test. We assessed whether UV classifications using allele frequency, Polyphen and the sequence conservation methods are correlated using an exact Mantel–Haenszel chi-square test.

We performed case–case analyses to examine the association between BRCA1 or BRCA2 carrier status categorized using each method (exposure variable) and outcome variables (clinical and disease characteristics). Case–case analyses were conducted using polychotomous logistic regression when the outcome variable was family history of breast cancer or ovarian cancer. The association with the ER/PR status was analyzed using logistic regression. We used linear regression where the outcome variable was age at diagnosis of breast cancer. When examining BRCA1, results were adjusted for the BRCA2 mutation status (DDCV, non-DDCV, unknown), and vice versa.

All reported P values are two-sided. The SAS 9.1 package was used for all analyses (SAS Institute, Cary, NC, USA).

Results

A total of 105 distinct BRCA1 variants (including 32 DDCVs) and 157 distinct BRCA2 variants (including 27 DDCVs) were identified in the 1,469 breast cancer patients (see Additional File 3). Among these distinct variants, 22 BRCA1 variants and 30 BRCA2 variants had not been reported in the BIC as of April 2007.

Correlated classifications using various approaches

Classification using the Polyphen algorithm appeared to be correlated both with the GMS and the conservation method: BRCA1/BRCA2 missense variants classified as high risk (probably damaging) using Polyphen had a higher mean GMS than those classified as low risk (benign missense variants) (Table 1). BRCA1/BRCA2 missense variants classified as benign missense variants using Polyphen were generally located at sites with low degree of sequence conservation, while probably damaging missense variants tended to be located in highly conserved regions (Table 2). The GMS, however, was not strongly correlated with level of conservation across species (Table 1). Given the small number of HFUVs of BRCA1/BRCA2, the classification using the allele frequency method seemed to be associated with the classifications using other methods, although not all of these analyses achieved statistical significance.

Table 1 Mean Grantham matrix score of BRCA1/BRCA2 variants (unclassified variants) according to classification using allele frequency, Polyphen, and sequence conservation
Table 2 Joint distribution of BRCA1/BRCA2 variants (unclassified variants) according to classification using allele frequency, Polyphen, and sequence conservation

Classification of case patients with regard to BRCA1 or BRCA2status

Among the 1,469 case patients in this study, 61 women carried a BRCA1 DDCV and 34 women carried a BRCA2 DDCV. Among the remaining women, 307 women and 860 women were UV carriers in BRCA1 and in BRCA2, respectively.

Classification of BRCA1/BRCA2status in relation to epidemiologic and histologic outcome variables

Family history of breast cancer or ovarian cancer

The BRCA1 DDCV carriers were substantially more likely to have a first-degree family history of breast cancer or ovarian cancer than the normal/polymorphic BRCA1 carriers (odds ratio = 11.3; Table 3) after adjusting for the BRCA2 mutation status. The UV carriers were also significantly, although to a smaller extent, more likely to have a first-degree family history than normal/polymorphic BRCA1 carriers (odds ratio = 1.54). The high-risk UV carriers were, in general, significantly more likely to have a first-degree family history of breast cancer or ovarian cancer than normal/polymorphic women, whereas the low-risk UV carriers were not. For example, the high-risk UV carriers identified using the allele frequency (LFUV) or Polyphen (probably damaging) methods were more likely to have a first-degree family history (odds ratio = 2.00 and 3.39, respectively) than normal/polymorphic BRCA1 carriers.

Table 3 Association between family history of breast cancer or ovarian cancer and BRCA1 or BRCA2 status of the breast cancer patients

A similar trend was observed using the sequence conservation or the GMS method, although differences between the categories of UV carriers were smaller. The integrated method of the GMS/sequence conservation classified only nine subjects as high risk, and their odds ratio was not different from that of the women who remained unclassified.

The BRCA2 DDCV carriers were also at a higher risk of having a first-degree family history of breast cancer or ovarian cancer compared with the normal/polymorphic BRCA2 carriers (odds ratio = 3.69) after adjusting for BRCA1 mutation status. The association was weaker than that of BRCA1 DDCV carriers. Regardless of the classification method, the high-risk UV carriers were not statistically significantly different from the normal/polymorphic BRCA2 carriers with regard to family history (Table 3).

Age at diagnosis and estrogen receptor/progesterone receptor status

As expected, compared with the carriers of normal/polymorphic BRCA1, the BRCA1 DDCV carriers had a much earlier age at diagnosis (by 4.1 years; P < 0.001) and more ER/PR-negative tumors (odds ratio = 7.24, 95% confidence interval = 3.56 to 14.7). Case patients with high-risk UVs, however, did not have such characteristics regardless of the method of UV classification. The BRCA2 DDCV or UV status was not associated with early age at diagnosis or with ER/PR negativity (data not shown).

Comparisons of the classifications using the methods in this study and the Breast Cancer Information Core

The recent update of the BIC includes the assessment of the clinical importance of each variant. This assessment is based on several criteria, including epidemiological, segregation, and co-occurrence data. Among the UVs in this study, one BRCA1 UV (IVS5-11T > G) was classified as clinically important whereas three BRCA1 UVs and 19 BRCA2 UVs were classified as clinically nonimportant. IVS5-11T > G was classified as a high-risk UV using allele frequency (LFUV). Since this variant is not a missense variant, other methods were not applicable. Table 4 shows how each UV that was considered nonimportant in the BIC was classified by the five UV classification methods. The allele frequency and the GMS method classified a large number of variants as high risk that were considered nonimportant by the BIC, particularly for BRCA2. In contrast, Polyphen and the conservation methods classified few such variants as high risk.

Table 4 Classification of BRCA1/BRCA2 variants (unclassified variants) that were considered clinically not important in the Breast Cancer Information Core database

Discussion

In the present study of young breast cancer patients, we identified numerous variants in BRCA1/BRCA2 by direct sequencing, including 22 BRCA1 and 30 BRCA2 new variants that have not been reported in the BIC as of April 2007. We applied various methods to classify 44 BRCA1 UVs and 95 BRCA2 UVs. To our knowledge, our study is the first to attempt to classify a large number of BRCA1/BRCA2 UVs identified in population-based breast cancer patients and to correlate these variants with outcome variables.

We found that classifications of BRCA1/BRCA2 UVs using the various classification methods in general agree with each other (Table 1 and Table 2). In particular, Polyphen seemed to be correlated with the GMS and with sequence conservation, which is expected given the composite nature of this algorithm. This intercorrelation supports the reliability of the classification methods.

In general, the BRCA1 UV carriers classified as high risk were at increased risk of having a family history of breast cancer or ovarian cancer. Family history has been considered a powerful tool in classifying UVs [37], and having a first-degree relative with breast cancer increases the breast cancer risk about twofold [38]. The odds ratio for the high-risk UV group was highest when using Polyphen, suggesting that the algorithm is better for the purpose of describing high-risk variants when using family history as a measure of true risk. We cannot exclude, however, the possibility that more stringent cutoff points to define the high-risk group using other methods (that is, high-degree conservation defined as no cross-species variation; or high GMS defined as >100) might increase the odds ratio estimates of the high-risk group. In this study, we did not have sufficient numbers of UV carriers to investigate this possibility.

Considering that the high-risk BRCA1 UV carriers classified using all of the classification methods were at a higher risk of having a family cancer history (either statistically significantly or nonsignificantly), we expected to observe similar trends using age of diagnosis or the ER/PR status as the outcome variables. This observation, however, did not occur. The narrow age range of our study subjects, all of whom were under age 50 at diagnosis, could have limited the study power. For analyses of the ER/PR status, our exclusion of about 30% of women because of missing, borderline, or mixed (-/+ or +/-) ER/PR status may have limited the statistical power. Alternatively, it is possible that only truncating mutations (resulting in a complete loss of BRCA1 functions), but not missense variants (retaining part of its ability; for example, the ability to interact with certain proteins), of BRCA1 lead to the high density of ER/PR-negative tumors.

For BRCA2, it is unclear why none of the classification methods identified high-risk UV carriers when family history was used as the measure of true risk. One explanation could be the fact that BRCA2 DDCV carriers themselves did not have such a high odds ratio as seen for BRCA1 DDCV carriers. The BRCA2 DDCV carrier status was also not associated with age at diagnosis in this study, again possibly because all of our subjects were younger than 50 years and the age at diagnosis for BRCA2 DDCV carriers is not as early as for BRCA1 DDCV carriers [29]. In our study, the median ages were 40 and 45 years for BRCA1 and BRCA2 DDCV carriers, respectively.

Homozygous deleterious mutations in BRCA1/BRCA2 are lethal [3942]. In the present study, all of the low-risk UVs classified using the allele frequency method (except those that were common only in African-Americans) were observed as homozygous and therefore should be benign. Consistent with this, all our low-risk UVs (HFUVs) that have been classified by the BIC were assessed as clinically nonimportant. On the contrary, quite a few variants classified by the BIC as nonimportant are rare variants, and are therefore classified as high-risk UVs (LFUVs) in our study. If a variant has arisen very recently, its population frequency will be low even though the variant is not clinically important [43]. The allele-frequency method may therefore be better for the purpose of describing low-risk UVs than high-risk UVs.

The GMS is a pairwise comparison of the two substituted amino acids, and it has been argued that a multiple comparison – that is, a comparison of the substituted amino acids taking into account the natural variation of the substituted site across species – would provide better information [44, 45]. One method of achieving such a multiple comparison is to use the integrated method of Abkevich and colleagues [26]. In our study, however, this method was not an improvement over the individual application of the two methods.

The Polyphen algorithm compares homologous sequences for conservation and examines the structural and physicochemical aspects of the substitution. We found that the high-risk UV carriers identified using Polyphen had the highest odds ratio of first-degree family history among those identified using all other methods. We also found that the number of clinically nonimportant variants that were classified as high risk or medium risk was smallest when using Polyphen. The Polyphen algorithm has been reported to have the smallest false-positive rate among the various online algorithms, including SIFT [35]. Polyphen has previously not been applied for BRCA1/BRCA2 whereas SIFT has been adopted for BRCA1 [24, 25]. Our results suggest that Polyphen might be useful to identify high-risk UVs, especially when the UV has never been reported and/or clinical information is not available.

Efforts to classify UVs are accumulating: several groups have used simple combinations of sequence conservation and the severity of amino acid substitutions [2426]. Whether the classification is clinically valid, however, has not been systematically examined [26]. Other studies have used extensive multifactorial models, most of them focusing on a few BRCA1 UVs. These models incorporate several approaches used in this study as well as clinical characteristics [46], co-occurrence with deleterious mutations [19, 46], and histopathological information [19]. While clinical and co-occurrence information has provided strong evidence to classify UVs [37, 46], however, such information is not always available, especially for UVs that have not been reported before. Further, it has been suggested that these "ideal" criteria cannot classify the majority of the UVs [37]. The classification methods used in the present study may serve as "readily available" additional information to classify Uvs.

Conclusion

The present study suggests that the application of different methodologies such as allele frequency, Polyphen, the GMS, and sequence conservation may be useful for evaluating UVs, especially when little functional or clinical data are available. While we found high correlations between these classification methods, our study suggests that each method has different levels of false-positives and false-negatives. The Polyphen algorithm appeared more appropriate in identifying high-risk variants whereas the allele frequency may be useful in classifying high-frequency variants as nonimportant. Although our study does not directly address the question of whether each specific UV is associated with the risk of breast cancer, our results suggest that these methods could be helpful in understanding the significance of a UV especially when other clinical or genetic information is not available. Further, the application of these methods may help to prioritize UVs for further functional or familial study.