Data description

Utility of the dataset

The cell line BT-474 was isolated by Lasfargues et al. [1] in 1978, from a biopsy of invasive ductal carcinoma from a 60 year old Caucasian female. Since that time it has become one of the most heavily utilized cell lines for breast cancer research. At the time of writing, entering the search term “BT-474 OR BT474” into PubMed resulted in 973 unique articles. Surprisingly, the complete genome sequence of this cell line has yet to be published. In this paper, we fill that void in the collective scientific knowledge by providing high coverage whole genome data for BT-474.

Previous studies have shown that BT-474 has a modal number of chromosomes approximating tetraploidy, and most of these chromosomes are covered with megabase-sized amplifications, deletions, and other structural rearrangements [2]. In an effort to provide better coverage of these complex rearranged regions, and to provide variant phasing and error correcting information, we generated high coverage libraries from long genomic DNA (~40 kb) using Long Fragment Read (LFR) technology [3, 4], and supplemented those libraries with a standard (STD) short mate pair library (~500 bp) [5] for a combined total coverage of over ~300X. We hope the freely available resource provided in this paper will benefit our understanding of the biology of cancer, and ultimately help to improve therapies for patients.

Library generation

DNA was isolated from low passage number BT-474 cells, procured from the American Type Culture Collection (ATCC, Manassas, VA, USA), using a RecoverEase dialysis kit (Agilent, Santa Clara, CA, USA). This material was further fragmented to 300–800 base pairs using a Covaris E220 (Covaris, Woburn, MA, USA), and processed using Complete Genomics’ proprietary standard library construction [5]. For LFR libraries, approximately 5 cells were collected and deposited into a 1.5 ml microtube with 10 μl of distilled water. Cells were lysed, and DNA was denatured using 1 μl of 20 mM KOH and 0.5 mM EDTA. Denatured genomic DNA was dispersed across a 384-well plate. In each well, long genomic fragments (~40 kb) were amplified, fragmented, and tagged with a unique barcode adapter as previously described [3]. All libraries were sequenced using Complete Genomics’ nanoarray sequencing platform [5].

BT-474 genome analysis

Read data of 343, 298, and 261 Gb from the STD, LFR1, and LFR2 libraries, respectively, were mapped to the NCBI human reference genome (build 37) using Complete Genomics’ pipeline [3, 5, 6] (Table 1), resulting in close to ~100X coverage in each of the libraries. The high coverage allowed more than 90 % of the genome and exome of each library to be called (Table 1.). Plotting reads falling within 100 kb consecutive windows for the BT-474 standard library resulted in the expected complex pattern of amplifications affecting almost all chromosomes [2] (Fig. 1). Known amplifications of ERBB2 and the HOX gene cluster on chromosome 17 [2] are readily identifiable from this plot, as well as many other megabase-sized highly amplified regions. Analysis of both standard and LFR libraries resulted in the discovery of 110, 175, and 145 interchromosomal translocations in the STD, LFR1, and LFR2 libraries, respectively (Table 2). Clustering these translocations, based on windows of 5 kb around the breakpoints, led to the overlap of many translocations within and between libraries, and an overall reduction in the total number of translocations to 291 (Table 2 and Fig. 2). Additionally, comparing our results to a published RNA sequencing analysis of BT-474 [7, 8] demonstrated that three of the five coding interchromosomal translocations were called in our data (Table 3). In the remaining two translocations that were not called by our algorithms, raw reads were found to support their existence in our libraries; for the STARD3-DOK5 translocation, improved algorithms would most likely detect this event. In the case of the TRPC4AP-MRPL45 translocation only one mate pair read in the STD library was found in support, making it unlikely to have been called even with modifications to our algorithms.

Table 1 BT-474 genome statistics
Fig. 1
figure 1

100 kb read coverage. For the standard library of BT-474 reads were averaged across consecutive 100 kb bins, normalized to a tetraploid copy number, and plotted such that each dot represents the coverage of a single 100 kb region of the genome. Y-axis shows haploid copy number; x-axis shows genome position increasing from left to right for chromosome and position

Table 2 Potential translocations identified in BT-474
Fig. 2
figure 2

Overlap of inter-chromosomal translocations between libraries. Interchromosomal translocations were called for each library. The overlap between libraries was determined by considering translocations found within 5 kb of each other, and in the same orientation, to be the same event. This also resulted in the aggregation of multiple close translocations within the same sample into a single event. In total, 109 translocations were found in the standard library (black), and 147 and 133 were found in LFR libraries 1 (blue) and 2 (green), respectively. Of these, 85 interchromosomal translocations are shared between at least two libraries

Table 3 Translocations confirmed by published RNA sequencing data

Single nucleotide variants (SNVs) numbering 3.24 million were called in the STD library, and over 2.85 million in each of the LFR libraries. Of these, 2.84 million were called in all libraries (Fig. 3), demonstrating good reproducibility between different methods of library construction. For all libraries the ratio of heterozygous to homozygous was close to 1; a ratio much lower than the expected ~1.6 for Caucasian genomes. This is most likely the result of loss of heterozygosity (LOH) from the deletion or multi-copy amplifications of large portions, and/or the complete parental copy of almost all chromosomes in the BT-474 genome, as seen in our data (Fig. 1 and Fig. 4), and as previously described [2]. This was confirmed by estimating what would happen to heterozygous variants in the NA12878 genome (the sample used by the ‘Genome in a Bottle’ Consortium [9]) in two scenarios: if the same percentage of the genome was LOH based on 1) the percentage of the genome lost, or 2) the percentage of variants lost (22.2 % and 20.3 %, respectively, Table 4). In both cases the ratio of heterozygous to homozygous variants was reduced to close to 1 (Table 5).

Fig. 3
figure 3

Overlap of called variations between libraries. Single nucleotide variants (SNVs) numbering 3.24 million were called in the STD library, and over 2.85 million in each of the LFR libraries. The overlap between each library was compared and plotted. The standard library (black), and LFR libraries 1 (blue) and 2 (green) are highly overlapping, demonstrating that the majority of the variant calls are highly reproducible between separately processed sequencing libraries

Fig. 4
figure 4

Circos plot of the BT-474 genome. Chromosome number (in large bold numbers and letters), chromosome position (in small numbers), and a karyotype ideogram form the outer circle of the plot. The remaining circles are described in order of outermost to innermost: called ploidy (the copy number of region; blue-gray), Lesser Allele Fraction (LAF, the fraction of the lesser allele, 0.5 for a heterozygous SNP, 0 for a homozygous SNP; green), density of heterzogous SNPs (orange), and density of homozygous SNPs (blue). Lines in the center of the plot represent interchromosomal junctions

Table 4 Calculation of the amount of LOH in BT-474
Table 5 NA12878 simulation of LOH event in BT-474

Analysis of the coding regions of a comprehensive list of known cancer-causing genes [10, 11] identified 67 small variants (<50 base pairs, Additional file 1). Most of these are probably inherited variants with no involvement in tumor formation, however variants in TP53 and PIK3CA, previously found as somatic mutations in many tumors [12], were found in this cell line (Additional file 1). Also identified in our data: a potentially inherited variant in CHEK2, listed as ‘likely to be pathogenic’ in the ClinVar database [13]. To demonstrate the quality of our variant calls we compared them to a list generated by targeted sequencing of BT-474 as part of the Cancer Cell Line Encyclopedia (CCLE) project [14]. When the data from all three libraries were combined, 92 % of the variants found in CCLE were also called in our data, suggesting that our BT-474 genome is of good quality (Fig. 5 and Additional file 2). Further, 130 variants were found in two or more of our libraries that were not found in the CCLE data. This is either because the exons in which these variants were found were not covered as part of the CCLE target set, these variants were missed in the CCLE sequencing analysis, or to a lesser extent they are false positives in our dataset (Additional file 2).

Fig. 5
figure 5

Data directory tree. The output from the LFR process consists of a series of files and folders. A complete description of everything contained within the Complete Genomics data package can be found in the Additional file 3

Availability of supporting data

Complete Genomics data formats

The entire data set from Complete Genomics, provided here, consists of a series of files and directories covering various categories of whole genome analysis (Fig. 6). A complete description of all files and the methods used to generate them can be found in the “Standard Sequencing Service Data File Formats v2.5” document provided by Complete Genomics, Inc. (available in Additional file 3 [15]).

Fig. 6
figure 6

Overlap of called variations between libraries and CCLE data. Variants in the standard (black), and LFR 1 (blue) and 2 (green) libraries from this study found in the genes analyzed in the CCLE study, were compared to those variants called for BT-474 by the CCLE (orange). Eighty-nine percent of variants found by CCLE (orange) in BT-474 were also found within at least one of our libraries

LFR-specific files

Data packages from LFR do not include directories for structural variation (SV) or mobile element insertion (MEI; for more information on the content of these directories see the “Standard Sequencing Service Data File Formats v2.5” file mentioned above. In addition, one of the fields in the variant file (hapLink) is modified and there are six new fields described below:

  • hapLink: LFR phased variants have an ID with the pattern: “Phased_#_#_#”, where # is an integer, the first two #s describe unique contigs, and the last # in the series is either 1 or 0 and represents the two possible haplotypes for each contig. All SNPs sharing the same “Phased_#_#_#” are from the same haplotype.

  • wellCount: total number of LFR wells (out of 384) containing sequence reads calling the variant or reference allele. This metric is used to identify polymerase-induced false positive calls, since it is unlikely that random polymerase errors will occur in multiple different wells. A complete explanation of this concept can be found in Peters et al. [4].

  • wellIDs: contains the IDs of the specific wells from which reads calling the variant come.

  • exclusiveWellCount: this is the number of wells at each locus that have reads calling only the variant or the reference allele (not both). For true heterozygous variants this number should be close to that obtained for “wellCount”.

  • SharedWellCount: at each locus this is the number of wells that contain reads calling both alleles. For true heterozygous variants this should be low; having a high number here suggests mapping errors. For homozygous variants almost all of the well counts should be in this field.

  • MinExclusiveWellCountInThisLocus: this is the minimum number of exclusive wells (non-shared well counts) at each locus.

  • MaxExculisveWellCountInThisLocus: this is the maximum number of exclusive wells (non-shared well counts) at each locus.

LFR structural variant analysis files

Each LFR genome contains an LFR-specific structural variant file in the ASM directory (see Fig. 2 for directory tree). This file is generated using a novel algorithm that identifies unexpected mate-pairs that are found in more than one compartment of an LFR library (manuscript in preparation). A full description of the headers can be found within each file under the Excel tab labeled “Header Description”.

Read and mapping data for all genomes reported here are available at the European Nucleotide Archive (ENA) under study accession number PRJEB10587. Sample accession numbers for each sequence library can be found in Table 1. Supporting data is also available from the GigaScience GigaDB database [16].