Introduction

Central nervous system (CNS) tumors, including brain and spinal cord tumors, have an estimated incidence rate of 18.71 cases per 100,000 person-years (American Brain Tumor Association http://www.abta.org/). In children under age 15, CNS tumors are the leading cause of solid tumor death (Jemal et al. 2008). CNS tumors can be classified based on tumor location and histology. From the 2009 statistical reports of the Central Brain Tumor Registry of the United States (CBTRUS), the majority of primary CNS tumors are meningiomas (33.4%), glioblastomas (17.6%) and astrocytomas (7.4%). Currently, diagnosis and treatment of brain tumors are heavily dependent upon tumor histology, albeit brain tumors sharing the same histological features often display variable drug response and clinical behavior. These differences have been shown to be due to numerous genetic and epigenetic aberrations occurring during brain development (Zlatescu et al. 2001; Mueller et al. 2002; Brennan et al. 2009; Gravendeel et al. 2009). Since cancer development and progression result from an accumulation of genetic and/or epigenetic alterations (Vogelstein and Kinzler 1993), a comprehensive understanding of the molecular mechanisms underlying these events is needed for development of more effective treatment regimens. Much effort has been made to discover genetic and epigenetic events underlying brain tumor development, including a number of comprehensive studies aimed at uncovering alterations in gene expression, DNA methylation, DNA copy number alterations and/or chromosomal rearrangements.

Coordinated efforts from various centers and institutes have been taken to address the genomic changes involved in human cancers. The Cancer Genome Atlas (TCGA, http://www.cancergenome.nih.gov) has been providing a significant amount of datasets, both on gene expression and DNA methylation, for glioblastoma multiforme (GBM). The Repository of Molecular Brain Neoplasia Data (REMBRANDT, http://caintegrator-info.nci.nih.gov/) provides molecular research as well as clinical trials data on brain tumors including gliomas. The REMBRANDT site also provides tools to analyze clinical and experimental data across multiple clinical trials and studies.

In addition, a number of studies on brain tumors have generated large amounts of research data. For gene expression studies, both sequencing- and hybridization-based approaches have been utilized to identify molecular markers for the sub-classification of brain tumors, including Serial Analysis of Gene Expression (SAGE) and multiple expression microarray platforms. In addition, a number of gene expression studies have been conducted to correlate transcript profiles of particular brain tumors with their clinical behavior, response to particular drugs, or recurrence incidences (Freije et al. 2004; Bandres et al. 2005; Hegi et al. 2005; Hoelzinger et al. 2005; Rich et al. 2005; Verhaak et al. 2010). These studies have yielded a number of candidate genes that might prove valuable as prognostic markers and therapeutic targets.

Epigenetic modifications, and DNA methylation in particular, have been shown to be involved in tumor initiation, progression and response to drugs (Hegi et al. 2005; Smith et al. 2007). Methylation of DNA can suppress gene transcription either by preventing the binding of transcription factors (Kim et al. 2003), or by enabling the binding of chromatin repressors (Bird and Wolffe 1999). Large methylation datasets have been generated in an effort to develop the methylation landscape of normal human brain (Rollins et al. 2006; Schumacher et al. 2006; Wu et al. 2010).

Alu retrotransposons, the most prevalent repetitive element in the human genome at 1.2 million copies, have also been associated with the development of various tumors (Lin et al. 1988; Chen et al. 1995; Miki et al. 1996). In addition, Alu family sequences may disrupt the normal splicing pattern of a gene (Lev-Maor et al. 2003). Approximately 8,000 Alu elements have been predicted to have the potential to become exonized (Sorek et al. 2004). In addition, the continued amplification of Alu family sequences has given rise to insertion polymorphisms within human populations (Batzer et al. 1995; Xing et al. 2009).

Lastly, DNA copy number alterations, both losses and gains, are commonly observed in tumors and they are bound to affect gene expression (Taylor et al. 2005; Deshmukh et al. 2008).

Uncovering the molecular mechanisms underlying development and progression of brain tumors requires seamless integration of large genomic, epigenomic and transcriptomic datasets. Large amounts of data have been generated in numerous studies using various techniques, but the resulting datasets are typically in different formats and often not easily retrievable. Here we report the development of a database, Brain Tumor Epigenomics/genomics Database at Children’s Memorial Hospital (BTECH), designed to store and to facilitate integration of genomic, epigenomic and transcriptomic datasets derived from brain tumor studies. A genome browser has been installed to allow simultaneous visualization of genomic, epigenomic and transcriptomic features. The BTECH database contains manually curated datasets extracted from publications and various other public resources as well as up to date DNA methylation data generated in our laboratory from normal brain and brain tumor tissues. We anticipate that BTECH will prove invaluable to investigators conducting genomic, epigenomic and/or transcriptomic analyses of brain tumors in their pursuit of the molecular mechanisms underlying development and progression of these tumors.

Data Acquisition and Processing

BTECH harbors a wide range of molecular datasets on—or of relevance to—brain tumors, including gene expression, DNA methylation, DNA copy number alterations (gains and losses) and structural chromosomal alterations, lists of genes of special interest such as cancer or brain tumor related, and differentially expressed in the brain or in brain tumors, manually curated from public datasets. In addition, BTECH includes general genomic annotations as well as specific data on subsets of Alu elements, including data on insertional polymorphism, transcription and methylation. These are listed in Table 1 along with pertinent information concerning the datasets.

Table 1 Data contents in BTECH (as of October 2010)
  1. (i)

    Gene Expression Datasets

    To compile a collection of gene expression profiles from various brain tumor studies, queries were first conducted against either Entrez PubMed (http://www.ncbi.nlm.nih.gov/Literature/) or the NCBI Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo/index.cgi). The queries for GEO were conducted with “brain tumor” as the keyword and “Homo sapiens” as the organism. Given the wide range of study purposes and experimental designs, only datasets using surgical tumor samples or drug-treated tumor cells lines were selected. In total, thirteen expression datasets were downloaded, derived from 105 different types or sub-types of brain tumors as well as from brain tumor cell lines (Freije et al. 2004; Bredel et al. 2005; Dong et al. 2005a; Ma et al. 2005; Pachiappan et al. 2005; Taylor et al. 2005; Bredel et al. 2006; Lee et al. 2006; Sun et al. 2006; Tso et al. 2006; Turkheimer et al. 2006; Donson et al. 2009). In addition, a large Serial Analysis of Gene Expression (SAGE) brain tumor dataset derived from seventy three SAGE libraries (Boon et al. 2002) was retrieved from GEO. Altogether these datasets encompass 937 brain-related biological samples (Table 1). Each downloaded SOFT file was further processed (Supplementary Table S1) and the final values were stored in BTECH.

    Gene expression data on GBM from TCGA were downloaded from the TCGA data portal site (http://www.cancergenome.nih.gov/dataportal/data/about/, total 369 gene expression and 148 DNA methylation datasets as of September 30, 2010). For each dataset, three levels of data are available from TCGA (raw, normalized, and interpreted). Since the interpreted data were generated by merging the normalized data with the annotation data, which provide the most information, this level of data was downloaded. For the BTECH’s purpose of data integration, a total of 80 samples with both gene expression and DNA methylation data were selected. The associated gene expression and DNA methylation data are stored in BTECH, together with the available clinical information for these samples, which can be accessed from BTECH as well.

    The REMBRANDT site (https://caintegrator.nci.nih.gov/rembrandt/) not only provides molecular research as well as clinical trials data on gliomas, but also provides tools to analyze clinical and experimental data across multiple clinical trials and studies. Because of the complication of the clinical information and the integrity of their analysis, datasets from this study are not included in BTECH. Instead, a hyperlink is provided for further reference.

    We also obtained expression data from Gene Expression Atlas2 provided by the Genomics Institute of the Novartis Research Foundation (human reference sequence NCBI Build 36.1). In this study the expression patterns for thousands of protein-coding genes as well as poorly characterized genes were examined (Su et al. 2004). From this study we downloaded the expression profile of 79 different normal human tissues including brain tissues from such areas as cerebellum, cortex, medulla oblongata, pituitary, pons, and temporal lobe.

  2. (ii)

    DNA Methylation Dataset

    Several large-scale DNA methylation datasets on human normal brain or brain tumors have been made public in the past few years (Rollins et al. 2006; Schumacher et al. 2006; Wu et al. 2010). The methylation datasets were retrieved from the publications of these studies. Moreover, as part of the Human Epigenome Project, the DNA methylation profiles of chromosomes 6, 20, and 22 were also generated; these were obtained from multiple samples derived from a total of 12 different tissue types (Eckhardt et al. 2006). Although brain tissue was not included in this study, the methylation data might prove valuable as reference for future studies. Therefore, their datasets were downloaded as well (2006 release). As these datasets were generated using different techniques, four sets of perl scripts were written in order to make the methylation datasets compatible with one another and also with the BTECH genome browser (Supplementary Table S2).

    As aforementioned, methylation data from TCGA on GBM were downloaded from the TCGA data portal site. A total of 80 samples with both DNA methylation and gene expression data were selected. The corresponding DNA methylation and gene expression data, together with the clinical information for these samples, are available from BTECH.

    In addition, a high-throughput sequence-based strategy was developed that enables simultaneous determination of the methylation status of thousands of CpG sites localized within and in the 5′ flanking genomic region of over 30,000 Alu retrotransposons dispersed throughout the genome (Xie et al. 2009). This is of special relevance given that hypomethylation of repetitive elements has been shown to be strongly associated with cancer progression and poor clinical outcome (Cho et al. 2007; Estecio et al. 2007; Roman-Gomez et al. 2008). Hence, Alu repeats may prove invaluable as reporters of epigenomic alterations during tumor development and progression. Several ependymoma tumor specimens—including primary non-aggressive, primary aggressive, and recurrent tumors—and disease-free brain tissue samples were investigated using the technique discussed above (Xie et al. 2010). The resulting methylation datasets from all these samples are stored in BTECH, hence easily retrievable. To preserve the integrity of the data and the sample set, they are listed under a separate category, namely “Soares’ Lab Methylation Data”, with separate tracks for individual tumor samples. The results from this dataset can be visualized through the genome browser (Fig. 1).

    Fig. 1
    figure 1

    An example of an Alu element that shows lower methylation (yellow) in pediatric ependymomas compared to normal brain controls, and the lowest in recurrent tumors. This Alu element resides in an intron of the DNM2 gene. Datasets from gene expression studies on pediatric ependymomas indicate a higher expression (red) of this gene in tumor samples

  3. (iii)

    Chromosomal Alteration Dataset

    DNA copy number alterations including both gains and losses are frequently associated with tumor initiation and/or tumor progression (Brito-Babapulle and Atkin 1981; Whang-Peng et al. 1982). Such alterations might result in decreased expression of a tumor suppressor gene or in increased expression of an oncogene. Three such datasets have been generated for brain tumors in recent years. One study used matrix-based comparative genomic DNA microarray to analyze chromosomal alterations in 68 ependymomas (Mendrzyk et al. 2006). Another study used cDNA microarray-based comparative genomic hybridization technology to profile copy number alterations in a series of 54 gliomas encompassing a wide range of histopathological types and tumor grades (Bredel et al. 2005). The third study analyzed 178 tumors for genomic alterations using the Affymetrix 100 K single-nucleotide polymorphism (SNP) arrays (Kotliarov et al. 2006). For each alteration identified in the aforementioned studies, BTECH contains information on genomic localization and, whenever possible, the number of times it has been observed (Supplementary Table S3).

  4. (iv)

    Genes of Interests

    1. a.

      Tumor-related Genes

      A series of genetic and epigenetic alterations occur during tumor initiation and progression, some conferring a growth advantage. Genes that are directly or indirectly affected by such alterations are referred to as tumor-related genes. We compiled a list of 3,432 tumor-related genes for inclusion in BTECH. These include oncogenes, tumor suppressor genes and genes whose altered expression has been associated with tumor development, progression and/or metastasis.

      The compiled list was derived from two resources: the NCBI database and the Affymetrix Human Cancer Array HC-G110 (Affymetrix, http://www.affymetrix.com/). First, we searched the “gene” category of the NCBI database for the keyword “tumor” or “cancer”. The NCBI database contains information from the Online Mendelian Inheritance in Man (OMIM), and provides a summary of known functions for the genes within the database. This resulted list contained a total of 3,223 genes of which 214 were oncogenes, 57 were tumor suppressor genes and the remainders were genes with altered expression in tumors (as of September 30, 2010). Next, we included 365 genes from the Affymetrix Human Cancer Array HC-G110 (Affymetrix, http://www.affymetrix.com/) that were represented in the “known gene” table in the UCSC database.

      In addition, we generated a list of 242 genes that have been reported to be specifically associated with brain tumors, once again using the NCBI database. In this case we searched the “gene” category using the brain tumor names from the WHO grading system. An annotation table for these 242 genes was generated using the UCSC genome database and subsequently included in BTECH.

    2. b.

      Brain Tumor Differentially Expressed Genes

      An initial search of PubMed for brain tumor related large-scale gene expression studies revealed twelve publications, eleven of which involving microarrays (Ljubimova et al. 2001; Mariani et al. 2001; Park et al. 2003; Raza et al. 2004; Dong et al. 2005b; Fathallah-Shaykh 2005; Hoelzinger et al. 2005; Mehrian Shai et al. 2005; Kikuchi et al. 2006; Bozinov et al. 2008; Rand et al. 2008), and one based on SAGE analysis (Boon et al. 2002). All genes identified in these studies as differentially expressed were extracted for inclusion in BTECH. For the SAGE-based study we extracted the genes that were found to have a p-value of less than 0.05 after pair-wise comparisons between normal brain and different types of brain tumors. A total of 544 differentially expressed genes were thus extracted, of which 321 could be unambiguously mapped to specific genomic loci. These 321 human genes were included in the BTECH database, along with the respective annotations regarding conditions under which they were identified as differentially expressed.

  5. (v)

    Alu Elements

    Almost half of the human genome is composed of repetitive elements, which consist of interspersed repeats and tandem repeats (Jordan et al. 2003). Alu elements, as the most abundant short interspersed nuclear elements, account for approximately 10% of the human genome sequence (Kazazian 2004). Rather than being passive bystanders in the human genome, Alu elements play an active role in cancer predisposition and development. Regions that have a high density of Alu elements have been reported to be associated with cancer-inducing chromosomal alterations such as deletions, duplications, and translocations (Mauillon et al. 1996; Strout et al. 1998; Kolomietz et al. 2001; Gad et al. 2002; Teugels et al. 2005; Abo-Dalo et al. 2010; Iskow et al. 2010). A number of transcriptionally competent elements continue to amplify in the human genome by retrotransposition, thus leading to insertional polymorphism (Batzer et al. 1995; Cordaux et al. 2007). Moreover, Alu elements were shown to be associated with 5% of the alternatively spliced internal exons in the human genome (Lev-Maor et al. 2003). Approximately 8,000 Alus were predicted to have the potential to exonize (Sorek et al. 2004). In the BTECH database, we provide the annotations for all Alu elements in the human genome (approximately 1.2 million) based on the UCSC’s repeat masker table. In addition, we also extracted the list of Alu elements that are polymorphic with respect to integration in human populations (Batzer et al. 1995; Cordaux et al. 2007), as well as the list of Alu elements that are prone to becoming exonized (Sorek et al. 2004),.

    In addition to the genetic alterations that Alu elements may cause, hypomethylation of Alu elements has been shown to be strongly associated with cancer progression and poor clinical outcome (Riggs and Jones 1983; Feinberg and Tycko 2004; Cho et al. 2007; Estecio et al. 2007; Roman-Gomez et al. 2008). In a recent publication using a high-throughput sequencing-based technique, we reported the methylation status of 31,178 Alu elements and their 5′ flanking sequences in normal brain and in ependymomas of different grades (Xie et al. 2010). These data have been included in the BTECH database, and retrievable from the genome browser listed under category “Soares’ Lab Methylation Data” with separate tracks for individual tumor samples (Fig. 1).

  6. (vi)

    General Annotations

    In addition to the molecular data and the selected gene lists discussed above, BTECH also harbors relevant genome annotations. The following genome annotations were downloaded from the databases indicated below:

    • RefSeq mRNA, Entrez genes, three-frame translation, DNA recombination rates were downloaded from UCSC genome database (human reference sequence NCBI Build 36.1).

    • CpG islands were identified by the program described by Takai et al. (Takai and Jones 2002) without masking the repetitive elements.

    • SNP annotations were from dbSNP (http://www.ncbi.nlm.nih.gov/SNP/, build 127)

    • All microRNA elements were downloaded and extracted from the Sanger Institute (http://www.sanger.ac.uk/, release 14).

    • Chromosome ideogram information was from NCBI.

    • Genomic sequences were based on human reference sequence (NCBI Build 36.1).

Database Design and Implementation

The BTECH database has been installed on a Mac OS X Server and is freely accessible through http://cmbteg.childrensmemorial.org. The BTECH database consists of a relational database hosted by the MySQL database management system (http://www.mysql.com/). In addition, a genome browser has been installed to provide an integrated interface for displaying experimental datasets and annotations using the genome annotation viewer software components from GMOD (Generic Model Organism Database project) (Stein et al. 2002). The datasets displayed on the Genome Browser were represented by different tracks. These tracks are laid on top of the human genomic sequence, thus enabling simultaneous visualization of all corresponding molecular data, sequence features and annotations (Fig. 2). From the genome browser, users can query the database using gene names, symbols, Genbank accession numbers or genomic co-ordinates.

Fig. 2
figure 2

An integrative view of brain tumor genomics, epigenomics, and transcriptomic data using genome browser. Detailed view can be found at: http://cmbteg.childrensmemorial.org/cgi-bin/gbrowse/btech/ with querying region: chr22: 14540000..14640000

Future Directions

The BTECH database is the first integrated molecular database that has been designed specifically for brain tumor studies. The database will be continuously updated and expanded. Future development will be focusing on the design of query engines for the database. Such query engines would allow users to search against the vast amount of brain tumor study datasets and provide efficient data mining. In addition, we are also going to analyze various kinds of alterations and develop hypotheses for a better understanding of the molecular mechanisms underlying brain tumor development and progression.

Information Sharing Statement

BTECH and its data contents are freely available to the public from this website: http://cmbteg.childrensmemorial.org. The visualization of these data and the corresponding data resources can be accessed from: http://cmbteg.childrensmemorial.org/cgi-bin/gbrowse/btech/