Integrative computational biology for cancer research
Over the past two decades, high-throughput (HTP) technologies such as microarrays and mass spectrometry have fundamentally changed clinical cancer research. They have revealed novel molecular markers of cancer subtypes, metastasis, and drug sensitivity and resistance. Some have been translated into the clinic as tools for early disease diagnosis, prognosis, and individualized treatment and response monitoring. Despite these successes, many challenges remain: HTP platforms are often noisy and suffer from false positives and false negatives; optimal analysis and successful validation require complex workflows; and great volumes of data are accumulating at a rapid pace. Here we discuss these challenges, and show how integrative computational biology can help diminish them by creating new software tools, analytical methods, and data standards.
KeywordsMicroarray Platform Gene Ontology Category International Cancer Genome Consortium Human Interactome Distribute Data Source
Since the commercialization of DNA microarray technology in the late 1990s, high-throughput (HTP) data relevant to cancer research have been accumulating at an ever-increasing rate. These data have led to crucial insights into fundamental cancer biology, including the mechanisms of tumorigenesis, metastasis, and drug resistance (Rhodes and Chinnaiyan 2005). They have also had enormous clinical impact, e.g., several cancers can now be fractionated into therapeutic subsets with unique prognostic outcomes based on their molecular phenotypes (Buyse et al. 2006; Dhanasekaran et al. 2001; Lowe et al. 2010; Pegram et al. 1998; Slamon and Press 2009; Spentzos et al. 2004; Zhu et al. 2010b). Despite these successes, many cancers still have a high mortality rate and no effective treatment. Looking at 1.9 million patients from 31 countries and 5 continents, the CONCORD study found that current treatments achieve a 5-year survival rate for less than 50% of diagnosed cancers (Coleman et al. 2008). For many cancers, survival rates have not changed in decades—pancreatic cancer remains almost 100% lethal, and the overall survival rate for lung cancer has improved only from 13% to 16%. Most cancers still lack any effective early disease biomarkers, and predictive signatures are limited to a few known mutations, such as EGFR or kRAS in lung cancer or HER2 in breast cancer. Predictive and prognostic biomarkers are often inconsistent from study to study (i.e., they show poor overlap), and cannot be validated by other methods or in new cohorts of patients (Diamandis 2010; Dupuy and Simon 2007; Lau et al. 2007).
The key difficulty is that cancer is a complex and heterogeneous disease: many genes are amplified, deleted, mutated, and up- or down-regulated. Many pathways are activated or suppressed. These changes vary substantially in different cancers, in different patients with the same cancer, and even in different tumor samples from the same patient (Axelrod et al. 2009; Bachtiary et al. 2006; Blackhall et al. 2004). To get the full picture, we will need to combine information from diverse experimental platforms and other sources that offer different perspectives on the problem, e.g., gene and protein expression, protein–protein interactions (PPIs) and pathways, chromosomal aberrations, mutation events, epigenetic changes, and clinical information from drug trials and the bedside—leading to integrative computational biology.
The challenges fall into three main categories. The first is noise: HTP platforms are inherently noisy—results vary substantially from run to run and from lab to lab, and are prone to false positives and negatives. The second challenge is volume: there is a vast quantity of relevant data, new data are piling up at an increasing rate and old data need to be constantly reinterpreted and updated in the light of new findings or reanalyzed with improved algorithms. The third challenge is analysis: simple methods—such as differential expression analyses of microarray data—often miss much of the signal in the data.
Integrative computational methods will continue to play a central role in addressing these challenges. We need new designs for databases; new software and workflows to combine and continuously update heterogeneous and distributed data; new analytical methods to identify complex signals in different data sources; and new standards for generating, maintaining, and sharing data. These methods will depend on advances in many areas such as statistics, knowledge representation and ontology, machine learning, data mining, graph theory, and visualization. Integrative analyses may ultimately lead to a better understanding of cancer, earlier diagnoses and true personalized medicine, where therapies are individually tailored based on combinations of single nucleotide polymorphisms, and gene, protein and microRNA expression levels (Auffray et al. 2009; Augen 2001; Cervigne et al. 2009; Reis et al. 2010; Zhu et al. 2010a, b).
Integrative computational biology shares many tools and goals with the closely related field of systems biology, the discipline that attempts to explain the structure and behavior of complex biological systems as a function of their simpler components (Kirschner 2005). The term systems biology may be applied to describe anything from the output of an ‘omics’ experiment to explicit mechanical models of tumor growth (Deisboeck et al. 2009; Hornberg et al. 2006). Integrative computational biology, in contrast, specifically concerns the computational interrogation and integration of diverse HTP molecular biology data.
In this review, we discuss the major challenges of HTP cancer studies and give examples of integrative computational methods that help to meet them.
The challenges facing HTP cancer biology
A wealth of genomic and proteomic cancer data is now available from HTP screens. While these data have improved our understanding of basic cancer biology and some have even translated into improved patient diagnosis and treatment, significant challenges remain. In this section we briefly review some of the major obstacles to the progress of HTP cancer research.
Cancer is a heterogeneous disease, and HTP platforms are noisy, i.e., the resulting data sets have false positives and false negatives. Consequently, there can be large variation in results from lab to lab (Bell et al. 2009; Irizarry et al. 2005); methods are needed to control this noise and to integrate different data sources to make them more reliable. The main noise issues that plague HTP cancer research are false negatives and false positives, intra- and inter-sample heterogeneity, and platform bias.
False negatives and false positives in HTP data
HTP screens suffer from substantial noise—both false positives and false negatives—which must be resolved through complementary experiments and computational analysis (Auffray et al. 2009; Augen 2001). For example, while HTP PPI screens can identify thousands of protein interactions at once, they do so at the cost of either high false discovery rates or poor sensitivity. The false discovery rate is the proportion of detected interactions that are false, and sensitivity is the proportion of true interactions that are successfully detected. When interactions detected by two HTP studies were tested in small-scale screens, they were found to have false discovery rates of 22% (Rual et al. 2005) and 38% (Stelzl et al. 2005). An evaluation of five HTP methods found that their sensitivity rates ranged from only 21 to 36% at false discovery rates of 0–11% (Braun et al. 2009). Similarly, mass spectrometry analyses of human serum typically produce many false negatives (Gstaiger and Aebersold 2009). One problem is that human serum has a high dynamic range—protein concentrations are estimated to vary over ten orders of magnitude (Gstaiger and Aebersold 2009). Mass spectrometers have a much smaller range of detection, leading to false negatives: low abundance proteins are not detected. This challenge may be somewhat diminished by extensive sample fractionation (Kislinger et al. 2006).
Tumors are heterogeneous
Often only a single sample from each tumor is available for analysis. But tumors are highly heterogeneous, so these samples may not be representative of the whole tumor (Axelrod et al. 2009; Bachtiary et al. 2006; Blackhall et al. 2004). Tumors comprise cells belonging to several distinct subpopulations—e.g., tumor regions can be hypoxic to different extents, or be made up of different proportions of tumor initiating cells (cancer stem cells)—and these differences have consequences for predicting drug response and prognosis (Axelrod et al. 2009; Blackhall et al. 2004; Jubb et al. 2010). Intra-tumor heterogeneity can lead to different cell populations expressing different levels of protein, or having different mutations and copy number alterations, which complicates analyses. For example, variable tumor epithelial and stromal cell content in breast tumor samples can significantly affect gene expression profiles and signature accuracy (Cleator et al. 2006; Myhre et al. 2010). Some of these difficulties can be alleviated with techniques such as laser-capture micro-dissection (Fend and Raffeld 2000), which can isolate regions of a sample that contain a more uniform population of cells, but these techniques remain costly and slow. In addition, they may not result in a sufficient amount of material for follow-up experiments. Though single samples from tumors can suffice for population-level studies (Axelrod et al. 2009), the fact that two samples from the same patient can be quite different means that this heterogeneity poses a significant challenge to personalized medicine. On top of that, in cancer studies there are issues with the control samples available for analysis: often these “normal” samples come from tissue directly adjacent to the tumor. The properties of these samples, such as gene expression profiles, may be quite different from those of more distant tissue or of a healthy patient (Chandran et al. 2005).
Different experimental platforms can disagree
Experimental platforms produced by different companies can yield conflicting results (Curtis et al. 2009; Elias et al. 2005; Tan et al. 2003). For example, different microarray platforms use different probe design and labeling and have different dynamic ranges. A gene may be overexpressed in cancer on one platform, yet under-expressed on another, simply because the two platforms use different DNA sequences to “probe” the same gene. On a given platform, some genes are represented by many probes while others by one or none, and this representation is different for different platforms. Genes that are expressed at low levels are particularly problematic for concordance across array platforms (Barnes et al. 2005). Another problem is that genome annotations continue to grow and change (Eggle et al. 2009). Updated probe set definitions can substantially affect the number and the identity of differentially expressed genes (Sandberg and Larsson 2007). Further, HTP technology is developing rapidly and new technologies are coming into use every year; and there will always be new differences and conflicts to resolve. The challenge is to have infrastructure set-up to compare and integrate new technologies as they come along, and systematically identify the best workflow for data processing and analysis, e.g. (Ponzielli et al. 2008).
A primary goal of integrative computational biology analysis in individualized medicine is to identify small groups of genes/proteins/microRNAs, etc., that can be used to improve diagnosis, predict outcome or predict treatment response, i.e., to identify prognostic or predictive signatures. Their identification in HTP data is challenging since basic analysis methods fail to capture the entire signal in the data, and good signatures comprise not only the most differentially expressed molecules. For this section we will focus our attention on gene expression microarray studies, but the criticisms apply equally well to similar experimental designs, such as those using HTP protein or microRNA assays.
Lists of differentially expressed genes show poor overlap across studies
The most popular way of analyzing microarray data is to detect whether individual genes are differentially expressed in one condition versus in another, e.g., in non-responders versus responders to some drug treatment. Differentially expressed genes are widely used in cancer research—their applications include deriving molecular signatures of cancer subtypes, increasing our understanding of the biology of tumorigenesis, and providing new candidate markers for diagnosis, prognosis, and drug response. Unfortunately, there are major challenges with their identification and interpretation. Previous work has shown that lists of differentially expressed genes are poorly reproduced across studies (Tian et al. 2005; Zhang et al. 2008); even random subsets of samples from one experiment can yield widely divergent gene lists. These problems are caused by high dimensionality, small number of samples, and noise (biological and technical variability), but they can be exacerbated by the analysis method. For example, analyses that quantify the differential expression of gene groups rather than individual genes show higher conservation across platforms and studies (Subramanian et al. 2005).
The most differentially expressed genes do not yield the best signatures
Very often in microarray experiments, the most differentially expressed genes are used to construct prognostic or predictive signatures: machine learning methods are trained to use the expression levels of those genes to predict, e.g., the disease state of a patient and probability of survival. The problem with this approach is that single-gene analyses overlook multivariate effects. More sophisticated analyses are needed to identify sets of genes that complement one another, i.e., ones whose combined expression levels yield the best-performing prognostic signatures. Such analyses show that genes essential to a good prognostic signature are often not highly differentially expressed on their own (Chuang et al. 2007; Fujita et al. 2008).
Signatures validate poorly on other data sets
One of the most important contributions of HTP biology to cancer research has been to develop prognostic and predictive signatures. Unfortunately, many signatures have failed to validate by other methods or in new cohorts of patients (Lau et al. 2007; Zhu et al. 2008). Obviously, this is a great concern for personalized medicine. Existing prognostic and predictive biomarkers for the same condition overlap only partially, and the set of biomarker genes identified depends strongly on the subset of patients used to generate it (Ein-Dor et al. 2005). One study estimated that to achieve 50% overlap in prognostic gene sets for breast cancer patients would require several thousand samples (Ein-Dor et al. 2006). Several factors contribute to these problems, including: (1) diverse patients and heterogeneous tumors, (2) different profiling platforms, (3) diverse statistical and bioinformatics approaches to biomarker identification (Shedden et al. 2008), (4) an insufficient number of samples (Ein-Dor et al. 2006), and (5) the existence of multiple equivalent signatures (Boutros et al. 2009; Ein-Dor et al. 2005).
The pace of discovery is rapid, data are piling up, and annotations are changing all the time. We need integrated workflows to organize data sets and make them easy to access, incorporate, annotate and update.
Large amounts of data need to be integrated and maintained
Many disease-related targets remain uncharacterized
Many disease-related targets remain uncharacterized; we have little or no information about their protein interaction partners, the pathways they control or are affected by, or their splice variants, functional mutations, or protein structure. For example, while there are roughly 21,000 genes in the human genome (Clamp et al. 2007), the Reactome and KEGG Pathway databases contain pathway annotations for only about 4,000 of these (Kanehisa et al. 2008; Matthews et al. 2009). And while many studies have shown that interaction networks provide information that can substantially improve predictive signatures (Chuang et al. 2007; Fortney et al. 2010; Nibbe et al. 2010), they can do so only for genes represented in the interactome.
The human interactome is poorly characterized
PPI networks provide biological context and meaning for gene signatures and yield better network-based prognostic and predictive signatures. Unfortunately, with current experimental knowledge the coverage of the human interactome is only around 13%. The human interactome is estimated to have around 650,000 interactions (Stumpf et al. 2008), and currently there are only 82,857 unique experimentally derived interactions (and an additional 55,471 unique interologous interactions) in I2D, a database that integrates HTP PPI data sets and most online PPI databases [such as BioGRID (Stark et al. 2006), BIND (Bader et al. 2003), DIP (Xenarios et al. 2000), HPRD (Peri et al. 2004), InnateDB (Lynn et al. 2008); IntAct (Hermjakob et al. 2004) and MINT (Zanzoni et al. 2002)]. While HTP methods can help to identify many protein interactions, their overlap is usually low, and even when combined were shown to result in a false negative rate of around 41% (Braun et al. 2009). Interaction dynamics with respect to time and localization remain largely unknown. Also, though over 92% of human genes have splice variants (Wang et al. 2008), and many genes may have hundreds or even thousands of such variants [for instance, the Drospohila gene Dscam may have over 38,000 alternative splice forms (Schmucker et al. 2000)], we still lack information about most variant-specific interactions.
How integrative computational biology can address these challenges
The field of integrative computational biology uses techniques from computer science, mathematics, physics and engineering to comprehensively analyze and interpret biological data. Through the creation of new analysis and visualization methods, software tools and databases, it can help diminish the challenges to HTP cancer biology. Here we present some successful applications of integrative computational biology to cancer research. These applications fall into four main categories: data integration, network analysis, databases, and standards.
As we have seen, noise in HTP cancer studies can arise both from biological and technological variability. One of the most effective strategies for reducing both types of noise is data integration. The idea is simple: we can be more confident about the result of an experiment if similar experiments yielded similar results. We can integrate different experiments that measure the same biological entity, such as microarray studies measuring tumor versus normal gene expression differences on different experimental platforms. We can also integrate different data types, such as mutation, expression, and proteomic data. Clearly, data integration can increase our confidence in results that are consistent across multiple studies and experimental modalities. But data integration can also increase sensitivity, since different platforms and methods exhibit different biases—e.g., some protein interactions may be undetectable by some methods. Integrative computational approaches can also be complemented by better experimental design: e.g., in tumor profiling, analyzing multiple samples from the same patient reduces the effect of intra-tumor heterogeneity (Bachtiary et al. 2006; Blackhall et al. 2004), and executing multiple MudPIT (Multi-dimensional Protein Identification Technology) runs significantly improves sensitivity (Gortzak-Uzan et al. 2008; Kislinger and Emili 2005; Sodek et al. 2008; Wei et al. 2011).
Integrating the same type of data across multiple platforms and studies
With microarray and similar data, small sample numbers and different experimental platforms can lead to highly variable results. These problems can be addressed by combining data from different studies and platforms, which increases the effective number of samples and helps control for inter-platform heterogeneity.
Most approaches to integrating microarray data can be divided into two general classes, pooling and meta-analyses. In pooling, multiple expression data sets are merged into a single data set (van Vliet et al. 2008; Warnat et al. 2005); typically, gene measurements from each separate study are transformed before pooling to make the experiments more comparable (Fierro et al. 2008). Previous work found that pooling six breast cancer data sets (over 900 samples in total) yielded better-performing signatures (van Vliet et al. 2008). In contrast, for a meta-analysis, statistics are computed for each data set separately and then combined. Meta-analyses identify gene changes that are seen consistently across many studies (Rhodes and Chinnaiyan 2005). Several cancer-specific databases gather information from multiple studies to facilitate meta-analysis, including Oncomine (Rhodes et al. 2007), GeneSigDB (Culhane et al. 2010), and CDIP. CDIP currently covers lung, ovarian, prostate and head and neck cancers. For example, for prostate cancer, CDIP contains 119 different microarray analyses with over 850 tumor and normal samples. While 12,975 unique genes are significantly differentially regulated in at least one study, far fewer show this trend across multiple studies and we can use this information to prioritize genes, both for biological validation and for signature generation.
Integrating different types of data
Integrating complementary data from different sources is helpful for reducing noise and prioritizing targets (Gortzak-Uzan et al. 2008; Varambally et al. 2005). For example, Gortzak-Uzan et al. (2008) combined proteins identified in ovarian cancer ascites with differentially expressed genes from CDIP and PPIs from I2D to identify putative biomarkers for early ovarian cancer detection in serum. For target prioritization, prognostic and predictive signatures can be put in their biological context by overlapping and expanding them with data from cancer-specific resources such as the Sanger Census database (Bamford et al. 2004; Stratton et al. 2009), the COSMIC catalog of somatic mutations (Bamford et al. 2004), GeneSigDB, and Oncomine. Techniques such as Gene Set Enrichment Analysis (GSEA) (Subramanian et al. 2005) can be used to combine experimental data with annotation and pathway databases such as the Gene Ontology (Ashburner et al. 2000), KEGG, and Reactome which allows us to convert lists of differentially expressed genes into lists of differentially expressed gene groups, which are more stable across studies (Subramanian et al. 2005).
Integrating data to predict new PPIs
Protein networks are essential for cancer signature design and interpretation (Chuang et al. 2007; Ergun et al. 2007; Rhodes and Chinnaiyan 2005), yet current experimental networks are far from complete, contain false positives, and in general lack context, such as condition, place and strength of individual interactions. Biological techniques can also be enriched by in silico approaches to predict new interactions and the context of existing ones. For example, we recently developed FpClass (Kotlyar and Jurisica 2006; Kotlyar et al. 2011), an association mining algorithm that greatly expands coverage of the human proteome without increasing false discovery rate: it predicts 178,738 new high-accuracy interactions for 9,372 proteins. FpClass integrates many different data types to predict interactions, including sequence, domains, gene expression, and post-translational modifications. Importantly, FpClass reduces the number of uncharacterized disease genes (again, where a gene is considered uncharacterized if its protein product participates in fewer than 5 interactions, see Fig. 2) to 34% for OMIM and 11% for SCGC.
Integrating data to predict disease gene function
Though many cancer-related genes remain uncharacterized, HTP data and in silico methods can be applied to assign them putative functions (Hu et al. 2007). For example, in the successful MouseFunc prediction challenge (Pena-Castillo et al. 2008), teams of scientists competed to predict mouse gene function (Gene Ontology categories) on the basis of several different data sources, including expression, sequence, interactions, phenotype annotations, disease associations and phylogenetic profiles. Computational methods for predicting the function of uncharacterized cancer genes have been previously reviewed in (Hu et al. 2007).
Network approaches have been successful in addressing some of the analysis-based challenges of HTP cancer biology. Genes do not act in isolation; they form highly complex and interlinked molecular networks. Examining genes in the context of these networks can yield valuable clues about their function and relations, and expand our knowledge of individual cancer pathways and their cross-talk (Agarwal et al. 2009; Mills et al. 2009). Despite the noise in current protein interaction data sets, network analysis can uncover biologically relevant information, such as lethality (Hahn and Kern 2005; Jeong et al. 2001), functional organization (Gavin et al. 2002; Maslov and Sneppen 2002; Wuchty 2006), hierarchical structure (Ravasz et al. 2002; Yu et al. 2006), modularity (Han et al. 2004) and network-building motifs (Milo et al. 2002; Przulj et al. 2006; Rice et al. 2005). Three important applications of protein networks to cancer research include signature generation, signature interpretation, and disease gene prediction.
Networks for signature generation
By integrating network information with gene expression data, we can identify predictive signatures that perform better and are more conserved across studies than signatures based on gene expression data alone. There are many ways of using networks to create improved gene signatures. One class of methods that has proven very successful is score-based subnetwork biomarkers (Chuang et al. 2007; Fortney et al. 2010; Hwang et al. 2009; Ideker et al. 2002; Nacu et al. 2007). In these approaches, genes are aggregated around an initial “seed” gene in a network to generate subnetworks whose pooled activity levels can be used to predict the value of some response variable, such as disease status or survival time. Subnetworks identified using this approach were shown to be highly conserved across studies, and to perform better than individual genes or pre-defined gene groups at predicting breast cancer metastasis (Chuang et al. 2007). Importantly, many crucial genes belonging to subnetwork biomarkers were not differentially expressed on their own, demonstrating the added value of a network approach. Related approaches have been used to develop subnetwork biomarkers for colon cancer using a combination of proteome and transcriptome data (Nibbe et al. 2009, 2010).
Networks for signature interpretation
Many genes that play a role in predictive signatures have not been previously linked to cancer, and thus can be considered as novel candidate cancer genes. Networks can be used to link these genes with known cancer mechanisms and pathways (Radulovich et al. 2010; Rhodes et al. 2005; Sodek et al. 2008; Tomasini et al. 2008). Gene signatures mapped to protein interactions can be further annotated with other profiles (including proteomic, CGH, and miRNA studies), and with network structures, such as graphlets (Przulj 2007; Przulj et al. 2006). Networks can also reveal new connections between different prognostic signatures. For example, we recently identified a 15-gene prognostic and predictive signature in lung cancer (Zhu et al. 2010a). Though our signature did not directly overlap with previously published ones, network analysis revealed that they were highly related: there were direct interactions between the protein products of genes from our signature and others. Similar results have been shown in other studies (Zhu et al. 2009, 2010b).
Networks for identifying new disease genes
Analyses of the network connectivity of cancer genes have shown that they can be characterized by several topological properties. For example, proteins encoded by cancer genes tend to be central in interaction networks [they have high degree and betweenness centrality (Jonsson and Bates 2006; Rambaldi et al. 2008; Syed et al. 2010)], have high clustering coefficients (Li et al. 2009), and are overrepresented in network motifs (Rambaldi et al. 2008). Several methods use the topological characteristics of known cancer genes, in combination with other features (such as Gene Ontology categories, protein domains, biological pathways, and sequence features), to predict new cancer genes (Aragues et al. 2008; Rambaldi et al. 2008) or functional SNPs (Savas et al. 2009). Many algorithms identify modules in interaction networks, or groups of densely interconnected genes that can be highly functionally related (King et al. 2004; Newman 2006; Palla et al. 2005; Spirin and Mirny 2003). Module-finding algorithms can also be applied to predict new disease genes (Chuang et al. 2007; Goh et al. 2007): clusters enriched for known cancer genes may implicate novel genes in cancer.
Generating context-specific co-expression networks from microarray data
Co-expression networks are distinct from PPI networks: instead of indicating physical interactions, edges (and edge weights) between genes reflect the degree of correlation of their expression profiles. Co-expression networks have also been useful in cancer research, e.g., (Aggarwal et al. 2006; Choi et al. 2005). For example, we can use microarray data from different health and disease conditions, such as cancer and non-cancer, to create networks specific to those contexts. We can then compare the networks using a variety of measures that reflect local and global network structure, such as graphlets (Przulj et al. 2006), communities (Palla et al. 2005), etc. Weighted Gene Co-expression Network Analysis (WCGNA) (Zhang and Horvath 2005), a popular method for generating and interpreting co-expression networks, has been applied to study prostate cancer (Wang et al. 2009) and glioblastoma (Horvath et al. 2006).
Databases and visualization
Integrating heterogeneous and distributed data sources
Different data sources can be complementary. For instance, while biological pathway databases may disagree on individual pathway definitions, by combining them we can diminish their false positives and false negatives; by integrating pathways with protein interactions, we can improve their coverage and relevance (Radulovich et al. 2010; Savas et al. 2009). Having all of these data available from a single source or at least in a standardized format simplifies integrative analyses and reduces the error and database incompatibility (arising from, e.g., inconsistent gene nomenclature). One example is CDIP (Gortzak-Uzan et al. 2008), a cancer informatics portal that combines HTP data from microarrays, CGH, KEGG, I2D, annotations from GeneCards (Rebhan et al. 1997), and other molecular information specific to different cancer types. Another is mirDIP (Shirdel et al. 2011), a microRNA integration portal that combines computational predictions of microRNA: mRNA binding from multiple databases (version 1 combines 11 largest sources).
Updating old data to reflect the latest knowledge
In part because of the noise inherent in HTP platforms, our best knowledge of what they are measuring and how their data should be interpreted changes with time. Past work has shown that using updated transcript data to re-annotate the assignment of microarray probe sets to genes provided for Affymetrix microarray platforms affected 20–30% of all probe sets (Gautier et al. 2004). Updated probe set definitions lead to higher precision and accuracy (Sandberg and Larsson 2007), and more consistent results across different microarray platforms (Carter et al. 2005; Elo et al. 2005). The Ensembl database (Hubbard et al. 2009) maintains and regularly updates a list of probeset mappings for many popular microarray platforms.
Visualizing biological networks
Web services for frequent updates
One of the main challenges faced by integrated databases is staying current: the knowledge and data in the source databases are changing all the time. Fortunately, many source databases provide access to their content through web services. Integrated databases and analysis tools can then query these databases in response to a user request and access the latest data. For example, NAViGaTOR uses web services to access pathway data from KEGG, Reactome, Pathway Commons (Cerami et al. 2011) and Wiki pathways (Pico et al. 2008), protein interactions from the IMEx consortium (Orchard et al. 2007), and protein interactions and gene associations via i-HOP (Hoffmann and Valencia 2005) or STRING. Of course, not every relevant data set can be accessed through web services; a list of web services currently available is provided by the EMBRACE Registry (Pettifer et al. 2009).
Text mining for automated database annotation and quality-checking
Many of the data and observations gained from biological experiments are not available in an easily accessible form—they are buried in the text of journal articles. But techniques from text mining can be used to automatically extract biological knowledge from free text; these have been extensively reviewed in, e.g., (Altman et al. 2008; Cohen and Hunter 2008; Hoffmann et al. 2005; Jensen et al. 2006; Rodriguez-Esteban 2009). For integrated databases, text mining methods have proven especially valuable for automatically annotating and quality-checking HTP data. For example, text mining has been applied to annotate the I2D PPI database (Niu et al. 2010) and to verify predicted PPIs from FpClass (Kotlyar et al. 2011). Text mining can also supplement integrated databases with new knowledge; e.g., by extracting gene-drug relationships (Kuhn et al. 2010; von Eichborn et al. 2011).
Standards and recommendations
Progress in HTP research relies on well-designed and widely adopted standards for how data is produced, managed, and analyzed. These provide a consistent way of dealing with large volumes of data, help reduce noise, and ensure reproducibility at both the experimental and computational levels.
Standards for HTP data and tools
Different research groups, databases, and analyses must use consistent reporting standards so their data can be effectively shared, integrated, and evaluated. In cancer research, standards are needed at several levels of data collection and analysis, from tumor collection, storage, and sample preparation to pre-processing and further computational study of the resulting HTP data.
For many HTP data types, there are now well-developed standards for data reporting, such as Minimum Information About a Microarray Experiment [MIAME (Brazma et al. 2001)] and large public repositories for raw data, such as the GEO (Barrett et al. 2005) and the PRoteomics IDEntifications database [PRIDE (Jones et al. 2006)]. Data standards development is an ongoing area of research, and has been extensively reviewed (Brazma et al. 2006; Brooksbank and Quackenbush 2006; Enkemann 2010). Some outstanding issues include data identifiers—e.g., the many-to-many mappings between genes from databases like EntrezGene and Ensembl—and open access to data. Though many high-impact journals now mandate that all HTP data and code associated with a publication be made open-access, this requirement is not universal, and sometimes the same journal will require microarray data submission, but not proteomic data. Also, many articles that do publish their data too frequently make them available only in an impractical form—such as hundreds of pages in supplementary PDF files. We need new initiatives to standardize (and enforce) HTP data submission formats to ensure that they are machine-readable. This will enable wider use of data and new discoveries, and may also substantially improve quality of research by reducing errors and mistakes (Baggerly and Coombes 2009; Carey and Stodden 2010).
For computational data analysis, we also need large-scale comparisons to provide a fair evaluation of different methods. In cancer research, these efforts depend on resources like GeneSigDB (Culhane et al. 2010), a curated database containing over 2,000 cancer gene signatures, and large well-annotated data sets such as the set of gene expression profiles for over 400 non-small cell lung cancer adenocarcinoma tumors provided by the Director’s Challenge Consortium for the Molecular Classification of Lung Adenocarcinoma (Shedden et al. 2008). New initiatives such as TCGA and ICGC are expanding cancer profiles across multiple different platforms and tumor types. Making the data and related clinical information publicly available will significantly contribute to our understanding of the molecular changes in cancer, enabling new discoveries as well as more comprehensive validation of novel prognostic and predictive signatures.
Recommendations for biological experiments
The quality of any computational analysis is limited by the data that are available. Considering intra- and inter-tumor heterogeneity, we may sometimes have too few samples or too much noise to be able to develop cancer signatures with the required accuracy and reproducibility. In this case, computational analyses of the available data can at least provide guidelines as to the kinds of experiments and data that are needed. For example, one study estimated that to achieve 50% overlap in prognostic gene sets for breast cancer patients would require several thousand samples (Ein-Dor et al. 2006), and other work created a general tool to calculate the number of samples needed to train a classifier as a function of normalized fold change, class prevalence, and the number of genes on an array (Dobbin et al. 2008). Much of the human PPI network, including PPIs of known cancer genes, remains uncharacterized. Past work shows that a reliable confidence score can be associated with an interaction by feeding the output of four different complementary HTP assays into a logistic regression model (Braun et al. 2009). Also, high-confidence PPIs predicted by various computational methods can focus future experiments and reduce their false positives and false negatives (Kotlyar and Jurisica 2006; Kotlyar et al. 2011).
A directory of tools and resources for integrative computational cancer biology
Below is a brief directory of links to some of the major resources for integrative computational biology that appear in this review.
Biological pathway annotations and related data
I2D—Interologous Interaction Database http://ophid.utoronto.ca/i2d/
STRING—Search Tool for the Retrieval of Interacting Genes http://string-db.org/
BioGRID—Biological General Repository for Interaction Datasets http://thebiogrid.org/
HPRD—Human Protein Reference Database http://www.hprd.org/
MINT—Molecular INTeraction Database http://mint.bio.uniroma2.it/
IMEx—International Molecular Exchange Consortium http://www.imexconsortium.org/
Sanger Cancer Gene Census http://www.sanger.ac.uk/genetics/CGP/Census/
COSMIC—Catalogue of Somatic Mutations in Cancer http://www.sanger.ac.uk/resources/databases/cosmic.html
mirDIP—microRNA Data Integration Portal http://ophid.utoronto.ca/mirDIP
Network visualization software
Gene function prediction
Cancer genome initiatives
This work was supported in part by Ontario Research Fund (GL2-01-030), Canada Institutes for Health Research (BIO-99745), the Canada Foundation for Innovation (CFI #12301 and #203383), the Canada Research Chair Program, and IBM to IJ, and the Ontario Ministry of Health and Long Term Care. The views expressed do not necessarily reflect those of the OMOHLTC. We thank M. Kotlyar, A. Hrvojic, A. Teisanu, and W. Xie for providing data for the figures.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
- Agarwal R, Gonzalez-Angulo AM, Myhre S, Carey M, Lee JS, Overgaard J, Alsner J, Stemke-Hale K, Lluch A, Neve RM, Kuo WL, Sorlie T, Sahin A, Valero V, Keyomarsi K, Gray JW, Borresen-Dale AL, Mills GB, Hennessy BT (2009) Integrative analysis of cyclin protein levels identifies cyclin b1 as a classifier and predictor of outcomes in breast cancer. Clin Cancer Res 15:3654–3662PubMedCrossRefGoogle Scholar
- Altman RB, Bergman CM, Blake J, Blaschke C, Cohen A, Gannon F, Grivell L, Hahn U, Hersh W, Hirschman L, Jensen LJ, Krallinger M, Mons B, O’Donoghue SI, Peitsch MC, Rebholz-Schuhmann D, Shatkay H, Valencia A (2008) Text mining for biology—the way forward: opinions from leading scientists. Genome Biol 9(Suppl 2):S7PubMedCrossRefGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25:25–29PubMedCrossRefGoogle Scholar
- Augen J (2001) Information technology to the rescue! Nat Biotechnol 19 Suppl: BE39–40Google Scholar
- Braun P, Tasan M, Dreze M, Barrios-Rodiles M, Lemmens I, Yu H, Sahalie JM, Murray RR, Roncari L, de Smet AS, Venkatesan K, Rual JF, Vandenhaute J, Cusick ME, Pawson T, Hill DE, Tavernier J, Wrana JL, Roth FP, Vidal M (2009) An experimentally derived confidence score for binary protein–protein interactions. Nat Methods 6:91–97PubMedCrossRefGoogle Scholar
- Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC, Gaasterland T, Glenisson P, Holstege FC, Kim IF, Markowitz V, Matese JC, Parkinson H, Robinson A, Sarkans U, Schulze-Kremer S, Stewart J, Taylor R, Vilo J, Vingron M (2001) Minimum information about a microarray experiment (MIAME)—toward standards for microarray data. Nat Genet 29:365–371PubMedCrossRefGoogle Scholar
- Buyse M, Loi S, van’t Veer L, Viale G, Delorenzi M, Glas AM, d’Assignies MS, Bergh J, Lidereau R, Ellis P, Harris A, Bogaerts J, Therasse P, Floore A, Amakrane M, Piette F, Rutgers E, Sotiriou C, Cardoso F, Piccart MJ (2006) Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. J Natl Cancer Inst 98: 1183–92Google Scholar
- Cervigne NK, Reis PP, Machado J, Sadikovic B, Bradley G, Galloni NN, Pintilie M, Jurisica I, Perez-Ordonez B, Gilbert R, Gullane P, Irish J, Kamel-Reid S (2009) Identification of a microRNA signature associated with progression of leukoplakia to oral carcinoma. Hum Mol Genet 18:4818–4829PubMedCrossRefGoogle Scholar
- Coleman MP, Quaresma M, Berrino F, Lutz JM, De Angelis R, Capocaccia R, Baili P, Rachet B, Gatta G, Hakulinen T, Micheli A, Sant M, Weir HK, Elwood JM, Tsukuma H, Koifman S, ES GA, Francisci S, Santaquilani M, Verdecchia A, Storm HH, Young JL (2008) Cancer survival in five continents: a worldwide population-based study (CONCORD). Lancet Oncol 9:730–756PubMedCrossRefGoogle Scholar
- Galsworthy MJ (2009) MEDSUM: an online MEDLINE summary tool. http://www.medsum.info
- Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, Remor M, Hofert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K, Hudak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B, Leutwein C, Heurtier MA, Copley RR, Edelmann A, Querfurth E, Rybin V, Drewes G, Raida M, Bouwmeester T, Bork P, Seraphin B, Kuster B, Neubauer G, Superti-Furga G (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415:141–147PubMedCrossRefGoogle Scholar
- Gortzak-Uzan L, Ignatchenko A, Evangelou AI, Agochiya M, Brown KA, St Onge P, Kireeva I, Schmitt-Ulms G, Brown TJ, Murphy J, Rosen B, Shaw P, Jurisica I, Kislinger T (2008) A proteome resource of ovarian cancer ascites: integrated proteomic and bioinformatic analyses to identify putative biomarkers. J Proteome Res 7:339–351PubMedCrossRefGoogle Scholar
- Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A, Margalit H, Armstrong J, Bairoch A, Cesareni G, Sherman D, Apweiler R (2004) IntAct: an open source molecular interaction database. Nucleic Acids Res 32:D452–D455PubMedCrossRefGoogle Scholar
- Hoffmann R, Valencia A (2005) Implementing the iHOP concept for navigation of biomedical literature. Bioinformatics 21 Suppl 2: ii252–ii258Google Scholar
- Hoffmann R, Krallinger M, Andres E, Tamames J, Blaschke C, Valencia A (2005) Text mining for metabolic pathways, signaling cascades, and protein networks. Sci STKE 2005(283):pe21Google Scholar
- Horvath S, Zhang B, Carlson M, Lu KV, Zhu S, Felciano RM, Laurance MF, Zhao W, Qi S, Chen Z, Lee Y, Scheck AC, Liau LM, Wu H, Geschwind DH, Febbo PG, Kornblum HI, Cloughesy TF, Nelson SF, Mischel PS (2006) Analysis of oncogenic signaling networks in glioblastoma identifies ASPM as a molecular target. Proc Natl Acad Sci USA 103:17402–17407PubMedCrossRefGoogle Scholar
- Hubbard TJ, Aken BL, Ayling S, Ballester B, Beal K, Bragin E, Brent S, Chen Y, Clapham P, Clarke L, Coates G, Fairley S, Fitzgerald S, Fernandez-Banet J, Gordon L, Graf S, Haider S, Hammond M, Holland R, Howe K, Jenkinson A, Johnson N, Kahari A, Keefe D, Keenan S, Kinsella R, Kokocinski F, Kulesha E, Lawson D, Longden I, Megy K, Meidl P, Overduin B, Parker A, Pritchard B, Rios D, Schuster M, Slater G, Smedley D, Spooner W, Spudich G, Trevanion S, Vilella A, Vogel J, White S, Wilder S, Zadissa A, Birney E, Cunningham F, Curwen V, Durbin R, Fernandez-Suarez XM, Herrero J, Kasprzyk A, Proctor G, Smith J, Searle S, Flicek P (2009) Ensembl 2009. Nucleic Acids Res 37:D690–D697PubMedCrossRefGoogle Scholar
- Hudson TJ, Anderson W, Artez A, Barker AD, Bell C, Bernabe RR, Bhan MK, Calvo F, Eerola I, Gerhard DS, Guttmacher A, Guyer M, Hemsley FM, Jennings JL, Kerr D, Klatt P, Kolar P, Kusada J, Lane DP, Laplace F, Youyong L, Nettekoven G, Ozenberger B, Peterson J, Rao TS, Remacle J, Schafer AJ, Shibata T, Stratton MR, Vockley JG, Watanabe K, Yang H, Yuen MM, Knoppers BM, Bobrow M, Cambon-Thomsen A, Dressler LG, Dyke SO, Joly Y, Kato K, Kennedy KL, Nicolas P, Parker MJ, Rial-Sebbag E, Romeo-Casabona CM, Shaw KM, Wallace S, Wiesner GL, Zeps N, Lichter P, Biankin AV, Chabannon C, Chin L, Clement B, de Alava E, Degos F, Ferguson ML, Geary P, Hayes DN, Johns AL, Kasprzyk A, Nakagawa H, Penny R, Piris MA, Sarin R, Scarpa A, van de Vijver M, Futreal PA, Aburatani H, Bayes M, Botwell DD, Campbell PJ, Estivill X, Grimmond SM, Gut I, Hirst M, Lopez-Otin C, Majumder P, Marra M, McPherson JD, Ning Z, Puente XS, Ruan Y, Stunnenberg HG, Swerdlow H, Velculescu VE, Wilson RK, Xue HH, Yang L, Spellman PT, Bader GD, Boutros PC, Flicek P, Getz G, Guigo R, Guo G, Haussler D, Heath S, Hubbard TJ, Jiang T et al (2010) International network of cancer genome projects. Nature 464:993–998PubMedCrossRefGoogle Scholar
- Irizarry RA, Warren D, Spencer F, Kim IF, Biswal S, Frank BC, Gabrielson E, Garcia JG, Geoghegan J, Germino G, Griffin C, Hilmer SC, Hoffman E, Jedlicka AE, Kawasaki E, Martinez-Murillo F, Morsberger L, Lee H, Petersen D, Quackenbush J, Scott A, Wilson M, Yang Y, Ye SQ, Yu W (2005) Multiple-laboratory comparison of microarray platforms. Nat Methods 2:345–350PubMedCrossRefGoogle Scholar
- Kislinger T, Cox B, Kannan A, Chung C, Hu P, Ignatchenko A, Scott MS, Gramolini AO, Morris Q, Hallett MT, Rossant J, Hughes TR, Frey B, Emili A (2006) Global survey of organ and organelle protein expression in mouse: combined proteomic and transcriptomic profiling. Cell 125:173–186PubMedCrossRefGoogle Scholar
- Kotlyar M, Niu Y, Ponzielli R, Ding Z, Mills GB, Penn LZ, Jurisica I (2011) Predicting human protein–protein interactions using non-independent features (in preparation)Google Scholar
- Lau SK, Boutros PC, Pintilie M, Blackhall FH, Zhu CQ, Strumpf D, Johnston MR, Darling G, Keshavjee S, Waddell TK, Liu N, Lau D, Penn LZ, Shepherd FA, Jurisica I, Der SD, Tsao MS (2007) Three-gene prognostic classifier for early-stage non small-cell lung cancer. J Clin Oncol 25:5562–5569PubMedCrossRefGoogle Scholar
- Lowe JA, Jones P, Wilson DM (2010) Network biology as a new approach to drug discovery. Curr Opin Drug Discov Dev 13:524–526Google Scholar
- Lynn DJ, Winsor GL, Chan C, Richard N, Laird MR, Barsky A, Gardy JL, Roche FM, Chan TH, Shah N, Lo R, Naseer M, Que J, Yau M, Acab M, Tulpan D, Whiteside MD, Chikatamarla A, Mah B, Munzner T, Hokamp K, Hancock RE, Brinkman FS (2008) InnateDB: facilitating systems-level analyses of the mammalian innate immune response. Mol Syst Biol 4:218PubMedCrossRefGoogle Scholar
- Matthews L, Gopinath G, Gillespie M, Caudy M, Croft D, de Bono B, Garapati P, Hemish J, Hermjakob H, Jassal B, Kanapin A, Lewis S, Mahajan S, May B, Schmidt E, Vastrik I, Wu G, Birney E, Stein L, D’Eustachio P (2009) Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Res 37:D619–D622PubMedCrossRefGoogle Scholar
- Pegram MD, Lipton A, Hayes DF, Weber BL, Baselga JM, Tripathy D, Baly D, Baughman SA, Twaddell T, Glaspy JA, Slamon DJ (1998) Phase II study of receptor-enhanced chemosensitivity using recombinant humanized anti-p185HER2/neu monoclonal antibody plus cisplatin in patients with HER2/neu-overexpressing metastatic breast cancer refractory to chemotherapy treatment. J Clin Oncol 16:2659–2671PubMedGoogle Scholar
- Pena-Castillo L, Tasan M, Myers CL, Lee H, Joshi T, Zhang C, Guan Y, Leone M, Pagnani A, Kim WK, Krumpelman C, Tian W, Obozinski G, Qi Y, Mostafavi S, Lin GN, Berriz GF, Gibbons FD, Lanckriet G, Qiu J, Grant C, Barutcuoglu Z, Hill DP, Warde-Farley D, Grouios C, Ray D, Blake JA, Deng M, Jordan MI, Noble WS, Morris Q, Klein-Seetharaman J, Bar-Joseph Z, Chen T, Sun F, Troyanskaya OG, Marcotte EM, Xu D, Hughes TR, Roth FP (2008) A critical assessment of Mus musculus gene function prediction using integrated genomic evidence. Genome Biol 9(Suppl 1):S2PubMedCrossRefGoogle Scholar
- Peri S, Navarro JD, Kristiansen TZ, Amanchy R, Surendranath V, Muthusamy B, Gandhi TK, Chandrika KN, Deshpande N, Suresh S, Rashmi BP, Shanker K, Padma N, Niranjan V, Harsha HC, Talreja N, Vrushabendra BM, Ramya MA, Yatish AJ, Joy M, Shivashankar HN, Kavitha MP, Menezes M, Choudhury DR, Ghosh N, Saravana R, Chandran S, Mohan S, Jonnalagadda CK, Prasad CK, Kumar-Sinha C, Deshpande KS, Pandey A (2004) Human protein reference database as a discovery resource for proteomics. Nucleic Acids Res 32:D497–D501PubMedCrossRefGoogle Scholar
- Reis PP, Tomenson M, Cervigne NK, Machado J, Jurisica I, Pintilie M, Sukhai MA, Perez-Ordonez B, Grenman R, Gilbert RW, Gullane PJ, Irish JC, Kamel-Reid S (2010) Programmed cell death 4 loss increases tumor cell invasion and is regulated by miR-21 in oral squamous cell carcinoma. Mol Cancer 9:238PubMedCrossRefGoogle Scholar
- Rhodes DR, Chinnaiyan AM (2005) Integrative analysis of the cancer transcriptome. Nat Genet 37 Suppl: S31–37Google Scholar
- Rhodes DR, Kalyana-Sundaram S, Mahavisno V, Varambally R, Yu J, Briggs BB, Barrette TR, Anstet MJ, Kincead-Beal C, Kulkarni P, Varambally S, Ghosh D, Chinnaiyan AM (2007) Oncomine 3.0: genes, pathways, and networks in a collection of 18, 000 cancer gene expression profiles. Neoplasia 9:166–180PubMedCrossRefGoogle Scholar
- Rual JF, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, Berriz GF, Gibbons FD, Dreze M, Ayivi-Guedehoussou N, Klitgord N, Simon C, Boxem M, Milstein S, Rosenberg J, Goldberg DS, Zhang LV, Wong SL, Franklin G, Li S, Albala JS, Lim J, Fraughton C, Llamosas E, Cevik S, Bex C, Lamesch P, Sikorski RS, Vandenhaute J, Zoghbi HY, Smolyar A, Bosak S, Sequerra R, Doucette-Stamm L, Cusick ME, Hill DE, Roth FP, Vidal M (2005) Towards a proteome-scale map of the human protein–protein interaction network. Nature 437:1173–1178PubMedCrossRefGoogle Scholar
- Shedden K, Taylor JM, Enkemann SA, Tsao MS, Yeatman TJ, Gerald WL, Eschrich S, Jurisica I, Giordano TJ, Misek DE, Chang AC, Zhu CQ, Strumpf D, Hanash S, Shepherd FA, Ding K, Seymour L, Naoki K, Pennell N, Weir B, Verhaak R, Ladd-Acosta C, Golub T, Gruidl M, Sharma A, Szoke J, Zakowski M, Rusch V, Kris M, Viale A, Motoi N, Travis W, Conley B, Seshan VE, Meyerson M, Kuick R, Dobbin KK, Lively T, Jacobson JW, Beer DG (2008) Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nat Med 14:822–827PubMedCrossRefGoogle Scholar
- Shirdel EA, Xie W, Mak TW, Jurisica I (2011) NAViGaTing the microNome—using multiple microRNA prediction databases to identify signaling pathway-associated microRNA. PLoS One 6(2):e17429Google Scholar
- Sodek KL, Evangelou AI, Ignatchenko A, Agochiya M, Brown TJ, Ringuette MJ, Jurisica I, Kislinger T (2008) Identification of pathways associated with invasive behavior by ovarian cancer cells using multidimensional protein identification technology (MudPIT). Mol Biosyst 4:762–773PubMedCrossRefGoogle Scholar
- Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner M, Schoenherr A, Koeppen S, Timm J, Mintzlaff S, Abraham C, Bock N, Kietzmann S, Goedde A, Toksoz E, Droege A, Krobitsch S, Korn B, Birchmeier W, Lehrach H, Wanker EE (2005) A human protein–protein interaction network: a resource for annotating the proteome. Cell 122:957–968PubMedCrossRefGoogle Scholar
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 102:15545–15550PubMedCrossRefGoogle Scholar
- Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, Doerks T, Stark M, Muller J, Bork P, Jensen LJ, von Mering C (2011) The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res 39:D561–D568PubMedCrossRefGoogle Scholar
- Tomasini R, Tsuchihara K, Wilhelm M, Fujitani M, Rufini A, Cheung CC, Khan F, Itie-Youten A, Wakeham A, Tsao MS, Iovanna JL, Squire J, Jurisica I, Kaplan D, Melino G, Jurisicova A, Mak TW (2008) TAp73 knockout shows genomic instability with infertility and tumor suppressor functions. Genes Dev 22:2677–2691PubMedCrossRefGoogle Scholar
- Varambally S, Yu J, Laxman B, Rhodes DR, Mehra R, Tomlins SA, Shah RB, Chandran U, Monzon FA, Becich MJ, Wei JT, Pienta KJ, Ghosh D, Rubin MA, Chinnaiyan AM (2005) Integrative genomic and proteomic analysis of prostate cancer reveals signatures of metastatic progression. Cancer Cell 8:393–406PubMedCrossRefGoogle Scholar
- Wei Y, Tong J, Taylor P, Strumpf D, Ignatchenko V, Pham NA, Yanagawa N, Liu G, Jurisica I, Shepherd FA, Tsao MS, Kislinger T, Moran MF (2011) Primary tumor xenografts of human lung adeno and squamous cell carcinoma express distinct proteomic signatures. J Proteome Res 10:161–174PubMedCrossRefGoogle Scholar
- Yu H, Paccanaro A, Trifonov V, Gerstein M (2006) Predicting interactions in protein networks by completing defective cliques. Bioinformatics 22(7):823–829Google Scholar
- Zhang B, Horvath S (2005) A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol 4 (article17)Google Scholar
- Zhu CQ, da Cunha Santos G, Ding K, Sakurada A, Cutz JC, Liu N, Zhang T, Marrano P, Whitehead M, Squire JA, Kamel-Reid S, Seymour L, Shepherd FA, Tsao MS (2008) Role of KRAS and EGFR as biomarkers of response to erlotinib in National Cancer Institute of Canada Clinical Trials Group Study BR.21. J Clin Oncol 26:4268–4275PubMedCrossRefGoogle Scholar
- Zhu CQ, Ding K, Strumpf D, Weir BA, Meyerson M, Pennell N, Thomas RK, Naoki K, Ladd-Acosta C, Liu N, Pintilie M, Der S, Seymour L, Jurisica I, Shepherd FA, Tsao MS (2010a) Prognostic and predictive gene signature for adjuvant chemotherapy in resected non-small-cell lung cancer. J Clin Oncol 28:4417–4424PubMedCrossRefGoogle Scholar