Optimization of multi-classifiers for computational biology: application to gene finding and expression
- 421 Downloads
Genomes of many organisms have been sequenced over the last few years. However, transforming such raw sequence data into knowledge remains a hard task. A great number of prediction programs have been developed to address part of this problem: the location of genes along a genome and their expression. We propose a multi-objective methodology to combine state-of-the-art algorithms into an aggregation scheme in order to obtain optimal methods’ aggregations. The results obtained show a major improvement in sensitivity when our methodology is compared to the performance of individual methods for gene finding and gene expression problems. The methodology proposed here is an automatic method generator, and a step forward to exploit all already existing methods, by providing alternative optimal methods’ aggregations to answer concrete queries for a certain biological problem with a maximized accuracy of the prediction. As more approaches are integrated for each of the presented problems, de novo accuracy can be expected to improve further.
KeywordsMultiobjective Gene finding Gene expression
Genomes of many organisms have been sequenced over the last few years. However, transforming such raw sequence data into knowledge remains a hard task . A great number of prediction programs have been developed to address one part of this problem: the location of genes along a genome [2, 3, 4]. Unfortunately, finding genes in a genomic sequence is far from being a trivial problem. Computational gene prediction methods yet have to achieve perfect accuracy, even in the relatively simple prokaryotic genomes . Gene prediction is one of the most important problems in computational biology due to the inherent value of the set of protein-coding genes for other analyses.
Another part of the problem is determining when, where and for how long these genes are turned on or off. Microarray technology allows the simultaneous evaluation of the expression of hundreds of genes in a single assay, converting this technology in a powerful tool for expression profiling, as well as, diagnosis and classification of cancers and other diseases. However, this technology presents a wealth of analysis problems  such as the inherent variability of cDNA microarrays at the individual slide and spot level, the large-scale nature of the data, and the fact that the full use of expression profiles for inferring gene function is still only partly explored. Many new methods have been developed to address the statistical challenge of identifying “important” genes in the large sets of raw sequence data [6, 7, 8, 9, 10, 11]. However, there is still a dearth of computational methods to facilitate understanding of differential gene expression profiles (e.g., profiles that change over time and/or over treatments and/or over patient) and to decide which of the many available statistical methods is the most reliable to identify differences across profiles.
Despite the advances in both referred problems, existing approaches to predict genes and to analyze microarray data have intrinsic advantages and limitations [1, 12]. Furthermore, there is no program or methodology that can provide perfect predictions for any given input for either of these two problems.
The problems of gene finding (identifying genes, exons and introns, beginning and end of the genes) and analysis of gene expression are formulated in this paper as classification problems. The gene finding problem can be interpreted as a simple decision between which section of a sequence is protein coding and which is not. Concerning the gene expression from microarray experiments, the classification problem can be seen as a decision between which genes are active or inactive in a given time point and/or under a given condition. For both problems many different programs are available, which give distinct solutions. There have been previous approaches to combine gene predictors [13, 14, 15] and microarray analyses [16, 17], but maximizing accuracy by weighting both sensitivity and specificity functions into a single objective. However, our methodology uses a multi-objective approach to extract the best methods’ aggregations by maximizing the specificity and sensitivity of their predictions individually. This approach combines state-of-the-art algorithms into an aggregation scheme to provide better predictions by taking advantage of the different methodologies’ strengths and avoiding their weaknesses.
We applied our methodology to the both referred problems. In the gene finding problem, we used the EGASP sets from the ENCODE Genome Annotation Assessment Project (EGASP) [18, 19]. These datasets contain manually curated fragments of the human genome originating from the ENCODE project . This data set was selected by the EGASP assessment because the genes encoded in these regions were not used to train any particular gene predictor. Therefore, it is not a biased dataset. In the case of analysis of the microarrays, we used a dataset derived from the analysis of longitudinal blood expression profiles of human volunteers treated with intravenous endotoxin, compared to those treated with a placebo in order to study the inflammation and human response to injury. This dataset was part of a Large-scale Collaborative Research Project sponsored by the National Institute of General Medical Sciences .
2 Materials and methods
2.1 Gene finding problem: dataset and programs
For the gene finding problem, we selected 27 ENCODE regions to test our proposal. These ENCODE regions have undergone an exhaustive annotation strategy prior to EGASP by the HAVANA team . They consist of 2,471 total transcripts representing 434 unique protein-coding gene loci.
The programs used in this study are those used in the EGASP competition, which are ab initio gene predictors using a single genome sequence. These programs were designed to predict gene structure, or at least a set of spliceable exons in vertebrate or pre-human genome sequences: GeneID , Genscan , Genemark , Augustus  and GeneZilla . GeneID combines different algorithms using Position Weight Arrays to detect features such as splice sites, start and stop codons and Markov Models to score exons and Dynamic Programming (DP) to assemble the gene structure . Genescan uses a general probabilistic model for the gene structure of human genomic sequences. It has the capacity to predict multiple genes in a sequence, to deal with partial as well as complete genes, and to predict consistent sets of genes occurring on either or both DNA strands . Genemark for eukaryotes gathers the original Genemark models into the naturally derived hidden Markov model framework with gene boundaries modeled as transitions between hidden states . Augustus is a gene predictor for eukaryotic genomic sequences that is based on a generalized hidden Markov model, a probabilistic model of a sequence and its gene structure . GeneZilla is based on the Generalized Hidden Markov Model (GHMM) framework, similar to Genscan. Graph-theoretic representations of the high scoring open reading frames are provided, allowing for exploration of sub-optimal gene models. It makes use of Interpolated Markov Models (IMMs), Maximal Dependence Decomposition (MDD), and includes states for signal peptides, branch points, TATA boxes and CAP sites . For each method, the closest organism available for each gene in the dataset was selected. Predictions on both strands were extracted.
2.2 Gene expression profile finding: datasets and analysis methods
The dataset used was derived from longitudinal blood expression profiles of human volunteers treated with intravenous endotoxin compared to those treated with a placebo. The data are related to the host response over time to systemic inflammatory insults, as part of a large-scale collaborative research project sponsored by the National Institute of General Medical Sciences (http://www.gluegrant.org). The data was derived from blood samples collected from eight normal human volunteers, four treated with intravenous endotoxin (i.e., patients 1–4) and four with placebo (i.e., patients 5–8) . Complementary RNA was generated from circulating leukocytes at 0, 2, 4, 6, 9 and 24 h after the intravenous infusion and hybridized with GeneChips® HG-U133A v2.0 from Affymetryx Inc., containing a set of 22,283 probe sets. A total set of 29 gene expression profiles (sets of genes which exhibit a common behavior throughout the conditions of the problem under study, time, treatment and patient in our particular case) are contained in the dataset and forms the focus of our study .
The methods analyzed in this study were applied to identify meaningful gene expression profiles from microarray data. The list of programs used comprises the methods most frequently applied to the analysis of microarray data: Student’s t test , Permutation Test , Analysis of Variance (ANOVA)  and Repeated Measures Analysis of Variance (RMANOVA) . These methods have been applied to the inflammation and host response to injury problem to account for different experimental conditions, such as treatment versus control and different time points. Therefore, Student’s t test and Permutation Test have been applied in two different ways: considering treatment versus control and considering time. The ANOVA and RMANOVA tests can account for more than one experimental condition simultaneously; therefore they have been applied in three different ways: considering treatment versus control, considering time, and considering treatment versus control and time simultaneously.
The aggregation of the results of different methods in the Gene Expression Profile Finding Problem is performed by combining the results obtained by each of the individual methods (group of probe sets). The union of two methods, Ma and Mb (Ma ∪ Mb), is defined as the group resulting which contains all genes retrieved either by method Ma or by Mb. The intersection of two methods, Ma and Mb (Ma ∩ Mb), is defined as the group resulting which contains all genes retrieved by both methods Ma and Mb.
We evaluate the performance of each method aggregation to retrieve each of the 29 gene expression profiles present in the inflammation and host response to injury dataset. In our particular problem, when studying the behavior of method Mi to retrieve a gene expression profile Pj, we define true positives (TPs) as probe sets retrieved by method Mi which exhibit the gene expression profile Pj, true negatives (TNs) probe sets not retrieved by method Mi which do not exhibit the gene expression profile Pj, false positives (FPs) as probe sets retrieved by method Mi which do not exhibit the gene expression profile Pj, and false negatives (FNs) as probe sets not retrieved by method Mi which exhibit the gene expression profile Pj.
TP, TN, FP and FN information is typically summarized in terms of Sn, the proportion of probe sets belonging to Pj in the dataset and correctly retrieved by the method Mi under evaluation, and Sp, the proportion of probe sets correctly retrieved by the method Mi from all the probe sets retrieved by method Mi (see Eq. 1). These measures are formally described as for the Gene Finding Problem.
The results obtained applying our methodology to the two proposed biological problems outperform in terms of specificity and sensitivity the results obtained by classical methods, though both gene prediction and the identification of gene expression profiles are problems of different nature.
3.1 Gene finding
Individual gene finding method’s performance
Genscan showed the highest CC while GeneMark obtained the lowest CC. GeneID obtained the highest specificity and the lowest sensitivity, while GeneZilla showed the highest sensitivity with lower specificity. The analysis of the individual results shows that some algorithms are able to predict certain genes very accurately with CC values close to 1, but the same algorithm completely fails to predict other genes (CC below 0.7 or even 0.5). These results show that a high average CC does not imply a good performance, and vice versa, since the average might hide some low CCs for specific genes.
Ten best methods’ aggregations
% Genes correctly predicted
Augustus ∪ GeneID
Augustus ∪ Genscan
Genscan ∩ GeneZilla
Augustus ∪ Genscan ∪ GeneID
Augustus ∪ Genscan ∪ GeneID ∪ GeneMark
Augustus ∪ GeneZilla ∪ GeneMark
Augustus ∪ Genzilla ∪ Genscan ∪ GeneID
Genzilla ∩ GeneID
Genzilla ∩ Genscan ∩ GeneID
Augustus ∩ Genzilla
The sensitivity of each method aggregation using the union operator to predict each gene is shown in Fig. 6b. There are a few genes (e.g., AC068580, AC079630, AC021607) that obtain very low sensitivity for all methods’ aggregations, as it can be seen in their green columns. The aggregation of methods increases the sensitivity of the prediction as it is shown by red cells in Fig. 6b.
Figure 6c shows gene prediction specificity for each methods’ aggregation using the intersection operator. The methods’ aggregations using the intersection operator increase the specificity of the prediction when compared to single methods and the methods’ aggregations using the union operator, as it is illustrated in Fig. 6c. However, there are several genes that are not predicted by any method (e.g., AC068580, AC079630, AC021607). Sensitivity, on the other hand, has a different behavior for each methods’ aggregation (Fig. 6d). Some genes are more difficult to predict than others as represented by mostly red (e.g., AC072051, AL023881) or green columns (e.g., AC079630, AC068580). Finally, there are several genes which are not recognized by any method or methods’ aggregations (e.g., AC068580, AC079630, AC021607).
3.2 Gene expression profile finding
Relabelling of methods analyzed in this study
Student’s t test considering treatment versus control
Student’s t test considering time
Permutation test considering treatment versus control
Permutation test considering time
ANOVA considering treatment versus control
ANOVA considering time
ANOVA considering treatment and time
RMANOVA considering treatment
RMANOVA over time
RMANOVA over treatment and time
Results obtained by the individual methods, M1 to M10
Out of all gene expression methods analyzed, ANOVA considering time (represented by M5) achieved the best sensitivity level and the Permutation Test considering time (M3), the best specificity level. However, their average correlation coefficient (CC) for all profiles was not the highest. We can see that the levels of CC are generally low. This is due to the type of problem we are dealing with, finding a particular profile in a very large set of data, and obtaining as a result a large rate of false positives (FP), which decrease dramatically the correlation coefficient associated to each method.
However, as occurred with the gene prediction problem, the results show that a low average CC does not imply a bad performance. Some methods recover particular profiles with better levels of CCs, and also with a good specificity/sensitivity aggregation levels.
Top ten methods’ aggregations according CC obtained with the union and intersection operator
M1 ∪ M5 ∪ M7 ∪ M8
M1 ∪ M3 ∪ M7 ∪ M8
M1 ∪ M2 ∪ M3 ∪ M7 ∪ M8 ∪ M9
M4 ∪ M5 ∪ M6 ∪ M7 ∪ M9 ∪ M10
M1 ∩ M3 ∩ M9
M1 ∩ M3 ∩ M5 ∩ M9
M2 ∩ M4 ∩ M6 ∩ M7 ∩ M8 ∩ M9 ∩ M10
M3 ∪ M5 ∪ M6 ∪ M10
M4 ∪ M5 ∪ M7
M1 ∪ M5 ∪ M7 ∪ M9 ∪ M10
The best methods’ aggregations applying the union operator include ANOVA considering time (M5) and ANOVA considering time and treatment (M7), which appear combined in four out of the seven best aggregations obtained with the union operator. In fact, the best aggregation in terms of correlation coefficient is M1 ∪ M5 ∪ M7 ∪ M8, and when replacing M5 by M3 (Permutation Test considering time) the sensitivity value decreases from 0.912 to 0.856 with an increase in specificity from 0.076 to 0.158.
We propose a methodology to combine algorithms for a biological problem into an aggregation scheme. Our approach consists on the use of a multi-objective approach to extract the best methods’ aggregation by maximizing the specificity and sensitivity of their predictions. This approach can provide better predictions by combining the advantages and strengths of the different algorithms available for a certain problem and avoiding redundant and overlapping predictions that might be produced depending on the methodologies and the aggregation scheme used.
The application of the proposed methodology to the gene finding and to the gene expression problem, shows in both issues a performance improvement of optimal methods’ aggregation when compared to the individual methods for each topic.
When determining which methods’ aggregation was the best one for the gene prediction problem, sensitivity and specificity were in contradiction. Nevertheless, the estimation of the correlation coefficient helped in the selection of the best methods’ aggregations.
The best aggregations include methods employing different algorithmic strategies that predict correctly different subsets of the genes in the dataset. Although the statistical properties of coding regions allow for a good discrimination between large coding and non-coding regions, the exact identification of the limits of exons or of gene boundaries remains difficult. For instance, GeneID has strong constraints concerning this point. In case of alternative splicing, a predicted structure frequently splits a single true gene into several or, alternatively, merges several genes into one. Such problems are, however, very complex, as intergenic and intronic sequences do not differ much, and specific gene boundary signals in the UTRs (e.g. the TATA box and the polyadenylation signal), are often too variable and sometimes are not even present . Some gene finders, like GeneZilla, obtain low specificity levels; this may be due to the fact that they were tested with unmasked sequences. It is well known that gene finding programs perform worse on unmasked sequences due to the high ‘protein-coding-like’ content of repetitive elements, resulting in an increase of the number of false positive predictions . Augustus obtained very good results individually and takes part in many of the best methods’ aggregations, showing robust results. Nevertheless, it was not able to identify some coding sequences that other gene finding methods could, such as Genscan and GeneMark for ENCODE region ENm011 and ENr322. The obtained results indicate that we could improve the exon accuracy by implementing a mixed approach doing the union only on the predicted regions of higher quality and doing the intersection for low-quality regions.
There are several previous publications combining gene finding programs [15, 39], but they fail to obtain good results as they use simultaneously all programs instead of optimizing their aggregation. De novo gene prediction for compact eukaryotic genomes is already quite accurate, although mammalian gene prediction lags way behind in accuracy. One future scope would be the extension of the application of this approach to identify ways to quickly combine many or all existing programs trained for the same organism, and determine the upper limit of predictive power by aggregations of programs genome wide .
The application of our methodology to standard analytical methods used for microarray experiments analysis alleviated the problems exhibited by individual methods, including missing important probe sets. The improvement in sensitivity was greater than 20% without a reduction of the specificity for the methods’ aggregations used. Our approach was able to detect probe sets not reported in the first publication of the dataset , where two classic microarray analysis methods, M1 and M3 were individually applied. In fact, some of these probe sets have been shown to be related both in expression level and functionality to probe sets stated as relevant in the publication . Such is the case of probe set 206011_at, related to gene CASP1, found by applying our methodology , which is related in gene expression level (see additional Fig. 1) and in function (apoptosis-related cysteine peptidase) to probe sets 211367_s_at, stated as relevant for the inflammation problem in . Probe set 206011_at was found by the method aggregation M7 ∪ M10.
As well as in the gene finding problem, the aggregations of the different programs/methods resulted optimal and consistently outperformed even the best individual approach and, in some cases, produced dramatic improvements in sensitivity and specificity. Moreover, we observed that even the worst methods contributed to the aggregation with more accurate programs.
The proposed methodology applied to the microarray technology is valid for either providing the optimal methods’ aggregations for a query profiles, or for identifying all differential profiles in a given set of microarray data suggesting the optimal methods’ aggregations for them. Although we have applied our procedure to time-course structured experiments, they constitute a more general case of simpler microarray problems where microarray samples are taken as single data points. Therefore, the methodology presented is also useful for simpler microarray experiments with single data points.
Our approach presents various advantages over the standard analytical methods for microarray experiments. The aggregation of the union and intersection operators provides the possibility of querying negative samples (i.e., genes which exhibit a given profiles but not others). The representation used for the profiles is optimal, and allows us to examine the behavior of the genes independently in each subject, and facilitates the identification of different behaviors of genes across the subjects in the same experimental group. These differences can help us to discover the influence of biological conditions not previously considered in the experiment such as gender or age. In contrast to other approaches, the system provides solutions based on a trade-off of specificity versus sensitivity, whereas other methods evaluate their solutions over one measure, usually a ratio between false positives and the total number of genes retrieved. The computational procedure presented can solve some of the problems actually present in the process of analyzing a microarray experiment, such as the decision of analytical methodology to follow, extraction of biologically significant results, proper management of complex experiments harboring experimental conditions, time-series and inter-subject variation . Therefore, it provides a robust platform for the analysis of many types of microarray experiments, from the simplest experimental design to the most complex, providing accurate and reliable results.
In the last 10 years, the existing competitive spirit has increased the number of programs/algorithms created, updated and adapted for the two biological problems here presented [1, 2, 4, 10, 28, 41]. On the one side, the development of a new algorithm always implies the sacrifice of an objective in favor of another, which makes very difficult for novel approaches to improve in absolute terms the quality of the existing ones. On the other side, the impressive amount of alternative algorithms available for different biological problems is confusing for users, who wonder what makes the programs different, which one should be used in which situation and which level of prediction confidence to expect. Finally, users also wonder whether current programs can answer all their questions. The answer is most probably no, and will remain to be negative as it is unrealistic to imagine that such complex biological processes can be explained merely by looking at one objective.
Our future work will extend the methodology here proposed in an automatic method generator, and a step forward to exploit all already existing methods, by providing optimal methods’ aggregations to answer concrete queries for a certain biological problem with a maximized accuracy of the prediction.
This work was supported in part by the Spanish Ministry of Science and Technology (MEC) under project TIN-2006-12879 and the Consejeria de Innovacion, Investigacion y Ciencia de la Junta de Andalucia under project TIC-02788. I. Zwir is a senior research scientist supported by the Howard Hughes Medical Institute and the “Ramon y Cajal” program of the MEC, C. del Val was supported by the “Programa de Retorno de Investigadores” from the Junta de Andalucia.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
- 4.Haussler D (1998) Computational genefinding. Trends Biochem Sci Suppl:12–15Google Scholar
- 5.Smyth GK, Yang YH (2003) Methods Mol Biol 224:111Google Scholar
- 7.Li C, Wong WH (2003) The analysis of gene expression data: methods and software, Springer, New York, pp 120–141Google Scholar
- 8.Pan W, Lin J, Le C (2001) Funct Integr Genomics 3:117Google Scholar
- 16.Rhodes DR, Barrette TR, Rubin MA, Ghosh D, Chinnaiyan AM (2002) Cancer Res 62:4427Google Scholar
- 21.Calvano SE, Xiao W, Richards DR, Felciano RM, Baker HV, Cho RJ, Chen RO, Brownstein BH, Cobb JP, Tschoeke SK, Miller-Graziano C, Moldawer LL, Mindrinos MN, Davis RW, Tompkin RG, Lowry SF, Inflammation and Host Response to Injury Large Scale Collab. Res. Program (2005) Nature 437:1032Google Scholar
- 22.Halmos P (1960) Naïve set theory, Princeton, NJGoogle Scholar
- 23.Mitchell TM (1997) Machine learning, McGraw-Hill, New YorkGoogle Scholar
- 24.Cohon JL (1978) Multiobjective programming and planning, Academic Press, New YorkGoogle Scholar
- 29.Borodovsky M, Lomsadze A, Nikolai I, Ryan M (2003) Curr Protoc Bioinformatics, Chap 4, Unit 4.6Google Scholar
- 33.Burset M, Guigó R (1996) Genomics 34: 353Google Scholar
- 34.Rubio-Escudero C (2007) Fusion of knowledge towards the identification of genetic profiles in the systemic inflammation problem, University of GranadaGoogle Scholar
- 35.Everitt B, Der G (1996) Statistical analysis of medical data using SAS, Chapman & Hall, LondonGoogle Scholar
- 36.Chankong V, Haimes YY (1983) Multiobjective decision making: theory and methodology, North-Holland, AmsterdamGoogle Scholar
- 39.Tech M, Merkl R (2003) In Silico Biol 3:441Google Scholar
- 41.Li C, Wong WH (2001) Genome Biol 2:193Google Scholar