Skip to main content
Log in

Hierarchical classification of protein function with ensembles of rules and particle swarm optimisation

  • Focus
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

This paper focuses on hierarchical classification problems where the classes to be predicted are organized in the form of a tree. The standard top-down divide and conquer approach for hierarchical classification consists of building a hierarchy of classifiers where a classifier is built for each internal (non-leaf) node in the class tree. Each classifier discriminates only between its child classes. After the tree of classifiers is built, the system uses them to classify test examples one class level at a time, so that when the example is assigned a class at a given level, only the child classes need to be considered at the next level. This approach has the drawback that, if a test example is misclassified at a certain class level, it will be misclassified at deeper levels too. In this paper we propose hierarchical classification methods to mitigate this drawback. More precisely, we propose a method called hierarchical ensemble of hierarchical rule sets (HEHRS), where different ensembles are built at different levels in the class tree and each ensemble consists of different rule sets built from training examples at different levels of the class tree. We also use a particle swarm optimisation (PSO) algorithm to optimise the rule weights used by HEHRS to combine the predictions of different rules into a class to be assigned to a given test example. In addition, we propose a variant of a method to mitigate the aforementioned drawback of top-down classification. These three types of methods are compared against the standard top-down hierarchical classification method in six challenging bioinformatics datasets, involving the prediction of protein function. Overall HEHRS with the rule weights optimised by the PSO algorithm obtains the best predictive accuracy out of the four types of hierarchical classification method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Battiti R, Colla AM (1994) Democracy in neural nets: voting schemes for accuracy. Neural Netw 7: 691–707

    Article  Google Scholar 

  • Bhasin M, Raghava GP (2004) GPCRpred: an SVM-based method for prediction of families and subfamilies of G-protein coupled receptors. Nucleic Acids Res. 1 32(Web Server issue): 383–9

    Article  Google Scholar 

  • Blockeel H, Schietgat L, Struyf J, Dzeroski S, Clare A (2006) Decision trees for hierarchical multilabel classification: a case study in functional genomics. In: Proceedings of PKDD 2006

  • Breiman L (1996) Bagging predictors. Mach Learn 24: 123–140

    MATH  MathSciNet  Google Scholar 

  • Brown G, Wyatt J, Harris R, Yao X (2005) Diversity creation methods: a survey and categorisation. Inf Fusion 6(1): 5–20

    Article  Google Scholar 

  • Bulashevska A, Eils R (2006) Predicting protein subcellular locations using hierarchical ensemble of Bayesian classifiers based on Markov chains. BMC Bioinformatics 7: 298

    Article  Google Scholar 

  • Clare A (2003) Machine learning and data mining for yeast functional genomics. PhD Thesis, University of Wales Aberystwyth

  • Clerc M, Kennedy J (2002) The particle swarm-explosion, stability and convergence in a multidimensional complex space. IEEE Transactions on Evolutionary Computation 6(1)

  • Derbeko P, El-Yaniv R, Meir R (2002) Variance Optimized Bagging. In: Proceedings of 13th European conference on machine learning, pp 60–71

  • Dietterich TG (1997) Machine learning research: four current directions. AI Magaz 18(4): 97–136

    Google Scholar 

  • Eisner R, Poulin B, Szafron D, Lu P, Greiner R (2005) Improving protein function prediction using the hierarchical structure of the gene ontology. In: Proceedings of 2005 IEEE symposium on computational intelligence in bioinformatics and computational biology

  • Fayyad UM, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery: an overview, Advances in Knowledge Discovery and Data Mining. AAAI/MIT, Menlo Park, pp 1–34

  • Fillmore D (2004) It’s a GPCR world. Modern Drug Discovery 11(7): 24–28

    Google Scholar 

  • Guenter S, Bunke H (2004) Optimization of weights in a multiple classifier handwritten word recognition system using a genetic algorithm. ELCVIA(3) 1: 25–44

    Google Scholar 

  • Guo YZ, Li ML, Wang KL, Wen ZN, Lu MC, Liu LX, Jiang L (2006) Classifying G protein-coupled receptors and nuclear receptors on the basis of protein power spectrum from fast Fourier transform. Amino Acids 30(4): 397–402

    Article  Google Scholar 

  • Günter S, Bunke H (2004) Evaluation of classical and novel ensemble methods for handwritten word recognition. In: Proceedings of 10th international workshop on structural and syntactic pattern recognition (SSPR), pp 583–591

  • Hand DJ (1997) Construction and assessment of classification rules. Wiley, New York

    MATH  Google Scholar 

  • Holden N, Freitas AA (2006) Hierarchical classification of G-protein-coupled receptors with a PSO/ACO algorithm. In: Proceedings of IEEE swarm intelligence symposium (SIS-06). IEEE Press, New York, pp 77–84

  • Huang Y, Cai J, Ji L, Li Y (2004) Classifying G-protein coupled receptors with bagging classification tree. Comput Biol Chem 28(4): 275–280

    Article  MATH  Google Scholar 

  • Hulo N, Sigrist CJA, Le Saux V, Langendijk-Genevaux PS, Bordoli L, Gattiker A, De Castro E, Bucher P, Bairoch A (2004) Recent improvements to the PROSITE database. Nucleic Acids Res 32(Database issue): D134–D137

    Article  Google Scholar 

  • Karchin R, Karplus K, Haussler D (2002) Classifying G-protein coupled receptors with support vector machines. Bioinformatics 18(1): 147–59

    Article  Google Scholar 

  • Kennedy J, Spears W (1998) Matching algorithms to problems: an experimental test of the particle swarm and some genetic algorithms on the multimodal problem generator. In: IEEE international conference on evolutionary computation, May

  • Kennedy J, Eberhart RC, Shi Y (2001) Swarm intelligence. Morgan Kaufmann/Academic Press, San Francisco/New York

    Google Scholar 

  • McDowall J (2005) InterPro: Exploring a Powerful Protein Diagnostic Tool, ECCB05, Tutorial, pp 14

  • Mouser CR, Dunn SA (2004) Comparing genetic algorithms and particle swarm optimisation for an inverse problem exercise. In: Computational techniques and applications conference (CTAC-2004), September

  • Papasaikas PK, Bagos PG, Litou ZI, Hamodrakas SJ (2003) A novel method for GPCR recognition and family classification from sequence alone using signatures derived from profile hidden Markov models. SAR QSAR Environ Res 14(5–6): 413–420

    Article  Google Scholar 

  • Ranawana R, Palade V (2005) MVGen—Ensemble learning for mcs majority voting with a genetic algorithm, internal report. Oxford University Computing Laboratory

  • Skalak DB (1997) Prototype selection for composite nearest neighbour classifiers. PhD Thesis, University of Massachusetts, Amherst

  • Stefano CD, Cioppa AD, Marcelli A (2003) Exploiting reliability for dynamic selection of classifiers by means of genetic algorithms. In: Proceedings of 7th Internatioanl conference on document analysis and recognition, vol 2, ICDAR. IEEE Computer Society, Washington, DC, pp 671

  • Sun A, Lim E-P (2001) Hierarchical text classification and evaluation. In: Proceedings of 2001 IEEE international conference on data mining, pp 521–528

  • Sun A, Lim E-P, Ng WK, Srivastava J (2004) Blocking reduction strategies in hierarchical text classification. IEEE Trans Knowl Data Eng 16(10): 1305–1308

    Article  Google Scholar 

  • Peng T, Zuo W, He F (2006) Text Classification from Positive and Unlabeled Documents Based on GA, vecpar 06(7)

  • Tan A, Gilbert D, Deville Y (2003) Multi-class protein fold classification using a new ensemble machine learning approach. Genome Inf 14: 206–217

    Google Scholar 

  • Tan P-N, Steinbach M, Kumar V (2006) Introduction to data mining. Addison-Wesley, Reading

    Google Scholar 

  • Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicholas Holden.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Holden, N., Freitas, A.A. Hierarchical classification of protein function with ensembles of rules and particle swarm optimisation. Soft Comput 13, 259–272 (2009). https://doi.org/10.1007/s00500-008-0321-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-008-0321-0

Keywords

Navigation