Skip to main content
Log in

Rank-based classifiers for extremely high-dimensional gene expression data

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Predicting phenotypes on the basis of gene expression profiles is a classification task that is becoming increasingly important in the field of precision medicine. Although these expression signals are real-valued, it is questionable if they can be analyzed on an interval scale. As with many biological signals their influence on e.g. protein levels is usually non-linear and thus can be misinterpreted. In this article we study gene expression profiles with up to 54,000 dimensions. We analyze these measurements on an ordinal scale by replacing the real-valued profiles by their ranks. This type of rank transformation can be used for the construction of invariant classifiers that are not affected by noise induced by data transformations which can occur in the measurement setup. Our 10 \(\times \) 10 fold cross-validation experiments on 86 different data sets and 19 different classification models indicate that classifiers largely benefit from this transformation. Especially random forests and support vector machines achieve improved classification results on a significant majority of datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Bavaud F (2009) Aggregation invariance in general clustering approaches. Adv Data Anal Classif 3(3):205–225

    Article  MathSciNet  Google Scholar 

  • Ben-Dor A, Bruhn L, Friedman N, Nachman I, Schummer M, Yakhini Z (2000) Tissue classification with gene expression profiles. J Comput Biol 7(3–4):559–583

    Article  Google Scholar 

  • Bishop CM (2006) Pattern recognition and machine learning (information science and statistics). Springer, New York

    MATH  Google Scholar 

  • Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  • Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. The Wadsworth statistics/probability series. Chapman & Hall/CRC, Boca Raton

    Google Scholar 

  • Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874

    MATH  Google Scholar 

  • Fix E, Hodges JL (1951) Discriminatory analysis: nonparametric discrimination: consistency properties. Tech. Rep. Project 21-49-004, Report Number 4, USAF School of Aviation Medicine, Randolf Field, Texas

  • Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    MATH  Google Scholar 

  • Haasdonk B, Burkhardt H (2007) Invariant kernel functions for pattern analysis and machine learning. Mach Learn 68(1):35–61

    Article  Google Scholar 

  • Hariharan B, Malik J, Ramanan D (2012) Discriminative decorrelation for clustering and classification. In: Fitzgibbon AW, Lazebnik S, Perona P, Sato Y, Schmid C (eds) Computer Vision–ECCV 2012, Springer, Lecture notes in computer science 7575:459–472

  • Irizarry R, Hobbs B, Collin F, Beazer-Barclay Y, Antonellis K, Scherf U, Speed T (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4(2):249–264

    Article  Google Scholar 

  • Jamain A, Hand D (2009) Where are the large and difficult datasets? Adv Data Anal Classif 3(1):25–38

    Article  MathSciNet  Google Scholar 

  • Kestler HA, Lausser L, Lindner W, Palm G (2011) On the fusion of threshold classifiers for categorization and dimensionality reduction. Comput Stat 26(2):321–340

    Article  MathSciNet  Google Scholar 

  • Lausser L, Müssel C, Kestler HA (2012) Representative prototype sets for data characterization and classification. In: Mana N, Schwenker F, Trentin E (eds) Artificial neural networks in pattern recognition (ANNPR12), Lecture notes in artificial intelligence, Springer, Heidelberg 7477:36–47

    Chapter  Google Scholar 

  • McCall M, Bolstad B, Irizarry R (2010) Frozen robust multiarray analysis (fRMA). Biostatistics 11(2):242n++253

    Article  Google Scholar 

  • Müssel C, Lausser L, Maucher M, Kestler HA (2012) Multi-objective parameter selection for classifiers. J Stat Softw 46(5):1–27

    Article  Google Scholar 

  • Niyogi P, Poggio T, Girosi F (1998) Incorporating prior information in machine learning by creating virtual examples. IEEE Proc Intell Signal Process 86(11):2196–2209

    Google Scholar 

  • Patil P, Bachant-Winner PO, Haibe-Kains B, Leek J (2015) Test set bias affects reproducibility of gene signatures. Bioinformatics 31(14):2318–2323

    Article  Google Scholar 

  • Saeys Y, Inza I, Larranaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517

    Article  Google Scholar 

  • Schmid F, Lausser L, Kestler HA (2014) Linear contrast classifiers in high-dimensional spaces. In: Gayar NE, Schwenker F, Suen C (eds) Artificial neural networks in pattern recognition (ANNPR14), Springer, Heidelberg, Lecture notes in artificial intelligence 8774:141–152

  • Schölkopf B, Burges C, Vapnik V (1996) Incorporating invariances in support vector learning machines. In: von der Malsburg C, von Seelen W, Vorbrüggen J, Sendhoff S (eds) Artificial neural networks—ICANN’96, Springer, Lecture Notes in Computer Science, 1112:47–52

  • Schölkopf B, Smola A, Müller KR (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10(5):1299–1319

    Article  Google Scholar 

  • Simard PY, LeCun YA, Denker JS, Victorri B (2012) Transformation invariance in pattern recognition—tangent distance and tangent propagation. In: Orr G, Müller KR (eds) Neural networks: tricks of the trade, vol 7700, 2nd edn., Lecture notes in computer scienceSpringer, Heidelberg, pp 239–274

    Google Scholar 

  • Thomas J, Olson J, Tapscott S, Zhao L (2001) An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. Genome Res 11(7):1227–1236

    Article  Google Scholar 

  • Tibshirani R, Hastie T, Narasimhan B, Chu G (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA 99(10):6567–6572

    Article  Google Scholar 

  • Tsuda K (1999) Support vector classifier with asymmetric kernel functions. In: Verleysen M (ed) Proceedings of ESANN’99 - European symposium on artificial neural networks, D-Facto public, Brussels, pp 183–188

  • Wood J (1996) Invariant pattern recognition: a review. Pattern Recogn 29(1):1–17

    Article  Google Scholar 

Download references

Acknowledgements

The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/20072013) under Grant Agreement No. 602783, the German Research Foundation (DFG, SFB 1074 project Z1 to HAK), and the Federal Ministry of Education and Research (BMBF, Gerontosys II, Forschungskern SyStaR, project ID 0315894A and e:Med, SYMBOL-HF, Grant ID 01ZX1407A) all to HAK.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hans A. Kestler.

Additional information

L. Lausser, F. Schmid, and L.-R. Schirra contributed equally.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 230 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lausser, L., Schmid, F., Schirra, LR. et al. Rank-based classifiers for extremely high-dimensional gene expression data. Adv Data Anal Classif 12, 917–936 (2018). https://doi.org/10.1007/s11634-016-0277-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-016-0277-3

Keywords

Mathematics Subject Classification

Navigation