Interpretable Per Case Weighted Ensemble Method for Cancer Associations

  • Adrin Jalali
  • Nico Pfeifer
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8701)

Abstract

Over the past decades, biology has transformed into a high throughput research field both in terms of the number of different measurement techniques as well as the amount of variables measured by each technique (e.g., from Sanger sequencing to deep sequencing) and is more and more targeted to individual cells [3]. This has led to an unprecedented growth of biological information. Consequently, techniques that can help researchers find the important insights of the data are becoming more and more important. Molecular measurements from cancer patients such as gene expression and DNA methylation are usually very noisy. Furthermore, cancer types can be very heterogeneous. Therefore, one of the main assumptions for machine learning, that the underlying unknown distribution is the same for all samples in training and test data, might not be completely fulfilled.

In this work, we introduce a method that is aware of this potential bias and utilizes an estimate of the differences during the generation of the final prediction method. For this, we introduce a set of sparse classifiers based on L1-SVMs [1], under the constraint of disjoint features used by classifiers. Furthermore, for each feature chosen by one of the classifiers, we introduce a regression model based on Gaussian process regression that uses additional features. For a given test sample we can then use these regression models to estimate for each classifier how well its features are predictable by the corresponding Gaussian process regression model. This information is then used for a confidence-based weighting of the classifiers for the test sample. Schapire and Singer showed that incorporating confidences of classifiers can improve the performance of an ensemble method [2]. However, in their setting confidences of classifiers are estimated using the training data and are thus fixed for all test samples, whereas in our setting we estimate confidences of individual classifiers per given test sample.

In our evaluation, the new method achieved state-of-the-art performance on many different cancer data sets with measured DNA methylation or gene expression. Moreover, we developed a method to visualize our learned classifiers to find interesting associations with the target label. Applied to a leukemia data set we found several ribosomal proteins associated with leukemia that might be interesting targets for follow-up studies and support the hypothesis that the ribosomes are a new frontier in gene regulation.

Keywords

machine learning cancer biomarkers supervised prediction ensemble methods support vector machines Gaussian processes 

References

  1. 1.
    Bradley, P.S., Mangasarian, O.L.: Feature selection via concave minimization and support vector machines. In: ICML, vol. 98, pp. 82–90 (1998)Google Scholar
  2. 2.
    Schapire, R.E., Singer, Y.: Improved boosting algorithms using confidence-rated predictions. Machine Learning 37(3), 297–336 (1999)CrossRefMATHGoogle Scholar
  3. 3.
    Shapiro, E., Biezuner, T., Linnarsson, S.: Single-cell sequencing-based technologies will revolutionize whole-organism science. Nat. Rev. Genet. 14(9), 618–630 (2013)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Adrin Jalali
    • 1
    • 2
  • Nico Pfeifer
    • 1
  1. 1.Department of Computational Biology and Applied AlgorithmicsMax Planck Institute for InformaticsSaarbrückenGermany
  2. 2.Saarbrücken Graduate School of Computer ScienceSaarland UniversitySaarbrückenGermany

Personalised recommendations