Cluster Computing

, Volume 18, Issue 4, pp 1317–1329 | Cite as

Ensemble learning of runtime prediction models for gene-expression analysis workflows

  • David A. Monge
  • Matěj Holec
  • Filip Železný
  • Carlos García Garino


The adequate management of scientific workflow applications strongly depends on the availability of accurate performance models of sub-tasks. Numerous approaches use machine learning to generate such models autonomously, thus alleviating the human effort associated to this process. However, these standalone models may lack robustness, leading to a decay on the quality of information provided to workflow systems on top. This paper presents a novel approach for learning ensemble prediction models of tasks runtime. The ensemble-learning method entitled bootstrap aggregating (bagging) is used to produce robust ensembles of M5P regression trees of better predictive performance than could be achieved by standalone models. Our approach has been tested on gene expression analysis workflows. The results show that the ensemble method leads to significant prediction-error reductions when compared with learned standalone models. This is the first initiative using ensemble learning for generating performance prediction models. These promising results encourage further research in this direction.


Performance prediction Ensemble learning Data-intensive workflows Gene expressions analysis experiments 



This research is supported by the ANPCyT project No. PICT-2012-2731, and by the MINCyT project No. RC0904. MH and FZ were supported by the Czech Science Foundation project No. P202/12/2032. The financial support from SeCTyP-UNCuyo through project No. M004 is also gratefully acknowledged. DAM wants to thank CONICET for the granted fellowship. We also want to thank Alejandro Edera and Rubén Santos for their fruitful comments. Finally, the authors want to thank the anonymous reviewers for their valuable comments and suggestions that helped to improve the quality of this paper.


  1. 1.
    Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991)Google Scholar
  2. 2.
    Allan, R.: Survey of HPC performance modelling and prediction tools. Tech. Rep. DL-TR-2010-006, Science and Technology Facilities Council, Great Britain (2010).
  3. 3.
    Chen, W., Deelman, E.: Partitioning and scheduling workflows across multiple sites with storage constraints. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) Parallel Processing and Applied Mathematics. Lecture Notes in Computer Science, vol. 7204, pp. 11–20. Springer, Berlin (2012)CrossRefGoogle Scholar
  4. 4.
    da Cruz, S., Campos, M., Mattoso, M.: Towards a taxonomy of provenance in scientific workflow management systems. In: 2009 World Conference on Services—I, pp. 259–266 (2009)Google Scholar
  5. 5.
    Genez, T., Bittencourt, L., Madeira, E.R.M.: Workflow scheduling for SaaS / PaaS cloud providers considering two SLA levels. In: Network Operations and Management Symposium (NOMS), 2012 IEEE, pp. 906–912 (2012)Google Scholar
  6. 6.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newsl. 11, 10–18 (2009)CrossRefGoogle Scholar
  7. 7.
    Hey, T., Tansley, S., Tolle, K. (eds.): The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, Redmond, Washington (2009)Google Scholar
  8. 8.
    Holec, M., Klema, J., Z̆elezný, F., Tolar, J.: Comparative evaluation of set-level techniques in predictive classification of gene expression samples. BMC Bioinform. 13, Suppl. 10(S15), 1–15 (2012)Google Scholar
  9. 9.
    Iverson, M., Ozguner, F., Potter, L.: Statistical prediction of task execution times through analytic benchmarking for scheduling in a heterogeneous environment. In: Heterogeneous Computing Workshop. (HCW ’99) Proceedings of the Eighth, vol. 8, pp. 99–111. IEEE Computer Society, San Juan, PR (1999)Google Scholar
  10. 10.
    Mann, H.B., Whitney, D.R.: On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18(1), 50–60 (1947)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Mao, M., Humphrey, M.: Scaling and scheduling to maximize application performance within budget constraints in cloud workflows. In: 2013 IEEE 27th International Symposium on Parallel & Distributed Processing (IPDPS), pp. 67–78. IEEE (2013)Google Scholar
  12. 12.
    Marx, V.: Biology: the big challenges of big data. Nature 498(7453), 255–260 (2013)CrossRefGoogle Scholar
  13. 13.
    Mendes-Moreira, J.A., Soares, C., Jorge, A.M., Sousa, J.F.D.: Ensemble approaches for regression: a survey. ACM Comput. Surv. 45(1), 10:1–10:40 (2012)CrossRefGoogle Scholar
  14. 14.
    Monge, D.A., Bĕlohradský, J., García Garino, C., Z̆elezný, F.: A Performance Prediction Module for Workflow Scheduling. In: A.R. de Mendarozqueta et al. (ed.) 4th Symposium on High-Performance Computing in Latin America (HPCLatAm 2011), 40 JAIIO, vol. 4, pp. 130–144. Argentine Society of Informatics (SADIO), Córdoba (2011)Google Scholar
  15. 15.
    Monge, D.A., Holec, M., Z̆elezný, F., García Garino, C.: Ensemble learning of run-time prediction models for data-intensive scientific workflows. In: G.H. et al. (ed.) High Performance Computing, Communications in Computer and Information Science, vol. 485, pp. 83–97. Springer, Berlin (2014)Google Scholar
  16. 16.
    Ould-Ahmed-Vall, E., Woodlee, J., Yount, C., Doshi, K., Abraham, S.: Using model trees for computer architecture performance analysis of software applications. In: IEEE International Symposium on Performance Analysis of Systems Software, 2007. ISPASS 2007, pp. 116–125. IEEE Computer Society (2007)Google Scholar
  17. 17.
    Pllana, S., Brandic, I., Benkner, S.: A survey of the state of the art in performance modeling and prediction of parallel and distributed computing systems. Int. J. Comput. Intell. Res. 4(1), 279–284 (2008)Google Scholar
  18. 18.
    Quinlan, J.: Learning with continuous classes. In: Proceedings of the 5th Australian joint Conference on Artificial Intelligence, pp. 343–348. World Scientific, Singapore (1992)Google Scholar
  19. 19.
    Smola, A., Schölkopf, B.: A tutorial on support vector regression. Stat. Comput. 14(3), 199–222 (2004). doi: 10.1023/B:STCO.0000035301.49549.88
  20. 20.
    Taylor, I., Deelman, E., Gannon, D., Shields, M.: Workflows for e-Science: Scientific Workflows for Grids, 1st edn. Springer, London (2007)CrossRefGoogle Scholar
  21. 21.
    Taylor, V., Wu, X., Stevens, R.: Prophesy: an infrastructure for performance analysis and modeling of parallel and grid applications. SIGMETRICS Perform. Eval. Rev. 30, 13–18 (2003)CrossRefGoogle Scholar
  22. 22.
    Wang, Y., Witten, I.: Induction of model trees for predicting continuous classes. In: Proceedings of the poster papers of the European Conference on Machine Learning. University of Economics, Faculty of Informatics and Statistics, Prague (1996)Google Scholar
  23. 23.
    Weicker, R.P.: Dhrystone: a synthetic systems programming benchmark. Commun. ACM 27(10), 1013–1030 (1984)CrossRefGoogle Scholar
  24. 24.
    Witten, I.H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan, Kaufman (2011)Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • David A. Monge
    • 1
  • Matěj Holec
    • 2
  • Filip Železný
    • 2
  • Carlos García Garino
    • 3
  1. 1.ITIC Research Institute & Faculty of Exact and Natural SciencesNational University of Cuyo (UNCuyo)MendozaArgentina
  2. 2.IDA Research GroupCzech Technical University in PraguePragueCzech Republic
  3. 3.ITIC Research Institute & Faculty of EngineeringNational University of Cuyo (UNCuyo)MendozaArgentina

Personalised recommendations