Abstract
The generation of a feature matrix is the first step in conducting machine learning analyses on complex data sets such as those containing DNA, RNA or protein sequences. These matrices contain information for each object which have to be identified using complex algorithms to interrogate the data. They are normally generated by combining the results of running such algorithms across various datasets from different and distributed data sources. Thus for non-computing experts the generation of such matrices prove a barrier to employing machine learning techniques. Further since datasets are becoming larger this barrier is augmented by the limitations of the single personal computer most often used by investigators to carry out such analyses. Here we propose a user friendly system to generate feature matrices in a way that is flexible, scalable and extendable. Additionally by making use of The Berkeley Open Infrastructure for Network Computing (BOINC) software, the process can be speeded up using distributed volunteer computing possible in most institutions. The system makes use of a combination of the Grid and Cloud User Support Environment (gUSE), combined with the Web Services Parallel Grid Runtime and Developer Environment Portal (WS-PGRADE) to create workflow-based science gateways that allow users to submit work to the distributed computing. This report demonstrates the use of our proposed WS-PGRADE/gUSE BOINC system to identify features to populate matrices from very large DNA sequence data repositories, however we propose that this system could be used to analyse a wide variety of feature sets including image, numerical and text data.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Jordan, M.I., Mitchell, T.M.: Machine learning: trends, perspectives, and prospects. Science. 349(6245), 255–260 (2015)
Q Zou, L Chen, T Huang, Z Zhang and Y Xu Machine Learning and Graph Analytics in Computational Biomedicine. Artificial Intelligence in Medicine 83, November, Page 1 and papers therein; (2017)
I.H. Witten, E. Frank, M.A. Hall and C.J. Pal, Data Mining: Practical machine learning tools and techniques. (Morgan Kaufmann 2016)
W. Cheng, G. Kasneci, T. Graepel, D. Stern and R. Herbrich Automated feature generation from structured knowledge. In Proceedings of the 20th ACM international conference on Information and knowledge management (pp. 1395–1404). ACM. (2011)
H. Paulheim and J. Fümkranz June. Unsupervised generation of data mining features from linked open data. In Proceedings of the 2nd international conference on web intelligence, mining and semantics (p. 31). ACM. (2012)
L. Friedman and S. Markovitch Recursive Feature Generation for Knowledge-based Learning. arXiv preprint arXiv:1802.00050. (2018)
Menezes, J.A., Cabral, G., Gomes, B.T.: Genetic algorithms for feature generation in the context of audio classification. World Academy of Science, Engineering and Technology, International Journal of Computer, Electrical, Automation, Control and Information Engineering. 10(2), 427–430 (2017)
Afgan, E.; Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Čech, M.; Chilton, J.; Clements, D.; Coraor, N.; Eberhard, C.; Grüning, B.; Guerler, A.; Hillman-Jackson, J.; Von Kuster, G.; Rasche, E.; Soranzo, N.; Turaga, N.; Taylor, J.; Nekrutenko, A.; Goecks, J. (8 July 2016). "The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Res. 44 (W1): W3–W10
Johannes Köster and Sven Rahmann. “Snakemake - A scalable bioinformatics workflow engine”. Bioinformatics 2012
J Gray. Jim Gray on eScience: A transformed scientific method. In The Fourth Paradigm: Data-Intensive Scientific Discovery, Tony Hey, Stewart Tansley, and Kristin Tolle (Eds.). (Microsoft, xix–xxxiii. 2009)
Hey, T., Tansley, S., Tolle, K. (eds.): The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research (2009)
Kell D B and Oliver S G. Here is the evidence, now what is the hypothesis? The complementary roles of inductive and hypothesis-driven science in the post-genomic era. BioEssays 26, 1, DOI:https://doi.org/10.1002/bies.10385 (Jan. 2004)
Gorton, I., Greenfield, P., Szalay, A., Williams, R.: Data-intensive computing in the 21st century. Computer. 41(4), 30–32 (2008)
Deelman E, Vahi K, Rynge M, Juve G, Mayani R, and Ferreira da Silva R. Pegasus in the cloud: science automation through workflow technologies. IEEE Internet Comput. 20, 1, 70–76. DOI:https://doi.org/10.1109/MIC.2016.15 (Jan. 2016)
Kacsuk, P., Kecskemeti, G., Kertesz, A., et al.: Infrastructure Aware Scientific Workflows and Infrastructure Aware Workflow Managers in Science Gateways J Grid Computing. 14, 641 (2016) https://doi.org/10.1007/s10723-016-9380
Wassenaar, T.A., van Dijk, M., Loureiro-Ferreira, N., et al.: WeNMR: Structural Biology on the Grid J Grid Computing. 10, 743 (2012) https://doi.org/10.1007/s10723-012-9246-z
M. McLennan, R. Kennell, "HUBzero: a platform for dissemination and collaboration in computational science and engineering," Computing in Science and Engineering 12(2), pp. 48–52, March/April, 2010
Kacsuk, P., Farkas, Z., Kozlovszky, M., et al.: WS-PGRADE/gUSE Generic DCI Gateway Framework for a Large Variety of User Communities J Grid Computing. 10, 601 (2012) https://doi.org/10.1007/s10723-012-9240-5
Deelman, E.: Grids and clouds: making workflow applications work in heterogeneous distributed environments. International Journal of High Performance Computing Applications. 24(3), 284–298 (Aug. 2010) https://doi.org/10.1177/10943420093564322010
Kacsuk P (Ed.). Science Gateways for Distributed Computing Infrastructures: Development Framework and Exploitation by Scientific User Communities. DOI:https://doi.org/10.1007/978-3-319-11268-8 (2014)
Liew C S, Atkinson M P., Galea M, Ang T F, Martin P, and Van Hemert J I. Scientific workflows: moving across paradigms. ACM Comput. Surv.. 49, 4, Article 66 DOI: https://doi.org/10.1145/3012429 (December 2016)
Kacsuk, P.: P-GRADE portal family for grid infrastructures. Concurrency and Computation: Practice and Experience Special Issue: IWPLS 2009. 23(3), 235–245 (2011)
Balasko, A .: Workflow Concept of WS-PGRADE/gUSE. Science Gateways for Distributed Computing Infrastructures:Development Framework and Exploitation by Scientific User Communities, pp. 33–50 doi:https://doi.org/10.1007/978-3-319-11268-83 (2014)
S.C. Shah Recent Advances in Mobile Grid and Cloud Computing. Intelligent Automation & Soft Computing, pp.1–13. (2017)
Ellert, M., et al.: Advanced resource connector middleware for lightweight computational grids. Futur. Gener. Comput. Syst. 23, 219–240 (2007)
Thain, D., Tannenbaum, T., Livny, M.: Distributed computing in practice: the condor experience. Concurrency and computation: practice and experience. 17(2–4), 323–356 (2005)
Foster, I.: Globus toolkit version 4: software for service-oriented systems. IFIP international conference on network and parallel computing, Springer-Verlag LNCS. 3779, 2–13 (2005)
David, P.: Anderson: Public Computing: Reconnecting People to Science. Conference on Shared Knowledge and the Web, Residencia de Estudiantes, Madrid, Spain (2003)
, et al.: The DECIDE science gateway. J Grid Comput. 10, 689–707 (2012). https://doi.org/10.1007/s10723-012-9242-3Ardizzone, V., Barbera, R., Calanducci, A. et al.: The DECIDE science gateway. J Grid Comput 10, 689 doi:https://doi.org/10.1007/s10723-012-9242-3 (2012), 707
Costa, A., Massimino, P., Bandieramonte, M., et al.: An innovative science gateway for the Cherenkov telescope array. J Grid Comput. 13, 547 (2015). https://doi.org/10.1007/s10723-015-9330-2
R. Grunzke, J. Krüger, R Jäkel., et al.: Metadata Management in the moSGrid Science Gateway – Evaluation and the Expansion of Quantum Chemistry Support. J Grid Computing. doi:https://doi.org/10.1007/s10723-016-9362-2 (2016)
Gugnani, S., Blanco, C., Kiss, T., Terstyanszky, G.: Extending science gateway frameworks to support big data applications in the cloud. Extending science gateway frameworks to support big data applications in the cloud J Grid Computing. 14, 589–601 (2016). https://doi.org/10.1007/s10723-016-9369-8
Farkas, Z., Kacsuk, P., Hajnal, Á.: Enabling workflow-oriented science gateways to access multi-cloud systems. Journal of Grid Computing. 14(4), 619–640 (2016)
C.M. Taylor BOINC user stats https://boincstats.com/en/stats/-1/user/detail/3531367/overview accessed 9/9/2016
Bazinet, A.L., Cummings, M.P.: Subdividing long-running, variable-length analyses into short. Fixed-Length BOINC Workunits J Grid Computing. 14, 429. https://doi.org/10.1007/s10723-015-9348-5–441 (2016)
F. Gutierrez, D. Azevedo, M. Barreto and R. Zucoloto Support for bioinformatics applications through volunteer and scalable computing frameworks. In Cluster Computing (CLUSTER), 2014 IEEE International Conference (pp. 364–370). IEEE. (2014)
Cook, C.E., Bergman, M.T., Finn, R.D., Cochrane, G., Birney, E., Apweiler, R.: The European bioinformatics institute in 2016: data growth and integration. Nucleic Acids Res. 44(D1), D20–D26 (2015)
M. Ghorbani, M. Themis, A. Payne Genome wide classification and characterisation of CpG sites in cancer and normal cells. Comput Biol Med. 1;68:57–66. doi: 10.1016/j.compbiomed.2015.09.023. Epub 2015 Oct 23. (2015)
BOINC 2017 https://boinc.berkeley.edu/ accessed 12/09/2017
Marosi, A., Kovács, J., Kacsuk, P.: Towards a volunteer cloud system. Futur. Gener. Comput. Syst. 29(6), 1442–1451 (2013)
Kacsuk, P., Farkas, Z., Kozlovszky, M., Hermann, G., Balasko, A., Karoczkai, K., Marton, I.: WS-PGRADE/gUSE generic DCI gateway framework for a large variety of user communities. Journal of Grid Computing. 10(4), 601–630 (2012)
C.B. Ries, C. Schroder and V. Grout Approach of a UML profile for Berkeley Open Infrastructure for network computing (BOINC), Computer Applications and Industrial Electronics (ICCAIE), 2011 IEEE International Conference, pp. 483. (2011)
Previti, C., Harari, O., Zwir, I., del Val, C.: Profile analysis and prediction of tissue-specific CpG island methylation classes. BMC Bioinformatics. 10(1), 116 (2009)
Rice, P., Longden, I., Bleasby, A.: EMBOSS: the European molecular biology open software suite. Trends Genet. 16, 276–277 (2000)
A.C. Marosi, Z. Balaton and P. Kacsuk GenWrapper: a generic wrapper for running legacy applications on desktop grids, Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on IEEE, pp. 1. (2009)
Jaspar 2017, http://jaspar.genereg.net/ accessed 12/09/2017
Acknowledgements
We would like to thank the technical and administration staff of the Department of Computer Science, Brunel University for their support and Brunel University for support in kind.
Author Responsibilities
MG designed and implemented the system, AP provided and designed the biological problem and advised on the needs of biologists, ST advised on the distributed system and workflow design, SS advised on the hill climbing technique.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing Interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ghorbani, M., Swift, S., Taylor, S.J.E. et al. Design of a Flexible, User Friendly Feature Matrix Generation System and its Application on Biomedical Datasets. J Grid Computing 18, 507–527 (2020). https://doi.org/10.1007/s10723-020-09518-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10723-020-09518-y