PyBDA: a command line tool for automated analysis of big biological data sets
Analysing large and high-dimensional biological data sets poses significant computational difficulties for bioinformaticians due to lack of accessible tools that scale to hundreds of millions of data points.
We developed a novel machine learning command line tool called PyBDA for automated, distributed analysis of big biological data sets. By using Apache Spark in the backend, PyBDA scales to data sets beyond the size of current applications. It uses Snakemake in order to automatically schedule jobs to a high-performance computing cluster. We demonstrate the utility of the software by analyzing image-based RNA interference data of 150 million single cells.
PyBDA allows automated, easy-to-use data analysis using common statistical methods and machine learning algorithms. It can be used with simple command line calls entirely making it accessible to a broad user base. PyBDA is available at https://pybda.rtfd.io.
KeywordsBig data Data analysis Command line Pipeline Computing cluster Grid engine Machine learning
Gradient boosting machines
Generalized linear model
Gaussian mixture model
High performance computing
Independent component analysis
Linear discriminant analysis
Principal component analysis
The advent of technologies that produce very large amounts of high-dimensional biological data is posing not only statistical, but primarily computational difficulties for researchers in bioinformatics, including in single-cell sequencing, genome-wide association studies, or imaging [1, 2, 3]. For statistical analysis and machine learning of gene expression data, tools such as Scanpy  exist. However, they scale only up to a (few) million observations rendering them unsuitable for the analysis of, e.g., microscopy imaging data often comprising billions of cells. Approaches that scale to big data sets by using high-performance computing, such as reviewed in , have been developed mainly for sequence analysis, but not statistical analysis for data derived from, for instance, imaging or mass spectrometry.
Here, we introduce PyBDA, a Python command line tool for automated analysis of big biological data sets. PyBDA offers easily customizable machine learning pipelines that require only minimal programming knowledge. The main goal of PyBDA is to simplify the repetitive, time-consuming task of creating customized machine learning pipelines and combine it with distributed computation on high-performance clusters. The main contributions of PyBDA are (i) a command line tool for the analysis of big data sets with automated pipelines and generation of relevant plots after each analysis, (ii) various statistical and machine learning methods either using novel, custom implementations or interfacing to MLLib  from Apache Spark , and (iii) a modularized framework that can be easily extended to introduce new methods and algorithms. We built PyBDA with a special emphasis on ease of usability and automation of multiple machine learning tasks, such that minimal programming and implementation effort is required and tasks can be executed quickly.
PyBDA provides various statistical methods and machine learning algorithms that scale to very large, high-dimensional data sets. Since most machine learning algorithms are computationally expensive and big high-dimensional data does not fit into the memory of standard desktop computers, PyBDA uses Apache Spark’s DataFrame API for computation which automatically partitions data across nodes of a computing cluster, or, if no cluster environment is available, uses the resources available.
In comparison to other data analysis libraries, for instance [8, 9], where the user needs to use the provided API, PyBDA is a command line tool that does not require extensive programming knowledge. Instead the user only needs to define a config file in which they specify the algorithms to be used. PyBDA then automatically builds a workflow and executes the specified methods one after another. PyBDA uses Snakemake  to automatically execute these workflows of methods.
Comparison to other big data tools
In the last decade several big data analysis and machine learning frameworks have been proposed, yet none of them allow for easy, automated pipelining of multiple data analysis or machine learning tasks. Here, we briefly compare the pros and cons of PyBDA with some of the most popular frameworks, including TensorFlow , scikit-learn , mlr , MLLib  and h20 . Furthermore, many other machine learning tools, such as PyTorch , Keras  or Edward  that are comparable in functionality to the previous frameworks exist. For the sake of completeness, we also mention tools for probabilistic modelling, such as PyMC3 , GPFlow  or greta  which, of course, are primarily designed for statistical modelling and probabilistic programming and not for big data analysis.
Common statistical analysis and machine learning tools
In comparison to PyBDA, the other methods we considered here are either complex to learn, take some time to get used to, or are not able to cope with big data sets. For instance, TensorFlow scales well to big, high-dimensional data sets and allows for the implementation of basically any numerical method. However, while being the most advanced of the compared tools, it has a huge, complex API and needs extensive knowledge of machine learning to be usable, for instance to implement the evidence lower bound of a variational autoencoder or to choose an optimizer for minimizing a custom loss function. On the other hand, tools such as scikit-learn and mlr are easy to use and have a large range of supported methods, but do not scale well, because some of their functionality is not distributable on HPC clusters and consequently not suitable for big data. The two tools that are specifically designed for big data, namely MLLib and h20, are very similar to each other. A drawback of both is the fact that the range of models and algorithms is rather limited in comparison to tools such as scikit-learn and mlr. In comparison to h20’s H20Frame API, we think Spark not only provides a superior DataFrame/RDD API that has more capabilities and is easier for extending a code base with new methods, but also has better integration for linear algebra. For instance, computation of basic descriptive statistics using map-reduce or matrix multiplication are easier implemented using Spark.
PyBDA is the only specifically built to not require much knowledge of programming or machine learning. It can be used right away without much time to get used to an API. Furthermore, due to using Spark it scales well and can be extended easily.
Methods provided by PyBDA
Linear discriminant analysis
Independent component analysis
Gaussian mixture models
Generalized linear models
PyBDA is a command line tool for machine learning of big biological data sets scaling up to hundreds of millions of data points. PyBDA automatically parses a user defined pipeline of multiple machine learning and data analysis tasks from a config file and distributes jobs to compute nodes using Snakemake and Apache Spark. We believe PyBDA will be a valuable and user-friendly tool supporting big data analytics and continued community-driven development of new algorithms.
Availability and requirements
Project name: PyBDA
Project home page: https://github.com/cbg-ethz/pybda
Operating system(s): Linux and MacOS X
Programming language: Python
Other requirements: Python 3.6, Java JDK 8, Apache Spark 2.4.0
License: GNU GPLv3
Any restrictions to use by non-academics: License needed
The authors want to thank David Seifert and Martin Pirkl for fruitful discussions and Kim Jablonski for reviewing of and commenting on code and implementation details.
SD developed PyBDA and its documentation. SD conducted the application. SD and NB wrote the manuscript. ME and CD extracted features from the cells and provided the data. CD and NB lead and conducted the research. All authors read and approved the final manuscript.
This work has been funded by SystemsX.ch, the Swiss Initiative in Systems Biology, under grant RTD 2013/152 (TargetInfectX), and by ERASysAPP, the ERA-Net for Applied Systems Biology, under grant ERASysAPP-30 (SysVirDrug). The funding agencies were not involved in the design of the study, nor collection, analysis, and interpretation of data, nor in writing the manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
- 2.Katal A, Wazid M, Goudar RH. Big Data: Issues, Challenges, Tools and Good Practices. In: 2013 Sixth International Conference on Contemporary Computing (IC3). IEEE: 2013. p. 404–9. https://doi.org/10.1109/IC3.2013.661222.
- 3.Marx V. The big challenges of big data. Nature 498. 2013.Google Scholar
- 5.Guo R, Zhao Y, Zou Q, Fang X, Peng S. Bioinformatics applications on Apache Spark. GigaScience. 2018; 7(8):098.Google Scholar
- 6.Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S, et al.Mllib: Machine learning in Apache Spark. J Mach Learn Res. 2016; 17(1):1235–41.Google Scholar
- 8.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al.Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011; 12(Oct):2825–30.Google Scholar
- 9.Bischl B, Lang M, Kotthoff L, Schiffner J, Richter J, Studerus E, Casalicchio G, Jones ZM. mlr: Machine Learning in R. J Mach Learn Res. 2016; 17(1):5938–42.Google Scholar
- 11.Abadi M, Agarwal A, Barham P, Brevdo E, Citro C, Corrado GS, Davis A, Dean J, Devin M, et al.TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv preprint arXiv:1603.04467. 2016.Google Scholar
- 12.H, 2O.ai. Python Interface for H2O. 2019. Python module version 22.214.171.124. https://github.com/h2oai/h2o-3.
- 13.Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A. Automatic Differentiation in PyTorch. In: NIPS Autodiff Workshop: 2017.Google Scholar
- 14.Chollet F, et al.Keras. 2015. https://keras.io.
- 15.Tran D, Kucukelbir A, Dieng AB, Rudolph M, Liang D, Blei DM. Edward: A library for probabilistic modeling, inference, and criticism. arXiv preprint arXiv:1610.09787. 2016.Google Scholar
- 17.Matthews DG, Alexander G, Van Der Wilk M, Nickson T, Fujii K, Boukouvalas A, León-Villagrá P, Ghahramani Z, Hensman J. GPflow: A Gaussian Process Library using TensorFlow. J Mach Learn Res. 2017; 18(1):1299–304.Google Scholar
- 18.Golding N. Greta: Simple and Scalable Statistical Modelling in R. 2018. R package version 0.3.0. https://CRAN.R-project.org/package=greta.
- 19.Pafka S. benchm-ml. GitHub. 2019. https://github.com/szilard/benchm-ml/tree/941dfd4ebab3854b3a49fd70c192ecf21e483267.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.