Abstract
In this work, new implementations of the U-BRAIN (Uncertainty-managing Bach Relevance-Based Artificial Intelligence) supervised machine learning algorithm are described. The implementations, referred as SP-BRAIN (SP stands for Spark), aim to efficiently process large datasets. Given the iterative nature of the algorithm together with its dependence on in-memory data, a non-standard MapReduce paradigm is applied, taking into account several memory and performance problems, e.g., the granularity of the MAP task, the reduction in the shuffling operation, caching, partial data recomputing, and usage of clusters. The implementations benefit the whole Hadoop ecosystem components, such as HDFS, Yarn, and streaming. Testing is performed in cloud execution environments, using different configurations with up to 128 cores. The performance of the new implementations is evaluated on three known datasets, and the findings are compared to the ones of a previous U-BRAIN parallel implementation. The results show a speedup up to 20 × with a good scalability and reliability in cluster environments.
Similar content being viewed by others
References
Aloisio A, Izzo V, Rampone S (2006) VLSI implementation of greedy-based distributed routing schemes for ad hoc networks. Soft Comput 11(9):865–872. https://doi.org/10.1007/s00500-006-0138-7
Armbrust M et al (2015) Scaling spark in the real world. Proc VLDB Endow 8(12):1840–1843. https://doi.org/10.14778/2824032.2824080
Benson D, Karsch-Mizrachi I, Lipman D, Ostell J, Sayers E (2010) GenBank. Nucleic Acids Res 39:D32–D37. https://doi.org/10.1093/nar/gkq1079
Celli F, Cumbo F, Weitschek E (2018) Classification of large DNA methylation datasets for identifying cancer drivers. Big Data Res 13:21–28. https://doi.org/10.1016/j.bdr.2018.02.005
Chambers A, Zaharia M (2018) Spark: the definitive guide, 1st edn. O’Reilly Media, Sebastopol, pp 49–58, 239–246, 326–328
Clancy S, Brown W (2008) Translation: DNA to mRNA to protein | learn science at scitable. Nature.com. [Online]. https://www.nature.com/scitable/topicpage/translation-dna-to-mrna-to-protein-393. Accessed 10 Mar 2019
D’angelo G, Palmieri F, Ficco M, Rampone S (2015) An uncertainty-managing batch relevance-based approach to network anomaly detection. Appl Soft Comput 36:408–418. https://doi.org/10.1016/j.asoc.2015.07.029
D’Angelo G, Pilla R, Tascini C, Rampone S (2019) A proposal for distinguishing between bacterial and viral meningitis using genetic programming and decision trees. Soft Comput. https://doi.org/10.1007/s00500-018-03729-y
Daly P (2000) Review: Java threads. Comput Bull 42(2):30. https://doi.org/10.1093/combul/42.2.30-b
D’Angelo G, Rampone S (2014a) Towards a HPC-oriented parallel implementation of a learning algorithm for bioinformatics applications. BMC Bioinform. https://doi.org/10.1186/1471-2105-15-s5-s2
D’Angelo G, Rampone S (2014b) Diagnosis of aerospace structure defects by a HPC implemented soft computing algorithm. In: 2014 IEEE metrology for aerospace (MetroAeroSpace). https://doi.org/10.1109/metroaerospace.2014.6865959
Dean J, Ghemawat S (2008) MapReduce. Commun ACM 51(1):107. https://doi.org/10.1145/1327452.1327492
Dobre C, Xhafa F (2014) Intelligent services for Big Data science. Future Gener Comput Syst 37:267–281. https://doi.org/10.1016/j.future.2013.07.014
Dörre J, Apel S, Lengauer C (2014) Modeling and optimizing MapReduce programs. Concurr Comput Pract Exp 27(7):1734–1766. https://doi.org/10.1002/cpe.3333
Eddy D, Adler J, Patterson B, Lucas D, Smith K, Morris M (2011) Individualized guidelines: the potential for increasing quality and reducing costs. Ann Intern Med 154(9):627. https://doi.org/10.7326/0003-4819-154-9-201105030-00008
Firouzi F et al (2018) Internet-of-Things and big data for smarter healthcare: from device to architecture, applications and analytics. Future Gener Comput Syst 78:583–586. https://doi.org/10.1016/j.future.2017.09.016
Gantz J, Reinsel D (2012) The digital universe in 2020: big data, bigger digital shadow s, and biggest growth in the far east. IDC Go-to-Market Services, Framingham, pp 1–16
Google (2019a) Google Cloud Platform Overview | Overview | Google Cloud. Google Cloud, 2019. [Online]. https://cloud.google.com/docs/overview/. Accessed 10 Mar 2019
Google (2019b) Cloud Dataproc FAQ | Cloud Dataproc Documentation | Google Cloud. Google Cloud, 2019. [Online]. https://cloud.google.com/dataproc/docs/resources/faq. 07 Jan 2019
Google (2019c) Geography and Regions | Documentation | Google Cloud. Google Cloud, 2019. [Online]. https://cloud.google.com/docs/geography-and-regions. 10 Feb 2019
Gray J (2008) Distributed computing economics. Queue 6(3):63–68. https://doi.org/10.1145/1394127.1394131
Grolinger K, Hayes M, Higashino W, L’Heureux A, Allison D, Capretz M (2014) Challenges for MapReduce in Big Data. In: 2014 IEEE world congress on services. https://doi.org/10.1109/services.2014.41
HDFS (2019) HDFS Architecture Guide. Hadoop.apache.org, 2019. [Online]. https://hadoop.apache.org/docs/current1/hdfs_design.html#Portability+Across+Heterogeneous+Hardware+and+Software+Platforms. Accessed: 07 Jan 2019
Hennessy JL, Patterson D (2011) Computer architecture, 4th edn. Elsevier Morgan Kaufmann, Amsterdam, p 39
Huang X et al (2018) Revealing Alzheimer’s disease genes spectrum in the whole-genome by machine learning. BMC Neurol. https://doi.org/10.1186/s12883-017-1010-3
Huedecker N, Mery A, Ankush J (2017) Market guide for Hadoop distributions. Gartner IT glossary, 01–Feb–2017. [Online]. https://www.gartner.com/doc/3591517/market-guide-hadoop-distributions. Accessed 8 Mar 2019
Karau H, Warren R (2017) High performance Spark, 1st edn. O’Reilly Media Inc., Sebastopol, CA, USA, pp 66–69, 92–97, 115–118, 125–127, 136–146
Kleppmann M (2017) Designing data-intensive applications, 1st edn. O’Reilly Media Inc., Sebastopol, CA, USA, pp 6–11, 273–284, 295–298, 389–410, 424–426
Kranzlmüller D, Kacsuk P, Dongarra J (2005) Recent advances in parallel virtual machine and message passing interface. Int J High Perform Comput Appl 19(2):99–101. https://doi.org/10.1177/1094342005054256
L’Heureux A, Grolinger K, Elyamany HF, Capretz MAM (2017) Machine learning With Big Data: challenges and approaches. IEEE Access 5:7776–7797. https://doi.org/10.1109/ACCESS.2017.2696365
Marx V (2013) The big challenges of big data. Nature 498(7453):255–260. https://doi.org/10.1038/498255a
McBryan O (1994) An overview of message passing environments. Parallel Comput 20(4):417–444. https://doi.org/10.1016/0167-8191(94)90021-3
Mohamed A, Berg W, Peng H, Luo Y, Jankowitz R, Wu S (2017) A deep learning method for classifying mammographic breast density categories. Med Phys 45(1):314–321. https://doi.org/10.1002/mp.12683
Morfino V, Rampone S, Weitschek E (2019) A comparison of Apache Spark supervised machine learning algorithms for DNA splicing sites prediction. In: Esposito A, Faundez-Zanuy M, Morabito FC, Pasero E (eds) Neural approaches to dynamics of signal exchanges. Springer, Singapore, pp 133–143. https://doi.org/10.1007/978-981-13-8950-4_13
Narkhede N, Shapira G, Palino T (2017) Kafka: the definitive guide, 1st edn. O’Reilly Media Inc., Sebastopol, pp 1–16
Pardi W (2004) Programming concurrent and distributed algorithms in Java. IEEE Distrib Syst Online 5(11):5. https://doi.org/10.1109/mdso.2004.32
Parker C (2012) Unexpected challenges in large scale machine learning. In: 1st International workshop on big data, streams and heterogeneous source mining: algorithms, systems, programming models and applications, Beijing, China, pp 1–6
Perrella A, Morfino V (2014) WTC (WE TAKE CARE) Experimental smartphone app to follow-up and take care of patients with chronic infectious disease: which impact on patients life style? In: Nardone C, Rampone S (eds) Global sustainability inside and outside the territory, proceedings of the 1st international workshop. World Scientific, pp 107–112 https://doi.org/10.1142/9789814651325_0009
Pollastro P, Rampone S (2002) HS3D, a dataset of Homo sapiens spilce regions, and its extraction procedure from a major public database. Int J Mod Phys C 13(08):1105–1117. https://doi.org/10.1142/s0129183102003796
Pugh W, Spacco J (2004) MPJava: high-performance message passing in Java using Java.nio. Lang Compil Parallel Comput. https://doi.org/10.1007/978-3-540-24644-2_21
Rampone S (1998) Recognition of splice junctions on DNA sequences by BRAIN learning algorithm. Bioinformatics 14(8):676–684. https://doi.org/10.1093/bioinformatics/14.8.676
Rampone S (2004) An error tolerant software equipment for human DNA characterization. IEEE Trans Nucl Sci 51(5):2018–2026. https://doi.org/10.1109/tns.2004.835609
Rampone S (2009) A web content management system for a geo-archeological research program. J Uncertain Syst 3(2):95–107
Rampone S, Russo C (2012) A fuzzified BRAIN algorithm for learning DNF from incomplete data. Electron J Appl Stat Anal 5(2):256–270. https://doi.org/10.1285/i20705948v5n2p256
Rampone S, Valente A (2012) Neural network aided evaluation of landslide susceptibility in Southern Italy. Int J Mod Phys C 23(1):10–29
Ryza S, Laserson U, Owen S, Wills J (2015) Advanced analytics with Spark, 1st edn. O’Reilly Media Inc., Sebastopol, p 66
Sa S (2018) Big Data in healthcare management: a review of literature. Am J Theor Appl Bus 4(2):57. https://doi.org/10.11648/j.ajtab.20180402.14
Sitto K, Presser M (2015) Field guide to Hadoop, 1st edn. O’Reilly Media, Inc, Sebastopol, pp 13–42, 55–117
Spark (2019a) Tuning - Spark 2.4.0 Documentation. Spark.apache.org, 2019. [Online]. https://spark.apache.org/docs/latest/tuning.html. Accessed 10 Feb 2019
Spark (2019b) Apache Spark™ - Unified Analytics Engine for Big Data. Spark.apache.org, 2019. [Online]. Available: https://spark.apache.org. Accessed 07 Jan 2019
Suwinski P, Ong C, Ling M, Poh Y, Khan A, Ong H (2019) Advancing personalized medicine through the application of whole exome sequencing and big data analytics. Front Genet. https://doi.org/10.3389/fgene.2019.00049
Taylor R (2010) An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinform. https://doi.org/10.1186/1471-2105-11-s12-s1
UCI (2019) UCI Machine Learning Repository. Archive.ics.uci.edu, 2019. [Online]. http://archive.ics.uci.edu/ml/index.php. Accessed 10 Mar 2019
Weitschek E, Fiscon G, Fustaino V, Felici G, Bertolazzi P (2015) Clustering and classification techniques for gene expression profile pattern analysis. Pattern Recognit Comput Mol Biol. https://doi.org/10.1002/9781119078845.ch19
Weitschek E, Lauro S, Cappelli E, Bertolazzi P, Felici G (2018) CamurWeb: a classification software and a large knowledge base for gene expression data of cancer. BMC Bioinform. https://doi.org/10.1186/s12859-018-2299-7
White T (2015) Hadoop: the definitive guide, 4th edn. O’Reilly & Associates, Sebastopol, pp 10, 22–37, 43–96
Zaharia M et al (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI’12 Proceedings of the 9th USENIX conference on networked systems design and implementation, San Jose, CA, p 2
Zaharia M, Reynold S, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ, Ghodsi A, Gonzalez J, Shenker S, Stoica I (2016) Apache Spark: a unified engine for big data processing. Commun ACM 59(11):56–65. https://doi.org/10.1145/2934664
Acknowledgements
This work has been supported by Università del Sannio—POR CAMPANIA FESR 2014/2020—”Distretti ad Alta Tecnologia, Aggregazioni e Laboratori Pubblico Privati per il rafforzamento del potenziale scientifico e tecnologico della Regione Campania”—Distretto Aerospaziale della Campania (DAC) S.C.A.R.L.—in the framework of the TABASCO project B43D18000220007.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with animals performed by any of the authors.
Additional information
Communicated by V. Loia.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Optimization choices
Appendix: Optimization choices
In this appendix some optimization choices are described.
1.1 Use of spark structured api
The use of Spark Structured APIs instead of lower level API (RDDs) generates more performant code, because the low-level code automatically generated by Spark is generally more performant compared to the one generated by a programmer of average experience Furthermore, the use of an higher level structure, that is translated in low-level code by Spark, can benefit of the future versions improvements (Chambers and Zaharia 2018).
1.2 Use of Scala programming language
For the most of DataFrame and structured API operations, the difference of performance between different API interfaces (Java, Scala, Python, R) is not so relevant. But a significant difference is present when using UDF and RDD custom operations, such as MAP or REDUCE with user defined functions (Chambers and Zaharia 2018). In fact when using UDFs or User Defined Aggregates Functions (UDAFs) written in non-JVM languages, such as Python or R, much of the performance benefits are lost, because the data must still be transferred out of the JVM (Karau and Warren 2017). This transfer is computationally intensive and difficult for several reasons:
It needs serialization and deserialization of data in order to make the data transfer between processes (Chambers and Zaharia 2018);
The transfer of data among different processes is performed via inter-process communication (IPC) mechanisms, which are slower than in-JVM multithread-based communications (Pardi 2004; Daly 2000);
Resource competition (e.g., memory and CPU) among JVM and Python (or R) processes that need to live together in the same environment. This “resource pressure”, especially regarding memory, also reduces the stability of the application (Chambers and Zaharia 2018);
One more element is that it is difficult to guarantee a strict data type correspondence among different programming languages in presence of transformation; this is another element that can undermine application stability (Chambers and Zaharia 2018).
For these reasons, we decide to use Scala, the language in which Spark is written, for all core library elements.
1.3 Cross-join specific optimizations
As previously described, the CROSS approach creates a Cartesian product, with a relevant space occupancy, as in the UBJ implementation (D’Angelo and Rampone 2014a).
To mitigate the high space occupancy, we decide to add two numeric IDs to all positive and negative instances, in order to obtain a more efficient memory allocation and avoiding to store in the DataFrame the instances containing the Cartesian product. The use of numeric IDs instead of strings as keys reduces the memory consumption (Spark 2019a, b), and the column ID of negative instances is renamed to handle the duplicate column names problem (Chambers and Zaharia 2018). Thus, these numeric IDs together with the partial relevance of each pair are included in the cross-join result.
The relevance of each positive/negative pair is computed just one time using an UDF, and a copy of cross-join (crossWork) is used to compute the cycles of the variables selection in Sij computing (see 2.2 of the algorithm schema).
One more important improvement is the use of the Broadcast Hash Join instead of the Shuffled Hash Join. Broadcasting of a low cardinality DataFrame can massively improve the join performance thanks to the reduction in shuffle joins (Karau and Warren 2017; Spark 2019a, b). In general, Spark supports two types of joins: shuffle join and broadcast join. In the shuffle join every node of the cluster interacts with every other node to share data regarding which node has the keys needed to perform the join. These communications are expensive and can congest the network, especially if partitioning is not optimal. Of course, when we have two big tables, this behavior is mandatory. But when we have a join between a big table and a small table (in particular with a table small enough to fit in the memory of each single worker node), we have the chance to optimize the join forcing a Broadcast Join. This option means that all broadcasted data, in our case the Positive Instances, will be replicated on each node of the cluster. This behavior, which can sound expensive, saves our computation from the all-to-all communication during the entire join process. So, we have a large communication in the very first phase of the join, but no further communications after (Chambers and Zaharia 2018).
The use of a broadcast join can be hinted to the Spark engine. Even if Spark automatically optimizes the joining strategy choice, we noticed, on average, an improvement by forcing a broadcast join.
In Fig. 12 we report some experimental results regarding the use of a broadcast join vs a join strategy selected automatically by Spark. We note that on three cases out of four the broadcast join has lower computing time with respect to the Spark default one.
1.4 DataFrame caching
In general, Spark does not perform automatically any persistence or caching, because storing RDDs can be very time consuming. Indeed, all kinds of persistence computations have high cost and are unlikely to improve performance for operations that are performed only once.
Furthermore, on large datasets the cost of persisting or checkpointing can be so high that recomputing is more desirable (Karau and Warren 2017; Kleppmann 2017). However, for some kinds of Spark programs, reusing an RDD can lead to huge performance gains, both in terms of speed and of reducing failures. One of the relevant cases is the iterative computation, which is precisely our case. Thus, a memory caching (Chambers and Zaharia 2018) is used on negative instances for the BROADP approach and of negative, positive, and cross-join DataFrames for the CROSS approach. Spark automatically persist cache-data on disk when the memory is full.
Rights and permissions
About this article
Cite this article
Morfino, V., Rampone, S. & Weitschek, E. SP-BRAIN: scalable and reliable implementations of a supervised relevance-based machine learning algorithm. Soft Comput 24, 7417–7434 (2020). https://doi.org/10.1007/s00500-019-04366-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-019-04366-9