SP-BRAIN: scalable and reliable implementations of a supervised relevance-based machine learning algorithm

Morfino, Valerio; Rampone, Salvatore; Weitschek, Emanuel

doi:10.1007/s00500-019-04366-9

SP-BRAIN: scalable and reliable implementations of a supervised relevance-based machine learning algorithm

Methodologies and Application
Published: 20 September 2019

Volume 24, pages 7417–7434, (2020)
Cite this article

Soft Computing Aims and scope Submit manuscript

161 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

In this work, new implementations of the U-BRAIN (Uncertainty-managing Bach Relevance-Based Artificial Intelligence) supervised machine learning algorithm are described. The implementations, referred as SP-BRAIN (SP stands for Spark), aim to efficiently process large datasets. Given the iterative nature of the algorithm together with its dependence on in-memory data, a non-standard MapReduce paradigm is applied, taking into account several memory and performance problems, e.g., the granularity of the MAP task, the reduction in the shuffling operation, caching, partial data recomputing, and usage of clusters. The implementations benefit the whole Hadoop ecosystem components, such as HDFS, Yarn, and streaming. Testing is performed in cloud execution environments, using different configurations with up to 128 cores. The performance of the new implementations is evaluated on three known datasets, and the findings are compared to the ones of a previous U-BRAIN parallel implementation. The results show a speedup up to 20 × with a good scalability and reliability in cluster environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial intelligence in recommender systems

Article Open access 01 November 2020

Qian Zhang, Jie Lu & Yaochu Jin

A survey on ensemble learning

Article 30 August 2019

Xibin Dong, Zhiwen Yu, … Qianli Ma

Learning from positive and unlabeled data: a survey

Article 02 April 2020

Jessa Bekker & Jesse Davis

References

Aloisio A, Izzo V, Rampone S (2006) VLSI implementation of greedy-based distributed routing schemes for ad hoc networks. Soft Comput 11(9):865–872. https://doi.org/10.1007/s00500-006-0138-7
Article Google Scholar
Armbrust M et al (2015) Scaling spark in the real world. Proc VLDB Endow 8(12):1840–1843. https://doi.org/10.14778/2824032.2824080
Article Google Scholar
Benson D, Karsch-Mizrachi I, Lipman D, Ostell J, Sayers E (2010) GenBank. Nucleic Acids Res 39:D32–D37. https://doi.org/10.1093/nar/gkq1079
Article Google Scholar
Celli F, Cumbo F, Weitschek E (2018) Classification of large DNA methylation datasets for identifying cancer drivers. Big Data Res 13:21–28. https://doi.org/10.1016/j.bdr.2018.02.005
Article Google Scholar
Chambers A, Zaharia M (2018) Spark: the definitive guide, 1st edn. O’Reilly Media, Sebastopol, pp 49–58, 239–246, 326–328
Clancy S, Brown W (2008) Translation: DNA to mRNA to protein | learn science at scitable. Nature.com. [Online]. https://www.nature.com/scitable/topicpage/translation-dna-to-mrna-to-protein-393. Accessed 10 Mar 2019
D’angelo G, Palmieri F, Ficco M, Rampone S (2015) An uncertainty-managing batch relevance-based approach to network anomaly detection. Appl Soft Comput 36:408–418. https://doi.org/10.1016/j.asoc.2015.07.029
Article Google Scholar
D’Angelo G, Pilla R, Tascini C, Rampone S (2019) A proposal for distinguishing between bacterial and viral meningitis using genetic programming and decision trees. Soft Comput. https://doi.org/10.1007/s00500-018-03729-y
Article Google Scholar
Daly P (2000) Review: Java threads. Comput Bull 42(2):30. https://doi.org/10.1093/combul/42.2.30-b
Article Google Scholar
D’Angelo G, Rampone S (2014a) Towards a HPC-oriented parallel implementation of a learning algorithm for bioinformatics applications. BMC Bioinform. https://doi.org/10.1186/1471-2105-15-s5-s2
Article Google Scholar
D’Angelo G, Rampone S (2014b) Diagnosis of aerospace structure defects by a HPC implemented soft computing algorithm. In: 2014 IEEE metrology for aerospace (MetroAeroSpace). https://doi.org/10.1109/metroaerospace.2014.6865959
Dean J, Ghemawat S (2008) MapReduce. Commun ACM 51(1):107. https://doi.org/10.1145/1327452.1327492
Article Google Scholar
Dobre C, Xhafa F (2014) Intelligent services for Big Data science. Future Gener Comput Syst 37:267–281. https://doi.org/10.1016/j.future.2013.07.014
Article Google Scholar
Dörre J, Apel S, Lengauer C (2014) Modeling and optimizing MapReduce programs. Concurr Comput Pract Exp 27(7):1734–1766. https://doi.org/10.1002/cpe.3333
Article Google Scholar
Eddy D, Adler J, Patterson B, Lucas D, Smith K, Morris M (2011) Individualized guidelines: the potential for increasing quality and reducing costs. Ann Intern Med 154(9):627. https://doi.org/10.7326/0003-4819-154-9-201105030-00008
Article Google Scholar
Firouzi F et al (2018) Internet-of-Things and big data for smarter healthcare: from device to architecture, applications and analytics. Future Gener Comput Syst 78:583–586. https://doi.org/10.1016/j.future.2017.09.016
Article Google Scholar
Gantz J, Reinsel D (2012) The digital universe in 2020: big data, bigger digital shadow s, and biggest growth in the far east. IDC Go-to-Market Services, Framingham, pp 1–16
Google Scholar
Google (2019a) Google Cloud Platform Overview | Overview | Google Cloud. Google Cloud, 2019. [Online]. https://cloud.google.com/docs/overview/. Accessed 10 Mar 2019
Google (2019b) Cloud Dataproc FAQ | Cloud Dataproc Documentation | Google Cloud. Google Cloud, 2019. [Online]. https://cloud.google.com/dataproc/docs/resources/faq. 07 Jan 2019
Google (2019c) Geography and Regions | Documentation | Google Cloud. Google Cloud, 2019. [Online]. https://cloud.google.com/docs/geography-and-regions. 10 Feb 2019
Gray J (2008) Distributed computing economics. Queue 6(3):63–68. https://doi.org/10.1145/1394127.1394131
Article Google Scholar
Grolinger K, Hayes M, Higashino W, L’Heureux A, Allison D, Capretz M (2014) Challenges for MapReduce in Big Data. In: 2014 IEEE world congress on services. https://doi.org/10.1109/services.2014.41
HDFS (2019) HDFS Architecture Guide. Hadoop.apache.org, 2019. [Online]. https://hadoop.apache.org/docs/current1/hdfs_design.html#Portability+Across+Heterogeneous+Hardware+and+Software+Platforms. Accessed: 07 Jan 2019
Hennessy JL, Patterson D (2011) Computer architecture, 4th edn. Elsevier Morgan Kaufmann, Amsterdam, p 39
Google Scholar
Huang X et al (2018) Revealing Alzheimer’s disease genes spectrum in the whole-genome by machine learning. BMC Neurol. https://doi.org/10.1186/s12883-017-1010-3
Article Google Scholar
Huedecker N, Mery A, Ankush J (2017) Market guide for Hadoop distributions. Gartner IT glossary, 01–Feb–2017. [Online]. https://www.gartner.com/doc/3591517/market-guide-hadoop-distributions. Accessed 8 Mar 2019
Karau H, Warren R (2017) High performance Spark, 1st edn. O’Reilly Media Inc., Sebastopol, CA, USA, pp 66–69, 92–97, 115–118, 125–127, 136–146
Kleppmann M (2017) Designing data-intensive applications, 1st edn. O’Reilly Media Inc., Sebastopol, CA, USA, pp 6–11, 273–284, 295–298, 389–410, 424–426
Kranzlmüller D, Kacsuk P, Dongarra J (2005) Recent advances in parallel virtual machine and message passing interface. Int J High Perform Comput Appl 19(2):99–101. https://doi.org/10.1177/1094342005054256
Article Google Scholar
L’Heureux A, Grolinger K, Elyamany HF, Capretz MAM (2017) Machine learning With Big Data: challenges and approaches. IEEE Access 5:7776–7797. https://doi.org/10.1109/ACCESS.2017.2696365
Article Google Scholar
Marx V (2013) The big challenges of big data. Nature 498(7453):255–260. https://doi.org/10.1038/498255a
Article Google Scholar
McBryan O (1994) An overview of message passing environments. Parallel Comput 20(4):417–444. https://doi.org/10.1016/0167-8191(94)90021-3
Article MATH Google Scholar
Mohamed A, Berg W, Peng H, Luo Y, Jankowitz R, Wu S (2017) A deep learning method for classifying mammographic breast density categories. Med Phys 45(1):314–321. https://doi.org/10.1002/mp.12683
Article Google Scholar
Morfino V, Rampone S, Weitschek E (2019) A comparison of Apache Spark supervised machine learning algorithms for DNA splicing sites prediction. In: Esposito A, Faundez-Zanuy M, Morabito FC, Pasero E (eds) Neural approaches to dynamics of signal exchanges. Springer, Singapore, pp 133–143. https://doi.org/10.1007/978-981-13-8950-4_13
Chapter Google Scholar
Narkhede N, Shapira G, Palino T (2017) Kafka: the definitive guide, 1st edn. O’Reilly Media Inc., Sebastopol, pp 1–16
Google Scholar
Pardi W (2004) Programming concurrent and distributed algorithms in Java. IEEE Distrib Syst Online 5(11):5. https://doi.org/10.1109/mdso.2004.32
Article Google Scholar
Parker C (2012) Unexpected challenges in large scale machine learning. In: 1st International workshop on big data, streams and heterogeneous source mining: algorithms, systems, programming models and applications, Beijing, China, pp 1–6
Perrella A, Morfino V (2014) WTC (WE TAKE CARE) Experimental smartphone app to follow-up and take care of patients with chronic infectious disease: which impact on patients life style? In: Nardone C, Rampone S (eds) Global sustainability inside and outside the territory, proceedings of the 1st international workshop. World Scientific, pp 107–112 https://doi.org/10.1142/9789814651325_0009
Pollastro P, Rampone S (2002) HS3D, a dataset of Homo sapiens spilce regions, and its extraction procedure from a major public database. Int J Mod Phys C 13(08):1105–1117. https://doi.org/10.1142/s0129183102003796
Article Google Scholar
Pugh W, Spacco J (2004) MPJava: high-performance message passing in Java using Java.nio. Lang Compil Parallel Comput. https://doi.org/10.1007/978-3-540-24644-2_21
Article MATH Google Scholar
Rampone S (1998) Recognition of splice junctions on DNA sequences by BRAIN learning algorithm. Bioinformatics 14(8):676–684. https://doi.org/10.1093/bioinformatics/14.8.676
Article Google Scholar
Rampone S (2004) An error tolerant software equipment for human DNA characterization. IEEE Trans Nucl Sci 51(5):2018–2026. https://doi.org/10.1109/tns.2004.835609
Article Google Scholar
Rampone S (2009) A web content management system for a geo-archeological research program. J Uncertain Syst 3(2):95–107
Google Scholar
Rampone S, Russo C (2012) A fuzzified BRAIN algorithm for learning DNF from incomplete data. Electron J Appl Stat Anal 5(2):256–270. https://doi.org/10.1285/i20705948v5n2p256
Article MathSciNet Google Scholar
Rampone S, Valente A (2012) Neural network aided evaluation of landslide susceptibility in Southern Italy. Int J Mod Phys C 23(1):10–29
Article Google Scholar
Ryza S, Laserson U, Owen S, Wills J (2015) Advanced analytics with Spark, 1st edn. O’Reilly Media Inc., Sebastopol, p 66
Google Scholar
Sa S (2018) Big Data in healthcare management: a review of literature. Am J Theor Appl Bus 4(2):57. https://doi.org/10.11648/j.ajtab.20180402.14
Article Google Scholar
Sitto K, Presser M (2015) Field guide to Hadoop, 1st edn. O’Reilly Media, Inc, Sebastopol, pp 13–42, 55–117
Spark (2019a) Tuning - Spark 2.4.0 Documentation. Spark.apache.org, 2019. [Online]. https://spark.apache.org/docs/latest/tuning.html. Accessed 10 Feb 2019
Spark (2019b) Apache Spark™ - Unified Analytics Engine for Big Data. Spark.apache.org, 2019. [Online]. Available: https://spark.apache.org. Accessed 07 Jan 2019
Suwinski P, Ong C, Ling M, Poh Y, Khan A, Ong H (2019) Advancing personalized medicine through the application of whole exome sequencing and big data analytics. Front Genet. https://doi.org/10.3389/fgene.2019.00049
Article Google Scholar
Taylor R (2010) An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinform. https://doi.org/10.1186/1471-2105-11-s12-s1
Article Google Scholar
UCI (2019) UCI Machine Learning Repository. Archive.ics.uci.edu, 2019. [Online]. http://archive.ics.uci.edu/ml/index.php. Accessed 10 Mar 2019
Weitschek E, Fiscon G, Fustaino V, Felici G, Bertolazzi P (2015) Clustering and classification techniques for gene expression profile pattern analysis. Pattern Recognit Comput Mol Biol. https://doi.org/10.1002/9781119078845.ch19
Article Google Scholar
Weitschek E, Lauro S, Cappelli E, Bertolazzi P, Felici G (2018) CamurWeb: a classification software and a large knowledge base for gene expression data of cancer. BMC Bioinform. https://doi.org/10.1186/s12859-018-2299-7
Article Google Scholar
White T (2015) Hadoop: the definitive guide, 4th edn. O’Reilly & Associates, Sebastopol, pp 10, 22–37, 43–96
Zaharia M et al (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI’12 Proceedings of the 9th USENIX conference on networked systems design and implementation, San Jose, CA, p 2
Zaharia M, Reynold S, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ, Ghodsi A, Gonzalez J, Shenker S, Stoica I (2016) Apache Spark: a unified engine for big data processing. Commun ACM 59(11):56–65. https://doi.org/10.1145/2934664
Article Google Scholar

Download references

Acknowledgements

This work has been supported by Università del Sannio—POR CAMPANIA FESR 2014/2020—”Distretti ad Alta Tecnologia, Aggregazioni e Laboratori Pubblico Privati per il rafforzamento del potenziale scientifico e tecnologico della Regione Campania”—Distretto Aerospaziale della Campania (DAC) S.C.A.R.L.—in the framework of the TABASCO project B43D18000220007.

Author information

Authors and Affiliations

Department of Law, Economics, Management and Quantitative Methods (DEMM), University of Sannio, Benevento, Italy
Valerio Morfino & Salvatore Rampone
Department of Engineering, Uninettuno University, Rome, Italy
Emanuel Weitschek

Authors

Valerio Morfino
View author publications
You can also search for this author in PubMed Google Scholar
Salvatore Rampone
View author publications
You can also search for this author in PubMed Google Scholar
Emanuel Weitschek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Salvatore Rampone.

Ethics declarations

Conflict of interest

Authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with animals performed by any of the authors.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Optimization choices

In this appendix some optimization choices are described.

1.1 Use of spark structured api

The use of Spark Structured APIs instead of lower level API (RDDs) generates more performant code, because the low-level code automatically generated by Spark is generally more performant compared to the one generated by a programmer of average experience Furthermore, the use of an higher level structure, that is translated in low-level code by Spark, can benefit of the future versions improvements (Chambers and Zaharia 2018).

1.2 Use of Scala programming language

For the most of DataFrame and structured API operations, the difference of performance between different API interfaces (Java, Scala, Python, R) is not so relevant. But a significant difference is present when using UDF and RDD custom operations, such as MAP or REDUCE with user defined functions (Chambers and Zaharia 2018). In fact when using UDFs or User Defined Aggregates Functions (UDAFs) written in non-JVM languages, such as Python or R, much of the performance benefits are lost, because the data must still be transferred out of the JVM (Karau and Warren 2017). This transfer is computationally intensive and difficult for several reasons:

It needs serialization and deserialization of data in order to make the data transfer between processes (Chambers and Zaharia 2018);
The transfer of data among different processes is performed via inter-process communication (IPC) mechanisms, which are slower than in-JVM multithread-based communications (Pardi 2004; Daly 2000);
Resource competition (e.g., memory and CPU) among JVM and Python (or R) processes that need to live together in the same environment. This “resource pressure”, especially regarding memory, also reduces the stability of the application (Chambers and Zaharia 2018);
One more element is that it is difficult to guarantee a strict data type correspondence among different programming languages in presence of transformation; this is another element that can undermine application stability (Chambers and Zaharia 2018).

For these reasons, we decide to use Scala, the language in which Spark is written, for all core library elements.

1.3 Cross-join specific optimizations

As previously described, the CROSS approach creates a Cartesian product, with a relevant space occupancy, as in the UBJ implementation (D’Angelo and Rampone 2014a).

To mitigate the high space occupancy, we decide to add two numeric IDs to all positive and negative instances, in order to obtain a more efficient memory allocation and avoiding to store in the DataFrame the instances containing the Cartesian product. The use of numeric IDs instead of strings as keys reduces the memory consumption (Spark 2019a, b), and the column ID of negative instances is renamed to handle the duplicate column names problem (Chambers and Zaharia 2018). Thus, these numeric IDs together with the partial relevance of each pair are included in the cross-join result.

The relevance of each positive/negative pair is computed just one time using an UDF, and a copy of cross-join (crossWork) is used to compute the cycles of the variables selection in S_ij computing (see 2.2 of the algorithm schema).

One more important improvement is the use of the Broadcast Hash Join instead of the Shuffled Hash Join. Broadcasting of a low cardinality DataFrame can massively improve the join performance thanks to the reduction in shuffle joins (Karau and Warren 2017; Spark 2019a, b). In general, Spark supports two types of joins: shuffle join and broadcast join. In the shuffle join every node of the cluster interacts with every other node to share data regarding which node has the keys needed to perform the join. These communications are expensive and can congest the network, especially if partitioning is not optimal. Of course, when we have two big tables, this behavior is mandatory. But when we have a join between a big table and a small table (in particular with a table small enough to fit in the memory of each single worker node), we have the chance to optimize the join forcing a Broadcast Join. This option means that all broadcasted data, in our case the Positive Instances, will be replicated on each node of the cluster. This behavior, which can sound expensive, saves our computation from the all-to-all communication during the entire join process. So, we have a large communication in the very first phase of the join, but no further communications after (Chambers and Zaharia 2018).

The use of a broadcast join can be hinted to the Spark engine. Even if Spark automatically optimizes the joining strategy choice, we noticed, on average, an improvement by forcing a broadcast join.

In Fig. 12 we report some experimental results regarding the use of a broadcast join vs a join strategy selected automatically by Spark. We note that on three cases out of four the broadcast join has lower computing time with respect to the Spark default one.

1.4 DataFrame caching

In general, Spark does not perform automatically any persistence or caching, because storing RDDs can be very time consuming. Indeed, all kinds of persistence computations have high cost and are unlikely to improve performance for operations that are performed only once.

Furthermore, on large datasets the cost of persisting or checkpointing can be so high that recomputing is more desirable (Karau and Warren 2017; Kleppmann 2017). However, for some kinds of Spark programs, reusing an RDD can lead to huge performance gains, both in terms of speed and of reducing failures. One of the relevant cases is the iterative computation, which is precisely our case. Thus, a memory caching (Chambers and Zaharia 2018) is used on negative instances for the BROADP approach and of negative, positive, and cross-join DataFrames for the CROSS approach. Spark automatically persist cache-data on disk when the memory is full.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Morfino, V., Rampone, S. & Weitschek, E. SP-BRAIN: scalable and reliable implementations of a supervised relevance-based machine learning algorithm. Soft Comput 24, 7417–7434 (2020). https://doi.org/10.1007/s00500-019-04366-9

Download citation

Published: 20 September 2019
Issue Date: May 2020
DOI: https://doi.org/10.1007/s00500-019-04366-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SP-BRAIN: scalable and reliable implementations of a supervised relevance-based machine learning algorithm

Abstract

Access this article

Similar content being viewed by others