Skip to main content
Log in

SP-BRAIN: scalable and reliable implementations of a supervised relevance-based machine learning algorithm

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

In this work, new implementations of the U-BRAIN (Uncertainty-managing Bach Relevance-Based Artificial Intelligence) supervised machine learning algorithm are described. The implementations, referred as SP-BRAIN (SP stands for Spark), aim to efficiently process large datasets. Given the iterative nature of the algorithm together with its dependence on in-memory data, a non-standard MapReduce paradigm is applied, taking into account several memory and performance problems, e.g., the granularity of the MAP task, the reduction in the shuffling operation, caching, partial data recomputing, and usage of clusters. The implementations benefit the whole Hadoop ecosystem components, such as HDFS, Yarn, and streaming. Testing is performed in cloud execution environments, using different configurations with up to 128 cores. The performance of the new implementations is evaluated on three known datasets, and the findings are compared to the ones of a previous U-BRAIN parallel implementation. The results show a speedup up to 20 × with a good scalability and reliability in cluster environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

Download references

Acknowledgements

This work has been supported by Università del Sannio—POR CAMPANIA FESR 2014/2020—”Distretti ad Alta Tecnologia, Aggregazioni e Laboratori Pubblico Privati per il rafforzamento del potenziale scientifico e tecnologico della Regione Campania”—Distretto Aerospaziale della Campania (DAC) S.C.A.R.L.—in the framework of the TABASCO project B43D18000220007.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Salvatore Rampone.

Ethics declarations

Conflict of interest

Authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with animals performed by any of the authors.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Optimization choices

Appendix: Optimization choices

In this appendix some optimization choices are described.

1.1 Use of spark structured api

The use of Spark Structured APIs instead of lower level API (RDDs) generates more performant code, because the low-level code automatically generated by Spark is generally more performant compared to the one generated by a programmer of average experience Furthermore, the use of an higher level structure, that is translated in low-level code by Spark, can benefit of the future versions improvements (Chambers and Zaharia 2018).

1.2 Use of Scala programming language

For the most of DataFrame and structured API operations, the difference of performance between different API interfaces (Java, Scala, Python, R) is not so relevant. But a significant difference is present when using UDF and RDD custom operations, such as MAP or REDUCE with user defined functions (Chambers and Zaharia 2018). In fact when using UDFs or User Defined Aggregates Functions (UDAFs) written in non-JVM languages, such as Python or R, much of the performance benefits are lost, because the data must still be transferred out of the JVM (Karau and Warren 2017). This transfer is computationally intensive and difficult for several reasons:

  • It needs serialization and deserialization of data in order to make the data transfer between processes (Chambers and Zaharia 2018);

  • The transfer of data among different processes is performed via inter-process communication (IPC) mechanisms, which are slower than in-JVM multithread-based communications (Pardi 2004; Daly 2000);

  • Resource competition (e.g., memory and CPU) among JVM and Python (or R) processes that need to live together in the same environment. This “resource pressure”, especially regarding memory, also reduces the stability of the application (Chambers and Zaharia 2018);

  • One more element is that it is difficult to guarantee a strict data type correspondence among different programming languages in presence of transformation; this is another element that can undermine application stability (Chambers and Zaharia 2018).

For these reasons, we decide to use Scala, the language in which Spark is written, for all core library elements.

1.3 Cross-join specific optimizations

As previously described, the CROSS approach creates a Cartesian product, with a relevant space occupancy, as in the UBJ implementation (D’Angelo and Rampone 2014a).

To mitigate the high space occupancy, we decide to add two numeric IDs to all positive and negative instances, in order to obtain a more efficient memory allocation and avoiding to store in the DataFrame the instances containing the Cartesian product. The use of numeric IDs instead of strings as keys reduces the memory consumption (Spark 2019a, b), and the column ID of negative instances is renamed to handle the duplicate column names problem (Chambers and Zaharia 2018). Thus, these numeric IDs together with the partial relevance of each pair are included in the cross-join result.

The relevance of each positive/negative pair is computed just one time using an UDF, and a copy of cross-join (crossWork) is used to compute the cycles of the variables selection in Sij computing (see 2.2 of the algorithm schema).

One more important improvement is the use of the Broadcast Hash Join instead of the Shuffled Hash Join. Broadcasting of a low cardinality DataFrame can massively improve the join performance thanks to the reduction in shuffle joins (Karau and Warren 2017; Spark 2019a, b). In general, Spark supports two types of joins: shuffle join and broadcast join. In the shuffle join every node of the cluster interacts with every other node to share data regarding which node has the keys needed to perform the join. These communications are expensive and can congest the network, especially if partitioning is not optimal. Of course, when we have two big tables, this behavior is mandatory. But when we have a join between a big table and a small table (in particular with a table small enough to fit in the memory of each single worker node), we have the chance to optimize the join forcing a Broadcast Join. This option means that all broadcasted data, in our case the Positive Instances, will be replicated on each node of the cluster. This behavior, which can sound expensive, saves our computation from the all-to-all communication during the entire join process. So, we have a large communication in the very first phase of the join, but no further communications after (Chambers and Zaharia 2018).

The use of a broadcast join can be hinted to the Spark engine. Even if Spark automatically optimizes the joining strategy choice, we noticed, on average, an improvement by forcing a broadcast join.

In Fig. 12 we report some experimental results regarding the use of a broadcast join vs a join strategy selected automatically by Spark. We note that on three cases out of four the broadcast join has lower computing time with respect to the Spark default one.

Fig. 12
figure 12

Broadcast join and Automatically optimized join training time comparison. The x axis reports the dataset used (IPDATA and HS3D_UB) and the number of working cores (4 and 8); the y axis reports the training time

1.4 DataFrame caching

In general, Spark does not perform automatically any persistence or caching, because storing RDDs can be very time consuming. Indeed, all kinds of persistence computations have high cost and are unlikely to improve performance for operations that are performed only once.

Furthermore, on large datasets the cost of persisting or checkpointing can be so high that recomputing is more desirable (Karau and Warren 2017; Kleppmann 2017). However, for some kinds of Spark programs, reusing an RDD can lead to huge performance gains, both in terms of speed and of reducing failures. One of the relevant cases is the iterative computation, which is precisely our case. Thus, a memory caching (Chambers and Zaharia 2018) is used on negative instances for the BROADP approach and of negative, positive, and cross-join DataFrames for the CROSS approach. Spark automatically persist cache-data on disk when the memory is full.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Morfino, V., Rampone, S. & Weitschek, E. SP-BRAIN: scalable and reliable implementations of a supervised relevance-based machine learning algorithm. Soft Comput 24, 7417–7434 (2020). https://doi.org/10.1007/s00500-019-04366-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-019-04366-9

Keywords

Navigation