Skip to main content
Log in

A distributed ensemble of relevance vector machines for large-scale data sets on Spark

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

The relevance vector machine (RVM) is a machine learning algorithm based on sparse Bayesian theory that shows good classification performance for small-scale data sets. However, due to the high runtime complexity \(O\left( n^{3}\right) \) and space complexity \(O\left( n^{2}\right) \) of the RVM, it is difficult to train models for medium-sized or large-scale data sets. Therefore, a distributed ensemble of relevance vector machines on the Spark framework (DE-RVM) is proposed. In this approach, a data set is divided into a number of disjoint subsets of data, and on each subset, a set of RVM classifiers are trained using the AdaBoostRVM based on sample type (STAB-RVM) according to the concept of ensemble learning. A final classifier is generated by the combination method with a diversity measure for the RVM classifiers. The smallest empirical loss of the combinatorial classifier is obtained in the quadratic programming problem. The algorithm was applied to both artificial data sets and real data sets. The experimental results show that the proposed method offers good classification performance and can effectively improve the ability of the RVM to process large-scale data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. http://archive.ics.uci.edu/ml/datasets.html.

  2. http://scikit-learn.org/

References

  • Bacardit J, Llorà X (2013) Large-scale data mining using genetics-based machine learning. Wiley Interdiscip Rev: Data Min Knowl Discov 3(1):37–61

    Google Scholar 

  • Barddal JP, Barddal JP, Bifet A (2017) A survey on ensemble learning for data stream classification. ACM Comput Surv 50(2):23

    Google Scholar 

  • Bechini A, Marcelloni F, Segatori A (2016) A MapReduce solution for associative classification of big data. Inf Sci 332:33–55

    Article  Google Scholar 

  • Bi Y (2012) The impact of diversity on the accuracy of evidential classifier ensembles. Int J Approx Reason 53(4):584–607

    Article  MathSciNet  Google Scholar 

  • Candela JQ, Hansen LK (2004) Learning with uncertainty-Gaussian processes and relevance vector machines (Doctoral dissertation, unknown)

  • Choi TM, Chan HK, Yue X (2017) Recent development in big data analytics for business operations and risk management. IEEE Trans Cybern 47(1):81–92

    Article  Google Scholar 

  • Csató L, Opper M (2002) Sparse on-line Gaussian processes. Neural Comput 14(3):641–668

    Article  Google Scholar 

  • Dean J, Ghemawat S (2010) MapReduce: a flexible data processing tool. Commun ACM 53(1):72–77

    Article  Google Scholar 

  • Dong C, Tian L (2012) Accelerating relevance-vector-machine-based classification of hyperspectral image with parallel computing. Math Problems Eng

  • Grolinger K, Hayes M, Higashino W A, L’Heureux A, Allison DS, Capretz MA (2014) Challenges for mapreduce in big data. In: IEEE world congress on services (SERVICES). IEEE, pp 182–189

  • Koh JL, Chen CC, Chan CY, Chen AL (2017) MapReduce skyline query processing with partitioning and distributed dominance tests. Inf Sci 375:114–137

    Article  Google Scholar 

  • Krogh A, Vedelsby J (1995) Neural network ensembles, cross validation, and active learning. In: Advances in neural information processing systems, pp 231–238

  • Kumar A, Shankar R, Choudhary A, Thakur LS (2016) A big data MapReduce framework for fault diagnosis in cloud-based manufacturing. Int J Prod Res 54(23):7060–7073

    Article  Google Scholar 

  • Kuncheva LI, Whitaker CJ (2003) Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach Learn 51(2):181–207

    Article  Google Scholar 

  • Lei Y, Ding X, Wang S (2008) Visual tracker using sequential bayesian learning: discriminative, generative, and hybrid. IEEE Trans Syst Man Cybern Part B (Cybern) 38(6):1578–1591

    Article  Google Scholar 

  • Li X, Wang L, Sung E (2008) AdaBoost with SVM-based component classifiers. Eng Appl Artif Intell 21(5):785–795

    Article  Google Scholar 

  • Low Y, Gonzalez JE, Kyrola A, Bickson D, Guestrin CE, Hellerstein J (2014) Graphlab: a new framework for parallel machine learning. arXiv:1408.2041

  • Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Xin D (2016) Mllib: Machine learning in apache spark. J Mach Learn Res 17(1):1235–1241

    MathSciNet  MATH  Google Scholar 

  • Palit I, Reddy CK (2012) Scalable and parallel boosting with mapreduce. IEEE Trans Knowl Data Eng 24(10):1904–1916

    Article  Google Scholar 

  • Seeger M, Williams C, Lawrence N (2003) Fast forward selection to speed up sparse Gaussian process regression. In: Artificial intelligence and statistics 9 (No. EPFL-CONF-161318)

  • Silva C, Ribeiro B (2008) Towards expanding relevance vector machines to large scale datasets. Int J Neural Syst 18(01):45–58

    Article  Google Scholar 

  • Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: IEEE 26th symposium on mass storage systems and technologies (MSST). IEEE, pp 1–10

  • Smola AJ, Bartlett PL (2001) Sparse greedy Gaussian process regression. In: Advances in neural information processing systems, pp 619–625

  • Tang EK, Suganthan PN, Yao X (2006) An analysis of diversity measures. Mach Learn 65(1):247–271

    Article  Google Scholar 

  • Tashk ARB, Sayadiyan A, Valiollahzadeh S (2007) Face detection using adaboosted RVM-based component classifier. In: 5th International symposium on image and signal processing and analysis, ISPA 2007. IEEE, pp 351–355

  • Tipping ME (2001) Sparse Bayesian learning and the relevance vector machine. J Mach Learn Res 1:211–244

    MathSciNet  MATH  Google Scholar 

  • Tipping ME, Faul AC (2003) Fast marginal likelihood maximisation for sparse Bayesian models. In: AISTATS

  • Yang D, Liang G, Jenkins DD, Peterson GD, Li H (2010) High performance relevance vector machine on GPUs. In: Symposium on application accelerators in high performance computing

  • Yu Y, Li YF, Zhou ZH (2011) July) Diversity regularized machine. IJCAI Proc Int Joint Conf Artif Intell 22(1):1603

    Google Scholar 

  • Zaharia M, Chowdhury M, Das T, Dave A, Ma J, Mccauley M (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Usenix conference on networked systems design and implementation, vol 70. USENIX Association, p 2

  • Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Ghodsi A (2016) Apache Spark: a unified engine for big data processing. Commun ACM 59(11):56–65

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China under projects 61402345 and 61735013.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fang Liu.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qin, W., Liu, F., Tong, M. et al. A distributed ensemble of relevance vector machines for large-scale data sets on Spark. Soft Comput 25, 7119–7130 (2021). https://doi.org/10.1007/s00500-021-05671-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-021-05671-y

Keywords

Navigation