Skip to main content

Novel method for optimizing performance in resource constrained distributed data streams

Abstract

The Big Data Era has presented many opportunities for using data mining techniques to discover knowledge patterns across large and diverse collections of data where the volume of data is growing at an exponential rate. Recent approaches to Distributed Data Mining (DDM) have focused on addressing the heterogeneous nature of data sources. However, such approaches do not prioritize the reduction of data communication costs which could be prohibitive in large scale sensor networks where bandwidth is a limited resource. In fact, higher communication and computational costs are the two most prominent problems that have been encountered in heterogeneous distributed environments. Moreover, an effort to decrease the communications load in the distributed environment has an adverse influence on the classification accuracy. Therefore, the research challenge lies in maintaining a balance between transmission cost, computational cost, and accuracy. This paper proposes an algorithm Performance Optimizer in Distributed Stream Mining (PODSM) based on Bayesian Inference to reduce the communication volume and resource time in a heterogeneous distributed data mining environment while retaining prediction accuracy. The approach used in this work exploits the past data for calculating statistics and these statistics are then utilized for the new data. In other words, it imparts the ability to learn from experiences. As a result, our experimental evaluation reveals that a significant reduction in the communication load and an improvement in classification response time can be achieved across a diverse range of dataset types. Reduction of 34.66% was obtained with regard to communication overhead for one of the datasets with huge savings of nearly 27% in resource time. Importantly, instead of showing a negative effect on accuracy, this dataset observes an increment of 0.44% in accuracy.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Notes

  1. An alias for low-confidence, signifies confidence lower than the threshold.

  2. First set of records that is used by the trouble site as the historic data for calculation of the conditional probabilities.

References

  1. Akbar A, Kousiouris G, Pervaiz H, Sancho J, Ta-Shma P, Carrez F, Moessner K (2018) Real-time probabilistic data fusion for large-scale iot applications. IEEE Access 6:10015–10027. https://doi.org/10.1109/ACCESS.2018.2804623

    Article  Google Scholar 

  2. Aronis JM, Kolluri V, Provost FJ, Buchanan BG (1997) The world: knowledge discovery from multiple distributed databases. In: Proceedings of Florida Arti intelligence research symposium (FLAIRS-97), pp 337–341

  3. Basak J, Kothari R (2004) A classification paradigm for distributed vertically partitioned data. Neural Comput 16(7):1525–44. https://doi.org/10.1162/089976604323057470

    Article  MATH  Google Scholar 

  4. Bifet A, Gavaldà R (2007) Learning from time-changing data with adaptive windowing. In: Proceedings of the seventh SIAM international conference on data mining. SIAM, Minneapolis, Minnesota

  5. Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) Moa: massive online analysis. Journal of Machine Learning Research 11:1601–1604. https://doi.org/10.1145/1219092.1219093. http://portal.acm.org/citation.cfm?id=1859903

  6. Boutaba R, Salahuddin MA, Limam N, Ayoubi S, Shahriar N, Solano FE, Rendon O (2018) A comprehensive survey on machine learning for networking: evolution, applications and research opportunities. Journal of Internet Services and Applications 9:1–99

    Article  Google Scholar 

  7. Chan PK, Stolfo SJ (1993) Toward parallel and distributed learning by meta-learning. In: Proceedings of the 2nd international conference on knowledge discovery in databases. AAAI Press, Washington, DC, AAAIWS’93, p 227–240

  8. Chan PK, Stolfo SJ (1995) Learning arbiter and combiner trees from partitioned data for scaling machine learning. In: KDD: proceedings of the first international conference on knowledge discovery and data mining. AAAI, Montréal, Québec, Canada, pp 39–44

  9. Chen R, Sivakumar K, Kargupta H (2001) An approach to online bayesian learning from multiple data streams. In: Proceedings of workshop on mobile and distributed data mining. PKDD ’01, pp 31–45

  10. Chen X, Rowe NC (2011) An energy-efficient communication scheme in wireless cable sensor networks. In: 2011 IEEE international conference on communications (ICC), Calhoun. https://doi.org/10.1109/icc.2011.5963077, pp 1– 5

  11. Denham B (2019) Novel methods for distributed and privacy-preserving data stream mining. Master’s thesis Auckland University of Technology, Auckland, NZ

  12. Denham B, Pears R, Naeem MA (2020) Hdsm: a distributed data mining approach to classifying vertically distributed data streams. Knowledge-Based Systems 189. https://doi.org/10.1016/j.knosys.2019.105114. https://www.sciencedirect.com/science/article/pii/S0950705119304836

  13. Devi SG (2014) A survey on distributed data mining and its trends. International Journal of Research in Engineering & Technology 2(3)

  14. Dua D, Taniskidou EK (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml

  15. Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Machine Learning 29:131–163. https://doi.org/10.1023/A:1007465528199

  16. Gomes HM, Bifet A, Read J, Barddal JP, Enembreck F, Pfharinger B, Holmes G, Abdessalem T (2017) Adaptive random forests for evolving data stream classification. Mach Learn 106(9-10):1469–1495. https://doi.org/10.1007/s10994-017-5642-8

  17. Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining. https://doi.org/10.1145/502512.502529. Association for Computing Machinery, San Francisco California, pp 97–106

  18. Chikhale R (2016) Study of distributed data mining algorithm and trends. IOSR Journal of Computer Engineering 8:41–47

    Google Scholar 

  19. Kargupta H, Byung-hoon, Hershberger D, Johnson E (1999) Collective data mining: a new perspective toward distributed data analysis. In: Advances in distributed and parallel knowledge discovery. MIT Press, Cambridge, pp 133–184

  20. Khedo KK, Perseedoss R, Mungur A (2010) A wireless sensor network air pollution monitoring system. International Journal of Wireless & Mobile Networks 2(2):31–45. https://doi.org/10.5121/ijwmn.2010.2203

    Article  Google Scholar 

  21. Kourtellis N, Morales GDF, Bifet A, Murdopo A (2016) Vht: vertical hoeffding tree. In: 2016 IEEE international conference on big data (big data). IEEE Computer Society, Los Alamitos, CA, USA. https://doi.org/10.1109/BigData.2016.7840687, pp 915–922

  22. Krawczyk B, Minku LL, Gama J, Stefanowski J, Woźniak M (2017) Ensemble learning for data stream analysis: a survey. Information Fusion 37:132–156. https://doi.org/10.1016/j.inffus.2017.02.004. https://www.sciencedirect.com/science/article/pii/S1566253516302329

  23. Moghadam AN, Ravanmehr R (2017) Multi-agent distributed data mining approach for classifying meteorology data: case study on iran’s synoptic weather stations. International Journal of Environmental Science and Technology 15:149–158

    Article  Google Scholar 

  24. Oza N (2011) FLTz flight simulator. https://c3.nasa.gov/dashlink/resources/294/

  25. Pal V, Yogita, Singh G, Yadav RP (2015) Effect of heterogeneous nodes location on the performance of clustering algorithms for wireless sensor networks. Procedia Computer Science 57:1042–1048. https://doi.org/10.1016/j.procs.2015.07.376. https://www.sciencedirect.com/science/article/pii/S1877050915019055, 3rd International Conference on Recent Trends in Computing 2015 (ICRTC-2015)

  26. Park BH, Kargupta H (2003) Distributed data mining: algorithms, systems, and applications. In: The handbook of data mining, Lawrence Erlbaum associates. Mahwah, United States, pp 341– 358

  27. Park BH, Ayyagari R, Kargupta H (2001) A fourier analysis based approach to learning decision trees in a distributed environment. In: Proceedings of the 2001 SIAM international conference on data mining. SIAM, Chicago, IL, USA. https://doi.org/10.1137/1.9781611972719.19. https://epubs.siam.org/doi/abs/10.1137/1.9781611972719.19, pp 1–22

  28. Parker B, Mustafa AM, Khan L (2012) Novel class detection and feature via a tiered ensemble approach for stream mining. In: 2012 IEEE 24th international conference on tools with artificial intelligence. https://doi.org/10.1109/ICTAI.2012.168, vol 1, pp 1171–1178

  29. Rokach L (2010) Ensemble-based classifiers. Artif Intell Rev 33 (1-2):1–39. https://doi.org/10.1007/s10462-009-9124-7

    Article  Google Scholar 

  30. Skillicorn D, Mcconnell S (2008) Distributed prediction from vertically partitioned data. Journal of Parallel and Distributed Computing 68(1):16–36. https://doi.org/10.1016/j.jpdc.2007.07.009, parallel Techniques for Information Extraction

  31. Tennant M, Stahl F, Rana O, Gomes JB (2017) Scalable real-time classification of data streams with concept drift. Future Generation Computer Systems 75:187–199. https://doi.org/10.1016/j.future.2017.03.026. https://www.sciencedirect.com/science/article/pii/S0167739X17304685

  32. Tumer K, Ghosh J (2000) Robust order statistics based ensembles for distributed data mining. In: Kargupta H, Chan P (eds) Advances in distributed and parallel knowledge discovery. AAAI/MIT Press, pp 185–210

  33. Urmela S, Malaiyappan NM (2017) Approaches and techniques of distributed data mining : a comprehensive study. International Journal of Engineering and Technology 9:63–76

    Google Scholar 

  34. Weinberg AI, Last M (2017) Interpretable decision-tree induction in a big data parallel framework. Int J Appl Math Comput Sci 27(4):737–748. https://doi.org/10.1515/amcs-2017-0051

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rashi Bhalla.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Bhalla, R., Pears, R., Naeem, M.A. et al. Novel method for optimizing performance in resource constrained distributed data streams. Appl Intell 52, 12924–12942 (2022). https://doi.org/10.1007/s10489-021-03019-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-03019-5

Keywords

  • Big data
  • Bayesian inference
  • Distributed data stream mining
  • Heterogeneous distributed data