Abstract
The Big Data Era has presented many opportunities for using data mining techniques to discover knowledge patterns across large and diverse collections of data where the volume of data is growing at an exponential rate. Recent approaches to Distributed Data Mining (DDM) have focused on addressing the heterogeneous nature of data sources. However, such approaches do not prioritize the reduction of data communication costs which could be prohibitive in large scale sensor networks where bandwidth is a limited resource. In fact, higher communication and computational costs are the two most prominent problems that have been encountered in heterogeneous distributed environments. Moreover, an effort to decrease the communications load in the distributed environment has an adverse influence on the classification accuracy. Therefore, the research challenge lies in maintaining a balance between transmission cost, computational cost, and accuracy. This paper proposes an algorithm Performance Optimizer in Distributed Stream Mining (PODSM) based on Bayesian Inference to reduce the communication volume and resource time in a heterogeneous distributed data mining environment while retaining prediction accuracy. The approach used in this work exploits the past data for calculating statistics and these statistics are then utilized for the new data. In other words, it imparts the ability to learn from experiences. As a result, our experimental evaluation reveals that a significant reduction in the communication load and an improvement in classification response time can be achieved across a diverse range of dataset types. Reduction of 34.66% was obtained with regard to communication overhead for one of the datasets with huge savings of nearly 27% in resource time. Importantly, instead of showing a negative effect on accuracy, this dataset observes an increment of 0.44% in accuracy.
This is a preview of subscription content, access via your institution.









Notes
An alias for low-confidence, signifies confidence lower than the threshold.
First set of records that is used by the trouble site as the historic data for calculation of the conditional probabilities.
References
Akbar A, Kousiouris G, Pervaiz H, Sancho J, Ta-Shma P, Carrez F, Moessner K (2018) Real-time probabilistic data fusion for large-scale iot applications. IEEE Access 6:10015–10027. https://doi.org/10.1109/ACCESS.2018.2804623
Aronis JM, Kolluri V, Provost FJ, Buchanan BG (1997) The world: knowledge discovery from multiple distributed databases. In: Proceedings of Florida Arti intelligence research symposium (FLAIRS-97), pp 337–341
Basak J, Kothari R (2004) A classification paradigm for distributed vertically partitioned data. Neural Comput 16(7):1525–44. https://doi.org/10.1162/089976604323057470
Bifet A, Gavaldà R (2007) Learning from time-changing data with adaptive windowing. In: Proceedings of the seventh SIAM international conference on data mining. SIAM, Minneapolis, Minnesota
Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) Moa: massive online analysis. Journal of Machine Learning Research 11:1601–1604. https://doi.org/10.1145/1219092.1219093. http://portal.acm.org/citation.cfm?id=1859903
Boutaba R, Salahuddin MA, Limam N, Ayoubi S, Shahriar N, Solano FE, Rendon O (2018) A comprehensive survey on machine learning for networking: evolution, applications and research opportunities. Journal of Internet Services and Applications 9:1–99
Chan PK, Stolfo SJ (1993) Toward parallel and distributed learning by meta-learning. In: Proceedings of the 2nd international conference on knowledge discovery in databases. AAAI Press, Washington, DC, AAAIWS’93, p 227–240
Chan PK, Stolfo SJ (1995) Learning arbiter and combiner trees from partitioned data for scaling machine learning. In: KDD: proceedings of the first international conference on knowledge discovery and data mining. AAAI, Montréal, Québec, Canada, pp 39–44
Chen R, Sivakumar K, Kargupta H (2001) An approach to online bayesian learning from multiple data streams. In: Proceedings of workshop on mobile and distributed data mining. PKDD ’01, pp 31–45
Chen X, Rowe NC (2011) An energy-efficient communication scheme in wireless cable sensor networks. In: 2011 IEEE international conference on communications (ICC), Calhoun. https://doi.org/10.1109/icc.2011.5963077, pp 1– 5
Denham B (2019) Novel methods for distributed and privacy-preserving data stream mining. Master’s thesis Auckland University of Technology, Auckland, NZ
Denham B, Pears R, Naeem MA (2020) Hdsm: a distributed data mining approach to classifying vertically distributed data streams. Knowledge-Based Systems 189. https://doi.org/10.1016/j.knosys.2019.105114. https://www.sciencedirect.com/science/article/pii/S0950705119304836
Devi SG (2014) A survey on distributed data mining and its trends. International Journal of Research in Engineering & Technology 2(3)
Dua D, Taniskidou EK (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Machine Learning 29:131–163. https://doi.org/10.1023/A:1007465528199
Gomes HM, Bifet A, Read J, Barddal JP, Enembreck F, Pfharinger B, Holmes G, Abdessalem T (2017) Adaptive random forests for evolving data stream classification. Mach Learn 106(9-10):1469–1495. https://doi.org/10.1007/s10994-017-5642-8
Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining. https://doi.org/10.1145/502512.502529. Association for Computing Machinery, San Francisco California, pp 97–106
Chikhale R (2016) Study of distributed data mining algorithm and trends. IOSR Journal of Computer Engineering 8:41–47
Kargupta H, Byung-hoon, Hershberger D, Johnson E (1999) Collective data mining: a new perspective toward distributed data analysis. In: Advances in distributed and parallel knowledge discovery. MIT Press, Cambridge, pp 133–184
Khedo KK, Perseedoss R, Mungur A (2010) A wireless sensor network air pollution monitoring system. International Journal of Wireless & Mobile Networks 2(2):31–45. https://doi.org/10.5121/ijwmn.2010.2203
Kourtellis N, Morales GDF, Bifet A, Murdopo A (2016) Vht: vertical hoeffding tree. In: 2016 IEEE international conference on big data (big data). IEEE Computer Society, Los Alamitos, CA, USA. https://doi.org/10.1109/BigData.2016.7840687, pp 915–922
Krawczyk B, Minku LL, Gama J, Stefanowski J, Woźniak M (2017) Ensemble learning for data stream analysis: a survey. Information Fusion 37:132–156. https://doi.org/10.1016/j.inffus.2017.02.004. https://www.sciencedirect.com/science/article/pii/S1566253516302329
Moghadam AN, Ravanmehr R (2017) Multi-agent distributed data mining approach for classifying meteorology data: case study on iran’s synoptic weather stations. International Journal of Environmental Science and Technology 15:149–158
Oza N (2011) FLTz flight simulator. https://c3.nasa.gov/dashlink/resources/294/
Pal V, Yogita, Singh G, Yadav RP (2015) Effect of heterogeneous nodes location on the performance of clustering algorithms for wireless sensor networks. Procedia Computer Science 57:1042–1048. https://doi.org/10.1016/j.procs.2015.07.376. https://www.sciencedirect.com/science/article/pii/S1877050915019055, 3rd International Conference on Recent Trends in Computing 2015 (ICRTC-2015)
Park BH, Kargupta H (2003) Distributed data mining: algorithms, systems, and applications. In: The handbook of data mining, Lawrence Erlbaum associates. Mahwah, United States, pp 341– 358
Park BH, Ayyagari R, Kargupta H (2001) A fourier analysis based approach to learning decision trees in a distributed environment. In: Proceedings of the 2001 SIAM international conference on data mining. SIAM, Chicago, IL, USA. https://doi.org/10.1137/1.9781611972719.19. https://epubs.siam.org/doi/abs/10.1137/1.9781611972719.19, pp 1–22
Parker B, Mustafa AM, Khan L (2012) Novel class detection and feature via a tiered ensemble approach for stream mining. In: 2012 IEEE 24th international conference on tools with artificial intelligence. https://doi.org/10.1109/ICTAI.2012.168, vol 1, pp 1171–1178
Rokach L (2010) Ensemble-based classifiers. Artif Intell Rev 33 (1-2):1–39. https://doi.org/10.1007/s10462-009-9124-7
Skillicorn D, Mcconnell S (2008) Distributed prediction from vertically partitioned data. Journal of Parallel and Distributed Computing 68(1):16–36. https://doi.org/10.1016/j.jpdc.2007.07.009, parallel Techniques for Information Extraction
Tennant M, Stahl F, Rana O, Gomes JB (2017) Scalable real-time classification of data streams with concept drift. Future Generation Computer Systems 75:187–199. https://doi.org/10.1016/j.future.2017.03.026. https://www.sciencedirect.com/science/article/pii/S0167739X17304685
Tumer K, Ghosh J (2000) Robust order statistics based ensembles for distributed data mining. In: Kargupta H, Chan P (eds) Advances in distributed and parallel knowledge discovery. AAAI/MIT Press, pp 185–210
Urmela S, Malaiyappan NM (2017) Approaches and techniques of distributed data mining : a comprehensive study. International Journal of Engineering and Technology 9:63–76
Weinberg AI, Last M (2017) Interpretable decision-tree induction in a big data parallel framework. Int J Appl Math Comput Sci 27(4):737–748. https://doi.org/10.1515/amcs-2017-0051
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Bhalla, R., Pears, R., Naeem, M.A. et al. Novel method for optimizing performance in resource constrained distributed data streams. Appl Intell 52, 12924–12942 (2022). https://doi.org/10.1007/s10489-021-03019-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-03019-5
Keywords
- Big data
- Bayesian inference
- Distributed data stream mining
- Heterogeneous distributed data