Abstract
Most data mining approaches assume that the data can be provided from a single source. If data was produced from many physically distributed locations like Wal-Mart, these methods require a data center which gathers data from distributed locations. Sometimes, transmitting large amounts of data to a data center is expensive and even impractical. Therefore, distributed and parallel data mining algorithms were developed to solve this problem. In this paper, we survey the-state-of-the-art algorithms and applications in distributed data mining and discuss the future research opportunities.
Similar content being viewed by others
References
AbdelSalam H, Maly K, Mukkamala R, Zubair M, Kaminsky D (2010) Scheduling-capable autonomic manager for policy-based IT change management system. Enterp Inf Syst 4(4):423–444
Albashiri K, Coenen F, Leng P (2009) EMADS: an extendible multi-agent data miner. Knowl Based Syst 22(7):523–528
Babcock B, Olston C (2003) Distributed top-K monitoring. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data (SIGMOD ‘03). ACM, New York, NY, USA, pp 28–39
Cannataro M, Congiusta A, Pugliese A, Talia D, Trunfio P (2004) Distributed data mining on grids: services, tools, and applications. IEEE Trans Syst Man Cybern B Cybern 34(6):2451–2465
Cao X, Yang F (2011) Measuring the performance of internet companies using a two-stage data envelopment analysis model. Enterp Inf Syst 5(2):207–217
Chiang D, Lin C, Chen M (2011) The adaptive approach for storage assignment by mining data of warehouse management system for distribution centres. Enterp Inf Syst 5(2):219–234
Datta S, Bhaduri K, Giannella C, Wolff R, Kargupta H (2006) Distributed data mining in peer-to-peer networks. IEEE Internet Comput 10(4):18–26
Datta S, Giannella C, Kargupta H (2006) K-means clustering over large, dynamic networks. In: Proceedings of 2006 SIAM conference data mining (SDM 06). SIAM Press, 2006, pp 153–164
Duan L, Xu L, Guo F, Lee J, Yan B (2007) A local-density based spatial clustering algorithm with noise. Inf Syst 32:978–986
Duan L, Xu L, Liu Y, Lee J (2009) Cluster-based outlier detection. Ann Oper Res 168:151–168
Duan L, Street W, Xu E (2011) Heathcare information systems: data mining methods in the creation of a clinical recommender system. Enterp Inf Syst 5(2):169–181
Fang W, Lau K, Lu M, Xiao X, Lam C, Yang Y, He B, Luo Q, Sander P, Yang K (2008) Parallel data mining on graphics processors. Technical Report, HKUST-CS08-07
Fang W, Lu M, Xiao X, He B, Luo Q (2009) Frequent itemset mining on graphics processors. In: Proceedings of the fifth international workshop on data management on new hardware (DaMoN ‘09). ACM, New York, USA, pp 34–42
Forli S (2011) Fight AIDS at Home Project, http://fightaidsathome.scripps.edu/, 2011
Foster I, Kesselman C, Tuecke S (2001) The anatomy of the grid: enabling scalable virtual organizations. Int J High Perform Comput Appl 15(3):200–222
Fu C, Zhang G, Yang J, Liu X (2011) Study on the contract characteristics of Internet architecture. Enterp Inf Syst 5(4):495–513
Gong Z, Muyeba M, Guo J (2010) Business information query expansion through semantic network. Enterp Inf Syst 4(1):1–22
Kumar V, Grama A, Gupta A, Karpis G (2003) Introduction to parallel computing: design and analysis of parallel algorithms. Addison Wesley, Reading, MA
Grossman R, Bodek H, Northcutt D, Poor V (1996) Data mining and tree-based optimization. In: The proceedings of the second international conference on knowledge discovery and data mining (KDD-96). AAAI Press, MenloPark, California, pp 323–326
Guo Y, Ruger S, Sutiwaraphun J, Forbes-millott J (1997) Meta-learning for parallel data mining. In: Proceedings of the seventh parallel computing workshop, pp 1–2
Ingvaldsen J, Gulla J (2012) Industrial application of semantic process mining. Enterp Inf Syst 6(2):139–163
Kargupta H, Sanseverino E, Park B, Silvestre L, Hershberger D (1999) Scalable data mining from vertically partitioned feature space using collective mining and gene expression based genetic algorithms. KDD-98 workshop on distributed data mining
Kargupta H, Hoon B, Hershberger D, Johnson E (1999) Collective data mining: a new perspective towards distributed data mining. In: Kargupta H, Chan P (eds) Advances in distributed and parallel knowledge discovery. MIT/AAAI Press, Cambridge, MA, pp 133–184
Kempe D, Dobra A, Gehrke J (2003) Gossip-based computation of aggregate information. In: Proceedings of the 44th annual ieee symposium on foundations of computer science (FOCS ‘03). IEEE Computer Society, Washington, DC, USA, pp 1–10
Khoussainov R, Zuo X, Kushmerick N (2004) Grid-enabled weka: a toolkit for machine learning on the grid. ERCIM News No. 59, Oct 2004
Kowalczyk W, Jelasity M, Eiben A (2003) Towards data mining in large and fully distributed peer-to-peer overlay networks. In: Proceedings of 15th Belgian-Dutch conference on artificial intelligence (BNAIC 03). University of Nijmegen Press, pp 203–210
Krieger E, Vriend G (2002) Models@Home: distributed computing in bioinformatics using a screensaver based approach. Bioinformatics 18(2):315–318
Kubota K, Nakase A, Sakai H, Oyanagi S (2000) Parallelization of decision tree algorithm and its performance evaluation. In: Proceedings of the the fourth international conference on high-performance computing in the Asia-Pacific region, vol 2. IEEE, pp 574–579
Li H, Xu L (2001) Feature space theory—a mathematical foundation for data mining. Knowl Based Syst 14:253–257
Li H, Xu L, Wang J, Mo Z (2003) Feature space theory in data mining: transformations between extensions and intensions in knowledge representation. Expert Syst 20(2):60–71
Li L (2011) Introduction: advances in e-business engineering. Inf Technol Manage 12(2):49–50
Li L, Warfield J, Guo S, Guo W, Qi J (2007) Advances in intelligent information processing. Inf Syst 32(7):941–943
Liang S, Liu Y, Wang C, Jian L (2009) A CUDA-based parallel implementation of K-nearest neighbor algorithm. International conference on cyber-enabled distributed computing and knowledge discovery (CyberC’09), Oct 2009, Zhangjiajie, China, pp 291–296
Liu B, Cao S, He W (2011) Distributed data mining for e-business. Inf Technol Manage 12(2):67–79
Liu K, Kargupta H, Ryan J (2006) Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Trans Knowl Data Eng 18(1):92–106
Liu R, Deters R, Zhang W (2010) Architectural design for resilience. Enterp Inf Syst 4(2):137–152
Luo J, Xu L, Jamont J, Zeng L, Shi Z (2007) A flood decision support system on agent grid: method and implementation. Enterp Inf Syst 1(1):49–68
Mehyar M, Spanos D, Pongsajapan J, Low S, Murray R (2005) Distributed averaging on peer-to-peer networks. In: Proceedings of IEEE conference on decision and control. IEEE CS Press, 2005
Mietzner R, Leymann F, Unger T (2011) Horizontal and vertical combination of multi-tenancy patterns in service-oriented applications. Enterp Inf Syst 5(1):59–77
Perez-Castillo R, Weber B, Pinggera J, Zugal S, Guzman I, Piattini M (2011) Generating event logs from non-process-aware systems enabling business process mining. Enterp Inf Syst 5(3):301–335
Perez M, Sanchez A, Herrero P, Robles V, Pena J (2005) Adapting the weka data mining toolkit to a grid based environment. Lect Notes Comput Sci 3528:819–820
Prodromidis A, Chan P, Stolfo S (2000) Meta-learning in distributed data mining systems: issues and approaches. In: Advances in distributed and parallel knowledge discovery, vol 114. AAAI Press, p 38
Qian Y, Jin B, Fang W (2011) Heuristic algorithms for effective broker deployment. Inf Technol Manage 12(2):55–66
Raftery A, Madigan D, Hoeting J (1997) Bayesian model averaging for linear regression models. J Am Stat Assoc 92(437):179–191
Shi Z, Huang Y, He Q, Xu L, Liu S, Qin L, Jia Z, Li J, Huang H, Zhao L (2007) MSMiner-a developing platform for OLAP. Decis Support Syst 42(4):2016–2028
Stainforth D, Kettleborough J, Allen M, Collins M, Heaps A, Murphy J (2002) Distributed computing for public-interest climate modeling research. Comput Sci Eng 4(3):82–89
Stankovski V, Swain M, Kravtsov V, Niessen T, Wegener D, Kindermann J, Dubitzky W (2008) Grid-enabling data mining applications with datamininggrid: an architectural perspective. Future Gener Comput Syst 24(4):259–279
Stolfo S, Tselepis A, Lee W, Fan D, Chan P (1997) JAM: java agents for meta-learning over distributed databases. In: Proceedings of the third international conference on knowledge discovery and data mining (KDD-97). AAAI Press, Menlo Park, California, 1997
Talia D, Trunfio P, Verta O (2005) Weka4WS: a WSRF-enabled weka toolkit for distributed data mining on grids. In: Proceedings of the 9th european conference on principles and practice of knowledge discovery in databases. Porto, Portugal, pp 309–320
Tan W, Xu Y, Xu W, Xu L, Zhao X, Wang L, Fu L (2010) A methodology toward manufacturing grid-based virtual enterprise operation platform. Enterp Inf Syst 4(3):283–309
Top500.org (2011) Top 500 supercomputers. http://www.top500.org/list/2011/11/100
Werthimer D, Cobb J, Lebofsky M, Anderson D, Korpela E (2001) SETI@home: massively distributed computing for SETI. Comput Sci Eng 3(1):78–83
Wetzstein B, Leitner P, Rosenberg F, Dustdar S, Leymann F (2011) Identifying influential factors of business process performance using dependency analysis. Enterp Inf Syst 5(1):79–98
Wolff R, Schuster A (2004) Association rule mining in peer-to-peer systems. IEEE Trans Syst Man Cybern B Cybern 34(6):2426–2438
Wolff R, Bhaduri K, Kargupta H (2006) Local L2-thresholding based data mining in peerto-peer systems. In: Proceedings of the 2006 SIAM conference data mining (SDM06). SIAM Press, pp 430–441
Xu L (2006) Advances in intelligent information processing. Expert Syst 23(5):249–250
Xu L, Liang N, Gao Q (2008) An integrated approach for agricultural ecosystem management. IEEE Trans SMC Part C 38(4):590–599
Xu L (2011) Information architecture for supply chain quality management. Int J Prod Res 49(1):183–198
Xu L (2011) Enterprise systems: state-of-the-art and future trends. IEEE Trans Industr Inf 7(4):630–640
Zaki M (1999) Parallel and distributed association mining: a survey. IEEE Concurr 7(4):14–25
Zeng L, Lu K, Xu L, Shi Z, Luo P (2006) Distributed data mining: approaches and applications. Working paper, Institute of Computing Technology, Chinese Academy of Sciences
Zeng L, Xu L, Shi Z, Wang M, Wu W (2007) Distributed computing environment: approaches and applications. In: Proceedings of IEEE international conference on SMC 2007, Montreal, pp 3240–3244
Zhao W, Ma H, He Q (2009) Parallel K-means clustering based on mapreduce. In: Proceedings of the 1st international conference on cloud computing (CloudCom ‘09). Springer, Berlin, Heidelberg, pp 674–679
Zhou B, Jia Y, Liu C, Zhang X (2010) A distributed text mining system for online web textual data analysis. In: Proceedings of 2010 international conference on cyber-enabled distributed computing and knowledge discovery (CyberC), Oct 2010, pp 1–4
Acknowledgments
This work is partially supported by the Chinese Academy of Sciences under Grant No. 20040402, Changjiang Scholar Program of the Ministry of Education of China, National Natural Science Foundation of China under Grant No. 71132008, US National Science Foundation under Grant No. 1044845.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zeng, L., Li, L., Duan, L. et al. Distributed data mining: a survey. Inf Technol Manag 13, 403–409 (2012). https://doi.org/10.1007/s10799-012-0124-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10799-012-0124-y