Distributed data mining: a survey

Zeng, Li; Li, Ling; Duan, Lian; Lu, Kevin; Shi, Zhongzhi; Wang, Maoguang; Wu, Wenjuan; Luo, Ping

doi:10.1007/s10799-012-0124-y

Distributed data mining: a survey

Published: 17 May 2012

Volume 13, pages 403–409, (2012)
Cite this article

Information Technology and Management Aims and scope Submit manuscript

Li Zeng¹,
Ling Li²,
Lian Duan^1,3,
Kevin Lu⁴,
Zhongzhi Shi¹,
Maoguang Wang¹,
Wenjuan Wu⁵ &
…
Ping Luo¹

3224 Accesses
57 Citations
Explore all metrics

Abstract

Most data mining approaches assume that the data can be provided from a single source. If data was produced from many physically distributed locations like Wal-Mart, these methods require a data center which gathers data from distributed locations. Sometimes, transmitting large amounts of data to a data center is expensive and even impractical. Therefore, distributed and parallel data mining algorithms were developed to solve this problem. In this paper, we survey the-state-of-the-art algorithms and applications in distributed data mining and discuss the future research opportunities.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Study of Various Varieties of Distributed Data Mining Architectures

Parallelization of Algorithms for Mining Data from Distributed Sources

Sparkmach: A Distributed Data Processing System Based on Automated Machine Learning for Big Data

References

AbdelSalam H, Maly K, Mukkamala R, Zubair M, Kaminsky D (2010) Scheduling-capable autonomic manager for policy-based IT change management system. Enterp Inf Syst 4(4):423–444
Article Google Scholar
Albashiri K, Coenen F, Leng P (2009) EMADS: an extendible multi-agent data miner. Knowl Based Syst 22(7):523–528
Article Google Scholar
Babcock B, Olston C (2003) Distributed top-K monitoring. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data (SIGMOD ‘03). ACM, New York, NY, USA, pp 28–39
Cannataro M, Congiusta A, Pugliese A, Talia D, Trunfio P (2004) Distributed data mining on grids: services, tools, and applications. IEEE Trans Syst Man Cybern B Cybern 34(6):2451–2465
Article Google Scholar
Cao X, Yang F (2011) Measuring the performance of internet companies using a two-stage data envelopment analysis model. Enterp Inf Syst 5(2):207–217
Article Google Scholar
Chiang D, Lin C, Chen M (2011) The adaptive approach for storage assignment by mining data of warehouse management system for distribution centres. Enterp Inf Syst 5(2):219–234
Article Google Scholar
Datta S, Bhaduri K, Giannella C, Wolff R, Kargupta H (2006) Distributed data mining in peer-to-peer networks. IEEE Internet Comput 10(4):18–26
Article Google Scholar
Datta S, Giannella C, Kargupta H (2006) K-means clustering over large, dynamic networks. In: Proceedings of 2006 SIAM conference data mining (SDM 06). SIAM Press, 2006, pp 153–164
Duan L, Xu L, Guo F, Lee J, Yan B (2007) A local-density based spatial clustering algorithm with noise. Inf Syst 32:978–986
Article Google Scholar
Duan L, Xu L, Liu Y, Lee J (2009) Cluster-based outlier detection. Ann Oper Res 168:151–168
Article Google Scholar
Duan L, Street W, Xu E (2011) Heathcare information systems: data mining methods in the creation of a clinical recommender system. Enterp Inf Syst 5(2):169–181
Article Google Scholar
Fang W, Lau K, Lu M, Xiao X, Lam C, Yang Y, He B, Luo Q, Sander P, Yang K (2008) Parallel data mining on graphics processors. Technical Report, HKUST-CS08-07
Fang W, Lu M, Xiao X, He B, Luo Q (2009) Frequent itemset mining on graphics processors. In: Proceedings of the fifth international workshop on data management on new hardware (DaMoN ‘09). ACM, New York, USA, pp 34–42
Forli S (2011) Fight AIDS at Home Project, http://fightaidsathome.scripps.edu/, 2011
Foster I, Kesselman C, Tuecke S (2001) The anatomy of the grid: enabling scalable virtual organizations. Int J High Perform Comput Appl 15(3):200–222
Article Google Scholar
Fu C, Zhang G, Yang J, Liu X (2011) Study on the contract characteristics of Internet architecture. Enterp Inf Syst 5(4):495–513
Article Google Scholar
Gong Z, Muyeba M, Guo J (2010) Business information query expansion through semantic network. Enterp Inf Syst 4(1):1–22
Article Google Scholar
Kumar V, Grama A, Gupta A, Karpis G (2003) Introduction to parallel computing: design and analysis of parallel algorithms. Addison Wesley, Reading, MA
Google Scholar
Grossman R, Bodek H, Northcutt D, Poor V (1996) Data mining and tree-based optimization. In: The proceedings of the second international conference on knowledge discovery and data mining (KDD-96). AAAI Press, MenloPark, California, pp 323–326
Guo Y, Ruger S, Sutiwaraphun J, Forbes-millott J (1997) Meta-learning for parallel data mining. In: Proceedings of the seventh parallel computing workshop, pp 1–2
Ingvaldsen J, Gulla J (2012) Industrial application of semantic process mining. Enterp Inf Syst 6(2):139–163
Article Google Scholar
Kargupta H, Sanseverino E, Park B, Silvestre L, Hershberger D (1999) Scalable data mining from vertically partitioned feature space using collective mining and gene expression based genetic algorithms. KDD-98 workshop on distributed data mining
Kargupta H, Hoon B, Hershberger D, Johnson E (1999) Collective data mining: a new perspective towards distributed data mining. In: Kargupta H, Chan P (eds) Advances in distributed and parallel knowledge discovery. MIT/AAAI Press, Cambridge, MA, pp 133–184
Google Scholar
Kempe D, Dobra A, Gehrke J (2003) Gossip-based computation of aggregate information. In: Proceedings of the 44th annual ieee symposium on foundations of computer science (FOCS ‘03). IEEE Computer Society, Washington, DC, USA, pp 1–10
Khoussainov R, Zuo X, Kushmerick N (2004) Grid-enabled weka: a toolkit for machine learning on the grid. ERCIM News No. 59, Oct 2004
Kowalczyk W, Jelasity M, Eiben A (2003) Towards data mining in large and fully distributed peer-to-peer overlay networks. In: Proceedings of 15th Belgian-Dutch conference on artificial intelligence (BNAIC 03). University of Nijmegen Press, pp 203–210
Krieger E, Vriend G (2002) Models@Home: distributed computing in bioinformatics using a screensaver based approach. Bioinformatics 18(2):315–318
Article Google Scholar
Kubota K, Nakase A, Sakai H, Oyanagi S (2000) Parallelization of decision tree algorithm and its performance evaluation. In: Proceedings of the the fourth international conference on high-performance computing in the Asia-Pacific region, vol 2. IEEE, pp 574–579
Li H, Xu L (2001) Feature space theory—a mathematical foundation for data mining. Knowl Based Syst 14:253–257
Article Google Scholar
Li H, Xu L, Wang J, Mo Z (2003) Feature space theory in data mining: transformations between extensions and intensions in knowledge representation. Expert Syst 20(2):60–71
Article Google Scholar
Li L (2011) Introduction: advances in e-business engineering. Inf Technol Manage 12(2):49–50
Article Google Scholar
Li L, Warfield J, Guo S, Guo W, Qi J (2007) Advances in intelligent information processing. Inf Syst 32(7):941–943
Article Google Scholar
Liang S, Liu Y, Wang C, Jian L (2009) A CUDA-based parallel implementation of K-nearest neighbor algorithm. International conference on cyber-enabled distributed computing and knowledge discovery (CyberC’09), Oct 2009, Zhangjiajie, China, pp 291–296
Liu B, Cao S, He W (2011) Distributed data mining for e-business. Inf Technol Manage 12(2):67–79
Article Google Scholar
Liu K, Kargupta H, Ryan J (2006) Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Trans Knowl Data Eng 18(1):92–106
Article Google Scholar
Liu R, Deters R, Zhang W (2010) Architectural design for resilience. Enterp Inf Syst 4(2):137–152
Article Google Scholar
Luo J, Xu L, Jamont J, Zeng L, Shi Z (2007) A flood decision support system on agent grid: method and implementation. Enterp Inf Syst 1(1):49–68
Article Google Scholar
Mehyar M, Spanos D, Pongsajapan J, Low S, Murray R (2005) Distributed averaging on peer-to-peer networks. In: Proceedings of IEEE conference on decision and control. IEEE CS Press, 2005
Mietzner R, Leymann F, Unger T (2011) Horizontal and vertical combination of multi-tenancy patterns in service-oriented applications. Enterp Inf Syst 5(1):59–77
Article Google Scholar
Perez-Castillo R, Weber B, Pinggera J, Zugal S, Guzman I, Piattini M (2011) Generating event logs from non-process-aware systems enabling business process mining. Enterp Inf Syst 5(3):301–335
Article Google Scholar
Perez M, Sanchez A, Herrero P, Robles V, Pena J (2005) Adapting the weka data mining toolkit to a grid based environment. Lect Notes Comput Sci 3528:819–820
Google Scholar
Prodromidis A, Chan P, Stolfo S (2000) Meta-learning in distributed data mining systems: issues and approaches. In: Advances in distributed and parallel knowledge discovery, vol 114. AAAI Press, p 38
Qian Y, Jin B, Fang W (2011) Heuristic algorithms for effective broker deployment. Inf Technol Manage 12(2):55–66
Article Google Scholar
Raftery A, Madigan D, Hoeting J (1997) Bayesian model averaging for linear regression models. J Am Stat Assoc 92(437):179–191
Article Google Scholar
Shi Z, Huang Y, He Q, Xu L, Liu S, Qin L, Jia Z, Li J, Huang H, Zhao L (2007) MSMiner-a developing platform for OLAP. Decis Support Syst 42(4):2016–2028
Article Google Scholar
Stainforth D, Kettleborough J, Allen M, Collins M, Heaps A, Murphy J (2002) Distributed computing for public-interest climate modeling research. Comput Sci Eng 4(3):82–89
Article Google Scholar
Stankovski V, Swain M, Kravtsov V, Niessen T, Wegener D, Kindermann J, Dubitzky W (2008) Grid-enabling data mining applications with datamininggrid: an architectural perspective. Future Gener Comput Syst 24(4):259–279
Article Google Scholar
Stolfo S, Tselepis A, Lee W, Fan D, Chan P (1997) JAM: java agents for meta-learning over distributed databases. In: Proceedings of the third international conference on knowledge discovery and data mining (KDD-97). AAAI Press, Menlo Park, California, 1997
Talia D, Trunfio P, Verta O (2005) Weka4WS: a WSRF-enabled weka toolkit for distributed data mining on grids. In: Proceedings of the 9th european conference on principles and practice of knowledge discovery in databases. Porto, Portugal, pp 309–320
Tan W, Xu Y, Xu W, Xu L, Zhao X, Wang L, Fu L (2010) A methodology toward manufacturing grid-based virtual enterprise operation platform. Enterp Inf Syst 4(3):283–309
Article Google Scholar
Top500.org (2011) Top 500 supercomputers. http://www.top500.org/list/2011/11/100
Werthimer D, Cobb J, Lebofsky M, Anderson D, Korpela E (2001) SETI@home: massively distributed computing for SETI. Comput Sci Eng 3(1):78–83
Article Google Scholar
Wetzstein B, Leitner P, Rosenberg F, Dustdar S, Leymann F (2011) Identifying influential factors of business process performance using dependency analysis. Enterp Inf Syst 5(1):79–98
Article Google Scholar
Wolff R, Schuster A (2004) Association rule mining in peer-to-peer systems. IEEE Trans Syst Man Cybern B Cybern 34(6):2426–2438
Article Google Scholar
Wolff R, Bhaduri K, Kargupta H (2006) Local L2-thresholding based data mining in peerto-peer systems. In: Proceedings of the 2006 SIAM conference data mining (SDM06). SIAM Press, pp 430–441
Xu L (2006) Advances in intelligent information processing. Expert Syst 23(5):249–250
Article Google Scholar
Xu L, Liang N, Gao Q (2008) An integrated approach for agricultural ecosystem management. IEEE Trans SMC Part C 38(4):590–599
Google Scholar
Xu L (2011) Information architecture for supply chain quality management. Int J Prod Res 49(1):183–198
Article Google Scholar
Xu L (2011) Enterprise systems: state-of-the-art and future trends. IEEE Trans Industr Inf 7(4):630–640
Article Google Scholar
Zaki M (1999) Parallel and distributed association mining: a survey. IEEE Concurr 7(4):14–25
Article Google Scholar
Zeng L, Lu K, Xu L, Shi Z, Luo P (2006) Distributed data mining: approaches and applications. Working paper, Institute of Computing Technology, Chinese Academy of Sciences
Zeng L, Xu L, Shi Z, Wang M, Wu W (2007) Distributed computing environment: approaches and applications. In: Proceedings of IEEE international conference on SMC 2007, Montreal, pp 3240–3244
Zhao W, Ma H, He Q (2009) Parallel K-means clustering based on mapreduce. In: Proceedings of the 1st international conference on cloud computing (CloudCom ‘09). Springer, Berlin, Heidelberg, pp 674–679
Zhou B, Jia Y, Liu C, Zhang X (2010) A distributed text mining system for online web textual data analysis. In: Proceedings of 2010 international conference on cyber-enabled distributed computing and knowledge discovery (CyberC), Oct 2010, pp 1–4

Download references

Acknowledgments

This work is partially supported by the Chinese Academy of Sciences under Grant No. 20040402, Changjiang Scholar Program of the Ministry of Education of China, National Natural Science Foundation of China under Grant No. 71132008, US National Science Foundation under Grant No. 1044845.

Author information

Authors and Affiliations

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100080, China
Li Zeng, Lian Duan, Zhongzhi Shi, Maoguang Wang & Ping Luo
Old Dominion University, Norfolk, VA, 23529, USA
Ling Li
New Jersey Institute of Technology, Newark, NJ, 07102, USA
Lian Duan
Brunel University, Uxbridge, UB8 3PH, UK
Kevin Lu
School of Information, Remin University of China, Beijing, 100872, China
Wenjuan Wu

Authors

Li Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Ling Li
View author publications
You can also search for this author in PubMed Google Scholar
Lian Duan
View author publications
You can also search for this author in PubMed Google Scholar
Kevin Lu
View author publications
You can also search for this author in PubMed Google Scholar
Zhongzhi Shi
View author publications
You can also search for this author in PubMed Google Scholar
Maoguang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wenjuan Wu
View author publications
You can also search for this author in PubMed Google Scholar
Ping Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lian Duan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zeng, L., Li, L., Duan, L. et al. Distributed data mining: a survey. Inf Technol Manag 13, 403–409 (2012). https://doi.org/10.1007/s10799-012-0124-y

Download citation

Published: 17 May 2012
Issue Date: December 2012
DOI: https://doi.org/10.1007/s10799-012-0124-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distributed data mining: a survey

Abstract

Access this article

Similar content being viewed by others

A Study of Various Varieties of Distributed Data Mining Architectures

Parallelization of Algorithms for Mining Data from Distributed Sources

Sparkmach: A Distributed Data Processing System Based on Automated Machine Learning for Big Data

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Distributed data mining: a survey

Abstract

Access this article

Similar content being viewed by others

A Study of Various Varieties of Distributed Data Mining Architectures

Parallelization of Algorithms for Mining Data from Distributed Sources

Sparkmach: A Distributed Data Processing System Based on Automated Machine Learning for Big Data

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation