Information Technology and Management

, Volume 13, Issue 4, pp 403–409 | Cite as

Distributed data mining: a survey

  • Li Zeng
  • Ling Li
  • Lian Duan
  • Kevin Lu
  • Zhongzhi Shi
  • Maoguang Wang
  • Wenjuan Wu
  • Ping Luo


Most data mining approaches assume that the data can be provided from a single source. If data was produced from many physically distributed locations like Wal-Mart, these methods require a data center which gathers data from distributed locations. Sometimes, transmitting large amounts of data to a data center is expensive and even impractical. Therefore, distributed and parallel data mining algorithms were developed to solve this problem. In this paper, we survey the-state-of-the-art algorithms and applications in distributed data mining and discuss the future research opportunities.


Data mining Business intelligence Business analytics Decision support systems Distributed systems Literature review 


  1. 1.
    AbdelSalam H, Maly K, Mukkamala R, Zubair M, Kaminsky D (2010) Scheduling-capable autonomic manager for policy-based IT change management system. Enterp Inf Syst 4(4):423–444CrossRefGoogle Scholar
  2. 2.
    Albashiri K, Coenen F, Leng P (2009) EMADS: an extendible multi-agent data miner. Knowl Based Syst 22(7):523–528CrossRefGoogle Scholar
  3. 3.
    Babcock B, Olston C (2003) Distributed top-K monitoring. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data (SIGMOD ‘03). ACM, New York, NY, USA, pp 28–39Google Scholar
  4. 4.
    Cannataro M, Congiusta A, Pugliese A, Talia D, Trunfio P (2004) Distributed data mining on grids: services, tools, and applications. IEEE Trans Syst Man Cybern B Cybern 34(6):2451–2465CrossRefGoogle Scholar
  5. 5.
    Cao X, Yang F (2011) Measuring the performance of internet companies using a two-stage data envelopment analysis model. Enterp Inf Syst 5(2):207–217CrossRefGoogle Scholar
  6. 6.
    Chiang D, Lin C, Chen M (2011) The adaptive approach for storage assignment by mining data of warehouse management system for distribution centres. Enterp Inf Syst 5(2):219–234CrossRefGoogle Scholar
  7. 7.
    Datta S, Bhaduri K, Giannella C, Wolff R, Kargupta H (2006) Distributed data mining in peer-to-peer networks. IEEE Internet Comput 10(4):18–26CrossRefGoogle Scholar
  8. 8.
    Datta S, Giannella C, Kargupta H (2006) K-means clustering over large, dynamic networks. In: Proceedings of 2006 SIAM conference data mining (SDM 06). SIAM Press, 2006, pp 153–164Google Scholar
  9. 9.
    Duan L, Xu L, Guo F, Lee J, Yan B (2007) A local-density based spatial clustering algorithm with noise. Inf Syst 32:978–986CrossRefGoogle Scholar
  10. 10.
    Duan L, Xu L, Liu Y, Lee J (2009) Cluster-based outlier detection. Ann Oper Res 168:151–168CrossRefGoogle Scholar
  11. 11.
    Duan L, Street W, Xu E (2011) Heathcare information systems: data mining methods in the creation of a clinical recommender system. Enterp Inf Syst 5(2):169–181CrossRefGoogle Scholar
  12. 12.
    Fang W, Lau K, Lu M, Xiao X, Lam C, Yang Y, He B, Luo Q, Sander P, Yang K (2008) Parallel data mining on graphics processors. Technical Report, HKUST-CS08-07Google Scholar
  13. 13.
    Fang W, Lu M, Xiao X, He B, Luo Q (2009) Frequent itemset mining on graphics processors. In: Proceedings of the fifth international workshop on data management on new hardware (DaMoN ‘09). ACM, New York, USA, pp 34–42Google Scholar
  14. 14.
    Forli S (2011) Fight AIDS at Home Project,, 2011
  15. 15.
    Foster I, Kesselman C, Tuecke S (2001) The anatomy of the grid: enabling scalable virtual organizations. Int J High Perform Comput Appl 15(3):200–222CrossRefGoogle Scholar
  16. 16.
    Fu C, Zhang G, Yang J, Liu X (2011) Study on the contract characteristics of Internet architecture. Enterp Inf Syst 5(4):495–513CrossRefGoogle Scholar
  17. 17.
    Gong Z, Muyeba M, Guo J (2010) Business information query expansion through semantic network. Enterp Inf Syst 4(1):1–22CrossRefGoogle Scholar
  18. 18.
    Kumar V, Grama A, Gupta A, Karpis G (2003) Introduction to parallel computing: design and analysis of parallel algorithms. Addison Wesley, Reading, MAGoogle Scholar
  19. 19.
    Grossman R, Bodek H, Northcutt D, Poor V (1996) Data mining and tree-based optimization. In: The proceedings of the second international conference on knowledge discovery and data mining (KDD-96). AAAI Press, MenloPark, California, pp 323–326Google Scholar
  20. 20.
    Guo Y, Ruger S, Sutiwaraphun J, Forbes-millott J (1997) Meta-learning for parallel data mining. In: Proceedings of the seventh parallel computing workshop, pp 1–2Google Scholar
  21. 21.
    Ingvaldsen J, Gulla J (2012) Industrial application of semantic process mining. Enterp Inf Syst 6(2):139–163CrossRefGoogle Scholar
  22. 22.
    Kargupta H, Sanseverino E, Park B, Silvestre L, Hershberger D (1999) Scalable data mining from vertically partitioned feature space using collective mining and gene expression based genetic algorithms. KDD-98 workshop on distributed data miningGoogle Scholar
  23. 23.
    Kargupta H, Hoon B, Hershberger D, Johnson E (1999) Collective data mining: a new perspective towards distributed data mining. In: Kargupta H, Chan P (eds) Advances in distributed and parallel knowledge discovery. MIT/AAAI Press, Cambridge, MA, pp 133–184Google Scholar
  24. 24.
    Kempe D, Dobra A, Gehrke J (2003) Gossip-based computation of aggregate information. In: Proceedings of the 44th annual ieee symposium on foundations of computer science (FOCS ‘03). IEEE Computer Society, Washington, DC, USA, pp 1–10Google Scholar
  25. 25.
    Khoussainov R, Zuo X, Kushmerick N (2004) Grid-enabled weka: a toolkit for machine learning on the grid. ERCIM News No. 59, Oct 2004Google Scholar
  26. 26.
    Kowalczyk W, Jelasity M, Eiben A (2003) Towards data mining in large and fully distributed peer-to-peer overlay networks. In: Proceedings of 15th Belgian-Dutch conference on artificial intelligence (BNAIC 03). University of Nijmegen Press, pp 203–210Google Scholar
  27. 27.
    Krieger E, Vriend G (2002) Models@Home: distributed computing in bioinformatics using a screensaver based approach. Bioinformatics 18(2):315–318CrossRefGoogle Scholar
  28. 28.
    Kubota K, Nakase A, Sakai H, Oyanagi S (2000) Parallelization of decision tree algorithm and its performance evaluation. In: Proceedings of the the fourth international conference on high-performance computing in the Asia-Pacific region, vol 2. IEEE, pp 574–579Google Scholar
  29. 29.
    Li H, Xu L (2001) Feature space theory—a mathematical foundation for data mining. Knowl Based Syst 14:253–257CrossRefGoogle Scholar
  30. 30.
    Li H, Xu L, Wang J, Mo Z (2003) Feature space theory in data mining: transformations between extensions and intensions in knowledge representation. Expert Syst 20(2):60–71CrossRefGoogle Scholar
  31. 31.
    Li L (2011) Introduction: advances in e-business engineering. Inf Technol Manage 12(2):49–50CrossRefGoogle Scholar
  32. 32.
    Li L, Warfield J, Guo S, Guo W, Qi J (2007) Advances in intelligent information processing. Inf Syst 32(7):941–943CrossRefGoogle Scholar
  33. 33.
    Liang S, Liu Y, Wang C, Jian L (2009) A CUDA-based parallel implementation of K-nearest neighbor algorithm. International conference on cyber-enabled distributed computing and knowledge discovery (CyberC’09), Oct 2009, Zhangjiajie, China, pp 291–296Google Scholar
  34. 34.
    Liu B, Cao S, He W (2011) Distributed data mining for e-business. Inf Technol Manage 12(2):67–79CrossRefGoogle Scholar
  35. 35.
    Liu K, Kargupta H, Ryan J (2006) Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Trans Knowl Data Eng 18(1):92–106CrossRefGoogle Scholar
  36. 36.
    Liu R, Deters R, Zhang W (2010) Architectural design for resilience. Enterp Inf Syst 4(2):137–152CrossRefGoogle Scholar
  37. 37.
    Luo J, Xu L, Jamont J, Zeng L, Shi Z (2007) A flood decision support system on agent grid: method and implementation. Enterp Inf Syst 1(1):49–68CrossRefGoogle Scholar
  38. 38.
    Mehyar M, Spanos D, Pongsajapan J, Low S, Murray R (2005) Distributed averaging on peer-to-peer networks. In: Proceedings of IEEE conference on decision and control. IEEE CS Press, 2005Google Scholar
  39. 39.
    Mietzner R, Leymann F, Unger T (2011) Horizontal and vertical combination of multi-tenancy patterns in service-oriented applications. Enterp Inf Syst 5(1):59–77CrossRefGoogle Scholar
  40. 40.
    Perez-Castillo R, Weber B, Pinggera J, Zugal S, Guzman I, Piattini M (2011) Generating event logs from non-process-aware systems enabling business process mining. Enterp Inf Syst 5(3):301–335CrossRefGoogle Scholar
  41. 41.
    Perez M, Sanchez A, Herrero P, Robles V, Pena J (2005) Adapting the weka data mining toolkit to a grid based environment. Lect Notes Comput Sci 3528:819–820Google Scholar
  42. 42.
    Prodromidis A, Chan P, Stolfo S (2000) Meta-learning in distributed data mining systems: issues and approaches. In: Advances in distributed and parallel knowledge discovery, vol 114. AAAI Press, p 38Google Scholar
  43. 43.
    Qian Y, Jin B, Fang W (2011) Heuristic algorithms for effective broker deployment. Inf Technol Manage 12(2):55–66CrossRefGoogle Scholar
  44. 44.
    Raftery A, Madigan D, Hoeting J (1997) Bayesian model averaging for linear regression models. J Am Stat Assoc 92(437):179–191CrossRefGoogle Scholar
  45. 45.
    Shi Z, Huang Y, He Q, Xu L, Liu S, Qin L, Jia Z, Li J, Huang H, Zhao L (2007) MSMiner-a developing platform for OLAP. Decis Support Syst 42(4):2016–2028CrossRefGoogle Scholar
  46. 46.
    Stainforth D, Kettleborough J, Allen M, Collins M, Heaps A, Murphy J (2002) Distributed computing for public-interest climate modeling research. Comput Sci Eng 4(3):82–89CrossRefGoogle Scholar
  47. 47.
    Stankovski V, Swain M, Kravtsov V, Niessen T, Wegener D, Kindermann J, Dubitzky W (2008) Grid-enabling data mining applications with datamininggrid: an architectural perspective. Future Gener Comput Syst 24(4):259–279CrossRefGoogle Scholar
  48. 48.
    Stolfo S, Tselepis A, Lee W, Fan D, Chan P (1997) JAM: java agents for meta-learning over distributed databases. In: Proceedings of the third international conference on knowledge discovery and data mining (KDD-97). AAAI Press, Menlo Park, California, 1997Google Scholar
  49. 49.
    Talia D, Trunfio P, Verta O (2005) Weka4WS: a WSRF-enabled weka toolkit for distributed data mining on grids. In: Proceedings of the 9th european conference on principles and practice of knowledge discovery in databases. Porto, Portugal, pp 309–320Google Scholar
  50. 50.
    Tan W, Xu Y, Xu W, Xu L, Zhao X, Wang L, Fu L (2010) A methodology toward manufacturing grid-based virtual enterprise operation platform. Enterp Inf Syst 4(3):283–309CrossRefGoogle Scholar
  51. 51. (2011) Top 500 supercomputers.
  52. 52.
    Werthimer D, Cobb J, Lebofsky M, Anderson D, Korpela E (2001) SETI@home: massively distributed computing for SETI. Comput Sci Eng 3(1):78–83CrossRefGoogle Scholar
  53. 53.
    Wetzstein B, Leitner P, Rosenberg F, Dustdar S, Leymann F (2011) Identifying influential factors of business process performance using dependency analysis. Enterp Inf Syst 5(1):79–98CrossRefGoogle Scholar
  54. 54.
    Wolff R, Schuster A (2004) Association rule mining in peer-to-peer systems. IEEE Trans Syst Man Cybern B Cybern 34(6):2426–2438CrossRefGoogle Scholar
  55. 55.
    Wolff R, Bhaduri K, Kargupta H (2006) Local L2-thresholding based data mining in peerto-peer systems. In: Proceedings of the 2006 SIAM conference data mining (SDM06). SIAM Press, pp 430–441Google Scholar
  56. 56.
    Xu L (2006) Advances in intelligent information processing. Expert Syst 23(5):249–250CrossRefGoogle Scholar
  57. 57.
    Xu L, Liang N, Gao Q (2008) An integrated approach for agricultural ecosystem management. IEEE Trans SMC Part C 38(4):590–599Google Scholar
  58. 58.
    Xu L (2011) Information architecture for supply chain quality management. Int J Prod Res 49(1):183–198CrossRefGoogle Scholar
  59. 59.
    Xu L (2011) Enterprise systems: state-of-the-art and future trends. IEEE Trans Industr Inf 7(4):630–640CrossRefGoogle Scholar
  60. 60.
    Zaki M (1999) Parallel and distributed association mining: a survey. IEEE Concurr 7(4):14–25CrossRefGoogle Scholar
  61. 61.
    Zeng L, Lu K, Xu L, Shi Z, Luo P (2006) Distributed data mining: approaches and applications. Working paper, Institute of Computing Technology, Chinese Academy of SciencesGoogle Scholar
  62. 62.
    Zeng L, Xu L, Shi Z, Wang M, Wu W (2007) Distributed computing environment: approaches and applications. In: Proceedings of IEEE international conference on SMC 2007, Montreal, pp 3240–3244Google Scholar
  63. 63.
    Zhao W, Ma H, He Q (2009) Parallel K-means clustering based on mapreduce. In: Proceedings of the 1st international conference on cloud computing (CloudCom ‘09). Springer, Berlin, Heidelberg, pp 674–679Google Scholar
  64. 64.
    Zhou B, Jia Y, Liu C, Zhang X (2010) A distributed text mining system for online web textual data analysis. In: Proceedings of 2010 international conference on cyber-enabled distributed computing and knowledge discovery (CyberC), Oct 2010, pp 1–4Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Li Zeng
    • 1
  • Ling Li
    • 2
  • Lian Duan
    • 1
    • 3
  • Kevin Lu
    • 4
  • Zhongzhi Shi
    • 1
  • Maoguang Wang
    • 1
  • Wenjuan Wu
    • 5
  • Ping Luo
    • 1
  1. 1.Institute of Computing TechnologyChinese Academy of SciencesBeijingChina
  2. 2.Old Dominion UniversityNorfolkUSA
  3. 3.New Jersey Institute of TechnologyNewarkUSA
  4. 4.Brunel UniversityUxbridgeUK
  5. 5.School of InformationRemin University of ChinaBeijingChina

Personalised recommendations