A Distributed Approach to Detect Outliers in Very Large Data Sets

  • Fabrizio Angiulli
  • Stefano Basta
  • Stefano Lodi
  • Claudio Sartori
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6271)


We propose a distributed approach addressing the problem of distance-based outlier detection in very large data sets. The presented algorithm is based on the concept of outlier detection solving set ([1]), which is a small subset of the data set that can be provably used for predicting novel outliers. The algorithm exploits parallel computation in order to meet two basic needs: (i) the reduction of the run time with respect to the centralized version and (ii) the ability to deal with distributed data sets. The former goal is achieved by decomposing the overall computation into cooperating parallel tasks. Other than preserving the correctness of the result, the proposed schema exhibited excellent performances. As a matter of fact, experimental results showed that the run time scales up with respect to the number of nodes. The latter goal is accomplished through executing each of these parallel tasks only on a portion of the entire data set, so that the proposed algorithm is suitable to be used over distributed data sets. Importantly, while solving the distance-based outlier detection task in the distributed scenario, our method computes an outlier detection solving set of the overall data set of the same quality as that computed by the corresponding centralized method.


Outlier Detection Local Node Parallel Task Candidate Object True Weight 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Angiulli, F., Basta, S., Pizzuti, C.: Distance-based detection and prediction of outliers. TKDE 18(2), 145–160 (2006)zbMATHGoogle Scholar
  2. 2.
    Angiulli, F., Fassetti, F.: Dolphin: An efficient algorithm for mining distance-based outliers in very large datasets. TKDD 3(1) (2009)Google Scholar
  3. 3.
    Angiulli, F., Pizzuti, C.: Outlier mining in large high-dimensional data sets. TKDE 2(17), 203–215 (2005)zbMATHGoogle Scholar
  4. 4.
    Asuncion, A., Newman, D.: UCI machine learning repository (2007)Google Scholar
  5. 5.
    Bay, S.D., Schwabacher, M.: Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proc. KDD (2003)Google Scholar
  6. 6.
    Ghoting, A., Parthasarathy, S., Otey, M.E.: Fast mining of distance-based outliers in high-dimensional datasets. Data Min. Knowl. Discov. 16(3), 349–364 (2008)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Han, J., Kamber, M.: Data Mining, Concepts and Technique. Morgan Kaufmann, San Francisco (2001)Google Scholar
  8. 8.
    Hung, E., Cheung, D.W.-L.: Parallel mining of outliers in large database. Distributed and Parallel Databases 12(1), 5–26 (2002)CrossRefzbMATHGoogle Scholar
  9. 9.
    Knorr, E., Ng, R.: Algorithms for mining distance-based outliers in large datasets. In: Proc. Int. Conf. on Very Large Databases (VLDB 1998), pp. 392–403 (1998)Google Scholar
  10. 10.
    Koufakou, A., Georgiopoulos, M.: A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes. Data Mining and Knowledge Discovery (November 11, 2009) (published online)Google Scholar
  11. 11.
    Lozano, E., Acuña, E.: Parallel algorithms for distance-based and density-based outliers. In: ICDM, pp. 729–732 (2005)Google Scholar
  12. 12.
    Otey, M.E., Ghoting, A., Parthasarathy, S.: Fast distributed outlier detection in mixed-attribute data sets. Data Min. Knowl. Discov. 12(2-3), 203–228 (2006)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Proc. Int. Conf. on Managment of Data (SIGMOD 2000), pp. 427–438 (2000)Google Scholar
  14. 14.
    Tao, Y., Xiao, X., Zhou, S.: Mining distance-based outliers from large databases in any metric space. In: KDD, pp. 394–403 (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Fabrizio Angiulli
    • 1
  • Stefano Basta
    • 2
  • Stefano Lodi
    • 3
  • Claudio Sartori
    • 3
  1. 1.DEIS-UNICALRendeItaly
  2. 2.ICAR-CNRRendeItaly
  3. 3.DEIS-UNIBOBolognaItaly

Personalised recommendations