Advertisement

Extraction of Outliers from Imbalanced Sets

  • Pavel ŠkrabánekEmail author
  • Natália Martínková
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10334)

Abstract

In this paper, we presented an outlier detection method, designed for small datasets, such as datasets in animal group behaviour research. The method was aimed at detection of global outliers in unlabelled datasets where inliers form one predominant cluster and the outliers are at distances from the centre of the cluster. Simultaneously, the number of inliers was much higher than the number of outliers. The extraction of exceptional observations (EEO) method was based on the Mahalanobis distance with one tuning parameter. We proposed a visualization method, which allows expert estimation of the tuning parameter value. The method was tested and evaluated on 44 datasets. Excellent results, fully comparable with other methods, were obtained on datasets satisfying the method requirements. For large datasets, the higher computational requirement of this method might be prohibitive. This drawback can be partially suppressed with an alternative distance measure. We proposed to use Euclidean distance in combination with standard deviation normalization as a reliable alternative.

Keywords

Outlier analysis Distance based method Global outlier Single cluster Mahalanobis distance Biology 

Notes

The work was supported by the University of Pardubice (PŠ) and the Czech Science Foundation grant number 17-20286S (NM).

References

  1. 1.
    MATLAB: Global optimization toolbox (R2016a) (2016). https://www.mathworks.com/help/gads/index.html
  2. 2.
    Aggarwal, C.C.: Outlier Analysis. Springer, New York (2013)CrossRefzbMATHGoogle Scholar
  3. 3.
    Angiulli, F., Basta, S., Lodi, S., Sartori, C.: GPU strategies for distance-based outlier detection. IEEE Trans. Parallel Distrib. Syst. 27(11), 3256–3268 (2016)CrossRefGoogle Scholar
  4. 4.
    Brereton, R.G.: The Mahalanobis distance and its relationship to principal component scores. J. Chemometr. 29(3), 143–145 (2015)CrossRefGoogle Scholar
  5. 5.
    Broom, D.M., Fraser, A.F.: Domestic Animal Behaviour and Welfare, 4th edn. CABI, Wallingford (2015)CrossRefGoogle Scholar
  6. 6.
    Chi, Z., Yan, H., Pham, T.: Fuzzy Algorithms: With Applications to Image Processing and Pattern Recognition, vol. 10. World Scientific, Singapore (1996)zbMATHGoogle Scholar
  7. 7.
    Deza, M.M., Deza, E.: Encyclopedia of Distances, 3rd edn. Springer, Heidelberg (2014)zbMATHGoogle Scholar
  8. 8.
    Fernndez, A., del Jesus, M.J., Herrera, F.: Hierarchical fuzzy rule based classification systems with genetic rule selection for imbalanced data-sets. Int. J. Approximate Reasoning 50(3), 561–577 (2009)CrossRefzbMATHGoogle Scholar
  9. 9.
    Gower, J., Lubbe, S., Roux, N.: Understanding Biplots. Wiley, New York (2010)Google Scholar
  10. 10.
    Han, J., Kamber, M., Pei, J.: Data Mining, 3rd edn. Morgan Kaufmann, San Francisco (2012)zbMATHGoogle Scholar
  11. 11.
    Hawkins, D.M.: Identification of Outliers. Springer, Netherlands (1980)CrossRefzbMATHGoogle Scholar
  12. 12.
    Hawkins, S., He, H., Williams, G., Baxter, R.: Outlier detection using replicator neural networks. In: Kambayashi, Y., Winiwarter, W., Arikawa, M. (eds.) DaWaK 2002. LNCS, vol. 2454, pp. 170–180. Springer, Heidelberg (2002). doi: 10.1007/3-540-46145-0_17 CrossRefGoogle Scholar
  13. 13.
    Ishibuchi, H., Yamamoto, T.: Rule weight specification in fuzzy rule-based classification systems. IEEE Trans. Fuzzy Syst. 13(4), 428–435 (2005)CrossRefGoogle Scholar
  14. 14.
    Kantardzic, M.: Data Mining: Concepts, Models, Methods, and Algorithms, 2nd edn. Wiley, Hoboken (2011)CrossRefzbMATHGoogle Scholar
  15. 15.
    Kohl, M.: Performance measures in binary classification. Int. J. Stat. Med. Res. 1(1), 79–81 (2012)Google Scholar
  16. 16.
    Reeves, C.R., Rowe, J.E.: Genetic Algorithms: Principles and Perspectives: A Guide to GA Theory. Kluwer Academic Publishers, Norwell (2002)zbMATHGoogle Scholar
  17. 17.
    Salzberg, S.L.: C4.5: programs for machine learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993. Mach. Learn. 16(3), 235–240 (1994)MathSciNetGoogle Scholar
  18. 18.
    Ward, A., Webster, M.: Sociality: The Behaviour of Group-Living Animals. Springer International Publishing, Heidelberg (2016)CrossRefGoogle Scholar
  19. 19.
    Xu, L., Chow, M.Y., Taylor, L.S.: Using the data mining based fuzzy classification algorithm for power distribution fault cause identification with imbalanced data. In: 2006 IEEE PES Power Systems Conference and Exposition. pp. 1228–1233, October 2006Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Faculty of Electrical Engineering and InformaticsUniversity of PardubicePardubiceCzech Republic
  2. 2.Institute of Vertebrate Biology, Czech Academy of SciencesBrnoCzech Republic
  3. 3.Institute of Biostatistics and AnalysesMasaryk UniversityBrnoCzech Republic

Personalised recommendations