Influence of Outliers Introduction on Predictive Models Quality

  • Mateusz Kalisch
  • Marcin Michalak
  • Marek Sikora
  • Łukasz Wróbel
  • Piotr Przystałka
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 613)

Abstract

The paper presents results of the research related to influence of the level of outliers in the data (train and test data considered separately) on the quality of a model prediction in a classification task. The set of 100 semi–artificial time series was taken into consideration, which independent variables was close to real ones, observed in a underground coal mining environment and dependent variable was generated with the decision tree. For every considered method (decision trees, naive bayes, logistic regression and kNN) a reference model was built (no outliers in the data) which quality was compared with the quality of two models: Out–Out (outliers in train and test data) and Non-out–Out (outliers only in test data). 50 levels of outliers in the data were considered, from 1 % to 50 %. Statistical comparison of models was done on the basis of sign test.

Keywords

Data analysis Classification Outlier detection Time series 

References

  1. 1.
    Aggarwal, C.C., Yu, P.S.: Outlier detection for high dimensional data. SIGMOD Rec. 30(2), 37–46 (2001). http://doi.acm.org/10.1145/376284.375668 CrossRefGoogle Scholar
  2. 2.
    Ahmed, B., Thesen, T., Blackmon, K.E., Zhao, Y., Devinsky, O., Kuzniecky, R., Brodley, C.E.: Hierarchical conditional random fields for outlier detection: an application to detecting epileptogenic cortical malformations. In: Proceedings of the 31st International Conference on Machine Learning, Beijing, China (2014)Google Scholar
  3. 3.
    Barnett, V., Lewis, T.: Outliers in Statistical Data, 3rd edn. Wiley, Chichester (1994)MATHGoogle Scholar
  4. 4.
    Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the SIAM Internation Conference on Data Mining, pp. 243–254 (2008)Google Scholar
  5. 5.
    Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: Lof: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104. ACM, New York (2000)Google Scholar
  6. 6.
    Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: OPTICS-OF: identifying local outliers. In: Żytkow, J.M., Rauch, J. (eds.) PKDD 1999. LNCS (LNAI), vol. 1704, pp. 262–270. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  7. 7.
    Byers, S., Raftery, A.E.: Nearest-neighbor clutter removal for estimating features in spatial point processes. J. Am. Stat. Assoc. 93(442), 577–584 (1998)CrossRefMATHGoogle Scholar
  8. 8.
    Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 226–231 (1996)Google Scholar
  9. 9.
    Fawcett, T., Provost, F.: Activity monitoring: noticing interesting changes in behavior. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 1999, pp. 53–62. ACM, New York (1999)Google Scholar
  10. 10.
    Grubbs, F.E.: Sample criteria for testing outlying observations. Ann. Math. Stat. 21(1), 27–58 (1950)MathSciNetCrossRefMATHGoogle Scholar
  11. 11.
    Grubbs, F.E.: Procedures for detecting outlying observations in samples. Technometrics 11(1), 1–21 (1969)CrossRefGoogle Scholar
  12. 12.
    Gupta, M., Gao, J., Aggarwal, C., Han, J.: Outlier detection for temporal data: a survey. IEEE Trans. Knowl. Data Eng. 26(9), 2250–2267 (2014)MathSciNetCrossRefMATHGoogle Scholar
  13. 13.
    Hawkins, D.M.: Identification of Outliers. Monographs on Applied Probability and Statistics. Springer, Netherlands (1980)CrossRefMATHGoogle Scholar
  14. 14.
    Hodge, V., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2), 85–126 (2004)CrossRefMATHGoogle Scholar
  15. 15.
    Japkowicz, N., Myers, C., Gluck, M.: A novelty detection approach to classification. In: Proceedings 14th International Joint Conference Artificial Intelligence, pp. 518–523 (1995)Google Scholar
  16. 16.
    John, G.H.: Robust decision trees: removing outliers from databases. In: Knowledge Discovery and Data Mining, pp. 174–179. AAAI Press (1995)Google Scholar
  17. 17.
    Johnson, T., Kwok, I., Ng, R.T.: Fast computation of 2-dimensional depth contours. In: Agrawal, R., Stolorz, P.E., Piatetsky-Shapiro, G. (eds.) Internation Conference on Knowledge Discovery and Data Mining (KDD), pp. 224–228. AAAI Press (1998)Google Scholar
  18. 18.
    Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the 24rd International Conference on Very Large Data Bases, VLDB 1998, pp. 392–403. Morgan Kaufmann Publishers Inc., San Francisco (1998). http://dl.acm.org/citation.cfm?id=645924.671334
  19. 19.
    Kuna, H., Garcia-Martinez, R., Villatoro, F.: Outlier detection in audit logs for application systems. Inf. Syst. 44, 22–33 (2014)CrossRefGoogle Scholar
  20. 20.
    Ma, J., Perkins, S.: Online novelty detection on temporal sequences. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2003, pp. 613–618. ACM, New York (2003)Google Scholar
  21. 21.
    Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. SIGMOD Rec. 29(2), 427–438 (2000)CrossRefGoogle Scholar
  22. 22.
    Ritter, G., Gallegos, M.T.: Outliers in statistical pattern recognition and an application to automatic chromosome classification. Pattern Recogn. Lett. 18(6), 525–539 (1997)CrossRefGoogle Scholar
  23. 23.
    Rousseeuw, P.J.: Multivariate estimation with high breakdown point. In: Grossmann, W., Pflug, G., Vincze, I., Wertz, W. (eds.) Mathematical Statistics and Applications, vol. B, pp. 283–297. Reidel, Dordrecht (1985)CrossRefGoogle Scholar
  24. 24.
    Ruts, I., Rousseeuw, P.J.: Computing depth contours of bivariate point clouds. Comput. Stat. Data Anal. 23(1), 153–168 (1996)CrossRefMATHGoogle Scholar
  25. 25.
    Schölkopf, B., Williamson, R.C., Smola, A.J., Shawe-Taylor, J., Platt, J.C.: Support vector method for novelty detection. In: Solla, S., Leen, T., Müller, K. (eds.) Advances in Neural Information Processing Systems 12, pp. 582–588. MIT Press (2000)Google Scholar
  26. 26.
    Torr, P.H.S., Murray, D.W.: Outlier detection and motion segmentation, vol. 2059, pp. 432–443 (1993)Google Scholar
  27. 27.
    Tukey, J.: Exploratory Data Analysis. Addison-Wesley Publishing Company, Reading (1977)MATHGoogle Scholar
  28. 28.
    Weisberg, S.: Applied Linear Regression. Wiley Series in Probability and Statistics, 3rd edn. Wiley & Sons, Hoboken (2005)CrossRefMATHGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Mateusz Kalisch
    • 1
  • Marcin Michalak
    • 2
    • 3
  • Marek Sikora
    • 2
    • 3
  • Łukasz Wróbel
    • 3
  • Piotr Przystałka
    • 1
  1. 1.Institute of Fundamentals of Machinery DesignSilesian University of TechnologyGliwicePoland
  2. 2.Institute of InformaticsSilesian University of TechnologyGliwicePoland
  3. 3.Institute of Innovative Technologies EMAGKatowicePoland

Personalised recommendations