Machine Learning

, Volume 95, Issue 2, pp 225–256 | Cite as

An instance level analysis of data complexity

  • Michael R. Smith
  • Tony Martinez
  • Christophe Giraud-Carrier


Most data complexity studies have focused on characterizing the complexity of the entire data set and do not provide information about individual instances. Knowing which instances are misclassified and understanding why they are misclassified and how they contribute to data set complexity can improve the learning process and could guide the future development of learning algorithms and data analysis methods. The goal of this paper is to better understand the data used in machine learning problems by identifying and analyzing the instances that are frequently misclassified by learning algorithms that have shown utility to date and are commonly used in practice. We identify instances that are hard to classify correctly (instance hardness) by classifying over 190,000 instances from 64 data sets with 9 learning algorithms. We then use a set of hardness measures to understand why some instances are harder to classify correctly than others. We find that class overlap is a principal contributor to instance hardness. We seek to integrate this information into the training process to alleviate the effects of class overlap and present ways that instance hardness can be used to improve learning.


Instance hardness Dataset hardness Data complexity 


  1. Abe, N., & Mamitsuka, H. (1998). Query learning strategies using boosting and bagging. In Proceedings of the fifteenth international conference on machine learning (pp. 1–9). Google Scholar
  2. Abe, N., Zadrozny, B., & Langford, J. (2006). Outlier detection by active learning. In Proceedings of the 12th international conference on knowledge discovery and data mining (pp. 504–509). New York: ACM. Google Scholar
  3. Barnett, V., & Lewis, T. (1978). Outliers in statistical data (2nd ed.). New York: Wiley. MATHGoogle Scholar
  4. Batista, G. E. A. P. A., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations Newsletter, 6(1), 20–29. CrossRefGoogle Scholar
  5. Bennett, P. N. (2000). Assessing the calibration of naive Bayes’ posterior estimates (Tech. Rep. CMU-CS-00-155). Carnegie Mellon University. Google Scholar
  6. Brazdil, P., Giraud-Carrier, C., Soares, C., & Vilalta, R. (2009). Metalearning: applications to data mining. Berlin: Springer. MATHGoogle Scholar
  7. Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000). Lof: identifying density-based local outliers. SIGMOD Record, 29(2), 93–104. CrossRefGoogle Scholar
  8. Bridle, J. S. (1989). Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neuro-computing: algorithms, architectures and applications (pp. 227–236). Berlin: Springer. Google Scholar
  9. Brighton, H., & Mellish, C. (2002). Advances in instance selection for instance-based learning algorithms. Data Mining and Knowledge Discovery, 6(2), 153–172. MathSciNetCrossRefMATHGoogle Scholar
  10. Brodley, C. E., & Friedl, M. A. (1999). Identifying mislabeled training data. Journal of Artificial Intelligence Research, 11, 131–167. MATHGoogle Scholar
  11. Brodley, C. E., & Utgoff, P. E. (1995). Multivariate decision trees. Machine Learning, 19(1), 45–77. MATHGoogle Scholar
  12. Dagan, I., & Engelson, S. P. (1995). Committee-based sampling for training probabilistic classifiers. In Proceedings of the 12th international conference on machine learning (pp. 150–157). Google Scholar
  13. Domingos, P., & Pazzani, M. J. (1996). Beyond independence: conditions for the optimality of the simple Bayesian classifier. In L. Saitta (Ed.), ICML (pp. 105–112). San Mateo: Morgan Kaufmann. Google Scholar
  14. Frank, A., & Asuncion, A. (2010). UCI machine learning repository.
  15. Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Thirteenth international conference on machine learning (pp. 148–156). Google Scholar
  16. Freund, Y., Seung, H. S., Shamir, E., & Tishby, N. (1992). Information, prediction, and query by committee. In Advances in neural information processing systems (NIPS) (pp. 483–490). Google Scholar
  17. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The weka data mining software: an update. SIGKDD Explorations Newsletter, 11(1), 10–18. CrossRefGoogle Scholar
  18. Ho, T. K., & Basu, M. (2002). Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 289–300. CrossRefGoogle Scholar
  19. John, G. H. (1995). Robust decision trees: removing outliers from databases. In Knowledge discovery and data mining (pp. 174–179). Google Scholar
  20. Knorr, E. M., & Ng, R. T. (1999). Finding intensional knowledge of distance-based outliers. In Proceedings of the 25th international conference on very large data bases (pp. 211–222). Google Scholar
  21. Kriegel, H. P., Kröger, P., Schubert, E., & Zimek, A. (2009). Loop: local outlier probabilities. In Proceedings of the 18th ACM conference on information and knowledge management (pp. 1649–1652). Google Scholar
  22. Kriegel, H. P., Kröger, P., Schubert, E., & Zimek, A. (2011). Interpreting and unifying outlier scores. In SDM (pp. 13–24). Google Scholar
  23. Lee, J., & Giraud-Carrier, C. (2011). A metric for unsupervised metalearning. Intelligent Data Analysis, 15(6), 827–841. Google Scholar
  24. Lewis, D. D., & Gale, W. A. (1994). A sequential algorithm for training text classifiers. In Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval (pp. 3–12). Google Scholar
  25. Mansilla, E. B., & Ho, T. K. (2004). On classifier domains of competence. In ICPR (Vol. 1, pp. 136–139). Google Scholar
  26. Mitchell, T. M. (1982). Generalization as search. Artifical Intelligence, 18(2), 203–226. CrossRefMathSciNetGoogle Scholar
  27. Orriols-Puig, A., Macià, N., Bernadó-Mansilla, E., & Ho, T. K. (2009). Documentation for the data complexity library in C++ (Tech. Rep. 2009001). La Salle, Universitat Ramon Llull. Google Scholar
  28. Peterson, A. H., & Martinez, T. R. (2005). Estimating the potential for combining learning models. In Proceedings of the ICML workshop on meta-learning (pp. 68–75). Google Scholar
  29. Platt, J. (2000). Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Advances in large margin classifiers. Google Scholar
  30. Quinlan, J. R. (1993). C4.5: programs for machine learning. San Mateo: Morgan Kaufmann. Google Scholar
  31. Salojärvi, J., Puolamäki, K., Simola, J., Kovanen, L., Kojo, I., & Kaski, S. (2005). Inferring relevance from eye movements: Feature extraction (Tech. Rep. A82). Helsinki University of Technology. Google Scholar
  32. Sayyad Shirabad, J., & Menzies, T. (2005). The PROMISE repository of software engineering databases. School of Information Technology and Engineering, University of Ottawa, Canada,
  33. Scheffer, T., Decomain, C., & Wrobel, S. (2001). Active hidden Markov models for information extraction. In Proceedings of the 4th international conference on advances in intelligent data analysis, IDA ’01 (pp. 309–318). London: Springer. CrossRefGoogle Scholar
  34. Segata, N., Blanzieri, E., & Cunningham, P. (2009). A scalable noise reduction technique for large case-based systems. In Proceedings of the 8th international conference on case-based reasoning: case-based reasoning research and development (pp. 328–342). Google Scholar
  35. Settles, B. (2010). Active learning literature survey (Tech. Rep. Computer Sciences Technical Report 1648). University of Wisconsin-Madison. Google Scholar
  36. Seung, H. S., Opper, M., & Sompolinsky, H. (1992). Query by committee. In Proceedings of the fifth annual workshop on computational learning theory (pp. 287–294). CrossRefGoogle Scholar
  37. Smith, M. R., & Martinez, T. (2011). Improving classification accuracy by identifying and removing instances that should be misclassified. In Proceedings of the IEEE internation joint conference on neural networks (pp. 2690–2697). Google Scholar
  38. Stiglic, G., & Kokol, P. (2009). GEMLer: gene expression machine learning repository.
  39. Thomson, K., & McQueen, R. J. (1996). Machine learning applied to fourteen agricultural datasets (Tech. Rep. 96/18). The University of Waikato. Google Scholar
  40. Tomek, I. (1976). An experiment with the edited nearest-neighbor rule. IEEE Transactions on Systems, Man and Cybernetics, 6, 448–452. MathSciNetCrossRefMATHGoogle Scholar
  41. Tong, S., & Koller, D. (2001). Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2, 45–66. MATHGoogle Scholar
  42. van Hulse, J., Khoshgoftaar, T. M., & Napolitano, A. (2007). Experimental perspectives on learning from imbalanced data. In Proceedings of the 24th international conference on machine learning (pp. 935–942). New York: ACM. Google Scholar
  43. Webb, G. I. (2000). Multiboosting: a technique for combining boosting and wagging. Machine Learning, 40(2), 159–196. CrossRefGoogle Scholar
  44. Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7), 1341–1390. CrossRefGoogle Scholar
  45. Zadrozny, B., & Elkan, C. (2001). Learning and making decisions when costs and probabilities are both unknown. In KDD (pp. 204–213). Google Scholar
  46. Zadrozny, B., & Elkan, C. (2002). Transforming classifier scores into accurate multiclass probability estimates. In KDD (pp. 694–699). New York: ACM. Google Scholar

Copyright information

© The Author(s) 2013

Authors and Affiliations

  • Michael R. Smith
    • 1
  • Tony Martinez
    • 1
  • Christophe Giraud-Carrier
    • 1
  1. 1.Department of Computer ScienceBrigham Young UniversityProvoUSA

Personalised recommendations