Machine Learning

, Volume 95, Issue 2, pp 225–256

An instance level analysis of data complexity


    • Department of Computer ScienceBrigham Young University
  • Tony Martinez
    • Department of Computer ScienceBrigham Young University
  • Christophe Giraud-Carrier
    • Department of Computer ScienceBrigham Young University

DOI: 10.1007/s10994-013-5422-z

Cite this article as:
Smith, M.R., Martinez, T. & Giraud-Carrier, C. Mach Learn (2014) 95: 225. doi:10.1007/s10994-013-5422-z


Most data complexity studies have focused on characterizing the complexity of the entire data set and do not provide information about individual instances. Knowing which instances are misclassified and understanding why they are misclassified and how they contribute to data set complexity can improve the learning process and could guide the future development of learning algorithms and data analysis methods. The goal of this paper is to better understand the data used in machine learning problems by identifying and analyzing the instances that are frequently misclassified by learning algorithms that have shown utility to date and are commonly used in practice. We identify instances that are hard to classify correctly (instance hardness) by classifying over 190,000 instances from 64 data sets with 9 learning algorithms. We then use a set of hardness measures to understand why some instances are harder to classify correctly than others. We find that class overlap is a principal contributor to instance hardness. We seek to integrate this information into the training process to alleviate the effects of class overlap and present ways that instance hardness can be used to improve learning.


Instance hardness Dataset hardness Data complexity

Copyright information

© The Author(s) 2013