# An instance level analysis of data complexity

- 1.1k Downloads
- 17 Citations

## Abstract

Most data complexity studies have focused on characterizing the complexity of the entire data set and do not provide information about individual instances. Knowing which instances are misclassified and understanding why they are misclassified and how they contribute to data set complexity can improve the learning process and could guide the future development of learning algorithms and data analysis methods. The goal of this paper is to better understand the data used in machine learning problems by identifying and analyzing the instances that are frequently misclassified by learning algorithms that have shown utility to date and are commonly used in practice. We identify instances that are hard to classify correctly (*instance hardness*) by classifying over 190,000 instances from 64 data sets with 9 learning algorithms. We then use a set of hardness measures to understand why some instances are harder to classify correctly than others. We find that class overlap is a principal contributor to instance hardness. We seek to integrate this information into the training process to alleviate the effects of class overlap and present ways that instance hardness can be used to improve learning.

### Keywords

Instance hardness Dataset hardness Data complexity### References

- Abe, N., & Mamitsuka, H. (1998). Query learning strategies using boosting and bagging. In
*Proceedings of the fifteenth international conference on machine learning*(pp. 1–9). Google Scholar - Abe, N., Zadrozny, B., & Langford, J. (2006). Outlier detection by active learning. In
*Proceedings of the 12th international conference on knowledge discovery and data mining*(pp. 504–509). New York: ACM. Google Scholar - Barnett, V., & Lewis, T. (1978).
*Outliers in statistical data*(2nd ed.). New York: Wiley. MATHGoogle Scholar - Batista, G. E. A. P. A., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data.
*SIGKDD Explorations Newsletter*,*6*(1), 20–29. CrossRefGoogle Scholar - Bennett, P. N. (2000).
*Assessing the calibration of naive Bayes’ posterior estimates*(Tech. Rep. CMU-CS-00-155). Carnegie Mellon University. Google Scholar - Brazdil, P., Giraud-Carrier, C., Soares, C., & Vilalta, R. (2009).
*Metalearning: applications to data mining*. Berlin: Springer. MATHGoogle Scholar - Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000). Lof: identifying density-based local outliers.
*SIGMOD Record*,*29*(2), 93–104. CrossRefGoogle Scholar - Bridle, J. S. (1989). Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In
*Neuro-computing: algorithms, architectures and applications*(pp. 227–236). Berlin: Springer. Google Scholar - Brighton, H., & Mellish, C. (2002). Advances in instance selection for instance-based learning algorithms.
*Data Mining and Knowledge Discovery*,*6*(2), 153–172. MathSciNetCrossRefMATHGoogle Scholar - Brodley, C. E., & Friedl, M. A. (1999). Identifying mislabeled training data.
*Journal of Artificial Intelligence Research*,*11*, 131–167. MATHGoogle Scholar - Brodley, C. E., & Utgoff, P. E. (1995). Multivariate decision trees.
*Machine Learning*,*19*(1), 45–77. MATHGoogle Scholar - Dagan, I., & Engelson, S. P. (1995). Committee-based sampling for training probabilistic classifiers. In
*Proceedings of the 12th international conference on machine learning*(pp. 150–157). Google Scholar - Domingos, P., & Pazzani, M. J. (1996). Beyond independence: conditions for the optimality of the simple Bayesian classifier. In L. Saitta (Ed.),
*ICML*(pp. 105–112). San Mateo: Morgan Kaufmann. Google Scholar - Frank, A., & Asuncion, A. (2010). UCI machine learning repository. http://archive.ics.uci.edu/ml.
- Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In
*Thirteenth international conference on machine learning*(pp. 148–156). Google Scholar - Freund, Y., Seung, H. S., Shamir, E., & Tishby, N. (1992). Information, prediction, and query by committee. In
*Advances in neural information processing systems (NIPS)*(pp. 483–490). Google Scholar - Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The weka data mining software: an update.
*SIGKDD Explorations Newsletter*,*11*(1), 10–18. CrossRefGoogle Scholar - Ho, T. K., & Basu, M. (2002). Complexity measures of supervised classification problems.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*24*, 289–300. CrossRefGoogle Scholar - John, G. H. (1995). Robust decision trees: removing outliers from databases. In
*Knowledge discovery and data mining*(pp. 174–179). Google Scholar - Knorr, E. M., & Ng, R. T. (1999). Finding intensional knowledge of distance-based outliers. In
*Proceedings of the 25th international conference on very large data bases*(pp. 211–222). Google Scholar - Kriegel, H. P., Kröger, P., Schubert, E., & Zimek, A. (2009). Loop: local outlier probabilities. In
*Proceedings of the 18th ACM conference on information and knowledge management*(pp. 1649–1652). Google Scholar - Kriegel, H. P., Kröger, P., Schubert, E., & Zimek, A. (2011). Interpreting and unifying outlier scores. In
*SDM*(pp. 13–24). Google Scholar - Lee, J., & Giraud-Carrier, C. (2011). A metric for unsupervised metalearning.
*Intelligent Data Analysis*,*15*(6), 827–841. Google Scholar - Lewis, D. D., & Gale, W. A. (1994). A sequential algorithm for training text classifiers. In
*Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval*(pp. 3–12). Google Scholar - Mansilla, E. B., & Ho, T. K. (2004). On classifier domains of competence. In
*ICPR*(Vol. 1, pp. 136–139). Google Scholar - Mitchell, T. M. (1982). Generalization as search.
*Artifical Intelligence*,*18*(2), 203–226. CrossRefMathSciNetGoogle Scholar - Orriols-Puig, A., Macià, N., Bernadó-Mansilla, E., & Ho, T. K. (2009).
*Documentation for the data complexity library in C++*(Tech. Rep. 2009001). La Salle, Universitat Ramon Llull. Google Scholar - Peterson, A. H., & Martinez, T. R. (2005). Estimating the potential for combining learning models. In
*Proceedings of the ICML workshop on meta-learning*(pp. 68–75). Google Scholar - Platt, J. (2000). Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In
*Advances in large margin classifiers*. Google Scholar - Quinlan, J. R. (1993).
*C4.5: programs for machine learning*. San Mateo: Morgan Kaufmann. Google Scholar - Salojärvi, J., Puolamäki, K., Simola, J., Kovanen, L., Kojo, I., & Kaski, S. (2005).
*Inferring relevance from eye movements: Feature extraction*(Tech. Rep. A82). Helsinki University of Technology. Google Scholar - Sayyad Shirabad, J., & Menzies, T. (2005). The PROMISE repository of software engineering databases. School of Information Technology and Engineering, University of Ottawa, Canada, http://promise.site.uottawa.ca/SERepository/.
- Scheffer, T., Decomain, C., & Wrobel, S. (2001). Active hidden Markov models for information extraction. In
*Proceedings of the 4th international conference on advances in intelligent data analysis, IDA ’01*(pp. 309–318). London: Springer. CrossRefGoogle Scholar - Segata, N., Blanzieri, E., & Cunningham, P. (2009). A scalable noise reduction technique for large case-based systems. In
*Proceedings of the 8th international conference on case-based reasoning: case-based reasoning research and development*(pp. 328–342). Google Scholar - Settles, B. (2010).
*Active learning literature survey*(Tech. Rep. Computer Sciences Technical Report 1648). University of Wisconsin-Madison. Google Scholar - Seung, H. S., Opper, M., & Sompolinsky, H. (1992). Query by committee. In
*Proceedings of the fifth annual workshop on computational learning theory*(pp. 287–294). CrossRefGoogle Scholar - Smith, M. R., & Martinez, T. (2011). Improving classification accuracy by identifying and removing instances that should be misclassified. In
*Proceedings of the IEEE internation joint conference on neural networks*(pp. 2690–2697). Google Scholar - Stiglic, G., & Kokol, P. (2009). GEMLer: gene expression machine learning repository. http://gemler.fzv.uni-mb.si/.
- Thomson, K., & McQueen, R. J. (1996).
*Machine learning applied to fourteen agricultural datasets*(Tech. Rep. 96/18). The University of Waikato. Google Scholar - Tomek, I. (1976). An experiment with the edited nearest-neighbor rule.
*IEEE Transactions on Systems, Man and Cybernetics*,*6*, 448–452. MathSciNetCrossRefMATHGoogle Scholar - Tong, S., & Koller, D. (2001). Support vector machine active learning with applications to text classification.
*Journal of Machine Learning Research*,*2*, 45–66. MATHGoogle Scholar - van Hulse, J., Khoshgoftaar, T. M., & Napolitano, A. (2007). Experimental perspectives on learning from imbalanced data. In
*Proceedings of the 24th international conference on machine learning*(pp. 935–942). New York: ACM. Google Scholar - Webb, G. I. (2000). Multiboosting: a technique for combining boosting and wagging.
*Machine Learning*,*40*(2), 159–196. CrossRefGoogle Scholar - Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms.
*Neural Computation*,*8*(7), 1341–1390. CrossRefGoogle Scholar - Zadrozny, B., & Elkan, C. (2001). Learning and making decisions when costs and probabilities are both unknown. In
*KDD*(pp. 204–213). Google Scholar - Zadrozny, B., & Elkan, C. (2002). Transforming classifier scores into accurate multiclass probability estimates. In
*KDD*(pp. 694–699). New York: ACM. Google Scholar