Skip to main content
Log in

Pruning boxes in a box-based classification method

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

In this work we address an extension of box clustering in supervised classification problems that makes use of optimization problems to refine the results obtained by agglomerative techniques. The central concept of box clustering is that of homogeneous boxes that give rise to overtrained classifiers under some conditions. Thus, we focus our attentions on the issue of pruning out redundant boxes, using the information gleaned from the other boxes generated under the hypothesis that such a choice would identify simpler models with good predictive power. We propose a pruning method based on an integer optimization problem and a family of sub problems derived from the main one. The overall performances are then compared to the accuracy levels of competing methods on a wide range of real data sets. The method has proven to be robust, making it possible to derive a more compact system of boxes in the instance space with good performance on training and test data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  • Almuallim H, Dietterich TG (1991) Learning with many irrelevant features. In: Proceedings of the ninth national conference on artificial intelligence, vol 2. AAAI Press, Menlo Park, CA, pp 547–552

  • Bache K, Lichman M (2013) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. http://www.ics.uci.edu/~mlearn/MLRepository.html

  • Bertolazzi P, Felici G, Festa P, Lancia G (2008) Logic classification and feature selection for biomedical data. Comput Math Appl 55(5):889–899

    Article  MathSciNet  MATH  Google Scholar 

  • Boros E, Hammer PL, Ibaraki T, Kogan A (1997) Logical analysis of numerical data. Math Progr 79:163–190

    MathSciNet  MATH  Google Scholar 

  • Boros E, Hammer PL, Ibaraki T, Kogan A, Mayoraz E, Muchnik I (2000) An implementation of logical analysis of data. Knowl Data Eng IEEE Trans 12(2):292–306

    Article  Google Scholar 

  • Davenport MA, Baraniuk MG, Scott CD (2006a) Controlling false alarms with support vector machines. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP), Toulouse, France

  • Davenport MA, Baraniuk MG, Scott CD (2006b) Learning minimum volume sets with support vector machines. In: IEEE workshop on machine learning for signal processing (MLSP), Maynooth, Ireland

  • Davenport MA, Baraniuk RG, Scott C (2010) Tuning support vector machines for minimax and neyman-pearson classification. IEEE Trans Pattern Anal Mach Intell 32(10):1888–1898. http://www.ece.rice.edu/~md/np_svm.php

  • Eckstein J, Hammer PL, Liu Y, Nediak M, Simeone B (2002) The maximum box problem and its application to data analysis. Comput Optim Appl 23(3):285–298

    Article  MathSciNet  MATH  Google Scholar 

  • Felici G, Simeone B, Spinelli V (2010) Classification techniques and error control in logic mining. In: Stahlbock R, Crone SF, Lessmann S (eds) Data mining, Annals of information systems, vol 8. Springer, London, pp 99–119. ISBN:978-1-4419-1279-4

  • Grudzinski K, Grochowski M, Duch W (2010) Pruning classification rules with reference vector selection methods. In: Rutkowski L, Scherer R, Tadeusiewicz R, Zadeh LA, Zurada JM (eds) ICAISC (1), Lecture notes in computer science, vol 6113. Springer, Berlin, Heidelberg, pp 347–354

  • Hammer PL, Liu Y, Simeone B, Szedmàk S (2004) Saturated systems of homogeneous boxes and the logical analysis of numerical data. Discret Appl Math 144:103–109

    Article  MathSciNet  MATH  Google Scholar 

  • Harris E (2002) Information gain versus gain ratio: a study of split method biases. In: ISAIM. http://dblp.uni-trier.de/db/conf/isaim/isaim2002.html

  • Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. Knowl Data Eng IEEE Trans 17(3):299–310

  • Kaneko A, Kano M (2003) Discrete geometry on red and blue points in the plane–a survey. In: Discrete and computational geometry. Springer, Berlin, Heidelberg, pp 551–570

  • Kohavi R, Provost F (1998) Glossary of terms. Mach Learn 30:271–274

    Article  Google Scholar 

  • Maloof MA (2003) Learning when data sets are imbalanced and when costs are unequal and unknown. In: ICML-2003 workshop on learning from imbalanced data sets

  • Maravalle M, Ricca F, Simeone B, Spinelli V (2014) Carpal tunnel syndrome automatic classification: electromyography vs. ultrasound imaging. In: Proceedings of TOP, pp 1–24

  • McCarthy K, Zabar B, Weiss G (2005) Does cost-sensitive learning beat sampling for classifying rare classes? In: Proceedings of the 1st international workshop on utility-based data mining, pp 69–77

  • Nitesh VC (2005) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) The data mining and knowledge discovery handbook. Springer, New York, pp 853–867

    Google Scholar 

  • Schaffer C (1994) A conservative law for generalization performance. In: Cohen WW, Hirsh H (eds) Eleventh international conference on machine learning, ICML. Morgan Kaufmann, San Francisco, CA, pp 259–265. ISBN:1-55860-335-2

  • Shah D, Lakshmanan LVS, Ramamritham K, Sudarshan S (1999) Interestingness and pruning of mined patterns. In: 1999 ACM SIGMOD workshop on research issues in data mining and knowledge discovery. http://dblp.uni-trier.de/db/conf/dmkd/dmkd1999.html

  • Simeone B, Spinelli V (2007) The optimization problem framework for box clustering approach in logic mining. In: Book of abstract of Euro XXII–22nd European conference on operational research, Euro XXII

  • Simeone B, Felici G, Spinelli V (2007) A graph coloring approach for box clustering techniques in logic mining. In: Book of Abstract of Euro XXII–22nd European conference on operational research, Euro XXII

  • Sogaard A (2013) Semi-supervised learning and domain adaptation in natural language processing. Morgan & Claypool, San Rafael

    Google Scholar 

  • Weka (2013) Machine learning group– data mining software in java. University of Waikato, Department of Computer Science. http://www.cs.waikato.ac.nz/ml/weka

  • Witten IH, Frank E, Hall MA (2011) Data mining: practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann, San Francisco, CA. ISBN:0123748569, 9780123748560

  • Wu S, Flach P (2005) A scored AUC metric for classifier evaluation and selection In: ICML’05 workshop on ROC Analysis in Machine Learning, Bonn, Germany, August 2005

  • Zadrozny B, Langford J, Abe N (2003) Cost-sensitive learning by cost-proportionate example weighting. In: Proceedings of the third IEEE international conference on data mining

  • Zilberstein S (1996) Using anytime algorithms in intelligent systems. AI Mag 17(3):73–83

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vincenzo Spinelli.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Spinelli, V. Pruning boxes in a box-based classification method. Adv Data Anal Classif 10, 285–304 (2016). https://doi.org/10.1007/s11634-014-0193-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-014-0193-3

Keywords

Mathematics Subject Classification

Navigation