Classifying imbalanced data in distance-based feature space

Ando, Shin

doi:10.1007/s10115-015-0846-3

Classifying imbalanced data in distance-based feature space

Regular Paper
Published: 28 May 2015

Volume 46, pages 707–730, (2016)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Shin Ando¹

826 Accesses
21 Citations
Explore all metrics

Abstract

Class imbalance is a significant issue in practical classification problems. Important countermeasures, such as re-sampling, instance-weighting, and cost-sensitive learning have been developed, but there are limitations as well as advantages to respective approaches. The synthetic re-sampling methods have wide applicability, but require a vector representation to generate additional instances. The instance-based methods can be applied to distance space data, but are not tractable with regard to a global objective. The cost-sensitive learning can minimize the expected cost given the costs of error, but generally does not extend to nonlinear measures, such as F-measure and area under the curve. In order to address the above shortcomings, this paper proposes a nearest neighbor classification model which employs a class-wise weighting scheme to counteract the class imbalance and a convex optimization technique to learn its weight parameters. As a result, the proposed model maintains the simple instance-based rule for prediction, yet retains a mathematical support for learning to maximize a nonlinear performance measure over the training set. An empirical study is conducted to evaluate the performance of the proposed algorithm on the imbalanced distance space data and make comparison with existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An effective distance based feature selection approach for imbalanced data

Article 27 August 2019

Distance Metric Learning with Prototype Selection for Imbalanced Classification

A dissimilarity-based imbalance data classification algorithm

Article 22 November 2014

Notes

http://svmlight.joachims.org.

References

He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Article Google Scholar
Chawla N, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor Newsl 6(1):1–6
Article Google Scholar
Chan P, Stolfo S (1998) Toward scalable learning with non-uniform cost and class distributions: a case study in credit card fraud detection. In: Proceedings of the fourth international conference on knowledge discovery and data mining, pp 164–168
Chawla NV, Cieslak DA, Hall LO, Joshi A (2008) Automatically countering imbalance and its empirical relationship to costs. Data Min Knowl Discov 17(2):225–252
Article MathSciNet Google Scholar
Köknar-Tezel S, Latecki LJ (2011) Improving SVM classification on imbalanced time series data sets with ghost points. Knowl Inf Syst 28(1):1–23
Article Google Scholar
Li Y, Zhang X (2011) Improving k nearest neighbor with Exemplar generalization for imbalanced classification. In: Proceedings of the 15th Pacific-Asia conference on advances in knowledge discovery and data mining, vol 2, PAKDD’11, pp 321–332
Liu W, Chawla S (2011) Class confidence weighted kNN algorithms for imbalanced data sets. In: Proceedings of the 15th Pacific-Asia conference on advances in knowledge discovery and data mining, vol 2, PAKDD’11, pp 345–356
Domingos P (1999) MetaCost: a general method for making classifiers cost-sensitive. In: Proceedings of the 5th SIGKDD international conference on knowledge discovery and data mining, pp 155–164
Sun Y, Kamel MS, Wong AKC, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recogn 40(12):3358–3378
Article MATH Google Scholar
Joachims T, Finley T, Yu CNJ (2009) Cutting-plane training of structural SVMs. Mach Learn 77:27–59
Article MATH Google Scholar
Tsochantaridis I, Joachims T, Hofmann T, Altun Y (2005) Large margin methods for structured and interdependent output variables. J Mach Learn Res 6:1453–1484
MathSciNet MATH Google Scholar
Hido S, Kashima H (2008) Roughly balanced bagging for imbalanced data. In: Proceedings of the SIAM international conference on data mining. SDM 2008, pp 143–152
Wallace BC, Small K, Brodley CE, Trikalinos TA (2011) Class imbalance, redux. In: Proceedings of the 2011 IEEE 11th international conference on data mining. ICDM’11, pp 754–763
Chen S, He H, Garcia EA (2010) Ramoboost: ranked minority oversampling in boosting. Trans Neural Netw 21(10):1624–1642
Article Google Scholar
Masnadi-Shirazi H, Vasconcelos N (2010) Risk minimization, probability elicitation, and cost-sensitive SVMs. In: Proceedings of the 27th international conference on machine learning, pp 759–766
Holte RC, Acker LE, Porter BW (1989) Concept learning and the problem of small disjuncts. In: Proceedings of the 11th international joint conference on artificial intelligence, pp 813–818
Batista G, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor Newsl 6(1):20–29
Article Google Scholar
Napierala K, Stefanowski J (2012) BRACID: a comprehensive approach to learning rules from imbalanced data. J Intell Inf Syst 39(2):335–373
Article Google Scholar
Ando S (2012) Performance-optimizing classification of time-series based on nearest neighbor density approximation. In: 2012 IEEE 12th international conference on data mining workshops (ICDMW), pp 759–764
Calders T, Jaroszewicz S (2007) Efficient AUC optimization for classification. In: Proceedings of the European conference on principles and practice of knowledge discovery in databases, pp 42–53
Yue Y, Finley T, Radlinski F, Joachims T (2007) A support vector method for optimizing average precision. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. SIGIR ’07, pp 271–278
Joachims T (2006) Training linear SVMs in linear time. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’06, pp 217–226
Fukunaga K (1990) Introduction to statistical pattern recognition, computer science and scientific computing, 2nd edn. Elsevier science, Amsterdam
Google Scholar
Angiulli F, Basta S, Pizzuti C (2006) Distance-based detection and prediction of outliers. IEEE Trans Knowl Data Eng 18(2):145–160
Article Google Scholar
Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern SMC–6(11):769–772
MathSciNet Google Scholar
Covões TF, Hruschka ER, Ghosh J (2013) A study of $k$-means-based algorithms for constrained clustering. Intell Data Anal 17(3):485–505
Google Scholar
Zeng H, Cheung Ym (2012) Semi-supervised maximum margin clustering with pairwise constraints. IEEE Trans Knowl Data Eng 24(5):926–939
Article Google Scholar
Joachims T (2005) A support vector method for multivariate performance measures. In: Proceedings of the 22nd international conference on machine learning. ICML ’05, pp 377–384
Pham DT, Chan AB (1998) Control chart pattern recognition using a new type of self-organizing neural network. Proc Inst Mech Eng Part I J Syst Control Eng 212(2):115–127
Article Google Scholar
Saito N (1994) Local feature extraction and its applications using a library of bases. Ph.D. thesis, Yale University, New Haven, CT, USA
Bache K, Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml
Kaluz̆a B, Mirchevska V, Dovgan E, Lus̆trek M, Gams M (2010) An agent-based approach to care in independent living. In: Ambient intelligence, vol 6439 of lecture notes in computer science. Springer, Berlin, pp 177–186
Zhang H, Berg AC, Maire M, Malik J (2006) SVM-KNN: discriminative nearest neighbor classification for visual category recognition. In: Proceedings of the 2006 IEEE computer society conference on computer vision and pattern recognition, vol 2, pp 2126–2136
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Int Res 16(1):321–357
MATH Google Scholar
Chawla NV, Lazarevic A, Hall LO, Bowyer K (2003) Smoteboost: improving prediction of the minority class in boosting. Knowledge discovery in databases: PKDD 2003, lecture notes in computer science. Springer, Berlin, vol 2838, pp 107–119
Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244
MATH Google Scholar
Verikas A, Gelzinis A, Bacauskiene M (2011) Mining data with random forests: a survey and results of new tests. Pattern Recogn 44(2):330–349
Article Google Scholar
Chang C, Lin C (2001) LIBSVM: a library for support vector machines
Flach PA, Hernández-Orallo J, Ramirez CF (2011) A coherent interpretation of AUC as a measure of aggregated classification performance. In: Proceedings of the 28th international conference on machine learning, ICML 2011, pp 657–664
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar
Xi X, Keogh E, Shelton C, Wei L, Ratanamahatana CA (2006) Fast time series classification using numerosity reduction. In: ICML ’06: Proceedings of the 23rd international conference on machine learning, pp 1033–1040

Download references

Acknowledgments

The authors would like to thank the handling editor and the anonymous reviewers for their valuable and insightful comments. This study is supported by the Grant-in-Aid for Young Scientists (B) 25730127 by the Japanese Ministry of Education, Culture, Sports, Science and Technology.

Author information

Authors and Affiliations

School of Management, Tokyo University of Science, Tokyo, Japan
Shin Ando

Authors

Shin Ando
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shin Ando.

Appendices

Appendix 1: Structural SVM learning

A structural classifier addresses a problem with multivariate input or output variables that are structured or dependent [10, 11]. The dependency is captured in the max-margin learning formulation using a feature function and an optimizer. The feature function ${\varPsi }$ generates an arbitrary representation from a pair of input/output values $(\mathbf {x},y)$. A discriminant function $F:X\times {Y}\rightarrow \mathbb {R}$ is then defined as an inner product of a weight parameter $\mathbf {w}$ and ${\varPsi }$

$$\begin{aligned} F(\mathbf {x},y;\mathbf {w})=\langle \mathbf {w},{\varPsi }(\mathbf {x},y)\rangle \end{aligned}$$

(30)

From F and the optimizer that selects a class over Y, the decision function $f(\mathbf {x};\mathbf {w})$ is defined as

$$\begin{aligned} f(\mathbf {x};\mathbf {w})=\mathop {\arg \max }\limits _{y}F(\mathbf {x},y;\mathbf {w}) \end{aligned}$$

(31)

A typical example of the structural classifier is the multiclass SVM described in [11]. Let $\{c_j\}_{j=1}^n$ denote the class values and ${\varLambda }(c_j)=\mathbf {e}_j$ the binary encoding of $c_j$, i.e., a unit vector whose jth element is 1. The feature function ${\varPsi }$ is defined as

$$\begin{aligned} {\varPsi }(\mathbf {x},y)=\mathbf {x}\otimes {\varLambda }(y) \end{aligned}$$

(32)

where $\otimes $ denotes the tensor product.

Substituting the feature function (32) and a stack of vectors $\mathbf {w}=[\mathbf {w}'_1,\ldots ,\mathbf {w}'_p]$ as parameters into (30), the discriminant function produces a set of margins for respective classes. Subsequently, the class predicted by (31) is the class with the largest margin.

Training of the structural classifier is formulated in a quadratic programming problem [10]. Given the training samples $\{(\mathbf {x}_i,y_i)\}_{i=1}^\eta $,

Problem 2

$$\begin{aligned} \mathop {\arg \min }\limits _{\mathbf {w},\xi \ge 0}\frac{1}{2}\Vert \mathbf {w}\Vert ^2 +C\xi \end{aligned}$$

subject to $\forall {i},\forall {y}\ne {y}_i$,

$$\begin{aligned} \langle \mathbf {w},{\varPsi }(\mathbf {x}_i,y_i)\rangle -\langle \mathbf {w},{\varPsi }(\mathbf {x}_i,y)\rangle \ge 1-\xi \end{aligned}$$

(33)

An approximated solution of Problem 2 can be obtained by a cutting-plane algorithm [22].

Appendix 2: Multiclass formulation of SNN classifier

This section describes the formulation of SNN classifier learning with multiple majority and minority classes. Similar to the binary classification problem described in Sect. 4.2, the main intuition is to maximize the margin between the correct input and all other cases through the constrained minimization of the $\ell 2$-norm.

Let $\fancyscript{M}=\{i\}_{i=1}^m$ and $\fancyscript{N}=\{m+j\}_{j=1}^n$, denote the class values of the majority and minority classes, respectively. Note that the class values of $\fancyscript{N}$ differ from Sect. 4.2. For mathematical convenience, a matrix notation W is introduced to represent the weights considered in the selection of the nearest neighbors among different classes. The relation between W and the weight vector $\mathbf {w}$ is defined as follows:

$$\begin{aligned} W=\left[ \begin{array}{l@{\qquad }l} \begin{array}{l@{\quad }l@{\quad }l} 1&{}\cdots &{}1\\ \vdots &{}\ddots &{}\vdots \\ 1&{}\cdots &{}1\\ \end{array} &{} M\\ M' &{} \begin{array}{l@{\quad }l@{\quad }l} 1&{}\cdots &{}1\\ \vdots &{}\ddots &{}\vdots \\ 1&{}\cdots &{}1\\ \end{array} \end{array} \right] _{(m+n)\times (m+n)} \end{aligned}$$

(34)

where

$$\begin{aligned} M=\left[ \frac{w_i}{w_j}\right] _{m\times {n}}, \quad M'=\left[ \frac{w_j}{w_i}\right] _{n\times {m}} \end{aligned}$$

W can be computed automatically from each $\mathbf {w}$ and is not shown in the main content to avoid the overuse of notations.

Let us define the decision function of an individual input as follows:

$$\begin{aligned} f'(X;W)=\mathop {\arg \max }\limits _{y\in \fancyscript{M}\cup \fancyscript{N}}W^\top (\mathbf {u}(y)\odot \mathbf {d}) \end{aligned}$$

(35)

where $\odot $ denotes the Hadamard product and $\mathbf {u}(y)$ the unit vector which has 1 at yth position.

In (35), the yth row of W is considered in choosing the nearest neighbor. From the definition of W, the weights are equivalent among the minority classes if $y\in \fancyscript{M}$. Otherwise, i.e., $y\in \fancyscript{N}$, the weights are equivalent among the majority classes.

The decision functions for the structural input is written as a summation of $f'$,

$$\begin{aligned} f(X;{W})=\mathop {\arg \max }\limits _Y\sum _h{W}^\top \left( \mathbf {u}(y_h)\odot \mathbf {d}(x_h)\right) \end{aligned}$$

(36)

Based on (36), let us define the feature function ${\varPsi }$ as follows

$$\begin{aligned} {\varPsi }(X,Y)=\sum _{h=1}^\eta \mathbf {u}(y_h)\odot \mathbf {d}(x_h) \end{aligned}$$

and the discriminative function F as a matrix product of W and ${\varPsi }$.

$$\begin{aligned} F(X,Y;{W})={W}^\top {\varPsi }(X,Y) \end{aligned}$$

Using ${\varPsi }$ and F, (36) is rewritten in the same form as (26)

$$\begin{aligned} f(X;{W})=\mathop {\arg \max }\limits _{Y}F(X,Y;\mathbf {w}) \end{aligned}$$

(37)

From (37), the constrained $\ell 2$-norm minimization problem is formulated as follows.

Problem 3

$$\begin{aligned} \mathop {\min }\limits _{\mathbf {w},\xi \ge 0}\frac{1}{2}\left\| \mathbf {w}\right\| ^2+C\xi \end{aligned}$$

subject to $\forall \mathbf {z}\in (\fancyscript{M}\cup \fancyscript{N})^\eta {\setminus }\{\mathbf {y}\}$

$$\begin{aligned} W^\top {\varPsi }(\mathbf {z},X)-{W}^\top {\varPsi }(Y,X)\ge {\varDelta }(\mathbf {z},Y)-\xi \end{aligned}$$

(38)

Problem 3 can be solved by the cutting-plane method described in Sect. 4.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ando, S. Classifying imbalanced data in distance-based feature space. Knowl Inf Syst 46, 707–730 (2016). https://doi.org/10.1007/s10115-015-0846-3

Download citation

Received: 26 June 2014
Revised: 17 January 2015
Accepted: 12 May 2015
Published: 28 May 2015
Issue Date: March 2016
DOI: https://doi.org/10.1007/s10115-015-0846-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Classifying imbalanced data in distance-based feature space

Abstract

Access this article

Similar content being viewed by others

An effective distance based feature selection approach for imbalanced data

Distance Metric Learning with Prototype Selection for Imbalanced Classification

A dissimilarity-based imbalance data classification algorithm

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Structural SVM learning

Problem 2

Appendix 2: Multiclass formulation of SNN classifier

Problem 3

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Classifying imbalanced data in distance-based feature space

Abstract

Access this article

Similar content being viewed by others

An effective distance based feature selection approach for imbalanced data

Distance Metric Learning with Prototype Selection for Imbalanced Classification

A dissimilarity-based imbalance data classification algorithm

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Structural SVM learning

Problem 2

Appendix 2: Multiclass formulation of SNN classifier

Problem 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation