1 Introduction

There exist a lot of different classifiers designed for classification from n-dimensional space. These classifiers are usually used for classification in two or more classes and are based on various principles. For example, linear combination (Orozco-Alzate et al. 2019), use of mutual distances (Shahid and Singh 2019; Liu and Wang 2022; Veneri et al. 2022), variants of gradient boosting (Bentejac et al. 2021; Liang and Sur 2022), adaboost (Hu et al. 2020), and bagging (Medina-Pérez et al. 2017; Jafarzadeh et al. 2021).

Our new approach is based on an assumption that for data with a complex structure, it is beneficial to divide the data into an unspecified number of sub-groups. These artificial groups, so-called hidden classes, can be formed easily using unsupervised or supervised learning techniques. We can use small low-level classifiers with limited validity and focus on data belonging to these classes. The way how we form hidden classes is not crucial, but they could be helpful for the final classification, which will be improved subsequently by their optimal unification. Such an optimal union leads to better classification results than using only the original datasets. This idea is the cornerstone of the proposed classifier. Furthermore, it is necessary to emphasize that we focus on the sensitivity of each, individual class, and not omit any of them.

We propose using simple clustering methods resulting in more clusters than the final classes, which will not be many or few. After forming such clusters, we propose a union of them with the optimal union method (Hrebik et al. 2019). Such unioning enables the formation of a classifier with the highest possible critical sensitivity. Our aim is to introduce a novel classifier suitable for processing data with a higher dimension, so dimension reduction must also be part of our classifier. Dimensionality reduction can be optional, but in our opinion, it generally increases the efficiency of the system and our recommendation is data whitening, thanks to which we get a dimensionless description.

A separate part of our paper considers meaning of critical sensitivity and why we propagate this criterion. The basic idea is that in the case of classification into classes, we have a separately evaluated percentage of correctly classified patterns for each class, which we refer to as the sensitivity of the relevant class. However, most classifiers maximize their accuracy on all data, which, of course, represents an unbalanced evaluation. We focus on the class providing worst results, and this sensitivity we note as critical sensitivity. Even simple clustering methods, such as DBSCAN, can provide high-quality clusters, which by unification will create a classifier in which even the worst class will have sufficient level sensitivity. Therefore, even in the case of an unbalanced size of classes, using our proposed classifier, we do not omit even the smallest one because of a low number of patterns. Our proposed method focuses on generating hidden classes from current datasets. Some authors suggest to generate synthetic samples (Rekha and Madhu 2022).

In this section, we provide a brief basic summary of the current state of research. Pattern recognition (Duda et al. 2012) represents a common aim of artificial intelligence and machine learning. In the case of machine learning, we distinguish between unsupervised and supervised approaches. The typical clustering models (Karlsson 2010) are connectivity, centroid, distribution, density, subspace, graph-based, and neural network ones (Back et al. 2018; Shi et al. 2019; Lin et al. 2020). A well-known unsupervised neural network model is represented by self-organizing maps (SOM). We can include subspace models as Principal Component Analysis or Independent Component Analysis as data processing techniques for dimension reduction (Eldar and Oppenheim 2003; Jolliffe 2011; Nguyen and Holmes 2019) of an original dataset consisting of a large number of interrelated variables. Dimensionality reduction leads to data representation using fewer features. Another approach to dimension reduction represent linear discriminant analysis into two classes (Eldar and Oppenheim 2003; Croux et al. 2008) or more classes (Rao and Toutenburg 1995).

Our proposed approach consists of three steps:

  • Dimensionality reduction and standardization,

  • Data clustering in the space of reduced dimension,

  • Hidden class forming and their optimal union to desired output classes.

Dimensionality reduction is based on data whitening (Eldar and Oppenheim 2003) or multi-class discriminant analysis, as an alternative, in the first step. The following clustering is performed by parameter-driven DBSCAN (Ester et al. 1996) generating several classes, their structure, and outliers. Proposed clustering technique is quick and can be reduced to the SLINK (Sibson 1973) algorithm in many cases. In the last step, we define hidden classes as clusters from the previous step. The relationship between these hidden classes and output classes learned is optimized using a binary programming technique that is focused on the maximization of classifier critical sensitivity.

We summarize several principles of multi-classifiers in the second section focusing on data whitening, multiple discriminant analysis, data clustering, and the concept of hidden classes. We discuss the framework and structure of a novel multi-classifier in the third section. The numerical experiments on basic pattern sets and resulting optimal settings and quality measures are summarized in the next section. The last section concludes our proposed classifier.

2 Multi-classification preliminaries

Basic facts related to the classification into several classes are summarized in this section. The optional methods of preprocessing are discussed first as unsupervised and supervised approaches. Data clustering is then discussed as a method of how to generate hidden classes. Finally, the concept of unioning of hidden classes is presented as a kernel of the novel approach.

2.1 Multi-classification Task

Basic frame of classification of vector patterns (Duda et al. 2012) into several classes is established first. Let \(n, m, N \in {\mathbb {N}}\) be number of features, patterns, and classes satisfying \(N \ge 2\). Let \({\textbf{x}} \in {\mathbb {R}}^n\) be the feature vector and \(y,y^* \in \lbrace 1,..., N \rbrace\) be a classifier output and its required value, denoting a pattern as \({\textbf{p}} = ({\textbf{x}}, y^*)\), the classifier is defined as a function

$$\begin{aligned} \textrm{c}:{\mathbb {R}}^n \rightarrow \{1,...,N\} \end{aligned}$$

and the classifier response is therefore \(y = \textrm{c}({\textbf{x}})\). Denoting \({\textbf{x}}_k \in {\mathbb {R}}^n\), \(y_k^* \in \lbrace 1,..., N \rbrace\) as a feature vector and given output of k-th pattern, we define a pattern set as

$$\begin{aligned} {\mathscr {S}} = \lbrace ( {\textbf{x}}_k, y_k^* ): k = 1,..., m \rbrace \end{aligned}$$

The pattern set can be represented by any input matrix \({\textbf{X}} \in {\mathbb {R}}^{m \times n}\) and any output vector \({\textbf{y}}^* \in \lbrace 1,..., N \rbrace ^m\). Any classifier is a complex system that applies various data processing techniques to obtain the final decision. Selected approaches are summarised in the following subsections.

2.2 Data whitening

The first but optional step of any classification is an efficient transformation that decreases the number of features but saves information about pattern differences. The main idea of principal component analysis (PCA) (Jolliffe 2011) is to reduce the dimensionality of the original dataset consisting of a large number of interrelated variables. The reduction retains as much as possible of the variation present in the data set. The aim is achieved by transforming into a new set of variables called principal components. These components are uncorrelated and ordered so that the first few retain most of the variation present in all of the original variables.

Let \(D \in {\mathbb {N}}\) be reduced dimension satisfying \(D<n\). The dimensionality reduction from \({\mathbb {R}}^n\) to \({\mathbb {R}}^D\) using PCA is based on a linear transformation

$$\begin{aligned} {\textbf{z}} = \textrm{PCA}({\textbf{x}}) = {\textbf{W}}_1^{\textrm{T}}({\textbf{x}}-{\textbf{x}}_0) \end{aligned}$$
(1)

The PCA is designed to satisfy \(\textrm{E} \,{\textbf{z}} = {\textbf{0}}\) and \(\textrm{var} \,\,{\textbf{z}} = {\textbf{D}}\), where \({\textbf{D}}\) is a diagonal matrix. Resulting parameters of PCA are \({\textbf{W}}_1 \in {\mathbb {R}}^{n \times D}\) and

$$\begin{aligned} {\textbf{x}}_0 = \frac{1}{m} \sum _{k=1}^m {\textbf{x}}_k \end{aligned}$$
(2)

The transforming matrix \({\textbf{W}}_1\) is calculated as follows. First, we shift the input matrix to obtain \({\textbf{X}}_{\textrm{S}} = {\textbf{X}} - {\textbf{1}}_m {\textbf{x}}_0^{\textrm{T}}\), where \({\textbf{1}}_m\) is m-dimensional vector of units. Then we calculate a covariance matrix \({\textbf{A}} = {\textbf{X}}_{\textrm{S}}^{\textrm{T}} {\textbf{X}}_{\textrm{S}} \ge 0\) and apply Eigen-Value Decomposition (EVD) as finding of eigenvalues \({\textbf{v}} \in {\mathbb {R}}^n\) and eigenvectors \(\lambda \ge 0\) in equation \(({\textbf{A}} - \lambda {\textbf{I}}_n){\textbf{v}} = {\textbf{0}}\) where \({\textbf{I}}_n \in {\mathbb {R}}^{n \times n}\) is identity matrix.

The solutions can be ordered as \(\lambda _{(1)} \ge \lambda _{(2)} \ge ... \ge \lambda _{(D)} \ge 0\) with corresponding normalized eigenvectors \({\textbf{v}}_{(1)}, {\textbf{v}}_{(2)},..., {\textbf{v}}_{(D)}\). Resulting PCA matrix (Jolliffe 2011) is

$$\begin{aligned} {\textbf{W}}_1 = ({\textbf{v}}_{(1)}, {\textbf{v}}_{(2)},..., {\textbf{v}}_{(D)}) \end{aligned}$$
(3)

and the dimensionality reduction generates new feature matrix \({\textbf{Z}} = {\textbf{X}}_{\textrm{S}} {\textbf{W}}_1\).

Data Whitening (DWH) (Eldar and Oppenheim 2003) represents improved process of PCA, which guarantees unit covariance matrix of resulting vector. The transform is defined as

$$\begin{aligned} {\textbf{z}} = \textrm{DWH}({\textbf{x}}) = {\textbf{W}}_2^{\textrm{T}} ({\textbf{x}}-{\textbf{x}}_0) \end{aligned}$$
(4)

The matrix \({\textbf{W}}_2^{\textrm{T}}\) is designed to satisfy \(\textrm{E} \,{\textbf{z}} = {\textbf{0}}\) and \(\textrm{var} \,{\textbf{z}} = {\mathbb {I}}_n\). Using the result of EVD we directly calculate (Eldar and Oppenheim 2003)

$$\begin{aligned} {\textbf{W}}_2 = \left( \frac{\mathbf {v_{(1)}}}{\sqrt{\lambda _{(1)}}}, \frac{\mathbf {v_{(2)}}}{\sqrt{\lambda _{(2)}}},..., \frac{{\textbf{v}}_{(D)}}{\sqrt{\lambda _{(D)}}} \right) \end{aligned}$$
(5)

Due to duality, we can perform the data whitening for \(m<n\) in more efficient way. We calculate \({\textbf{B}} = {\textbf{X}}_{\textrm{S}} {\textbf{X}}_{\textrm{S}}^{\textrm{T}} \ge 0\) and perform its EVD. Resulting EVD equation is \(({\textbf{B}} - \lambda {\textbf{I}}_m){\textbf{u}} = {\textbf{0}}\). The solutions can be ordered again as \(\lambda _{(1)} \ge \lambda _{(2)} \ge ... \ge \lambda _{(D)} \ge 0\) with corresponding normalized eigenvectors \({\textbf{u}}_{(1)}, {\textbf{u}}_{(2)},..., {\textbf{u}}_{(D)} \in \textrm{R}^m\). Resulting whitening matrix is

$$\begin{aligned} {\textbf{W}}_2 = {\textbf{X}}^{\textrm{T}}_{\textrm{S}} \left( \frac{\mathbf {u_{(1)}}}{{\lambda _{(1)}}}, \frac{\mathbf {u_{(2)}}}{{\lambda _{(2)}}},..., \frac{{\textbf{u}}_{(D)}}{{\lambda _{(D)}}} \right) \end{aligned}$$
(6)

and the data whitening generates new feature matrix \({\textbf{Z}} = {\textbf{X}}_{\textrm{S}} {\textbf{W}}_2\) in both cases. Data whitening in primal or dual form is preferred in this paper for optional data preprocessing.

2.3 Multiple discriminant analysis

Another approach of dimensionality reduction is based on knowledge of class membership. Having information about classes we can also perform linear data transformation to obtain higher data separation. Classical Fisher discriminant analysis (FDA) (Mika et al. 1999; Croux et al. 2008) is designed for two classes but Rao (Rao and Toutenburg 1995; Duda et al. 2012) generalized it for multi-classification task as follows.

The Rao method transforms the data from \({\mathbb {R}}^n\) to \({\mathbb {R}}^{N-1}\) for \(N \ge 2\) using a linear transformation

$$\begin{aligned} {\textbf{z}} = \textrm{RAO} ({\textbf{x}}) = {\textbf{W}}_3^{\textrm{T}} ({\textbf{x}} - {\textbf{x}}_0) \end{aligned}$$
(7)

where \({\textbf{W}}_3 \in {\mathbb {R}}^{n \times (N-1)}\).

The method is based on pattern index sets \({\mathcal {D}}_i \in \lbrace k \in {\mathbb {N}}: y_k^* = i \rbrace\) for \(i=1,...,N\) and their cardinalities \(m_i = \textrm{card} \, {\mathcal {D}}_i\). After the evaluation of cluster centres

$$\begin{aligned} {\textbf{t}}_i = \frac{1}{m_i} \sum _{k \in {\mathcal {D}}_i} {\textbf{x}}_k \end{aligned}$$
(8)

we can calculate within matrix

$$\begin{aligned} {\textbf{S}}_{\textrm{W}} = \sum _{i=1}^{N} {\textbf{S}}_i \ge 0 \end{aligned}$$
(9)

where

$$\begin{aligned} {\textbf{S}}_i = \sum _{k \in {\mathcal {D}}_i} ({\textbf{x}}_k - {\textbf{t}}_i)({\textbf{x}}_k - {\textbf{t}}_i)^{\textrm{T}} \end{aligned}$$
(10)

The total and between matrices are calculated as

$$\begin{aligned} {\textbf{S}}_{\textrm{T}} = \sum _{k=1}^{m} ({\textbf{x}}_k - {\textbf{x}}_0)({\textbf{x}}_k - {\textbf{x}}_0)^{\textrm{T}} \end{aligned}$$
(11)
$$\begin{aligned} {\textbf{S}}_{\textrm{B}} = {\textbf{S}}_{\textrm{T}} - {\textbf{S}}_{\textrm{W}} = \sum _{i=1}^{N} m_i ({\textbf{t}}_i - {\textbf{x}}_0)({\textbf{t}}_i - {\textbf{x}}_0)^{\textrm{T}} \end{aligned}$$
(12)

When the pattern set is non-degenerated then \({\textbf{S}}_{\textrm{W}} > 0\) and we solve generalized EVD problem which is driven by equation

$$\begin{aligned} \left( {\textbf{S}}_{\textrm{W}}^{-1} {\textbf{S}}_{\textrm{B}} - \lambda {\mathbb {I}}_n \right) {\textbf{e}} = {\textbf{0}}. \end{aligned}$$
(13)

Solutions of generalized EVD can be ordered as \(\lambda _{(1)} \ge \lambda _{(2)} \ge ... \ge \lambda _{(N-1)} \ge 0\) with corresponding eigenvectors \({\textbf{e}}_{(1)}, {\textbf{e}}_{(2)},..., {\textbf{e}}_{(N-1)}\). Finally, the transformation matrix of Rao method is

$$\begin{aligned} {\textbf{W}}_3 = ({\textbf{e}}_{(1)}, {\textbf{e}}_{(2)},..., {\textbf{e}}_{(N-1)}) \end{aligned}$$
(14)

and the pattern set has new feature matrix \({\textbf{Z}} = {\textbf{X}}_{\textrm{S}} {\textbf{W}}_3\) again.

The Rao method is well informed and the new dimension is fixed to \(N-1\) but the data whitening has the optional dimension of result and the information about class membership is missing. Therefore, these two approaches can generate different matrices \({\textbf{W}}\) and \({\textbf{Z}}\). The Rao method is also useful for data preprocessing.

2.4 DBSCAN technique

There are also various approaches to pattern classification. In this section, we focus on modern sequential clustering algorithms as SLINK, CLINK and finally DBSCAN. The SLINK algorithm (Sibson 1973) carries out single-link cluster analysis on an arbitrary dissimilarity coefficient and provides a representation of the resultant dendrogram which can readily be converted into the usual tree-diagram. There exist also alternative implementation (Goyal et al. 2020) which comes from a reduction in the number of distance calculations required by the standard implementation of SLINK with time complexity O\((m \, \textrm{log} m)\) in the case of m patterns. Hierarchical clustering omitting the initial sorting and consecutive clustering (Schmidt et al. 2017) having a linear time complexity as alternative to single linkage clustering has also been presented.

An algorithm for a complete linkage clustering (Patel et al. 2015) is based, same as SLINK, on a compact representation of a dendrogram. Fast algorithms (Banerjee et al. 2021) for CLINK clustering show that complete linkage clustering of m points can be computed in O\((m \, \textrm{log}^2 m)\) time.

The density-based spatial clustering of applications with noise (DBSCAN) (Ester et al. 1996) represents a non-parametric algorithm with a given set of points in some metric space and groups together points that are closely packed together and marks outliers.

We will study patterns in vector space as \({\textbf{x}}_k \in {\mathbb {R}}^n, k=1,2,...,m\), where mn are number of patterns and space dimensionality but the DBSCAN is defined in metric space. After application of Euclidean distance we can define mutual distances as \(d_{i,j}=|| {\textbf{x}}_i - {\textbf{x}}_j ||_2\). Various versions of this algorithm (Antony and Deshpande 2016; Bai et al. 2017) differ in the method of distance computation. The inefficient implementations (Shen et al. 2016; Schubert et al. 2017) calculate all mutual distances before data clustering but there are more effective procedures that rapidly decrease the time complexity of DBSCAN to O\((m \, \textrm{log} m)\) as in the case of SLINK.

The DBSCAN is driven by two parameters \(\epsilon > 0\), \(k_{\textrm{min}} \ge 2\) which fully depends on users opinion. We will set them to obtain the best sensitivity of resulting classifier in the process of cross-validation. The DBSCAN generates an undirected graph \({\mathcal {G}}\) with vertex set \({\mathcal {V}} = \{ 1,2,...,m \}\) and edge set \({\mathcal {E}}= \{e_1,e_2,...,e_t \}\) and the pattern \({\textbf{x}}_i\) is placed in vertex i for \(i=1,2,...,m\). There are three types of vertices: a hard member, a soft member and an outlier.

The vertex i is called the hard member when \(\textrm{card} \{ j:d_{i,j} \le \epsilon \} \ge k_{\textrm{min}}\). Every edge \(e= \{ i,j \}\) has to satisfy \(d_{i,j} \le \epsilon\). The edge e is called a hard connection when the vertices ij are hard members. A soft connection is the edge e where the node i is the hard member but the node j is not. Remaining edges are eliminated. Resulting graph \({\mathcal {G}}\) has several components. The component is declared as a cluster when it has two vertices at least. Remaining discrete components are declared as the outliers.

The main advantage of SLINK, CLINK, and DBSCAN is in the ability of sequential learning with acceptable time complexity. Exactly, the hard members of DBSCAN, number of clusters, and outliers are invariant to pattern order during the learning process.

2.5 Union of hidden classes

Both supervised and unsupervised approaches to the pattern classification can be used to form hidden classes inside the final classifier. The system of hidden classes arises from uncertainty of class membership, imperfectness of classification or any context out approach. We will apply a deterministic approach which is based on the relationship between hidden groups and the output classes (Hrebik et al. 2019) as follows. The aim is to optimize this relationship as the best mapping from the hidden to the output classes. The strict classifier is defined as mapping \({\textrm{c}}: {\mathscr {L}}_H \rightarrow {\mathscr {L}}_N\) from the set \({\mathscr {L}}_H\) of hidden class indices to the set \({\mathscr {L}}_N\) of final class indices, where \({\mathscr {L}}_n = \lbrace 1,..., n \rbrace\). This mapping can be expressed via the matrix \({\mathbb {X}} \in \lbrace 0,1 \rbrace ^{N \times H}\), where \(x_{i,j} = 1\) iff \(d_k \in {\mathscr {H}}_j \Rightarrow d_k \in {\mathscr {C}}_i\). Therefore, \(x_{i,j} = 1\) just when for any pattern belonging to \({\mathscr {H}}_j\) it also belongs to \({\mathscr {C}}_i\). The uniqueness conditions \(\sum _{i=1}^{N} x_{i,j}= 1\) have to be satisfied for \(j=1,...,H\).

The relation between the classes and the hidden groups is presented via the contingency table \({\mathbb {F}} \in {\mathbb {N}}_0^{N \times H}\), where \(f_{i,j} = \text {card} \lbrace k:d_k \in {\mathscr {C}}_i \bigcap {\mathscr {H}}_j \rbrace\) is the result of pattern counting. Here, \(f_{i,j}\) is the number of patterns belonging to both class \({\mathscr {C}}_i\) and group \({\mathscr {H}}_j\) as joint frequency, which can be relativized as

$$\begin{aligned} q_{i,j} = \frac{f_{i,j}}{\sum _{k=1}^{H} f_{i,k}} \end{aligned}$$
(15)

where \(i = 1,...,N\), \(j=1,...,H\).

There are two approaches to evaluating the quality of a classifier. One of them is accuracy, which is the success of the classifier as a whole, i.e. the proportion of successfully classified patterns and all patterns. The main disadvantage of such approach is the imbalance in results for individual classes. Against this, there is another concept based on sensitivity, which is very often used in medicine when we evaluate the percentage of success in classifying a sick patient. If, on the other hand, we are interested in the percentage of success in determining whether a patient is healthy, then medicine uses the term specificity. If we do not know in advance how many classes there will be in the task, it is more advantageous to call the classical sensitivity the sensitivity for the first class (\(se_1\)) and the specificity the sensitivity of the second class (\(se_2\)) and generally introduce \(se_N\), which is actually the success rate of the classifier for the given class, i.e. the number of correct individuals in that class relative to the total number of individuals in that class. It is obvious that if we want to have strict requirements for the classifier, it is not a good idea to maximize its accuracy, but to maximize the so-called critical sensitivity, which is nothing but the smallest value of the individual \(se_N\). When we substitute the values \(f_{i,j}\) and \(x_{i,j}\) into those definitions, we receive the following formulas.

The accuracy of given classifier can be expressed as

$$\begin{aligned} acc = \frac{1}{m} \sum _{i=1}^{N} \sum _{j=1}^{H} f_{i,j} x_{i,j} \end{aligned}$$
(16)

Using the concept of class sensitivity as a relative frequency of true classification, we can calculate it for \(i = 1,...,N\) as

$$\begin{aligned} se_i = \sum _{j=1}^H q_{i,j} x_{i,j} \end{aligned}$$
(17)

An average sensitivity can be defined as

$$\begin{aligned} ase = \frac{1}{N} \sum _{i=1}^N se_i \end{aligned}$$
(18)

A lower estimate of class sensitivity is defined as a critical sensitivity

$$\begin{aligned} se^* = \min \lbrace se_i:i=1,...,N \rbrace \end{aligned}$$
(19)

We prefer critical sensitivity as the strength criterion of classifier efficiency and maximize them via the union of hidden classes. The accuracy criterion is also used as the traditional measure which is frequently used by many authors.

In accordance with Hrebik et al. (2019), we will maximize the critical sensitivity \(se^*\). An adequate mixed binary optimization task is

$$\begin{aligned} se^* = \max \end{aligned}$$
(20)

subject to

$$\begin{aligned} \sum _{i=1}^{N} x_{i,j} = 1 \ \textrm{for}\ j = 1,...,H \end{aligned}$$
(21)
$$\begin{aligned} \sum _{j=1}^{H} q_{i,j} x_{i,j} - \textit{se}^* \ge 0 \ \textrm{for}\ \textit{i} = 1,...,\textit{N} \end{aligned}$$
(22)
$$\begin{aligned} x_{i,j} \in \lbrace 0,1 \rbrace \ \textrm{for}\ \textit{i} = 1,...,\textit{N},\ \textit{j} = 1,...,\textit{H} \end{aligned}$$
(23)
$$\begin{aligned} \textit{se}^* \in [0,1] \end{aligned}$$
(24)

with real artificial variable \(\textit{se}^*\). The inequalities Eq. (22) guarantee that \(se^*\) is a lower bound of critical sensitivity during the optimization process.

From the theoretical point of view it is necessary to think whether this task has a solution in general. If we choose \(se^*=0\) then all inequalities Eq. (22) holds regardless of what the values of x are. This means that if the task was too complicated, then in the worst possible scenario it will case that the optimal value of critical sensitivity will be \(se^*=0\), as a symbol that the original task has no solution, but the system of inequalities Eq. (22) has a solution. So, if we obtain \(se^*=0\), it means that the given task cannot be solved by the given method. In the case of degeneration, the task can have more solution. Therefore is important following consideration.

After the specification of \(se^*\), we can yield from the task degeneration and solve additional binary programming task which guarantees the same critical sensitivity and maximize accuracy as

$$\begin{aligned} acc = \frac{1}{M} \sum _{i=1}^{N} \sum _{j=1}^{H} f_{i,j} x_{i,j} =\max \end{aligned}$$
(25)

subject to

$$\begin{aligned} \sum _{i=1}^{N} x_{i,j} = 1 \ \textrm{for}\ j = 1,...,H \end{aligned}$$
(26)
$$\begin{aligned} \sum _{j=1}^{H} q_{i,j} x_{i,j} \ge se^* \ \textrm{for}\ \textit{i} = 1,...,\textit{N} \end{aligned}$$
(27)
$$\begin{aligned} x_{i,j} \in \lbrace 0,1 \rbrace \ \textrm{for}\ \textit{i} = 1,...,\textit{N},\ \textit{j} = 1,...,\textit{H} \end{aligned}$$
(28)

3 Framework of multi-classification

The novel approach of vector pattern multi-classification is based on a combination of the approaches mentioned above. Basic assumptions and procedures are summarized in this section.

3.1 Assumptions

Let NMn be number of output classes, number of patterns, and number of pattern dimensions that are unlimited in general. But there is a threshold value \(n^*\) of pattern dimension which switch between deterministic and random sub-sampling approaches. In both cases, the first step of classification is dimensionality reduction using data whitening or multi-class discriminant analysis which converts the data into the space of dimension \(D \le n\). In the second step, the reduced data are clustered using the DBSCAN technique and the hidden classes are formed. Optimal unions of these hidden classes are performed in the last step of the multiple classifications.

3.2 Classification for \(n \le n^*\)

The learning strategy of classification is based on the pattern dimension. When the patterns are not too large we will proceed with the whole set of patterns. In the case of data whitening we use learning procedure with parameters \(D, k_{\textrm{min}}, \epsilon\). But in the case of multi-class discriminant analysis, we have only two free parameters \(k_{\textrm{min}}, \epsilon\) because of \(D=N-1\). In the first step we transform data matrix \({\textbf{X}} \in {\mathbb {R}}^{M \times n}\) to \({\textbf{Y}} \in {\mathbb {R}}^{M \times D}\) using data whitening Eqs. (15) or using Rao method Eqs. (714). Then we use the DBSCAN technique for clustering in \({\mathbb {R}}^D\) with parameters \(k_{\textrm{min}}, \epsilon\). Every cluster forms a new hidden class of patterns and the outliers which are also localized by the DBSCAN are ignored. Finally, the optimal union of hidden classes Eqs. 2024) is performed.

There are no problems with large pattern number M because both data whitening, Rao method, and DBSCAN are designed for a large amount of data patterns. But the parameters of DBSCAN must be selected to generate not too large number H of hidden classes.

3.3 Approximated classification for \(n > n^*\)

When the pattern vector length n is too large its reduction is necessary preprocessing step which is a kind of context out data whitening. We select \(n_{\textrm{red}} \le n^*\) first and create a random sub-sample of m patterns which are supposed to be representatives of the given pattern set. When \(m \le n^*\), the dual form of learning Eq. (6) is performed as an alternative data whitening which produces the weight matrix \({{\textbf {W}}} \in {\mathbb {R}}^{n \times n_{\textrm{red}}}\). Using this matrix we transform the original data to obtain a matrix \({\textbf{X}}_{\textrm{red}} \in {\mathbb {R}}^{M \times n_\textrm{red}}\). This matrix of reduced patterns is used instead of the original pattern set using the learning strategy (Sect. 3.2). This process is of a stochastic nature and \(n_{\textrm{red}}, m\) are two additive parameters that control the preliminary dimensionality reduction. Therefore, the novel classification algorithm is also applicable to long pattern vectors but with context out imperfectness related to the random sampling of patterns.

3.4 Classification verification

We suppose the parameters of classification are set on the complete pattern set to obtain the classifier with maximum possible critical sensitivity \(se^*\). The role of parameter \(\epsilon\) for fixed \(D, k_{\textrm{min}}\) is crucial and can rapidly change the class sensitivities but the critical sensitivity peaceful-wise continuous function of \(\epsilon\) and therefore there is an interval of \(\epsilon\) which maximizes \(se^*\). After this preliminary parameter setting we have to perform the cross-validation. When the number of patterns M is small, we prefer Leave-One-Out (Wong 2015; Gronau and Wagenmakers 2019) cross-validation technique but for large M we can use 10-fold (Xu et al. 2018; Steyerberg 2019) cross-validation scheme as generally recommended. When we map the role of parameter \(\epsilon\) in the case of cross-validation the values of \(se^*\) are not so high in many cases. The interval of an optimal \(\epsilon\) can be also different in the case of cross-validation. There are no general rules and this phenomenon will be studied experimentally in the next section.

3.5 Classifier structure

The new classifier serves to ensure that for all patterns \({\textbf{x}}_k \in {\mathbb {R}}^n, k=1,2,...,m\), where mn are number of patterns and space dimensionality, recognized to which class \({\mathscr {C}}_i, i=1,2,...,N\) belongs. The new classifier consists of three parts, and the proposed structure is captured in Fig. 1. The first part of the system, is a linear transformation that displays all patterns from dimension n to a lower dimension D. Either multiple discriminant analysis (Sect. 2.3) or data whitening (Sect. 2.2) is used for this transformation. The second part of the system uses the standard DBSCAN tool with parameters \(\epsilon\) and \(k_{\textrm{min}}\) (Sect. 2.4). Depending on these parameters, individual patterns are classified unsupervised into H classes, representing hidden classes. The third part of the system is optimal union using binary programming techniques (Sect. 2.5). This creates a system that unambiguously assigns each x on the input to which class it belongs.

Fig. 1
figure 1

Classifier structure

4 Experimental part

To demonstrate the results of proposed approach we have selected ten basic classification tasks (Dua 2020). This approach allow clear comparison with other classification methods. All datasets analysed during the current study are available in the repositories referred in Table 1, mainly in UC Irvine Machine Learning Repository (Dua 2020). The table includes also the number of patterns, their dimensionality, and the number of classes. For comparison of our results, we work with relevant papers presenting classification results. The main aim is to confirm that our method is comparable with other ones. As our approach based on critical sensitivity is not commonly used we compare the basic criteria of classification accuracy. We present and compare both cases’ optimal setting on the training set as well as the leave-one-out cross-validation.

Table 1 Datasets for Assessment

4.1 Case study: iris flower classification task

As the primal research dataset, we decided to use the well-known and widely used iris dataset (Swain et al. 2012). Iris dataset contains three classes of fifty instances each, where each class refers to a type of iris plant. One class is linearly separable from the other two, the latter is not linearly separable from each other. Every iris pattern is a four-dimensional real positive vector.

First of all, we have to prepare the data for our model. We can use the data whitening or Rao method. Our aim is to demonstrate using of DBSCAN. We test different values of \(\epsilon\) and investigate results for \(k_{\textrm{min}}\) from 2 to 5. The \(\epsilon\) values in the testing examples we set from 0 to 1.26. The optimal settings with the highest value of \(se^*\) are summarized in Table 2 for training set and in Table 3 for Leave-One-Out cross-validation.

We can confirm the best results using \(k_{\textrm{min}} = 2\), Rao method and \(\epsilon\) from \(\epsilon _{\textrm{min}} = 0.08\) to \(\epsilon _{\textrm{max}} = 0.09\) in naive case. Leave-One-Out cross-validation led to the wider interval of \(\epsilon\) reaching 0.04 to 0.14 for the same setting. The value of critical sensitivity is above ninety percent in both cases. The reached maximum accuracy value is fully comparable to other presented results (Kulluk et al. 2012; Ozyildirim and Avci 2014; Asafuddoula et al. 2017; Yin and Gelenbe 2018) summarized in Tables 4 and 5 along with reached minimum and maximal accuracy together with placement among benchmark techniques.

Table 2 Optimal setting on training set
Table 3 Optimal setting under leave-one-out cross-validation

4.2 Application to other pattern sets

We have applied our novel method also to nine other datasets: Wine, Glass, Cancer (Wisconsin), Haberman, Liver, Ionosphere, Cancer (Coimbra), Transfusion, and Cryotherapy described in Table 1. Most of the tasks are classification into two classes. In the case of the Glass dataset, there is known also an alternative task to seven classes. The results of training are collected in Table 2 as optimal values of classification parameters and adequate values of accuracy and sensitivities. The proposed method has been also treated by leave-one-out cross-validation. Obtained results are summarized in Table 3 in a similar way. There are no dramatic changes in parameter setting and resulting accuracy and sensitivities in Tables 2 and 3. The SLINK algorithm as DBSCAN for \(k_{\textrm{min}} = 2\) is a good choice in a majority of cases. But the type dimension of primal coordinate reduction is task sensitive. The Rao technique multi-class discriminant analysis is useful in many cases.

As seen in Tables 4 and 5 our method is comparable with standard techniques of classification. To see how competitive our results are, we went through several papers on this topic (Kulluk et al. 2012; Ozyildirim and Avci 2014; Rani and Ganesh 2014; Abdar et al. 2017; Asafuddoula et al. 2017; Kahramanli 2017; Aslan et al. 2018; Li and Chen 2018; Talabni and Engin 2018; Yin and Gelenbe 2018; Austria et al. 2019; Chan and Chin 2019; Kraipeerapun and Amornsamankul 2019; Rahman et al. 2020). The rank of our novel method has been evaluated for every dataset and during both training and cross-validation processes. Comparison results including the rank among others is in Tables 4 and 5. The proposed method is in the second quartile related to the involved referential methods in most cases. As the use of datasets varies, also the number of used benchmark techniques for comparison is variant and summarized in the following paragraphs.

Table 4 Accuracy compared classifiers for training
Table 5 Accuracy compared classifiers for cross-validation

According to Kulluk et al. (2012) for training and cross-validation comparison we used five different variants of harmony search algorithms and standard backpropagation algorithm. For training, we compared the results with approximator using a spiking random neural network using five other benchmarks (Yin and Gelenbe 2018). For cross-validation results comparison we used generalized classifier neural network and its logarithmic learning implementation, probabilistic neural network and standard multilayer perceptron as presented in Ozyildirim and Avci (2014). The results of an incremental ensemble classifier method (Asafuddoula et al. 2017) were also used as a benchmark.

Comparison of training and testing for Haberman and Liver datasets is included for product-unit neural networks as a special class of feed-forward neural network, together with results for backpropagation and Levenberg–Marquardt algorithms, as presented in Kahramanli (2017).

For Ionosphere and Transfusion datasets, we used also method proposed in Chan and Chin (2019) to tackle the problem of imbalanced data based on cosine similarity together with results for synthetic minority oversampling technique and adaptive synthetic sampling approach. The Liver dataset is compared to the results of two proposed methods Boosted C5.0 and CHAID based on the decision trees and five other methods (backpropagation, NB tree, decision tree, C5.0, support vector machine, and basic neural network) presented in Abdar et al. (2017). The training results for Transfusion dataset were obtained by the naive Bayesian classifier, implementation of algorithm iterative Dichotomiser 3, and random tree, all presented in Rani and Ganesh (2014).

Both breast cancer datasets, Cancer Wisconsin and Coimbra, were compared with five different classification models including decision tree, random forest, support vector machine, neural network and logistics regression as presented in Li and Chen (2018). Coimbra dataset is compared with results of an artificial neural network, standard extreme learning machine, support vector machine and K-nearest neighbour presented in Aslan et al. (2018), and additionally to ten classification algorithms and their variations including logistic regression, k-nearest neighbour, support vector machine, decision tree, random forest, gradient boosting method, and naive Bayes presented in Austria et al. (2019).

In the case of Cryotherapy dataset, we used comparison with four methods using kernel functions for improving the learning capacity of support vector machine presented in Talabni and Engin (2018). Seven methods additional methods for Cryotherapy, two methods based on the combination of cascade generalization and complementary neural network, and five existing methods (neural network, stacked generalization, cascade generalization, complementary neural network, the combination of stacked generalization as presented in Kraipeerapun and Amornsamankul (2019). Cryotherapy results were also compared with proposed using of support vector machine and nine standard methods k-nearest neighbours, binary logistic regression, linear discriminant analysis, quadratic discriminant analysis, classification and regression trees, random forest, adaptive boosting, gradient boosting, and bagging, presented and summarized in Rahman et al. (2020).

Results for training using Iris, Wine, Glass, Haberman, Liver and Ionosphere datasets are compared also with 1-NN classifier and two its variants, namely the Hypersphere Classifier and the Adaptive Nearest Neighbor Rule, as presented in Orozco-Alzate et al. (2019). Results of Fuzzy Pattern Trees using Grammatical Evolution, called Fuzzy Grammatical Evolution (Murphy et al. 2022), is used for comparison on Iris, Wine, Haberman and Transfusion datasets.

For additional comparison we have included also average accuracy results for cross-validation of scalable ensemble technique XGBoost, random forest, gradient boosting, LightGBM using selective sampling of high gradient instances and Ordered CatBoost modifiing the computation of gradients to avoid the prediction shift as presented in Bentejac et al. (2021) for Iris, Wine, Cancer (Wisconsin), Liver, and, Ionosphere. Cross-validation results in case of hybrid classification model named HyCASTLE (Veneri et al. 2022) is also included for Iris, Glass and Haberman.

Based on all previous experiments we see our method as quite robust. Robustness means a similar parameter setting in the training and cross-validation processes in this case. The novel method is therefore comparable with current methods and its learning is very simple and robust related to the parameter value. Moreover, we present additionally the values of critical sensitivities, which we believe are relevant for the classifier quality.

5 Conclusion

The proposed type of pattern classifier with embedded dimensionality reduction and hidden class forming has been tested. Based on training and cross-validation using ten standard datasets, the new type of classifier has several advantages evaluated below.

Using a training materialized for direct verification or leave-one-out cross-validation, we set DBSCAN parameter \(k_{\textrm{min}}=2\) in most cases to obtain the best value of critical sensitivity of a given system. Therefore, there is no need to use general DBSCAN approach because its reduced version the clustering algorithm SLINK produces less compact hidden clusters. It is usually the main weakness of the SLINK approach, but in our case, the optimal unioning of hidden classes eliminates this disadvantage. The future realization of this classifier can use only SLINK instead.

The most important property of this classifier is similar behavior during learning on the training set and the cross-validation. We suppose that selected method of dimensionality reduction and length of resulting vectors depend only on the dataset but are independent of the verification strategy.

There is also a similarity in the range of parameter \(\epsilon [ \epsilon _{\textrm{min}}, \epsilon _{\textrm{max}} ]\). Moreover, the optimal \(\epsilon\) is included in the optimum range of cross-validation. Therefore the optimal set of parameter \(\epsilon\) can be directly used as relevant estimate of \(\epsilon\) for the leave-one-out cross-validation processes.

The proposed method aims the critical sensitivity maximization, which produces non-trivial unions of hidden classes. But the authors of referential techniques are oriented mainly to the accuracy evaluation of classifiction, which complicates the method comparison.

There are several general recommendations for the setting of a classifier. The user decides first whether to apply multi-class discriminant analysis or whether he prefers data whitening of given dimension D. The SLINK method with parameter \(\epsilon\) is suggested for hidden class forming. The optimal values of D and \(\epsilon\) can be estimated using only the training set for verification. Finally, the leave-one-out cross-validation is a simple process focused only on parameter \(\epsilon\) improvement, which remains unchanged in many cases. The proposed classifier, represents a comparable alternative to other pattern classification techniques focusing on critical sensitivity.

The main advantage of the proposed classifier is that it successfully avoids the curse of dimensionality and includes automatic reduction and standardization of the input dataset. All cluster analysis is carried out in dimensionless coordinates and thus offers a wide range of uses for a whole range of applications. It is a relatively simple classifier having comparable properties comapred to more complicated ones. As one of the other applications, we can recommend, for example, the recognition of unstructured data such as strings, trees, and graphs. In such a case, it is necessary to set correctly a suitable feature description, which precedes the reduction layer of our classifier.