1 Introduction

Understanding the underlying data distribution before applying a machine learning process is an important step in the analysis of data, as otherwise, wrong choices may be made in the different stages of the machine learning process. Every single algorithm used in machine learning has, either explicitly or implicitly, some assumptions about the data for it to work effectively. For linear regression, the typical assumptions include linearity (there is linear relationship between the independent and dependent variables), exogeneity (the errors between observed and predicted values should have conditional mean zero), multicollinearity (the independent variables must all be linearly independent), homoscedasticity (the errors have the same variance in each observation) and normality (the errors have normal distribution) [7, 23]. For random forests [2], one assumption is that changes in the dependent variable are best described by hyper-rectangles in the independent variables (because they are based on trees). Another assumption is that no future value of the dependent variable will be outside of the range of values already in the training data. If the distribution of data can be described as the canonical statistical distributions it is possible to gain much inferential and predictive power [15]. The key to any successful use of data in an analysis or in making a decision is applying the correct machine learning/statistical modelling technique to the data at hand.

In this paper we consider a particular type of data distribution where there are multiple modalities (concentrations/clusters of data) within each class, within-class multimodality, and study how to choose the right feature extraction methods to model such data more effectively. Fig. 1(a) illustrates within-class multimodality at a conceptual level, where there are two and three modalities respectively in Class One and Class Two. Within-class multimodality is prevalent in the real world. For example, we can recognise people under different illuminations, and also in different poses. If we represent face images of the same person under different illuminations, it is likely that different images with different illuminations will be in different clusters (see Fig. 1(b) for an illustration). Actually, face recognition under varying illuminations is a challenging problem[25, 31]. The same can be said of face recognition in different head poses (see Fig. 1(c) for an illustration). Another potential application is energy disaggregation of appliances by non-intrusive load monitoring (NILM) [8, 11, 12, 19], namely disaggregating the total consumption readings into the consumption patterns of each individual appliance, where the total consumption reading of a house represents a class and the appliances in a house are the modalities within this class. Therefore, dividing a class into multiple modalities is similar to disaggregating the total consumption of all appliance into the consumption of each appliance.

Fig. 1
figure 1

(a) There are two modalities in Class One, and three modalities in Class Two, where different modalities are marked by different colours. (b) Each person has three different illumination modalities: two face images in the green dotted circle are taken under normal lighting; one face image in the cyan dotted circle is taken under normal lighting and right light on; one face image in the orange dotted circle is taken under normal lighting and left light on. (c) Each person has two different head pose modalities: two face images in the green dotted circle are taken with frontal head pose, and one face image in the cyan dotted circle is taken with rightwards head pose

Within-class multimodality has been largely ignored in the literature, or at least under studied. The closest studies are linear discriminant analysis (LDA) [5, 22], subclass discriminant analysis (SDA) [32], Mixture subclass discriminant analysis (MSDA) [6], and separability-oriented subclass discriminant analysis (SSDA) [26]. Unlike LDA which separates different classes under the assumption that each class is unimodal, SDA, MSDA and SSDA recognize that a class may be multimodal and seek to find LDA dimensions based on multimodality descriptors through the notion of subclass. SDA, MSDA and SSDA have better classification performance than LDA, which indicates the importance of within-class multimodality for classification. LDA is a classical approach to discriminant dimensionality reduction. It transforms data from the original data space into a lower dimensional space (LDA space) so that the within-class compactness is maximised whilst the between-class separation is maximised. This is achieved through maximising the well-known Fisher objective, which is composed by the within-class scatter matrix and between-class scatter matrix [5, 22]. In the presence of within-class multimodality, LDA reduces dimensionality by merging multiple modalities in each class into a single modality. SDA extends LDA in order to separate classes at a subclass level rather than at a class level. It transforms data into a lower dimensional LDA space so that the between-subclass separation is maximised, and within-class compactness is maximised. The SDA subclasses are discovered using the leave-one-out-test (LOOT) criterion proposed in [32] or the stability criterion [18]. MSDA extends SDA by replacing SDA’s within-class scatter matrix with a new within-subclass scatter matrix. SSDA further extends SDA to minimise the level of overlap between subclasses within every class; thus the between-class separation is maximised, between-subclass separation is maximised and within-class compactness is maximised. The SSDA subclasses are discovered by the agglomerative hierarchical clustering algorithm using a new criterion called the separability criterion [26], which aims to divide each class into several non-overlapping clusters.

A lot is known about within-class unimodality classification,Footnote 1 whose aim is to build a model assuming there is one modality per class. It is well-known that simultaneously minimising intra-class variance and maximising inter-class variance will increase learning performance [4, 28, 29]. However, not enough is known about within-class multimodality classification, when data distribution is within-class multimodal. Existing studies (e.g. SDA and SSDA) only scratch the surface in multimodality, and many questions remain unanswered. In this paper, we present an extensive study of within-class multimodality classification as guided by the following five key questions about within-class multimodality that are important for the understanding of multimodality, the design of new learning algorithms and the improvement of existing learning algorithms.

  • Question 1: Why do we consider multimodality?

  • Question 2: How many clusters should we use?

  • Question 3: How should we utilise the clusters?

  • Question 4: Do we have real benefits?

  • Question 5: If we keep increasing modalities, what will happen?

The study of these questions is important for a number of reasons. Firstly, it will reveal a relationship between the modality of the data distribution and the comparative performance of the classification, so it is possible to gain an insight into the data through the comparative model performance using different data dimensionality reduction techniques. Secondly, it will establish the fact that different dimensionality reduction techniques are suitable for different data distributions. Thirdly, it will provide a direction for improving other machine learning algorithms such as neural networks by designing new loss functions.

We create artificial data sets having a range of modalities and conduct extensive experiments in order to answer Questions 1-3 (and possibly Question 5). We also select real world data sets that clearly have multiple modalities and conduct extensive experiments to answer Question 4. The contributions of this paper are highlighted as follows:

  • We answered the abovementioned five key questions.

  • We obtained the following useful findings: 1) when within-class multimodality is present, the concurrent maximisation of between-class separation, within-class compactness and between-subclass separation can lead to significant performance gains; 2) within-class multimodal classification offers a competitive solution to face recognition under different lighting and face pose conditions, where each lighting/pose condition corresponds to a separate modality in the data space; 3) There is correlation between multimodality and performance gain in within-class multimodality classification. Optimal performance can be expected if the number of modalities in the within-class multimodality classification algorithm is the same as the true number of within-class modalities

The rest of the paper is organised as follows. Section 2 presents relevant work including linear discriminant analysis (LDA), subclass discriminant analysis (SDA) and separability-oriented subclass discriminant analysis (SSDA). Section 3 focuses on artificial data sets and their rationale. Section 4 attempts to answer various questions about multimodality using artificial data sets, and Section 5 attempts to answer other questions using real data sets. Section 6 concludes the paper with a summary.

In the rest of the paper we use cluster, subclass and modality in different contexts but these terms are interchangeable in this paper.

2 Related work

In this section, we present an overview of related work, including the LDA, SDA and SSDA to provide the context for this work and introduce the necessary technical notations.

2.1 Linear discriminant analysis

Linear discriminant analysis (LDA) is a classical method for discriminant analysis. It has been widely used in many areas, such as pattern recognition [13, 14] and machine learning [10, 27]. LDA seeks to find a linear combination of features that separates two or more classes of objects. The resulting combination may be used as a linear classifier, or more commonly, for dimensionality reduction before later classification [30]. LDA uses a between-class scatter matrix Sb to measure the separability of classes, and uses a within-class scatter matrix Sw to measure the compactness of each class. Then LDA attempts to find a linear projective matrix W that projects data into a new space, LDA space, that is spanned by LDA features (or LDA dimensions), such that a measure of the between-class scatter matrix Sb in the new space is maximised and simultaneously the same measure of the within-class scatter matrix Sw in the new space is minimised. Sb and Sw are defined, respectively, as follows:

$$ \begin{array}{@{}rcl@{}} S_{b}&=&\frac{1}{N}\sum\limits_{i=1}^{C} N_{i}(\mu_{i} - \mu)(\mu_{i} - \mu)^{T}, \end{array} $$
(1)
$$ \begin{array}{@{}rcl@{}} S_{w}&=&\frac{1}{N}\sum\limits_{i=1}^{C} \sum\limits_{j=1}^{N_{i}}(x_{ij} - \mu_{i})(x_{ij} - \mu_{i})^{T}, \end{array} $$
(2)

where N is the number of samples, Ni is the number of samples in class i, C is the number of classes, μi is the mean of class i, μ is global mean of all samples, and xij denotes the jth sample in class i.

LDA is an optimisation process, with the following Fisher objective:

$$ J^{LDA}(W) = \frac{tr(W^{T} S_{b} W)}{tr(W^{T} S_{w} W)}, $$
(3)

where W is a projective matrix that projects data from the data space to the LDA space. In order to find an LDA space that can separate different classes well, LDA needs to find the matrix \(\mathrm {W}^{*} = \underset {\mathrm {W}}{\arg \max \limits } J^{LDA}(\mathrm {W})\). It turns out that the sought-after projective matrix W is composed of the eigenvectors corresponding to the largest eigenvalues of \(S_{w}^{-1}S_{b}\) [26], under the assumption that every class is Gaussian distributed and has the same covariance.

2.2 Subclass discriminant analysis

Subclass discriminant analysis (SDA) [32] is a variant of LDA that separates classes at a subclass level rather than at a class level, based on the observation that the data distribution in a class may be multimodal (i.e., forming clusters). This is achieved by dividing each class into a set of subclasses and then running an LDA-like optimisation process to maximise between-subclass separation and within-class compactness.

The between-class scatter matrix Sb of LDA is replaced by the between-subclass scatter matrix, which is defined below (4):

$$ S_{b}^{SDA}= \sum\limits_{i=1}^{C-1}\sum\limits_{j=1}^{K_{i}}\sum\limits_{l=i+1}^{C}\sum\limits_{n=1}^{K_{l}} p_{ij}p_{ln}(\mu_{ij}-\mu_{ln})(\mu_{ij}-\mu_{ln})^{T}, $$
(4)

where C denotes the number of classes, Ki (Kl)denotes the number of subclasses in class i (l), μij (μln) denotes the mean of the jth (nth) subclass in class i (l), \(p_{ij}=\frac {N_{ij}}{N}\) (\(p_{ln}=\frac {N_{ln}}{N}\)) denotes the prior of the jth (nth) subclass of class i (l), and Nij (Nln) is the number of samples in jth (nth) subclass of class i (l).

The within-class scatter matrix of SDA is re-defined as the sample covariance matrix as below (5):

$$ S_{w}^{SDA}= \frac{1}{N}\sum\limits_{j=1}^{N}(x_{j}-\mu)(x_{j}-\mu)^{T}, $$
(5)

where N, xj, and μ are the number of instances, the jth instance and the mean of all instances, respectively.

The Fisher objective is re-defined as follows (6):

$$ J^{SDA}(W) = \frac{tr(W^{T} S_{b}^{SDA} W)}{tr(W^{T} S_{w}^{SDA} W)}. $$
(6)

In order to divide each class into the same number of subclasses, a leave-one-out-test (LOOT) criterion [32] or a faster stability criterion [18] is used together with a nearest neighbour based clustering algorithm [32]. Firstly, the clustering algorithm is used to sort the samples of each class so that samples with smaller Euclidean distance stay closer. To achieve this, two samples A and B are found in each class that have the largest Euclidean distance between each other, and are taken as the 1st and nth samples in the sorted data. After that, the samples ranked from 1st to (n/2)th are near A, and the samples ranked from (n/2 + 1)th to nth are near B. Then, based on the number of subclasses set by the user, the sorted samples are divided into the specified number of subclasses for each class. Finally, the LOOT criterion or stability criterion is used to find the optimal number of subclass for each class.

2.3 Separability-oriented subclass discriminant analysis

Separability-oriented subclass discriminant analysis (SSDA) [26] is an extension of SDA, which also separates classes at subclass level. It aims to (1) maximise the between-subclass separation within every class; (2) maximise the within-class compactness; and (3) maximise the overall between-class separation. This is achieved through an LDA-like optimisation process operating at subclass level and with a different Fisher objective.

The way to find optimal subclasses for each class is very different from SDA. SSDA aims to find subclasses with no or little overlap through agglomerative hierarchical clustering guided by a separability criterion [26]. The resulting clustering is one that maximises the average euclidean distance (AED) between the mean of a class and the means of subclasses in the class.

Three versions of SSDA exist [26], each having different combination of between-class scatter matrix and within-class scatter matrix. One version is reviewed here. The between-class scatter matrix in SSDA, \(S_{b}^{SSDA}\), is defined in terms of the subclasses:

$$ S_{b}^{SSDA} = \sum\limits_{i=1}^{C}\frac{N_{i}}{N}\sum\limits_{j=1}^{K_{i}}(\mu_{ij}-\mu)(\mu_{ij}-\mu)^{T}, $$
(7)

where N is the number of samples in the data set, Ni is the number of samples in class i (\(i = 1, 2,\dots ,C \), C is the number of class) such that \({\sum }_{i=1}^{C}{N_{i}}=N\), Ki is the number of subclasses in class i, μ is the mean of the whole data set and μij is the mean of subclass j of class i.

The within-class scatter matrix is the standard LDA within-class matrix, \(S_{w}^{SSDA}=S_{w}\). Therefore, the Fisher objective of SSDA JSSDA(W) is below, replacing Sb by \(S_{b}^{SSDA}\). Moreover, we summarise the idea of SSDA in the Algorithm 1 and show the main steps of SSDA algorithm using a flowchart, see Fig. 2. Here, the notations used in the flowchart have same meaning as those used in the Algorithm 1.

$$ J^{SSDA}(W) = \frac{tr(W^{T} S_{b}^{SSDA} W)}{tr(W^{T} S_{w}^{SSDA} W)}=\frac{tr(W^{T} S_{b}^{SSDA} W)}{tr(W^{T} S_{w} W)}. $$
(8)
figure f
Fig. 2
figure 2

The flowchart of SSDA algorithm

3 Artificial data

In order to answer the research questions mentioned above, we generate four types of artificial data.

  • Type 1, consists of two different classes and samples in each class are from a single multivariate normal distribution. This type is denoted by C2M1.

  • Type 2, consists of two different classes and every class has two subclasses of samples generated from two multivariate normal distributions. This type is denoted by C2M2.

  • Type 3, consists of two different classes and every class has three subclasses of samples generated from three multivariate normal distributions. This type is denoted by C2M3.

  • Type 4, consists of three different classes, and every class has three subclasses of samples generated from three multivariate normal distributions. This type is denoted by C3M3.

The number of variables is one parameter in a multivariate normal distribution, which is set to 30 for all types of artificial data in our studies. Two other important parameters are: the mean μ and covariance σ, which are needed to generate artificial data from a multivariate normal distribution. In our studies, the mean μ is a 1-by-30 vector and the values of the mean vector are integers chosen randomly from the range [1,10]. Covariance σ is a 30-by-30 diagonal matrix. There are two covariance matrixes for C2M1, one for each class. The values of one covariance matrices for C2M1 are integers chosen randomly from the range [10,21], and the values of the other covariance matrix are integers chosen randomly from the range [20,41].

There are four covariance matrices for C2M2, one for each subclasses and two for each class (there are two subclasses in each class). For class one, the values of the covariance matrices for the two subclasses are integers chosen randomly from the range [10,21], and the values of the covariance matrices for the two subclasses of class two are integers chosen randomly from the range [20,41].

There are six covariance matrices for C2M3, one for each subclass and three for each class. For class one, the values of the covariance matrices for the three subclasses are integers chosen from the ranges [10,21] randomly. For class two, the values of the covariance matrices for the three subclasses are integers chosen randomly from the range [20,41].

There are nine covariance matrices for C3M3, one for each subclass and three for each class. For class one, the values of the covariance matrices for the three subclasses are integers chosen from the ranges [1,10] randomly. For class two and class three, the values of the covariance matrices for the three subclasses are integers chosen randomly from the ranges [10,21] and [20,41], respectively.

In total 10 data sets are generated for each type, and every class of every artificial data set (any type) has 1000 samples. Therefore C2M1, C2M2 and C2M3 each has a total of 2000 samples with 1000 per class. For C2M2 and C2M3, the samples in each class are randomly placed into two and three subclasses respectively according to a probability distribution which varies from data set 1 to 10. C3M3 has a total of 3000 samples with 1000 per class. The samples in each class are randomly placed into three subclasses in the same way as for C2M2 and C2M3. The actual number of samples per subclass are shown in Tables 23 and 4.

4 Multimodality in artificial data

Multiple modalities exist in data. In order to have full insights about the issue of within-class multimodality, various questions can be asked and answered. In the Introduction, some questions are posed explicitly, and the rest of this paper is to seek answers to these questions. Some questions will be answered using artificial data in this section. Other questions will be answered using real-world data in the next section.

4.1 Q1: Is it necessary to address within-class multimodality?

To answer this question we consider and compare experimentally three approaches in the presence of within-class multimodality:

  • separating within-class modalities for every class through the extraction of features by dimensionality reduction methods such as SDA and SSDA;

  • merging within-class modalities as a uni-modality for every class in the process of feature extraction using a dimensionality reduction method such as LDA; and

  • doing nothing about within-class multimodality and using the original data for classification.

In order to evaluate these three approaches, we conduct experiments using k-nearest neighbour (kNN, k = 1) as the classifier on all of the artificial data sets. We consider four cases: (1) Original: the original artificial data sets without any processing for dimensionality reduction (2) LDA processed (3) SDA processed (4) SSDA processed. In addition, we use one half of each data set for training and the other half for testing.

Tables 123 and 4 show the experimental results in the four cases on all of the artificial data sets. From these results, we can observe the following:

  • It is apparent that SSDA outperforms Original and LDA on all artificial data sets. In particular, SSDA improves classification accuracy over Original by at least 9% on all of the C2M1, C2M2 and C2M3 data sets, and by at least 14% on the C3M3 data sets.

  • LDA, SDA and SSDA outperform Original consistently, so dimensionality reduction in the style of LDA can indeed improve classification performance significantly. Whilst this is not new, it indicates that doing nothing about multimodality is suboptimal.

  • When there is only one modality per class: it is clear from Table 1 that the differences between LDA, SDA and SSDA do not appear to be significant. This suggests that when there is only one modality per class, doing dimensionality reduction using SDA or SSDA makes little difference from using LDA.

  • As for LDA and its variants, we can rank order them in terms of their performance: LDA≤SDA ≤SSDA on the artificial data sets with within-class multimodality, namely C2M2, C2M3 and C3M3. This suggests that dealing with within-class multimodality the SSDA way is better.

  • When there are multiple modalities per class: from Tables 23 and 4, it is clear that doing dimensionality reduction at the subclass level as in SDA or SSDA is better than at the class level as in LDA. Furthermore, SSDA clearly outperforms SDA in these experiments. This suggests that separating subclasses (in other words, reducing the overlap of different subclasses) within every class and at the same time separating all classes is a better approach than simply pulling subclasses in a class from the subclasses of other classes.

  • When the number of modalities per class increases: according to Tables 12 and 3, in general the classification accuracy drops in all methods, suggesting that the complexity of the problem increases. This can be seen more clearly in Fig. 3. Interestingly, the margin of performance drop is the smallest with SSDA, suggesting that SSDA is more robust than Original, LDA and SDA when the number of modalities per class changes.

From these observations we can draw the conclusion that it is indeed necessary to deal with the issue of within-class multimodality. Furthermore, this conclusion will be confirmed by using the real data sets in Section 5.

Fig. 3
figure 3

The classification performance of Original, LDA, SDA and SSDA on ten C2M1 data sets, ten C2M2 data sets and ten C2M3 data sets: In the line charts, the horizontal axis shows the ten data sets from C2M1, C2M2 and C2M3, and the vertical axis shows the classification accuracy

Table 1 Classification accuracy with kNN (k = 1) of Original, LDA, SDA and SSDA on ten C2M1 data sets
Table 2 Classification accuracy with kNN (k = 1) of Original, LDA, SDA and SSDA on ten C2M2 data sets, along with the ratio between the numbers of samples from different subclasses in each class
Table 3 Classification accuracy with kNN (k = 1) of Original, LDA, SDA and SSDA on ten C2M3 data sets, along with the ratio between the numbers of samples from different subclasses in each class
Table 4 Classification accuracy with kNN (k = 1) of Original, LDA, SDA and SSDA on ten C3M3 data sets, along with the ratio between the numbers of samples from different subclasses in each class

4.2 Q2: How many within-class modalities should we use?

There is a clear difference between SDA and SSDA in terms of classification accuracy as shown in Tables 123 and 4. SDA and SSDA are both trying to separate classes at subclass level but they are different in two ways: (1) how to find the within-class multimodalities; (2) once found, how to make use of these modalities. We examine the first issue in this subsection and discuss the second issue in Subsection 4.3.

SDA uses a stability criterion to find class modalities, whereas SSDA uses a separability criterion. Tables 567 and 8 show the numbers of class modalities found by SDA and SSDA for the 10 data sets, of type C2M1, C2M2, C2M3 and C3M3, respectively. It is clear that the numbers are quite different for SDA and SSDA. The numbers found by SSDA in general are quite close to the true numbers of within-class modalities, and identical in most of the data sets. Apart from in a few cases, the numbers found by SDA are quite different to the true numbers.

Table 5 The number of subclasses found by SDA and SSDA for each class in the C2M1 data sets
Table 6 The number of subclasses found by SDA and SSDA for each class in the C2M2 data sets
Table 7 The number of subclasses found by SDA and SSDA for each class in the C2M3 data sets
Table 8 The number of subclasses found by SDA and SSDA for each class in the C3M3 data sets

Furthermore, SSDA can even find true within-class modalities for classes with imbalanced proportions of data between subclasses. For instance, SSDA separates each of Class One, Class Two and Class Three of C3M3-10 into three modalities, when their subclass ratios are 16 : 500 : 484, 23 : 462 : 515 and 20 : 487 : 493 respectively.

All of these observations suggest that (1) the steadily good classification performance is guaranteed by the correct the number of modalities found; and (2) SSDA can find the number of within-class modalities more correctly than SDA, which will be verified on the two face databases in Section 5.2.

4.3 Q3: How should we utilise the modalities?

After the multiple within-class modalities are found, we need to utilise them in order to reduce dimensionality for the purpose of effective classification. SDA and SSDA provide different solutions, all based on the LDA optimisation process but with different Fisher objectives. To compare these two solutions, we apply the SDA and SSDA optimisation processes on the artificial data sets consist of within-class modalities (i.e., C2M2, C2M3 and C3M3). In addition, the true number of within-class modalities (True-MN) is used in the both SDA and SSDA. The experimental results are presented in Tables 910 and 11.

Table 9 The classification accuracy with kNN (k = 1) of SDA and SSDA using the true number of within-class modalities on the C2M2 data sets
Table 10 The classification accuracy with kNN (k = 1) of SDA and SSDA using the true number of within-class modalities on the C2M3 data sets
Table 11 The classification accuracy with kNN (k = 1) of SDA and SSDA using the true number of within-class modalities on the C3M3 data sets

From Tables 910 and 11, it is clear that the performance of SSDA with True-MN is consistently higher than SDA with True-MN. Therefore, it suggests that the SSDA optimisation process can better utilise the modalities than the SDA optimisation process. Furthermore, it shows that maximising inter-subclass and inter-class separation at the same time is a worthwhile goal of LDA-like dimensionality reduction when the true modalities are found in data.

5 Multimodality in real data

Separating within-class multimodalities results in good performance on artificial data, when the modality of the data is known. For real-world data, the modality of the data is unknown even if we believe that there should be multimodality, e.g., as in the problem of face recognition discussed in Section 1. Can we obtain real benefits by addressing within-class multimodality in real-world data in the same way as for artificial data? This is the question we want to answer in this section. We consider two types of data. One is general data from the UCI data repository [3]; the other is face image data, as it is intuitively plausible that there is within-class multimodality associated with lighting conditions and head pose.

In our experiments, we consider k-nearest neighbor (kNN, k = 1) as the classifier. We conduct a study on the within-class classification problem by focusing on extracting discriminant features in this paper. Some commonly used classifiers have built-in feature selection/extraction functions. For example, Support Vector Machine (SVM) and Decision Tree (DT) select features as part of the learning process. KNN classifier does not have any built-in feature selection/extraction function, so it is selected and used in our experiments. Additionally, we use ten-fold cross-validation as the evaluation framework, and Estimated Mean Accuracy (EMA) and Standard Error of the Mean (SEM) [9] as the evaluation metrics: \(EMA = \frac {{\sum }_{i = 1}^{10}p_{i}}{10}\), where pi denotes the percentage of correct classification in the i th fold validation; \(SEM = \frac {\delta }{\sqrt {10}}\), where \(\delta = \sqrt {\frac {{\sum }_{i = 1}^{10}(p_{i} - EMA)^{2}}{9}}\). So, the higher EMA and lower SEM are, the better classification performance is. Moreover, to make the evaluation results more reliable, we ran each experiment 10 times using ten-fold cross-validation, and reported the average EMA (AEMA) and average SEM (ASEM).

5.1 General data

We select eleven UCI data sets using two criteria: (1) all attributes must be numerical; (2) there must be many attributes so that dimensionality reduction is meaningful. General information about the eleven UCI data sets is shown in Table 12.

Table 12 General information about the ten UCI data sets used, where #I denotes the number of instances, #C denotes the number of classes and #A denotes the number of attributes

Furthermore, we compare SSDA and SDA against adaptive local linear discriminant analysis (ALLDA) [20]. To compare ALLDA as fairly as possible, we follow the experimental settings used in the [20] since we do not have the source code of ALLDA. In [20], four UCI data sets are used to test the performance of ALLDA. They are Australian, Heart, Pima and Diabetes, respectively. We can not find the Diabetes data set corresponding to the description in the [20], so we compare SDA and SSDA with ALLDA on the remaining three data sets. The experimental settings used in the [20] are: (1)several samples are randomly selected from every class with same proportion as training data and the rest of samples as testing data. The splits of Australian, Heart and Pima data sets are described in the Table 13; (2)1-nearest neighbor is used as the classifier and each experiment is conducted using 20 random splits; (3)the mean accuracy (Macc) and standard deviation (Std) are the evaluation of classification performance.

Table 13 General information and the split about Australian, Heart and Pima data set, where #C denotes the number of classes, #Training denotes the number of training data, #Testing denotes the number of testing data and #A denotes the number of attributes

Experimental results are presented in Tables 14 and 15. The experimental results of ALLDA in the Table 15 are cited from [20]. From these results we note the following observations:

  • LDA, SDA and SSDA achieve better performance than Original on the majority of the UCI data sets. This furtherly verifies the conclusion drawn by using artificial data sets that it is necessary to deal with the issue of within-class multimodality.

  • SSDA achieves better performance than the other three methods on the majority of data sets. In particular, SSDA outperforms LDA on all UCI data sets.

  • Compared with Original and LDA, both SDA and SSDA have superior performance on CMSC, DR, MF-fou, Parkinsons, Yeast and Isolet. This suggests that these data sets are likely to have salient within-class multimodalities. Fig. 4 is a visualisation of these data sets in a two-dimensional space by t-SNE [16], where different colours represent different classes. t-SNE is a technique for visualising high-dimensional data sets by giving each sample a location in a two- or three-dimensional space. It can be observed that these data sets comprise different class and some class clusters consist of several clusters, which correspond to within-class modalities. In particular, the presence of multimodality is clear in Parkinsons, where class one consists of several red clusters and class two consists of several cyan clusters.

Fig. 4
figure 4

The data visualisation of QSAR-B, CMSC, DR, MF-fou, M1-C1, Parkinsons, WWQ, SP, Yeast, Isolet and Verbebral in a two-dimensional space

Table 14 AEMA±ASEM values with kNN (k = 1) of Original, LDA, SDA and SSDA on Eleven UCI data sets
Table 15 Macc±Std values with kNN (k = 1) of Original, LDA, SDA, SSDA and ALLDA on Australian, Heart and Pima data set, where the results of ALLDA are cited from the paper [20]

5.2 Face image data

We conduct face recognition experiments on two widely used face databases: AR face database [17] and FERET face database [21]. Face Recognition is a multi-class classification problem, where each person is regarded as a class. Face recognition attempts to determine whether a face image is from someone in the database when we have a collection of images for each person in the database. A person’s set of face images may contain multiple modalities when they are captured in different illumination conditions or head poses. So, the purpose of this study is to test whether the within-class multimodality methods discussed in this paper can bring benefit to this problem.

In our experiments, the images are represented using their pixel values, resulting in large numbers of features. Therefore, our face recognition task becomes a small sample size (SSS) problem [24]. To deal with this problem, a two stage PCA + LDA method [1] is used. We use PCA to reduce data dimensionality, retaining principal components that can explain 90% of the variance, before LDA, SDA and SSDA are used. Details of the two face databases used in our experiments are the given below:

  • AR face database: The AR face database contains frontal-view face images of 126 different persons (70 males and 56 females). Each person was photographed under different lighting conditions (normal lighting, normal lighting and left light on, normal lighting and right light on, normal lighting and both left and right lights on) and distinct facial expressions (neutral expression, smile, anger, and scream), and some images have partial occlusions (sunglasses or scarf). For each person, a total of 13 images were taken in each session for a total of two sessions, which were separated by an interval of two weeks. Hence, there are 26 frontal face images per person. In our experiments, we use a subset of the AR face data set, which comprises 700 face images from 100 persons. We use 7 non-occluded face images of each person taken under different lighting conditions and different facial expressions from the first session. Also, Besides, we crop the face part of the image and then resize all images to a standard image size of 80 x 100 pixels (see Fig. 5(a) for some examples). Thus, every face image in the AR database has 8000 features.

  • FERET face database: The FERET face database includes over 10,000 face images, which have different head poses, lighting conditions and expressions. In our experiments, we use a subset of the FERET face database that consists of 700 images from 100 people, with 7 images per person. Again the face portion of each image is cropped out and normalised to a standard image size of 100 x 100 pixels (see Fig. 5(b) for some examples). We have 10000 features for each image of FERET.

Fig. 5
figure 5

Sample images from the face databases

We run experiments with Original, LDA, SDA and SSDA on the AR and FERET face databases 10 times using ten-fold cross-validation. Experimental results are shown in Tables 16 and 17. It is clear that SSDA achieves higher face recognition accuracy than the other three methods on both face image databases; SDA also outperform Original and LDA on both face databases. These results suggest that within-class multimodality does exist in these image databases, and tackling within-class multimodality in the manner of SDA and SSDA does bring benefits.

Table 16 EMA±SEM values with kNN (k = 1) of Original, LDA, SDA and SSDA on the AR face database
Table 17 EMA±SEM values with kNN (k = 1) of Original, LDA, SDA and SSDA on the FERET face database

Furthermore, we want to see what within-class modalities SDA and SSDA can find for AR and FERET, and if the modalities found are consistent with reality. To achieve this, we apply SDA and SSDA on all images of AR and FERET, respectively. Therefore, the maximum number of modalities for each class is set as 7 for both methods since every person only has 7 images in AR and FERET databases. According to the within-modalities found by SDA and SSDA shown in the Figs. 6 and 7, we obtain the following observations:

  • From Fig. 6, it readily see that the four modalities found by SSDA correspond to four different illumination conditions existing in the AR database: normal lighting, normal lighting and left light on, normal lighting and right light on, normal lighting and both left and right light on. Although SDA successfully finds two types of illumination modalities: normal lighting and left light on, normal lighting and both left and right light on, it mixes up the images with normal lighting and left light on.

  • For FERET database, both SDA and SSDA find different types of within-class modilities for different classes as shown in the Fig. 7. Again, SSDA identifies two types of illumination modalities for each class: normal lighting and low lighting. But SDA fails to find the modality with low lighting for some classes, such as Fig. 7(a)(2).

  • Apart from identifying the illumination modalities in the FERET database, SSDA can find all correct head pose modalities for some classes (see Fig. 7(b)(3)): frontal modality, leftwards modalities with two different angles and rightwards modalities with two different angles. In addition, SDA also can find some correct head pose modalities for some classes, for example, the modalities represented by the cyan and purple dotted cilcles shown in the Fig. 7(a)(3).

Therefore, all results from these experiments on two real face databases are consistent with the results on the artificial data sets. When there is within-class multimodality in the data, dealing with the multimodality problem in the manner of either SDA or SSDA is beneficial and, furthermore, the SSDA approach is better than the SDA approach. Interestingly, we have shown that SDA and SSDA offer potential solutions to a challenging problem – face recognition under different lighting and head pose conditions.

Fig. 6
figure 6

Examples of modality distributions found by SDA and SSDA on the AR face database, where dotted circles with different colours represent different modalities found by SDA and SSDA. In the (b), the green dotted circle represents the illumination modality with normal lighting; the cyan dotted cilcle represents the illumination modality with normal lighting and right light on; the orange dotted circle represents the illumination modality with normal lighting and left light on; the red dotted circle represents the illumination modality with normal lighting and both left and right light on

Fig. 7
figure 7

Examples of modality distributions found by SDA and SSDA on the FERET face database, where dotted circles with different colours represent different modalities found by SDA and SSDA

5.3 The results: runtime performance

Running results of Original, LDA, SDA and SSDA are shown in Table 18. It is clear that SSDA are slower than Original and LDA but fast than SDA in the most of data sets.

Table 18 Running time, in seconds, of Original, LDA, SDA and SSDA on eleven UCI data sets and two face databases 10 times using ten-fold cross-validation

6 Conclusion

Within-class multimodality exists in real-world data and is first studied by [32] and more recently by [26], but many questions are unanswered about within-class multimodality, and its true value is not uncovered fully. This paper presents an extensive study of the within-class multimodality problem through experiments on both artificial data and real data in order to establish a strong case for within-class multi-modal classification.

It has been shown using both artificial data and real data that when within-class multimodality is present, maximising between-subclass separation, between-class separation and within-class compactness at the same time in the manner of SDA or SSDA will increase classification performance, with SSDA being the better approach. It is also shown that addressing within-class multimodality this way is optimal if the true number of modalities is known. Interestingly, the experiment on face image databases suggests that SDA and SSDA offer an alternative approach to addressing face recognition under different lighting and head pose conditions.

We believe that a strong case for within-class multimodal classification can be established. We also believe that this classification approach offers a new perspective on improving existing classification algorithms such as Gaussian mixture model and convolutional neural networks, and even devising new classification algorithms. These will be our future research directions.