Within-class multimodal classification

Wan, Huan; Wang, Hui; Scotney, Bryan; Liu, Jun; Ng, Wing W. Y.

doi:10.1007/s11042-020-09238-1

Within-class multimodal classification

Open access
Published: 11 August 2020

Volume 79, pages 29327–29352, (2020)
Cite this article

Download PDF

You have full access to this open access article

Multimedia Tools and Applications Aims and scope Submit manuscript

Within-class multimodal classification

Download PDF

Huan Wan ORCID: orcid.org/0000-0002-8722-663X¹,
Hui Wang¹,
Bryan Scotney²,
Jun Liu¹ &
…
Wing W. Y. Ng³

2047 Accesses
Explore all metrics

Abstract

In many real-world classification problems there exist multiple subclasses (or clusters) within a class; in other words, the underlying data distribution is within-class multimodal. One example is face recognition where a face (i.e. a class) may be presented in frontal view or side view, corresponding to different modalities. This issue has been largely ignored in the literature or at least under studied. How to address the within-class multimodality issue is still an unsolved problem. In this paper, we present an extensive study of within-class multimodality classification. This study is guided by a number of research questions, and conducted through experimentation on artificial data and real data. In addition, we establish a case for within-class multimodal classification that is characterised by the concurrent maximisation of between-class separation, between-subclass separation and within-class compactness. Extensive experimental results show that within-class multimodal classification consistently leads to significant performance gains when within-class multimodality is present in data. Furthermore, it has been found that within-class multimodal classification offers a competitive solution to face recognition under different lighting and face pose conditions. It is our opinion that the case for within-class multimodal classification is established, therefore there is a milestone to be achieved in some machine learning algorithms (e.g. Gaussian mixture model) when within-class multimodal classification, or part of it, is pursued.

A Portmanteau Local Feature Discrimination Approach to the Classification with High-dimensional Matrix-variate Data

Article 28 July 2021

Single Classifier Based Multiple Classifications

Subconcept perturbation-based classifier for within-class multimodal data

Article 21 November 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Understanding the underlying data distribution before applying a machine learning process is an important step in the analysis of data, as otherwise, wrong choices may be made in the different stages of the machine learning process. Every single algorithm used in machine learning has, either explicitly or implicitly, some assumptions about the data for it to work effectively. For linear regression, the typical assumptions include linearity (there is linear relationship between the independent and dependent variables), exogeneity (the errors between observed and predicted values should have conditional mean zero), multicollinearity (the independent variables must all be linearly independent), homoscedasticity (the errors have the same variance in each observation) and normality (the errors have normal distribution) [7, 23]. For random forests [2], one assumption is that changes in the dependent variable are best described by hyper-rectangles in the independent variables (because they are based on trees). Another assumption is that no future value of the dependent variable will be outside of the range of values already in the training data. If the distribution of data can be described as the canonical statistical distributions it is possible to gain much inferential and predictive power [15]. The key to any successful use of data in an analysis or in making a decision is applying the correct machine learning/statistical modelling technique to the data at hand.

In this paper we consider a particular type of data distribution where there are multiple modalities (concentrations/clusters of data) within each class, within-class multimodality, and study how to choose the right feature extraction methods to model such data more effectively. Fig. 1(a) illustrates within-class multimodality at a conceptual level, where there are two and three modalities respectively in Class One and Class Two. Within-class multimodality is prevalent in the real world. For example, we can recognise people under different illuminations, and also in different poses. If we represent face images of the same person under different illuminations, it is likely that different images with different illuminations will be in different clusters (see Fig. 1(b) for an illustration). Actually, face recognition under varying illuminations is a challenging problem[25, 31]. The same can be said of face recognition in different head poses (see Fig. 1(c) for an illustration). Another potential application is energy disaggregation of appliances by non-intrusive load monitoring (NILM) [8, 11, 12, 19], namely disaggregating the total consumption readings into the consumption patterns of each individual appliance, where the total consumption reading of a house represents a class and the appliances in a house are the modalities within this class. Therefore, dividing a class into multiple modalities is similar to disaggregating the total consumption of all appliance into the consumption of each appliance.

Within-class multimodality has been largely ignored in the literature, or at least under studied. The closest studies are linear discriminant analysis (LDA) [5, 22], subclass discriminant analysis (SDA) [32], Mixture subclass discriminant analysis (MSDA) [6], and separability-oriented subclass discriminant analysis (SSDA) [26]. Unlike LDA which separates different classes under the assumption that each class is unimodal, SDA, MSDA and SSDA recognize that a class may be multimodal and seek to find LDA dimensions based on multimodality descriptors through the notion of subclass. SDA, MSDA and SSDA have better classification performance than LDA, which indicates the importance of within-class multimodality for classification. LDA is a classical approach to discriminant dimensionality reduction. It transforms data from the original data space into a lower dimensional space (LDA space) so that the within-class compactness is maximised whilst the between-class separation is maximised. This is achieved through maximising the well-known Fisher objective, which is composed by the within-class scatter matrix and between-class scatter matrix [5, 22]. In the presence of within-class multimodality, LDA reduces dimensionality by merging multiple modalities in each class into a single modality. SDA extends LDA in order to separate classes at a subclass level rather than at a class level. It transforms data into a lower dimensional LDA space so that the between-subclass separation is maximised, and within-class compactness is maximised. The SDA subclasses are discovered using the leave-one-out-test (LOOT) criterion proposed in [32] or the stability criterion [18]. MSDA extends SDA by replacing SDA’s within-class scatter matrix with a new within-subclass scatter matrix. SSDA further extends SDA to minimise the level of overlap between subclasses within every class; thus the between-class separation is maximised, between-subclass separation is maximised and within-class compactness is maximised. The SSDA subclasses are discovered by the agglomerative hierarchical clustering algorithm using a new criterion called the separability criterion [26], which aims to divide each class into several non-overlapping clusters.

A lot is known about within-class unimodality classification,^{Footnote 1} whose aim is to build a model assuming there is one modality per class. It is well-known that simultaneously minimising intra-class variance and maximising inter-class variance will increase learning performance [4, 28, 29]. However, not enough is known about within-class multimodality classification, when data distribution is within-class multimodal. Existing studies (e.g. SDA and SSDA) only scratch the surface in multimodality, and many questions remain unanswered. In this paper, we present an extensive study of within-class multimodality classification as guided by the following five key questions about within-class multimodality that are important for the understanding of multimodality, the design of new learning algorithms and the improvement of existing learning algorithms.

Question 1: Why do we consider multimodality?
Question 2: How many clusters should we use?
Question 3: How should we utilise the clusters?
Question 4: Do we have real benefits?
Question 5: If we keep increasing modalities, what will happen?

The study of these questions is important for a number of reasons. Firstly, it will reveal a relationship between the modality of the data distribution and the comparative performance of the classification, so it is possible to gain an insight into the data through the comparative model performance using different data dimensionality reduction techniques. Secondly, it will establish the fact that different dimensionality reduction techniques are suitable for different data distributions. Thirdly, it will provide a direction for improving other machine learning algorithms such as neural networks by designing new loss functions.

We create artificial data sets having a range of modalities and conduct extensive experiments in order to answer Questions 1-3 (and possibly Question 5). We also select real world data sets that clearly have multiple modalities and conduct extensive experiments to answer Question 4. The contributions of this paper are highlighted as follows:

We answered the abovementioned five key questions.
We obtained the following useful findings: 1) when within-class multimodality is present, the concurrent maximisation of between-class separation, within-class compactness and between-subclass separation can lead to significant performance gains; 2) within-class multimodal classification offers a competitive solution to face recognition under different lighting and face pose conditions, where each lighting/pose condition corresponds to a separate modality in the data space; 3) There is correlation between multimodality and performance gain in within-class multimodality classification. Optimal performance can be expected if the number of modalities in the within-class multimodality classification algorithm is the same as the true number of within-class modalities

The rest of the paper is organised as follows. Section 2 presents relevant work including linear discriminant analysis (LDA), subclass discriminant analysis (SDA) and separability-oriented subclass discriminant analysis (SSDA). Section 3 focuses on artificial data sets and their rationale. Section 4 attempts to answer various questions about multimodality using artificial data sets, and Section 5 attempts to answer other questions using real data sets. Section 6 concludes the paper with a summary.

In the rest of the paper we use cluster, subclass and modality in different contexts but these terms are interchangeable in this paper.

2 Related work

In this section, we present an overview of related work, including the LDA, SDA and SSDA to provide the context for this work and introduce the necessary technical notations.

2.1 Linear discriminant analysis

Linear discriminant analysis (LDA) is a classical method for discriminant analysis. It has been widely used in many areas, such as pattern recognition [13, 14] and machine learning [10, 27]. LDA seeks to find a linear combination of features that separates two or more classes of objects. The resulting combination may be used as a linear classifier, or more commonly, for dimensionality reduction before later classification [30]. LDA uses a between-class scatter matrix S_b to measure the separability of classes, and uses a within-class scatter matrix S_w to measure the compactness of each class. Then LDA attempts to find a linear projective matrix W that projects data into a new space, LDA space, that is spanned by LDA features (or LDA dimensions), such that a measure of the between-class scatter matrix S_b in the new space is maximised and simultaneously the same measure of the within-class scatter matrix S_w in the new space is minimised. S_b and S_w are defined, respectively, as follows:

$$ \begin{array}{@{}rcl@{}} S_{b}&=&\frac{1}{N}\sum\limits_{i=1}^{C} N_{i}(\mu_{i} - \mu)(\mu_{i} - \mu)^{T}, \end{array} $$

(1)

$$ \begin{array}{@{}rcl@{}} S_{w}&=&\frac{1}{N}\sum\limits_{i=1}^{C} \sum\limits_{j=1}^{N_{i}}(x_{ij} - \mu_{i})(x_{ij} - \mu_{i})^{T}, \end{array} $$

(2)

where N is the number of samples, N_i is the number of samples in class i, C is the number of classes, μ_i is the mean of class i, μ is global mean of all samples, and x_ij denotes the j^th sample in class i.

LDA is an optimisation process, with the following Fisher objective:

$$ J^{LDA}(W) = \frac{tr(W^{T} S_{b} W)}{tr(W^{T} S_{w} W)}, $$

(3)

where W is a projective matrix that projects data from the data space to the LDA space. In order to find an LDA space that can separate different classes well, LDA needs to find the matrix $\mathrm {W}^{*} = \underset {\mathrm {W}}{\arg \max \limits } J^{LDA}(\mathrm {W})$. It turns out that the sought-after projective matrix W^∗ is composed of the eigenvectors corresponding to the largest eigenvalues of $S_{w}^{-1}S_{b}$ [26], under the assumption that every class is Gaussian distributed and has the same covariance.

2.2 Subclass discriminant analysis

Subclass discriminant analysis (SDA) [32] is a variant of LDA that separates classes at a subclass level rather than at a class level, based on the observation that the data distribution in a class may be multimodal (i.e., forming clusters). This is achieved by dividing each class into a set of subclasses and then running an LDA-like optimisation process to maximise between-subclass separation and within-class compactness.

The between-class scatter matrix S_b of LDA is replaced by the between-subclass scatter matrix, which is defined below (4):

$$ S_{b}^{SDA}= \sum\limits_{i=1}^{C-1}\sum\limits_{j=1}^{K_{i}}\sum\limits_{l=i+1}^{C}\sum\limits_{n=1}^{K_{l}} p_{ij}p_{ln}(\mu_{ij}-\mu_{ln})(\mu_{ij}-\mu_{ln})^{T}, $$

(4)

where C denotes the number of classes, K_i (K_l)denotes the number of subclasses in class i (l), μ_ij (μ_ln) denotes the mean of the j^th (n^th) subclass in class i (l), $p_{ij}=\frac {N_{ij}}{N}$ ($p_{ln}=\frac {N_{ln}}{N}$) denotes the prior of the j^th (n^th) subclass of class i (l), and N_ij (N_ln) is the number of samples in j^th (n^th) subclass of class i (l).

The within-class scatter matrix of SDA is re-defined as the sample covariance matrix as below (5):

$$ S_{w}^{SDA}= \frac{1}{N}\sum\limits_{j=1}^{N}(x_{j}-\mu)(x_{j}-\mu)^{T}, $$

(5)

where N, x_j, and μ are the number of instances, the j^th instance and the mean of all instances, respectively.

The Fisher objective is re-defined as follows (6):

$$ J^{SDA}(W) = \frac{tr(W^{T} S_{b}^{SDA} W)}{tr(W^{T} S_{w}^{SDA} W)}. $$

(6)

In order to divide each class into the same number of subclasses, a leave-one-out-test (LOOT) criterion [32] or a faster stability criterion [18] is used together with a nearest neighbour based clustering algorithm [32]. Firstly, the clustering algorithm is used to sort the samples of each class so that samples with smaller Euclidean distance stay closer. To achieve this, two samples A and B are found in each class that have the largest Euclidean distance between each other, and are taken as the 1^st and n^th samples in the sorted data. After that, the samples ranked from 1^st to (n/2)^th are near A, and the samples ranked from (n/2 + 1)^th to n^th are near B. Then, based on the number of subclasses set by the user, the sorted samples are divided into the specified number of subclasses for each class. Finally, the LOOT criterion or stability criterion is used to find the optimal number of subclass for each class.

2.3 Separability-oriented subclass discriminant analysis

Separability-oriented subclass discriminant analysis (SSDA) [26] is an extension of SDA, which also separates classes at subclass level. It aims to (1) maximise the between-subclass separation within every class; (2) maximise the within-class compactness; and (3) maximise the overall between-class separation. This is achieved through an LDA-like optimisation process operating at subclass level and with a different Fisher objective.

The way to find optimal subclasses for each class is very different from SDA. SSDA aims to find subclasses with no or little overlap through agglomerative hierarchical clustering guided by a separability criterion [26]. The resulting clustering is one that maximises the average euclidean distance (AED) between the mean of a class and the means of subclasses in the class.

Three versions of SSDA exist [26], each having different combination of between-class scatter matrix and within-class scatter matrix. One version is reviewed here. The between-class scatter matrix in SSDA, $S_{b}^{SSDA}$, is defined in terms of the subclasses:

$$ S_{b}^{SSDA} = \sum\limits_{i=1}^{C}\frac{N_{i}}{N}\sum\limits_{j=1}^{K_{i}}(\mu_{ij}-\mu)(\mu_{ij}-\mu)^{T}, $$

(7)

where N is the number of samples in the data set, N_i is the number of samples in class i ($i = 1, 2,\dots ,C $, C is the number of class) such that ${\sum }_{i=1}^{C}{N_{i}}=N$, K_i is the number of subclasses in class i, μ is the mean of the whole data set and μ_ij is the mean of subclass j of class i.

The within-class scatter matrix is the standard LDA within-class matrix, $S_{w}^{SSDA}=S_{w}$. Therefore, the Fisher objective of SSDA J^SSDA(W) is below, replacing S_b by $S_{b}^{SSDA}$. Moreover, we summarise the idea of SSDA in the Algorithm 1 and show the main steps of SSDA algorithm using a flowchart, see Fig. 2. Here, the notations used in the flowchart have same meaning as those used in the Algorithm 1.

$$ J^{SSDA}(W) = \frac{tr(W^{T} S_{b}^{SSDA} W)}{tr(W^{T} S_{w}^{SSDA} W)}=\frac{tr(W^{T} S_{b}^{SSDA} W)}{tr(W^{T} S_{w} W)}. $$

(8)

3 Artificial data

In order to answer the research questions mentioned above, we generate four types of artificial data.

Type 1, consists of two different classes and samples in each class are from a single multivariate normal distribution. This type is denoted by C2M1.
Type 2, consists of two different classes and every class has two subclasses of samples generated from two multivariate normal distributions. This type is denoted by C2M2.
Type 3, consists of two different classes and every class has three subclasses of samples generated from three multivariate normal distributions. This type is denoted by C2M3.
Type 4, consists of three different classes, and every class has three subclasses of samples generated from three multivariate normal distributions. This type is denoted by C3M3.

The number of variables is one parameter in a multivariate normal distribution, which is set to 30 for all types of artificial data in our studies. Two other important parameters are: the mean μ and covariance σ, which are needed to generate artificial data from a multivariate normal distribution. In our studies, the mean μ is a 1-by-30 vector and the values of the mean vector are integers chosen randomly from the range [1,10]. Covariance σ is a 30-by-30 diagonal matrix. There are two covariance matrixes for C2M1, one for each class. The values of one covariance matrices for C2M1 are integers chosen randomly from the range [10,21], and the values of the other covariance matrix are integers chosen randomly from the range [20,41].

There are four covariance matrices for C2M2, one for each subclasses and two for each class (there are two subclasses in each class). For class one, the values of the covariance matrices for the two subclasses are integers chosen randomly from the range [10,21], and the values of the covariance matrices for the two subclasses of class two are integers chosen randomly from the range [20,41].

There are six covariance matrices for C2M3, one for each subclass and three for each class. For class one, the values of the covariance matrices for the three subclasses are integers chosen from the ranges [10,21] randomly. For class two, the values of the covariance matrices for the three subclasses are integers chosen randomly from the range [20,41].

There are nine covariance matrices for C3M3, one for each subclass and three for each class. For class one, the values of the covariance matrices for the three subclasses are integers chosen from the ranges [1,10] randomly. For class two and class three, the values of the covariance matrices for the three subclasses are integers chosen randomly from the ranges [10,21] and [20,41], respectively.

In total 10 data sets are generated for each type, and every class of every artificial data set (any type) has 1000 samples. Therefore C2M1, C2M2 and C2M3 each has a total of 2000 samples with 1000 per class. For C2M2 and C2M3, the samples in each class are randomly placed into two and three subclasses respectively according to a probability distribution which varies from data set 1 to 10. C3M3 has a total of 3000 samples with 1000 per class. The samples in each class are randomly placed into three subclasses in the same way as for C2M2 and C2M3. The actual number of samples per subclass are shown in Tables 2, 3 and 4.

4 Multimodality in artificial data

Multiple modalities exist in data. In order to have full insights about the issue of within-class multimodality, various questions can be asked and answered. In the Introduction, some questions are posed explicitly, and the rest of this paper is to seek answers to these questions. Some questions will be answered using artificial data in this section. Other questions will be answered using real-world data in the next section.

4.1 Q1: Is it necessary to address within-class multimodality?

To answer this question we consider and compare experimentally three approaches in the presence of within-class multimodality:

separating within-class modalities for every class through the extraction of features by dimensionality reduction methods such as SDA and SSDA;
merging within-class modalities as a uni-modality for every class in the process of feature extraction using a dimensionality reduction method such as LDA; and
doing nothing about within-class multimodality and using the original data for classification.

In order to evaluate these three approaches, we conduct experiments using k-nearest neighbour (kNN, k = 1) as the classifier on all of the artificial data sets. We consider four cases: (1) Original: the original artificial data sets without any processing for dimensionality reduction (2) LDA processed (3) SDA processed (4) SSDA processed. In addition, we use one half of each data set for training and the other half for testing.

Tables 1, 2, 3 and 4 show the experimental results in the four cases on all of the artificial data sets. From these results, we can observe the following:

It is apparent that SSDA outperforms Original and LDA on all artificial data sets. In particular, SSDA improves classification accuracy over Original by at least 9% on all of the C2M1, C2M2 and C2M3 data sets, and by at least 14% on the C3M3 data sets.
LDA, SDA and SSDA outperform Original consistently, so dimensionality reduction in the style of LDA can indeed improve classification performance significantly. Whilst this is not new, it indicates that doing nothing about multimodality is suboptimal.
When there is only one modality per class: it is clear from Table 1 that the differences between LDA, SDA and SSDA do not appear to be significant. This suggests that when there is only one modality per class, doing dimensionality reduction using SDA or SSDA makes little difference from using LDA.
As for LDA and its variants, we can rank order them in terms of their performance: LDA≤SDA ≤SSDA on the artificial data sets with within-class multimodality, namely C2M2, C2M3 and C3M3. This suggests that dealing with within-class multimodality the SSDA way is better.
When there are multiple modalities per class: from Tables 2, 3 and 4, it is clear that doing dimensionality reduction at the subclass level as in SDA or SSDA is better than at the class level as in LDA. Furthermore, SSDA clearly outperforms SDA in these experiments. This suggests that separating subclasses (in other words, reducing the overlap of different subclasses) within every class and at the same time separating all classes is a better approach than simply pulling subclasses in a class from the subclasses of other classes.
When the number of modalities per class increases: according to Tables 1, 2 and 3, in general the classification accuracy drops in all methods, suggesting that the complexity of the problem increases. This can be seen more clearly in Fig. 3. Interestingly, the margin of performance drop is the smallest with SSDA, suggesting that SSDA is more robust than Original, LDA and SDA when the number of modalities per class changes.

From these observations we can draw the conclusion that it is indeed necessary to deal with the issue of within-class multimodality. Furthermore, this conclusion will be confirmed by using the real data sets in Section 5.

Table 1 Classification accuracy with kNN (k = 1) of Original, LDA, SDA and SSDA on ten C2M1 data sets

Within-class multimodal classification

Abstract

Similar content being viewed by others

A Portmanteau Local Feature Discrimination Approach to the Classification with High-dimensional Matrix-variate Data

Single Classifier Based Multiple Classifications

Subconcept perturbation-based classifier for within-class multimodal data

1 Introduction

2 Related work

2.1 Linear discriminant analysis

2.2 Subclass discriminant analysis

2.3 Separability-oriented subclass discriminant analysis

3 Artificial data

4 Multimodality in artificial data

4.1 Q1: Is it necessary to address within-class multimodality?

4.2 Q2: How many within-class modalities should we use?

4.3 Q3: How should we utilise the modalities?

5 Multimodality in real data

5.1 General data

5.2 Face image data

5.3 The results: runtime performance

6 Conclusion

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation