Semi-supervised Ladder Networks for Speech Emotion Recognition

As a major component of speech signal processing, speech emotion recognition has become increasingly essential to understanding human communication. Benefitting from deep learning, many researchers have proposed various unsupervised models to extract effective emotional features and supervised models to train emotion recognition systems. In this paper, we utilize semi-supervised ladder networks for speech emotion recognition. The model is trained by minimizing the supervised loss and auxiliary unsupervised cost function. The addition of the unsupervised auxiliary task provides powerful discriminative representations of the input features, and is also regarded as the regularization of the emotional supervised task. We also compare the ladder network with other classical autoencoder structures. The experiments were conducted on the interactive emotional dyadic motion capture (IEMOCAP) database, and the results reveal that the proposed methods achieve superior performance with a small number of labelled data and achieves better performance than other methods.


Introduction
As one of the main information mediums in human communication, speech contains not only basic language information, but also a wealth of emotional information. Emotion can help people understand real expressions and potential intentions. Speech emotion recognition (SER) has many applications in human-computer interactions, since it can help machines to understand emotional states like human beings do [1] . For example, speech emotion recognition can be utilized to monitor customers′ emotional state which reflects their service quality in call centers. The information can help promote service level and reduce the workload of manual evaluation [2] .
Emotion is conventionally represented as several discrete human emotional moods such as happiness, sadness and anger over utterances [3] . In speech emotion recognition, the establishment of a speech emotional database is based on the reality that every speech utterance is assigned to a certain one of emotional categories. As a result, most researchers regard speech emotion recognition as a typical supervised learning task. Given the emotion-al database, the classification models are trained to predict exact emotional labels for each utterance. Thus, lots of conventional machine learning methods were applied successfully in speech emotion recognition. The models, hidden Markov models (HMMs) and Gaussian mixture models (GMMs) which emphasize the temporality of speech signal and had achieved great performance in speech recognition, were also applied in SER [4,5] . Support vector machines (SVMs), which have the superiority of modeling small data sets, usually achieved better performance than other alterative models [6] . Inspired by the success of various tasks with deep learning [7,8] , numerous research efforts have been made to build an effective speech emotion recognition model with deep neural networks (DNN), leading to impressive achievement [9,10] .
However, speech emotion recognition still faces many challenges, such as the diversity of speakers, genders, languages and cultures which would influence the system performance. The difference of recording conditions is also bad for the stability of the system. While automatic systems have been shown to outperform naive human listeners on speech emotion classification [11] , existing SER systems are not so mature compared with speech and image classification tasks. One of the serious problems is the shortage of emotional data that limits the robustness of the models.
Supervised classification methods estimate emotional class by learning the differences between different categories. The guarantee of a large enough number of labelled speech emotional data is necessary for the exactitude of the separatrices. However, the acquisition of labelled data demands experts′ knowledge and is also highly time consuming. Even worse, there exists large ambiguity and subjectivity among the boundaries of the emotions since the expressions and perceptions of different people are different [12] . Thus, there is no definite standard for providing emotional labels. Due to these shortcomings, the quantity of speech emotion databases is limited, and cannot cover the diversity of different conditions [13] .
Considering the scarcity of speech emotion data, it is beneficial to take full advantage of the information from unlabeled data. Unsupervised learning is one choice which extracts robust feature representations from the data automatically without depending on label information. This technique can depict the intrinsic structures of the data, and has stronger modeling and generalization ability for training better classification models [14] . Most of the existing unsupervised feature learning approaches have been explored to generate salient emotional feature representations for speech emotion recognition, such as autoencoders (AE) [15] and denoising autoencoders (DAE) [14] . The purpose of AE and DAE is to obtain intermediate feature representations which can rebuild the input data as much as possible. Other sophisticated methods, such as variational autoencoders (VAE) [16] and generative adversarial networks (GAN) [17] , have achieved better performance in SER. They emphasize the modeling of the distribution of the data, explicit form such as normal distribution for VAE and inexplicit form for GAN, rather than the data itself.
The feature representations learning from unsupervised models are usually used as the inputs of supervised classification models to train speech emotion recognition systems. Nevertheless, such an approach has an underlying problem. The former unsupervised learning plays the role of the feature extractor, while the target of the model is to recover the input signals perfectly. It means all information would persist as much as possible. However, we only need to focus on emotionally relevant information. On the other hand, the later supervised learning only concentrates on the information that is good for classification prediction. The extra information which is maybe supplementary for SER would be dropped. Therefore, the feature representations learning from unsupervised learning may not necessarily support the supervised classification task. The objectives of two steps, unsupervised part and supervised part are not consistent because their trainings are parted.
To address this problem, deep semi-supervised learning is proposed to dispose of the difficulty [18][19][20] . Semi-supervised learning is the combination of unsupervised feature representation learning and supervised model training. The key is that these two parts are trained simultan-eously so that the feature representations obtained from unsupervised learning can accord with the supervised model better. The typical structures, such as semi-supervised variational autoencoders [19] and ladder networks [18] , have achieved competitive performance with less labelled training samples in other areas.
Benefiting from the unsupervised learning part, semisupervised learning can introduce great feature representations with the aid of many unlabeled examples to improve the performance of supervised tasks. Due to the scarcity of speech emotional data and richness of speech data, it is appropriate to apply semi-supervised learning approaches to speech emotion recognition. Actually, the part of auxiliary unsupervised learning also plays the role of regularization in the semi-supervised learning model. The regularization is essential to develop speech emotion recognition systems that generalize across different conditions [21] . Conventional models obtained poor performance when the databases of training and testing are different [22,23] . By training models that are optimized for primary and auxiliary tasks, the feature representations are more general, avoiding overfitting to a particular domain. It is appealing to create unsupervised auxiliary tasks to regularize the network.
Classic semi-supervised learning structure is an autoencoder which introduces additional unsupervised learning. The autoencoder structure can be replaced by other structures like DAE and VAE. More layers can be stacked. A more advanced structure is a semi-supervised ladder networks [18,24] . Similar to DAE, every layer of a ladder network is intended to reconstruct their corrupted inputs. Further, the ladder network adds the lateral connections between each layer of the encoder and decoder, which is different from DAE. Figuratively, this is also the meaning of the term "ladder", and it indicates the deep multilayer structure of the ladder network. The attraction of hierarchical layer models is the ability of modeling latent variables to learn from low layers to high layers. Generally, low layers represent the specific information while high layers can generate abstract features which are invariant and relevant for classification tasks. This can model more complex nonlinear structures than conventions methods [25] .
Most unsupervised methods aim to learn intermediate feature representations that may not support the underlying emotion classification task. This paper proposes to employ the unsupervised reconstruction of the inputs as an auxiliary task to regularize the network, while optimizing the performance of an emotion classification system. We efficiently achieve this goal with a semi-supervised ladder network. The addition of the unsupervised auxiliary task not only provides powerful discriminative representations of the input features, but is also regarded as the regularization of primary emotional supervised task. The core contributions of this paper can be summarized as follows: 1) In this paper, we utilize semi-supervised learning with a ladder network for speech emotion recognition. We emphasize the importance of unsupervised reconstruction and skip connection modules. In addition, higher layers of the ladder network have a better ability to obtain discriminative features.
2) We show the benefit of semi-supervised ladder networks and that the promising results can be obtained with only a small number of labelled samples.
3) We compare the ladder network with DAE and VAE methods for emotion recognition from speech, showing superior performance of the ladder network. Besides, the convolutional neural network structure of the encoder and decoder has a better ability to encode emotional characteristics.
The remainder of the paper is organized as follows. Section 2 discusses the related work. In Section 3, we describe our proposed methods. We then present the dataset and acoustic features used for the experiments in Section 4. Section 5 presents the experimental results and analysis. Finally, Section 6 concludes this paper.

Related work
Traditional speech emotion recognition relies on wellestablished hand-crafted speech emotional features. The most popular acoustic features are frame-level low-level descriptors (LLD) such as mel-frequency cepstral coefficients (MFCC), followed by utterance-level information extraction with different functionals, such as mean and maximum etc. [26,27] With the great achievement of deep learning, a great number of approaches utilize DNN to extract effective emotional feature representations, then feed them as inputs to an emotional classifier. Kim et al. [28] captured high-order nonlinear relationships from multimodal data with four deep belief networks (DBN) architectures. Firstly, the audio features and video features were inputted to their individual layers, then their outputs were concatenated to generate final multimodal fusion emotional features in a later layer. Finally, the classifier SVM was used to evaluate their performance.
Various autoencoders have been widely applied in speech emotion recognition. Deng et al. [29] proposed shared hidden layer autoencoders for common feature transfers learning to cope with the mismatch between the corpora. Then, they extended to sparse autoencoders (SAE) [30] . In source domain, every emotional class trained its own individual SAE model. After that, all training data of the target domain was reconstructed with corresponding SAE of the same class to alleviate the difference. The new reconstructed data was regraded as training data to train the SVM model to predict test samples. Furthermore, they substituted SAE with DAE to obtain performance gain [6] . Xia and Liu [31] proposed a modified DAE to distinguish the emotional representations from non-emotional factors like speakers and genders. They de-signed two hidden layers separately to represent emotional and non-emotional representation in parallel. The nonemotional layer was trained firstly like normal autoencoder. Then, the emotional layer was trained with the non-emotional layer frozen. Finally, the emotional representations were the inputs of SVM for speech emotion classification. Next, they joined gender information to model emotional specific characteristics for further performance gain [32] .
Ghosh et al. [33,34] combined DAE and bidirectional long short-term memory (BLSTM) AE to get more robust emotional representations from the original wav spectrogram. They utilized a multilayer perceptron (MLP) to evaluate the performance of generated latent representations. Eskimez et al. [35] systematically investigated four kinds of unsupervised feature learning methods, DAE, VAE, adversarial autoencoder (AAE) and adversarial variational Bayes (AVB) for improving the performance of speech emotion recognition. They showed that the models which emphasized the distribution of speech emotional data, namely VAE, AAE and AVB, outperformed DAE.
Deng et al. [36] proposed semi-supervised autoencoders to improve the performance of SER. This was achieved by regarding the unlabeled data as an extra class, which explicitly aided the supervised learning by incorporating prior information from unlabeled samples. In this paper, our work builds upon the ladder network to further explore the influence of semi-supervised learning for speech emotion recognition.
Valpola [24] proposed the ladder network to reinforce autoencoder networks. The unsupervised tasks involve the reconstruction of hidden representations of a denoising autoencoder with lateral connections between the encoder and decoder layers. Rasmus et al. [18,37] further extended this idea to support supervised learning. They included a batch normalization to reduce covariate shift. They also compared various denoising functions to be used by the decoder. The representations from the encoder are simultaneously used to solve the supervised learning problem. The ladder network conveniently solved unsupervised auxiliary tasks along with primary supervised tasks. Finally, Pezeshki et al. [38] explored different components of the ladder network, noting that lateral connections between encoder and decoder and the addition of noise at every layer of the network greatly contributed to their improved performance. The skip connections between the encoder and decoder ease the pressure of transporting information needed to reconstruct the representations to the top layers. Therefore, top layers can learn features that are useful for the supervised task, such as the emotional prediction.
Inspired by their work, we propose semi-supervised ladder networks for speech emotion recognition, showing their benefits for emotion prediction. This work is an extension of our previous work presented in [39], which fo-cused on discrete emotion recognition. In similar work, Parthasarathy and Busso [40] utilized the ladder network to perform dimensional emotional recognition with multitask learning. Notice that our work [39] was published first in AAAC Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia 2018) before Parthas-arathy′s work [40] in International Speech Communication Association (INTERSPEECH 2018).

Method
In this section, we will describe specific ladder network architecture with two hidden layers, as shown in Fig. 1. There are two encoders in the ladder network, that is, one is the noise encoder corrupted by noise which is similar to DAE, and the other is an original clean input signal with shared parameters. The ladder network combines a primary supervised task with an auxiliary unsupervised task. The auxiliary unsupervised task reconstructs the hidden representations of a clean encoder. The noise encoder is simultaneously used to train primary classification task.
The key aspect of the ladder network is the lateral connections between the layers of the encoder and decoder. These lateral skip connections establish the relationships between each layer of the noisy encoder and its corresponding layer in the decoder. This operation enables the information to flow freely between the encoder and decoder. As a result, the feature representations from low layers to high layers would be from specific to abstract and emotional-relevant for speech emotion classification tasks. Formally, the ladder network is defined as follows: x,z (1) , · · · ,z (L) ,ỹ = Encodernoisy (x) (1) x, z (1) , · · · , z (L) , y = Encoder x,ẑ (1) , · · · ,ẑ (L) , y = Decoder (z (1) , · · · ,z (L) where the variables x, y, and y * are the input, the noiseless output, the noisy output and the true target, respectively. The variables , and are the hidden representation, its noisy version, and its reconstructed version at layer l. In the following parts, we give a detailed description of the ladder network to introduce our proposed methods herein.

Encoder
The encoder of the ladder network is a fully connected MLP network. A Gaussian noise with variance is added to each layer of the noisy encoder, as shown in Fig. 1. The representations from the final layer of the encoder are used for the supervised task. The decoder tries to reconstruct the latent representation at every layer using a clean copy of the encoder z as target. In the training phase, the supervised task is trained with the noisy encoder which further regularizes the supervised learning. Meanwhile, the clean encode is utilized to predict emotional class in the testing phase.
In the forward network, a single layer of the encoder includes three types of calculation. The inputs are first transformed with linear transformation, then batch normalization is applied with the mean and standard deviation of mini-batch, followed by a non-linear activation function. The detailed schematic diagram is illustrated in Fig. 2. Formally, the encoder is defined as follows: where is the post-activation at layer and is the weight matrix from layer to layer l. and are the mean and standard deviation of mini-batch at layer . The Gaussian noise with zero mean and variance is added to post-normalization to get preactivation . The purpose of and is to increase the diversity and robustness of the model. Finally, a nonlinear activation function is applied to obtain the output . The difference between the noise encoder and y CE x u u (1) u (2) x y (2) z (2) x z (1) z (2) f (1) Fig. 1 The architecture using semi-supervised ladder networks [18] for speech emotion recognition. This clean encoder is the second item of (7). It is a noisy encoder with noise just as mentioned above, and it is a clean encoder without noise , and are replaced with h and z, respectively.

Decoder
The structure of the decoder is similar to the encoder. The goal of the decoder is to denoise the noisy latent representations. Instead of using a nonlinear activation function, the denoising function combines top-down information from the decoder and the lateral connection from the corresponding encoder layer. With lateral connections, the ladder network performs similarly to hierarchical latent variable models. Lower layers are mostly responsible for reconstructing the input vector and higher layers can learn more abstract, discriminative features for speech emotion recognition.
Similarly, batch normalization is also employed at each layer of the decoder. In the back network of the decoder, the inputs of every layer are from two signals, that one is from above layer and another is the noise signal from the corresponding layer in the encoder. The detailed schematic diagram is illustrated in Fig. 2. Formally, the decoder is defined by the following equations: where is a weight matrix from layer l+1 to layer l.
The function is also called the combinator function as it combines the vertical and the lateral . We use the function proposed by Pezeshki et al. [38] , modeled by an MLP with inputs , where u is the batch normalized projection of the layer above and represents the Hadamard product.

Objective function
The objective function of the ladder network consists of two parts which correspond to the supervised part and unsupervised part respectively. The goal of the unsupervised part is to reconstruct the input signals, whose impact is to obtain effective intermediate hidden representations automatically that accord with speech emotion classification better. Besides, the unsupervised objective can regularize the supervised speech emotional recognition task. The unsupervised objective and supervised objective are optimized simultaneously, which makes the system integrated into a whole model to train the ladder network and avoids the discordance of the optimization objective of two parts.ỹ The supervised loss is cross entropy cost calculated between the noisy output from the top of noise encoder and the true target y * . The unsupervised loss is reconstruction loss between every layer of clean encoder and its corresponding layer of the decoder with lateral connections. The total objective function is a weighted sum of the supervised loss and unsupervised loss: is a hyper-parameter weight for the unsupervised loss and CE is the supervised loss: is the reconstruction loss at layer l: RC  (2) z (3) h (2) z (1) h (1) h (3) z (2) z (3) h (2) z (1) h (1) h (3) z (1) z (2) z (3) Fig. 2 The detailed calculation structure diagram of the ladder network [38] . Both sides of the figure are encoders, the left is noise encoder ((4) to (8)) and the right is clean encoder. At each layer of encoders, linear transformation and normalization are first applied to and . The noise encoder injects noise addition to get while has no noise in clean encoder. Then, the batch normalization correction and nonlinearity activation is computed to get and , respectively. At the decoder ((9) to (13)), the inputs of every layer are from two signals, that one is from above layer and another is noise signal from corresponding layer in the encoder. The linear transformation and normalization are applied to before combining singals. CE stands for supervised cross entropy cost and RC stands for unsupervised reconstruction cost. The total objective function is a weighted sum of the supervised loss and unsupervised loss.
where and are mean and standard deviation of the samples in the encoder. The output of the decoder is normalized to release the effect of unwanted noise introduced by the limited batch size of batch normalization.

Variational autoencoder (VAE)
In this paper, we utilize VAE [16] , another version of AE, as a comparison. Unlike DAE which aims to reconstruct the data, VAE emphasizes the modeling of the explicit distribution form of the data to generate more intrinsic feature representations, as shown in Fig. 3. Formally, VAE is defined as follows: where x is the input, means the encoder. , are the mean and standard deviation of normal distribution learning from the encoder network.
is the Gaussian distribution with zero mean and unit standard deviation.
During the training of VAE, besides the reconstruction loss, Kullback-Leibler (KL) divergence loss is also used.
is the prior multivariate Gaussian distribution.

Database
In this paper, we conduct the experiments on the interactive emotional dyadic motion capture (IEMOCAP) [41] .
The database has multimodal data of 12 hours duration from audio, visual and textual data. We focus on emotion recognition from speech data. It was recorded by ten actors; five males and five females. The recording condition was based on the form of dyadic interaction acting in two different scenarios: scripted play and spontaneous dialog. After completing the conversations, the recordings would be cut into the sentence levels. Valid and valuable sentences would be selected and annotated as emotional labels and neutral label with at least three annotators. The database has nine emotion classes in total, namely angry, excited, happy, sad, neutral, frustrated, fearful, surprised, and disgust. For this study, we use four categories to evaluate the system performance including "angry", "happy", "sad" and "neutral" which are researched frequently and own most samples. Like other researchers do [9,10] , we regard the "excited" class as "happy" class. Only the sentences satisfying the condition that at least two annotations are agreed would be selected. In total we collect 5 531 utterances. The basis for partitioning the training set and test set is leave-onespeaker-out. The class distribution is: 20.0% "angry", 19.6% "sad", 29.6% "happy", and 30.8% "neutral".

Acoustic features
The inputs of the networks are speech acoustic features which are traditional hand-crafted emotional features for speech emotion recognition. We refer to the baseline features of the INTERSPEECH 2009 Emotion Challenge [42] . As shown in Table 1, it contains 16 acoustic low-level descriptors (LLDs) including zero-crossingrate (ZCR), root mean square (RMS) frame energy, pitch frequency (normalized to 500 Hz), harmonics-to-noise ratio (HNR) by autocorrelation function, and mel-frequency cepstral coefficient (MFCC) 1-12. Their first order delta regression coefficients are utilized to double the LLDs resulting in 32 LLDs. 12 functionals -mean, standard deviation, kurtosis, skewness, minimum and maximum value, relative position, and ranges as well as two linear regression coefficients with their mean square error (MSE) are applied to 32 LLDs to calculate 384 dimensional features. The extraction of the LLDs and the computation of the functionals are done using the openSMILE toolkit [43] .

Experimental setup and evaluation metrics
In the experiments, three hidden layers with the size of 500-300-100 from low layers to high layers are employed. The size of input layer is 384 corresponding to speech acoustic feature dimension and the final prediction layer is 4 corresponding to emotional classes. For the hyper-parameter weight of the unsupervised loss in (14), we optimize them with search grid λ (l) . Since every layer has individual , the global search would consume much time. Thus, this parameter is optimized layer by layer. The ADAM optimization algorithm [44] is utilized. The batch size is set as 32. The initial learning rate is 0.02 for 50 iterations followed by 25 iterations with a learning rate decaying linearly to 0. In the following part, we use SVM to evaluate the performance of the feature representations learning from the networks. To determine the parameters of SVM, we use a grid search in the range of [1.0, 100.0] and [0.000 1, 0.1] for C and g, respectively. Each experiment is repeated five times to account for instability. The evaluation measure is the unweighted accuracy (UAR).

Results
For semi-supervised learning, the true benefit of the ladder network is that only a few labelled samples are available for the primary supervised task. Thus, we conduct four semi-supervised emotion recognition tasks with 300, 600, 1 200 and 2 400 labelled samples. Labelled samples are chosen randomly from the training set but the number of training samples in each class is balanced. The left training samples are utilized to pretrain the network without label information. To evaluate the performance of the semi-supervised ladder networks, we utilize three other methods as a comparison. SVM is utilized to evaluate the performance of acoustic features as baseline results. The DAE method has similar network structure to the ladder network shown in Fig. 1, however it has no skip connections and reconstruction loss of high layers. Further, the VAE method is achieved by replacing the encoder and decoder of the DAE structure with the part described in Fig. 3. The network settings of DAE and VAE are the same as the ladder network. Table 2 lists the average accuracies with standard deviation over five trials for four models. In addition, the best accuracy of five trials is presented in the table as well. The following analyses are mostly based on the average accuracies. With the increase of training samples, the performance is gradually improved and the best performance is achieved with all training samples for all situations, showing the increasing number of training samples is beneficial to the emotional classification task. Furthermore, our proposed methods can achieve superior performance with a small number of labelled data. The VAE method using only 600 training samples, achieves better performance 53.7% than SVM 52.4% using all training samples, while the ladder network only needs 300 training samples to reach 53.6%. This suggests semi-supervised learning has a positive influence on performance improvement for speech emotion recognition. It is worth noticing that the performance is improved faster for three network methods when fewer training samples are available, specifically from 300 to 600 and from 600 to 1 200. This suggests the auxiliary unsupervised task is essential to performance improvement. Overall, three network methods achieve better performance than SVM baseline results in all situations. Thus, the representations learning from deep autoencoder structures achieves better performance than conventional models when using similar hand-crafted acoustic features. As can be seen from Table 2, VAE yields better accuracy than the DAE method, which shows that VAE would model intrinsic structure of speech emotional data to generate better feature representations. However, there is instability for VAE, since their standard deviations are greater than DAE. We can also observe that the ladder network achieves best performance among all methods. Noticing the difference between DAE and the ladder network is the existence of lateral connections. The results verify that lateral connections between encoder and decoder greatly contribute to the improved performance. The ladder network yields better performance than VAE with smaller standard deviation, showing its superiority to speech emotion recognition.
The results of Table 2 are based on the structure whose encoder and decoder are composed of MLP. Next, we replace the MLP layer with convolutional neural net-  works (CNN) layer for three network models to improve the performance. The inputs are frame-level features which replace utterance-level features when using MLP. Specifically, the encoder is replaced with three 2D convolutional layers and the decoder is replaced with three 2D deconvolutional layers. The filter number and kernel size are shown in Table 3. The number of parameters is increased from 2.8 M with MLP to 5.8 M with CNN for three networks. Therefore, the training time is also increased from less than one hour to about two hours for three networks. The training time of the ladder network is larger than DAE because of the addition of lateral connections and reconstruction loss. The training time of VAE is sometimes larger than the ladder network or smaller than the ladder network due to its instability.
The corresponding experimental results are shown in Table 4. By comparing Table 2 with Table 4, we can observe that the models with MLP have better performance when fewer training samples (300 and 600) are available, while the models with CNN have better performance when more training samples (1 200 and more) are available. In addition, the standard deviation is relatively decreased. This suggests CNN has better and robust ability to encode emotional characteristics when training data is enough. Similarly, the ladder network achieves better performance than VAE, which achieves better performance than DAE. The results also verify the significance of unsupervised learning when few training samples are available.
The autoencoder structures have superior ability to extract effective feature representations compared with other network models. We extract the features from the highest hidden layer of trained models and feed them to the SVM classifier to assess their quality. Fig. 4 reveals the classifier performance of three network methods on the testing set with 300, 600, 1 200 and 2 400 labelled samples using MLP and CNN as feature extractor. We also explore the influence of supervised learning. In Fig. 4, the symbol "1" represents training without supervised learning while "2" represents training with supervised learning. At every setting, the ladder network achieves best performance followed by VAE, and DAE is worst. Comparing Fig. 4 (a) with Fig. 4 (b), the performance of CNN structure achieves better performance than MLP structure. For example, when using 2 400 training samples with supervised learning, the ladder network using CNN structure achieves 60.3%, better than the MLP structure′s result of 59.8%. The results show the models with supervised learning yield better performance than the models without supervised learning. The "Ladder2" using CNN achieves better performance of 60.3% than "Ladder1" with 59.4%, which shows the supervised information is beneficial to guide better feature representations. The results of Fig. 4 are better than the results of Tables 2 and 4, verifying the ability of autoencoder structures to generate more discriminating feature representations for speech emotion recognition.
After comparing the performance of the features from the highest layer, we turn our attention from low layers to high layers with three network structures. Similarly, SVM is utilized to evaluate the quality of the features. This part is based on CNN structure using all training samples and the experimental results are shown in Fig. 5. The performances of the first layer "384" are the experimental results of acoustic features which are similar to Table 2 and final layer "4" are accuracies of supervised learning which are similar to Table 4. The results show that the accuracies of last layer "4" are worse than last hidden layer "100" which is the same as the experimental results of Fig. 4. The accuracies are improved with the increase of the layers for three network methods. We can observe that DAE achieves better performance than VAE and the ladder network in first hidden layer "500" while VAE and the ladder network outperform DAE in the following layers "300" and "100". Therefore, high layers have the advantage of generating more salient emotional representations. Further, the ladder network achieves better performance than VAE in all hidden layers, which verifies the effectiveness of our proposed methods.  Finally, we demonstrate the performance on different speech emotional categories. Table 5 shows the results of different models corresponding to the best results using all training samples in Table 4. The performance of DAE, VAE and the ladder network are 57.0%, 58.5% and 59.7% respectively. Compared with DAE, the results show that the VAE method yields better performance on "angry", "happy" and "sad". The ladder network achieves its best performance on "angry", "happy" and "neutral", while the performance of "sad" is decreased slightly. Thus, the enhanced network structure is beneficial to the performance improvement of "angry" and "happy".
We also compare the proposed method with other methods in the literature. We also compare the proposed method with other methods in the literature, as shown in Table 6. Our proposed method achieves better performance of 59.7% than the 56.1% of Michael′s work in [9], which uses an attentive convolutional neural network to recognize emotions. Fayek et al. [10] introduce a framebased formulation to model intra-utterance dynamics with end-to-end deep learning, achieving better performance of 60.9%. The feature representations from the top layer with SVM achieve 60.3% in Fig. 5, which is a comparable result to [10].

Conclusions
In this paper, we apply semi-supervised learning to speech emotion recognition to explore the effect of the ladder network. The unsupervised reconstruction of the inputs is an auxiliary task to regularize the network, which can generate more powerful representations for speech emotion recognition system. We conduct the experiments on the IEMOCAP database and the results demonstrate that the proposed methods achieve superior performance with a small number of labelled data. We also compare the ladder network with two classic network structures DAE and VAE, showing the ladder network outperforms them significantly. The results suggest lateral connections between encoder and decoder greatly contribute to the improved performance. The skip connections between the encoder and decoder ease the pressure of transporting information needed to reconstruct the representations to the top layers. Thus, a higher layer has the ability to generate discriminative features for speech emotion recognition. Meanwhile, the supervised  Table 6 Performance comparison between our method with other methods

Model Accuracy
Attentive CNN [9] 56.1 Frame-based SER [10] 60.9 learning task is beneficial to generate more effective feature representations. Besides, CNN has better and robust ability to encode emotional characteristics compared to the MLP structure. Finally, our proposed methods are beneficial to the performance improvement of "angry" and "happy". In the future, we will try to utilize more available unlabeled speech data to improve the performance of SER. Deeper semi-supervised learning and other network structures like recurrent neural networks (RNNs) will be explored.