Data-driven Dimensional Expression Generation via Encapsulated Variational Auto-Encoders

Concerning facial expression generation, relying on the sheer volume of training data, recent advances on generative models allow high-quality generation of facial expressions free of the laborious facial expression annotating procedure. However, these generative processes have limited relevance to the psychological conceptualised dimensional plane, i.e., the Arousal-Valence two-dimensional plane, resulting in the generation of psychological uninterpretable facial expressions. For this, in this research, we seek to present a novel generative model, targeting learning the psychological compatible (low-dimensional) representations of facial expressions to permit the generation of facial expressions along the psychological conceptualised Arousal-Valence dimensions. To generate Arousal-Valence compatible facial expressions, we resort to a novel form of the data-driven generative model, i.e., the encapsulated variational auto-encoders (EVAE), which is consisted of two connected variational auto-encoders. Two harnessed variational auto-encoders in our EVAE model are concatenated with a tuneable continuous hyper-parameter, which bounds the learning of EVAE. Since this tuneable hyper-parameter, along with the linearly sampled inputs, largely determine the process of generating facial expressions, we hypothesise the correspondence between continuous scales on the hyper-parameter and sampled inputs, and the psychological conceptualised Arousal-Valence dimensions. For empirical validations, two public released facial expression datasets, e.g., the Frey faces and FERG-DB datasets, were employed here to evaluate the dimensional generative performance of our proposed EVAE. Across two datasets, the generated facial expressions along our two hypothesised continuous scales were observed in consistent with the psychological conceptualised Arousal-Valence dimensions. Applied our proposed EVAE model to the Frey faces and FERG-DB facial expression datasets, we demonstrate the feasibility of generating facial expressions along with the conceptualised Arousal-Valence dimensions. In conclusion, to generate facial expressions along the psychological conceptualised Arousal-Valance dimensions, we propose a novel type of generative model, i.e., encapsulated variational auto-encoders (EVAE), allowing the generation process to be disentangled into two tuneable continuous factors. Validated in two publicly available facial expression datasets, we demonstrate the association between these factors and Arousal-Valence dimensions in facial expression generation, deriving the data-driven Arousal-Valence plane in affective computing. Despite its embryonic stage, our research may shed light on the prospect of continuous, dimensional affective computing.


Dimensional Expression Generation
Facial expression is one of the primary channels in human beings to express emotions and sentiments. How to reflect the complexity of human facial expressions is one of the central enquiries in affective computing. Lengthy researches on affective generation paid overwhelmed attention on the discrete, non-continuous, non-dimensional modelling of human affect [1,2]. These learned simplified affective representations quickly reached their bottlenecks in reflecting the richness of sentiments [3]. I.e., emotions and sentiments in the wild are not restricted in certain fixed taxonomy, e.g., the basic six emotions [4]. In contrast, human facial expressions should be represented and generated as granularity at continuous dimensions. A recent neurological finding also suggests that human emotion may represent in a distributed dimensional manner [5].
Regarding the dimensional representation of facial expressions, the classic psychological conceptualised Arousal -Valence (A-V) dimensions hold great promise in defining most facial expressions along these two dimensions [6,7]. As shown in Fig. 1, the arousal dimension measures the global feeling of dynamism or lethargy during the expressing of a certain emotion, shown as the vertical axis in Fig. 1., whereas the valence dimension offers a rough measure of the polarity of the feeling. However, in affective computing, facial expression generation along these A-V dimensions in a data-driven manner remains opaque, i.e., whether these psychological conceptualised A-V dimensions can be mathematically modelled by two continuous factors in dimensional facial expression generation is unknown.

Overview of Our Approach
To tackle the challenge of dimensional facial expression generation, among various deep learning models [8], we rely on a recent advent in generative modelling: the variational auto-encoder (VAE) [9] to generate facial expressions in a data-driven manner. Unfortunately, it is imprudent to employ a VAE directly in generating our targeted dimensional facial expressions as it hinges on the following issue. Under the conventional framework of a VAE, the encoded latent representations of facial expressions are largely entangled. This poses the difficulty to tease our aimed two continuous factors for the arousal and valence dimension apart from the entangled latent representations. Hence, a novel form of VAE that allows the encoding of disentangled latent representations and deriving two continuous scales that correspond to the arousal and valence dimensions, is on-demand.
To this end, we propose a novel form of variational autoencoder: the encapsulated variational auto-encoders (EVAE) that is tailored specifically for learning data-driven dimensional affective representations of facial expressions. These learned dimensional representations can then be used in generating artificial facial expressions along with the theoretical conceptualised A-V dimensions. The crux of our proposed EVAE is a tuneable hyper-parameter in permitting the encoding of flexible latent representations. These flexible, anatomical-aware latent representations can be further boiled down to two continuous factors: the included hyper-parameter and the sampled inputs for generating latent variables. Mapping these two factors onto two continuous scales allows us to establish a Cartesian coordinate system to model the theoretical A-V dimensions. Fig. 1 Theoretical A-V dimensions of facial expressions. This theoretical A-V two-dimensional plane is the reproduction of the work in [6,7]. The horizontal axis represents the arousal dimension, whereas the vertical axis stands for the valence dimension. The majority of human emotions can be represented on this plane. E.g., in comparison to the feeling of excitement, the feeling of happiness is relatively low in the arousal but high in the valence dimension 1 3 The rest article is formatted as follows. Before delineating our method, we firstly present a succinct review of previous attempts in dimensional expression generation, followed by a detailed description of our proposed encapsulated variational auto-encoders (EVAE). The succeeding section is devoted to giving a detailed description of the usage of our proposed EVAE in generating expressions along the modelled A-V dimensions. To validate our approach, two empirical experiments were conducted to demonstrate the feasibility of generating data-driven A-V dimensional expressions on two publicly available datasets, e.g., the Frey-face and FERG-DB datasets. We conclude this article by highlighting some bearing limitations and several future research avenues.

Related Works
Previous efforts in learning dimensional affective representations can be acknowledged as the degrading of the original sentimental classification problem to a regression problem in reporting continuous ratings(values) in replacement of the discrete labels [10,11]. This trend calls for the release of large-scale dimensional annotated affect datasets, e.g., Aff-Wild [12], SEMINE [13], and AFFECTNET [14]. Relying on these datasets, several previous researches were able to revise the original discrete based classification models to regression ones that outputted the continuous arousal and valence values [15,16]. Recent advancements along this direction is to mix the discrete categorisation with the dimensional annotations via the convex combination [17] or the multi-task learning strategy [18]. Other researches resort on multi-modal feature fusion [19,20], rigorous defined facial features(or facial landmarks) [21,22], and recovered 3D facial features [23] to learn dimensional affective representations.
However, the above-mentioned attempts can hardly be perceived as the distal solution in learning dimensional affective representations. Their relied annotations are expensive to collect, subjective, prone to errors, and varied across datasets. Hence, in our research, we are up against using these ambiguous annotations in developing a fully unsupervised approach to learn dimensional representations.
Using supervised self-organisation maps and a variant of neural networks, i.e., extreme learning machines, Bugnon et al [24] also aims to link the theoretical A-V space with the mathematical defined latent/graphic space. Similar to our work, their attempt is also focused on yielding a low dimensional(2D) A-V plane. Interestingly, our hyperparameter modulated inner space between two-component VAEs can be interpreted as a latent space model, which is similar to self-organisation map. Notwithstanding, our work diverges from theirs on the generative property. The learned latent space from EVAE with a tuneable hyper-parameter in our work allows the easy generation of the affect in accordance with A-V dimensions.
Narrowing our focus on prevailing methodological modifications on VAE, a plethora of variants that seek to learn the interpretable representation of data can be seen on the improved generalisation, e.g., beta-VAE [25] and Multientity VAE [26], and the usage of discretised latent space, e.g., vq-VAE [27]. Interestingly, despite the similar learning objective, i.e., to diversify the generation process of VAE, our proposed EVAE considerably differs from mentioned alternatives for its flexible independence assumption between two sets of latent variables z s and z b . This flexible independence constraint cannot be realised by the tuning of a hyper-parameter in beta-VAE, the heuristic posterior parameter selection in Multi-entity VAE, or resorting to a less-accurate discretisation approach in vq-VAE.

EVAE: Overview
Different to the conventional formation of a VAE ( Fig. 2(I)), in our proposed EVAE ( Fig. 2(II)), the employment of two latent variables implies a novel structured encoder-decoder architecture. I.e., for a given dataset, e.g., x ∈ X , two separate probabilistic encoders, e.g., q b (z b |x) and q s (z s |x) and decoders, e.g., p b (x|z b ) and p s (x|z s ) compiles up to the coinage of two component VAEs, i.e., denoting as VAE b (the base VAE) and VAE s (the scaffolding VAE), with latent variables z b , z s and two sets of to-be-optimised parameters, e.g., Importantly, these two latent variables z b , z s allow the coinage of the joint encoder and encoder of our proposed EVAE: q b , s ((z b , z s )|x) and p b , s (x|(z b , z s )) . However, it is essential to note here that we do not assume the conditional independency of two latent variables. I.e., the simplified assumption of factorised joint encoder, e.g., or factorised joint decoders does not hold in this research. Instead, we introduce a hyper-parameter to tune the relation of two latent variables in deriving its corresponded specific analytic expressions of the joint encoder and decoder.

EVAE: Encoders and Decoders
Prior to derive the analytic form of our targeted joint encoder and decoder in EVAE, we chiefly introduce the used parameterisations of the component encoders, e.g., q b (z b |x) and q s (z s |x) and decoders, e.g., p b (x b |z b ) and p s (x s |z s ). In this case, we parameterise the base encoder, e.g., , where the to-be-optimised variational parameters are b , which can be used to produce the mean ( b ) and s.d. ( 2 ) of the approximating distribution.
Differing to the simplified parameterisation that is used in base encoder, we let a more complex full rank Gaussian distribution to be the form of our scaffolding encoder as follows, i.e., log q s (z s |x) = log N(z s ; s , L). Where the parameterised variational parameter s is the concatenation of the mean vector and the decomposed covariance matrix, i.e., { s , L} . Here, we use the numerical stable Cholesky decomposition to decompose the correlation matrix, e.g., Σ , into two lower triangular matrices, i.e., Σ = LL T , to speed up the variational inference. The LKJ prior captures the covariance matrix. The LKJ prior is well suited for modelling the covariance in a multivariate normal distribution, where x ∼ N( , Σ −1 ) . It provides a prior on correlation matrix of the parameter, i.e., C = Corrx(x i , x j ) , which fuses all the standard deviation of every components. We then use the numerical stable Cholesky decomposition to decompose the correlation matrix, i.e., square matrix, Σ , to be Σ = LL T , where L is a lower triangular matrix. The LKJ distribution has the density function, e.g., f (C| ) ∝ |C| −1 .
We further let two decoders, i.e., p b (x b |z b ) and p s (x s |z s ) , to take the similar forms of multivariate Gaussian that are used in forming the preceding encoders. I.e., log , where x b and x s are reconstructed inputs from the base and scaffolding VAEs respectively.

EVAE: the Importance of
Armed with the parameterisations on two-component encoders and decoders, it is positioned to derive the analytic expressions of our joint encoder and decoder. An optimal analytic expression of joint encoder or decoder should satisfy two following requirements. (1) The derived expression should be flexible enough to imply full factorised, full equality and the mixed relation between two-component VAEs.
(2) The derived expression needs to be fully differentiable to allow fast and accurate approximate marginal inference on the variable x.
To this end, we consider the analytic expression of our targeted joint encoder that allows smooth interpolation of two-component encoders in the following Eq. 1: where the introduced discrepancy function, e.g., R e (z b , z s ) can be further expressed as: As the samples are more easy to work with, we further Monte-Carlo sample these encoded representations from two encoders, i.e., z l b ∼ q b (z b |x) and z l s ∼ q s (z s |x) , where l stands for the number of samples. In practise, a single sample suffices. Interestingly, as we pay close attention to this discrepancy function in Eq. 2, the included positive factor offers us a tool to directly exert control over encoded latent representations.
In specific, as → 0 , it implies the equality constraint to enforce the two encoded latent representations from two encoders to coincide with each other. This implied equality constraint degrades our proposed EVAE to two independent VAEs. Numerically, this imposed equality constraint is equivalent with inserting a function to promote the exactness of two encoded latent representation as: . In opposition to the preceding case, when → +∞ , two encoded latent representations are learned to diverge to the final independent relation. In this case, it assumes factorised joint encoder, i.e., q b (z b |x) ⋅ q s (z s |x). Similar to the prior noted factorisation in the case of joint encoder, the analytic expression of the joint decoder, e.g., p b , s (x|(z b , z s )) can be derived into: where the discrepancy function for the joint decoder, e.g., Here, x b ,x s represents the reconstructed data instances from VAE b and VAE s respectively, i.e., x b ∼ p b (x|z b ) and x s ∼ p s (x|z s ) . Notice here, we incorporate the same positive defined hyper-parameter to scale the differences of two reconstructed data instances.

EVAE: the Learning Objective
We complete our EVAE model by defining a simple factorised joint prior on two latent variables, i.e., As the learning objective of a conventional VAE is expressed as: where the first term denotes the regularisation penalty from the variational encoder, and the second term expresses the decoder induced reconstruction loss between the generation and the original input. Following the similar derivation, and the defined analytic expressions of joint encoder (cf. Eq. 1), the joint decoder (cf. Eq. 3), and the joint prior (cf. Eq. 5),the learning objective of the EVAE is defined as: In Eq. 7, the first term measures the K-L divergence between the conditional density p(z b , z s |x) and the harnessed factorised prior p(z b , z s ) . The second term in Eq. 7 computes the reconstruction error between inputted samples and reconstructed ones from the joint decoder.

EVAE: the ˛ Bounded Learning
Relying on the proposed EVAE, we are now positioned to render out the generative process. The targeted generative process is unfolded as follows: two latent representations (values of z b and z s ) are firstly encoded from the approximated probabilistic encoders, then these latent representations are fed to the joint decoders to generate expressions.
As the encoding of latent representations is uniformly dependent upon the value of z b and z s , it is critical to reveal the force that has impact on the shaping of z b and z s . For this, we primarily rewrite the objective function in Eq. 7 into: To rewrite the learning objective this way is to group two discrepancy functions together. The remaining terms L vae b ( b , b , x) and L vae s ( s , s , x) match with the learning objectives of two conventional VAEs. As seen in Eq. 8, the included hyper-parameter plays the key role in shaping the learning objective. However, the current support of our incorporated hyper-parameter lives on semi-positive defined space, which needs to be transformed to the real coordinate to serve as a continuous axis. Hence, we resort to a simple continuous function, i.e., the log function, to attain the support for to live in the real coordinate space, e.g., ℜ . The transformed hyper-parameter is , i.e., ⟵ log( ).
Interestingly, this newly transformed hyper-parameter not only preserves the appealing properties of the prior noted , i.e., the implied independent ( → +∞ ) and equality ( → 0 ) relations but also allows us to sample the negative values of to influence the encoding of latent representations. As shown in Fig. 3, in terms of the learning objective, the value of exerts its direct control on bounding the learning of EVAE.

Valence Dimension
To model the valence dimension, we primarily fix the factor to a constant number. With a fixed and sufficient training sessions, the encoded representations can be easily produced via the sampled values of learned latent variables, e.g., z b and z s , allowing the production of low dimensional (2D) latent representations of a given human affect dataset. The sampling technique we adopt here to produce values of latent variables relies on running the inverse cumulative probability function(CDF) of the Gaussian on the linearly spaced unit square that ranges from -1 and 1. 1 As these linearly sampled values lives on the real coordinate, we then map these values on a continuous scale, denoting as X sample . As we fix the , different values on this continuous scale lead to the diversification of encoded latent representations that correspond to different kinds of human facial expressions. In this research, we hypothesise this continuous scale is modelled as the valence dimension in our targeted data-driven A-V dimensions.

Arousal Dimension
Unfortunately, with a fixed , the exploitation of previous continuous scale can merely ensure the modelling of the valence dimension. To model the arousal dimension demands further analysis on the second factor, the included hyper-parameter .
In specific, driving to values that are close to 0 enforces the encoded representations from two encoders(the base and scaffolding ones) to coincide with each other in producing two identical latent representations. In this research, we hypothesise that conditioning on these encoded uniformed latent representations, the decoder in our proposed EVAE is able to produce neutral facial expressions. I.e., the association between the real '0' on a Cartesian coordinate and the generated neutral human affect can be established.
Pushing to two extremes, e.g., → +∞ and → −∞ , allows the diversification and unification of the encoded latent representations. Since now lives on the unconstrained real coordinate, tuning the from negative to positive(except 0) should elevate the degree of the same facial expression while fixing the previously defined valence axis. Hence, a continuous scale of is hypothesised here to represent the data-driven arousal dimension.

Data-driven A-V Two-dimensional Plane
Armed with the modelled arousal and valence dimensions, we can finally pile them up to build our targeted data-driven A-V dimensional plane. The first axis, i.e., the arousal dimension, is the continuous scale on the introduced hyperparameter, i.e., , in reflecting the level of dynamism in human expressions. The second axis, i.e., the valence dimension, is modelled as the sampled value ( X sample ) in producing the latent variables in EVAE, delineating the polarity of the human affect. Plotting these two axes together is to form a Cartesian coordinate system that renders out datadriven A-V dimensions. This yielded data-driven A-V plane is the cornerstone of this research, which transforms a psychological conceptualised plane (shown in Fig. 4(a)) to a data-driven Cartesian coordinate plane (shown in Fig. 4(b)) in representing human facial expressions. Each learned human expression can now be represented as a real number on these data-driven A-V dimensions. More importantly, as both dimensions are modelled by the continuous scales, this permits the generation of facial expressions along these two dimensions.

Empirical Validations
We ran empirical experiments on two public available facial expression datasets: the Frey faces and FERG-DB [28] datasets. The objective is to empirical validate whether our proposed EVAE with a tuneable hyper-parameter is capable of learning A-V dimensional representations for expression generation. I.e., the valence dimension is associated with the sampled inputs in producing latent variables, whereas the arousal dimension is modelled as the included single tuneable hyper-parameter .

Frey Faces Dataset
The Frey faces dataset contains 1956 grey-scaled images of Brendan Frey's face taken from sequential frames of a video with the dimension of 20 x 28. Each image went through the input normalisation pre-processing step. This dataset is ideal to evaluate the dimensional generative performance of EVAE due to its assumed homogeneity in facial features, i.e., all emotional expressions were rendered out from a single person.

FERG-DB Dataset
Facial Expression Research Group Database (FERG-DB) [28] is an annotated facial expression dataset, which utilised MAYA 3D modelling software to create the expressions in animated characters. Differ from the previous Frey faces dataset, this dataset is created specifically for automatic facial analysis. The entire dataset contains 55767 annotated facial expression images of six created characters. The modelled facial expressions range from low valance expressions, e.g., anger, disgust, sadness and fearful expressions, to high valence ones, e.g., surprise and joyful expressions. To fit with our objective, we intentionally discard the original discrete annotations that associate with expressions. Each expression image in the FERG-DB dataset was grey-scaled and normalised.

Experimental Set-ups
For two encoders in our proposed EVAE, two different feedforward neural networks with isotropic Gaussian priors, e.g., p b (x|z b ) = N(z b ;0, I) and p s (x|z s ) = N(z s ;0, I) , were employed to warp the variational parameters of base and scaffolding encoders in EVAE, forming two inference networks. For the decoders, two complementary feedforward neural networks were used as recognition networks.
For Frey faces dataset, the network configurations on inference and recognition networks are summarised in the upper panel of Table 1. Note that, reckoning on other feasible configurations for the encoders and decoders, e.g., the number of hidden units and the type of layer-wise activation, the relative performance of models were observed as insensitive to these choices. For the cost function, we strictly follow the derived learning objective in Eq. 8. To account for the reconstruction loss in the cost function, we were in favour of the mean square error as the primary choice as it induced stability in the practical implementation.
Compared to the previous implementation on the Frey faces dataset, learning A-V dimensional representations on FERG-DB facial expressions is a difficult task as some nonessential facial features for expression perception are also included, e.g., the female or muscular of the faces, the different hairstyles. Hence, as for the employed inference and recognition networks, a different set of configurations was applied here, shown in the lower panel of Table 1. In terms of the loss function and the choice of gradient ascent algorithm, we continue to use the mean square loss and Adam [32], respectively. Different to previously fixed learning epochs, we adopt the early stopping technique here to prevent the overfitting via setting the patience to −2 2 . Fig. 4 The conversion between the theoretical and data-driven A-V two-dimensional planes. (a) The psychological conceptualised A-V dimensional plane. (b) The data-driven A-V dimensional plane. The sampled inputs (denoted as X sample ) is modelled as the valence dimension, whereas the continuous scale on [ ] is modelled as the arousal dimension

Evaluation Metrics
In comparison with the easy assessment of discriminative models in classification and regression tasks, providing an objective metric to assess the performance of a generative model is a challenging task. To this end, both quantitative and qualitative evaluations have to be taken into the consideration. The quantitative evaluation is served to measure how well our proposed EVAE is trained, whereas the qualitative one is served for assessing the visual fidelity of generated expressions along A-V dimensions.
In terms of chosen metrics, for the quantitative measurement, we default the commonly used optimised lower bound (elbo), i.e., L [33], as our target index. Meanwhile, for the comparative purpose, we include the original VAE as the benchmark in the quantitative evaluation. For the qualitative evaluation, the perceptual quality of generated expressions along the modelled A-V dimensions is our prior criteria in assessing the generative performance of our proposed EVAE.

Quantitative Evaluation
In terms of quantitative results on Frey faces dataset, as the learning of EVAE is bounded by the continuous scale on , this bounding effect leads to varied optimised lower bound across different , shown in Fig. 5.
As the optimised lower bound of EVAE is varied across different , direct comparison between our proposed EVAE with the conventional VAE on this metric yields limited ratifications.

Qualitative Evaluation
Rather than the previous less grounded quantitative assessment, we gravitate towards the qualitative evaluation. Relying on the generative property of our proposed EVAE, tuning the magnitude along the modelled valence dimension from one end to the other, we wish to observe a shift in generated expression from negative to positive ones. This is achieved via the forward propagation of a linear spaced unit square (2D) to the inverse CDF function of the Gaussian to produce the latent variables. The latent variables were then fed to the learned recognition networks to generate the corresponding images.
As shown in Fig. 6, a clear shift in generated facial expressions from the disgust-like negative one (the leftmost one in Fig. 6) to the joy-like expression (the rightmost one in Fig. 6) was observed. The observed expression shift demonstrates the feasibility of generating expressions along the modelled valence dimension.
Facial expressions that are different in the modelled arousal dimension should exhibit a differentiated degree of variations to the same affective signal. As demonstrated in Fig. 7, along the modelled arousal dimension, varying degrees of facial expressions are generated from different instantiations of EVAEs. Facial expressions that were generated at two ends of the arousal dimension are rendered out diversified and unified patterns of generation, respectively. This observed related variations in expression generation shows that via tuning the magnitude on the modelled arousal dimension, our proposed EVAE is capable of generating expressions in line with the theoretical arousal dimension.

Quantitative Evaluation
Following the similar bounded learning effect on the Frey faces dataset (Fig. 5), the hyper-parameter also bounded the learning of EVAE on FERG-DB dataset. Delineated in Fig. 8, the choice of value has the direct impact on the final optimised learning elbo of our proposed EVAE on this dataset.

Qualitative Evaluation
To evaluate whether our proposed EVAE is capable of generating expressions along the modelled valence dimension, e.g., the sampled input, via the linear interpolation of the latent space and fixing to a certain value, a gradual shift on generated facial expressions was observed in Fig. 9. In comparison with the previous implementation of EVAE on the Frey faces dataset, the observed expression shift on the FERG-DB dataset is rendered out in a still evident but less smoothed manner. As shown in Fig. 10, through tuning the numerical value of , our proposed EVAE is able to generate designated expressions that correspond to their rough position on the theoretical arousal dimensions. We inherit the previous defined operational definition of the arousal dimension. I.e., high in the arousal dimension should lead to the generation of diversified facial expressions, whereas low in the arousal dimension leads to the generation of indifferent ones. Arming with this operational definition, we observed the induced differentiated patterns of generated expressions in terms of the presented expressions between the bottom and top panels in Fig. 10.
However, as the employed FERG-DB dataset has six different animation characters, these diversified facial features largely influence the perceptive quality of generated expressions, the impacts of certain facial features such as the hair style and identity of the character are hard to control under the current version of EVAE, leaving this to our future research pursuit. Fig. 7 Facial expression generation along with the modelled arousal dimension on the Frey faces dataset. The modelled arousal axis is depicted at the leftmost of this figure. We choose 5 most representative points on this continuous value, e.g., 50, 25, 0, −25, −50. For each value along the modelled arousal dimension, the corresponded Frey's facial expressions are generated. I.e., the negative value leads to more dramatic facial expressions, and a wider range of regenerated affect (high arousal) in comparison to its positive alternative. Setting the arousal index to 0, the generated affects are neutral expressions. Note here, the axis is rendered from the positive to negative ends. This is done in purpose to coincide with the theoretical A-V plane [6] Fig. 8 The bounded learning of EVAE on the FERG-DB dataset. It is clear that on the FERG-DB dataset, the hyper-parameter can exert direct control over the elbo in optimisation

Conclusion
In this pioneering research, we propose a novel form of variational auto-encoder: encapsulated variational auto-encoder (EVAE) to automatically learn psychological conceptualised Arousal-valence (A-V) dimensional representations of human facial expressions. As the encoding of latent representations from our proposed EVAE can be further boiled down to two continuous factors: the sampled input and a tuneable hyper-parameter, it allows the building of a Cartesian coordinate system along the A-V dimensions. Leveraging on the generative property of our proposed EVAE, it permits the generation of artificial facial expressions in accordance with the theoretical A-V dimensions.
Unfortunately, these learned dimensional representations of facial expressions suffer from two limitations that thwart their large-scale usages. The primary one is their reflected poor perceptual quality in terms of reproduced expressions based on these representations. The second one is the lack of objective validation metric in accessing the performance of learned data-driven A-V dimensions.
In tackling with foregoing noted issues, we highlight two main mitigation approaches for consideration. To improve the perceptual quality of reproduced expressions and lower the dataset dependency, a large-scale facial expression dataset that can be run on more complicated network structures should be considered. In order to yield a fair assessment on the performance of our learned data-driven A-V dimensions, Fig. 9 Facial expression generation along with the modelled valence dimension on the FERG-DB dataset. Through tuning the value on the modelled valence dimension, e.g., the sampled input. EVAE is capable of generating expressions from one polarity to the other the validation on regenerated expressions on celebrated facial action unit system (FACS) [34] can also be an interesting research avenue.
Despite the mentioned limitations, within authors' awareness, it is the first time that dimensional representation of facial expressions, which are defined in cognitive science, can be mathematically modelled and learned from generative modelling. The demonstrated feasibility of generating facial expressions along with the conceptualised A-V dimensions sheds light on the future prospect of continuous, dimensional affective computing.
Regarding the future research directions, two streams that are in respect to both theoretical refinements and several real-life applications on the basis of our approach can be taken into the consideration. From the theoretical perspective, the future refinement is expected to greatly simplify the current model from the use of two full auto-encoders to the utilisation of two decoders, in which will largely reduce the number of training parameters to stabilise the training process. Standing on the real-life application view, our EVAE method can be readily applied to a spectrum of domains including the software engineering on facial expression generation, and the psychological experiments, where the generated continuous facial expressions along A-V axes serve as the stimuli.
Funding This study is supported by National Natural Science Foundation of China under Grant No. 61472117.

Declarations
Ethical Approval This article does not contain any studies with human participants or animals performed by any of the authors.

Conflicts of Interest
The authors declare that they have no conflicts of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.