1 Introduction

An understanding and reasoning about the world based on a limited set of observations is important in the field of artificial intelligence. For instance, we can infer the movement of a ball in motion at a single glance, as the human brain is capable of disentangling positions from a set of images without supervision. Therefore, disentanglement learning is highly desirable to build intelligent applications. A disentangled representation has been proposed to be beneficial for a large variety of downstream tasks (Schölkopf et al., 2012). According to Kim and Mnih (2018), a disentangled representation promotes interpretable semantic information, resulting in substantial advancement, which includes but is not limited to reducing the performance gap between humans and AI approaches (Higgins et al., 2018b; Tenenbaum, 2018). Other instances of disentangled representations include semantic image understanding and generation (Lample et al. 2017), zero-shot learning (Zhu et al., 2019), and reinforcement learning (Higgins et al., 2017b).

As depicted in the seminal paper by Bengio et al. (2013), humans can understand and reason from a complex observation, after which they can induce the explanatory factors. The observations are generated by explanatory ground-truth factors \( {c} \), which are invisible from the observations. The task of disentanglement learning aims to obtain a disentangled representation that separates these factors from the observations. The notion of disentanglement remains an open topic (Do and Tran, 2020; Higgins et al., 2018a), and we follow a strict version of discourse that one and only one latent variable \({z}_i\) represents one corresponding factor, \({c}_j\) (Burgess et al., 2017).

Locatello et al. (2019) proved the impossibility of disentanglement learning without inductive biases on the model and data. One popular inductive bias on the model assumes that the latent variables are independent. These approaches, penalizing total correlation (TC), dominate visual disentanglement learning (Chen et al., 2018; Kumar et al., 2018). This assumption is correct when the factors are sampled uniformly; however, the independent factors show statistical relevance in reality (Träuble et al., 2021). For instance, we observe that men are more likely to have short hair, and based on the observations, there is a correlation between gender and hair length. However, a man who is not bald may grow long hair if desired. In other words, sex does not determine hair length, and they are two independent factors. Therefore, the exploration of disentanglement approaches beyond the independence assumption is vital to reality applications.

Another popular research approaches are based on information theory (Jeon et al., 2021; Chen et al., 2016). They hypothesize that the gradually increased information bottleneck (IB) leads to a better disentanglement (Burgess et al., 2017; Dupont, 2018). Unfortunately, in practice, the approaches based on IB usually exhibit lower performance than those penalizing the TC (Locatello et al., 2019). However, it is important to understand whether this means that the total correlation beats the IB. It is believed that the answer is negative. In this research, we investigate the reason for which IBs fall behind TC in practice. We found that the information diffusion (ID) problem is an invisible hurdle that should be addressed in the IB community.

Information diffusion indicates that one factor’s information diffuses into two or more latent variables; thus, the disentanglement scores fluctuate during training. Figure 1 shows the disentanglement scores of three approaches with the best hyperparameter settings, and it is observed that numerous trials have a high varianceFootnote 1. We bridge the ID problem with the instability of the current approaches in Sect. 3.

Fig. 1
figure 1

The distribution of beta VAE metric, MIG, and DCI disentanglement on dSprites. Models are abbreviated (V=\(\beta \)-VAE, TV=\(\beta \)-TCVAE, AV=AnnealedVAE), and 50 trials are run with different random seeds

In this paper, we trace the ID problem by measuring the NMI1 and the NMI2, see Equation 8. The learned information may diffuse into other latent variables when IB-based approaches, such as AnnealedVAE (Burgess et al., 2017) and CascadeVAEC (Jeong and Song, 2019), learn new information. It is crucial to detect the components with different contributions to the objective for increasing the IB gradually. To do that, we have developed the annealing test to measure information freezing point (IFP) that the critical value for learning information from inputs. We also find that one factor is easy to be disentangled if the IFP distribution is distinguished from others.

Inspired by distillationFootnote 2 in chemistry, we can divide the training process into several stages and disentangle one component at each stage. In particular, we propose a framework, called the distilling entangled factor (DEFT), to disentangle factors stage-by-stage. DEFT chooses selective pressure to enable some information to pass through the IB according to the IFP distribution at each stage. In addition, DEFT reduces the backward information of the first \(m-1\) sub-encoders by scaling the learning rate to relieve the ID problem at the \(m\)-th stage. We evaluate DEFT on four datasets, which shows robust performances. We also examine DEFT on the dataset with correlative factors. Our codes and all experimental settings are published in dlib for PyTorch forked from disentanglement lib. Our contributions are summarized in the following:

  • We hypothesize that the ID problem is one reason for the low performances of IB-based approaches.

  • We propose DEFT, a multistage disentangling framework, to address the ID problem by blocking partial information and scaling the backward information.

2 Preliminary

2.1 Disentanglement approaches

Variational autoencoder In variational inference, posterior \(p(z|x)\) is intractable. The variational autoencoder (VAE) (Kingma and Welling, 2014) uses a neural network \(q_\phi (z|x)\) (encoder) to approximate the posterior \(p(z|x)\). The other neural network \(p_\theta (x|z)\) (decoder) rebuilds the observations. The objective of the VAE is to optimize the evidence lower bound (ELBO):

$$\begin{aligned} \begin{aligned} \mathcal {L}(\theta , \phi ) = \mathbb {E}_{q_\phi (\mathbf {z}|\mathbf {x})}[\log {p_\theta (x|z)}] - D_{\mathrm {KL}}(q_\phi (z|x) || p(z)). \end{aligned} \end{aligned}$$
(1)

\( \varvec{\beta } \)-VAEHiggins et al. discovered the relationship between the disentanglement and the Kullback-Liebler (KL) divergence penalty strength. They proposed the \( \beta \)-VAE to introduce an additional pressure on the KL term:

$$\begin{aligned} \begin{aligned} \mathcal {L}^1(\theta , \phi ; \beta ) = \mathbb {E}_{q_\phi (\mathbf {z}|\mathbf {x})}[\log {p_\theta (x|z)}] - \beta D_{\mathrm {KL}}(q_\phi (z|x) || p(z)), \end{aligned} \end{aligned}$$
(2)

where \(\beta \) controls the pressure for the posterior \(q_{\phi }(z|x)\) to match the factorized unit Gaussian prior \(p(z)\). However, there is a trade-off between the quality of the reconstructed images and the performance of disentanglement.

AnnealedVAE Burgess et al. (2017) proposed the AnnealedVAE, which progressively increases the information capacity of the latent variables while training:

$$\begin{aligned} \begin{aligned} \mathcal {L}^2(\theta , \phi ; C) = \mathbb {E}_{q_\phi (\mathbf {z}|\mathbf {x})}[\log {p_\theta (x|z)}] - \gamma \left| D_{\mathrm {KL}}(q_\phi (z|x) || p(z)) - C \right| , \end{aligned} \end{aligned}$$
(3)

where \(\gamma {}\) is a sufficiently large constant (usually 1, 000) to constrain the latent information, and \(C\) controls the capacity that gradually increases from zero to a large number.

\( \varvec{\beta } \)-TCVAE The TC (Watanabe, 1960) quantifies the dependency among variables. \(\beta \)-TCVAE (Chen et al., 2018) decomposed the KL term into three parts: mutual information (MI), total correlation (TC), and dimensional-wise KL (DWKL). The TC can be penalized to achieve both high reconstruction quality and disentanglement:

$$\begin{aligned} \begin{aligned} \mathcal {L}^3(\theta , \phi ; \beta ) =&\mathbb {E}_{q_\phi (\mathbf {z}|\mathbf {x})}[\log {p_\theta (x|z)}] - \mathbb {E}_{q(z, n)}\left[ \log \frac{q_\phi (z \mid n) p(n)}{q_\phi (z) p(n)}\right] - \\&\beta \mathbb {E}_{q_\phi (z)}\left[ \log \frac{q_\phi (z)}{\prod _{j} q_\phi \left( {z}_{j}\right) }\right] -\sum _{j} \mathbb {E}_{q_\phi \left( {z}_{j}\right) }\left[ \log \frac{q_\phi \left( {z}_{j}\right) }{p\left( {z}_{j}\right) }\right] . \end{aligned} \end{aligned}$$
(4)

CascadeVAEC Jeong and Song provided another total correlation penalization through information cascading. They proved that \(TC(z)=\sum _{i=2}^{d} I({z}_{1:i-1};{z}_i)\). CascadeVAEC, the continuous version, sequentially relieves one latent variable at one stage, encouraging the model to disentangle one factor during the \(i\)-th stage:

$$\begin{aligned} \begin{aligned} \mathcal {L}^4(\theta , \phi ; \beta _l,\beta _h) =&\mathbb {E}_{q_\phi (\mathbf {z}|\mathbf {x})}[\log {p_\theta (x|z)}] - \\&\beta _l D_{\mathrm {KL}}(q_\phi ({z}_{1:i}|x) || p({z}_{1:i})) - \beta _h D_{\mathrm {KL}}(q_\phi ({z}_{i+1:d}|x) || p({z}_{i+1:d})), \end{aligned} \end{aligned}$$
(5)

where \(\beta _l\) is a small value for opening the information flow, \(\beta _h\) is a large value for blocking information, and \(d\) is the number of dimensions.

Relevant but not compared approaches ICA (Comon, 1994) and PCA (Wold et al., 1987) guarantee the independence mathematically, and the nonlinear versions are helpful to disentanglement (Sorrenson et al., 2020). However, they require the factors satisfying a factorized prior distribution. Learning factorial codes (Schmidhuber, 1992) is limited in the cases with binary codes. Merely encouraging independence is insufficient to disentangle factors theoretically, and the inductive biases on the data and the model should be explored explicitly (Locatello et al., 2019).

2.2 Disentanglement evaluation

Several metrics have been proposed to evaluate the disentanglement, including the BetaVAE metric (Higgins et al., 2017a), FactorVAE metric (Kim and Mnih, 2018), MI gap (Chen et al., 2018), modularity (Ridgeway and Mozer, 2018), DCI (Eastwood and Williams, 2018), and SAP score (Kumar et al., 2018). Shannon MI is an information-theoretic quantity that measures the amount of information shared between two variables. Based on that, the MIG (Chen et al., 2018) measures the gap between the top two latent variables with the highest MI to evaluate the performance of disentanglement:

$$\begin{aligned} \text {MIG} = \frac{1}{\Vert {c} \Vert } \sum _{i=1}^{\Vert {c} \Vert } \text {NMI}({c}_i,1) - \text {NMI}({c}_i,2), \end{aligned}$$
(6)

where \(\text {NMI}({c}_k,m)\) is the \(m\)-th largest normalized MI (NMI) between \({z}_j\) and \({c}_k\). The calculation can be:

$$\begin{aligned} \begin{aligned} \text {NMI}({c}_k,m) = \frac{1}{H \left( {c}_{k} \right) } I({z}_{j^m};{c}_k), \end{aligned} \end{aligned}$$
(7)

where \({z}\) is the vector of latent variables, \(c\) is the vector of ground-truth factors, and \(j^m\) denotes the index of the m-th largest element (\(\displaystyle j^1={{\,\mathrm{arg\,max}\,}}_i I({z}_i;{c}_k)\)). \(\text {NMI}({c}_k,1)\) measures how best one variable can learn for the factor \({c}_k\), and \(\text {NMI}({c}_k,2)\) indicates the diffused information into other variables. Therefore, the gap of \(\text {NMI}({c}_k,1)\) and \(\text {NMI}({c}_k,2)\) should be large for the disentanglement.

3 Motivation

Locatello et al. conducted a survey of current disentanglement approaches, and the results show that these approaches have a high variance of disentanglement scores. They concluded that “tuning hyperparameters matters more than the choice of the objective function” (See Fig. 7 in their paper). A reliable and robust approach should therefore have a consistently high performance and low variance. We investigated the performance of \(\beta \)-VAE (\(\beta =4\)), \(\beta \)-TCVAE (\(\beta =6\)), and AnnealedVAE (\(C=25\)) on dSprites, and traced the disentanglement scores through the training processes. Fig. 2 shows the curves of three metrics (beta VAE metric, MIG, and DCI disentanglement) for four models (\(\beta \)-VAE, \(\beta \)-TCVAE, CascadeVAEC, and AnnealedVAE). AnnealedVAE, CascadeVAEC, and \(\beta \)-TCVAE show significant improvements in the very first iteration. However, CascadeVAEC has a sharp decrement in the 10, 000 iteration, and AnnealedVAE shows a downward trend after 10, 000 iteration. The training process did not consistently enhance the model being disentangled, resulting in poor performance.

Fig. 2
figure 2

Disentanglement fluctuation of the IB-based approaches. AnnealedVAE and CascadeVAEC could degenerate into lower disentanglement scores

One solution to address fluctuation is to block some information by using a narrow information bottleneck and then assign the increased information to a new latent variable by increasing the bottleneck. AnnealedVAE and CascadeVAEC follow this concept; however, they differ in terms of expanding the IB. AnnealedVAE directly controls the capacity of the latent variables by an annealed increasing parameter, \(C\). CascadeVAEC increases the capacity by relieving the pressure on the \( i \)-th latent variable at the \( i \)-th stage, opening the information flow. Ideally, these approaches that are based on IB should have a steady growth of disentanglement; however, they also show fluctuation.

Fig. 3
figure 3

The change of NMI over training process on dSprites. The NMIs on many factors decrease gradually after the largest values have been captured at the early stage, especially on the factor scale

A perfect disentangled representation should project one factor into one latent variable. In other words, the largest NMI (\(\text {NMI}({c},1)\)) reaches the maximum \( 1 \), and the second largest NMI (\(\text {NMI}({c},2)\)) is close to \( 0 \). Therefore, the decrement of \(\text {NMI}({c},1)\) implies that the information of one factor diffuses into another latent variable, which we define as information diffusion (ID). The representation can be said to re-entangle in the case of an ID.

Though the final disentanglement score is desirable, it is insufficient to indicate the problems in the learning process. Monitoring of metrics probably clears the way to reveal the hidden problems on disentanglement during training the model. To do that, we monitored \(\text {NMI}({c},1)\) and \(\text {NMI}({c},2)\) during training with AnnealedVAE on dSprites (training details in Sect. 5.1), as shown in Fig. 3. We computed the NMIs for the five factors every 10, 000 iterations and presented them in one row. Ideally, the expanded capacity would promote the model to learn new information. Oppositely, \(\text {NMI}({c},1)\) (scale) decreased after 5e4 iterations. AnnealedVAE suffered the ID problem, which caused the low performance.

4 Method

4.1 Information freezing

Fig. 4
figure 4

Information freezing. The model starts to learn information at iteration 7500 (\(\beta =32\)), where KL increases and the reconstruction error decreases

Burgess et al. proposed that the value of beta in \(\beta \)-VAE controls the IB between inputs and latent variables, similar to the role of temperature in distillation; a low value of beta encourages the MI \(I(x;z)\), and more information condenses on the latent space. The IFP is a critical point at which the model starts to learn information from observations. It is an intrinsic property of a dataset and almost invariant. Thus, different factors can be identified by IFPs.

Definition 1

The IFP is the maximum value of \(\beta {}\), such that \(I(x;z)>0\) for the \(\beta \)-VAE objective.

We introduce the annealing test to determine the IFP for a given dataset. The objective of the annealing test is the same as that of \(\beta \)-VAE, except that it uses an annealing \(\beta {}\) from a high value to 1 (i.e., it starts with value 200 and ends with value 1). While the pressure of the KL term decays, there exists a critical point where \(I(x;z)\) increases and the reconstruction error decreases. For example, we trained the model with an annealing \(\beta {}\) from 200 to 0 in 100,000 iterations in Fig. 4. One can see that the IFP is approximately 32 at iteration 7400. Roughly, we regard the IFP as the value of beta where the model learns information (\(I(x;z)\) is over 0.1).

4.2 DEFT

Fig. 5
figure 5

Illustration of DEFT with two sub-encoders (G = 2), and each sub-encoder has K = 2 latent variables. DEFT has isolated sub-encoders and scales partial backward information

Distillation is the process of separating a mixture into its components by heating to an appropriate temperature, such that components boil and freeze into the target containers. Inspired by distillation in chemistry, this paper proposes a novel disentanglement approach based on \(\beta \)-VAE, which distills independent components into several isolated sub-encoders. There exists a suitable pressure on \( \beta \) that makes one component with high IFP being separated from the components with lower IFP. Therefore, an iterative algorithm is concluded to Distill (disentangle) the Entangled FacTor, named DEFT. Specifically, it splits the latent variables into G groups which per group has K latent variables, and there are G \(\times \) K latent variables in total. The decoder takes the concatenation of latent variables of all groups as inputs, which is the same as the conventional decoder. DEFT also divides the training process into G stages, so that the model extracts components according to IFPs per stage by assigning a different \( \beta \), \( \beta ^i \) for the \( i \)-th stage. Apart from that, DEFT scales the gradients of the old sub-encoders to prevent the ID problem; that is, the backward gradients of the first \(i-1\) sub-encoders are scaled by multiplying a scaling coefficient \( \gamma \). Overall, the architecture of DEFT (\(\text {G}=2, \text {K}=2\)) at stage 2 is shown in Fig. 5. The forward part of DEFT takes one image as inputs to two isolated sub-encoders and then concatenates the outputs of two groups of sub-encoders into a four-dimensional vector which is fed to the decoder. The backward part of the DEFT scales the backward gradients for the old variables. For example, \(\nabla z^1=\gamma \nabla {z}_{1:2},\nabla z^2= \nabla {z}_{3:4}\) at stage 2. In addition, the algorithm of DEFT is shown in Algorithm 1, where \(q_{\phi _i}(z^i|x)\) denotes one sub-encoder, \(p_\theta (x|z)\) denotes the decoder, and \(\mathcal {L}\) denotes the \(\beta \)-VAE objective.

DEFT chooses a suitable value of beta to separate factors, that act as like the temperature, such that the desired factor’s information passes the bottleneck and freezes into the latent variables. Furthermore, backward information scaling is performed for these old variables to prevent the information from diffusing into others.

figure a

5 Experiment

5.1 Settings

In this work, there are two types (standard and lite) of sub-encoders and one type of decoder architecture, as shown in Table 1. For the encoder part, DEFT uses the lite architecture—the dimension of z is \(\text {K} \times \text {G}\) in total; the other approaches use the standard architecture. All models use the same decoder architecture. All layers are activated by ReLU. The optimizer is Adam with a learning rate of 5e-4, \(\beta _1 = 0,\ \beta _2 = 0.99\). The batch size is 256, which accelerates the training process.

Table 1 Lite encoder, standard encoder, and decoder architecture for all experiments. For dSprites and SmallNORB, \(c=1\). For Color and Scream, \(c=3\)

5.2 Supervised problem

Dataset detail We compared DEFT with others on dSprites (Matthey et al., 2017), color dSprites (color for short), scream dSprites (scream for short) (Locatello et al., 2019), and SmallNORB (LeCun et al., 2004). The images of dSprites are strictly generated by the five factors. It has three shapes: square, ellipse, and heart; six scale values: 0.5, 0.6, 0.7, 0.8, 0.9, and 1.0, 40 orientation values in [0, 2 pi], 32 position X values, and 32 position Y values. Two variants of dSprites (Color and Scream), which introduce random noise, were closer to the true situation. SmallNORB is generated from 3D objects and is much more complex than 2D shapes. It contains five generic categories, namely, four-legged animals, human figures, airplanes, trucks, and cars; nine elevation values, i.e., 30, 35, 40, 45, 50, 55, 60, 65, and 70; eighteen azimuth values, 0, 20, 40, ..., 340; and six lighting conditions.

Fig. 6
figure 6

The distribution of IFP on four datasets. The red number denotes the pressure required to separate these factors. There are four factors on SmallNORB—category: CAT, elevation: EL, azimuth: AZ, and lighting condition: LT. There are five factors on three variants of dSprites—shape: SHP, scale: SC, orientation: ORIEN, position X: posX, position Y: posY

Information freezing point The ideal situation is to find a set of \(\beta \) to isolate IFPs into several parts without overlaps. To obtain the distribution of IFPs with respect to a factor \({c}_i\), we enumerate all possible values of factor \({c}_i\) for a random sample, and calculate its IFP using the algorithm introduced in Sect. 4.1. Then, we repeated the above procedure 50 times to estimate the IFP distribution of \({c}_i\). We measured the IFPs of the factors on the four datasets, as shown in Fig. 6. dSprites and Color had more separable IFPs than Scream and SmallNORB. Although the three variants of dSprites have the same factors, their IFPs are different. The difference in IFP distributions explains why current approaches fail to transfer hyperparameters across different problems in Locatello et al. (2019). Note that the IFP distributions of factors are almost separable for dSprites and Color; the ground-truth factors are independent of the four datasets. In summary, four datasets are all independent; Scream and SmallNORB are inseparable. Based on the distribution of IFPs, we summarize the optimal training settings for the DEFT in Table 2. We tune the hyperparameters of compared approaches with the highest MIG and show these settings in Table 3.

Table 2 Experimental settings for DEFT. \(\gamma {}\) is always \(0.1\), see in Sect. 5.6. The number of iterations per stage (N) is sufficiently large such that the objective converges. The number of latents per sub-encoder (K) is not less than the size of the newly learned factors. The number of sub-encoders (G) is determined by the number of separable areas in Fig. 6
Table 3 Experimental settings for the compared approaches
Fig. 7
figure 7

The performances of three disentanglement metrics (Factor VAE (Kim and Mnih, 2018), MIG (Chen et al., 2018), DCI dis. (Eastwood and Williams, 2018)) for five approaches (V=\(\beta \)-VAE, TV=\(\beta \)-TCVAE, AV=AnnealedVAE, CV=CascadeVAEC, DF=DEFT) on four datasets (Color, dSprites, SmallNORB, Scream)

Fig. 8
figure 8

MIG distribution of DEFT on four datasets for different stages

Fig. 9
figure 9

Reconstruction error for different approaches and datasets. Five approaches respectively denote V=\(\beta \)-VAE, TV=\(\beta \)-TCVAE, AV=AnnealedVAE, CV=CascadeVAEC, DF=DEFT

5.3 Performance

We trained each model 50 times and compared our model with the other four disentanglement approaches on dSprites, Color, Scream, and SmallNORB.

Disentanglement metric We show the performances of disentanglement metrics in Fig. 7. All approaches have a lower performance on Scream and SmallNORB, where the distributions of IFP are inseparable. \(\beta \)-VAE and \(\beta \)-TCVAE have similar performances on four datasets. CascadeVAEC shows high performances on three variants of dSprites but has high variances for most cases. DEFT outperforms others for most cases and has lower variances. A downside of DEFT is the reduction of the searching space for more possible solutions, better or worse. As a result, DEFT has lower performances for the best models. The distributions of MIG at different stages are shown in Fig. 8. All experimental results on the four datasets reveal that DEFT obtains low scores at the first stage and gradually improves disentanglement in the following stages.

Reconstruction quality We also show the distributions of the reconstruction error in Fig. 9. CascadeVAEC and DEFT generally have higher qualities on rebuild images. Note that, though CascadeVAEC beats DEFT in some cases (dSprites and SmallNORB), the improved values are negligible compared with the overall errors (10% for dSprites, 2% for SmallNORB), and the differences are merely indistinguishable to human eyes. In general, DEFT reduces the variance by blocking partial information and achieves both a high image quality and disentanglement.

Failure rate We define the failure rate as the percentage of models that fail to learn a disentangled representation if the MIG score is lower than 0.1. Table 4 shows the failure rate. It can be seen that DEFT has the lowest average failure rates. Although AnnealedVAE success to disentangle factors on three datasets, it fails to disentangle factors on Scream for most cases. Note that it is possible to reduce the failure rate for AnnealedVAE on Scream, but we have tried six settings, and none of them outperforms on all datasets. Overall, the failure rates of DEFT are lower than others. From the IFP distributions in Fig. 6, we can see that SmallNORB has a separable factor that is easy to be disentangled. That causes SmallNORB to have a high lower bound of disentanglement but get a low overall score. Generally, DEFT significantly decreased the failure rate compared to the other approaches.

Table 4 Failure rate (%) for each approach (column) and dataset (row)

Visualization Higgins et al. (2017a) introduced the latent traversal to visualize the generated images through the traversal of a single latent \({z}_i\). Fig. 10 shows the latent traversal of the best model with the highest MIG score. One can see the intrinsic relationship between IFP and disentanglement. Orientation has the lowest IFP among all factors; meanwhile, it is the hardest one to be disentangled for all approaches. For SmallNORB, the lighting condition is separable with others, which is easy to be disentangled. For Scream, three factors have similar IFP distributions, and it is also a hard problem for the disentanglement approaches.

Fig. 10
figure 10

Latent traversal of DEFT on four datasets (MIG score). Each column shows the images of traversing a latent variable\({z}_i\) representing a factor and its VIR (last row) (Suter et al. 2019). We choose the variable having the highest MI for the factor. The same variable has the highest MI for both elevation and light condition

5.4 Correlative but separable

To demonstrate the superiority of the IB approaches, we built a dataset of a triangle with three factors (posX, posY, and orientation), where posX and posY are independent, and the triangle always points to the center of the canvas \(\theta = \arctan (\text {posY}-16,\text {posX}-16)\). Figure 11a shows the samples from this toy dataset. We trained CascadeVAEC, \(\beta \)-TCVAE (\( \beta =6 \)), and DEFT (\( \text {K}=2,\text {G}=2 \)) within 10,000 steps and repeated 10 times. From Fig. 11 (b), all three approaches disentangle posX and posY successfully. However, only DEFT extracts orientation information (\( I({z}_4;\text {orientation}) \) is high, \( I({z}_4;\text {posX}) \) and \( I({z}_4;\text {posY}) \) are low). DEFT has higher disentanglement scores for all three metrics, as shown in Fig. 11c. The latent traversal in Fig. 12 shows that DEFT has a high image quality and separated orientation information. The correlation makes it difficult for \(\beta \)-VAE to disentangle orientation.

Fig. 11
figure 11

a Dataset visualization. b NMI matrix \(I({z}_i;{c}_j)\) for three approaches. c Disentanglement scores for different approaches

Fig. 12
figure 12

Latent traversal of CascadeVAEC, \(\beta \)-TCVAE and DEFT on the separable but correlative dataset. Each column denotes the rebuild images by traversing the variable from -2 to 2

5.5 Unsupervised problem

3D Chairs (Aubry et al., 2014) is an unlabeled dataset containing 1394 3D models from the Internet.

Annealing test without supervision The label information is unavailable for common situations. Therefore, the factor’s IFP distribution is hard to be obtained. Alternatively, we calculate the upper bound of IFP distribution for the unsupervised setting. Intuitively, the rate of information increment changes if there is a new factor starting to freeze. We conducted an annealing test on dSprites and 3D chairs without labels and plotted the curves of beta vs. \( \Delta I(x;z) \) in Fig. 13. This method is in agreement with the upper bound of the IFP distribution for position and scaling, as shown in Fig. 13a. One can recognize four points where the latent information suddenly increases: 36 and 16 from Fig. 13b. Though this method needs human participation, we only show the potency to develop a fully unsupervised procedure for the separations. Therefore, we set \( \text {G = 3}, \text {K = 3}, \beta _j=\{36, 16, 1\} \) for 3D Chairs and trained the DEFT 20 epochs per stage. We compared the performance with \(\beta \)-TCVAE and CascadeVAEC on 3D Chairs, as shown in Fig. 14. We notice that DEFT can learn one additional interpretable property compared with CascadeVAEC— leg orientation.

Fig. 13
figure 13

Information increment variability. The broken line denotes the tendency of the growth increment of mutual information. The dot denotes the mutation point of the mutual information increment. The star point denotes the selective separation of the IFP distributions

Fig. 14
figure 14

Latent traversal on 3D Chairs. Each row shows the rebuild images by traversing the corresponding variable from -2 to 2. We show two samples for each factor separated by a line

5.6 Analysis

We introduce the following metrics to evaluate the problems on disentanglement during training in detail:

$$\begin{aligned} \begin{aligned} \text {NMI1} = \frac{1}{\Vert {c} \Vert } \sum _{i=1}^{\Vert {c} \Vert } \text {NMI}({c}_i,1), \ \text {NMI2} = \frac{1}{\Vert {c} \Vert } \sum _{i=1}^{\Vert {c} \Vert } \text {NMI}({c}_i,2). \end{aligned} \end{aligned}$$
(8)

NMI1 denotes the major information representing the factors, which should be as large as possible (1 at maximal). In contrast, NMI2 indicates the diffused information from the major latent variables, which should be as small as possible (0 at minimal).

To analyze the effects of techniques applied in DEFT, a simple model with only two stages is examined on dSprites. Experiments use the same setting at the first stage and apply specific settings for different purposes. At the first stage, the model with \(\beta ^1=70\) was trained 15, 000 iterations so that the model could only learn a disentangled representation of posX and posY according to the IFP distribution in Fig. 6.

Piecewise pressure At the second stage, the model was trained 15, 000 iterations with different values of \(\beta ^2\). The experimental results in Fig. 15 show that a lower \(\beta ^2\) helps the model to learn the insignificant factors with small IFPs (shape and orientation) but violates the model to improve the factors with high IFPs (scale, posX, and posY). \(\beta ^2\) plays the role of a valve for passing information, and impure information is harmful to the disentanglement. Note that, increasing the \(\beta ^2\) brings the problem of the larger \(\text {NMI}({c},2)\), which is incapable of improving the disentanglement solely.

Backward information scaling At the second stage, we train the model with \(\beta ^2=30\) for 15, 000 iterations across different values of \(\gamma {}\). As shown in Fig. 16, the diffused information (scale) is descending as reducing \(\gamma \), relieving the ID problem. However, \(\text {NMI}({c},1)\) reaches the lowest value when all backward information is clipped \(\gamma =0\), violating the model to extract new factors. \(\text {NMI}({c},2)\) and \(\gamma {}\) are simultaneously increased; A small value of \(\gamma {}\) is sufficient to learn the majority information, and it also prevents information from diffusing into another variable. In conclusion, \(\beta \) controls the passing information, and a large one is used to generate pure information; \(\gamma \) retards the increment of NMI2; the disentanglement can benefit from both two techniques by relieving the ID problem.

Fig. 15
figure 15

Each row shows the \(\text {NMI}({c},1)\) or \(\text {NMI}({c},2)\) in an independent trail with different values of \(\beta \)

Fig. 16
figure 16

Each row shows the \(\text {NMI}({c},1)\) or \(\text {NMI}({c},2)\) in an independent trail with different values of \(\gamma \)

Comparison To see the overall effects of DEFT, we compare DEFT with AnnealedVAE and CascadeVAEC on dSprites 10 times, see details in Table 2. Note that we use a standard DEFT in this part. From Fig. 17a, one can see that there is a declination of NMI1 during iteration 1, 000 to 3, 000 for AnnealedVAE and an overall low level of NMI1 for CascadeVAEC. DEFT shows a steady improvement and a high level on NMI1. We also show the NMI2 with four values of \(\gamma \): 0, 0.1, 0.5, 1, in Fig. 17b. The curves with error regions from 10 trails demonstrate that \(\gamma =0.1\) achieves a lower NMI2.

Fig. 17
figure 17

Comparison of three models and four values of \(\gamma \)

5.7 Complexity

The difference between DEFT and other approaches is mainly in the encoder part: DEFT uses a fractional encoder that has several sub-encoders. We assume that \(\varvec{W}_{(\mathrm {G}\times \mathrm {K}) \times \mathrm {M}}\) represents the parameters of a normal encoder, and \(\varvec{W}_{\mathrm {K} \times \mathrm {M}}^i\) represents the parameters of a sub-encoder in a fractional encoder, where M is the dimension of inputs. The computational costs for both should be the same ideally, \(J(\varvec{W} x) = \sum _{i=1}^\mathrm {G} J(\varvec{W^i} x)\). However, there are some extra operations, such as the iterative loop and the concatenation of the latent variables. To make a fair comparison, we set the dimension of latent variables to 1, 000 and change the number of channels in convolutional layers so that the total parameters of the fractional encoder and the normal one are equal. Each trail generates a batch of samples (256) randomly and then runs a forward and a backward process. Table 5 shows the mean and the standard deviation of runtime (second) for 100 trails. Overall, the increased cost of the fractional encoder is only about 6.9% for G = 100, which is acceptable in practice. The extra computational cost is acceptable for the common disentanglement tasks, which usually have less than ten ground-truth factors.

Table 5 The computational cost (second) for the normal encoder and the fractional encoder

6 Conclusion

Based on existing studies involving IBs, we have developed new insights into the reason for which these approaches have lower performances than the TC-based ones. In particular, we identified the IFP distribution for each factor by performing an annealing test, and a dataset was easily disentangled if the IFP distributions were separable. Furthermore, we found that the ID problem is an invisible hurdle that prevents steady improvements in disentanglement. We proposed DEFT to retain the learned information by blocking partial information. In addition, scaling the backward information is also helpful to relieve the ID problem. Our results show that approaches that are based on IBs are competitive and have the potential to solve problems with correlative factors.

We verified the ID problem that causes the low performance of IB-based approaches. However, as a plain solution, the DEFT method still needs to be further improved. In the future, an automatic way to adjust the best separation of IFP distribution is highly required.