Skip to main content
Log in

Gated information bottleneck for generalization in sequential environments

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Deep neural networks suffer from poor generalization to unseen environments when the underlying data distribution is different from that in the training set. By learning minimum sufficient representations from training data, the information bottleneck (IB) approach has demonstrated its effectiveness to improve generalization in different AI applications. In this work, we propose a new neural network-based IB approach, termed gated information bottleneck (GIB), that dynamically drops spurious correlations and progressively selects the most task-relevant features across different environments by a trainable soft mask (on raw features). Using the recently proposed matrix-based Rényi’s \(\alpha \)-order mutual information estimator, GIB enjoys a simple and tractable objective, without any variational approximation or distributional assumption. We empirically demonstrate the superiority of GIB over other popular neural network-based IB approaches in adversarial robustness and out-of-distribution detection. Meanwhile, we also establish the connection between IB theory and invariant causal representation learning and observed that GIB demonstrates appealing performance when different environments arrive sequentially, a more practical scenario where invariant risk minimization fails.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. https://github.com/rois-codh/kmnist.

  2. https://www.nist.gov/itl/products-and-services/emnist-dataset.

  3. https://github.com/facebookresearch/GradientEpisodicMemory.

  4. https://github.com/mattriemer/mer.

References

  1. Alesiani F, Yu S, Yu X (2021) Gated information bottleneck for generalization in sequential environments. In: IEEE International Conference on Data Mining

  2. Tishby N, Pereira FC, Bialek W (2000) The information bottleneck method. arXiv preprint arXiv:physics/0004057

  3. Alemi AA, Fischer I, Dillon JV, Murphy K (2017) Deep variational information bottleneck. In: International Conference on Learning Representations

  4. Wieczorek A, Roth V (2020) On the difference between the information bottleneck and the deep information bottleneck. Entropy 22(2):131

    Article  Google Scholar 

  5. Gilad-Bachrach R, Navot A, Tishby N (2003) An information theoretic tradeoff between complexity and accuracy. Learning theory and kernel machines. Springer, London, pp 595–609

    Chapter  MATH  Google Scholar 

  6. Shwartz-Ziv R, Tishby N (2017) Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810

  7. Yu S, Principe JC (2019) Understanding autoencoders with information theoretic concepts. Neural Netw 117:104–123

    Article  MATH  Google Scholar 

  8. Kolchinsky A, Tracey BD, Wolpert DH (2019) Nonlinear information bottleneck. Entropy 21(12):1181

    Article  Google Scholar 

  9. Fischer I (2020) The conditional entropy bottleneck. Entropy 22(9):999

    Article  Google Scholar 

  10. Mahabadi RK, Belinkov Y, Henderson J (2021) Variational information bottleneck for effective low-resource fine-tuning. In: International Conference on Learning Representations

  11. Kim J, Kim M, Woo D, Kim G (2021) Drop-bottleneck: Learning discrete compressed representation for noise-robust exploration. arXiv preprint arXiv:2103.12300

  12. Fischer I, Alemi AA (2020) Ceb improves model robustness. Entropy 22(10):1081

    Article  Google Scholar 

  13. Giraldo LGS, Rao M, Principe JC (2014) Measures of entropy from data using infinitely divisible kernels. IEEE Trans Inf Theor 61(1):535–548

    Article  MATH  Google Scholar 

  14. Yu S, Alesiani F, Yu X, Jenssen R, Principe J (2021) Measuring dependence with matrix-based entropy functional. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp 10781–10789

  15. Arjovsky M, Bottou L, Gulrajani I, Lopez-Paz D (2019) Invariant risk minimization. arXiv preprint arXiv:1907.02893

  16. Mitrovic J, McWilliams B, Walker J, Buesing L, Blundell C (2020) Representation learning via invariant causal mechanisms. arXiv preprint arXiv:2010.07922

  17. Ahuja K, Shanmugam K, Varshney K, Dhurandhar A (2020) Invariant risk minimization games. arXiv preprint arXiv:2002.04692

  18. Pearl J (2009) Causality. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  19. Peters J, Bühlmann P, Meinshausen N (2016) Causal inference by using invariant prediction: identification and confidence intervals. J Royal Stat Soc Ser B Stat Methodol 78(5):947–1012

    Article  MATH  Google Scholar 

  20. Moyer D, Gao S, Brekelmans R, Steeg GV, Galstyan A (2018) Invariant representations without adversarial training. arXiv preprint arXiv:1805.09458

  21. Achille A, Soatto S (2018) Emergence of invariance and disentanglement in deep representations. J Mach Learn Res 19(1):1947–1980

    MATH  Google Scholar 

  22. Amjad RA, Geiger BC (2019) Learning representations for neural network-based classification using the information bottleneck principle. IEEE Trans Pattern Anal Mach Intell 42(9):2225–2239

    Article  Google Scholar 

  23. Bengio Y, Léonard N, Courville A (2013) Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432

  24. Goodfellow IJ, Shlens J, Szegedy C (2015) Explaining and harnessing adversarial examples. In: International Conference on Learning Representations

  25. Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A (2018) Towards deep learning models resistant to adversarial attacks. In: International Conference on Learning Representations

  26. Hendrycks D, Gimpel K (2017) A baseline for detecting misclassified and out-of-distribution examples in neural networks. In: International Conference on Learning Representations

  27. Clanuwat T, Bober-Irizar M, Kitamoto A, Lamb A, Yamamoto K, Ha D (2018) Deep learning for classical japanese literature. arXiv preprint arXiv:1812.01718

  28. Cohen G, Afshar S, Tapson J, Van Schaik A (2017) Emnist: Extending mnist to handwritten letters. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp 2921–2926. IEEE

  29. Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu AA, Milan K, Quan J, Ramalho T, Grabska-Barwinska A et al (2017) Overcoming catastrophic forgetting in neural networks. Proc National Acad Sci 114(13):3521–3526

    Article  MATH  Google Scholar 

  30. Lopez-Paz D, Ranzato M (2017) Gradient episodic memory for continual learning. Adv Neural Inf Process Syst, pp 6467–6476

  31. Riemer M, Cases I, Ajemian R, Liu M, Rish I, Tu Y, Tesauro G (2018) Learning to learn without forgetting by maximizing transfer and minimizing interference. arXiv preprint arXiv:1810.11910

  32. Tishby N, Zaslavsky N (2015) Deep learning and the information bottleneck principle. In: 2015 IEEE Information Theory Workshop (ITW), pp 1–5. IEEE

  33. Zaidi A, Estella-Aguerri I et al (2020) On the information bottleneck problems: models, connections, applications and information theoretic views. Entropy 22(2):151

    Article  Google Scholar 

  34. Goldfeld Z, Polyanskiy Y (2020) The information bottleneck problem and its applications in machine learning. IEEE J Selected Areas Inf Theor 1(1):19–38

    Article  Google Scholar 

  35. Achille A, Soatto S (2018) Information dropout: learning optimal representations through noisy computation. IEEE Trans Pattern Anal Mach Intell 40(12):2897–2905

    Article  Google Scholar 

  36. Kolchinsky A, Tracey BD (2017) Estimating mixture entropy with pairwise distances. Entropy 19(7):361

    Article  Google Scholar 

  37. Belghazi MI, Baratin A, Rajeshwar S, Ozair S, Bengio Y, Courville A, Hjelm D (2018) Mutual information neural estimation. In: International Conference on Machine Learning, pp 531–540. PMLR

  38. Elad A, Haviv D, Blau Y, Michaeli T (2019) Direct validation of the information bottleneck principle for deep nets. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops

  39. Strouse D, Schwab DJ (2017) The deterministic information bottleneck. Neural Comput 29(6):1611–1630

    Article  MATH  Google Scholar 

  40. Kolchinsky A, Tracey BD, Van Kuyk S (2019) Caveats for information bottleneck in deterministic scenarios. In: International Conference on Learning Representations

  41. Li XL, Eisner J (2019) Specializing word embeddings (for parsing) by information bottleneck. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2744–2754

  42. Igl M, Ciosek K, Li Y, Tschiatschek S, Zhang C, Devlin S, Hofmann K (2019) Generalization in reinforcement learning with selective noise injection and information bottleneck. Adv Neural Inf Process Syst 32:13978–13990

    Google Scholar 

  43. Goyal A, Islam R, Strouse D, Ahmed Z, Larochelle H, Botvinick M, Bengio Y, Levine S(2019) Infobot: Transfer and exploration via the information bottleneck. In: International Conference on Learning Representations

  44. Kim Y, Nam W, Kim H, Kim J-H, Kim G (2019) Curiosity-bottleneck: Exploration by distilling task-specific novelty. In: International Conference on Machine Learning, pp 3379–3388. PMLR

  45. Yu X, Yu S, Príncipe JC (2021) Deep deterministic information bottleneck with matrix-based entropy functional. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 3160–3164. IEEE

  46. Wu T, Ren H, Li P, Leskovec J (2020) Graph information bottleneck. Adv Neural Inf Process Syst 33:20437–20448

    Google Scholar 

  47. Yu J, Xu T, Rong Y, Bian Y, Huang J, He R (2021) Graph information bottleneck for subgraph recognition. In: International Conference on Learning Representations

  48. Zheng K, Yu S, Li B, Jenssen R, Chen B (2022) Brainib: Interpretable brain network-based psychiatric diagnosis with graph information bottleneck. arXiv preprint arXiv:2205.03612

  49. Hjelm RD, Fedorov A, Lavoie-Marchildon S, Grewal K, Bachman P, Trischler A, Bengio Y (2019) Learning deep representations by mutual information estimation and maximization. International Conference for Learning Representations (ICLR)

  50. Velickovic P, Fedus W, Hamilton WL, Liò P, Bengio Y, Hjelm RD (2019) Deep graph infomax. ICLR 2(3):4

    Google Scholar 

  51. Belinkov Y, Henderson J, et al (2020) Variational information bottleneck for effective low-resource fine-tuning. In: International Conference on Learning Representations

  52. Wang B, Wang S, Cheng Y, Gan Z, Jia R, Li B, Liu J (2021) Infobert: Improving robustness of language models from an information theoretic perspective. International Conference on Learning Representations

  53. Bengio Y, Deleu T, Rahaman N, Ke R, Lachapelle S, Bilaniuk O, Goyal A, Pal C (2019) A meta-transfer objective for learning to disentangle causal mechanisms. arXiv preprint arXiv:1901.10912

  54. Krueger D, Caballero E, Jacobsen J-H, Zhang A, Binas J, Zhang D, Priol RL, Courville A (2020) Out-of-distribution generalization via risk extrapolation (rex). arXiv preprint arXiv:2003.00688

  55. Javed K, White M, Bengio Y (2020) Learning causal models online. arXiv preprint arXiv:2006.07461

  56. Ahuja K, Caballero E, Zhang D, Bengio Y, Mitliagkas I, Rish I (2021) Invariance principle meets information bottleneck for out-of-distribution generalization. arXiv preprint arXiv:2106.06607

  57. Li B, Shen Y, Wang Y, Zhu W, Reed CJ, Zhang J, Li D, Keutzer K, Zhao H (2021) Invariant information bottleneck for domain generalization. arXiv preprint arXiv:2106.06333

  58. Huszár F (2019) Invariant risk minimization: an information theoretic view. Blog Post. https://www.inference.vc/invariant-risk-minimization/

  59. Zhao H, Des Combes RT, Zhang K, Gordon G (2019) On learning invariant representations for domain adaptation. In: International Conference on Machine Learning, pp 7523–7532. PMLR

  60. Zhao S, Gong M, Liu T, Fu H, Tao D (2020) Domain generalization via entropy regularization. Adv Neural Inf Process Syst 33:14675–14687

    Google Scholar 

  61. Romano S, Chelly O, Nguyen V, Bailey J, Houle ME (2016) Measuring dependency via intrinsic dimensionality. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp 1207–1212. IEEE

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Francesco Alesiani or Shujian Yu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This paper was presented in part at the 2021 IEEE International Conference on Data Mining (ICDM) [1].

Appendix A Proofs to Lemma 1 and Lemma 2

Appendix A Proofs to Lemma 1 and Lemma 2

Before providing proofs to Lemma 1 and Lemma 2, let us first introduce the following three lemmas.

Lemma 3

Given two random variables XZ, whose joint distribution is given by p(zx) and conditional distribution \(p(z \vert x)\), the KL divergence between \(p(z \vert x)\) and p(z) is upper bounded by:

$$\begin{aligned} {\text {KL}}( p(z|x)||p(z))&\le {{\,\mathrm{{\mathbb {E}}}\,}}_{x'} {\text {KL}}( p(z|x)||p(z|x') ) \end{aligned}$$
(A1)

Proof of Lemma 3

Indeed, we have:

$$\begin{aligned} {\text {KL}}( p(z|x)||p(z))&= \sum _{z} p(z|x) \ln \frac{p(z|x)}{p(z)} \\&= \sum _{z} p(z|x) \ln p(z|x) - \sum _{z} p(z|x) \ln {p(z)} \\&= H(z|x) - \sum _{z} p(z|x) \ln {\sum _{x'} p(z|x') p(x')} \\&\le H(z|x) - \sum _{x'} p(x') \sum _{z} p(z|x) \ln { p(z|x') } \\&= \sum _{z} p(z|x) \ln p(z|x) - \sum _{x'} p(x') \sum _{z} p(z|x) \ln { p(z|x') } \\&= \sum _{x'} p(x') \left[ \sum _{z} p(z|x) \ln p(z|x) \right. \left. - \sum _{z} p(z|x) \ln { p(z|x') } \right] \\&= \sum _{x'} p(x') \left[ \sum _{z} p(z|x) \frac{\ln p(z|x)}{p(z|x') } \right] \\&= {{\,\mathrm{{\mathbb {E}}}\,}}_{x'} {\text {KL}}( p(z|x)||p(z|x')), \end{aligned}$$

where we used \(p(z) = \sum _{x'} p(z|x') p(x') \) and \(\ln {\sum _{x'} p(z|x') p(x')} \ge \sum _{x'} p(x') \ln { p(z|x') }\). \(\square \)

Lemma 4

Given two random variables XZ, with two joint distributions given by \(p^1(z,x),p^2(z,x)\) and conditional distributions \(p^1(z|x),p^2(z|x)\), the KL divergence between \(p^1(z|x)\) and \(p^2(z)\) is upper bounded by:

$$\begin{aligned} {\text {KL}}( p^1(z|x)||p^2(z))&\le {{\,\mathrm{{\mathbb {E}}}\,}}_{x'} {\text {KL}}( p^1(z|x)||p^2(z|x')) \end{aligned}$$
(A2)

Proof of Lemma 4

The derivation is similar to Lemma.3, where we used \(p^2(z) = \sum _{x'} p^2(z|x') p^2(x') \). \(\square \)

Given Lemma 3 and Lemma 4, we can now introduce the main inequality in the following Lemma.

Lemma 5

Given two random variables XZ, whose joint distribution is given by p(zx) and conditional distribution p(z|x), the Mutual Information I(xz) is upper bounded by:

$$\begin{aligned} I(x,z)&\le {{\,\mathrm{{\mathbb {E}}}\,}}_x {{\,\mathrm{{\mathbb {E}}}\,}}_{x'} {\text {KL}}( p(z|x)||p(z|x')) \end{aligned}$$
(A3)

Proof of Lemma 5

Mutual information is the KL divergence between the join and product of marginal distributions

$$\begin{aligned} I(x,z)&= {\text {KL}}(p(x,z)||p(x)p(z)) \end{aligned}$$
(A4)
$$\begin{aligned}&= {{\,\mathrm{{\mathbb {E}}}\,}}_x {\text {KL}}(p(z|x)||p(z)). \end{aligned}$$
(A5)

We thus build an upper bound to the mutual information:

$$\begin{aligned} I(x,z)&\le {{\,\mathrm{{\mathbb {E}}}\,}}_x {{\,\mathrm{{\mathbb {E}}}\,}}_{x'} {\text {KL}}( p(z|x)||p(z|x')), \end{aligned}$$
(A6)

where we use Lemma 3. \(\square \)

Now we provide proofs to Lemma 1 and Lemma 2, respectively.

1.1 Proof to Lemma 1

This inequality can be obtained by expanding the two distributions and using the convexity property of the logarithm:

$$\begin{aligned} {\text {KL}}( p^1(z) || p^2(z))&= {{\,\mathrm{{\mathbb {E}}}\,}}_{p^1(z)} \ln \frac{p^1(z)}{p^2(z)} \\&=\sum _{z} p^1(z) \ln p^1(z) - \sum _{z} p^1(z) \ln p^2(z) \\&=\sum _{x} p^1(x) \sum _{z} p^1(z|x) \ln \sum _{x'} p^1(x') p^1(z|x') - \sum _{z} p^1(z) \ln p^2(z) \\&\ge \sum _{x} p^1(x) \sum _{x'} p^1(x') \sum _{z} p^1(z|x) \ln p^1(z|x') - \sum _{z} p^1(z) \ln p^2(z) \\&= \sum _{x} p^1(x) \sum _{x'} p^1(x') \sum _{z} p^1(z|x) \ln p^1(z|x') \\&\quad - \sum _{x} p^1(x) \sum _{z} p^1(z|x) \ln p^2(z) \\&= \sum _{x} p^1(x) \sum _{x'} p^1(x') \sum _{z} p^1(z|x) \ln p^1(z|x') \\&\quad - \sum _{x} p^1(x) \sum _{x'} p^1(x') \sum _{z} p^1(z|x) \ln p^2(z) \\&= \sum _{x} p^1(x) \sum _{x'} p^1(x') ( \sum _{z} p^1(z|x) \ln p^1(z|x') - \sum _{z} p^1(z|x) \ln p^2(z) ) \\&= \sum _{x} p^1(x) \sum _{x'} p^1(x') ( \sum _{z} p^1(z|x) \ln p^1(z|x') \\&\quad - \sum _{z} p^1(z|x) \ln \sum _{x''} p^2(x'') p^2(z|x'') ) \\&\approx \sum _{x} p^1(x) \sum _{x'} p^1(x') ( \sum _{z} p^1(z|x) \ln p^1(z|x') \\&\quad - \sum _{x''} p^2(x'') \sum _{z} p^1(z|x) \ln p^2(z|x'') ) \\&= \sum _{x} p^1(x) \sum _{x'} p^1(x') \sum _{x''} p^2(x'') \left[ \sum _{z} p^1(z|x) \ln p^1(z|x') \right. \\&\quad \left. - \sum _{z} p^1(z|x) \ln p^2(z|x'') \right] \\&= \sum _{x} p^1(x) \sum _{x'} p^1(x') \sum _{x''} p^2(x'') \left[ \sum _{z} p^1(z|x) \ln \frac{p^1(z|x')}{ p^2(z|x'')} \right] \\&= \sum _{x} p^1(x) \sum _{x'} p^1(x') \sum _{x''} p^2(x'') {\text {KL}}( p^1(z|x') || p^2(z|x'')) \\&= {{\,\mathrm{{\mathbb {E}}}\,}}_{x\sim p^1(x),x'\sim p^1(x),x'' \sim p^2(x)} {\text {KL}}( p^1(z|x') || p^2(z|x'')) \\&\approx \sum _{x} p^1(x) \sum _{x''} p^2(x'') {\text {KL}}( p^1(z|x) || p^2(z|x'')) \\&= {{\,\mathrm{{\mathbb {E}}}\,}}_{x,x'} {\text {KL}}( p^1(z|x) || p^2(z|x')) \end{aligned}$$

where we used \(\ln \sum _{x'} p^1(x') p^1(z|x') \ge \sum _{x'} p^1(x') \ln p^1(z|x') \), \(\sum _{x'} p^1(x') \sum _{z} p^1(z|x') = p^1(z)\) and \(\sum _{x'} p^1(x') = 1\). In the last approximation we substitute \(p^1(z|x') \rightarrow p^1(z|x)\).

1.2 Proof to Lemma 2

Similar to Lemma 5, this property follows from Lemma 4. Indeed, when we consider the cross-domain mutual information between XZ on the two distributions, we have:

$$\begin{aligned} I^{12}(X;Z)&= {\text {KL}}(p^1(z,x)||p^2(z)p^2(x)) \end{aligned}$$
(A7)
$$\begin{aligned}&= {{\,\mathrm{{\mathbb {E}}}\,}}_x {\text {KL}}(p^1(z|x)||p^2(z)) \end{aligned}$$
(A8)
$$\begin{aligned}&\le {{\,\mathrm{{\mathbb {E}}}\,}}_x {{\,\mathrm{{\mathbb {E}}}\,}}_{x'} {\text {KL}}( p^1(z|x)||p^2(z|x')), \end{aligned}$$
(A9)

where \({\text {KL}}(p^1(z,x)||p^2(z)p^2(x)) = {{\,\mathrm{{\mathbb {E}}}\,}}_x {\text {KL}}(p^1(z|x)||p^2(z))\), when \(p^2(x)=p^1(x)\) and \({\text {KL}}( p^1(z|x)||p^2(z)) \le {{\,\mathrm{{\mathbb {E}}}\,}}_{x'} {\text {KL}}( p^1(z|x)||p^2(z|x')) \) from Lemma 4.

1.3 Simulation details

1.3.1 Synthetic experiment

In this experiment, the features have fix size of \(d = 20\). Hyper-parameter search was done using grid search method over the following values \(\lambda \in [0, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1]\), 5 repetitions, learning rate \(lr=1e-4\), 10’000 iterations, 1’000 samples. Experiments were performed on a server with 8 CPUs, 64Gb RAM, and one GPU with 8Gb RAM.

1.3.2 Colored dataset experiment

Experiments have been conducted using an MLP with 3 hidden linear layer with drop-out, each of 200 neurons and Relu activation function. Cross Entropy is used as minimization loss. Hyper-parameters search based on grid search. For IRM and IRMG the best parameters suggested in the original works have been used. Regularization terms have been selected based on heuristics. Each environment has \(5'000\) samples, while the test has \(10'0000\) samples, trained over 100 epochs. For MER, EWC, and GEM settings are those from the authors, when possible.

1.4 Analysis of the complexity (time, space, sample size)

The proposed GIB algorithm requires only marginal complexity in time and space. The space complexity is two times the size of features, marginal w.r.t the number of variables in a typical DNN.

Fig. 10
figure 10

(a) Total dependence value among each dimension of representation measured by \(T_\alpha ^*\) and IDD; and (b) PGD on MNIST

1.5 The independence assumption in drop-bottleneck

Although Drop-Bottleneck [11] shares similar ideas to ours, i.e., discretely drops redundant features that are irrelevant to tasks. We emphasize two significant differences here:

  • Drop-Bottleneck does not build the connection between IB and invariance representation learning. It targets reinforcement learning and implements the maximization of mutual information I(zy) by the mutual information neural estimator (MINE) as has been used in the famed deep infoMax. By contrast, same to other popular IB approaches, we simply implement this term by cross-entropy.

  • Drop-Bottleneck assumes the independence between each dimension of the representation Z for simplicity (i.e., ), and claims that the total dependence among \({Z_1,Z_2,...,Z_p}\) could decrease to zero as the optimization progresses.

Although it is hard to directly compare GIB with Drop-Bottleneck, we additionally provide two empirical justifications for why the independence assumption is harmful. Figure (10) shows the total dependence when we train an MLP on MNIST. As can be seen, the total dependence (measured by state-of-the-art \(T_\alpha ^*\) [14] and IDD [61]) is far away from zero, which indicates that the pairwise independence assumption does not hold across training. Figure (10) shows the result when we replace our entropy estimator (i.e., Eq. (12)) with a simpler Shannon entropy estimator that assumes fully independence (i.e., \(H(z)=\sum _i^p H(z_i)\)). As can be seen, such assumptions significantly degrade the performance.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alesiani, F., Yu, S. & Yu, X. Gated information bottleneck for generalization in sequential environments. Knowl Inf Syst 65, 683–705 (2023). https://doi.org/10.1007/s10115-022-01770-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-022-01770-w

Keywords

Navigation