Parameter identifiability of a deep feedforward ReLU neural network

Bona-Pellissier, Joachim; Bachoc, François; Malgouyres, François

doi:10.1007/s10994-023-06355-4

Parameter identifiability of a deep feedforward ReLU neural network

Published: 03 August 2023

Volume 112, pages 4431–4493, (2023)
Cite this article

Machine Learning Aims and scope Submit manuscript

Joachim Bona-Pellissier ORCID: orcid.org/0000-0003-0794-7726¹,
François Bachoc¹^na1 &
François Malgouyres¹^na1

429 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

The possibility for one to recover the parameters—weights and biases—of a neural network thanks to the knowledge of its function on a subset of the input space can be, depending on the situation, a curse or a blessing. On one hand, recovering the parameters allows for better adversarial attacks and could also disclose sensitive information from the dataset used to construct the network. On the other hand, if the parameters of a network can be recovered, it guarantees the user that the features in the latent spaces can be interpreted. It also provides foundations to obtain formal guarantees on the performances of the network. It is therefore important to characterize the networks whose parameters can be identified and those whose parameters cannot. In this article, we provide a set of conditions on a deep fully-connected feedforward ReLU neural network under which the parameters of the network are uniquely identified—modulo permutation and positive rescaling—from the function it implements on a subset of the input space.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Article 18 August 2021

A review on the long short-term memory model

Article 13 May 2020

Visualizing and Understanding Convolutional Networks

Notes

For clarity of the proofs, we index the layers from K (input) to 0 (output). The input layer is not counted hence the ‘K layers’.

References

Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., & Zhang, L. (2016). Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security (pp. 308–318).
Adewoyin, R. A., Dueben, P., Watson, P., He, Y., & Dutta, R. (2021). Tru-net: A deep learning approach to high resolution prediction of rainfall. Machine Learning, 110(8), 2035–2062.
Article MathSciNet MATH Google Scholar
Albertini, F., Sontag, E. D., & Maillot, V. (1993). Uniqueness of weights for neural networks. Artificial neural networks for speech and vision (pp. 115–125).
Arora, S., Bhaskara, A., Ge, R., & Ma, T. (2014). Provable bounds for learning some deep representations. In: Xing, E. P., Jebara, T. (eds.) Proceedings of the 31st international conference on machine learning. proceedings of machine learning Research, vol. 32 (pp. 584–592). PMLR, Bejing, China.
Baluja, S., & Fischer, I. (2017). Adversarial transformation networks: Learning to generate adversarial examples. arXiv preprint arXiv:1703.09387.
Bona-Pellissier, J., Malgouyres, F., & Bachoc, F. (2022). Local identifiability of deep ReLU neural networks: The theory. In: Oh, A. H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in neural information processing systems.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D. (2020). Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., Lin, H. (eds.) Advances in neural information processing systems, vol. 33 (pp. 1877–1901).
Brutzkus, A., & Globerson, A. (2017). Globally optimal gradient descent for a ConvNet with Gaussian inputs. In Proceedings of the 34th international conference on machine learning, vol. 70 (pp. 605–614).
Carlini, N., & Wagner, D. (2017). Towards evaluating the robustness of neural networks. In 2017 IEEE symposium on security and privacy (sp) (pp. 39–57). IEEE.
Carlini, N., Jagielski, M., & Mironov, I. (2020). Cryptanalytic extraction of neural network models. In Annual international cryptology conference (pp. 189–218). Springer.
Carlini, N., Liu, C., Erlingsson, Ú., Kos, J., & Song, D. (2019). The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th$\{$USENIX$\}$security symposium ($\{$USENIX$\}$security 19) (pp. 267–284).
Chabanne, H., Despiegel, V., & Guiga, L. (2020). A protection against the extraction of neural network models. arXiv preprint arXiv:2005.12782.
Chen, J., Guo, Y., Zheng, Q., & Chen, H. (2021). Protect privacy of deep classification networks by exploiting their generative power. Machine Learning, 110(4), 651–674.
Article MathSciNet MATH Google Scholar
Chen, A. M., Lu, H.-M., & Hecht-Nielsen, R. (1993). On the geometry of feedforward neural network error surfaces. Neural Computation, 5(6), 910–927.
Article Google Scholar
Cisse, M., Adi, Y., Neverova, N., & Keshet, J. (2017). Houdini: Fooling deep structured prediction models. arXiv preprint arXiv:1707.05373.
Du, S. S., Lee, J. D., & Tian, Y. (2018). When is a convolutional filter easy to learn? In International conference on learning representations.
Elbrächter, D. M., Berner, J., & Grohs, P. (2019). How degenerate is the parametrization of neural networks with the ReLU activation function? In Advances in neural information processing systems, vol. 32.
Fefferman, C. (1994). Reconstructing a neural net from its output. Revista Matemática Iberoamericana, 10(3), 507–555.
Article MathSciNet MATH Google Scholar
Fredrikson, M., Jha, S., Ristenpart, T. (2015). Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security (pp. 1322–1333).
Fu, H., Chi, Y., & Liang, Y. (2020). Guaranteed recovery of one-hidden-layer neural networks via cross entropy. IEEE Transactions on Signal Processing, 68, 3225–3235.
Article MathSciNet MATH Google Scholar
Ge, R., Lee, J. D., & Ma, T. (2018). Learning one-hidden-layer neural networks with landscape design. In 6th international conference on learning representations, ICLR 2018.
Goel, S., Klivans, A., & Meka, R. (2018). Learning one convolutional layer with overlapping patches. In International conference on machine learning (pp. 1783–1791).
Goodfellow, I., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. In International conference on learning representations.
Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., & Coates, A., et al. (2014). Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.
Hecht-Nielsen, R. (1990). On the algebraic structure of feedforward network weight spaces. In R. Eckmiller (Ed.), Advanced Neural Computers (pp. 129–135). Amsterdam: North-Holland.
Google Scholar
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.
Article Google Scholar
Janzamin, M., Sedghi, H., & Anandkumar, A. (2015). Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473.
Kainen, P. C., Kurková, V., Kreinovich, V., & Sirisaengtaksin, O. (1994). Uniqueness of network parametrization and faster learning. Neural, Parallel & Scientific Computations, 2(4), 459–466.
MathSciNet MATH Google Scholar
Kalchbrenner, N., & Blunsom, P. (2013). Recurrent continuous translation models. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1700–1709).
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097–1105.
Google Scholar
Kurakin, A., Goodfellow, I., & Bengio, S. (2017). Adversarial examples in the physical world. In International conference on learning representations.
Kurková, V., & Kainen, P. C. (1994). Functionally equivalent feedforward neural networks. Neural Computation, 6(3), 543–558.
Article Google Scholar
Li, Y., Ma, T., & Zhang, H. R.: Learning over-parametrized two-layer neural networks beyond NTK. In Conference on learning theory (pp. 2613–2682). PMLR.
Li, Y., Yuan, Y. (2017). Convergence analysis of two-layer neural networks with ReLU activation. In Advances in neural information processing systems (pp. 597–607).
Loukides, G., Denny, J. C., & Malin, B. (2010). The disclosure of diagnosis codes can breach research participants’ privacy. Journal of the American Medical Informatics Association, 17(3), 322–327.
Article Google Scholar
Malgouyres, F. (2020). On the stable recovery of deep structured linear networks under sparsity constraints. In Mathematical and scientific machine learning (pp. 107–127). PMLR.
Malgouyres, F., & Landsberg, J. (2016). On the identifiability and stable recovery of deep/multi-layer structured matrix factorization. In IEEE, information theory workshop.
Malgouyres, F., & Landsberg, J. (2019). Multilinear compressive sensing and an application to convolutional linear networks. SIAM Journal on Mathematics of Data Science, 1(3), 446–475.
Article MathSciNet MATH Google Scholar
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In: Bengio, Y., LeCun, Y. (eds.) 1st international conference on learning representations, ICLR 2013, Scottsdale, Arizona, USA, May 2–4, 2013, workshop track proceedings.
Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., & Khudanpur, S. (2010). Recurrent neural network based language model. Interspeech, 2, 1045–1048.
Article Google Scholar
Moosavi-Dezfooli, S.-M., Fawzi, A., & Frossard, P. (2016). Deepfool: A simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2574–2582).
Moosavi-Dezfooli, S.-M., Fawzi, A., Fawzi, O., & Frossard, P. (2017). Universal adversarial perturbations. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1765–1773).
Narayanan, A., & Shmatikov, V. (2008). Robust de-anonymization of large sparse datasets. In 2008 IEEE symposium on security and privacy (pp. 111–125). IEEE.
Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z. B., Swami, A. (2016). The limitations of deep learning in adversarial settings. In 2016 IEEE European symposium on security and privacy (EuroS &P) (pp 372–387). IEEE.
Petersen, P., Raslan, M., & Voigtlaender, F. (2021). Topological properties of the set of functions generated by neural networks of fixed size. Foundations of Computational Mathematics, 21, 375–444.
Article MathSciNet MATH Google Scholar
Petzka, H., Trimmel, M., & Sminchisescu, C. (2020). Notes on the symmetries of 2-layer ReLU-networks. In Proceedings of the northern lights deep learning workshop, vol. 1 (pp. 6–6).
Phuong, M., & Lampert, C. H. (2020). Functional vs. parametric equivalence of ReLU networks. In International conference on learning representations.
Pinto, J. P., Pimenta, A., & Novais, P. (2021). Deep learning and multivariate time series for cheat detection in video games. Machine Learning, 110(11), 3037–3057.
Article MathSciNet MATH Google Scholar
Pourzanjani, A. A., Jiang, R. M., & Petzold, L. R. (2017). Improving the identifiability of neural networks for Bayesian inference. In NIPS workshop on bayesian deep learning, vol. 4 (p. 31).
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779–788).
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, 28, 91–99.
Google Scholar
Rolnick, D., & Kording, K. (2020). Reverse-engineering deep ReLU networks. In: III, H. D., Singh, A. (eds.) Proceedings of the 37th international conference on machine learning. Proceedings of machine learning research, vol. 119 (pp. 8178–8187).
Sak, H., Senior, A., & Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth annual conference of the international speech communication association.
Sarkar, S., Bansal, A., Mahbub, U., & Chellappa, R. (2017). UPSET and ANGRI: Breaking high performance image classifiers. arXiv preprint arXiv:1707.01159.
Sedghi, H., & Anandkumar, A. (2014). Provable methods for training neural networks with sparse connectivity. In Deep learning and representation learning workshop: NIPS.
Serra, T., Kumar, A., & Ramalingam, S. (2020). Lossless compression of deep neural networks. In Integration of constraint programming, artificial intelligence, and operations research: 17th international conference, CPAIOR 2020, Vienna, Austria, September 21–24, 2020, proceedings (pp. 417–430). Springer.
Serra, T., Yu, X., Kumar, A., & Ramalingam, S. (2021). Scaling up exact neural network compression by ReLU stability. Advances in Neural Information Processing Systems, 34, 27081–27093.
Google Scholar
Stock, P. (2021). Efficiency and redundancy in deep learning models : Theoretical considerations and practical applications. Ph.D. thesis, Université de Lyon. https://tel.archives-ouvertes.fr/tel-03208517
Stock, P., & Gribonval, R. (2022). An embedding of ReLU networks and an analysis of their identifiability. Constructive Approximation, 1–47.
Sussmann, H. J. (1992). Uniqueness of the weights for minimal feedforward nets with a given input–output map. Neural Networks, 5(4), 589–593.
Article Google Scholar
Su, J., Vargas, D. V., & Sakurai, K. (2019). One pixel attack for fooling deep neural networks. IEEE Transactions on Evolutionary Computation, 23(5), 828–841.
Article Google Scholar
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2014). Intriguing properties of neural networks. In International conference on learning representations.
Zhang, S., Wang, M., Liu, S., Chen, P.-Y., & Xiong, J.: Guaranteed convergence of training convolutional neural networks via accelerated gradient descent. In 2020 54th annual conference on information sciences and systems (CISS) (pp. 1–6). IEEE.
Zhang, X., Yu, Y., Wang, L., & Gu, Q. (2019). Learning one-hidden-layer ReLU networks via gradient descent. In The 22nd international conference on artificial intelligence and statistics (pp. 1524–1534). PMLR.
Zhang, S., Wang, M., Xiong, J., Liu, S., & Chen, P.-Y. (2020). Improved linear convergence of training CNNs with generalizability guarantees: A one-hidden-layer case. IEEE Transactions on Neural Networks and Learning Systems, 32(6), 2622–2635.
Article MathSciNet Google Scholar
Zhong, K., Song, Z., Jain, P., Bartlett, P. L., & Dhillon, I. S. (2017). Recovery guarantees for one-hidden-layer neural networks. In Proceedings of the 34th international conference on machine learning, vol. 70 (pp. 4140–4149).
Zhou, M., Ge, R., & Jin, C. (2021). A local convergence theory for mildly over-parameterized two-layer neural network. arXiv preprint arXiv:2102.02410.

Download references

Acknowledgements

Our work has benefited from the AI Interdisciplinary Institute ANITI. ANITI is funded by the French “Investing for the Future - PIA3” program under the Grant agreement no ANR-19-PI3A-0004. The authors gratefully acknowledge the support of the DEEL project https://www.deel.ai/.

Funding

This work was funded by ANITI.

Author information

François Bachoc and François Malgouyres have contributed equally to this work.

Authors and Affiliations

Institut de Mathématiques de Toulouse, UMR5219, Université de Toulouse, CNRS, UPS IMT, 31062, Toulouse Cedex 9, France
Joachim Bona-Pellissier, François Bachoc & François Malgouyres

Authors

Joachim Bona-Pellissier
View author publications
You can also search for this author in PubMed Google Scholar
François Bachoc
View author publications
You can also search for this author in PubMed Google Scholar
François Malgouyres
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the main ideas present in this manuscript. The first draft of the manuscript was written by JBP and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Joachim Bona-Pellissier is the main contributor to the manuscript.

Corresponding author

Correspondence to Joachim Bona-Pellissier.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Consent for publication

The authors all agree to publish the article in Machine Learning.

Additional information

Editor: Hendrik Blockeel.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Definitions, notations and preliminary results

Appendix 1 is structured as follows: after giving some notations in “Sect. 7.1” section, we recall the definition of a continuous piecewise linear function and some corresponding basic properties in “Sect. 7.2” section and we give our formalization of deep ReLU networks as well as some well-known properties in “Sect. 7.3” section.

1.1 Basic notations and definitions

We denote by

$$\begin{aligned} \begin{array}{rrcl} {\sigma }:&{}{{\mathbb {R}}}&{}\longrightarrow &{}{{\mathbb {R}}}\\ &{}{t}&{} \longmapsto &{} {\max (t,0)}\end{array} \end{aligned}$$

the ReLU activation function. If $x = (x_1, \dots , x_m)^T \in {\mathbb {R}}^m$ is a vector, we denote $\sigma (x) = (\sigma (x_1), \dots , \sigma (x_m))^T$.

If $A \subset {\mathbb {R}}^m$, we denote by $\mathring{A}$ the interior of A and ${\overline{A}}$ the closure of A with respect to the standard topology of ${\mathbb {R}}^m$. We denote by $\partial A = {\overline{A}} \backslash \mathring{A}$ the topological boundary of A.

For $m,n \in {\mathbb {N}}^*$, we denote by ${\mathbb {R}}^n$ the vector space of n-dimensional real vectors and ${\mathbb {R}}^{m \times n}$ the vector space of real matrices with m lines and n columns. On the space of vectors, we use the norm $\Vert x \Vert = \sqrt{\sum _{i=1}^n x_i^2}$. For $x \in {\mathbb {R}}^n$ and $r>0$, we denote $B(x,r) = \{ y \in {\mathbb {R}}^n, \Vert y - x \Vert < r \}$.

For any vector $x \in {\mathbb {R}}^n$ whose coefficients $x_i$ are all different from zero, we denote by $x^{-1}$ or $\frac{1}{x}$ the vector $\left( \frac{1}{x_1}, \frac{1}{x_2}, \dots , \frac{1}{x_n}\right) ^T$.

For any matrix $M \in {\mathbb {R}}^{m\times n}$, for all $i \in \llbracket 1, m \rrbracket$, we denote by $M_{i,.}$ the $i^{\text {th}}$ line of M. The vector $M_{i,.}$ is a line vector whose $j^{\text {th}}$ component is $M_{i,j}$. Similarly, for $j \in \llbracket 1, n \rrbracket$, we denote by $M_{.,j}$ the $j^{\text {th}}$ column of M, which is the column vector whose $i^{\text {th}}$ component is $M_{i,j}$. For any matrix $M \in {\mathbb {R}}^{m \times n}$, we denote by $M^T \in {\mathbb {R}}^{n\times m}$ the transpose matrix of M.

To avoid any confusion, we will denote by $(M^T)_{i,.}$ the $i^{\text {th}}$ line of the matrix $M^T$ and by $M_{i,.}^{ T}$ the transpose of the line vector $M_{i,.}$, which is a column vector. Similarly, we will denote by $(M^T)_{.,j}$ the $j^{\text {th}}$ column of $M^T$ and $M_{.,j}^{ T}$ the transpose of the column vector $M_{.,j}$.

For $n \in {\mathbb {N}}^*$, we denote by ${{\,\textrm{Id}\,}}_n$ the $n \times n$ identity matrix and by $\mathbbm {1}_n$ the vector $(1,1, \dots , 1)^T \in {\mathbb {R}}^n$.

If $\lambda \in {\mathbb {R}}^n$ is a vector of size n, for some $n \in {\mathbb {N}}^*$, we denote by ${{\,\textrm{D}\,}}(\lambda )$ the $n \times n$ matrix defined by:

$$\begin{aligned} {{\,\textrm{D}\,}}(\lambda )_{i,j} = {\left\{ \begin{array}{ll} \lambda _i &{} \text {if } i=j \\ 0 &{} \text {otherwise.}\end{array}\right. } \end{aligned}$$

For any integer $m \in {\mathbb {N}}^*$, we denote by ${\mathfrak {S}}_m$ the set of all permutations of $\llbracket 1, m \rrbracket$. We denote by $id_{\llbracket 1, m \rrbracket }$ and $id_{{\mathbb {R}}^m}$ the identity functions on $\llbracket 1, m \rrbracket$ and ${\mathbb {R}}^m$ respectively.

For any permutation $\varphi \in {\mathfrak {S}}_m$, we denote by $P_\varphi$ the $m \times m$ permutation matrix associated to $\varphi$:

$$\begin{aligned} \forall i, j \in \llbracket 1, m \rrbracket , \quad (P_\varphi )_{i,j} = {\left\{ \begin{array}{ll} 1 &{} \text {if } \varphi (j) = i\\ 0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

(20)

For all $x \in {\mathbb {R}}^m$, we have:

$$\begin{aligned} (P_{\varphi }x)_i = x_{\varphi ^{-1}(i)}. \end{aligned}$$

(21)

Using (21) we see that $P_{\varphi ^{-1}} P_{\varphi }x = x$, which shows, since $P_{\varphi }$ is orthogonal, that we have

$$\begin{aligned} P_{\varphi }^{-1} = P_{\varphi ^{-1}} = P_{\varphi }^T. \end{aligned}$$

(22)

Let $l,m,n \in {\mathbb {N}}^*$. For any matrix $M \in {\mathbb {R}}^{m \times l}$ and any function $f: {\mathbb {R}}^m \rightarrow {\mathbb {R}}^n$, we denote with a slight abuse of notation $f \circ M$ the function $x \mapsto f(Mx)$.

If X and Y are two sets and $h: X \rightarrow Y$ is a function, for a subset $A \subset Y$, we denote by $h^{-1}(A)$ the following set:

$$\begin{aligned} \{x \in X, h(x) \in A \}. \end{aligned}$$

Note that this does not require the function h to be injective.

1.2 Continuous piecewise linear functions

We now introduce a few definitions and properties around the notion of continuous piecewise linear function.

Definition 15

Let $m \in {\mathbb {N}}^{*}$. A subset $D \subset {\mathbb {R}}^m$ is a closed polyhedron iif there exist $q \in {\mathbb {N}}^*$, $a_1, \dots , a_q \in {\mathbb {R}}^m$ and $b_1, \dots b_q \in {\mathbb {R}}$ such that for all $x \in {\mathbb {R}}^m$,

$$\begin{aligned} x \in D \quad \Longleftrightarrow \quad {\left\{ \begin{array}{ll} a_1^T x + b_1 \le 0 \\ \vdots \\ a_q^T x + b_q \le 0. \end{array}\right. } \end{aligned}$$

Remarks

A closed polyhedron is convex as an intersection of convex sets.
Since we can fuse the inequation systems of several closed polyhedrons into one system, we see that an intersection of closed polyhedrons is a closed polyhedron.
For $q=1$ and $a_1 = 0$, taking $b_1 > 0$ and $b_1 \le 0$ respectively we can show that $\emptyset$ and ${\mathbb {R}}^m$ are both closed polyhedra.

Proposition 16

Let $m,l \in {\mathbb {N}}^*$. If $h: {\mathbb {R}}^l \rightarrow {\mathbb {R}}^m$ is linear and C is a closed polyhedron of ${\mathbb {R}}^m$, then $h^{-1}(C)$ is a closed polyhedron of ${\mathbb {R}}^l$.

Proof

The function h is linear so there exist $M \in {\mathbb {R}}^{m \times l}$ and $b \in {\mathbb {R}}^m$ such that for all $x \in {\mathbb {R}}^l$,

$$\begin{aligned} h(x) = Mx + b. \end{aligned}$$

The set C is a closed polyhedron so there exist $a_1, \dots , a_q \in {\mathbb {R}}^m$ and $b_1, \dots b_q \in {\mathbb {R}}$ such that $y \in C$ if and only if

$$\begin{aligned}{\left\{ \begin{array}{ll} a_1^T y + b_1 \le 0 \\ \vdots \\ a_q^T y + b_q \le 0. \end{array}\right. }\end{aligned}$$

For all $x \in {\mathbb {R}}^l$,

$$\begin{aligned} \begin{aligned} x \in h^{-1}(C) \quad&\Longleftrightarrow \quad h(x) \in C \\&\Longleftrightarrow \quad {\left\{ \begin{array}{ll} a_1^T (Mx + b) + b_1 \le 0 \\ \vdots \\ a_q^T (Mx + b) + b_q \le 0 \end{array}\right. } \\&\Longleftrightarrow \quad {\left\{ \begin{array}{ll} (a_1^T M)x + (a_1^Tb + b_1) \le 0 \\ \vdots \\ (a_q^T M)x + (a_q^Tb + b_q) \le 0. \end{array}\right. } \end{aligned} \end{aligned}$$

This shows that $h^{-1}(C)$ is a closed polyhedron. $\square$

Definition 17

We say that a function $g: {\mathbb {R}}^m \rightarrow {\mathbb {R}}^n$ is continuous piecewise linear if there exists a finite set of closed polyhedra whose union is ${\mathbb {R}}^m$ and such that g is linear over each polyhedron.

Example

Since ${\mathbb {R}}^m$ is a closed polyhedron, we see in particular that an affine function $x \mapsto Ax + b$, with $A \in {\mathbb {R}}^{n \times m}$ and $b \in {\mathbb {R}}^n$, is continuous piecewise linear from ${\mathbb {R}}^m$ to ${\mathbb {R}}^n$.

Example 18

The vectorial ReLU function $\sigma : {\mathbb {R}}^m \rightarrow {\mathbb {R}}^m$ is continuous piecewise linear. Indeed, each of the $2^m$ closed orthants is a closed polyhedron, defined by a system of the form

$$\begin{aligned} {\left\{ \begin{array}{ll}\epsilon _1 x_1 \ge 0 \\ \vdots \\ \epsilon _m x_m \ge 0, \end{array}\right. } \end{aligned}$$

with $\epsilon _i \in \{-1, 1 \}$, and over such an orthant, the ReLU coincides with the affine function

$$\begin{aligned} (x_1, \dots , x_m ) \mapsto \left( \frac{1+\epsilon _1}{2} x_1, \dots ,\frac{1+\epsilon _m}{2} x_m\right) . \end{aligned}$$

In this definition the continuity is not obvious. We show it in the following proposition.

Proposition 19

A continuous piecewise linear function is continuous.

Proof

Let $g: {\mathbb {R}}^m \rightarrow {\mathbb {R}}^n$ be a continuous piecewise linear function. There exists a finite family of closed polyhedra $C_1, \dots , C_r$ such that $\bigcup _{i=1}^r C_i = {\mathbb {R}}^m$ and g is linear on each closed polyhedron $C_i$.

Let $x \in {\mathbb {R}}^m$. Let $\epsilon > 0$.

Let us denote $I = \{ i \in \llbracket 1, n \rrbracket , \ x \in C_i \}$. Since the polyhedrons are closed, there exists $r_0 > 0$ such that for all $i \notin I, B(x,r_0) \cap C_i = \emptyset$. We thus have

$$\begin{aligned} B(x,r_0) = \bigcup _{i=1}^m (B(x,r_0) \cap C_i) = \bigcup _{i \in I} \left( B(x,r_0) \cap C_i \right) . \end{aligned}$$

For all $i \in I$, g is linear -therefore continuous- on $C_i$ so there exists $r_i > 0$, such that

$$\begin{aligned} y \in C_i \cap B(x,r_i) \ \Rightarrow \ \Vert g(y) - g(x) \Vert \le \epsilon . \end{aligned}$$

Let $r = \min (r_0, \min _{i \in I} (r_i) )$. For all $y \in B(x,r)$ there exists $i \in I$ such that $y \in C_i$, and since $r \le r_i$, we have

$$\begin{aligned} \Vert g(y) - g(x) \Vert \le \epsilon . \end{aligned}$$

Summarizing, for any $x \in {\mathbb {R}}^n$ and for any $\epsilon > 0$, there exists $r> 0$ such that

$$\begin{aligned} y \in B(x,r) \ \Rightarrow \ \Vert g(y) - g(x) \Vert \le \epsilon . \end{aligned}$$

This shows g is continuous. $\square$

Proposition 20

If $h: {\mathbb {R}}^l \rightarrow {\mathbb {R}}^m$ and $g: {\mathbb {R}}^m \rightarrow {\mathbb {R}}^n$ are two continuous piecewise linear functions, then $g \circ h$ is continuous piecewise linear.

Proof

By definition there exist a family $C_1, \dots , C_r$ of closed polyhedra of ${\mathbb {R}}^l$ such that $\bigcup _{i=1}^r C_i = {\mathbb {R}}^l$ and h is linear on each $C_i$ and a family $D_1, \dots , D_s$ of closed polyhedra of ${\mathbb {R}}^m$ such that $\bigcup _{i=1}^s D_i = {\mathbb {R}}^m$ and g is linear on each $D_i$. Let $i \in \llbracket 1, r \rrbracket$ and $j \in \llbracket 1, s \rrbracket$. The function h coincides with a linear map ${\tilde{h}}: {\mathbb {R}}^l \rightarrow {\mathbb {R}}^m$ on $C_i$ and the inverse image of a closed polyhedron by a linear map is a closed polyhedron (Proposition 16) so ${\tilde{h}}^{-1}(D_j)$ is a closed polyhedron. Thus $h^{-1}(D_j)\cap C_i = {\tilde{h}}^{-1}(D_j)\cap C_i$ is a closed polyhedron as an intersection of closed polyhedra. The function h is linear on $C_i$ and g is linear on $D_j$ so $g \circ h$ is linear on $h^{-1}(D_j) \cap C_i$. We have a family of closed polyhedra,

$$\begin{aligned} \left( h^{-1}(D_j) \cap C_i\right) _{\begin{array}{c} i \in \llbracket 1, r \rrbracket \\ j \in \llbracket 1, s \rrbracket \end{array}}, \end{aligned}$$

each of which $g \circ h$ is linear over. Given that

$$\begin{aligned} \bigcup _{i = 1}^r \bigcup _{j=1}^s h^{-1}(D_j) \cap C_i = \bigcup _{i = 1}^r C_i = {\mathbb {R}}^l, \end{aligned}$$

we can conclude that $g \circ h$ is continuous piecewise linear. $\square$

Definition 21

Let $g: {\mathbb {R}}^m \rightarrow {\mathbb {R}}^n$ be a continuous piecewise linear function. Let $\Pi$ be a set of closed polyhedra of ${\mathbb {R}}^m$. We say that $\Pi$ is admissible with respect to the function g if and only if:

$\bigcup _{D \in \Pi } D = {\mathbb {R}}^m$,
for all $D \in \Pi$, g is linear on D,
for all $D \in \Pi$, $\mathring{D} \ne \emptyset$.

Proposition 22

For all $g: {\mathbb {R}}^m \rightarrow {\mathbb {R}}^n$ continuous piecewise linear, there exists a set of closed polyhedra $\Pi$ admissible with respect to g.

Proof

Let $g: {\mathbb {R}}^m \rightarrow {\mathbb {R}}^n$ be a continuous piecewise linear function. By definition there exists a finite set of closed polyhedra $D_1, \dots , D_s$ such that $\bigcup _{i=1}^s D_i = {\mathbb {R}}^m$ and g is linear on each $D_i$.

Let $I = \{ i \in \llbracket 1, s \rrbracket , \mathring{D_i} \ne \emptyset \}$. Let us show that $\bigcup _{i \in I} D_i = {\mathbb {R}}^m$.

We first show that if a polyhedron $D_i$ has empty interior, then it is contained in an affine hyperplane. Indeed, if it is not contained in an affine hyperplane, then there exist $m+1$ affinely independent points $x_1, \dots , x_{m+1} \in D_i$. Since a closed polyhedron is convex, the convex hull of the points ${{\,\textrm{Conv}\,}}(x_1, \dots , x_{m+1})$, which is a m-simplex, is contained in $D_i$, and thus $D_i$ has nonempty interior.

Let $x \in {\mathbb {R}}^m$. For all $i \notin I$, $D_i$ is contained in an affine hyperplane, and a finite union of affine hyperplanes does not contain any nontrivial ball. As a consequence, for all $n \in {\mathbb {N}}$, the ball $B(x,\frac{1}{n})$ is not contained in $\bigcup _{i \notin I} D_i$ and thus there exists $i_n \in I$ such that $D_{i_n} \cap B(x,\frac{1}{n}) \ne \emptyset$. Since I is finite, there exists $i \in I$ such that $i_n = i$ for infinitely many n, and thus $x \in \overline{D_i}$.

We have shown that for all $x \in {\mathbb {R}}^m$ there exists $i \in I$ such that $x \in \overline{D_i} = D_i$, which means that

$$\begin{aligned} \bigcup _{i \in I} D_i = {\mathbb {R}}^m. \end{aligned}$$

Hence, the set $\Pi : = \{ D_i, i \in I \}$ is admissible with respect to g. $\square$

Proposition 23

Let $h: {\mathbb {R}}^l \rightarrow {\mathbb {R}}^m$ be a continuous piecewise linear function and let ${\mathcal {P}}$ be a finite set of closed polyhedra of ${\mathbb {R}}^m$. Then

for all $D \in {\mathcal {P}}$, $h^{-1}(D)$ is a finite union of closed polyhedra;
$\bigcup _{D \in {\mathcal {P}}} \partial h^{-1}(D)$ is contained in a finite union of hyperplanes $\bigcup _{k=1}^s A_k$.

Proof

Consider $\Pi$ an admissible set of closed polyhedra with respect to h. Let $D \in {\mathcal {P}}$. Since $\bigcup _{C \in \Pi } C = {\mathbb {R}}^l$, we can write

$$\begin{aligned} h^{-1}(D) = h^{-1}(D) \cap \left( \bigcup _{C \in \Pi } C\right) = \bigcup _{C \in \Pi } \left( h^{-1}(D) \cap C \right) . \end{aligned}$$

For all $C \in \Pi$, h is linear over C, so $h^{-1}(D) \cap C$ is a polyhedron (see Proposition 16). This shows the first point of the proposition.

Since $h^{-1}(D) \cap C$ is a polyhedron, $\partial \left( h^{-1}(D) \cap C \right)$ is contained in a finite union of hyperplanes. In topology, we have

$$\begin{aligned} \partial \left[ \bigcup _{C \in \Pi } \left( h^{-1}(D) \cap C \right) \right] \ \subset \ \bigcup _{C \in \Pi } \partial \left( h^{-1}(D) \cap C \right) , \end{aligned}$$

which shows that $\partial \left[ \bigcup _{C \in \Pi } \left( h^{-1}(D) \cap C \right) \right]$ i.e. $\partial h^{-1}(D)$ is contained in a finite union of hyperplanes too. This is true for any $D \in {\mathcal {P}}$, and since ${\mathcal {P}}$ is finite, this is also true of the union $\bigcup _{D \in {\mathcal {P}}} \partial h^{-1}(D)$. $\square$

1.3 Neural networks

We consider fully connected feedforward neural networks, with ReLU activation function. We index the layers in reverse order, from K to 0, for some $K \ge 2$. The input layer is the layer K, the output layer is the layer 0, and between them are $K-1$ hidden layers. For $k \in \llbracket 0, K \rrbracket$, we denote by $n_k \in {\mathbb {N}}$ the number of neurons of the layer k. This means the information contained at the layer k is a $n_k$-dimensional vector.

Let $k \in \llbracket 0, K-1 \rrbracket$. We denote the weights between the layer $k+1$ and the layer k with a matrix $M^{k} \in {\mathbb {R}}^{n_k \times n_{k+1}}$, and we consider a bias $b^{k} \in {\mathbb {R}}^{n_k}$ in the layer k. If $k \ne 0$, we add a ReLU activation function. If $x \in {\mathbb {R}}^{n_{k+1}}$ is the information contained at the layer $k+1$, the layer k contains:

$$\begin{aligned} {\left\{ \begin{array}{ll}\sigma (M^{k} x + b^{k}) &{} \text {if } k \ne 0 \\ M^{0} x + b^{0} &{} \text {if } k=0.\end{array}\right. } \end{aligned}$$

The parameters of the network can be summarized in the couple $({{\textbf {M}}},{{\textbf {b}}})$, where ${{\textbf {M}}} = (M^{0}, M^{1}, \dots , M^{K-1}) \in {\mathbb {R}}^{n_0 \times n_1} \times \dots \times {\mathbb {R}}^{n_{K-1} \times n_K}$ and ${{\textbf {b}}} = (b^{0}, b^{1}, \dots , b^{K-1}) \in {\mathbb {R}}^{n_0} \times \dots \times {\mathbb {R}}^{n_{K-1}}$. We formalize the transformation implemented by one layer of the network with the following definition.

Definition 24

For a network with parameters $({{\textbf {M}}}, {{\textbf {b}}})$, we define the family of functions $(h_0,\dots , h_{K-1})$ such that for all $k \in \llbracket 0, K-1 \rrbracket$, $h_k: {\mathbb {R}}^{n_{k+1}} \rightarrow {\mathbb {R}}^{n_k}$ and for all $x \in {\mathbb {R}}^{n_k}$,

$$\begin{aligned} h_k(x)={\left\{ \begin{array}{ll}\sigma (M^{k} x + b^{k}) &{} \text {if } k \ne 0 \\ M^{0} x + b^{0} &{} \text {if } k=0.\end{array}\right. } \end{aligned}$$

The function implemented by the network is then

$$\begin{aligned} f_{{{\textbf {M}}},{{\textbf {b}}}} = h_0 \circ h_1 \circ \dots \circ h_{K-1}: {\mathbb {R}}^{n_K} \longrightarrow {\mathbb {R}}^{n_0}. \end{aligned}$$

(23)

The network with its parameters are represented in Fig. 1 in the main part.

For all $l \in \llbracket 0, K-1 \rrbracket$, we denote ${{\textbf {M}}}^{\le l} = (M^{0}, M^{1}, \dots , M^{l})$ and ${{\textbf {b}}}^{\le l} = (b^{0}, b^{1}, \dots , b^{l})$.

Remark 25

Since the vectorial ReLU function is continuous piecewise linear, Proposition 20 guarantees that the functions $h_k$ are continuous piecewise linear.

We now define a few more functions associated to a network.

Definition 26

For a network with parameters $({{\textbf {M}}}, {{\textbf {b}}})$, we define the family of functions $(h^{lin}_0,\dots , h^{lin}_{K-1})$ such that for all $k \in \llbracket 0, K-1 \rrbracket$, $h_k^{lin}: {\mathbb {R}}^{n_{k+1}} \rightarrow {\mathbb {R}}^{n_k}$ and for all $x \in {\mathbb {R}}^{n_{k+1}}$,

$$\begin{aligned} h_k^{lin} (x) = M^kx + b^k. \end{aligned}$$

The functions $h_k^{lin}$ correspond to the linear part of the transformation implemented by the network between two layers, before applying the activation $\sigma$.

Definition 27

For a network with parameters $({{\textbf {M}}}, {{\textbf {b}}})$, we define the family of functions $(f_K, f_{K-1}, \dots , f_{0})$ as follows:

$f_{K} = id_{{\mathbb {R}}^{n_K}}$,
for all $k \in \llbracket 0, K-1 \rrbracket$, $f_k = h_k \circ h_{k+1} \circ \dots \circ h_{K-1}$.

Remark

In particular we have $f_0 = f_{{{\textbf {M}}},{{\textbf {b}}}}$.

The function $f_k:{\mathbb {R}}^{n_K} \mapsto {\mathbb {R}}^{n_k}$ represents the transformation implemented by the network between the input layer and the layer k.

Definition 28

For a network with parameters $({{\textbf {M}}}, {{\textbf {b}}})$, we define the sequence $(g_0,\dots , g_{K})$ as follows:

$g_{0} = id_{{\mathbb {R}}^{n_0}}$,
for all $k \in \llbracket 1, K \rrbracket$, $g_k = h_0 \circ h_1 \circ \dots \circ h_{k-1}$.

Remark

We have in particular

$g_K = f_{{\textbf {M,b}}}$;
for all $k \in \llbracket 0, K \rrbracket$, $f_{{{\textbf {M}}},{{\textbf {b}}}} = g_k\circ f_k$.

The function $g_k: {\mathbb {R}}^{n_k} \mapsto {\mathbb {R}}^{n_0}$ represents the transformation implemented by the network between the layer k and the output layer.

In this paper the functions implemented by the networks are considered on a subset $\Omega \subset {\mathbb {R}}^{n_K}$. The successive layers of a network project this subset onto the spaces ${\mathbb {R}}^{n_k}$, inducing a subset $\Omega _k$ of ${\mathbb {R}}^{n_k}$ for all k, as in the following definition.

Definition 29

For a network with parameters $({{\textbf {M}}}, {{\textbf {b}}})$, for any $\Omega \subset {\mathbb {R}}^{n_K}$, we denote for all $k \in \llbracket 0,K \rrbracket$,

$$\begin{aligned} \Omega _k = f_k (\Omega ). \end{aligned}$$

Definition 30

For a network with parameters $({{\textbf {M}}}, {{\textbf {b}}})$, for all $k \in \llbracket 2, K \rrbracket$, for all $i \in \llbracket 1, n_{k-1} \rrbracket$, we define

$$\begin{aligned} H_i^k = \{x \in {\mathbb {R}}^{n_k}, \ M^{k-1}_{i,.}x + b_i^{k-1} = 0\}. \end{aligned}$$

Remark

When $M^{k-1}_{i,.} \ne 0$, the set $H^{k}_i$ is a hyperplane.

Remark 31

The objects defined in Definitions 24, 26, 27, 28, 29 and 30 all depend on $({{\textbf {M}}}, {{\textbf {b}}})$, but to simplify the notation we do not write it explicitly. To disambiguate when manipulating a second network, whose parameters we will denote by $({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}})$, we will denote by ${\tilde{h}}_k$, ${\tilde{h}}_k^{lin}$, ${\tilde{f}}_k$, ${\tilde{g}}_k$, ${\tilde{\Omega }}_k$ and ${\tilde{H}}^{k}_i$ the corresponding objects.

Proposition 32

For all $k \in \llbracket 0, K\rrbracket$, $f_k$ and $g_k$ are continuous piecewise linear.

Proof

We show this by induction: for the initialisation we have $f_K = id_{{\mathbb {R}}^{n_K}}$ which is continuous piecewise linear. Now let $k \in \llbracket 0, K-1 \rrbracket$ and assume $f_{k+1}$ is continuous piecewise linear. By definition, we have $f_k = h_k \circ f_{k+1}$. The function $h_k$ is continuous piecewise linear as noted in Remark 25. By Proposition 20, the composition of two continuous piecewise linear functions is continuous piecewise linear, so $f_k$ is continuous piecewise linear. The conclusion follows by induction.

We do the same for $(g_0, \dots , g_K)$ starting with $g_0$: first we have $g_0 = id_{{\mathbb {R}}^{n_0}}$ which is continuous piecewise linear, then for all $k \in \llbracket 1, K \rrbracket$, we have $g_k = g_{k-1} \circ h_{k-1}$, and we conclude by composition of two continuous piecewise linear functions. $\square$

Corollary 33

The function $f_{{{\textbf {M}}},{{\textbf {b}}}}$ is continuous piecewise linear.

Proof

It comes immediately from $f_{{{\textbf {M}}},{{\textbf {b}}}} = f_0$ and Proposition 32. $\square$

Recall the definition of an admissible set with respect to a continuous piecewise linear function (Definition 21). Proposition 32 allows the following definition.

Definition 34

Consider a network parameterization ${\textbf {(M, b)}}$, and the functions $g_k$ associated to it. We say that a list of sets of closed polyhedra $\varvec{\Pi } = (\Pi _1, \dots , \Pi _{K-1})$ is admissible with respect to $({{\textbf {M}}}, {{\textbf {b}}})$ iif for all $k \in \llbracket 1, K-1 \rrbracket$, the set $\Pi _k$ is admissible with respect to $g_k$.

Remark

For a list $\varvec{\Pi } = (\Pi _1, \dots , \Pi _{K-1})$, for all $l \in \llbracket 1, K-1 \rrbracket$, we denote $\varvec{\Pi }^{\le l} = (\Pi _1, \dots , \Pi _{l})$. If $\varvec{\Pi }$ is admissible with respect to ${\textbf {(M, b)}}$, then $\varvec{\Pi }^{\le l}$ is admissible with respect to $({{\textbf {M}}}^{\le l}, {{\textbf {b}}}^{\le l})$.

Proposition 35

For any network parameterization ${\textbf {(M, b)}}$, there always exists a list of sets of closed polyhedra $\varvec{\Pi }$ that is admissible with respect to ${\textbf {(M, b)}}$.

Proof

For all $k\in \llbracket 1, K-1 \rrbracket$, since $g_k$ is continuous piecewise linear, Proposition 22 guarantees that there exists an admissible set of polyhedra $\Pi _k$ with respect to $g_k$. We simply define $\varvec{\Pi } = (\Pi _1, \dots , \Pi _{K-1})$. $\square$

Definition 36

For a parameterization ${\textbf {(M, b)}}$ and a list $\varvec{\Pi }$ admissible with respect to ${\textbf {(M, b)}}$, for all $k \in \llbracket 1, K-1 \rrbracket$, for all $D \in \Pi _k$, since $g_k$ is linear over D and D has nonempty interior, we can define $V^k(D) \in {\mathbb {R}}^{n_0 \times n_k}$ and $c^k(D) \in {\mathbb {R}}^{n_0}$ as the unique couple that satisfies:

$$\begin{aligned} \forall x \in D, \quad g_k(x) = V^k(D) x + c^k(D). \end{aligned}$$

We now introduce the equivalence relation between parameterizations, often referred to as equivalence modulo permutation and positive rescaling.

Definition 37

(Equivalent parameterizations) If $({{\textbf {M}}},{{\textbf {b}}})$ and $({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}})$ are two network parameterizations, we say that $({{\textbf {M}}},{{\textbf {b}}})$ is equivalent modulo permutation and positive rescaling, or simply equivalent, to $({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}})$, and we write $({{\textbf {M}}},{{\textbf {b}}}) \sim ({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}})$, if and only if there exist:

a family of permutations $\varvec{\varphi } = (\varphi _0, \dots , \varphi _{K}) \in {\mathfrak {S}}_{n_0} \times \dots \times {\mathfrak {S}}_{n_{K}}$, with $\varphi _0 = id_{\llbracket 1, n_0 \rrbracket }$ and $\varphi _{K} = id_{\llbracket 1, n_K \rrbracket }$,
a family of vectors $\varvec{\lambda }=(\lambda ^{0}, \lambda ^{1}, \dots , \lambda ^{K}) \in (\mathbb {R}_+^*)^{n_0} \times \dots \times ( \mathbb {R}_+^*)^{n_{K}}$, with $\lambda ^{0} = \mathbbm {1}_{n_0}$ and $\lambda ^{K} = \mathbbm {1}_{n_K}$,

such that for all $k \in \llbracket 0, K-1 \rrbracket$,

$$\begin{aligned} {\left\{ \begin{array}{ll}{\tilde{M}}^{k}= P_{\varphi _{k}} {{\,\textrm{D}\,}}(\lambda ^{k} )M^{k}{{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1}P_{\varphi _{k+1}}^{-1} \\ {\tilde{b}}^{k} = P_{\varphi _k}{{\,\textrm{D}\,}}(\lambda ^{k} ) b^{k}.\end{array}\right. } \end{aligned}$$

(24)

Remarks

1.
Recall that we denote by $\frac{1}{\lambda ^{k+1}}$ the vector whose components are $\frac{1}{\lambda ^{k+1}_i}$. Note that ${{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1} = {{\,\textrm{D}\,}}(\frac{1}{\lambda ^{k+1}})$. Using (21), for all $k \in \llbracket 0, K-1 \rrbracket$, (24) means that for all $(i,j) \in \llbracket 1, n_k \rrbracket \times \llbracket 1, n_{k+1} \rrbracket$,
$$\begin{aligned} {\tilde{M}}^{k}_{i,j} = \frac{\lambda ^{k}_{\varphi _k^{-1}(i)}}{\lambda ^{k+1}_{\varphi _{k+1}^{-1}(j)}} M^{k}_{\varphi _k^{-1}(i),\varphi _{k+1}^{-1}(j)} \end{aligned}$$
and
$$\begin{aligned} {\tilde{b}}^{k}_{i} = \lambda ^{k}_{\varphi _k^{-1}(i)} b^{k}_{\varphi _k^{-1}(i)}. \end{aligned}$$
2.
We go from a parameterization to an equivalent one by:
- permuting the neurons of each hidden layer k with a permutation $\varphi _k$;
- for each hidden layer k, multiplying all the weights of the edges arriving (from the layer $k+1$) to the neuron j, as well as the bias $b^k_j$, by some positive number $\lambda ^k_j$, and multiplying all the weights of the edges leaving (towards the layer $k-1$) the neuron j by $\frac{1}{\lambda ^k_j}$.

Proposition 38

The relation $\sim$ is an equivalence relation.

Proof

Let us first show the following equality, that we are going to use in the proof. For any $n \in {\mathbb {N}}^*$, $\lambda \in {\mathbb {R}}^{n}$ and $\varphi \in {\mathfrak {S}}_n$,

$$\begin{aligned} {{\,\textrm{D}\,}}(\lambda ) P_{\varphi } = P_{\varphi } {{\,\textrm{D}\,}}(P_{\varphi }^{-1} \lambda ). \end{aligned}$$

(25)

Indeed, ${{\,\textrm{D}\,}}(\lambda ) P_{\varphi }$ is the matrix obtained by multiplying each line i of $P_{\varphi }$ by $\lambda _i$, so recalling (20), for all $i,j \in \llbracket 1,m\rrbracket$, we have

$$\begin{aligned} ({{\,\textrm{D}\,}}(\lambda ) P_{\varphi })_{i,j} = {\left\{ \begin{array}{ll} \lambda _i &{} \text {if } \varphi (j) = i\\ 0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

At the same time, $P_{\varphi } {{\,\textrm{D}\,}}(P_{\varphi }^{-1} \lambda )$ is the matrix obtained by multiplying each column j of $P_{\varphi }$ by $(P_{\varphi }^{-1} \lambda )_j = \lambda _{\varphi (j)}$ (see (21) and (22)), so for all $i,j \in \llbracket 1,m\rrbracket$, we have

$$\begin{aligned} (P_{\varphi } {{\,\textrm{D}\,}}(P_{\varphi }^{-1} \lambda ))_{i,j} = {\left\{ \begin{array}{ll} \lambda _{\varphi (j)} &{} \text {if } \varphi (j) = i\\ 0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

The two matrices are clearly equal.

We can now show the proposition.

To show reflexivity we can take $\lambda ^{k} = \mathbbm {1}_{n_k}$ and $\varphi _k = id_{\llbracket 1, n_k \rrbracket }$ for all $k \in \llbracket 0, K \rrbracket$.
Let us show symmetry. Assume a parameterization ${\textbf {(M, b)}}$ is equivalent to another parameterization $({\tilde{{\textbf {M}}}}, {\tilde{{\textbf {b}}}})$. Let us denote by $\varvec{\varphi }$ and $\varvec{\lambda }$ the corresponding families of permutations and vectors, as in Definition 37. Inverting the expression of ${\tilde{M}}^k$ in Definition 37 and using (25) twice, we have for all $k \in \llbracket 0, K-1 \rrbracket$:
$$\begin{aligned} \begin{aligned} {\tilde{M}}^{k}&= P_{\varphi _{k}} {{\,\textrm{D}\,}}(\lambda ^{k} )M^{k}{{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1}P_{\varphi _{k+1}}^{-1} \\&\Longleftrightarrow \quad {{\,\textrm{D}\,}}(\lambda ^{k})^{-1} P_{\varphi _{k}}^{-1} {\tilde{M}}^{k} P_{\varphi _{k+1}} {{\,\textrm{D}\,}}(\lambda ^{k+1}) = M^{k} \\&\Longleftrightarrow \quad P_{\varphi _{k}}^{-1} {{\,\textrm{D}\,}}(P_{\varphi _{k}} \lambda ^{k})^{-1} {\tilde{M}}^{k} {{\,\textrm{D}\,}}(P_{\varphi _{k+1}}\lambda ^{k+1}) P_{\varphi _{k+1}} = M^{k}, \end{aligned} \end{aligned}$$
so denoting ${\tilde{\varphi }}_k = \varphi _{k}^{-1}$ and ${\tilde{\lambda }}^{k} = (P_{\varphi _{k}} \lambda ^{k})^{-1}$, and recalling that $P_{\varphi _k^{-1}} = P_{\varphi _k}^{-1}$, we have, for all $k \in \llbracket 0, K-1 \rrbracket$,
$$\begin{aligned} M^{k} = P_{{\tilde{\varphi }}_{k}} {{\,\textrm{D}\,}}({\tilde{\lambda }}^{k}) {\tilde{M}}^{k} {{\,\textrm{D}\,}}(\tilde{\lambda }^{k+1})^{-1} P_{{\tilde{\varphi }}_{k+1}}^{-1}. \end{aligned}$$
We show similarly that for all $k \in \llbracket 0, K-1 \rrbracket$,
$$\begin{aligned} b^k = P_{{\tilde{\varphi }}_k} {{\,\textrm{D}\,}}({\tilde{\lambda }}_k){\tilde{b}}^k. \end{aligned}$$
We naturally have ${\tilde{\varphi }}_0 = id_{\llbracket 1, n_0 \rrbracket }$ and ${\tilde{\varphi }}_K = id_{\llbracket 1, n_K \rrbracket }$, as well as ${\tilde{\lambda }}^0 = \mathbbm {1}_{n_0}$ and ${\tilde{\lambda }}^K = \mathbbm {1}_{n_K}$.

This proves the symmetry of the relation.
Let us show transitivity. Assume ${\textbf {(M, b)}}$, $({\tilde{{\textbf {M}}}}, {\tilde{{\textbf {b}}}})$ and $(\check{{{\textbf {M}}}}, \check{{{\textbf {b}}}})$ are three parameterizations such that ${\textbf {(M, b)}} \sim ({\tilde{{\textbf {M}}}}, {\tilde{{\textbf {b}}}})$ and $({\tilde{{\textbf {M}}}}, {\tilde{{\textbf {b}}}}) \sim (\check{{{\textbf {M}}}}, \check{{{\textbf {b}}}})$.

As in Definition 37, we denote by $\varvec{\varphi }$, $\tilde{\varvec{\varphi }}$, $\varvec{\lambda }$ and $\tilde{\varvec{\lambda }}$ the families of permutations and vectors such that, for all $k \in \llbracket 0, K-1 \rrbracket$,
$$\begin{aligned} {\left\{ \begin{array}{ll}{\tilde{M}}^{k} = P_{\varphi _{k}} {{\,\textrm{D}\,}}(\lambda ^{k} )M^{k}{{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1}P_{\varphi _{k+1}}^{-1} \\ {\tilde{b}}^{k} = P_{{\varphi }_{k}} {{\,\textrm{D}\,}}({\lambda }^{k} ) b^{k}, \end{array}\right. } \end{aligned}$$
and
$$\begin{aligned}{\left\{ \begin{array}{ll} \check{M}^{k} = P_{{\tilde{\varphi }}_{k}} {{\,\textrm{D}\,}}({\tilde{\lambda }}^{k} ){\tilde{M}}^{k}{{\,\textrm{D}\,}}({\tilde{\lambda }}^{k+1})^{-1}P_{{\tilde{\varphi }}_{k+1}}^{-1} \\ \check{b}^{k} = P_{{\tilde{\varphi }}_{k}} {{\,\textrm{D}\,}}({\tilde{\lambda }}^{k} ) {\tilde{b}}^{k}. \end{array}\right. } \end{aligned}$$
Combining these and using (25), we have
$$\begin{aligned} \begin{aligned} \check{M}^{k}&= P_{{\tilde{\varphi }}_{k}} {{\,\textrm{D}\,}}({\tilde{\lambda }}^{k} )P_{\varphi _{k}} {{\,\textrm{D}\,}}(\lambda ^{k} )M^{k}{{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1}P_{\varphi _{k+1}}^{-1} {{\,\textrm{D}\,}}({\tilde{\lambda }}^{k+1})^{-1}P_{{\tilde{\varphi }}_{k+1}}^{-1} \\&= P_{{\tilde{\varphi }}_{k}} \left( {{\,\textrm{D}\,}}({\tilde{\lambda }}^{k} )P_{\varphi _{k}} \right) {{\,\textrm{D}\,}}(\lambda ^{k} )M^{k} \\&\quad \cdot {{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1}\left( {{\,\textrm{D}\,}}({\tilde{\lambda }}^{k+1})P_{\varphi _{k+1}}\right) ^{-1} P_{{\tilde{\varphi }}_{k+1}}^{-1} \\&= P_{{\tilde{\varphi }}_{k}} \left( P_{\varphi _{k}} {{\,\textrm{D}\,}}(P_{\varphi _{k}}^{-1}{\tilde{\lambda }}^{k} ) \right) {{\,\textrm{D}\,}}(\lambda ^{k} )M^{k} \\&\quad \cdot {{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1}\left( P_{\varphi _{k+1}} {{\,\textrm{D}\,}}( P_{\varphi _{k+1}}^{-1}{\tilde{\lambda }}^{k+1})\right) ^{-1} P_{{\tilde{\varphi }}_{k+1}}^{-1} \\&= P_{{\tilde{\varphi }}_{k}} P_{\varphi _{k}} {{\,\textrm{D}\,}}(P_{\varphi _{k}}^{-1}{\tilde{\lambda }}^{k} ) {{\,\textrm{D}\,}}(\lambda ^{k} )M^{k} \\&\quad \cdot {{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1} {{\,\textrm{D}\,}}(P_{\varphi _{k+1}}^{-1}{\tilde{\lambda }}^{k+1})^{-1}P_{\varphi _{k+1}}^{-1}P_{{\tilde{\varphi }}_{k+1}}^{-1}, \end{aligned} \end{aligned}$$
and
$$\begin{aligned} \begin{aligned} \check{b}^{k}&= P_{{\tilde{\varphi }}_{k}} {{\,\textrm{D}\,}}({\tilde{\lambda }}^{k} )P_{\varphi _{k}} {{\,\textrm{D}\,}}(\lambda ^{k} ) b^{k} \\&= P_{{\tilde{\varphi }}_{k}}P_{\varphi _{k}} {{\,\textrm{D}\,}}(P_{\varphi _{k}}^{-1}{\tilde{\lambda }}^{k} ) {{\,\textrm{D}\,}}(\lambda ^{k} ) b^{k}. \end{aligned} \end{aligned}$$
Hence denoting $\check{\varphi }_k = {\tilde{\varphi }}_{k} \circ \varphi _{k}$ and $\check{\lambda }^{k} = {{\,\textrm{D}\,}}(P_{\varphi _{k}}^{-1}{\tilde{\lambda }}^{k}) \lambda ^{k}$, for all $k \in \llbracket 0, K \rrbracket$, we see that, for $k \in \llbracket 0, K-1 \rrbracket$,
$$\begin{aligned} \check{M}^{k} = P_{\check{\varphi }_{k}} {{\,\textrm{D}\,}}(\check{\lambda }^{k} )M^{k}{{\,\textrm{D}\,}}(\check{\lambda }^{k+1})^{-1}P_{\check{\varphi }_{k+1}}^{-1} \end{aligned}$$
and
$$\begin{aligned} \check{b}^{k} = P_{\check{\varphi }_{k}} {{\,\textrm{D}\,}}(\check{\lambda }^{k} ) b^{k}. \end{aligned}$$
Naturally, we also have $\check{\varphi }_0 = id_{\llbracket 1, n_0 \rrbracket }$ and $\check{\varphi }_K = id_{\llbracket 1, n_K \rrbracket }$, as well as $\check{\lambda }^0 = \mathbbm {1}_{n_0}$ and $\check{\lambda }^K = \mathbbm {1}_{n_K}$, which shows that ${\textbf {(M, b)}} \sim (\check{{{\textbf {M}}}}, \check{{{\textbf {b}}})}$.

$\square$

Recall the objects $h_k, f_k, g_k, \Omega _k, H_i^k$ associated to a parameterization $({{\textbf {M}}}, {{\textbf {b}}})$, defined in Definitions 24, 27, 28, 29 and 30, and recall that we denote by ${\tilde{h}}_k, {\tilde{f}}_k, {\tilde{g}}_k, {\tilde{\Omega }}_k$ and ${\tilde{H}}_i^k$ the corresponding objects with respect to another parameterization $({\tilde{{\textbf {M}}}}, {\tilde{{\textbf {b}}}})$. We give in the following proposition the relations that link these objects when the two parameterizations ${\textbf {(M, b)}}$ and $({\tilde{{\textbf {M}}}}, {\tilde{{\textbf {b}}}})$ are equivalent.

Proposition 39

Assume $({{\textbf {M}}},{{\textbf {b}}}) \sim ({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}})$ and consider $\varvec{\varphi }$ and $\varvec{\lambda }$ as in Definition 37. Let $\varvec{\Pi }$ be a list of sets of closed polyhedra that is admissible with respect to ${\textbf {(M, b)}}$. Then:

1.
for all $k \in \llbracket 0, K-1 \rrbracket$,
$$\begin{aligned} {\tilde{h}}_k = P_{\varphi _k} {{\,\textrm{D}\,}}(\lambda ^{k}) \circ h_k \circ {{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1}P_{\varphi _{k+1}}^{-1}, \end{aligned}$$
2.
for all $k \in \llbracket 0, K \rrbracket$,
$$\begin{aligned} {\tilde{f}}_k= & {} P_{\varphi _k} {{\,\textrm{D}\,}}(\lambda ^{k}) \circ f_k, \nonumber \\ {\tilde{g}}_k= & {} g_k \circ {{\,\textrm{D}\,}}(\lambda ^{k})^{-1} P_{\varphi _{k}}^{-1},\nonumber \\ {\tilde{\Omega }}_k= & {} P_{\varphi _{k}}{{\,\textrm{D}\,}}(\lambda ^{k}) \Omega _k, \end{aligned}$$
(26)
3.
for all $k \in \llbracket 2,K \rrbracket$, for all $i \in \llbracket 1, n_{k-1} \rrbracket$,
$$\begin{aligned} {\tilde{H}}_i^k = P_{\varphi _{k}}{{\,\textrm{D}\,}}(\lambda ^{k})H_{\varphi _{k-1}^{-1}(i)}^k, \end{aligned}$$
4.
for all $k \in \llbracket 1, K-1 \rrbracket$, the set of closed polyhedra ${\tilde{\Pi }}_k = \{ P_{\varphi _{k}} {{\,\textrm{D}\,}}(\lambda ^{k}) D, D \in \Pi _k \}$ is admissible for ${\tilde{g}}_k$, i.e. the list $\varvec{{\tilde{\Pi }}} = ({\tilde{\Pi }}_1, \dots , {\tilde{\Pi }}_{K-1})$ is admissible with respect to $({\tilde{{\textbf {M}}}}, {\tilde{{\textbf {b}}}})$.

Proof

1.
Let $k \in \llbracket 0, K-1 \rrbracket$. If $k \ne 0$, we have from Definition 24:
$$\begin{aligned} \begin{aligned} {\tilde{h}}_k(x)&= \sigma ({\tilde{M}}^{k} x + {\tilde{b}}^{k}) \\&= \sigma \left( P_{\varphi _{k}} {{\,\textrm{D}\,}}(\lambda ^{k})M^{k}{{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1}P_{\varphi _{k+1}}^{-1} x \right. \\&\quad \left. + P_{\varphi _k}{{\,\textrm{D}\,}}(\lambda ^{k} ) b^{k}\right) \\&=\sigma \left( P_{\varphi _{k}} {{\,\textrm{D}\,}}(\lambda ^{k}) \left[ M^{k}{{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1}P_{\varphi _{k+1}}^{-1} x + b^{k} \right] \right) . \\ \end{aligned} \end{aligned}$$
Denote $y:= \left[ M^{k}{{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1}P_{\varphi _{k+1}}^{-1} x + b^{k} \right]$. Let $i \in \llbracket 1, n_k \rrbracket$. Using (21) and the fact that $\lambda ^{k}_{\varphi _k^{-1}(i)}$ is nonnegative, the $i^{\text {th}}$ coordinate of ${\tilde{h}}_k(x)$ is
$$\begin{aligned} \begin{aligned} {\tilde{h}}_k(x)_i&= \left[ \sigma \left( P_{\varphi _{k}} {{\,\textrm{D}\,}}(\lambda ^{k})y \right) \right] _i = \sigma \left( \left[ P_{\varphi _{k}} {{\,\textrm{D}\,}}(\lambda ^{k})y \right] _i \right) \\&= \sigma \left( \lambda ^{k}_{\varphi _k^{-1}(i)} y_{\varphi _k^{-1}(i)}\right) \\&= \lambda ^{k}_{\varphi _k^{-1}(i)} \sigma \left( y_{\varphi _k^{-1}(i)}\right) \\&= \left[ P_{\varphi _{k}} {{\,\textrm{D}\,}}(\lambda ^{k}) \sigma \left( y \right) \right] _i. \end{aligned} \end{aligned}$$
Finally, we find the expression of ${\tilde{h}}_k(x)$:
$$\begin{aligned} \begin{aligned} {\tilde{h}}_k(x)&= P_{\varphi _{k}} {{\,\textrm{D}\,}}(\lambda ^{k}) \sigma \left( y\right) \\&= P_{\varphi _{k}} {{\,\textrm{D}\,}}(\lambda ^{k}) \sigma \left( M^{k}{{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1}P_{\varphi _{k+1}}^{-1} x + b^{k} \right) \\&= P_{\varphi _{k}} {{\,\textrm{D}\,}}(\lambda ^{k})h_k \left( {{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1}P_{\varphi _{k+1}}^{-1} (x)\right) . \end{aligned} \end{aligned}$$
This concludes the proof when $k \ne 0$.

The case $k=0$ is proven similarly but replacing the ReLU function $\sigma$ by the identity.
2.
- We prove by induction the expression of ${\tilde{f}}_k$.
  
  For $k = K$, we have ${\tilde{f}}_K = f_K = id_{{\mathbb {R}}^{n_K}}$, and since $P_{\varphi _K} = {{\,\textrm{Id}\,}}_{n_K}$ and $\lambda ^{K} = \mathbbm {1}_{n_K}$ the equality ${\tilde{f}}_K = P_{\varphi _K} {{\,\textrm{D}\,}}(\lambda ^{K}) f_K$ holds.
  
  Now let $k \in \llbracket 0, K-1 \rrbracket$. Suppose the induction hypothesis is true for ${\tilde{f}}_{k+1}$. Using the expression of ${\tilde{h}}_k$ we just proved in 1 and the induction hypothesis, we have
  $$\begin{aligned} \begin{aligned} {\tilde{f}}_k&= {\tilde{h}}_k \circ {\tilde{f}}_{k+1} \\&= \left( P_{\varphi _k} {{\,\textrm{D}\,}}(\lambda ^{k}) \circ h_k \circ {{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1}P_{\varphi _{k+1}}^{-1}\right) \circ \left( P_{\varphi _{k+1}} {{\,\textrm{D}\,}}(\lambda ^{k+1}) \circ f_{k+1} \right) \\&= P_{\varphi _k} {{\,\textrm{D}\,}}(\lambda ^{k}) \circ h_k \circ f_{k+1} \\&= P_{\varphi _k} {{\,\textrm{D}\,}}(\lambda ^{k}) \circ f_k. \end{aligned} \end{aligned}$$
  This concludes the induction.
- We prove similarly the expression of ${\tilde{g}}_k$, but starting from $k = 0$: first we have ${\tilde{g}}_0 = g_0 = id_{{\mathbb {R}}^{n_0}}$, and then, for $k \in \llbracket 0, K-1 \rrbracket$, we write ${\tilde{g}}_{k+1} = {\tilde{g}}_{k} \circ {\tilde{h}}_{k}$ and we use the induction hypothesis and the expression of ${\tilde{h}}_k$.
- Using the relation (26), that we just proved, we obtain
  $$\begin{aligned} {\tilde{\Omega }}_k = {\tilde{f}}_k(\Omega ) = P_{\varphi _k} {{\,\textrm{D}\,}}(\lambda ^{k}) f_k (\Omega ) = P_{\varphi _k} {{\,\textrm{D}\,}}(\lambda ^{k}) \Omega _k. \end{aligned}$$
3.
Let $k \in \llbracket 2, K \rrbracket$ and $i \in \llbracket 1, n_{k-1} \rrbracket$. For all $x \in {\mathbb {R}}^{n_k}$, using (24) and (21),
$$\begin{aligned} \begin{aligned} x \in {\tilde{H}}^k_i \quad&\Longleftrightarrow \quad {\tilde{M}}^{k-1}_{i,.}x + {\tilde{b}}_i^{k-1} = 0 \\&\Longleftrightarrow \quad \left[ P_{\varphi _{k-1}} {{\,\textrm{D}\,}}(\lambda ^{k-1} )M^{k-1}{{\,\textrm{D}\,}}(\lambda ^{k})^{-1}P_{\varphi _{k}}^{-1} \right] _{i,.} x \\&\quad \qquad + \left[ P_{\varphi _{k-1}} {{\,\textrm{D}\,}}(\lambda ^{k-1} ) b^{k-1} \right] _{i} =0 \\&\Longleftrightarrow \quad \lambda ^{k-1}_{\varphi _{k-1}^{-1}(i)}M^{k-1}_{\varphi _{k-1}^{-1}(i),.}{{\,\textrm{D}\,}}(\lambda ^{k})^{-1}P_{\varphi _{k}}^{-1} x + \lambda ^{k-1}_{\varphi _{k-1}^{-1}(i)}b^{k-1}_{\varphi _{k-1}^{-1}(i)} = 0 \\&\Longleftrightarrow \quad \lambda ^{k-1}_{\varphi _{k-1}^{-1}(i)} \left( M^{k-1}_{\varphi _{k-1}^{-1}(i),.}{{\,\textrm{D}\,}}(\lambda ^{k})^{-1}P_{\varphi _{k}}^{-1} x + b^{k-1}_{\varphi _{k-1}^{-1}(i)} \right) = 0 \\&\Longleftrightarrow \quad M^{k-1}_{\varphi _{k-1}^{-1}(i),.}{{\,\textrm{D}\,}}(\lambda ^{k})^{-1}P_{\varphi _{k}}^{-1} x + b^{k-1}_{\varphi _{k-1}^{-1}(i)} = 0 \\&\Longleftrightarrow \quad {{\,\textrm{D}\,}}(\lambda ^{k})^{-1}P_{\varphi _{k}}^{-1} x \in H^k_{\varphi _{k-1}^{-1}(i)}. \\ \end{aligned} \end{aligned}$$
Thus, ${\tilde{H}}^k_i = P_{\varphi _{k}}{{\,\textrm{D}\,}}(\lambda ^{k})H_{\varphi _{k-1}^{-1}(i)}^k$.
4.
For all $D \in \Pi _k$, denote ${\tilde{D}} = P_{\varphi _{k}} {{\,\textrm{D}\,}}(\lambda ^{k}) D$. We have ${\tilde{\Pi }}_k = \{ {\tilde{D}}, D \in \Pi _k \}$.

Let $D \in \Pi _k$. The matrix $P_{\varphi _{k}} {{\,\textrm{D}\,}}(\lambda ^{k})$ is invertible so, according to Proposition 16, ${\tilde{D}} = P_{\varphi _{k}} {{\,\textrm{D}\,}}(\lambda ^{k}) D$ is a closed polyhedron, and since $\mathring{D} \ne \emptyset$ we also have $\mathring{{\tilde{D}}} \ne \emptyset$.

Now recall from Item 2 that:

$$\begin{aligned} {\tilde{g}}_k = g_k \circ {{\,\textrm{D}\,}}(\lambda ^{k})^{-1} P_{\varphi _{k}}^{-1}. \end{aligned}$$

For all $x \in {\tilde{D}}$, we have ${{\,\textrm{D}\,}}(\lambda ^{k})^{-1} P_{\varphi _{k}}^{-1} x \in D$. Since $\Pi _k$ is admissible with respect to $g_k$ (by definition of $\varvec{\Pi }$), $g_k$ is linear on D, and thus the function ${\tilde{g}}_k$ is linear on ${\tilde{D}}$.

Again, since $\Pi _k$ is admissible with respect to $g_k$, we have $\bigcup _{D \in \Pi _k} D = {\mathbb {R}}^m$, and thus

$$\begin{aligned} \begin{aligned} \bigcup _{{\tilde{D}} \in {\tilde{\Pi }}_k} {\tilde{D}}&= \bigcup _{D \in \Pi _k} P_{\varphi _{k}} {{\,\textrm{D}\,}}(\lambda ^{k}) D \\&= P_{\varphi _{k}} {{\,\textrm{D}\,}}(\lambda ^{k}) \left( \bigcup _{D \in \Pi _k} D \right) \\&= P_{\varphi _{k}} {{\,\textrm{D}\,}}(\lambda ^{k}) \left( {\mathbb {R}}^m \right) \\&= {\mathbb {R}}^m, \end{aligned} \end{aligned}$$

which shows that ${\tilde{\Pi }}_k$ is admissible with respect to ${\tilde{g}}_k$.

This being true for any $k \in \llbracket 1, K-1 \rrbracket$, we conclude that $\varvec{{\tilde{\Pi }}}$ is admissible with respect to $({\tilde{{\textbf {M}}}}, {\tilde{{\textbf {b}}}})$. $\square$

Corollary 40

If $({{\textbf {M}}},{{\textbf {b}}}) \sim ({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}})$, then $f_{{{\textbf {M}}},{{\textbf {b}}}} = f_{{\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}}}$.

Proof

Consider $\varvec{\varphi }$ and $\varvec{\lambda }$ as in Definition 37. Looking at (26) for $k=0$, and using the fact that $f_0 = f_{{{\textbf {M}}},{{\textbf {b}}}}$ and ${\tilde{f}}_0 = f_{{\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}}}$, we obtain from Proposition 39

$$\begin{aligned} f_{{\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}}} = P_{\varphi _0} {{\,\textrm{D}\,}}(\lambda ^{0}) f_{{{\textbf {M}}},{{\textbf {b}}}}. \end{aligned}$$

By definition of $\varvec{\varphi }$ and $\varvec{\lambda }$, we have $P_{\varphi _0} = {{\,\textrm{Id}\,}}_{n_0}$ and $\lambda ^{0} = \mathbbm {1}_{n_0}$, so we can finally conclude:

$$\begin{aligned} f_{{\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}}} = f_{{{\textbf {M}}},{{\textbf {b}}}}. \end{aligned}$$

$\square$

Definition 41

We say that $({{\textbf {M}}},{{\textbf {b}}})$ is normalized if for all $k \in \llbracket 1, K-1 \rrbracket$, for all $i \in \llbracket 1, n_k \rrbracket$, we have:

$$\begin{aligned} \Vert {M}^{k}_{i,.}\Vert =1. \end{aligned}$$

Proposition 42

If $({{\textbf {M}}},{{\textbf {b}}})$ satisfies, for all $k \in \llbracket 1, K-1 \rrbracket$, for all $i \in \llbracket 1, n_k \rrbracket$, $M^{k}_{i,.} \ne 0$, then there exists an equivalent parameterization $({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}})$ that is normalized.

Proof

We define recursively the family $(\lambda ^{0}, \lambda ^{1}, \dots , \lambda ^{K}) \in (\mathbb {R}_+^*)^{n_0} \times \dots \times ( \mathbb {R}_+^*)^{n_{K}}$ by:

$\lambda ^{K} = \mathbbm {1}_{n_K}$;
for all $k \in \llbracket 1, K-1 \rrbracket$, for all $i \in \llbracket 1, n_{k}\rrbracket$,
$$\begin{aligned} \lambda ^{k}_i = \frac{1}{\Vert M^{k}_{i,.}{{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1}\Vert }; \end{aligned}$$
$\lambda ^{0}= \mathbbm {1}_{n_0}$.

Consider the parameterization $({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}})$ defined by, for all $k \in \llbracket 0, K-1 \rrbracket$:

$$\begin{aligned} {\left\{ \begin{array}{ll}{\tilde{M}}^{k}= {{\,\textrm{D}\,}}(\lambda ^{k} )M^{k}{{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1} \\ {\tilde{b}}^{k} = {{\,\textrm{D}\,}}(\lambda ^{k} ) b^{k}.\end{array}\right. } \end{aligned}$$

The parameterization is, by definition, equivalent to ${\textbf {(M, b)}}$, and, for all $k \in \llbracket 1, K-1 \rrbracket$, for all $i \in \llbracket 1, n_k \rrbracket$:

$$\begin{aligned} \begin{aligned} \Vert {\tilde{M}}^{k}_{i,.}\Vert&= \left\| \left[ {{\,\textrm{D}\,}}(\lambda ^{k}) M^{k}{{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1} \right] _{i,.} \right\| \\&= \left\| \lambda ^{k}_i M^{k}_{i,.}{{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1} \right\| \\&= \left\| \frac{1}{\Vert M^{k}_{i,.}{{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1}\Vert } M^{k}_{i,.}{{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1} \right\| \\&=1. \end{aligned} \end{aligned}$$

$\square$

Proposition 43

If $({{\textbf {M}}},{{\textbf {b}}})$ and $({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}})$ are both normalized, then they are equivalent if and only if there exists a family of permutations $(\varphi _0, \dots , \varphi _{K}) \in {\mathfrak {S}}_{n_0} \times \dots \times {\mathfrak {S}}_{n_{K}}$, with $\varphi _0 = id_{\llbracket 1, n_0 \rrbracket }$ and $\varphi _{K} = id_{\llbracket 1, n_K \rrbracket }$, such that for all $k \in \llbracket 0, K-1 \rrbracket$:

$$\begin{aligned} {\left\{ \begin{array}{ll}{\tilde{M}}^{k}= P_{\varphi _{k}} M^{k}P_{\varphi _{k+1}}^{-1} \\ {\tilde{b}}^{k} = P_{\varphi _k} b^{k}.\end{array}\right. } \end{aligned}$$

(27)

Proof

Assume $({{\textbf {M}}},{{\textbf {b}}})$ and $({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}})$ are equivalent. Then there exist a family of permutations $(\varphi _0, \dots , \varphi _{K}) \in {\mathfrak {S}}_{n_0} \times \dots \times {\mathfrak {S}}_{n_{K}}$ and a family $(\lambda ^{0},\dots , \lambda ^{K}) \in (\mathbb {R}_+^*)^{n_0} \times \dots \times ( \mathbb {R}_+^*)^{n_{K}}$ as in Definition 37.

Let us prove by induction that $\lambda ^{k} = \mathbbm {1}_{n_k}$ for all $k \in \llbracket 0, K \rrbracket$.

For $k=K$ it is true by Definition 37.

Let $k \in \llbracket 1, K-1 \rrbracket$, and suppose $\lambda ^{k+1} = \mathbbm {1}_{n_{k+1}}$. This means ${{\,\textrm{D}\,}}(\lambda ^{k+1})= {{\,\textrm{Id}\,}}_{n_{k+1}}$. Let $i \in \llbracket 1, n_k \rrbracket$. Since ${\textbf {(M, b)}}$ is normalized, $\Vert M^{k}_{i,.} \Vert =1$. Since $P_{\varphi _{k+1}}^{-1}$ is a permutation matrix, it is orthogonal so $\Vert M^{k}_{i,.} P_{\varphi _{k+1}}^{-1} \Vert = \Vert M^{k}_{i,.} \Vert = 1$. Recalling (24) and using the fact that $({\tilde{{\textbf {M}}}}, {\tilde{{\textbf {b}}}})$ is normalized, that ${{\,\textrm{D}\,}}(\lambda ^{k+1})= {{\,\textrm{Id}\,}}_{n_{k+1}}$ and that $\lambda ^{k}_i$ is positive, we have:

$$\begin{aligned} \begin{aligned} 1&= \Vert {\tilde{M}}^{k}_{\varphi _k(i),.} \Vert = \Vert \lambda ^{k}_i M^{k}_{i,.} {{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1} P_{\varphi _{k+1}}^{-1} \Vert \\&= \lambda ^{k}_i \Vert M^{k}_{i,.} P_{\varphi _{k+1}}^{-1} \Vert \\&= \lambda ^{k}_i. \end{aligned} \end{aligned}$$

This shows $\lambda ^{k} = \mathbbm {1}_{n_k}$.

The case $k=0$ is also true by Definition 37.

Equation (24) with $\lambda ^{k} = \mathbbm {1}_{n_k}$ for all $k \in \llbracket 0, K\rrbracket$ is precisely equation (27).

The reciprocal is clear: (27) is a particular case of (24) with $\lambda ^{k} = \mathbbm {1}_{n_k}$. $\square$

Appendix 2: Main theorem

In Appendix 2, we prove the main theorem using the notations and results of Appendix 1, and admitting Lemma 53, which is proven in Appendix 3.

More precisely, we begin by stating the conditions ${{\textbf {C}}}$ and ${{\textbf {P}}}$ in Sect. 8.1, we then state our main result, which is Theorem 51, in Sect. 8.2, and we give a consequence of this result in terms of risk minimization, which is Corollary 52, in Sect. 8.3. Finally we prove Theorem 51 and Corollary 52 in Sects. 8.4 and 8.5 respectively.

1.1 Conditions

Assume $g: {\mathbb {R}}^m \rightarrow {\mathbb {R}}^n$ is a continuous piecewise linear function, $\Pi$ is a set of closed polyhedra admissible with respect to g, and let $\Omega \subset {\mathbb {R}}^l$, $M \in {\mathbb {R}}^{m \times l}$ and $b \in {\mathbb {R}}^{m}$.

We define

$$\begin{aligned} \begin{array}{rrcl} {h}:&{}{{\mathbb {R}}^l}&{}\longrightarrow &{} {{\mathbb {R}}^m}\\ &{}{x}&{}\longmapsto &{}{\sigma (Mx + b)}\end{array} \end{aligned}$$

and

$$\begin{aligned} \begin{array}{rrcl} {h^{lin}}:&{}{{\mathbb {R}}^l}&{}\longrightarrow &{}{{\mathbb {R}}^m}\\ &{}{x}&{}\longmapsto &{}{Mx + b.} \end{array} \end{aligned}$$

Definition 44

For all $i \in \llbracket 1, m \rrbracket$, we denote $E_i = \{x \in {\mathbb {R}}^m, \ x_i = 0 \}$.

Definition 45

Let $D \in \Pi$. The function g coincides with a linear function on D. Since the interior of D is nonempty, we define $V(D) \in {\mathbb {R}}^{n \times m}$ and $c(D) \in {\mathbb {R}}^n$ as the unique couple satisfying, for all $x \in D$:

$$\begin{aligned} g(x) = V(D)x + c(D). \end{aligned}$$

Definition 46

We say that $(g,M,b,\Omega ,\Pi )$ satisfies the conditions ${{\textbf {C}}}$ iif:

${{\textbf {C}}}.a)$:

M is full row rank;

${{\textbf {C}}}.b)$:

for all $i \in \llbracket 1, m \rrbracket$, there exists $x \in \mathring{\Omega }$ such that

$$\begin{aligned} M_{i,.}x + b_i = 0, \end{aligned}$$

or equivalently,

$$\begin{aligned} E_i \cap h^{lin}(\mathring{\Omega }) \ne \emptyset ; \end{aligned}$$

${{\textbf {C}}}.c)$:

for all $D \in \Pi$, for all $i \in \llbracket 1, m \rrbracket$, if $E_i \cap D \cap h(\Omega ) \ne \emptyset$ then $V_{.,i}(D) \ne 0$;

${{\textbf {C}}}.d)$:

for any affine hyperplane $H \subset {\mathbb {R}}^{l}$,

$$\begin{aligned} H \cap \mathring{\Omega } \ \not \subset \bigcup _{D \in \Pi } \partial h^{-1}(D). \end{aligned}$$

Definition 47

For all $k \in \llbracket 1, K-1 \rrbracket$, for all $i \in \llbracket 1, n_k \rrbracket$, we denote $E_i^k = \{x \in {\mathbb {R}}^{n_k}, x_i = 0 \}$.

We now state the conditions ${{\textbf {P}}}$ (already stated in the main text in Definition 5).

Definition 48

We say that $({{\textbf {M}}},{{\textbf {b}}},\Omega , \varvec{\Pi })$ satisfies the conditions ${{\textbf {P}}}$ iif for all $k \in \llbracket 1, K-1 \rrbracket$, $(g_k, M^{k}, b^{k}, \Omega _{k+1}, \Pi _k)$ satisfies the conditions ${{\textbf {C}}}$.

Explicitly, for all $k \in \llbracket 1, K-1 \rrbracket$, the conditions are the following:

${{\textbf {P}}}.a)$:

$M^k$ is full row rank;

${{\textbf {P}}}.b)$:

for all $i \in \llbracket 1, n_k \rrbracket$, there exists $x \in \mathring{\Omega }_{k+1}$ such that

$$\begin{aligned} M^k_{i,.}x + b^k_i = 0, \end{aligned}$$

or equivalently

$$\begin{aligned} E_i^k \cap h_k^{lin}(\mathring{\Omega }_{k+1}) \ne \emptyset ; \end{aligned}$$

${{\textbf {P}}}.c)$:

for all $D \in \Pi _k$, for all $i \in \llbracket 1, n_k \rrbracket$, if $E^k_i \cap D \cap \Omega _{k} \ne \emptyset$ then $V^k_{.,i}(D) \ne 0$;

${{\textbf {P}}}.d)$:

for any affine hyperplane $H \subset {\mathbb {R}}^{n_{k+1}}$,

$$\begin{aligned} H \cap \mathring{\Omega }_{k+1} \ \not \subset \bigcup _{D \in \Pi _k} \partial h_k^{-1}(D). \end{aligned}$$

Remark 49

The condition ${{\textbf {P}}}.b)$ implies that for all $k \in \llbracket 1, K-1 \rrbracket$, $\mathring{\Omega }_{k+1} \ne \emptyset$, and in particular for $k = K-1$, the set $\Omega = \Omega _K$ has nonempty interior.

The following proposition shows that the conditions ${{\textbf {P}}}$ are stable modulo permutation and positive rescaling, as defined in Definition 37.

Proposition 50

Suppose $({{\textbf {M}}}, {{\textbf {b}}})$ and $({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}})$ are two equivalent network parameterizations, and suppose $({{\textbf {M}}}, {{\textbf {b}}}, \Omega , \varvec{\Pi })$ satisfies the conditions ${{\textbf {P}}}$. Then, if we define $\varvec{{\tilde{\Pi }}}$ as in Item 3 of Proposition 39, $({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}}, \Omega , \varvec{{\tilde{\Pi }}})$ satisfies the conditions ${{\textbf {P}}}$.

Proof

Since $({{\textbf {M}}}, {{\textbf {b}}})$ and $({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}})$ are equivalent, by Definition 37 there exist

a family of permutations $(\varphi _0, \dots , \varphi _{K}) \in {\mathfrak {S}}_{n_0} \times \dots \times {\mathfrak {S}}_{n_{K}}$, with $\varphi _0 = id_{\llbracket 1, n_0 \rrbracket }$ and $\varphi _{K} = id_{\llbracket 1, n_K \rrbracket }$,
a family $(\lambda ^{0}, \lambda ^{1}, \dots , \lambda ^{K}) \in (\mathbb {R}_+^*)^{n_0} \times \dots \times ( {\mathbb {R}}_+^*)^{n_{K}}$, with $\lambda ^{0} = \mathbbm {1}_{n_0}$ and $\lambda ^{K} = \mathbbm {1}_{n_K}$,

such that

$$\begin{aligned} {\left\{ \begin{array}{ll}{\tilde{M}}^{k}= P_{\varphi _{k}} {{\,\textrm{D}\,}}(\lambda ^{k} )M^{k}{{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1}P_{\varphi _{k+1}}^{-1} \\ {\tilde{b}}^{k} = P_{\varphi _k}{{\,\textrm{D}\,}}(\lambda ^{k} ) b^{k}.\end{array}\right. } \end{aligned}$$

(28)

Let $k \in \llbracket 1, K-1 \rrbracket$. We know the conditions ${{\textbf {P}}}.a) - {{\textbf {P}}}.d)$ are satisfied by $(g_k, M^{k}, b^{k}, \Omega _{k+1}, \Pi _k)$, let us show they are satisfied by $({\tilde{g}}_k, {\tilde{M}}^{k}, {\tilde{b}}^{k}, {\tilde{\Omega }}_{k+1}, {\tilde{\Pi }}_k)$.

${{\textbf {P}}}.a)$ Since $M^{k}$ satisfies ${{\textbf {P}}}.a)$, it is full row rank, and using (28) and the fact that the matrices $P_{\varphi _{k}}, {{\,\textrm{D}\,}}(\lambda ^{k} ), {{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1}$ and $P_{\varphi _{k+1}}^{-1}$ are invertible, we see that ${\tilde{M}}^k$ is full row rank.
${{\textbf {P}}}.b)$ Let $i \in \llbracket 1, n_k \rrbracket$. Since $(g_k, M^{k}, b^{k}, \Omega _{k+1}, \Pi _k)$ satisfies the condition ${{\textbf {P}}}.b)$, we can choose $x \in \mathring{\Omega }_{k+1}$ such that
$$\begin{aligned} M^k_{\varphi _k^{-1}(i),.}x + b^k_{\varphi _k^{-1}(i)} = 0. \end{aligned}$$
(29)
Recall from Proposition 39 that
$$\begin{aligned} {\tilde{\Omega }}_{k+1} = P_{\varphi _{k+1}} {{\,\textrm{D}\,}}(\lambda ^{k+1}) \Omega _{k+1}. \end{aligned}$$
Since $P_{\varphi _{k+1}} {{\,\textrm{D}\,}}(\lambda ^{k+1})$ is an invertible matrix, it induces an homeomorphism on ${\mathbb {R}}^{n_{k+1}}$, and thus this identity also holds for the interiors:
$$\begin{aligned} \mathring{{\tilde{\Omega }}}_{k+1} = P_{\varphi _{k+1}} {{\,\textrm{D}\,}}(\lambda ^{k+1}) \mathring{\Omega }_{k+1}. \end{aligned}$$
Given that $x \in \mathring{\Omega }_{k+1}$, defining $y = P_{\varphi _{k+1}} {{\,\textrm{D}\,}}(\lambda ^{k+1})x$, we have $y \in \mathring{{\tilde{\Omega }}}_{k+1}$.

Using (28), (21) and (29), we have
$$\begin{aligned} \begin{aligned} {\tilde{M}}^k_{i,.}y + {\tilde{b}}^k_i&= [P_{\varphi _{k}} {{\,\textrm{D}\,}}(\lambda ^{k} ) M^k{{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1}P_{\varphi _{k+1}}^{-1}]_{i,.} y + [P_{\varphi _{k}} {{\,\textrm{D}\,}}(\lambda ^{k} )b^k]_i \\&= [{{\,\textrm{D}\,}}(\lambda ^{k} ) M^k{{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1}P_{\varphi _{k+1}}^{-1}]_{\varphi _k^{-1}(i),.} y + [ {{\,\textrm{D}\,}}(\lambda ^{k} )b^k]_{\varphi _k^{-1}(i)} \\&= \lambda ^k_{\varphi _k^{-1}(i)}M^k_{\varphi _k^{-1}(i),.}{{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1}P_{\varphi _{k+1}}^{-1} y + \lambda ^{k}_{\varphi _{k}^{-1}(i)}b^k_{\varphi _{k}^{-1}(i)} \\&= \lambda ^k_{\varphi _k^{-1}(i)}M^k_{\varphi _k^{-1}(i),.}x + \lambda ^{k}_{\varphi _{k}^{-1}(i)}b^k_{\varphi _{k}^{-1}(i)} \\&= 0. \end{aligned} \end{aligned}$$

We showed that there exists $y \in \mathring{{\tilde{\Omega }}}_{k+1}$ such that
$$\begin{aligned} {\tilde{M}}^k_{i,.}y + {\tilde{b}}^k_i = 0, \end{aligned}$$
which concludes the proof of ${{\textbf {P}}}.b)$.
${{\textbf {P}}}.c)$ Let ${\tilde{D}} \in {\tilde{\Pi }}_{k}$ and $i \in \llbracket 1, n_k \rrbracket$. Suppose $E_i^k \cap {\tilde{D}} \cap {\tilde{h}}_k({\tilde{\Omega }}_{k+1})\ne \emptyset$, and let us show ${\tilde{V}}^k_{i,.} ({\tilde{D}}) \ne 0$.

Let $x \in {\tilde{\Omega }}_{k+1}$ such that ${\tilde{h}}_k(x) \in E_i^k \cap {\tilde{D}}$. Inverting the equalities of Proposition 39 we get
- $h_{k} = {{\,\textrm{D}\,}}(\lambda ^{k})^{-1}P_{\varphi _{k}}^{-1} {\tilde{h}}_{k} \circ P_{\varphi _{k+1}} {{\,\textrm{D}\,}}(\lambda ^{k+1})$,
- $H^{k+1}_{\varphi _{k}^{-1}(i)} = {{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1} P_{\varphi _{k+1}}^{-1} {\tilde{H}}_{i}^{k+1}$,
- $\Omega _{k+1} = {{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1} P_{\varphi _{k+1}}^{-1} {\tilde{\Omega }}_{k+1}$.
Denote $D = {{\,\textrm{D}\,}}(\lambda ^{k})^{-1} P_{\varphi _{k}}^{-1} {\tilde{D}}$. Since ${\tilde{\Pi }}_k$ has been defined as in Item 3 of Proposition 39, we know that $D \in \Pi _k$. Let $y = {{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1} P_{\varphi _{k+1}}^{-1}x$. Let us prove that $h_k(y) \in E_{\varphi _k(i)^{-1}}^{k} \cap D \cap h_k(\Omega _{k+1})$.

Since $x \in {\tilde{\Omega }}_{k+1}$, we see that $y \in \Omega _{k+1}$, so $h_k(y) \in h_k(\Omega _{k+1})$.

We also have
$$\begin{aligned} \begin{aligned} h_k(y)&= {{\,\textrm{D}\,}}(\lambda ^{k})^{-1}P_{\varphi _{k}}^{-1} {\tilde{h}}_{k} \circ P_{\varphi _{k+1}} {{\,\textrm{D}\,}}(\lambda ^{k+1})\left( {{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1} P_{\varphi _{k+1}}^{-1} x \right) \\&= {{\,\textrm{D}\,}}(\lambda ^{k})^{-1}P_{\varphi _{k}}^{-1} {\tilde{h}}_{k}\left( x\right) , \end{aligned} \end{aligned}$$
which shows, since ${\tilde{h}}_k(x) \in {\tilde{D}}$, that $h_k(y) \in D$.

Since, by hypothesis, ${\tilde{h}}_k(x) \in E_i^k$, using (21) and (22), we have
$$\begin{aligned} \begin{aligned} \left[ h_k(y)\right] _{\varphi _k^{-1}(i)}&= \left[ {{\,\textrm{D}\,}}(\lambda ^{k})^{-1}P_{\varphi _{k}}^{-1} {\tilde{h}}_{k}\left( x\right) \right] _{\varphi _k^{-1}(i)} \\&= \frac{1}{\lambda ^{k}_{\varphi _k^{-1}(i)}} \left[ P_{\varphi _{k}}^{-1} {\tilde{h}}_{k}(x)\right] _{\varphi _k^{-1}(i)} \\&= \frac{1}{\lambda ^{k}_{\varphi _k^{-1}(i)}} ({\tilde{h}}_k(x))_i \\&= 0. \end{aligned} \end{aligned}$$
This proves that $h_k(y) \in E_{\varphi _k^{-1}(i)}^k$.

We proved that
$$\begin{aligned} h_k(y) \in E_{\varphi _k^{-1}(i)}^k \cap D \cap h_k(\Omega _{k+1}),\end{aligned}$$
which shows this intersection is not empty. Since $(g_k, M^{k}, b^{k}, \Omega _{k+1}, \Pi _k)$ satisfies ${{\textbf {P}}}.c)$, we have $V_{.,\varphi _k^{-1}(i)}^k(D) \ne 0$.

Since, according to proposition 39,
$$\begin{aligned} {\tilde{g}}_k = g_k \circ {{\,\textrm{D}\,}}(\lambda ^{k})^{-1} P_{\varphi _{k}}^{-1}, \end{aligned}$$
we deduce:
$$\begin{aligned} {\tilde{V}}^k({\tilde{D}}) = V^k(D) {{\,\textrm{D}\,}}(\lambda ^k)^{-1} P_{\varphi _k}^{-1}. \end{aligned}$$
(30)
For a matrix A and a permutation $\varphi$, we have $[P_{\varphi } A ]_{i,.} = A_{\varphi ^{-1}(i),.}$, so by taking the transpose, we see that $[A^T P_{\varphi }^{-1}]_{.,i} = (A^T)_{.,\varphi ^{-1}(i)}$.

Taking the $i^{\text {th}}$ column of (30), we thus obtain
$$\begin{aligned} {\tilde{V}}_{.,i}^k ({\tilde{D}}) = \left[ V^k(D) {{\,\textrm{D}\,}}(\lambda ^k)^{-1} P_{\varphi _k}^{-1}\right] _{.,i} = \frac{1}{\lambda ^{k}_{\varphi _k^{-1}(i)}}V_{.,\varphi _k^{-1}(i)}^k(D), \end{aligned}$$
which shows that ${\tilde{V}}_{.,i}^k({\tilde{D}}) \ne 0$.
${{\textbf {P}}}.d)$ Let ${\tilde{H}} \subset {\mathbb {R}}^{n_{k+1}}$ be an affine hyperplane. Denote $H = {{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1} P_{\varphi _{k+1}}^{-1} {\tilde{H}}$. Since ${{\textbf {P}}}.d)$ holds for $(g_k, M^{k}, b^{k}, \Omega _{k+1}, \Pi _k)$, using Item 2 of Proposition 39, we have
$$\begin{aligned} {\tilde{H}} \cap \mathring{{\tilde{\Omega }}}_{k+1}= & {} P_{\varphi _{k+1}} {{\,\textrm{D}\,}}(\lambda ^{k+1}) \left( H \cap \mathring{\Omega }_{k+1}\right) \nonumber \\{} & {} \not \subset P_{\varphi _{k+1}} {{\,\textrm{D}\,}}(\lambda ^{k+1}) \bigcup _{D \in \Pi _k} \partial h_{k}^{-1}(D) \nonumber \\= & {} \bigcup _{D \in \Pi _k} P_{\varphi _{k+1}} {{\,\textrm{D}\,}}(\lambda ^{k+1}) \partial h_{k}^{-1}(D). \end{aligned}$$
(31)
For all k, $P_{\varphi _{k+1}} {{\,\textrm{D}\,}}(\lambda ^{k+1})$ is an invertible matrix, so it induces an homeomorphism of ${\mathbb {R}}^{n_{k+1}}$. We thus have
$$\begin{aligned} P_{\varphi _{k+1}} {{\,\textrm{D}\,}}(\lambda ^{k+1}) \partial h_{k}^{-1}(D) = \partial \left( P_{\varphi _{k+1}} {{\,\textrm{D}\,}}(\lambda ^{k+1}) h_{k}^{-1}(D)\right) . \end{aligned}$$
(32)
Furthermore, by Item 1 of Proposition 39, we have ${\tilde{h}}_k = P_{\varphi _k} {{\,\textrm{D}\,}}(\lambda ^k) h_k \circ {{\,\textrm{D}\,}}(\lambda ^{k+1})^{-1} P_{\varphi _{k+1}}^{-1}$, so
$$\begin{aligned} {\tilde{h}}_k^{-1} = P_{\varphi _{k+1}} {{\,\textrm{D}\,}}(\lambda ^{k+1}) h_k^{-1} \circ {{\,\textrm{D}\,}}(\lambda ^k)^{-1} P_{\varphi _k}^{-1}, \end{aligned}$$
and since ${\tilde{D}} = P_{\varphi _k} {{\,\textrm{D}\,}}(\lambda ^k) D$,
$$\begin{aligned} {\tilde{h}}_k^{-1} ({\tilde{D}}) = P_{\varphi _{k+1}} {{\,\textrm{D}\,}}(\lambda ^{k+1}) h_k^{-1}(D). \end{aligned}$$
(33)
Combining (32) and (33), we obtain
$$\begin{aligned} P_{\varphi _{k+1}} {{\,\textrm{D}\,}}(\lambda ^{k+1}) \partial h_{k}^{-1}(D) = \partial {\tilde{h}}_k^{-1}({\tilde{D}}), \end{aligned}$$
and we can thus reformulate (31) as
$$\begin{aligned} {\tilde{H}} \cap \mathring{{\tilde{\Omega }}}_{k+1}&\not \subset&\bigcup _{{\tilde{D}} \in {\tilde{\Pi }}_k} \partial {\tilde{h}}_{k}^{-1}({\tilde{D}}). \end{aligned}$$

$\square$

1.2 Identifiability statement

We restate here the main theorem, already stated as Theorem 7 in the main part of the article.

Theorem 51

Let $K \in {\mathbb {N}}$, $K \ge 2$. Suppose we are given two networks with K layers, identical number of neurons per layer, and with respective parameters $({{\textbf {M}}}, {{\textbf {b}}})$ and $({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}})$. Assume $\varvec{\Pi }$ and $\varvec{{\tilde{\Pi }}}$ are two lists of sets of closed polyhedra that are admissible with respect to $({{\textbf {M}}}, {{\textbf {b}}})$ and $({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}})$ respectively. Denote by $n_K$ the number of neurons of the input layer, and suppose we are given a set $\Omega \subset {\mathbb {R}}^{n_K}$ such that $({{\textbf {M}}}, {{\textbf {b}}},\Omega ,\varvec{\Pi })$ and $({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}},\Omega , \varvec{{\tilde{\Pi }}})$ satisfy the conditions ${\textbf {{{\textbf {P}}}}}$, and such that, for all $x \in \Omega$:

$$\begin{aligned} f_{{{\textbf {M}}},{{\textbf {b}}}}(x) = f_{\tilde{{{\textbf {M}}}}, \tilde{{{\textbf {b}}}}}(x). \end{aligned}$$

Then:

$$\begin{aligned} ({{\textbf {M}}},{{\textbf {b}}}) \sim ({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}}). \end{aligned}$$

1.3 An application to risk minimization

We restate here the consequence of the main result in terms of minimization of the population risk, already stated as Corollary 8 in the main part.

Assume we are given a couple of input–output variables (X, Y) generated by a ground truth network with parameters $({{\textbf {M}}}, {{\textbf {b}}})$:

$$\begin{aligned} Y = f_{{\textbf {M,b}}}(X). \end{aligned}$$

We can use Theorem 51 to show that the only way to bring the population risk to 0 is to find the ground truth parameters -modulo permutation and positive rescaling.

Indeed, let $\Omega \subset {\mathbb {R}}^{n_K}$ be a domain that is contained in the support of X, and suppose $L: {\mathbb {R}}^{n_0} \times {\mathbb {R}}^{n_0} \rightarrow {\mathbb {R}}_+$ is a loss function such that $L(y,y') = 0 \Rightarrow y = y'$. Consider the population risk:

$$\begin{aligned} R({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}}) = {\mathbb {E}}[L(f_{{\tilde{{{\textbf {M}}}}},{\tilde{{{\textbf {b}}}}}}(X), Y)]. \end{aligned}$$

We have the following result.

Corollary 52

Suppose there exists a list of sets of closed polyhedra $\varvec{\Pi }$ admissible with respect to $({{\textbf {M}}}, {{\textbf {b}}})$ such that $({{\textbf {M}}}, {{\textbf {b}}},\Omega , \varvec{\Pi })$ satisfies the conditions ${{\textbf {P}}}$.

If $({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}})$ is also such that there exists a list of sets of closed polyhedra $\varvec{{\tilde{\Pi }}}$ admissible with respect to $({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}})$ such that $({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}},\Omega , \varvec{{\tilde{\Pi }}})$ satisfies the conditions ${{\textbf {P}}}$, and if $({{\textbf {M}}}, {{\textbf {b}}}) \not \sim ({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}})$, then:

$$\begin{aligned} R({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}}) > 0. \end{aligned}$$

1.4 Proof of Theorem 51

To prove Theorem 51, we can assume the parameterizations $({{\textbf {M}}}, {{\textbf {b}}})$ and $({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}})$ are normalized. Indeed, if they are not, by Proposition 42 there exist a normalized parameterization $({\textbf {M'}}, {\textbf {b'}})$ equivalent to $({{\textbf {M}}}, {{\textbf {b}}})$ and a normalized parameterization $({\tilde{{{\textbf {M}}}}'}, {\tilde{{{\textbf {b}}}}'})$ equivalent to $({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}})$. Note that we can apply Proposition 42 because $M^{k}$ and ${\tilde{M}}^{k}$ are full row rank (condition ${{\textbf {P}}}.a)$) for all $k \in \llbracket 1, K-1 \rrbracket$ so their lines are always nonzero. We derive $\varvec{\Pi }'$ from $\varvec{\Pi }$ and $\varvec{{\tilde{\Pi }}}'$ from $\varvec{{\tilde{\Pi }}}$ as in Item 3 of Proposition 39. By Proposition 50, $({\textbf {M'}}, {\textbf {b'}},\Omega , \varvec{\Pi }')$ and $({\tilde{{{\textbf {M}}}}'}, {\tilde{{{\textbf {b}}}}'}, \Omega , \varvec{{\tilde{\Pi }}}')$ also satisfy the conditions ${{\textbf {P}}}$. By Corollary 40, $f_{{\textbf {M'}}, {\textbf {b'}}} = f_{{{\textbf {M}}}, {{\textbf {b}}}}$ and $f_{{\tilde{{{\textbf {M}}}}'}, {\tilde{{{\textbf {b}}}}}'} = f_{{\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}}}$, so we have, for all $x \in \Omega$:

$$\begin{aligned} f_{{\textbf {M'}}, {\textbf {b'}}}(x) = f_{{\tilde{{{\textbf {M}}}}'}, {\tilde{{{\textbf {b}}}}'}}(x). \end{aligned}$$

$({\textbf {M'}}, {\textbf {b'}}, \Omega , \varvec{\Pi '})$ and $({\tilde{{{\textbf {M}}}}'}, {\tilde{{{\textbf {b}}}}'}, \Omega , \varvec{{\tilde{\Pi }}'})$ satisfy the hypotheses of Theorem 51. If we are able to show that $({\textbf {M'}}, {\textbf {b'}}) \sim ({\tilde{{{\textbf {M}}}}'}, {\tilde{{{\textbf {b}}}}'})$, then $({{\textbf {M}}}, {{\textbf {b}}}) \sim ({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}})$ follows immediately from the transitivity of the equivalence relation, proven in Proposition 38.

Thus in the proof $({{\textbf {M}}}, {{\textbf {b}}})$ and $({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}})$ will be assumed to be normalized.

To prove the theorem, we need the following fundamental lemma (already stated as Lemma 14 in the main text), that is proven in Appendix 3.

Lemma 53

Let $l,m,n \in {\mathbb {N}}^*$. Suppose $g,{\tilde{g}}: {\mathbb {R}}^m \rightarrow {\mathbb {R}}^n$ are continuous piecewise linear functions, $\Omega \subset {\mathbb {R}}^l$ is a subset and let $M, {\tilde{M}} \in {\mathbb {R}}^{m \times l}$, $b, {\tilde{b}} \in {\mathbb {R}}^{m}$. Denote $h: x \mapsto \sigma (Mx + b)$ and ${\tilde{h}}: x \mapsto \sigma ({\tilde{M}}x + {\tilde{b}})$. Assume $\Pi$ and ${\tilde{\Pi }}$ are two sets of polyhedra admissible with respect to g and ${\tilde{g}}$ respectively as in Definition 21.

Suppose $(g,M,b,\Omega , \Pi )$ and $({\tilde{g}}, {\tilde{M}}, {\tilde{b}}, \Omega , {\tilde{\Pi }})$ satisfy the conditions ${{\textbf {C}}}$, and for all $i \in \llbracket 1, m \rrbracket$, $\Vert M_{i,.} \Vert = \Vert {\tilde{M}}_{i,.} \Vert = 1$.

Suppose for all $x \in \Omega$:

$$\begin{aligned} g \circ h (x) = {\tilde{g}} \circ {\tilde{h}}(x). \end{aligned}$$

Then, there exists a permutation $\varphi \in {\mathfrak {S}}_m$, such that:

${\tilde{M}} = P_{\varphi }M$;
${\tilde{b}} = P_{\varphi }b$;
g and ${\tilde{g}} \circ P_{\varphi }$ coincide on ${h}(\Omega )$.

Proof of Theorem 51

We prove the theorem by induction on K.

Initialization. Assume here $K=2$. We are going to apply Lemma 53. Since $({{\textbf {M}}}, {{\textbf {b}}},\Omega ,\varvec{\Pi })$ and $({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}},\Omega , \varvec{{\tilde{\Pi }}})$ satisfy the conditions ${\textbf {{{\textbf {P}}}}}$, by definition, $(g_1, M^{1}, b^{1}, \Omega _{2}, \Pi _1)$ and $({\tilde{g}}_1, {\tilde{M}}^{1}, {\tilde{b}}^{1}, \Omega _{2}, {\tilde{\Pi }}_1)$ satisfy the conditions ${{\textbf {C}}}$ (note that ${\tilde{\Omega }}_2 = \Omega _2 = \Omega$). The network is normalized, so we have, for all $i \in \llbracket 1, n_1 \rrbracket$,

$$\begin{aligned} \Vert M_{i,.}^{1}\Vert = \Vert {\tilde{M}}_{i,.}^{1} \Vert = 1. \end{aligned}$$

By the assumptions of Theorem 51, for all $x \in \Omega$,

$$\begin{aligned} g_1 \circ h_1 (x) = f_{{\textbf {M,b}}} (x) = f_{{\tilde{{\textbf {M}}}},{\tilde{{\textbf {b}}}}}(x) = {\tilde{g}}_1 \circ {\tilde{h}}_1(x). \end{aligned}$$

We can thus apply Lemma 53, which shows that there exists a permutation $\varphi \in {\mathfrak {S}}_{n_1}$ such that

${\tilde{M}}^1 = P_{\varphi }M^1$;
${\tilde{b}}^1 = P_{\varphi }b^1$;
$g_1$ and ${\tilde{g}}_1 \circ P_{\varphi }$ coincide on ${h_1}(\Omega )$.

Recall from Definition 30 that for all $i \in \llbracket 1, n_1 \rrbracket$, we denote $H^2_i = \{ x \in {\mathbb {R}}^{n_2}, \ M^1_{i,.} x + b^1_i = 0\}.$ Let $(v_1, \dots , v_{n_1})$ be the canonical basis of ${\mathbb {R}}^{n_1}$. Let us show that for all $i \in \llbracket 1, n_1 \rrbracket$,

$$\begin{aligned} M^0 v_i = {\tilde{M}}^0 P_{\varphi } v_i. \end{aligned}$$

Let $i \in \llbracket 1, n_1 \rrbracket$. By ${{\textbf {P}}}.b)$, $H^2_i \cap \mathring{\Omega } \ne \emptyset$. Since $M^1$ is full row rank by ${{\textbf {P}}}.a)$, none of the hyperplanes $H^2_j$, with $j \ne i$, is parallel to $H^2_i$. As a consequence, the intersections $H^2_i \cap H^2_j$ have Hausdorff dimension smaller than $n_2 - 2$, so there exists $x \in \mathring{\Omega } \cap H^2_i \backslash \left( \bigcup _{j\ne i} H^2_j \right)$, and $\epsilon > 0$ such that $B(x,\epsilon ) \cap H^2_j = \emptyset$ for all $j \ne i$. Let u be a unit vector such that $M^1_{j,.}u = 0$ for all $j \ne i$ and $M^1_{i,.}u = \alpha > 0$ (this is possible again since $M^1$ is full row rank).

For all $j \in \llbracket 1, n_1 \rrbracket \backslash \{ i \}$, we have

$$\begin{aligned} \sigma (M^1_{j,.} (x + \epsilon u) + b^1_j) - \sigma (M^1_{j,.} x + b^1_j) = \sigma (M^1_{j,.}x + b^1_j) - \sigma (M^1_{j,.} x + b^1_j) = 0. \end{aligned}$$

At the same time, we have

$$\begin{aligned} \begin{aligned} \sigma (M^1_{i,.} (x + \epsilon u) + b^1_i) - \sigma (M^1_{i,.} x + b^1_i)&= M^1_{i,.} (x + \epsilon u) + b^1_i - M^1_{i,.} x + b^1_i \\&= \epsilon M^1_{i,.} u \\&= \epsilon \alpha . \end{aligned} \end{aligned}$$

Summarizing,

$$\begin{aligned} \begin{aligned} h_1(x + \epsilon u)-h_1(x)&= \sigma (M^1(x + \epsilon u) + b^1) - \sigma (M^1x + b^1) \\&= \epsilon \alpha v_i. \end{aligned} \end{aligned}$$

Let us denote $y_2 = h_1 ( x + \epsilon u) \in h_1 (\Omega )$ and $y_1 = h_1 (x) \in h_1 ( \Omega )$. We have shown $y_2 - y_1 = \epsilon \alpha v_i$, and since $g_1$ and ${\tilde{g}}_1 \circ P_{\varphi }$ coincide on $h_1 ( \Omega )$, we have

$$\begin{aligned} \begin{aligned} g_1 ( y_2) - g_1 (y_1)&= {\tilde{g}}_1 \circ P_{\varphi } ( y_2) - {\tilde{g}}_1 \circ P_{\varphi }(y_1) \\ \Longleftrightarrow \ M^0(y_2 - y_1)&= {\tilde{M}}^0 P_{\varphi } (y_2 - y_1) \\ \Longleftrightarrow \ \epsilon \alpha M^0 v_i&= \epsilon \alpha {\tilde{M}}^0 P_{\varphi } v_i \\ \Longleftrightarrow \ M^0 v_i&= {\tilde{M}}^0 P_{\varphi } v_i. \end{aligned} \end{aligned}$$

Since this last equality holds for any $i \in \llbracket 1, n_1 \rrbracket$, we conclude that

$$\begin{aligned} M^0 = {\tilde{M}}^0 P_{\varphi }, \end{aligned}$$

and using one last time that $g_1$ and ${\tilde{g}}_1 \circ P_{\varphi }$ coincide on $h_1(\Omega )$, we obtain

$$\begin{aligned} b^0 = {\tilde{b}}^0, \end{aligned}$$

i.e. we have shown

$$\begin{aligned}{\left\{ \begin{array}{ll} {\tilde{M}}^0 = M^0 P_{\varphi }^{-1}\\ {\tilde{b}}^0 = b^0. \end{array}\right. }\end{aligned}$$

Defining $P_{\varphi _1} = P_{\varphi }$, $P_{\varphi _0} = {{\,\textrm{Id}\,}}_{n_0}$ and $P_{\varphi _2} = {{\,\textrm{Id}\,}}_{n_2}$, we can use Proposition 43 to conclude that

$$\begin{aligned} {\textbf {(M, b)}} \sim ({\tilde{{\textbf {M}}}}, {\tilde{{\textbf {b}}}}). \end{aligned}$$

Induction step. Let $K \ge 3$ be an integer. Suppose Theorem 51 is true for all networks with $K-1$ layers.

Consider two networks with parameters $({{\textbf {M}}}, {{\textbf {b}}})$ and $({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}})$, with K layers and, for all $k \in \llbracket 0, K \rrbracket$, same number $n_k$ of neurons per layer. Let $\varvec{\Pi }$ and $\varvec{{\tilde{\Pi }}}$ be two list of sets of closed polyhedra that are admissible with respect to ${\textbf {(M, b)}}$ and $({\tilde{{\textbf {M}}}}, {\tilde{{\textbf {b}}}})$ respectively (Definition 34), and let $\Omega \subset {\mathbb {R}}^{n_K}$ such that $({{\textbf {M}}}, {{\textbf {b}}}, \Omega , \varvec{\Pi })$ and $({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}}, \Omega , \varvec{{\tilde{\Pi }}})$ satisfy the conditions ${{\textbf {P}}}$ and $f_{{{\textbf {M}}}, {{\textbf {b}}}}$ and $f_{{\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}}}$ coincide on $\Omega$.

Recall the functions $h_k$ and $g_k$ associated to $({{\textbf {M}}}, {{\textbf {b}}})$, defined in Definition 24 and Definition 28 respectively, and the corresponding functions ${\tilde{h}}_k$ and ${\tilde{g}}_k$ associated to $({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}})$.

We have two matrices $M^{K-1}$ and ${\tilde{M}}^{K-1} \in {\mathbb {R}}^{n_{K-1} \times n_K}$, two vectors $b^{K-1}$ and ${\tilde{b}}^{K-1} \in {\mathbb {R}}^{n_{K-1}}$, two functions $g_{K-1}$ and ${\tilde{g}}_{K-1}:{\mathbb {R}}^{n_{K-1}} \rightarrow {\mathbb {R}}^{n_0}$, two sets $\Pi _{K-1}$ and ${\tilde{\Pi }}_{K-1}$ such that:

$\forall x \in \Omega$, $g_{K-1}\circ h_{K-1}(x) = g_K(x) = f_{{{\textbf {M}}},{{\textbf {b}}}} (x) = f_{{\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}}} (x) = {\tilde{g}}_K(x) = {\tilde{g}}_{K-1} \circ {\tilde{h}}_{K-1}(x)$,
$g_{K-1}$ and ${\tilde{g}}_{K-1}$ are continuous piecewise linear, and $\Pi _{K-1}$ and ${\tilde{\Pi }}_{K-1}$ are admissible with respect to $g_{K-1}$ and ${\tilde{g}}_{K-1}$ respectively,
$(g_{K-1},M^{K-1}, b^{K-1}, \Omega , \Pi _{K-1})$ and $({\tilde{g}}_{K-1}, {\tilde{M}}^{K-1}, {\tilde{b}}^{K-1}, \Omega , {\tilde{\Pi }}_{K-1})$ satisfy the conditions ${{\textbf {C}}}$,
$\forall i \in \llbracket 1, n_{K-1} \rrbracket$, $\Vert M^{K-1}_{i,.} \Vert = \Vert {\tilde{M}}^{K-1}_{i,.} \Vert = 1$.

The third point comes from the fact that the conditions ${{\textbf {P}}}$ hold for $({{\textbf {M}}}, {{\textbf {b}}},\Omega ,\varvec{\Pi })$ and $({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}},\Omega , {\varvec{{\tilde{\Pi }}}})$, and the fourth point comes from the fact that $({{\textbf {M}}},{{\textbf {b}}})$ and $({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}})$ are normalized.

Thus, the objects $g_{K-1}, {\tilde{g}}_{K-1}, M^{K-1}, b^{K-1}, {\tilde{M}}^{K-1}, {\tilde{b}}^{K-1}, \Pi _{K-1}$ and ${\tilde{\Pi }}_{K-1}$ satisfy the hypotheses of Lemma 53 and hence there exists $\varphi \in {\mathfrak {S}}_{n_{K-1}}$ such that

$$\begin{aligned} {\left\{ \begin{array}{ll} {\tilde{M}}^{K-1} = P_{\varphi } M^{K-1}, \\ {\tilde{b}}^{K-1} = P_{\varphi } b^{K-1}, \end{array}\right. } \end{aligned}$$

(34)

and $g_{K-1}$ and ${\tilde{g}}_{K-1} \circ P_{\varphi }$ coincide on $\Omega _{K-1}$.

Let us denote ${{\textbf {M}}}^{*} = (M^{0}, \dots , M^{K-3}, M^{K-2}P_{\varphi }^{-1})$. The functions $g_{K-1} \circ P_{\varphi }^{-1}$ and ${\tilde{g}}_{K-1}$ are implemented by two networks with $K-1$ layers, indexed from $K-1$ up to 0, with parameters $({{\textbf {M}}}^{*},{{\textbf {b}}}^{\le K-2})$ and $({\tilde{{{\textbf {M}}}}}^{\le K-2},{\tilde{{{\textbf {b}}}}}^{\le K-2})$ respectively. The previous paragraph shows these functions coincide on $P_{\varphi } \Omega _{K-1}$. Recalling the definition of ${\tilde{\Omega }}_{K-1}$ and since, by (34), ${\tilde{f}}_{K-1} = {\tilde{h}}_{K-1} = P_{\varphi } h_{K-1}$, we have

$$\begin{aligned} {\tilde{\Omega }}_{K-1} = {\tilde{f}}_{K-1}(\Omega ) = P_{\varphi } h_{K-1}(\Omega ) = P_{\varphi } \Omega _{K-1}, \end{aligned}$$

i.e. the functions $g_{K-1} \circ P_{\varphi }^{-1} = f_{{{\textbf {M}}}^{*},{{\textbf {b}}}^{\le {{\textbf {K-2}}}}}$ and ${\tilde{g}}_{K-1}= f_{{\tilde{{{\textbf {M}}}}}^{\le {{\textbf {K-2}}}}, {\tilde{{{\textbf {b}}}}}^{\le {{\textbf {K-2}}}}}$ coincide on ${\tilde{\Omega }}_{K-1}$.

Since $({{\textbf {M}}},{{\textbf {b}}},\Omega , \varvec{\Pi })$ and $({\tilde{{{\textbf {M}}}}},{\tilde{{{\textbf {b}}}}}, \Omega , \varvec{{\tilde{\Pi }}})$ satisfy the conditions ${{\textbf {P}}}$, $(g_k, M^{k}, b^{k}, \Omega _{k+1}, \Pi _k)$ and $({\tilde{g}}_k, {\tilde{M}}^{k}, {\tilde{b}}^{k}, {\tilde{\Omega }}_{k+1}, {\tilde{\Pi }}_k)$ satisfy the conditions ${{\textbf {C}}}$ for all $k \in \llbracket 1, K-1 \rrbracket$ so in particular these conditions are satisfied for $k \in \llbracket 1, K-2 \rrbracket$, so $({{\textbf {M}}}^{\le K-2},{{\textbf {b}}}^{\le K-2},\Omega _{K-1}, \varvec{\Pi }^{\le K-2})$ and $({\tilde{{{\textbf {M}}}}}^{\le K-2},{\tilde{{{\textbf {b}}}}}^{\le K-2}, {\tilde{\Omega }}_{K-1}, \varvec{{\tilde{\Pi }}}^{\le K-2 )})$ satisfy the conditions ${{\textbf {P}}}$.

Let us verify that $({{\textbf {M}}}^{*},{{\textbf {b}}}^{\le K-2}, {\tilde{\Omega }}_{K-1}, \varvec{\Pi }^{\le K-2})$ also satisfies the conditions ${{\textbf {P}}}$. Indeed, the only thing that differs from $({{\textbf {M}}}^{\le K-2},{{\textbf {b}}}^{\le K-2},\Omega _{K-1}, \varvec{\Pi }^{\le K-2})$ is ${\tilde{\Omega }}_{K-1}$ and the weights $M^{*K-2}$ between the layer $K-1$ and the layer $K-2$. Writing that $M^{*K-2} = M^{K-2} P_{\varphi }^{-1}$, $h^*_{K-2} = h_{K-2} \circ P_{\varphi }^{-1}$, ${\tilde{\Omega }}_{K-1} = P_{\varphi } \Omega _{K-1}$ and $H^{*K-1}_i = P_{\varphi } H_i^{K-1}$, let us check that the conditions ${{\textbf {C}}}$ also hold for $(g_{K-2}, M^{*K-2}, b^{K-2}, {\tilde{\Omega }}_{K-1}, \Pi _{K-2})$.

Indeed $P_{\varphi }^{-1}$ is invertible, so $M^{*K-2}$ is full row rank and ${{\textbf {C}}}.a)$ holds.

If $x \in \mathring{\Omega }$ satisfies $M^{K-2}_{i,.}x + b^{K-2}_i = 0$, we define $h_{K-2}^{*lin}(x) = M^{*K-2}x + b^{K-2}$, we have $h_{K-2}^{*lin} = h_{K-2}^{lin} \circ P_{\varphi }^{-1}$, so

$$\begin{aligned} E_i \cap h_{K-2}^{*lin}(\mathring{{\tilde{\Omega }}}_{K-1}) = E_i \cap h_{K-2}^{lin}(\mathring{\Omega }_{K-1}) \ne \emptyset , \end{aligned}$$

and ${{\textbf {C}}}.b)$ is satisfied.

Similarly, the observation $h_{K-2}^*({\tilde{\Omega }}_{K-1}) = h_{K-2}(\Omega _{K-1})$ yields ${{\textbf {C}}}.c)$.

Finally, assume $H^* \subset {\mathbb {R}}^{n_{K-1}}$ is an affine hyperplane. Let $H = P_{\varphi }^{-1} H^*$. We have by hypothesis

$$\begin{aligned} H \cap \mathring{\Omega }_{K-1} \not \subset \bigcup _{D \in \Pi _{K-2}} \partial h_{K-2}^{-1} (D), \end{aligned}$$

thus

$$\begin{aligned} \begin{aligned} H^* \cap \mathring{{\tilde{\Omega }}}_{K-1}&= P_{\varphi } \left( H \cap \mathring{\Omega }_{K-1} \right) \\&\not \subset P_{\varphi } \bigcup _{D \in \Pi _{K-2}} \partial h_{K-2}^{-1} (D)\\&= \bigcup _{D \in \Pi _{K-2}} \partial (P_{\varphi } h_{K-2}^{-1} (D)). \end{aligned} \end{aligned}$$

For all $D \in \Pi _{K-2}$ we have

$$\begin{aligned} \begin{aligned} P_{\varphi } h_{K-2}^{-1} (D)&= P_{\varphi }\{ y, \ h_{K-2}(y) \in D \}\\&= P_{\varphi }\{ P_{\varphi }^{-1} x, \ h_{K-2} \circ P_{\varphi }^{-1}(x) \in D \}\\&= \{ x, \ h_{K-2}^*(x) \in D \}\\&= h_{K-2}^{*-1} (D). \end{aligned} \end{aligned}$$

Therefore,

$$\begin{aligned} \begin{aligned} H^* \cap \mathring{{\tilde{\Omega }}}_{K-1}&= \bigcup _{D \in \Pi _{K-2}} \partial h_{K-2}^{*-1} (D), \end{aligned} \end{aligned}$$

which proves ${{\textbf {C}}}.d)$.

Since the rest stays unchanged, we can conclude.

The induction hypothesis can thus be applied to $({{\textbf {M}}}^{*},{{\textbf {b}}}^{\le K-2}, {\tilde{\Omega }}_{K-1}, \varvec{\Pi }^{\le K-2})$ and $({\tilde{{{\textbf {M}}}}}^{\le K-2},{\tilde{{{\textbf {b}}}}}^{\le K-2}, {\tilde{\Omega }}_{K-1}, \varvec{{\tilde{\Pi }}}^{\le K-2})$, to obtain:

$$\begin{aligned} ({{\textbf {M}}}^{*},{{\textbf {b}}}^{\le K-2}) \sim ({\tilde{{{\textbf {M}}}}}^{\le K-2},{\tilde{{{\textbf {b}}}}}^{\le K-2}). \end{aligned}$$

Since we also have

$$\begin{aligned}{} & {} \forall k \in \llbracket 1, K-3 \rrbracket , \ \forall i \in \llbracket 1, n_k \rrbracket , \qquad \Vert M^{*k}_{i,.}\Vert =\Vert M_{i,.}^{k}\Vert = 1 \quad \text {and} \quad \Vert {\tilde{M}}_{i,.}^{k}\Vert =1,\\{} & {} \forall i \in \llbracket 1, n_{K-2} \rrbracket , \quad \Vert M^{*K-2}_{i,.}\Vert = \Vert M_{i,.}^{K-2}P_{\varphi }^{-1}\Vert =\Vert M_{i,.}^{K-2}\Vert = 1 \quad \text {and} \quad \Vert {\tilde{M}}_{i,.}^{K-2}\Vert =1, \end{aligned}$$

Proposition 43 shows that there exists a family of permutations $(\varphi _0, \dots , \varphi _{K-1}) \in {\mathfrak {S}}_{n_0} \times \dots \times {\mathfrak {S}}_{n_{K-1}}$, with $\varphi _0 = id_{\llbracket 1, n_0 \rrbracket }$ and $\varphi _{K-1} = id_{\llbracket 1, n_{K-1} \rrbracket }$, such that:

$$\begin{aligned} \forall k \in \llbracket 0, K-3 \rrbracket , \quad {\left\{ \begin{array}{ll}{\tilde{M}}^{k}= P_{\varphi _{k}} M^{*k}P_{\varphi _{k+1}}^{-1} = P_{\varphi _{k}} M^{k}P_{\varphi _{k+1}}^{-1} \\ {\tilde{b}}^{k} = P_{\varphi _k} b^{k},\end{array}\right. } \end{aligned}$$

(35)

and:

$$\begin{aligned} {\left\{ \begin{array}{ll}{\tilde{M}}^{K-2}= P_{\varphi _{K-2}} M^{*K-2}P_{\varphi _{K-1}}^{-1} = P_{\varphi _{K-2}} (M^{K-2} P_{\varphi }^{-1})P_{\varphi _{K-1}}^{-1} = P_{\varphi _{K-2}} M^{K-2} P_{\varphi }^{-1}\\ {\tilde{b}}^{K-2} = P_{\varphi _{K-2}} b^{K-2}.\end{array}\right. } \end{aligned}$$

(36)

We can define $(\psi _0, \dots , \psi _K) \in {\mathfrak {S}}_{n_0} \times \dots \times {\mathfrak {S}}_{n_K}$ by:

$\psi _0 = id_{\llbracket 1, n_0 \rrbracket }$, $\psi _K = id_{\llbracket 1, n_K \rrbracket }$;
$\forall k \in \llbracket 1, K-2 \rrbracket$, $\psi _k = \varphi _k$;
$\psi _{K-1} = \varphi$;

and using (35), (36) and (34) altogether, we then have, for all $k \in \llbracket 0, K-1 \rrbracket$:

$$\begin{aligned} {\left\{ \begin{array}{ll}{\tilde{M}}^{k}= P_{\psi _{k}} M^{k}P_{\psi _{k+1}}^{-1} \\ {\tilde{b}}^{k} = P_{\psi _k} b^{k}.\end{array}\right. } \end{aligned}$$

It follows from Proposition 43 that $({{\textbf {M}}},{{\textbf {b}}}) \sim ({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}})$. $\square$

1.5 Proof of Corollary 52

Theorem 52 is an immediate consequence of Theorem 51.

Since $({{\textbf {M}}},{{\textbf {b}}}, \Omega , \Pi )$ and $({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}}, \Omega , {\tilde{\Pi }})$ satisfy the conditions ${{\textbf {P}}}$ and $({{\textbf {M}}},{{\textbf {b}}}) \not \sim ({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}})$, the contrapositive of Theorem 51 shows that there exists $x \in \Omega$ such that $f_{{{\textbf {M}}},{{\textbf {b}}}}(x) \ne f_{\tilde{{{\textbf {M}}}}, \tilde{{{\textbf {b}}}}}(x)$. The function $f_{{{\textbf {M}}},{{\textbf {b}}}} - f_{\tilde{{{\textbf {M}}}}, \tilde{{{\textbf {b}}}}}$ is continuous so there exists $r > 0$ such that for all $u \in B(x,r)$, $f_{{{\textbf {M}}},{{\textbf {b}}}}(u) \ne f_{\tilde{{{\textbf {M}}}}, \tilde{{{\textbf {b}}}}}(u)$ so $L(f_{{{\textbf {M}}},{{\textbf {b}}}}(u),f_{\tilde{{{\textbf {M}}}}, \tilde{{{\textbf {b}}}}}(u)) > 0$. Since $\Omega$ is included in the support of X and $x \in \Omega$, denoting ${\mathbb {P}}_X$ the law of X we have ${\mathbb {P}}_X(B(x,r)) > 0$ and thus

$$\begin{aligned} \begin{aligned} R({\tilde{{{\textbf {M}}}}}, {\tilde{{{\textbf {b}}}}})&= {\mathbb {E}}[L(f_{{\tilde{{{\textbf {M}}}}},{\tilde{{{\textbf {b}}}}}}(X), f_{{{\textbf {M}}}, {{\textbf {b}}}}(X))] \\&\ge \int _{B(x,r)} L(f_{{{\textbf {M}}},{{\textbf {b}}}}(u), f_{\tilde{{{\textbf {M}}}}, \tilde{{{\textbf {b}}}}}(u)) \textrm{d} {\mathbb {P}}_X(u) \\&> 0. \end{aligned} \end{aligned}$$

Appendix 3: Proof of Lemma 53

In this section we prove Lemma 53.

Let $(g,M,b,\Omega , \Pi )$ and $({\tilde{g}}, {\tilde{M}}, {\tilde{b}}, \Omega , {\tilde{\Pi }})$ be as in the lemma. In particular, we assume they satisfy the conditions ${{\textbf {C}}}$ all along Appendix 3.

We denote, for all $x \in {\mathbb {R}}^l$:

$$\begin{aligned} f(x) = g\left( \sigma ( M x + b)\right) . \end{aligned}$$

Recall that, for all $x \in {\mathbb {R}}^l$, $h(x) = \sigma (Mx + b)$ and ${\tilde{h}}(x) = \sigma ({\tilde{M}} x + {\tilde{b}})$.

Recall that, as in Definition 30, we define for all $i \in \llbracket 1, m \rrbracket$ the sets $H_i = \{x \in {\mathbb {R}}^l \, \ M_{i,.}x + b_i = 0 \}$ and ${\tilde{H}}_i = \{ x \in {\mathbb {R}}^l \, \ {\tilde{M}}_{i,.} x + {\tilde{b}}_i = 0 \}$. By condition C.a), for all $i \in \llbracket 1, m \rrbracket$, $M_{i,.} \ne 0$ and ${\tilde{M}}_{i,.} \ne 0$ so $H_i$ and ${\tilde{H}}_i$ are hyperplanes.

Recall that for all $D \in \Pi$, we define $V(D) \in {\mathbb {R}}^{n \times m}$ and $c(D) \in {\mathbb {R}}^n$ as in Definition 45, and similarly for all ${\tilde{D}} \in {\tilde{\Pi }}$, we define ${\tilde{V}}({\tilde{D}}) \in {\mathbb {R}}^{n \times m}$ and ${\tilde{c}}({\tilde{D}}) \in {\mathbb {R}}^n$ associated to ${\tilde{g}}$.

We now define $s: {\mathbb {R}}^{l} \rightarrow \{0,1 \}^{m}$ as follows:

$$\begin{aligned} \forall i \in \llbracket 1, m \rrbracket , \quad s_i(x):= {\left\{ \begin{array}{ll} 1 &{} \text {if }M_{i,.} x + b_i \ge 0 \\ 0 &{} \text {otherwise.}\end{array}\right. } \end{aligned}$$

(37)

We define similarly ${\tilde{s}}$ for $({\tilde{M}}, {\tilde{b}})$. We thus have, for all $i \in \llbracket 1, m \rrbracket$,

$$\begin{aligned} \sigma (M_{i,.}x + b_i) = s_i(x) (M_{i,.}x + b_i) \end{aligned}$$

and

$$\begin{aligned} \sigma ({\tilde{M}}_{i,.}x + {\tilde{b}}_i) = {\tilde{s}}_i(x) ({\tilde{M}}_{i,.}x + {\tilde{b}}_i). \end{aligned}$$

Let $D \in \Pi$. For all $y \in D$, we have, by definition,

$$\begin{aligned}g(y) = V(D) y + c(D),\end{aligned}$$

thus, for all $x \in h^{-1}(D)$,

$$\begin{aligned} f(x)= & {} V(D)h(x)+ c(D) \nonumber \\= & {} V(D) \sigma (Mx + b) + c(D) \nonumber \\= & {} \sum _{k=1}^m V_{.,k}(D) s_k(x) (M_{k,.}x + b_k) +c(D). \end{aligned}$$

(38)

Similarly, for all ${\tilde{D}} \in {\tilde{\Pi }}$, for all $x \in {\tilde{h}}^{-1}({\tilde{D}})$,

$$\begin{aligned} f(x) = \sum _{k=1}^m {\tilde{V}}_{.,k}({\tilde{D}}) {\tilde{s}}_k(x) ({\tilde{M}}_{k,.}x + {\tilde{b}}_k) +{\tilde{c}}({\tilde{D}}). \end{aligned}$$

(39)

Proposition 54

Let $D \in \Pi$. For all $i \in \llbracket 1, m \rrbracket$, for all $x \in H_i \cap \overset{\circ }{\overbrace{h^{-1}(D)}} \cap \mathring{\Omega } \backslash \left( \bigcup _{k \ne i} H_k \right)$, f is not differentiable at the point x.

Proof

Let $i \in \llbracket 1, m \rrbracket$ and suppose $x \in H_i \cap \overset{\circ }{\overbrace{h^{-1}(D)}} \cap \mathring{\Omega } \backslash \left( \bigcup _{k \ne i} H_k \right)$. Let us consider the function $t \mapsto f(x + t M_{i,.}^{ T})$. Since $x \in H_i$ and $\Vert M_{i,.} \Vert = 1$ by hypothesis,

$$\begin{aligned} M_{i,.}(x + t M_{i,.}^{ T}) + b_i = t M_{i,.}M_{i,.}^{ T} + M_{i,.}x + b_i = t \Vert M_{i,.}\Vert ^2 = t. \end{aligned}$$

(40)

Given the definition of s in (37), we thus have

$$\begin{aligned} s_i(x + t M^{ T}_{i,.}) = {\left\{ \begin{array}{ll} 1 &{} \text {if } t \ge 0 \\ 0 &{} \text {if } t < 0.\end{array}\right. } \end{aligned}$$

Since $x \in \overset{\circ }{\overbrace{h^{-1}(D)}}$ which is an open set, for t small enough we have $x + t M_{i,.}^{ T} \in \overset{\circ }{\overbrace{h^{-1}(D)}}$ and thus, using (38) and (40),

$$\begin{aligned} \begin{aligned} f(x + t M_{i,.}^{ T})&= \sum _{k =1}^m V_{.,k}(D) s_k(x + t M_{i,.}^{ T}) \left( M_{k,.}( x + t M_{i,.}^{ T}) + b_k \right) + c(D) \\&= {\left\{ \begin{array}{ll} \sum _{k \ne i} V_{.,k}(D) s_k(x + tM_{i,.}^{ T})\left( M_{k,.}(x + t M_{i,.}^{ T}) + b_k \right) \\ +c(D) + t V_{.,i}(D) &{} \text { if } t \ge 0 \\ \sum _{k \ne i} V_{.,k}(D) s_k(x + t M_{i,.}^{ T}) \left( M_{k,.}( x + t M_{i,.}^{ T}) + b_k \right) \\ + c(D) &{} \text { if } t < 0.\end{array}\right. } \end{aligned} \end{aligned}$$

Since x does not belong to any of the hyperplanes $H_k$ for $k \ne i$, which are closed, there exists $\epsilon > 0$ such that for all $t \in \ ] - \epsilon , \epsilon [$ and for all $k \ne i$, $x + tM^{ T}_{i,.} \notin H_k$. Therefore, for all $t \in \ ] - \epsilon , \epsilon [$, for all $k \in \llbracket 1, m \rrbracket \backslash \{i\}$, $s_k(x + t M_{i,.}^{ T}) = s_k(x)$ and

$$\begin{aligned} f(x + t M_{i,.}^{ T}) = {\left\{ \begin{array}{ll} \sum _{k \ne i} V_{.,k}(D) s_k(x)( M_{k,.} (x + t M_{i,.}^{ T}) + b_k) + c(D) \\ + t V_{.,i}(D) &{} \text { if } t \ge 0 \\ \sum _{k \ne i} V_{.,k}(D) s_k(x) ( M_{k,.} (x + t M_{i,.}^{ T}) + b_k) + c(D) &{} \text { if } t < 0. \end{array}\right. } \end{aligned}$$

The right derivative of $t \mapsto f(x + t M_{i,.}^{ T})$ at 0 is:

$$\begin{aligned} \sum _{k \ne i} V_{.,k}(D) s_k(x) M_{k,.}M_{i,.}^{ T} + V_{.,i}(D). \end{aligned}$$

The left derivative of $t \mapsto f(x + t M_{i,.}^{ T})$ at 0 is:

$$\begin{aligned} \sum _{k \ne i} V_{.,k}(D) s_k(x) M_{k,.} M_{i,.}^{ T}. \end{aligned}$$

Since $x \in H_i \cap h^{-1}(D) \cap \Omega$, we have $h(x) \in E_i \cap D \cap h(\Omega )$ so the condition ${{\textbf {C}}}.c)$ implies that $V_{.,i}(D) \ne 0$. We conclude that the left and right derivatives at x do not coincide and thus f is not differentiable at x. $\square$

Lemma 55

Let $D \in \Pi$. For all $x \in \overset{\circ }{\overbrace{h^{-1}(D)}} \ \backslash \left( \bigcup _{i=1}^m H_i \right)$, there exists $r>0$ such that f is differentiable on B(x, r).

Proof

Consider $x \in \overset{\circ }{\overbrace{h^{-1}(D)}} \ \backslash \left( \bigcup _{i=1}^m H_i \right)$. Since the hyperplanes $H_i$ are closed, there exists a ball $B(x, r) \subset \overset{\circ }{\overbrace{h^{-1}(D)}}$ such that for all $i \in \llbracket 1, m \rrbracket$, $B(x,r) \cap H_i = \emptyset$. As a consequence, for all $y \in B(x,r)$, $s(y) = s(x)$. Using (38) we get, for all $y \in B(x,r)$,

$$\begin{aligned} f(y) = \sum _{i=1}^m V_{.,i}(D) s_i(x) \left( M_{i,.} y + b_i \right) + c(D). \end{aligned}$$

The right side of this equality is affine in the variable y, so f is differentiable on B(x, r). $\square$

Lemma 56

Let $\gamma : {\mathbb {R}}^l \rightarrow {\mathbb {R}}^m$ be a continuous piecewise linear function. Let ${\mathcal {P}}$ be a finite set of polyhedra of ${\mathbb {R}}^m$ such that $\bigcup _{D \in {\mathcal {P}}} D = {\mathbb {R}}^m$. Let $A_1, \dots A_s$ be a set of hyperplanes such that $\bigcup _{D \in {\mathcal {P}}} \partial \gamma ^{-1} (D) \subset \bigcup _{k=1}^s A_k$ (Proposition 23 shows the existence of such hyperplanes). Let H be an affine hyperplane and $a \in {\mathbb {R}}^l, b \in {\mathbb {R}}$ such that $H = \{ x \in {\mathbb {R}}^l, a^Tx + b = 0 \}$. Denote $I = \{ k \in \llbracket 1, s \rrbracket , A_k = H \}$. Let $x \in H$ such that for all $k \in \llbracket 1, s \rrbracket \backslash I$, $x \notin A_k$. Then there exists $r>0$, $D_-$ and $D_+ \in {\mathcal {P}}$ (not necessarily distinct) such that

$$\begin{aligned} B(x,r) \cap \{ y \in {\mathbb {R}}^l, a^T y + b < 0 \} \quad \subset \quad \gamma ^{-1}(D_-) \\ B(x,r) \cap \{ y \in {\mathbb {R}}^l, a^T y + b > 0 \} \quad \subset \quad \gamma ^{-1}(D_+). \end{aligned}$$

Proof

Let $r > 0$ such that

$$\begin{aligned} B(x, r) \cap \left( \bigcup _{k \notin I} A_k \right) = \emptyset . \end{aligned}$$

$B(x,r) \backslash H$ has two connected components: $B_- = B(x,r) \cap \{ y \in {\mathbb {R}}^l, a^Ty + b <0 \}$ and $B_+ = B(x,r) \cap \{ y \in {\mathbb {R}}^l, a^Ty + b > 0 \}$. The set $B_-$ (resp. $B_+$) is convex as an intersection of two convex sets.

Since $\bigcup _{D \in {\mathcal {P}}} D = {\mathbb {R}}^m$, there exists $D_- \in {\mathcal {P}}$ such that $\gamma ^{-1}(D_-) \cap B_- \ne \emptyset$. Let us show that

$$\begin{aligned} B_- \ \subset \ \gamma ^{-1}(D_-). \end{aligned}$$

Indeed, $B_- \cap \left( \bigcup _{k \notin I} A_k \right) = \emptyset$ and $B_- \cap H = \emptyset$ so $B_- \cap \left( \bigcup _{k \in I} A_k \right) = \emptyset$, therefore we have

$$\begin{aligned} B_- \cap \left( \bigcup _{D \in {\mathcal {P}}} \partial \gamma ^{-1}(D) \right) \ \subset \ B_- \cap \left( \bigcup _{k=1}^s A_k \right) = \emptyset . \end{aligned}$$

In particular, $B_- \cap \partial \gamma ^{-1}(D_-) = \emptyset$. Let $Y = \gamma ^{-1}(D_-) \cap B_-$. Let us denote by $\partial _{B_-} Y$ the topological boundary of Y with respect to the topology of $B_-$. Let us show the following inclusion:

$$\begin{aligned} \partial _{B_-} Y \ \subset \ \partial \gamma ^{-1}(D_-) \cap B_-. \end{aligned}$$

Indeed, let $y \in \partial _{B_-} Y$. By definition, there exist two sequences $(u_n)$ and $(v_n)$ such that $u_n \in Y$, $v_n \in B_- \backslash Y$, and both $u_n$ and $v_n$ tend to y. In particular, $u_n \in \gamma ^{-1} (D_-)$ and $v_n \in {\mathbb {R}}^l \backslash \gamma ^{-1}(D_-)$, so $y \in \partial \gamma ^{-1}(D_-)$. Since $y \in B_-$, we have $y \in \partial \gamma ^{-1}(D_-) \cap B_-$.

This shows $\partial _{B_-} Y = \emptyset$, and as a consequence Y is open and closed in $B_-$. Since $B_-$ is connex and Y is not empty, we conclude that $Y = B_-$, i.e. $B_- \ \subset \ \gamma ^{-1}(D_-)$.

We show similarly that there exists $D_+ \in \Pi$ such that $B_+ \subset \gamma ^{-1}(D_+)$. $\square$

Proposition 57

There exists a bijection $\varphi \in {\mathfrak {S}}_m$ such that for all $i \in \llbracket 1, m \rrbracket$, ${\tilde{H}}_i = H_{\varphi ^{-1}(i)}$.

Proof

We denote by X the set of all points of $\mathring{\Omega }$ at which f is not differentiable. We denote by ${\mathcal {G}}$ the set of all hyperplanes of ${\mathbb {R}}^l$. We denote ${\mathcal {H}} = \{ H \in {\mathcal {G}} \, \ H \cap \mathring{\Omega } \ne \emptyset \text { and } H \cap \mathring{\Omega } \subset {\overline{X}} \}$. We want to show ${\mathcal {H}} = \{ H_i, i\in \llbracket 1, m \rrbracket \}$.

Indeed, once this established, since ${\mathcal {H}}$ only depends on $\Omega$ and f, we also have ${\mathcal {H}} = \{ {\tilde{H}}_i, i \in \llbracket 1, m \rrbracket \}$, and thus $\{ H_i, i \in \llbracket 1, m \rrbracket \} = \{ {\tilde{H}}_i, i \in \llbracket 1, m \rrbracket \}$. Since, using C.a), for all i, j, $i \ne j$, we have $H_i \ne H_j$ and ${\tilde{H}}_i \ne {\tilde{H}}_j$, we can conclude that there exists a permutation $\varphi \in {\mathfrak {S}}_m$ such that, for all $i \in \llbracket 1, m \rrbracket$, ${\tilde{H}}_i = H_{\varphi ^{-1}(i)}$.

− Let us show ${\mathcal {H}} \subset \{ H_i, i\in \llbracket 1, m \rrbracket \}$.

To begin, let us show that ${\overline{X}} \cap \mathring{\Omega } \ \subset \ \bigcup _{D \in \Pi } \partial h^{-1}(D) \cup \bigcup _{i=1}^m H_i$. Let $x \in {\overline{X}} \cap \mathring{\Omega }$. Let $D \in \Pi$ such that $h(x) \in D$. Since $x \in {\overline{X}}$, there does not exist any $r>0$ such that f is differentiable on B(x, r). The contrapositive of Lemma 55 shows that $x \notin \overset{\circ }{\overbrace{h^{-1}(D)}} \backslash \left( \bigcup _{i=1}^m H_i \right)$, so either $x \in \bigcup _{i=1}^m H_i$ or $x \notin \overset{\circ }{\overbrace{h^{-1}(D)}}$. In the latter case, since $x \in h^{-1}(D)$ by definition of D, we have $x \in h^{-1}(D) \backslash \overset{\circ }{\overbrace{h^{-1}(D)}} \subset \partial h^{-1}(D)$.

This shows:

$$\begin{aligned} {\overline{X}} \cap \mathring{\Omega } \quad \subset \quad \bigcup _{D \in \Pi } \partial h^{-1}(D) \cup \bigcup _{i=1}^m H_i. \end{aligned}$$

(41)

Let $H \in {\mathcal {H}}$. We are going to show that there exists $i \in \llbracket 1, m \rrbracket$ such that $H = H_i$.

We know by condition C.d that $H \cap \mathring{\Omega } \not \subset \bigcup _{D \in \Pi } \partial h^{-1}(D)$. Let $x \in (H \cap \mathring{\Omega } )\backslash \left( \bigcup _{D \in \Pi } \partial h^{-1}(D) \right)$. The set $\bigcup _{D \in \Pi } \partial h^{-1}(D)$ is closed, so there exists a ball

$$\begin{aligned} B(x,r) \subset \mathring{\Omega } \backslash \left( \bigcup _{D \in \Pi } \partial h^{-1}(D) \right) . \end{aligned}$$

(42)

By definition of ${\mathcal {H}}$,

$$\begin{aligned} H \cap \mathring{\Omega } \ \subset \ {\overline{X}} \cap \mathring{\Omega }, \end{aligned}$$

so using the fact that $B(x,r) \subset \mathring{\Omega }$ we have:

$$\begin{aligned} B(x,r) \cap H \ = \ B(x,r) \cap H \cap \mathring{\Omega } \ \subset \ B(x,r) \cap {\overline{X}} \cap \mathring{\Omega }. \end{aligned}$$

Thus, using (41),

$$\begin{aligned} \begin{aligned} B(x,r) \cap H \ {}&\subset \ B(x,r) \cap {\overline{X}} \cap \mathring{\Omega } \\&\subset \ B(x,r) \cap \left( \bigcup _{D \in \Pi } \partial h^{-1}(D) \cup \bigcup _{i=1}^m H_i \right) \\&= \ \left( B(x,r) \cap \bigcup _{D \in \Pi } \partial h^{-1}(D) \right) \cup \left( B(x,r) \cap \bigcup _{i=1}^m H_i \right) , \end{aligned} \end{aligned}$$

and since by (42) the first set of the last equality is empty, we have

$$\begin{aligned} \begin{aligned} B(x,r) \cap H \ \subset \ B(x,r) \cap \bigcup _{i=1}^m H_i. \end{aligned} \end{aligned}$$

Therefore,

$$\begin{aligned} \begin{aligned} B(x,r) \cap H&= \left( B(x,r) \cap H \right) \cap \left( B(x,r)\cap \bigcup _{i = 1}^m H_i\right) \\&= B(x,r) \cap H \cap \bigcup _{i = 1}^m H_i \\&= B(x,r) \cap \bigcup _{i = 1}^m \left( H \cap H_i \right) . \end{aligned} \end{aligned}$$

Assume, by contradiction, that for all $i \in \llbracket 1, m \rrbracket$ we have $H \ne H_i$. Then $H \cap H_i$ is an affine space of dimension less or equal to $l-2$ so it has Hausdorff dimension smaller or equal to $l-2$. A finite union of sets of Hausdorff dimension smaller or equal to $l-2$ has Hausdorff dimension smaller or equal to $l-2$. Thus, $B(x,r) \cap H = B(x,r) \cap \bigcup _{i = 1}^m \left( H \cap H_i \right)$ has Hausdorff dimension smaller or equal to $l-2$, which is absurd since $x \in H$ so $B(x,r) \cap H$ has Hausdorff dimension $l-1$. Hence there exists $i \in \llbracket 1, m \rrbracket$ such that $H = H_i$.

We have shown

$$\begin{aligned} {\mathcal {H}} \subset \{ H_i, i\in \llbracket 1, m \rrbracket \}. \end{aligned}$$

(43)

− Let us show $\{ H_i, i\in \llbracket 1, m \rrbracket \} \subset {\mathcal {H}}$.

Let $i \in \llbracket 1, m \rrbracket$. Let us prove $H_i \in {\mathcal {H}}$.

First, by condition C.b) we know that $E_i \cap h^{lin}(\mathring{\Omega }) \ne \emptyset$, so there exists $x \in \mathring{\Omega }$ such that $h^{lin}(x) \in E_i$. Since $h^{lin}(x) = Mx +b$ and $E_i$ is the space of vectors whose $i^{\text {th}}$ coordinate is 0, this is equivalent to

$$\begin{aligned} M_{i,.}x + b_i = 0, \end{aligned}$$

or said otherwise $x \in H_i$. This proves that $H_i \cap \mathring{\Omega } \ne \emptyset$. We still need to prove $H_i \cap \mathring{\Omega } \subset {\overline{X}}$.

Let $x \in H_i \cap \mathring{\Omega }$. Let us prove $x \in {\overline{X}}$.

Since M is full row rank, the line vectors $M_{1,.}, \dots , M_{m,.}$ are linearly independent, and thus for all $k \in \llbracket 1, m \rrbracket \backslash \{ i \}$, $H_k \cap H_i$ has Hausdorff dimension smaller or equal to $l-2$.

Proposition 23 shows that $\bigcup _{D \in \Pi } \partial h^{-1}(D)$ is contained in a finite union of hyperplanes $\bigcup _{k=1}^s A_k$. Let $I = \{ k \in \llbracket 1, s \rrbracket \, \ A_k = H_i \}.$ For all $k \in \llbracket 1, s \rrbracket \backslash I$, $A_k \cap H_i$ is either empty, or an intersection of two non parallel hyperplanes, in both cases it is an affine space of dimension smaller than $l-2$.

Thus,

$$\begin{aligned} H_i \cap \left( (\bigcup _{k \ne i} H_k ) \cup (\bigcup _{k \notin I} A_k) \right) \end{aligned}$$

has Hausdorff dimension strictly smaller than $l-1$, so for any $r>0$ there exists

$$\begin{aligned} y \in B(x,r) \cap H_i \cap \mathring{\Omega } \backslash \left( (\bigcup _{k \ne i} H_k ) \cup (\bigcup _{k \notin I} A_k) \right) . \end{aligned}$$

(44)

In the rest of the proof, we show that such a y is an element of X. Once this is established, since it is true for all $r>0$, we conclude that $x \in {\overline{X}}$ and therefore $H_i \in {\mathcal {H}}$.

If there exists $D \in \Pi$ such that $y \in \overset{\circ }{\overbrace{h^{-1}(D)}}$, then

$$\begin{aligned} y \in H_i \cap \overset{\circ }{\overbrace{h^{-1}(D)}} \cap \mathring{\Omega } \backslash \left( \bigcup _{k \ne i} H_k \right) \end{aligned}$$

therefore we can use Proposition 54 to conclude that f is not differentiable at y.

Otherwise we can use Lemma 56 to find $R_1>0$, $D_-$ and $D_+ \in \Pi$ such that

$$\begin{aligned} B(y,R_1) \cap \{ z \in {\mathbb {R}}^l, M_{i,.} z + b_i < 0 \} \quad \subset \quad h^{-1}(D_-) \\ B(y,R_1) \cap \{ z \in {\mathbb {R}}^l, M_{i,.} z + b_i > 0 \} \quad \subset \quad h^{-1}(D_+). \end{aligned}$$

Since for all $j \ne i$, $y \notin H_j$ and since these hyperplanes are closed, there exists $R_2 > 0$ such that for all $j \ne i$, $B(y,R_2) \cap H_j = \emptyset$. Let $R = \min (R_1,R_2)$ and denote $B_- = B(y,R) \cap \{ z \in {\mathbb {R}}^l, M_{i,.} z + b_i < 0 \}$ and $B_+ = B(y,R) \cap \{ z \in {\mathbb {R}}^l, M_{i,.} z + b_i > 0 \}$.

For all $z \in B_-$, using (38) with the fact that $s_i(z) = 0$ and $s_k(z) = s_k(y)$ for all $k \ne i$, we have

$$\begin{aligned} {} f(z) = \sum _{k \ne i} V_{.,k}(D_-) s_k(y) (M_{k,.}z + b_k) +c(D_-). \end{aligned}$$

(45)

For all $z \in B_+$, using this time that $s_i(z) = 1$, we have

$$\begin{aligned} f(z) = \sum _{k \ne i} V_{.,k}(D_+) s_k(y) (M_{k,.}z + b_k) +c(D_+) + V_{.,i}(D_+) (M_{i,.}z + b_i). \end{aligned}$$

(46)

If f was differentiable at y, we would derive from (45) the expression of the Jacobian matrix

$$\begin{aligned} J_f(y) = \sum _{k \ne i} V_{.,k}(D_-) s_k(y) M_{k,.},\end{aligned}$$

(47)

but we would also derive from (46) the expression

$$\begin{aligned} J_f(y) = \sum _{k \ne i} V_{.,k}(D_+) s_k(y) M_{k,.} + V_{.,i}(D_+) M_{i,.}, \end{aligned}$$

(48)

hence subtracting (47) to (48) we would find

$$\begin{aligned} \sum _{k \ne i} (V_{.,k}(D_+) - V_{.,k}(D_-)) s_k(y) M_{k,.} + V_{.,i}(D_+) M_{i,.} = 0. \end{aligned}$$

Since M is full row rank, this would imply that $V_{.,i}(D_+) = 0$.

However since $h^{-1}(D_+)$ is closed and contains $B_+$, we have $y \in \overline{B_+} \subset h^{-1}(D_+)$. Recalling (44) we thus have

$$\begin{aligned} y \in H_i \cap h^{-1}(D_+) \cap \mathring{\Omega }, \end{aligned}$$

thus

$$\begin{aligned} h(y) \in E_i \cap D_+ \cap h(\mathring{\Omega }), \end{aligned}$$

which shows the latter intersection is not empty. By assumption C.c) this implies that $V_{.,i}(D_+) \ne 0$, which is a contradiction. Therefore f is not differentiable at y.

As a conclusion, we have showed that for all $r>0$, there exists $y \in B(x,r)$ such that f is not differentiable at y and $y \in \mathring{\Omega }$. In other words, $x \in {\overline{X}}$.

Since x is arbitrary in $H_i \cap \mathring{\Omega }$, we have shown that for all $i \in \llbracket 1, m \rrbracket$,

$$\begin{aligned} H_i \cap \mathring{\Omega } \ \subset \ {\overline{X}}, \end{aligned}$$

i.e., since we have already shown that $H_i \cap \mathring{\Omega } \ne \emptyset$,

$$\begin{aligned} H_i \in {\mathcal {H}}. \end{aligned}$$

Finally $\{ H_i, i\in \llbracket 1, m \rrbracket \} \subset {\mathcal {H}}$, and, using (43),

$$\begin{aligned} {\mathcal {H}} = \{ H_i, i\in \llbracket 1, m \rrbracket \}. \end{aligned}$$

$\square$

Proposition 58

For all $i \in \llbracket 1, m \rrbracket$, there exists $\epsilon _{\varphi ^{-1}(i)} \in \{ -1, 1 \}$ such that

$$\begin{aligned} {\tilde{M}}_{i,.} = \epsilon _{\varphi ^{-1}(i)} M_{{\varphi ^{-1}(i)},.} \qquad \text {and} \qquad {\tilde{b}}_i = \epsilon _{\varphi ^{-1}(i)} b_{\varphi ^{-1}(i)}. \end{aligned}$$

Proof

Let $i \in \llbracket 1, m \rrbracket$. We know that ${\tilde{H}}_i = H_{\varphi ^{-1}(i)}$, so the equations ${\tilde{M}}_{i,.} x + {\tilde{b}}_i = 0$ and $M_{{\varphi ^{-1}(i)},.} x + b_{\varphi ^{-1}(i)} = 0$ define the same hyperplanes. This is only possible if the parameters of the equation are proportional (but nonzero): there exists $\epsilon _{\varphi ^{-1}(i)} \in {\mathbb {R}}^*$ such that ${\tilde{M}}_{i,.} = \epsilon _{\varphi ^{-1}(i)} M_{\varphi ^{-1}(i),.}$ and ${\tilde{b}}_i = \epsilon _{\varphi ^{-1}(i)} b_{\varphi ^{-1}(i)}$. But since $\Vert {\tilde{M}}_{i,.}\Vert = \Vert M_{{\varphi ^{-1}(i)},.}\Vert =1$ by hypothesis, we necessarily have $\epsilon _{\varphi ^{-1}(i)} \in \{-1,1\}$. $\square$

Proposition 59

For all $i \in \llbracket 1, m \rrbracket$,

${\tilde{M}}_{i,.} = M_{{\varphi ^{-1}(i)},.}$;
${\tilde{b}}_i = b_{\varphi ^{-1}(i)}$.

Proof

By Proposition 58, we know that there exists $(\epsilon _i)_{1 \le i \le m} \in \{ -1, 1 \}^{m}$ such that for all $i \in \llbracket 1, m \rrbracket$,

$$\begin{aligned} {\tilde{M}}_{i,.} = \epsilon _{\varphi ^{-1}(i)} M_{{\varphi ^{-1}(i)},.} \qquad \text { and } \qquad {\tilde{b}}_i = \epsilon _{\varphi ^{-1}(i)} b_{\varphi ^{-1}(i)}. \end{aligned}$$

(49)

We need to prove that for all $i \in \llbracket 1, m \rrbracket$, $\epsilon _{\varphi ^{-1}(i)} = 1$.

Let $i \in \llbracket 1, m \rrbracket$.

Applying Proposition 23 to h and $\Pi$, we see that $\bigcup _{D \in \Pi } \partial h^{-1}(D)$ is contained in a finite union of hyperplanes $\bigcup _{k=1}^s A_k$. Applying it to ${\tilde{h}}$ and ${\tilde{\Pi }}$, we see similarly that $\bigcup _{{\tilde{D}} \in {\tilde{\Pi }}} \partial {\tilde{h}}^{-1}({\tilde{D}})$ is contained in a finite union of hyperplanes $\bigcup _{k=1}^r B_k$.

Let $I = \{ k \in \llbracket 1, s \rrbracket \, \ A_k = H_i \}$ and $J = \{ k \in \llbracket 1, r \rrbracket \, \ B_k = H_i \}.$ For all $k \in \llbracket 1, s \rrbracket \backslash I$, since $A_k \ne H_i$, $A_k \cap H_i$ is either empty, or an intersection of two non parallel hyperplanes, in both cases it is an affine space of dimension smaller than $l-2$. The same applies for all $k \in \llbracket 1, r \rrbracket \backslash J$ to $B_k \cap H_i$. For all $j \ne i$, $H_j \ne H_i$ so $H_j \cap H_i$ is also an affine space of dimension smaller than $l-2$. Since $H_i \cap \mathring{\Omega }$ is nonempty by ${{\textbf {C}}}.b)$, we can thus find a vector

$$\begin{aligned} x \quad \in \quad \mathring{\Omega } \ \cap \ H_i \ \backslash \ \left( (\bigcup _{k \notin I} A_k ) \cup (\bigcup _{k \notin J } B_k ) \cup ( \bigcup _{j \ne i} H_j ) \right) . \end{aligned}$$

Applying Lemma 56 with $\Pi$, h, $H_i$ and $(M_{i,.}, b_i)$, we find $r_1 > 0$, $D_-$ and $D_+ \in \Pi$ such that

$$\begin{aligned} B(x,r_1)&\cap&\{ y \in {\mathbb {R}}^l, M_{i,.} y + b_i < 0 \} \quad \subset \quad h^{-1}(D_-) \nonumber \\ B(x,r_1)&\cap&\{ y \in {\mathbb {R}}^l, M_{i,.} y + b_i > 0 \} \quad \subset \quad h^{-1}(D_+). \end{aligned}$$

(50)

Applying the same lemma with ${\tilde{\Pi }}$, ${\tilde{h}}$, $H_i$ and $(M_{i,.}, b_i)$ we find $r_2 > 0$, ${\tilde{D}}_-$ and ${\tilde{D}}_+ \in {\tilde{\Pi }}$ such that

$$\begin{aligned} B(x,r_2)&\cap&\{ y \in {\mathbb {R}}^l, M_{i,.} y + b_i < 0 \} \quad \subset \quad {\tilde{h}}^{-1}({\tilde{D}}_-) \nonumber \\ B(x,r_2)&\cap&\{ y \in {\mathbb {R}}^l, M_{i,.} y + b_i > 0 \} \quad \subset \quad {\tilde{h}}^{-1}({\tilde{D}}_+). \end{aligned}$$

(51)

Since the hyperplanes $H_j$ are closed, we can also find $r_3 > 0$ such that for all $j \ne i$, $B(x,r_3) \cap H_j = \emptyset$. Taking $r = \min (r_1,r_2,r_3)$ and denoting $B_+ = B(x,r) \cap \{ y \in {\mathbb {R}}^l, M_{i,.} y + b_i > 0 \}$, we derive from (50) and (51) that

$$\begin{aligned} B_+ \quad \subset \quad h^{-1} (D_+) \cap {\tilde{h}}^{-1} ( {\tilde{D}}_+). \end{aligned}$$

Since $r \le r_3$, we have $B_+ \cap \left( \bigcup _{j \ne i} H_j \right) = \emptyset$, and by definition $B_+ \cap \{y \in {\mathbb {R}}^l, M_{i,.}y + b_i = 0 \} = \emptyset$, so $B_+ \cap H_i = \emptyset$. We have $B_+ \cap \left( \bigcup _{j =1 }^m H_j \right) = \emptyset$, so for all $j \in \llbracket 1, m \rrbracket$, there exist $\delta _j \in \{0,1\}$ such that for all $y \in B_+$, $s_j(y) = \delta _j$. We have $\bigcup _{j=1}^m {\tilde{H}}_j = \bigcup _{j=1}^m H_j$ so similarly, $B_+ \cap \bigcup _{j=1}^m {\tilde{H}}_j = \emptyset$ and there exists ${\tilde{\delta }}_j \in \{0,1\}$ such that for all $j \in \llbracket 1, m \rrbracket$, for all $y \in B_+$, ${\tilde{s}}_j(y) = {\tilde{\delta }}_j$.

For all $y \in B_+$, we thus have, using (38),

$$\begin{aligned} \sum _{j=1}^m V_{.,j}(D_+) \delta _j \left( M_{j,.}y + b_j \right) + c(D_+) = \sum _{j=1}^m {\tilde{V}}_{.,j}({\tilde{D}}_+) {\tilde{\delta }}_j \left( {\tilde{M}}_{j,.}y + {\tilde{b}}_j \right) + {\tilde{c}}({\tilde{D}}_+). \end{aligned}$$

$B_+$ is a nonempty open set so we have the equality

$$\begin{aligned} \sum _{j=1}^m V_{.,j}(D_+) \delta _j M_{j,.}= & {} \sum _{j=1}^m {\tilde{V}}_{.,j}({\tilde{D}}_+) {\tilde{\delta }}_j {\tilde{M}}_{j,.} \nonumber \\= & {} \sum _{j=1}^m {\tilde{V}}_{.,j}({\tilde{D}}_+) {\tilde{\delta }}_j \epsilon _{\varphi ^{-1}(j)} M_{{\varphi ^{-1}(j)},.} \nonumber \\= & {} \sum _{j=1}^m {\tilde{V}}_{.,\varphi (j)}({\tilde{D}}_+) {\tilde{\delta }}_{\varphi (j)} \epsilon _{j} M_{j,.}. \end{aligned}$$

(52)

The condition ${{\textbf {C}}}.a)$ states that M is full row rank, so the vectors $M_{j,.}$ are linearly independent. Applied to (52), this information yields, for all $j \in \llbracket 1, m \rrbracket$,

$$\begin{aligned} V_{.,j}(D_+) \delta _j = {\tilde{V}}_{.,\varphi (j)}({\tilde{D}}_+) {\tilde{\delta }}_{\varphi (j)} \epsilon _{j}, \end{aligned}$$

and in particular,

$$\begin{aligned} V_{.,i}(D_+) \delta _i = {\tilde{V}}_{.,\varphi (i)}({\tilde{D}}_+) {\tilde{\delta }}_{\varphi (i)} \epsilon _{i}. \end{aligned}$$

(53)

Since $h^{-1}(D_+)$ and ${\tilde{h}}^{-1}({\tilde{D}}_+)$ are closed, we have

$$\begin{aligned} \overline{B_+} \ \subset \ h^{-1}(D_+) \cap {\tilde{h}}^{-1}({\tilde{D}}_+), \end{aligned}$$

and since $x \in \overline{B_+}$, we have $h^{-1}(D_+) \cap H_i \ne \emptyset$ and ${\tilde{h}}^{-1}({\tilde{D}}_+) \cap H_i \ne \emptyset$. The condition C.c) implies that $V_{.,i}(D_+) \ne 0$ and ${\tilde{V}}_{.,{\varphi (i)}}({\tilde{D}}_+) \ne 0$ (recall that $H_i = {\tilde{H}}_{\varphi (i)}$). We also have $\epsilon _i \ne 0$, so from (53) we obtain

$$\begin{aligned} \delta _i = 0 \Leftrightarrow {\tilde{\delta }}_{\varphi (i)} = 0. \end{aligned}$$

By definition, the coefficient $\delta _i$ depends on the sign of $M_{i,.}y + b_i$: if $M_{i,.}y + b_i$ is positive, $\delta _i =1$ and if $M_{i,.}y + b_i$ is negative then $\delta _i = 0$ ($M_{i,.}y + b_i$ cannot be equal to zero since $y \notin H_i$). The coefficient ${\tilde{\delta }}_{\varphi (i)}$ depends similarly on the sign of ${\tilde{M}}_{{\varphi (i)},.}y + {\tilde{b}}_{\varphi (i)}$. Thus, $M_{i,.}y + b_i$ and ${\tilde{M}}_{{\varphi (i)},.}y + {\tilde{b}}_{\varphi (i)}$ have same sign.

Since $\epsilon _i \in \{-1,1 \}$ and

$$\begin{aligned} {\tilde{M}}_{{\varphi (i)},.}y + {\tilde{b}}_{\varphi (i)} = \epsilon _i {M}_{i,.}y + \epsilon _i {b}_i = \epsilon _i \left( {M}_{i,.}y + {b}_i \right) , \end{aligned}$$

we conclude that $\epsilon _i=1$. $\square$

We can now finish the proof of Lemma 53. It results from the above that:

$$\begin{aligned} {\tilde{M}}= & {} P_{\varphi }M \\ {\tilde{b}}= & {} P_{\varphi } b. \end{aligned}$$

We have by hypothesis, for all $x \in \Omega$,

$$\begin{aligned} {\tilde{g}}(\sigma ({\tilde{M}}x + {\tilde{b}})) = g(\sigma (Mx + b)), \end{aligned}$$

but since ${\tilde{M}} = P_{\varphi }M$ and ${\tilde{b}} = P_{\varphi } b$ we also have:

$$\begin{aligned} {\tilde{g}}(\sigma ({\tilde{M}}x + {\tilde{b}})) = {\tilde{g}}(\sigma (P_{\varphi }Mx + P_{\varphi } b)) = {\tilde{g}}(P_{\varphi }\sigma (Mx + b)). \end{aligned}$$

Combining these, we have for all $x \in \Omega$,

$$\begin{aligned} {\tilde{g}} \circ P_{\varphi } (h(x)) = g(h(x)), \end{aligned}$$

i.e. ${\tilde{g}} \circ P_{\varphi }$ and g coincide on $h(\Omega )$.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Bona-Pellissier, J., Bachoc, F. & Malgouyres, F. Parameter identifiability of a deep feedforward ReLU neural network. Mach Learn 112, 4431–4493 (2023). https://doi.org/10.1007/s10994-023-06355-4

Download citation

Received: 16 December 2021
Revised: 14 March 2023
Accepted: 15 May 2023
Published: 03 August 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s10994-023-06355-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parameter identifiability of a deep feedforward ReLU neural network

Abstract

Access this article

Similar content being viewed by others

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

A review on the long short-term memory model

Visualizing and Understanding Convolutional Networks

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Consent for publication

Additional information

Publisher's Note

Appendices

Appendix 1: Definitions, notations and preliminary results

1.1 Basic notations and definitions

1.2 Continuous piecewise linear functions

Definition 15

Remarks

Proposition 16

Proof

Definition 17

Example

Example 18

Proposition 19

Proof

Proposition 20

Proof

Definition 21

Proposition 22

Proof

Proposition 23

Proof

1.3 Neural networks

Definition 24

Remark 25

Definition 26

Definition 27

Remark

Definition 28

Remark

Definition 29

Definition 30

Remark

Remark 31

Proposition 32

Proof

Corollary 33

Proof

Definition 34

Remark

Proposition 35

Proof

Definition 36

Definition 37

Remarks

Proposition 38

Proof

Proposition 39

Proof

Corollary 40

Proof

Definition 41

Proposition 42

Proof

Proposition 43

Proof

Appendix 2: Main theorem

1.1 Conditions

Definition 44

Definition 45

Definition 46

Definition 47