On the choice of the low-dimensional domain for global optimization via random embeddings

Abstract

The challenge of taking many variables into account in optimization problems may be overcome under the hypothesis of low effective dimensionality. Then, the search of solutions can be reduced to the random embedding of a low dimensional space into the original one, resulting in a more manageable optimization problem. Specifically, in the case of time consuming black-box functions and when the budget of evaluations is severely limited, global optimization with random embeddings appears as a sound alternative to random search. Yet, in the case of box constraints on the native variables, defining suitable bounds on a low dimensional domain appears to be complex. Indeed, a small search domain does not guarantee to find a solution even under restrictive hypotheses about the function, while a larger one may slow down convergence dramatically. Here we tackle the issue of low-dimensional domain selection based on a detailed study of the properties of the random embedding, giving insight on the aforementioned difficulties. In particular, we describe a minimal low-dimensional set in correspondence with the embedded search space. We additionally show that an alternative equivalent embedding procedure yields simultaneously a simpler definition of the low-dimensional minimal set and better properties in practice. Finally, the performance and robustness gains of the proposed enhancements for Bayesian optimization are illustrated on numerical examples.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Notes

  1. 1.

    See [58] for the extended journal version.

  2. 2.

    Outside of the set corresponding to \(\mathcal {P}_I\) in \(\mathbb {R}^d\), these influential variables I would be fixed to \(\pm 1\).

References

  1. 1.

    Binois, M.: Uncertainty quantification on Pareto fronts and high-dimensional strategies in Bayesian optimization, with applications in multi-objective automotive design. Ph.D. thesis, Ecole Nationale Supérieure des Mines de Saint-Etienne (2015)

  2. 2.

    Binois, M., Ginsbourger, D., Roustant, O.: A warped kernel improving robustness in Bayesian optimization via random embeddings. In: Dhaenens, C., Jourdan, L., Marmion, M.E. (eds.) Learning and Intelligent Optimization. Lecture Notes in Computer Science, vol. 8994, pp. 281–286. Springer, New York (2015). https://doi.org/10.1007/978-3-319-19084-6_28

    Google Scholar 

  3. 3.

    Carpentier, A., Munos, R.: Bandit theory meets compressed sensing for high dimensional stochastic linear bandit. In: International Conference on Artificial Intelligence and Statistics (2012)

  4. 4.

    Černỳ, M.: Goffin’s algorithm for zonotopes. Kybernetika 48(5), 890–906 (2012)

    MathSciNet  MATH  Google Scholar 

  5. 5.

    Chen, B., Castro, R., Krause, A.: Joint optimization and variable selection of high-dimensional Gaussian processes. In: Proceedings of International Conference on Machine Learning (ICML) (2012)

  6. 6.

    Chen, Y., Hoffman, M.W., Colmenarejo, S.G., Denil, M., Lillicrap, T.P., de Freitas, N.: Learning to learn for global optimization of black box functions. arXiv preprint arXiv:1611.03824 (2016)

  7. 7.

    Constantine, P.G., Dow, E., Wang, Q.: Active subspace methods in theory and practice: applications to kriging surfaces. SIAM J. Sci. Comput. 36(4), A1500–A1524 (2014)

    MathSciNet  Article  Google Scholar 

  8. 8.

    Courrier, N., Boucard, P.A., Soulier, B.: Variable-fidelity modeling of structural analysis of assemblies. J. Glob. Optim. 64(3), 577–613 (2016)

    MathSciNet  Article  Google Scholar 

  9. 9.

    Dixon, L., Szegö, G.: The global optimization problem: an introduction. Towards Glob. Optim. 2, 1–15 (1978)

    Google Scholar 

  10. 10.

    Djolonga, J., Krause, A., Cevher, V.: High-dimensional Gaussian process bandits. In: Advances in Neural Information Processing Systems, pp. 1025–1033 (2013)

  11. 11.

    Donoho, D.L.: High-dimensional data analysis: the curses and blessings of dimensionality. In: AMS Math Challenges Lecture pp. 1–32 (2000)

  12. 12.

    Durrande, N.: Étude de classes de noyaux adaptées à la simplification et à linterprétation des modèles dapproximation. une approche fonctionnelle et probabiliste. Ph.D. thesis, Saint-Etienne, EMSE (2011)

  13. 13.

    Durrande, N., Ginsbourger, D., Roustant, O.: Additive kernels for Gaussian process modeling. Annales de la Facultée de Sciences de Toulouse 21(3), 481–499 (2012)

    Article  Google Scholar 

  14. 14.

    Duvenaud, D.K.: Automatic model construction with Gaussian processes. Ph.D. thesis, University of Cambridge (2014)

  15. 15.

    Feliot, P., Bect, J., Vazquez, E.: A Bayesian approach to constrained single-and multi-objective optimization. J. Glob. Optim. 67, 1–37 (2015)

    MathSciNet  MATH  Google Scholar 

  16. 16.

    Filliman, P.: Extremum problems for zonotopes. Geometriae Dedicata 27(3), 251–262 (1988)

    MathSciNet  Article  Google Scholar 

  17. 17.

    Franey, M., Ranjan, P., Chipman, H.: Branch and bound algorithms for maximizing expected improvement functions. J. Stat. Plan. Inference 141(1), 42–55 (2011)

    MathSciNet  Article  Google Scholar 

  18. 18.

    Gardner, J., Guo, C., Weinberger, K., Garnett, R., Grosse, R.: Discovering and exploiting additive structure for Bayesian optimization. In: Artificial Intelligence and Statistics, pp. 1311–1319 (2017)

  19. 19.

    Garnett, R., Osborne, M., Hennig, P.: Active learning of linear embeddings for Gaussian processes. In: 30th Conference on Uncertainty in Artificial Intelligence (UAI 2014), pp. 230–239. AUAI Press (2014)

  20. 20.

    Gutmann, H.M.: A radial basis function method for global optimization. J. Glob. Optim. 19(3), 201–227 (2001)

    MathSciNet  Article  Google Scholar 

  21. 21.

    Hastie, T., Tibshirani, R., Friedman, J., Franklin, J.: The elements of statistical learning: data mining, inference and prediction. Math. Intell. 27(2), 83–85 (2005)

    Google Scholar 

  22. 22.

    Hennig, P., Schuler, C.J.: Entropy search for information-efficient global optimization. J. Mach. Learn. Res. 98888, 1809–1837 (2012)

    MathSciNet  MATH  Google Scholar 

  23. 23.

    Huang, D., Allen, T.T., Notz, W.I., Zeng, N.: Global optimization of stochastic black-box systems via sequential kriging meta-models. J. Glob. Optim. 34(3), 441–466 (2006)

    MathSciNet  Article  Google Scholar 

  24. 24.

    Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: International Conference on Learning and Intelligent Optimization, pp. 507–523. Springer, Berlin (2011)

    Google Scholar 

  25. 25.

    Iooss, B., Lemaître, P.: A review on global sensitivity analysis methods. In: Meloni, C., Dellino, G. (eds.) Uncertainty Management in Simulation-Optimization of Complex Systems: Algorithms and Applications. Springer, Berlin (2015)

    Google Scholar 

  26. 26.

    Ivanov, M., Kuhnt, S.: A parallel optimization algorithm based on FANOVA decomposition. Qual. Reliabil. Eng. Int. 30(7), 961–974 (2014)

    Article  Google Scholar 

  27. 27.

    Jones, D., Schonlau, M., Welch, W.: Efficient global optimization of expensive black-box functions. J. Glob. Optim. 13(4), 455–492 (1998)

    MathSciNet  Article  Google Scholar 

  28. 28.

    Kandasamy, K., Schneider, J., Póczos, B.: High dimensional Bayesian optimisation and bandits via additive models. In: Proceedings of The 32nd International Conference on Machine Learning, pp. 295–304 (2015)

  29. 29.

    Krein, M., Milman, D.: On extreme points of regular convex sets. Studia Mathematica 9(1), 133–138 (1940)

    MathSciNet  Article  Google Scholar 

  30. 30.

    Krityakierne, T., Akhtar, T., Shoemaker, C.A.: SOP: parallel surrogate global optimization with Pareto center selection for computationally expensive single objective problems. J. Glob. Optim. 66, 1–21 (2016)

    MathSciNet  Article  Google Scholar 

  31. 31.

    Laguna, M., Martí, R.: Experimental testing of advanced scatter search designs for global optimization of multimodal functions. J. Glob. Optim. 33(2), 235–255 (2005)

    MathSciNet  Article  Google Scholar 

  32. 32.

    Le, V.T.H., Stoica, C., Alamo, T., Camacho, E.F., Dumur, D.: Uncertainty representation based on set theory. Zonotopes, pp. 1–26 (2013)

  33. 33.

    Li, C.L., Kandasamy, K., Póczos, B., Schneider, J.: High dimensional Bayesian optimization via restricted projection pursuit models. In: Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pp. 884–892 (2016)

  34. 34.

    Liu, B., Zhang, Q., Gielen, G.G.: A Gaussian process surrogate model assisted evolutionary algorithm for medium scale expensive optimization problems. IEEE Trans. Evolut. Comput. 18(2), 180–192 (2014)

    Article  Google Scholar 

  35. 35.

    Mardia, K.V., Kent, J.T., Bibby, J.M.: Multivariate Analysis (Probability and Mathematical Statistics). Academic Press, Cambridge (1980)

    Google Scholar 

  36. 36.

    Mathar, R., Zilinskas, A.: A class of test functions for global optimization. J. Glob. Optim. 5(2), 195–199 (1994)

    MathSciNet  Article  Google Scholar 

  37. 37.

    McMullen, P.: On zonotopes. Trans. Am. Math. Soc. 159, 91–109 (1971)

    MathSciNet  Article  Google Scholar 

  38. 38.

    Meyer, C.D.: Matrix Analysis and Applied Linear Algebra, vol. 2. Siam, Philadelphia (2000)

    Google Scholar 

  39. 39.

    Mishra, S.: Global Optimization by Differential Evolution and Particle Swarm Methods: Evaluation on Some Benchmark Functions. University Library of Munich, Germany, Tech. rep. (2006)

    Google Scholar 

  40. 40.

    Mockus, J., Tiesis, V., Zilinskas, A.: The application of Bayesian methods for seeking the extremum. Towards Glob. Optim. 2(117–129), 2 (1978)

    MATH  Google Scholar 

  41. 41.

    Morris, M.D., Mitchell, T.J., Ylvisaker, D.: Bayesian design and analysis of computer experiments: use of derivatives in surface prediction. Technometrics 35(3), 243–255 (1993)

    MathSciNet  Article  Google Scholar 

  42. 42.

    Neal, R.M.: Bayesian learning for neural networks. In: Lecture Notes in Statistics, vol. 118. Springer, Berlin (1996)

    Google Scholar 

  43. 43.

    Nguyen, H.H., Vu, V.: Random matrices: law of the determinant. Ann. Probab. 42(1), 146–167 (2014)

    MathSciNet  Article  Google Scholar 

  44. 44.

    Oh, C., Gavves, E., Welling, M.: BOCK: Bayesian optimization with cylindrical kernels. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning. In: Proceedings of Machine Learning Research, vol. 80, pp. 3868–3877. PMLR, Stockholmsmssan, Stockholm Sweden (2018). http://proceedings.mlr.press/v80/oh18a.html

  45. 45.

    Qian, H., Hu, Y.Q., Yu, Y.: Derivative-free optimization of high-dimensional non-convex functions by sequential random embeddings. In: IJCAI 2016 (2016)

  46. 46.

    Rasmussen, C.E., Williams, C.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2006)

    Google Scholar 

  47. 47.

    Rios, L.M., Sahinidis, N.V.: Derivative-free optimization: a review of algorithms and comparison of software implementations. J. Glob. Optim. 56(3), 1247–1293 (2013)

    MathSciNet  Article  Google Scholar 

  48. 48.

    Rolland, P., Scarlett, J., Bogunovic, I., Cevher, V.: High-dimensional Bayesian optimization via additive models with overlapping groups. In: International Conference on Artificial Intelligence and Statistics, pp. 298–307 (2018)

  49. 49.

    Roustant, O., Ginsbourger, D., Deville, Y.: DiceKriging, DiceOptim: two R packages for the analysis of computer experiments by kriging-based metamodeling and optimization. J. Stat. Softw. 51(1), 1–55 (2012)

    Article  Google Scholar 

  50. 50.

    Salem, M.B., Bachoc, F., Roustant, O., Gamboa, F., Tomaso, L.: Sequential dimension reduction for learning features of expensive black-box functions (2018)

  51. 51.

    Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., de Freitas, N.: Taking the human out of the loop: a review of Bayesian optimization. Proc. IEEE 104(1), 148–175 (2016)

    Article  Google Scholar 

  52. 52.

    Song, W., Keane, A.J.: Surrogate-based aerodynamic shape optimization of a civil aircraft engine nacelle. AIAA J. 45(10), 2565–2574 (2007)

    Article  Google Scholar 

  53. 53.

    Turlach, B.A., Weingessel, A.: quadprog: Functions to solve Quadratic Programming Problems. (2013). https://CRAN.R-project.org/package=quadprog. R package version 1.5-5

  54. 54.

    Vershynin, R.: Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027 (2010)

  55. 55.

    Villemonteix, J., Vazquez, E., Sidorkiewicz, M., Walter, E.: Global optimization of expensive-to-evaluate functions: an empirical comparison of two sampling criteria. J. Glob. Optim. 43(2), 373–389 (2009)

    MathSciNet  Article  Google Scholar 

  56. 56.

    Viswanath, A., Forrester, A.J., Keane, A.J.: Dimension reduction for aerodynamic design optimization. AIAA J. 49(6), 1256–1266 (2011)

    Article  Google Scholar 

  57. 57.

    Wang, Z., Gehring, C., Kohli, P., Jegelka, S.: Batched large-scale Bayesian optimization in high-dimensional spaces. In: International Conference on Artificial Intelligence and Statistics (2018)

  58. 58.

    Wang, Z., Hutter, F., Zoghi, M., Matheson, D., de Feitas, N.: Bayesian optimization in a billion dimensions via random embeddings. J. Artif. Intell. Res. (JAIR) 55, 361–387 (2016)

    MathSciNet  Article  Google Scholar 

  59. 59.

    Wang, Z., Zoghi, M., Hutter, F., Matheson, D., de Freitas, N.: Bayesian optimization in high dimensions via random embeddings. In: IJCAI (2013)

  60. 60.

    Ziegler, G.M.: Lectures on Polytopes, vol. 152. Springer, Berlin (1995)

    Google Scholar 

Download references

Acknowledgements

We thank the anonymous reviewers for helpful comments on the earlier version of the paper. Parts of this work have been conducted within the frame of the ReDice Consortium, gathering industrial (CEA, EDF, IFPEN, IRSN, Renault) and academic (Ecole des Mines de Saint-Etienne, INRIA, and the University of Bern) partners around advanced methods for Computer Experiments. M.B. also acknowledges partial support from National Science Foundation grant DMS-1521702.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Mickaël Binois.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Proofs

A.1 Properties of the convex projection

We begin with two elementary properties of the convex projection onto the hypercube \(\mathcal {X}= [-1,1]^D\):

Property 1

[Tensorization property] \(\forall \mathbf {x}\in \mathbb {R}^D\), \(p_\mathcal {X}\begin{pmatrix} x_1 \\ \ldots \\ x_D \end{pmatrix} = \begin{pmatrix} p_{[-1,1]}(x_1) \\ \ldots \\ p_{[-1,1]}(x_D) \end{pmatrix} .\)

Property 2

[Commutativity with some isometries] Let q be an isometry represented by a diagonal matrix with terms \(\varepsilon _i = \pm 1\), \(1 \le i \le D\). Then, for all \(\mathbf {x}\in \mathbb {R}^D\), \(p_\mathcal {X}(q(\mathbf {x})) = \begin{pmatrix} \varepsilon _1 p_{[-1,1]}(x_1) \\ \ldots \\ \varepsilon _D p_{[-1,1]}(x_D) \end{pmatrix} = q(p_\mathcal {X}(\mathbf {x}))\).

A.2 Proof of Theorem 1

Proof

First, note that \(\mathcal {U}\) is a closed set as a finite union of closed sets. Then, let us show that \(p_\mathcal {X}(\mathbf {A}\mathcal {U}) = \mathcal {E}\). Consider \(\mathbf {x}\in \mathcal {E}\), hence \(|x_i| \le 1\) and \(\exists \mathbf {y}\in \mathbb {R}^d\) s.t. \(\mathbf {x}= p_\mathcal {X}(\mathbf {A}\mathbf {y})\). Denote \(\mathbf {b}= \mathbf {A}\mathbf {y}\). We distinguish two cases:

  1. 1.

    More than d components of \(\mathbf {b}\) are in \([-1,1]\). Then there exists a set \(I \subset \{1, \ldots , D \}\) of cardinality d such that \(\mathbf {y}\in \bigcap \nolimits _{i \in I} \mathcal {S}_i = \mathcal {P}_I \subseteq \mathcal {U}\), implying that \(\mathbf {x}\in p_\mathcal {X}(\mathbf {A}\mathcal {U})\).

  2. 2.

    \(0 \le k < d\) components of \(\mathbf {b}\) are in \([-1,1]\). It is enough to consider that \(\mathbf {b}\in [0, +\infty )^D\). Indeed, for any \(\mathbf {x}\in \mathcal {E}\), any \(\mathbf {A}\in \mathcal {A}\), let \(\varvec{\varepsilon }\) be the isometry given by the diagonal \(D \times D\) matrix \(\varvec{\varepsilon }\) with elements \(\pm 1\) such that \(\varvec{\varepsilon } \mathbf {x}\in [0, +\infty )^D\). It follows that \(\varvec{\varepsilon }\mathbf {b}\) is in \([0, +\infty )^D\) too. Denote \(\mathbf {x}' = \varvec{\varepsilon } \mathbf {x}\), \(\mathbf {b}' = \varvec{\varepsilon } \mathbf {b}\) and \(\mathbf {A}' = \varvec{\varepsilon } \mathbf {A}\). Thus if \(\exists \mathbf {u}\in \mathcal {U}\) such that \(\mathbf {x}' = p_\mathcal {X}(\mathbf {b}') = p_\mathcal {X}(\mathbf {A}' \mathbf {u})\), by property 2: \(\varvec{\varepsilon } \mathbf {x}= \varvec{\varepsilon } p_\mathcal {X}(\mathbf {A}\mathbf {u})\) leading to \(\mathbf {x}= p_\mathcal {X}(\mathbf {b}) = p_\mathcal {X}(\mathbf {A}\mathbf {u})\). From now on, we therefore assume that \(b_i \ge 0\), \(1 \le i \le D\). Furthermore, we can assume that \(0 \le b_1 \le \ldots \le b_D\), from a permutation of indices. Hence \(b_i > 1\) if \(i > k\) and \(\mathbf {x}= (x_1 = b_1, \ldots , x_k = b_k, 1, \ldots , 1)^T\). Let \(\mathbf {y}' \in \mathbb {R}^d\) be the solution of \(\mathbf {A}_{1, \ldots ,d} \mathbf {y}' = (b_1, \ldots b_k, 1, \ldots , 1)^T\) (vector of size d). Such a solution exists since \(\mathbf {A}_{1, \ldots ,d}\) is invertible by hypothesis. Then define \(\mathbf {b}' = \mathbf {A}\mathbf {y}'\), \(\mathbf {b}' = (b_1, \ldots , b_k,1, \ldots , 1, b_{d+1}', \ldots , b_D')^T\). We have \(\mathbf {b}' \in \text {Ran}(\mathbf {A})\) and \(\mathbf {y}' \in \mathcal {P}_{1, \ldots , d} \subseteq \mathcal {U}\).

    • If \(\min _{i \in \{d+1, \ldots , D\}}(b_i') \ge 1\), then \(p_\mathcal {X}(\mathbf {b}') = p_\mathcal {X}(\mathbf {b}) = \mathbf {x}\), and thus \(\mathbf {x}= p_\mathcal {X}(\mathbf {A}\mathbf {y}') \in p_\mathcal {X}(\mathbf {A}\mathcal {U})\).

    • Else, the set \(L = \{i \in \mathbb {N}: d+1 \le i \le D \,,\,b'_i < 1\}\) is not empty. Consider \(\mathbf {c}= \lambda \mathbf {b}' + (1-\lambda )\mathbf {b}\), \(\lambda \in ]0,1[\). By linearity, since both \(\mathbf {b}\) and \(\mathbf {b}'\) belong to \(\text {Ran}(\mathbf {A})\), \(\mathbf {c}\in \text {Ran}(\mathbf {A})\).

      • For \(1 \le i \le k\), \(c_i = x_i\).

      • For \(k+1 \le i \le d\), \(c_i = \lambda + (1- \lambda )b_i \ge 1\) since \(b_i > 1\).

      • For \(i \in \{d+1, \ldots , D\} \setminus L\), \(b'_i \ge 1\) and \(b_i > 1\) hence \(c_i \ge 1\).

      • We now focus on the remaining components in L. For all \(i \in L\), we solve \(c_i = 1\), i.e., \(\lambda b'_i + (1-\lambda ) b_i = \lambda (b'_i - b_i) + b_i = 1\). The solution is \(\lambda _i = \frac{b_i-1}{b_i - b'_i}\), with \(b_i - b'_i > 0\) since \(b'_i < 1\). Also \(b_i - 1 > 0\) and \(b_i - 1 < b_i - b'_i\) such that we have \(\lambda _i \in ]0,1[\). Denote \(\lambda ^* = \min _{i \in L} \lambda _i\) and the corresponding index \(i^*\). By construction, \(c_{i^*} = 1\) and \(\forall i \in L\), \(c_i = \lambda ^* (b'_i - b_i) + b_i \ge \lambda _i (b'_i - b_i) + b_i = 1\) since \(\lambda _i \ge \lambda ^*\) and \(b'_i - b_i < 0\).

      To summarize, we can construct \(\mathbf {c}^*\) with \(\lambda ^*\) that has \(k + 1\) components in \([-1,1]\) (the first k and the \(i{*th}\) ones), the others are greater or equal than 1. Moreover, \(\mathbf {c}^* \in \text {Ran}(\mathbf {A})\) and fulfills \(p_\mathcal {X}(\mathbf {c}^*) = p_\mathcal {X}(\mathbf {b}) = \mathbf {x}\) by Property 1. If \(k+1 = d\), this corresponds to case 1 above, otherwise, it is possible to reiterate by taking \(\mathbf {b}= \mathbf {c}\). Hence we have a pre-image of \(\mathbf {x}\) by \(\phi \) in \(\mathcal {U}\).

Thus the surjection property is shown. There remains to show that \(\mathcal {U}\) is the smallest closed set achieving this, along with additional topological properties.

Let us show that any closed set \(\mathcal {Y}\in \mathbb {R}^d\) such that \(p_\mathcal {X}(\mathbf {A}\mathcal {Y}) = \mathcal {E}\) contains \(\mathcal {U}\). To this end, we consider \(\mathcal {U}^* = \bigcup \nolimits _{I \subseteq \{1, \ldots , D\}, |I| = d} \mathring{\mathcal {P}}_I\) with \(\mathring{\mathcal {P}}_I = \left\{ {\mathbf {y}\in \mathbb {R}^d \,,\,\forall i \in I, -1< \mathbf {A}_i \mathbf {y}< 1}\right\} \), the interior of the parallelotopes. We have \({\phi }|_{\mathring{U}}\) bijective. Indeed, all \(\mathbf {x}\in p_\mathcal {X}(\mathbf {A}\mathcal {U}^*)\) have (at least) d components whose absolute value is strictly lower that 1. Without loss of generality, we suppose that they are the d first ones, \(I = \{1, \ldots , d\}\). Then there exists a unique\(\mathbf {y}\in \mathbb {R}^d\) s.t. \(\mathbf {x}= p_\mathcal {X}(\mathbf {A}\mathbf {y})\) because \(\mathbf {x}_I = (\mathbf {A}\mathbf {y})_I = \mathbf {A}_I \mathbf {y}\) has a unique solution with \(\mathbf {A}_I\) is invertible. Since \(\mathcal {Y}\) is in surjection with \(\mathcal {E}\) for \({\phi }|_{\mathcal {Y}}\) and \({\phi }|_{\mathcal {U}^*}\) is bijective, \(\mathcal {U}^* \subseteq \mathcal {Y}\). Additionally, \(\mathcal {Y}\) is a closed set so it must contain the closure \(\mathcal {U}\) of \(\mathcal {U}^*\).

Finally let us prove the topological properties of \(\mathcal {U}\). Recall that parallelotopes \(\mathcal {P}_I\)\((I \subseteq \{1, \ldots , D \})\) are compact convex sets as linear transformations of d-cubes. Thus \(\mathcal {I}= \bigcap \limits _{I \subseteq \{1, \ldots , D \}, |I| = d} \mathcal {P}_I\) is a compact convex set as the intersection of compact convex sets, which is non empty (\(O \in \mathcal {I}\)). It follows that \(\mathcal {U}= \bigcup \limits _{I \subseteq \{1, \ldots , D \}, |I| = d} \mathcal {P}_I\) is compact and connected as a finite union of compact connected sets with a non-empty intersection, i.e., \(\mathcal {I}\). Additionally \(\mathcal {U}\) is star-shaped with respect to any point in \(\mathcal {I}\) (since \(\mathcal {I}\) belongs to all parallelotopes in \(\mathcal {U}\)).\(\square \)

A.3 Proof of Proposition 1

Proof

It follows from Definition 2 that \(p_\mathbf {A}(\mathcal {X})\) is a zonotope of center O, obtained from the orthogonal projection of the D-hypercube \(\mathcal {X}\). As such, \(p_\mathbf {A}(\mathcal {X})\) is a convex polytope.

Since \(\mathcal {E}\subset \mathcal {X}\), it is direct that \(p_\mathbf {A}(\mathcal {E}) \subseteq p_\mathbf {A}(\mathcal {X})\).

To prove \(p_\mathbf {A}(\mathcal {X}) \subseteq p_\mathbf {A}(\mathcal {E})\), let us start by vertices. Denote by \(\mathbf {x}\in \mathbb {R}^D\) a vertex of \(p_\mathbf {A}(\mathcal {X})\). If \(\mathbf {x}\in \mathcal {X}\), then \(p_\mathbf {A}(p_\mathcal {X}(\mathbf {x})) = p_\mathbf {A}(\mathbf {x}) = \mathbf {x}\), i.e., \(\mathbf {x}\) has a pre-image in \(\mathcal {E}\) by \(p_\mathbf {A}\). Else, if \(\mathbf {x}\notin \mathcal {X}\), consider the vertex \(\mathbf {v}\) of \(\mathcal {X}\) such that \(p_\mathbf {A}(\mathbf {v}) = \mathbf {x}\). Suppose that \(\mathbf {v}\notin \mathcal {E}\). Let us remark that if \(\mathbf {v}\) is a vertex of \(\mathcal {X}\) such that \(\mathbf {v}\notin \mathcal {E}\), then \(\text {Ran}(\mathbf {A}) \cap H_v = \emptyset \), where \(H_v\) is the open hyper-octant (with strict inequalities) that contains \(\mathbf {v}\). Indeed, if \(\mathbf {x}\in \text {Ran}(\mathbf {A}) \cap H_v\), \(\exists k \in \mathbb {R}^*\) such that \(p_\mathcal {X}(k \mathbf {x}) = \mathbf {v}\), which contradicts \(\mathbf {v}\notin \mathcal {E}\). Denote by \(\mathbf {u}\) the intersection of the line \((O\mathbf {x})\) with \(\mathcal {X}\), since \(\mathbf {x}\notin H_v\), \(\mathbf {u}\notin H_v\) either, hence \(\widehat{\mathbf {x}\mathbf {u}\mathbf {v}} > \pi /2\). Then \(\Vert \mathbf {u}- \mathbf {v}\Vert \le \Vert \mathbf {x}- \mathbf {v}\Vert \), which contradicts \(\mathbf {x}= p_\mathbf {A}(\mathbf {v})\). Hence \(\mathbf {v}\in \mathcal {E}\) and \(\mathbf {x}\) has a pre-image by \(p_\mathbf {A}\) in \(\mathcal {E}\).

Now, suppose \(\exists \mathbf {x}\in p_\mathbf {A}(\mathcal {X})\) such that its pre-image(s) in \(\mathcal {X}\) by \(p_\mathbf {A}\) belong to \(\mathcal {X}\setminus \mathcal {E}\). Denote \(\mathbf {x}' \in p_\mathbf {A}(\mathcal {X})\) the closest vertex of \(p_\mathbf {A}(\mathcal {X})\), which has a pre-image in \(\mathcal {E}\) by \(p_\mathbf {A}\). By continuity of \(p_\mathbf {A}\), there exists \(\mathbf {x}'' \in [\mathbf {x}, \mathbf {x}']\) with pre-image in \((\mathcal {X}\setminus \mathcal {E}) \cap \mathcal {E}= \emptyset \), hence there is a contradiction since \(\mathbf {x}''\) has at least one pre-image. Consequently \(\mathbf {x}\) has at least a pre-image in \(\mathcal {E}\), and \(p_\mathbf {A}(\mathcal {X}) \subseteq p_\mathbf {A}(\mathcal {E})\).\(\square \)

A.4 Proof of Theorem 2

Proof

As a preliminary, let us show that \(\forall \mathbf {y}\in \mathcal {Z}\), \(\gamma (\mathbf {y}) \in \mathcal {E}\). Let \(\mathbf {x}_1 \in \mathcal {X}\cap p_\mathbf {A}^{-1}(\mathbf {B}^\top \mathbf {y}) (\ne \emptyset )\). From Proposition 1, \(\mathbf {y}\) also have a pre-image \(\mathbf {x}_2 \in \mathcal {E}\) by \(p_\mathbf {A}\), and denote \(\mathbf {u}\in \text {Ran}(\mathbf {A})\) such that \(p_\mathcal {X}(\mathbf {u}) = \mathbf {x}_2\), i.e., \(\Vert \mathbf {x}_2 - \mathbf {u}\Vert = \min \limits _{\mathbf {x}\in \mathcal {X}} \Vert \mathbf {x}- \mathbf {u}\Vert \). We have \(\Vert \mathbf {x}_1 - \mathbf {u}\Vert ^2 = \Vert \mathbf {x}_1 - \mathbf {B}^\top \mathbf {y}\Vert ^2 + \Vert \mathbf {B}^\top \mathbf {y}- \mathbf {u}\Vert ^2\) and \(\Vert \mathbf {x}_2 - \mathbf {u}\Vert ^2 = \Vert \mathbf {x}_2 - \mathbf {B}^\top \mathbf {y}\Vert ^2 + \Vert \mathbf {B}^\top \mathbf {y}- \mathbf {u}\Vert ^2\) as \(\mathbf {x}_1, \mathbf {x}_2 \in p_\mathbf {A}^{-1}(\mathbf {B}^\top \mathbf {y})\). Then, \(\Vert \mathbf {x}_2 - \mathbf {u}\Vert \le \Vert \mathbf {x}_1 - \mathbf {u}\Vert \Rightarrow \Vert \mathbf {x}_2 - \mathbf {B}^\top \mathbf {y}\Vert \le \Vert \mathbf {x}_1 - \mathbf {B}^\top \mathbf {y}\Vert \) with equality if \(\mathbf {x}_1 = \mathbf {x}_2\), so that \(\gamma (\mathbf {B}^\top \mathbf {y}) \in \mathcal {E}\).

We now proceed by showing that \(\gamma \) defines a bijection from \(\mathcal {Z}\) to \(\mathcal {E}\), with \(\gamma ^{-1} = \mathbf {B}\). First, \(\forall \mathbf {y}\in \mathcal {Z}\), \(\mathbf {B}\gamma (\mathbf {y}) = \mathbf {y}\) by definition of \(\gamma \). It remains to show that, \(\forall \mathbf {x}\in \mathcal {E}\), \(\gamma (\mathbf {B}\mathbf {x}) = \mathbf {x}\). Let \(\mathbf {x}\in \mathcal {E}\), \(\mathbf {u}\in \text {Ran}(\mathbf {A})\) such that \(p_\mathcal {X}(\mathbf {u}) = \mathbf {x}\). Suppose \(\gamma (\mathbf {B}\mathbf {x}) = \mathbf {x}' \in \mathcal {E}\), \(\mathbf {x}' \ne \mathbf {x}\), in particular \(\Vert \mathbf {x}' - \mathbf {B}^\top \mathbf {B}\mathbf {x}\Vert < \Vert \mathbf {x}- \mathbf {B}^\top \mathbf {B}\mathbf {x}\Vert \). Again, \(\mathbf {x}, \mathbf {x}' \in p_\mathbf {A}^{-1}(\mathbf {B}^\top \mathbf {B}\mathbf {x})\), hence \(\Vert \mathbf {x}' - \mathbf {B}^\top \mathbf {B}\mathbf {x}\Vert ^2 + \Vert \mathbf {B}^\top \mathbf {B}\mathbf {x}- \mathbf {u}\Vert ^2 = \Vert \mathbf {x}' - \mathbf {u}\Vert ^2 < \Vert \mathbf {x}- \mathbf {B}^\top \mathbf {B}\mathbf {x}\Vert ^2 + \Vert \mathbf {B}^\top \mathbf {B}\mathbf {x}- \mathbf {u}\Vert ^2 = \Vert \mathbf {x}- \mathbf {u}\Vert ^2\) which contradicts \(\mathbf {x}= p_\mathcal {X}(\mathbf {u})\). Thus \(\gamma (\mathbf {B}\mathbf {x}) = \mathbf {x}\).

\(\mathcal {Z}\) is compact, convex and centrally symmetric from being a zonotope, see Definition 2. Finally, any smaller set than \(\mathcal {Z}\) would not have an image through \(\gamma \) covering \(\mathcal {E}\) entirely, which concludes the proof.\(\square \)

A.5 Proof of Proposition 2

Proof

The first part directly follows from the properties of the convex and orthogonal projection. In detail: \(\text {Vol}_d(\mathbf {A}\mathcal {U}) \ge \text {Vol}_d(p_\mathcal {X}(\mathbf {A}\mathcal {U})) = \text {Vol}_d(\mathcal {E}) \ge \text {Vol}_d(p_\mathbf {A}(\mathcal {E})) = \text {Vol}_d(\mathbf {A}\mathcal {Z})\).

For the second part, we need the length of a strip \(\mathcal {S}_i\), i.e., the inter hyperplane distance: \(l_i = 2 / \Vert \mathbf {A}_i \Vert \). Recall that \(\mathbf {B}= \mathbf {A}^\top \), that rows of \(\mathbf {A}\) have equal norm and \(\mathbf {A}\) orthonormal. Then, following the proof of [16, Theorem 1], \(\sum \limits _{j = 1}^d \Vert \mathbf {B}_j \Vert ^2 = d\) (orthonormality) \(= \sum \limits _{i = 1}^D \sum \limits _{j = 1}^d A_{i,j}^2 = \sum \limits _{i = 1}^D \Vert \mathbf {A}_i \Vert ^2 = D \Vert \mathbf {A}_1 \Vert ^2\), hence \(\Vert \mathbf {A}_1 \Vert = \sqrt{d/D}\). As \(\mathcal {Z}\) is enclosed in the d-sphere of radius \(\sqrt{D}\) and the d-sphere of radius \(\sqrt{D/d}\) is enclosed in \(\mathcal {I}\), the result follows from the formula of the volume of a d-sphere of radius \(\rho \): \(\frac{\pi ^{d/2}}{\Gamma (d/2 + 1)} \rho ^d\). \(\square \)

B Main notations

d Low embedding dimension
D Original dimension, \(D \gg d\)
\(\mathbf {A}\) Random embedding matrix of size \(D \times d\)
\(\mathbf {B}\) Transpose of \(\mathbf {A}\) after orthonormalization
\(\mathcal {X}\) Search domain \([-1,1]^D\)
\(\mathcal {Y}\) Low dimensional optimization domain, in \(\mathbb {R}^d\)
\(\mathcal {Z}\) Zonotope \(\mathbf {B}\mathcal {X}\)
\(p_\mathcal {X}\) Convex projection onto \(\mathcal {X}\)
\(p_\mathbf {A}\) Orthogonal projection onto \(\text {Ran}(\mathbf {A})\)
\(\Psi \) Warping function from \(\mathbb {R}^d\) to \(\text {Ran}(\mathbf {A})\)
\(\phi \) Mapping from \(\mathcal {Y}\subset \mathbb {R}^d\) to \(\mathbb {R}^D\)
\(\gamma \) Mapping from \(\mathcal {Z}\subset \mathbb {R}^d\) to \(\mathbb {R}^D\)
\((\mathcal {R})\) Optimization problem for REMBO with mapping \(\phi \)
\((\mathcal {R}')\) Optimization problem for REMBO with mapping \(\gamma \)
\((\mathcal {Q})\) Minimal volume problem definition with mapping \(\phi \)
\((\mathcal {Q}')\) Minimal volume problem definition with mapping \(\gamma \)
\(\mathcal {B}\) Box enclosing \(\mathcal {Z}\)
\(\mathcal {E}\) Image of \(\mathbb {R}^d\) by \(\phi \)
\(\mathcal {S}_i\) Strip associated with the ith row of \(\mathbf {A}\)
\(\mathcal {I}\) Intersection of all strips \(\mathcal {S}_i\)
\(\mathcal {U}\) Union of all intersection of d strips \(\mathcal {S}_i\)
\(\mathcal {P}_I\) Parallelotope associated with strips in the set I

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Binois, M., Ginsbourger, D. & Roustant, O. On the choice of the low-dimensional domain for global optimization via random embeddings. J Glob Optim 76, 69–90 (2020). https://doi.org/10.1007/s10898-019-00839-1

Download citation

Keywords

  • Expensive black-box optimization
  • Low effective dimensionality
  • Zonotope
  • REMBO
  • Bayesian optimization