1 Introduction

For unsupervised learning, dimensionality reduction and exploratory data analysis, a popular approach is to represent and visualize multivariate proximities in lower dimensions, also known as ordination. Such methods were first known as multidimensional scaling (MDS) and later also collected under the nonlinear dimension reduction and manifold learning monikers. We will use the umbrella term proximity scaling (PS) by which we mean MDS and extensions (for a comprehensive overview, see, e.g., France and Carroll 2011).

The main idea behind these methods is that for N data points or objects there is a matrix of pairwise proximities between object i and object j and one looks for a representation of the objects—the configuration—in a lower dimensional target space where the distances between objects optimally approximate the proximities. Examples include classical MDS (Torgerson 1958), metric MDS and nonmetric MDS (Kruskal 1964), Sammon mapping (Sammon 1969), Isomap (De’ath 1999; Tenenbaum et al. 2000) or power stress MDS (POST-MDS, Buja et al. 2008; Groenen and De Leeuw 2010).

PS methods play an important role in the exploration and communication of data. There exists a plethora of PS methods suitable for many different purposes. Interestingly in the wild one seems to encounter only a relatively narrow PS toolbox, mainly classical MDS and non-metric/metric MDS. General and flexible PS methods, which allow for various transformations of the quantities of interest, are more rarely encountered. This is unfortunate as we are convinced that flexible PS methods can be both useful and powerful and ignoring them may lead to foregone insights.

Flexible PS methods typically have hyperparameters governing the transformations, which are either explicit or implicit, often yielding standard PS methods for certain hyperparameter constellations. For example, a popular method for unravelling a manifold embedded in a higher-dimensional space is Isomap that does classical MDS on proximities that are derived from a neighborhood graph. The graph and derived proximities depend on a hyperparameter for the neighborhood that can be freely chosen. Similarly, in POST-MDS there are power transformations for fitted distances, proximities and/or weights which can be freely chosen and yield metric MDS if the exponents are 1.

A possible obstacle for a more wide-spread use of flexible PS methods are the hyperparameters for the transformations of the quantities of interest and their selection. An instructive example of this are the ideas of Ramsay (1982) that were read before the Royal Statistical Society but have gained relatively little traction. Ramsay suggests power and spline transformations to be used in MDS and regarding the hyperparameters, Ramsay states to use an exponent of 1.5 and polynomials of degree 2 based on “general experience” (p. 288 and 290, Ramsay 1982). The choice of transformations and hyperparameters is met with some contention by influential discussants like J. C. Gower, F. Critchley, J. de Leeuw, and others. Two comments from the discussion illustrate this: S. Tagg states “this complexity requires understanding of the choice of parameters” (p. 303, Ramsay 1982) and C. Chatfield says “by allowing [...] for different transformations [...] he hopes to have a realistic, albeit rather complicated, model. My worry is that the model is too complicated and yet still not realistic enough.” (p. 306 Ramsay 1982). This debate on how to choose the optimal parameters for flexible MDS methods has not yet been resolved. Hence, if these methods are used, the choice of hyperparameters often has an arbitrary trial-and-error aspect to it as they are simply set ad hoc (e.g., Ramsay 1982; Buja et al. 2008; Mair et al. 2014) and/or different values are tried out (sometimes in a semi-optimized fashion) until values are found that work in light of the application, see, e.g., the analyses in Buja and Swayne (2002); Borg and Groenen (2005); Chen and Buja (2009, 2013); De Leeuw (2014).

We propose to approach the hyperparameter selection in a principled way: as a computational problem of optimizing over the hyperparameter space for a wide variety of PS methods. Hence this article’s contributions are: a) providing a methodological framework that subsumes a wide array of flexible PS methods and parametric transformations for them, b) suggesting the setup of objective functions for hyperparameter selection to optimize for within this framework, c) operationalizing the building blocks that comprise the objective functions, and d) identifying a general computational approach that is suitable to tackle all these instances in the same way.

One contribution is that we propose criteria that can be used for operationalizing this optimization problem. Our porposal is based on the observation that in many applications of PS the obtained result—which is faithful to the input proximities—is interpreted with respect to the arrangement of objects in the configuration. We call this “structural quality” for a specific notion of structural appearance, for example, that objects are arranged in quadrants or as a circumplex. In recent years this aspect of interpretation in PS has been made explicit by using indices to measure a structural quality of interest, e.g., by Akkucuk and Carroll (2006); France and Carroll (2007); Chen and Buja (2009, 2013); Rusch et al. (2021).

Since structural quality is a property of the configuration, we coin the term c-structuredness to single out that we mean the degree of structuredness of a configuration with respect to a certain structure. High c-structuredness is something we aim for in PS results under the condition that the proximities are still represented faithfully. Changing the hyperparameters for the transformations typically also changes the c-structuredness, which provides a way for us to set up tuning of the hyperparameters so that we obtain a configuration that is both faithful and c-structured.

Our suggestions build on precursor ideas of letting a criterion guide the selection of hyperparameters in PS, as for example in Akkucuk and Carroll (2006) and Chen and Buja (2009, 2013), where tuning has been done manually or over a grid. France and Akkucuk (2021) propose a visualization and exploration framework along these lines. An optimization approach to hyperparameter tuning in MDS was previously used by Rusch et al. (2021). Our proposal extends and complements these approaches.

In the remainder of this article we present our conceptual framework for hyperparameter tuning in PS: STOPS for STructure Optimized Proximity Scaling. It allows to computationally tackle this task for a wide array of flexible PS versions, subsuming standard MDS methods that are used in the wild resulting as specific hyperparameter constellations in the search space. We do this by making the notion of the structural quality that is sought and interpreted explicit; these structural appearances are condensed into c-structuredness indices to clearly quantify the degree of structuredness of a PS result. The goal of finding hyperparameters is then handled as a multi-objective optimization problem: The c-structuredness indices are combined with the PS badness-of-fit by scalarization into an objective function that is parametrized with the hyperparameters. The optimal hyperparameters are found via a general purpose optimization routine; one that emerged as working well for this problem class was Bayesian Optimization (BO, Mockus 1989). This gives a general approach that allows to handle most instances of parametrized PS in the same way.

The article is organized as follows: It starts with a high level description of the STOPS framework, the objective functions and the different building blocks that make up the framework in Sect. 2. We then discuss the building blocks in detail: in Sect. 3 we discuss badness-of-fit objectives, in Sect. 4 we elaborate on transformations and their governing hyperparameters. In Sect. 5 we turn to c-structuredness, i.e, structures of interest and their quantification by indices. Section 6 discusses hyperparameter selection within the STOPS framework. In Sect. 7 we illustrate the use of the framework with three data examples. Concluding remarks can be found in Sect. 8. As Supplementary Information we include the R code file to reproduce the results and figures, and a supplementary document with details on (additional) structures and structuredness indices, the nature of Stoploss and normalization of badness-of-fit.

2 The STOPS framework

In proximity scaling methods we start from a given matrix \(\varvec{\Delta }\) of pairwise symmetric proximities between objects \(i, j; i, j=1,\dots ,N\), with individual entries \(\delta _{ij}\). We will assume the proximities to take on a minimum when two observations are equal, i.e., dissimiliarities. The main diagonal of \(\varvec{\Delta }\) is 0.

Let \(\hat{\varvec{\Delta }}\) be the result of transformations applied to the entries of the proximity matrix \(\varvec{\Delta }\), where \(\hat{\varvec{\Delta }}=T_\Delta (\varvec{\Delta })=T_\Delta (\varvec{\Delta }\vert \theta _\Delta )\), with individual entries \(\hat{\delta }_{ij}\). Let \(\varvec{X}\) denote an \(N \times M\) matrix (the configuration) of lower dimension \(M<N\) (mostly \(M \ll N\)) from which the matrix \(\varvec{D}(\varvec{X})\) comprising untransformed pairwise distances \(d_{ij}(\varvec{X})\) between objects (row vectors) in \(\varvec{X}\) can be derived. The matrix \(\hat{\varvec{D}}(\varvec{X})=T_{D}(\varvec{D}(\varvec{X}))=T_D(\varvec{D}(\varvec{X})\vert \theta _D)\) comprises transformed pairwise distances between objects in \(\varvec{X}\), with individual entries \(\hat{d}_{ij}(\varvec{X})\). We call \(T_{\Delta }: (\varvec{\Delta },\theta _\Delta ) \mapsto \hat{\varvec{\Delta }}\) a proximity transformation function and \(T_D: (\varvec{D}(\varvec{X}),\theta _D) \mapsto \hat{\varvec{D}}(\varvec{X})\) a distance transformation function. Some PS models also allow different weights either given a priori as an input weight matrix \(\varvec{W}\) with elements \(w_{ij}\), or as transformed values based on the input weights. In case of the latter this is a transformed weight matrix \(\hat{\varvec{W}}=T_W(\varvec{W}\vert \theta _W)\), with elements \(\hat{w}_{ij}\) for the weight transformation function \(T_W: (\varvec{W},\theta _W) \mapsto \hat{\varvec{W}}\). The combined hyperparameter vector of all these transformations is \(\varvec{\theta }=(\theta _\Delta ,\theta _D,\theta _W)^\top \).

We search for an optimal configuration that allows one to reconstruct the matrix \(\hat{\varvec{\Delta }}\) as well as possible from \(\hat{\varvec{D}}(\varvec{X})\), i.e., we want \(\hat{\varvec{D}}(\varvec{X}) \approx \hat{\varvec{\Delta }}\). This is achieved by minimizing a measure of badness-of-fit \(\sigma _{\text {PS}}(\varvec{X} \vert \varvec{\theta }) = \mathcal {L}(\hat{\varvec{\Delta }}=T_\Delta (\varvec{\Delta }\vert \theta _\Delta ),\hat{\varvec{D}}(\varvec{X})=T_D(\varvec{D}(\varvec{X})\vert \theta _D),\hat{\varvec{W}}=T_W(\varvec{W}\vert \theta _W))\), where \(\mathcal {L}\) denotes a loss function.

Minimizing the badness-of-fit criterion in PS means finding—for given \(\varvec{\theta }\)—the optimal configuration \(\varvec{X}^*\) out of all possible \(\varvec{X}\) as

$$\begin{aligned} \varvec{X}^*=\arg \min _{\varvec{X}} \sigma _{\text {PS}}(\varvec{X}\vert \varvec{\theta }) \end{aligned}$$
(1)

Many measures of badness-of-fit with different types of transformations and hyperparameters have been proposed (see Sects. 3 and 4). Our framework covers all of them; we give a loose taxonomy of the most popular ones in Sect. 3. Concrete values of \(\varvec{\theta }\) are typically chosen manually ad hoc but can also be found by optimization, the latter is what we focus on in this article.

Let us assume we are interested in L different structural qualities of \(\varvec{X}\) and that we have L corresponding univariate c-structuredness indices \(I_l(\varvec{X}\vert \varvec{\gamma })\) for the \(l=1,\dots , L\) different structures, capturing the essence of the structural appearance of the configuration with respect to the l-th structure. For example, we might be interested in both the structural appearance of how clustered the configuration is (structure 1) and how strongly linearly related the column vectors of the configuration are (structure 2). We then measure the c-structuredness of \(\varvec{X}\) for the two structures with an index for clusteredness and one for linear dependence respectively. The \(\varvec{\gamma }\) are optional metaparameters for the indices, which we assume are given and fixed; they control how c-structuredness is measured. Some structures one might be interested along with their c-structuredness indices will be discussed in Sect. 5 and many more in Appendix A in the supplementary document. We further assume broadly that the transformations \(T_\Delta (\varvec{\Delta }\vert \varvec{\theta })\) and/or \(T_D(\varvec{D}(\varvec{X})\vert \varvec{\theta })\) and/or \(T_W(\varvec{W}\vert \varvec{\theta })\) produce different c-structuredness in \(\varvec{X}\) for different values of \(\varvec{\theta }\).

In a nutshell our proposal is to select optimal hyperparameters \(\varvec{\theta }^*\) for the scaling procedure by assessing the c-structuredness of an optimal configuration \(\varvec{X}^*\) found from a PS method for given \(\varvec{\theta }\) usually in combination with its badness-of-fit value. We aim at finding a \(\varvec{\theta }^*\) that, when used as transformation parameters in the PS method, will give a configuration that has high (or low) values of the c-structuredness indices. We view this as a multi-objective optimization problem, where we want to maximize/minimize different criteria (either badness-of-fit, or c-structuredness, or both) over \(\varvec{\theta }\). C-structuredness may this way be induced at a possible expense of fit but we control the expense amount.

To formalize this we explicitly write the building blocks of the objective function used for hyperparameter tuning via STOPS as a function of \(\varvec{\theta }\): Let us denote by \(\varvec{X}^*(\varvec{\theta })\) the optimal solution from minimizing a badness-of-fit \(\sigma _{\text {PS}}(\varvec{X}\vert \varvec{\theta })\) for a fixed \(\varvec{\theta }\), so \(\varvec{X}^*(\varvec{\theta }):= \arg \min _{\varvec{X}} \sigma _{\text {PS}}(\varvec{X}\vert \varvec{\theta })\). Further we also have the L different univariate indices with possible metaparameters \(\varvec{\gamma }\), \(I_l(\varvec{X}^*(\varvec{\theta })\vert \varvec{\gamma })\), to be optimized for.

Specific variants of STOPS can be instantiated by defining objective functions \(\text {Stoploss}(\varvec{\theta }\vert v_0, \dots , v_L, \varvec{\gamma })\), comprising either badness-of-fit or c-structuredness indices or both in a scalarized combination. Two variants of objective functions—called additive STOPS (aSTOPS) and multiplicative STOPS (mSTOPS) respectively—are of the following form:

$$\begin{aligned} \text {Stoploss}&_{\text {aSTOPS}}(\varvec{\theta }\vert v_0, \dots , v_L, \varvec{\gamma }) = \nonumber \\&v_0 \cdot \sigma _{\text {PS}}(\varvec{X}^*(\varvec{\theta })\vert \varvec{\theta }) + \sum ^L_{l=1} v_l I_l(\varvec{X}^*(\varvec{\theta })\vert \varvec{\gamma }) \end{aligned}$$
(2)

and

$$\begin{aligned} \text {Stoploss}&_{\text {mSTOPS}}(\varvec{\theta } \vert v_0, \dots , v_L, \varvec{\gamma }) = \nonumber \\&\sigma _{\text {PS}}(\varvec{X}^*(\varvec{\theta })\vert \varvec{\theta })^{v_0} \cdot \prod ^L_{l=1} I_l(\varvec{X}^*(\varvec{\theta })\vert \varvec{\gamma })^{v_l} \end{aligned}$$
(3)

with \(v_0 \in \mathbb {R}_{\ge 0}\) and \(v_1,\dots ,v_L \in \mathbb {R}\) being the scalarization weights. Numerically, the badness-of-fit function value \(\sigma _{\text {PS}}(\varvec{X}^*(\varvec{\theta })\vert \varvec{\theta })\) needs to be normalized to be scale-free and commensurable for comparability of different values of \(\varvec{\theta }\). We discuss such normalization in Appendix B of the supplementary document. The objective function for aSTOPS is fully compensatory, whereas for mSTOPS it ensures that a normalized badness-of-fit of 0 will always lead to the minimal \(\text {Stoploss}(\varvec{\theta }\vert v_0, \dots , v_L, \varvec{\gamma })\) for a positive value of \(I_l(\cdot )\). For notational convenience, we will refer to the objective functions for STOPS variants by \(\text {Stoploss}(\varvec{\theta })\) for the remainder of the paper.

The \(v_0,\dots ,v_L\) are weights that determine how the badness-of-fit and c-structuredness indices are scalarizedFootnote 1 for Stoploss. This can be used to set how strongly the criteria are taken into account or to control the trade-off of fit and c-structuredness in determining hyperparameters (typically in a convex combination). For example, we might want to tune by only minimizing badness-of-fit over \(\varvec{\theta }\) (\(v_0=1, v_l=0\)), or find the best configurations and only optimize for structure over \(\varvec{\theta }\) without taking badness-of-fit into account (\(v_0=0, v_l \ne 0\)), or tune so that we relax \(10\%\) of goodness-of-fit for more c-structuredness (\(v_0=0.9\) and \(\sum ^L_{l=1}v_l=0.1\)). A negative (positive) weight for \(v_l\) means that a higher (lower) value for the l-th index is favored.

Using either (2) or (3) for hyperparameter selection, we then need to find

$$\begin{aligned} \varvec{\theta }^*= \arg \min _{\varvec{\theta }}\ \text {Stoploss}(\varvec{\theta } \vert v_0, \dots , v_L, \varvec{\gamma }) \end{aligned}$$
(4)

Accordingly, hyperparameter tuning with STOPS comprises the following building blocks:

  1. 1.

    The PS badness-of-fit loss function \(\sigma _{\text {PS}}(\varvec{X}\vert \varvec{\theta })\) that allows us to find an optimal \(\varvec{X}^*(\varvec{\theta })\) for given \(\varvec{\theta }\) (see Sect. 3).

  2. 2.

    The transformations employed that depend on the vector \(\varvec{\theta }\), so any of \(T_\Delta (\varvec{\Delta }\vert \varvec{\theta })\), \(T_D(\varvec{D}(\varvec{X})\vert \varvec{\theta })\) and \(T_W(\varvec{W}\vert \varvec{\theta })\) (see Sect. 4).

  3. 3.

    The structures of interest and their c-structuredness indices \(I_l(\varvec{X}^*(\varvec{\theta })\vert \varvec{\gamma })\) (see Sect. 5).

  4. 4.

    A way of optimizing (4) (see Sect. 6).

We do hyperparameter selection via an outer loop: \(\text {Stoploss}(\varvec{\theta })\) is used solely for optimization over the hyperparameters \(\varvec{\theta }\); the \(\varvec{X}^*(\varvec{\theta })\) itself is found conditional on the transformation parameters from \(\sigma _{\text {PS}}(\varvec{X}\vert \varvec{\theta })\). This allows to utilize tailored optimization for the main model parameters (the fitted distances) in the badness-of-fit functions. Our proposal is a formalized hyperparameter selection procedure supplanting the standard workflow of tuning hyperparameters by trying out and comparing different solutions ad hoc. The modular design of the STOPS framework offers a lot of flexibility able to incorporate many different instances of badness-of-fit functions, transformations and c-structuredness indices, as well as providing a computational framework that is generally applicable for continuous and discrete hyperparameter spaces \(\varvec{\Theta }\).

3 Proximity scaling losses for measuring badness-of-fit

To be more detailed about finding the configuration, recall that the problem that proximity scaling solves is to find an \(N \times M\) matrix \(\varvec{X}^*\) by means of a sensible loss criterion \(\sigma _{\text {PS}}(\varvec{X}\vert \varvec{\theta })=\mathcal {L}(\hat{\varvec{\Delta }}=T_\Delta (\varvec{\Delta }\vert \varvec{\theta }),\hat{\varvec{D}}(\varvec{X})=T_D(\varvec{D}(\varvec{X})\vert \varvec{\theta }), \hat{\varvec{W}}=T_W(\varvec{W}\vert \varvec{\theta }))\) that is used to measure how closely the fitted \(\hat{\varvec{D}}(\varvec{X})\) approximates \(\hat{\varvec{\Delta }}\) (badness-of-fit). We’ll now discuss different loss functions \(\mathcal {L}(\cdot )\) that have proven to be especially popular.

3.1 Quadratic loss: STRESS

One popular type of PS is least squares scaling, which employs the quadratic loss function. This type of loss is usually called Stress (for standardized residual sum of squares, Kruskal 1964). A general formulation of a Stress-type loss function is

$$\begin{aligned} \sigma _{\text {Stress}}(\varvec{X}\vert \varvec{\theta })=\sum _{i<j} \hat{w}_{ij}\left( \hat{d}_{ij}(\varvec{X})-\hat{\delta }_{ij}\right) ^2 \end{aligned}$$
(5)

Here, the \(\hat{w}_{ij}\) are finite (transformed) input weights, with \(\hat{w}_{ij}=0\) if a proximity is missing or to be ignored.

The fitted distances in the configuration are usually some type of Minkowski distance

$$\begin{aligned} d_{ij}(\varvec{X})&= \Vert x_{i}-x_{j}\Vert _p \nonumber \\&= \left( \sum _{m=1}^M \vert x_{im}-x_{jm}\vert ^p \right) ^{1/p} i,j = 1, \dots , N \end{aligned}$$
(6)

with \(p > 0\), typically Euclidean, so \(p=2\). Stress losses need to be solved iteratively. Popular algorithms for finding the minimum of (5) are majorization (De Leeuw 1977) or gradient descent algorithms (Buja and Swayne 2002).

3.2 Approximation by inner product: STRAIN

The second popular type is the Strain loss function (Torgerson 1958). Here, the \(\varvec{\Delta }\) is transformed to \(\hat{\varvec{\Delta }}\) so that \(T_\Delta (\varvec{\Delta }\vert \theta _\Delta )=-(h\circ s)(\varvec{\Delta }\vert \varvec{\theta }_\Delta )\) where s is a function parametrized by \(\theta _\Delta \) and \(h(\cdot )\) is the double centering operation (i.e., for a matrix \(\varvec{A}\), \(h(\varvec{A})=\varvec{A}-\varvec{A}_{i.}-\varvec{A}_{.j}+\varvec{A}_{..}\), where \(\varvec{A}_{i.}, \varvec{A}_{.j}, \varvec{A}_{..}\) are matrices consisting of the row, column and grand means respectively). Subsequently \(\hat{\varvec{\Delta }}\) is approximated by the inner product matrix of \(\varvec{X}\), so,

$$\begin{aligned} d_{ij}(\varvec{X}) = \langle x_{i},x_{j} \rangle . \end{aligned}$$
(7)

In the context of Strain we always assume that \(T_\Delta (\cdot )\) is a composite function of the double centering operation and some other parametrized function \(s(\cdot \vert \theta _\Delta )\), so we can express Strain as

$$\begin{aligned} \sigma _{\text {Strain}}(\varvec{X}\vert \varvec{\theta })=\sum _{i<j} \left( d_{ij}(\varvec{X})-\hat{\delta }_{ij}\right) ^2 \end{aligned}$$
(8)

if \(d_{ij}(\varvec{X})\) is as in (7) and \(\hat{\delta }_{ij}=-h(s_{ij}(\delta _{ij}^2/2\vert \varvec{\theta }))\). Strain losses can usually be solved by an eigenvalue decomposition.

3.3 Repulsion and attraction: energy model

Another way of interpreting the PS objective is related to energy models with a pairwise attraction part \(\propto \hat{d}_{ij}(\varvec{X})^\nu \) and a pairwise repulsion part \(\propto -\hat{\delta }_{ij}\hat{d}_{ij}(\varvec{X})\) between objects (Chen and Buja 2013). This is

$$\begin{aligned} \sigma _{\text {EM}}(\varvec{X}\vert \varvec{\theta }) = \sum _{i<j} \hat{w}_{ij} \left( a \hat{d}_{ij}(\varvec{X})^\nu - b \hat{\delta }_{ij}\hat{d}_{ij}(\varvec{X}) \right) \end{aligned}$$
(9)

with ab being some constants. For \(\nu =2, a=1\) and \(b=2\) this is (5) where terms depending solely on \(\hat{\varvec{\Delta }}\) are disregarded for finding \(\varvec{X}\).

4 Parametrized transformations

Many PS methods allow to transform the input dissimilarities and/or the fitted distances and/or input weights. In this section, we discuss popular transformations that are employed with the above mentioned proximity scaling losses.Footnote 2 Note that our list is not meant to be exhaustive.

We consider possible parameter vectors \(\varvec{\theta } \subseteq \{\theta _\Delta ,\theta _D,\theta _W\}\) for the transformations \(T_\Delta , T_D, T_W\) for proximities, fitted distances or weights. These are the hyperparameters we later want to tune.

4.1 Transforming observed proximities

A simple and very flexible way is transforming the input proximities, so using \(T_\Delta (\varvec{\Delta })\). The advantage here is that this can easily be implemented for and in principle applied with all PS methods. Often it is also possible to use \(T_W(\varvec{W})\).

4.1.1 Metric scaling transformations of proximities

In metric scaling one applies a parametric bijective strictly monotonic transformation to the proximities.

With specific choices for \(T_\Delta (\cdot )\) one can express many popular PS versions, including absolute MDS with \(\hat{\delta }_{ij}=\delta _{ij}\), ratio MDS with \(\hat{\delta }_{ij}=\vert b\vert \delta _{ij}\), interval MDS with \(\hat{\delta }_{ij}=a+b\delta _{ij}\ge 0\), logarithmic MDS with \(\hat{\delta }_{ij}=a+b \log (\delta _{ij})\), exponential MDS with \(\hat{\delta }_{ij}=a+b \exp (\delta _{ij})\) (for all Borg and Groenen 2005), power MDS with \(\hat{\delta }_{ij} =\delta _{ij}^{\lambda }\) (e.g. Buja and Swayne 2002), or, with additionally setting the \(\hat{w}_{ij}\), instances of metric scaling that use a priori inverse weighting with the observed proximities, e.g., setting \(\hat{w}_{ij}=\delta ^{-1}_{ij}\) (Sammon mapping, Sammon 1969) or \(\hat{w}_{ij}=\delta ^{-2}_{ij}\) (elastic scaling, McGee 1966) or curvilinear component analysis (Demartines and Herault 1997) for a hyperparameter \(\theta _W=\tau \) with \(\hat{\varvec{W}}=T_W(\hat{\varvec{D}}(\varvec{X})\vert \theta _W)\) being a bounded and monotonically non-increasing function, e.g, \(\hat{w}_{ij}=1\) if \(\hat{d}_{ij}(\varvec{X})\le \tau \) and 0 otherwise.

These transformations are governed by the appropriate hyperparameters \(\theta _\Delta \) and/or \(\theta _W\) to yield the models.

4.1.2 Geodesic transformation of proximities

A popular method for manifold learning is Isomap (De’ath 1999; Tenenbaum et al. 2000), which is a PS (originally Strain-type) with the \(\hat{\delta }_{ij}\) being the geodesic distance between objects ij as imposed by a weighted graph. These proximities are defined as the sum of edge lengths along the shortest path \(\text {SP}(i,j\vert G(k))\) between two objects in a neighborhood graph G(k) for a given parameter k (the number of nearest neighbors) where the objects are the vertices, so \(\hat{\delta }_{ij}=\text {SP}(i,j\vert G(k))\). Alternatively, one can define the neighborhood graph in terms of an \(\varepsilon \)-radius, so \(\hat{\delta }_{ij}=\text {SP}(i,j\vert G(\varepsilon ))\).

The transformation is governed by \(\theta _\Delta =k\) or \(\theta _\Delta =\varepsilon \) as the hyperparameter that defines the neighborhood graph.

4.2 Transforming observed proximities and fitted distances

One may also transform the distances that get fit in the configuration, that is, applying a distance transformation function \(T_D(\varvec{D}(\varvec{X}))\) to the fitted distances. This is more complicated than simply transforming the input proximities—the whole fitting process has to be adapted to accommodate the transformation of fitted distances. It is then natural to apply proximity transformation functions, distance transformation functions and/or weight transformations simultaneously, using any combination of \(T_D(\varvec{D}(\varvec{X}))\), \(T_\Delta (\varvec{\Delta })\) and \(T_W(\varvec{W})\). This allows for a rich class of models with possible parameter vectors \(\varvec{\theta } \subseteq \{\theta _\Delta ,\theta _D,\theta _W\}\), corresponding to the hyperparameters for the transformations of proximities, fitted distances and weights.

4.2.1 Power transformations for proximities and distances

Employing power transformations on fitted distances and proximities is often done (e.g., Ramsay 1982; Buja and Swayne 2002; Buja et al. 2008; Groenen and De Leeuw 2010).

A general instance of a PS type with power transformed fitted distances is r-stress (De Leeuw 2014), the transformation being \(\hat{d}_{ij}(\varvec{X}) = d_{ij}(\varvec{X})^{2r}\) with \(r \in \mathbb {R}_+\). A number of stress versions can be expressed as special or limiting cases of r-stress, including raw Stress (\(r=0.5\), Kruskal 1964), s-stress (\(r=1\), Takane et al. 1977) and maximum likelihood MDS (Ramsay 1977) for \(r \rightarrow 0\). Here \(\theta _D=r\).

It is straightforward to extend this to use power functions as the proximity transformation function, distance transformation function and weight transformation function simultaneously (POST-MDS), so \(\hat{d}_{ij}(\varvec{X})=d_{ij}(\varvec{X})^\kappa \), \(\hat{\delta }_{ij}=\delta ^\lambda _{ij}\), and \(\hat{w}_{ij}=w_{ij}^\rho \) with \(\lambda , \rho \in \mathbb {R}, \kappa \in \mathbb {R}_+\). Inserted into (5) this is called power stress or p-stress in the literature. It subsumes many metric MDS models. Here \(\varvec{\theta }\) is a three-dimensional parameter vector, \(\varvec{\theta }=(\theta _{\Delta },\theta _D,\theta _W)^\top =(\lambda ,\kappa ,\rho )^\top \).

This encompasses Sammon or elastic scaling type models by using \(\hat{w}_{ij}=\hat{\delta }_{ij}^\rho \) for the appropriate \(\rho \), and can be turned into curvilinear component analyses type models with \(\hat{W}=T_W(\hat{\varvec{D}}(\varvec{X})\vert \tau )\) a bounded and monotonically decreasing function, e.g., \(\hat{w}_{ij}=\mathbb {1}(\hat{d}_{ij}(\varvec{X}) \le \tau )\), where \(\varvec{\theta }=(\kappa ,\lambda ,\tau )^\top \).

4.2.2 Box–Cox transformations for proximities and distances

Chen and Buja (2013) propose Box-Cox transformations on observed proximities and on fitted distances in an energy badness-of-fit formulation (9). For complete data matrices it is a three-parameter energy-type MDS family, BC-MDS, \(\sigma _{\text {BC}}(\varvec{X}\vert \varvec{\theta })\):

$$\begin{aligned} \sigma _{\text {BC}}&(\varvec{X}\vert \varvec{\theta }) = \nonumber \\&\sum _{i<j} \delta _{ij}^{\rho } \left( BC_{\mu +\lambda }(d_{ij}(\varvec{X})) - \delta _{ij}^{\lambda } BC_{\mu }(d_{ij}(\varvec{X})) \right) \end{aligned}$$
(10)

with \(\mu , \rho \in \mathbb {R}\) and \(\lambda \in \mathbb {R}_+\). Here \(BC_\alpha \) is the one-parameter Box–Cox transformation (Box and Cox 1964) with parameter \(\alpha \),

$$\begin{aligned} BC_{\alpha }(d)={\left\{ \begin{array}{ll} \frac{d^{\alpha }-1}{\alpha } \text{ if } \alpha \ne 0\\ \log (d) \text{ if } \alpha = 0\\ \end{array}\right. } \end{aligned}$$
(11)

Note that here the distance transformations used in the attraction and repulsion part are not equal as \(\lambda > 0\).

The hyperparameter vector is \(\varvec{\theta }=(\mu ,\lambda ,\rho )^\top \).

4.2.3 Sets of local neighborhood

Yet another idea can be expressed with parametrized transformations: local MDS (lMDS, Chen and Buja 2009). Let \(N_k\) define the symmetric set of nearby pairs of points (ij) so that \((i,j)\in N_k\) iff i is among the \(k-\)nearest neighbours of j or the other way round. Let \(\delta _{\infty } \rightarrow \infty \) be a large “imputed” dissimilarity that is constant and w a small weight, e.g., \(w \approx 1/\delta _{\infty }\) as in the standard lMDS objective.

The lMDS objective can be expressed in our framework, as (5) with the transformations

$$\begin{aligned} \hat{\delta }_{ij}={\left\{ \begin{array}{ll} \delta _{ij} &{} \text{ if } (i,j) \in N_k\\ \sqrt{w}\delta _{\infty } &{} \text{ if } (i,j) \notin N_k\\ \end{array}\right. } \end{aligned}$$
(12)

and

$$\begin{aligned} \hat{d}_{ij}(\varvec{X})={\left\{ \begin{array}{ll} d_{ij}(\varvec{X}) &{} \text{ if } (i,j) \in N_k\\ \sqrt{w}d_{ij}(\varvec{X}) &{} \text{ if } (i,j) \notin N_k\\ \end{array}\right. } \end{aligned}$$
(13)

For the standard lMDS objective this can be reduced to a version with a free hyperparameter \(\tau =2w\delta _{\infty }\) for a given k, as well as the k (Chen and Buja 2009); hence \(\varvec{\theta }=(k,\tau )^\top \).

5 Structures, c-structuredness and indices

Central to our proposal for hyperparameter tuning in PS models is the concept of c-structuredness, as c-structuredness is often desirable from an applied point of view. We informally defined c-structuredness as the adherence of the arrangement of objects in a configuration \(\varvec{X}\) to a predefined notion of what constitutes structure. How much a structure in question is present in \(\varvec{X}\) is expressed as the amount of c-structuredness: The higher the c-structuredness, the clearer the structure is present.

Naturally, there is a very high number of possible structures. Some examples of c-structuredness that we deem to be of particular interest are given in Fig. 1 with index values as formalized below and in Appendix A in the supplementary document. The c-structuredness types we single out here are (i) c-regularity (objects arranged on a regular grid), (ii) c-association (any (non-)linear association), (iii) c-clusteredness (objects arranged in clusters), (iv) c-linearity (objects arranged on a straight line), (v) c-functionality (objects arranged on a smooth line) and (vi) c-manifoldness (objects arranged so that they resemble a manifold).

Fig. 1
figure 1

Configurations in \(\mathbb {R}^2\) that correspond non-exclusively in varying degrees to some archetypical structures (with numerical values of the c-structuredness indices in brackets): c-regularity (1), c-association (0.92), c-clusteredness (0.78), c-linearity (1), c-functionality (1), c-manifoldness (0.68)

For STOPS we need to quantify information about the c-structuredness present in \(\varvec{X}\). We do this with univariate c-structuredness indices that capture the essence of a particular structure in \(\varvec{X}\). The indices should be numerically high (low) the more (less) of a given structure we find. To illustrate, for c-linearity and \(M=2\), we may use the absolute value of the Pearson correlation coefficient for the columns of \(\varvec{X}\) as a c-structuredness index, which is 1 when there is a perfect noise-free linear relationship or 0 when there is no linear relationship.

We aim at indices that capture the essence of a particular structure, depend on the arrangement of objects in \(\varvec{X}\) and should be bounded from above and below, i.e., have unique finite minima and maxima, and be nonnegative. In what follows we will list three examples of structures in a configuration as illustration. For each of these there is an index that captures the c-structuredness for that structure; these are also used in the examples.

We generally write c-structuredness indices as\(\text {I}_{\text {c-structuredness}}(\varvec{X} \vert \varvec{\gamma })\),Footnote 3 which means \(\text {I}_{\text {c-structuredness}}(\cdot )\) is an index that reflects c-structuredness as a function of \(\varvec{X}\), possibly depending on further metaindex parameters \(\varvec{\gamma }\) which are assumed to be given.

5.1 C-clusteredness and c-regularity

These structures are concerned with how clustered the configuration appears in the sense of Rusch et al. (2018). The concept essentially captures where \(\varvec{X}\) falls on a continuum between unclustered and maximally clustered.

Following Rusch et al. (2018), for a minimum number c of points that must comprise a cluster we denote with \(O(\varvec{X} \vert c)=(\varvec{x}_{o(i)})_{o(i) = 1, \ldots , N}\) an ordering of the N original row vectors \(\varvec{x}_i, (i=1, \ldots ,N)\) in \(\varvec{X}\). \(O(\varvec{X})\) is a permutation of the rows of \(\varvec{X}\). The position of object \(x_i\) in the ordering \(O(\varvec{X} \vert c)\) is \(o(i)\) (which depends on c but we drop it for readability). \(O(\varvec{X} \vert c)\) can be obtained by the algorithm OPTICS (Ankerst et al. 1999), which provides the bijective algorithmic mapping \(o: \{1,\ldots ,N\} \rightarrow \{1,\ldots ,N\}\). OPTICS also augments \(O(\varvec{X} \vert c)\) with each object’s so called representative reachability distance, \(r^*_{o(i)}\) that cannot be expressed in closed form.

For the index of c-clusteredness we then use the normalized OPTICS Cordillera, \(\text {OC}'(\varvec{X})\) (Rusch et al. 2018),

$$\begin{aligned} \text {I}_{\text {c-clusteredness}}&(\varvec{X}\vert \varvec{\gamma })=\text {OC}'(\varvec{X}\vert \varvec{\gamma })=\nonumber \\&\left( \frac{\sum _{o(i)=2}^{N} \vert r^*_{o(i)}-r^*_{o(i)-1} \vert ^q}{d_{\max }^q \cdot \left( \left\lceil \frac{N-1}{c} \right\rceil + \left\lfloor \frac{N-1}{c} \right\rfloor \right) }\right) ^{1/q} \end{aligned}$$
(14)

where \(\varvec{\gamma }=(c,\epsilon ,q,d_{\max })^\top \), the four free metaparameters of the OPTICS Cordillera. Here \(q \ge 1\) designates a q-norm, \(\epsilon \) is the maximum radius around each point to look for neighbors and \(d_{\max }\) denotes a maximum reference distance, \(\min d_{ij} \le d_{\max } \le \epsilon \) that winsorizes the \(r^*_{o(i)}\) for robustness. Apart from c the metaparameters are optional; we suggest to fix them as \(d_{\max }=1.5 \max (d_{ij}(\varvec{X}^0))\), \(\epsilon =2 \max _{i,j}(\max (\delta _{ij},d_{ij}(\varvec{X}^0)))\) and q to the norm of the target space (\(\varvec{X}^0\) refers to the Strain solution for untransformed proximities). The larger this index value is, the more c-clusteredness we find. See the middle left panel in Fig. 1 for an example.

The \(\text {OC}'(\varvec{X})\) takes on its minimum value when for each point the distance to the \(c-\)th neighbor is constant (Rusch et al. 2018). We can use this to fashion an index for c-regularity, a structure where the observations are arranged so that all nearest neighbor distances are equal. We fix the parameter c to 2 (nearest neighbor) and \(q=1\) and therefore use

$$\begin{aligned} \text {I}_{\text {c-regularity}}&(\varvec{X}\vert \varvec{\gamma })=\nonumber \\&1-\text {OC}'(\varvec{X}\vert c=2,q=1,\varvec{\gamma }) \end{aligned}$$
(15)

with \(\varvec{\gamma }=(\epsilon ,d_{\max })^\top \). This index will be 1 if the point arrangement is perfectly regular and 0 if perfectly clustered with respect to N, \(c=2\) and \(d_{\max }\). See the top left panel in Fig. 1.

Using these indices with negative weights in Stoploss will favor parameters that give configurations with high c-clusteredness/c-regularity whereas positive weights favor low c-clusteredness/c-regularity.

5.2 C-manifoldness

This structure captures how close almost arbitrary, real valued transformed projections of columns of \(\varvec{X}\) lie to a 1D manifold (a line) in the transformed space.

We use the maximal correlation coefficient (Gebelein 1941; Sarmanov 1958) for this, where the higher the coefficient, the stronger the relationship. For a sample, the maximal correlation can be calculated approximately with the alternating conditional expectations algorithm (ACE, Breiman and Friedman 1985). Let \(x_m\) denote the m-th column vector of \(\varvec{X}, (m=1,\dots ,M)\). To obtain the sample version, we use the \((x_{ik}, x_{il}), i=1,\dots ,N\), input them into the ACE algorithm and use the output—called \(\text {ACE}(x_k,x_l)\)—to construct a c-structuredness index for a given \(\varvec{X}\).

We then use an aggregation function \(\phi _{k,l}(\cdot )\) (e.g., the maximum) of the ACE value between any two different columns \(x_k,x_l; k \ne l\), so

$$\begin{aligned} \text {I}_{\text {c-manifoldness}}(\varvec{X})=\phi _{k,l}\left( \text {ACE}(x_k,x_l)\right) \end{aligned}$$
(16)

An example of such a relationship is given in the bottom right of Fig. 1. As before, a negative weight for this index in Stoploss would favor \(\varvec{\theta }\) that provide higher index values and positive weights lower index values.

These are instances of the c-structuredness concept as used in Sect. 7. Details on these and more structures that may be of interest can be found in Appendix A in the supplementary document. We are quick to point out that other structures/indices are conceivable and can easily be incorporated in our framework, e.g., the ones from France and Carroll (2007).

6 Optimization for STOPS

The approach for searching for the hyperparameters that we take builds on an idea in Rusch et al. (2021). We view hyperparameter tuning via STOPS as a multi-objective optimization problem of either minimizing badness-of-fit, or maximizing/minimizing c-structuredness indices, or both over \(\varvec{\theta }\). We choose a scalarization approach leading to variants of \(\text {Stoploss}(\varvec{\theta })\) as in (2) and (3) and the resulting optimization problem (4).

For optimization, minimizing any (4) as a function of \(\varvec{\theta }\) is a difficult, analytically intractable problem. There are several reasons for this: For one, concrete instances of Stoploss result as a combination of any badness-of-fit measure with any number of different c-structuredness indices, which leads to a high number of possible concrete instances. These concrete instances need not share structural similarities that can be exploited for optimization. Second, the optimization surface over \(\varvec{\theta }\) can be difficult for different STOPS models in that it is typically nonconvex, may have local optima, may have (removable or jump) discontinuities and may be smooth or non-smooth to varying degrees over regions of the support of \(\varvec{\theta }\). Third, \(\varvec{\theta }\) may be continuous or discrete. All of these problems are empirically illustrated for a typical Stoploss (the one from Sect. 7.1) in Appendix C in the supplement, where we also elaborate on the nature of Stoploss.

We therefore aim at a general purpose approach that can deal with the intractability and thus be used for all conceivable Stoplosses. To that end we approach the optimization problem with the nested profile-type strategy laid out in Algorithm 1 that only employs function evaluations; u denotes the iteration step and we do not list the given \(\varvec{\Delta }\) and \(v_0,\dots ,v_l\) for \(\text {Stoploss}(\varvec{\theta })\) in the algorithm. \(\varvec{\Theta }\) denotes the feasible set of \(\varvec{\theta }\) (e.g., box constraints). The \(\sigma _{\text {PS}}\left( \varvec{X}^*(\varvec{\theta })\vert \varvec{\theta } \right) \) employed should be scale and unit free, e.g., normalized to lie between zero and one (see Appendix B in the supplementary document).

figure a

To generate candidates for \(\varvec{\theta }_u\) in Step 3 of Algorithm 1, different general purpose strategies for global optimization like random search, grid search and/or derivative-free metaheuristics can in principle be employed. We made good experiences with particle swarm optimization (Eberhart and Kennedy 1995) but also random search and the adaptive Luus-Jakoola pattern search (Luus and Jaakola 1973; Rusch et al. 2021) for a moderate number of iterations (around 100) and these strategies can be recommended.

That said, one of the contributions of this article is to suggest a generally applicable default strategy that can be used with and promises to work well for every conceivable instance of Stoploss. Working well here means to suggest candidates for \(\varvec{\theta }\) that lead to relatively low values of the Stoploss objectives in an efficient manner (i.e., for a relatively small number of iterations) as each function evaluation in STOPS can be very costly (especially so for transformations of the \(\varvec{D}(\varvec{X})\)) due to having to solve a PS problem. To establish a general purpose strategy, we empirically investigated a number of solvers, including simulated annealing (Kirkpatrick et al. 1983), random search, adaptive Luus-Jakola algorithm, particle swarm optimization, and different versions of Bayesian optimization (BO, Mockus 1989) on different Stoplosses and different data sets (not shown).

What emerged as a generally applicable strategy that worked well in most cases (in the sense of suggesting candidates with lower “minima” as their competitors for a small numbers of iterations of around 10-20) was Bayesian optimization with a treed Gaussian process with jumps to the limiting linear model as the surrogate model (TGPLLM, Gramacy and Lee 2008) (which we elaborate on below). This strategy not only worked well for efficiently finding good candidates for \(\varvec{\theta }\) in all the data instances we used it for, but is also theoretically able to accommodate the afore-mentioned structural difficulties of Stoploss (see also Appendix C in the Supplement). We note that particle swarm optimization has often performed on par and sometimes better for a higher number of iterations.

Optimizing Stoploss with Bayesian optimization When investigating different general purpose solvers for optimization in STOPS we found that Bayesian optimization (BO, Mockus 1989) lends itself well as an out-of-the-box solution to suggest candidates for \(\varvec{\theta }\) in Step 3 of Algorithm 1.

The basic idea in Bayesian optimization is to approximate the unknown Stoploss surface with a flexible surrogate model (“prior”). The surrogate model identifies areas for exploration where we can expect improvement of the objective by maximizing an acquisition function over the surrogate surface. From these areas new candidate(s) for \(\varvec{\theta }\) are sampled and the objective function gets evaluated at the new candidate(s) (“data”). Then the surrogate model is updated (“posterior”) to reflect this new information, the acquisition function of the refitted surrogate model is again maximized and the whole process repeats. Hence in each iteration a new candidate is chosen for which we expect that it improves the objective of interest based on the available information, thus trading off exploitation and exploration in an efficient way.

This approach works well for STOPS due to three aspects: First, BO needs only function evaluations so the modularity of STOPS and the lack of exploitable structure is no hindrance. Second, BO is competitive in situations where the parameter vector is low-dimensional (e.g., Siivola et al. 2021) as is the case for all the \(\varvec{\theta }\) we outlined in Sect. 4 (with at most 3). Third, fitting the surrogate model and optimizing the acquisition function may be less expensive then evaluating the objective function as the cost of finding a configuration in PS can be quite high; BO can then dramatically reduce the number of evaluations necessary to get close to a global optimum.

To describe the BO approach more formally, we have the unknown objective function \(\text {Stoploss}(\varvec{\theta }): \varvec{\Theta } \rightarrow \mathbb {R}\). For notational convenience we write \(\mathcal {Y}:=\text {Stoploss}(\varvec{\theta })\) for the function and \(\mathcal {Y}_u:=\text {Stoploss}(\varvec{\theta }_u)\) for the \(u-\)th evaluation of the function for \(\varvec{\theta }_u\). We have a surrogate model for the objective, \(\mathcal {M}(\varvec{\theta },\mathcal {Y},\iota )\) with possible surrogate model metaparameters \(\iota \), fitted to the sequence of U pairs of \(\{\varvec{\theta }_u,\mathcal {Y}_u\}_{u=1}^{U}\). We also have the acquisition function \(\Omega \left( \varvec{\theta }\vert \{\varvec{\theta }_u,\mathcal {Y}_u\},\mathcal {M}(\varvec{\theta },\mathcal {Y},\iota )\right) \). We look for

$$\begin{aligned} \varvec{\theta }_{u+1}=\arg \max _{\varvec{\theta }} \Omega (\varvec{\theta }\vert \{\varvec{\theta }_u,\mathcal {Y}_u\},\mathcal {M}(\varvec{\theta },\mathcal {Y},\iota )) \end{aligned}$$
(17)

which constitutes the new candidate. Then \(\mathcal {Y}_{u+1}\) is evaluated and the model \(\mathcal {M}\) is updated with the new point \(\{\varvec{\theta }_{u+1},\mathcal {Y}_{u+1}\}\), yielding the “posterior”,

$$\begin{aligned} P_{\mathcal {M}}\left( \mathcal {Y}\vert \{\varvec{\theta }_{u},\mathcal {Y}_{u}\}_{u=1}^{U+1}\right) . \end{aligned}$$
(18)

Then the acquisition function gets maximized for the updated data and the whole process repeats until some termination criterion is met.

For \(\Omega (\varvec{\theta }\vert \{\varvec{\theta }_u,\mathcal {Y}_u\}, \mathcal {M}(\varvec{\theta },\mathcal {Y},\iota ))\) we use the expected improvement (EI) criterion (Jones et al. 1998), which is

$$\begin{aligned} \Omega&(\varvec{\theta }\vert \{\varvec{\theta }_u,\mathcal {Y}_u\},\mathcal {M}(\varvec{\theta },\mathcal {Y},\iota ))=\nonumber \\&\int _{-\infty }^{\infty }\max (\mathcal {Y}^*-\mathcal {Y},0)P_{\mathcal {M}}(\mathcal {Y}\vert \{\varvec{\theta }_u,\mathcal {Y}_u\},\iota ) d\mathcal {Y} \end{aligned}$$
(19)

with \(\mathcal {Y}^*=\min \mathcal {Y}_u\). EI has shown good behavior over a wide array of tasks (Bergstra et al. 2011).

What is crucial for the perfomance of BO is how well the surrogate model \(\mathcal {M}\) is able to approximate the unknown objective. To find a good surrogate model \(\mathcal {M}\) to be used in our case, we empirically investigated the behavior of different \(\text {Stoploss}(\varvec{\theta })\) over \(\varvec{\theta }\) for a number of different data sets; see the examples in Appendix C in the supplementary document.

Given the nature of Stoploss we mentioned before, we looked for a surrogate model that is nonstationary, allows for jumps and piecewise constant regions (e.g., for discrete \(\varvec{\Theta }\)) and allows a sufficiently rough Gaussian processes within segments over \(\varvec{\theta }\) (for nonconvexity). We found it with the treed Gaussian process with jumps to the limiting linear model, with the separable power exponential family for the correlation structure \(\iota \). This process recursively partitions the search space into non-overlapping segments (like a regression tree) and for each segment independent Gaussian processes are used, which have linear models as their limit. This allows nonstationarity over the whole search space and accommodates piecewise linear or constant areas of the search space (which is useful especially for discrete \(\varvec{\theta }\)). Of further note is that the independent GP do not have to connect at the boundaries of the segments, thus allowing for jump discontinuities. The specification and estimation of this process is fully Bayesian; for details, see Gramacy and Lee (2008).

BO with TGPLLM is sufficiently flexible and general for optimizing Stoploss for the dimensionality of \(\varvec{\theta }\) that we are faced with, both from a theoretical perspective and over all our empirical investigations. We thus recommend it as the default approach for generating candidates for optimizing \(\text {Stoploss}(\varvec{\theta })\) in a small number of iterations. We point out that there may be combinations of data, hyperparameters and Stoploss specification for which a different approach may be more acccurate or efficient, e.g., the example in Sect. 7.1, where a crude grid search is sufficient. It is also possible to develop tailored optimization approaches for concrete Stoplosses that exploit structure and perform better. Also, for a \(\varvec{\theta }\) that is higher than 20 dimensions, standard BO starts to perform less well. Nevertheless, we see BO with the TGPLLM surrogate model as a general strategy that can be successfully used with every conceivable Stoploss that can be derived from the suggestions within this article.

7 Application

In this section, we demonstrate how the STOPS framework can be used for tuning hyperparameters in PS for individual data analytical instances. For the purpose of illustration, we aim at a diverse number of Stoplosses comprising different PS badness-of-fit, transformations and c-structuredness indices.

7.1 Unrolling the swiss roll

As a simple example for illustrating the concept, we use the STOPS framework to select hyperparameters for the geodesic distance function in Isomap to unroll the classic swiss roll regularly. In the swiss roll example, data lie equally spaced on a spiral manifold (“swiss roll”) embedded in a higher dimensional space. Proximity scaling methods that emphasize local structure are able to flatten out this manifold in the target space. One of the most popular PS variants for doing this is Isomap.

We use an example with 150 data points in three dimensions lying on a grid on the embedded manifold where along the y dimension there are five points and along the x dimension there are 30. We code the points in a diverging palette from the center of the roll along the spiral direction. The flattening out operation worked well if the same shades are arranged vertically, the palette is visible from left to right, and the grid is recovered.

As described in Sect. 4.1.2, Isomap has a governing hyperparameter for the calculation of the geodesic distances. In line with our objective of flattening out the swiss roll, we are looking for a solution with objects arranged on a regular grid and also want to preserve the neighborhood of points around each point as well as possible (specifically preserving a neighborhood of 5 points). In terms of structuredness, one way to measure whether the objects are arranged on a grid is c-regularity (see Sect. 5.1), and preservation of the neighborhood can be assessed with c-faithfulness (see Appendix A.7 in the supplementary document), both of which ideally would be 1.

We use the \(\varepsilon \)-version of Isomap and may now set the hyperparameter for the geodesic distances to different values and tune by inspecting for which hyperparameter values we get a faithful, regular representation (e.g., manually or in a grid search). We may also tune the hyperparameter with the STOPS framework automatically.

In Table 1 we list values of normalized stress, c-regularity and c-faithfulness for different \(\varepsilon \) (our \(\varvec{\theta }\)) obtained in a grid search from 0.09 to 0.3 by increments of 0.02. Here the best values for c-regularity and c-faithfulness would be obtainedFootnote 4 with \(\varepsilon ^*= 0.11\).

Table 1 Values of normalized stress, c-regularity (with the \(\varvec{\gamma }\) being \(\epsilon =10, d_{\max }=1\)) and c-faithfulness (with \(\varvec{\gamma }\) being \(k=5\)) for different values of the hyperparameter \(\varvec{\theta }=\varepsilon \) in Isomap for the swiss roll example. The optimal value for the hyperparameter \(\varepsilon ^*=0.1095\) found with mSTOPS is given in the last row

Alternatively, we can use STOPS with Isomap and a weight of 0 for the badness-of-fit and \(-1\) for c-regularity and c-faithfulness respectively (as we want to maximize them). We use the mSTOPS variant for \(\text {Stoploss}(\varvec{\theta })\), so combining the two c-structuredness indices multiplicatively. The search space for \(\varepsilon \) is set to be between 0.09 and 0.3. We use BO with TGPLLM for 10 iterations. Note that due to our objective we only consider c-regularity and c-faithfulness and disregard stress, as the latter measures how closely the fitted distances approximate the dissimilarities (which is of no concern for our choice of \(\varepsilon \) here). The role of fit measure is played by c-faithfulness.

Fig. 2
figure 2

The unrolled swiss roll from Isomap. The upper panel features the result from using \(\varepsilon ^*=0.109\), the optimal hyperparameter value suggested by mSTOPS when considering c-regularity and c-faithfulness (with a weight of -1 respectively). The c-regularity value for the configuration is 0.991 and the c-faithfulness value is 0.829. The lower panel features a non-optimal \(\varepsilon \) of 0.21 which has essentially the same c-regularity but lower c-faithfulness

Tuning the hyperparameter with the mSTOPS procedureFootnote 5 suggests to use \(\varepsilon ^*=0.1095\). The resulting configuration can be found in the upper panel of Fig. 2. This is the most faithful and regular representation that we obtain by varying \(\varepsilon \), with values 0.9911 for c-regularity and 0.8293 for c-faithfulness, although we cannot achieve a perfectly regular rectangle. Both values either match or improve the corresponding optimal values we found in the grid search for \(\varepsilon =0.11\). For comparison, a configuration with non-optimal \(\varepsilon =0.21\) is found in the lower panel of Fig. 2. This solution has equal c-regularity to the optimal one, but lower c-faithfulness due to the difficulty of representing the inner part of the swiss roll for this \(\varepsilon \).

7.2 Handwritten digits

We consider a second application of the STOPS framework, this time on handwritten digits data from Alimoglu and Alpaydin (1997). Following Izenman (2009), the original data were obtained from 44 people who handwrote the digits \(0,\dots ,9\) 250 times by following the trajectory of the pen on a \(500 \times 500\) grid which was then normalized. From the normalized trajectory, eight bivariate points were randomly selected for each handwritten digit, leading to 16 variables per digit. We look at a random sample of 500 written digits.

To these data we apply different nonlinear dimensionality reduction methods that allow for parametric transformations. We use the STOPS framework to select the transformation parameters for visualization of the multivariate similarities with emphasis on two structures: clusters, to make the separation of digits clear, and an underlying manifold. The PS methods we tune with STOPS are Sammon mapping with power transformations of the proximities, Box–Cox MDS, POST-MDS and local MDS. This is a relatively large data set in an MDS context and fitting a single PS can already be costly, so using BO is an efficient strategy.

We use aSTOPS with normalized stress with a weight of \(v_0=1\), c-clusteredness (see Sect. 5.1) with a high weight of \(v_1=-400\) and c-manifoldness (Sect. 5.2) with a moderate weight of \(v_2=-2.5\).Footnote 6 The OPTICS Cordillera index uses \(q=2\), \(c=5\), \(\epsilon =10\) and \(d_{\max }=0.6\), the other indices use the default values. For comparability of indices the configurations are rescaled so that the highest column variance is 1. BO with TGPLLM is carried out in 10 steps.Footnote 7

The optimal hyperparameter for the Sammon mapping with power transformation was found at \(\lambda ^*=5.4\). The c-clusteredness value was 0.15, the c-manifoldness value was 0.928. The configuration is shown in the top left panel in Fig. 3.

Fig. 3
figure 3

Different PS configurations obtained for a random sample of 500 of the digits data set. The first two panels show configurations with transformation parameters found from using aSTOPS with c-clusteredness and c-manifoldness. The PS versions were (in reading order) Sammon mapping with power transformations, lMDS, Box–Cox MDS and POST-MDS. The last row features comparison configurations obtained from STOPS giving zero weight to c-structuredness. The shades and plotting characters highlight the digit that was written. All configurations are Procrustes adjusted to the lMDS configuration (top right panel)

For lMDS the transformation hyperparameter values selected by STOPS were \(k^*=14.73, \tau ^*=3\). With them the obtained solution has a c-clusteredness value of 0.094 and c-manifoldness value of 0.953. The configuration is shown in the top right panel in Fig. 3.

The transformation hyperparameters for Box–Cox MDS were found at \(\mu ^*=4.631, \lambda ^*=4.859, \rho ^*=1.285\). The c-clusteredness value was 0.124, the c-manifoldness value was 0.961. The configuration is shown in the bottom left panel in Fig. 3.

Lastly, for POST-MDS with \(w_{ij}=\delta _{ij}\), the transformation hyperparameter values were \(\kappa ^*=0.787, \lambda ^*=3.766, \rho ^*=-2.979\). The c-clusteredness value was 0.116, the c-manifoldness value was 0.862. The configurations for the POST-MDS solution is shown in the bottom right panel in Fig. 3.

Overall, we can see that the STOPS hyperparameter selection leads to configurations that, in general, arrange the digits in a sensible way: Due to the high c-clusteredness weight, similar looking digits are positioned visually close to each other, e.g., 0, 5, 8 or 1, 2, 7. This is rather pronounced for lMDS, power Sammon and POST-MDS; BC-MDS produces an arrangement in three clusters that is less in accordance with the other representations and the ground truth. The c-manifoldness weight ensured that each arrangement does not deviate too far from an appreciable, imagined submanifold. For every method, using the optimal hyperparameters from STOPS improves the c-structuredness indices as compared to an untransformed Sammon solution (which had c-clusteredness of 0.0833 and c-manifoldness of 0.5939).

For comparison, we also include two configurations in Fig. 3 obtained from Samon mapping and lMDS respectively, where the c-structuredness weights have been set to \(v_1=v_2=0\) and \(v_0=1\), therefore disregarding c-structuredness and only minimizing the badness of-fit criterion over \(\varvec{\theta }\). We can see that disregarding c-clusteredness leads to less c-structuredness. With respect to c-clusteredness, for the Sammon mapping result, clusters are less discernable in the bottom left plot than in the top left in Fig. 3 (especially when ignoring the coloring) and for lMDS clusters are hard to make out in the configuration obtained when disregarding c-clusteredness (bottom right plot vs. top right plot).

7.3 Republican mantras

For our last illustration of the STOPS framework in action, we turn to the “Republican Mantra” data from Mair et al. (2014). The data are natural language texts that have been periodically obtained from the website of the Republican party (Grand Old Party, or GOP) in the USA, which at that time hosted a section called “Republican Faces”. In this section, supporters of the Republican party gave a short statement about why they see themselves as Republican. The statements always started with “I’m a Republican, because...” followed by the person’s personal ending. An example statement would be “...I believe in a free market society which enables hard work to equal success”. Mair et al. (2014) used MDS to explore the document term matrix of these statements but encountered difficulties when using the cosine distance between the words due to an almost-equal-dissimilarities artifact (see first row of Fig. 4; for illustration, this arrangement has c-clusteredness of 0.053 with respect to \(c=6, d_{\max }=1.2, \epsilon =10, q=2\)). This prompted Mair et al. (2014) to abandon the cosine distance in favor of a co-occurence based dissimilarity measure, the gravity similarity, that was subject to power transformations. Of note is that the concrete transformations they used in their work were manually chosen ad hoc.

We revisit the data from the original angle: We retain the cosine distance that the authors originally aimed for but employ the STOPS framework to guide the PS result towards more c-structuredness by choosing power transformations in a flexible PS version that is appropriate for promoting structures of interest. This also serves as an empirical example of how an approach that relied on manual trial-and-error hyperparameter selection would have benefitted from the STOPS framework. To illustrate the versatility of STOPS we use aSTOPS and mSTOPS with different structures and weights.

Additive STOPS: high c-clusteredness, high c-association, low c-complexity With aSTOPS we select parameters in a POST-MDS with cosine dissimilarities so that there is a focus on c-clusteredness (see Sect. 5.1) and c-association (Sect. A.1 in the Supplement) but favoring lower c-complexity (Sect. A.2 in the Supplement) for the association. To translate this into STOPS model weights, we used \(-10, -5\) and 1 as weights for the c-structuredness indices. The weight for stress was 1. Essentially this means that c-clusteredness is weighted twice as important as c-association, which is valued 5 times as important as stress. C-complexity is more of an afterthought but for two results with similar c-clusteredness and c-association, we prefer the one with lower complexity traded-off equally with badness-of-fit. We note that the weights we use are somewhat arbitrary, but what is clear is that all in all we relax the fit criterion quite a bit to allow for high c-clusteredness and c-functionality. The metaparameters in \(\varvec{\gamma }\) are \(c=6, d_{\max }=1.2, \epsilon =10, q=2\) for c-clusteredness and \(\omega =0.9\) for c-association and c-complexity.

The resulting configuration is displayed in the second row of Fig. 4. The objects are arranged in clusters close to a circumplex structure, reflecting the STOPS setup of a clustered arrangement with a relatively simple nonlinear association. It stems from a POST-MDS with \(w_{ij}=\delta _{ij}\) and parameters \(\varvec{\theta }^*=(2.635,1.185,3)^\top \). The square root of the normalized stress value is 0.63 and the c-structuredness indices are 0.386 for c-clusteredness, 0.999 for c-association and 3.807 for c-complexity.

Multiplicative STOPS: high c-nonmonotonocity, moderate c-association With mSTOPS we again select parameters in a POST-MDS with cosine dissimilarities, this time focusing on having a nonmonotonic, nonlinear associative structure in the target space, so high c-nonmonotonicity (Sect. A.10 in the Supplement) and moderate c-association (Sect. A.1 in the Supplement). Essentially we want the objects to be projected close to a functional that should be highly nonmonotonic.Footnote 8 To achieve this, badness-of-fit is allowed to become moderately high relative to the c-structuredness, but not as high as before with aSTOPS. This translate to the weights \(-2,-1\) for c-nonmonotonicity and c-association respectively, and to a weight for stress of 1. The metaparameters in \(\varvec{\gamma }\) were \(\omega =0.9\) for c-nonmonotonicity and c-association.

The resulting configuration is displayed in the last row of Fig. 4. We see that there is an associative structure that is rather complicated and highly nonmonotonic (reminiscent of a trefoil). It stems from a POST-MDS with \(w_{j}=\delta _{ij}\) and parameter values \(\varvec{\theta }^*=(2.443,19.756,-2.984)^\top \). The square root of the normalized stress value is 0.545 and the c-structuredness indices are 0.469 for c-nonmonotonicity and 0.999 for c-association.

Fig. 4
figure 4

Three configurations for the “Republican Mantras” data set. The top plot shows a ratio MDS with cosine dissimiliarities between words that leads to an artifact. The middle plot shows the configuration obtained from fitting a power stress MDS with parameters found via aSTOPS with weights \(-10,-5,1,1\) for c-clusteredness, c-association, c-complexity and stress respectively. The bottom plot shows the configuration obtained from fitting a power stress MDS with parameters found via mSTOPS with weights \(-2,-1,1\) for c-nonmonotonicity, c-association and stress respectively

7.4 Artifactual c-structuredness

There is an application aspect when selecting hyperparameters based on partially optimizing for c-structuredness that we need to point out: It is possible that this “forces” a configuration to look a certain way even if that does not correspond to the ground truth in the high-dimensional data or in the data generating process, so the c-structuredness exhibited can be an artifact of using STOPS rather than inherent in the data. In that sense, STOPS can be artificially generating c-structuredness instead of uncovering “real” c-structuredness.Footnote 9 Two situations are of particular interest to discuss: First, if the \(\varvec{\Delta }\) are not at all structured (they exhibit the null structure, i.e. one resulting from perfectly equal dissimilarities Buja and Swayne 2002), and second, if the \(\varvec{\Delta }\) exhibit some ground truth structures but the structures are not represented in the set of structures used for a STOPS model. We address both of these situations briefly.

In the first situation, the data correspond to the null structure of equal dissimilarities (\(\delta _{ij}=\delta =const.\)), i.e., the objects form a regular simplex in \((N-1)\)-dimensional space. For all the PS methods we mentioned, this leads to highly structured configuration artifacts for given distances fitted. The effect that STOPS has in this situation depends on the transformations used for the \(\varvec{\Delta }\) (and, by extension, the transformation vector \(\theta _{\Delta }\)). Generally speaking, if after applying transformations to the \(\varvec{\Delta }\), the elements of \(\hat{\varvec{\Delta }}\) are still equal, then the null structure is preserved in the sense of producing the same configuration artifact in the target space for given distances fitted regardless of the choice of transformation for \(\hat{\varvec{\Delta }}\). This is when all \(\delta _{ij}\) are subjected to the same, deterministic transformation for all ij. STOPS will then not induce c-structuredness artificially beyond the effect of the null structure on the PS configuration. For all of the transformations and PS methods we discussed in this article, this holds true.

In the second situation the elements of \(\varvec{\Delta }\) are differentiated and informative for the ground truth structure, but STOPS is not taking the correct structure into account. Two mechanisms are related to that in STOPS. First, the ground truth structure in the high-dimensional data cannot be recovered with the chosen PS method. Using STOPS may then in general also fail to uncover the structure. This problem can be mitigated with STOPS, however, by including structures that measure how faithful the configuration is to the high-dimensional ground truth and/or the original data, e.g. with c-faithfulness (see Appendix A.7 in the supplementary document). That way it is possible for STOPS to actually improve recovery of the ground truth via the PS method as STOPS selects transformations that then partially optimize for the ground truth structure.

Second, the PS method specified for STOPS is able to recover the ground truth but this ground truth is not within the structures specified. Then using the wrong c-structuredness index may have STOPS select hyperparameters that artificially distort the mapping towards the c-structuredness that was specified. Using STOPS may then in general either induce artificial c-structuredness (because it focuses on the wrong structure), distort or fail to uncover the ground truth structure in the result (because it does not consider the correct structure), or only recover parts of the correct structure (if the correct structure shares aspects with the wrong structure), or any combination of these. STOPS has one mechanism that is meant to mitigate that, which is to put relatively high weight on the badness-of-fit part (via \(v_0\)) and relatively little weight on c-structuredness (via \(v_l\)), with the extreme of no weight on c-structuredness (\(v_l=0, \forall l\)). This will make STOPS select hyperparameters that lead to configurations that are still faithful to the \(\hat{\varvec{\Delta }}\) in the sense of minimizing mainly the badness-of-fit over \(\varvec{\theta }\). To gauge the extent of the distortion, one can compare the STOPS result obtained from c-structuredness weights that are not 0 with the STOPS result obtained for all \(v_l=0\).

8 Discussion

In this article we suggested a framework for hyperparameter selection in proximity scaling models, coined STOPS for STructure optimized Proximity Scaling. The selection process for the hyperparameters is based on the idea that we want the configuration to show c-structuredness, enabling easier interpretation. The underlying objective function, Stoploss, combines c-structuredness indices and/or a badness-of-fit measure as a function of the PS hyperparameters, and thus allows optimization over their space. We presented a nested, profile-type optimization procedure, that solves the PS problem given the hyperparameters by a standard PS algorithm in an inner optimization step and then uses the resulting configurations in the Stoploss to optimize over the hyperparameters in an outer optimization. We suggested Bayesian optimization via treed Gaussian process with jumps to limiting linear models as an efficient strategy for generating candidates for the outer optimization. The use of the STOPS framework was illustrated with three examples—in these example there were structural consideration about the configuration that we used to select hyperparameters to achieve a desired effect.

The aim of this article was to suggest a general, principled way to select transformation parameters in PS methods based on the structural appearance of the configuration. For the scalarized objective we limited ourselves to optimize only over the hyperparameters of the PS methods; in principle and with little adaptation to the framework, it would be possible to treat the scalarization weights as an additional set of hyperparameters and optimize over them as well.

We chose to set up the framework for hyperparameter selection in a flexible, modular fashion. This allows to plug in any type of badness-of-fit PS objective, use any optimization for the PS objective that is advantageous, combine any number of structuredness indices in different ways and use any global optimization strategy over the hyperparameters. We found Bayesian Optimization with TGPLLM to be well suited as a generally applicable strategy for the latter step, but due to the modular nature of the framework any other metaheuristic or random search can be used. The proposed framework is sufficiently general and modular to allow for conceptual and computational extensions in future research. For example, the framework allows the development of tailored optimization for concrete Stoplosses, it can be incorporated in a larger context of evaluation of PS methods such as France and Akkucuk (2021) and the idea of structure-based hyperparameter selection can be adapted to unsupervised learning methods that do not fit neatly into the presented PS framework, for example for t-SNE or UMAP (McInnes et al. 2018).

We discussed structured appearances that we think to be of interest to researchers and suggest statistics by which to measure them. We are quick to point out that our list is not meant to be exhaustive and that any scalar statistic that captures a property of the configuration can be incorporated into STOPS. We therefore believe our framework to be applicable even beyond the suggestions we made in this paper.

The STOPS framework as a hyperparameter tuning method is similar to using indirect, soft constraints in PS via choosing the hyperparameters instead of incorporating hard, direct constraints on the configuration, and this is where we see the method located. In cases where there exists strong theory, or hard constraints, constrained versions of PS may be preferred over the STOPS framework. Since the method works completely unsupervised with respect to a ground truth for structures, the structures that are produced may not be “real” and would need to be validated.

Despite these limitations, we believe the STOPS framework to be a versatile conceptual framework for the task of hyperparameter tuning in ordination, unsupervised learning and dimensionality reduction. It can be utilized for data exploration, visualization and scaling in applications where one seeks configurations that show a structural quality that can be interpreted and also wants to retain properties of a standard PS variant. This way one no longer has to restrict oneself to a narrow toolbox of standard MDS but can utilize the full world of flexible parametrized PS variants, with the standard versions resulting as specific hyperparameter constellations in the search space; the choice of hyperparameters is no longer ad hoc but principled and reproducible; the obtained hyperparameter values usually lead to the desired structure being appreciably present in the configuration (especially as compared to a standard MDS result) under the condition that the PS still fits well; both fit and the c-structuredness can be quantified with descriptive statistics, thus making results reproducible and comparable between data sets, studies and settings.