Abstract
We introduce the structure optimized proximity scaling (STOPS) framework for hyperparameter selection in parametrized multidimensional scaling and extensions (proximity scaling; PS). The selection process for hyperparameters is based on the idea that we want the configuration to show a certain structural quality (cstructuredness). A number of structures and how to measure them are discussed. We combine the structural quality by means of cstructuredness indices with the PS badnessoffit measure in a multiobjective scalarization approach, yielding the Stoploss objective. Computationally we suggest a profiletype algorithm that first solves the PS problem and then uses Stoploss in an outer step to optimize over the hyperparameters. Bayesian optimization with treed Gaussian processes as a an apt and efficient strategy for carrying out the outer optimization is recommended. This way, hyperparameter tuning for many instances of PS is covered in a single conceptual framework. We illustrate the use of the STOPS framework with three data examples.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
For unsupervised learning, dimensionality reduction and exploratory data analysis, a popular approach is to represent and visualize multivariate proximities in lower dimensions, also known as ordination. Such methods were first known as multidimensional scaling (MDS) and later also collected under the nonlinear dimension reduction and manifold learning monikers. We will use the umbrella term proximity scaling (PS) by which we mean MDS and extensions (for a comprehensive overview, see, e.g., France and Carroll 2011).
The main idea behind these methods is that for N data points or objects there is a matrix of pairwise proximities between object i and object j and one looks for a representation of the objects—the configuration—in a lower dimensional target space where the distances between objects optimally approximate the proximities. Examples include classical MDS (Torgerson 1958), metric MDS and nonmetric MDS (Kruskal 1964), Sammon mapping (Sammon 1969), Isomap (De’ath 1999; Tenenbaum et al. 2000) or power stress MDS (POSTMDS, Buja et al. 2008; Groenen and De Leeuw 2010).
PS methods play an important role in the exploration and communication of data. There exists a plethora of PS methods suitable for many different purposes. Interestingly in the wild one seems to encounter only a relatively narrow PS toolbox, mainly classical MDS and nonmetric/metric MDS. General and flexible PS methods, which allow for various transformations of the quantities of interest, are more rarely encountered. This is unfortunate as we are convinced that flexible PS methods can be both useful and powerful and ignoring them may lead to foregone insights.
Flexible PS methods typically have hyperparameters governing the transformations, which are either explicit or implicit, often yielding standard PS methods for certain hyperparameter constellations. For example, a popular method for unravelling a manifold embedded in a higherdimensional space is Isomap that does classical MDS on proximities that are derived from a neighborhood graph. The graph and derived proximities depend on a hyperparameter for the neighborhood that can be freely chosen. Similarly, in POSTMDS there are power transformations for fitted distances, proximities and/or weights which can be freely chosen and yield metric MDS if the exponents are 1.
A possible obstacle for a more widespread use of flexible PS methods are the hyperparameters for the transformations of the quantities of interest and their selection. An instructive example of this are the ideas of Ramsay (1982) that were read before the Royal Statistical Society but have gained relatively little traction. Ramsay suggests power and spline transformations to be used in MDS and regarding the hyperparameters, Ramsay states to use an exponent of 1.5 and polynomials of degree 2 based on “general experience” (p. 288 and 290, Ramsay 1982). The choice of transformations and hyperparameters is met with some contention by influential discussants like J. C. Gower, F. Critchley, J. de Leeuw, and others. Two comments from the discussion illustrate this: S. Tagg states “this complexity requires understanding of the choice of parameters” (p. 303, Ramsay 1982) and C. Chatfield says “by allowing [...] for different transformations [...] he hopes to have a realistic, albeit rather complicated, model. My worry is that the model is too complicated and yet still not realistic enough.” (p. 306 Ramsay 1982). This debate on how to choose the optimal parameters for flexible MDS methods has not yet been resolved. Hence, if these methods are used, the choice of hyperparameters often has an arbitrary trialanderror aspect to it as they are simply set ad hoc (e.g., Ramsay 1982; Buja et al. 2008; Mair et al. 2014) and/or different values are tried out (sometimes in a semioptimized fashion) until values are found that work in light of the application, see, e.g., the analyses in Buja and Swayne (2002); Borg and Groenen (2005); Chen and Buja (2009, 2013); De Leeuw (2014).
We propose to approach the hyperparameter selection in a principled way: as a computational problem of optimizing over the hyperparameter space for a wide variety of PS methods. Hence this article’s contributions are: a) providing a methodological framework that subsumes a wide array of flexible PS methods and parametric transformations for them, b) suggesting the setup of objective functions for hyperparameter selection to optimize for within this framework, c) operationalizing the building blocks that comprise the objective functions, and d) identifying a general computational approach that is suitable to tackle all these instances in the same way.
One contribution is that we propose criteria that can be used for operationalizing this optimization problem. Our porposal is based on the observation that in many applications of PS the obtained result—which is faithful to the input proximities—is interpreted with respect to the arrangement of objects in the configuration. We call this “structural quality” for a specific notion of structural appearance, for example, that objects are arranged in quadrants or as a circumplex. In recent years this aspect of interpretation in PS has been made explicit by using indices to measure a structural quality of interest, e.g., by Akkucuk and Carroll (2006); France and Carroll (2007); Chen and Buja (2009, 2013); Rusch et al. (2021).
Since structural quality is a property of the configuration, we coin the term cstructuredness to single out that we mean the degree of structuredness of a configuration with respect to a certain structure. High cstructuredness is something we aim for in PS results under the condition that the proximities are still represented faithfully. Changing the hyperparameters for the transformations typically also changes the cstructuredness, which provides a way for us to set up tuning of the hyperparameters so that we obtain a configuration that is both faithful and cstructured.
Our suggestions build on precursor ideas of letting a criterion guide the selection of hyperparameters in PS, as for example in Akkucuk and Carroll (2006) and Chen and Buja (2009, 2013), where tuning has been done manually or over a grid. France and Akkucuk (2021) propose a visualization and exploration framework along these lines. An optimization approach to hyperparameter tuning in MDS was previously used by Rusch et al. (2021). Our proposal extends and complements these approaches.
In the remainder of this article we present our conceptual framework for hyperparameter tuning in PS: STOPS for STructure Optimized Proximity Scaling. It allows to computationally tackle this task for a wide array of flexible PS versions, subsuming standard MDS methods that are used in the wild resulting as specific hyperparameter constellations in the search space. We do this by making the notion of the structural quality that is sought and interpreted explicit; these structural appearances are condensed into cstructuredness indices to clearly quantify the degree of structuredness of a PS result. The goal of finding hyperparameters is then handled as a multiobjective optimization problem: The cstructuredness indices are combined with the PS badnessoffit by scalarization into an objective function that is parametrized with the hyperparameters. The optimal hyperparameters are found via a general purpose optimization routine; one that emerged as working well for this problem class was Bayesian Optimization (BO, Mockus 1989). This gives a general approach that allows to handle most instances of parametrized PS in the same way.
The article is organized as follows: It starts with a high level description of the STOPS framework, the objective functions and the different building blocks that make up the framework in Sect. 2. We then discuss the building blocks in detail: in Sect. 3 we discuss badnessoffit objectives, in Sect. 4 we elaborate on transformations and their governing hyperparameters. In Sect. 5 we turn to cstructuredness, i.e, structures of interest and their quantification by indices. Section 6 discusses hyperparameter selection within the STOPS framework. In Sect. 7 we illustrate the use of the framework with three data examples. Concluding remarks can be found in Sect. 8. As Supplementary Information we include the R code file to reproduce the results and figures, and a supplementary document with details on (additional) structures and structuredness indices, the nature of Stoploss and normalization of badnessoffit.
2 The STOPS framework
In proximity scaling methods we start from a given matrix \(\varvec{\Delta }\) of pairwise symmetric proximities between objects \(i, j; i, j=1,\dots ,N\), with individual entries \(\delta _{ij}\). We will assume the proximities to take on a minimum when two observations are equal, i.e., dissimiliarities. The main diagonal of \(\varvec{\Delta }\) is 0.
Let \(\hat{\varvec{\Delta }}\) be the result of transformations applied to the entries of the proximity matrix \(\varvec{\Delta }\), where \(\hat{\varvec{\Delta }}=T_\Delta (\varvec{\Delta })=T_\Delta (\varvec{\Delta }\vert \theta _\Delta )\), with individual entries \(\hat{\delta }_{ij}\). Let \(\varvec{X}\) denote an \(N \times M\) matrix (the configuration) of lower dimension \(M<N\) (mostly \(M \ll N\)) from which the matrix \(\varvec{D}(\varvec{X})\) comprising untransformed pairwise distances \(d_{ij}(\varvec{X})\) between objects (row vectors) in \(\varvec{X}\) can be derived. The matrix \(\hat{\varvec{D}}(\varvec{X})=T_{D}(\varvec{D}(\varvec{X}))=T_D(\varvec{D}(\varvec{X})\vert \theta _D)\) comprises transformed pairwise distances between objects in \(\varvec{X}\), with individual entries \(\hat{d}_{ij}(\varvec{X})\). We call \(T_{\Delta }: (\varvec{\Delta },\theta _\Delta ) \mapsto \hat{\varvec{\Delta }}\) a proximity transformation function and \(T_D: (\varvec{D}(\varvec{X}),\theta _D) \mapsto \hat{\varvec{D}}(\varvec{X})\) a distance transformation function. Some PS models also allow different weights either given a priori as an input weight matrix \(\varvec{W}\) with elements \(w_{ij}\), or as transformed values based on the input weights. In case of the latter this is a transformed weight matrix \(\hat{\varvec{W}}=T_W(\varvec{W}\vert \theta _W)\), with elements \(\hat{w}_{ij}\) for the weight transformation function \(T_W: (\varvec{W},\theta _W) \mapsto \hat{\varvec{W}}\). The combined hyperparameter vector of all these transformations is \(\varvec{\theta }=(\theta _\Delta ,\theta _D,\theta _W)^\top \).
We search for an optimal configuration that allows one to reconstruct the matrix \(\hat{\varvec{\Delta }}\) as well as possible from \(\hat{\varvec{D}}(\varvec{X})\), i.e., we want \(\hat{\varvec{D}}(\varvec{X}) \approx \hat{\varvec{\Delta }}\). This is achieved by minimizing a measure of badnessoffit \(\sigma _{\text {PS}}(\varvec{X} \vert \varvec{\theta }) = \mathcal {L}(\hat{\varvec{\Delta }}=T_\Delta (\varvec{\Delta }\vert \theta _\Delta ),\hat{\varvec{D}}(\varvec{X})=T_D(\varvec{D}(\varvec{X})\vert \theta _D),\hat{\varvec{W}}=T_W(\varvec{W}\vert \theta _W))\), where \(\mathcal {L}\) denotes a loss function.
Minimizing the badnessoffit criterion in PS means finding—for given \(\varvec{\theta }\)—the optimal configuration \(\varvec{X}^*\) out of all possible \(\varvec{X}\) as
Many measures of badnessoffit with different types of transformations and hyperparameters have been proposed (see Sects. 3 and 4). Our framework covers all of them; we give a loose taxonomy of the most popular ones in Sect. 3. Concrete values of \(\varvec{\theta }\) are typically chosen manually ad hoc but can also be found by optimization, the latter is what we focus on in this article.
Let us assume we are interested in L different structural qualities of \(\varvec{X}\) and that we have L corresponding univariate cstructuredness indices \(I_l(\varvec{X}\vert \varvec{\gamma })\) for the \(l=1,\dots , L\) different structures, capturing the essence of the structural appearance of the configuration with respect to the lth structure. For example, we might be interested in both the structural appearance of how clustered the configuration is (structure 1) and how strongly linearly related the column vectors of the configuration are (structure 2). We then measure the cstructuredness of \(\varvec{X}\) for the two structures with an index for clusteredness and one for linear dependence respectively. The \(\varvec{\gamma }\) are optional metaparameters for the indices, which we assume are given and fixed; they control how cstructuredness is measured. Some structures one might be interested along with their cstructuredness indices will be discussed in Sect. 5 and many more in Appendix A in the supplementary document. We further assume broadly that the transformations \(T_\Delta (\varvec{\Delta }\vert \varvec{\theta })\) and/or \(T_D(\varvec{D}(\varvec{X})\vert \varvec{\theta })\) and/or \(T_W(\varvec{W}\vert \varvec{\theta })\) produce different cstructuredness in \(\varvec{X}\) for different values of \(\varvec{\theta }\).
In a nutshell our proposal is to select optimal hyperparameters \(\varvec{\theta }^*\) for the scaling procedure by assessing the cstructuredness of an optimal configuration \(\varvec{X}^*\) found from a PS method for given \(\varvec{\theta }\) usually in combination with its badnessoffit value. We aim at finding a \(\varvec{\theta }^*\) that, when used as transformation parameters in the PS method, will give a configuration that has high (or low) values of the cstructuredness indices. We view this as a multiobjective optimization problem, where we want to maximize/minimize different criteria (either badnessoffit, or cstructuredness, or both) over \(\varvec{\theta }\). Cstructuredness may this way be induced at a possible expense of fit but we control the expense amount.
To formalize this we explicitly write the building blocks of the objective function used for hyperparameter tuning via STOPS as a function of \(\varvec{\theta }\): Let us denote by \(\varvec{X}^*(\varvec{\theta })\) the optimal solution from minimizing a badnessoffit \(\sigma _{\text {PS}}(\varvec{X}\vert \varvec{\theta })\) for a fixed \(\varvec{\theta }\), so \(\varvec{X}^*(\varvec{\theta }):= \arg \min _{\varvec{X}} \sigma _{\text {PS}}(\varvec{X}\vert \varvec{\theta })\). Further we also have the L different univariate indices with possible metaparameters \(\varvec{\gamma }\), \(I_l(\varvec{X}^*(\varvec{\theta })\vert \varvec{\gamma })\), to be optimized for.
Specific variants of STOPS can be instantiated by defining objective functions \(\text {Stoploss}(\varvec{\theta }\vert v_0, \dots , v_L, \varvec{\gamma })\), comprising either badnessoffit or cstructuredness indices or both in a scalarized combination. Two variants of objective functions—called additive STOPS (aSTOPS) and multiplicative STOPS (mSTOPS) respectively—are of the following form:
and
with \(v_0 \in \mathbb {R}_{\ge 0}\) and \(v_1,\dots ,v_L \in \mathbb {R}\) being the scalarization weights. Numerically, the badnessoffit function value \(\sigma _{\text {PS}}(\varvec{X}^*(\varvec{\theta })\vert \varvec{\theta })\) needs to be normalized to be scalefree and commensurable for comparability of different values of \(\varvec{\theta }\). We discuss such normalization in Appendix B of the supplementary document. The objective function for aSTOPS is fully compensatory, whereas for mSTOPS it ensures that a normalized badnessoffit of 0 will always lead to the minimal \(\text {Stoploss}(\varvec{\theta }\vert v_0, \dots , v_L, \varvec{\gamma })\) for a positive value of \(I_l(\cdot )\). For notational convenience, we will refer to the objective functions for STOPS variants by \(\text {Stoploss}(\varvec{\theta })\) for the remainder of the paper.
The \(v_0,\dots ,v_L\) are weights that determine how the badnessoffit and cstructuredness indices are scalarized^{Footnote 1} for Stoploss. This can be used to set how strongly the criteria are taken into account or to control the tradeoff of fit and cstructuredness in determining hyperparameters (typically in a convex combination). For example, we might want to tune by only minimizing badnessoffit over \(\varvec{\theta }\) (\(v_0=1, v_l=0\)), or find the best configurations and only optimize for structure over \(\varvec{\theta }\) without taking badnessoffit into account (\(v_0=0, v_l \ne 0\)), or tune so that we relax \(10\%\) of goodnessoffit for more cstructuredness (\(v_0=0.9\) and \(\sum ^L_{l=1}v_l=0.1\)). A negative (positive) weight for \(v_l\) means that a higher (lower) value for the lth index is favored.
Using either (2) or (3) for hyperparameter selection, we then need to find
Accordingly, hyperparameter tuning with STOPS comprises the following building blocks:

1.
The PS badnessoffit loss function \(\sigma _{\text {PS}}(\varvec{X}\vert \varvec{\theta })\) that allows us to find an optimal \(\varvec{X}^*(\varvec{\theta })\) for given \(\varvec{\theta }\) (see Sect. 3).

2.
The transformations employed that depend on the vector \(\varvec{\theta }\), so any of \(T_\Delta (\varvec{\Delta }\vert \varvec{\theta })\), \(T_D(\varvec{D}(\varvec{X})\vert \varvec{\theta })\) and \(T_W(\varvec{W}\vert \varvec{\theta })\) (see Sect. 4).

3.
The structures of interest and their cstructuredness indices \(I_l(\varvec{X}^*(\varvec{\theta })\vert \varvec{\gamma })\) (see Sect. 5).
 4.
We do hyperparameter selection via an outer loop: \(\text {Stoploss}(\varvec{\theta })\) is used solely for optimization over the hyperparameters \(\varvec{\theta }\); the \(\varvec{X}^*(\varvec{\theta })\) itself is found conditional on the transformation parameters from \(\sigma _{\text {PS}}(\varvec{X}\vert \varvec{\theta })\). This allows to utilize tailored optimization for the main model parameters (the fitted distances) in the badnessoffit functions. Our proposal is a formalized hyperparameter selection procedure supplanting the standard workflow of tuning hyperparameters by trying out and comparing different solutions ad hoc. The modular design of the STOPS framework offers a lot of flexibility able to incorporate many different instances of badnessoffit functions, transformations and cstructuredness indices, as well as providing a computational framework that is generally applicable for continuous and discrete hyperparameter spaces \(\varvec{\Theta }\).
3 Proximity scaling losses for measuring badnessoffit
To be more detailed about finding the configuration, recall that the problem that proximity scaling solves is to find an \(N \times M\) matrix \(\varvec{X}^*\) by means of a sensible loss criterion \(\sigma _{\text {PS}}(\varvec{X}\vert \varvec{\theta })=\mathcal {L}(\hat{\varvec{\Delta }}=T_\Delta (\varvec{\Delta }\vert \varvec{\theta }),\hat{\varvec{D}}(\varvec{X})=T_D(\varvec{D}(\varvec{X})\vert \varvec{\theta }), \hat{\varvec{W}}=T_W(\varvec{W}\vert \varvec{\theta }))\) that is used to measure how closely the fitted \(\hat{\varvec{D}}(\varvec{X})\) approximates \(\hat{\varvec{\Delta }}\) (badnessoffit). We’ll now discuss different loss functions \(\mathcal {L}(\cdot )\) that have proven to be especially popular.
3.1 Quadratic loss: STRESS
One popular type of PS is least squares scaling, which employs the quadratic loss function. This type of loss is usually called Stress (for standardized residual sum of squares, Kruskal 1964). A general formulation of a Stresstype loss function is
Here, the \(\hat{w}_{ij}\) are finite (transformed) input weights, with \(\hat{w}_{ij}=0\) if a proximity is missing or to be ignored.
The fitted distances in the configuration are usually some type of Minkowski distance
with \(p > 0\), typically Euclidean, so \(p=2\). Stress losses need to be solved iteratively. Popular algorithms for finding the minimum of (5) are majorization (De Leeuw 1977) or gradient descent algorithms (Buja and Swayne 2002).
3.2 Approximation by inner product: STRAIN
The second popular type is the Strain loss function (Torgerson 1958). Here, the \(\varvec{\Delta }\) is transformed to \(\hat{\varvec{\Delta }}\) so that \(T_\Delta (\varvec{\Delta }\vert \theta _\Delta )=(h\circ s)(\varvec{\Delta }\vert \varvec{\theta }_\Delta )\) where s is a function parametrized by \(\theta _\Delta \) and \(h(\cdot )\) is the double centering operation (i.e., for a matrix \(\varvec{A}\), \(h(\varvec{A})=\varvec{A}\varvec{A}_{i.}\varvec{A}_{.j}+\varvec{A}_{..}\), where \(\varvec{A}_{i.}, \varvec{A}_{.j}, \varvec{A}_{..}\) are matrices consisting of the row, column and grand means respectively). Subsequently \(\hat{\varvec{\Delta }}\) is approximated by the inner product matrix of \(\varvec{X}\), so,
In the context of Strain we always assume that \(T_\Delta (\cdot )\) is a composite function of the double centering operation and some other parametrized function \(s(\cdot \vert \theta _\Delta )\), so we can express Strain as
if \(d_{ij}(\varvec{X})\) is as in (7) and \(\hat{\delta }_{ij}=h(s_{ij}(\delta _{ij}^2/2\vert \varvec{\theta }))\). Strain losses can usually be solved by an eigenvalue decomposition.
3.3 Repulsion and attraction: energy model
Another way of interpreting the PS objective is related to energy models with a pairwise attraction part \(\propto \hat{d}_{ij}(\varvec{X})^\nu \) and a pairwise repulsion part \(\propto \hat{\delta }_{ij}\hat{d}_{ij}(\varvec{X})\) between objects (Chen and Buja 2013). This is
with a, b being some constants. For \(\nu =2, a=1\) and \(b=2\) this is (5) where terms depending solely on \(\hat{\varvec{\Delta }}\) are disregarded for finding \(\varvec{X}\).
4 Parametrized transformations
Many PS methods allow to transform the input dissimilarities and/or the fitted distances and/or input weights. In this section, we discuss popular transformations that are employed with the above mentioned proximity scaling losses.^{Footnote 2} Note that our list is not meant to be exhaustive.
We consider possible parameter vectors \(\varvec{\theta } \subseteq \{\theta _\Delta ,\theta _D,\theta _W\}\) for the transformations \(T_\Delta , T_D, T_W\) for proximities, fitted distances or weights. These are the hyperparameters we later want to tune.
4.1 Transforming observed proximities
A simple and very flexible way is transforming the input proximities, so using \(T_\Delta (\varvec{\Delta })\). The advantage here is that this can easily be implemented for and in principle applied with all PS methods. Often it is also possible to use \(T_W(\varvec{W})\).
4.1.1 Metric scaling transformations of proximities
In metric scaling one applies a parametric bijective strictly monotonic transformation to the proximities.
With specific choices for \(T_\Delta (\cdot )\) one can express many popular PS versions, including absolute MDS with \(\hat{\delta }_{ij}=\delta _{ij}\), ratio MDS with \(\hat{\delta }_{ij}=\vert b\vert \delta _{ij}\), interval MDS with \(\hat{\delta }_{ij}=a+b\delta _{ij}\ge 0\), logarithmic MDS with \(\hat{\delta }_{ij}=a+b \log (\delta _{ij})\), exponential MDS with \(\hat{\delta }_{ij}=a+b \exp (\delta _{ij})\) (for all Borg and Groenen 2005), power MDS with \(\hat{\delta }_{ij} =\delta _{ij}^{\lambda }\) (e.g. Buja and Swayne 2002), or, with additionally setting the \(\hat{w}_{ij}\), instances of metric scaling that use a priori inverse weighting with the observed proximities, e.g., setting \(\hat{w}_{ij}=\delta ^{1}_{ij}\) (Sammon mapping, Sammon 1969) or \(\hat{w}_{ij}=\delta ^{2}_{ij}\) (elastic scaling, McGee 1966) or curvilinear component analysis (Demartines and Herault 1997) for a hyperparameter \(\theta _W=\tau \) with \(\hat{\varvec{W}}=T_W(\hat{\varvec{D}}(\varvec{X})\vert \theta _W)\) being a bounded and monotonically nonincreasing function, e.g, \(\hat{w}_{ij}=1\) if \(\hat{d}_{ij}(\varvec{X})\le \tau \) and 0 otherwise.
These transformations are governed by the appropriate hyperparameters \(\theta _\Delta \) and/or \(\theta _W\) to yield the models.
4.1.2 Geodesic transformation of proximities
A popular method for manifold learning is Isomap (De’ath 1999; Tenenbaum et al. 2000), which is a PS (originally Straintype) with the \(\hat{\delta }_{ij}\) being the geodesic distance between objects i, j as imposed by a weighted graph. These proximities are defined as the sum of edge lengths along the shortest path \(\text {SP}(i,j\vert G(k))\) between two objects in a neighborhood graph G(k) for a given parameter k (the number of nearest neighbors) where the objects are the vertices, so \(\hat{\delta }_{ij}=\text {SP}(i,j\vert G(k))\). Alternatively, one can define the neighborhood graph in terms of an \(\varepsilon \)radius, so \(\hat{\delta }_{ij}=\text {SP}(i,j\vert G(\varepsilon ))\).
The transformation is governed by \(\theta _\Delta =k\) or \(\theta _\Delta =\varepsilon \) as the hyperparameter that defines the neighborhood graph.
4.2 Transforming observed proximities and fitted distances
One may also transform the distances that get fit in the configuration, that is, applying a distance transformation function \(T_D(\varvec{D}(\varvec{X}))\) to the fitted distances. This is more complicated than simply transforming the input proximities—the whole fitting process has to be adapted to accommodate the transformation of fitted distances. It is then natural to apply proximity transformation functions, distance transformation functions and/or weight transformations simultaneously, using any combination of \(T_D(\varvec{D}(\varvec{X}))\), \(T_\Delta (\varvec{\Delta })\) and \(T_W(\varvec{W})\). This allows for a rich class of models with possible parameter vectors \(\varvec{\theta } \subseteq \{\theta _\Delta ,\theta _D,\theta _W\}\), corresponding to the hyperparameters for the transformations of proximities, fitted distances and weights.
4.2.1 Power transformations for proximities and distances
Employing power transformations on fitted distances and proximities is often done (e.g., Ramsay 1982; Buja and Swayne 2002; Buja et al. 2008; Groenen and De Leeuw 2010).
A general instance of a PS type with power transformed fitted distances is rstress (De Leeuw 2014), the transformation being \(\hat{d}_{ij}(\varvec{X}) = d_{ij}(\varvec{X})^{2r}\) with \(r \in \mathbb {R}_+\). A number of stress versions can be expressed as special or limiting cases of rstress, including raw Stress (\(r=0.5\), Kruskal 1964), sstress (\(r=1\), Takane et al. 1977) and maximum likelihood MDS (Ramsay 1977) for \(r \rightarrow 0\). Here \(\theta _D=r\).
It is straightforward to extend this to use power functions as the proximity transformation function, distance transformation function and weight transformation function simultaneously (POSTMDS), so \(\hat{d}_{ij}(\varvec{X})=d_{ij}(\varvec{X})^\kappa \), \(\hat{\delta }_{ij}=\delta ^\lambda _{ij}\), and \(\hat{w}_{ij}=w_{ij}^\rho \) with \(\lambda , \rho \in \mathbb {R}, \kappa \in \mathbb {R}_+\). Inserted into (5) this is called power stress or pstress in the literature. It subsumes many metric MDS models. Here \(\varvec{\theta }\) is a threedimensional parameter vector, \(\varvec{\theta }=(\theta _{\Delta },\theta _D,\theta _W)^\top =(\lambda ,\kappa ,\rho )^\top \).
This encompasses Sammon or elastic scaling type models by using \(\hat{w}_{ij}=\hat{\delta }_{ij}^\rho \) for the appropriate \(\rho \), and can be turned into curvilinear component analyses type models with \(\hat{W}=T_W(\hat{\varvec{D}}(\varvec{X})\vert \tau )\) a bounded and monotonically decreasing function, e.g., \(\hat{w}_{ij}=\mathbb {1}(\hat{d}_{ij}(\varvec{X}) \le \tau )\), where \(\varvec{\theta }=(\kappa ,\lambda ,\tau )^\top \).
4.2.2 Box–Cox transformations for proximities and distances
Chen and Buja (2013) propose BoxCox transformations on observed proximities and on fitted distances in an energy badnessoffit formulation (9). For complete data matrices it is a threeparameter energytype MDS family, BCMDS, \(\sigma _{\text {BC}}(\varvec{X}\vert \varvec{\theta })\):
with \(\mu , \rho \in \mathbb {R}\) and \(\lambda \in \mathbb {R}_+\). Here \(BC_\alpha \) is the oneparameter Box–Cox transformation (Box and Cox 1964) with parameter \(\alpha \),
Note that here the distance transformations used in the attraction and repulsion part are not equal as \(\lambda > 0\).
The hyperparameter vector is \(\varvec{\theta }=(\mu ,\lambda ,\rho )^\top \).
4.2.3 Sets of local neighborhood
Yet another idea can be expressed with parametrized transformations: local MDS (lMDS, Chen and Buja 2009). Let \(N_k\) define the symmetric set of nearby pairs of points (i, j) so that \((i,j)\in N_k\) iff i is among the \(k\)nearest neighbours of j or the other way round. Let \(\delta _{\infty } \rightarrow \infty \) be a large “imputed” dissimilarity that is constant and w a small weight, e.g., \(w \approx 1/\delta _{\infty }\) as in the standard lMDS objective.
The lMDS objective can be expressed in our framework, as (5) with the transformations
and
For the standard lMDS objective this can be reduced to a version with a free hyperparameter \(\tau =2w\delta _{\infty }\) for a given k, as well as the k (Chen and Buja 2009); hence \(\varvec{\theta }=(k,\tau )^\top \).
5 Structures, cstructuredness and indices
Central to our proposal for hyperparameter tuning in PS models is the concept of cstructuredness, as cstructuredness is often desirable from an applied point of view. We informally defined cstructuredness as the adherence of the arrangement of objects in a configuration \(\varvec{X}\) to a predefined notion of what constitutes structure. How much a structure in question is present in \(\varvec{X}\) is expressed as the amount of cstructuredness: The higher the cstructuredness, the clearer the structure is present.
Naturally, there is a very high number of possible structures. Some examples of cstructuredness that we deem to be of particular interest are given in Fig. 1 with index values as formalized below and in Appendix A in the supplementary document. The cstructuredness types we single out here are (i) cregularity (objects arranged on a regular grid), (ii) cassociation (any (non)linear association), (iii) cclusteredness (objects arranged in clusters), (iv) clinearity (objects arranged on a straight line), (v) cfunctionality (objects arranged on a smooth line) and (vi) cmanifoldness (objects arranged so that they resemble a manifold).
For STOPS we need to quantify information about the cstructuredness present in \(\varvec{X}\). We do this with univariate cstructuredness indices that capture the essence of a particular structure in \(\varvec{X}\). The indices should be numerically high (low) the more (less) of a given structure we find. To illustrate, for clinearity and \(M=2\), we may use the absolute value of the Pearson correlation coefficient for the columns of \(\varvec{X}\) as a cstructuredness index, which is 1 when there is a perfect noisefree linear relationship or 0 when there is no linear relationship.
We aim at indices that capture the essence of a particular structure, depend on the arrangement of objects in \(\varvec{X}\) and should be bounded from above and below, i.e., have unique finite minima and maxima, and be nonnegative. In what follows we will list three examples of structures in a configuration as illustration. For each of these there is an index that captures the cstructuredness for that structure; these are also used in the examples.
We generally write cstructuredness indices as\(\text {I}_{\text {cstructuredness}}(\varvec{X} \vert \varvec{\gamma })\),^{Footnote 3} which means \(\text {I}_{\text {cstructuredness}}(\cdot )\) is an index that reflects cstructuredness as a function of \(\varvec{X}\), possibly depending on further metaindex parameters \(\varvec{\gamma }\) which are assumed to be given.
5.1 Cclusteredness and cregularity
These structures are concerned with how clustered the configuration appears in the sense of Rusch et al. (2018). The concept essentially captures where \(\varvec{X}\) falls on a continuum between unclustered and maximally clustered.
Following Rusch et al. (2018), for a minimum number c of points that must comprise a cluster we denote with \(O(\varvec{X} \vert c)=(\varvec{x}_{o(i)})_{o(i) = 1, \ldots , N}\) an ordering of the N original row vectors \(\varvec{x}_i, (i=1, \ldots ,N)\) in \(\varvec{X}\). \(O(\varvec{X})\) is a permutation of the rows of \(\varvec{X}\). The position of object \(x_i\) in the ordering \(O(\varvec{X} \vert c)\) is \(o(i)\) (which depends on c but we drop it for readability). \(O(\varvec{X} \vert c)\) can be obtained by the algorithm OPTICS (Ankerst et al. 1999), which provides the bijective algorithmic mapping \(o: \{1,\ldots ,N\} \rightarrow \{1,\ldots ,N\}\). OPTICS also augments \(O(\varvec{X} \vert c)\) with each object’s so called representative reachability distance, \(r^*_{o(i)}\) that cannot be expressed in closed form.
For the index of cclusteredness we then use the normalized OPTICS Cordillera, \(\text {OC}'(\varvec{X})\) (Rusch et al. 2018),
where \(\varvec{\gamma }=(c,\epsilon ,q,d_{\max })^\top \), the four free metaparameters of the OPTICS Cordillera. Here \(q \ge 1\) designates a qnorm, \(\epsilon \) is the maximum radius around each point to look for neighbors and \(d_{\max }\) denotes a maximum reference distance, \(\min d_{ij} \le d_{\max } \le \epsilon \) that winsorizes the \(r^*_{o(i)}\) for robustness. Apart from c the metaparameters are optional; we suggest to fix them as \(d_{\max }=1.5 \max (d_{ij}(\varvec{X}^0))\), \(\epsilon =2 \max _{i,j}(\max (\delta _{ij},d_{ij}(\varvec{X}^0)))\) and q to the norm of the target space (\(\varvec{X}^0\) refers to the Strain solution for untransformed proximities). The larger this index value is, the more cclusteredness we find. See the middle left panel in Fig. 1 for an example.
The \(\text {OC}'(\varvec{X})\) takes on its minimum value when for each point the distance to the \(c\)th neighbor is constant (Rusch et al. 2018). We can use this to fashion an index for cregularity, a structure where the observations are arranged so that all nearest neighbor distances are equal. We fix the parameter c to 2 (nearest neighbor) and \(q=1\) and therefore use
with \(\varvec{\gamma }=(\epsilon ,d_{\max })^\top \). This index will be 1 if the point arrangement is perfectly regular and 0 if perfectly clustered with respect to N, \(c=2\) and \(d_{\max }\). See the top left panel in Fig. 1.
Using these indices with negative weights in Stoploss will favor parameters that give configurations with high cclusteredness/cregularity whereas positive weights favor low cclusteredness/cregularity.
5.2 Cmanifoldness
This structure captures how close almost arbitrary, real valued transformed projections of columns of \(\varvec{X}\) lie to a 1D manifold (a line) in the transformed space.
We use the maximal correlation coefficient (Gebelein 1941; Sarmanov 1958) for this, where the higher the coefficient, the stronger the relationship. For a sample, the maximal correlation can be calculated approximately with the alternating conditional expectations algorithm (ACE, Breiman and Friedman 1985). Let \(x_m\) denote the mth column vector of \(\varvec{X}, (m=1,\dots ,M)\). To obtain the sample version, we use the \((x_{ik}, x_{il}), i=1,\dots ,N\), input them into the ACE algorithm and use the output—called \(\text {ACE}(x_k,x_l)\)—to construct a cstructuredness index for a given \(\varvec{X}\).
We then use an aggregation function \(\phi _{k,l}(\cdot )\) (e.g., the maximum) of the ACE value between any two different columns \(x_k,x_l; k \ne l\), so
An example of such a relationship is given in the bottom right of Fig. 1. As before, a negative weight for this index in Stoploss would favor \(\varvec{\theta }\) that provide higher index values and positive weights lower index values.
These are instances of the cstructuredness concept as used in Sect. 7. Details on these and more structures that may be of interest can be found in Appendix A in the supplementary document. We are quick to point out that other structures/indices are conceivable and can easily be incorporated in our framework, e.g., the ones from France and Carroll (2007).
6 Optimization for STOPS
The approach for searching for the hyperparameters that we take builds on an idea in Rusch et al. (2021). We view hyperparameter tuning via STOPS as a multiobjective optimization problem of either minimizing badnessoffit, or maximizing/minimizing cstructuredness indices, or both over \(\varvec{\theta }\). We choose a scalarization approach leading to variants of \(\text {Stoploss}(\varvec{\theta })\) as in (2) and (3) and the resulting optimization problem (4).
For optimization, minimizing any (4) as a function of \(\varvec{\theta }\) is a difficult, analytically intractable problem. There are several reasons for this: For one, concrete instances of Stoploss result as a combination of any badnessoffit measure with any number of different cstructuredness indices, which leads to a high number of possible concrete instances. These concrete instances need not share structural similarities that can be exploited for optimization. Second, the optimization surface over \(\varvec{\theta }\) can be difficult for different STOPS models in that it is typically nonconvex, may have local optima, may have (removable or jump) discontinuities and may be smooth or nonsmooth to varying degrees over regions of the support of \(\varvec{\theta }\). Third, \(\varvec{\theta }\) may be continuous or discrete. All of these problems are empirically illustrated for a typical Stoploss (the one from Sect. 7.1) in Appendix C in the supplement, where we also elaborate on the nature of Stoploss.
We therefore aim at a general purpose approach that can deal with the intractability and thus be used for all conceivable Stoplosses. To that end we approach the optimization problem with the nested profiletype strategy laid out in Algorithm 1 that only employs function evaluations; u denotes the iteration step and we do not list the given \(\varvec{\Delta }\) and \(v_0,\dots ,v_l\) for \(\text {Stoploss}(\varvec{\theta })\) in the algorithm. \(\varvec{\Theta }\) denotes the feasible set of \(\varvec{\theta }\) (e.g., box constraints). The \(\sigma _{\text {PS}}\left( \varvec{X}^*(\varvec{\theta })\vert \varvec{\theta } \right) \) employed should be scale and unit free, e.g., normalized to lie between zero and one (see Appendix B in the supplementary document).
To generate candidates for \(\varvec{\theta }_u\) in Step 3 of Algorithm 1, different general purpose strategies for global optimization like random search, grid search and/or derivativefree metaheuristics can in principle be employed. We made good experiences with particle swarm optimization (Eberhart and Kennedy 1995) but also random search and the adaptive LuusJakoola pattern search (Luus and Jaakola 1973; Rusch et al. 2021) for a moderate number of iterations (around 100) and these strategies can be recommended.
That said, one of the contributions of this article is to suggest a generally applicable default strategy that can be used with and promises to work well for every conceivable instance of Stoploss. Working well here means to suggest candidates for \(\varvec{\theta }\) that lead to relatively low values of the Stoploss objectives in an efficient manner (i.e., for a relatively small number of iterations) as each function evaluation in STOPS can be very costly (especially so for transformations of the \(\varvec{D}(\varvec{X})\)) due to having to solve a PS problem. To establish a general purpose strategy, we empirically investigated a number of solvers, including simulated annealing (Kirkpatrick et al. 1983), random search, adaptive LuusJakola algorithm, particle swarm optimization, and different versions of Bayesian optimization (BO, Mockus 1989) on different Stoplosses and different data sets (not shown).
What emerged as a generally applicable strategy that worked well in most cases (in the sense of suggesting candidates with lower “minima” as their competitors for a small numbers of iterations of around 1020) was Bayesian optimization with a treed Gaussian process with jumps to the limiting linear model as the surrogate model (TGPLLM, Gramacy and Lee 2008) (which we elaborate on below). This strategy not only worked well for efficiently finding good candidates for \(\varvec{\theta }\) in all the data instances we used it for, but is also theoretically able to accommodate the aforementioned structural difficulties of Stoploss (see also Appendix C in the Supplement). We note that particle swarm optimization has often performed on par and sometimes better for a higher number of iterations.
Optimizing Stoploss with Bayesian optimization When investigating different general purpose solvers for optimization in STOPS we found that Bayesian optimization (BO, Mockus 1989) lends itself well as an outofthebox solution to suggest candidates for \(\varvec{\theta }\) in Step 3 of Algorithm 1.
The basic idea in Bayesian optimization is to approximate the unknown Stoploss surface with a flexible surrogate model (“prior”). The surrogate model identifies areas for exploration where we can expect improvement of the objective by maximizing an acquisition function over the surrogate surface. From these areas new candidate(s) for \(\varvec{\theta }\) are sampled and the objective function gets evaluated at the new candidate(s) (“data”). Then the surrogate model is updated (“posterior”) to reflect this new information, the acquisition function of the refitted surrogate model is again maximized and the whole process repeats. Hence in each iteration a new candidate is chosen for which we expect that it improves the objective of interest based on the available information, thus trading off exploitation and exploration in an efficient way.
This approach works well for STOPS due to three aspects: First, BO needs only function evaluations so the modularity of STOPS and the lack of exploitable structure is no hindrance. Second, BO is competitive in situations where the parameter vector is lowdimensional (e.g., Siivola et al. 2021) as is the case for all the \(\varvec{\theta }\) we outlined in Sect. 4 (with at most 3). Third, fitting the surrogate model and optimizing the acquisition function may be less expensive then evaluating the objective function as the cost of finding a configuration in PS can be quite high; BO can then dramatically reduce the number of evaluations necessary to get close to a global optimum.
To describe the BO approach more formally, we have the unknown objective function \(\text {Stoploss}(\varvec{\theta }): \varvec{\Theta } \rightarrow \mathbb {R}\). For notational convenience we write \(\mathcal {Y}:=\text {Stoploss}(\varvec{\theta })\) for the function and \(\mathcal {Y}_u:=\text {Stoploss}(\varvec{\theta }_u)\) for the \(u\)th evaluation of the function for \(\varvec{\theta }_u\). We have a surrogate model for the objective, \(\mathcal {M}(\varvec{\theta },\mathcal {Y},\iota )\) with possible surrogate model metaparameters \(\iota \), fitted to the sequence of U pairs of \(\{\varvec{\theta }_u,\mathcal {Y}_u\}_{u=1}^{U}\). We also have the acquisition function \(\Omega \left( \varvec{\theta }\vert \{\varvec{\theta }_u,\mathcal {Y}_u\},\mathcal {M}(\varvec{\theta },\mathcal {Y},\iota )\right) \). We look for
which constitutes the new candidate. Then \(\mathcal {Y}_{u+1}\) is evaluated and the model \(\mathcal {M}\) is updated with the new point \(\{\varvec{\theta }_{u+1},\mathcal {Y}_{u+1}\}\), yielding the “posterior”,
Then the acquisition function gets maximized for the updated data and the whole process repeats until some termination criterion is met.
For \(\Omega (\varvec{\theta }\vert \{\varvec{\theta }_u,\mathcal {Y}_u\}, \mathcal {M}(\varvec{\theta },\mathcal {Y},\iota ))\) we use the expected improvement (EI) criterion (Jones et al. 1998), which is
with \(\mathcal {Y}^*=\min \mathcal {Y}_u\). EI has shown good behavior over a wide array of tasks (Bergstra et al. 2011).
What is crucial for the perfomance of BO is how well the surrogate model \(\mathcal {M}\) is able to approximate the unknown objective. To find a good surrogate model \(\mathcal {M}\) to be used in our case, we empirically investigated the behavior of different \(\text {Stoploss}(\varvec{\theta })\) over \(\varvec{\theta }\) for a number of different data sets; see the examples in Appendix C in the supplementary document.
Given the nature of Stoploss we mentioned before, we looked for a surrogate model that is nonstationary, allows for jumps and piecewise constant regions (e.g., for discrete \(\varvec{\Theta }\)) and allows a sufficiently rough Gaussian processes within segments over \(\varvec{\theta }\) (for nonconvexity). We found it with the treed Gaussian process with jumps to the limiting linear model, with the separable power exponential family for the correlation structure \(\iota \). This process recursively partitions the search space into nonoverlapping segments (like a regression tree) and for each segment independent Gaussian processes are used, which have linear models as their limit. This allows nonstationarity over the whole search space and accommodates piecewise linear or constant areas of the search space (which is useful especially for discrete \(\varvec{\theta }\)). Of further note is that the independent GP do not have to connect at the boundaries of the segments, thus allowing for jump discontinuities. The specification and estimation of this process is fully Bayesian; for details, see Gramacy and Lee (2008).
BO with TGPLLM is sufficiently flexible and general for optimizing Stoploss for the dimensionality of \(\varvec{\theta }\) that we are faced with, both from a theoretical perspective and over all our empirical investigations. We thus recommend it as the default approach for generating candidates for optimizing \(\text {Stoploss}(\varvec{\theta })\) in a small number of iterations. We point out that there may be combinations of data, hyperparameters and Stoploss specification for which a different approach may be more acccurate or efficient, e.g., the example in Sect. 7.1, where a crude grid search is sufficient. It is also possible to develop tailored optimization approaches for concrete Stoplosses that exploit structure and perform better. Also, for a \(\varvec{\theta }\) that is higher than 20 dimensions, standard BO starts to perform less well. Nevertheless, we see BO with the TGPLLM surrogate model as a general strategy that can be successfully used with every conceivable Stoploss that can be derived from the suggestions within this article.
7 Application
In this section, we demonstrate how the STOPS framework can be used for tuning hyperparameters in PS for individual data analytical instances. For the purpose of illustration, we aim at a diverse number of Stoplosses comprising different PS badnessoffit, transformations and cstructuredness indices.
7.1 Unrolling the swiss roll
As a simple example for illustrating the concept, we use the STOPS framework to select hyperparameters for the geodesic distance function in Isomap to unroll the classic swiss roll regularly. In the swiss roll example, data lie equally spaced on a spiral manifold (“swiss roll”) embedded in a higher dimensional space. Proximity scaling methods that emphasize local structure are able to flatten out this manifold in the target space. One of the most popular PS variants for doing this is Isomap.
We use an example with 150 data points in three dimensions lying on a grid on the embedded manifold where along the y dimension there are five points and along the x dimension there are 30. We code the points in a diverging palette from the center of the roll along the spiral direction. The flattening out operation worked well if the same shades are arranged vertically, the palette is visible from left to right, and the grid is recovered.
As described in Sect. 4.1.2, Isomap has a governing hyperparameter for the calculation of the geodesic distances. In line with our objective of flattening out the swiss roll, we are looking for a solution with objects arranged on a regular grid and also want to preserve the neighborhood of points around each point as well as possible (specifically preserving a neighborhood of 5 points). In terms of structuredness, one way to measure whether the objects are arranged on a grid is cregularity (see Sect. 5.1), and preservation of the neighborhood can be assessed with cfaithfulness (see Appendix A.7 in the supplementary document), both of which ideally would be 1.
We use the \(\varepsilon \)version of Isomap and may now set the hyperparameter for the geodesic distances to different values and tune by inspecting for which hyperparameter values we get a faithful, regular representation (e.g., manually or in a grid search). We may also tune the hyperparameter with the STOPS framework automatically.
In Table 1 we list values of normalized stress, cregularity and cfaithfulness for different \(\varepsilon \) (our \(\varvec{\theta }\)) obtained in a grid search from 0.09 to 0.3 by increments of 0.02. Here the best values for cregularity and cfaithfulness would be obtained^{Footnote 4} with \(\varepsilon ^*= 0.11\).
Alternatively, we can use STOPS with Isomap and a weight of 0 for the badnessoffit and \(1\) for cregularity and cfaithfulness respectively (as we want to maximize them). We use the mSTOPS variant for \(\text {Stoploss}(\varvec{\theta })\), so combining the two cstructuredness indices multiplicatively. The search space for \(\varepsilon \) is set to be between 0.09 and 0.3. We use BO with TGPLLM for 10 iterations. Note that due to our objective we only consider cregularity and cfaithfulness and disregard stress, as the latter measures how closely the fitted distances approximate the dissimilarities (which is of no concern for our choice of \(\varepsilon \) here). The role of fit measure is played by cfaithfulness.
Tuning the hyperparameter with the mSTOPS procedure^{Footnote 5} suggests to use \(\varepsilon ^*=0.1095\). The resulting configuration can be found in the upper panel of Fig. 2. This is the most faithful and regular representation that we obtain by varying \(\varepsilon \), with values 0.9911 for cregularity and 0.8293 for cfaithfulness, although we cannot achieve a perfectly regular rectangle. Both values either match or improve the corresponding optimal values we found in the grid search for \(\varepsilon =0.11\). For comparison, a configuration with nonoptimal \(\varepsilon =0.21\) is found in the lower panel of Fig. 2. This solution has equal cregularity to the optimal one, but lower cfaithfulness due to the difficulty of representing the inner part of the swiss roll for this \(\varepsilon \).
7.2 Handwritten digits
We consider a second application of the STOPS framework, this time on handwritten digits data from Alimoglu and Alpaydin (1997). Following Izenman (2009), the original data were obtained from 44 people who handwrote the digits \(0,\dots ,9\) 250 times by following the trajectory of the pen on a \(500 \times 500\) grid which was then normalized. From the normalized trajectory, eight bivariate points were randomly selected for each handwritten digit, leading to 16 variables per digit. We look at a random sample of 500 written digits.
To these data we apply different nonlinear dimensionality reduction methods that allow for parametric transformations. We use the STOPS framework to select the transformation parameters for visualization of the multivariate similarities with emphasis on two structures: clusters, to make the separation of digits clear, and an underlying manifold. The PS methods we tune with STOPS are Sammon mapping with power transformations of the proximities, Box–Cox MDS, POSTMDS and local MDS. This is a relatively large data set in an MDS context and fitting a single PS can already be costly, so using BO is an efficient strategy.
We use aSTOPS with normalized stress with a weight of \(v_0=1\), cclusteredness (see Sect. 5.1) with a high weight of \(v_1=400\) and cmanifoldness (Sect. 5.2) with a moderate weight of \(v_2=2.5\).^{Footnote 6} The OPTICS Cordillera index uses \(q=2\), \(c=5\), \(\epsilon =10\) and \(d_{\max }=0.6\), the other indices use the default values. For comparability of indices the configurations are rescaled so that the highest column variance is 1. BO with TGPLLM is carried out in 10 steps.^{Footnote 7}
The optimal hyperparameter for the Sammon mapping with power transformation was found at \(\lambda ^*=5.4\). The cclusteredness value was 0.15, the cmanifoldness value was 0.928. The configuration is shown in the top left panel in Fig. 3.
For lMDS the transformation hyperparameter values selected by STOPS were \(k^*=14.73, \tau ^*=3\). With them the obtained solution has a cclusteredness value of 0.094 and cmanifoldness value of 0.953. The configuration is shown in the top right panel in Fig. 3.
The transformation hyperparameters for Box–Cox MDS were found at \(\mu ^*=4.631, \lambda ^*=4.859, \rho ^*=1.285\). The cclusteredness value was 0.124, the cmanifoldness value was 0.961. The configuration is shown in the bottom left panel in Fig. 3.
Lastly, for POSTMDS with \(w_{ij}=\delta _{ij}\), the transformation hyperparameter values were \(\kappa ^*=0.787, \lambda ^*=3.766, \rho ^*=2.979\). The cclusteredness value was 0.116, the cmanifoldness value was 0.862. The configurations for the POSTMDS solution is shown in the bottom right panel in Fig. 3.
Overall, we can see that the STOPS hyperparameter selection leads to configurations that, in general, arrange the digits in a sensible way: Due to the high cclusteredness weight, similar looking digits are positioned visually close to each other, e.g., 0, 5, 8 or 1, 2, 7. This is rather pronounced for lMDS, power Sammon and POSTMDS; BCMDS produces an arrangement in three clusters that is less in accordance with the other representations and the ground truth. The cmanifoldness weight ensured that each arrangement does not deviate too far from an appreciable, imagined submanifold. For every method, using the optimal hyperparameters from STOPS improves the cstructuredness indices as compared to an untransformed Sammon solution (which had cclusteredness of 0.0833 and cmanifoldness of 0.5939).
For comparison, we also include two configurations in Fig. 3 obtained from Samon mapping and lMDS respectively, where the cstructuredness weights have been set to \(v_1=v_2=0\) and \(v_0=1\), therefore disregarding cstructuredness and only minimizing the badness offit criterion over \(\varvec{\theta }\). We can see that disregarding cclusteredness leads to less cstructuredness. With respect to cclusteredness, for the Sammon mapping result, clusters are less discernable in the bottom left plot than in the top left in Fig. 3 (especially when ignoring the coloring) and for lMDS clusters are hard to make out in the configuration obtained when disregarding cclusteredness (bottom right plot vs. top right plot).
7.3 Republican mantras
For our last illustration of the STOPS framework in action, we turn to the “Republican Mantra” data from Mair et al. (2014). The data are natural language texts that have been periodically obtained from the website of the Republican party (Grand Old Party, or GOP) in the USA, which at that time hosted a section called “Republican Faces”. In this section, supporters of the Republican party gave a short statement about why they see themselves as Republican. The statements always started with “I’m a Republican, because...” followed by the person’s personal ending. An example statement would be “...I believe in a free market society which enables hard work to equal success”. Mair et al. (2014) used MDS to explore the document term matrix of these statements but encountered difficulties when using the cosine distance between the words due to an almostequaldissimilarities artifact (see first row of Fig. 4; for illustration, this arrangement has cclusteredness of 0.053 with respect to \(c=6, d_{\max }=1.2, \epsilon =10, q=2\)). This prompted Mair et al. (2014) to abandon the cosine distance in favor of a cooccurence based dissimilarity measure, the gravity similarity, that was subject to power transformations. Of note is that the concrete transformations they used in their work were manually chosen ad hoc.
We revisit the data from the original angle: We retain the cosine distance that the authors originally aimed for but employ the STOPS framework to guide the PS result towards more cstructuredness by choosing power transformations in a flexible PS version that is appropriate for promoting structures of interest. This also serves as an empirical example of how an approach that relied on manual trialanderror hyperparameter selection would have benefitted from the STOPS framework. To illustrate the versatility of STOPS we use aSTOPS and mSTOPS with different structures and weights.
Additive STOPS: high cclusteredness, high cassociation, low ccomplexity With aSTOPS we select parameters in a POSTMDS with cosine dissimilarities so that there is a focus on cclusteredness (see Sect. 5.1) and cassociation (Sect. A.1 in the Supplement) but favoring lower ccomplexity (Sect. A.2 in the Supplement) for the association. To translate this into STOPS model weights, we used \(10, 5\) and 1 as weights for the cstructuredness indices. The weight for stress was 1. Essentially this means that cclusteredness is weighted twice as important as cassociation, which is valued 5 times as important as stress. Ccomplexity is more of an afterthought but for two results with similar cclusteredness and cassociation, we prefer the one with lower complexity tradedoff equally with badnessoffit. We note that the weights we use are somewhat arbitrary, but what is clear is that all in all we relax the fit criterion quite a bit to allow for high cclusteredness and cfunctionality. The metaparameters in \(\varvec{\gamma }\) are \(c=6, d_{\max }=1.2, \epsilon =10, q=2\) for cclusteredness and \(\omega =0.9\) for cassociation and ccomplexity.
The resulting configuration is displayed in the second row of Fig. 4. The objects are arranged in clusters close to a circumplex structure, reflecting the STOPS setup of a clustered arrangement with a relatively simple nonlinear association. It stems from a POSTMDS with \(w_{ij}=\delta _{ij}\) and parameters \(\varvec{\theta }^*=(2.635,1.185,3)^\top \). The square root of the normalized stress value is 0.63 and the cstructuredness indices are 0.386 for cclusteredness, 0.999 for cassociation and 3.807 for ccomplexity.
Multiplicative STOPS: high cnonmonotonocity, moderate cassociation With mSTOPS we again select parameters in a POSTMDS with cosine dissimilarities, this time focusing on having a nonmonotonic, nonlinear associative structure in the target space, so high cnonmonotonicity (Sect. A.10 in the Supplement) and moderate cassociation (Sect. A.1 in the Supplement). Essentially we want the objects to be projected close to a functional that should be highly nonmonotonic.^{Footnote 8} To achieve this, badnessoffit is allowed to become moderately high relative to the cstructuredness, but not as high as before with aSTOPS. This translate to the weights \(2,1\) for cnonmonotonicity and cassociation respectively, and to a weight for stress of 1. The metaparameters in \(\varvec{\gamma }\) were \(\omega =0.9\) for cnonmonotonicity and cassociation.
The resulting configuration is displayed in the last row of Fig. 4. We see that there is an associative structure that is rather complicated and highly nonmonotonic (reminiscent of a trefoil). It stems from a POSTMDS with \(w_{j}=\delta _{ij}\) and parameter values \(\varvec{\theta }^*=(2.443,19.756,2.984)^\top \). The square root of the normalized stress value is 0.545 and the cstructuredness indices are 0.469 for cnonmonotonicity and 0.999 for cassociation.
7.4 Artifactual cstructuredness
There is an application aspect when selecting hyperparameters based on partially optimizing for cstructuredness that we need to point out: It is possible that this “forces” a configuration to look a certain way even if that does not correspond to the ground truth in the highdimensional data or in the data generating process, so the cstructuredness exhibited can be an artifact of using STOPS rather than inherent in the data. In that sense, STOPS can be artificially generating cstructuredness instead of uncovering “real” cstructuredness.^{Footnote 9} Two situations are of particular interest to discuss: First, if the \(\varvec{\Delta }\) are not at all structured (they exhibit the null structure, i.e. one resulting from perfectly equal dissimilarities Buja and Swayne 2002), and second, if the \(\varvec{\Delta }\) exhibit some ground truth structures but the structures are not represented in the set of structures used for a STOPS model. We address both of these situations briefly.
In the first situation, the data correspond to the null structure of equal dissimilarities (\(\delta _{ij}=\delta =const.\)), i.e., the objects form a regular simplex in \((N1)\)dimensional space. For all the PS methods we mentioned, this leads to highly structured configuration artifacts for given distances fitted. The effect that STOPS has in this situation depends on the transformations used for the \(\varvec{\Delta }\) (and, by extension, the transformation vector \(\theta _{\Delta }\)). Generally speaking, if after applying transformations to the \(\varvec{\Delta }\), the elements of \(\hat{\varvec{\Delta }}\) are still equal, then the null structure is preserved in the sense of producing the same configuration artifact in the target space for given distances fitted regardless of the choice of transformation for \(\hat{\varvec{\Delta }}\). This is when all \(\delta _{ij}\) are subjected to the same, deterministic transformation for all i, j. STOPS will then not induce cstructuredness artificially beyond the effect of the null structure on the PS configuration. For all of the transformations and PS methods we discussed in this article, this holds true.
In the second situation the elements of \(\varvec{\Delta }\) are differentiated and informative for the ground truth structure, but STOPS is not taking the correct structure into account. Two mechanisms are related to that in STOPS. First, the ground truth structure in the highdimensional data cannot be recovered with the chosen PS method. Using STOPS may then in general also fail to uncover the structure. This problem can be mitigated with STOPS, however, by including structures that measure how faithful the configuration is to the highdimensional ground truth and/or the original data, e.g. with cfaithfulness (see Appendix A.7 in the supplementary document). That way it is possible for STOPS to actually improve recovery of the ground truth via the PS method as STOPS selects transformations that then partially optimize for the ground truth structure.
Second, the PS method specified for STOPS is able to recover the ground truth but this ground truth is not within the structures specified. Then using the wrong cstructuredness index may have STOPS select hyperparameters that artificially distort the mapping towards the cstructuredness that was specified. Using STOPS may then in general either induce artificial cstructuredness (because it focuses on the wrong structure), distort or fail to uncover the ground truth structure in the result (because it does not consider the correct structure), or only recover parts of the correct structure (if the correct structure shares aspects with the wrong structure), or any combination of these. STOPS has one mechanism that is meant to mitigate that, which is to put relatively high weight on the badnessoffit part (via \(v_0\)) and relatively little weight on cstructuredness (via \(v_l\)), with the extreme of no weight on cstructuredness (\(v_l=0, \forall l\)). This will make STOPS select hyperparameters that lead to configurations that are still faithful to the \(\hat{\varvec{\Delta }}\) in the sense of minimizing mainly the badnessoffit over \(\varvec{\theta }\). To gauge the extent of the distortion, one can compare the STOPS result obtained from cstructuredness weights that are not 0 with the STOPS result obtained for all \(v_l=0\).
8 Discussion
In this article we suggested a framework for hyperparameter selection in proximity scaling models, coined STOPS for STructure optimized Proximity Scaling. The selection process for the hyperparameters is based on the idea that we want the configuration to show cstructuredness, enabling easier interpretation. The underlying objective function, Stoploss, combines cstructuredness indices and/or a badnessoffit measure as a function of the PS hyperparameters, and thus allows optimization over their space. We presented a nested, profiletype optimization procedure, that solves the PS problem given the hyperparameters by a standard PS algorithm in an inner optimization step and then uses the resulting configurations in the Stoploss to optimize over the hyperparameters in an outer optimization. We suggested Bayesian optimization via treed Gaussian process with jumps to limiting linear models as an efficient strategy for generating candidates for the outer optimization. The use of the STOPS framework was illustrated with three examples—in these example there were structural consideration about the configuration that we used to select hyperparameters to achieve a desired effect.
The aim of this article was to suggest a general, principled way to select transformation parameters in PS methods based on the structural appearance of the configuration. For the scalarized objective we limited ourselves to optimize only over the hyperparameters of the PS methods; in principle and with little adaptation to the framework, it would be possible to treat the scalarization weights as an additional set of hyperparameters and optimize over them as well.
We chose to set up the framework for hyperparameter selection in a flexible, modular fashion. This allows to plug in any type of badnessoffit PS objective, use any optimization for the PS objective that is advantageous, combine any number of structuredness indices in different ways and use any global optimization strategy over the hyperparameters. We found Bayesian Optimization with TGPLLM to be well suited as a generally applicable strategy for the latter step, but due to the modular nature of the framework any other metaheuristic or random search can be used. The proposed framework is sufficiently general and modular to allow for conceptual and computational extensions in future research. For example, the framework allows the development of tailored optimization for concrete Stoplosses, it can be incorporated in a larger context of evaluation of PS methods such as France and Akkucuk (2021) and the idea of structurebased hyperparameter selection can be adapted to unsupervised learning methods that do not fit neatly into the presented PS framework, for example for tSNE or UMAP (McInnes et al. 2018).
We discussed structured appearances that we think to be of interest to researchers and suggest statistics by which to measure them. We are quick to point out that our list is not meant to be exhaustive and that any scalar statistic that captures a property of the configuration can be incorporated into STOPS. We therefore believe our framework to be applicable even beyond the suggestions we made in this paper.
The STOPS framework as a hyperparameter tuning method is similar to using indirect, soft constraints in PS via choosing the hyperparameters instead of incorporating hard, direct constraints on the configuration, and this is where we see the method located. In cases where there exists strong theory, or hard constraints, constrained versions of PS may be preferred over the STOPS framework. Since the method works completely unsupervised with respect to a ground truth for structures, the structures that are produced may not be “real” and would need to be validated.
Despite these limitations, we believe the STOPS framework to be a versatile conceptual framework for the task of hyperparameter tuning in ordination, unsupervised learning and dimensionality reduction. It can be utilized for data exploration, visualization and scaling in applications where one seeks configurations that show a structural quality that can be interpreted and also wants to retain properties of a standard PS variant. This way one no longer has to restrict oneself to a narrow toolbox of standard MDS but can utilize the full world of flexible parametrized PS variants, with the standard versions resulting as specific hyperparameter constellations in the search space; the choice of hyperparameters is no longer ad hoc but principled and reproducible; the obtained hyperparameter values usually lead to the desired structure being appreciably present in the configuration (especially as compared to a standard MDS result) under the condition that the PS still fits well; both fit and the cstructuredness can be quantified with descriptive statistics, thus making results reproducible and comparable between data sets, studies and settings.
Code Availability
An R script to fully reproduce results and figures is available as Supplementary Information.
Data Availability
The data sets used are available online in the R package stops.
Notes
While the setup of a STOPS model is at the discretion of the user, from an application and interpretability view, we recommend to include only up to two cstructuredness indices, otherwise the effect of the indices compensating each other can be hard to predict. More than two can still be interpretable if the indices are referring to related aspects, for example high cnonmonotonicity and low ccomplexity would favor quadratic or circular arrangements.
Not all combinations have been proposed in the literature; conceptually, mix and match of the badnessoffit functions with the parametrized transformations for proximities, distances and weights is possible.
In the STOPS context, we always have \(\varvec{X}(\varvec{\theta })\) but for readability and generality, we only write \(\varvec{X}\) in the cstructuredness indices.
Running R 4.0.3 under Linux Mint 19.2. on an Intel Core i58350U CPU with 1.7 GHz, this grid search took 0.5 seconds.
Running R 4.0.3 under Linux Mint 19.2. on an Intel Core i58350U CPU with 1.7 GHz, optimization with our implementation took 12.5 seconds.
The weight values come from tradingoff the relative scale of the structure indices in the standard Sammon mapping to be commensurate; it has badnessoffit of 0.151, cclusteredness of 0.083 and cmanifoldness of 0.5939. The scales of the indices are of relative magnitude of roughly 8:1 for cclusteredness and cmanifoldness and 4:1 for cmanifoldness and fit for the initial Sammon solution. We want to trade these magnitudes off on the same scale and put more weight on cclusteredness, say, 20 times more than on cmanifoldness and more weight on cmanifoldness than for fit, say, 10 times. So, weights of 400 for cclusteredness, \(\)2.5 for cmanifoldness and 1 for stress reflect this.
Due to how BO with TGPLLM works, this amounts to fitting around 60 PS models each. For our prototype implementation in R 4.1.2 on a PC with Linux Mint 20.1. with Intel Core i76700 with 3.40 GHz, the timings were 751 seconds for Sammon mapping with powers, 2662 seconds for lMDS (with maximum PS iterations of \(1\times 10^4\)), 6169 seconds for Box–Cox MDS (with maximum PS iterations of \(1\times 10^4\)) and 106809 seconds for POSTMDS (with maximum PS iterations of \(5\times 10^4\)). The bottleneck is fitting the PS problem, especially for POSTMDS.
We do not have a theoretical justification for these structures; it is meant to be another show case of STOPS and also as a contrast to the aSTOPS result to illustrate how different various STOPS models can look for the same data.
This is by no means exclusive to STOPS; it shares this property with constrained forms of MDS (e.g., De Leeuw and Heiser 1980; Mathar 1990; Bronstein et al. 2006) and with many other types of unsupervised learning methods like tSNE (van der Maaten and Hinton 2008) (whose emphasis on small scale structure can distort global structure) or kmeans clustering (which always produces k Voronoi clusters in the distance metric used) (Mucherino et al. 2009).
References
Akkucuk, U., Carroll, J.D.: PARAMAP vs. Isomap: a comparison of two nonlinear mapping algorithms. J. Classif. 23(2), 221–254 (2006). https://doi.org/10.1007/s0035700600142
Alimoglu, F., Alpaydin, E.: Combining multiple representations and classifiers for penbased handwritten digit recognition. Proceedings of the Fourth International Conference on Document Analysis and Recognition, pp 637–640 (1997). https://doi.org/10.1109/ICDAR.1997.620583
Ankerst, M., Breunig, M.M., Kriegel, H.P., et al.: OPTICS: Ordering points to identify the clustering structure. In: Press, A.C.M. (ed.) ACM SIGMOD International Conference on Management of Data, vol. 28, pp. 49–60. New York City (1999). https://doi.org/10.1145/304182.304187
Bergstra, J.S., Bardenet, R., Bengio, Y., et al.: Algorithms for hyperparameter optimization. In: Advances in Neural Information Processing Systems, pp 2546–2554, (2011). https://proceedings.neurips.cc/paper/2011/file/86e8f7ab32cfd12577bc2619bc635690Paper.pdf
Borg, I., Groenen, P.J.: Modern Multidimensional Scaling: Theory and Applications, 2nd edn. Springer, New York (2005). https://doi.org/10.1007/038728981X
Box, G.E., Cox, D.R.: An analysis of transformations. J. Roy. Stat. Soc.: Ser. B (Methodol.) 26(2), 211–243 (1964). https://doi.org/10.1111/j.25176161.1964.tb00553.x
Breiman, L., Friedman, J.H.: Estimating optimal transformations for multiple regression and correlation. J. Am. Stat. Assoc. 80(391), 580–598 (1985). https://doi.org/10.2307/2288473
Bronstein, A.M., Bronstein, M.M., Kimmel, R.: Generalized multidimensional scaling: a framework for isometryinvariant partial surface matching. Proc. Natl. Acad. Sci. 103(5), 1168–1172 (2006). https://doi.org/10.1073/pnas.0508601103
Buja, A., Swayne, D.F.: Visualization methodology for multidimensional scaling. J. Classif. 19(1), 7–43 (2002). https://doi.org/10.1007/s0035700100310
Buja, A., Swayne, D.F., Littman, M.L., et al.: Data visualization with multidimensional scaling. J. Comput. Graph. Stat. 17(2), 444–472 (2008). https://doi.org/10.1198/106186008X318440
Chen, L., Buja, A.: Local multidimensional scaling for nonlinear dimension reduction, graph drawing, and proximity analysis. J. Am. Stat. Assoc. 104(485), 209–219 (2009). https://doi.org/10.1198/jasa.2009.0111
Chen, L., Buja, A.: Stress functions for nonlinear dimension reduction, proximity analysis, and graph drawing. J. Mach. Learn. Res. 14, 1145–1173 (2013). (https://jmlr.org/papers/v14/chen13a.html)
De Leeuw, J.: Applications of convex analysis to multidimensional scaling. In: Barra, J.R., Brodeau, F., Romier, G., et al. (eds.) Recent Developments in Statistics, pp. 133–145. North Holland Publishing Company, Amsterdam (1977). https://escholarship.org/uc/item/4ps3b5mj
De Leeuw, J., Heiser, W.J.: Multidimensional scaling with restrictions on the configuration. Multivar. Anal. 5(1), 501–522 (1980)
De Leeuw, J.: Minimizing rstress using majorization. Tech. rep., UCLA Statistics Preprint Series, (2014) https://rpubs.com/deleeuw/142619
De’ath, G.: Extended dissimilarity: a method of robust estimation of ecological distances from high beta diversity data. Plant Ecol. 144(2), 191–199 (1999). https://doi.org/10.1023/A:1009763730207
Demartines, P., Herault, J.: Curvilinear component analysis: a selforganizing neural network for nonlinear mapping of data sets. IEEE Trans. Neural Netw. 8(1), 148–154 (1997). https://doi.org/10.1109/72.554199
Eberhart, R.C., Kennedy, J.: A new optimizer using particle swarm theory. In: Proceedings of the Sixth International Symposium on Micro Machine and Human Science, vol. 1, pp. 39–43. IEEE Press, Picataway, NJ (1995). https://doi.org/10.1109/MHS.1995.494215
France, S.L., Akkucuk, U.: A review, framework, and R toolkit for exploring, evaluating, and comparing visualization methods. Vis. Comput. 37, 457–475 (2021). https://doi.org/10.1007/s00371020018175
France, S., Carroll, D.: Development of an agreement metric based upon the rand index for the evaluation of dimensionality reduction techniques, with applications to mapping customer data. In: Perner, P. (ed.) Machine Learning and Data Mining in Pattern Recognition, pp. 499–517. Springer, Berlin, Heidelberg (2007). https://doi.org/10.1007/9783540734994_38
France, S.L., Carroll, J.D.: Twoway multidimensional scaling: a review. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 41(5), 644–661 (2011). https://doi.org/10.1109/TSMCC.2010.2078502
Gebelein, H.: Das statistische Problem der Korrelation als Variations und Eigenwertproblem und sein Zusammenhang mit der Ausgleichsrechnung. ZAMM  J. Appl. Math. Mech. / Zeitschrift für Angewandte Mathematik und Mechanik 21(6), 364–379 (1941). https://doi.org/10.1002/zamm.19410210604
Gramacy, R.B., Lee, H.K.H.: Bayesian treed Gaussian process models with an application to computer modeling. J. Am. Stat. Assoc. 103(483), 1119–1130 (2008). https://doi.org/10.1198/016214508000000689
Groenen, P., De Leeuw, J.: Powerstress for multidimensional scaling. Tech. rep, UCLA, Los Angeles, USA (2010)
Izenman, A.J.: Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning. Springer, New York (2009). https://doi.org/10.1007/9780387781891
Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive blackbox functions. J. Global Optim. 13(4), 455–492 (1998). https://doi.org/10.1023/A:1008306431147
Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983). https://doi.org/10.1126/science.220.4598.671
Kruskal, J.B.: Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29(1), 1–27 (1964). https://doi.org/10.1007/BF02289565
Luus, R., Jaakola, T.: Optimization by direct search and systematic reduction of the size of search region. Am. Inst. Chem. Eng. J. (AIChE) 19(4), 760–766 (1973). https://doi.org/10.1002/aic.690190413
Mair, P., Rusch, T., Hornik, K.: The grand old party: A party of values? Springer Plus 3(697), 1–10 (2014). https://doi.org/10.1186/219318013697
Mathar, R.: Multidimensional scaling with constraints on the configuration. J. Multivar. Anal. 33(2), 151–156 (1990). https://doi.org/10.1016/0047259X(90)90043H
McGee, V.E.: The multidimensional analysis of ‘elastic’ distances. Br. J. Math. Stat. Psychol. 19(2), 181–196 (1966). https://doi.org/10.1111/j.20448317.1966.tb00367.x
McInnes, L., Healy, J., Melville, J.: Umap: Uniform manifold approximation and projection for dimension reduction (2018). https://doi.org/10.48550/ARXIV.1802.03426
Mockus, J.: Bayesian Approach to Global Optimization: Theory and Applications, Mathematics and its Applications, vol. 37. Springer, Netherlands, Amsterdam (1989). https://doi.org/10.1007/9789400909090
Mucherino, A., Papajorgji, P.J., Pardalos, P.M.: Clustering by kmeans, pp. 47–82. Springer, New York (2009). https://doi.org/10.1007/9780387886152_3
Ramsay, J.O.: Maximum likelihood estimation in multidimensional scaling. Psychometrika 42(2), 241–266 (1977). https://doi.org/10.1007/BF02294052
Ramsay, J.O.: Some statistical approaches to multidimensional scaling data (with discussion). J. Royal Stat. Soc. Series A (General) 145(3), 285–312 (1982). (http://www.jstor.org/stable/2981865)
Rusch, T., Hornik, K., Mair, P.: Assessing and quantifying clusteredness: The OPTICS Cordillera. J. Comput. Graph. Stat. 27(1), 220–233 (2018). https://doi.org/10.1080/10618600.2017.1349664
Rusch, T., Mair, P., Hornik, K.: Cluster optimized proximity scaling. J. Comput. Graph. Stat. 30(4), 1156–1167 (2021). https://doi.org/10.1080/10618600.2020.1869027
Sammon, J.W.: A nonlinear mapping for data structure analysis. IEEE Trans. Comput. C 18(5), 401–409 (1969). https://doi.org/10.1109/TC.1969.222678
Sarmanov, O.V.: The maximum correlation coefficient (symmetrical case). In: Doklady Akademii Nauk SSR, pp 715–718, (1958) available at http://mi.mathnet.ru/eng/dan23105
Siivola, E., Paleyes, A., González, J., et al.: Good practices for Bayesian optimization of high dimensional structured spaces. Appl. AI Lett. 2(2), e24 (2021). https://doi.org/10.1002/ail2.24
Takane, Y., Young, F., De Leeuw, J.: Nonmetric individual differences multidimensional scaling: an alternating least squares method with optimal scaling features. Psychometrika 42(1), 7–67 (1977). https://doi.org/10.1007/BF02293745
Tenenbaum, J.B., De Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000). https://doi.org/10.1126/science.290.5500.2319
Torgerson, W.S.: Theory and Methods of Scaling. Wiley, New York (1958)
van der Maaten, L., Hinton, G.: Visualizing data using tSNE. J. Mach. Learn. Res. 9(86), 2579–2605 (2008). (http://jmlr.org/papers/v9/vandermaaten08a.html)
Acknowledgements
All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Thomas Rusch. The first draft of the manuscript was written by Thomas Rusch and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Funding
Open access funding provided by Vienna University of Economics and Business (WU). No funds, grants, or other support was received for this work.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Rusch, T., Mair, P. & Hornik, K. Structurebased hyperparameter selection with Bayesian optimization in multidimensional scaling. Stat Comput 33, 28 (2023). https://doi.org/10.1007/s1122202210197w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s1122202210197w