Structure-based hyperparameter selection with Bayesian optimization in multidimensional scaling

Rusch, Thomas; Mair, Patrick; Hornik, Kurt

doi:10.1007/s11222-022-10197-w

Structure-based hyperparameter selection with Bayesian optimization in multidimensional scaling

Original Paper
Open access
Published: 26 December 2022

Volume 33, article number 28, (2023)
Cite this article

Download PDF

You have full access to this open access article

Statistics and Computing Aims and scope Submit manuscript

Structure-based hyperparameter selection with Bayesian optimization in multidimensional scaling

Download PDF

1861 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

We introduce the structure optimized proximity scaling (STOPS) framework for hyperparameter selection in parametrized multidimensional scaling and extensions (proximity scaling; PS). The selection process for hyperparameters is based on the idea that we want the configuration to show a certain structural quality (c-structuredness). A number of structures and how to measure them are discussed. We combine the structural quality by means of c-structuredness indices with the PS badness-of-fit measure in a multi-objective scalarization approach, yielding the Stoploss objective. Computationally we suggest a profile-type algorithm that first solves the PS problem and then uses Stoploss in an outer step to optimize over the hyperparameters. Bayesian optimization with treed Gaussian processes as a an apt and efficient strategy for carrying out the outer optimization is recommended. This way, hyperparameter tuning for many instances of PS is covered in a single conceptual framework. We illustrate the use of the STOPS framework with three data examples.

A systematic review on model selection in high-dimensional regression

Article 12 November 2018

High-dimensional local polynomial regression with variable selection and dimension reduction

Article 17 October 2023

The spectral condition number plot for regularization parameter evaluation

Article Open access 12 July 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

For unsupervised learning, dimensionality reduction and exploratory data analysis, a popular approach is to represent and visualize multivariate proximities in lower dimensions, also known as ordination. Such methods were first known as multidimensional scaling (MDS) and later also collected under the nonlinear dimension reduction and manifold learning monikers. We will use the umbrella term proximity scaling (PS) by which we mean MDS and extensions (for a comprehensive overview, see, e.g., France and Carroll 2011).

The main idea behind these methods is that for N data points or objects there is a matrix of pairwise proximities between object i and object j and one looks for a representation of the objects—the configuration—in a lower dimensional target space where the distances between objects optimally approximate the proximities. Examples include classical MDS (Torgerson 1958), metric MDS and nonmetric MDS (Kruskal 1964), Sammon mapping (Sammon 1969), Isomap (De’ath 1999; Tenenbaum et al. 2000) or power stress MDS (POST-MDS, Buja et al. 2008; Groenen and De Leeuw 2010).

PS methods play an important role in the exploration and communication of data. There exists a plethora of PS methods suitable for many different purposes. Interestingly in the wild one seems to encounter only a relatively narrow PS toolbox, mainly classical MDS and non-metric/metric MDS. General and flexible PS methods, which allow for various transformations of the quantities of interest, are more rarely encountered. This is unfortunate as we are convinced that flexible PS methods can be both useful and powerful and ignoring them may lead to foregone insights.

Flexible PS methods typically have hyperparameters governing the transformations, which are either explicit or implicit, often yielding standard PS methods for certain hyperparameter constellations. For example, a popular method for unravelling a manifold embedded in a higher-dimensional space is Isomap that does classical MDS on proximities that are derived from a neighborhood graph. The graph and derived proximities depend on a hyperparameter for the neighborhood that can be freely chosen. Similarly, in POST-MDS there are power transformations for fitted distances, proximities and/or weights which can be freely chosen and yield metric MDS if the exponents are 1.

A possible obstacle for a more wide-spread use of flexible PS methods are the hyperparameters for the transformations of the quantities of interest and their selection. An instructive example of this are the ideas of Ramsay (1982) that were read before the Royal Statistical Society but have gained relatively little traction. Ramsay suggests power and spline transformations to be used in MDS and regarding the hyperparameters, Ramsay states to use an exponent of 1.5 and polynomials of degree 2 based on “general experience” (p. 288 and 290, Ramsay 1982). The choice of transformations and hyperparameters is met with some contention by influential discussants like J. C. Gower, F. Critchley, J. de Leeuw, and others. Two comments from the discussion illustrate this: S. Tagg states “this complexity requires understanding of the choice of parameters” (p. 303, Ramsay 1982) and C. Chatfield says “by allowing [...] for different transformations [...] he hopes to have a realistic, albeit rather complicated, model. My worry is that the model is too complicated and yet still not realistic enough.” (p. 306 Ramsay 1982). This debate on how to choose the optimal parameters for flexible MDS methods has not yet been resolved. Hence, if these methods are used, the choice of hyperparameters often has an arbitrary trial-and-error aspect to it as they are simply set ad hoc (e.g., Ramsay 1982; Buja et al. 2008; Mair et al. 2014) and/or different values are tried out (sometimes in a semi-optimized fashion) until values are found that work in light of the application, see, e.g., the analyses in Buja and Swayne (2002); Borg and Groenen (2005); Chen and Buja (2009, 2013); De Leeuw (2014).

We propose to approach the hyperparameter selection in a principled way: as a computational problem of optimizing over the hyperparameter space for a wide variety of PS methods. Hence this article’s contributions are: a) providing a methodological framework that subsumes a wide array of flexible PS methods and parametric transformations for them, b) suggesting the setup of objective functions for hyperparameter selection to optimize for within this framework, c) operationalizing the building blocks that comprise the objective functions, and d) identifying a general computational approach that is suitable to tackle all these instances in the same way.

One contribution is that we propose criteria that can be used for operationalizing this optimization problem. Our porposal is based on the observation that in many applications of PS the obtained result—which is faithful to the input proximities—is interpreted with respect to the arrangement of objects in the configuration. We call this “structural quality” for a specific notion of structural appearance, for example, that objects are arranged in quadrants or as a circumplex. In recent years this aspect of interpretation in PS has been made explicit by using indices to measure a structural quality of interest, e.g., by Akkucuk and Carroll (2006); France and Carroll (2007); Chen and Buja (2009, 2013); Rusch et al. (2021).

Since structural quality is a property of the configuration, we coin the term c-structuredness to single out that we mean the degree of structuredness of a configuration with respect to a certain structure. High c-structuredness is something we aim for in PS results under the condition that the proximities are still represented faithfully. Changing the hyperparameters for the transformations typically also changes the c-structuredness, which provides a way for us to set up tuning of the hyperparameters so that we obtain a configuration that is both faithful and c-structured.

Our suggestions build on precursor ideas of letting a criterion guide the selection of hyperparameters in PS, as for example in Akkucuk and Carroll (2006) and Chen and Buja (2009, 2013), where tuning has been done manually or over a grid. France and Akkucuk (2021) propose a visualization and exploration framework along these lines. An optimization approach to hyperparameter tuning in MDS was previously used by Rusch et al. (2021). Our proposal extends and complements these approaches.

In the remainder of this article we present our conceptual framework for hyperparameter tuning in PS: STOPS for STructure Optimized Proximity Scaling. It allows to computationally tackle this task for a wide array of flexible PS versions, subsuming standard MDS methods that are used in the wild resulting as specific hyperparameter constellations in the search space. We do this by making the notion of the structural quality that is sought and interpreted explicit; these structural appearances are condensed into c-structuredness indices to clearly quantify the degree of structuredness of a PS result. The goal of finding hyperparameters is then handled as a multi-objective optimization problem: The c-structuredness indices are combined with the PS badness-of-fit by scalarization into an objective function that is parametrized with the hyperparameters. The optimal hyperparameters are found via a general purpose optimization routine; one that emerged as working well for this problem class was Bayesian Optimization (BO, Mockus 1989). This gives a general approach that allows to handle most instances of parametrized PS in the same way.

The article is organized as follows: It starts with a high level description of the STOPS framework, the objective functions and the different building blocks that make up the framework in Sect. 2. We then discuss the building blocks in detail: in Sect. 3 we discuss badness-of-fit objectives, in Sect. 4 we elaborate on transformations and their governing hyperparameters. In Sect. 5 we turn to c-structuredness, i.e, structures of interest and their quantification by indices. Section 6 discusses hyperparameter selection within the STOPS framework. In Sect. 7 we illustrate the use of the framework with three data examples. Concluding remarks can be found in Sect. 8. As Supplementary Information we include the R code file to reproduce the results and figures, and a supplementary document with details on (additional) structures and structuredness indices, the nature of Stoploss and normalization of badness-of-fit.

2 The STOPS framework

In proximity scaling methods we start from a given matrix $\varvec{\Delta }$ of pairwise symmetric proximities between objects $i, j; i, j=1,\dots ,N$, with individual entries $\delta _{ij}$. We will assume the proximities to take on a minimum when two observations are equal, i.e., dissimiliarities. The main diagonal of $\varvec{\Delta }$ is 0.

Let $\hat{\varvec{\Delta }}$ be the result of transformations applied to the entries of the proximity matrix $\varvec{\Delta }$, where $\hat{\varvec{\Delta }}=T_\Delta (\varvec{\Delta })=T_\Delta (\varvec{\Delta }\vert \theta _\Delta )$, with individual entries $\hat{\delta }_{ij}$. Let $\varvec{X}$ denote an $N \times M$ matrix (the configuration) of lower dimension $M<N$ (mostly $M \ll N$) from which the matrix $\varvec{D}(\varvec{X})$ comprising untransformed pairwise distances $d_{ij}(\varvec{X})$ between objects (row vectors) in $\varvec{X}$ can be derived. The matrix $\hat{\varvec{D}}(\varvec{X})=T_{D}(\varvec{D}(\varvec{X}))=T_D(\varvec{D}(\varvec{X})\vert \theta _D)$ comprises transformed pairwise distances between objects in $\varvec{X}$, with individual entries $\hat{d}_{ij}(\varvec{X})$. We call $T_{\Delta }: (\varvec{\Delta },\theta _\Delta ) \mapsto \hat{\varvec{\Delta }}$ a proximity transformation function and $T_D: (\varvec{D}(\varvec{X}),\theta _D) \mapsto \hat{\varvec{D}}(\varvec{X})$ a distance transformation function. Some PS models also allow different weights either given a priori as an input weight matrix $\varvec{W}$ with elements $w_{ij}$, or as transformed values based on the input weights. In case of the latter this is a transformed weight matrix $\hat{\varvec{W}}=T_W(\varvec{W}\vert \theta _W)$, with elements $\hat{w}_{ij}$ for the weight transformation function $T_W: (\varvec{W},\theta _W) \mapsto \hat{\varvec{W}}$. The combined hyperparameter vector of all these transformations is $\varvec{\theta }=(\theta _\Delta ,\theta _D,\theta _W)^\top $.

We search for an optimal configuration that allows one to reconstruct the matrix $\hat{\varvec{\Delta }}$ as well as possible from $\hat{\varvec{D}}(\varvec{X})$, i.e., we want $\hat{\varvec{D}}(\varvec{X}) \approx \hat{\varvec{\Delta }}$. This is achieved by minimizing a measure of badness-of-fit $\sigma _{\text {PS}}(\varvec{X} \vert \varvec{\theta }) = \mathcal {L}(\hat{\varvec{\Delta }}=T_\Delta (\varvec{\Delta }\vert \theta _\Delta ),\hat{\varvec{D}}(\varvec{X})=T_D(\varvec{D}(\varvec{X})\vert \theta _D),\hat{\varvec{W}}=T_W(\varvec{W}\vert \theta _W))$, where $\mathcal {L}$ denotes a loss function.

Minimizing the badness-of-fit criterion in PS means finding—for given $\varvec{\theta }$—the optimal configuration $\varvec{X}^*$ out of all possible $\varvec{X}$ as

$$\begin{aligned} \varvec{X}^*=\arg \min _{\varvec{X}} \sigma _{\text {PS}}(\varvec{X}\vert \varvec{\theta }) \end{aligned}$$

(1)

Many measures of badness-of-fit with different types of transformations and hyperparameters have been proposed (see Sects. 3 and 4). Our framework covers all of them; we give a loose taxonomy of the most popular ones in Sect. 3. Concrete values of $\varvec{\theta }$ are typically chosen manually ad hoc but can also be found by optimization, the latter is what we focus on in this article.

Let us assume we are interested in L different structural qualities of $\varvec{X}$ and that we have L corresponding univariate c-structuredness indices $I_l(\varvec{X}\vert \varvec{\gamma })$ for the $l=1,\dots , L$ different structures, capturing the essence of the structural appearance of the configuration with respect to the l-th structure. For example, we might be interested in both the structural appearance of how clustered the configuration is (structure 1) and how strongly linearly related the column vectors of the configuration are (structure 2). We then measure the c-structuredness of $\varvec{X}$ for the two structures with an index for clusteredness and one for linear dependence respectively. The $\varvec{\gamma }$ are optional metaparameters for the indices, which we assume are given and fixed; they control how c-structuredness is measured. Some structures one might be interested along with their c-structuredness indices will be discussed in Sect. 5 and many more in Appendix A in the supplementary document. We further assume broadly that the transformations $T_\Delta (\varvec{\Delta }\vert \varvec{\theta })$ and/or $T_D(\varvec{D}(\varvec{X})\vert \varvec{\theta })$ and/or $T_W(\varvec{W}\vert \varvec{\theta })$ produce different c-structuredness in $\varvec{X}$ for different values of $\varvec{\theta }$.

In a nutshell our proposal is to select optimal hyperparameters $\varvec{\theta }^*$ for the scaling procedure by assessing the c-structuredness of an optimal configuration $\varvec{X}^*$ found from a PS method for given $\varvec{\theta }$ usually in combination with its badness-of-fit value. We aim at finding a $\varvec{\theta }^*$ that, when used as transformation parameters in the PS method, will give a configuration that has high (or low) values of the c-structuredness indices. We view this as a multi-objective optimization problem, where we want to maximize/minimize different criteria (either badness-of-fit, or c-structuredness, or both) over $\varvec{\theta }$. C-structuredness may this way be induced at a possible expense of fit but we control the expense amount.

To formalize this we explicitly write the building blocks of the objective function used for hyperparameter tuning via STOPS as a function of $\varvec{\theta }$: Let us denote by $\varvec{X}^*(\varvec{\theta })$ the optimal solution from minimizing a badness-of-fit $\sigma _{\text {PS}}(\varvec{X}\vert \varvec{\theta })$ for a fixed $\varvec{\theta }$, so $\varvec{X}^*(\varvec{\theta }):= \arg \min _{\varvec{X}} \sigma _{\text {PS}}(\varvec{X}\vert \varvec{\theta })$. Further we also have the L different univariate indices with possible metaparameters $\varvec{\gamma }$, $I_l(\varvec{X}^*(\varvec{\theta })\vert \varvec{\gamma })$, to be optimized for.

Specific variants of STOPS can be instantiated by defining objective functions $\text {Stoploss}(\varvec{\theta }\vert v_0, \dots , v_L, \varvec{\gamma })$, comprising either badness-of-fit or c-structuredness indices or both in a scalarized combination. Two variants of objective functions—called additive STOPS (aSTOPS) and multiplicative STOPS (mSTOPS) respectively—are of the following form:

$$\begin{aligned} \text {Stoploss}&_{\text {aSTOPS}}(\varvec{\theta }\vert v_0, \dots , v_L, \varvec{\gamma }) = \nonumber \\&v_0 \cdot \sigma _{\text {PS}}(\varvec{X}^*(\varvec{\theta })\vert \varvec{\theta }) + \sum ^L_{l=1} v_l I_l(\varvec{X}^*(\varvec{\theta })\vert \varvec{\gamma }) \end{aligned}$$

(2)

and

$$\begin{aligned} \text {Stoploss}&_{\text {mSTOPS}}(\varvec{\theta } \vert v_0, \dots , v_L, \varvec{\gamma }) = \nonumber \\&\sigma _{\text {PS}}(\varvec{X}^*(\varvec{\theta })\vert \varvec{\theta })^{v_0} \cdot \prod ^L_{l=1} I_l(\varvec{X}^*(\varvec{\theta })\vert \varvec{\gamma })^{v_l} \end{aligned}$$

(3)

with $v_0 \in \mathbb {R}_{\ge 0}$ and $v_1,\dots ,v_L \in \mathbb {R}$ being the scalarization weights. Numerically, the badness-of-fit function value $\sigma _{\text {PS}}(\varvec{X}^*(\varvec{\theta })\vert \varvec{\theta })$ needs to be normalized to be scale-free and commensurable for comparability of different values of $\varvec{\theta }$. We discuss such normalization in Appendix B of the supplementary document. The objective function for aSTOPS is fully compensatory, whereas for mSTOPS it ensures that a normalized badness-of-fit of 0 will always lead to the minimal $\text {Stoploss}(\varvec{\theta }\vert v_0, \dots , v_L, \varvec{\gamma })$ for a positive value of $I_l(\cdot )$. For notational convenience, we will refer to the objective functions for STOPS variants by $\text {Stoploss}(\varvec{\theta })$ for the remainder of the paper.

The $v_0,\dots ,v_L$ are weights that determine how the badness-of-fit and c-structuredness indices are scalarized^{Footnote 1} for Stoploss. This can be used to set how strongly the criteria are taken into account or to control the trade-off of fit and c-structuredness in determining hyperparameters (typically in a convex combination). For example, we might want to tune by only minimizing badness-of-fit over $\varvec{\theta }$ ($v_0=1, v_l=0$), or find the best configurations and only optimize for structure over $\varvec{\theta }$ without taking badness-of-fit into account ($v_0=0, v_l \ne 0$), or tune so that we relax $10\%$ of goodness-of-fit for more c-structuredness ($v_0=0.9$ and $\sum ^L_{l=1}v_l=0.1$). A negative (positive) weight for $v_l$ means that a higher (lower) value for the l-th index is favored.

Using either (2) or (3) for hyperparameter selection, we then need to find

$$\begin{aligned} \varvec{\theta }^*= \arg \min _{\varvec{\theta }}\ \text {Stoploss}(\varvec{\theta } \vert v_0, \dots , v_L, \varvec{\gamma }) \end{aligned}$$

(4)

Accordingly, hyperparameter tuning with STOPS comprises the following building blocks:

1.
The PS badness-of-fit loss function $\sigma _{\text {PS}}(\varvec{X}\vert \varvec{\theta })$ that allows us to find an optimal $\varvec{X}^*(\varvec{\theta })$ for given $\varvec{\theta }$ (see Sect. 3).
2.
The transformations employed that depend on the vector $\varvec{\theta }$, so any of $T_\Delta (\varvec{\Delta }\vert \varvec{\theta })$, $T_D(\varvec{D}(\varvec{X})\vert \varvec{\theta })$ and $T_W(\varvec{W}\vert \varvec{\theta })$ (see Sect. 4).
3.
The structures of interest and their c-structuredness indices $I_l(\varvec{X}^*(\varvec{\theta })\vert \varvec{\gamma })$ (see Sect. 5).
4.
A way of optimizing (4) (see Sect. 6).

We do hyperparameter selection via an outer loop: $\text {Stoploss}(\varvec{\theta })$ is used solely for optimization over the hyperparameters $\varvec{\theta }$; the $\varvec{X}^*(\varvec{\theta })$ itself is found conditional on the transformation parameters from $\sigma _{\text {PS}}(\varvec{X}\vert \varvec{\theta })$. This allows to utilize tailored optimization for the main model parameters (the fitted distances) in the badness-of-fit functions. Our proposal is a formalized hyperparameter selection procedure supplanting the standard workflow of tuning hyperparameters by trying out and comparing different solutions ad hoc. The modular design of the STOPS framework offers a lot of flexibility able to incorporate many different instances of badness-of-fit functions, transformations and c-structuredness indices, as well as providing a computational framework that is generally applicable for continuous and discrete hyperparameter spaces $\varvec{\Theta }$.

3 Proximity scaling losses for measuring badness-of-fit

To be more detailed about finding the configuration, recall that the problem that proximity scaling solves is to find an $N \times M$ matrix $\varvec{X}^*$ by means of a sensible loss criterion $\sigma _{\text {PS}}(\varvec{X}\vert \varvec{\theta })=\mathcal {L}(\hat{\varvec{\Delta }}=T_\Delta (\varvec{\Delta }\vert \varvec{\theta }),\hat{\varvec{D}}(\varvec{X})=T_D(\varvec{D}(\varvec{X})\vert \varvec{\theta }), \hat{\varvec{W}}=T_W(\varvec{W}\vert \varvec{\theta }))$ that is used to measure how closely the fitted $\hat{\varvec{D}}(\varvec{X})$ approximates $\hat{\varvec{\Delta }}$ (badness-of-fit). We’ll now discuss different loss functions $\mathcal {L}(\cdot )$ that have proven to be especially popular.

3.1 Quadratic loss: STRESS

One popular type of PS is least squares scaling, which employs the quadratic loss function. This type of loss is usually called Stress (for standardized residual sum of squares, Kruskal 1964). A general formulation of a Stress-type loss function is

$$\begin{aligned} \sigma _{\text {Stress}}(\varvec{X}\vert \varvec{\theta })=\sum _{i<j} \hat{w}_{ij}\left( \hat{d}_{ij}(\varvec{X})-\hat{\delta }_{ij}\right) ^2 \end{aligned}$$

(5)

Here, the $\hat{w}_{ij}$ are finite (transformed) input weights, with $\hat{w}_{ij}=0$ if a proximity is missing or to be ignored.

The fitted distances in the configuration are usually some type of Minkowski distance

$$\begin{aligned} d_{ij}(\varvec{X})&= \Vert x_{i}-x_{j}\Vert _p \nonumber \\&= \left( \sum _{m=1}^M \vert x_{im}-x_{jm}\vert ^p \right) ^{1/p} i,j = 1, \dots , N \end{aligned}$$

(6)

with $p > 0$, typically Euclidean, so $p=2$. Stress losses need to be solved iteratively. Popular algorithms for finding the minimum of (5) are majorization (De Leeuw 1977) or gradient descent algorithms (Buja and Swayne 2002).

3.2 Approximation by inner product: STRAIN

The second popular type is the Strain loss function (Torgerson 1958). Here, the $\varvec{\Delta }$ is transformed to $\hat{\varvec{\Delta }}$ so that $T_\Delta (\varvec{\Delta }\vert \theta _\Delta )=-(h\circ s)(\varvec{\Delta }\vert \varvec{\theta }_\Delta )$ where s is a function parametrized by $\theta _\Delta $ and $h(\cdot )$ is the double centering operation (i.e., for a matrix $\varvec{A}$, $h(\varvec{A})=\varvec{A}-\varvec{A}_{i.}-\varvec{A}_{.j}+\varvec{A}_{..}$, where $\varvec{A}_{i.}, \varvec{A}_{.j}, \varvec{A}_{..}$ are matrices consisting of the row, column and grand means respectively). Subsequently $\hat{\varvec{\Delta }}$ is approximated by the inner product matrix of $\varvec{X}$, so,

$$\begin{aligned} d_{ij}(\varvec{X}) = \langle x_{i},x_{j} \rangle . \end{aligned}$$

(7)

In the context of Strain we always assume that $T_\Delta (\cdot )$ is a composite function of the double centering operation and some other parametrized function $s(\cdot \vert \theta _\Delta )$, so we can express Strain as

$$\begin{aligned} \sigma _{\text {Strain}}(\varvec{X}\vert \varvec{\theta })=\sum _{i<j} \left( d_{ij}(\varvec{X})-\hat{\delta }_{ij}\right) ^2 \end{aligned}$$

(8)

if $d_{ij}(\varvec{X})$ is as in (7) and $\hat{\delta }_{ij}=-h(s_{ij}(\delta _{ij}^2/2\vert \varvec{\theta }))$. Strain losses can usually be solved by an eigenvalue decomposition.

3.3 Repulsion and attraction: energy model

Another way of interpreting the PS objective is related to energy models with a pairwise attraction part $\propto \hat{d}_{ij}(\varvec{X})^\nu $ and a pairwise repulsion part $\propto -\hat{\delta }_{ij}\hat{d}_{ij}(\varvec{X})$ between objects (Chen and Buja 2013). This is

$$\begin{aligned} \sigma _{\text {EM}}(\varvec{X}\vert \varvec{\theta }) = \sum _{i<j} \hat{w}_{ij} \left( a \hat{d}_{ij}(\varvec{X})^\nu - b \hat{\delta }_{ij}\hat{d}_{ij}(\varvec{X}) \right) \end{aligned}$$

(9)

with a, b being some constants. For $\nu =2, a=1$ and $b=2$ this is (5) where terms depending solely on $\hat{\varvec{\Delta }}$ are disregarded for finding $\varvec{X}$.

4 Parametrized transformations

Many PS methods allow to transform the input dissimilarities and/or the fitted distances and/or input weights. In this section, we discuss popular transformations that are employed with the above mentioned proximity scaling losses.^{Footnote 2} Note that our list is not meant to be exhaustive.

We consider possible parameter vectors $\varvec{\theta } \subseteq \{\theta _\Delta ,\theta _D,\theta _W\}$ for the transformations $T_\Delta , T_D, T_W$ for proximities, fitted distances or weights. These are the hyperparameters we later want to tune.

4.1 Transforming observed proximities

A simple and very flexible way is transforming the input proximities, so using $T_\Delta (\varvec{\Delta })$. The advantage here is that this can easily be implemented for and in principle applied with all PS methods. Often it is also possible to use $T_W(\varvec{W})$.

4.1.1 Metric scaling transformations of proximities

In metric scaling one applies a parametric bijective strictly monotonic transformation to the proximities.

With specific choices for $T_\Delta (\cdot )$ one can express many popular PS versions, including absolute MDS with $\hat{\delta }_{ij}=\delta _{ij}$, ratio MDS with $\hat{\delta }_{ij}=\vert b\vert \delta _{ij}$, interval MDS with $\hat{\delta }_{ij}=a+b\delta _{ij}\ge 0$, logarithmic MDS with $\hat{\delta }_{ij}=a+b \log (\delta _{ij})$, exponential MDS with $\hat{\delta }_{ij}=a+b \exp (\delta _{ij})$ (for all Borg and Groenen 2005), power MDS with $\hat{\delta }_{ij} =\delta _{ij}^{\lambda }$ (e.g. Buja and Swayne 2002), or, with additionally setting the $\hat{w}_{ij}$, instances of metric scaling that use a priori inverse weighting with the observed proximities, e.g., setting $\hat{w}_{ij}=\delta ^{-1}_{ij}$ (Sammon mapping, Sammon 1969) or $\hat{w}_{ij}=\delta ^{-2}_{ij}$ (elastic scaling, McGee 1966) or curvilinear component analysis (Demartines and Herault 1997) for a hyperparameter $\theta _W=\tau $ with $\hat{\varvec{W}}=T_W(\hat{\varvec{D}}(\varvec{X})\vert \theta _W)$ being a bounded and monotonically non-increasing function, e.g, $\hat{w}_{ij}=1$ if $\hat{d}_{ij}(\varvec{X})\le \tau $ and 0 otherwise.

These transformations are governed by the appropriate hyperparameters $\theta _\Delta $ and/or $\theta _W$ to yield the models.

4.1.2 Geodesic transformation of proximities

A popular method for manifold learning is Isomap (De’ath 1999; Tenenbaum et al. 2000), which is a PS (originally Strain-type) with the $\hat{\delta }_{ij}$ being the geodesic distance between objects i, j as imposed by a weighted graph. These proximities are defined as the sum of edge lengths along the shortest path $\text {SP}(i,j\vert G(k))$ between two objects in a neighborhood graph G(k) for a given parameter k (the number of nearest neighbors) where the objects are the vertices, so $\hat{\delta }_{ij}=\text {SP}(i,j\vert G(k))$. Alternatively, one can define the neighborhood graph in terms of an $\varepsilon $-radius, so $\hat{\delta }_{ij}=\text {SP}(i,j\vert G(\varepsilon ))$.

The transformation is governed by $\theta _\Delta =k$ or $\theta _\Delta =\varepsilon $ as the hyperparameter that defines the neighborhood graph.

4.2 Transforming observed proximities and fitted distances

One may also transform the distances that get fit in the configuration, that is, applying a distance transformation function $T_D(\varvec{D}(\varvec{X}))$ to the fitted distances. This is more complicated than simply transforming the input proximities—the whole fitting process has to be adapted to accommodate the transformation of fitted distances. It is then natural to apply proximity transformation functions, distance transformation functions and/or weight transformations simultaneously, using any combination of $T_D(\varvec{D}(\varvec{X}))$, $T_\Delta (\varvec{\Delta })$ and $T_W(\varvec{W})$. This allows for a rich class of models with possible parameter vectors $\varvec{\theta } \subseteq \{\theta _\Delta ,\theta _D,\theta _W\}$, corresponding to the hyperparameters for the transformations of proximities, fitted distances and weights.

4.2.1 Power transformations for proximities and distances

Employing power transformations on fitted distances and proximities is often done (e.g., Ramsay 1982; Buja and Swayne 2002; Buja et al. 2008; Groenen and De Leeuw 2010).

A general instance of a PS type with power transformed fitted distances is r-stress (De Leeuw 2014), the transformation being $\hat{d}_{ij}(\varvec{X}) = d_{ij}(\varvec{X})^{2r}$ with $r \in \mathbb {R}_+$. A number of stress versions can be expressed as special or limiting cases of r-stress, including raw Stress ($r=0.5$, Kruskal 1964), s-stress ($r=1$, Takane et al. 1977) and maximum likelihood MDS (Ramsay 1977) for $r \rightarrow 0$. Here $\theta _D=r$.

It is straightforward to extend this to use power functions as the proximity transformation function, distance transformation function and weight transformation function simultaneously (POST-MDS), so $\hat{d}_{ij}(\varvec{X})=d_{ij}(\varvec{X})^\kappa $, $\hat{\delta }_{ij}=\delta ^\lambda _{ij}$, and $\hat{w}_{ij}=w_{ij}^\rho $ with $\lambda , \rho \in \mathbb {R}, \kappa \in \mathbb {R}_+$. Inserted into (5) this is called power stress or p-stress in the literature. It subsumes many metric MDS models. Here $\varvec{\theta }$ is a three-dimensional parameter vector, $\varvec{\theta }=(\theta _{\Delta },\theta _D,\theta _W)^\top =(\lambda ,\kappa ,\rho )^\top $.

This encompasses Sammon or elastic scaling type models by using $\hat{w}_{ij}=\hat{\delta }_{ij}^\rho $ for the appropriate $\rho $, and can be turned into curvilinear component analyses type models with $\hat{W}=T_W(\hat{\varvec{D}}(\varvec{X})\vert \tau )$ a bounded and monotonically decreasing function, e.g., $\hat{w}_{ij}=\mathbb {1}(\hat{d}_{ij}(\varvec{X}) \le \tau )$, where $\varvec{\theta }=(\kappa ,\lambda ,\tau )^\top $.

4.2.2 Box–Cox transformations for proximities and distances

Chen and Buja (2013) propose Box-Cox transformations on observed proximities and on fitted distances in an energy badness-of-fit formulation (9). For complete data matrices it is a three-parameter energy-type MDS family, BC-MDS, $\sigma _{\text {BC}}(\varvec{X}\vert \varvec{\theta })$:

$$\begin{aligned} \sigma _{\text {BC}}&(\varvec{X}\vert \varvec{\theta }) = \nonumber \\&\sum _{i<j} \delta _{ij}^{\rho } \left( BC_{\mu +\lambda }(d_{ij}(\varvec{X})) - \delta _{ij}^{\lambda } BC_{\mu }(d_{ij}(\varvec{X})) \right) \end{aligned}$$

(10)

with $\mu , \rho \in \mathbb {R}$ and $\lambda \in \mathbb {R}_+$. Here $BC_\alpha $ is the one-parameter Box–Cox transformation (Box and Cox 1964) with parameter $\alpha $,

$$\begin{aligned} BC_{\alpha }(d)={\left\{ \begin{array}{ll} \frac{d^{\alpha }-1}{\alpha } \text{ if } \alpha \ne 0\\ \log (d) \text{ if } \alpha = 0\\ \end{array}\right. } \end{aligned}$$

(11)

Note that here the distance transformations used in the attraction and repulsion part are not equal as $\lambda > 0$.

The hyperparameter vector is $\varvec{\theta }=(\mu ,\lambda ,\rho )^\top $.

4.2.3 Sets of local neighborhood

Yet another idea can be expressed with parametrized transformations: local MDS (lMDS, Chen and Buja 2009). Let $N_k$ define the symmetric set of nearby pairs of points (i, j) so that $(i,j)\in N_k$ iff i is among the $k-$nearest neighbours of j or the other way round. Let $\delta _{\infty } \rightarrow \infty $ be a large “imputed” dissimilarity that is constant and w a small weight, e.g., $w \approx 1/\delta _{\infty }$ as in the standard lMDS objective.

The lMDS objective can be expressed in our framework, as (5) with the transformations

$$\begin{aligned} \hat{\delta }_{ij}={\left\{ \begin{array}{ll} \delta _{ij} &{} \text{ if } (i,j) \in N_k\\ \sqrt{w}\delta _{\infty } &{} \text{ if } (i,j) \notin N_k\\ \end{array}\right. } \end{aligned}$$

(12)

and

$$\begin{aligned} \hat{d}_{ij}(\varvec{X})={\left\{ \begin{array}{ll} d_{ij}(\varvec{X}) &{} \text{ if } (i,j) \in N_k\\ \sqrt{w}d_{ij}(\varvec{X}) &{} \text{ if } (i,j) \notin N_k\\ \end{array}\right. } \end{aligned}$$

(13)

For the standard lMDS objective this can be reduced to a version with a free hyperparameter $\tau =2w\delta _{\infty }$ for a given k, as well as the k (Chen and Buja 2009); hence $\varvec{\theta }=(k,\tau )^\top $.

5 Structures, c-structuredness and indices

Central to our proposal for hyperparameter tuning in PS models is the concept of c-structuredness, as c-structuredness is often desirable from an applied point of view. We informally defined c-structuredness as the adherence of the arrangement of objects in a configuration $\varvec{X}$ to a predefined notion of what constitutes structure. How much a structure in question is present in $\varvec{X}$ is expressed as the amount of c-structuredness: The higher the c-structuredness, the clearer the structure is present.

Naturally, there is a very high number of possible structures. Some examples of c-structuredness that we deem to be of particular interest are given in Fig. 1 with index values as formalized below and in Appendix A in the supplementary document. The c-structuredness types we single out here are (i) c-regularity (objects arranged on a regular grid), (ii) c-association (any (non-)linear association), (iii) c-clusteredness (objects arranged in clusters), (iv) c-linearity (objects arranged on a straight line), (v) c-functionality (objects arranged on a smooth line) and (vi) c-manifoldness (objects arranged so that they resemble a manifold).

For STOPS we need to quantify information about the c-structuredness present in $\varvec{X}$. We do this with univariate c-structuredness indices that capture the essence of a particular structure in $\varvec{X}$. The indices should be numerically high (low) the more (less) of a given structure we find. To illustrate, for c-linearity and $M=2$, we may use the absolute value of the Pearson correlation coefficient for the columns of $\varvec{X}$ as a c-structuredness index, which is 1 when there is a perfect noise-free linear relationship or 0 when there is no linear relationship.

We aim at indices that capture the essence of a particular structure, depend on the arrangement of objects in $\varvec{X}$ and should be bounded from above and below, i.e., have unique finite minima and maxima, and be nonnegative. In what follows we will list three examples of structures in a configuration as illustration. For each of these there is an index that captures the c-structuredness for that structure; these are also used in the examples.

We generally write c-structuredness indices as$\text {I}_{\text {c-structuredness}}(\varvec{X} \vert \varvec{\gamma })$,^{Footnote 3} which means $\text {I}_{\text {c-structuredness}}(\cdot )$ is an index that reflects c-structuredness as a function of $\varvec{X}$, possibly depending on further metaindex parameters $\varvec{\gamma }$ which are assumed to be given.

5.1 C-clusteredness and c-regularity

These structures are concerned with how clustered the configuration appears in the sense of Rusch et al. (2018). The concept essentially captures where $\varvec{X}$ falls on a continuum between unclustered and maximally clustered.

Following Rusch et al. (2018), for a minimum number c of points that must comprise a cluster we denote with $O(\varvec{X} \vert c)=(\varvec{x}_{o(i)})_{o(i) = 1, \ldots , N}$ an ordering of the N original row vectors $\varvec{x}_i, (i=1, \ldots ,N)$ in $\varvec{X}$. $O(\varvec{X})$ is a permutation of the rows of $\varvec{X}$. The position of object $x_i$ in the ordering $O(\varvec{X} \vert c)$ is $o(i)$ (which depends on c but we drop it for readability). $O(\varvec{X} \vert c)$ can be obtained by the algorithm OPTICS (Ankerst et al. 1999), which provides the bijective algorithmic mapping $o: \{1,\ldots ,N\} \rightarrow \{1,\ldots ,N\}$. OPTICS also augments $O(\varvec{X} \vert c)$ with each object’s so called representative reachability distance, $r^*_{o(i)}$ that cannot be expressed in closed form.

For the index of c-clusteredness we then use the normalized OPTICS Cordillera, $\text {OC}'(\varvec{X})$ (Rusch et al. 2018),

$$\begin{aligned} \text {I}_{\text {c-clusteredness}}&(\varvec{X}\vert \varvec{\gamma })=\text {OC}'(\varvec{X}\vert \varvec{\gamma })=\nonumber \\&\left( \frac{\sum _{o(i)=2}^{N} \vert r^*_{o(i)}-r^*_{o(i)-1} \vert ^q}{d_{\max }^q \cdot \left( \left\lceil \frac{N-1}{c} \right\rceil + \left\lfloor \frac{N-1}{c} \right\rfloor \right) }\right) ^{1/q} \end{aligned}$$

(14)

where $\varvec{\gamma }=(c,\epsilon ,q,d_{\max })^\top $, the four free metaparameters of the OPTICS Cordillera. Here $q \ge 1$ designates a q-norm, $\epsilon $ is the maximum radius around each point to look for neighbors and $d_{\max }$ denotes a maximum reference distance, $\min d_{ij} \le d_{\max } \le \epsilon $ that winsorizes the $r^*_{o(i)}$ for robustness. Apart from c the metaparameters are optional; we suggest to fix them as $d_{\max }=1.5 \max (d_{ij}(\varvec{X}^0))$, $\epsilon =2 \max _{i,j}(\max (\delta _{ij},d_{ij}(\varvec{X}^0)))$ and q to the norm of the target space ($\varvec{X}^0$ refers to the Strain solution for untransformed proximities). The larger this index value is, the more c-clusteredness we find. See the middle left panel in Fig. 1 for an example.

The $\text {OC}'(\varvec{X})$ takes on its minimum value when for each point the distance to the $c-$th neighbor is constant (Rusch et al. 2018). We can use this to fashion an index for c-regularity, a structure where the observations are arranged so that all nearest neighbor distances are equal. We fix the parameter c to 2 (nearest neighbor) and $q=1$ and therefore use

$$\begin{aligned} \text {I}_{\text {c-regularity}}&(\varvec{X}\vert \varvec{\gamma })=\nonumber \\&1-\text {OC}'(\varvec{X}\vert c=2,q=1,\varvec{\gamma }) \end{aligned}$$

(15)

with $\varvec{\gamma }=(\epsilon ,d_{\max })^\top $. This index will be 1 if the point arrangement is perfectly regular and 0 if perfectly clustered with respect to N, $c=2$ and $d_{\max }$. See the top left panel in Fig. 1.

Using these indices with negative weights in Stoploss will favor parameters that give configurations with high c-clusteredness/c-regularity whereas positive weights favor low c-clusteredness/c-regularity.

5.2 C-manifoldness

This structure captures how close almost arbitrary, real valued transformed projections of columns of $\varvec{X}$ lie to a 1D manifold (a line) in the transformed space.

We use the maximal correlation coefficient (Gebelein 1941; Sarmanov 1958) for this, where the higher the coefficient, the stronger the relationship. For a sample, the maximal correlation can be calculated approximately with the alternating conditional expectations algorithm (ACE, Breiman and Friedman 1985). Let $x_m$ denote the m-th column vector of $\varvec{X}, (m=1,\dots ,M)$. To obtain the sample version, we use the $(x_{ik}, x_{il}), i=1,\dots ,N$, input them into the ACE algorithm and use the output—called $\text {ACE}(x_k,x_l)$—to construct a c-structuredness index for a given $\varvec{X}$.

We then use an aggregation function $\phi _{k,l}(\cdot )$ (e.g., the maximum) of the ACE value between any two different columns $x_k,x_l; k \ne l$, so

$$\begin{aligned} \text {I}_{\text {c-manifoldness}}(\varvec{X})=\phi _{k,l}\left( \text {ACE}(x_k,x_l)\right) \end{aligned}$$

(16)

An example of such a relationship is given in the bottom right of Fig. 1. As before, a negative weight for this index in Stoploss would favor $\varvec{\theta }$ that provide higher index values and positive weights lower index values.

These are instances of the c-structuredness concept as used in Sect. 7. Details on these and more structures that may be of interest can be found in Appendix A in the supplementary document. We are quick to point out that other structures/indices are conceivable and can easily be incorporated in our framework, e.g., the ones from France and Carroll (2007).

6 Optimization for STOPS

The approach for searching for the hyperparameters that we take builds on an idea in Rusch et al. (2021). We view hyperparameter tuning via STOPS as a multi-objective optimization problem of either minimizing badness-of-fit, or maximizing/minimizing c-structuredness indices, or both over $\varvec{\theta }$. We choose a scalarization approach leading to variants of $\text {Stoploss}(\varvec{\theta })$ as in (2) and (3) and the resulting optimization problem (4).

For optimization, minimizing any (4) as a function of $\varvec{\theta }$ is a difficult, analytically intractable problem. There are several reasons for this: For one, concrete instances of Stoploss result as a combination of any badness-of-fit measure with any number of different c-structuredness indices, which leads to a high number of possible concrete instances. These concrete instances need not share structural similarities that can be exploited for optimization. Second, the optimization surface over $\varvec{\theta }$ can be difficult for different STOPS models in that it is typically nonconvex, may have local optima, may have (removable or jump) discontinuities and may be smooth or non-smooth to varying degrees over regions of the support of $\varvec{\theta }$. Third, $\varvec{\theta }$ may be continuous or discrete. All of these problems are empirically illustrated for a typical Stoploss (the one from Sect. 7.1) in Appendix C in the supplement, where we also elaborate on the nature of Stoploss.

We therefore aim at a general purpose approach that can deal with the intractability and thus be used for all conceivable Stoplosses. To that end we approach the optimization problem with the nested profile-type strategy laid out in Algorithm 1 that only employs function evaluations; u denotes the iteration step and we do not list the given $\varvec{\Delta }$ and $v_0,\dots ,v_l$ for $\text {Stoploss}(\varvec{\theta })$ in the algorithm. $\varvec{\Theta }$ denotes the feasible set of $\varvec{\theta }$ (e.g., box constraints). The $\sigma _{\text {PS}}\left( \varvec{X}^*(\varvec{\theta })\vert \varvec{\theta } \right) $ employed should be scale and unit free, e.g., normalized to lie between zero and one (see Appendix B in the supplementary document).

To generate candidates for $\varvec{\theta }_u$ in Step 3 of Algorithm 1, different general purpose strategies for global optimization like random search, grid search and/or derivative-free metaheuristics can in principle be employed. We made good experiences with particle swarm optimization (Eberhart and Kennedy 1995) but also random search and the adaptive Luus-Jakoola pattern search (Luus and Jaakola 1973; Rusch et al. 2021) for a moderate number of iterations (around 100) and these strategies can be recommended.

That said, one of the contributions of this article is to suggest a generally applicable default strategy that can be used with and promises to work well for every conceivable instance of Stoploss. Working well here means to suggest candidates for $\varvec{\theta }$ that lead to relatively low values of the Stoploss objectives in an efficient manner (i.e., for a relatively small number of iterations) as each function evaluation in STOPS can be very costly (especially so for transformations of the $\varvec{D}(\varvec{X})$) due to having to solve a PS problem. To establish a general purpose strategy, we empirically investigated a number of solvers, including simulated annealing (Kirkpatrick et al. 1983), random search, adaptive Luus-Jakola algorithm, particle swarm optimization, and different versions of Bayesian optimization (BO, Mockus 1989) on different Stoplosses and different data sets (not shown).

What emerged as a generally applicable strategy that worked well in most cases (in the sense of suggesting candidates with lower “minima” as their competitors for a small numbers of iterations of around 10-20) was Bayesian optimization with a treed Gaussian process with jumps to the limiting linear model as the surrogate model (TGPLLM, Gramacy and Lee 2008) (which we elaborate on below). This strategy not only worked well for efficiently finding good candidates for $\varvec{\theta }$ in all the data instances we used it for, but is also theoretically able to accommodate the afore-mentioned structural difficulties of Stoploss (see also Appendix C in the Supplement). We note that particle swarm optimization has often performed on par and sometimes better for a higher number of iterations.

Optimizing Stoploss with Bayesian optimization When investigating different general purpose solvers for optimization in STOPS we found that Bayesian optimization (BO, Mockus 1989) lends itself well as an out-of-the-box solution to suggest candidates for $\varvec{\theta }$ in Step 3 of Algorithm 1.

The basic idea in Bayesian optimization is to approximate the unknown Stoploss surface with a flexible surrogate model (“prior”). The surrogate model identifies areas for exploration where we can expect improvement of the objective by maximizing an acquisition function over the surrogate surface. From these areas new candidate(s) for $\varvec{\theta }$ are sampled and the objective function gets evaluated at the new candidate(s) (“data”). Then the surrogate model is updated (“posterior”) to reflect this new information, the acquisition function of the refitted surrogate model is again maximized and the whole process repeats. Hence in each iteration a new candidate is chosen for which we expect that it improves the objective of interest based on the available information, thus trading off exploitation and exploration in an efficient way.

This approach works well for STOPS due to three aspects: First, BO needs only function evaluations so the modularity of STOPS and the lack of exploitable structure is no hindrance. Second, BO is competitive in situations where the parameter vector is low-dimensional (e.g., Siivola et al. 2021) as is the case for all the $\varvec{\theta }$ we outlined in Sect. 4 (with at most 3). Third, fitting the surrogate model and optimizing the acquisition function may be less expensive then evaluating the objective function as the cost of finding a configuration in PS can be quite high; BO can then dramatically reduce the number of evaluations necessary to get close to a global optimum.

To describe the BO approach more formally, we have the unknown objective function $\text {Stoploss}(\varvec{\theta }): \varvec{\Theta } \rightarrow \mathbb {R}$. For notational convenience we write $\mathcal {Y}:=\text {Stoploss}(\varvec{\theta })$ for the function and $\mathcal {Y}_u:=\text {Stoploss}(\varvec{\theta }_u)$ for the $u-$th evaluation of the function for $\varvec{\theta }_u$. We have a surrogate model for the objective, $\mathcal {M}(\varvec{\theta },\mathcal {Y},\iota )$ with possible surrogate model metaparameters $\iota $, fitted to the sequence of U pairs of $\{\varvec{\theta }_u,\mathcal {Y}_u\}_{u=1}^{U}$. We also have the acquisition function $\Omega \left( \varvec{\theta }\vert \{\varvec{\theta }_u,\mathcal {Y}_u\},\mathcal {M}(\varvec{\theta },\mathcal {Y},\iota )\right) $. We look for

$$\begin{aligned} \varvec{\theta }_{u+1}=\arg \max _{\varvec{\theta }} \Omega (\varvec{\theta }\vert \{\varvec{\theta }_u,\mathcal {Y}_u\},\mathcal {M}(\varvec{\theta },\mathcal {Y},\iota )) \end{aligned}$$

(17)

which constitutes the new candidate. Then $\mathcal {Y}_{u+1}$ is evaluated and the model $\mathcal {M}$ is updated with the new point $\{\varvec{\theta }_{u+1},\mathcal {Y}_{u+1}\}$, yielding the “posterior”,

$$\begin{aligned} P_{\mathcal {M}}\left( \mathcal {Y}\vert \{\varvec{\theta }_{u},\mathcal {Y}_{u}\}_{u=1}^{U+1}\right) . \end{aligned}$$

(18)

Then the acquisition function gets maximized for the updated data and the whole process repeats until some termination criterion is met.

For $\Omega (\varvec{\theta }\vert \{\varvec{\theta }_u,\mathcal {Y}_u\}, \mathcal {M}(\varvec{\theta },\mathcal {Y},\iota ))$ we use the expected improvement (EI) criterion (Jones et al. 1998), which is

$$\begin{aligned} \Omega&(\varvec{\theta }\vert \{\varvec{\theta }_u,\mathcal {Y}_u\},\mathcal {M}(\varvec{\theta },\mathcal {Y},\iota ))=\nonumber \\&\int _{-\infty }^{\infty }\max (\mathcal {Y}^*-\mathcal {Y},0)P_{\mathcal {M}}(\mathcal {Y}\vert \{\varvec{\theta }_u,\mathcal {Y}_u\},\iota ) d\mathcal {Y} \end{aligned}$$

(19)

with $\mathcal {Y}^*=\min \mathcal {Y}_u$. EI has shown good behavior over a wide array of tasks (Bergstra et al. 2011).

What is crucial for the perfomance of BO is how well the surrogate model $\mathcal {M}$ is able to approximate the unknown objective. To find a good surrogate model $\mathcal {M}$ to be used in our case, we empirically investigated the behavior of different $\text {Stoploss}(\varvec{\theta })$ over $\varvec{\theta }$ for a number of different data sets; see the examples in Appendix C in the supplementary document.

Given the nature of Stoploss we mentioned before, we looked for a surrogate model that is nonstationary, allows for jumps and piecewise constant regions (e.g., for discrete $\varvec{\Theta }$) and allows a sufficiently rough Gaussian processes within segments over $\varvec{\theta }$ (for nonconvexity). We found it with the treed Gaussian process with jumps to the limiting linear model, with the separable power exponential family for the correlation structure $\iota $. This process recursively partitions the search space into non-overlapping segments (like a regression tree) and for each segment independent Gaussian processes are used, which have linear models as their limit. This allows nonstationarity over the whole search space and accommodates piecewise linear or constant areas of the search space (which is useful especially for discrete $\varvec{\theta }$). Of further note is that the independent GP do not have to connect at the boundaries of the segments, thus allowing for jump discontinuities. The specification and estimation of this process is fully Bayesian; for details, see Gramacy and Lee (2008).

BO with TGPLLM is sufficiently flexible and general for optimizing Stoploss for the dimensionality of $\varvec{\theta }$ that we are faced with, both from a theoretical perspective and over all our empirical investigations. We thus recommend it as the default approach for generating candidates for optimizing $\text {Stoploss}(\varvec{\theta })$ in a small number of iterations. We point out that there may be combinations of data, hyperparameters and Stoploss specification for which a different approach may be more acccurate or efficient, e.g., the example in Sect. 7.1, where a crude grid search is sufficient. It is also possible to develop tailored optimization approaches for concrete Stoplosses that exploit structure and perform better. Also, for a $\varvec{\theta }$ that is higher than 20 dimensions, standard BO starts to perform less well. Nevertheless, we see BO with the TGPLLM surrogate model as a general strategy that can be successfully used with every conceivable Stoploss that can be derived from the suggestions within this article.

7 Application

In this section, we demonstrate how the STOPS framework can be used for tuning hyperparameters in PS for individual data analytical instances. For the purpose of illustration, we aim at a diverse number of Stoplosses comprising different PS badness-of-fit, transformations and c-structuredness indices.

7.1 Unrolling the swiss roll

As a simple example for illustrating the concept, we use the STOPS framework to select hyperparameters for the geodesic distance function in Isomap to unroll the classic swiss roll regularly. In the swiss roll example, data lie equally spaced on a spiral manifold (“swiss roll”) embedded in a higher dimensional space. Proximity scaling methods that emphasize local structure are able to flatten out this manifold in the target space. One of the most popular PS variants for doing this is Isomap.

We use an example with 150 data points in three dimensions lying on a grid on the embedded manifold where along the y dimension there are five points and along the x dimension there are 30. We code the points in a diverging palette from the center of the roll along the spiral direction. The flattening out operation worked well if the same shades are arranged vertically, the palette is visible from left to right, and the grid is recovered.

As described in Sect. 4.1.2, Isomap has a governing hyperparameter for the calculation of the geodesic distances. In line with our objective of flattening out the swiss roll, we are looking for a solution with objects arranged on a regular grid and also want to preserve the neighborhood of points around each point as well as possible (specifically preserving a neighborhood of 5 points). In terms of structuredness, one way to measure whether the objects are arranged on a grid is c-regularity (see Sect. 5.1), and preservation of the neighborhood can be assessed with c-faithfulness (see Appendix A.7 in the supplementary document), both of which ideally would be 1.

We use the $\varepsilon $-version of Isomap and may now set the hyperparameter for the geodesic distances to different values and tune by inspecting for which hyperparameter values we get a faithful, regular representation (e.g., manually or in a grid search). We may also tune the hyperparameter with the STOPS framework automatically.

In Table 1 we list values of normalized stress, c-regularity and c-faithfulness for different $\varepsilon $ (our $\varvec{\theta }$) obtained in a grid search from 0.09 to 0.3 by increments of 0.02. Here the best values for c-regularity and c-faithfulness would be obtained^{Footnote 4} with $\varepsilon ^*= 0.11$.

Table 1 Values of normalized stress, c-regularity (with the $\varvec{\gamma }$ being $\epsilon =10, d_{\max }=1$) and c-faithfulness (with $\varvec{\gamma }$ being $k=5$) for different values of the hyperparameter $\varvec{\theta }=\varepsilon $ in Isomap for the swiss roll example. The optimal value for the hyperparameter $\varepsilon ^*=0.1095$ found with mSTOPS is given in the last row

Full size table

Alternatively, we can use STOPS with Isomap and a weight of 0 for the badness-of-fit and $-1$ for c-regularity and c-faithfulness respectively (as we want to maximize them). We use the mSTOPS variant for $\text {Stoploss}(\varvec{\theta })$, so combining the two c-structuredness indices multiplicatively. The search space for $\varepsilon $ is set to be between 0.09 and 0.3. We use BO with TGPLLM for 10 iterations. Note that due to our objective we only consider c-regularity and c-faithfulness and disregard stress, as the latter measures how closely the fitted distances approximate the dissimilarities (which is of no concern for our choice of $\varepsilon $ here). The role of fit measure is played by c-faithfulness.

Tuning the hyperparameter with the mSTOPS procedure^{Footnote 5} suggests to use $\varepsilon ^*=0.1095$. The resulting configuration can be found in the upper panel of Fig. 2. This is the most faithful and regular representation that we obtain by varying $\varepsilon $, with values 0.9911 for c-regularity and 0.8293 for c-faithfulness, although we cannot achieve a perfectly regular rectangle. Both values either match or improve the corresponding optimal values we found in the grid search for $\varepsilon =0.11$. For comparison, a configuration with non-optimal $\varepsilon =0.21$ is found in the lower panel of Fig. 2. This solution has equal c-regularity to the optimal one, but lower c-faithfulness due to the difficulty of representing the inner part of the swiss roll for this $\varepsilon $.

7.2 Handwritten digits

We consider a second application of the STOPS framework, this time on handwritten digits data from Alimoglu and Alpaydin (1997). Following Izenman (2009), the original data were obtained from 44 people who handwrote the digits $0,\dots ,9$ 250 times by following the trajectory of the pen on a $500 \times 500$ grid which was then normalized. From the normalized trajectory, eight bivariate points were randomly selected for each handwritten digit, leading to 16 variables per digit. We look at a random sample of 500 written digits.

To these data we apply different nonlinear dimensionality reduction methods that allow for parametric transformations. We use the STOPS framework to select the transformation parameters for visualization of the multivariate similarities with emphasis on two structures: clusters, to make the separation of digits clear, and an underlying manifold. The PS methods we tune with STOPS are Sammon mapping with power transformations of the proximities, Box–Cox MDS, POST-MDS and local MDS. This is a relatively large data set in an MDS context and fitting a single PS can already be costly, so using BO is an efficient strategy.

We use aSTOPS with normalized stress with a weight of $v_0=1$, c-clusteredness (see Sect. 5.1) with a high weight of $v_1=-400$ and c-manifoldness (Sect. 5.2) with a moderate weight of $v_2=-2.5$.^{Footnote 6} The OPTICS Cordillera index uses $q=2$, $c=5$, $\epsilon =10$ and $d_{\max }=0.6$, the other indices use the default values. For comparability of indices the configurations are rescaled so that the highest column variance is 1. BO with TGPLLM is carried out in 10 steps.^{Footnote 7}

The optimal hyperparameter for the Sammon mapping with power transformation was found at $\lambda ^*=5.4$. The c-clusteredness value was 0.15, the c-manifoldness value was 0.928. The configuration is shown in the top left panel in Fig. 3.

For lMDS the transformation hyperparameter values selected by STOPS were $k^*=14.73, \tau ^*=3$. With them the obtained solution has a c-clusteredness value of 0.094 and c-manifoldness value of 0.953. The configuration is shown in the top right panel in Fig. 3.

The transformation hyperparameters for Box–Cox MDS were found at $\mu ^*=4.631, \lambda ^*=4.859, \rho ^*=1.285$. The c-clusteredness value was 0.124, the c-manifoldness value was 0.961. The configuration is shown in the bottom left panel in Fig. 3.

Lastly, for POST-MDS with $w_{ij}=\delta _{ij}$, the transformation hyperparameter values were $\kappa ^*=0.787, \lambda ^*=3.766, \rho ^*=-2.979$. The c-clusteredness value was 0.116, the c-manifoldness value was 0.862. The configurations for the POST-MDS solution is shown in the bottom right panel in Fig. 3.

Overall, we can see that the STOPS hyperparameter selection leads to configurations that, in general, arrange the digits in a sensible way: Due to the high c-clusteredness weight, similar looking digits are positioned visually close to each other, e.g., 0, 5, 8 or 1, 2, 7. This is rather pronounced for lMDS, power Sammon and POST-MDS; BC-MDS produces an arrangement in three clusters that is less in accordance with the other representations and the ground truth. The c-manifoldness weight ensured that each arrangement does not deviate too far from an appreciable, imagined submanifold. For every method, using the optimal hyperparameters from STOPS improves the c-structuredness indices as compared to an untransformed Sammon solution (which had c-clusteredness of 0.0833 and c-manifoldness of 0.5939).

For comparison, we also include two configurations in Fig. 3 obtained from Samon mapping and lMDS respectively, where the c-structuredness weights have been set to $v_1=v_2=0$ and $v_0=1$, therefore disregarding c-structuredness and only minimizing the badness of-fit criterion over $\varvec{\theta }$. We can see that disregarding c-clusteredness leads to less c-structuredness. With respect to c-clusteredness, for the Sammon mapping result, clusters are less discernable in the bottom left plot than in the top left in Fig. 3 (especially when ignoring the coloring) and for lMDS clusters are hard to make out in the configuration obtained when disregarding c-clusteredness (bottom right plot vs. top right plot).

7.3 Republican mantras

For our last illustration of the STOPS framework in action, we turn to the “Republican Mantra” data from Mair et al. (2014). The data are natural language texts that have been periodically obtained from the website of the Republican party (Grand Old Party, or GOP) in the USA, which at that time hosted a section called “Republican Faces”. In this section, supporters of the Republican party gave a short statement about why they see themselves as Republican. The statements always started with “I’m a Republican, because...” followed by the person’s personal ending. An example statement would be “...I believe in a free market society which enables hard work to equal success”. Mair et al. (2014) used MDS to explore the document term matrix of these statements but encountered difficulties when using the cosine distance between the words due to an almost-equal-dissimilarities artifact (see first row of Fig. 4; for illustration, this arrangement has c-clusteredness of 0.053 with respect to $c=6, d_{\max }=1.2, \epsilon =10, q=2$). This prompted Mair et al. (2014) to abandon the cosine distance in favor of a co-occurence based dissimilarity measure, the gravity similarity, that was subject to power transformations. Of note is that the concrete transformations they used in their work were manually chosen ad hoc.

We revisit the data from the original angle: We retain the cosine distance that the authors originally aimed for but employ the STOPS framework to guide the PS result towards more c-structuredness by choosing power transformations in a flexible PS version that is appropriate for promoting structures of interest. This also serves as an empirical example of how an approach that relied on manual trial-and-error hyperparameter selection would have benefitted from the STOPS framework. To illustrate the versatility of STOPS we use aSTOPS and mSTOPS with different structures and weights.

Additive STOPS: high c-clusteredness, high c-association, low c-complexity With aSTOPS we select parameters in a POST-MDS with cosine dissimilarities so that there is a focus on c-clusteredness (see Sect. 5.1) and c-association (Sect. A.1 in the Supplement) but favoring lower c-complexity (Sect. A.2 in the Supplement) for the association. To translate this into STOPS model weights, we used $-10, -5$ and 1 as weights for the c-structuredness indices. The weight for stress was 1. Essentially this means that c-clusteredness is weighted twice as important as c-association, which is valued 5 times as important as stress. C-complexity is more of an afterthought but for two results with similar c-clusteredness and c-association, we prefer the one with lower complexity traded-off equally with badness-of-fit. We note that the weights we use are somewhat arbitrary, but what is clear is that all in all we relax the fit criterion quite a bit to allow for high c-clusteredness and c-functionality. The metaparameters in $\varvec{\gamma }$ are $c=6, d_{\max }=1.2, \epsilon =10, q=2$ for c-clusteredness and $\omega =0.9$ for c-association and c-complexity.

The resulting configuration is displayed in the second row of Fig. 4. The objects are arranged in clusters close to a circumplex structure, reflecting the STOPS setup of a clustered arrangement with a relatively simple nonlinear association. It stems from a POST-MDS with $w_{ij}=\delta _{ij}$ and parameters $\varvec{\theta }^*=(2.635,1.185,3)^\top $. The square root of the normalized stress value is 0.63 and the c-structuredness indices are 0.386 for c-clusteredness, 0.999 for c-association and 3.807 for c-complexity.

Multiplicative STOPS: high c-nonmonotonocity, moderate c-association With mSTOPS we again select parameters in a POST-MDS with cosine dissimilarities, this time focusing on having a nonmonotonic, nonlinear associative structure in the target space, so high c-nonmonotonicity (Sect. A.10 in the Supplement) and moderate c-association (Sect. A.1 in the Supplement). Essentially we want the objects to be projected close to a functional that should be highly nonmonotonic.^{Footnote 8} To achieve this, badness-of-fit is allowed to become moderately high relative to the c-structuredness, but not as high as before with aSTOPS. This translate to the weights $-2,-1$ for c-nonmonotonicity and c-association respectively, and to a weight for stress of 1. The metaparameters in $\varvec{\gamma }$ were $\omega =0.9$ for c-nonmonotonicity and c-association.

The resulting configuration is displayed in the last row of Fig. 4. We see that there is an associative structure that is rather complicated and highly nonmonotonic (reminiscent of a trefoil). It stems from a POST-MDS with $w_{j}=\delta _{ij}$ and parameter values $\varvec{\theta }^*=(2.443,19.756,-2.984)^\top $. The square root of the normalized stress value is 0.545 and the c-structuredness indices are 0.469 for c-nonmonotonicity and 0.999 for c-association.

7.4 Artifactual c-structuredness

There is an application aspect when selecting hyperparameters based on partially optimizing for c-structuredness that we need to point out: It is possible that this “forces” a configuration to look a certain way even if that does not correspond to the ground truth in the high-dimensional data or in the data generating process, so the c-structuredness exhibited can be an artifact of using STOPS rather than inherent in the data. In that sense, STOPS can be artificially generating c-structuredness instead of uncovering “real” c-structuredness.^{Footnote 9} Two situations are of particular interest to discuss: First, if the $\varvec{\Delta }$ are not at all structured (they exhibit the null structure, i.e. one resulting from perfectly equal dissimilarities Buja and Swayne 2002), and second, if the $\varvec{\Delta }$ exhibit some ground truth structures but the structures are not represented in the set of structures used for a STOPS model. We address both of these situations briefly.

In the first situation, the data correspond to the null structure of equal dissimilarities ($\delta _{ij}=\delta =const.$), i.e., the objects form a regular simplex in $(N-1)$-dimensional space. For all the PS methods we mentioned, this leads to highly structured configuration artifacts for given distances fitted. The effect that STOPS has in this situation depends on the transformations used for the $\varvec{\Delta }$ (and, by extension, the transformation vector $\theta _{\Delta }$). Generally speaking, if after applying transformations to the $\varvec{\Delta }$, the elements of $\hat{\varvec{\Delta }}$ are still equal, then the null structure is preserved in the sense of producing the same configuration artifact in the target space for given distances fitted regardless of the choice of transformation for $\hat{\varvec{\Delta }}$. This is when all $\delta _{ij}$ are subjected to the same, deterministic transformation for all i, j. STOPS will then not induce c-structuredness artificially beyond the effect of the null structure on the PS configuration. For all of the transformations and PS methods we discussed in this article, this holds true.

In the second situation the elements of $\varvec{\Delta }$ are differentiated and informative for the ground truth structure, but STOPS is not taking the correct structure into account. Two mechanisms are related to that in STOPS. First, the ground truth structure in the high-dimensional data cannot be recovered with the chosen PS method. Using STOPS may then in general also fail to uncover the structure. This problem can be mitigated with STOPS, however, by including structures that measure how faithful the configuration is to the high-dimensional ground truth and/or the original data, e.g. with c-faithfulness (see Appendix A.7 in the supplementary document). That way it is possible for STOPS to actually improve recovery of the ground truth via the PS method as STOPS selects transformations that then partially optimize for the ground truth structure.

Second, the PS method specified for STOPS is able to recover the ground truth but this ground truth is not within the structures specified. Then using the wrong c-structuredness index may have STOPS select hyperparameters that artificially distort the mapping towards the c-structuredness that was specified. Using STOPS may then in general either induce artificial c-structuredness (because it focuses on the wrong structure), distort or fail to uncover the ground truth structure in the result (because it does not consider the correct structure), or only recover parts of the correct structure (if the correct structure shares aspects with the wrong structure), or any combination of these. STOPS has one mechanism that is meant to mitigate that, which is to put relatively high weight on the badness-of-fit part (via $v_0$) and relatively little weight on c-structuredness (via $v_l$), with the extreme of no weight on c-structuredness ($v_l=0, \forall l$). This will make STOPS select hyperparameters that lead to configurations that are still faithful to the $\hat{\varvec{\Delta }}$ in the sense of minimizing mainly the badness-of-fit over $\varvec{\theta }$. To gauge the extent of the distortion, one can compare the STOPS result obtained from c-structuredness weights that are not 0 with the STOPS result obtained for all $v_l=0$.

8 Discussion

In this article we suggested a framework for hyperparameter selection in proximity scaling models, coined STOPS for STructure optimized Proximity Scaling. The selection process for the hyperparameters is based on the idea that we want the configuration to show c-structuredness, enabling easier interpretation. The underlying objective function, Stoploss, combines c-structuredness indices and/or a badness-of-fit measure as a function of the PS hyperparameters, and thus allows optimization over their space. We presented a nested, profile-type optimization procedure, that solves the PS problem given the hyperparameters by a standard PS algorithm in an inner optimization step and then uses the resulting configurations in the Stoploss to optimize over the hyperparameters in an outer optimization. We suggested Bayesian optimization via treed Gaussian process with jumps to limiting linear models as an efficient strategy for generating candidates for the outer optimization. The use of the STOPS framework was illustrated with three examples—in these example there were structural consideration about the configuration that we used to select hyperparameters to achieve a desired effect.

The aim of this article was to suggest a general, principled way to select transformation parameters in PS methods based on the structural appearance of the configuration. For the scalarized objective we limited ourselves to optimize only over the hyperparameters of the PS methods; in principle and with little adaptation to the framework, it would be possible to treat the scalarization weights as an additional set of hyperparameters and optimize over them as well.

We chose to set up the framework for hyperparameter selection in a flexible, modular fashion. This allows to plug in any type of badness-of-fit PS objective, use any optimization for the PS objective that is advantageous, combine any number of structuredness indices in different ways and use any global optimization strategy over the hyperparameters. We found Bayesian Optimization with TGPLLM to be well suited as a generally applicable strategy for the latter step, but due to the modular nature of the framework any other metaheuristic or random search can be used. The proposed framework is sufficiently general and modular to allow for conceptual and computational extensions in future research. For example, the framework allows the development of tailored optimization for concrete Stoplosses, it can be incorporated in a larger context of evaluation of PS methods such as France and Akkucuk (2021) and the idea of structure-based hyperparameter selection can be adapted to unsupervised learning methods that do not fit neatly into the presented PS framework, for example for t-SNE or UMAP (McInnes et al. 2018).

We discussed structured appearances that we think to be of interest to researchers and suggest statistics by which to measure them. We are quick to point out that our list is not meant to be exhaustive and that any scalar statistic that captures a property of the configuration can be incorporated into STOPS. We therefore believe our framework to be applicable even beyond the suggestions we made in this paper.

The STOPS framework as a hyperparameter tuning method is similar to using indirect, soft constraints in PS via choosing the hyperparameters instead of incorporating hard, direct constraints on the configuration, and this is where we see the method located. In cases where there exists strong theory, or hard constraints, constrained versions of PS may be preferred over the STOPS framework. Since the method works completely unsupervised with respect to a ground truth for structures, the structures that are produced may not be “real” and would need to be validated.

Despite these limitations, we believe the STOPS framework to be a versatile conceptual framework for the task of hyperparameter tuning in ordination, unsupervised learning and dimensionality reduction. It can be utilized for data exploration, visualization and scaling in applications where one seeks configurations that show a structural quality that can be interpreted and also wants to retain properties of a standard PS variant. This way one no longer has to restrict oneself to a narrow toolbox of standard MDS but can utilize the full world of flexible parametrized PS variants, with the standard versions resulting as specific hyperparameter constellations in the search space; the choice of hyperparameters is no longer ad hoc but principled and reproducible; the obtained hyperparameter values usually lead to the desired structure being appreciably present in the configuration (especially as compared to a standard MDS result) under the condition that the PS still fits well; both fit and the c-structuredness can be quantified with descriptive statistics, thus making results reproducible and comparable between data sets, studies and settings.

Code Availability

An R script to fully reproduce results and figures is available as Supplementary Information.

Data Availability

The data sets used are available online in the R package stops.

Notes

While the setup of a STOPS model is at the discretion of the user, from an application and interpretability view, we recommend to include only up to two c-structuredness indices, otherwise the effect of the indices compensating each other can be hard to predict. More than two can still be interpretable if the indices are referring to related aspects, for example high c-nonmonotonicity and low c-complexity would favor quadratic or circular arrangements.
Not all combinations have been proposed in the literature; conceptually, mix and match of the badness-of-fit functions with the parametrized transformations for proximities, distances and weights is possible.
In the STOPS context, we always have $\varvec{X}(\varvec{\theta })$ but for readability and generality, we only write $\varvec{X}$ in the c-structuredness indices.
Running R 4.0.3 under Linux Mint 19.2. on an Intel Core i5-8350U CPU with 1.7 GHz, this grid search took 0.5 seconds.
Running R 4.0.3 under Linux Mint 19.2. on an Intel Core i5-8350U CPU with 1.7 GHz, optimization with our implementation took 12.5 seconds.
The weight values come from trading-off the relative scale of the structure indices in the standard Sammon mapping to be commensurate; it has badness-of-fit of 0.151, c-clusteredness of 0.083 and c-manifoldness of 0.5939. The scales of the indices are of relative magnitude of roughly 8:1 for c-clusteredness and c-manifoldness and 4:1 for c-manifoldness and fit for the initial Sammon solution. We want to trade these magnitudes off on the same scale and put more weight on c-clusteredness, say, 20 times more than on c-manifoldness and more weight on c-manifoldness than for fit, say, 10 times. So, weights of -400 for c-clusteredness, $-$2.5 for c-manifoldness and 1 for stress reflect this.
Due to how BO with TGPLLM works, this amounts to fitting around 60 PS models each. For our prototype implementation in R 4.1.2 on a PC with Linux Mint 20.1. with Intel Core i7-6700 with 3.40 GHz, the timings were 751 seconds for Sammon mapping with powers, 2662 seconds for lMDS (with maximum PS iterations of $1\times 10^4$), 6169 seconds for Box–Cox MDS (with maximum PS iterations of $1\times 10^4$) and 106809 seconds for POST-MDS (with maximum PS iterations of $5\times 10^4$). The bottleneck is fitting the PS problem, especially for POST-MDS.
We do not have a theoretical justification for these structures; it is meant to be another show case of STOPS and also as a contrast to the aSTOPS result to illustrate how different various STOPS models can look for the same data.
This is by no means exclusive to STOPS; it shares this property with constrained forms of MDS (e.g., De Leeuw and Heiser 1980; Mathar 1990; Bronstein et al. 2006) and with many other types of unsupervised learning methods like t-SNE (van der Maaten and Hinton 2008) (whose emphasis on small scale structure can distort global structure) or k-means clustering (which always produces k Voronoi clusters in the distance metric used) (Mucherino et al. 2009).

References

Akkucuk, U., Carroll, J.D.: PARAMAP vs. Isomap: a comparison of two nonlinear mapping algorithms. J. Classif. 23(2), 221–254 (2006). https://doi.org/10.1007/s00357-006-0014-2
Article MathSciNet MATH Google Scholar
Alimoglu, F., Alpaydin, E.: Combining multiple representations and classifiers for pen-based handwritten digit recognition. Proceedings of the Fourth International Conference on Document Analysis and Recognition, pp 637–640 (1997). https://doi.org/10.1109/ICDAR.1997.620583
Ankerst, M., Breunig, M.M., Kriegel, H.P., et al.: OPTICS: Ordering points to identify the clustering structure. In: Press, A.C.M. (ed.) ACM SIGMOD International Conference on Management of Data, vol. 28, pp. 49–60. New York City (1999). https://doi.org/10.1145/304182.304187
Bergstra, J.S., Bardenet, R., Bengio, Y., et al.: Algorithms for hyperparameter optimization. In: Advances in Neural Information Processing Systems, pp 2546–2554, (2011). https://proceedings.neurips.cc/paper/2011/file/86e8f7ab32cfd12577bc2619bc635690-Paper.pdf
Borg, I., Groenen, P.J.: Modern Multidimensional Scaling: Theory and Applications, 2nd edn. Springer, New York (2005). https://doi.org/10.1007/0-387-28981-X
Book MATH Google Scholar
Box, G.E., Cox, D.R.: An analysis of transformations. J. Roy. Stat. Soc.: Ser. B (Methodol.) 26(2), 211–243 (1964). https://doi.org/10.1111/j.2517-6161.1964.tb00553.x
Article MATH Google Scholar
Breiman, L., Friedman, J.H.: Estimating optimal transformations for multiple regression and correlation. J. Am. Stat. Assoc. 80(391), 580–598 (1985). https://doi.org/10.2307/2288473
Article MathSciNet MATH Google Scholar
Bronstein, A.M., Bronstein, M.M., Kimmel, R.: Generalized multidimensional scaling: a framework for isometry-invariant partial surface matching. Proc. Natl. Acad. Sci. 103(5), 1168–1172 (2006). https://doi.org/10.1073/pnas.0508601103
Article MathSciNet MATH Google Scholar
Buja, A., Swayne, D.F.: Visualization methodology for multidimensional scaling. J. Classif. 19(1), 7–43 (2002). https://doi.org/10.1007/s00357-001-0031-0
Article MathSciNet MATH Google Scholar
Buja, A., Swayne, D.F., Littman, M.L., et al.: Data visualization with multidimensional scaling. J. Comput. Graph. Stat. 17(2), 444–472 (2008). https://doi.org/10.1198/106186008X318440
Article MathSciNet Google Scholar
Chen, L., Buja, A.: Local multidimensional scaling for nonlinear dimension reduction, graph drawing, and proximity analysis. J. Am. Stat. Assoc. 104(485), 209–219 (2009). https://doi.org/10.1198/jasa.2009.0111
Article MathSciNet MATH Google Scholar
Chen, L., Buja, A.: Stress functions for nonlinear dimension reduction, proximity analysis, and graph drawing. J. Mach. Learn. Res. 14, 1145–1173 (2013). (https://jmlr.org/papers/v14/chen13a.html)
MathSciNet MATH Google Scholar
De Leeuw, J.: Applications of convex analysis to multidimensional scaling. In: Barra, J.R., Brodeau, F., Romier, G., et al. (eds.) Recent Developments in Statistics, pp. 133–145. North Holland Publishing Company, Amsterdam (1977). https://escholarship.org/uc/item/4ps3b5mj
De Leeuw, J., Heiser, W.J.: Multidimensional scaling with restrictions on the configuration. Multivar. Anal. 5(1), 501–522 (1980)
MathSciNet MATH Google Scholar
De Leeuw, J.: Minimizing r-stress using majorization. Tech. rep., UCLA Statistics Preprint Series, (2014) https://rpubs.com/deleeuw/142619
De’ath, G.: Extended dissimilarity: a method of robust estimation of ecological distances from high beta diversity data. Plant Ecol. 144(2), 191–199 (1999). https://doi.org/10.1023/A:1009763730207
Article Google Scholar
Demartines, P., Herault, J.: Curvilinear component analysis: a self-organizing neural network for nonlinear mapping of data sets. IEEE Trans. Neural Netw. 8(1), 148–154 (1997). https://doi.org/10.1109/72.554199
Article Google Scholar
Eberhart, R.C., Kennedy, J.: A new optimizer using particle swarm theory. In: Proceedings of the Sixth International Symposium on Micro Machine and Human Science, vol. 1, pp. 39–43. IEEE Press, Picataway, NJ (1995). https://doi.org/10.1109/MHS.1995.494215
France, S.L., Akkucuk, U.: A review, framework, and R toolkit for exploring, evaluating, and comparing visualization methods. Vis. Comput. 37, 457–475 (2021). https://doi.org/10.1007/s00371-020-01817-5
Article Google Scholar
France, S., Carroll, D.: Development of an agreement metric based upon the rand index for the evaluation of dimensionality reduction techniques, with applications to mapping customer data. In: Perner, P. (ed.) Machine Learning and Data Mining in Pattern Recognition, pp. 499–517. Springer, Berlin, Heidelberg (2007). https://doi.org/10.1007/978-3-540-73499-4_38
Chapter Google Scholar
France, S.L., Carroll, J.D.: Two-way multidimensional scaling: a review. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 41(5), 644–661 (2011). https://doi.org/10.1109/TSMCC.2010.2078502
Article Google Scholar
Gebelein, H.: Das statistische Problem der Korrelation als Variations- und Eigenwertproblem und sein Zusammenhang mit der Ausgleichsrechnung. ZAMM - J. Appl. Math. Mech. / Zeitschrift für Angewandte Mathematik und Mechanik 21(6), 364–379 (1941). https://doi.org/10.1002/zamm.19410210604
Article MathSciNet MATH Google Scholar
Gramacy, R.B., Lee, H.K.H.: Bayesian treed Gaussian process models with an application to computer modeling. J. Am. Stat. Assoc. 103(483), 1119–1130 (2008). https://doi.org/10.1198/016214508000000689
Article MathSciNet MATH Google Scholar
Groenen, P., De Leeuw, J.: Power-stress for multidimensional scaling. Tech. rep, UCLA, Los Angeles, USA (2010)
Izenman, A.J.: Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning. Springer, New York (2009). https://doi.org/10.1007/978-0-387-78189-1
Book MATH Google Scholar
Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive black-box functions. J. Global Optim. 13(4), 455–492 (1998). https://doi.org/10.1023/A:1008306431147
Article MathSciNet MATH Google Scholar
Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983). https://doi.org/10.1126/science.220.4598.671
Article MathSciNet MATH Google Scholar
Kruskal, J.B.: Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29(1), 1–27 (1964). https://doi.org/10.1007/BF02289565
Article MathSciNet MATH Google Scholar
Luus, R., Jaakola, T.: Optimization by direct search and systematic reduction of the size of search region. Am. Inst. Chem. Eng. J. (AIChE) 19(4), 760–766 (1973). https://doi.org/10.1002/aic.690190413
Article Google Scholar
Mair, P., Rusch, T., Hornik, K.: The grand old party: A party of values? Springer Plus 3(697), 1–10 (2014). https://doi.org/10.1186/2193-1801-3-697
Article Google Scholar
Mathar, R.: Multidimensional scaling with constraints on the configuration. J. Multivar. Anal. 33(2), 151–156 (1990). https://doi.org/10.1016/0047-259X(90)90043-H
Article MathSciNet MATH Google Scholar
McGee, V.E.: The multidimensional analysis of ‘elastic’ distances. Br. J. Math. Stat. Psychol. 19(2), 181–196 (1966). https://doi.org/10.1111/j.2044-8317.1966.tb00367.x
Article Google Scholar
McInnes, L., Healy, J., Melville, J.: Umap: Uniform manifold approximation and projection for dimension reduction (2018). https://doi.org/10.48550/ARXIV.1802.03426
Mockus, J.: Bayesian Approach to Global Optimization: Theory and Applications, Mathematics and its Applications, vol. 37. Springer, Netherlands, Amsterdam (1989). https://doi.org/10.1007/978-94-009-0909-0
Mucherino, A., Papajorgji, P.J., Pardalos, P.M.: Clustering by k-means, pp. 47–82. Springer, New York (2009). https://doi.org/10.1007/978-0-387-88615-2_3
Book Google Scholar
Ramsay, J.O.: Maximum likelihood estimation in multidimensional scaling. Psychometrika 42(2), 241–266 (1977). https://doi.org/10.1007/BF02294052
Article MATH Google Scholar
Ramsay, J.O.: Some statistical approaches to multidimensional scaling data (with discussion). J. Royal Stat. Soc. Series A (General) 145(3), 285–312 (1982). (http://www.jstor.org/stable/2981865)
Article MATH Google Scholar
Rusch, T., Hornik, K., Mair, P.: Assessing and quantifying clusteredness: The OPTICS Cordillera. J. Comput. Graph. Stat. 27(1), 220–233 (2018). https://doi.org/10.1080/10618600.2017.1349664
Article MathSciNet MATH Google Scholar
Rusch, T., Mair, P., Hornik, K.: Cluster optimized proximity scaling. J. Comput. Graph. Stat. 30(4), 1156–1167 (2021). https://doi.org/10.1080/10618600.2020.1869027
Article MathSciNet MATH Google Scholar
Sammon, J.W.: A nonlinear mapping for data structure analysis. IEEE Trans. Comput. C 18(5), 401–409 (1969). https://doi.org/10.1109/T-C.1969.222678
Article Google Scholar
Sarmanov, O.V.: The maximum correlation coefficient (symmetrical case). In: Doklady Akademii Nauk SSR, pp 715–718, (1958) available at http://mi.mathnet.ru/eng/dan23105
Siivola, E., Paleyes, A., González, J., et al.: Good practices for Bayesian optimization of high dimensional structured spaces. Appl. AI Lett. 2(2), e24 (2021). https://doi.org/10.1002/ail2.24
Article Google Scholar
Takane, Y., Young, F., De Leeuw, J.: Nonmetric individual differences multidimensional scaling: an alternating least squares method with optimal scaling features. Psychometrika 42(1), 7–67 (1977). https://doi.org/10.1007/BF02293745
Article MATH Google Scholar
Tenenbaum, J.B., De Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000). https://doi.org/10.1126/science.290.5500.2319
Article Google Scholar
Torgerson, W.S.: Theory and Methods of Scaling. Wiley, New York (1958)
Google Scholar
van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(86), 2579–2605 (2008). (http://jmlr.org/papers/v9/vandermaaten08a.html)
MATH Google Scholar

Download references

Acknowledgements

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Thomas Rusch. The first draft of the manuscript was written by Thomas Rusch and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Funding

Open access funding provided by Vienna University of Economics and Business (WU). No funds, grants, or other support was received for this work.

Author information

Authors and Affiliations

WU Vienna University of Economics and Business, Vienna, Austria
Thomas Rusch & Kurt Hornik
Harvard University, Cambridge, MA, USA
Patrick Mair

Authors

Thomas Rusch
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Mair
View author publications
You can also search for this author in PubMed Google Scholar
Kurt Hornik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thomas Rusch.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 700 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Rusch, T., Mair, P. & Hornik, K. Structure-based hyperparameter selection with Bayesian optimization in multidimensional scaling. Stat Comput 33, 28 (2023). https://doi.org/10.1007/s11222-022-10197-w

Download citation

Received: 19 March 2022
Accepted: 11 December 2022
Published: 26 December 2022
DOI: https://doi.org/10.1007/s11222-022-10197-w

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Structure-based hyperparameter selection with Bayesian optimization in multidimensional scaling

Abstract

Similar content being viewed by others

A systematic review on model selection in high-dimensional regression

High-dimensional local polynomial regression with variable selection and dimension reduction

The spectral condition number plot for regularization parameter evaluation

1 Introduction

2 The STOPS framework

3 Proximity scaling losses for measuring badness-of-fit

3.1 Quadratic loss: STRESS

3.2 Approximation by inner product: STRAIN

3.3 Repulsion and attraction: energy model

4 Parametrized transformations

4.1 Transforming observed proximities

4.1.1 Metric scaling transformations of proximities

4.1.2 Geodesic transformation of proximities

4.2 Transforming observed proximities and fitted distances

4.2.1 Power transformations for proximities and distances

4.2.2 Box–Cox transformations for proximities and distances

4.2.3 Sets of local neighborhood

5 Structures, c-structuredness and indices

5.1 C-clusteredness and c-regularity

5.2 C-manifoldness

6 Optimization for STOPS

7 Application

7.1 Unrolling the swiss roll

7.2 Handwritten digits

7.3 Republican mantras

7.4 Artifactual c-structuredness

8 Discussion

Code Availability

Data Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 700 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation