1 Introduction

Seriation is an exploratory combinatorial data analysis method that aims at reordering data objects to capture and identify patterns and trends of gradually varying similarities in the data. The general objective of the resulting reordering is to position more similar objects proximately and dissimilar ones further apart. The original motivation for seriation arose in the field of archeology, when Sir Flinders Petrie used sequencing to infer the chronological order of a set of graves based on the artifacts recovered from them (Hodson 1968). The problem of seriation was mathematically formalized by Kendall (1971). Since then, it has been studied and successfully put to practice in several other areas, such as sociology and psychology (Liiv 2010), gene sequencing (Fulkerson and Gross 1965), and bioinformatics (Tsafrir et al. 2005; Tien et al. 2008; Recanati et al. 2017). Seriation can also be used in exploratory data visualization (Havens and Bezdek 2012) as a means for rearranging similarity or dissimilarity matrices, so that global patterns (e.g., the number or tendency of clusters) can be identified. For this purpose, it has been applied to reveal patterns in microarray data (Tien et al. 2008), and to arrange words or documents in text mining based on their co-occurrence statistics (Mavroeidis and Bingham 2010); the latter work also includes the reordering of word-by-document similarity matrices for the purpose of tracking the flow of conversations. A broad overview of different applications and miscellaneous theoretical details of seriation is presented by Liiv (2010) and Hahsler et al. (2008). More recent works include the systematic experimental analysis of seriation methods and measures by Hahsler (2017), mechanisms for comparing and fusing generated orderings by Goulermas et al. (2016), and the introduction of various modeling formulations and solution procedures for robust seriation by Recanati et al. (2018).

Seriation methods employ heuristics or combinatorial optimization procedures in order to identify orderings that maintain object proximities according to their pairwise (dis)similarities. They typically act on a symmetric similarity (dissimilarity) matrix to simultaneously interchange its rows and columns, such that its entries decrease (increase) monotonically while departing from the main diagonal. Formally, given an \(n\times n\) symmetric similarity matrix \({\mathbf {A}}\), the goal of seriation is to find an ideal row and column reordering, such that \(A_{ik} \le \min ( A_{ij},A_{jk})\), for all ijk with \(1\le i \le j \le k \le n\); in other words bring it to a RobinsonianFootnote 1 form.

One consistent objective for seriation is the p-SUM (Juvan and Mohar 1992), defined as \(\frac{1}{p} \sum _{i,j=1}^n A_{ij} |i - j |^p\), since for all \(p>0\), an optimal ordering that renders any pre-RobinsonianFootnote 2 matrix to a Robinsonian one can be found (Laurent and Seminaroti 2015). The p-SUM problem, which was initially introduced in the context of the matrix envelope reduction problem (George and Pothen 1994), describes a class of objective functions that can be modeled as instances of the quadratic assignment problem (QAP) (Burkard et al. 1999), where a Toeplitz Robinsonian dissimilarity matrix is involved to represent positional differences of the objects. Different values of p confer different penalties on similar objects that are far apart in the linear ordering. Various instances of this problem have been studied, with the most widespread being the \(p=2\) case, which is referred to as the 2-SUM problem. In the context of seriation, the 2-SUM objective is known as the inertia criterion when it is applied to dissimilarity values (Hahsler et al. 2008). The 2-SUM objective penalizes the squared difference of the coordinates between similar instances, and can be expressed as a quadratic function of a permutation vector involving a graph Laplacian matrix (the details can be found later in Sect. 3.3).

Another specific case of the p-SUM is the 1-SUM problem, also known as the optimal linear arrangement problem (George and Pothen 1994), which is more difficult to analyze in terms of a spectral approximation and bounds, as it is no longer a quadratic function of the permutation vector. In comparison with the 2-SUM objective function which relies on squared positional differences of the objects, the 1-SUM uses absolute differences. Finally, interesting p-SUM instances for seriation are the cases when \(p<1\), corresponding to quasi \(\ell _p\)-norms, as they are less sensitive to large positional differences and relatively more sensitive to local ordering, and can therefore prioritize local neighborhoods of similar objects.

As a QAP instance, the p-SUM is an NP-hard combinatorial problem with \(\mathcal {O}(n!)\) possible discrete solutions corresponding to permutations (Çela 2013). Therefore, solving optimally such seriation formulations can be impractical when the problem size is large. In the ideal and infrequent case where the data yield a pre-Robinsonian similarity matrix, an optimal solution can be identified in polynomial time (Barnard et al. 1993; Atkins et al. 1998) by sorting the patterns according to the corresponding entries of the Fiedler vector (Fiedler 1973), which is the eigenvector associated with the smallest non-zero eigenvalue. However, when the similarity matrix is not pre-Robinsonian, this spectral solution is only guaranteed to approximately minimize the 2-SUM problem. Therefore, alternative approaches for the p-SUM problem are desirable.

There exist various directions for solving QAP problems (Anstreicher 2003; Burkard et al. 1999; Burkard and Çela 1999; Loiola et al. 2007). Examples of exact QAP algorithms include branch-and-bound (Brusco and Stahl 2001), cutting plane methods (Bazaraa and Sherali 1982) and dynamic programming approaches (Christofides and Benavent 1989). As exact methods can only be used for QAP instances of small sizes, suboptimal algorithms and heuristics that maintain good running performance have been very popular. Some of them include improvement methods, such as local search, tabu search (Glover and Laguna 1997), simulation approaches such as simulated annealing, and population-based heuristics such as evolutionary optimization (Mühlenbein 1989). Besides these, there are relaxation-based algorithms in the context of graph matching (Vogelstein et al. 2015; Lyzinski et al. 2016). Particularly for the 2-SUM case, recent works (Fogel et al. 2013; Lim and Wright 2014; Fogel et al. 2015) have shown how the relaxations of the 2-SUM problem can be solved using interior-point methods relying on either matrix- or vector-based formulations. However, these relaxations may yield solutions far from the optimum permutation and there is no guarantee that the nearest permutation will minimize the original objective.

Relaxation methods have mostly been applied to the 2-SUM problem but not the general p-SUM. Our contribution is to propose a set of first-order optimization methods for minimizing certain p-SUM objectives. The methodology combines first-order optimization with graduated non-convexity, which successively transforms the relaxation to a concave problem, so that the final solution is guaranteed to be a permutation. We previously showed (Evangelopoulos et al. 2017) that this approach outperforms other convex relaxation methods for the 2-SUM problem and scales very well with large datasets. Additionally, while previous methods rely on extra ordering information to achieve good performance, our method does not have such requirement. Here, we extend this work by proposing algorithms for approximately solving the 1-SUM and \(\tfrac{1}{2}\)-SUM objectives. The proposed methodologies are able to scale up to problem sizes unattainable with existing approaches, and additionally, apart from the noiseless cases they outperform the spectral approximation algorithms which are the most computationally efficient approaches. To the best of our knowledge, this is the first time that highly scalable algorithms for the p-SUM problem with \(p<2\) have been proposed.

The rest of the paper is organized as follows. In Sect. 2, we present recent developments in the field and the current state-of-the-art algorithms. In Sect. 3, we give a detailed description for each of the proposed algorithms, with the different subsections presenting various formulations and optimization-related aspects. Section 4 contains detailed experimental evaluations and comparisons with regard to the performance of the algorithms, while relevant analyses and conclusions are presented in Sect. 5.

2 Relation to existing methods

The most extensively studied instance of the p-SUM problem is the 2-SUM one because it is amenable to a much more convenient algebraic formulation. The most recent approaches approximate the 2-SUM problem via convex relaxations. Specifically, Fogel et al. (2013, 2015) formulate their relaxation over the set of doubly stochastic matrices which is known to be the convex hull of the permutation matrices, while Lim and Wright (2014) use sorting networks to generate a set of linear constraints in order to perform the optimization in terms of the permutahedron (Goemans 2015), which is the convex hull of all permutation vectors. In both cases, interior point methods are used to optimize a regularized version of the 2-SUM problem that can be written as a quadratic program with additional linear constraints. The permutahedron-based method performs better and is considerably faster as it uses an order of \(\mathcal {O}(n \log ^2 n)\) variables and constraints. Furthermore, both approaches can be used to solve a semi-supervised instance of seriation as they both accommodate the use of additional ordering constraints.

Nevertheless, the aforementioned convex relaxation approaches do not outperform spectral ordering unless additional ordering constraints are used. Moreover, they suffer from scalability issues and when the input size increases significantly, even commercial solvers cannot alleviate the need for demanding computational resources. Furthermore, recent work (Vogelstein et al. 2015; Lyzinski et al. 2016) on solving general QAP problems suggests that convex relaxations do not always outperform indefinite formulations. Towards this direction, Lim and Wright (2016a) present a new framework for approximating general QAP problems formulated in terms of sorting networks, and use a continuation procedure (Blake 1983; Rangarajan and Chellappa 1990; Liu and Qiao 2014) that starts by solving a convex relaxation of the problem and then gradually converts it to a concave one, to finally yield a local optimum to the original discrete problem. A similar approach was followed by Zaslavskiy et al. (2009), where instead of employing an objective function with a convex and non-convex component as used in typical continuation methods, the authors follow the solution path of a linear combination of two different relaxations of the initial problem, one convex and one concave, in order to approximately solve it.

Other instances of the p-SUM problem, especially for \(p<1\), have not been studied extensively in the seriation literature. Juvan and Mohar (1992, 1993) are the first to present a theoretical analysis on the minimization of the p-SUM problem for \(p=1,2,\) and \(\infty \) using a spectral method. George and Pothen (1994) investigate the specific cases of 1-SUM and 2-SUM and their close connection to the matrix envelope reduction problem (George and Liu 1981), as the former problem is expressed via the sum of spreads of the non-zero entries in each row, while the latter uses the sum of squared spreads. Most of the problems analyzed for the different p-SUM employed spectral methods. Such methods were also used by Helmberg et al. (1995) to obtain lower bounds on the bandwidth problem. In this work we present alternative methodologies that enable us to solve an approximation of different p-SUM problems in a more efficient way than other convex relaxations and spectral methods.

3 Proposed methodology

3.1 Preliminaries and basic notations

Let \({\varvec{\pi }}\) denote a permutation vector consisting of the rearrangement of the integers \(1,\ldots ,n\). The set of n! distinct permutations (which for convenience are treated here as vectors) is denoted by \(\mathcal {P}^n\). Each permutation describes the rearrangement of the entries of an n-dimensional vector, with one convention being that the element at position \(\pi _i\) is moved to position i. This transformation can be explicitly represented by an \(n\times n\) matrix \({\varvec{\Pi }}\) from the set of permutation matrices \(\mathcal {M}^n\) with elements defined by

$$\begin{aligned} \varPi _{ij} = \left\{ \begin{array}{l l} 1, &{} \text { if } \pi _i =j,\\ 0, &{} \text { otherwise}. \end{array} \right. \end{aligned}$$
(1)

This also allows \({\varvec{\Pi }}\) to be converted to its corresponding permutation via \({\varvec{\Pi }}{\mathbf {e}}={\varvec{\pi }}\), where \({\mathbf {e}}= (1,2,...,n)^\top \) is the identity permutation.

Many combinatorial problems involving the optimal arrangement of objects can be modeled by objective functions parametrized by permutation vectors or matrices. In particular, the aforementioned QAP describes models that are quadratic with respect to a permutation matrix, and can be expressed as

$$\begin{aligned} \mathrm {QAP}({\mathbf {A}},{\mathbf {B}})&\triangleq \mathrm {tr}\left[ {\mathbf {A}}{\varvec{\Pi }}{\mathbf {B}}^\top {\varvec{\Pi }}^\top \right] =\sum _{i,j=1}^n A_{ij} B_{\pi _i\pi _j}, \end{aligned}$$
(2)

where the problem depends on the two parameter matrices \({\mathbf {A}}\) and \({\mathbf {B}}\).

For seriation we are interested in specific QAP instances, where \({\mathbf {A}}\) is a non-negativeFootnote 3 symmetric data-dependent matrix that encapsulates the pairwise similarities between n objects. \({\mathbf {B}}\) is a Toeplitz Robinsonian dissimilarity matrix with elements \(B_{ij} =\frac{1}{p} |i-j |^p\) for some \(p > 0\). It acts as the seriation template with elements increasing across diagonals while moving away from the main one. In this case, the QAP corresponds to the p-SUM problem (George and Pothen 1994)

$$\begin{aligned} \mathrm {QAP}({\mathbf {A}},{\mathbf {B}}) = \frac{1}{p} \sum _{i,j=1}^n A_{ij} |\pi _i - \pi _j |^p. \end{aligned}$$
(3)

When \({\mathbf {A}}\) is Robinsonian, the identity permutation optimizes the QAP (Laurent and Seminaroti 2015), and if \({\mathbf {A}}\) is pre-Robinsonian, then a solution can be found in polynomial time (Atkins et al. 1998). Different cases for p yield different types of problems. For example, for \(p=1, 2\) and in the limit of \(\infty \), we obtain the 1-SUM or optimal linear arrangement, the 2-SUM, and the bandwidth minimization problem, respectively (this relies on the more conventional problem definition of \(\left( \sum A_{ij} |\pi _i - \pi _j |^p\right) ^\frac{1}{p}\)). Approximate solutions for this problem can be searched for with a variety of QAP approximation methods, including simulated annealing, tabu search, and evolutionary methods (Loiola et al. 2007).

3.2 Problem relaxations

Recent work on the 2-SUM (Fogel et al. 2013; Lim and Wright 2014; Fogel et al. 2015) has considered convex relaxations on the set of permutation matrices and also on permutation vectors. The relaxed feasible sets are the convex hull of permutation matrices which is the Birkhoff polytope, i.e., the set of doubly stochastic matrices \(\mathcal {B}^n\triangleq \{{\mathbf {X}}: {\mathbf {X}}{\mathbf {1}}={\mathbf {X}}^\top {\mathbf {1}}={\mathbf {1}}, X_{ij}\ge 0\}\), and the convex hull of permutation vectors which is the permutahedron (Goemans 2015) denoted as \(\mathcal {PH}^n\). These are directly related by enumerating all contributing permutations; that is, for each \({\mathbf {X}}=\sum _{i=1}^{n!} a_i {\varvec{\Pi }}_i \in \mathcal {B}^n\), we have \({\mathbf {x}}={\mathbf {X}}{\mathbf {e}}=\sum _{i=1}^{n!} a_i {\varvec{\pi }}_i \in \mathcal {PH}^n\), where the ith vertex correspondence between the polytopes is through \({\varvec{\pi }}_i = {\varvec{\Pi }}_i {\mathbf {e}}\), and the coefficients of the convex combination satisfy \(a_i\ge 0\) and \(\sum _{i=1}^{n!} a_i =1\).

For the p-SUM problem, possible relaxations can be expressed as

$$\begin{aligned} \min _{{\mathbf {x}}\in \mathcal {PH}^n} \frac{1}{p} \sum _{i,j} A_{ij} |x_i - x_j |^p, \end{aligned}$$
(4)

or, in matrix form, as

$$\begin{aligned} \min _{{\mathbf {X}}\in \mathcal {B}^n} \mathrm {tr}\left[ {\mathbf {A}}{\mathbf {X}}{\mathbf {B}}^\top {\mathbf {X}}^\top \right] , \end{aligned}$$
(5)

where \(B_{ij}= \frac{1}{p} |i - j |^p\). The first objective function, for \(p\ge 1\) and \(A_{ij}\ge 0\) is convex, since it is non-negative combination of convex functions \(| \cdot |^p\) applied to the linear functions \(x_i - x_j\), with \(i,j\in \{1,\ldots ,n\}\). The second objective depends on \({\mathbf {A}}\) and \({\mathbf {B}}\), but these can be adjusted in their diagonals before relaxation to become convex.Footnote 4 Nonetheless, this convexity is not useful. For example, the constant vector \(\frac{n+1}{2}{\mathbf {1}}\) which lies at the barycenter of the permutahedron, minimizes the relaxed problem in Eq. (4) since all \(x_i-x_j=0\).

In order to find non-trivial solutions further from the barycenter and closer to the vertices, as the norm of each permutation vector is constant and maximal over the relaxed set, we attempt to maximize the norm of the relaxed solution while simultaneously minimizing the original objective. Using a tradeoff parameter \(\mu >0\), this may lead to the following regularized objective

$$\begin{aligned} \min _{{\mathbf {x}}\in \mathcal {PH}^n}\frac{1}{p} \sum _{i,j} A_{ij} |x_i - x_j |^p - \mu \left\| {\mathbf {x}} \right\| _2^2. \end{aligned}$$
(6)

3.3 Regularized 2-SUM relaxation

Due to its quadratic form, the 2-SUM case is amenable to more convenient algebraic manipulations and it has therefore attracted further attention by recent works (Barnard et al. 1993; Atkins et al. 1998; Fogel et al. 2013; Lim and Wright 2014; Fogel et al. 2015). In particular, the associated QAP can be reformulated into an equivalent one parametrized by a rank-1 matrix as

$$\begin{aligned} \text {QAP}({\mathbf {A}},{\mathbf {B}})&=\frac{1}{2} \sum _{i,j=1}^n {A_{ij}(\pi _i^2+\pi _j^2-2\pi _i\pi _j )} \\&=\sum _{i=1}^n \pi _i^2 \sum _{j=1}^n A_{ij} - \sum _{i,j=1}^n \pi _i \pi _j A_{ij} \\&={\varvec{\pi }}^\top \left( {\text {dg}}\left( {\mathbf {A}}{\mathbf {1}}\right) - {\mathbf {A}}\right) {\varvec{\pi }}={\varvec{\pi }}^\top {\mathbf {L}}_{\mathbf {A}}{\varvec{\pi }}\\&=\mathrm {tr}[ {\mathbf {L}}_{\mathbf {A}}{\varvec{\Pi }}{\mathbf {e}}{\mathbf {e}}^\top {\varvec{\Pi }}^\top ] =\text {QAP}({\mathbf {L}}_{\mathbf {A}},{\mathbf {e}}{\mathbf {e}}^\top ), \end{aligned}$$

where \({\text {dg}}\left( {\mathbf {x}}\right) \) returns a diagonal matrix with elements from a vector \({\mathbf {x}}\). The matrix \({\mathbf {L}}_{\mathbf {A}}\triangleq {\text {dg}}\left( {\mathbf {A}}{\mathbf {1}}\right) -{\mathbf {A}}\) is defined to be the graph Laplacian, and is guaranteed to be positive semidefinite for symmetric non-negative \({\mathbf {A}}\), since \(f({\mathbf {x}})\triangleq {\mathbf {x}}^\top {\mathbf {L}}_{\mathbf {A}}{\mathbf {x}}=\frac{1}{2}\sum _{i,j} A_{ij}(x_i-x_j)^2 \ge 0, \forall {\mathbf {x}}\in \mathbb {R}^n\).

The resulting QAP form above is very practical as it can be used in a relaxed version of the 2-SUM, expressed in either of the following forms

$$\begin{aligned} \min _{{\mathbf {x}}\in \mathcal {PH}^n} {\mathbf {x}}^\top {\mathbf {L}}_{\mathbf {A}}{\mathbf {x}}\equiv \min _{ {\mathbf {X}}\in \mathcal {B}^n} {\mathbf {e}}^\top {\mathbf {X}}^\top {\mathbf {L}}_{\mathbf {A}}{\mathbf {X}}{\mathbf {e}}. \end{aligned}$$
(7)

It is clear that the objective function is convex since, in terms of the first form, the Hessian \({\mathbf {L}}_{\mathbf {A}}\) is positive semidefinite. In terms of the matrix form, the objective can be rewritten as \({\text {vec}}\left( {\mathbf {X}}\right) ^\top \!( {\mathbf {e}}{\mathbf {e}}^\top \otimes {\mathbf {L}}_{\mathbf {A}}) {\text {vec}}\left( {\mathbf {X}}\right) \), where \(\otimes \) denotes the Kronecker product, and the Hessian \({\mathbf {e}}{\mathbf {e}}^\top \otimes {\mathbf {L}}_{\mathbf {A}}\) is positive semidefinite. However, the optimal solution to this relaxed formulation is the barycenter \(\frac{1}{n}{\mathbf {1}}{\mathbf {1}}^\top \) of \(\mathcal {B}^n\), since \([{\mathbf {L}}_{\mathbf {A}}{\mathbf {1}}]_i =\sum _k {\mathbf {A}}_{ik}-\sum _j {\mathbf {A}}_{ij} =0\), which gives \({\mathbf {1}}^\top {\mathbf {L}}_{\mathbf {A}}{\mathbf {1}}=\sum _i [{\mathbf {L}}_{\mathbf {A}}{\mathbf {1}}]_i =0\) that corresponds to the minimum of the objective function.

The objective in Eq. (7) can be modified in line with the regularization described in Sect. 3.2 to produce a non-trivial solution. For example, the objective \({\mathbf {x}}^\top {\mathbf {L}}_{\mathbf {A}}{\mathbf {x}}- \mu \left\| {\mathbf {x}} \right\| _2^2\) can be used, but this precludes convexity for any \(\mu >0\). An alternative modification for the 2-SUM minimization problem with a concave regularizer is suggested by Fogel et al. (2013) and Lim and Wright (2014) as

$$\begin{aligned} \min _{{\mathbf {x}}\in \mathcal {PH}^n} \Big \{ f_\mu ({\mathbf {x}}) \triangleq {\mathbf {x}}^\top ({\mathbf {L}}_{\mathbf {A}}\!-\!\mu {\mathbf {H}}) {\mathbf {x}}= {\mathbf {x}}^\top {\mathbf {L}}_{\mathbf {A}}{\mathbf {x}}- \mu \left\| {\mathbf {H}}{\mathbf {x}} \right\| _2^2 \Big \}. \end{aligned}$$
(8)

The use of the above regularizer leaves the sought optimization intact, since by using the constant matrix \({\mathbf {J}}={\mathbf {1}}{\mathbf {1}}^\top \) and the centering one \({\mathbf {H}}={\mathbf {I}}-\frac{1}{n}{\mathbf {J}}\), we have

$$\begin{aligned} \left\| {\mathbf {x}} \right\| _2^2&={\textstyle \left\| ({\mathbf {H}}\!+\!\frac{1}{n}{\mathbf {J}}){\mathbf {x}} \right\| _2^2=\left\| {\mathbf {H}}{\mathbf {x}} \right\| _2^2+ \frac{2}{n} {\mathbf {x}}^\top {\mathbf {H}}{\mathbf {J}}{\mathbf {x}}+ \left\| \frac{1}{n}{\mathbf {J}}{\mathbf {x}} \right\| _2^2} \\&= \left\| {\mathbf {H}}{\mathbf {x}} \right\| _2^2 + \textstyle \frac{n(n+1)^2}{4}, \end{aligned}$$

where we make use of the facts that \({\mathbf {H}}{\mathbf {J}}=\mathbf {0}\), and that for any \({\mathbf {x}}\in \mathcal {PH}^n\), \({\mathbf {J}}{\mathbf {x}}=\frac{n(n+1)}{2} {\mathbf {1}}\). Note that this equivalence between the two regularizers holds independently of the problem relaxation from \(\mathcal {P}^n\) to \(\mathcal {PH}^n\).

The same can also be observed for the matrix formulation, after adding a constant c to one of the QAP matrix parameters. Specifically, the optimization of \(\text {QAP}({\mathbf {A}}+ c {\mathbf {J}},{\mathbf {B}})\) is unaffected (again either before or after the relaxation to \(\mathcal {B}^n\)), as it changes by the constant quantity \(c {\mathbf {1}}^\top {\mathbf {B}}{\mathbf {1}}\). In such case, replacing \({\mathbf {A}}\) by \({\tilde{{\mathbf {A}}}}= {\mathbf {A}}-\frac{\mu }{n}{\mathbf {J}}\) yields \({\mathbf {L}}_{{\tilde{{\mathbf {A}}}}}={\mathbf {L}}_{\mathbf {A}}-\mu {\mathbf {I}}+\frac{\mu }{n}{\mathbf {J}}={\mathbf {L}}_{\mathbf {A}}-\mu {\mathbf {H}}\). The new matrix may no longer be positive semidefinite (it is not a proper Laplacian matrix as \({\mathbf {A}}-\frac{\mu }{n}{\mathbf {J}}\) may have negative entries) and the resulting minimization is not always convex.

Although the objective in Eq. (8) is generally non-convex because it is the difference of convex functions, convexity can be preserved for values of \(\mu \) that keep \({\mathbf {L}}_{\mathbf {A}}-\mu {\mathbf {H}}\) positive semidefinite. Note that the constant vector is an eigenvector of both \({\mathbf {L}}_{\mathbf {A}}\) and \({\mathbf {H}}\) with an associated eigenvalue \(\lambda _1=0\). Consequently, choosing \(\mu \le \lambda _2({\mathbf {L}}_{\mathbf {A}})\) ensures convexity (henceforth, for the eigenvalues \(\lambda _i\) of a matrix \({\mathbf {X}}\) we assume the ordering \(\lambda _1({\mathbf {X}})\le \cdots \le \lambda _n({\mathbf {X}})\)). Moreover, choosing \(\mu \ge \lambda _n({\mathbf {L}}_{\mathbf {A}})\) ensures that this matrix is negative semidefinite, which renders the objective concave. Therefore, adjusting \(\mu \) from \(\lambda _2({\mathbf {L}}_{\mathbf {A}})\) to \(\lambda _n({\mathbf {L}}_{\mathbf {A}})\) can gradually transform the relaxed 2-SUM problem from a convex, to an indefinite and finally to a concave problem. In general, except for the concave form, the relaxed solutions may lie in the interior of the polytope and far from the set of sought permutations. However, in the concave form, the solution will necessarily lie at the boundaries. We exploit this fact and use a continuation scheme to successively find relaxed solutions moving from the convex to the concave case, which is a common approach for similar problems (Zaslavskiy et al. 2009; Xia 2010; Liu and Qiao 2014).

3.4 First-order optimization with graduated non-convexity

Given an initial feasible solution \({\mathbf {x}}^{(0)}\) and a current value for \(\mu \), we now show how to solve the relaxed and regularized 2-SUM problem using first-order optimization. In particular, we employ conditional gradient, also known as the Frank-Wolfe (FW) algorithm (Frank and Wolfe 1956), to ensure the optimization variable at each iteration remains within the convex hull of \(\mathcal {P}^n\). We note that other first-order methods, such as projected gradient descent (Bertsekas 1995) which over the permutahedron can be equally efficient per iteration (Lim and Wright 2016b), could also be employed. However, FW can produce sparse iterates for certain cases of convex optimization problems, adapts to norm-free smoothness and does not need a projection step (Jaggi 2013; Bubeck 2015). Due to its simplicity we use it throughout this work.

The FW update at iteration \(k+1\) can be written as

$$\begin{aligned} {\mathbf {x}}^{(k+1)} = \alpha {\mathbf {x}}^\star + (1-\alpha ) {\mathbf {x}}^{(k)}, \end{aligned}$$
(9)

where

$$\begin{aligned} {\mathbf {x}}^\star = \mathop {\hbox {arg min}}\limits _{{\mathbf {x}}\in \mathcal {PH}^n} \; \langle \nabla f_\mu ({\mathbf {x}}^{(k)}) , {\mathbf {x}}\rangle , \end{aligned}$$
(10)

and \(\alpha \in [0,1]\) is the step size. The gradient descent direction is based on optimizing a linearization of the objective function \(f_\mu \) in Eq. (8) over the constraint set, given by

$$\begin{aligned} {\tilde{f}}_\mu ({\mathbf {x}})= f_\mu ({\mathbf {x}}^{(k)}) + \langle \nabla f_\mu ({\mathbf {x}}^{(k)}), {\mathbf {x}}- {\mathbf {x}}^{(k)} \rangle , \end{aligned}$$
(11)

where \(\frac{1}{2} \nabla f_\mu ({\mathbf {x}}^{(k)}) = {\mathbf {L}}_{\mathbf {A}}{\mathbf {x}}^{(k)} - \mu {\mathbf {H}}{\mathbf {x}}^{(k)}\). The solution \({\mathbf {x}}^\star = {\hbox {arg min}}_{{\mathbf {x}}\in \mathcal {PH}^n} {\tilde{f}}_\mu ({\mathbf {x}})\) is necessarily a permutation, since a bounded linear program is optimized at a vertex of the constraint set. To calculate it, we use Hardy–Littlewood–Pólya’s rearrangement theorem (Hardy et al. 1952), that states that two vectors \(\varvec{a}\) and \(\varvec{b}\) assume the minimum shuffled inner product when sorted in opposite orders. This happens, for example, when the permutations \({\varvec{\pi }}\) and \({\varvec{\tau }}\) order two given vectors \(\varvec{a}\) and \(\varvec{b}\) descending and ascending, respectively, or equivalently when \({\varvec{\tau }}({\varvec{\pi }}^{-1})\) reorders \(\varvec{a}\) while \(\varvec{b}\) is kept in its original order. In this situation, by setting \(\varvec{a}=\nabla f_\mu ({\mathbf {x}}^{(k)})\) and \(\varvec{b}={\mathbf {e}}\), we obtain the permutation \({\mathbf {x}}^\star = {\hbox {arg min}}_{{\mathbf {x}}\in \mathcal {P}^n} \langle \varvec{a}, {\mathbf {x}}\rangle = {\varvec{\pi }}^{-1}\) (or in permutation matrix format \({\hbox {arg min}}_{{\varvec{\Pi }}\in \mathcal {M}^n} \langle {\mathbf {e}}, {\varvec{\Pi }}^\top \varvec{a} \rangle \)) whose inverse (\({\varvec{\pi }}\)) sorts the gradient descending.

Given \({\mathbf {x}}^\star \), the optimal step size \(\alpha \) can then be easily computed in closed form, as \(f_\mu (\alpha {\mathbf {x}}^\star +(1-\alpha ) {\mathbf {x}}^{(k)} ) \) is quadratic in \(\alpha \). Since its second and first order coefficients are correspondingly \(\gamma _2 =({\mathbf {x}}^\star - {\mathbf {x}}^{(k)})^\top (\mathbf {L_A} - \mu {\mathbf {H}}) ({\mathbf {x}}^\star - {\mathbf {x}}^{(k)}) = f_\mu ({\mathbf {x}}^\star - {\mathbf {x}}^{(k)})\) and \(\gamma _1 = \langle \nabla f_\mu ({\mathbf {x}}^{(k)}), {\mathbf {x}}^\star - {\mathbf {x}}^{(k)} \rangle \), the optimizing step within [0, 1] is (from convexity and optimizing step we have \(\gamma _1 \le f_\mu ({\mathbf {x}}^\star )-f_\mu ({\mathbf {x}}^{(k)}) \le 0\))

$$\begin{aligned} \alpha = \left\{ \begin{array}{l l} \text{ min }\left( \frac{-\gamma _1}{2\gamma _2},1\right) , &{} \text{ if } \gamma _2 > 0, \\ 0, &{} \text{ if } \gamma _2 \le 0 \; \wedge \; f_\mu ({\mathbf {x}}^\star ) \ge f({\mathbf {x}}^{(k)}), \\ 1, &{} \text{ if } \gamma _2 \le 0 \; \wedge \; f_\mu ({\mathbf {x}}^\star ) < f({\mathbf {x}}^{(k)}). \end{array} \right. \end{aligned}$$
(12)

As previously mentioned, to solve the problem in Eq. (7) we use a continuation scheme that starts from a solution to a convex instance of the problem in Eq. (8). In each iteration we increase \(\mu \) by multiplying it with a user-defined parameter \(\gamma>\) 1 and solve the new problem until the solution becomes discrete, which is guaranteed in the concave case. This graduated non-convexity approach (Blake 1983; Rangarajan and Chellappa 1990) yields a sequence of relaxed solutions that ultimately lead to a local optimum of the original discrete problem. The procedure can be started at a permutation or any point around the barycenter. However, we have experimentally observed that starting from the ordering of the Fiedler vector, frequently leads to better solutions in terms of 2-SUM value and therefore we use that as a starting point (the continuation scheme almost always converges to a different solution except for pre-Robinsonian cases). We note here that calculating this ordering does not require an extra initial eigen-decomposition, since in our setting this is already performed in order to determine the initial parameter \(\mu _0\). The method converges when \(\alpha \) reaches near-zero values. Algorithm 1, referred to as Graduated non-Convexity Relaxation (GnCR), summarizes the main steps of this vector-based graduated non-convexity approach to solve the relaxed regularized 2-SUM problem.

figure a

Computationally, the proposed method is highly efficient, since each update only requires a single matrix-vector multiplication to compute the gradient vector (where any sparsity and/or low-rank structure of \({\mathbf {A}}\) can be exploited) and the sorting of the gradient vector, which has complexity \(\mathcal {O}(n\log n)\). For example, if \({\mathbf {A}}={\mathbf {M}}{\mathbf {M}}^\top \) where \({\mathbf {M}}\) is a sparse matrix with Tn non-zero entries, then the time complexity of each gradient computation \(\frac{1}{2}\nabla f_\mu ({\mathbf {x}})= \mathbf {D} {\mathbf {x}}-{\mathbf {M}}({\mathbf {M}}^\top {\mathbf {x}})-\mu {\mathbf {H}}{\mathbf {x}}\), where \(\mathbf {D}={\text {dg}}\left( {\mathbf {M}}({\mathbf {M}}^\top {\mathbf {1}})\right) \), is \(\mathcal {O}(Tn)\) due to the sparse matrix with vector multiplication. Likewise, the function evaluation can be calculated as \(f_\mu ({\mathbf {x}})= \frac{1}{2} \langle \nabla f_\mu ({\mathbf {x}}), {\mathbf {x}}\rangle \).

As convergence is concerned, a rate of \(\mathcal {O}\left( \frac{1}{\sqrt{t}}\right) \) (where t is the number of iterations) for non-convex objectives is known for the FW method (Lacoste-Julien 2016), which applies here since the objective is not necessarily convex for all varying values of \(\mu \). Specifically, it is shown that the minimal FW gap is upper bounded by the quantity \(\frac{\max \{2h_0,C_{f_\mu }\}}{\sqrt{t+1}}\), for an objective \(f_\mu \) as defined in Eq. (8). The quantity \(h_0= f_\mu ({\mathbf {x}}_0) - \min _{{\mathbf {x}}\in \mathcal {PH}^n} f_\mu ({\mathbf {x}})\) is the initial global suboptimality, and \(C_{f_\mu }\) the related curvature constant defined over \(f_\mu \). Due to the regularization the latter becomes

$$\begin{aligned} C_{f_\mu }= \sup \limits _{ \begin{array}{c} {\mathbf {x}}, {\mathbf {s}}\in \mathcal {PH}^n \!,\, \alpha \in (0,1], \\ {\mathbf {y}}={\mathbf {x}}+\alpha ({\mathbf {s}}-{\mathbf {x}}) \end{array}} \tfrac{2}{\alpha ^2} \mathcal {D}_f({\mathbf {y}},{\mathbf {x}}) - \tfrac{2\mu }{\alpha ^2} \left\| {\mathbf {H}}({\mathbf {x}}-{\mathbf {y}}) \right\| _2^2 \le C_f, \end{aligned}$$

since \(\mathcal {D}_f({\mathbf {y}},{\mathbf {x}}) - \mu \left\| {\mathbf {H}}({\mathbf {x}}-{\mathbf {y}}) \right\| _2^2 \le \mathcal {D}_f({\mathbf {y}},{\mathbf {x}})\), where \(\mathcal {D}_f\) is the Bregman distance over f. We note that for adaptive FW variants, such as the away steps, the pairwise FW and the fully corrective, a linear and sublinear convergence rate for strongly convex and convex problems has been shown, respectively (Lacoste-Julien and Jaggi 2015). In our case however, experiments showed that such variants yield negligible benefit in the solution quality, and can even sometimes increase the overall running time (e.g., each step of the fully corrective FW has significant computational demands as a quadratic optimization is realized over the polytope defined by an active set of permutations).

3.5 A smoothed regularized relaxation for the 1-SUM

We now consider the 1-SUM or optimal linear arrangement problem (George and Pothen 1994), which is harder to analyze as it no longer assumes a quadratic function of the permutation vector. Although a convex function, no regularized form can be employed in this case as in Eq. (8), since \(\mu >0\) cannot control the convexity of the formulation. Additionally, the non-smoothness of this problem resulting from the absolute terms in \(\sum _{i,j=1}^n A_{ij} |x_i - x_j |\), prevents the use of a gradient approach (subgradient methods may not be suitable for the regularized formulation that assumes non-convex forms). Therefore, we propose a smooth approximation of the 1-SUM problem in order to enable us to utilize the continuation scheme of Sect. 3.4.

We employ a pseudo-Huber function (Fountoulakis and Gondzio 2016) of the form

$$\begin{aligned} \psi _\delta (x) = \sqrt{\delta ^2 + x^2} - \delta , \end{aligned}$$
(13)

which has bounded and Lipschitz continuous first and second derivatives. Other formulations of the pseudo-Huber functions were previously used in Hartley and Zisserman (2004) and González-Recio and Forni (2011). Figure 1 sketches \(\psi _\delta (x)\) for different values of the parameter \(\delta >0\). This form is a smooth approximation of the Huber loss penalty function (Huber 1992), and approximates |x| as \(\delta \) approaches zero. Unlike the Huber loss function, which is only first-order differentiable, the pseudo-Huber function is second-order differentiable, a fact essential to the convexity analysis of the continuation process, as shown later in this section. The first two derivatives of the pseudo-Huber function are

$$\begin{aligned} \psi _\delta ^{'}(x)&= \frac{x}{\sqrt{\delta ^2+x^2}} = \frac{x}{\psi _\delta (x)+\delta } , \end{aligned}$$
(14)
$$\begin{aligned} \psi _\delta ^{''}(x)&= \frac{\delta ^2}{(\delta ^2+x^2)^\frac{3}{2}} = \frac{\delta ^2}{(\psi _\delta (x)+\delta )^3}, \end{aligned}$$
(15)

and as \(\psi _\delta ^{''}(x) >0\), it is a strictly convex function.

Fig. 1
figure 1

Plots of the pseudo-Huber function \(\psi _\delta (x)\) scaled within [0, 100], for different parameter values \(\delta \)

By using the pseudo-Huber loss, we can formulate a smooth approximation for the 1-SUM problem of Eq. (4) for p=1. The new objective is defined as

$$\begin{aligned} \phi _\delta ({\mathbf {x}}) \triangleq \sum _{i,j=1}^n A_{ij} \psi _\delta (x_i - x_j), \end{aligned}$$
(16)

and is also convex for non-negative \(A_{ij}\) as a non-negative combination of convex functions applied to the linear functions \(x_i-x_j\) (this also can be shown from the Hessian of \(\phi _\delta ({\mathbf {x}})\) being diagonally dominant).

The first and second order gradients of \(\phi _\delta ({\mathbf {x}})\) (for symmetric \({\mathbf {A}}\)) assume the simple-to-calculate forms of

$$\begin{aligned} \frac{\partial \phi _\delta ({\mathbf {x}})}{\partial x_i} = 2 \sum _{k=1}^n A_{ik} \frac{(x_i-x_k)}{\sqrt{ \delta ^2 + (x_i-x_k)^2 } }= 2 \sum _{k=1}^n A_{ik} \psi _\delta ^{'}(x_i-x_k), \end{aligned}$$
(17)

and

$$\begin{aligned} \frac{\partial ^2 \phi _\delta ({\mathbf {x}})}{\partial x_i \partial x_j} = \left\{ \begin{array}{l l l} -&{}2 A_{ij}~\psi ^{''}_\delta (x_i - x_j), &{} \text{ if } i \ne j, \\ &{}2 \sum \limits _{\begin{array}{c} k=1 \\ k\ne i \end{array}}^n A_{ik}~\psi ^{''}_\delta (x_i - x_k), &{} \text{ if } i=j. \\ \end{array} \right. \end{aligned}$$
(18)

Since the minimization of \(\phi _\delta ({\mathbf {x}})\) leads to the trivial barycenter solution and in order to apply a continuation scheme, we solve instead the regularized form, defined as

$$\begin{aligned} \min _{{\mathbf {x}}\in \mathcal {PH}^n} \Big \{ \phi _{\delta ,\mu }({\mathbf {x}}) \triangleq \phi _\delta ({\mathbf {x}}) - \mu \left\| {\mathbf {H}}{\mathbf {x}} \right\| _2^2 \Big \}. \end{aligned}$$
(19)

It can be observed from Eq. (18), that the Hessian \(\nabla ^2\phi _{\delta }({\mathbf {x}})\) happens to be equal to the Laplacian \({\mathbf {L}}_{{\mathbf {G}}}= {\text {dg}}\left( {\mathbf {G}}{\mathbf {1}}\right) -{\mathbf {G}}\), where \({\mathbf {G}}\) is a hollow matrix with off-diagonal elements the negated mixed partials, and is centered. This allows us to apply a continuation scheme following the same reasoning as in Sect. 3.3. Particularly, setting an initial value for \(\mu \le \lambda _2(\nabla ^2\phi _{\delta }({\mathbf {x}}))\) enables us to start from a convex instance of the objective \(\phi _{\delta ,\mu }({\mathbf {x}})\), and by gradually increasing \(\mu \) we can eventually convert it into concave. As in the GnCR algorithm, during each iteration of the continuation process the FW method is used, but here the step size is estimated with a golden section search (Bertsekas 1995). Unlike GnCR, we do not use the ordering of the Fiedler vector as a starting point for the continuation procedure since the spectral solution approximates the 2-SUM problem and not the 1-SUM. However, initial experimentations showed that depending on the similarity matrix (for instance when it is close to pre-Robinsonian) such an initialization could help, but the gain was very small to offset the extra computation. For this method, we start from around the barycenter and specifically, the midpoint between the barycenter \(\frac{n+1}{2}{\mathbf {1}}\) and \({\mathbf {e}}\). Experimental tests on the sensitivity of the algorithm to the \(\delta \) parameter reveal that within \(\left[ \frac{n}{50},\frac{n}{10}\right] \), a sufficiently small \(\delta \) can be found that ensures good performance. However, very small choices of \(\delta \) have shown to result to ill-conditioning, something also verified by Fountoulakis and Gondzio (2016). The parameter choice for \(\delta \) can rely on a grid search in the interval \(\left[ \frac{n}{50},\frac{n}{10}\right] \) performed in parallel or just set initially by the user. We refer to this “Huberized” 1-SUM algorithm as H-GnCR.

3.6 A kernel annealing approach for the quasi p-SUM

Depending on the employed objective function, seriation can focus on the global or more localized aspects of ordering (Earle and Hurley 2015; Hahsler 2017). Emphasis on the local ordering corresponds to prioritizing neighborhoods of similar objects as opposed to the global ordering that additionally separates dissimilar objects. After having investigated the p-SUM objective \(\frac{1}{p} \sum _{i,j} A_{ij} |x_i - x_j |^p\) for \(p=1,2\), we now consider the case of \(p<1\). The motivation is that the optimization becomes more sensitive to small differences \(|x_i-x_j |\) than in the \(p\ge 1\) case, which encourages more local object placements. Figure 2 exemplifies the effects of localized ordering for three p-SUM cases on a toy dataset.

Fig. 2
figure 2

Seriated points of the Double moons dataset, for the 2-SUM (left), 1-SUM (center), and \(\tfrac{1}{2}\)-SUM (right). The order is implied by the lines connecting the points consecutively. The rightmost sequence follows better the local ordering as it avoids moving back and forth between the two moons

One difficulty with the \(p<1\) case is that the objective is non-convex and non-smooth and prevents the application of the proposed continuation-based optimization scheme. As an alternative, we use an approximation through a series of indefinite functions. In particular, we use the Cauchy distribution-based kernel (Basak 2008) defined as \(K_\sigma (x-y) = \frac{1}{1+ \frac{(x - y)^2}{\sigma ^2}}\), and we approximate the term \(|x_i - x_j |^p\) with the function

$$\begin{aligned} \xi _\sigma (x)=1 - K_\sigma (x)=\frac{x^2}{\sigma ^2+x^2}. \end{aligned}$$
(20)

The scale parameter \(\sigma \) can be used to approximate the effects of the penalty contributions for cases of \(p<1\). Figure 3 presents some plots to demonstrate the behavior of \(\xi _\sigma \) for various values of \(\sigma \).

Fig. 3
figure 3

Plots of the function \(\xi _\sigma \) (dotted lines) and \(|x|^p\) (solid), for different values of the parameters \(\sigma \) and p. Both functions are scaled within [0, 1]. For larger \(\sigma \), the former can locally approximate \(x^2\), but for smaller kernel sizes it behaves more similar to \(|x|^p\) for \(p<1\)

Unlike the pseudo-Huber function, this kernel-based smoothing function is not convex. The first and second derivatives are

$$\begin{aligned} \xi _\sigma ^{'}(x)&= \frac{2\sigma ^2 x}{(\sigma ^2 +x^2)^2}, \end{aligned}$$
(21)
$$\begin{aligned} \xi _\sigma ^{''}(x)&= \frac{2\sigma ^4 - 6\sigma ^2 x^2}{(\sigma ^2+x^2)^3}, \end{aligned}$$
(22)

and the sign of \(\xi _\sigma ^{''}\) is dependent on the input and the positive scale parameter \(\sigma \); specifically, it is non-negative when \(\sigma \ge x\sqrt{3}\).

Substituting \(\xi _\sigma \) in the objective of Eq. (4), gives

$$\begin{aligned} \varphi _{\sigma }({\mathbf {x}}) \triangleq \sum _{i,j=1}^n A_{ij} \xi _\sigma (x_i-x_j). \end{aligned}$$
(23)

It can be seen that in order to have \(\xi _\sigma (x_i-x_j)\) convex when \(x_i\) and \(x_j\) are components of \({\mathbf {x}}\in \mathcal {PH}^n\), we need \(\sigma \ge (n-1)\sqrt{3}\). Another observation is that if we restrict attention for \(\xi _\sigma (x)\) within \([1-n,n-1]\) and scale accordingly, then we have \(\lim _{\sigma \rightarrow \infty } \frac{\xi _\sigma (x)}{\xi _\sigma (n-1)} = \frac{x^2}{(n-1)^2}\). This shows that for large \(\sigma \), Eq. (23) approximates the 2-SUM problem, as normalizing \(\xi _\sigma (x)\) by \(\frac{1}{\xi _\sigma (n-1)}\) and \(x^2\) by \(\frac{1}{(n-1)^2}\) does not affect the optimization.

Since the Hessian \(\nabla ^2\varphi _{\sigma }({\mathbf {x}})\) is written in a form similar to Eq. (18), the regularized form can be given similarly to that of Sect. 3.5. That is

$$\begin{aligned} \min _{{\mathbf {x}}\in \mathcal {PH}^n} \Big \{ \varphi _{\sigma ,\mu }({\mathbf {x}}) \triangleq \varphi _\sigma ({\mathbf {x}}) - \mu \left\| {\mathbf {H}}{\mathbf {x}} \right\| _2^2 \Big \}, \end{aligned}$$
(24)

where \(\varphi _{\sigma ,\mu }({\mathbf {x}})\) is convex for \(\sigma \ge (n-1)\sqrt{3}\) and \(\mu \le \lambda _2(\nabla ^2 \varphi _\sigma ({\mathbf {x}}))\).

Although any value \(p<1\) can be potentially useful to recover the local order, here we focus on the \(\frac{1}{2}\)-SUM objective, which experimentally appeared to be more sensitive in capturing local structure within the proposed setup. We follow a heuristic annealing of the scale parameter \(\sigma \) whose value is gradually decreased. In each step, a continuation scheme is realized with an increasing \(\mu \) until the problem becomes concave (based on empirical observations, we only need to ensure we start with a convex setup for \(\varphi _{\sigma ,\mu }({\mathbf {x}})\) for the initial and largest \(\sigma \) value, while the remaining steps may start from being indefinite). For experimentation, we let \(\sigma \) vary within the interval \(\left[ \frac{n}{5}, 4n\right] \) in order for \(\xi _\sigma \) to capture various profiles of \(|x|^p\). Each solution obtained from a \(\sigma \) step is recorded and used to initialize the subsequent step but from a shifted location to avoid solution stagnation. We finally report the solution that amongst the recorded minimizes the \(\frac{1}{2}\)-SUM value. However, the method can be used independently of the p-SUM formulation to suit a given application. For example, one can instead seek the solution that minimizes the \(\delta _{count}\) measure from Sect. 4.2 or any other measure that captures local order. It has to be noted that although \(\xi _\sigma (x)\) is, as shown in Fig. 3, only a rough approximation of \(|x|^{\frac{1}{2}}\), when used in an annealing scheme of the \(\sigma \) parameter with restarts, it results to good solutions in terms of the \(\frac{1}{2}\)-SUM value. We refer to this heuristic approximation as C-GnCR, and we summarize its main steps in Algorithm 2.

figure b

The recent work of Recanati et al. (2018) on robust seriation is using a formulation that controls error contributions to reduce sensitivity on outliers. In this respect, this can be an additional motivation for using the Cauchy-based kernel here, as for small \(\sigma \) values it has a similar limiting effect. We note that we also tested other approximation functions, such as the Gaussian (the Laplacian and the log-kernel are not applicable since they both are non-smooth functions), but the Cauchy-based shows the best overall performance when used in the proposed annealing process (see Table 8). Nonetheless, the choice of the approximating function may depend on the given problem.

4 Experimental results

We present a series of experiments in order to compare the proposed algorithmsFootnote 5 with other relevant methods in terms of both utility and scalability. Section 4.1 presents experimental results from comparisons with state-of-the-art algorithms for seriation and various heuristics that approximately solve the QAP. We use several datasets with different characteristics ranging from synthetic to real. Section 4.2 contains a detailed comparison among the different p-SUM algorithms, and highlights the utility of each one in sequencing problems using interpretable supervised measures. Finally, in Sect. 4.3 we test the scalability of the algorithms, and in Sect. 4.4 we test their performance on image seriation problems.

4.1 Benchmark evaluation

In this section we experiment with the following methods:

  • GnCR the graduated non-convexity 2-SUM relaxation in Algorithm 1.

  • H-GnCR the 1-SUM method relying on the pseudo-Huber approximation.

  • C-GnCR the annealing-based quasi \(\tfrac{1}{2}\)-SUM method in Algorithm 2.

  • Spectral\(_\text{ A }\) the spectral method (Barnard et al. 1993) that sorts the entries of the Fiedler vector of the unnormalized Laplacian.

  • Spectral\(_\text{ B }\) the spectral method (Ding and He 2004) that sorts the entries of the Fiedler vector of the normalized Laplacian.

  • vRCR (Vector-regularized convex 2-SUM relaxation) minimizes problem (8) using an interior point solver (we only use the tie-breaking constraint). Its implementation was provided to us by the authors (Lim and Wright 2014).

  • vRCR\(_2\) variant of vRCR that minimizes problem (8) using FW on the permutahedron with the tie-breaking constraint (Lim and Wright 2014); also used to solve problems (19) and (24).

  • FAQ the fast approximate QAP method (Vogelstein et al. 2015), based on the relaxation on the Birkhoff polytope and the Frank–Wolfe method.

  • SA a simulated annealing-based optimizer (Brusco and Stahl 2000).

We note that other population-based heuristics (Kennedy and Eberhart 1995; Yang 2008) were also tried, but they showed to perform worse than SA and therefore were not included in our results. Each algorithm is implemented in MATLAB ver.9.3. For timing comparisons we use a 2.93 GHz 12-Core Intel Xeon desktop with 16 GB of memory. Typical parameters are \(\gamma = 1.05\) and \(\mu _0\) set to the second smallest eigenvalue of each corresponding Hessian. Sections 3.5 and 3.6 discuss in detail the parameter choices for \(\delta \) and \(\sigma \) for the H-GnCR and C-GnCR methods, respectively.

We selected a range of real and synthetic datasets, associated either with a similarity matrix \({\mathbf {A}}\in \mathbb {R}_{\ge 0}^{n\times n}\), or a data matrix \({\mathbf {M}}=[{\mathbf {m}}_1,\ldots ,{\mathbf {m}}_n]^\top \) for which case we assume that \(A_{ij} = |{\mathbf {m}}_i^\top {\mathbf {m}}_j |\). These sets include:

  • Real datasets from the \(\texttt {seriation}\) R-package (Hahsler et al. 2008):

    • Munsingen a \({59\times 70}\) binary matrix \({\mathbf {M}}\).

    • Psych24 a \({24\times 24}\) similarity matrix \({\mathbf {A}}\).

    • Gene expression (wood) a \({136\times 6}\)\({\mathbf {M}}\).

    • Zoo a \({101\times 16}\)\({\mathbf {M}}\).

  • Other real datasets :

    • Votes a \({232\times 16}\) binary matrix \({\mathbf {M}}\) (Dheeru and Karra Taniskidou 2017).

    • Facebook ego-network a \({324\times 324}\) similarity matrix \({\mathbf {A}}\) (Leskovec and Krevl 2014).

    • Elutriation gene expression a \({301\times 14}\)\({\mathbf {M}}\) (Alter et al. 2000).

  • Datasets from the SuiteSparse Matrix Collection (Davis and Hu 2011):

    • CAT a \({85\times 85}\) similarity matrix \({\mathbf {A}}\).

    • DWT a \({59\times 59}\) similarity matrix \({\mathbf {A}}\).

  • Synthetic datasets:

    • Markov chains (Lim and Wright 2014) a \(100\times 100\)\({\mathbf {A}}\), that is the covariance matrix of 50 independent linear Markov chains, with each one generated as \(X_i =bX_{i-1}+\epsilon _i, \quad i\in \{1,\ldots ,100\}\), where \(\epsilon _i~\sim ~\mathcal {N}(0, \sigma ^2)\), \(b = 0.999\), and \(\sigma = 0.5\).

    • Artificial graves a \(100 \times 200\) binary \({\mathbf {M}}\), that models the incidence of artifacts in graves assuming that the occurrence rate of each artifact follows a Gaussian curve. Specifically, each grave is associated with a time-point \(t_i\sim \mathcal {U}(0,1)\). The probability that the jth artifact will appear in a grave is defined as \(\text {Pr}(M_{i,j}=1)=\alpha _i \beta _j \exp (-\frac{\left\| t_i-\mu _j \right\| }{2\sigma _j^2})\), where \(\alpha _i \sim \text {Lognormal}(\log (0.3),0.3)\), \(\beta _j \sim \mathcal {U}(0,1)\), \(\mu _j \sim \mathcal {U}(-1,2)\), and the standard deviation \(\sigma _j\) is distributed with a truncated Jeffrey’s prior between [0.01, 0.25].

    • Robinsonian\(_N\) formed from an \(N \times M\) binary 0–1 matrix \({\mathbf {M}}\) that has the consecutive ones property (C1P), that is its rows can be rearranged such that the ones in every column form a single contiguous sequence (Fulkerson and Gross 1965).

    • Double moons a \(100 \times 100\)\({\mathbf {A}}\), that generates points that form two half moons in 2-D space. The \({\mathbf {A}}\) similarity matrix is computed using a Gaussian kernel (Baudat and Anouar 2001).

We evaluate the utility of the proposed algorithms by comparing their objective function values with a number of seriation methods that can solve different p-SUM problems. All evaluations are run over multiple randomly shuffled instances of the available datasets. We additionally use the weighted Robinson events (WRE) measure (Hahsler et al. 2008) to assess the Robinsonian structure of a similarity matrix. Since the values on different datasets are not comparable, for interpretability we report a normalized value for each measure that quantifies the deviation from the best performer for that dataset. For the ith dataset the deviation for the jth algorithm is defined as

$$\begin{aligned} \varDelta _{i,j} = \frac{score_{i,j} - best_i}{best_i}. \end{aligned}$$
(25)
Table 1 Deviation from the best 2-SUM value across the 12 datasets
Table 2 Deviation from the best 1-SUM value across the 12 datasets
Table 3 Deviation from the best \(\frac{1}{2}\)-SUM value across the 12 datasets
Table 4 Deviation from the best WRE score when solving the 2-SUM across the 12 datasets
Table 5 Deviation from the best WRE score when solving the 1-SUM across the 12 datasets
Table 6 Deviation from the best WRE score when solving the \(\frac{1}{2}\)-SUM across the 12 datasets

Additionally, we report the overall average deviations for each algorithm across all datasets. Tables 1, 2 and 3 show the normalized deviation from the best p-SUM value for each algorithm and for the 12 datasets. FAQ, SA and vRCR\(_2\) directly solve each corresponding p-SUM problem. For the 1-SUM and \(\frac{1}{2}\)-SUM, the vRCR method is not included in the comparisons as it is designed to solve the 2-SUM, but we do compare with the two spectral methods as their 2-SUM solutions can perform well in near noiseless cases. Table 1 shows the normalized deviation for the 2-SUM and demonstrates that the proposed GnCR algorithm outperforms the others for the 2-SUM criterion (boldfaced table entries denote best performance). Unlike previous convex relaxation methods (Lim and Wright 2014; Fogel et al. 2015), the proposed method can outperform both spectral methods without the use of any extra ordering information. The performance difference was assessed with a sign test, which showed that GnCR performs better than both spectral methods with a p-value of 0.0084 at a significance level of 0.05 (a Bonferroni correction for the two hypotheses tested was applied). Similarly, Table 2 shows the results for the 1-SUM case, where it is clear that the proposed H-GnCR outperforms all competing ones except SA. Nonetheless, it achieves a normalized deviation very close to the best. Lastly, Table 3 summarizes results for the \(\frac{1}{2}\)-SUM case, where the proposed C-GnCR achieves the second best overall performance, having an insignificant difference from the best performer, which is SA. Tables 45 and 6 show similar trends for the WRE measure, where the proposed methods achieve scores very close to the best performing method, SA.

Table 7 Average 2-SUM values and running times for the three proposed algorithms and Spectral\(_\text{ A }\) using pre-Robinsonian matrices of sizes \(n=100\) and \(n=500\)
Fig. 4
figure 4

Reconstructions for the Robinsonian\(_{100}\) dataset, for Spectral\(_\text{ A }\) with perfect reconstruction and the three proposed methods. a Spectral\(_\text{ A }\). b GnCR. c H-GnCR. d C-GnCR

In a second set of experiments we test the consistency of the proposed algorithms on artificial Robinsonian datasets of size \(n=100\) and \(n=500\) by comparing them against the spectral solution that can find the optimal solution in noiseless cases. For each problem size we generate 20 randomly permuted instances and find the best reordering for each dataset. We measure the 2-SUM values of each algorithm in order for the comparison to be consistent with that of the spectral solution. Table 7 shows the average 2-SUM values and running times of each algorithm. We can see that for both datasets GnCR achieves the optimal score, which is owing to the spectral initialization, and C-GnCR outperforms H-GnCR. However, C-GnCR appears to be much slower compared to the rest of the methods due to the kernel size annealing process. Overall, the scores are comparable which supports that the underlying optimization mechanisms of the proposed methods behave consistently. Figure 4 provides graphical representations of the quality of the different reconstructions.

4.2 Comparison on the different p-SUM objectives

We now examine the utility in terms of seriation quality for each algorithm when solving different instances of the p-SUM. We first test the performance of C-GnCR when using different kernel approximations in Table 8. It can be seen that the Cauchy-based outperforms the Gaussian-based with respect to both objective value and running time.

Table 8 Average performance and running time of C-GnCR from 20 runs using the Artificial graves dataset, for two different kernel approximations

Subsequently, we ascertain the ability of the different algorithms to solve the \(\frac{1}{2}\)-SUM problem in situations where local ordering is of particular interest. For this setting we employ 10 random repetitions of the Double moons dataset (see Fig. 2) with size \(n=400\), and use the class membership to each moon to evaluate a resulting ordering \({\varvec{\pi }}\). If we define the class label matrix as

$$\begin{aligned} C_{ij} = \left\{ \begin{array}{l l} 1, &{} \text { if } i,j \in \text { same class},\\ 0, &{} \text { otherwise}, \end{array} \right. \end{aligned}$$
(26)

the first measure we propose quantifies the number of times a seriation algorithm places objects from different classes next to each other as

$$\begin{aligned} \delta _{count}({\varvec{\pi }},{\mathbf {C}}) \triangleq \sum _{i=1}^{n-1} \left( 1 - C_{\pi (i)\pi (i+1)} \right) . \end{aligned}$$
(27)

The second measure we use, penalizes objects from the same class that are placed far apart. It has the same form as the 2-SUM objective, but the similarity matrix \({\mathbf {A}}\) is replaced with the above \({\mathbf {C}}\); that is

$$\begin{aligned} \text {2-SUM}_{sup}({\varvec{\pi }},{\mathbf {C}}) \triangleq {\varvec{\pi }}^\top {\mathbf {L}}_{{\mathbf {C}}} {\varvec{\pi }}. \end{aligned}$$
(28)
Table 9 Supervised evaluation of seriation quality for different algorithms solving the general p-SUM using the Double moons dataset

Table 9 shows the average values from both measures above. It can be seen that the algorithms solving the \(\frac{1}{2}\)-SUM perform much better as the sought seriation is more sensitive to the local structure. The algorithms used for the 2-SUM, that is GnCR, FAQ and SA, show a similar performance in both measures. For the 1-SUM, the proposed H-GnCR achieves the second best performance for the \(\text {2-SUM}_{sup}\), but it is the worst with respect to \(\delta _{count}\), owing to the fact that it solves a smooth approximation of the 1-SUM in contrast with FAQ and SA. For the \(\frac{1}{2}\)-SUM case, the proposed C-GnCR performs worse in terms of both measures, again due to the underlying approximation, but maintains a \(\text {2-SUM}_{sup}\) value very close to the best.

We further examine the effects of solving the general p-SUM on the Facebook ego-network dataset, which contains a network of connections among friends of a user (McAuley and Leskovec 2012). In this case, seriation can be used to reveal node clustering patterns, as orderings that are more sensitive to the local structure can highlight tighter social circles. Figure 5a–c shows the effects of solving the 2-SUM, 1-SUM and \(\frac{1}{2}\)-SUM problems with SA, chosen here for its objective approximation quality. Figure 5d–f displays the corresponding cluster crossing curves. These are calculated as in Ding and He (2004) via summing fractions of pairwise similarities between objects. They can indicate cluster overlapping and minimum values are attained at boundaries between clusters. In this experiment we can see that smaller p yields increased number of clusters (more valleys) of smaller sizes (narrower peaks).

Fig. 5
figure 5

Seriation of the Facebook dataset for different p-SUM losses using the SA method. The top row displays the seriated similarity matrices and the bottom row contains the corresponding cluster crossing curves. a 2-SUM map. b 1-SUM map. c\(\frac{1}{2}\)-SUM map. d 2-SUM crossing. e 1-SUM crossing. f\(\frac{1}{2}\)-SUM crossing

Since for this dataset we do not have distinct class labels, we use the Hamiltonian path (Hahsler et al. 2008) to assess the local ordering of the resulting seriation. Table 10 presents the performance of the proposed methods against FAQ and SA, across different p-SUM objectives. We can see that for all methods, as we reduce p, the measure increases (since we use similarities) which suggests more localized orderings. Furthermore, for the 2-SUM objective, GnCR outperforms both FAQ and SA, while for the 1-SUM and \(\frac{1}{2}\)-SUM, the proposed H-GnCR and C-GnCR perform worse.

Table 10 Hamiltonian path measures for different algorithms that solve the general p-SUM for the Facebook dataset

4.3 Envelope reduction on big NASA datasets

In order to test the scalability of our proposed algorithms at an even larger scale, we apply them to a collection of large (\(n>1000\)) sparse matrices taken from the SuiteSparse Matrix Collection (Davis and Hu 2011). The quality of seriation can also be measured by the bandwidth and the envelope of the symmetric similarity matrix. The bandwidth is defined to be the maximum width of all rows (with the row width defined to be the largest distance between any non-zero element within the row and the diagonal), and the envelope size is defined to be the sum of all row widths (George and Pothen 1994). The goal of this experiment is to examine whether the proposed methods can successfully reduce the envelope size of sparse matrices of size up to \(n=36{,}519\). Specifically, we use four sparse NASA datasets:

  • BARTH4 a 6,691\(\times \)6,691 binary asymmetric \({\mathbf {A}}_u\) with 26,439 non-zero elements, symmetrized as \({\mathbf {A}}= {\mathbf {A}}_u^\top \vee {\mathbf {A}}_u\) (where \(\vee \) denotes elementwise OR).

  • BARTH5 a 15,606\(\times \)15,606 binary asymmetric \({\mathbf {A}}_u\) with 61,484 non-zero elements, symmetrized as before.

  • PWT a 36,519\(\times \)36,519 binary symmetric \({\mathbf {A}}\) with 326,107 non-zero elements.

  • CAN a 1,072\(\times \)1,072 binary symmetric \({\mathbf {A}}\) with 11,372 non-zero elements.

Table 11 Envelope size, bandwidth, objective function and running time for each algorithm for the four datasets

The only feasible algorithms for this setting of experiments are the three proposed methods and the two spectral methods. We present the envelope size and bandwidth of each reordered matrix as in Barnard et al. (1993) and also report the p-SUM objective values and running times in Table 11. In terms of envelope size, we can see that GnCR and H-GnCR show the best performance across all datasets. Nevertheless, C-GnCR maintains a good performance as well, very close to the best. Regarding the bandwidth, the proposed methods H-GnCR, GnCR and C-GnCR achieve best for the first three datasets, and Spectral\(_\text{ B }\) for the last one. It is notable however, that GnCR maintains a low bandwidth very close to the best for all cases. With regard to the 2-SUM, GnCR shows the best performance in all datasets apart from PWT where C-GnCR scores best. For the 1-SUM, GnCR and H-GnCR outperform the other methods for the first two (BARTH4, BARTH5) and last two (PWT, CAN) datasets, respectively. Results for the \(\frac{1}{2}\)-SUM objective show that the proposed H-GnCR scores best for the last three datasets, while C-GnCR is best for BARTH4. Again, it is notable that C-GnCR scores very close to the best for the rest datasets. Finally, we can see that all algorithms maintain a reasonable running time as the problem size increases and thus prove to be very scalable.

Figure 6 gives a visual representation of the original BARTH4 matrix and the reordered matrices of the tested algorithms. It can be seen that all methods show similar behavior and successfully reduce the envelope of the corresponding matrix.

Fig. 6
figure 6

Original similarity map for the BARTH4 dataset, and reordered versions produced by the two spectral and proposed algorithms. a Original map. b Spectral\(_\text{ A }\) map. c Spectral\(_\text{ B }\) map. d GnCR map. e H-GnCR map. f C-GnCR map

4.4 Seriation of images

In this section we explore seriation on complex patterns, such as images, where their ordering according to their semantic content may be of interest. An optimized linear ordering can reveal whether there are smooth variations across the patterns. Possible applications include browsing collections of photos while preserving scene similarity, exploring patterns of pathology amongst medical images, or sequencing video frames.

In order to test the performance of the proposed methods on images we use two datasets. A set of 40 rotating teapot images (Weinberger and Saul 2004) captured at \(4.5^\circ \) apart, spanning \(180^\circ \) and categorized in 8 classes, and a set of 585 images from the MSRC2 databaseFootnote 6 categorized in 20 distinct classes. We represent images as bag-of-visual-words (Csurka et al. 2004), that is histograms of quantized local descriptors densely sampled using overlapping matches of each image (Tuytelaars 2010). In this setup, we use the SIFT (Lowe 2004) vector descriptorsFootnote 7 and image patches of 12 pixels long overlapping every 6 pixels. For the bag-of-visual-words representation we use k-means with a cluster size of 500. Then, to derive the similarity matrix we use the exponentiated \(\chi ^2\) distance, as in Quadrianto et al. (2010). For comparison, we also include the algorithm kernelized sorting (KS) (Quadrianto et al. 2010) that can align a set of images according to a given template, which in this case is an one dimensional grid.

Table 12 \(\delta _{count}\) measure for each algorithm (parenthesized numbers indicate the value of p) for the teapots and MSRC2 image datasets

Numerical results in Table 12 rely on the \(\delta _{count}\) in order to evaluate how closely images from the same category are placed. We use different p-SUM objectives to obtain more local ordering solutions. For the Teapots dataset we can see that H-GnCR and SA\(\left( {\tfrac{1}{2}}\right) \) achieve the optimum \(\delta _{count}\) value. It is also notable that all three proposed methods maintain a very low value across the different objectives, while this is not the case for all SA and FAQ versions. For the MSRC2 dataset, SA\(\left( {\tfrac{1}{2}}\right) \) scores the best \(\delta _{count}\), while C-GnCR achieves the third best.

Fig. 7
figure 7

Image sequence of a teapot captured in different angles, seriated using Spectral\(_\text{ A }\) and H-GnCR. a Spectral\(_\text{ A }\). b H-GnCR

Figure 7 shows the seriated teapots for the spectral (Barnard et al. 1993) and H-GnCR methods, while Figs. 8 and 9 the results on MSRC2 for C-GnCR and SA\(\left( {\tfrac{1}{2}}\right) \), respectively. For the teapot experiment we can see that H-GnCR finds an ordering that reflects the smooth variation across the patterns, while spectral fails to do so. For MSRC2, it is noticeable that images with similar content are frequently placed close to each other along the linear ordering, i.e., categories of trees, animals, cars, planes, faces, flowers, books, etc. Although a perfect reconstruction of the original order cannot be achieved in this case, both methods seem to do a good job seriating images with animals, trees and books, while the SA\(\left( {\tfrac{1}{2}}\right) \) method performs better in seriating images with faces.

Fig. 8
figure 8

Seriated images from the MSRC2 dataset with C-GnCR. Differently shaded bars at the top of each image correspond to different categories

Fig. 9
figure 9

Seriated images from the MSRC2 dataset with SA\(\left( {\tfrac{1}{2}}\right) \). Differently shaded bars at the top of each image correspond to different categories

We additionally evaluate the ability of the proposed methods to find a solution that is close to the true underlying ordering of the rotating teapots. Table 13 presents for all methods the corresponding absolute Kendall’s tau (Critchlow 2012) which measures the rank correlation between two orderings, and PPC (Goulermas et al. 2016) which measures the agreement in terms of positional proximities. We can see that H-GnCR and SA\(\left( {\tfrac{1}{2}}\right) \) achieve the optimum scores. Moreover, all three proposed methods maintain very low proximity scores across the different objectives.

5 Discussion and conclusions

In this work we have introduced a new set of algorithms for a continuous relaxation of various versions of the p-SUM problem, based on a graduated non-convexity procedure with a first-order optimization method that is performed directly in terms of a permutation vector. To the best of our knowledge, it is the first time continuation-based algorithms are used for approximating a wide range of instances of the p-SUM. A clear advantage of vector gradient-based search when solving large problems is that they are very efficient and naturally scalable.

The experimental results from the previous sections contain some interesting observations regarding the usefulness of the proposed methods for the problem of seriation. In the first set of experiments we examined the utility of the three proposed algorithms in a set of real and artificial datasets. Results show that all proposed algorithms maintain a good performance in a wide range of datasets. SA seems to be the main competitor in this experimental setup, but this is a much slower method (running times are usually greater than 1000 s for problem sizes over \(n=500\)). It is also notable that the two convex relaxation approaches (vRCR and vRCR\(_2\)) do not outperform the two spectral methods when no auxiliary information is present. Similar performance behavior is observed for the WRE measure as well. Moreover, we verified the consistency of the proposed methods with the aid of pre-Robinsonian datasets, where results show that the proposed methods effectively solve the noiseless seriation problem and perform closely to the spectral method. This further supports the benefit of graduated non-convexity as a method to track solutions close to the global optimum.

Table 13 Kendall’s \(|\tau |\) and PPC scores between final solution and true underlying ordering, for the Teapots dataset (parenthesized numbers indicate the value of p). Values closer to 1 indicate better ordering agreement

To explore the suitability of the methods for solving different p-SUM problems, in Sect. 4.2 we used a synthetic dataset with class label information that enabled us to calculate a local ordering measure and compare algorithms that solve general p-SUM instances. Results show that as we reduce the value of p, the seriation results are more localized. The proposed methods for the 1-SUM and the \(\tfrac{1}{2}\)-SUM perform slightly worse in terms of \(\delta _{count}\), due to the fact that they rely on smooth approximations of the objective functions. H-GnCR outperforms the FAQ method in terms of \(\text {2-SUM}_{sup}\). This can be explained since one single misplacement of objects that are very far apart, could result into a poor seriation quality when assessed globally. Additional experiments on the Facebook dataset demonstrate the effects of solving the p-SUM for \(p<2\). Results in terms of Hamiltonian path show that as we reduce p, more localized orderings are obtained. It can also be seen that the proposed method for the 2-SUM outperforms FAQ and SA, but this is not the case for the ones designed for the 1-SUM and \(\tfrac{1}{2}\)-SUM, again due to their approximated objective functions. The scalability of the proposed methods was tested on four large scale sparse matrices in the context of the envelope reduction problem. For this reason, we compared with the two spectral methods which perform well on such problems. Experiments reveal that all three proposed methods are very scalable. In terms of envelope reduction quality GnCR and H-GnCR achieve the best envelope size values, a fact that further supports their close connection to the envelope reduction problem. C-GnCR is slightly worse, but maintains a performance very close to the best in all datasets. For the bandwidth measure, each of the proposed methods achieves best for one of the first three datasets, while for the CAN dataset Spectral\(_\text{ B }\) outperforms them. Results therefore suggest that the proposed methods are suitable for envelope reduction. Finally, the proposed methods were applied to image seriation. With regard to the Teapots dataset, results show that all proposed methods show good performance, with H-GnCR performing best. Additionally, all proposed methods appear to be able to find solutions that are very close to the true underlying ordering. For the MSRC2 dataset we see that C-GnCR performs well in terms of keeping close images of the same category, although it does not outperform SA\(\left( {\tfrac{1}{2}}\right) \) which scores best. In general, the problem of image seriation using extracted features is a challenging one and is highly dependent on the type and quality of the features, such as the SIFT descriptors.

Overall, the results demonstrate the practical benefit of solving the p-SUM for different values of p. The proposed algorithms show a competitive performance and strong scalability to problem sizes unattainable by other methods, a fact that makes them suitable for highlighting patterns of global or local similarities on data in various real-world applications, such as bioinformatics, data mining, image analysis, data visualization, etc.