1 Introduction

Linear regression and its extensions are among the most fundamental models in statistics and related fields. In the classical and most common setting, we are given n samples with features \({\varvec{x}}_{i} \in {\mathbb {R}}^{d}\) and response \(y_{i} \in {\mathbb {R}}\), where i denotes the sample indices. We assume that the features and responses are perfectly matched i.e., \({\varvec{x}}_i\) and \(y_i\) correspond to the same record or sample. However, in important applications (for example, due to errors in the data merging process), the correspondence between the response and features may be broken [13, 14, 19]. This erroneous correspondence needs to be adjusted before performing downstream statistical analysis. Thus motivated, we consider a mismatched linear model with responses \({\varvec{y}} = [y_1,\ldots ,y_n]^\top \in {\mathbb {R}}^n\) and covariates \({\varvec{X}}=[{\varvec{x}}_1,\ldots , {\varvec{x}}_n]^\top \in {\mathbb {R}}^{n\times d} \) satisfying

$$\begin{aligned} {\varvec{P}}^* {\varvec{y}} = {\varvec{X}} \varvec{\beta }^* + \varvec{\epsilon } \end{aligned}$$
(1.1)

where \(\varvec{\beta }^*\in {\mathbb {R}}^d\) are the true regression coefficients, \(\varvec{\epsilon } = [\epsilon _1,\ldots ,\epsilon _n]^\top \in {\mathbb {R}}^n\) is the noise term, and \({\varvec{P}}^*\in {\mathbb {R}}^{n\times n}\) is an unknown permutation matrix. We consider the classical setting where \(n > d\) and \({\varvec{X}}\) has full rank; and seek to estimate both \({\varvec{\beta }}^*\) and \({\varvec{P}}^*\) based on the n observations \(\{(y_{i}, {\varvec{x}}_{i})\}_{1}^{n}\). Note that the main computational difficulty in this task arises from the unknown permutation.

Linear regression with mismatched/permuted data—for example, model (1.1)—has a long history in statistics dating back to 1960s [5, 6, 13]. In addition to the aforementioned application in record linkage, similar problems also appear in robotics [22], multi-target tracking [4] and signal processing [3], among others. Recently, this problem has garnered significant attention from the statistics and machine learning communities. A series of recent works [1, 2, 7,8,9, 11, 14, 15, 17, 19,20,21, 23, 24, 26] have studied the statistical and computational aspects of this model. To learn the coefficients \({\varvec{\beta }}^*\) and the matrix \({\varvec{P}}^*\), one can consider the following natural optimization problem:

$$\begin{aligned} \min _{{\varvec{\beta }}, {\varvec{P}}}~ \Vert {\varvec{P}} {\varvec{y}} - {\varvec{X}} {\varvec{\beta }} \Vert ^2 ~~~ \mathrm{s.t.} ~~~ {\varvec{P}} \in \varPi _n \end{aligned}$$
(1.2)

where \(\varPi _n\) is the set of \(n \times n\) permutation matrices. Solving problem (1.2) is difficult as there are exponentially many choices for \({\varvec{P}} \in \varPi _{n}\). Given \({\varvec{P}}\) however, it is easy to estimate \({\varvec{\beta }}\) via least squares. [24] shows that in the noiseless setting (\({\varvec{\epsilon }}={\varvec{0}}\)), a solution \((\hat{{\varvec{P}}}, \hat{{\varvec{\beta }}})\) of problem (1.2) equals \(( {\varvec{P}}^*, {\varvec{\beta }}^*)\) with probability one if \(n\ge 2d\) and the entries of \({\varvec{X}}\) are independent and identically distributed (iid) as per a distribution that is absolutely continuous with respect to the Lebesgue measure. [11, 15] studies the estimation of \(({\varvec{P}}^*, {\varvec{\beta }}^*) \) under the noisy setting. It is shown in [15] that Problem (1.2) is NP-hard if \(d\ge \kappa n \) for some constant \(\kappa >0\). A polynomial-time approximation algorithm appears in [11] for a fixed d.

However, as noted in [11], this algorithm does not appear to be efficient in practice. [8] propose a branch-and-bound method, that can solve small problems with \(n\le 20\) (within a reasonable time). [16] propose a branch-and-bound method for a concave minimization formulation, which can solve problem (1.2) with \(d\le 8\) and \(n \approx 100\) (the authors report a runtime of 40 minutes to solve instances with \(d=8\) and \(n=100\)). [23] propose an approach using tools from algebraic geometry, which can handle problems with \(d\le 6\) and \(n = 10^3 \sim 10^5\)—the cost of this method increases exponentially with d. This approach is exact for the noiseless case but approximate for the noisy case (\({\varvec{\epsilon }} \ne {\varvec{0}}\)). Several heuristics have been proposed for (1.2): Examples include, alternating minimization [9, 26], Expectation Maximization [2]—as far as we can tell, these methods are sensitive to initialization, and have limited theoretical guarantees.

As discussed in [18, 19], in several applications, a small fraction of the samples are mismatched — that is, the permutation \({\varvec{P}}^*\) is sparse. In other words, if we let \(r:= |\{i\in [n] ~|~ {\varvec{P}}^* {\varvec{e}}_i \ne {\varvec{e}}_i \}|\) where \({\varvec{e}}_1, \ldots , {\varvec{e}}_n\) are the standard basis elements of \({\mathbb {R}}^n\), then r is much smaller than n. In this paper, we focus on such sparse permutation matrices, and assume the value of r is known or a good estimate is available to the practitioner. This motivates a constrained version of (1.2), given by

$$\begin{aligned} \min _{{\varvec{\beta }}, {\varvec{P}}}~ \Vert {\varvec{P}} {\varvec{y}} -{\varvec{X}} {\varvec{\beta }} \Vert ^2 ~~~ \mathrm{s.t.} ~~~ {\varvec{P}} \in \varPi _n, ~ \mathsf {dist}({\varvec{P}}, {\varvec{I}}_n) \le R \end{aligned}$$
(1.3)

where, the constraint \(\mathsf {dist}({\varvec{P}}, {\varvec{I}}_{n}) \le R\) restricts the number of mismatches between \({\varvec{P}}\) and the identity permutation to be below R—See (1.4) for a formal definition of \(\mathsf {dist}(\cdot ,\cdot )\). Above, R is taken such that \(r \le R \le n\) (Further details on the choice of R can be found in Sects. 3 and 5). Note that as long as \(r \le R \le n\), the true parameters \(({\varvec{P}}^*, {\varvec{\beta }}^*)\) lead to a feasible solution to (1.3). In the special case when \(R = n\), the constraint \(\mathsf {dist}({\varvec{P}}, {\varvec{I}}_{n}) \le R\) is redundant, and Problem (1.3) is equivalent to problem (1.2). Interesting convex optimization approaches based on robust regression have been proposed in [18] to approximately solve (1.3) when \(r \ll n\). The authors focus on obtaining an estimate of \({\varvec{\beta }}^*\). Similar ideas have been extended to consider problems with multiple responses in [19].

Problem (1.3) can be formulated as a mixed-integer program (MIP) with \(O(n^2)\) binary variables (to model the unknown permutation matrix). Solving this MIP with off-the-shelf MIP solvers (e.g., Gurobi) becomes computationally expensive for even a small value of n (e.g. \(n \approx 50\)). To the best of our knowledge, we are not aware of computationally practical algorithms with theoretical guarantees that can optimally solve the original problem (1.3), under suitable assumptions on the problem data. Addressing this gap is the main focus of this paper: We propose and study a novel greedy local search methodFootnote 1 for Problem (1.3). Loosely speaking, our algorithm at every step performs a greedy swap or transposition, in an attempt to improve the cost function. This algorithm is typically efficient in practice based on our numerical experiments. We also propose an approximate version of the greedy swap procedure that scales to much larger problem instances. We establish theoretical guarantees on the convergence of the proposed method under suitable assumptions on the problem data. Under a suitable scaling of the number of mismatched pairs compared to the number of samples and features, and certain assumptions on the covariates and noise; our local search method converges to an objective value that is at most a constant multiple of the squared norm of the underlying noise term. From a statistical viewpoint, this is the best objective value that one can hope to obtain (due to the noise in the problem). Interestingly, in the special case of \({\varvec{\epsilon }} = {\varvec{0}}\) (i.e., the noiseless setting), our algorithm converges to an optimal solution of (1.3) with a linear rateFootnote 2. We also prove an upper bound of the estimation error of \({\varvec{\beta }}^*\) (in \(\ell _2\) norm) and derive a bound on the number of iterations taken by our proposed local search method to find a solution with this estimation error.

Notation and preliminaries: For a vector \({\varvec{a}}\), we let \(\Vert {\varvec{a}}\Vert \) denote the Euclidean norm, \(\Vert {\varvec{a}} \Vert _{\infty }\) the \(\ell _{\infty }\)-norm and \(\Vert {\varvec{a}}\Vert _0\) the \(\ell _0\)-pseudo-norm (i.e., number of nonzeros) of \({\varvec{a}}\). We let \( |||\cdot |||_2\) denote the operator norm for matrices. Let \(\{\varvec{e}_1,\ldots ,{\varvec{e}}_n\}\) be the natural orthogonal basis of \({\mathbb {R}}^n\). For a finite set S, we let \( \# S \) denote its cardinality. For any permutation matrix \({\varvec{P}}\), let \(\pi _{{\varvec{P}}}\) be the corresponding permutation of \(\{1,2,\ldots .,n\}\), that is, \(\pi _{{\varvec{P}}}(i) = j\) if and only if \({\varvec{e}}_i^\top {\varvec{P}} = {\varvec{e}}_j^\top \) if and only if \(P_{ij} = 1\). We define the distance between two permutation matrices \({\varvec{P}}\) and \({\varvec{Q}}\) as

$$\begin{aligned} \mathsf {dist}({\varvec{P}},{\varvec{Q}}) = \# \left\{ i \in [n] : \pi _{{\varvec{P}}} (i) \ne \pi _{{\varvec{Q}}}(i) \right\} . \end{aligned}$$
(1.4)

Recall that we assume \(r = \mathsf {dist}({\varvec{P}}^*, {\varvec{I}}_n)\). For a given permutation matrix \({\varvec{P}}\), define the m-neighbourhood of \({\varvec{P}}\) as

$$\begin{aligned} {\mathcal {N}}_m({\varvec{P}}) := \{ {\varvec{Q}} \in \varPi _n : ~ \mathsf {dist}({\varvec{P}},{\varvec{Q}}) \le m \} . \end{aligned}$$
(1.5)

It is easy to check that \({\mathcal {N}}_1({\varvec{P}}) = \{{\varvec{P}}\}\), and for any \(R\ge 2\), \({\mathcal {N}}_R({\varvec{P}})\) contains more than one element. For any permutation matrix \({\varvec{P}}\in \varPi _n\), we define its support as: \(\mathsf {supp}({\varvec{P}}) := \left\{ i\in [n] :~ \pi _{{\varvec{P}}}(i) \ne i \right\} .\) For a real symmetric matrix \({\varvec{A}}\), let \(\lambda _{\max }({\varvec{A}})\) and \(\lambda _{\min }({\varvec{A}})\) denote the largest and smallest eigenvalues of \({\varvec{A}}\), respectively.

For two positive scalar sequences \(\{a_n\}, \{b_n\}\), we write \(a_n = O(b_n)\) or equivalently, \(a_n/b_n = O(1)\), if there exists a universal constant C such that \( a_n \le C b_n \). We write \(a_n = \varOmega (b_n)\) or equivalently, \(a_n/b_n = \varOmega (1)\), if there exists a universal constant c such that \( a_n \ge c b_n \). We write \(a_n = \varTheta (b_n)\) if both \(a_n = O(b_n) \) and \(a_n = \varOmega (b_n)\) hold.

2 A local search method

Here we present our local search method for (1.3). For any fixed \({\varvec{P}}\in \varPi _n\), by minimizing the objective function in (1.3) with respect to \({\varvec{\beta }}\), we have an equivalent formulation

$$\begin{aligned} \min _{{\varvec{P}}} ~~ \Vert {\varvec{P}} {\varvec{y}} - {\varvec{H}} {\varvec{P}} {\varvec{y}} \Vert ^2 ~~\mathrm{s.t.} ~~{\varvec{P}}\in \varPi _n , ~ \mathsf {dist}({\varvec{P}}, {\varvec{I}}_n) \le R \ , \end{aligned}$$
(2.1)

where \({\varvec{H}} = {\varvec{X}} ({\varvec{X}}^\top \varvec{X})^{-1} {\varvec{X}}^\top \) is the projection matrix onto the columns of \({\varvec{X}}\). To simplify notation, denote \( \widetilde{{\varvec{H}}} := {\varvec{I}}_n - {\varvec{H}}\), then (2.1) is equivalent to

$$\begin{aligned} \min _{{\varvec{P}}} ~~ \Vert \widetilde{{\varvec{H}}} {\varvec{P}} {\varvec{y}} \Vert ^2 ~~ \mathrm{s.t.} ~{\varvec{P}}\in \varPi _n , ~ \mathsf {dist}({\varvec{P}}, {\varvec{I}}_n) \le R \ . \end{aligned}$$
(2.2)

Our local search approach for the optimization of Problem (2.2) is summarized in Algorithm 1.

figure a

At iteration k, Algorithm 1 finds a swap (within a distance of R from \({\varvec{I}}_n\)) that leads to the smallest objective value. To see the computational cost of (2.3), note that:

$$\begin{aligned} \Vert \widetilde{{\varvec{H}}} {\varvec{P}}{\varvec{y}} \Vert ^2= & {} \Vert \widetilde{{\varvec{H}}}({\varvec{P}}-{\varvec{P}}^{(k)}){\varvec{y}} +\widetilde{{\varvec{H}}}{\varvec{P}}^{(k)}{\varvec{y}} \Vert ^2 \nonumber \\= & {} \Vert \widetilde{{\varvec{H}}}({\varvec{P}}-{\varvec{P}}^{(k)}){\varvec{y}} \Vert ^2 +2 \langle ({\varvec{P}}-{\varvec{P}}^{(k)}){\varvec{y}}, \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)} {\varvec{y}} \rangle \nonumber \\&+ \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)}{\varvec{y}} \Vert ^2 \ . \end{aligned}$$
(2.4)

For each \({\varvec{P}}\), with \(\mathsf {dist}({\varvec{P}}, {\varvec{P}}^{(k)}) \le 2\), the vector \(({\varvec{P}}-{\varvec{P}}^{(k)}) {\varvec{y}}\) has at most two nonzero entries. Since we pre-compute \(\widetilde{{\varvec{H}}}\), computing the first term in (2.4) costs O(1) operations. As we retain a copy of \(\widetilde{{\varvec{H}}}{\varvec{P}}^{(k)}{\varvec{y}} \) in memory, computing the second term in (2.4) also costs O(1) operations. Therefore, computing (2.3) requires \(O(n^2)\) operations, as there are at most \(n^2\)-many possible swaps to search over. The \(O(n^2)\) per-iteration cost is quite reasonable for medium-sized examples with n being a few hundred to a few thousand, but might be expensive for larger examples. In Sect. 4, we propose a fast method to find an approximate solution of (2.3) that scales to instances with \(n \approx 10^7\) in a few minutes (see Sect. 5 for numerical findings).

3 Theoretical guarantees for algorithm 1

Here we present theoretical guarantees for Algorithm 1. The main assumptions and conclusions appear in Sect. 3.1. Section 3.2 presents the proofs of the main theorems. The development in Sects. 3.1 and 3.2 assumes that the problem data (i.e., \({\varvec{y}}, {\varvec{X}}, {\varvec{\epsilon }}\)) is deterministic. Section 3.3 discusses conditions on the distribution of the features and the noise term, under which the main assumptions hold true with high probability.

3.1 Main results

We state and prove the main theorems on the convergence of Algorithm 1. For any \(m\le n\), define

$$\begin{aligned} {\mathcal {B}}_m:= \left\{ {\varvec{w}} \in {\mathbb {R}}^n:~ \Vert {\varvec{w}}\Vert _0 \le m \right\} . \end{aligned}$$
(3.1)

We first state the assumptions useful for our technical analysis.

Assumption 1

Suppose \({\varvec{X}}\), \({\varvec{y}}\), \({\varvec{\epsilon }}\), \({\varvec{\beta }}^*\) and \({\varvec{P}}^*\) satisfy the model (1.1) with \(\mathsf {dist}({\varvec{P}}^*, {\varvec{I}}_n) \le r\). Suppose the following conditions hold:

  1. (1)

    There exist constants \(U>L>0\) such that

    $$\begin{aligned} \max _{i,j\in [n]} |y_i-y_j| \le U, ~~~ \mathrm{and } ~~~ | (P^*y)_i - y_i | \ge L ~~~~~ \forall i\in \mathsf {supp}({\varvec{P}}^*) \ . \end{aligned}$$
  2. (2)

    Set \(R=10 C_1rU^2/L^2+4\) for some constant \( C_1 > 1\).

  3. (3)

    There is a constant \(\rho _n = O(d \log (n)/n)\) such that \(R \rho _n \le L^2/(90U^2)\), and

    $$\begin{aligned} \Vert {\varvec{H}} {\varvec{u}}\Vert ^2 \le \rho _n \Vert {\varvec{u}}\Vert ^2 ~~ \forall {\varvec{u}}\in {\mathcal {B}}_{4}, ~~\text {and} ~~\Vert {\varvec{H}} {\varvec{u}}\Vert ^2 \le R\rho _n \Vert {\varvec{u}}\Vert ^2 ~~ \forall {\varvec{u}}\in {\mathcal {B}}_{2R} \ . \end{aligned}$$
    (3.2)
  4. (4)

    There is a constant \( {{\bar{\sigma }}}\ge 0\) satisfying \({{\bar{\sigma }}} \le \min \{0.5, (\rho _nd)^{-1/2}\} L^2/(80U) \) such that

    $$\begin{aligned} \Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }}\Vert _{\infty } \le {{\bar{\sigma }}}, ~~ \Vert {\varvec{\epsilon }}\Vert _\infty \le {{\bar{\sigma }}}, ~~\text {and}~~ \Vert \varvec{H}{\varvec{\epsilon }} \Vert \le \sqrt{d} {{\bar{\sigma }}} \ . \end{aligned}$$
    (3.3)

Note that the lower bound in Assumption 1 (1) states that the y-value for a record that has been mismatched is not too close to its original value (before mismatch). Assumption 1 (2) states that R is set to a constant multiple of r. This constant can be large (\(\ge 10U^2/L^2\)), and appears to be an artifact of our proof techniques. Our numerical experience appears to suggest that this constant can be much smaller in practice. Assumption 1 (3) is a restricted eigenvalue (RE)-type condition [25] stating that: a multiplication of any (2R)-sparse vector by \(\varvec{H}\) will result in a vector with small norm (in the case \(R\rho _n <1\)). Section 3.3 discusses conditions on the distribution of the rows of \({\varvec{X}}\) under which Assumption 1 (3) holds true with high probability. Note that if \(\rho _n = \varTheta (d \log (n)/n)\), then for the assumption \(R \rho _n \le L^2/(90U^2)\) to hold true, we require \(n/\log (n) = \varOmega (dr)\). Assumption 1 (4) limits the amount of noise \({\varvec{\epsilon }}\) in the problem. Section 3.3 presents conditions on the distributions of \({\varvec{\epsilon }}\) and \({\varvec{X}}\) (in a random design setting) which ensures Assumption 1 (4) holds true with high probability.

Assumption 1 (3) plays an important role in our technical analysis. In particular, this allows us to approximate the objective function in (2.2) with one that is easier to analyze. To provide some intuition, we write \(\widetilde{{\varvec{H}}}{\varvec{P}}^{(k)}{\varvec{y}} = \widetilde{{\varvec{H}}}({\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}}) + \widetilde{{\varvec{H}}}{\varvec{\epsilon }}\)—noting that \(\widetilde{{\varvec{H}}}{\varvec{P}}^* {\varvec{y}} = \widetilde{{\varvec{H}}}({\varvec{X}}{\varvec{\beta }}^* + {\varvec{\epsilon }}) = \widetilde{{\varvec{H}}}{\varvec{\epsilon }}\), and assuming that the noise \({\varvec{\epsilon }}\) is small, we have:

$$\begin{aligned} \begin{aligned} \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)}{\varvec{y}}\Vert ^2 \approx \Vert \widetilde{{\varvec{H}}}( {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}}) \Vert \approx \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert ^2. \end{aligned} \end{aligned}$$
(3.4)

Intuitively, the term on the right-hand side is the approximate objective that we analyze in our theory. Lemma 1 presents a one-step decrease property on the approximate objective function.

Lemma 1

(One-step decrease) Given any \({\varvec{y}}\in {\mathbb {R}}^n \) and \({\varvec{P}}, {\varvec{P}}^* \in \varPi _n\), there exists a permutation matrix \(\widetilde{{\varvec{P}}}\in \varPi _n\) such that \(\mathsf {dist}(\widetilde{{\varvec{P}}}, {\varvec{P}}) = 2\), \(\mathsf {supp}(\widetilde{{\varvec{P}}}({\varvec{P}}^*)^{-1}) \subseteq \mathsf {supp}( {\varvec{P}} ({\varvec{P}}^*)^{-1}) \) and

$$\begin{aligned} \Vert {\varvec{P}} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}}\Vert ^2 - \Vert \widetilde{{\varvec{P}}}{\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert ^2 \ge (1/2) \Vert {\varvec{P}} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}}\Vert ^2_{\infty } \ . \end{aligned}$$
(3.5)

If in addition \( \Vert {\varvec{P}} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{0} \le m\) for some \(m\le n\), then

$$\begin{aligned} \Vert \widetilde{{\varvec{P}}}{\varvec{y}} - {\varvec{P}}^* {\varvec{y}}\Vert ^2 \le \left( 1- 1/(2m)\right) \Vert {\varvec{P}} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}}\Vert ^2 \ . \end{aligned}$$
(3.6)

The main results make use of Lemma 1 and formalize the intuition conveyed in (3.4). We first present a result regarding the support of the permutation matrix \({\varvec{P}}^{(k)}\) delivered by Algorithm 1.

Proposition 1

(Support detection) Suppose Assumption 1 holds. Let \(\{{\varvec{P}}^{(k)}\}_{k \ge 0}\) be the permutation matrices generated by Algorithm 1. Then for all \(k\ge R/2\), it holds \( \mathsf {supp}({\varvec{P}}^{*}) \subseteq \mathsf {supp}({\varvec{P}}^{(k)}) \).

Proposition 1 states that the support of \({\varvec{P}}^*\) will be contained within the support of \({\varvec{P}}^{(k)}\) after at most R/2 iterations. Intuitively, this result is because of Assumption 1 (1), which assumes that the mismatches represented by \({\varvec{P}}^*\) have “strong signal”. Proposition 1 is also useful for the proofs of the main theorems below (e.g., see Claim 3.25 in the proof of Theorem 3 for details).

We now present some additional assumptions required for the results that follow.

Assumption 2

Let \(\rho _n\) and \({{\bar{\sigma }}}\) be parameters appearing in Assumption 1.

  1. (1)

    Suppose \(R^2 \rho _n \le 1/10\).

  2. (2)

    There is a constant \(\sigma \ge 0\) such that \(\bar{\sigma }^2 \le \sigma ^2 \min \{n/(660R^2),\) \( n/(5dR)\}\), and

    $$\begin{aligned} \Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \Vert ^2 \ge (1/2)n \sigma ^2. \end{aligned}$$
    (3.7)

In light of the discussion following Assumption 1, Assumption 2 (1) places a stricter condition on the size of n via the requirement \(R^2 \rho _n \le 1/10\). If \(\rho _n = \varTheta (d\log (n)/n)\), then we would need \(n/\log (n) =\varOmega ( dr^2)\), which is stronger than the condition \(n/\log (n) = \varOmega (dr)\) needed in Assumption 1.

Assumption 2 (2) imposes a lower bound on \(\Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }}\Vert \) – this can be equivalently viewed as an upper bound on \(\Vert {\varvec{H}} {\varvec{\epsilon }} \Vert \), in addition to the upper bound appearing in Assumption 1 (4). Section 3.3 provides a sufficient condition for Assumption 2 (2) to hold with high probability. In particular, in the noiseless case (\({\varvec{\epsilon }} = 0\)), Assumption 1 (4) and Assumption 2 (2) hold with \({{\bar{\sigma }}} = \sigma = 0\).

We now state the first convergence result.

Theorem 3

(Linear convergence of objective up to noise level) Suppose Assumptions 1 and 2 hold with R being an even number. Let \(\{{\varvec{P}}^{(k)}\}_{k \ge 0}\) be the permutation matrices generated by Algorithm 1. Then for any \(k\ge 0\), we have

$$\begin{aligned} \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)} {\varvec{y}} \Vert ^2 \le \bigg (1- \frac{1}{18R}\bigg )^k \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(0)} {\varvec{y}} \Vert ^2 + 36 \Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \Vert ^2 \ . \end{aligned}$$
(3.8)

In the special (noiseless) setting when \({\varvec{\epsilon }} = {\varvec{0}}\), Theorem 3 establishes that the sequence of objective values generated by Algorithm 1 converges to zero i.e., the optimal objective value, at a linear rate. The parameter for the linear rate of convergence depends upon the search width R. Following the discussion after Assumption 2, the sample-size requirement \(n/\log (n) = \varOmega (dr^2)\) is more stringent than that needed in order for the model to be identifiable (\(n\ge 2d\)) [24] in the noiseless setting. In particular, when \( n/(d\log n) = O(1)\), the number of mismatched pairs r needs to be bounded by a constant. Numerical evidence presented in Sect. 5 (for the noiseless case) appears to suggest that the sample size n needed to recover \({\varvec{P}}^*\) is smaller than what is suggested by our theory.

In the noisy case (i.e. \({\varvec{\epsilon }}\ne {\varvec{0}}\)), the bound (3.8) provides an upper bound on the objective value consisting of two terms. The first term converges to 0 with a linear rate similar to the noiseless case. The second term is a constant multiple of the squared norm of the unavoidable noise termFootnote 3: \( \Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \Vert ^2 \). In other words, Algorithm 1 finds a solution whose objective value is at most a constant multiple of the objective value at the true permutation \({\varvec{P}}^*\).

Theorem 3 proves a convergence guarantee on the objective value. The next result provides upper bounds on the \(\ell _{\infty }\)-norm of the mismatched entries i.e., \(\Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty }\). For any \({\varvec{Q}} \in \varPi _n\), define

$$\begin{aligned} G({\varvec{Q}}) ~ := ~\Vert \widetilde{{\varvec{H}}}{\varvec{Q}} {\varvec{y}} \Vert ^2 - \min _{{\varvec{P}} \in {\mathcal {N}}_2({\varvec{Q}})\cap {\mathcal {N}}_R(\varvec{I}_n)} \Vert \widetilde{{\varvec{H}}}{\varvec{P}} {\varvec{y}} \Vert ^2 \end{aligned}$$
(3.9)

that is, \(G({\varvec{Q}}) \) is the decrease in the objective value after one step of local search starting at \({\varvec{Q}}\). For the permutation matrices \(\{{\varvec{P}}^{(k)}\}_{k \ge 0}\) generated by Algorithm 1, we know \(G({\varvec{P}}^{(k)}) = \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)} {\varvec{y}} \Vert ^2 - \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(k+1)} {\varvec{y}} \Vert ^2 \).

Theorem 4

(\(\ell _\infty \)-bound on mismatched pairs) Suppose Assumptions 1 and 2 hold, and let \(\{{\varvec{P}}^{(k)}\}_{k \ge 0}\) be the permutation matrices generated by Algorithm 1. Then for all \(k\ge 0\) it holds

$$\begin{aligned} \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty }^2 \le 800 {{\bar{\sigma }}}^2 + 10G({\varvec{P}}^{(k)}) \ . \end{aligned}$$

Theorem 4 states that the largest squared error of the mismatched pairs (i.e., \(\Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty }^2\)) is bounded above by a constant multiple of the one-step decrease in objective value (i.e. \(G({\varvec{P}}^{(k)})\)) plus a term comparable to the noise level \(O({{\bar{\sigma }}}^2)\). In particular, if Algorithm 1 is terminated at an iteration \({\varvec{P}}^{(k)}\) with \( G({\varvec{P}}^{(k)})\) of the order of \({\bar{\sigma }}^2\), then \(\Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert ^2_{\infty } \) is bounded by a constant multiple of \( {{\bar{\sigma }}}^2\).

Note that the constant 800 in (3.10) is conservative and may be improved with a careful adjustment of the constants appearing in the proof and in the assumptions.

In light of Theorem 4, we can prove an upper bound on the estimation error of \({\varvec{\beta }}^*\), using an additional assumption stated below.

Assumption 5

There exists a constant \({{\bar{\gamma }}} >0\) such that

$$\begin{aligned} (1) \quad \lambda _{\min } \bigg (\frac{1}{n} {\varvec{X}}^\top \varvec{X}\bigg ) \ge {{\bar{\gamma }}}~~~\quad (2) \quad \Vert ({\varvec{X}}^\top {\varvec{X}})^{-1} {\varvec{X}}^\top {\varvec{\epsilon }} \Vert \le {{\bar{\sigma }}} \sqrt{\frac{d}{n{{\bar{\gamma }}}}} \ . \end{aligned}$$
(3.10)

where \({{\bar{\sigma }}}\) is as defined in Assumption 1.

Section 3.3 presents conditions on \({\varvec{X}},\varvec{\epsilon }\) under which Assumption 5 is satisfied with high probability.

Theorem 6

(Estimation error) Suppose Assumptions 1, 2 and 5 hold. Suppose iteration k of Algorithm 1 satisfies \(G({\varvec{P}}^{(k)}) \le c{{\bar{\sigma }}}^2\) for a constant \(c>0\). Let \({\varvec{\beta }}^{(k)} := ({\varvec{X}}^\top {\varvec{X}})^{-1} {\varvec{X}}^\top {\varvec{P}}^{(k)} {\varvec{y}}\) and denote \({{\bar{c}}}:= 800+ 10c\). Then we have

$$\begin{aligned} \Vert {\varvec{\beta }}^{(k)} - {\varvec{\beta }}^* \Vert ^2 ~\le ~ 4{{\bar{\gamma }}}^{-1}{{\bar{c}}} \frac{R {{\bar{\sigma }}}^2 }{n} + 2 {{\bar{\gamma }}}^{-1} \frac{d {{\bar{\sigma }}}^2}{n} \end{aligned}$$
(3.11)

and

$$\begin{aligned} \frac{1}{n} \bigg \Vert ( {\varvec{P}}^{(k)} )^{-1} {\varvec{X}} {\varvec{\beta }}^{(k)} - ({\varvec{P}}^*)^{-1} {\varvec{X}} {\varvec{\beta }}^* \bigg \Vert ^2 ~\le ~ 2 \left( \sqrt{2{{\bar{c}}}}+3\right) ^2 \frac{R{{\bar{\sigma }}}^2}{n} + \frac{2d{{\bar{\sigma }}}^2}{n}. \end{aligned}$$
(3.12)

Theorem 6 (cf bound (3.11)) states that as long as k is sufficiently largeFootnote 4, the estimation error \(\Vert {\varvec{\beta }}^{(k)} - {\varvec{\beta }}^* \Vert ^2 \) is of the order \(O(({r+d}){{\bar{\sigma }}}^2/n)\), assuming \({\bar{\gamma }}\) is a constant. Therefore, as \(n\rightarrow \infty \) (with rd fixed), the estimator delivered by our algorithm (after sufficiently many iterations) will converge to the true regression coefficient vector, \({\varvec{\beta }}^*\). In addition, (3.12) provides an upper bound on the entrywise “denoising error" (left hand side of (3.12))—this is of the order \(O((r+d){{\bar{\sigma }}}^2/n)\). See [14] for past works and discussions on this error metric.

The following theorem provides an upper bound on the total number of local search steps needed to find a \({\varvec{P}}^{(k)}\) with \(G({\varvec{P}}^{(k)}) \le c {{\bar{\sigma }}}^2\).

Theorem 7

(Iteration complexity) Suppose Assumptions 1 and 2 hold. Let \(\{{\varvec{P}}^{(k)}\}_{k \ge 0}\) be the permutation matrices generated by Algorithm 1. Given any \(c>0\), define

$$\begin{aligned} K^\dagger := \left\lceil \log \bigg ( \frac{36\Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }}\Vert ^2}{\Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(0)} {\varvec{y}} \Vert ^2} \bigg ) \bigg / \log \bigg ( 1 - \frac{1}{18R} \bigg ) ~+ ~ \frac{72n}{c} \right\rceil +1 \ . \end{aligned}$$
(3.13)

Then there exists \(0\le k \le K^\dagger \) such that \(G({\varvec{P}}^{(k)}) \le c{{\bar{\sigma }}}^2\).

Proof

Denote

$$\begin{aligned} K_1 := \left\lceil \log \bigg ( \frac{36\Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }}\Vert ^2}{\Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(0)} {\varvec{y}} \Vert ^2} \bigg ) \bigg / \log \bigg ( 1 - \frac{1}{18R} \bigg ) \right\rceil \ . \end{aligned}$$
(3.14)

Then by Theorem 3, after \(K_1\) iterations, it holds

$$\begin{aligned} \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(K_1)} {\varvec{y}} \Vert ^2 \le 36 \Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }}\Vert ^2 + 36\Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }}\Vert ^2 = 72 \Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }}\Vert ^2 \le 72 n {{\bar{\sigma }}}^2 \end{aligned}$$
(3.15)

where the second inequality follows Assumption 1 (4). Suppose \(G({\varvec{P}}^{(k)}) > c {{\bar{\sigma }}}^2 \) for all \(K_1 \le k \le K^\dagger -1\), then

$$\begin{aligned} \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(K^\dagger )} {\varvec{y}} \Vert ^2 = \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(K_1)} {\varvec{y}} \Vert ^2 - \sum _{k=K_1}^{K^{\dagger }-1} G({\varvec{P}}^{(k)}) < 72 n {{\bar{\sigma }}}^2 - \frac{72n}{c} c {{\bar{\sigma }}}^2 = 0 \ , \end{aligned}$$

which is a contradiction. So there must exist some \(K_1 \le k \le K^\dagger -1\) such that \(G({\varvec{P}}^{(k)}) \le c {{\bar{\sigma }}}^2 \). \(\square \)

Note that if R and \(\Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }}\Vert ^2/ \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(0)} {\varvec{y}} \Vert ^2\) are bounded by a constant, then the number of iterations \(K^\dagger = O(n)\). Therefore, in this situation, one can find an estimate \({\varvec{\beta }}^{(k)}\) satisfying \( \Vert {\varvec{\beta }}^{(k)} - {\varvec{\beta }}^* \Vert ^2 \le O((d+r){{\bar{\sigma }}}^2/n)\) within O(n) iterations of Algorithm 1.

3.2 Proofs of main theorems

In this section, we present the proofs of Proposition 1, Theorem 3, Theorem 4 and Theorem 6. We first present a technical result used in our proofs.

Lemma 2

Suppose Assumption 1 holds. Let \(\{{\varvec{P}}^{(k)}\}_{k \ge 0}\) be the permutation matrices generated by Algorithm 1. Suppose \(\Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty } \ge L \) for some \( k\ge 1 \). Suppose at least one of the two conditions holds: (i) \(k\le R/2\); or (ii) \(k\ge R/2 + 1\), and \(\mathsf {supp}({\varvec{P}}^*) \subseteq \mathsf {supp}({\varvec{P}}^{(k')})\) for all \( R/2 \le k'\le k - 1\). Then for all \(t\le k-1\), we have

$$\begin{aligned} \Vert {\varvec{P}}^{(t+1)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert ^2 - \Vert {\varvec{P}}^{(t)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert ^2 \le -L^2/5 \ . \end{aligned}$$
(3.16)

The proof of Lemma 2 is presented in Sect. A.3. As mentioned earlier, our analysis makes use of the one-step decrease condition in Lemma 1. Note however, if the permutation matrix at the current iteration, denoted by \({\varvec{P}}^{(k)}\), is on the boundary, i.e. \(\mathsf {dist}({\varvec{P}}^{(k)}, {\varvec{I}}_n) = R\), it is not clear whether the permutation found by Lemma 1 is within the search region \({\mathcal {N}}_R({\varvec{I}}_n)\). Lemma 2 helps address this issue (See the proof of Theorem 3 below for details).

3.2.1 Proof of proposition 1

We show this result by contradiction. Suppose that there exists a \(k\ge R/2\) such that \( \mathsf {supp}({\varvec{P}}^{*}) \not \subseteq \mathsf {supp}({\varvec{P}}^{(k)}) \). Let \( T\ge R/2 \) be the first iteration (\(\ge R/2\)) such that \( \mathsf {supp}({\varvec{P}}^{*}) \not \subseteq \mathsf {supp}({\varvec{P}}^{(T)}) \), i.e.,

$$\begin{aligned} \mathsf {supp}({\varvec{P}}^{*}) \not \subseteq \mathsf {supp}({\varvec{P}}^{(T)}) ~~~ {\mathrm{and}} ~~~ \mathsf {supp}({\varvec{P}}^{*}) \subseteq \mathsf {supp}({\varvec{P}}^{(k)}) ~~ \forall ~ R/2 \le k \le T-1 \ . \end{aligned}$$

Let \(i\in \mathsf {supp}({\varvec{P}}^{*})\) but \(i\notin \mathsf {supp}({\varvec{P}}^{(T)})\), then by Assumption 1 (1), we have

$$\begin{aligned} \Vert {\varvec{P}}^{(T)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty } \ge | {\varvec{e}}_i^\top ({\varvec{P}}^{(T)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}}) | = | \varvec{e}_i^\top ( {\varvec{y}} - {\varvec{P}}^* {\varvec{y}}) | \ge L. \end{aligned}$$

By Lemma 2, we have \( \Vert {\varvec{P}}^{(k+1)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert ^2 - \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert ^2 \le -L^2/5 \) for all \(k\le T-1\). As a result,

$$\begin{aligned} \Vert {\varvec{P}}^{(T)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}}\Vert ^2 - \Vert {\varvec{P}}^{(0)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}}\Vert ^2 \le -TL^2/5 \le -RL^2/10 \ . \end{aligned}$$

Since by Assumption 1 (1), \( \Vert {\varvec{P}}^{(0)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert ^2 = \Vert {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert ^2 \le r U^2 \ , \) we have

$$\begin{aligned} \Vert {\varvec{P}}^{(T)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}}\Vert ^2 \le r U^2 - RL^2/10 \le r U^2 - \frac{L^2}{10} \frac{10C_1rU^2}{L^2} = (1-C_1) rU^2<0 \ . \end{aligned}$$

This is a contradiction, so such an iteration counter T does not exist; and for all \(k\ge R/2\), we have \( \mathsf {supp}({\varvec{P}}^{*}) \subseteq \mathsf {supp}({\varvec{P}}^{(k)}) \).

3.2.2 Proof of theorem 3

Let \(a = 2.2\). Because \(R\ge 10r\), we have \(r +R \le 1.1R = aR/2\); and for any \(k\ge 0\):

$$\begin{aligned}\Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _0 \le \mathsf {dist}( {\varvec{P}}^{(k)} , {\varvec{I}}_n ) + \mathsf {dist}( {\varvec{P}}^* , {\varvec{I}}_n ) \le R+r\le aR/2.\end{aligned}$$

Hence, by Lemma 1, there exists a permutation matrix \(\widetilde{{\varvec{P}}}^{(k)} \in \varPi _n\) such that \(\mathsf {dist}( \widetilde{{\varvec{P}}}^{(k)} , {\varvec{P}}^{(k)} ) \le 2 \), \(\mathsf {supp}( \widetilde{{\varvec{P}}}^{(k)} ({\varvec{P}}^*)^{-1}) \subseteq \mathsf {supp}( {\varvec{P}}^{(k)} ({\varvec{P}}^*)^{-1} ) \) and

$$\begin{aligned} \Vert \widetilde{{\varvec{P}}}^{(k)} {\varvec{y}}- {\varvec{P}}^* {\varvec{y}}\Vert ^2 \le (1-1/(aR) ) \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert ^2 \ . \end{aligned}$$

As a result,

$$\begin{aligned} \Vert \widetilde{{\varvec{H}}}(\widetilde{{\varvec{P}}}^{(k)} {\varvec{y}}- {\varvec{P}}^* {\varvec{y}}) \Vert ^2 \le \Vert \widetilde{{\varvec{P}}}^{(k)} {\varvec{y}}- {\varvec{P}}^* {\varvec{y}} \Vert ^2\le & {} (1-1/(aR) ) \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert ^2 \\\le & {} \frac{1-1/(aR)}{1-R\rho _n} \Vert \widetilde{{\varvec{H}}}({\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}}) \Vert ^2 \ , \end{aligned}$$

where the last inequality is from Assumption 1 (3). Note that by Assumption 2 (1), we have \(R^2 \rho _n \le 1/10 \le 1/(2a) \), so \( R\rho _n \le 1/(2aR) \). Because \( (1- 1/(aR)) \le (1-1/(2aR))^2 \), we have

$$\begin{aligned} \Vert \widetilde{{\varvec{H}}}(\widetilde{{\varvec{P}}}^{(k)} {\varvec{y}}- {\varvec{P}}^* {\varvec{y}}) \Vert ^2\le & {} \frac{1-1/(aR)}{1-1/(2aR)} \Vert \widetilde{{\varvec{H}}}({\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^*{\varvec{y}}) \Vert ^2 \\\le & {} (1-1/(2aR)) \Vert \widetilde{{\varvec{H}}}({\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^*{\varvec{y}}) \Vert ^2 \ . \end{aligned}$$

Recall that (1.1) leads to \(\widetilde{{\varvec{H}}}{\varvec{P}}^* {\varvec{y}} = \widetilde{{\varvec{H}}}{\varvec{\epsilon }}\), so we have

$$\begin{aligned} \Vert \widetilde{{\varvec{H}}}\widetilde{{\varvec{P}}}^{(k)} {\varvec{y}}- \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \Vert ^2 \le (1-1/(2aR)) \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)} {\varvec{y}} - \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \Vert ^2 \ . \end{aligned}$$
(3.17)

Let \(\eta := 1/(2aR)\) and \({\varvec{z}}:= \widetilde{{\varvec{P}}}^{(k)} {\varvec{y}} - {\varvec{P}}^{(k)} {\varvec{y}}\), then (3.17) leads to:

$$\begin{aligned}&\Vert \widetilde{{\varvec{H}}}\widetilde{{\varvec{P}}}^{(k)} {\varvec{y}} \Vert ^2 \nonumber \\&\quad \le (1-\eta )\Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)} {\varvec{y}} \Vert ^2 - \eta \Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \Vert ^2 + 2 \langle \widetilde{{\varvec{H}}}\widetilde{{\varvec{P}}}^{(k)} {\varvec{y}}, \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \rangle - 2(1-\eta ) \langle \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)} {\varvec{y}}, \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \rangle \nonumber \\&\quad \le (1-\eta )\Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)} {\varvec{y}} \Vert ^2 + 2 (1-\eta ) \langle \widetilde{{\varvec{H}}}{\varvec{z}} , \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \rangle + 2\eta \langle \widetilde{{\varvec{H}}}\widetilde{{\varvec{P}}}^{(k)} {\varvec{y}}, \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \rangle \nonumber \\&\quad \le (1-\eta )\Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)} {\varvec{y}} \Vert ^2 + 2|\langle \widetilde{{\varvec{H}}}{\varvec{z}} , \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \rangle | + 2\eta |\langle \widetilde{{\varvec{H}}}\widetilde{{\varvec{P}}}^{(k)} {\varvec{y}}, \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \rangle | \ \end{aligned}$$
(3.18)

where, to arrive at the second inequality, we drop the term \(- \eta \Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \Vert ^2\). We now make use of the following claim whose proof is in Sect. A.7:

$$\begin{aligned} {\textbf {Claim.}} ~~~~ 2|\langle \widetilde{{\varvec{H}}}{\varvec{z}} , \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \rangle | \le \frac{\eta }{4}\Vert \widetilde{{\varvec{H}}}\widetilde{{\varvec{P}}}^{(k)} {\varvec{y}}\Vert ^2 + \frac{\eta }{4}\Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)} {\varvec{y}}\Vert ^2 + 4\eta \Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \Vert ^2 \ . \end{aligned}$$
(3.19)

On the other hand, by Cauchy-Schwarz inequality,

$$\begin{aligned} 2\eta |\langle \widetilde{{\varvec{H}}}\widetilde{{\varvec{P}}}^{(k)} {\varvec{y}}, \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \rangle | \le ({\eta }/{4}) \Vert \widetilde{{\varvec{H}}}\widetilde{{\varvec{P}}}^{(k)} {\varvec{y}} \Vert ^2 + 4\eta \Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }}\Vert ^2 \ . \end{aligned}$$
(3.20)

Combining (3.18), (3.19) and (3.20), we have

$$\begin{aligned} \begin{aligned} \Vert \widetilde{{\varvec{H}}}\widetilde{{\varvec{P}}}^{(k)} {\varvec{y}} \Vert ^2&\le (1-\eta )\Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)} {\varvec{y}} \Vert ^2 + (\eta /2) \Vert \widetilde{{\varvec{H}}}\widetilde{{\varvec{P}}}^{(k)} {\varvec{y}} \Vert ^2 \\&\quad + (\eta /4) \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)} {\varvec{y}} \Vert ^2 +8 \eta \Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \Vert ^2 \ . \end{aligned} \end{aligned}$$

After some rearrangement, the above leads to:

$$\begin{aligned} \begin{aligned} \Vert \widetilde{{\varvec{H}}}\widetilde{{\varvec{P}}}^{(k)} {\varvec{y}} \Vert ^2&\le \frac{1-3\eta /4}{1-\eta /2} \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)} {\varvec{y}} \Vert ^2+\frac{8 \eta }{1-\eta /2} \Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \Vert ^2 \\&\le (1-\eta /4) \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)} {\varvec{y}} \Vert ^2+9 \eta \Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \Vert ^2 \ \end{aligned} \end{aligned}$$
(3.21)

where the second inequality uses \( 1-3\eta /4 \le (1-\eta /2) (1-\eta /4)\) and \( (1-\eta /2)^{-1} \le 9/8 \) (recall, \(\eta = 1/(2aR)\)).

To complete the proof, we use another claim whose proof is in Sect. A.6:

$$\begin{aligned} {\textbf {Claim.}} ~~~ \mathrm{For~ any~ } k\ge 0 \mathrm{~ it ~ holds~ that~} \widetilde{{\varvec{P}}}^{(k)} \in {\mathcal {N}}_R({\varvec{I}}_n) \cap {\mathcal {N}}_{2}({\varvec{P}}^{(k)}) \ . \end{aligned}$$
(3.22)

By the above claim, the update rule (2.3) and inequality (3.21), we have

$$\begin{aligned} \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(k+1)} {\varvec{y}}\Vert ^2 \le \Vert \widetilde{{\varvec{H}}}\widetilde{{\varvec{P}}}^{(k)} {\varvec{y}} \Vert ^2 \le (1-\eta /4) \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)} {\varvec{y}} \Vert ^2+9 \eta \Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \Vert ^2 \ . \end{aligned}$$

Using the notation \(a_k:= \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)} {\varvec{y}} \Vert ^2\), \(\lambda =1-\eta /4\) and \( {\tilde{e}} = 9 \eta \Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \Vert ^2 \), the above inequality leads to: \( a_{k+1} \le \lambda a_k + {\tilde{e}}\) for all \(k\ge 0\). Therefore, we have

$$\begin{aligned} \frac{a_{k+1}}{\lambda ^{k+1}} \le \frac{a_{k}}{\lambda ^{k}} + \frac{{\tilde{e}}}{\lambda ^{k+1}} \le \frac{a_{k-1}}{\lambda ^{k-1}} + \frac{{\tilde{e}}}{\lambda ^{k}} + \frac{{\tilde{e}}}{\lambda ^{k+1}} \le \cdots \le \frac{a_0}{\lambda ^0} + {\tilde{e}} \sum _{i=1}^{k+1} \frac{1}{\lambda ^i} \ , \end{aligned}$$

which implies \( a_k \le a_0 \lambda ^k + {\tilde{e}} \sum _{i=1}^k \lambda ^{i-1} \le a_0 \lambda ^k + ({{\tilde{e}}}/{(1-\lambda )}) \). This leads to

$$\begin{aligned} \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)} {\varvec{y}} \Vert ^2\le & {} (1-\eta /4)^k \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(0)} {\varvec{y}} \Vert ^2 + \frac{9 \eta \Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \Vert ^2}{\eta /4} \\\le & {} (1- {1}/{(8aR)})^k \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(0)}{\varvec{y}} \Vert ^2 + 36 \Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \Vert ^2 \ . \end{aligned}$$

Recalling that \( 8a \le 18 \), we conclude the proof of the theorem.

3.2.3 Proof of theorem 4

By the definition of \(G(\cdot )\), we have

$$\begin{aligned} \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)} {\varvec{y}} \Vert ^2 \le \Vert \widetilde{{\varvec{H}}}{\varvec{P}} {\varvec{y}} \Vert ^2 + G({\varvec{P}}^{(k)}) ~~~ \forall ~ {\varvec{P}}\in {\mathcal {N}}_2({\varvec{P}}^{(k)}) \cap {\mathcal {N}}_R({\varvec{I}}_n) \ . \end{aligned}$$
(3.23)

By Lemma 1, there exists a permutation matrix \(\widetilde{{\varvec{P}}}^{(k)} \in \varPi _n\) such that

$$\begin{aligned}\mathsf {dist}( \widetilde{{\varvec{P}}}^{(k)} ,{\varvec{P}}^{(k)} ) \le 2, ~~\mathsf {supp}( \widetilde{{\varvec{P}}}^{(k)} ({\varvec{P}}^*)^{-1}) \subseteq \mathsf {supp}( {\varvec{P}}^{(k)} ({\varvec{P}}^*)^{-1} )\end{aligned}$$

and

$$\begin{aligned} \Vert \widetilde{{\varvec{P}}}^{(k)} {\varvec{y}}- {\varvec{P}}^* {\varvec{y}}\Vert ^2 - \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^*{\varvec{y}} \Vert ^2 \le - (1/2) \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^*{\varvec{y}} \Vert ^2_{\infty } \ . \end{aligned}$$
(3.24)

By Claim (3.22) we have

$$\begin{aligned} \widetilde{{\varvec{P}}}^{(k)} \in {\mathcal {N}}_R({\varvec{I}}_n) \cap {\mathcal {N}}_{2}({\varvec{P}}^{(k)}) \ . \end{aligned}$$
(3.25)

Therefore, by (3.23) and (3.25), we have

$$\begin{aligned} \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)} {\varvec{y}} \Vert ^2 \le \Vert \widetilde{{\varvec{H}}}\widetilde{{\varvec{P}}}^{(k)} {\varvec{y}} \Vert ^2 + G({\varvec{P}}^{(k)})\ . \end{aligned}$$

Let \({\varvec{z}} := \widetilde{{\varvec{P}}}^{(k)} {\varvec{y}} - {\varvec{P}}^{(k)} {\varvec{y}}\). Recall that \(\widetilde{{\varvec{H}}}{\varvec{P}}^* {\varvec{y}} = \widetilde{{\varvec{H}}}{\varvec{\epsilon }}\), so by the inequality above we have

$$\begin{aligned} \Vert \widetilde{{\varvec{H}}}({\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}}) + \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \Vert ^2 \le \Vert \widetilde{{\varvec{H}}}({\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}}) + \widetilde{{\varvec{H}}}{\varvec{\epsilon }} + \widetilde{{\varvec{H}}}{\varvec{z}} \Vert ^2 +G({\varvec{P}}^{(k)}) \end{aligned}$$

which is equivalent to

$$\begin{aligned} -2 \langle \widetilde{{\varvec{H}}}({\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^*{\varvec{y}}) , \widetilde{{\varvec{H}}}{\varvec{z}} \rangle - \Vert \widetilde{{\varvec{H}}}{\varvec{z}} \Vert ^2 \le 2 \langle \widetilde{{\varvec{H}}}{\varvec{z}}, \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \rangle + G({\varvec{P}}^{(k)}) \ . \end{aligned}$$
(3.26)

On the other hand, from (3.24) we have

$$\begin{aligned} \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} + {\varvec{z}} \Vert ^2 - \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert ^2\le - (1/2) \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty }^2 \ \end{aligned}$$

or equivalently,

$$\begin{aligned} 2 \langle {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}}, {\varvec{z}} \rangle + \Vert {\varvec{z}}\Vert ^2 \le - (1/2) \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty }^2 \ . \end{aligned}$$
(3.27)

Summing up (3.26) and (3.27) we have

$$\begin{aligned} \begin{aligned}&~ 2 \langle {\varvec{H}}( {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}}), {\varvec{H}} {\varvec{z}} \rangle + \Vert {\varvec{H}} {\varvec{z}}\Vert ^2 \\&\quad \le ~ 2 \langle \widetilde{{\varvec{H}}}{\varvec{z}}, \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \rangle - (1/2) \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty }^2 + G({\varvec{P}}^{(k)}) \ . \end{aligned} \end{aligned}$$
(3.28)

Note that

$$\begin{aligned} \begin{aligned}&\ 2 | \langle {\varvec{H}}( {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}}), {\varvec{H}} {\varvec{z}} \rangle | \le 2 \Vert {\varvec{H}}( {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}}) \Vert \cdot \Vert {\varvec{H}} \varvec{z} \Vert \\&\quad \le \ 2 \sqrt{R\rho _n} \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert \Vert {\varvec{H}} {\varvec{z}} \Vert \le 2\sqrt{2} R \sqrt{\rho _n} \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty } \Vert {\varvec{H}} {\varvec{z}} \Vert \end{aligned} \end{aligned}$$
(3.29)

where the second inequality is by Assumption 1 (3) and the third inequality uses \(\Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _0 \le 2R\). From Assumption 2 (1) we have \(R\sqrt{\rho _n} \le 1/\sqrt{10}\), hence

$$\begin{aligned} \begin{aligned}&~ 2 | \langle {\varvec{H}}( {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}}), {\varvec{H}} {\varvec{z}} \rangle | \le \frac{2}{\sqrt{5}} \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty } \Vert {\varvec{H}} {\varvec{z}} \Vert \\&\quad \le ~ \frac{1}{5} \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty }^2 + \Vert {\varvec{H}} {\varvec{z}} \Vert ^2 \end{aligned} \end{aligned}$$
(3.30)

where the last inequality is by Cauchy-Schwarz inequality. Rearranging terms in (3.28), and making use of (3.30), we have

$$\begin{aligned}&\Vert {\varvec{H}} {\varvec{z}}\Vert ^2 + (1/2) \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty }^2 \\&\quad \le 2 \langle \widetilde{{\varvec{H}}}{\varvec{z}}, \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \rangle - 2 \langle {\varvec{H}}( {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}}), {\varvec{H}} {\varvec{z}} \rangle + G({\varvec{P}}^{(k)}) \\&\quad \le 2 \langle \widetilde{{\varvec{H}}}{\varvec{z}}, \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \rangle + (1/5) \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty }^2 + \Vert {\varvec{H}} {\varvec{z}} \Vert ^2 + G({\varvec{P}}^{(k)}) \ . \end{aligned}$$

As a result,

$$\begin{aligned} (3/10) \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty }^2\le & {} 2 \langle \widetilde{{\varvec{H}}}{\varvec{z}}, \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \rangle + G({\varvec{P}}^{(k)}) \nonumber \\= & {} 2\langle {\varvec{z}}, \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \rangle + G({\varvec{P}}^{(k)}). \end{aligned}$$
(3.31)

By the definition of \({\varvec{z}}\), we know there exist \(i,j \in [n]\) such that

$$\begin{aligned}{\varvec{z}} = \Vert {\varvec{z}}\Vert _{\infty } ({\varvec{e}}_i - {\varvec{e}}_j) = (\Vert {\varvec{z}} \Vert /\sqrt{2})({\varvec{e}}_i - {\varvec{e}}_j).\end{aligned}$$

Therefore

$$\begin{aligned} 2 \langle {\varvec{z}}, \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \rangle = \sqrt{2} \Vert {\varvec{z}}\Vert \langle {\varvec{e}}_i - {\varvec{e}}_j, \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \rangle \le 2\sqrt{2} \Vert {\varvec{z}}\Vert \Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }}\Vert _{\infty } \le 2\sqrt{2} {{\bar{\sigma }}} \Vert \varvec{z}\Vert \end{aligned}$$
(3.32)

where the last inequality makes use of Assumption 1 (4). On the other hand, by (3.27) we have

$$\begin{aligned} \Vert {\varvec{z}}\Vert ^2 \le 2 |\langle {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}}, {\varvec{z}} \rangle | = \sqrt{2} \Vert {\varvec{z}} \Vert |\langle {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}}, {\varvec{e}}_i - {\varvec{e}}_j \rangle | \ , \end{aligned}$$

and hence

$$\begin{aligned} \Vert {\varvec{z}}\Vert \le \sqrt{2} |\langle {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}}, {\varvec{e}}_i - {\varvec{e}}_j \rangle | \le 2\sqrt{2} \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty } \ . \end{aligned}$$
(3.33)

Combining (3.31), (3.32) and (3.33), we have

$$\begin{aligned} (3/10) \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty }^2\le & {} 8 {{\bar{\sigma }}} \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty } + G({\varvec{P}}^{(k)}) \nonumber \\\le & {} 80 {{\bar{\sigma }}}^2 + (1/5) \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty }^2 + G({\varvec{P}}^{(k)}) \end{aligned}$$
(3.34)

where the second inequality is by Cauchy-Schwarz inequality. Inequality in display (3.34) leads to

$$\begin{aligned} \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty }^2 \le 800 {{\bar{\sigma }}}^2 + 10G({\varvec{P}}^{(k)}) \ \end{aligned}$$

which completes the proof of this theorem.

3.2.4 Proof of theorem 6

Recall that \({\varvec{P}}^* {\varvec{y}} = {\varvec{X}} {\varvec{\beta }}^* + {\varvec{\epsilon }}\), so it holds \({\varvec{\beta }}^* = ({\varvec{X}}^\top {\varvec{X}})^{-1} {\varvec{X}}^\top ({\varvec{P}}^* {\varvec{y}} - {\varvec{\epsilon }} )\). Therefore

$$\begin{aligned} {\varvec{\beta }}^{(k)} - {\varvec{\beta }}^* = ({\varvec{X}}^\top {\varvec{X}})^{-1} {\varvec{X}}^\top ( {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} + {\varvec{\epsilon }}) \ . \end{aligned}$$
(3.35)

Hence we have

$$\begin{aligned} \begin{aligned} \Vert {\varvec{\beta }}^{(k)} - {\varvec{\beta }}^* \Vert&\le \Vert ({\varvec{X}}^\top {\varvec{X}})^{-1} {\varvec{X}}^\top ( {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} ) \Vert + \Vert ({\varvec{X}}^\top {\varvec{X}})^{-1} {\varvec{X}}^\top {\varvec{\epsilon }} \Vert \\&\le \Vert ({\varvec{X}}^\top {\varvec{X}})^{-1} {\varvec{X}}^\top ( {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} ) \Vert + {{\bar{\sigma }}} \sqrt{{d}/{(n{{\bar{\gamma }}})}} \end{aligned} \end{aligned}$$
(3.36)

where the second inequality is by Assumption 5 (2). Note that

$$\begin{aligned} \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert \le \sqrt{2R} \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty } \le {{\bar{\sigma }}} \sqrt{2R{{\bar{c}}}} \end{aligned}$$
(3.37)

where the first inequality is because \({\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}}\) has at most 2R non-zero coordinates; the second inequality makes use of Theorem 4 and the definition of \({{\bar{c}}}\) in Theorem 6. On the other hand, by Assumption 5 (1) we have

$$\begin{aligned} |||({\varvec{X}}^\top {\varvec{X}})^{-1} {\varvec{X}}^\top |||_2 = \sqrt{|||({\varvec{X}}^\top {\varvec{X}})^{-1} |||_2} \le 1/\sqrt{n{{\bar{\gamma }}}} \ . \end{aligned}$$
(3.38)

Combining (3.36), (3.37) and (3.38) we have

$$\begin{aligned} \Vert {\varvec{\beta }}^{(k)} - {\varvec{\beta }}^* \Vert\le & {} |||({\varvec{X}}^\top {\varvec{X}})^{-1} {\varvec{X}}^\top |||_2 \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert + {{\bar{\sigma }}} \sqrt{{d}/{(n{{\bar{\gamma }}})}} \\\le & {} {{\bar{\sigma }}} \sqrt{2{{\bar{\gamma }}}^{-1}R{{\bar{c}}}/n} + {{\bar{\sigma }}} \sqrt{{d{{\bar{\gamma }}}^{-1}}/{n}} \ . \end{aligned}$$

Squaring both sides of the above, we get

$$\begin{aligned} \Vert {\varvec{\beta }}^{(k)} - {\varvec{\beta }}^* \Vert ^2 ~\le ~ 4{{\bar{\gamma }}}^{-1}{{\bar{c}}} \frac{R {{\bar{\sigma }}}^2 }{n} + 2 {{\bar{\gamma }}}^{-1} \frac{d {{\bar{\sigma }}}^2}{n} \ \end{aligned}$$

which completes the proof of (3.11).

We will now prove (3.12). Let us denote \(\varvec{J}:= ( {\varvec{P}}^{(k)} )^{-1} {\varvec{X}} {\varvec{\beta }}^{(k)} - ({\varvec{P}}^*)^{-1} {\varvec{X}} {\varvec{\beta }}^*\). Note that we can write

$$\begin{aligned} {\varvec{J}} ~=~ ( {\varvec{P}}^{(k)} )^{-1} {\varvec{X}} ({\varvec{\beta }}^{(k)} - {\varvec{\beta }}^*) + ( ( {\varvec{P}}^{(k)} )^{-1} - ({\varvec{P}}^*)^{-1} ) {\varvec{X}} {\varvec{\beta }}^* \ . \end{aligned}$$
(3.39)

Multiplying both sides of (3.35) by \(({\varvec{P}}^{(k)} )^{-1} {\varvec{X}}\), we have

$$\begin{aligned} ( {\varvec{P}}^{(k)} )^{-1} {\varvec{X}} ({\varvec{\beta }}^{(k)} - {\varvec{\beta }}^*) = ( {\varvec{P}}^{(k)} )^{-1} {\varvec{H}} ( {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} + {\varvec{\epsilon }}) \ . \end{aligned}$$
(3.40)

On the other hand,

$$\begin{aligned} \begin{aligned}&~ ( ( {\varvec{P}}^{(k)} )^{-1} - ({\varvec{P}}^*)^{-1} ) {\varvec{X}} {\varvec{\beta }}^* \\&\quad =~ ( ( {\varvec{P}}^{(k)} )^{-1} - ({\varvec{P}}^*)^{-1} ) ({\varvec{P}}^* {\varvec{y}} - {\varvec{\epsilon }}) \\&\quad =~ ({\varvec{P}}^{(k)})^{-1} ( {\varvec{P}}^* {\varvec{y}} - {\varvec{P}}^{(k)} {\varvec{y}}) - ( ( {\varvec{P}}^{(k)} )^{-1} - ({\varvec{P}}^*)^{-1} ) {\varvec{\epsilon }} \ . \end{aligned} \end{aligned}$$
(3.41)

Combining (3.39), (3.40) and (3.41) we have

$$\begin{aligned} \begin{aligned} {\varvec{J}}&= \underbrace{( {\varvec{P}}^{(k)} )^{-1} \widetilde{{\varvec{H}}}( {\varvec{P}}^* {\varvec{y}} - {\varvec{P}}^{(k)} {\varvec{y}} )}_{:={\varvec{J}}_{1}}~+~ \underbrace{( {\varvec{P}}^{(k)} )^{-1} {\varvec{H}} {\varvec{\epsilon }}}_{:={\varvec{J}}_{2}} ~+~ \underbrace{( ({\varvec{P}}^*)^{-1} - ( {\varvec{P}}^{(k)} )^{-1} ) {\varvec{\epsilon }}}_{:={\varvec{J}}_{3}} \\&= {\varvec{J}}_1 + {\varvec{J}}_2 + {\varvec{J}}_3 \ . \end{aligned} \end{aligned}$$
(3.42)

By (3.37), we know

$$\begin{aligned} \Vert {\varvec{J}}_1 \Vert \le \Vert {\varvec{P}}^* {\varvec{y}} - {\varvec{P}}^{(k)} {\varvec{y}} \Vert \le {{\bar{\sigma }}} \sqrt{2R\bar{c}} \ . \end{aligned}$$
(3.43)

By Assumption 1 (4) we have

$$\begin{aligned} \Vert {\varvec{J}}_2 \Vert = \Vert {\varvec{H}} {\varvec{\epsilon }} \Vert \le \sqrt{d} {{\bar{\sigma }}} \ . \end{aligned}$$
(3.44)

Since \(\mathsf {dist}(({\varvec{P}}^*)^{-1} , ( {\varvec{P}}^{(k)} )^{-1}) \le 2R\), it holds

$$\begin{aligned} \Vert {\varvec{J}}_3 \Vert = \Vert ({\varvec{P}}^*)^{-1} {\varvec{\epsilon }} - ( {\varvec{P}}^{(k)} )^{-1} {\varvec{\epsilon }} \Vert \le 2\sqrt{2R} \Vert {\varvec{\epsilon }}\Vert _\infty \le 3 \sqrt{R} {{\bar{\sigma }}} \end{aligned}$$
(3.45)

where the last inequality makes use of Assumption 1 (4).

Using (3.43), (3.44) and (3.45), to bound the r.h.s of (3.42), we have

$$\begin{aligned} \Vert {\varvec{J}} \Vert ~ \le ~ {{\bar{\sigma }}} \sqrt{2R{{\bar{c}}}} + \sqrt{d} {{\bar{\sigma }}} + 3 \sqrt{R} {{\bar{\sigma }}} \ . \end{aligned}$$

As a result,

$$\begin{aligned} \frac{1}{n} \Vert {\varvec{J}} \Vert ^2 ~ \le ~ 2 (\sqrt{2{{\bar{c}}}}+3)^2 \frac{R{{\bar{\sigma }}}^2}{n} + \frac{2d{{\bar{\sigma }}}^2}{n} \ . \end{aligned}$$

This completes the proof of (3.12).

3.3 Sufficient conditions for assumptions to hold

Our analysis in Sects. 3.1 and 3.2 was completely deterministic in nature under Assumptions 12 and 5. To provide some intuition, in the following, we discuss some probability models on \({\varvec{X}}\) and \({\varvec{\epsilon }}\) under which Assumption 1 (3), (4), Assumption 2 (2) and Assumption 5 hold true with high probability.

3.3.1 A random model matrix \({\varvec{X}}\)

When the rows of \({\varvec{X}}\) are iid draws from a well behaved probability distribution, Assumption 1 (3) and Assumption 5 (1) hold true with high probability. This is formalized via the following lemma.

Lemma 3

Suppose the rows of the matrix \({\varvec{X}}\): \(\varvec{x}_1,\ldots ,{\varvec{x}}_n\) are iid zero-mean random vectors in \({\mathbb {R}}^d\) with covariance matrix \({\varvec{\varSigma }}\in {\mathbb {R}}^{d\times d}\). Suppose there exist constants \(\gamma , b,V>0\) such that \( \lambda _{\min } ({\varvec{\varSigma }}) \ge \gamma \), \( \Vert {\varvec{x}}_i \Vert \le b \) and \( \Vert {\varvec{x}}_i \Vert _{\infty } \le V \) almost surely. Given any \(\tau >0\), define

$$\begin{aligned} \delta _{n,m}:= 16V^2 \bigg (\frac{d}{n\gamma }\log (2d/\tau ) + \frac{dm}{n\gamma } \log (3n^2) \bigg ) \ . \end{aligned}$$

Suppose n is large enough such that \(\sqrt{\delta _{n,m}} \ge 2/n\) and \( 3b^2 |||{\varvec{\varSigma }}|||_2 \log (2d/\tau ) / n\le (1/4)\gamma ^2 \). Then with probability at least \( 1-2\tau \), it holds \(\lambda _{\min } (\frac{1}{n} {\varvec{X}}^\top {\varvec{X}} ) \ge \gamma /2\), and

$$\begin{aligned} \Vert {\varvec{H}} {\varvec{u}} \Vert ^2 \le {\delta _{n,m}} \Vert \varvec{u}\Vert ^2 ~~~~ \forall ~{\varvec{u}}\in {\mathcal {B}}_m \ . \end{aligned}$$
(3.46)

The proof of Lemma 3 is presented in Sect. A.4. Suppose there are universal constants \({{\bar{c}}}>0\) and \({{\bar{C}}}>0\) such that the parameters \((\gamma , V, b, |||{\varvec{\varSigma }} |||_2 ,\tau )\) in Lemma 3 satisfy \({{\bar{c}}} \le \gamma , V, b, |||{\varvec{\varSigma }} |||_2 ,\tau \le {{\bar{C}}}\). Given a pre-specified probability level (e.g., \(1-2\tau =0.99\)), under the setting of Lemma 3, if we set \(\rho _n = \delta _{n,4}\) and \({{\bar{\sigma }}} = \gamma /2\), then Assumption 1 (3) and Assumption 5 (1) are true with high probability (\(\ge 1-2\tau \)).

Note that the almost sure boundedness assumption on \(\Vert \varvec{x}_i\Vert \) can be relaxed to cases when \(\Vert {\varvec{x}}_i\Vert \) is bounded with high probability (e.g. \({\varvec{x}}_i {\mathop {\sim }\limits ^{\text {iid}}} N({\varvec{0}}, {\varvec{\varSigma }})\)).

3.3.2 The error distribution

In the following, we discuss a commonly used random setting under which Assumption 1 (4) and Assumption 2 (2) hold with high probability. A random variable \(\xi \) is called sub-Gaussian [25] with variance proxy \(\vartheta ^2\) (denoted by \(\xi \in \mathsf {subG}(\vartheta ^2)\)) if \({\mathbb {E}}[\xi ] = 0\) and \({\mathbb {E}}e^{t\xi } \le e^{\vartheta ^2 t^2 /2}\) for all \(t\in {\mathbb {R}}\).

Lemma 4

Suppose \({\varvec{\epsilon }} = [\epsilon _1,\ldots ,\epsilon _n]^\top \) with \(\epsilon _1,\ldots ,\epsilon _n {\mathop {\sim }\limits ^{\text {iid}}} \mathsf {subG}(\sigma ^2)\) for some \(\sigma >0 \). Suppose \({\varvec{\epsilon }}\) is independent of \({\varvec{X}}\). Then with probabilityFootnote 5 at least \( 1-\tau \) it holds

  1. (a)

    \(\max \{\Vert {\varvec{\epsilon }} \Vert _{\infty }, \Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \Vert _{\infty } \} \le \sigma \sqrt{2\log (6n/\tau )}\).

  2. (b)

    \( \Vert {\varvec{H}} {\varvec{\epsilon }} \Vert \le \sigma \sqrt{2d\log (6d/\tau )} \).

  3. (c)

    \( \Vert ({\varvec{X}}^\top {\varvec{X}})^{-1} {\varvec{X}}^\top {\varvec{\epsilon }} \Vert \le \lambda ^{-1/2}_{\min } ( \frac{1}{n} {\varvec{X}}^\top {\varvec{X}} ) \cdot \sigma \sqrt{2d \log (6d/\tau )/n}\).

In addition, if \({{\widetilde{\sigma }}}^2 := \text {Var}(\epsilon _i) \ge (3/4) \sigma ^2\), then there exists a universal constant \(C>0\) such that if \(\sqrt{\log (4/\tau )/(Cn)} + 2d \log (4d/\tau )/n \le 1/4 \), then with probability at least \(1-\tau \),

$$\begin{aligned} \Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \Vert ^2 \ge (1/2)n\sigma ^2 \ . \end{aligned}$$
(3.47)

The proof of Lemma 4 is presented in Sect. A.5. Note that in Lemma 4 the assumption \(\text {Var}(\epsilon _i) \ge (3/4) \sigma ^2\) can be replaced by \( \text {Var}(\epsilon _i) \ge C_0 \sigma ^2 \) for any constant \(C_0\), with the conclusion changing accordingly (i.e., 1/2 in (3.47) will be replaced by another constant). In particular, Lemma 4 holds true when \(\epsilon _i {\mathop {\sim }\limits ^{\text {iid}}} N(0,\sigma ^2)\). Given a pre-specified probability level (e.g., \(1-\tau =0.99\)), under the setting of Lemma 4, if we set \({{\bar{\sigma }}} = \sigma \sqrt{2\log (6n/\tau )}\), then Assumption 1 (4), Assumption 2 (2) and Assumption 5 (2) hold with probability at least \(1-2\tau \).

3.3.3 Summary

We summarize parameter choices informed by the results above in the following corollary.

Corollary 1

Suppose the matrix \({\varvec{X}}\) is drawn from a probability model as discussed in Lemma 3 with \(m=R\); and the noise term \({\varvec{\epsilon }}\) satisfies the assumptions in Lemma 4. Then with probability at least \(1-6\tau \), the inequalities in (3.2), (3.3), (3.7) and (3.10) hold true with the following parameters

$$\begin{aligned} \rho _n = 16V^2 \bigg (\frac{d}{n\gamma }\log (2d/\tau ) + \frac{4d}{n\gamma } \log (3n^2) \bigg ), \end{aligned}$$
(3.48)

\({{\bar{\sigma }}} = \sigma \sqrt{2\log (6n/\tau )}\) and \({{\bar{\gamma }}} = \gamma /2\).

If in addition the conditions in Assumption 1 (1) and (2) hold true and the following four inequalities

$$\begin{aligned} R \rho _n \le L^2/(90U^2), \quad {{\bar{\sigma }}} \le \min \{0.5, (\rho _nd)^{-1/2}\} L^2/(80U) \ , \end{aligned}$$
$$\begin{aligned} R^2 \rho _n \le 1/10, \quad {{\bar{\sigma }}}^2 \le \sigma ^2 \min \{n/(660R^2), \ n/(5dR)\} \ \end{aligned}$$
(3.49)

are true, then all the statements in Assumptions 1, 2 and 5 are satisfied.

4 Approximate local search steps for computational scalability

As discussed in Sect. 2, the local search step (2.3) in Algorithm 1 costs \(O(n^2)\) for each iteration k—this can limit the scalability of Algorithm 1 to problems with a large n. Here we discuss an efficient method to find an approximate solution for step (2.3). Suppose that in the k-th iteration of Algorithm 1 the permutation \({\varvec{P}}^{(k)}\) satisfies \( \mathsf {dist}({\varvec{P}}^{(k)} , {\varvec{I}}_n) \le R-2 \), then update (2.3) is

$$\begin{aligned} \min _{{\varvec{P}}}~~~ \Vert \widetilde{{\varvec{H}}}{\varvec{P}} {\varvec{y}} \Vert ^2 ~~~ \text {s.t. }~~~ \mathsf {dist}({\varvec{P}}, {\varvec{P}}^{(k)} ) \le 2. \end{aligned}$$
(4.1)

Problem (4.1) can be equivalently formulated as

$$\begin{aligned}&\min _{i,j\in [n]} \Vert \widetilde{{\varvec{H}}}({\varvec{P}}^{(k)} {\varvec{y}} - (y_i - y_j) ({\varvec{e}}_i- {\varvec{e}}_j) ) \Vert ^2 \nonumber \\&\quad = \min _{i,j\in [n]} \bigg \{ \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)} {\varvec{y}} \Vert ^2 - 2 (y_i - y_j) \langle {\varvec{e}}_i - {\varvec{e}}_j , \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)} {\varvec{y}} \rangle \nonumber \\&\qquad + (y_i - y_j)^2 \Vert \widetilde{{\varvec{H}}}({\varvec{e}}_i - {\varvec{e}}_j) \Vert ^2 \bigg \}. \end{aligned}$$
(4.2)

In view of Assumption 1 (3), for \(n\gg d\), we have \( \Vert \widetilde{{\varvec{H}}}({\varvec{e}}_i - {\varvec{e}}_j) \Vert ^2 \approx 2 \). Note that in general, \(\Vert \widetilde{{\varvec{H}}}({\varvec{e}}_i - {\varvec{e}}_j) \Vert ^2 \le 2\). Hence, one can approximately optimize (4.2) by minimizing an upper bound of the last two terms in the second line of display (4.2). This is given by:

$$\begin{aligned} \min _{i,j\in [n]} ~~ \bigg \{- 2 (y_i - y_j) \langle {\varvec{e}}_i - {\varvec{e}}_j , \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)} {\varvec{y}} \rangle + 2 (y_i - y_j)^2 \bigg \}. \end{aligned}$$
(4.3)

Denoting \({\varvec{w}} := \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)} {\varvec{y}}\) and \({\varvec{v}}: = {\varvec{y}} - {\varvec{w}}\), the objective in (4.3) is given by

$$\begin{aligned}&- 2 (y_i - y_j) \langle {\varvec{e}}_i - {\varvec{e}}_j , \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)} {\varvec{y}} \rangle + 2 (y_i - y_j)^2 \\&\quad = 2(y_i - y_j) (-w_i + w_j + y_i - y_j) = 2(y_i - y_j) (v_i - v_j) \ . \end{aligned}$$

So problem (4.3) is equivalent to

$$\begin{aligned} \min _{i,j\in [n]} ~~ (y_i - y_j) (v_i - v_j). \end{aligned}$$
(4.4)

As we discuss below, the computation cost of the above problem can be reduced by making use of its structural properties. Let us denote \({\varvec{z}}_i = (y_i, v_i) \in {\mathbb {R}}^2\). Among the set of points \(\{{\varvec{z}}_1,\ldots , {\varvec{z}}_n\}\), we say \({\varvec{z}}_i\) is a “left-top" point if for all \(j\in [n]\),

$$\begin{aligned} {\varvec{z}}_j \notin \{ (u_1,u_2) \in {\mathbb {R}}^2 ~|~ u_1 \le y_i , ~ u_2 \ge v_i \} \setminus \{{\varvec{z}}_i\} \ . \end{aligned}$$

We say \({\varvec{z}}_i\) is a “right-bottom" point if for all \(j\in [n]\),

$$\begin{aligned} {\varvec{z}}_j \notin \{ (u_1,u_2) \in {\mathbb {R}}^2 ~|~ u_1 \ge y_i , ~ u_2 \le v_i \} \setminus \{{\varvec{z}}_i\} \ . \end{aligned}$$

Figure 1 shows an example of left-top and right-bottom points for a collection of \({\varvec{z}}_{i}\)’s with noisy \({\varvec{y}}\). It can be seen that the number of left-top and right-bottom points can be much smaller than the total number of points.

Fig. 1
figure 1

Figure illustrating the set Left-top points and right-bottom points in a 2D collection of points \(\varvec{z}_i = (y_i, v_i) \in {\mathbb {R}}^2 \)

Let \((i^*,j^*)\) be an optimal solution to (4.4), then it must hold that one of \(\{{\varvec{z}}_{i^*}, {\varvec{z}}_{j^*}\}\) is a left-top point and the other is a right-bottom point. Let \({\mathcal {Z}}_{lt}\) and \({\mathcal {Z}}_{rb}\) be the set of left-top and right-bottom points respectively, and define

$$\begin{aligned} {\mathcal {S}}_{lt} := \{i\in [n] ~| ~ {\varvec{z}}_{i} \in {\mathcal {Z}}_{lt}\},~~~ {\mathcal {S}}_{rb} := \{i\in [n] ~| ~ {\varvec{z}}_{i} \in {\mathcal {Z}}_{rb}\} \ . \end{aligned}$$

Then Problem (4.4) is equivalent to

$$\begin{aligned} \min _{i\in {\mathcal {S}}_{lt}, ~ j\in {\mathcal {S}}_{rb}} ~~ (y_i - y_j) (v_i - v_j) \end{aligned}$$
(4.5)

implying that it suffices to compute values of \((y_i - y_j) (v_i - v_j)\) for \(i\in {\mathcal {S}}_{lt}\) and \(j\in {\mathcal {S}}_{rb}\). Algorithm 2 discusses how to compute \( {\mathcal {S}}_{lt}\) and \({\mathcal {S}}_{rb}\)—this requires (a) performing a sorting operation on \({\varvec{y}}\), which can be done once with a cost of \(O(n\log (n))\); and (b) two additional passes over the data with cost O(n) (to be performed at every iteration of Algorithm 1).

The computation of (4.5) can be further simplified as discussed in the following section.

figure b

4.1 Faster computation of problem (4.5)

To simplify the computation of (4.5), we introduce a partial order ‘\(\preceq \)’ on the points in \({\mathbb {R}}^2\): For \(\varvec{p},{\varvec{q}} \in {\mathbb {R}}^2\), denote \({\varvec{p}}\preceq \varvec{q}\) if \(p_1 \le q_1\) and \(p_2 \le q_2\). It is easy to check that for any two points \({\varvec{z}}_i, {\varvec{z}}_j \in {\mathcal {Z}}_{rb}\), either it holds \({\varvec{z}}_i \preceq {\varvec{z}}_j\), or it holds \({\varvec{z}}_j \preceq {\varvec{z}}_i\). So we can write \({\mathcal {Z}}_{rb}=\{{\varvec{z}}_{i_1}, ...., {\varvec{z}}_{i_{L'}}\}\) with

$$\begin{aligned} {\varvec{z}}_{i_1} \preceq {\varvec{z}}_{i_2 } \preceq \cdots \preceq {\varvec{z}}_{i_{L'}}. \end{aligned}$$
(4.6)

For any \({\varvec{z}}_{m} \in {\mathcal {Z}}_{lt}\), two cases can happen:

  1. 1.

    There is no point \({\varvec{z}}_{i_t} \in {\mathcal {Z}}_{rb}\) satisfying \((y_{i_t} - y_m) (v_{i_t} - v_m) \le 0\).

  2. 2.

    There exist \({{\bar{t}}} ,{{\bar{b}}} \in [L']\) with \( {{\bar{b}}} \le \bar{t}\) such that \( (y_m - y_{i_t}) (v_m - v_{i_t}) \le 0 \) for all \(\bar{b} \le t \le {{\bar{t}}}\), and \( (y_m - y_{i_t}) (v_m - v_{i_t}) > 0 \) for all \(t>{{\bar{t}}}\) or \(t<{{\bar{b}}}\).

Because \({\mathcal {Z}}_{rb}\) is nicely ordered as in (4.6), Case 1 above can be identified by a bisection method with cost (at most) \( O(\log n)\). Similarly, for Case 2, the values of \({{\bar{t}}}\) and \({{\bar{b}}}\) can be found using bisection. Since the optimal value of (4.4) must be non-positive, we can compute \( (y_m - y_{i_t}) (v_m - v_{i_t}) \) only for \({{\bar{b}}} \le t \le {{\bar{t}}}\).

The methods described for solving (4.4) are summarized in Algorithm 2. Finally, note that when \(\mathsf {dist}({\varvec{P}}^{(k)}, {\varvec{I}}_n) \ge R-1\), similar ideas are still applicable. When \(\mathsf {dist}({\varvec{P}}^{(k)}, \varvec{I}_n) = R-1\), we consider the following problem:

$$\begin{aligned} \min _{i,j} ~ (y_i - y_j) (v_{i} - v_j)~~~~ \text {s.t.} ~ i\in [n], ~ j\in \mathsf {supp}({\varvec{P}}^{(k)}). \end{aligned}$$
(4.7)

Similarly, when \(\mathsf {dist}({\varvec{P}}^{(k)}, {\varvec{I}}_n) = R\), we consider:

$$\begin{aligned} \min _{i,j} ~ (y_i - y_j) (v_{i} - v_j)~~~~ \text {s.t.} ~ i, j\in \mathsf {supp}({\varvec{P}}^{(k)}). \end{aligned}$$
(4.8)

Problems (4.7) and (4.8) can also be efficiently solved by finding the sets of left-top and right-bottom points and using the partial order to simplify the computation. We omit the details for brevity.

5 Experiments

We perform numerical experiments to study the performance of Algorithm 1.

Data generation. We consider the setup in our basic model (1.1), where entries of \({\varvec{X}}\in {\mathbb {R}}^{n\times d}\) are iid N(0, 1); \({\varvec{\beta }}^*\) is generated uniformly from the unit sphere in \({\mathbb {R}}^d\) (i.e., \(\Vert {\varvec{\beta }}^*\Vert = 1\)), and \({\varvec{\beta }}^*\) is independent of \({\varvec{X}}\). We consider two schemes for generating the permutation \({\varvec{P}}^*\): (a) Random scheme: select r coordinates uniformly from \(\{1, \ldots , n\}\). (b) Equi-spaced scheme: Assume \(y_1\le \cdots \le y_n\) (otherwise re-order the data). Let \(a_1<\cdots <a_r\) be the sequence of r equi-spaced real numbers with \(a_1 = \min _{i\in [n]} y_i\) and \(a_n = \max _{i\in [n]} y_i\). Select r indices \(i_1<\cdots <i_r\) such that \(i_1 = \mathop {\mathrm{argmin}}_{i\in [n] } |y_i - a_1|\) and \(i_s = \mathop {\mathrm{argmin}}_{i_{s-1}+1 \le i \le n} |y_i - a_s|\) for all \(2\le s \le r\). After the r coordinates are chosen, we generate a uniformly distributed random permutation on these r coordinates.Footnote 6

We generate \({\varvec{\epsilon }}\) (independent of \({\varvec{X}}\) and \({\varvec{\beta }}^*\)) with \(\epsilon _{i} {\mathop {\sim }\limits ^{\text {iid}}} N(0,\sigma ^2)\) for some \(\sigma \ge 0\) (\(\sigma = 0\) corresponds to the noiseless setting). Unless otherwise specified, we set the tolerance \({\mathfrak {e}} = 0\) and \(K=\infty \) in Algorithm 1.

5.1 Experiments for the noiseless setting

We first consider the noiseless setting (\({\varvec{\epsilon }} = {\varvec{0}}\)) with different combinations of (drn). We use the random scheme to generate the unknown permutation \({\varvec{P}}^*\). We set \(R = n\) in Algorithm 1 and a maximum iteration limit of 1000. While our algorithm parameter choices are not covered by our theory, in practice when r is small, our local search algorithm converges to optimality; and the number of iterations is bounded by a small constant multiple of r (e.g., for \(r=50\), the algorithm converges to optimality within around 60 iterations).

Figure 2 presents preliminary results on examples with \(n = 500\), \(d\in \{20,50,100,200\}\), and 40 roughly equi-spaced values of \(r \in [10,400]\). In Fig. 2 [left panel], we plot the Hamming distance of the solution \(\hat{{\varvec{P}}}\) computed by Algorithm 1 and the underlying permutation \({\varvec{P}}^*\) (i.e. \(\mathsf {dist}(\hat{{\varvec{P}}}, {\varvec{P}}^*)\)) versus r. In Fig. 2 [right panel], we present errors in estimating \(\varvec{\beta }^*\) versus r. More precisely, let \( \hat{\varvec{\beta }}\) be the solution computed by Algorithm 1 (i.e. \(\hat{\varvec{\beta }}= ({\varvec{X}}^\top {\varvec{X}})^{-1}{\varvec{X}}^\top \hat{{\varvec{P}}}{\varvec{y}}\)), then the beta error is defined as \( \Vert \hat{\varvec{\beta }}- {\varvec{\beta }}^* \Vert /\Vert {\varvec{\beta }}^* \Vert \). For each choice of (rd), we consider the average over 50 independent replications (the vertical bars show standard errors, which are hardly visible in the figures). As shown in Fig. 2, when r is small, the underlying permutation \({\varvec{P}}^*\) can be exactly recovered, and thus the corresponding beta error is also 0. As r becomes larger, Algorithm 1 fails to recover \({\varvec{P}}^*\) exactly; and \(\mathsf {dist}({\varvec{P}}^*, \hat{{\varvec{P}}})\) is close to the maximal possible value \(n = 500\). In contrast, the estimation error appears to vary more smoothly: As the value of r increases, beta error increases. We also observe that the recovery of \({\varvec{P}}^*\) depends upon the number of covariates d — permutation recovery performance deteriorates with increasing d. This is consistent with our theory suggesting that the performance of our algorithm depends upon both r and d.

Fig. 2
figure 2

Left: Values of Hamming distance \(\mathsf {dist}(\hat{{\varvec{P}}}, {\varvec{P}}^*)\) versus r. Right: Values of beta error \(\Vert \hat{\varvec{\beta }}- {\varvec{\beta }}^* \Vert / \Vert {\varvec{\beta }}^*\Vert \) versus r

5.2 Experiments for the noisy setting

We explore the performance of Algorithm 1 under the noisy setting (\({\varvec{\epsilon }} \ne {\varvec{0}}\)).

Performance for different values of R: We denote Relative Obj as the objective value computed by Algorithm 1 divided by \(\Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }}\Vert ^2\). Figure 3 presents the Relative Obj, beta error and Hamming distance of the local search algorithm with different values of R (x-axis corresponds to the values of R). Here we consider \(n=1000\), \(d=10\), \(r=10\) and \(\sigma =0.1\); and use the equi-spaced scheme to choose the mismatched coordinates in \({\varvec{P}}^*\). We highlight the value at \(R = r = 10\) by a red point. As shown in Fig. 3, as R increases, the Relative Obj decreases below 1 – this is consistent with our theory stating that with a proper choice of R, the final objective value will be below a constant multiple of \(\Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }}\Vert ^2\).

As R increases, different from the Relative Obj profile, the beta error and Hamming distance first decrease then increase. This appears to suggest that when R is too large, Algorithm 1 can overfit and further regularization may be necessary to mitigate overfitting. A detailed investigation of this matter is left as future work. In this example, the best beta error and Hamming distance are achieved when R equals r. Note that in Fig. 3 [left panel], the Relative Obj is close to 1 when we choose R close to r. Therefore, if we have a good estimate of the noise level \(\sigma \) (but the exact value of r is not available), we can choose a value of R at which the Relative Obj is approximately 1.

Finally, we note that in the noisy case, the local search method cannot exactly recover \({\varvec{P}}^*\). Indeed, in the noisy case, if a solution to (1.2) has to exactly recover \({\varvec{P}}^*\), we need to take a smaller value of \(\sigma \) (see discussions in [15]). Even though in our example, we cannot exactly recover \({\varvec{P}}^*\), we may still be able to obtain a good estimate for \({\varvec{\beta }}^*\)—see Fig. 3 [middle panel].

Fig. 3
figure 3

Experiment on an instance with \(n=1000\), \(d=10\), \(r=10\) and \(\sigma =0.1\). Left: Relative Obj vs R.  Middle: beta error vs R.  Right: Hamming distance vs R. The circled red point corresponds to \(R = r\)

Estimating \({\varvec{P}}^*, {\varvec{\beta }}^*\) under different noise levels: For a given \(\sigma \) (standard deviation of the noise), let relative beta error be the value \(\Vert \hat{{\varvec{\beta }}} - {\varvec{\beta }}^* \Vert / ( \sigma \Vert {\varvec{\beta }}^*\Vert )\), where \(\hat{{\varvec{P}}}\) and \(\hat{{\varvec{\beta }}}\) are the estimates available from Algorithm 1 upon termination.

Consider an example with \(n = 500\) and \(d=r=10\) and different values of \(\sigma \in \{ 0.01, 0.03, 0.1, 0.3, 1.0\}\), and use the random scheme to generate the unknown permutation \({\varvec{P}}^*\). We run Algorithm 1 with the setting \(R = r\). Figure 4 presents the values of Hamming distance and relative beta error under different noise levels. In Fig. 4 [left panel], it can be seen that as \(\sigma \) increases, the Hamming distance also increases, and recovering \({\varvec{P}}^*\) becomes harder as the noise level becomes larger. In Fig. 4 [right panel], we can see that the relative beta error almost does not change under different values of \(\sigma \). This appears to be consistent with our conclusion in Theorem 6 that \(\Vert \hat{{\varvec{\beta }}} - {\varvec{\beta }}^* \Vert \) will be bounded by a value proportional to \(\sigma \).

Fig. 4
figure 4

Experiment on an instance with \(n=500\), \(d=10\), \(r=10\), and different noise levels \(\sigma \in \{0.01, 0.03, 0.1, 0.3, 1.0\}\). Left: Hamming distance vs \(\sigma \). Right: relative beta error vs \(\sigma \)

5.3 Comparisons with existing methods

We compare across the following methods for (1.3):

  • AltMin: The alternating minimization method of [9]. We initialize with \({\varvec{P}} ={\varvec{I}}_n\) and \({\varvec{\beta }} = {\varvec{0}}\), and alternately minimize over \({\varvec{P}}\) and \({\varvec{\beta }}\) until no improvement can be made.

  • StoEM: The stochastic expectation maximization method [2]. We run the algorithm for 30 steps under the default setting.

  • S-BD: The robust regression relaxation method of [18]. We set the regularization parameter \(\lambda = 4(1+M)\sigma \sqrt{2\log (n)/n}\) with \(M=1\) (this value of \(\lambda \) is given by Theorem 2 of [18]).

For both AltMin and StoEM, we use the implementation provided in the repositoryFootnote 7 accompanying paper [2]. We compare these methods with Algorithm 1 (denoted by Alg-1) and the variant of Algorithm 1 with the fast approximate local search steps introduced in Sect. 4 (denoted by Fast-Alg-1). We run Alg-1 and Fast-Alg-1 by setting \(R = r\).

Table 1 presents the beta errors of different methods on an example with \(n = 500\), \(d=10\); and \({\varvec{P}}^* \) chosen by the random scheme. The presented values are the average of 10 independent replications.

As shown in Table 1, in the noiseless setting, Alg-1 can recover the true value of \({\varvec{\beta }}^*\) for r up to 300. Fast-Alg-1 is quite similar to Alg-1, though the performance is marginally worse for larger values of r. In contrast, AltMin and StoEM are not able to exactly estimate \({\varvec{\beta }}^*\) even for small values of r, and for all values of r they have large values of beta error. S-BD which is based on convex optimization, can also recover \({\varvec{\beta }}^*\) for \(r\le 200\), but for \(r = 300\) its performance degrades and has a large beta error.

In the noisy setting, with \(\sigma = 0.1\), Alg-1 and Fast-Alg-1 have similar performance, and compute a value of \({\varvec{\beta }}\) with much smaller beta error compared to AltMin and StoEM. For small values of r (\(\le 100\)), S-BD has a similar performance to Alg-1 and Fast-Alg-1, while for \(r = 200\) and 300, its performance degrades and has a much larger beta error than Alg-1 and Fast-Alg-1.

Table 1 Comparison with existing methods on an example with \(n=500\), \(d=10\)

5.4 Scalability to large instances

We explore the scalability of our proposed approach to large n problems (from \(n\approx 10^4\) up to \(n\approx 10^7\))—for these instances, Fast-Alg-1 appears to be computationally attractive. All these experiments are run on the MIT engaging cluster with 1 CPU and 16GB memory. The codes are written in Julia 1.2.0.

We consider examples with \(d = 100\), \(r = 50\) and \(n\in \{10^4, 10^5,\) \( 10^6, 10^7\}\). Here, the mismatched coordinates of \({\varvec{P}}^*\) are chosen based on the random scheme. We set \(R = r\) for all instances. For these examples, we do not form the \(n \times n\) matrices \(\widetilde{{\varvec{H}}}\) or \({\varvec{H}}\) explicitly, but compute a thin QR decomposition of \({\varvec{X}}\) (\(\varvec{Q}\in {\mathbb {R}}^{n\times d}\)) and maintain \({\varvec{Q}} \) in memory. The results are presented in Table 2, where “total time" is the total runtime of Fast-Alg-1 upon termination, “QR time" is the time used for the QR-decomposition, and “iterations" are the number of iterations taken by the local search method till convergence. All numbers reported in the table are averaged across 10 independent replications. As shown in Table 2, Fast-Alg-1 can solve examples with n up to \(10^7\) (and \(d = 100\)) within around 200 seconds — this runtime (s) includes the time to complete around 60 iterations of local search steps and the time to do the QR decomposition. The total runtime is empirically seen to be of the order O(n) as n increases. Note that the QR time (i.e., time to perform the QR decomposition) can be viewed as a benchmark runtime for ordinary least squares. Hence, for the examples considered, the runtime of Fast-Alg-1 appears to be a constant multiple of the runtime of performing ordinary least squares (the total time will increase with r and/or R). Interestingly, it can be seen that the runtimes for the noisy case (\(\sigma = 0.1\)) are smaller than the noiseless case (\(\sigma = 0\)). We believe this is because Algorithm 2 is faster for the noisy case. In particular, for the noisy case, we empirically observe the number of “left-top" points and “right-bottom" points to be fewer than those in the noiseless case.

Table 2 Runtimes of Fast-Alg-1 on instances with \(d = 100\), \(r=50\) and different n. For reference, the time taken by Alg-1 in the case \(n=10^4\) is 100 (s), which is 500X slower than Fast-Alg-1