Abstract
Linear regression is a fundamental modeling tool in statistics and related fields. In this paper, we study an important variant of linear regression in which the predictor-response pairs are partially mismatched. We use an optimization formulation to simultaneously learn the underlying regression coefficients and the permutation corresponding to the mismatches. The combinatorial structure of the problem leads to computational challenges. We propose and study a simple greedy local search algorithm for this optimization problem that enjoys strong theoretical guarantees and appealing computational performance. We prove that under a suitable scaling of the number of mismatched pairs compared to the number of samples and features, and certain assumptions on problem data; our local search algorithm converges to a nearly-optimal solution at a linear rate. In particular, in the noiseless case, our algorithm converges to the global optimal solution with a linear convergence rate. Based on this result, we prove an upper bound for the estimation error of the parameter. We also propose an approximate local search step that allows us to scale our approach to much larger instances. We conduct numerical experiments to gather further insights into our theoretical results, and show promising performance gains compared to existing approaches.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Linear regression and its extensions are among the most fundamental models in statistics and related fields. In the classical and most common setting, we are given n samples with features \({\varvec{x}}_{i} \in {\mathbb {R}}^{d}\) and response \(y_{i} \in {\mathbb {R}}\), where i denotes the sample indices. We assume that the features and responses are perfectly matched i.e., \({\varvec{x}}_i\) and \(y_i\) correspond to the same record or sample. However, in important applications (for example, due to errors in the data merging process), the correspondence between the response and features may be broken [13, 14, 19]. This erroneous correspondence needs to be adjusted before performing downstream statistical analysis. Thus motivated, we consider a mismatched linear model with responses \({\varvec{y}} = [y_1,\ldots ,y_n]^\top \in {\mathbb {R}}^n\) and covariates \({\varvec{X}}=[{\varvec{x}}_1,\ldots , {\varvec{x}}_n]^\top \in {\mathbb {R}}^{n\times d} \) satisfying
where \(\varvec{\beta }^*\in {\mathbb {R}}^d\) are the true regression coefficients, \(\varvec{\epsilon } = [\epsilon _1,\ldots ,\epsilon _n]^\top \in {\mathbb {R}}^n\) is the noise term, and \({\varvec{P}}^*\in {\mathbb {R}}^{n\times n}\) is an unknown permutation matrix. We consider the classical setting where \(n > d\) and \({\varvec{X}}\) has full rank; and seek to estimate both \({\varvec{\beta }}^*\) and \({\varvec{P}}^*\) based on the n observations \(\{(y_{i}, {\varvec{x}}_{i})\}_{1}^{n}\). Note that the main computational difficulty in this task arises from the unknown permutation.
Linear regression with mismatched/permuted data—for example, model (1.1)—has a long history in statistics dating back to 1960s [5, 6, 13]. In addition to the aforementioned application in record linkage, similar problems also appear in robotics [22], multi-target tracking [4] and signal processing [3], among others. Recently, this problem has garnered significant attention from the statistics and machine learning communities. A series of recent works [1, 2, 7,8,9, 11, 14, 15, 17, 19,20,21, 23, 24, 26] have studied the statistical and computational aspects of this model. To learn the coefficients \({\varvec{\beta }}^*\) and the matrix \({\varvec{P}}^*\), one can consider the following natural optimization problem:
where \(\varPi _n\) is the set of \(n \times n\) permutation matrices. Solving problem (1.2) is difficult as there are exponentially many choices for \({\varvec{P}} \in \varPi _{n}\). Given \({\varvec{P}}\) however, it is easy to estimate \({\varvec{\beta }}\) via least squares. [24] shows that in the noiseless setting (\({\varvec{\epsilon }}={\varvec{0}}\)), a solution \((\hat{{\varvec{P}}}, \hat{{\varvec{\beta }}})\) of problem (1.2) equals \(( {\varvec{P}}^*, {\varvec{\beta }}^*)\) with probability one if \(n\ge 2d\) and the entries of \({\varvec{X}}\) are independent and identically distributed (iid) as per a distribution that is absolutely continuous with respect to the Lebesgue measure. [11, 15] studies the estimation of \(({\varvec{P}}^*, {\varvec{\beta }}^*) \) under the noisy setting. It is shown in [15] that Problem (1.2) is NP-hard if \(d\ge \kappa n \) for some constant \(\kappa >0\). A polynomial-time approximation algorithm appears in [11] for a fixed d.
However, as noted in [11], this algorithm does not appear to be efficient in practice. [8] propose a branch-and-bound method, that can solve small problems with \(n\le 20\) (within a reasonable time). [16] propose a branch-and-bound method for a concave minimization formulation, which can solve problem (1.2) with \(d\le 8\) and \(n \approx 100\) (the authors report a runtime of 40 minutes to solve instances with \(d=8\) and \(n=100\)). [23] propose an approach using tools from algebraic geometry, which can handle problems with \(d\le 6\) and \(n = 10^3 \sim 10^5\)—the cost of this method increases exponentially with d. This approach is exact for the noiseless case but approximate for the noisy case (\({\varvec{\epsilon }} \ne {\varvec{0}}\)). Several heuristics have been proposed for (1.2): Examples include, alternating minimization [9, 26], Expectation Maximization [2]—as far as we can tell, these methods are sensitive to initialization, and have limited theoretical guarantees.
As discussed in [18, 19], in several applications, a small fraction of the samples are mismatched — that is, the permutation \({\varvec{P}}^*\) is sparse. In other words, if we let \(r:= |\{i\in [n] ~|~ {\varvec{P}}^* {\varvec{e}}_i \ne {\varvec{e}}_i \}|\) where \({\varvec{e}}_1, \ldots , {\varvec{e}}_n\) are the standard basis elements of \({\mathbb {R}}^n\), then r is much smaller than n. In this paper, we focus on such sparse permutation matrices, and assume the value of r is known or a good estimate is available to the practitioner. This motivates a constrained version of (1.2), given by
where, the constraint \(\mathsf {dist}({\varvec{P}}, {\varvec{I}}_{n}) \le R\) restricts the number of mismatches between \({\varvec{P}}\) and the identity permutation to be below R—See (1.4) for a formal definition of \(\mathsf {dist}(\cdot ,\cdot )\). Above, R is taken such that \(r \le R \le n\) (Further details on the choice of R can be found in Sects. 3 and 5). Note that as long as \(r \le R \le n\), the true parameters \(({\varvec{P}}^*, {\varvec{\beta }}^*)\) lead to a feasible solution to (1.3). In the special case when \(R = n\), the constraint \(\mathsf {dist}({\varvec{P}}, {\varvec{I}}_{n}) \le R\) is redundant, and Problem (1.3) is equivalent to problem (1.2). Interesting convex optimization approaches based on robust regression have been proposed in [18] to approximately solve (1.3) when \(r \ll n\). The authors focus on obtaining an estimate of \({\varvec{\beta }}^*\). Similar ideas have been extended to consider problems with multiple responses in [19].
Problem (1.3) can be formulated as a mixed-integer program (MIP) with \(O(n^2)\) binary variables (to model the unknown permutation matrix). Solving this MIP with off-the-shelf MIP solvers (e.g., Gurobi) becomes computationally expensive for even a small value of n (e.g. \(n \approx 50\)). To the best of our knowledge, we are not aware of computationally practical algorithms with theoretical guarantees that can optimally solve the original problem (1.3), under suitable assumptions on the problem data. Addressing this gap is the main focus of this paper: We propose and study a novel greedy local search methodFootnote 1 for Problem (1.3). Loosely speaking, our algorithm at every step performs a greedy swap or transposition, in an attempt to improve the cost function. This algorithm is typically efficient in practice based on our numerical experiments. We also propose an approximate version of the greedy swap procedure that scales to much larger problem instances. We establish theoretical guarantees on the convergence of the proposed method under suitable assumptions on the problem data. Under a suitable scaling of the number of mismatched pairs compared to the number of samples and features, and certain assumptions on the covariates and noise; our local search method converges to an objective value that is at most a constant multiple of the squared norm of the underlying noise term. From a statistical viewpoint, this is the best objective value that one can hope to obtain (due to the noise in the problem). Interestingly, in the special case of \({\varvec{\epsilon }} = {\varvec{0}}\) (i.e., the noiseless setting), our algorithm converges to an optimal solution of (1.3) with a linear rateFootnote 2. We also prove an upper bound of the estimation error of \({\varvec{\beta }}^*\) (in \(\ell _2\) norm) and derive a bound on the number of iterations taken by our proposed local search method to find a solution with this estimation error.
Notation and preliminaries: For a vector \({\varvec{a}}\), we let \(\Vert {\varvec{a}}\Vert \) denote the Euclidean norm, \(\Vert {\varvec{a}} \Vert _{\infty }\) the \(\ell _{\infty }\)-norm and \(\Vert {\varvec{a}}\Vert _0\) the \(\ell _0\)-pseudo-norm (i.e., number of nonzeros) of \({\varvec{a}}\). We let \( |||\cdot |||_2\) denote the operator norm for matrices. Let \(\{\varvec{e}_1,\ldots ,{\varvec{e}}_n\}\) be the natural orthogonal basis of \({\mathbb {R}}^n\). For a finite set S, we let \( \# S \) denote its cardinality. For any permutation matrix \({\varvec{P}}\), let \(\pi _{{\varvec{P}}}\) be the corresponding permutation of \(\{1,2,\ldots .,n\}\), that is, \(\pi _{{\varvec{P}}}(i) = j\) if and only if \({\varvec{e}}_i^\top {\varvec{P}} = {\varvec{e}}_j^\top \) if and only if \(P_{ij} = 1\). We define the distance between two permutation matrices \({\varvec{P}}\) and \({\varvec{Q}}\) as
Recall that we assume \(r = \mathsf {dist}({\varvec{P}}^*, {\varvec{I}}_n)\). For a given permutation matrix \({\varvec{P}}\), define the m-neighbourhood of \({\varvec{P}}\) as
It is easy to check that \({\mathcal {N}}_1({\varvec{P}}) = \{{\varvec{P}}\}\), and for any \(R\ge 2\), \({\mathcal {N}}_R({\varvec{P}})\) contains more than one element. For any permutation matrix \({\varvec{P}}\in \varPi _n\), we define its support as: \(\mathsf {supp}({\varvec{P}}) := \left\{ i\in [n] :~ \pi _{{\varvec{P}}}(i) \ne i \right\} .\) For a real symmetric matrix \({\varvec{A}}\), let \(\lambda _{\max }({\varvec{A}})\) and \(\lambda _{\min }({\varvec{A}})\) denote the largest and smallest eigenvalues of \({\varvec{A}}\), respectively.
For two positive scalar sequences \(\{a_n\}, \{b_n\}\), we write \(a_n = O(b_n)\) or equivalently, \(a_n/b_n = O(1)\), if there exists a universal constant C such that \( a_n \le C b_n \). We write \(a_n = \varOmega (b_n)\) or equivalently, \(a_n/b_n = \varOmega (1)\), if there exists a universal constant c such that \( a_n \ge c b_n \). We write \(a_n = \varTheta (b_n)\) if both \(a_n = O(b_n) \) and \(a_n = \varOmega (b_n)\) hold.
2 A local search method
Here we present our local search method for (1.3). For any fixed \({\varvec{P}}\in \varPi _n\), by minimizing the objective function in (1.3) with respect to \({\varvec{\beta }}\), we have an equivalent formulation
where \({\varvec{H}} = {\varvec{X}} ({\varvec{X}}^\top \varvec{X})^{-1} {\varvec{X}}^\top \) is the projection matrix onto the columns of \({\varvec{X}}\). To simplify notation, denote \( \widetilde{{\varvec{H}}} := {\varvec{I}}_n - {\varvec{H}}\), then (2.1) is equivalent to
Our local search approach for the optimization of Problem (2.2) is summarized in Algorithm 1.
![figure a](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10107-022-01863-y/MediaObjects/10107_2022_1863_Figa_HTML.png)
At iteration k, Algorithm 1 finds a swap (within a distance of R from \({\varvec{I}}_n\)) that leads to the smallest objective value. To see the computational cost of (2.3), note that:
For each \({\varvec{P}}\), with \(\mathsf {dist}({\varvec{P}}, {\varvec{P}}^{(k)}) \le 2\), the vector \(({\varvec{P}}-{\varvec{P}}^{(k)}) {\varvec{y}}\) has at most two nonzero entries. Since we pre-compute \(\widetilde{{\varvec{H}}}\), computing the first term in (2.4) costs O(1) operations. As we retain a copy of \(\widetilde{{\varvec{H}}}{\varvec{P}}^{(k)}{\varvec{y}} \) in memory, computing the second term in (2.4) also costs O(1) operations. Therefore, computing (2.3) requires \(O(n^2)\) operations, as there are at most \(n^2\)-many possible swaps to search over. The \(O(n^2)\) per-iteration cost is quite reasonable for medium-sized examples with n being a few hundred to a few thousand, but might be expensive for larger examples. In Sect. 4, we propose a fast method to find an approximate solution of (2.3) that scales to instances with \(n \approx 10^7\) in a few minutes (see Sect. 5 for numerical findings).
3 Theoretical guarantees for algorithm 1
Here we present theoretical guarantees for Algorithm 1. The main assumptions and conclusions appear in Sect. 3.1. Section 3.2 presents the proofs of the main theorems. The development in Sects. 3.1 and 3.2 assumes that the problem data (i.e., \({\varvec{y}}, {\varvec{X}}, {\varvec{\epsilon }}\)) is deterministic. Section 3.3 discusses conditions on the distribution of the features and the noise term, under which the main assumptions hold true with high probability.
3.1 Main results
We state and prove the main theorems on the convergence of Algorithm 1. For any \(m\le n\), define
We first state the assumptions useful for our technical analysis.
Assumption 1
Suppose \({\varvec{X}}\), \({\varvec{y}}\), \({\varvec{\epsilon }}\), \({\varvec{\beta }}^*\) and \({\varvec{P}}^*\) satisfy the model (1.1) with \(\mathsf {dist}({\varvec{P}}^*, {\varvec{I}}_n) \le r\). Suppose the following conditions hold:
-
(1)
There exist constants \(U>L>0\) such that
$$\begin{aligned} \max _{i,j\in [n]} |y_i-y_j| \le U, ~~~ \mathrm{and } ~~~ | (P^*y)_i - y_i | \ge L ~~~~~ \forall i\in \mathsf {supp}({\varvec{P}}^*) \ . \end{aligned}$$ -
(2)
Set \(R=10 C_1rU^2/L^2+4\) for some constant \( C_1 > 1\).
-
(3)
There is a constant \(\rho _n = O(d \log (n)/n)\) such that \(R \rho _n \le L^2/(90U^2)\), and
$$\begin{aligned} \Vert {\varvec{H}} {\varvec{u}}\Vert ^2 \le \rho _n \Vert {\varvec{u}}\Vert ^2 ~~ \forall {\varvec{u}}\in {\mathcal {B}}_{4}, ~~\text {and} ~~\Vert {\varvec{H}} {\varvec{u}}\Vert ^2 \le R\rho _n \Vert {\varvec{u}}\Vert ^2 ~~ \forall {\varvec{u}}\in {\mathcal {B}}_{2R} \ . \end{aligned}$$(3.2) -
(4)
There is a constant \( {{\bar{\sigma }}}\ge 0\) satisfying \({{\bar{\sigma }}} \le \min \{0.5, (\rho _nd)^{-1/2}\} L^2/(80U) \) such that
$$\begin{aligned} \Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }}\Vert _{\infty } \le {{\bar{\sigma }}}, ~~ \Vert {\varvec{\epsilon }}\Vert _\infty \le {{\bar{\sigma }}}, ~~\text {and}~~ \Vert \varvec{H}{\varvec{\epsilon }} \Vert \le \sqrt{d} {{\bar{\sigma }}} \ . \end{aligned}$$(3.3)
Note that the lower bound in Assumption 1 (1) states that the y-value for a record that has been mismatched is not too close to its original value (before mismatch). Assumption 1 (2) states that R is set to a constant multiple of r. This constant can be large (\(\ge 10U^2/L^2\)), and appears to be an artifact of our proof techniques. Our numerical experience appears to suggest that this constant can be much smaller in practice. Assumption 1 (3) is a restricted eigenvalue (RE)-type condition [25] stating that: a multiplication of any (2R)-sparse vector by \(\varvec{H}\) will result in a vector with small norm (in the case \(R\rho _n <1\)). Section 3.3 discusses conditions on the distribution of the rows of \({\varvec{X}}\) under which Assumption 1 (3) holds true with high probability. Note that if \(\rho _n = \varTheta (d \log (n)/n)\), then for the assumption \(R \rho _n \le L^2/(90U^2)\) to hold true, we require \(n/\log (n) = \varOmega (dr)\). Assumption 1 (4) limits the amount of noise \({\varvec{\epsilon }}\) in the problem. Section 3.3 presents conditions on the distributions of \({\varvec{\epsilon }}\) and \({\varvec{X}}\) (in a random design setting) which ensures Assumption 1 (4) holds true with high probability.
Assumption 1 (3) plays an important role in our technical analysis. In particular, this allows us to approximate the objective function in (2.2) with one that is easier to analyze. To provide some intuition, we write \(\widetilde{{\varvec{H}}}{\varvec{P}}^{(k)}{\varvec{y}} = \widetilde{{\varvec{H}}}({\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}}) + \widetilde{{\varvec{H}}}{\varvec{\epsilon }}\)—noting that \(\widetilde{{\varvec{H}}}{\varvec{P}}^* {\varvec{y}} = \widetilde{{\varvec{H}}}({\varvec{X}}{\varvec{\beta }}^* + {\varvec{\epsilon }}) = \widetilde{{\varvec{H}}}{\varvec{\epsilon }}\), and assuming that the noise \({\varvec{\epsilon }}\) is small, we have:
Intuitively, the term on the right-hand side is the approximate objective that we analyze in our theory. Lemma 1 presents a one-step decrease property on the approximate objective function.
Lemma 1
(One-step decrease) Given any \({\varvec{y}}\in {\mathbb {R}}^n \) and \({\varvec{P}}, {\varvec{P}}^* \in \varPi _n\), there exists a permutation matrix \(\widetilde{{\varvec{P}}}\in \varPi _n\) such that \(\mathsf {dist}(\widetilde{{\varvec{P}}}, {\varvec{P}}) = 2\), \(\mathsf {supp}(\widetilde{{\varvec{P}}}({\varvec{P}}^*)^{-1}) \subseteq \mathsf {supp}( {\varvec{P}} ({\varvec{P}}^*)^{-1}) \) and
If in addition \( \Vert {\varvec{P}} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{0} \le m\) for some \(m\le n\), then
The main results make use of Lemma 1 and formalize the intuition conveyed in (3.4). We first present a result regarding the support of the permutation matrix \({\varvec{P}}^{(k)}\) delivered by Algorithm 1.
Proposition 1
(Support detection) Suppose Assumption 1 holds. Let \(\{{\varvec{P}}^{(k)}\}_{k \ge 0}\) be the permutation matrices generated by Algorithm 1. Then for all \(k\ge R/2\), it holds \( \mathsf {supp}({\varvec{P}}^{*}) \subseteq \mathsf {supp}({\varvec{P}}^{(k)}) \).
Proposition 1 states that the support of \({\varvec{P}}^*\) will be contained within the support of \({\varvec{P}}^{(k)}\) after at most R/2 iterations. Intuitively, this result is because of Assumption 1 (1), which assumes that the mismatches represented by \({\varvec{P}}^*\) have “strong signal”. Proposition 1 is also useful for the proofs of the main theorems below (e.g., see Claim 3.25 in the proof of Theorem 3 for details).
We now present some additional assumptions required for the results that follow.
Assumption 2
Let \(\rho _n\) and \({{\bar{\sigma }}}\) be parameters appearing in Assumption 1.
-
(1)
Suppose \(R^2 \rho _n \le 1/10\).
-
(2)
There is a constant \(\sigma \ge 0\) such that \(\bar{\sigma }^2 \le \sigma ^2 \min \{n/(660R^2),\) \( n/(5dR)\}\), and
$$\begin{aligned} \Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \Vert ^2 \ge (1/2)n \sigma ^2. \end{aligned}$$(3.7)
In light of the discussion following Assumption 1, Assumption 2 (1) places a stricter condition on the size of n via the requirement \(R^2 \rho _n \le 1/10\). If \(\rho _n = \varTheta (d\log (n)/n)\), then we would need \(n/\log (n) =\varOmega ( dr^2)\), which is stronger than the condition \(n/\log (n) = \varOmega (dr)\) needed in Assumption 1.
Assumption 2 (2) imposes a lower bound on \(\Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }}\Vert \) – this can be equivalently viewed as an upper bound on \(\Vert {\varvec{H}} {\varvec{\epsilon }} \Vert \), in addition to the upper bound appearing in Assumption 1 (4). Section 3.3 provides a sufficient condition for Assumption 2 (2) to hold with high probability. In particular, in the noiseless case (\({\varvec{\epsilon }} = 0\)), Assumption 1 (4) and Assumption 2 (2) hold with \({{\bar{\sigma }}} = \sigma = 0\).
We now state the first convergence result.
Theorem 3
(Linear convergence of objective up to noise level) Suppose Assumptions 1 and 2 hold with R being an even number. Let \(\{{\varvec{P}}^{(k)}\}_{k \ge 0}\) be the permutation matrices generated by Algorithm 1. Then for any \(k\ge 0\), we have
In the special (noiseless) setting when \({\varvec{\epsilon }} = {\varvec{0}}\), Theorem 3 establishes that the sequence of objective values generated by Algorithm 1 converges to zero i.e., the optimal objective value, at a linear rate. The parameter for the linear rate of convergence depends upon the search width R. Following the discussion after Assumption 2, the sample-size requirement \(n/\log (n) = \varOmega (dr^2)\) is more stringent than that needed in order for the model to be identifiable (\(n\ge 2d\)) [24] in the noiseless setting. In particular, when \( n/(d\log n) = O(1)\), the number of mismatched pairs r needs to be bounded by a constant. Numerical evidence presented in Sect. 5 (for the noiseless case) appears to suggest that the sample size n needed to recover \({\varvec{P}}^*\) is smaller than what is suggested by our theory.
In the noisy case (i.e. \({\varvec{\epsilon }}\ne {\varvec{0}}\)), the bound (3.8) provides an upper bound on the objective value consisting of two terms. The first term converges to 0 with a linear rate similar to the noiseless case. The second term is a constant multiple of the squared norm of the unavoidable noise termFootnote 3: \( \Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \Vert ^2 \). In other words, Algorithm 1 finds a solution whose objective value is at most a constant multiple of the objective value at the true permutation \({\varvec{P}}^*\).
Theorem 3 proves a convergence guarantee on the objective value. The next result provides upper bounds on the \(\ell _{\infty }\)-norm of the mismatched entries i.e., \(\Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty }\). For any \({\varvec{Q}} \in \varPi _n\), define
that is, \(G({\varvec{Q}}) \) is the decrease in the objective value after one step of local search starting at \({\varvec{Q}}\). For the permutation matrices \(\{{\varvec{P}}^{(k)}\}_{k \ge 0}\) generated by Algorithm 1, we know \(G({\varvec{P}}^{(k)}) = \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)} {\varvec{y}} \Vert ^2 - \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(k+1)} {\varvec{y}} \Vert ^2 \).
Theorem 4
(\(\ell _\infty \)-bound on mismatched pairs) Suppose Assumptions 1 and 2 hold, and let \(\{{\varvec{P}}^{(k)}\}_{k \ge 0}\) be the permutation matrices generated by Algorithm 1. Then for all \(k\ge 0\) it holds
Theorem 4 states that the largest squared error of the mismatched pairs (i.e., \(\Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty }^2\)) is bounded above by a constant multiple of the one-step decrease in objective value (i.e. \(G({\varvec{P}}^{(k)})\)) plus a term comparable to the noise level \(O({{\bar{\sigma }}}^2)\). In particular, if Algorithm 1 is terminated at an iteration \({\varvec{P}}^{(k)}\) with \( G({\varvec{P}}^{(k)})\) of the order of \({\bar{\sigma }}^2\), then \(\Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert ^2_{\infty } \) is bounded by a constant multiple of \( {{\bar{\sigma }}}^2\).
Note that the constant 800 in (3.10) is conservative and may be improved with a careful adjustment of the constants appearing in the proof and in the assumptions.
In light of Theorem 4, we can prove an upper bound on the estimation error of \({\varvec{\beta }}^*\), using an additional assumption stated below.
Assumption 5
There exists a constant \({{\bar{\gamma }}} >0\) such that
where \({{\bar{\sigma }}}\) is as defined in Assumption 1.
Section 3.3 presents conditions on \({\varvec{X}},\varvec{\epsilon }\) under which Assumption 5 is satisfied with high probability.
Theorem 6
(Estimation error) Suppose Assumptions 1, 2 and 5 hold. Suppose iteration k of Algorithm 1 satisfies \(G({\varvec{P}}^{(k)}) \le c{{\bar{\sigma }}}^2\) for a constant \(c>0\). Let \({\varvec{\beta }}^{(k)} := ({\varvec{X}}^\top {\varvec{X}})^{-1} {\varvec{X}}^\top {\varvec{P}}^{(k)} {\varvec{y}}\) and denote \({{\bar{c}}}:= 800+ 10c\). Then we have
and
Theorem 6 (cf bound (3.11)) states that as long as k is sufficiently largeFootnote 4, the estimation error \(\Vert {\varvec{\beta }}^{(k)} - {\varvec{\beta }}^* \Vert ^2 \) is of the order \(O(({r+d}){{\bar{\sigma }}}^2/n)\), assuming \({\bar{\gamma }}\) is a constant. Therefore, as \(n\rightarrow \infty \) (with r, d fixed), the estimator delivered by our algorithm (after sufficiently many iterations) will converge to the true regression coefficient vector, \({\varvec{\beta }}^*\). In addition, (3.12) provides an upper bound on the entrywise “denoising error" (left hand side of (3.12))—this is of the order \(O((r+d){{\bar{\sigma }}}^2/n)\). See [14] for past works and discussions on this error metric.
The following theorem provides an upper bound on the total number of local search steps needed to find a \({\varvec{P}}^{(k)}\) with \(G({\varvec{P}}^{(k)}) \le c {{\bar{\sigma }}}^2\).
Theorem 7
(Iteration complexity) Suppose Assumptions 1 and 2 hold. Let \(\{{\varvec{P}}^{(k)}\}_{k \ge 0}\) be the permutation matrices generated by Algorithm 1. Given any \(c>0\), define
Then there exists \(0\le k \le K^\dagger \) such that \(G({\varvec{P}}^{(k)}) \le c{{\bar{\sigma }}}^2\).
Proof
Denote
Then by Theorem 3, after \(K_1\) iterations, it holds
where the second inequality follows Assumption 1 (4). Suppose \(G({\varvec{P}}^{(k)}) > c {{\bar{\sigma }}}^2 \) for all \(K_1 \le k \le K^\dagger -1\), then
which is a contradiction. So there must exist some \(K_1 \le k \le K^\dagger -1\) such that \(G({\varvec{P}}^{(k)}) \le c {{\bar{\sigma }}}^2 \). \(\square \)
Note that if R and \(\Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }}\Vert ^2/ \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(0)} {\varvec{y}} \Vert ^2\) are bounded by a constant, then the number of iterations \(K^\dagger = O(n)\). Therefore, in this situation, one can find an estimate \({\varvec{\beta }}^{(k)}\) satisfying \( \Vert {\varvec{\beta }}^{(k)} - {\varvec{\beta }}^* \Vert ^2 \le O((d+r){{\bar{\sigma }}}^2/n)\) within O(n) iterations of Algorithm 1.
3.2 Proofs of main theorems
In this section, we present the proofs of Proposition 1, Theorem 3, Theorem 4 and Theorem 6. We first present a technical result used in our proofs.
Lemma 2
Suppose Assumption 1 holds. Let \(\{{\varvec{P}}^{(k)}\}_{k \ge 0}\) be the permutation matrices generated by Algorithm 1. Suppose \(\Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty } \ge L \) for some \( k\ge 1 \). Suppose at least one of the two conditions holds: (i) \(k\le R/2\); or (ii) \(k\ge R/2 + 1\), and \(\mathsf {supp}({\varvec{P}}^*) \subseteq \mathsf {supp}({\varvec{P}}^{(k')})\) for all \( R/2 \le k'\le k - 1\). Then for all \(t\le k-1\), we have
The proof of Lemma 2 is presented in Sect. A.3. As mentioned earlier, our analysis makes use of the one-step decrease condition in Lemma 1. Note however, if the permutation matrix at the current iteration, denoted by \({\varvec{P}}^{(k)}\), is on the boundary, i.e. \(\mathsf {dist}({\varvec{P}}^{(k)}, {\varvec{I}}_n) = R\), it is not clear whether the permutation found by Lemma 1 is within the search region \({\mathcal {N}}_R({\varvec{I}}_n)\). Lemma 2 helps address this issue (See the proof of Theorem 3 below for details).
3.2.1 Proof of proposition 1
We show this result by contradiction. Suppose that there exists a \(k\ge R/2\) such that \( \mathsf {supp}({\varvec{P}}^{*}) \not \subseteq \mathsf {supp}({\varvec{P}}^{(k)}) \). Let \( T\ge R/2 \) be the first iteration (\(\ge R/2\)) such that \( \mathsf {supp}({\varvec{P}}^{*}) \not \subseteq \mathsf {supp}({\varvec{P}}^{(T)}) \), i.e.,
Let \(i\in \mathsf {supp}({\varvec{P}}^{*})\) but \(i\notin \mathsf {supp}({\varvec{P}}^{(T)})\), then by Assumption 1 (1), we have
By Lemma 2, we have \( \Vert {\varvec{P}}^{(k+1)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert ^2 - \Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert ^2 \le -L^2/5 \) for all \(k\le T-1\). As a result,
Since by Assumption 1 (1), \( \Vert {\varvec{P}}^{(0)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert ^2 = \Vert {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert ^2 \le r U^2 \ , \) we have
This is a contradiction, so such an iteration counter T does not exist; and for all \(k\ge R/2\), we have \( \mathsf {supp}({\varvec{P}}^{*}) \subseteq \mathsf {supp}({\varvec{P}}^{(k)}) \).
3.2.2 Proof of theorem 3
Let \(a = 2.2\). Because \(R\ge 10r\), we have \(r +R \le 1.1R = aR/2\); and for any \(k\ge 0\):
Hence, by Lemma 1, there exists a permutation matrix \(\widetilde{{\varvec{P}}}^{(k)} \in \varPi _n\) such that \(\mathsf {dist}( \widetilde{{\varvec{P}}}^{(k)} , {\varvec{P}}^{(k)} ) \le 2 \), \(\mathsf {supp}( \widetilde{{\varvec{P}}}^{(k)} ({\varvec{P}}^*)^{-1}) \subseteq \mathsf {supp}( {\varvec{P}}^{(k)} ({\varvec{P}}^*)^{-1} ) \) and
As a result,
where the last inequality is from Assumption 1 (3). Note that by Assumption 2 (1), we have \(R^2 \rho _n \le 1/10 \le 1/(2a) \), so \( R\rho _n \le 1/(2aR) \). Because \( (1- 1/(aR)) \le (1-1/(2aR))^2 \), we have
Recall that (1.1) leads to \(\widetilde{{\varvec{H}}}{\varvec{P}}^* {\varvec{y}} = \widetilde{{\varvec{H}}}{\varvec{\epsilon }}\), so we have
Let \(\eta := 1/(2aR)\) and \({\varvec{z}}:= \widetilde{{\varvec{P}}}^{(k)} {\varvec{y}} - {\varvec{P}}^{(k)} {\varvec{y}}\), then (3.17) leads to:
where, to arrive at the second inequality, we drop the term \(- \eta \Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \Vert ^2\). We now make use of the following claim whose proof is in Sect. A.7:
On the other hand, by Cauchy-Schwarz inequality,
Combining (3.18), (3.19) and (3.20), we have
After some rearrangement, the above leads to:
where the second inequality uses \( 1-3\eta /4 \le (1-\eta /2) (1-\eta /4)\) and \( (1-\eta /2)^{-1} \le 9/8 \) (recall, \(\eta = 1/(2aR)\)).
To complete the proof, we use another claim whose proof is in Sect. A.6:
By the above claim, the update rule (2.3) and inequality (3.21), we have
Using the notation \(a_k:= \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)} {\varvec{y}} \Vert ^2\), \(\lambda =1-\eta /4\) and \( {\tilde{e}} = 9 \eta \Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \Vert ^2 \), the above inequality leads to: \( a_{k+1} \le \lambda a_k + {\tilde{e}}\) for all \(k\ge 0\). Therefore, we have
which implies \( a_k \le a_0 \lambda ^k + {\tilde{e}} \sum _{i=1}^k \lambda ^{i-1} \le a_0 \lambda ^k + ({{\tilde{e}}}/{(1-\lambda )}) \). This leads to
Recalling that \( 8a \le 18 \), we conclude the proof of the theorem.
3.2.3 Proof of theorem 4
By the definition of \(G(\cdot )\), we have
By Lemma 1, there exists a permutation matrix \(\widetilde{{\varvec{P}}}^{(k)} \in \varPi _n\) such that
and
By Claim (3.22) we have
Therefore, by (3.23) and (3.25), we have
Let \({\varvec{z}} := \widetilde{{\varvec{P}}}^{(k)} {\varvec{y}} - {\varvec{P}}^{(k)} {\varvec{y}}\). Recall that \(\widetilde{{\varvec{H}}}{\varvec{P}}^* {\varvec{y}} = \widetilde{{\varvec{H}}}{\varvec{\epsilon }}\), so by the inequality above we have
which is equivalent to
On the other hand, from (3.24) we have
or equivalently,
Summing up (3.26) and (3.27) we have
Note that
where the second inequality is by Assumption 1 (3) and the third inequality uses \(\Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _0 \le 2R\). From Assumption 2 (1) we have \(R\sqrt{\rho _n} \le 1/\sqrt{10}\), hence
where the last inequality is by Cauchy-Schwarz inequality. Rearranging terms in (3.28), and making use of (3.30), we have
As a result,
By the definition of \({\varvec{z}}\), we know there exist \(i,j \in [n]\) such that
Therefore
where the last inequality makes use of Assumption 1 (4). On the other hand, by (3.27) we have
and hence
Combining (3.31), (3.32) and (3.33), we have
where the second inequality is by Cauchy-Schwarz inequality. Inequality in display (3.34) leads to
which completes the proof of this theorem.
3.2.4 Proof of theorem 6
Recall that \({\varvec{P}}^* {\varvec{y}} = {\varvec{X}} {\varvec{\beta }}^* + {\varvec{\epsilon }}\), so it holds \({\varvec{\beta }}^* = ({\varvec{X}}^\top {\varvec{X}})^{-1} {\varvec{X}}^\top ({\varvec{P}}^* {\varvec{y}} - {\varvec{\epsilon }} )\). Therefore
Hence we have
where the second inequality is by Assumption 5 (2). Note that
where the first inequality is because \({\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}}\) has at most 2R non-zero coordinates; the second inequality makes use of Theorem 4 and the definition of \({{\bar{c}}}\) in Theorem 6. On the other hand, by Assumption 5 (1) we have
Combining (3.36), (3.37) and (3.38) we have
Squaring both sides of the above, we get
which completes the proof of (3.11).
We will now prove (3.12). Let us denote \(\varvec{J}:= ( {\varvec{P}}^{(k)} )^{-1} {\varvec{X}} {\varvec{\beta }}^{(k)} - ({\varvec{P}}^*)^{-1} {\varvec{X}} {\varvec{\beta }}^*\). Note that we can write
Multiplying both sides of (3.35) by \(({\varvec{P}}^{(k)} )^{-1} {\varvec{X}}\), we have
On the other hand,
Combining (3.39), (3.40) and (3.41) we have
By (3.37), we know
By Assumption 1 (4) we have
Since \(\mathsf {dist}(({\varvec{P}}^*)^{-1} , ( {\varvec{P}}^{(k)} )^{-1}) \le 2R\), it holds
where the last inequality makes use of Assumption 1 (4).
Using (3.43), (3.44) and (3.45), to bound the r.h.s of (3.42), we have
As a result,
This completes the proof of (3.12).
3.3 Sufficient conditions for assumptions to hold
Our analysis in Sects. 3.1 and 3.2 was completely deterministic in nature under Assumptions 1, 2 and 5. To provide some intuition, in the following, we discuss some probability models on \({\varvec{X}}\) and \({\varvec{\epsilon }}\) under which Assumption 1 (3), (4), Assumption 2 (2) and Assumption 5 hold true with high probability.
3.3.1 A random model matrix \({\varvec{X}}\)
When the rows of \({\varvec{X}}\) are iid draws from a well behaved probability distribution, Assumption 1 (3) and Assumption 5 (1) hold true with high probability. This is formalized via the following lemma.
Lemma 3
Suppose the rows of the matrix \({\varvec{X}}\): \(\varvec{x}_1,\ldots ,{\varvec{x}}_n\) are iid zero-mean random vectors in \({\mathbb {R}}^d\) with covariance matrix \({\varvec{\varSigma }}\in {\mathbb {R}}^{d\times d}\). Suppose there exist constants \(\gamma , b,V>0\) such that \( \lambda _{\min } ({\varvec{\varSigma }}) \ge \gamma \), \( \Vert {\varvec{x}}_i \Vert \le b \) and \( \Vert {\varvec{x}}_i \Vert _{\infty } \le V \) almost surely. Given any \(\tau >0\), define
Suppose n is large enough such that \(\sqrt{\delta _{n,m}} \ge 2/n\) and \( 3b^2 |||{\varvec{\varSigma }}|||_2 \log (2d/\tau ) / n\le (1/4)\gamma ^2 \). Then with probability at least \( 1-2\tau \), it holds \(\lambda _{\min } (\frac{1}{n} {\varvec{X}}^\top {\varvec{X}} ) \ge \gamma /2\), and
The proof of Lemma 3 is presented in Sect. A.4. Suppose there are universal constants \({{\bar{c}}}>0\) and \({{\bar{C}}}>0\) such that the parameters \((\gamma , V, b, |||{\varvec{\varSigma }} |||_2 ,\tau )\) in Lemma 3 satisfy \({{\bar{c}}} \le \gamma , V, b, |||{\varvec{\varSigma }} |||_2 ,\tau \le {{\bar{C}}}\). Given a pre-specified probability level (e.g., \(1-2\tau =0.99\)), under the setting of Lemma 3, if we set \(\rho _n = \delta _{n,4}\) and \({{\bar{\sigma }}} = \gamma /2\), then Assumption 1 (3) and Assumption 5 (1) are true with high probability (\(\ge 1-2\tau \)).
Note that the almost sure boundedness assumption on \(\Vert \varvec{x}_i\Vert \) can be relaxed to cases when \(\Vert {\varvec{x}}_i\Vert \) is bounded with high probability (e.g. \({\varvec{x}}_i {\mathop {\sim }\limits ^{\text {iid}}} N({\varvec{0}}, {\varvec{\varSigma }})\)).
3.3.2 The error distribution
In the following, we discuss a commonly used random setting under which Assumption 1 (4) and Assumption 2 (2) hold with high probability. A random variable \(\xi \) is called sub-Gaussian [25] with variance proxy \(\vartheta ^2\) (denoted by \(\xi \in \mathsf {subG}(\vartheta ^2)\)) if \({\mathbb {E}}[\xi ] = 0\) and \({\mathbb {E}}e^{t\xi } \le e^{\vartheta ^2 t^2 /2}\) for all \(t\in {\mathbb {R}}\).
Lemma 4
Suppose \({\varvec{\epsilon }} = [\epsilon _1,\ldots ,\epsilon _n]^\top \) with \(\epsilon _1,\ldots ,\epsilon _n {\mathop {\sim }\limits ^{\text {iid}}} \mathsf {subG}(\sigma ^2)\) for some \(\sigma >0 \). Suppose \({\varvec{\epsilon }}\) is independent of \({\varvec{X}}\). Then with probabilityFootnote 5 at least \( 1-\tau \) it holds
-
(a)
\(\max \{\Vert {\varvec{\epsilon }} \Vert _{\infty }, \Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \Vert _{\infty } \} \le \sigma \sqrt{2\log (6n/\tau )}\).
-
(b)
\( \Vert {\varvec{H}} {\varvec{\epsilon }} \Vert \le \sigma \sqrt{2d\log (6d/\tau )} \).
-
(c)
\( \Vert ({\varvec{X}}^\top {\varvec{X}})^{-1} {\varvec{X}}^\top {\varvec{\epsilon }} \Vert \le \lambda ^{-1/2}_{\min } ( \frac{1}{n} {\varvec{X}}^\top {\varvec{X}} ) \cdot \sigma \sqrt{2d \log (6d/\tau )/n}\).
In addition, if \({{\widetilde{\sigma }}}^2 := \text {Var}(\epsilon _i) \ge (3/4) \sigma ^2\), then there exists a universal constant \(C>0\) such that if \(\sqrt{\log (4/\tau )/(Cn)} + 2d \log (4d/\tau )/n \le 1/4 \), then with probability at least \(1-\tau \),
The proof of Lemma 4 is presented in Sect. A.5. Note that in Lemma 4 the assumption \(\text {Var}(\epsilon _i) \ge (3/4) \sigma ^2\) can be replaced by \( \text {Var}(\epsilon _i) \ge C_0 \sigma ^2 \) for any constant \(C_0\), with the conclusion changing accordingly (i.e., 1/2 in (3.47) will be replaced by another constant). In particular, Lemma 4 holds true when \(\epsilon _i {\mathop {\sim }\limits ^{\text {iid}}} N(0,\sigma ^2)\). Given a pre-specified probability level (e.g., \(1-\tau =0.99\)), under the setting of Lemma 4, if we set \({{\bar{\sigma }}} = \sigma \sqrt{2\log (6n/\tau )}\), then Assumption 1 (4), Assumption 2 (2) and Assumption 5 (2) hold with probability at least \(1-2\tau \).
3.3.3 Summary
We summarize parameter choices informed by the results above in the following corollary.
Corollary 1
Suppose the matrix \({\varvec{X}}\) is drawn from a probability model as discussed in Lemma 3 with \(m=R\); and the noise term \({\varvec{\epsilon }}\) satisfies the assumptions in Lemma 4. Then with probability at least \(1-6\tau \), the inequalities in (3.2), (3.3), (3.7) and (3.10) hold true with the following parameters
\({{\bar{\sigma }}} = \sigma \sqrt{2\log (6n/\tau )}\) and \({{\bar{\gamma }}} = \gamma /2\).
If in addition the conditions in Assumption 1 (1) and (2) hold true and the following four inequalities
are true, then all the statements in Assumptions 1, 2 and 5 are satisfied.
4 Approximate local search steps for computational scalability
As discussed in Sect. 2, the local search step (2.3) in Algorithm 1 costs \(O(n^2)\) for each iteration k—this can limit the scalability of Algorithm 1 to problems with a large n. Here we discuss an efficient method to find an approximate solution for step (2.3). Suppose that in the k-th iteration of Algorithm 1 the permutation \({\varvec{P}}^{(k)}\) satisfies \( \mathsf {dist}({\varvec{P}}^{(k)} , {\varvec{I}}_n) \le R-2 \), then update (2.3) is
Problem (4.1) can be equivalently formulated as
In view of Assumption 1 (3), for \(n\gg d\), we have \( \Vert \widetilde{{\varvec{H}}}({\varvec{e}}_i - {\varvec{e}}_j) \Vert ^2 \approx 2 \). Note that in general, \(\Vert \widetilde{{\varvec{H}}}({\varvec{e}}_i - {\varvec{e}}_j) \Vert ^2 \le 2\). Hence, one can approximately optimize (4.2) by minimizing an upper bound of the last two terms in the second line of display (4.2). This is given by:
Denoting \({\varvec{w}} := \widetilde{{\varvec{H}}}{\varvec{P}}^{(k)} {\varvec{y}}\) and \({\varvec{v}}: = {\varvec{y}} - {\varvec{w}}\), the objective in (4.3) is given by
So problem (4.3) is equivalent to
As we discuss below, the computation cost of the above problem can be reduced by making use of its structural properties. Let us denote \({\varvec{z}}_i = (y_i, v_i) \in {\mathbb {R}}^2\). Among the set of points \(\{{\varvec{z}}_1,\ldots , {\varvec{z}}_n\}\), we say \({\varvec{z}}_i\) is a “left-top" point if for all \(j\in [n]\),
We say \({\varvec{z}}_i\) is a “right-bottom" point if for all \(j\in [n]\),
Figure 1 shows an example of left-top and right-bottom points for a collection of \({\varvec{z}}_{i}\)’s with noisy \({\varvec{y}}\). It can be seen that the number of left-top and right-bottom points can be much smaller than the total number of points.
Let \((i^*,j^*)\) be an optimal solution to (4.4), then it must hold that one of \(\{{\varvec{z}}_{i^*}, {\varvec{z}}_{j^*}\}\) is a left-top point and the other is a right-bottom point. Let \({\mathcal {Z}}_{lt}\) and \({\mathcal {Z}}_{rb}\) be the set of left-top and right-bottom points respectively, and define
Then Problem (4.4) is equivalent to
implying that it suffices to compute values of \((y_i - y_j) (v_i - v_j)\) for \(i\in {\mathcal {S}}_{lt}\) and \(j\in {\mathcal {S}}_{rb}\). Algorithm 2 discusses how to compute \( {\mathcal {S}}_{lt}\) and \({\mathcal {S}}_{rb}\)—this requires (a) performing a sorting operation on \({\varvec{y}}\), which can be done once with a cost of \(O(n\log (n))\); and (b) two additional passes over the data with cost O(n) (to be performed at every iteration of Algorithm 1).
The computation of (4.5) can be further simplified as discussed in the following section.
![figure b](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10107-022-01863-y/MediaObjects/10107_2022_1863_Figb_HTML.png)
4.1 Faster computation of problem (4.5)
To simplify the computation of (4.5), we introduce a partial order ‘\(\preceq \)’ on the points in \({\mathbb {R}}^2\): For \(\varvec{p},{\varvec{q}} \in {\mathbb {R}}^2\), denote \({\varvec{p}}\preceq \varvec{q}\) if \(p_1 \le q_1\) and \(p_2 \le q_2\). It is easy to check that for any two points \({\varvec{z}}_i, {\varvec{z}}_j \in {\mathcal {Z}}_{rb}\), either it holds \({\varvec{z}}_i \preceq {\varvec{z}}_j\), or it holds \({\varvec{z}}_j \preceq {\varvec{z}}_i\). So we can write \({\mathcal {Z}}_{rb}=\{{\varvec{z}}_{i_1}, ...., {\varvec{z}}_{i_{L'}}\}\) with
For any \({\varvec{z}}_{m} \in {\mathcal {Z}}_{lt}\), two cases can happen:
-
1.
There is no point \({\varvec{z}}_{i_t} \in {\mathcal {Z}}_{rb}\) satisfying \((y_{i_t} - y_m) (v_{i_t} - v_m) \le 0\).
-
2.
There exist \({{\bar{t}}} ,{{\bar{b}}} \in [L']\) with \( {{\bar{b}}} \le \bar{t}\) such that \( (y_m - y_{i_t}) (v_m - v_{i_t}) \le 0 \) for all \(\bar{b} \le t \le {{\bar{t}}}\), and \( (y_m - y_{i_t}) (v_m - v_{i_t}) > 0 \) for all \(t>{{\bar{t}}}\) or \(t<{{\bar{b}}}\).
Because \({\mathcal {Z}}_{rb}\) is nicely ordered as in (4.6), Case 1 above can be identified by a bisection method with cost (at most) \( O(\log n)\). Similarly, for Case 2, the values of \({{\bar{t}}}\) and \({{\bar{b}}}\) can be found using bisection. Since the optimal value of (4.4) must be non-positive, we can compute \( (y_m - y_{i_t}) (v_m - v_{i_t}) \) only for \({{\bar{b}}} \le t \le {{\bar{t}}}\).
The methods described for solving (4.4) are summarized in Algorithm 2. Finally, note that when \(\mathsf {dist}({\varvec{P}}^{(k)}, {\varvec{I}}_n) \ge R-1\), similar ideas are still applicable. When \(\mathsf {dist}({\varvec{P}}^{(k)}, \varvec{I}_n) = R-1\), we consider the following problem:
Similarly, when \(\mathsf {dist}({\varvec{P}}^{(k)}, {\varvec{I}}_n) = R\), we consider:
Problems (4.7) and (4.8) can also be efficiently solved by finding the sets of left-top and right-bottom points and using the partial order to simplify the computation. We omit the details for brevity.
5 Experiments
We perform numerical experiments to study the performance of Algorithm 1.
Data generation. We consider the setup in our basic model (1.1), where entries of \({\varvec{X}}\in {\mathbb {R}}^{n\times d}\) are iid N(0, 1); \({\varvec{\beta }}^*\) is generated uniformly from the unit sphere in \({\mathbb {R}}^d\) (i.e., \(\Vert {\varvec{\beta }}^*\Vert = 1\)), and \({\varvec{\beta }}^*\) is independent of \({\varvec{X}}\). We consider two schemes for generating the permutation \({\varvec{P}}^*\): (a) Random scheme: select r coordinates uniformly from \(\{1, \ldots , n\}\). (b) Equi-spaced scheme: Assume \(y_1\le \cdots \le y_n\) (otherwise re-order the data). Let \(a_1<\cdots <a_r\) be the sequence of r equi-spaced real numbers with \(a_1 = \min _{i\in [n]} y_i\) and \(a_n = \max _{i\in [n]} y_i\). Select r indices \(i_1<\cdots <i_r\) such that \(i_1 = \mathop {\mathrm{argmin}}_{i\in [n] } |y_i - a_1|\) and \(i_s = \mathop {\mathrm{argmin}}_{i_{s-1}+1 \le i \le n} |y_i - a_s|\) for all \(2\le s \le r\). After the r coordinates are chosen, we generate a uniformly distributed random permutation on these r coordinates.Footnote 6
We generate \({\varvec{\epsilon }}\) (independent of \({\varvec{X}}\) and \({\varvec{\beta }}^*\)) with \(\epsilon _{i} {\mathop {\sim }\limits ^{\text {iid}}} N(0,\sigma ^2)\) for some \(\sigma \ge 0\) (\(\sigma = 0\) corresponds to the noiseless setting). Unless otherwise specified, we set the tolerance \({\mathfrak {e}} = 0\) and \(K=\infty \) in Algorithm 1.
5.1 Experiments for the noiseless setting
We first consider the noiseless setting (\({\varvec{\epsilon }} = {\varvec{0}}\)) with different combinations of (d, r, n). We use the random scheme to generate the unknown permutation \({\varvec{P}}^*\). We set \(R = n\) in Algorithm 1 and a maximum iteration limit of 1000. While our algorithm parameter choices are not covered by our theory, in practice when r is small, our local search algorithm converges to optimality; and the number of iterations is bounded by a small constant multiple of r (e.g., for \(r=50\), the algorithm converges to optimality within around 60 iterations).
Figure 2 presents preliminary results on examples with \(n = 500\), \(d\in \{20,50,100,200\}\), and 40 roughly equi-spaced values of \(r \in [10,400]\). In Fig. 2 [left panel], we plot the Hamming distance of the solution \(\hat{{\varvec{P}}}\) computed by Algorithm 1 and the underlying permutation \({\varvec{P}}^*\) (i.e. \(\mathsf {dist}(\hat{{\varvec{P}}}, {\varvec{P}}^*)\)) versus r. In Fig. 2 [right panel], we present errors in estimating \(\varvec{\beta }^*\) versus r. More precisely, let \( \hat{\varvec{\beta }}\) be the solution computed by Algorithm 1 (i.e. \(\hat{\varvec{\beta }}= ({\varvec{X}}^\top {\varvec{X}})^{-1}{\varvec{X}}^\top \hat{{\varvec{P}}}{\varvec{y}}\)), then the beta error is defined as \( \Vert \hat{\varvec{\beta }}- {\varvec{\beta }}^* \Vert /\Vert {\varvec{\beta }}^* \Vert \). For each choice of (r, d), we consider the average over 50 independent replications (the vertical bars show standard errors, which are hardly visible in the figures). As shown in Fig. 2, when r is small, the underlying permutation \({\varvec{P}}^*\) can be exactly recovered, and thus the corresponding beta error is also 0. As r becomes larger, Algorithm 1 fails to recover \({\varvec{P}}^*\) exactly; and \(\mathsf {dist}({\varvec{P}}^*, \hat{{\varvec{P}}})\) is close to the maximal possible value \(n = 500\). In contrast, the estimation error appears to vary more smoothly: As the value of r increases, beta error increases. We also observe that the recovery of \({\varvec{P}}^*\) depends upon the number of covariates d — permutation recovery performance deteriorates with increasing d. This is consistent with our theory suggesting that the performance of our algorithm depends upon both r and d.
5.2 Experiments for the noisy setting
We explore the performance of Algorithm 1 under the noisy setting (\({\varvec{\epsilon }} \ne {\varvec{0}}\)).
Performance for different values of R: We denote Relative Obj as the objective value computed by Algorithm 1 divided by \(\Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }}\Vert ^2\). Figure 3 presents the Relative Obj, beta error and Hamming distance of the local search algorithm with different values of R (x-axis corresponds to the values of R). Here we consider \(n=1000\), \(d=10\), \(r=10\) and \(\sigma =0.1\); and use the equi-spaced scheme to choose the mismatched coordinates in \({\varvec{P}}^*\). We highlight the value at \(R = r = 10\) by a red point. As shown in Fig. 3, as R increases, the Relative Obj decreases below 1 – this is consistent with our theory stating that with a proper choice of R, the final objective value will be below a constant multiple of \(\Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }}\Vert ^2\).
As R increases, different from the Relative Obj profile, the beta error and Hamming distance first decrease then increase. This appears to suggest that when R is too large, Algorithm 1 can overfit and further regularization may be necessary to mitigate overfitting. A detailed investigation of this matter is left as future work. In this example, the best beta error and Hamming distance are achieved when R equals r. Note that in Fig. 3 [left panel], the Relative Obj is close to 1 when we choose R close to r. Therefore, if we have a good estimate of the noise level \(\sigma \) (but the exact value of r is not available), we can choose a value of R at which the Relative Obj is approximately 1.
Finally, we note that in the noisy case, the local search method cannot exactly recover \({\varvec{P}}^*\). Indeed, in the noisy case, if a solution to (1.2) has to exactly recover \({\varvec{P}}^*\), we need to take a smaller value of \(\sigma \) (see discussions in [15]). Even though in our example, we cannot exactly recover \({\varvec{P}}^*\), we may still be able to obtain a good estimate for \({\varvec{\beta }}^*\)—see Fig. 3 [middle panel].
Estimating \({\varvec{P}}^*, {\varvec{\beta }}^*\) under different noise levels: For a given \(\sigma \) (standard deviation of the noise), let relative beta error be the value \(\Vert \hat{{\varvec{\beta }}} - {\varvec{\beta }}^* \Vert / ( \sigma \Vert {\varvec{\beta }}^*\Vert )\), where \(\hat{{\varvec{P}}}\) and \(\hat{{\varvec{\beta }}}\) are the estimates available from Algorithm 1 upon termination.
Consider an example with \(n = 500\) and \(d=r=10\) and different values of \(\sigma \in \{ 0.01, 0.03, 0.1, 0.3, 1.0\}\), and use the random scheme to generate the unknown permutation \({\varvec{P}}^*\). We run Algorithm 1 with the setting \(R = r\). Figure 4 presents the values of Hamming distance and relative beta error under different noise levels. In Fig. 4 [left panel], it can be seen that as \(\sigma \) increases, the Hamming distance also increases, and recovering \({\varvec{P}}^*\) becomes harder as the noise level becomes larger. In Fig. 4 [right panel], we can see that the relative beta error almost does not change under different values of \(\sigma \). This appears to be consistent with our conclusion in Theorem 6 that \(\Vert \hat{{\varvec{\beta }}} - {\varvec{\beta }}^* \Vert \) will be bounded by a value proportional to \(\sigma \).
5.3 Comparisons with existing methods
We compare across the following methods for (1.3):
-
AltMin: The alternating minimization method of [9]. We initialize with \({\varvec{P}} ={\varvec{I}}_n\) and \({\varvec{\beta }} = {\varvec{0}}\), and alternately minimize over \({\varvec{P}}\) and \({\varvec{\beta }}\) until no improvement can be made.
-
StoEM: The stochastic expectation maximization method [2]. We run the algorithm for 30 steps under the default setting.
-
S-BD: The robust regression relaxation method of [18]. We set the regularization parameter \(\lambda = 4(1+M)\sigma \sqrt{2\log (n)/n}\) with \(M=1\) (this value of \(\lambda \) is given by Theorem 2 of [18]).
For both AltMin and StoEM, we use the implementation provided in the repositoryFootnote 7 accompanying paper [2]. We compare these methods with Algorithm 1 (denoted by Alg-1) and the variant of Algorithm 1 with the fast approximate local search steps introduced in Sect. 4 (denoted by Fast-Alg-1). We run Alg-1 and Fast-Alg-1 by setting \(R = r\).
Table 1 presents the beta errors of different methods on an example with \(n = 500\), \(d=10\); and \({\varvec{P}}^* \) chosen by the random scheme. The presented values are the average of 10 independent replications.
As shown in Table 1, in the noiseless setting, Alg-1 can recover the true value of \({\varvec{\beta }}^*\) for r up to 300. Fast-Alg-1 is quite similar to Alg-1, though the performance is marginally worse for larger values of r. In contrast, AltMin and StoEM are not able to exactly estimate \({\varvec{\beta }}^*\) even for small values of r, and for all values of r they have large values of beta error. S-BD which is based on convex optimization, can also recover \({\varvec{\beta }}^*\) for \(r\le 200\), but for \(r = 300\) its performance degrades and has a large beta error.
In the noisy setting, with \(\sigma = 0.1\), Alg-1 and Fast-Alg-1 have similar performance, and compute a value of \({\varvec{\beta }}\) with much smaller beta error compared to AltMin and StoEM. For small values of r (\(\le 100\)), S-BD has a similar performance to Alg-1 and Fast-Alg-1, while for \(r = 200\) and 300, its performance degrades and has a much larger beta error than Alg-1 and Fast-Alg-1.
5.4 Scalability to large instances
We explore the scalability of our proposed approach to large n problems (from \(n\approx 10^4\) up to \(n\approx 10^7\))—for these instances, Fast-Alg-1 appears to be computationally attractive. All these experiments are run on the MIT engaging cluster with 1 CPU and 16GB memory. The codes are written in Julia 1.2.0.
We consider examples with \(d = 100\), \(r = 50\) and \(n\in \{10^4, 10^5,\) \( 10^6, 10^7\}\). Here, the mismatched coordinates of \({\varvec{P}}^*\) are chosen based on the random scheme. We set \(R = r\) for all instances. For these examples, we do not form the \(n \times n\) matrices \(\widetilde{{\varvec{H}}}\) or \({\varvec{H}}\) explicitly, but compute a thin QR decomposition of \({\varvec{X}}\) (\(\varvec{Q}\in {\mathbb {R}}^{n\times d}\)) and maintain \({\varvec{Q}} \) in memory. The results are presented in Table 2, where “total time" is the total runtime of Fast-Alg-1 upon termination, “QR time" is the time used for the QR-decomposition, and “iterations" are the number of iterations taken by the local search method till convergence. All numbers reported in the table are averaged across 10 independent replications. As shown in Table 2, Fast-Alg-1 can solve examples with n up to \(10^7\) (and \(d = 100\)) within around 200 seconds — this runtime (s) includes the time to complete around 60 iterations of local search steps and the time to do the QR decomposition. The total runtime is empirically seen to be of the order O(n) as n increases. Note that the QR time (i.e., time to perform the QR decomposition) can be viewed as a benchmark runtime for ordinary least squares. Hence, for the examples considered, the runtime of Fast-Alg-1 appears to be a constant multiple of the runtime of performing ordinary least squares (the total time will increase with r and/or R). Interestingly, it can be seen that the runtimes for the noisy case (\(\sigma = 0.1\)) are smaller than the noiseless case (\(\sigma = 0\)). We believe this is because Algorithm 2 is faster for the noisy case. In particular, for the noisy case, we empirically observe the number of “left-top" points and “right-bottom" points to be fewer than those in the noiseless case.
Notes
We draw inspiration from the local search method presented in [10] in the context of a different problem: high dimensional sparse regression.
The extended abstract [12] which is a shorter version of this manuscript, studies the noiseless setting.
Recall that the objective value at \({\varvec{P}} = {\varvec{P}}^*\) is \( \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^* {\varvec{y}} \Vert ^2= \Vert \widetilde{{\varvec{H}}}({\varvec{X}} {\varvec{\beta }}^* + {\varvec{\epsilon }}) \Vert ^2= \Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }}\Vert ^2\).
We note that in Algorithm 1, as \(k \rightarrow \infty \), the quantity \(G({\varvec{P}}^{(k)})\rightarrow 0\), and the condition \(G({\varvec{P}}^{(k)}) \le c{{\bar{\sigma }}}^2\) will hold for k sufficiently large.
The probability statements here are conditional on \({\varvec{X}}\).
Note that \({\varvec{P}}^*\) may not satisfy \( \mathsf {dist}({\varvec{P}}^* , {\varvec{I}}_n) = r\), but \( \mathsf {dist}({\varvec{P}}^* ,{\varvec{I}}_n)\) will be close to r.
References
Abid, A., Poon, A., Zou, J.: Linear regression with shuffled labels. arXiv preprint arXiv:1705.01342 (2017)
Abid, A., Zou, J.: Stochastic EM for shuffled linear regression. arXiv preprint arXiv:1804.00681 (2018)
Balakrishnan, A.V.: On the problem of time jitter in sampling. IRE Transactions on Information Theory 8(3), 226–236 (1962)
Blackman, S.S.: Multiple-target tracking with radar applications. Artech House, Norwood, MA (1986)
DeGroot, M.H., Feder, P.I., Goel, P.K.: Matchmaking. Ann. Math. Stat. 42(2), 578–593 (1971)
DeGroot, M.H., Goel, P.K.: The matching problem for multivariate normal data. Sankhyā: The Indian Journal of Statistics, Series B 14–29 (1976)
Dokmanić, I.: Permutations unlabeled beyond sampling unknown. IEEE Signal Process. Lett. 26(6), 823–827 (2019)
Emiya, V., Bonnefoy, A., Daudet, L., Gribonval, R.: Compressed sensing with unknown sensor permutation. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1040–1044. IEEE (2014)
Haghighatshoar, S., Caire, G.: Signal recovery from unlabeled samples. IEEE Trans. Signal Process. 66(5), 1242–1257 (2017)
Hazimeh, H., Mazumder, R.: Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms. Oper. Res. 68(5), 1517–1537 (2020)
Hsu, D.J., Shi, K., Sun, X.: Linear regression without correspondence. In: Advances in Neural Information Processing Systems, 1531–1540 (2017)
Mazumder, R., Wang, H.: Linear regression with mismatched data: A provably optimal local search algorithm. In: Integer Programming and Combinatorial Optimization: 22nd International Conference, IPCO 2021, Atlanta, GA, USA, May 19–21, 2021, Proceedings 22, 443–457. Springer (2021)
Neter, J., Maynes, E.S., Ramanathan, R.: The effect of mismatching on the measurement of response errors. J. Am. Stat. Assoc. 60(312), 1005–1027 (1965)
Pananjady, A., Wainwright, M.J., Courtade, T.A.: Denoising linear models with permuted data. In: 2017 IEEE International Symposium on Information Theory (ISIT), 446–450. IEEE (2017)
Pananjady, A., Wainwright, M.J., Courtade, T.A.: Linear regression with shuffled data: Statistical and computational limits of permutation recovery. IEEE Trans. Inf. Theory 64(5), 3286–3300 (2017)
Peng, L., Tsakiris, M.C.: Linear regression without correspondences via concave minimization. IEEE Signal Process. Lett. 27, 1580–1584 (2020)
Shi, X., Li, X., Cai, T.: Spherical regression under mismatch corruption with application to automated knowledge translation. Journal of the American Statistical Association 1–12 (2020)
Slawski, M., Ben-David, E.: Linear regression with sparsely permuted data. Electronic Journal of Statistics 13(1), 1–36 (2019)
Slawski, M., Ben-David, E., Li, P.: Two-stage approach to multivariate linear regression with sparsely mismatched data. J. Mach. Learn. Res. 21(204), 1–42 (2020)
Slawski, M., Diao, G., Ben-David, E.: A pseudo-likelihood approach to linear regression with partially shuffled data. Journal of Computational and Graphical Statistics 1–13 (2021)
Slawski, M., Rahmani, M., Li, P.: A sparse representation-based approach to linear regression with partially shuffled labels. In: Uncertainty in Artificial Intelligence, 38–48. PMLR (2020)
Stachniss, C., Leonard, J.J., Thrun, S.: Simultaneous localization and mapping. In: Springer Handbook of Robotics, 1153–1176. Springer (2016)
Tsakiris, M.C., Peng, L., Conca, A., Kneip, L., Shi, Y., Choi, H.: An algebraic-geometric approach for linear regression without correspondences. IEEE Trans. Inf. Theory 66(8), 5130–5144 (2020)
Unnikrishnan, J., Haghighatshoar, S., Vetterli, M.: Unlabeled sensing with random linear measurements. IEEE Trans. Inf. Theory 64(5), 3237–3253 (2018)
Wainwright, M.J.: High-dimensional statistics: A non-asymptotic viewpoint, vol. 48. Cambridge University Press, UK (2019)
Wang, G., Zhu, J., Blum, R.S., Willett, P., Marano, S., Matta, V., Braca, P.: Signal amplitude estimation and detection from unlabeled binary quantized samples. IEEE Trans. Signal Process. 66(16), 4291–4303 (2018)
Acknowledgements
The authors would like to thank the anonymous referees for their constructive comments that led to an improvement of the paper.
Funding
Open Access funding provided by the MIT Libraries
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A subset of this paper appeared in IPCO 2021 [12]. This research was supported in part, by grants from the Office of Naval Research: ONR-N000141812298 (YIP), the National Science Foundation: NSF-IIS-1718258, IBM and Liberty Mutual Insurance, awarded to Rahul Mazumder.
Proofs and technical results
Proofs and technical results
1.1 Technical lemmas
Lemma 5
(Covariance estimation) Let \({\varvec{X}} = [\varvec{x}_1,\ldots ,{\varvec{x}}_n]^\top \) be a random matrix in \( {\mathbb {R}}^{n\times d} \). Suppose rows \({\varvec{x}}_1,\ldots ,{\varvec{x}}_n\) are iid zero-mean random vectors in \({\mathbb {R}}^d\) with covariance matrix \({\varvec{\varSigma }}\in {\mathbb {R}}^{d\times d}\). Suppose \(\Vert \varvec{x}_i\Vert \le b\) almost surely. Then for any \(t >0\), it holds
See e.g. Corollary 6.20 of [25] for a proof.
Lemma 6
Suppose three permutation matrices \({\varvec{P}}, \widetilde{{\varvec{P}}}, {\varvec{Q}} \in \varPi _n\) satisfy
then it holds \(\mathsf {supp}(\widetilde{{\varvec{P}}}) \subseteq \mathsf {supp}( {\varvec{P}}) \).
Proof
We just need to show that for any \(i\notin \mathsf {supp}({\varvec{P}})\), it holds \( i\notin \mathsf {supp}(\widetilde{{\varvec{P}}})\). Let \(i\notin \mathsf {supp}({\varvec{P}})\), then \({\varvec{e}}_i^\top {\varvec{P}} = {\varvec{e}}_i^\top \). Since \( \mathsf {supp}({\varvec{Q}} ) \subseteq \mathsf {supp}({\varvec{P}}) \), we also have \( {\varvec{e}}_i^\top {\varvec{Q}} = {\varvec{e}}_i^\top \). So it holds \( {\varvec{e}}_i^\top {\varvec{P}} {\varvec{Q}}^{-1} = {\varvec{e}}_i^\top \) or equivalently \( i \notin \mathsf {supp}({\varvec{P}} {\varvec{Q}}^{-1} ) \). Because \( \mathsf {supp}(\widetilde{{\varvec{P}}}{\varvec{Q}}^{-1} ) \subseteq \mathsf {supp}({\varvec{P}} {\varvec{Q}}^{-1} )\), we have \( i \notin \mathsf {supp}(\widetilde{{\varvec{P}}}{\varvec{Q}}^{-1} ) \), or equivalently \( {\varvec{e}}_i^\top \widetilde{{\varvec{P}}}{\varvec{Q}}^{-1} = {\varvec{e}}_i^\top \). This implies \( {\varvec{e}}_i^\top \widetilde{{\varvec{P}}}= {\varvec{e}}_i^\top {\varvec{Q}} = {\varvec{e}}_i^\top \), or equivalently, \( i \notin \mathsf {supp}(\widetilde{{\varvec{P}}}) \). \(\square \)
Lemma 7
Suppose Assumption 1 holds. Let \(\widetilde{{\varvec{P}}},{\varvec{P}} \in {\mathcal {N}}_R({\varvec{I}}_n) \subseteq \varPi _n\) with \(\mathsf {dist}({\varvec{P}}, \widetilde{{\varvec{P}}}) \le 4\). Let
then it holds \(| \varDelta (\widetilde{{\varvec{P}}}, {\varvec{P}}) - \varDelta _{\widetilde{{\varvec{H}}}} (\widetilde{{\varvec{P}}}, {\varvec{P}})| < L^2/5 \).
Proof
Let \({\varvec{z}} := \widetilde{{\varvec{P}}}{\varvec{y}} - {\varvec{P}}{\varvec{y}} \). Note that
and
where the last equality uses the fact \( \widetilde{{\varvec{H}}}{\varvec{P}}^* {\varvec{y}} = \widetilde{{\varvec{H}}}({\varvec{X}} {\varvec{\beta }}^* + {\varvec{\epsilon }}) = \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \). From (A.1) and (A.2) we have
where the second equality is because \(\widetilde{{\varvec{H}}}= {\varvec{I}}_n - {\varvec{H}}\). As a result,
Since \( \mathsf {dist}({\varvec{P}}, \widetilde{{\varvec{P}}}) \le 4 \), we have \( {\varvec{z}} = \widetilde{{\varvec{P}}}{\varvec{y}} - {\varvec{P}}{\varvec{y}} \in {\mathcal {B}}_4 \), hence by Assumption 1 (3) and (1) we have
Since \( \mathsf {dist}({\varvec{P}}, {\varvec{P}}^*) \le \mathsf {dist}({\varvec{P}}, {\varvec{I}}_n) + \mathsf {dist}({\varvec{P}}^*, {\varvec{I}}_n) \le R+r \), we have \( {\varvec{P}}{\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \in {\mathcal {B}}_{R+r} \subseteq {\mathcal {B}}_{2R}\), hence by Assumption 1 (3) and (1),
Again because \( {\varvec{z}} \in {\mathcal {B}}_4 \) we have
Combining (A.3) – (A.6) we have
where in the last inequality we use \(R> 4\). By Assumption 1 (4) we know \(\Vert {\varvec{\epsilon }}\Vert _{\infty } \le {{\bar{\sigma }}} \) and \(\Vert {\varvec{H}} {\varvec{\epsilon }} \Vert \le \sqrt{d} {{\bar{\sigma }}}\), so we have
From Assumption 1 (4) we have \({{\bar{\sigma }}} \le \min \{0.5, (\rho _nd)^{-1/2}\} L^2/(80U) \). This implies
Note that by Assumption 1 (3) we have \(R\rho _n \le L^2/(90U^2)\), or equivalently, \(9R\rho _nU^2 \le L^2/10\). Combining this with (A.7) and (A.8), we have
\(\square \)
Lemma 8
(Decrease in infinity norm) Let \({\varvec{P}},\widetilde{{\varvec{P}}}\in \varPi _n\) with \(\mathsf {dist}({\varvec{P}},\widetilde{{\varvec{P}}}) = 2\). For any \({\varvec{v}}\in {\mathbb {R}}^n\), if \( \Vert \widetilde{{\varvec{P}}}{\varvec{v}} - {\varvec{v}}\Vert ^2 < \Vert {\varvec{P}} {\varvec{v}} - \varvec{v}\Vert ^2 \), then it holds \( \Vert \widetilde{{\varvec{P}}}{\varvec{v}} - \varvec{v}\Vert _{\infty } \le \Vert {\varvec{P}} {\varvec{v}} - \varvec{v}\Vert _{\infty }\).
Proof
Let \(i\in [n]\) be the index such that \(|{\varvec{e}}_i^\top (\widetilde{{\varvec{P}}}{\varvec{v}} - {\varvec{v}})| = \Vert \widetilde{{\varvec{P}}}{\varvec{v}} - {\varvec{v}} \Vert _{\infty }\). If \( {\varvec{e}}_i^\top {\varvec{P}} = {\varvec{e}}_i^\top \widetilde{{\varvec{P}}}\), then immediately we have
If \( {\varvec{e}}_i^\top {\varvec{P}} \ne {\varvec{e}}_i^\top \widetilde{{\varvec{P}}}\), assume \(\ell \in [n]\) is the index such that \( \varvec{e}_i^\top \widetilde{{\varvec{P}}}= {\varvec{e}}_{\ell }^\top {\varvec{P}} \). Since \(\mathsf {dist}({\varvec{P}},\widetilde{{\varvec{P}}}) = 2\), it holds \( {\varvec{e}}^\top _{\ell } \widetilde{{\varvec{P}}}= {\varvec{e}}_i^\top {\varvec{P}} \). Denote \( i_+:= \pi _{{\varvec{P}}}(i) \) and \(\ell _+:= \pi _{{\varvec{P}}}(\ell )\). Because \( {\varvec{e}}_{i_+}^\top = {\varvec{e}}_i^\top P = {\varvec{e}}_\ell ^\top \widetilde{{\varvec{P}}}\) and \({\varvec{e}}_{\ell _+}^\top = {\varvec{e}}_\ell ^\top P = {\varvec{e}}_i^\top \widetilde{{\varvec{P}}}\), it holds \( i_+ = \pi _{\widetilde{{\varvec{P}}}}(\ell ) \) and \(\ell _+ = \pi _{\widetilde{{\varvec{P}}}}(i) \). As a result, we have
By the assumption that \( \Vert \widetilde{{\varvec{P}}}{\varvec{v}} - {\varvec{v}}\Vert ^2 < \Vert {\varvec{P}} {\varvec{v}} - {\varvec{v}}\Vert ^2 \), we have
Let us denote \(L:= \Vert \widetilde{{\varvec{P}}}{\varvec{v}}- {\varvec{v}} \Vert _{\infty } = |{\varvec{e}}_i^\top (\widetilde{{\varvec{P}}}{\varvec{v}} - {\varvec{v}})| = |v_i - v_{\ell _+}|\). In what follows, we discuss different cases depending upon the ordering of the values of \(v_i\), \( v_{i_+} \), \(v_{\ell }\) and \(v_{\ell _+}\). Without loss of generality, we can assume \( v_{i_+} \ge v_i \). Then there are 12 cases of the ordering in \(v_i\), \( v_{i_+} \), \(v_{\ell }\) and \(v_{\ell _+}\). In the following, the first 6 cases correspond to when \( v_{\ell _+} \ge v_{\ell } \), and the last 6 cases correspond to when \( v_{\ell _+} \le v_{\ell } \).
Case 1: \( v_{i_+} \ge v_{\ell _+} \ge v_i \ge v_{\ell } \). Then we have \(\Vert {\varvec{P}} {\varvec{v}} - {\varvec{v}}\Vert _{\infty } \ge |v_i - v_{i_+}| \ge |v_i - v_{\ell _+}| = L \).
Case 2: \( v_{i_+}\ge v_i \ge v_{\ell _+} \ge v_{\ell } \). Let \(a = v_{i_+} - v_{i}\), \( b = v_i - v_{\ell _+}\) and \(c= v_{\ell _+} - v_{\ell } \). Then
This is a contradiction to (A.9). So this case cannot appear.
Case 3: \( v_{i_+} \ge v_{\ell _+} \ge v_{\ell } \ge v_i\). Then we have \(\Vert {\varvec{P}} {\varvec{v}} - {\varvec{v}}\Vert _{\infty } \ge |v_i - v_{i_+}| \ge |v_i - v_{\ell _+}| = L \).
Case 4: \( v_{\ell _+} \ge v_{i_+} \ge v_{\ell } \ge v_{i} \). Let \(a = v_{\ell _+} - v_{i_+} \), \(b = v_{i_+} - v_{\ell }\) and \( c = v_{\ell } - v_{i}\). Then
This is a contradiction to (A.9). So this case cannot appear.
Case 5: \( v_{\ell _+}\ge v_{\ell } \ge v_{i_+} \ge v_{i} \). Let \( a= v_{\ell _+}- v_{\ell }\), \(b = v_{\ell } - v_{i_+}\) and \(c = v_{i_+} - v_{i} \). Then
This is a contradiction to (A.9). So this case cannot appear.
Case 6: \( v_{\ell _+}\ge v_{i_+} \ge v_{i}\ge v_{\ell } \). Then \(\Vert {\varvec{P}} {\varvec{v}} - \varvec{v}\Vert _{\infty } \ge |v_\ell - v_{\ell _+}| \ge |v_i - v_{\ell _+}| = L \).
Case 7: \( v_{i_+}\ge v_{\ell } \ge v_{i}\ge v_{\ell _+} \). Then \(\Vert {\varvec{P}} {\varvec{v}} - \varvec{v}\Vert _{\infty } \ge |v_\ell - v_{\ell _+}| \ge |v_i - v_{\ell _+}| = L \).
Case 8: \( v_{i_+} \ge v_{i} \ge v_{\ell } \ge v_{\ell _+} \). Let \(a = v_{i_+} - v_{i}\), \( b = v_{i} - v_{\ell } \), \( c = v_{\ell } - v_{\ell _+} \). Then
This is a contradiction to (A.9). So this case cannot appear.
Case 9: \( v_{i_+} \ge v_{\ell } \ge v_{\ell _+} \ge v_{i} \). Then \(\Vert {\varvec{P}} {\varvec{v}} - \varvec{v}\Vert _{\infty }\ge |v_i - v_{i_+}| \ge |v_i - v_{\ell _+}| = L \).
Case 10: \( v_{\ell } \ge v_{i_+} \ge v_{\ell _+} \ge v_{i} \). Then \( \Vert {\varvec{P}} {\varvec{v}} - \varvec{v}\Vert _{\infty } \ge |v_i - v_{i_+}| \ge |v_i - v_{\ell _+}| = L \).
Case 11: \( v_{\ell } \ge v_{\ell _+} \ge v_{i_+} \ge v_{i} \). Let \( a= v_{\ell } - v_{\ell _+} \), \(b = v_{\ell _+} - v_{i_+}\) and \( v_{i_+} - v_{i} \). Then
This is a contradiction to (A.9). So this case cannot appear.
Case 12: \( v_{\ell } \ge v_{i_+} \ge v_{i} \ge v_{\ell _+} \). Then \(\Vert {\varvec{P}} {\varvec{v}} - \varvec{v}\Vert _{\infty } \ge |v_\ell - v_{\ell _+}| \ge |v_i - v_{\ell _+}| = L \).
In view of all these cases, we have \( \Vert {\varvec{P}} {\varvec{v}} -{\varvec{v}}\Vert _{\infty } \ge L = \Vert \widetilde{{\varvec{P}}}{\varvec{v}} - {\varvec{v}}\Vert _{\infty } \). \(\square \)
1.2 Proof of lemma 1
Without loss of generality we assume \({\varvec{P}}^* = \varvec{I}_n\), (otherwise, we work with \({\varvec{P}} ({\varvec{P}}^*)^{-1} \) in place of \({\varvec{P}} \) and \( \widetilde{{\varvec{P}}}({\varvec{P}}^*)^{-1}\) in place of \(\widetilde{{\varvec{P}}}\)). For any \(k\in [n]\), let \(k_+:= \pi _{{\varvec{P}}}(k)\). Let i be the index such that \(( y_{i_+} - y_i)^2 = \Vert {\varvec{P}} {\varvec{y}}- {\varvec{y}}\Vert ^2_{\infty } \). With out loss of generality, we can assume \( y_{i_+} > y_i \). Denote \(i_0 = i\) and \(i_1 = i_+\). By the structure of a permutation, there exists a cycle that
where \(q_1 \mathop {\longrightarrow }\limits ^{P} q_2\) means \(q_2 = \pi _{{\varvec{P}}}(q_1)\). By moving from \(y_i \) to \(y_{i_+}\), we “upcross” the value \( \frac{y_i+y_{i_+}}{2} \). Since the cycle (A.10) finally returns to \(i_0\), there exists one step where we “downcross” the value \( \frac{y_i+y_{i_+}}{2} \). In other words, there exists \(j\in [n]\) with \((j, j_+) \ne (i, i_+)\) such that \( y_{j_+} < y_j \) and \( \frac{y_i+y_{i_+}}{2} \in [y_{j_+}, y_j] \). Define \(\widetilde{{\varvec{P}}}\) as follows:
We note that \( \mathsf {dist}({\varvec{P}}, \widetilde{{\varvec{P}}}) = 2 \) and \(\mathsf {supp}(\widetilde{{\varvec{P}}}) \subseteq \mathsf {supp}( {\varvec{P}} ) \). Since
there are 3 cases depending upon the relative ordering \( y_i, y_{i_+}, y_j, y_{j_+}\), as considered below.
Case 1: \( y_j \ge y_{i_+} \ge y_{j_+} \ge y_i \). In this case, let \( a = y_j - y_{i_+} \), \(b = y_{i_+} - y_{j_+}\) and \(c = y_{j_+} - y_i \). Then \(a,b,c \ge 0\), and
Since \( \frac{y_i+y_{i_+}}{2} \in [y_{j_+}, y_j] \), we have \(b = y_{i_+} - y_{j_+} \ge y_{i_+} - \frac{y_i+y_{i_+}}{2} = \frac{y_{i_+}- y_i}{2} \), and hence
Case 2: \( y_{i_+} \ge y_{j} \ge y_{i} \ge y_{j_+} \). In this case, let \( a = y_{i_+} - y_{j} \), \(b = y_{j} - y_{i}\) and \(c = y_{i} - y_{j_+} \). Then \(a,b,c \ge 0\), and
Since \( \frac{y_i+y_{i_+}}{2} \in [y_{j_+}, y_j] \), we have \( b = y_{j} - y_{i} \ge \frac{y_i+y_{i_+}}{2} - y_i = \frac{y_{i_+}- y_i}{2}\), and hence
Case 3: \( y_{i_+} \ge y_{j} \ge y_{j_+} \ge y_{i} \). In this case, let \( a = y_{i_+} - y_{j} \), \(b = y_{j} - y_{j_+}\) and \(c = y_{j_+} - y_{i} \). Then \(a,b,c \ge 0\), and
Note that \( \Vert {\varvec{P}} {\varvec{y}}- {\varvec{y}} \Vert _{\infty }^2 = (y_i-y_{i_+})^2 = (a+b+c)^2 \). Because \( \frac{y_i+y_{i_+}}{2} \in [y_{j_+}, y_j] \), we have
which implies \( a \le (a+b+c)/2 \) and \( c\le (a+b+c)/2 \). So we have
where
This is equivalent to
Combining Cases 1, 2 and 3, completes the proof of (3.5).
For the proof of (3.6), note that if \( \Vert {\varvec{P}}{\varvec{y}} -{\varvec{P}}^* {\varvec{y}} \Vert _{0} \le m\), then \( \Vert {\varvec{P}} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert ^2 \le m \Vert {\varvec{P}} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert ^2_{\infty }\). Using (3.5) we have:
which completes the proof of (3.6).
1.3 Proof of lemma 2
To prove Lemma 2, we first prove the following proposition:
Proposition 2
Under the assumptions of Lemma 2, it holds \(\Vert {\varvec{P}}^{(t)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty } \ge L \) for all \(0 \le t \le k\).
Proof
As \(\Vert {\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty } \ge L \), let \(i\in [n]\) be the index such that \( | {\varvec{e}}_i^\top ({\varvec{P}}^{(k)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}}) | \ge L \). We can assume that there exists \(j \le k-1\) such that
since otherwise \({\varvec{e}}_i^\top {\varvec{P}}^{(t)} = {\varvec{e}}_i^\top {\varvec{P}}^{(k)} \) for all \(0\le t \le k\) and hence \(\Vert {\varvec{P}}^{(t)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty } \ge | {\varvec{e}}_i^\top ({\varvec{P}}^{(t)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}}) | \ge L\) for all \(0\le t \le k\), i.e., the conclusion of Proposition 2 holds true.
Below we prove Proposition 2 under the assumption (A.11). By (A.11) we know that \( {\varvec{e}}_i^\top {\varvec{P}}^{(j)} \ne {\varvec{e}}_i^\top {\varvec{P}}^{(j+1)} \). For any \(t\ge 0\) let \({\varvec{Q}}^{(t)} := {\varvec{P}}^{(t)} ({\varvec{P}}^*)^{-1}\). Then we have
Since \( \mathsf {dist}({\varvec{Q}}^{(j)}, {\varvec{Q}}^{(j+1)}) = \mathsf {dist}({\varvec{P}}^{(j)}, {\varvec{P}}^{(j+1)}) =2\), there must exist an index \(\ell \in [n]\) such that
In the following, denote \(i_+:= \pi _{{\varvec{Q}}^{(j)}} (i )\) and \(\ell _+:= \pi _{{\varvec{Q}}^{(j)}} (\ell )\) and let \({\varvec{y}}^*:= {\varvec{P}}^* {\varvec{y}}\). Then we have
Since \(\ell _+ = \pi _{{\varvec{Q}}^{(j)}} (\ell ) = \pi _{{\varvec{Q}}^{(j+1)}} (i)\), we have
By the definition of j and equality (A.13), we have
Since \( \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(j+1)} {\varvec{y}} \Vert \le \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(j)} {\varvec{y}} \Vert \), using Lemma 7 with \(\widetilde{{\varvec{P}}}= {\varvec{P}}^{(j+1)}\) and \({\varvec{P}} = {\varvec{P}}^{(j)}\), we have
This leads to
which when combined with (A.14) leads to:
As a result, we have
By Lemma 1, there exists \(\widetilde{{\varvec{P}}}^{(j)}{\in }\,\varPi _n\) such that \(\mathsf {dist}(\widetilde{{\varvec{P}}}^{(j)}, {\varvec{P}}^{(j)})\,{\le }\,2\), \(\mathsf {supp}(\widetilde{{\varvec{P}}}^{(j)} ({\varvec{P}}^*)^{-1} ) \subseteq \mathsf {supp}({\varvec{P}}^{(j)} ({\varvec{P}}^*)^{-1} )\) and
We now make use of the following claim:
Proof of Claim (A.17): Note that if \(j\le R/2 - 1\), then \( \mathsf {dist}({\varvec{I}}_n, \widetilde{{\varvec{P}}}^{(j)} ) \le \mathsf {dist}( {\varvec{I}}_n, {\varvec{P}}^{(j)} ) + \mathsf {dist}( {\varvec{P}}^{(j)}, \widetilde{{\varvec{P}}}^{(j)} ) \le R\). Otherwise, from the statement of Lemma 2, we know that \(\mathsf {supp}({\varvec{P}}^*) \subseteq \mathsf {supp}({\varvec{P}}^{(j)}) \). Using Lemma 6 with \({\varvec{P}} = {\varvec{P}}^{(j)}\), \({\varvec{Q}} = {\varvec{P}}^*\) and \(\widetilde{{\varvec{P}}}= \widetilde{{\varvec{P}}}^{(j)}\) we have \(\mathsf {supp}(\widetilde{{\varvec{P}}}^{(j)}) \subseteq \mathsf {supp}({\varvec{P}}^{(j)})\). Since \({\varvec{P}}^{(j)} \in {\mathcal {N}}_R({\varvec{I}}_n)\), we also have \(\mathsf {supp}(\widetilde{{\varvec{P}}}^{(j)}) \in {\mathcal {N}}_R({\varvec{I}}_n)\). The proof of Claim (A.17) is complete.
Because \(\widetilde{{\varvec{P}}}^{(j)} \in {\mathcal {N}}_R({\varvec{I}}_n)\), by the update step in the local search algorithm, we know \( \Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(j+1)} {\varvec{y}} \Vert \le \Vert \widetilde{{\varvec{H}}}\widetilde{{\varvec{P}}}^{(j)} {\varvec{y}} \Vert \). Using Lemma 7 again with \(\widetilde{{\varvec{P}}}= {\varvec{P}}^{(j+1)}\) and \({\varvec{P}} = \widetilde{{\varvec{P}}}^{(j)}\), we have
Combining (A.16), (A.18) and (A.15) we have
which is equivalent to
We now use Lemma 8 with \(\widetilde{{\varvec{P}}}= {\varvec{Q}}^{(j+1)}\), \({\varvec{P}} = \varvec{Q}^{(j)}\) and \({\varvec{v}} = {\varvec{y}}^*\), to obtain
By the arguments above we have proved that \(\Vert {\varvec{P}}^{(j)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty } \ge L \). Recall that we also have \( \Vert {\varvec{P}}^{(t)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty } \ge L\) for all \( j+1 \le t\le k \), so we know that \( \Vert {\varvec{P}}^{(t)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty } \ge L \) for all \(j\le t \le k\). We can just replace k by j and repeat the arguments above to obtain \( \Vert {\varvec{P}}^{(t)} {\varvec{y}} - {\varvec{P}}^* {\varvec{y}} \Vert _{\infty } \ge L \) for all \( 0 \le t\le k\). This completes the proof of Proposition 2. \(\square \)
With Proposition 2 at hand, we are ready to wrap up the proof of Lemma 2. For each \(t\le k-1\), By Lemma 1, there exists \(\widetilde{{\varvec{P}}}^{(t)} \in \varPi _n\) such that \( \mathsf {dist}(\widetilde{{\varvec{P}}}^{(t)}, {\varvec{P}}^{(t)}) \le 2 \), \(\mathsf {supp}(\widetilde{{\varvec{P}}}^{(t)} ({\varvec{P}}^*)^{-1} ) \subseteq \mathsf {supp}({\varvec{P}}^{(t)} ({\varvec{P}}^*)^{-1} )\) and
where the second inequality is by Proposition 2. With the same arguments as in the proof of Claim (A.17), we have \(\widetilde{{\varvec{P}}}^{(t)} \in {\mathcal {N}}_R({\varvec{I}}_n)\), hence \(\Vert \widetilde{{\varvec{H}}}{\varvec{P}}^{(t+1)} {\varvec{y}} \Vert ^2 \le \Vert \widetilde{{\varvec{H}}}\widetilde{{\varvec{P}}}^{(t)} {\varvec{y}} \Vert ^2\). Using Lemma 7 again with \({\varvec{P}} = \widetilde{{\varvec{P}}}^{(t)}\) and \(\widetilde{{\varvec{P}}}={\varvec{P}}^{(t+1)}\), we have
Combining (A.19) and (A.20) we have
which completes the proof of the Lemma 2.
1.4 Proof of lemma 3
We will prove that for any \( {\varvec{u}} \in {\mathcal {B}}_m\) (cf definition (3.1)),
Take \(t_n := \sqrt{3b^2 \log (2d/\tau ) / (n|||{\varvec{\varSigma }}|||_2)}\). By assumption in the statement of Lemma 3 we have \(t_n |||{\varvec{\varSigma }} |||_2 \le \gamma /2\). From Lemma 5 and some simple algebra we have:
with probability at least \(1-\tau \). By Weyl’s inequality, \(|\lambda _{\min } ({\varvec{X}}^\top {\varvec{X}})/n- \lambda _{\min } ({\varvec{\varSigma }})|\) is bounded by the left hand side of (A.22). So we have
where, we use \(t_{n}|||{\varvec{\varSigma }} |||_2 \le \gamma /2\). Hence we have \(\lambda _{\max }(({\varvec{X}}^\top {\varvec{X}})^{-1}) \le {2}/{(n \gamma )} \) and
Let \( {\mathcal {B}}_m(1) := \{{\varvec{u}}\in {\mathcal {B}}_m: ~ \Vert {\varvec{u}}\Vert \le 1 \}\), and let \({\varvec{u}}^1,\ldots ,{\varvec{u}}^M\) be an \((\sqrt{\delta _{n,m}}/2)\)-net of \({\mathcal {B}}_m(1) \), that is, for any \({\varvec{u}}\in {\mathcal {B}}_m(1)\), there exists some \({\varvec{u}}^j\) such that \( \Vert {\varvec{u}}^j - {\varvec{u}}\Vert \le \sqrt{\delta _{n,m}}/2 \). Since the \((\sqrt{\delta _{n,m}}/2) \)-covering number of \({\mathcal {B}}_m(1)\) is bounded by \(({6}/{\sqrt{\delta _{n,m}}})^m {n \atopwithdelims ()m}\), we can take
where the second inequality is from our assumption that \(\sqrt{\delta _{n,m}}\ge 2/n\). By Hoeffding inequality, for each fixed \(j\in [M]\), and for all \( k\in [d] \), we have
Using a union bound for \(k\in [d]\) to the inequality above, we have that for any \(\rho >0\), with probability at least \( 1- \rho \), the following inequality holds for all \(k\in [d]\)
where the second inequality is because each \({\varvec{u}}^j \in {\mathcal {B}}_m(1)\). As a result,
Take \(\rho = \tau / M\), then by the union bound, with probability at least \(1-\tau \), it holds
Combining (A.24) with (A.23), we have that for all \(j\in [M]\),
Recall that \( M\le (3n^2)^m \), so we have
where the last equality follows the definition of \(\delta _{n,m}\). Using the above bound in (A.25), we have
For any \({\varvec{u}}\in {\mathcal {B}}_m(1)\), there exists some \(j\in [M]\) such that \(\Vert {\varvec{u}}- {\varvec{u}}^j\Vert \le \sqrt{\delta _{n,m}}/2\), hence
Since both (A.22) and (A.24) have failure probability of at most \(\tau \), we know that (A.26) holds with probability at least \(1-2\tau \). This proves the conclusion for all \({\varvec{u}}\in {\mathcal {B}}_m(1)\). For a general \({\varvec{u}}\in {\mathcal {B}}_m\), \( {\varvec{u}}/ \Vert \varvec{u}\Vert \in {\mathcal {B}}_m(1)\), hence we have
which is equivalent to what we had set out to prove (A.21).
1.5 Proof of lemma 4
Since \(\epsilon _i\sim \mathsf {subG}(\sigma ^2)\), by Chernoff inequality, for all \(t>0\), we have
As a result, for all \(t>0\), \({\mathbb {P}}(\Vert {\varvec{\epsilon }}\Vert _{\infty }>t) \le \sum _{i=1}^n {\mathbb {P}}( |\epsilon _i| >t ) \le 2n \exp (-t^2/(2\sigma ^2)) \). Therefore with probability at least \(1-\tau /3\) we have \(\Vert {\varvec{\epsilon }} \Vert _{\infty } \le \sigma \sqrt{2\log (6n/\tau )}\).
For any \(i\in [n]\), since \(\Vert {\varvec{e}}_i^\top \widetilde{{\varvec{H}}}\Vert ^2 \le 1 \), so it is easy to check that \({\varvec{e}}_i^\top \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \) is also sub-Gaussian with variance proxy \(\sigma ^2\). Similar to the arguments in the last paragraph, with probability at least \(1- \tau /3\), we have \(\Vert \widetilde{{\varvec{H}}}{\varvec{\epsilon }} \Vert _{\infty } \le \sigma \sqrt{2\log (6n/\tau )}\). So we have proved that with probability at least \(1-2\tau /3\), the inequalities in (a) hold.
Let \({\varvec{X}} = \varvec{{{\bar{U}}}} {\varvec{D}} \varvec{{{\bar{V}}}}^\top \) be the SVD of \({\varvec{X}}\) with \(\varvec{{{\bar{U}}}} \in {\mathbb {R}}^{n\times d}\) satisfying \(\varvec{{{\bar{U}}}}^\top \varvec{{{\bar{U}}}} = {\varvec{I}}_d\); \(\varvec{{{\bar{V}}}} \in {\mathbb {R}}^{d\times d}\) satisfying \( \varvec{{{\bar{V}}}}^\top \varvec{{{\bar{V}}}} = {\varvec{I}}_d\); and \({\varvec{D}} \in {\mathbb {R}}^{d\times d}\) being diagonal. Then we have
and
Note that for any \(j\in [d]\), one can verify that \(\varvec{e}_j^\top \varvec{{{\bar{U}}}}^\top {\varvec{\epsilon }}\) is sub-Gaussian with variance proxy \( \sigma ^2\). Hence, for any \(t>0\), we have
As a result, with probability at least \(1-\tau /3\) we have \(\Vert \varvec{{{\bar{U}}}}^\top {\varvec{\epsilon }} \Vert _{\infty } \le \sigma \sqrt{2\log (6d/\tau )} \), and hence
and
Note that \( |||{\varvec{D}}^{-1} |||_2 = \frac{1}{\sqrt{n}} \lambda _{\min }^{-1/2} (\frac{1}{n} {\varvec{X}}^\top {\varvec{X}} ) \), so this completes the proofs of (b) and (c).
To prove (3.47), using Bernstein inequality we have
for a universal constant C. Taking \(t = \sqrt{\log (4/\tau )/(Cn)}\) in the inequality above, and note that \(t\le 1\) (because of the assumption \(\sqrt{\log (4/\tau )/(Cn)} + 2d \log (4d/\tau )/n \le 1/4 \)), we have
As a result, with probability at least \(1-\tau /2\)
where the second inequality makes use of the assumption that \({{\widetilde{\sigma }}}^2 \ge (3/4) \sigma ^2\). On the other hand we know that with probability at least \(1-\tau /2\) it holds \(\Vert {\varvec{H}} {\varvec{\epsilon }} \Vert \le \sigma \sqrt{2d\log (4d/\tau )}\). As a result, with probability at least \(1-\tau \) we have
where the last inequality uses the assumption \(\sqrt{\log (4/\tau )/(Cn)} + 2d \log (4d/\tau )/n \le 1/4\).
1.6 Proof of claim (3.22)
To prove Claim (3.22), we just need to prove that \(\widetilde{{\varvec{P}}}^{(k)} \in {\mathcal {N}}_R({\varvec{I}}_n)\), i.e. \(\mathsf {dist}(\widetilde{{\varvec{P}}}^{(k)} ,{\varvec{I}}_n) \le R\). If \(k\le R/2 - 1\), because \( \mathsf {dist}({\varvec{P}}^{(t+1)}, {\varvec{P}}^{(t)}) \le 2 \) for all \(t\ge 0\) and \({\varvec{P}}^{(0)} ={\varvec{I}}_n\), we have \( \mathsf {dist}({\varvec{P}}^{(k)}, {\varvec{I}}_n) \le 2k \le R-2 \). Hence
Otherwise, by Proposition 1, it holds \( \mathsf {supp}( {\varvec{P}}^* ) \subseteq \mathsf {supp}({\varvec{P}}^{(k)})\). Then from Lemma 6, we have \( \mathsf {supp}(\widetilde{{\varvec{P}}}^{(k)}) \subseteq \mathsf {supp}({\varvec{P}}^{(k)})\), therefore \(\widetilde{{\varvec{P}}}^{(k)} \in {\mathcal {N}}_R( {\varvec{I}}_n)\).
1.7 Proof of claim (3.19)
Note that
Let \(w = 33\). Since only two coordinates of z are non-zero, we have
where the second inequality uses Cauchy-Schwarz inequality. Using Assumption 1 (3) and (4) we have
By Assumption 1 (3), we have \(\rho _nR \le L^2/(90U^2) \le 1/73 \le 1/(aw)\), hence \( \rho _n/2 \le 1/(2aRw) = \eta /w\). As a result,
Combining (A.33), (A.34) and (A.32) we have
From \(\rho _nR \le L^2/(90U^2) \) we know that \(\rho _n \le 0.01\). Therefore by Assumption 1 (3) we have \( \Vert \widetilde{{\varvec{H}}}z\Vert ^2 \ge (1-\rho _n) \Vert {\varvec{z}}\Vert ^2 \ge 0.99\Vert {\varvec{z}}\Vert ^2 \). So \( \Vert {\varvec{z}}\Vert ^2 \le 1.02 \Vert \widetilde{{\varvec{H}}}{\varvec{z}}\Vert ^2 \) and combines it with (A.35) we have
where the second inequality uses the definition of \({\varvec{z}}\) and the Cauchy-Schwarz inequality. Note that \(4.08/w = 4.08/33 \le 1/8\) and \(w/2 = 33/2 \le 17\), so we have
From Assumption 2 (2), we have \( {{\bar{\sigma }}}^2 / \sigma ^2 \le \min \{n/(660R^2), n/(5Rd)\}\), which implies
As a result, by (A.36) and (A.37) we have
Multiplying 2 in both sides of (A.38) we complete the proof.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Mazumder, R., Wang, H. Linear regression with partially mismatched data: local search with theoretical guarantees. Math. Program. 197, 1265–1303 (2023). https://doi.org/10.1007/s10107-022-01863-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-022-01863-y
Keywords
- Linear regression
- Mismatched data
- Local search method
- Permutation learning
- Mixed integer programming
- Polynomial time algorithm