Linear regression with partially mismatched data: local search with theoretical guarantees

Linear regression is a fundamental modeling tool in statistics and related fields. In this paper, we study an important variant of linear regression in which the predictor-response pairs are partially mismatched. We use an optimization formulation to simultaneously learn the underlying regression coefficients and the permutation corresponding to the mismatches. The combinatorial structure of the problem leads to computational challenges. We propose and study a simple greedy local search algorithm for this optimization problem that enjoys strong theoretical guarantees and appealing computational performance. We prove that under a suitable scaling of the number of mismatched pairs compared to the number of samples and features, and certain assumptions on problem data; our local search algorithm converges to a nearly-optimal solution at a linear rate. In particular, in the noiseless case, our algorithm converges to the global optimal solution with a linear convergence rate. Based on this result, we prove an upper bound for the estimation error of the parameter. We also propose an approximate local search step that allows us to scale our approach to much larger instances. We conduct numerical experiments to gather further insights into our theoretical results, and show promising performance gains compared to existing approaches.


Introduction
Linear regression and its extensions are among the most fundamental models in statistics and related fields. In the classical and most common setting, we are given n samples with features x i ∈ R d and response y i ∈ R, where i denotes the sample indices. We assume that the features and responses are perfectly matched i.e., x i and y i correspond to the same record or sample. However, in important applications (for example, due to errors in the data merging process), the correspondence between the response and features may be broken [13,14,19]. This erroneous correspondence needs to be adjusted before performing downstream statistical analysis. Thus motivated, we consider a mismatched linear model with responses y = [y 1 , ..., y n ] ∈ R n and covariates X = [x 1 , ..., x n ] ∈ R n×d satisfying P * y = Xβ * + (1.1) where β * ∈ R d are the true regression coefficients, = [ 1 , ..., n ] ∈ R n is the noise term, and P * ∈ R n×n is an unknown permutation matrix. We consider the classical setting where n > d and X has full rank; and seek to estimate both β * and P * based on the n observations {(y i , x i )} n 1 . Note that the main computational difficulty in this task arises from the unknown permutation.
Linear regression with mismatched/permuted data-for example, model (1.1)-has a long history in statistics dating back to 1960s [13,5,6]. In addition to the aforementioned application in record linkage, similar problems also appear in robotics [22], multi-target tracking [4] and signal processing [3], among others. Recently, this problem has garnered significant attention from the statistics and machine learning communities. A series of recent works [8,24,14,15,1,2,11,9,17,26,7,23,19,20,21] have studied the statistical and computational aspects of this model. To learn the coefficients β * and the matrix P * , one can consider the following natural optimization problem: min β,P P y − Xβ 2 s.t. P ∈ Π n (1.2) where Π n is the set of n × n permutation matrices. Solving problem (1.2) is difficult as there are exponentially many choices for P ∈ Π n . Given P however, it is easy to estimate β via least squares. [24] shows that in the noiseless setting ( = 0), a solution (P ,β) of problem (1.2) equals (P * , β * ) with probability one if n ≥ 2d and the entries of X are independent and identically distributed (iid) as per a distribution that is absolutely continuous with respect to the Lebesgue measure. [15,11] studies the estimation of (P * , β * ) under the noisy setting. It is shown in [15] that Problem (1.2) is NP-hard if d ≥ κn for some constant κ > 0. A polynomial-time approximation algorithm appears in [11] for a fixed d. However, as noted in [11], this algorithm does not appear to be efficient in practice. [8] propose a branch-and-bound method, that can solve small problems with n ≤ 20 (within a reasonable time). [16] propose a branch-and-bound method for a concave minimization formulation, which can solve problem (1.2) with d ≤ 8 and n ≈ 100 (the authors report a runtime of 40 minutes to solve instances with d = 8 and n = 100). [23] propose an approach using tools from algebraic geometry, which can handle problems with d ≤ 6 and n = 10 3 ∼ 10 5 -the cost of this method increases exponentially with d. This approach is exact for the noiseless case but approximate for the noisy case ( = 0). Several heuristics have been proposed for (1.2): Examples include, alternating minimization [9,26], Expectation Maximization [2]-as far as we can tell, these methods are sensitive to initialization, and have limited theoretical guarantees. As discussed in [18,19], in several applications, a small fraction of the samples are mismatched -that is, the permutation P * is sparse. In other words, if we let r := |{i ∈ [n] | P * e i = e i }| where e 1 , ..., e n are the standard basis elements of R n , then r is much smaller than n. In this paper, we focus on such sparse permutation matrices, and assume the value of r is known or a good estimate is available to the practitioner. This motivates a constrained version of (1.2), given by min β,P P y − Xβ 2 s.t. P ∈ Π n , dist(P , I n ) ≤ R (1. 3) where, the constraint dist(P , I n ) ≤ R restricts the number of mismatches between P and the identity permutation to be below R-See (1.4) for a formal definition of dist(·, ·). Above, R is taken such that r ≤ R ≤ n (Further details on the choice of R can be found in Sections 3 and 5). Note that as long as r ≤ R ≤ n, the true parameters (P * , β * ) lead to a feasible solution to (1.3). In the special case when R = n, the constraint dist(P , I n ) ≤ R is redundant, and Problem (1.3) is equivalent to problem (1.2). Interesting convex optimization approaches based on robust regression have been proposed in [18] to approximately solve (1.3) when r n. The authors focus on obtaining an estimate of β * . Similar ideas have been extended to consider problems with multiple responses in [19].
Problem (1.3) can be formulated as a mixed-integer program (MIP) with O(n 2 ) binary variables (to model the unknown permutation matrix). Solving this MIP with off-the-shelf MIP solvers (e.g., Gurobi) becomes computationally expensive for even a small value of n (e.g. n ≈ 50). To the best of our knowledge, we are not aware of computationally practical algorithms with theoretical guarantees that can optimally solve the original problem (1.3), under suitable assumptions on the problem data. Addressing this gap is the main focus of this paper: We propose and study a novel greedy local search method 1 for Problem (1.3). Loosely speaking, our algorithm at every step performs a greedy swap or transposition, in an attempt to improve the cost function. This algorithm is typically efficient in practice based on our numerical experiments. We also propose an approximate version of the greedy swap procedure that scales to much larger problem instances. We establish theoretical guarantees on the convergence of the proposed method under suitable assumptions on the problem data. Under a suitable scaling of the number of mismatched pairs compared to the number of samples and features, and certain assumptions on the covariates and noise; our local search method converges to an objective value that is at most a constant multiple of the squared norm of the underlying noise term. From a statistical viewpoint, this is the best objective value that one can hope to obtain (due to the noise in the problem). Interestingly, in the special case of = 0 (i.e., the noiseless setting), our algorithm converges to an optimal solution of (1.3) with a linear rate 2 . We also prove an upper bound of the estimation error of β * (in 2 norm) and derive a bound on the number of iterations taken by our proposed local search method to find a solution with this estimation error.
Notation and preliminaries: For a vector a, we let a denote the Euclidean norm, a ∞ the ∞ -norm and a 0 the 0 -pseudo-norm (i.e., number of nonzeros) of a. We let · 2 denote the operator norm for matrices. Let {e 1 , ..., e n } be the natural orthogonal basis of R n . For a finite set S, we let #S denote its cardinality. For any permutation matrix P , let π P be the corresponding permutation of {1, 2, ...., n}, that is, π P (i) = j if and only if e i P = e j if and only if P ij = 1. We define the distance between two permutation matrices P and Q as Recall that we assume r = dist(P * , I n ). For a given permutation matrix P , define the mneighbourhood of P as It is easy to check that N 1 (P ) = {P }, and for any R ≥ 2, N R (P ) contains more than one element. For any permutation matrix P ∈ Π n , we define its support as: supp(P ) := {i ∈ [n] : π P (i) = i} . For a real symmetric matrix A, let λ max (A) and λ min (A) denote the largest and smallest eigenvalues of A, respectively. 1 We draw inspiration from the local search method presented in [10] in the context of a different problem: high dimensional sparse regression. 2 The extended abstract [12] which is a shorter version of this manuscript, studies the noiseless setting.
For two positive scalar sequences {a n }, {b n }, we write a n = O(b n ) or equivalently, a n /b n = O(1), if there exists a universal constant C such that a n ≤ Cb n . We write a n = Ω(b n ) or equivalently, a n /b n = Ω(1), if there exists a universal constant c such that a n ≥ cb n . We write a n = Θ(b n ) if both a n = O(b n ) and a n = Ω(b n ) hold.

A local search method
Here we present our local search method for (1.3). For any fixed P ∈ Π n , by minimizing the objective function in (1.3) with respect to β, we have an equivalent formulation min P P y − HP y 2 s.t. P ∈ Π n , dist(P , I n ) ≤ R , (2.1) where H = X(X X) −1 X is the projection matrix onto the columns of X. To simplify notation, denote H := I n − H, then (2.1) is equivalent to Our local search approach for the optimization of Problem (2.2) is summarized in Algorithm 1.

Algorithm 1 A local search method for Problem (2.2).
Input: Initial permutation P (0) = I n . Tolerance e ≥ 0 and maximum number of iterations, K.
At iteration k, Algorithm 1 finds a swap (within a distance of R from I n ) that leads to the smallest objective value. To see the computational cost of (2.3), note that: HP y 2 = H(P − P (k) )y + HP (k) y 2 = H(P − P (k) )y 2 + 2 (P − P (k) )y, HP (k) y + HP (k) y 2 . (2.4) For each P , with dist(P , P (k) ) ≤ 2, the vector (P −P (k) )y has at most two nonzero entries. Since we pre-compute H, computing the first term in (2.4) costs O(1) operations. As we retain a copy of HP (k) y in memory, computing the second term in (2.4) also costs O(1) operations. Therefore, computing (2.3) requires O(n 2 ) operations, as there are at most n 2 -many possible swaps to search over. The O(n 2 ) per-iteration cost is quite reasonable for medium-sized examples with n being a few hundred to a few thousand, but might be expensive for larger examples. In Section 4, we propose a fast method to find an approximate solution of (2.3) that scales to instances with n ≈ 10 7 in a few minutes (see Section 5 for numerical findings).

Theoretical guarantees for Algorithm 1
Here we present theoretical guarantees for Algorithm 1. The main assumptions and conclusions appear in Section 3.1. Section 3.2 presents the proofs of the main theorems. The development in Sections 3.1 and 3.2 assumes that the problem data (i.e., y, X, ) is deterministic. Section 3.3 discusses conditions on the distribution of the features and the noise term, under which the main assumptions hold true with high probability.

Main results
We state and prove the main theorems on the convergence of Algorithm 1. For any m ≤ n, define We first state the assumptions useful for our technical analysis.
Note that the lower bound in Assumption 3.1 (1) states that the y-value for a record that has been mismatched is not too close to its original value (before mismatch). Assumption 3.1 (2) states that R is set to a constant multiple of r. This constant can be large (≥ 10U 2 /L 2 ), and appears to be an artifact of our proof techniques. Our numerical experience appears to suggest that this constant can be much smaller in practice. Assumption 3.1 (3) is a restricted eigenvalue (RE)-type condition [25] stating that: a multiplication of any (2R)-sparse vector by H will result in a vector with small norm (in the case Rρ n < 1). Section 3.3 discusses conditions on the distribution of the rows of X under which Assumption 3.1 (3) holds true with high probability. Note that if ρ n = Θ(d log(n)/n), then for the assumption Rρ n ≤ L 2 /(90U 2 ) to hold true, we require n/ log(n) = Ω(dr). Assumption 3.1 (4) limits the amount of noise in the problem. Section 3.3 presents conditions on the distributions of and X (in a random design setting) which ensures Assumption 3.1 (4) holds true with high probability. Assumption 3.1 (3) plays an important role in our technical analysis. In particular, this allows us to approximate the objective function in (2.2) with one that is easier to analyze. To provide some intuition, we write HP (k) y = H(P (k) y−P * y)+ H -noting that HP * y = H(Xβ * + ) = H , and assuming that the noise is small, we have: Intuitively, the term on the right-hand side is the approximate objective that we analyze in our theory. Lemma 3.2 presents a one-step decrease property on the approximate objective function.
Lemma 3.2 (One-step decrease) Given any y ∈ R n and P , P * ∈ Π n , there exists a permutation matrix P ∈ Π n such that dist( P , P ) = 2, supp( P (P * ) −1 ) ⊆ supp(P (P * ) −1 ) and If in addition P y − P * y 0 ≤ m for some m ≤ n, then The main results make use of Lemma 3.2 and formalize the intuition conveyed in (3.4). We first present a result regarding the support of the permutation matrix P (k) delivered by Algorithm 1. Then for all k ≥ R/2, it holds supp(P * ) ⊆ supp(P (k) ). Proposition 3.3 states that the support of P * will be contained within the support of P (k) after at most R/2 iterations. Intuitively, this result is because of Assumption 3.1 (1), which assumes that the mismatches represented by P * have "strong signal". Proposition 3.3 is also useful for the proofs of the main theorems below (e.g., see Claim 3.25 in the proof of Theorem 3.5 for details).
We now present some additional assumptions required for the results that follow.
Assumption 3.4 Let ρ n andσ be parameters appearing in Assumption 3.1.
We now state the first convergence result. (3.8) In the special (noiseless) setting when = 0, Theorem 3.5 establishes that the sequence of objective values generated by Algorithm 1 converges to zero i.e., the optimal objective value, at a linear rate. The parameter for the linear rate of convergence depends upon the search width R. Following the discussion after Assumption 3.4, the sample-size requirement n/ log(n) = Ω(dr 2 ) is more stringent than that needed in order for the model to be identifiable (n ≥ 2d) [24] in the noiseless setting. In particular, when n/(d log n) = O(1), the number of mismatched pairs r needs to be bounded by a constant. Numerical evidence presented in Section 5 (for the noiseless case) appears to suggest that the sample size n needed to recover P * is smaller than what is suggested by our theory.
In the noisy case (i.e. = 0), the bound (3.8) provides an upper bound on the objective value consisting of two terms. The first term converges to 0 with a linear rate similar to the noiseless case. The second term is a constant multiple of the squared norm of the unavoidable noise term 3 : H 2 . In other words, Algorithm 1 finds a solution whose objective value is at most a constant multiple of the objective value at the true permutation P * . Theorem 3.5 proves a convergence guarantee on the objective value. The next result provides upper bounds on the ∞ -norm of the mismatched entries i.e., P (k) y − P * y ∞ . For any Q ∈ Π n , define G(Q) := HQy 2 − min that is, G(Q) is the decrease in the objective value after one step of local search starting at Q. For the permutation matrices {P (k) } k≥0 generated by Algorithm 1, we know G(P (k) ) = HP (k) y 2 − HP (k+1) y 2 .
Theorem 3.6 states that the largest squared error of the mismatched pairs (i.e., P (k) y − P * y 2 ∞ ) is bounded above by a constant multiple of the one-step decrease in objective value (i.e. G(P (k) )) plus a term comparable to the noise level O(σ 2 ). In particular, if Algorithm 1 is terminated at an iteration P (k) with G(P (k) ) of the order ofσ 2 , then P (k) y − P * y 2 ∞ is bounded by a constant multiple ofσ 2 .
Note that the constant 800 in (3.10) is conservative and may be improved with a careful adjustment of the constants appearing in the proof and in the assumptions.
In light of Theorem 3.6, we can prove an upper bound on the estimation error of β * , using an additional assumption stated below. Assumption 3.7 There exists a constantγ > 0 such that whereσ is as defined in Assumption 3.1.

Section 3.3 presents conditions on X, under which Assumption 3.7 is satisfied with high probability.
Theorem 3.8 (Estimation error) Suppose Assumptions 3.1, 3.4 and 3.7 hold. Suppose iteration k of Algorithm 1 satisfies G(P (k) ) ≤ cσ 2 for a constant c > 0. Let β (k) := (X X) −1 X P (k) y and denotec := 800 + 10c. Then we have Theorem 3.8 (cf bound (3.11)) states that as long as k is sufficiently large 4 , the estimation error β (k) − β * 2 is of the order O((r + d)σ 2 /n), assumingγ is a constant. Therefore, as n → ∞ (with r, d fixed), the estimator delivered by our algorithm (after sufficiently many iterations) will converge to the true regression coefficient vector, β * . In addition, (3.12) provides an upper bound on the entrywise "denoising error" (left hand side of (3.12))-this is of the order O((r + d)σ 2 /n). See [14] for past works and discussions on this error metric.
The following theorem provides an upper bound on the total number of local search steps needed to find a P (k) with G(P (k) ) ≤ cσ 2 .
which is a contradiction. So there must exist some Note that if R and H 2 / HP (0) y 2 are bounded by a constant, then the number of iterations K † = O(n). Therefore, in this situation, one can find an estimate β (k) satisfying

Proofs of main theorems
In this section, we present the proofs of Proposition 3.3, Theorem 3.5, Theorem 3.6 and Theorem 3.8. We first present a technical result used in our proofs. Lemma 3.10 Suppose Assumption 3.1 holds. Let {P (k) } k≥0 be the permutation matrices generated by Algorithm 1. Suppose P (k) y − P * y ∞ ≥ L for some k ≥ 1. Suppose at least one of the two conditions holds: The proof of Lemma 3.10 is presented in Section A.3. As mentioned earlier, our analysis makes use of the one-step decrease condition in Lemma 3.2. Note however, if the permutation matrix at the current iteration, denoted by P (k) , is on the boundary, i.e. dist(P (k) , I n ) = R, it is not clear whether the permutation found by Lemma 3.2 is within the search region N R (I n ). Lemma 3.10 helps address this issue (See the proof of Theorem 3.5 below for details).

Proof of Proposition 3.3
We show this result by contradiction. Suppose that there exists a k ≥ R/2 such that supp( Let i ∈ supp(P * ) but i / ∈ supp(P (T ) ), then by Assumption 3.1 (1), we have By Lemma 3.10, we have P (k+1) y − P * y 2 − P (k) y − P * y 2 ≤ −L 2 /5 for all k ≤ T − 1. As a result, Since by Assumption 3.1 (1), P (0) y − P * y 2 = y − P * y 2 ≤ rU 2 , we have This is a contradiction, so such an iteration counter T does not exist; and for all k ≥ R/2, we have supp(P * ) ⊆ supp(P (k) ).
To complete the proof, we use another claim whose proof is in Section A.6: Claim. For any k ≥ 0 it holds that P By the above claim, the update rule (2.3) and inequality (3.21), we have Using the notation a k := HP (k) y 2 , λ = 1 − η/4 andẽ = 9η H 2 , the above inequality leads to: a k+1 ≤ λa k +ẽ for all k ≥ 0. Therefore, we have ). This leads to Recalling that 8a ≤ 18, we conclude the proof of the theorem.

Proof of Theorem 3.6
By the definition of G(·), we have ∈ Π n such that Therefore, by (3.23) and (3.25), we have Recall that HP * y = H , so by the inequality above we have which is equivalent to On the other hand, from (3.24) we have Summing up (3.26) and (3.27) we have where the second inequality is by Assumption 3.1 (3) and the third inequality uses where the last inequality is by Cauchy-Schwarz inequality. Rearranging terms in (3.28), and making use of (3.30), we have As a result, By the definition of z, we know there exist i, j ∈ [n] such that where the last inequality makes use of Assumption 3.1 (4). On the other hand, by (3.27) we have Combining (3.31), (3.32) and (3.33), we have where the second inequality is by Cauchy-Schwarz inequality. Inequality in display (3.34) leads to P (k) y − P * y 2 ∞ ≤ 800σ 2 + 10G(P (k) ) which completes the proof of this theorem.

Proof of Theorem 3.8
Recall that P * y = Xβ * + , so it holds β * = (X X) −1 X (P * y − ). Therefore Hence we have where the second inequality is by Assumption 3.7 (2). Note that where the first inequality is because P (k) y −P * y has at most 2R non-zero coordinates; the second inequality makes use of Theorem 3.6 and the definition ofc in Theorem 3.8. On the other hand, by Assumption 3.7 (1) we have Combining (3.36), (3.37) and (3.38) we have Squaring both sides of the above, we get which completes the proof of (3.11).

Sufficient conditions for assumptions to hold
Our analysis in Sections 3.1 and 3.2 was completely deterministic in nature under Assumptions 3.1, 3.4 and 3.7. To provide some intuition, in the following, we discuss some probability models on X and under which Assumption 3.1 (3), (4), Assumption 3.4 (2) and Assumption 3.7 hold true with high probability.

A random model matrix X
When the rows of X are iid draws from a well behaved probability distribution, Assumption 3.1 (3) and Assumption 3.7 (1) hold true with high probability. This is formalized via the following lemma.
Note that the almost sure boundedness assumption on x i can be relaxed to cases when x i is bounded with high probability (e.g. x i iid ∼ N (0, Σ)).

Summary
We summarize parameter choices informed by the results above in the following corollary.
Corollary 3.13 Suppose the matrix X is drawn from a probability model as discussed in Lemma 3.11 with m = R; and the noise term satisfies the assumptions in Lemma 3.12. Then with probability at least 1 − 6τ , the inequalities in (3.2), (3.3), (3.7) and (3.10) hold true with the following parameters In view of Assumption 3.1 (3), for n d, we have H(e i − e j ) 2 ≈ 2. Note that in general, H(e i − e j ) 2 ≤ 2. Hence, one can approximately optimize (4.2) by minimizing an upper bound of the last two terms in the second line of display (4.2). This is given by: Denoting w := HP (k) y and v := y − w, the objective in (4.3) is given by

So problem (4.3) is equivalent to
As we discuss below, the computation cost of the above problem can be reduced by making use of its structural properties. Let us denote z i = (y i , v i ) ∈ R 2 . Among the set of points {z 1 , ..., z n }, we say z i is a "left-top" point if for all j ∈ [n], We say z i is a "right-bottom" point if for all j ∈ [n],  Let (i * , j * ) be an optimal solution to (4.4), then it must hold that one of {z i * , z j * } is a left-top point and the other is a right-bottom point. Let Z lt and Z rb be the set of left-top and right-bottom points respectively, and define Then Problem (4.4) is equivalent to min i∈S lt , j∈S rb implying that it suffices to compute values of (y i −y j )(v i −v j ) for i ∈ S lt and j ∈ S rb . Algorithm 2 discusses how to compute S lt and S rb -this requires (a) performing a sorting operation on y, which can be done once with a cost of O(n log(n)); and (b) two additional passes over the data with cost O(n) (to be performed at every iteration of Algorithm 1). The computation of (4.5) can be further simplified as discussed in the following section.

Algorithm 2 Fast algorithm for Problem (4.4)
Input: Vectors v, y ∈ R n .

Faster computation of Problem (4.5)
To simplify the computation of (4.5), we introduce a partial order ' ' on the points in R 2 : For p, q ∈ R 2 , denote p q if p 1 ≤ q 1 and p 2 ≤ q 2 . It is easy to check that for any two points z i , z j ∈ Z rb , either it holds z i z j , or it holds z j z i . So we can write For any z m ∈ Z lt , two cases can happen: Because Z rb is nicely ordered as in (4.6), Case 1 above can be identified by a bisection method with cost (at most) O(log n). Similarly, for Case 2, the values oft andb can be found using bisection.
Since the optimal value of (4.4) must be non-positive, we can compute (y m − y it )(v m − v it ) only forb ≤ t ≤t.
The methods described for solving (4.4) are summarized in Algorithm 2. Finally, note that when dist(P (k) , I n ) ≥ R − 1, similar ideas are still applicable. When dist(P (k) , I n ) = R − 1, we consider the following problem: Similarly, when dist(P (k) , I n ) = R, we consider: Problems (4.7) and (4.8) can also be efficiently solved by finding the sets of left-top and rightbottom points and using the partial order to simplify the computation. We omit the details for brevity.

Experiments
We perform numerical experiments to study the performance of Algorithm 1.
Data generation. We consider the setup in our basic model (1.1), where entries of X ∈ R n×d are iid N (0, 1); β * is generated uniformly from the unit sphere in R d (i.e., β * = 1), and β * is independent of X. We consider two schemes for generating the permutation P * : (a) Random scheme: select r coordinates uniformly from {1, . . . , n}. (b) Equi-spaced scheme: Assume y 1 ≤ · · · ≤ y n (otherwise re-order the data). Let a 1 < · · · < a r be the sequence of r equi-spaced real numbers with a 1 = min i∈[n] y i and a n = max i∈[n] y i . Select r indices i 1 < · · · < i r such that i 1 = argmin i∈[n] |y i − a 1 | and i s = argmin i s−1 +1≤i≤n |y i − a s | for all 2 ≤ s ≤ r. After the r coordinates are chosen, we generate a uniformly distributed random permutation on these r coordinates. 6 We generate (independent of X and β * ) with i iid ∼ N (0, σ 2 ) for some σ ≥ 0 (σ = 0 corresponds to the noiseless setting). Unless otherwise specified, we set the tolerance e = 0 and K = ∞ in Algorithm 1.

Experiments for the noiseless setting
We first consider the noiseless setting ( = 0) with different combinations of (d, r, n). We use the random scheme to generate the unknown permutation P * . We set R = n in Algorithm 1 and a maximum iteration limit of 1000. While our algorithm parameter choices are not covered by our theory, in practice when r is small, our local search algorithm converges to optimality; and the number of iterations is bounded by a small constant multiple of r (e.g., for r = 50, the algorithm converges to optimality within around 60 iterations).  Figure 2 [left panel], we plot the Hamming distance of the solutionP computed by Algorithm 1 and the underlying permutation P * (i.e. dist(P , P * )) versus r. In Figure 2 [right panel], we present errors in estimating β * versus r. More precisely, let β be the solution computed by Algorithm 1 (i.e.β = (X X) −1 X P y), then the beta error is defined as β − β * / β * . For each choice of (r, d), we consider the average over 50 independent replications (the vertical bars show standard errors, which are hardly visible in the figures). As shown in Figure 2, when r is small, the underlying permutation P * can be exactly recovered, and thus the corresponding beta error is also 0. As r becomes larger, Algorithm 1 fails to recover P * exactly; and dist(P * ,P ) is close to the maximal possible value n = 500. In contrast, the estimation error appears to vary more smoothly: As the value of r increases, beta error increases. We also observe that the recovery of P * depends upon the number of covariates d -permutation recovery performance deteriorates with increasing d. This is consistent with our theory suggesting that the performance of our algorithm depends upon both r and d.

Experiments for the noisy setting
We explore the performance of Algorithm 1 under the noisy setting ( = 0).
Performance for different values of R: We denote Relative Obj as the objective value computed by Algorithm 1 divided by H 2 . Figure 3 presents the Relative Obj, beta error and Hamming distance of the local search algorithm with different values of R (x-axis corresponds to the values of R). Here we consider n = 1000, d = 10, r = 10 and σ = 0.1; and use the equi-spaced scheme to choose the mismatched coordinates in P * . We highlight the value at R = r = 10 by a red point. As shown in Figure 3, as R increases, the Relative Obj decreases below 1 -this is consistent with our theory stating that with a proper choice of R, the final objective value will be below a constant multiple of H 2 . As R increases, different from the Relative Obj profile, the beta error and Hamming distance first decrease then increase. This appears to suggest that when R is too large, Algorithm 1 can overfit and further regularization may be necessary to mitigate overfitting. A detailed investigation of this matter is left as future work. In this example, the best beta error and Hamming distance are achieved when R equals r. Note that in Figure 3 [left panel], the Relative Obj is close to 1 when we choose R close to r. Therefore, if we have a good estimate of the noise level σ (but the exact value of r is not available), we can choose a value of R at which the Relative Obj is approximately 1.
Finally, we note that in the noisy case, the local search method cannot exactly recover P * . Indeed, in the noisy case, if a solution to (1.2) has to exactly recover P * , we need to take a smaller value of σ (see discussions in [15]). Even though in our example, we cannot exactly recover P * , we may still be able to obtain a good estimate for β * -see Figure 3 [middle panel].
Estimating P * , β * under different noise levels: For a given σ (standard deviation of the noise), let relative beta error be the value β − β * /(σ β * ), whereP andβ are the estimates available from Algorithm 1 upon termination.
Consider an example with n = 500 and d = r = 10 and different values of σ ∈ {0.01, 0.03, 0.1, 0.3, 1.0}, and use the random scheme to generate the unknown permutation P * . We run Algorithm 1 with the setting R = r. Figure 4 presents the values of Hamming distance and relative beta error under different noise levels. In Figure 4 [left panel], it can be seen that as σ increases, the Hamming distance also increases, and recovering P * becomes harder as the noise level becomes larger. In Figure 4 [right panel], we can see that the relative beta error almost does not change under different values of σ. This appears to be consistent with our conclusion in Theorem 3.8 that β − β * will be bounded by a value proportional to σ.

Comparisons with existing methods
We compare across the following methods for (1.3): • AltMin: The alternating minimization method of [9]. We initialize with P = I n and β = 0, and alternately minimize over P and β until no improvement can be made.
• StoEM: The stochastic expectation maximization method [2]. We run the algorithm for 30 steps under the default setting.
For both AltMin and StoEM, we use the implementation provided in the repository 7 accompanying paper [2]. We compare these methods with Algorithm 1 (denoted by Alg-1) and the variant of Algorithm 1 with the fast approximate local search steps introduced in Section 4 (denoted by Fast-Alg-1). We run Alg-1 and Fast-Alg-1 by setting R = r. Table 1 presents the beta errors of different methods on an example with n = 500, d = 10; and P * chosen by the random scheme. The presented values are the average of 10 independent replications.
As shown in Table 1, in the noiseless setting, Alg-1 can recover the true value of β * for r up to 300. Fast-Alg-1 is quite similar to Alg-1, though the performance is marginally worse  Table 1: Comparison with existing methods on an example with n = 500, d = 10 for larger values of r. In contrast, AltMin and StoEM are not able to exactly estimate β * even for small values of r, and for all values of r they have large values of beta error. S-BD which is based on convex optimization, can also recover β * for r ≤ 200, but for r = 300 its performance degrades and has a large beta error.
In the noisy setting, with σ = 0.1, Alg-1 and Fast-Alg-1 have similar performance, and compute a value of β with much smaller beta error compared to AltMin and StoEM. For small values of r (≤ 100), S-BD has a similar performance to Alg-1 and Fast-Alg-1, while for r = 200 and 300, its performance degrades and has a much larger beta error than Alg-1 and Fast-Alg-1.

Scalability to large instances
We explore the scalability of our proposed approach to large n problems (from n ≈ 10 4 up to n ≈ 10 7 )-for these instances, Fast-Alg-1 appears to be computationally attractive. All these experiments are run on the MIT engaging cluster with 1 CPU and 16GB memory. The codes are written in Julia 1.2.0.
We consider examples with d = 100, r = 50 and n ∈ {10 4 , 10 5 , 10 6 , 10 7 }. Here, the mismatched coordinates of P * are chosen based on the random scheme. We set R = r for all instances. For these examples, we do not form the n × n matrices H or H explicitly, but compute a thin QR decomposition of X (Q ∈ R n×d ) and maintain Q in memory. The results are presented in Table 2, where "total time" is the total runtime of Fast-Alg-1 upon termination, "QR time" is the time used for the QR-decomposition, and "iterations" are the number of iterations taken by the local search method till convergence. All numbers reported in the table are averaged across 10 independent replications. As shown in Table 2, Fast-Alg-1 can solve examples with n up to 10 7 (and d = 100) within around 200 seconds -this runtime (s) includes the time to complete around 60 iterations of local search steps and the time to do the QR decomposition. The total runtime is empirically seen to be of the order O(n) as n increases. Note that the QR time (i.e., time to perform the QR decomposition) can be viewed as a benchmark runtime for ordinary least squares. Hence, for the examples considered, the runtime of Fast-Alg-1 appears to be a constant multiple of the runtime of performing ordinary least squares (the total time will increase with r and/or R). Interestingly, it can be seen that the runtimes for the noisy case (σ = 0.1) are smaller than the noiseless case (σ = 0). We believe this is because Algorithm 2 is faster for the noisy case. In particular, for the noisy case, we empirically observe the number of "left-top" points and "right-bottom" points to be fewer than those in the noiseless case.  Table 2: Runtimes of Fast-Alg-1 on instances with d = 100, r = 50 and different n. For reference, the time taken by Alg-1 in the case n = 10 4 is 100 (s), which is 500X slower than Fast-Alg-1.

Acknowledgments
The authors would like to thank the anonymous referees for their constructive comments that led to an improvement of the paper.

A.1 Technical lemmas
Lemma A.1 (Covariance estimation) Let X = [x 1 , ..., x n ] be a random matrix in R n×d . Suppose rows x 1 , ..., x n are iid zero-mean random vectors in R d with covariance matrix Σ ∈ R d×d . Suppose x i ≤ b almost surely. Then for any t > 0, it holds See e.g. Corollary 6.20 of [25] for a proof.
If e i P = e i P , then immediately we have If e i P = e i P , assume ∈ [n] is the index such that e i P = e P . Since dist(P , P ) = 2, it holds e P = e i P . Denote i + := π P (i) and + := π P ( ). Because e i + = e i P = e P and e + = e P = e i P , it holds i + = π P ( ) and + = π P (i). As a result, we have By the assumption that Let us denote L : In what follows, we discuss different cases depending upon the ordering of the values of v i , v i + , v and v + . Without loss of generality, we can assume v i + ≥ v i . Then there are 12 cases of the ordering in v i , v i + , v and v + . In the following, the first 6 cases correspond to when v + ≥ v , and the last 6 cases correspond to when v + ≤ v .
This is a contradiction to (A.9). So this case cannot appear.
This is a contradiction to (A.9). So this case cannot appear.
This is a contradiction to (A.9). So this case cannot appear.
This is a contradiction to (A.9). So this case cannot appear.
This is a contradiction to (A.9). So this case cannot appear.
In view of all these cases, we have

A.2 Proof of Lemma 3.2
Without loss of generality we assume P * = I n , (otherwise, we work with P (P * ) −1 in place of P and P (P * ) −1 in place of P ). For any k ∈ [n], let k + := π P (k). Let i be the index such that (y i + − y i ) 2 = P y − y 2 ∞ . With out loss of generality, we can assume y i + > y i . Denote i 0 = i and i 1 = i + . By the structure of a permutation, there exists a cycle that where q 1 P −→ q 2 means q 2 = π P (q 1 ). By moving from y i to y i + , we "upcross" the value y i +y i + 2 . Since the cycle (A.10) finally returns to i 0 , there exists one step where we "downcross" the value . In other words, there exists j ∈ [n] with (j, j + ) = (i, i + ) such that y j + < y j and y i +y i + 2 ∈ [y j + , y j ]. Define P as follows: π P (i) = j + , π P (j) = i + , π P (k) = π P (k) ∀k = i, j .
We note that dist(P , P ) = 2 and supp( P ) ⊆ supp(P ). Since there are 3 cases depending upon the relative ordering y i , y i + , y j , y j + , as considered below.
Case 2: y i + ≥ y j ≥ y i ≥ y j + . In this case, let a = y i + − y j , b = y j − y i and c = y i − y j + . Then a, b, c ≥ 0, and Since , and hence Case 3: y i + ≥ y j ≥ y j + ≥ y i . In this case, let a = y i + − y j , b = y j − y j + and c = y j + − y i . Then a, b, c ≥ 0, and Combining Cases 1, 2 and 3, completes the proof of (3.5).

A.3 Proof of Lemma 3.10
To prove Lemma 3.10, we first prove the following proposition: Proposition A.5 Under the assumptions of Lemma 3.10, it holds P (t) y − P * y ∞ ≥ L for all 0 ≤ t ≤ k.
Let X =Ū DV be the SVD of X withŪ ∈ R n×d satisfyingŪ Ū = I d ;V ∈ R d×d satisfyingV V = I d ; and D ∈ R d×d being diagonal. Then we have and Note that for any j ∈ [d], one can verify that e jŪ is sub-Gaussian with variance proxy σ 2 . Hence, for any t > 0, we have P( Ū ∞ > t) ≤ 2d exp(−t 2 /(2σ 2 )) .