On the Incommensurability Phenomenon

Suppose that two large, multi-dimensional data sets are each noisy measurements of the same underlying random process, and principal components analysis is performed separately on the data sets to reduce their dimensionality. In some circumstances it may happen that the two lower-dimensional data sets have an inordinately large Procrustean fitting-error between them. The purpose of this manuscript is to quantify this “incommensurability phenomenon”. In particular, under specified conditions, the square Procrustean fitting-error of the two normalized lower-dimensional data sets is (asymptotically) a convex combination (via a correlation parameter) of the Hausdorff distance between the projection subspaces and the maximum possible value of the square Procrustean fitting-error for normalized data. We show how this gives rise to the incommensurability phenomenon, and we employ illustrative simulations and also use real data to explore how the incommensurability phenomenon may have an appreciable impact.


Overview and Outline
The ever-increasing importance of modern big-data analytics brings with it the imperative to understand fusion and and inference on multiple and massive disparate, distributed data sets.
What processing can be profitably done separately, for subsequent joint inference?In the case where each data set consists of measurements on the same objects, and combining the full data sets is prohibitively expensive, it seems reasonable to separately project each large, high-dimensional sufficiently large gap between the d'th and d + 1'th eigenvalues of the covariance matrix for the data sets.These factors may combine to allow substantial probability of having significant distance between the separate projection subspaces, which then causes an inordinately large Procrustean fitting-error.
Of course, one remedy is simply not to do the projections separately for the two data sets; robust joint embedding schemes are available, such as developed in [24], [16], and [13].Indeed, an easily used candidate is canonical correlation analysis (CCA) [9], [7], which can be extended to situations where more than two data sets are being treated, and CCA has good properties for subsequent inferential tasks [20], [21], [17].The incommensurability phenomenon can then be avoided at the cost of the extra computation involved, although this extra computation might be a significant burden when dealing with a large volume of data in a distributed system.
Another possible remedy would be to choose the embedding dimension d so as to maintain enough of a gap between the d'th and d + 1'th eigenvalues of the data sets' covariance matrix.
However, this remedy may actually throw out the baby with the bath water; indeed, limiting the embedding dimension to maintain a healthy eigengap may come at the expense of additional signal that might be mined by the inclusion of additional dimensions, if practical.
Besides the theoretical relationships proven in this manuscript, a practical and applied contribution of this manuscript is the take-away lesson and awareness of the potential danger in not doing joint embedding (or similar tactics) in the course of dimension reduction with correlated data sets.Indeed, in the sections that follow, we provide an illustrative vision of what could go wrong.

A cautionary Tale of Two Scientists
For this section only, we explore an idealized scenario for the purpose of straightforward illustration; the general setting will be treated in Section 3.For this entire manuscript, a general background reference for matrix analysis tools that we employ (e.g.Procrustes fitting, singular value decomposition, spectral and norm identities and inequalities such as Weyl's Theorem for Hermitian matrices and Interlacing inequalities for Hermitian matrices) is the classical text [8] , background on the Grassmannian (e.g.principal angles, Hausdorff distance) useful for our particular work is easily accessible in [14], and background on principal components analysis (PCA) can be found in [1].A classical and broad textbook on the Grassmannian is [5].
Suppose that two scientists each take daily measurements of m features of a random process, where m is a large, positive integer.For each day i = 1, 2, 3, . .., the first scientist records her daily measurements as X (i) ∈ R m , where X (i) j is her measurement of the jth feature, and the second scientist records his daily measurements as Y (i) ∈ R m , where Y (i) j is his measurement of the jth feature, for j = 1, 2, . . ., m.Although the two scientists want to record the same process, suppose that their measurements are made with some error, which we model in the following manner.
There are three collections of random variables {Z (i) j }, {Z (i) j }, and {Z (i) j }, each over indices i = 1, 2, 3, . . .and j = 1, 2, . . ., m, such that these random variables are all collectively independent and identically distributed, and their common distribution has finite variance α > 0. Suppose that the random variables {Z (i) j } are the signal feature values associated with the process that the scientists would like to record, and the random variables {Z (i) j } and {Z (i) j } are confounding noise.Let a real-valued "measurement-accuracy" parameter γ be fixed in the interval [0, 1].One scenario is that for each day i = 1, 2, 3, . . .and feature j = 1, 2, . . ., m, the first scientist's measurement with respective probabilities γ and 1 − γ, and the second scientist's measurement j with respective probabilities γ and 1 − γ.A second scenario is that, instead, for each day i = 1, 2, 3, . . .and feature j = 1, 2, . . ., m, The main result of Section 2 is Theorem 1, which will hold in either of these two scenarios.At one extreme, if γ = 1, then the two scientists' measurements are perfectly accurate and X (i) = Y (i) for all i.At the other extreme, if γ = 0, then the two scientists' measurements are independent of each other.
For each positive integer n, denote by X (n) the matrix [X (1) ing of the first scientist's measurements over the first n days, and denote by consisting of the second scientist's measurements over the first n days.
Because the measurement vectors are in high-dimensional space R m , suppose the scientists project their respective measurement vectors to the lower-dimensional space R k for some smaller, positive integer k.This is done in the following manner.Let H n = I n − 1 n J n denote the centering matrix (I n and J n are, respectively, the n × n identity matrix and the matrix of all ones).Suppose that the first scientist chooses a sequence A (1) , A (2) , A (3) , . . . of random (or deterministic) elements of the Grassmannian G k,m (the space of all k-dimensional subspaces of R m ), and suppose that the second scientist chooses a sequence B (1) , B (2) , B (3) , . . . of random (or deterministic) elements of the Grassmannian G k,m .No assumptions are made on the distributions of these elements of the Grassmannian or on their dependence/independence, but one example of interest is where, for n = 1, 2, 3, . .., A (n) , B (n) ∈ G k,m denote the respective k-dimensional subspaces to which principal components analysis (PCA) projects X (n) H n and Y (n) H n , respectively (and separately).Let P A (n) denote the projection operator from R m onto A (n) .On each day n, the first scientist reports the scaled matrix X (n) := H n ∈ R k×n to the Governing Board of Scientists, and the second scientist reports the scaled matrix Y (n) := Now, the Governing Board of Scientists wants to perform its own check that the two scientists are indeed taking measurements reflecting the same process.So the Governing Board of Scientists computes the Procrustean fitting-error (X (n) , F .It will later be seen (from Equation ( 5)) that the square Procrustean fitting-error satisfies 0 ≤ 2 (X (n) , Y (n) ) ≤ 2k; the Governing Board of Scientists reasons that this square Procrustean fittingerror should be small (negligible compared to 2k) if indeed γ is close to 1.Is this reasoning valid?
In the following, d(•, •) denotes the Hausdorff distance (e.g.see [14]) on the Grassmannian 2 are the principal angles between A and B. Note that the square Hausdorff distance satisfies The proof of Theorem 1 is given later, in Section 3.2, as a special case of the more general Theorem 2.
Theorem 1 says that 2 (X (n) , Y (n) ) asymptotically becomes this convex combination (via γ 2 ) of 2k and d 2 (A (n) , B (n) ).In particular, if γ is close to 0, which implies that the two scientists' measurements are independent of each other, then indeed 2 (X (n) , Y (n) ) is close to its maximum possible value 2k, but if γ is close to 1, meaning that the scientists' measurements are close to being the same as each other, we then have 2 (X (n) , Y (n) ) close to d 2 (A (n) , B (n) ).Is this square Hausdorff distance close to zero when γ is close to 1?
In Section 4 we show that, in fact, if the (separate) principal components analysis projections are used then this may not be the case, and the square Hausdorff distance d 2 (A (n) , B (n) ) might even be close to its maximum possible value of 2k.By contrast, here if the two scientists both used the simple-minded projection consisting of just taking the first k coordinates of R m and ignoring the rest of the coordinates, then 3 The asymptotic relationship between Procrustean fittingerror and the projection distance The main result of this section is the statement and proof of Theorem 2. We begin with a description of a general setting and a list of basic facts that will be used in the proof of Theorem 2.

Preliminaries and the general setting
From this point and on, we will consider a much more general setting than the idealistic setting of Section 2. Suppose now that X (1) , X (2) , X (3) , . . .∈ R m and Y (1) , Y (2) , Y (3) , . . .∈ R m are random vectors (for convenience, let us denote X ≡ X (1) , Y ≡ Y (1) ) such that the stacked random vectors Y (3) , . . .∈ R 2m are independent, identically distributed, with covariance matrix (We no longer require, in the manner of Section 2, that X and Y have independent, nor identically distributed components, nor that they arise as a mixture of other random variables in any particular way.)Assume that Cov(X) and Cov(Y) are both nonzero matrices.
Then define, for each positive integer n, random matrix to which principal components analysis (PCA) projects X (n) H n , and let B (n) ∈ G k,m denote the k-dimensional subspace to which principal components analysis (PCA) projects Y (n) H n (these projections being done separately).In the special case where Cov(X) and Cov(Y) are scalar multiples of I m , then we will explicitly allow to be any sequences of elements in G k,m whatsoever, deterministic or random.
It is useful to consider the projections P A (n) and P B (n) as m×m symmetric, idempotent matrices (i.e., keep the ambient coordinate system R m for the projection's range) and, for each n = 1, 2, . .., (There is no difference for our results and for the Procrustean fitting-error if, as in Section 2, we instead treated P A (n) and P B (n) as functions R m → R k with the coordinate systems of A (n) and B (n) , respectively, in which case we have X (n) and Y (n) in R k×n instead of R m×n .) For any matrix C ∈ R m×m with only real-valued eigenvalues (eg, symmetric matrices), let and positive semidefinite (e.g., a covariance matrix) then λ i (C) = σ i (C) for all i = 1, 2, . . ., m, and recall that, for any C ∈ R m×n , D ∈ R n×m , the nonzero eigenvalues of CD are the same as the nonzero eigenvalues of DC.For any A, B ∈ G k,m (with projection matrices P A , P B ) and all i = 1, 2, . . ., m, we thus have σ 2 i (P A P B ) = λ i (P A P B P T B P T A ) = λ i (P A P B P B P A ) = λ i (P A P A P B P B ) = λ i (P A P B ).In fact, P A P B has at most k positive eigenvalues and at most k positive singular values (the rest of the eigenvalues and the rest of the singular values are all zero) and, for all where {θ i (A, B)} k i=1 are the principal angles between A and B.
We also define, for each n = 1, 2, . .., the quantity Later, in Proposition 6, we will prove it always holds that 0 For this reason, we like to view ð 2 (A (n) , B (n) ) as a weighted form of the square Hausdorff distance.

The result
Within the setting of Section 3.1, we now state and prove the main result of Section 3: In the setting of Section 3.1, it holds almost surely that as n → ∞, where ρ is defined as .
In Proposition 7 we prove that 0 ≤ ρ ≤ 1.To prove Theorem 2 we first establish Lemmas 3 and 4: Proof of Lemma 3: For each n = 1, 2, 3, . .., let us consider a singular value decomposition where U (n) ∈ R m×m is orthogonal, Λ (n) ∈ R m×n is a "diagonal" matrix, with nonnegative diagonals non-increasing along its main diagonal, and where E ∈ R m×m is the diagonal matrix with its first k diagonals 1 and its remaining diagonals 0.
Thus, the matrix and the matrix share their k largest eigenvalues, and the remaining m − k eigenvalues of the latter matrix are 0.
By the Strong Law of Large Numbers, almost surely as n → ∞.Lastly, recall that we explicitly allow {A (n) } ∞ n=1 to be any elements of G k,m in the special case that Cov(X) = α • I m for some α > 0; indeed, in this special case, note that by the boundedness of {P A (n) } ∞ n=1 and the Strong Law of Large Numbers that, as n → ∞, as desired.
We are now able to prove the main result of this section, Theorem 2.
Proof of Theorem 2: Let δ be as defined in Lemma 4. Note that for any nonnegative, bounded 1 Indeed, because a (n) and b (n) are bounded, and since as n → ∞.Thus, by Lemma 4, and noting that the rank of But the expression in ( 8) can be simplified, by ( 5) and (3), as There is a special case of Theorem 2 that deserves attention: Theorem 5.In the setting of Section 3.1, if Cov(X) = Cov(Y) and Cov(X, Y) = βI m for a real number β, then it holds almost surely that Theorem 5 is an immediate consequence of Theorem 2, since we previously pointed out that when Cov(X, Y) is a scalar multiple of the identity then Finally, Theorem 1 from Section 2 is an immediate consequence of Theorem 5, after noting that the setting of Section 2 is a special case of the setting of Section 3.1, with (recall the definitions of α and γ from Section 2) So |β| and α of Theorem 5 are, respectively, γ 2 •α and α, thus in Theorem 5 we have |β| α = γ 2 •α α = γ 2 .This proves Theorem 1.

Bounds for ð 2 and ρ
Proof of Proposition 6: The upper bound is trivial.To prove the lower bound, first we reexpress (3) as and we show that each summand in the summation of (9) will be nonnegative.Indeed, for any S ∈ R m×m and i = 1, 2, . . ., n, we have that σ i (S • P A (n) ) ≤ σ i (S) and σ i (P A (n) S) ≤ σ i (S); this is seen as follows.Say P A (n) = QEQ T is such that Q ∈ R m×m is orthogonal and E is diagonal with 1's and 0's on its diagonal.Then , the inequality holding by the Interlacing Theorem for Hermitian matrices.By a similar argument σ i (P A (n) S) ≤ σ i (S), and applying these in succession yields that σ i (P Proposition 7.For ρ, as defined in Theorem 2, it holds that 0 ≤ ρ ≤ 1. Proof of Proposition 7: Let Cov(X, Y) = U ΛV T be a singular value decomposition; i.e.U, V ∈ R m×m are orthogonal and Λ ∈ R m×m is diagonal, with nonincreasing nonnegative diagonal entries.Define M ∈ R 2m×2m by where 0 m ∈ R m×m is the matrix of zeros.A covariance matrix is positive semidefinite, thus M is positive semidefinite, as well as all of its principal submatrices.For each j = 1, 2, . . ., k, the two-by-two submatrix consisting of the jth and j + mth rows and columns of M has nonnegative diagonals and a nonnegative determinant, thus ( Now, summing (10) over j = 1, 2, . . ., k and applying the Cauchy-Schwartz inequality to the resulting right-hand side, we obtain For any Hermitian matrix, the vector of its diagonals always majorizes the vector of its eigenvalues, and Proposition 7 follows from ( 11), (12), and ( 12) applied to Cov(Y) and V .

An isometry-corrective property of ð 2
Suppose that W ∈ R m×m is an orthogonal matrix such that where β ∈ R is nonzero; this might arise in situations similar to the cautionary Tale of Two Scientists in Section 2-wherein two scientists are taking measurements of the same random processexcept that the second scientist permutes the order of the features (i.e., W is a permutation matrix).Define W B (n) := {W x : x ∈ B (n) }.In this situation, the quantity d 2 (A (n) , W B (n) ) may be more interesting than the quantity d 2 (A (n) , B (n) ), since A (n) might be viewed as being more comparable to W B (n) then to B (n) .Indeed, if the eigenvalues of Cov(X) are distinct and n is large and W is not I m , then d 2 (A (n) , W B (n) ) would be small, in contrast to d 2 (A (n) , B (n) ).
Proposition 8.In the case of the previous paragraph, we have Proposition 8 will be illustrated in Section 4.3.
Proof of Proposition 8: Here we have and Because P W B (n) = W P B (n) W T , and by ( 2), ( 3), (13), and ( 14) it follows that

Simulations and Real Data
In this section, simulations and real data illustrate and support the theorems which we stated and proved in the previous sections, and we then use these simulations and real data to illustrate how the "incommensurability phenomenon" can arise as a consequence.What is meant by this phenomenon is the occurrence an inordinately large Procrustean fitting-error between projected data that was originally highly-correlated.(This phenomenon was named in Priebe et al [13].)

A first illustration
Our first illustration of Theorem 2 and Theorem 5 is with X and Y distributed multivariate normal (with mean vector consisting of all zeros) such that Cov(X) =Cov(Y) = I 6 and Cov(X, Y) = β • I 6 for assorted values of β.Note that ρ as defined in Theorem 2 is β here, note that α and β as defined in Theorem 5 are, respectively, 1 and β here, and note that here because Cov(X, Y) is a scalar multiple of the identity.Also, this example may be seen as an illustration of Theorem 1-in the Tale of Two Scientists-with γ 2 there being β here.
The dimension of the space containing X and Y is m = 6, and we will project to spaces of dimension k = 2.
For each of β = 0, .1,.2,.3,.4,.5, .6,.7,.8,.9,.99,and for each of n = 1000 and n = 10000 we obtained 1000 realizations of X (n) and Y (n) and used PCA to obtain A (n) and B (n) .In Figure 1, we plotted the values of 2 (X (n) , Y (n) ) against the respective values of d 2 (A (n) , B (n) ), in colors blue, green, red, cyan, magenta, blue, green, red, cyan, magenta, blue for the respective values of β = 0, .1,.2,.3,.4,.5, .6,.7,.8,.9,.99.For reference, we also included-in Figure 1-lines with y-intercept (1−β)•2k and slope β, for each of the above-specified values of β; basically, Theorem 1, Theorem 2, and Theorem 5 state that the scatter plots will adhere to these respective lines in the limit as n goes to ∞.Indeed, notice in Figure 1 that the scatter plots adhere very closely to their respective lines, and such adherence substantially improves as n = 1000 is raised to n = 10000, which supports/illustrates the claims of Theorem 1, Theorem 2, and Theorem 5.
The above was done using PCA to generate A (n) and B (n) .What if we instead took A (n) and B (n) to (each) be the span of the first two standard-basis vectors in R 6 ?We will call this the "trivial" choice of A (n) and B (n) .Of course, the value of d 2 (A (n) , B (n) ) would always be identically zero, and note that Theorem 1, Theorem 2, and Theorem 5 still apply with this choice of A (n) and B (n) because Cov(X) and Cov(Y) are scalar multiples of the identity.Thus, the scatter plots from these above experiments when they are performed instead for the trivial choice of A (n) and B (n)   would land in the far left of Figure 1 (along the y-axis at d 2 (A (n) , B (n) ) = 0), clustered about their For each of β = 0 (blue), .1 (green), .2(red), .3(cyan), .4(magenta), .5 (blue), .6 (green), .7 (red), .8(cyan), .9(magenta), .99(blue), for each of n = 1000 (left) and n = 10000 (right), there were 1000 Monte Carlo replicates using k = 2.Note that the axis-values are to be multiplied by 2k, which is 4 here, since the ranges of 2 respective lines.Indeed, we then performed the above experiments for the trivial choice of A (n)   and B (n) ; the sample mean and sample standard deviation of 2 (X (n) , Y (n) ) for the 1000 Monte Carlo replicates when n = 10000 were as follows: mean, st.dev. of 2 (X (n) , Y (n) ) with PCA mean, st.dev. of 2 (X (n) , Y (n) ) with trivial A (n)  Indeed, besides the notable exception when β = 0 (where there is no correlation anyway between X and Y), the values of 2 (X (n) , Y (n) ) were substantially larger when PCA was used to generate A (n)   and B (n) than for the trivial choice of A (n) and B (n) .This is the incommensurability phenomenon, a situation where use of PCA has the consequence of inordinately large Procrustean fitting-error.
Let us call the values 2 (X It is noteworthy that in the above experiments the sample standard deviation of the residuals when PCA was used to generate A (n) and B (n) is very close to the sample standard deviation of 2 (X (n) , Y (n) ) for the trivial choice of A (n) and B (n) .Specifically, we computed: st.dev. of residuals with PCA st.dev. of 2 (X (n) , Y (n) ) with trivial A (n) when PCA generates A (n) and B (n) is approximately the same as the variation in 2 (X for the trivial choice of A (n) and B (n) (in which d 2 (A (n) , B (n) ) = 0 identically) and, as such, ) explains all of the rest of the variation here in 2 (X (n) , Y (n) ) when PCA is used.

A second illustration
Our next illustration of Theorem 2 and Theorem 5 is with X and Y multivariate normal (with mean vector of all zeros) such that Cov(X) =Cov(Y) = the diagonal matrix in R 20×20 with .7 on all diagonals except for the first diagonal, which has the value 1, and such that Cov(X, Y) = .6* I 20 .
So we are using m = 20 here.As above, scalar multiple of the identity.
In the experiments for the left figure in Figure 2, the sample mean and sample standard deviation of were as follows: sample mean of sample standard deviation of k increases, the correlation ρ increases, so it would seem at first thought that the normalized Procrustean fitting-error should decrease.Indeed, the leftmost green points in (the left figure of) Figure 2 are below the leftmost red points, which are below the leftmost blue points.
However, overall, the normalized Procrustean fitting-error is seen in the table above to be much higher in the case of k = 2 than the case of k = 1.This is explained by noting a substantial gap between the first eigenvalue of Cov(X) and the second eigenvalue of Cov(X) (1 vs .7)whereas there is no gap between the second eigenvalue of of Cov(X) and the third eigenvalue of Cov(X) (both are .7).Thus when k = 1 the PCA projection has little variance whereas when k = 2 the PCA projection has much variance, often causing much larger Hausdorff distance between A (n)   and B (n) , which results in larger Procrustean fitting-error by Theorem 2. As such, the case of k = 2 is an example of the incommensurability phenomenon of inordinately large Procrustean fittingerror.But then observe that when k = 10 we find that the normalized Procrustean fitting-error is competitive with the k = 1 case; even though the tenth and eleventh eigenvalues of Cov(X) are the same, nonetheless the correlation ρ has increased, and the variance of the PCA projection has decreased enough to improve the normalized Procrustean fitting-error to be competitive with the case of k = 1.
Not only may the incommensurability phenomenon occur when there is no spectral gap in the covariance structure at the projection dimension, but the incommensurability phenomenon may occur when this spectral gap is positive but small.Indeed, repeating the experiments performed for the left figure in Figure 2, and just changing the second diagonal of Cov(X) =Cov(Y) from .7 to λ for each of λ = .71,.72,.73,.74,.75 but otherwise the experiments are the same, we got a very similar-looking scatter plot as the left figure in Figure 2, and the sample mean and sample Of course, this is exactly the illustration in the beginning of Section 4.2, with the only exception that the coordinates of Y have been permuted into reverse order.Performing the very same experiments from the beginning of Section 4.2, the scatter plots of 2 (X will not look like the scatter plots in Figure 2.However, since the permutation transformation is an isometry, we then have by Proposition 8 in Section 3.4, that the scatter plots of 2 (X (n) , Y (n) ) vs ð 2 (A (n) , B (n) ) will indeed look like the scatter plots in Figure 2. The use of ð 2 automatically accounts for isometrical transformations of X and/or Y from a common frame, in the manner of this example.
It should also be mentioned that, for the illustration of this section (with the covariance matrix above), if A (n) and B (n) were not generated with PCA, but instead A (n) and B (n) were selected to be (the same as each other by setting them to be) the span of any number of standard-basis vectors in R 40 then the Procrustes fitting-error would be disasterously large.The fact that such a naive choice of A (n) and B (n) was successful in the illustration in Section 4.1 was just a byproduct of the good fortune that X and Y did not have permuted coordinates or any other isometrical transformation applied to them.

The incommensurability phenomenon in real data
We next illustrate the incommensurability phenomenon using real data from the 2014 Science article of Vogelstein et.al. [23], titled "Discovery of brainwide neural-behavioral maps via multiscale unsupervised structure learning."See O'Leary and Marder [12] for a big-picture overview and discussion of the contributions of this article.This data from Vogelstein et.al. will be observed by Two Scientists who will record highly correlated observations.We will show the incommensurability phenomenon creeping into the Two Scientist's efforts.(The data related to this section is available online at http://www.cis.jhu.edu/~parky/Incomm/.) In the Vogelstein et.al. paper [23], the authors consider a collection of optogenetically manipulated Drosophila larvae, with the goal of generating a behavioral reference atlas.The animals considered are partitioned into lines, with each line defined by the neuron classes which are being optogenetically manipulated.Each line includes multiple replicates -dishes -in each of which numerous animals are found.Videos of animal behavior are processed into a multivariate behavioral time series for each animal, these time series give rise to an animal dissimilarity matrix, and multidimensional scaling applied to this dissimilarity matrix yields a representation of the collection of animals in high-dimensional Euclidean space.For our purposes, we will focus our attention on the sixteen most significant dimensions; in this manner, every animal corresponds to a vector in R 16 .
In our experiment here, for each of i = 1, 2, 3, . . ., 242, the first scientist will record X (i) ∈ R 16   and the second scientist will record Y (i) ∈ R 16 , as follows.There were a total of n = 242 dishes in the Vogelstein et.al. data set corresponding to the control line pBDPU-ChR2.For each i = 1, 2, 3, . . ., 242, we select the two most correlated animals in the i'th dish, and the first scientist picks-equiprobably-one of these two animals, and sets X (i) ∈ R 16 to be this animal's associated vector, and the other scientist is left with the other animal, and sets Y (i) ∈ R 16 to be that animal's associated vector.The first scientist's observations are stored in the matrix and the second scientist's observations are stored in the matrix . (When we replicate our experiment, the identities of the two most correlated animals in each dish don't change from one experiment replication to the next, but which of the two animals is assigned to the first scientist are independent Bernoulli( 1 2 ) trials.) For each embedding dimension k = 1, 2, . . ., 16, we use PCA to generate A (242) and B (242) , and then we compute d 2 (A (242) , B (242) ) and 2 (X (242) , Y (242) ) in the manner described in Section 3.1.
Performing 10000 Monte-Carlo replications of this experiment, we plot in Figure 3   (242) , Y (242) ) seem to be anyway increasing as k increases, it also seems that increases in 2 (X (242) , Y (242) ) are also explained by the increased values of d 2 (A (242) , B (242) ), as these increased values of d 2 (A (242) , B (242) ) occur.This is the incommensurability phenomenon.
Although it is not as dramatic as with the simulated data, it is present in this real-data setting.

Summary and discussion
When principal components analysis (PCA) is used for the dimension reduction of two random data sets that are highly-correlated with each other, there is a natural hope that that the projected (and normalized) data sets will be commensurate, in the sense that a Procrustes transformation of one to the other will render it close in distance (according to the strength of correlation in the original data).However, sometimes this Procrustean fitting-error is higher than what might be expected, which is the "incommensurability phenomenon."This may occur when the projections are done separately for the two data sets and there is an insufficient gap between covariance eigenvalues as the more-principal principal components are taken and less-principal principal components are discarded, which can lead to nontrivial variance in the resulting PCA projectors.(Indeed, the Cautionary Tale of Two Scientists from Section 2, with spherical covariance structure, creates a perfect storm.) Our main result is Theorem 2, which succinctly quantifies the asymptotic effects of (an adaption of) the Hausdorff distance between the PCA projections, in terms of the strength of the correlation between the original data sets, on the Procrustean fitting-error of the projected data.
We then illustrate that highly-correlated data, even with a mild gap in covariance eigenvalues, can appreciably exhibit the incommensurability phenomenon; indeed, what we observe from the simulations is very closely aligned with the asymptotic relationship that we proved.
Awareness of these results is important when decisions are made regarding dimension reduction for separate data sets assumed to represent similar phenomena.For example, in distributed settings it may be assumed that highly correlated large data sets can be merged after dimension reduction, thereby allowing for more computationally efficient data transfer.However, our results indicate that this approach can be disastrous, even when the assumption that the separate data sets are highly correlated is valid.

Figure 2 ,
with k = 1 in blue, k = 2 in red, and k = 10 in green.As before, lines are drawn on the figure to indicate the limiting relationship between 2 (X (n) , Y (n) ) and d 2 (A (n) , B (n) ) that is predicted by Theorem 2 and Theorem 5; indeed, the scatter plots adhere very closely to these respective lines.In the right
and B(n) and B(n)