Inference on covariance operators via concentration inequalities: k-sample tests, classification, and clustering via Rademacher complexities

We propose a novel approach to the analysis of covariance operators making use of concentration inequalities. First, non-asymptotic confidence sets are constructed for such operators. Then, subsequent applications including a k sample test for equality of covariance, a functional data classifier, and an expectation-maximization style clustering algorithm are derived and tested on both simulated and phoneme data.


Introduction
Functional data spans many realms of applications from medical imaging, (Jiang et al., 2014), to speech and linguistics, (Pigoli et al., 2014), to genetics, (Panaretos et al., 2010). General inference techniques for functional data are one area of analysis that has received much attention in recent years from the construction of confidence sets, to other topics such as k-sample tests, classification, and clustering of functional data. Most testing methodology treats the data as continuous L 2 valued functions and subsequently reduces the problem to a finite dimensional one through expansion in some orthogonal basis such as the often utilized Karhunen-Loève expansion (Horváth and Kokoszka, 2012). However, inference in the more general Banach space setting, making use of a variety of norms, has not been addressed adequately. We propose a novel methodology for performing fully functional inference through the application of concentration inequalities; for general concentration of measure results, see Ledoux (2001) and Boucheron et al. (2013). Special emphasis is given to inference on covariance operators, which offers a fruitful way to analyze functional data.
Imagine multiple samples of speech data collected from multiple speakers. Each speaker will have his or her own sample covariance operator taking into account the unique variations of his or her speech and language. An exploratory researcher may want to find natural clusters amidst the speakers perhaps corresponding to gender, language, or regional accent. Meanwhile, a linguist studying the similarities between languages may want to test for the equality of such covariances. A computer scientist may need to implement an algorithm that when given speech data quickly identifies what language is being spoken and furthermore parses the sound clip and identifies each individual phoneme in order to process the speech into text. Our proposed method has the versatility to yield statistical tests that answer these questions as well as others.
Past methods for analyzing covariance operators (Panaretos et al., 2010;Fremdt et al., 2013) rely on the Hilbert-Schmidt setting for their inference. However, the recent work of Pigoli et al. (2014) argues that the use of the Hilbert-Schmidt metric ignores the geometry of the covariance operators and that more statistical power can be gained by using alternative metrics. Hence, we approach such inference for covariance operators in the full generality of Banach spaces, by using a non-asymptotic concentration of measure approach, which has previously been used in nonparametric statistics and machine learning, sometimes under the name of 'Rademacher complexities' (Koltchinskii, 2001(Koltchinskii, , 2006Bartlett et al., 2002;Bartlett and Mendelson, 2003;Giné and Nickl, 2010;Arlot et al., 2010;Lounici and Nickl, 2011;Kerkyacharian et al., 2012;Fan, 2011). These concentration inequalities provide a natural way to construct non-asymptotic confidence regions. p p = tr((Σ * Σ) p/2 ). For p = ∞, the Schatten norm is the operator norm: Σ ∞ = sup f ∈H1 ( Σf H2 / f H1 ). In the case that Σ is compact, self-adjoint, and trace-class, then given the associated eigenvalues {λ i } ∞ i=1 , the p-Schatten norm coincides with the p norm of the eigenvalues: In order to construct a covariance operator from a sample of functional data, the notion of tensor product is required. Let f, g ∈ L 2 (I) and φ in the dual space L 2 (I) * with inner product f, φ = φ(f ). The tensor product, f ⊗ g, is the rank one operator defined by (f ⊗ g)φ = g, φ f = φ(g)f .
Secondly, we will implement a Rademacher symmetrization technique in the concentration inequalities to come. This requires the use of the namesake Rademacher random variables.
One particularly fruitful avenue of functional data analysis is the analysis of covariance operators. Such an approach to functional data has been discussed by Panaretos et al. (2010) for DNA microcircles, by Fremdt et al. (2013) for the egg laying practices of fruit flies, and by Pigoli et al. (2014) with application to differentiating spoken languages. Definition 2.3 (Covariance Operator) Let I ⊆ R, and let f be a random function (variable) in L 2 (I) with E f 2 L 2 < ∞ and mean zero. The associated covariance operator If I = {i 1 , . . . , i m } has finite cardinality, then f = (f 1 , . . . , f m ) is a random vector in R m and for some fixed vector Covariance operators are integral operators with the kernel function c f (s, t) = cov{f (s), f (t)} ∈ L 2 (I × I). Such operators are characterized by the result that for f ∈ L 2 (I), Σ f is a covariance operator if and only if it is trace-class, self-adjoint, and compact on L 2 (I) where the symmetry follows immediately from the definition and the finite trace norm comes from Parseval's equality (Bosq, 2012, Theorem 1.7) (Horváth and Kokoszka, 2012, Section 2.3).
Furthermore, working under the assumption that E f 4 L 2 < ∞, we will require tensor powers of covariance Specifically for covariance operators, the tensor power takes on a similar integral operator form with kernel c Σ (s, t, u, v Given an Hilbert space H with inner product ·, · , the adjoint of a bounded linear operator Σ : H → H, denoted as Σ * , is the unique operator such that Σf, g = f, Σ * g for f, g ∈ H. The existence of which is given by the Riesz representation theorem (Rudin, 1987, chapter 2). For self-adjoint operators, such as the covariance operators of interest, Σ = Σ * .
We begin with a sample of functional data. Let f 1 , . . . , f n ∈ L 2 (I) be independent and identically distributed observations with mean zero and covariance operator Σ. Let the sample mean bef = n −1 n i=1 f i and the empirical The initial goal is to construct a confidence set for Σ with respect to some metric d(·, ·) of the form {Σ : d(Σ, Σ) ≤ r(n,Σ, α)}, which has coverage 1 − α for any desired α ∈ [0, 1] and a radius r depending only on the data and α. Such a confidence set can be utilized for a wide variety of statistical analyses.
3 Constructing a confidence set

Confidence sets for the mean in Banach spaces
The goal of this section is to construct a non-asymptotic confidence region in the Banach space setting. These can subsequently be specialized to our case of interest, covariance operators, when the X i below are replaced with f ⊗2 i . The construction of our confidence set is based on Talagrand's concentration inequality (Talagrand, 1996) with explicit constants. This inequality is typically stated for empirical processes (Giné and Nickl, 2016, Theorem 3.3.9 and 3.3.10 ), but applies to random variables with values in a separable Banach space (B, · B ) as well by simple duality arguments (Giné and Nickl, 2016, Example 2.1.6). Let X 1 , . . . , X n ∈ (B, · B ) be mean zero independent and identically distributed Banach space valued random variables with X i B ≤ U for all i = 1, . . . , n where U is some positive constant. Furthermore, let ·, · : B × B * → R such that for X ∈ B and φ ∈ B * then X, φ = φ(X). Define where the supremum is taken over a countably dense subset of the unit ball of B * . Furthermore, define v = 2U E (Z) + nσ 2 .
The above tail bound incorporates the unknown E( X − E X B ). Consequently, a symmetrization technique is used. This term is replaced by the norm of the Rademacher average R n = n −1 n i=1 ε i (X i −X) where the ε i are independent and identically distributed Rademacher random variables also independent of the X i . This substitution is justified by invoking the symmetrization inequality (Giné and Nickl, 2016, Theorem 3.1.21), In practice, the coefficient of 2 is unnecessary as averaging the data already has a symmetrizing effect, and if starting from symmetric data, then X −X and ε(X −X) are equal in distribution. This result allows us to replace the original expectation with the expectation of the Rademacher average. Furthermore, Talagrand's inequality also applies to R n . Hence, the Rademacher average concentrates strongly about its expectation, which justifies dropping the expectation. In practice, one can use the intermediary E ε ( R n B ), which can be approximated for reasonable sized data sets via Monte Carlo simulations of the ε i . However, this is not strictly necessary, and for large data sets, a single random draw of ε i will suffice (Giné and Nickl, 2016, Section 3.4.2).
The resulting (1 − α)-confidence set is To make use of these results in practice, both the weak variance σ 2 must be estimated for the data and a reasonable choice of U must be made, and a main contribution of this present paper is to propose some theoretically motivated but practically useful non-asymptotic choices for these constants that work for the functional data applications we are investigating. This will be discussed in the setting of covariance operators in the following subsection.

Confidence sets for covariance operators
To construct a confidence set for covariance operators, let our functional data f i ∈ L 2 (I) and f ⊗2 , the Hilbert space of bounded linear operators mapping L 2 to L 2 , such that (f i ⊗ f i )φ = f, φ f for some φ ∈ L 2 . Hence, for some desired p-Schatten norm, · p , with p ∈ [1, ∞) and with conjugate q = p/(p − 1), we have for the supremum being taken over a countably dense subset of the unit ball of Π ∈ Op(L 2 ). The Rademacher average, is also in Op(L 2 ), because for any φ ∈ L 2 (I) and for some M ∈ R, is a bounded operator and |ε i | = 1. Furthermore, in the case that there exists a fixed c ∈ R with f i L 2 ≤ c for all i corresponding to a physical bound on the energy of It will be determined below that U ≥ σ in this case. In general, setting U = σ gives good experimental results when f i is Gaussian as will be discussed in later sections. This results in v n ≈ σ 2 /n. For any p ∈ [1, ∞) and α ∈ [0, 1/2], the proposed (1 − α)-confidence set inspired by Equation 3.1 for covariance operators is where σ depends on the distribution on the functional data. As a rule of thumb for the choice of σ 2 , as we will now show, is to note that σ 2 ≤ E(f ⊗4 ) − Σ ⊗2 p and to estimate this bound empirically byσ 2 . For example, when the f i are from a Gaussian processσ ≤ 2 1/2 Σ p as explained in detail in Section 3.4.
To calculate the weak variance σ 2 , define f ⊗n = f ⊗ . . . ⊗ f to be the n-fold tensor product of f with itself and extend the definition of ·, · : where the inequality stems from the fact that the supremum is being taken over a larger set. However, in the Hilbert space setting, the dual of the tensor product does coincide with the tensor product of the dual space, and thus the above inequality can be replaced with an equality if the Hilbert-Schmidt norm, 2-Schatten norm, is used.

The weak variance for p = ∞
Let E be a countable dense subset of the unit ball of L 2 (I). In the case p = ∞, we cannot use duality, but can still write Z and σ 2 as suprema over the countable set and achieve the same results as above.

The weak variance for Gaussian data
Similarly to the bounded case, we estimate E f ⊗4 − Σ ⊗2 p for Gaussian data. Consider f from a Gaussian process with mean zero and covariance Σ and define f s = f (s). Strictly speaking these variables are not norm bounded, but similar to concentration results for Gaussian random variables in R d (Giné and Nickl, 2016, Theorem 3.1.9), we found below that our methods still work well. The integral kernel can be written as (Isserlis, 1918) t,u and that the operator E f ⊗4 − Σ ⊗2 , which can be thought of as an Hilbert-Schmidt operator on the space Op(L 2 ), can be represented by the integral kernel u). These two terms are merely relabeled versions of Σ ⊗2 . Consequently, using the subadditivity of the norm, For example, for the Hilbert-Schmidt norm, Lemma 5.1 of Horváth and Kokoszka (2012) gives an explicit form of a covariance operator of Σ in terms of the eigenfunctions of Σ for Gaussian data in the Hilbert-Schmidt setting.
Hence, for any of the p-Schatten norms, Σ ⊗ Σ p = Σ 2 p . Note that in the above calculations, the weak variance depends on the unknown Σ. In practice, this can be replaced by the empirical estimateΣ.

k Sample Comparison
Testing for the equality of means among multiple sets of data is a common task in data analysis. In the functional setting, there has been recent work on performing such a test on covariance operators in order to test whether or not k sets of curves have similar variation. Panaretos et al. (2010) propose such a method for a two sample test on covariance operators given data from Gaussian processes. Similarly, Fremdt et al. (2013) propose a non-parametric two sample test on covariance operators. Both of these approaches make use of the Karhunen-Loève expansion and, hence, the underlying Hilbert space geometry. Pigoli et al. (2014) take a comparative look at a variety of metrics to rank their statistical power when used in a two sample permutation test.
Following from the results of Pigoli et al. (2014), our method uses the p-Schatten norms with the concentration inequality based confidence sets of the previous section to compare covariance operators. In the two sample setting, we are able to achieve similar statistical power to that of the permutation test after proper tuning of the coefficients in the inequalities. Furthermore, the analytic nature of the concentration approach leads to a significant reduction in computing time, which offers an even more significant savings for larger values of k.
From the confidence set constructed in the previous section, we can devise a test for comparing the empirical covariance operators generated from k samples of functional data. Let the k samples be f n k where for each sample i and all elements j = 1, . . . , n i , f (i) j has covariance Σ (i) . Our goal is to design a test for the following two hypotheses: To achieve this, a pooled estimate of the weak variance is computed as a weighted average of each sample's individual weak variance in similar style to that of a standard t-test. Let the total data size be N = n 1 + . . . + n k and σ 2 i be the weak variance for sample i, then the pooled variance is defined as σ 2 Given Gaussian data and the p-Schatten norm, for example, this reduces to σ 2 In practice, σ 2 pool is estimated from the data for the following confidence regions in order to have those regions only depend on the data.
Taking inspiration from the standard analysis of variance (Casella and Berger, 2002, Chapter 11), letΣ (i) be the empirical estimate of the covariance operator for the ith sample, and letΣ be the estimate of the covariance operator for the total data set. Making use of the confidence sets for covariance operators from Section 3.2 gives the rejection region which under the null hypothesis will have size no greater than the desired α.
The size of the test induced by this rejection region is significantly less than the target size α due to the use of multiple concentration inequalities. Hence, tuning the inequalities is required to yield a useful test. Many experiments were run on simulated data sets generated as samples from a Gaussian process with randomly generated covariance operators whose eigenvalues were chosen to decay at a variety of rates. In this setting, the coefficients of 1 − k −1/2 for the Rademacher term and (k + 2)/(k + 3) for the deviation term were determined experimentally to improve the size of the confidence region:

Classification of Operators
Classification of functional data has been an area of heavy research over the last two decades. James and Hastie (2001) extend linear discriminant analysis to functional data. Hall et al. (2001) and Glendinning and Herbert (2003)  One application of our method beyond classification of functional data is the classification of covariance operators. In the setting of speech analysis, consider multiple speakers and multiple samples of speech from each speaker. The speech samples can be combined into a single sample covariance operator for each speaker. Then, our method can be employed, for example, to classify the covariance operators by speaker gender or speaker language. Evidence that this is a fruitful approach can be found in the analysis of Pigoli et al. (2014) and Pigoli et al. (2015) where a variety of metrics are compared for their efficacy when performing inference on covariance operators. These articles detail the discrepancy between sample covariance operators produced by speakers of different romance languages.
Given k possible labels and n samples of labeled data (Y i , f i ) with label Y i ∈ {1, . . . , k} and observation f i ∈ L 2 (I), our goal is to determine the probability that a newly observed g ∈ L 2 (I) belongs to label Y = j. Given such a g, the Bayes classifier chooses the label y = arg max j pr(Y = j | g) where pr(Y = j | g) = pr(g | Y = j)pr (Y = j)/pr (g).
Beginning with a training set of n samples with n j samples of label j, the sample mean of each category is computed: with the goal of making a decision based on how much moref j differs from g thanf j differs from its expectation E f j . Similar techniques to those is Section 3 as used. Define the Rademacher sum, R j , and the empirical weak variance,σ 2 j , for label j to be, respectively, where ε i are independent and identically distributed Rademacher random variables. The tail bound for the above probability is then where U is an upper bound on f i L 2 . However, this can be approximated by the Gaussian tail exp −n j r 2 /2σ 2 j . In the simulations of Section 5.3, this approximation actually achieves a better correct classification rate on both Gaussian and t-distributed data. This specifically works on t-distributed data as the estimate in Equation 4.3 below is merely concerned with comparing the tail bounds rather than their specific values. Consequently, the tail for every category is underestimated in the t case, but the ratio remains valid for comparison purposes.
Assuming uniform priors on the labels, the estimate for the probability expression in the Bayes classifier is This can be extended to the case where an unlabeled observation is a collection of curves g 1 , . . . , g m by replacing f j − g L 2 in the above expression with Σ j −Σ g p whereΣ j is the sample covariance of the f i with label j andΣ g is the sample covariance of the g i . The Rademacher and weak variance terms would also be updated accordingly.

Clustering of Operator Mixtures
Closely related to the problem of classification is the problem of clustering. Given a sample of functional data, we want to assign one of a finite collection of labels to each curve. For example, in speech processing, one may want to cluster sound clips based the language of the speaker, or, to be discussed in Section 5.4, one may want to separate unlabeled phoneme curves into clusters.
There have been many recently proposed methods for clustering functional data. Many approaches begin by constructing a low dimensional representation of the data in some basis such as modelling the data with a B-spline basis followed by clustering the spline representations with k-means (Abraham et al., 2003). A similar approach makes use of the eigenfunctions of the covariance operator instead of B-splines (Peng and Müller, 2008). In contrast, we will attempt to cluster functions or operators directly via a concentration of measure approach similar to the previously described classification procedure.
Consider the same setting to the previous section of multiple observations from multiple categories. However, now the category labels are missing. This is a functional mixture model where each observed functional datum is a stochastic process with one of k possible covariance operators. In the below experiments, the data will be simulated from a Gaussian process. The goal is to correctly separate the data into k sets. To achieve this, an expectation-maximization style algorithm is implemented.
Let the observed operator data be S 1 , . . . , S n ∈ Op(L 2 ) where each S i = cov(f is a rank m i operator produced from m i functional observations. Let the latent label variables be Y 1 , . . . , Y n ∈ {1, . . . , k}. Assuming no prior knowledge on the proportions of data in each category, the algorithm is initialized with the Jeffreys prior for the Dirichlet distribution by randomly generating ρ (0) i,· ∼ Dirichlet (1/2, . . . , 1/2) , the initial probability vector that pr(Y i = * | f i ). Assuming t iterations of the algorithm have completed, we have a label probability vector ρ (t) i,· for each of the n observations. Given this collection of vectors, the expected proportions of each category can be estimated as τ Similarly, a weighted sum of the data,Σ (t+1) j , and a weighted Rademacher sum, , can be used to update the estimated covariance operators for each label j: Lastly, a pooled weak variance is required, which is used in place of each individual category weak variance. Otherwise, in practice, one single category captures all of the data points. By defining the pooled covariance operator asΣ , then the pooled weak variance in the Gaussian case, for example, is estimated by 2 Σ (t+1) pool p . As a result, the label probability vectors ρ (t) i,· can be updated given the t + 1st collection of estimated covariance operators, Rademacher sums, and the pooled covariance operator. From the previous section, Equation 4.3 can be used to determine ρ ), the probability that observation i belongs to the jth category. This process can be iterated until a local optimum is reached.

Simulated and phoneme data
To test each of the above three applications, experiments were first run on simulated data. These data sets were generated as observations from Gaussian or t-distributed processes. The covaraiance operators were randomly chosen given a specific decay rate for the eigenvalues.
Secondly, the phoneme data to be tested (Ferraty and Vieu, 2003;Hastie et al., 1995) is a collection of 400 logperiodograms for each of five different phonemes: /A/ as in the vowel of "dark"; /O/ as in the first vowel of "water"; /d/ as in the plosive of "dark"; /i/ as in the vowel of "she"; /S/ as in the fricative of "she". Each curve contains the first 150 frequencies from a 32 ms sound clip sampled at a rate of 16-kHz.

k Sample Comparison
The above confidence set in Equation 4.1 comparing k samples can be used to refute the null hypothesis that all covariance operators are equal. A two sample permutation test was performed in Pigoli et al. (2014). Given two samples of functional data, f (2) m with associated covariance operators Σ (1) and Σ (2) , respectively, the desired hypotheses to test are When using a permutation test, the labels are randomly reassigned M times, and each time, the distance between the two new covariance operators is computed. For sufficiently large M , this procedure will return the exact significance level of the observations with respect to the data set. A power analysis was performed between the permutation method and our proposed concentration approach using Equation 4.1. Given two different operators Σ (1) and Σ (2) and γ ≥ 0, an interpolation between the two operators is constructed as Σ (γ) = [(Σ (1) ) 1/2 + γ{S(Σ (2) ) 1/2 − (Σ (1) ) 1/2 }][(Σ (1) ) 1/2 + γ{S(Σ (2) ) 1/2 − (Σ (1) ) 1/2 }] * , where S is an operator minimizing the Procrustes distance, between Σ (1) and Σ (2) , which is d where Σ (i) = (R (i) )(R (i) ) * and U L 2 (I) is the space of unitary operators on L 2 (I) (Pigoli et al., 2014).
Monte Carlo simulations were run in order to estimate the power of each test. Two operators Σ (1) and Σ (2) with similar eigenvalue decay were compared. For sample size n = 50, γ ∈ {0, .1, .2, .3, .4, .5}, and for sample size n = 500, γ ∈ {0, .05, .1, .15, .2, .25}. For each γ, 5000 samples of size n were generated for Σ (1) and Σ (γ) . Equation 4.1 and the permutation method (Pigoli et al., 2014) were both implemented to estimate the empirical power. Figures 1 and 2 display the results for operators whose eigenvalues decay at a quartic and quadratic rate, respectively. The solid lines indicate the power of the permutation test, and the circle lines indicate the power of our concentration approach. The colors red, green, and blue correspond to the three norms trace, Hilbert-Schmidt, and operator, respectively.
In most cases, the concentration approach is able to achieve the same power to reject the null as does the permutation test. The notable exception is for the trace norm when the eigenvalues decay slowly. The added benefit to the concentration approach is the speed with which it executes. Across all of the Monte Carlo simulations, our concentration approach ran on average 28.14 times faster than the permutation method. This was computed by tracking the amount of computation time each method spent while producing the plots in Figures 1 and 2, which corresponds to 5 values of γ, 2 values of α, 3 different norms, 2 sample sizes, and 5000 replications each resulting in 300,000 function calls for both the permutation and concentration methods. Unlike the other norms, the Hilbert-Schmidt norm can be calculated without explicit computation of the eigenvalues. For each evaluation of the permutation test, 100 permutations of the data were generated, which corresponds to 100 random draws and 100 eigenvalue computations. More accuracy would require even more permutations. In comparison, our concentration approach requires only 3k eigenvalue computations and no random draws and hence is only dependent on the number of samples regardless of data size or α.
The proposed k-sample test was also used to compare samples of log-periodogram curves from the spoken phonemes /A/ and /O/ . As one can imagine, these vowels can be hard to distinguish; see Section 5.4 for further evidence of this. For k ∈ {2, 3, 4, 5, 6}, k − 1 disjoint sets of 40 /A/ curves and one set of 40 /O/ curves were randomly sampled from the data set. This was replicated 500 times, and each time Equation 4.1 was used to decide whether or not the k covariance operators were equivalent at the α = 0.05 level. The resulting estimated statistical power for each k is In the null setting, the above experiement was rerun except that every disjoint set of curves came from the /A/ set. The resulting experimentally computed test sizes are k 2 3 4 5 6 Size 0.00 0.00 0.00 0.004 0.072

Binary and trinary classification
Our concentration of measure (CoM) method is implemented on covariance operators making use of the trace norm · tr where for a covariance operator Σ with eigenvalues The trace norm was chosen based on the analysis of the preceding section as well as that of Pigoli et al. (2014) where it achieved the best performance when compared with the other p-Schatten norms. The CoM approach to classification of operators is tested in a variety of simulations against other standard approaches to functional classification. The methods used for comparison are k-nearest  . The red, green, and blue lines respectively correspond to the trace class, Hilbert-Schmidt, and operator norms. Table 1: A comparison of the performances of five classification methods. The first entry for each method corresponds to classifying the covariance operators as functions. The prime entry corresponds to classifying curves with a majority vote. The estimated percent of correct classification is listed in the table with the sample standard deviation in brackets. The top block comes from Gaussian process data, and the bottom comes from t-process data with 4 degrees of freedom. The highest percentage of each column is marked in bold. neighbours (Ferraty and Vieu, 2006), classification using kernel estimators (Ferraty and Vieu, 2003), general linear model (Müller and Stadtmüller, 2005), and regression trees. The first simulation asks each method to classify observed mean zero Gaussian process data or mean zero t-process data with 4 degrees of freedom. The two covariance operators in question, Σ 1 and Σ 2 , are the sample covariances of the male and of the females of the Berkeley growth curve data (Ramsay and Silverman, 2005). In particular, n collections of k curves were generated from each of Σ 1 and Σ 2 as a training set, and m collections of k curves were generated as a test set. The CoM method was trained on the set of n sample covariances and used to classify each of the m test covariances. The remaining classification methods were trained and tested in two separate ways: By treating each sample covariance as a function and classifying as usual, and by training on all n × k observations and testing each of the m collections by classifying each constituent curve individually and taking a majority vote with ties settled by a uniform random draw.
For group sizes k = 1, 2, 4, 8, 16, 100 simulations were run with n = 100 sets of k training curves. To compare the accuracy of each approach m = 100 sets of k testing curves were generated for each operator. The accuracy of each method is tabulated in Table 1.
The concentration method performed well against the alternatives. Its performance was on par with the kernel method applied to each covariance operator as a function. Our method was only consistently outperformed by the kernel method implementing the majority vote approach. However, the two operators in question have very similar weak variances. The next simulation demonstrates how the concentration method adapts naturally when the variances of each label significantly differ.
Continuing from the previous simulation, a third operator is constructed from Σ 1 and Σ 2 by averaging these two and then scaling up the non-principle eigenvalues by a factor of 5. This, in some sense, creates a third operator between the first two, but with higher variance. The simulation is carried out precisely as before, but incorporating all three operators. In this setting, our concentration approach demonstrates the best performance. The results are listed in Table 2.
These five methods tested on simulated data were also tested against phoneme data. Across 50 iterations, each set of 400 curves was partitioned at random into an 100 curve training set and a 300 curve testing set. The five classifiers were trained and run on each of the 300 × 5 curves individually. For our concentration of measure approach, the rank one operator associated to each individual curve was compared with the covariance operator formed from the 100 × 5 Table 2: A comparison of the performances of five classification methods as in Table 1, but with three potential classes from which to choose. The estimated percent of correct classification is listed in the table with the sample standard deviation in brackets. The top block comes from Gaussian process data, and the bottom comes from t-process data with 4 degrees of freedom. The highest percentage of each column is marked in bold.   Table 3. Our concentration of measure approach only uniformly outperforms the regression tree classifier, but has comparable performance to the other three methods, and none of the competing methods uniformly outperforms ours.

The expectation-maximization algorithm in practice
The experiments described and depicted below make use of the trace norm only. It was determined through experimentation that the expectation-maximization algorithm we propose in Section 4.3 does not perform well under the topology of either the Hilbert-Schmidt or operator norms as they give more emphasis to the principle eigenvalue at the expense of the others. The usual behavior under these norms is for all estimates to converge to the average of all of the data points. This is in contrast to the better performance of the algorithm making use of the trace norm, which is somewhat more uniform in its treatment of the eigenstructure. As a first test case, this algorithm was run given three target covariance operators, which were constructed by taking three randomly generated orthonormal bases U i and a diagonal operator D of eigenvalues decaying at a rate λ k = O(k −4 ) and multiplying Σ i = U i DU i T . Let the three target covariance operators be denoted as Σ a , Σ b , and Σ c . For each of these operators, 1000 rank four data points were generated from a zero mean Gaussian process. From the data, the algorithm initializes three estimatesΣ 3 , which attempt to locate the three target operators as the method iterates.  After 15 iterations, the original 3000 data points were perfectly separated into three groups. To make the problem harder, a second test case was run identical to the first except that the observed operators are all of rank one. The resulting clusters from both tests are in Table 4. The inaccuracy in the rank one setting is equivalent to the poor performance of classification of rank one operators detailed in Tables 1 and 2. For the phoneme data all 400 sample curves from each of the five phoneme sets were clustered individually as curves. The algorithm was run for 20 iterations and told to partition the data into five clusters. The results are in Table 5. Cluster 3 captured almost all of the vowels /A/ and /O/ , which, recalling their definition in Section 5.1, are quite similar in sound. Clusters 2 and 4 contain the majority of /S/ and /d/ curves, respectively. Lastly, Clusters 1 and 5 split the set of /i/ curves roughly in half. Rerunning the experiment with only four clustered nicely separated the data as displayed in Table 6. Cluster 3 again contains almost all of the /A/ and /O/ . The remaining clusters mostly partition the /d/ , /i/ , and /S/ with high accuracy. Hence, barring the ability to distinguish the very similar /A/ and /O/ curves, the proposed expectation-maximization algorithm is an effective method for the unsupervised clustering of phonemes.