Outlier Detection for Pandemic-Related Data Using Compositional Functional Data Analysis

With accurate data, governments can make the most informed decisions to keep people safer through pandemics such as the COVID-19 coronavirus. In such events, data reliability is crucial and therefore outlier detection is an important and even unavoidable issue. Outliers are often considered as the most interesting observations, because the fact that they differ from the data majority may lead to relevant ﬁndings in the subject area. Outlier detection has also been addressed in the context of multivariate functional data, thus smooth functions of several characteristics, often derived from measurements at different time points (Hubert et al. in Stat Methods Appl 24(2):177–202, 2015b). Here the underlying data are regarded as compositions, with the compositional parts forming the multivariate information, and thus only relative information in terms of log-ratios between these parts is considered as relevant for the analysis. The multivariate functional data thus have to be derived as smooth functions by utilising this relative information. Subsequently, already established multivariate functional outlier detection procedures can be used, but for interpretation purposes, the functional data need to be presented in an appropriate space. The methodology is illustrated with publicly available data around the COVID-19 pandemic to ﬁnd countries displaying outlying trends.


Introduction
The crisis caused by COVID-19 in almost all areas of life has also revealed that an accurate data collection is a challenge that cannot be easily resolved due to political or logistic problems. However, the availability of clean and reliable data is a key step in fighting a pandemic. On the one hand, knowing the real number of tested, newly infected and dead people allows to investigate the causes of the observed developments and to take appropriate measures to stop the spread of an infection. On the other hand, insurance companies offering a protection linked to some specific events during a pandemic would like to have reliable data to avoid the possibility of moral hazard.
Many countries report the number of cases, deaths, tests, and further parameters (variables) related to the COVID-19 pandemic regularly over time, and the data are accessible in public data repositories. Rather than treating the data with tools from time series analysis, it is common to consider them as functional data, so that the measurements are represented by smooth functions over time. One could then analyse the multivariate information contained in the functions for the different variables, and compare the countries with respect to this information. Thus, countries for which the multivariate information differs from the main trend given by the majority of the countries are possible outliers. Instead of directly considering the reported number (represented by the functions), one could also focus on analysing relative information. This can be done by taking (log-)ratios between the variables. Thus, the source of information for the analysis would not consist in the number of cases, death, tests, etc., for a particular day in a particular country, but in the (log-)ratios between these numbers. This is what is done in compositional data analysis, and outlier detection in this context will focus on atypical behaviour in the multivariate information of such (log-)ratios. For example, if the development of the number of cases over time is similar in some countries, but in one country the number of deaths develops more rapidly, this could be much better visible in a (log-ratio) than in the reported values. Thus, treating COVID-19 data as compositional data and analysing relative rather than absolute information can be very beneficial for outlier detection.
In this paper we consider a new method for the detection of outliers in the compositional functional data setting. The detection of outliers in the p-dimensional multivariate data case has been intensively investigated throughout the years and many methods have been developed. Denote by x k ∈ R p , for k = 1, ..., K , the observed samples. A popular approach considers an outlier of these samples as a point x k 0 for which the robustified version of the Mahalanobis distance, (x k 0 −m) C −1 (x k 0 −m), where m respectively C are robust estimators for the mean and the covariance matrix, is above a certain threshold and thus far away from the centre m with respect to the covariance structure C; see Rousseeuw (1985), Rousseeuw and Driessen (1999) and Hubert and Debruyne (2010). The idea of defining an outlier as a point being far away from the centre has been extended to more general measures related to statistical depth, see Tukey (1975), Serfling (2006) and Mosler (2012).
In recent years, many methods of multivariate statistics have been generalised to Functional Data Analysis (FDA). In FDA one considers data points to be whole functions, i.e. in the notation above, data points x k : I → R p are multivariate functions; for an overview of FDA we refer to Ramsay (2004), Ferraty and Vieu (2006) or Kokoszka and Reimherr (2017). Accordingly, the concept of outliers has been extended from the multivariate to the FDA setting, see Fraiman and Muniz (2001), Febrero et al. (2008), Sun and Genton (2011) and Hubert et al. (2015b).
In this paper we consider extending the ideas of outlyingness to functional data with image in the compositional data space. Thus, Sects. 12.1.1 and 12.1.2 provide a short introduction to the concepts of compositional data analysis and functional data, respectively. Further, in Sect. 12.2 we consider smoothing for functional data with image in the compositional space. In Sect. 12.3 we look at how one can detect outliers for the latter setting. That is, we extend the methods of detecting outliers from the non-compositional FDA case to the compositional one. Furthermore, Sect. 12.4 contains an application of the method presented. The data is comprised of COVID-19 data of different countries over time. Each country represents a functional data point. We finish in Sect. 12.5 with a summary and some conclusions.

Compositional Data Analysis Concepts
Assume we have given a D-dimensional random vector x for which each entry is strictly positive, i.e. x ∈ R D + , where R D + denotes the D-dimensional real number space with strictly positive entries. In the framework of compositional data analysis (CODA) it is assumed that the ratios x j x k , for any j, k ∈ {1, ..., D}, j = k, carry the relevant information, and thus only relative information is essential. As ratios do not change when multiplying x with a strictly positive scalar λ > 0, it holds that λx =: y carries the same information as x. This motivates defining the equivalence relation x ∼ y ⇐⇒ ∃λ > 0 λx = y for any x, y ∈ R D + which partitions the space R D + into equivalence classes. Choosing for each equivalence class the representative x = (x 1 , ..., x D ) satisfying D j=1 x j = 1, leads to the set of equivalence classes called the D-part simplex The space S D is turned into a Hilbert space-called the Aitchison geometry on the simplex, see Aitchison (1982)-by defining addition (perturbation), multiplication with a scalar (powering), an inner product and a norm for x = (x 1 , ..., x D ) , y = (y 1 , ..., y D ) ∈ S D and α ∈ R: • Perturbation: x ⊕ y := (x 1 y 1 , ..., x D y D ) • Powering: Furthermore, the Aitchison geometry is (bijectively) isometric to R D−1 . To show this, firstly define the centred log-ratio (clr) which satisfies the properties of being invariant under the above operations and the norm, i.e.
clr(x ⊕ y) = clr(x) + clr(y) (12.2) see Filzmoser et al. (2018). However, as for any x ∈ S D , the entries of clr(x) sum up to zero, D i=1 clr(x) i = 0, it follows that the clr mapping does not satisfy the property of being one-to-one onto R D . To obtain a bijective mapping, choose a D − 1 dimensional basis V = (v 1 , ..., v D−1 ), where v j ∈ R D , for j = 1, . . . , D − 1, are clr coefficients, and define the isometric log-ratio (ilr) mapping as The latter is a one-to-one mapping fulfilling (12.2), (12.3) and (12.4), see Filzmoser et al. (2018). As there are infinitely many possibilities to choose a basis V, ilr coefficients are frequently considered to express all relative information of a composition appropriately in the usual Euclidean geometry, for which the common statistical tools have been designed. If an interpretation is desirable, the relative information is often re-expressed in terms of clr coefficients by clr(x) = V ilr V (x), because they relate to the original compositional parts in terms of relative information of the part to an "average" (geometric mean), see (12.1).

Functional Data
In FDA we consider observations to be multivariate smooth functions f : [t 1 , t N ] → R D . In practice, such observations often originate as time series, measured at certain time points t i , with i = 1, . . . , N , and thus they are not necessarily forming smooth functions. In this case, a preprocessing step is needed to find an estimatef for f given (t i , y i ), with y i ∈ R D , i = 1, . . . , N , being noisy samples of f(t i ). We assume in the following Gaussian centred uncorrelated noise with equal variance. Although many methods exist to recover smooth functions, it is common thatf is estimated by smoothing spline methods. The literature on spline methods is vast and we refer to Reinsch (1967), Wood (2017) and Yee (2015) for a good overview. The main idea is that given multivariate data (t i , y i ) we find an estimatef which is, on the one hand, sufficiently smooth but, on the other, also a good approximation to the data. It is common to look at the following vector valued smoothing problem where λ > 0 is a fixed smoothing parameter, and · E denotes the Euclidean norm. The idea is that with increasing λ, the second derivative f is forced to zero, i.e. towards a linear function. From Problem (12.6) it can be deduced that the solution is of the form f(t) (2015), with b i being basis functions of the cubic spline space, and a i being fixed vectors in R D . Plugging this basis expansion into (12.6) shows that the penalty function acts as regularisation penalty on a i restraining the flexibility of the latter. In reality, one never uses the full basis expansion as given above, but rather a different and equally flexible expansion with less basis functions to save coefficients and avoid unnecessary computation in the case of a lot of data, for example a B-spline basis. Plugging in a specific basis expansion we can see that the problem is a convex problem, and solving this vector valued problem is discussed in Yee (2015).

Smoothing for CODA Time Series
In this section we consider functional observations with image in S D , i.e. functions u : [t 1 , t N ] → S D . As before, we assume that only a set of discrete samples (t i , x i ) is given, with i = 1, . . . , N and x i ∈ S D , where x i is a sample of u(t i ). To construct a smooth estimateû of u, we firstly define derivatives and smoothing splines in a compositional context. For a function u : [t 1 , t N ] → S D , its derivative at a time point t is defined as Accordingly, one can define higher order derivatives inductively, e.g. u (t) := (u ) (t). For a reference on compositional calculus we refer to Pawlowsky-Glahn and Buccianti (2011). In accordance with the previous section, defineû aŝ where λ > 0 is again a fixed smoothing parameter controlling the smoothness.
Using the continuity of ilr V and (12.2), it follows that holds. With the same arguments, the equation Therefore, defining f := ilr V (u), Problem (12.8) can be reformulated using the latter, as well as the properties (12.2) and (12.4): The latter is a vector valued smoothing problem in R D−1 for the data (t i , ilr V (x i )), see Problem (12.6), and it can be solved accordingly. Given a solutionf to (12.11), a solution to (12.8) is thenû = ilr −1 V (f) per definition of f. In the case that different solutions to (12.11) exist, e.g.f 1 andf 2 , we know from the equivalence chain before and from the fact that ilr V is isometric, that also ilr −1 V (f 1 ) and ilr −1 V (f 2 ) are different solutions to Problem (12.8). Equally, having two different solution of (12.8) leads to different solutions of (12.11). This means that if (12.11) is uniquely solvable for a chosen V, we get thatû is also uniquely determined. Therefore, the choice of V is irrelevant. With the exception of some very degenerate settings, Problem (12.11) is uniquely solvable in most applications.

Outlier Detection in Compositional FDA
In the univariate case we can think of outliers as observations being very far away from the main mass of the data set, thus far away from the data centre with respect to the scale (Maronna et al. 2006).
The outlyingness of a multivariate observation x ∼ P X , where P X denotes the distribution of a p-dimensional random vector X and x a realisation, can be built on the univariate case by means of projection onto a line defined by r ∈ R p , with r = 1, thus r X. As discussed in Donoho et al. (1992), the outlyingness of an observation x of the projection r x can be measured by where "mad" denotes the median absolute deviation, i.e. the median of |X − median(X)|. Taking the supremum of (12.12) over all r with r = 1 yields a measure of outlyingness for any x independent of the direction r. Adjusting (12.12) for skewness-see Hubert and Vandervieren (2008) for adjusted boxplots of skewed distributions in the univariate case-the adjusted outlyingness (AO) is defined as where w 1 and w 2 are functions that allow to adjust for the skewness of the univariate distributions, see Hubert et al. (2015b) for an exact definition of these two functions.
To obtain a measure of outlyingness in the FDA case, e.g. for the data (f : Hubert et al. (2015b) propose to use the functional adjusted outlyingness of a FDA point f: In a compositional functional data context, where the compositions are functions of the form u : [t 1 , t N ] → S D , with distribution P U , we propose to define the compositional functional adjusted outlyingness as ( 12.13) For Definition (12.13) to be a valid measure of outlyingness it needs to be checked that it is well defined, i.e., this measure needs to be independent of the choice of the basis matrix V. As V ilr V (x) = clr(x) holds by definition for a matrix with orthonormal columns V, we have for a different matrix V with orthonormal columns ilr see Filzmoser et al. (2018). As the matrix V V ∈ R (D−1)×(D−1) is of full rank D − 1, we get, for any fixed t where the last equality follows from the affine invariance property of AO, see Hubert and Van der Veeken (2008); affine invariance means that AO(x, P X ) = AO(Ax + b, P AX+b ) holds for any regular matrix A ∈ R p× p and b ∈ R p for x ∈ R p with x ∼ P X . As CFAO is defined as an integral over (12.16) it follows that the latter is equally invariant and thus well defined.
To visually find outliers in the FDA setting, Hubert et al. (2015a) introduced a functional outlier map (FOM). Assume that the evaluation of K multivariate functional data points f 1 , . . . , f K is given at time points t 1 , . . . , t n , and denote P K the sample distribution of the functional data points, and P t i the sample distribution of the evaluations at time point t i . The FOM is defined as a two dimensional graph, plotting F AO(f k , P K ) on the horizontal axis against on the vertical axis, for k = 1, . . . , K , where σ denotes the standard deviation. The motivation behind this map is that when a data point f k is a shift outlier, its according point in the FOM plot will be higher on the horizontal axis. If a data point f k displays an outlying high variability in time, this will result in a high value on the vertical axis in the FOM plot. The denominator in (12.17) is necessary to correct for the effect that when a data point is shifted further, this is reflected in the standard deviation accordingly, see Hubert et al. (2015a). Given the evaluation of the compositional functional data u 1 , . . . , u K , k = 1, . . . , K , at time points t 1 , . . . , t N , we suggest equivalently to plot C F AO(u k , P K ) on the horizontal axis, against on the vertical axis. Again, the latter is independent of the choice of V, because AO as well as C F AO are affine invariant, see the reasoning for (12.16) and its conclusion.

Application to COVID-19 Data
In this section we use data from https://covid.ourworldindata.org, which are publicly available. This page contains for most countries of the world daily information related to the COVID-19 pandemic. Here we focus on European countries only, and on the following information: • Total number of COVID-19 deaths per million inhabitants.
• Total number of COVID-19 tests per million inhabitants.
• Positive rate, i.e. share of total COVID-19 tests that were positive.
• Reproduction rate, referring to the expected number of cases directly generated by one case.
We select the time period from April 1 until December 31, 2020, because from April onwards the information was consistently collected in the data base. However, for some of the European countries the information on some of the variables was not available, so that finally only 35 European countries could be used. Still, for some countries there were missing values (or shorter time periods with missings), which have been imputed by a weighted moving average imputation method, implemented as function na_ma() in the R package imputeTS (Moritz and Bartz-Beielstein 2017).
As an example, Fig. 12.1 shows the data for Austria, and the data structure is similar in many of the other countries. Still, there might be countries with deviations in the multivariate data structure, and the task is to identify such countries. The focus here is on relative information in terms of log-ratios between the different variables. Figure 12.1 reveals that the total number of cases starts to grow quickly in October 2020, and the same is true for the total number of deaths (per million). The number of tests grows steadily over the time period. The positive rate decreases at the beginning of this selected time period, but it increases drastically in October, followed by a decline in November/December. The reproduction rate fluctuates more, and has higher values than one in the summer and fall.
Multivariate functional outlier detection is here first applied to the data expressed in relative information, i.e. as ilr coordinates. In a second stage we also compare with an analysis based on absolute information, as reported in Fig. 12.1 for Austria. Naturally, the different treatment of the data will very likely lead to different results. As an example for relative versus absolute information, we may consider just the number of cases and the number of deaths (per million). For most countries, an increase of cases also implies an increase of deaths, probably with a different time delay. If one looks at relative information in terms of a log-ratio, however, differences between the countries might get more clearly pronounced. We will come back to this issue later. For every country, the data are first ilr-transformed, resulting in time series of the ilr coordinates. Since the specific choice of the ilr coordinates is not relevant here, we use so-called pivot coordinates, where the first coordinate expresses all relative information of the first part to the remaining parts in the composition, see Filzmoser et al. (2018). Figure 12.2 shows the resulting ilr coordinates for the Austrian data; since there are 5 variables available, see Fig. 12.1, we end up with 4 ilr coordinates. Figure 12.2 also shows the lines after smoothing the data in ilr coordinates, thus after solving Problem (12.11). The information of these lines form the compositional functional data as they are used for multivariate outlier detection. Since we used pivot coordinates, only the first coordinate (denoted here by z 1 ) has a clear interpretation in terms of all relative information of the total cases to the remaining variables. This coordinate is in fact proportional to the first clr coefficient (Filzmoser et al. 2018). We will show and discuss the corresponding clr coefficients later in Fig. 12.5.
Once the smooth functions are estimated for every country, compositional functional outlier detection can be performed. Figure 12.3 shows the compositional functional outlier map (CFOM). Every point in the plot corresponds to a country, and the line indicates the outlier cutoff. It can be seen that one (red) point (Iceland) slightly exceeds the cutoff, and another point (Belarus) is just below the cutoff. The sorted compositional functional adjusted outlyingness is again shown in Fig. 12.4 (left), with the corresponding country names added. The values for Iceland and Belarus clearly stick out, and the next biggest value originates from the data from Luxembourg. These countries are not particularly outlying in their variability in time, since their values in Fig. 12.4 (right) are not unusual. Figure 12.5 is an attempt to identify the reason for outlyingness. The plots show the smoothed functional data in clr coefficients, which are simply obtained by a transformation from the functions in ilr coordinates, see Eq. (12.5). The function for Iceland is shown in red, and that for Belarus in blue. For example, the clr coefficients for the total cases (left plot) mainly show a strongly increasing trend at the beginning, and again at the end of the considered time period. This means that the cases have grown rapidly, relative to the remaining variables (on average). The function for Belarus (blue) shows a quite different behaviour, with very high values especially around May. This means that the total cases are very much dominating over the values of the other variables. The reason for this is not because of high values of   cases, but because of exceptionally low (reported) values of the remaining variables. Also the values for Iceland (red curves) are seen as atypical. For example, the clr coefficients of the total cases started to be the lowest in April, but then increased to be the highest in August. In a ratio, it can either be the change in the numerator or in the denominator, or in both, to get this behaviour, but in any case it turns out to be quite different compared to the other countries. As a comparison, the following analysis is based on absolute information. Thus, the smoothed curves are directly estimated from the raw input data without any trans- formation, see Eq. (12.6). Then multivariate functional outlier detection is applied, which results in the functional outlier map presented in Fig. 12.6. Here, one point clearly exceeds the outlier cutoff value, and this point is Luxembourg. Details are presented in Fig. 12.7, where the left plot are the sorted values from the horizontal axis, and the right plot the sorted values of the vertical axis from the FOM of Fig. 12.6. Indeed, Luxembourg appears with an exceptionally high value of FAO, and neither Iceland nor Belarus are atypical in any of these plots.
Finally, Fig. 12.8 shows the raw functional data. The outlier Luxembourg is shown by green curves, Iceland in red, and Belarus in blue. Luxembourg shows a very clear difference in the total tests, which might be the reason for the multivariate outlyingness. The countries Iceland and Belarus, which were clearly different in the  compositional analysis, follow the main data structure well and do no longer appear as atypical. This shows that both types of analysis indeed focus on different data aspects, and it will be based on the task and research question to determine which of the analysis is more appropriate.

Summary and Conclusions
Outlier detection has been a relevant task in data analysis already since the beginning of data collection, and it continues being important also for more complex data structures. The identified outliers may point at atypical events, and depending on the context even at possible cases of fraud; see, e.g., van Capelleveen et al. (2016) or Nian et al. (2016). Outlier detection methods are also useful for pandemic-related data, as they may guide policy makers to draw appropriate conclusions.
Here we have used publicly available time series data related to COVID-19, as they are reported from different countries. The multivariate information, here in terms of the number of cases, deaths, tests, the positive rate, and the reproduction rate, has been treated as compositional data, where relative rather than absolute values are processed in the analysis. Absolute values would refer to the data as they are reported, while relative information refers to the log-ratios between the values of the different variables. An outlier detection method which makes use of relative information thus will focus more on the differences of the developments over time between the variables, and not necessarily on extreme values in single variables. In fact, if there is a peak in one variable in a certain time period, and the peak also appears in another variable in the same period, the log-ratio would not show up as unusual. A temporal shift of the peaks, however, creates big log-ratios, and if the position or magnitude is different for one country compared to the others, this country will appear as a potential outlier.
The time trends of the COVID-19 data have been treated here as functional data. Functional data which are processed with tools from compositional data analysis commonly have a constant sum constraint, such as probability density functions or particle-size curves, see van den Boogaart et al. (2014) or Menafoglio et al. (2014).
Here we considered the single variables of the multivariate data information as parts of a composition, and since the information is derived continuously over a domain (here time), such data are regarded as multivariate compositional functional data. As functional data are supposed to be smooth functions, the concepts from compositional data analysis already need to be taken into account when generating the compositional functional data. Thus, the original data information, which usually needs to be smoothed in order to represent functions, has to be presented in the appropriate geometry. Since we deal with multivariate information, smoothing also needs to be done in a multivariate context. Here we have used isometric log-ratio coordinates to move the data from the simplex to the standard Euclidean geometry, and we have shown that the specific choice of these coordinates is not relevant for obtaining the smooth functions.
Once the multivariate compositional functional data are available and expressed in the appropriate geometry, standard tools for multivariate functional outlier detection can be used. The application of the methodology to the COVID-19 data revealed that the outlyingness values for the two countries Iceland and Belarus were clearly higher compared to the other investigated countries. Diagnostics in clr coefficients, again referring to relative information, has shown that some of the functions for these countries indeed deviated clearly, at least in certain time periods. Because clr coefficients refer to log-ratios of a specific variable to the geometric mean, deviations can be caused either by atypical values of this variable, or by atypical values of the geometric mean, representing an "average behaviour" of all analysed variables. The analyst would then have to compare this information to that from the other countries, or even go back to the original data source for such a comparison. There could be many reasons for outlyingness: data reporting is done differently (probably only for some of the variables), the policy of the restrictions in the context of the pandemic is very different, the behaviour of the people to deal with the pandemic is very different, etc.
We have also compared such an analysis with multivariate functional outlier detection using the absolute information, where outliers are, for example, countries with extreme values of a function in a certain time period. This analysis led to different outliers, and it finally will depend on the underlying task and research question which type of analysis is most appropriate.
There are many further methodological challenges, which are revealed when considering real data applications as, for instance, the full COVID-19 data set provided from the source mentioned in the paper: zero values, missings, poor data quality, some countries do not provide information for some of the characteristics, etc. These issues are relevant already for estimating the multivariate smooth functions, and subsequently also for the purpose of outlier detection. Our future research will be devoted to such tasks.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.