Depth-based reconstruction method for incomplete functional data

The problem of estimating missing fragments of curves from a functional sample has been widely considered in the literature. However, most reconstruction methods rely on estimating the covariance matrix or the components of its eigendecomposition, which may be difficult. In particular, the estimation accuracy might be affected by the complexity of the covariance function, the noise of the discrete observations, and the poor availability of complete discrete functional data. We introduce a non-parametric alternative based on depth measures for partially observed functional data. Our simulations point out that the benchmark methods perform better when the data come from one population, curves are smooth, and there is a large proportion of complete data. However, our approach is superior when considering more complex covariance structures, non-smooth curves, and when the proportion of complete functions is scarce. Moreover, even in the most severe case of having all the functions incomplete, our method provides good estimates; meanwhile, the competitors are unable. The methodology is illustrated with two real data sets: the Spanish daily temperatures observed in different weather stations and the age-specific mortality by prefectures in Japan. They highlight the interpretability potential of the depth-based method.


Introduction
Partially observed functional data (POFD) are becoming more recurrent, invalidating many of the existing methodologies (Ramsay and Silverman, 2005;Ferraty and Vieu, 2006).Diverse case studies motivate the development of statistical tools for these types of data.For example, in medical studies, many data sets are recorded through periodical check-ups, and patients who miss appointments or devices that fail to record may be typical sources of censoring.These situations may present in different types of monitoring, such as ambulatory blood pressure, the health status of human immunodeficiency virus tests (HIV), growth curves, and the evolution of lung function (James et al., 2000;James and Hastie, 2001;Delaigle and Hall, 2013;Kraus, 2015;Delaigle and Hall, 2016).Sangalli et al. (2009Sangalli et al. ( , 2014) also consider POFD from aneurysm studies where the source of censoring comes from a prior reconstruction of the sample and posterior processing to make the data comparable across subjects.In demography, it is common that age-specific mortality rates for older ages are not completely observed due to the decreasing number of survivors (Human Mortality Database, 2021) and this cohort is the focus of actuarial science studies (D'Amato et al., 2011).Other examples involve electricity supply functions that may not be completely observed because suppliers and buyers typically agree on prices and quantities depending on the market conditions (Kneip and Liebl, 2020;Liebl and Rameseder, 2019).
Prominent literature has tackled estimating missing parts of partially observed functional data, providing several benchmark methods.Among them, the methods of Yao et al. (2005); Goldberg et al. (2014); Kraus (2015); Delaigle and Hall (2016); Kneip and Liebl (2020).In this article, a new method is presented and compared with two of the above benchmark methods.We selected those that provided the best performance to minimize the mean squared error on our simulations and considered case studies.The selected benchmark methods are the method of Kraus (2015), and the method of Kneip and Liebl (2020).Notably, Kraus (2015) proposes a functional linear ridge regression model for completing functional data based on principal component analysis.Kraus (2015) estimates the principal component scores of the incomplete functions from the completely observed functions and then uses a completion procedure for recovering the missing part of the functions by using the observed part.On the other hand, Kneip and Liebl (2020) approach the problem by introducing an optimal linear regression operator based on a local linear kernel aiming to produce smoother results and trying to avoid artificial jumps between observed and reconstructed parts.
Our approach combines two novel functional tools: 1) a depth-based method for functional time series forecasting (Elías et al., 2021c); and 2) a functional depth for partially observed functional data (Elías et al., 2021b).
The new method is conceived for: 1) Scenarios where the estimation of the covariance function is complex or impossible.We remark that the available reconstruction methods strongly depend on a proper estimation of the covariance.Its estimators might be very sensitive and become unreliable for many analyses, in particular for principal component analysis (Hubert et al., 2005).In addition, a low number of complete functions in the sample also hampers the estimation procedures and makes it impossible if all the sample functions are partially observed (Kneip and Liebl, 2020).
2) Reconstructing non-smooth functions.Some methods are designed to deal with smooth and unaligned functions, and, in consequence, their results are smooth and aligned functions (Kneip and Liebl, 2020).Our goal is to provide a method that produces reconstructed functions remaining as accurate as possible in roughness and variability.3) Adding interpretability.It might be useful to get a precise estimation and some insights into the final reconstruction drivers.We consider simulated and real data for illustrating the issues listed above.On the one hand, we consider yearly curves of Spanish daily temperatures.This data set is gathered at different weather stations spread along with the Spanish territory.On the other hand, we consider age-specific yearly mortality rates recorded at each Japanese prefecture (political territory division).
The structure of this paper is as follows: Section 2 introduces notation and the method.Section 3 shows a variety of simulated results based on Gaussian Processes under different simulated regimes of partial observability.In addition, we illustrate the method's performance with a Spanish daily temperatures data set and the Japanese age-specific mortality rates by prefecture.In Section 4, we make some conclusions.

Definition and notation
Let X = {X(t) : t ∈ [a, b]} be a stochastic process of continuous trajectories and (X 1 , . . ., X n ) independent copies of X.To simplify the notation, we assume without loss of generality [a, b] = [0, 1].We consider the case X 1 , . . ., X n are partially observed.Following Delaigle and Hall (2013), we model the partially observed setting by considering a random mechanism Q that generates compact subsets of [0,1] where the functional data are observed.Specifically, let O be a random compact set generated by Q, and let (O 1 , . . ., O n ) be independent copies of O. Therefore, for Then, the observed and missing parts of X i are (X i , O i ) and (X i , M i ).
The core idea of our method is based on the depth-based method for dynamic updating on functional time series (Elías et al., 2021c).In this context, but using the notation introduced here, the functional sample, (X 1 , . . ., X n ), was ordered in time, n being the most recent time period.X n was only observed on O n = [0, q], with 0 q < 1, and the rest of sample curves {X j : j < n} were fully observed on [0, 1].This is, for all j < n, O j = [0, 1].We may summarize the depth-based approach for dynamic updating as follows: if J n ⊂ {1, ..., n − 1}, and the band delimited by the curve segments {(X j , O n ) : j ∈ J n } captures both the shape and magnitude of (X n , O n ), we may estimate (X n , M n ) from {(X j , M n ) : j ∈ J n }.
In particular, point estimators of (X n , M n ) were obtained by computing weighted averages on the curve segments of {(X j , M n ) : j ∈ J n }.In our jargon, {(X j , O n ) : j ∈ J n } is called the envelope of (X n , O n ).
In contrast to the dynamic updating framework described above, the sample curves may not be temporarily ordered in the partially observed scenario that we are considering.What is more important, every single curve may be partially observed.So, for enveloping the observed part of a curve, say us (X i , O i ), we could only have curve segments, namely {(X j , O i ∩ O j ) : j = i}.
Similarly, we may only have curve segments, specifically {(X j , M i ∩ O j ) : j = i}, for estimating the missing part of X i , that is (X i , M i ).How to apply the depth-based approach for these cases is a challenging problem that we address here.The method can be explained in two steps: one concerned about how to compute an envelope of (X i , O i ), described in Section 2.2 and one on how to estimate or reconstruct (X i , M i ) from the observed parts of the curves used for enveloping (X i , O i ), described in Section 2.3.

A depth-based algorithm for focal-curve enveloping
This concept of depth arises for ordering multivariate data from the center to outward (Liu, 1990;Liu et al., 1999;Rousseeuw et al., 1999;Zuo and Serfling, 2000;Serfling, 2006;Li et al., 2012).Let F be the collection of all probability distribution functions on R, F ∈ F and x ∈ R. In the univariate context, a depth measure is a function D : R × F → [0, 1] such that, for any fixed F, D(x, F) reaches their maximum value at the median of F, this is at x such that F(x) = 1/2, and decreases to the extent that x is farthest from the median.Examples of such univariate depth measures are and Denote by P to the generating law of the process X and by P t to the marginal distribution of X(t), this is P t (x) = P(X(t) ≤ x).Given a univariate depth measure D, the Integrated Functional Depth of X with respect to P is defined as w being a weight function that integrates to one (Claeskens et al., 2014).It is worth mentioning that when w ≡ 1, and D is defined by Equation ( 1), the integrated functional depth corresponds to the seminal Fraiman and Mu ñiz functional depth (Fraiman and Muniz, 2001).When D is defined by Equation ( 2), the corresponding IFD is the famous Modified Band Depth with bands formed by two curves (L ópez-Pintado et al., 2010).
Denote by F J (t) to the empirical distribution function of the univariate sample {X j (t) : j ∈ J (t)}.
This is the probability distribution that assigns constant mass equals to 1/q(t) to each available observation at time t.Then, for any pair (X i , O i ) and a given univariate depth D, we consider the empirical Partially Observed Integrated Functional Depth restricted to O i (Elías et al., 2021b) defined by In line with the approach for dynamic updating introduced by Elías et al. (2021c), we search for a set J of sample curves, as big as possible, such that i / ∈ J and: For measuring depth here we use the Partially Observed Integrated Functional Dept restricted to O i defined in (4). 2) with λ being the Lebesgue measure on R.
3) {(X j , O j ) : j ∈ J } contains near curves to (X i , O i ), as many as possible.For measuring nearness, we use mean L 2 distance between POFD.This is, Algorithm 1 provides a set of curves with the three features above that we call the i-curve envelope and denote by J i from now on.The algorithm is a variation of Algorithm 1 of Elías et al.
(2021c), adapted to POFD.Algorithm 1 iteratively selects as many sample curves as possible, from the nearest to the farthest to (X i , O i ), for enveloping (X i , O i ) and increasing its depth.
Figure 1 illustrates how Algorithm 1 works.For there, we consider the first, second and final iteration of two runs from the algorithm based on 1000 i.i.d.trajectories of a Gaussian process.We considered partially observed curves for the run shown in the left panels by removing six random intervals of the observation domain.For the run shown in the right panels, we considered missing data uniformly on the observation domain.On average, only 50% of each curve were observed for both runs.The partially observed function we intend to reconstruct is colored in red and plotted Let y be the nearest curve to f from Y and N = {y } for y ∈ Y \ {y }, from the nearest curve to the farthest from f , do entirely in the bottom panels jointly with its estimation that we describe how to compute below.

Reconstruction of missing parts
For estimating the unobserved part of X i , this is (X i , M i ), we use the same approach used for dynamic updating (Elías et al., 2021c).The estimator is a weighted functional mean from data of the curve envelope.Only here, these functional data may be partially observed.Specifically, these This estimator is a version of the envelope projection with exponential weights (see Elías et al., 2021c, Equation ( 2)), adapted to the partially observed data context.The parameter θ is chosen by minimizing the mean squared prediction error (MSE).In practice, if O i is the observational set where X i can be computable, this is As an illustration, the bottom panels of Figure 1 show reconstructions of missing parts of the two simulated cases discussed above.On average, only 50% of each curve was observed for both runs.The partially observed function that we reconstructed is colored in red and plotted entirely in the bottom panels jointly with its estimation.(in blue) We compare results obtained by using the depth-based method with those obtained from studies by Kraus (2015) and Kneip and Liebl (2020).Kraus (2015) propose a regularized regression model to predict the principal component scores (Reg.Regression) whereas Kneip and Liebl (2020) introduce a new class of reconstruction operators that are optimal (Opt.Operator).The two methods were implemented by using the R-codes available at https://is.muni.cz/www/david.kraus/web_files/papers/partial_fda_code.zip and https://github.com/lidom/ReconstPoFD.The depth for partially observed curves and the data generation settings are implemented using the R-package fdaPOIFD of Elías et al. (2021a).
Subsection 3.1 introduces the simulation setting, the data generation process for POFD and shows results with synthetic data.Subsection 3.2 uses the same simulation settings but applied to AEMET temperatures data.Additionally, it illustrates the reconstruction of some yearly temperature curves that are partially observed in reality.Finally, Subsection 3.3 presents another real case study where Japanese age-specific mortality functions are reconstructed.

Simulation study
Let us denote by c% the percentage of sample curves that are partially observed.We remark that benchmark methods perform better as the parameter c is larger.This is because these reconstruction methods strongly depend on the information of the completely observed curves to estimate the covariance or the components of its eigendecomposition.However, the depth-based method can handle the case c = 0, i.e., there are no complete functions in the sample.Therefore, results for this case are reported without comparison.
For our simulation study, we considered two Missing-Completely-at-Random procedures for generating partially observed data.These procedures have previously been used in the literature (Elías et al., 2021b) and are in line with the partial observability of the real case studies.They are: Random Intervals, with which c% of the sample curves is observed on a number m of random disjointed intervals of [0, 1].
Random points, with which c% of the functions is observed on a very sparse random grid.
First, we apply these observability patterns to simulated trajectories.Concretely, we consider a Gaussian process X(t) = µ(t) + (t) where (t) is a centered Gaussian process with covariance kernel ρ (s, t) = αe −β|s−t| for s, t ∈ [0, 1].The functional mean µ(t) is a periodic function randomly generated by a centered Gaussian process with covariance ρ µ (s, t) = σe −(2 sin(π|s−t|) 2 /l 2 ) .Thus, each sample will present different functional means.The set of parameters used for our study were β = 2, α = 1, σ = 3 and l = 0.5.Example of the generated trajectories by this model are those shown in Figure 1.
We considered small and large sample sizes for the study by making n = 200 and n = 1000.Also, we considered different percentages of observed curves that were partially observed.
Specifically, we tested with c = 0, 25, 50 and 75.In addition, we considered different percentages of time on which the incomplete curves of a sample were observed.Henceforth, we term this percentage by p%.We considered p = 25, 50 and 75 for small samples but only p = 25 and 50 for large samples.This is due to the computational cost of the benchmark methods when n = 1000 and p = 75.Note that this parameter setting implies the highest computational cost for estimating covariance functions.Finally, we replicate 100 samples of each data set for estimating median values of MSE.
Table 1 presents results for n = 200.It shows MSE from the Gaussian data and points out the superiority of the Reg.Regression method (Kraus, 2015) when covariance function is simple to estimate, as is the case of the exponential decay covariance function involved in these data (see left panel of Figure 2).We remark that, even in this case, depth-based is lightly better than the Opt.Operator method (Kneip and Liebl, 2020).When all the functions of the sample are partially observed (c = 0%), only the depth-based method can provide a reconstruction and, surprisingly, the MSE remains reasonably similar to those cases with a proportion of complete functions significantly large (c = 25, 50, 75%).Similar results are obtained with larger sample sizes.Table 3 shows the same simulation setup with AEMET data but under the Random Interval setting and small sample size.In this setting, we generate partially observed data for a different number of observed intervals (m), percentages of completely observed curves (p), and mean observability percentage of each partially observed curve (c).The result shows that when p = 25 the Reg.The regression method typically performs better than the competitors.This is because our implementation of the partially observed setting produces a sample of curves not as densely distributed in the extremes of the domain as it is in the middle of the domain.Then, for small p to extrapolate using the depth-based method to dense part of the domain worsens the results.
However, when p increases, our implementation produces a sample of partially observed curves covering the complete domain densely, and the depth-based method outperforms.In Figure 3, we plot the reconstructions obtained by the three methods under consideration from one random sample.This was obtained by randomly taking 1000 curves from the total observed curves of the AEMET data.Then, we generated partially observed data by applying the Missing-Completely-at-Random procedure based on random intervals described above, with m = 4, p = 50, and c = 50.Finally, we randomly selected one function to reconstruct, namely, VALLADOLID/VILLANUBLA-1956, where VALLADOLID/VILLANUBLA refers to the location of the station and 1956 to the observation year.According to a general view of its shape, VALLADOLID/VILLANUBLA-1956 is completely plotted in red.In contrast, the reconstructions are only plotted on the four intervals where the curve was observed in our simulation.The top panels of the figure show output produced by the depth-based method (Depth-based in blue), Kneip and Liebl (2020) (Opt.Operator in green), by Kraus (2015) (Reg.Regression in black).The depth-based method is superior to the benchmark methods.The bottom panel of the figure shows some descriptive statistics related to the depth-based method.We show years used for reconstructing (the years of the curves into the envelope).The frequency of each year (number of curves into the envelope with the same year) is represented by a proportional blue bubble.Similarly, we show at the right side the locations of the curves into the envelope.From our understanding, all of them have similar geographical features.Finally, Figure 4 illustrates the actual case of "BURGOS/VILLAFR ÍA-1943" a station that probably started operating in the middle of the year 1943.Consequently, only the year's second half is recorded (red curve at the top panel).We apply the three reconstructing methods to complete the first half of the curve (Reg.Regression (Kraus, 2015) in black, Opt.Operator (Kneip and Liebl, 2020)

Case study: Reconstructing Japanese mortality
The Human Mortality Data Set (https://www.mortality.org)provides detailed mortality and population data of 41, mainly in developed countries.Some countries also offer micro information by subdivision of the territory, providing challenging spatial and temporal information.In particular, the Japanese mortality data set (http://www.ipss.go.jp/p-toukei/JMD/index-en.asp) is available for its 47 prefectures for males, females, and the total population.
A common FDA approach to analyze mortality data is to consider that each function is the yearly mortality for each age cohort (Shang and Hyndman, 2017;Shang and Haberman, 2018;Gao et al., 2019;Shang, 2019).With this configuration and arranging the 47 prefectures together, we deal with a male, female, or total Japanese mortality data set of size 2007.Of course, each prefecture does not have the same number of functions, and the range of observed years is also different.However, roughly, we have yearly mortality functions between 1975 and 2016.In this case study, the poor availability of complete functions invalidates the possibility of doing a resampling exercise like the one we have done for the AEMET data set.Then, we are only able to illustrate some real situations.Figures 5 and 6 present two real reconstruction problems and the results obtained from the three methods.In Figure 5, we reconstruct the shortest available curve, "Saitama-2007", that was only available in a very short interval of mortality rates for the youngest cohorts.Reg.Regression (Kraus, 2015) and Opt.Operator (Kneip and Liebl, 2020) produce smooth results (black and green respectively).The depth-based method produces more spiky results in concordance with other available curves.The bottom panel presents the bubble plot illustrating the period of the envelope functions and the prefectures in the map. Figure 6 presents a case with the function "Tottori-2015" that is not observed in six intervals fragments (domain where only the red curve is visible).

Conclusion
This article introduces a non-parametric method to reconstruct samples of incomplete functional data.Our proposal relies on the concept of depth for POFD to select a subset of sample curves that is defined to share shape and magnitude with the observed part of the curve to predict.We term this subset of sample curves as envelope, and we use them to proposed a point reconstruction method based on a weighted average.These weights only depend on a parameter we set by minimizing the MSE where the curve to predict is observed.
We compare the performance of the new method with other alternatives of the literature.
Our simulation exercises consider simulated and real data and various random procedures to generate incomplete data scenarios.The available reconstruction methods seem to be unbeatable in our settings when the covariance can be efficiently estimated.These favorable circumstances are exemplified with Gaussian processes with stationary covariance functions and other more complex covariance regimes, including a considerable proportion of completely observed curves.
In contrast, our method outperforms when the covariance can not be properly estimated due to a richer covariance structure and highly scarce data settings.To show that, we test the methods under severe incomplete data settings and introduce more complex covariance structures.Particularly, we decrease the number of completely observed functions up to zero and consider real data with complex covariance structures such as yearly age-specific mortality and yearly temperature data.
Finally, our simulation exercises show that our proposal can provide a reasonable reconstruction output when every function is partially observed or, in other words, when there are no complete functions in the sample.
The depth-based method requires the Missing-Completely-at-Random assumption, as it is standard in the literature.This assumption implies that the partially observed functions cover densely the reconstruction domain and that the observability process is not conditional to external information.Future research could allow for specific relationships between the functional process and the process that generates partial observability.In addition, the depth-based algorithm requires a notion of proximity between POFD that we fill with a L 2 distance between the observed segments.Developments along this line would also be valuable for our proposal.
In summary, this article provides an alternative data-driven and model-free method to reconstruct partially observed functional data preferable under challenging scenarios like the ones presented here.Last but not least, we believe that the interpretability of the results might be helpful to provide different insides into the data under analysis.

Figure 1 :
Figure 1: In the top panels, an illustration of how Algorithm 1 works.The sample curves correspond to 1000 i.i.d.trajectories of a Gaussian process.We considered partially observed curves for the left panels by removing six random intervals from [0, 1].For the right panels, we considered missing data uniformly.On average, only 50% of each curve was observed for both runs.The partially observed function that we reconstructed is colored in red and plotted entirely in the bottom panels jointly with its estimation.(in blue)

Figure 2 :
Figure 2: Covariance estimations based on the available functions are completely observed.Left panel, Gaussian processes with an exponential decay covariance.Central panel: Spanish daily temperatures with lower covariance values in Spring and Autumn periods and higher covariance in Summer and Winter.Right panel: Japanese age-specific mortality rates with higher correlations at the oldest ages.

Figure 3 :
Figure 3: Simulated exercise reconstruction of "VALLADOLID/VILLANUBLA-1956".Top panel presents the reconstruction by Kraus (2015) (black) and Kneip and Liebl (2020) (green).Middle panel, the reconstruction provided by the depth-based method.Bottom panel, the spatial (Spanish map) and temporal (bubble plot) descriptive analysis offered by the depth-based methodology and the envelope.
in green, and the depth-based method in blue).The depth-based methodology supports the reconstruction with the additional information provided by the most important curves of the envelope.The subsample contains distant-past function from 1905 from MADRID-RETIRO station, more recent functions from the 90s from the same "BURGOS-VILLAFR ÍA" station a bigger proportion of functions from 1943, belonging to the same year of the partially observed function to reconstruct.

Figure 4 :
Figure 4: Real partially observed function, "BURGOS/VILLAFR ÍA-1943".Top panel: reconstructions by Kraus (2015) (in black) Kneip and Liebl (2020) (in green) and the depth-based method (in blue).The bottom panel shows the descriptive analysis of the most relevant curves in reconstructing "BURGOS/VILLAFR ÍA-1943".In the left part, time analysis (bubble plot of the involved years); in the right part, spatial analysis (map with the most relevant and involved stations).

Figure 5 :
Figure 5: Reconstruction of the most poorly observed function of the sample, "Saitama-2007".The top panel presents the reconstruction given by Reg.Regression method by Kraus (2015) (black), Opt.Operator method by Kneip and Liebl (2020) (green) and the depth-based method (blue).The bottom panel shows the year of the most important functions of the envelope, and the maps showing the corresponding prefectures, in red the one to be reconstructed.

Figure 6 :
Figure 6: Real partially observed function of the sample observed in 6 fragments, Tottori-2015.The top panel presents the reconstruction given by Reg.Regression method by Kraus (2015) (black), Opt.Operator method by Kneip and Liebl (2020) (green) and the depth-based method (blue).The bottom panel shows the year of the most important functions of the envelope, and the right map shows its prefectures.

Table 1 :
Median values of mean square errors over 100 pseudo-random replicates.Each replicate is composed of 200 curves.A dash (-) represents that the method cannot produce any reconstruction.The partially observed samples are obtained by observing p% of the total discrete realization points (Random Points).The smallest error is bolded for each combination of c and p.

Table 2 :
Median values of mean square errors over 100 replications.Each replicate is composed of 1000 and 200 curves (results between parenthesis).A dash (-) represents that the method cannot produce any reconstruction.The partially observed samples are obtained by observing p% of the total discrete realization points (Random Points).

Table 3 :
Median values of the mean squared error after 100 replications.This table summarizes the exercise considering random samples from the fully observed AEMET data set.Each replicate is composed of 200 functional observations.The partially observed samples are obtained by restricting each function to m intervals of total length p% of the domain (Random Intervals).