1 Introduction

With the advancement of information technology, a significant amount of sequential data has become accessible in the field of biomedical research. This includes data from microarray and RNA-seq in genetic studies and patient health-tracking information in disease studies. These data are often presented as time series, longitudinal, and functional data, with repeated measurements taken over a period of time for each study subject. This type of data has been shown to be highly valuable in gaining insights into the underlying mechanisms of diseases, developing new diagnostic techniques and treatment plans, and ultimately enhancing patient healthcare [1, 2].

The main focus of this paper is to develop an approach that simultaneously performs denoising and clustering for sequential data. In particular, we focus on time series data and hope our method can be generalized to other data types in a future work. Data denoising and smoothing are a well-recognized challenge due to the existence of complex fluctuations and seasonal variations in time series, including short- and long-term changes. Several methods have been proposed in the literature to address this problem, including total variation denoising [3, 4], Gaussian process filter [5,6,7], wavelet transform [8, 9], Kalman filter [10, 11], and kernel smoothers [12,13,14]. We focus on trend filtering [15,16,17], which is a nonparametric smoothing method that fits a piecewise polynomial model to the data. Compared to other approaches, trend filtering achieves a desirable balance between easy model interpretation and theoretically guaranteed estimation accuracy [18]. Moreover, trend filtering can be implemented efficiently using alternating direction method of multipliers (ADMM) algorithm [19]. Thanks to these properties, trend filtering has been widely used in denoising time series data such as annual GDP [20] and global surface temperature deviation [21].

Despite its success in signal denoising and curve estimation, trend filtering does not directly handle heterogeneity, which is a critical issue to address in biomedical applications. For example, large heterogeneity is known to exist in patients’ health information, genomic profile, and treatment effects. There is often a need to identify patient subgroups for improving the accuracy of disease diagnosis and personalized treatment [22]. In epidemiology, it is well known that heterogeneity exists in many diseases (e.g., malaria) across geographical regions and social networks [23, 24]. It is hence our goal in this paper to fill this gap. In particular, we propose a clusterwise trend filtering approach that simultaneously identifies the clustering structure in study subjects where each cluster has a different mean trend over time fitted by different trend filtering models. The result is expected to provide useful insights towards a better understanding of data heterogeneity than what a marginal homogeneous model can offer. For example, in a lung cancer mortality study (more details in Sect. 4.1), the annual mortality rate is monitored over 48 continental states in the US between the year 1969 and 2009. By studying how the mean trend changes over time for different states, we are able to reveal interesting spatial heterogeneity pattern and relate the spatial clusters to environmental factors.

Clustering has been extensively researched in the fields of statistics and machine learning, as indicated in a recent survey [25]. Our proposed method aims to integrate trend filtering with distance-based clustering approaches. We use K-means [26] as an example and show that our method offers the best of both worlds, inheriting the nice properties of both K-means and trend filtering in terms of easy implementation and computational efficiency. Through simulation studies, we also show that our method effectively recovers the unknown cluster structure and clusterwise trends. We further demonstrate the utility of our method using two real-world examples. The rest of this paper is organized as follows. Section 2 offers a brief review of trend filtering method and discusses our proposed method. In Sect. 3, we use simulations to compare our method with a few existing approaches. We present two data analysis examples in Sect. 4 and discuss a few future working directions in Sect. 5.

2 Methods

In this section, we give a brief review of trend filtering, and then present our method in Sect. 2.2.

2.1 Trend Filtering Estimation

Consider a time series with T time points \({\textbf {y}} = (y_1,\ldots ,y_T) \in \mathbb {R}^T\). Trend filtering [15,16,17] provides a useful way of smoothing the data by considering a piecewise polynomial approximation. In particular, for a given non-negative integer q, the qth order trend filtering estimates \(\hat{\beta } \in \mathbb {R}^T\) by solving the following optimization problem:

$$\begin{aligned} \hat{\beta } = \underset{\beta \in \mathbb {R}^T}{\textrm{argmin}} \frac{1}{2} \big \Vert {\textbf {y}} - \beta \big \Vert ^2_2 + \lambda \big \Vert D^{(q+1)} \beta \big \Vert _1, \end{aligned}$$
(1)

where \(\beta\) follows a qth-order piecewise polynomial, \(\lambda\) is a non-negative tuning parameter to control the trade-off between smoothness of \(\beta\) and approximation error \(\Vert {\textbf {y}} - \beta \Vert _2\), and \(D^{(q+1)}\) is the discrete difference operator of order \(q + 1\). For example, when \(q = 0\), the fitted values \(\beta = (\beta _1,\ldots ,\beta _T)\) form a piecewise constant structure, and

$$\begin{aligned} D^{(q+1)} = D^{(1)}= \begin{pmatrix} -1 &{} 1 &{} 0 &{}\ldots &{} 0\\ 0 &{} -1 &{} 1 &{} \ldots &{} 0\\ \vdots &{} \vdots &{}\ldots &{}\vdots &{} \vdots \\ 0 &{}0 &{}\ldots &{} -1 &{} 1 \end{pmatrix}_{(T-1) \times T}, \end{aligned}$$

which means \(\Vert D^{(1)} \beta \Vert _1 = \sum _{i = 1}^{T-1} \mid \beta _i - \beta _{i+1} \mid\), i.e., (1) yields one-dimensional fused lasso [27].

For \(q \ge 1\), the operator \(D^{(q+1)} \in \mathbb {R}^{(T-q-1)}\times T\) is defined recursively by \(D^{(q+1)} = D^{(1)} \cdot D^{(q)}\). For example, when \(q = 1\), \(\Vert D^{(2)} \beta \Vert _1 = \sum _{i=2}^{T-1} \mid \beta _{i-1} - 2\beta _{i} + \beta _{i+1}\mid\), which is related to the Hodrick–Prescott filtering [28]. In general, \(\beta\) forms a piecewise linear structure when \(q=1\) and a piecewise quadratic structure when \(q = 2\), with

$$\begin{aligned} D^{(2)}= \begin{pmatrix} 1 &{} -2 &{} 1 &{} 0 &{} \ldots &{} 0 \\ 0 &{} 1 &{} -2 &{} 1 &{} \ldots &{} 0 \\ 0 &{} 0 &{} 1 &{} -2 &{} \ldots &{} 0 \\ \vdots &{} \vdots &{} \vdots &{} \ldots &{} \vdots &{} \vdots \\ 0 &{} 0 &{} 0 &{} \ldots &{} -2 &{} 1 \end{pmatrix}, D^{(3)}= \begin{pmatrix} -1 &{} 3 &{} -3 &{}1 &{} \ldots &{} 0 \\ 0 &{} -1 &{} 3 &{} -3 &{} \ldots &{} 0 \\ 0 &{} 0 &{} -1 &{} 3 &{} \ldots &{} 0 \\ \vdots &{} \vdots &{} \vdots &{} \ldots &{} \vdots &{} \vdots \\ 0 &{} 0 &{} 0 &{} \ldots &{} -3 &{} 1 \\ \end{pmatrix}. \end{aligned}$$

As shown in Eq. (1), trend filtering estimation is a generalized lasso problem with an identity design matrix \(X = I\) and a specific choice of penalty \(D^{q+1}\). Thus, it also shares properties of the generalized lasso, e.g., the degrees of freedom for trend filtering estimation are \(\text {df}(\hat{\beta }) = \mathbb {E}(\text {number of knots in } \hat{\beta }) + q + 1\) [29]. The number of knots in \(\hat{\beta }\) can be understood as the change points in the time series, which is also the number of non-zero entries in \(D^{(q+1)}\beta\) in the second term of Eq. (1). In addition, because (1) is strictly convex, the trend filtering estimate \(\hat{\beta }\) is the unique minimizer for every \(q \ge 0\). In summary, trend filtering enjoys several nice properties, including local adaptivity, computational efficiency, and easy interpretation [18], which makes it an ideal tool for our analysis.

2.2 Clusterwise Trend Filtering

Consider a dataset of n time series, \(Y = \{{\textbf {y}}_1, {\textbf {y}}_2,..., {\textbf {y}}_n\}\), where each \({\textbf {y}}_i = (y_{i1},\ldots ,y_{iT}) \in \mathbb {R}^T\) is a time series being observed over T time points. Our goal is to simultaneously perform smoothing for each \({\textbf {y}}_i\) and also cluster these time series. To achieve this goal, we consider a partition of the index set \(\{1,\ldots ,n\}\), denoted by \(\mathcal {C} = \{C_1, C_2,..., C_K \}\) such that within each cluster, the time series are assumed to follow the same mean structure, which is modeled by a piecewise polynomial sequence obtained from trend filtering. In general, any distance-based clustering methods can be used for inferring the clustering structure \(\mathcal {C}\). For simplicity, we choose K-means to demonstrate our idea. We propose to solve the following two optimizations:

$$\begin{aligned}&\mathcal {C} = \underset{C_1, ..., C_K}{\textrm{argmin}} \sum ^n_{i = 1} \sum ^K_{k = 1} \mathbbm {1}\{ c_i = k\} \big \Vert {\textbf {y}}_i -\hat{\beta }_k\big \Vert _2^2, \nonumber \\&\hat{\beta }_k = \underset{\beta _k \in \mathbb {R}^T}{\textrm{argmin}} \frac{1}{2} \big \Vert \bar{{\textbf {y}}}_k - \beta _k \big \Vert ^2_2 + \lambda \big \Vert D^{(q+1)} \beta _k \big \Vert _1,~~~~k= 1,\ldots ,K, \end{aligned}$$
(2)

where \(\bar{{\textbf {y}}}_k = \mid C_k \mid ^{-1} \sum _{i=1}^{\mid C_k \mid } {\textbf {y}}_i\) is the average of time series belonging to cluster k, K is the pre-specified number of clusters, and \(c_i\) is the cluster index for \({\textbf {y}}_i\), \(i=1,\ldots ,n\). It can be seen that the first optimization is similar with that of the original K-means, by treating \({\textbf {y}}_i\) as the input data point and \(\hat{\beta }_k\) as the center for cluster k, which is obtained by fitting trend filtering to the cluster average to help with interpretation. An alternative approach is to borrow the idea of K-medoids algorithm, which is proposed as a variant of K-means to address the influence of outliers [30]. Unlike K-means, K-medoids does not use the mean value, but instead finds a data point as the center of the cluster. Because K-medoids is computationally more expensive than K-means as it involves computing the distances between all pairs of data points at each iteration [31], we still choose to use K-means in our numerical implementation.

Optimization in (2) can be conveniently solved by the following procedure:

  1. (1)

    Initialization: Set the cluster number K, and generate an initial partition \(\mathcal {C} = (C_1,\ldots ,C_K)\) by fitting K-means to the dataset treating each data point as a T-dimensional vector.

  2. (2)

    Obtain the trend filtering estimator \(\hat{\beta }_k\) for each cluster \(C_k\), \(k=1,\ldots ,K\).

  3. (3)

    Update partition \(\mathcal {C}\) by assigning each time series to its closest center, i.e., the updated \(\hat{\beta }_k\) from step (2).

  4. (4)

    Repeat step (2) and (3) until convergence.

In practice, we stop the algorithm once the partition \(\mathcal {C}\) does not change after a few updates. Our proposed approach inherits the simplicity and computational convenience of K-means and trend filtering. In particular, both methods can be conveniently implemented in standard software packages such as R, so does our method. The computational complexity of trend filtering is at most \(O(n^{3/2})\) [15, 17] and the complexity of K-means is \(O(n^2)\). Hence our method has an \(O(n^2)\) computational complexity due to the prefixed K.

At the same time, our method also faces the same challenges as trend filtering and K-means do. For example, the objective function of K-means is non-convex, which means that it may converge to a local minimum instead of the global optimum, and the results may be sensitive to the choice of initial values. Therefore, multiple initial values will be used to fully explore the parameter space. Another challenge is the choice of hyperparameters including the cluster number K and the polynomial order q. In practice, we choose q either based on prior knowledge (e.g., shape of the trend) or let q take values within a range and pick the optimal value that minimizes the total sums of square error between raw clusterwise data average \({\textbf {y}}_k\) and the filtered trend \(\hat{\beta }_k\). For the choice of cluster number K, several approaches are available in the literature to determine K for K-means, such as the elbow method [26], the Silhouette score [32], and cross-validation [33]. However, there are no universally agreed criteria to determine the optimal value of K, especially for the large-scale dataset with more overlapping or fuzzy clusters. In our data analysis, we consider a reasonably wide range of values for K that yields a convenient interpretation depending on the nature of scientific applications and the computational complexity.

Our method can be easily generalized to integrate with other distance-based clustering methods. For example, one may consider hierarchical clustering: start by trend filtering every times series to form separate clusters, then calculate the pairwise distance to merge the two closest clusters, and repeat this process till a proper number of clusters is obtained. Other distance-based clustering can also be adjusted based on the smooth version of individual time series.

3 Simulation

3.1 Setting

We conduct simulation studies to evaluate the empirical performance of our proposed approach. We generate data with a mean structure following a piecewise polynomial model under four settings, including a constant scenario, a linear scenario, a quadratic scenario, and a mixed scenario. For example, when the order is 0, time series from different clusters all present a piecewise constant trend, which contains several unknown phases (varying over clusters) and takes a constant value under each phase. Under the mixed scenario, time series in different clusters follow a piecewise polynomial with different orders. More specifically, under the first three scenarios, there are five different types of mean trends which correspond to five clusters. Under the mixed scenario, there are a total of 15 types of mean trends. For instance, for the constant scenario, the number of phases, the length of each phase, and the signal values may vary cluster by cluster. Under all scenarios, each cluster includes 10 time series observed over \(T=100\) time points. We then add Gaussian noise to the generated mean trends with a standard deviation taking values from 0.4 to 1.8. A demonstration is given in Fig. 1, where the colored solid lines are the generated mean trends for each cluster. For example, there are 5 clusters and 50 time series in each of the first three panels under the piecewise constant, linear, and quadratic scenarios; and 15 clusters and 150 time series under the mixed scenario.

Fig. 1
figure 1

Four simulation data generation scenarios (noise SD \(=1.2\)) (Color figure online)

We compare our proposed method with two alternative approaches: K-medoids [34] and functional K-means clustering [35, 36]. Functional K-means provides a useful way to identify common patterns and trends among different groups of functional data. For implementation, all numerical experiments are conducted in R on a compute server (256 GB RAM, with 8 AMD Opteron 6276 processors, operating at 2.3 GHz, with 60 processing cores). The average running time is 9.3 (SD = 1.3) minutes for one simulated dataset analysis. Our method can be implemented based on genlasso package [37] for trend filtering step. The K-medoids is implemented using clust package. In the simulation, we assume the order of polynomials q is known for the first three scenarios. For the mixed scenario, we assume q is unknown. To determine its value, we let q take values between 0 and 3, and pick the one that minimizes the fitted square error.

3.2 Results

We conduct the simulation for 1000 replications and summarize the percentage of times when the true cluster structures are correctly identified in Table 1. In addition, we calculate the Rand index (RI), which is a metric that can be used to evaluate the performance of a clustering algorithm taking values between 0 and 1 (higher is better) [38]. The RIs for our method and two competing approaches are summarized in Fig. 2, where blue lines are for our method, and green and red lines are for functional K-means and K-medoids.

Table 1 Simulation results: percentage of correct cluster structure identification (and associated maximum standard errors) for our method (Trend Filtering), functional K-means (Functional), and K-medoids, based on 1000 replications
Fig. 2
figure 2

Simulation accuracy: Rand index and associated \(95\%\) confidence bands for three methods under different data generating scenarios and noise levels (Color figure online)

We find that our method achieves the highest accuracy in all scenarios and under different noise levels. The advantage over the competing methods becomes more significant as the noise level increases (e.g., noise \(\ge .8\)), which indicates that our method works quite well especially for more fluctuating curves. As the order of polynomial becomes larger, the clustering accuracy deteriorates for all methods as expected. Also it is worthy mentioning that for the most difficult case, mixed scenario, our method manages to achieve a high RI value of above 90% even when the noise level is fairly large (\(\ge 1.4\)). For the piecewise constant case, our method manages to maintain a RI above 90% over all noise levels while the other two methods have an RI dropping down to around 65%. All these observations confirm the excellent performance of our method.

4 Real-Data Examples

4.1 Lung Cancer Mortality Rate

Cancer is a leading cause of death in the United States. Among all types of cancer, the bronchial and lung cancers are associated with the highest number of deaths. In 2019, it is estimated that 0.6 million people died of cancer in the United States, with 0.14 million due to lung cancer. Past studies supported by National Institutes of Health have suggested the existence of geographical pattern in bronchial and lung cancers, e.g., the highest incidence was found in the south (76.0 per 100,000) and the lowest incidence was in the west (58.8 per 100,000) [39]. In addition, many studies have discovered a temporal change pattern in lung cancer mortality [40,41,42,43,44].

We analyze the lung cancer mortality rate data collected by the American Cancer Society, which covers the annual age-adjusted death rate due to lung cancer in 48 states in the US (excluding Alaska and Hawaii) from 1969 to 2009. In other words, the data consist of 48 time series being observed over 41 years. As shown in Fig. 4, the temporal trend for most states has a parabolic trajectory. The mortality rate continuously increases for the first two decades until the peak around 1990 followed by a decrease over the next two decades, with some states beginning to stabilize at the same level and others experiencing significant declines in mortality. Our goal is to explore the heterogeneity in the state-level mortality rate curves. This is important because the resulting clusters provide insights into which factors may influence the mortality rates. For example, a spatial pattern can be seen in the mortality rate map. Some neighboring states, such as Washington and Oregon, have similar changes in their mortality rates, with a sharp decline after reaching the peak of the parabola. In comparison, some neighboring states in the southeast, such as Mississippi and Alabama, have mortality rates stabilized after a previous upward trend. In addition, Utah’s pattern is distinctly different from its nearby states, as shown in Figs. 3 and 4.

Fig. 3
figure 3

Age-adjusted mortality rates of lung cancer from 1969 to 2009 for five selected states

Fig. 4
figure 4

Spaghetti plot of lung cancer mortality rates for 48 continent states in the U.S. (Color figure online)

We apply our proposed method and set \(q=2\) to capture the parabolic trajectory in the curves. To determine the cluster number K, we fit the model by letting K take values from 3 to 8, and consider several criteria, including the elbow method, the silhouette coefficient [32], the Calinski–Harabasz index [45], and the Gap statistics [46]. We choose \(K = 4\) since it is preferred by most of the criteria being considered. The clustering result is shown in Fig. 5, where four clusters are marked by four colors. There seems a quite obvious geographic pattern in the result. For example, cluster A consists of spatially contiguous states located in the Rocky Mountains and the Mid-Atlantic region, and cluster D is mainly located in the middle-east and southern part of US except Nevada an Maine. Utah forms its own cluster due to its low mortality rate compared to the rest of U.S. Our cluster result is also presented in Fig. 4, where it is obvious that cluster A has the lowest mortality rate (excluding Utah), while cluster D has the highest mortality rate and the fastest growth during the year 1970–1990. Cluster D also has a higher variation compared to the other clusters.

Fig. 5
figure 5

U.S. map with four clusters obtained by our proposed method (Color figure online)

We are also able to relate our cluster result with two main risk factors of lung cancer, including smoking, which is the number one risk factor [47,48,49], and air pollution, which also contributes to lung cancer [50,51,52]. As a reference, we present the state-level plots for both risk factors in Fig. 6a and b. It is clear that Utah has the lowest adult smoking rate in the country at 9%. While most states in cluster D have a higher smoking rate [darker blue color in panel (a)], e.g., Arkansas at 22% and Kentucky at 23%. Similar findings can be obtained in Fig. 6b. For example, cluster A and C in general have a lower air pollution rate compared to the other regions, which matches with the fact that these two clusters have a lower mortality rate. Meanwhile, cluster D has the highest air pollution index and, hence, the highest mortality rate. These findings highlight the utility of our method in discovering meaningful clustering and temporal patterns in mortality rate curves.

Fig. 6
figure 6

U.S. map based on two leading risk factors associated with the lung cancer (Color figure online)

4.2 Suicide Rate Study

Next we consider a suicide mortality study. According to World Health Organization (WHO), more than 0.7 million people die due to suicide every year. This number has kept increasing since COVID-19 [53,54,55]. Many factors contribute to the risk of suicide, including mental illness, stigma, financial reasons, alcohol, and drug misuse [56]. In recent years, researchers have also discovered temporal and spatial patterns in the suicide rates [57,58,59]. To verify the effectiveness of our method, we study a 30-year-long data on suicide mortality in the U.S. The data are available as a CDC Wide-ranging Online Data for Epidemiologic Research (WONDER) dataset. It provides the annual suicide mortality rates for all 48 contiguous states in the continental United States (excluding Alaska and Hawaii) from 1990 to 2019. As shown in Fig. 7a, the suicide rate exhibits a ‘V’ shape for most states, i.e., there are two phases over the observed 30-year period. During the first phase (first 10–15 years), the suicide mortality rate keeps going down. This trend is especially obvious for states such as California, Nevada, Illinois, and New York. The next 15–20 years is the second phase for a strong rebound, where the mortality rate in many states has far exceeded the initial 1990 level by the end of 2010.

Fig. 7
figure 7

Suicide mortality rate clusters (Color figure online)

We apply our method and choose \(q=2\), i.e., a piecewise quadratic trend. The cluster number K is decided to be 3 according to a combination of elbow method, the silhouette coefficient, and Gap statistic. The clustering results are provided in Fig. 7a and the clusterwise average curves are given in Fig. 7b. The results exhibit a clear geographical pattern despite we did not include any spatial information in our analysis. For example, cluster II (green) consists of 11 contiguous states in the middle west part of U.S; the cluster III (blue) has 29 states where the majority are states in east and middle east except Washington; and cluster I is the smallest cluster that contains California, Illinois, New Jersey, and Massachusetts. As shown in Fig. 7b, the suicide mortality rate is the highest in cluster II, followed by cluster III and I. One possible explanation is that a high suicide rate is often associated with a low economic status. For example, WHO reports that 77% of global suicides occur in low- and middle-income countries. This is reflected in our results, e.g., cluster I, despite having the least number of states, has the best economic and welfare development and hence the lowest suicide rate.

5 Discussion

In this paper, we propose a new time series clustering method that performs smoothing over temporal direction and learns heterogeneity at subject level. Our method builds on the idea of K-means clustering and trend filtering, and can be extended to integrate with other distance-based clustering methods. Numerical results have confirmed the utility of our method in terms of cluster structure recovery and time series denoising. Our data analysis results suggest that the cluster results can be useful to provide guidance on the inclusion of covaraites (e.g., spatial, environmental, and economic factors) in a future analysis such as regression.

Several future work directions remain open for this topic. First, it will be of interest to generalize our method to analyze longitudinal and functional data where the observations are collected at non-equally spaced time points. Classical trend filtering cannot perform smoothing over irregular time intervals. Instead, one may consider other smoothing methods such as wavelet or kernel approaches. Second, it will be of interest to develop Bayesian methods that could take account for the uncertainty associated with cluster number and polynomial order estimation by using Gaussian process and its generalizations [60, 61]. In addition, studying theoretical properties such as convergence analysis of the algorithm and risk analysis of the curve estimation in this context is another important direction. Finally, developing a spatial clustering method that accounts for the spatial dependence may help improve the performance of our method in our data examples.