1 Introduction

Given a sparse temporal tensor where one mode denotes time, how can we discover its latent factors? A tensor, or multi-dimensional array, has been widely used to model multi-faceted relationships for time-evolving data. For example, air quality data (Zhang et al., 2017) containing measurements of contaminants collected from sensors at every time step are modeled as a 3-mode temporal tensor (time, site, contaminant).

Tensor decomposition is a fundamental building block for effectively analyzing tensors by revealing latent factors between entities (Kolda & Sun, 2008; Kang et al., 2012; Oh et al., 2018; Park et al., 2017), and it has been extensively utilized in various real-world applications across diverse domains including recommender systems (Symeonidis 2016), clustering (Sun et al., 2015), anomaly detection (Kolda & Sun, 2008), correlation analysis (Sun et al., 2006), network forensic (Maruhashi et al., 2011), and latent concept discovery (Kolda et al., 2005). CANDECOMP/PARAFAC (CP) (Harshman et al., 1970) decomposition is one of the most widely used tensor decomposition models, which factorizes a tensor into a set of factor matrices and a core tensor which is restricted to be diagonal.

Previous CP decomposition methods (Kolda et al., 2005; Sun et al., 2006, 2015; Kolda & Sun, 2008; Maruhashi et al., 2011; Matsubara et al., 2012; Kang et al., 2012; Symeonidis, 2016; de Araujo et al., 2017; Park et al., 2017; Oh et al., 2018) do not consider temporal property when it comes to factorizing tensors while most time-evolving tensors exhibit time dependency. Recently, Yu et al. (2016) have tried to address the above problem, but it considers only the past information of a time step to model the time dependency. A model needs to examine both the past and the future information to capture accurate temporal properties since information at each time point is heavily related to that in the past and the future. In addition, they do not consider the time-varying sparsity, one of the main properties in real-world temporal tensors. The main challenges to design an accurate temporal tensor decomposition method for sparse temporal tensors are (1) how to harness the time dependency in real-world data, and (2) how to exploit the varying sparsity of time slices.

In this paper, we propose Time-Aware Tensor Decomposition (tatd), a time-aware tensor decomposition method for analyzing real-world temporal tensors. Our main observation is that values of adjacent time slices are mostly similar to each other since time slices in a temporal tensor are closely related to each other. Based on this observation, tatd employs a kernel smoothing regularization to make time factor vectors reflect time dependency. Moreover, tatd imposes a time-dependent sparsity penalty to strengthen the smoothing regularization. The sparsity penalty modulates the amount of the regularization using the sparsity of time slices. tatd further improves accuracy using an effective alternating optimization scheme that incorporates an analytical solution and Adam optimizer. Through extensive experiments, we show that tatd effectively considers the time dependency for tensor decomposition, and achieves higher accuracy compared to existing methods. Our main contributions are as follows:

  • Method We propose tatd, an accurate tensor decomposition method considering time dependency. tatd exploits a smoothing regularization for effectively modeling time factor with time dependency, and adjusts it by utilizing a time-varying sparsity.

  • Optimization We propose an alternating optimization strategy suitable for our smoothing regularization. The strategy alternatively optimizes factor matrices with an analytical solution and Adam optimizer.

  • Performance Extensive experiments show that exploiting time dependency and sparsity is crucial for accurate tensor decomposition of temporal tensors. tatd provides the state-of-the-art error (in terms of RMSE and MAE) for decomposing temporal tensors.

The rest of the paper is organized as follows. In Sect. 2, we explain preliminaries on tensor decomposition. Section 3 describes our proposed method tatd. We demonstrate our experimental results in Sect. 4. After reviewing related work in Sect. 5, we conclude in Sect. 6. The source code of tatd and datasets are available at https://github.com/snudatalab/TATD/.

2 Preliminaries

We describe the preliminaries of tensor and tensor decomposition. We use the symbols listed in Table 1.

Table 1 Table of symbols

2.1 Tensor and notations

Tensors are defined as multi-dimensional arrays that generalize the one-dimensional arrays (or vectors) and two-dimensional arrays (or matrices) to higher dimensions. Specifically, the dimension of a tensor is referred to as order or way; the length of each mode is called ‘dimensionality’ and denoted by \(I_1, \ldots , I_n\). We use boldface Euler script letters (e.g., \({\boldsymbol{{\mathcal{X}}}}\)) to denote tensors, boldface capitals (e.g., \({\mathbf{A}}\)) to denote matrices, and boldface lower cases (e.g., \(\mathbf {a}\)) to denote vectors. The \(\alpha = (i_1, \ldots , i_N)\)th entry of tensor \({\boldsymbol{{\mathcal{X}}}}\) is denoted by \(x_{\alpha }\).

A slice of a 3-order tensor is a two-dimensional subset of it. There are the horizontal, lateral, and frontal slices in a 3-order tensor \({\boldsymbol{{\mathcal{X}}}}\), denoted by \(\mathbf {X}_{i_1::}\), \(\mathbf {X}_{:i_2:}\), and \(\mathbf {X}_{::i_3}\). A tensor containing a mode representing time is called a temporal tensor.

A time slice in a 3-mode temporal tensor represents a two-dimensional subset disjointed by each time index. For example, \(\mathbf {X}_{i_t::}\) is an \(i_t\)th time slice when the first mode is the time mode. For brevity, we express \(\mathbf {X}_{i_t::}\) as \(\mathbf {X}_{i_t}\). Our proposed method is not limited to a 3-mode tensor so that a time slice in an N-order temporal tensor corresponds to an (N-1)-dimensional subset of the tensor sliced by each time index. We formally define a time slice \({\boldsymbol{{\mathcal{X}}}}_{i_{t}}\) as follows:

Definition 1

(Time slice \({\boldsymbol{{\mathcal{X}}}}_{i_{t}}\)) Given an N-order tensor \({\boldsymbol{{\mathcal{X}}}} \in \mathbb {R}^{I_1\times \cdots \times I_{N}}\) and a time mode t, we extract time slices of size \(I_1\times \cdots \times I_{t-1} \times I_{t+1}\cdots \times I_{N}\) by slicing the tensor \({\boldsymbol{{\mathcal{X}}}}\) so that an \(i_{t}\)th time slice \({\boldsymbol{{\mathcal{X}}}}_{i_{t}} \in \mathbb {R}^{I_1\times \cdots \times I_{t-1} \times I_{t+1}\cdots \times I_{N}}\) is an \(N-1\) order tensor obtained at time \(i_{t}\) where \(1 \le i_{t} \le I_{t}\). \(\square\)

The Frobenius norm of a tensor \({\boldsymbol{{\mathcal{X}}}}\) \((\in {\mathbf{R}}^{I_{1} \times \cdots \times I_{N}})\) is given by \(||{\boldsymbol{{\mathcal{X}}}}||_F = \sqrt{\sum _{\alpha \in \Omega }{x^{2}_{\alpha }}}\), where \(\Omega\) is the set of indices of entries in \({\boldsymbol{{\mathcal{X}}}}\), \(\alpha \ = (i_1,\ldots , i_N)\) is an index included in \(\Omega\), and \(x_{\alpha }\) is the \((i_1, \ldots , i_N)\)th entry of the tensor \({\boldsymbol{{\mathcal{X}}}}\).

2.2 Tensor decomposition

We provide the definition of CP decomposition (Harshman et al., 1970; Kiers 2000) which is one of the most representative decomposition models. Figure 1 illustrates CP decomposition of a 3-way sparse tensor. Our model tatd is based on CP decomposition.

Definition 2

(CP decomposition) Given a rank K and an N-mode tensor \({\boldsymbol{{\mathcal{X}}}} \in \mathbb {R}^{I_{1} \times \cdots \times I_N}\) with observed entries, CP decomposition approximates \({\boldsymbol{{\mathcal{X}}}}\) by finding latent factor matrices \(\{{\mathbf{A}}^{(n)} \in \mathbb {R}^{I_n \times K}\,|\,1 \le n \le N\}\). The factor matrices are obtained by minimizing the following loss function:

$$\begin{aligned} {\boldsymbol{{\mathcal{L}}}} \left( {\mathbf{A}}^{(1)}, \ldots , {\mathbf{A}}^{(N)}\right)&= \sum _{\forall \alpha \in \Omega }{ \left( {x}_{\alpha }-\sum _{k=1}^{K} \prod _{n=1}^{N}a^{(n)}_{i_{n}k}\right) ^{2}} \end{aligned}$$
(1)

where \(\Omega\) indicates the set of the indices of the observed entries, \(x_\alpha\) indicates the \(\alpha = (i_1, \ldots , i_N)\)th entry of \({\boldsymbol{{\mathcal{X}}}}\), and \(a^{(n)}_{i_{n}k}\) indicates \((i_{n}, k)\)th entry of \({\mathbf{A}}^{(n)}\). \(\square\)

Fig. 1
figure 1

CP decomposition of a 3-way sparse tensor into K components

The standard CP decomposition method is not specifically designed to deal with time dependency; thus CP decomposition does not discover accurate temporal patterns in a time-evolving tensor. Although a previous work (Yu et al., 2016) has tried to capture temporal interaction, it does not (1) capture time dependency between adjacent time steps, and (2) exploit the sparsity of time slices. Our proposed tatd carefully captures temporal information and considers sparsity of time slices for better accuracy in decomposing temporal tensors.

3 Proposed method

In this section, we propose tatd (Time-Aware Tensor Decomposition), an accurate tensor decomposition method for sparse temporal tensors. We first introduce the overview of tatd in Sect. 3.1. We then explain the details of tatd in Sects. 3.2 and 3.3, and the optimization technique in Sect. 3.4.

3.1 Overview

tatd is a tensor decomposition method designed for sparse temporal tensors. There are several challenges in designing an accurate tensor decomposition method for temporal tensors.

  1. 1.

    Model time dependency. Time dependency is an essential structure of temporal tensor. How can we design a tensor decomposition model to reflect the time dependency?

  2. 2.

    Exploit sparsity of time slices. Time-evolving tensor has varying sparsity for its time slices. How can we exploit the temporal sparsity for better accuracy?

  3. 3.

    Optimization. How can we efficiently train our model and minimize its loss function?

To overcome the aforementioned challenges, we propose the following main ideas.

  1. 1.

    Smoothing regularization (Sect. 3.2). We propose a smoothing regularization on time factor to capture time dependency.

  2. 2.

    Time-dependent sparsity penalty (Sect. 3.3). We propose a time-dependent sparsity penalty to further improve the accuracy.

  3. 3.

    Careful optimization (Sect. 3.4). We propose an optimization strategy utilizing an analytical solution and Adam optimizer to efficiently and accurately train our model.

Figure 2 illustrates overview of tatd. We observe that adjacent time slices in a temporal tensor are closely related with each other due to a temporal trend of the tensor. Based on the observation, tatd uses smoothing regularization such that time factor vectors for adjacent time slices become similar. We also observe that different time slices have different densities. Instead of applying the same amount of regularization for all the time slices, we control the amount of regularization based on the sparsity of time slices such that sparse slices are affected more from the regularization. It is also crucial to efficiently optimize our objective function. We propose an optimization strategy exploiting alternating minimization to expedite training and improve the accuracy.

Fig. 2
figure 2

Illustration of smoothing regularization and sparsity penalty by tatd. tatd accurately trains the time factor matrix with the smoothing regularization considering the time-varying sparsity. For a sparse time slice at \(t_1\), tatd fits the time factor via strong regularization to actively consider nearby slices. For a dense time slice at \(t_2\), tatd makes the time factor with weak regularization by paying little attention to nearby slices

3.2 Smoothing regularization

We describe how we formulate the smoothing regularization on tensor decomposition to capture time dependency. Our main observation is that temporal tensors have temporal trends, and adjacent time slices are closely related. For example, consider an air quality tensor containing measurements of pollutant at a specific time and location; it is modeled as a 3-mode tensor \({\boldsymbol{{\mathcal{X}}}}\) (time, location, type of pollutants; measurements). Since the amount of pollutants at nearby time steps are closely related, the time slice \({\boldsymbol{{\mathcal{X}}}}_{t}\) at time t is closely related to the time slices \({\boldsymbol{{\mathcal{X}}}}_{t-1}\) at time \(t-1\) and \({\boldsymbol{{\mathcal{X}}}}_{t+1}\) at time \(t+1\). This implies the time factor matrix after tensor decomposition should have related rows for adjacent time steps.

Based on the observation, our objective function is as follows. Given an N-order temporal tensor \({\boldsymbol{{\mathcal{X}}}} \in \mathbb {R}^{I_1 \times \cdots \times I_N}\) with observed entries \(\Omega\), the time mode t, and a window size S, we find factor matrices \({\mathbf{A}}^{(n)} \in {\mathbb {R}}^{I_{n}\times K}\) \(,1 \le n \le N\) that minimizes

$$\begin{aligned} {\boldsymbol{{\mathcal{L}}}} = \sum _{\alpha \in \Omega }{\left( {x}_{\alpha }-\sum _{k=1}^{K} \prod _{n=1}^{N}a^{(n)}_{i_{n}k}\right) ^{2}} + \lambda _t \sum _{i_{t}=1}^{I_{t}}\Vert {\mathbf {a}_{i_{t}}^{(t)}}-{{\tilde{\mathbf {a}}_{i_{t}}^{(t)}}}\Vert _{\text {2}}^{2} + \lambda _r\sum _{n \ne t}^{N}\Vert {{\mathbf{A}}^{(n)}}\Vert _{\text {F}}^{2} \end{aligned}$$
(2)

where we define

$$\begin{aligned} {\tilde{\mathbf {a}}_{i_t}^{(t)}} = \sum _{i_s \in \mathscr {N}({i_t}, S)}{w(i_t,i_s)}{\mathbf {a}_{i_s}^{(t)}}, \end{aligned}$$
(3)

and \(\mathscr {N}(i_t, S)\) indicates adjacent indices \(i_s\) of \(i_t\) in a window of size S. \(\lambda _t\) and \(\lambda _r\) are regularization constants to adjust the effect of time smoothing and weight decay, respectively. \(\tilde{\mathbf {a}}_{i_t}^{(t)}\) in Eq. (3) denotes the smoothed row of the time factor. The \(\sum _{i_{t}=1}^{I_{t}}\Vert {\mathbf {a}_{i_{t}}^{(t)}}-{{\tilde{\mathbf {a}}_{i_{t}}^{(t)}}}\Vert _{\text {2}}^{2}\) term in Eq. (2) means that we regularize the \(i_t\)th row of the time factor to the smoothed vector from the neighboring rows in the factor. The weight \(w(i_t, i_s)\) denotes the weight to give to the \(i_s\)th row of the time factor matrix for the smoothing the \(i_t\)th row of the time factor.

An important question is, how to determine the weight \(w(i_t, i_s)\)? We use the Gaussian kernel for the weight function due to the following two reasons. First, it does not require any parameters to tune, and thus we can focus more on learning the factors in tensor decomposition. Second, it fits our intuition that a row closer to the \(i_t\)th row should be given a higher weight. In Sect. 4, we show that tatd with Gaussian kernel outperforms all the competitors; however we note that other weight function can possibly replace the Gaussian kernel to further improve the accuracy, and we leave it as a future work.

Given a target row index \({i_t}\), an adjacent row index \({i_{s}}\), and a window size S, the weight function based on the Gaussian kernel is as follows:

$$\begin{aligned} w(i_{t},i_{s}) = \frac{{\mathcal {K}}(i_{t},i_{s})}{{\sum _{i_{s'} \in \mathscr {N}({i_t}, S)}} {\mathcal {K}}(i_{t}, i_{s'})} \end{aligned}$$
(4)

where \({\mathcal {K}}\) is defined by

$$\begin{aligned} {\mathcal {K}}(i_{t}, i_{s}) =\exp \left( -\frac{(i_{t}-i_{s})^2}{2\sigma ^2}\right) \end{aligned}$$

Note that \(\sigma\) affects the degree of smoothing; a higher value of \(\sigma\) imposes more smoothing. For each \(i_t\)th time slice, the model constructs a smoothed time factor vector \(\tilde{\mathbf {a}}_{i_t}\) based on nearby factor vectors \(\mathbf {a}_{i_s}\) and the weights \(w(i_{t},i_{s})\).

Our model then aims to reduce the smoothing loss between the time factor vector \(\mathbf {a}_{i_{t}}^{(t)}\) and the smoothed one \({{\tilde{\mathbf {a}}_{i_{t}}^{(t)}}}\).

3.3 Sparsity penalty

We describe how to further improve the accuracy of our method by considering the sparsity of time slices. The loss function (2) uses the same smoothing regularization penalty \(\lambda _t\) for all the time factor vectors. However, different time slices have different sparsity due to the different number of nonzeros in time slices (see Fig. 3), and it is thus desired to design our method so that it controls the degree of regularization penalty depending on the sparsity. For example, consider the 3-mode air quality tensor \({\boldsymbol{{\mathcal{X}}}}\) (time, location, type of pollutants; measurements), introduced in Sect. 3.2, containing measurements of pollutant at a specific time and location. Assume that the time slice \({\boldsymbol{{\mathcal{X}}}}_{t_1}\) at time \(t_1\) is very sparse containing few nonzeros, while the time slice \({\boldsymbol{{\mathcal{X}}}}_{t_2}\) at time \(t_2\) is dense with many nonzeros. The factor row \(\mathbf {a}_{i_{t_2 }}^{(1)}\) at time \(t_2\) can be updated easily using its many nonzeros. However, the factor row \(\mathbf {a}_{i_{t_1}}^{(1)}\) at time \(t_1\) does not have enough nonzeros at its corresponding time slice, and thus it is hard to train \(\mathbf {a}_{i_{t_1}}^{(1)}\) using only its few nonzeros; we need to actively use nearby slices to make up for the lack of data. Thus, it is desired to impose more smoothing regularization at time \(t_1\) than at time \(t_2\).

Fig. 3
figure 3

Time-varying density of five real-world datasets. The horizontal axis represents the unique number of nonzero entries in time slices. The vertical axis represents the number of time indices with such number of nonzeros. Note that time slices have varying densities

Based on the motivation, tatd controls the degree of smoothing regularization based on the sparsity of time slices. Let the time sparsity \(\beta _{i_{t}}\) of the \(i_t\)th time slice be defined as

$$\begin{aligned} \beta _{i_{t}}&= 1 - d_{i_t} \end{aligned}$$
(5)

where a time density \(d_{i_t}\) is defined as follows:

$$\begin{aligned} d_{i_t}&= (0.999-0.001) \frac{ \omega _{i_{t}} - \omega _\textit{min}}{\omega _\textit{max} - \omega _\textit{min}} + 0.001 \end{aligned}$$
(6)

\(\omega _{i_{t}}\) indicates the number of nonzeros at \(i_{t}\)th time slice; \(\omega _\textit{max}\) and \(\omega _\textit{min}\) are the maximum and the minimum values of the number of nonzeros in time slices, respectively. The time density \(d_{i_t}\) can be thought of as a min-max normalized version of \(\omega _{i_{t}}\), with its range regularized to [0.001, 0.999].

Using the defined time sparsity, we modify our objective function as follows.

$$\begin{aligned} {\boldsymbol{{\mathcal{L}}}} = \sum _{\alpha \in \Omega }{\left( {x}_{\alpha }-\sum _{k=1}^{K} \prod _{n=1} ^{N}a^{(n)}_{i_{n}k}\right) ^{2}} +\sum _{i_{t}=1}^{I_{t}}\lambda _t \beta _{i_{t}} ||{{\mathbf {a}}_{i_{t}}^{(t)}}-{{\tilde{\mathbf {a}}_{i_{t}}^{(t)}}}||_{2}^{2} + \lambda _r\sum _{n \ne t}^{N}\Vert {{\mathbf{A}}^{(n)}}\Vert _{\text {F}}^{2} \end{aligned}$$
(7)

Note that the second term is changed to include the time sparsity \(\beta _{i_{t}}\); this makes the degree of the regularization vary depending on the sparsity of time slices.

Given the modified objective function in Eq. (7), we focus on minimizing the difference between \(\mathbf {a}_{i_t}^{(t)}\) and \(\tilde{\mathbf {a}}_{i_t}^{(t)}\) for time slices with a high sparsity rather than those with a low sparsity. tatd actively exploits the neighboring time slices when a target time slice is sparse, while it less exploits the neighboring ones for a dense time slice.

3.4 Optimization

To minimize the objective function in Eq. (7), tatd uses an alternating optimization method; it updates one factor matrix at a time while fixing all other factor matrices. tatd updates non-time factor matrices using the row-wise update rule (Shin et al., 2016) while updating the time factor matrix using the Adam optimizer (Kingma & Ba, 2014).

Updating non-time factor matrix. We note that updating a non-time factor matrix while fixing all other factor matrices is solved via the least square method, and we use the row-wise update rule (Shin et al., 2016) in ALS for it. The row-wise update rule is advantageous since it gives the optimal closed-form solution, and allows parallel update of factors. The update rule for \(i_n\)th row of the nth factor matrix \({\mathbf{A}}^{(n)} (n \ne t)\) is given as follows:

$$\begin{aligned} \begin{aligned} \mathbf {a}_{i_n}^{(n)} \leftarrow \underset{\mathbf {a}_{i_n}^{(n)}}{{\text {arg}}\,{\text {min}}}\;{L({\mathbf{A}}^{(1)},\ldots ,{\mathbf{A}}^{(N)})} = [{\mathbf{B}}_{i_n}^{(n)}+\lambda _r \mathbf {I}_{K}]^{-1} \times \mathbf {c}_{i_n}^{(n)} \end{aligned} \end{aligned}$$
(8)

where \({\mathbf{B}}_{i_n}^{(n)}\) is a \({K \times K}\) matrix whose entries are

$$\begin{aligned} {({\mathbf{B}}_{i_n}^{(n)})}_{k_1 k_2} = \sum _{\forall \alpha \in \Omega _{i_n}^{(n)}} \prod _{l \ne n} a^{(l)}_{i_l k_1}\prod _{l \ne n} a^{(l)}_{i_l k_2}, \forall k_1, k_2, \end{aligned}$$
(9)

\(\mathbf {c}_{i_n}^{(n)}\) is a length K vector whose entries are

$$\begin{aligned} \sum _{\forall \alpha \in \Omega _{i_n}^{(n)}}x_{\alpha }\prod _{l \ne n} a^{(l)}_{i_l k}, \forall k \end{aligned}$$
(10)

and \(\Omega _{i_n}^{(n)}\) denotes the subset of \(\Omega\) whose nth mode’s index is \(i_n\).

Updating time factor matrix. Updating the time factor matrix while fixing all other factor matrices is not the least square problem any more, and thus we turn to gradient based methods. We use the Adam optimizer which has shown superior performance for recent machine learning tasks. We verify that using the Adam optimizer only for the time factor leads to faster convergence compared to other optimization methods in Sect. 4.

Overall training. Algorithm 1 describes how we train tatd. We first initialize all factor matrices (line 1). For each iteration, we update a factor matrix while keeping all others fixed (lines 4–11). The time factor matrix is updated with Adam optimizer (line 7) until the validation RMSE increases, which is our convergence criterion (line 8) for Adam. Each of the non-time factor matrices is updated with the row-wise update rule (line 11) in ALS. We repeat this process until the validation RMSE continuously increases for five iterations, which is our global convergence criterion (line 12).

figure a

4 Experiment

We perform experiments to answer the following questions.

Q1:

Accuracy (Sect. 4.2). How accurately does tatd factorize real-world temporal tensors compared to other methods?

Q2:

Effect of smoothing regularization (Sect. 4.3). How accurately does tatd generate the time factor matrix and non-time factor matrices?

Q3:

Effect of data sparsity (Sect. 4.4). How does the sparsity of input tensors affect the decomposition performance of tatd and other methods?

Q4:

Running time (Sect. 4.5). How fast is tatd compared to competitors?

Q5:

Effect of optimization (Sect. 4.6). How effective is our proposed optimization approach for training tatd?

Q6:

Hyper-parameter study (Sect. 4.7). How do the different hyper-parameter settings affect the performance of tatd?

4.1 Experimental settings

4.1.1 Machine

All experiments are performed on a machine equipped with Intel Xeon E5-2630 CPU.

4.1.2 Datasets

We evaluate tatd on five real-world datasets summarized in Table 2.

  • Beijing Air Quality (Zhang et al., 2017) is a 3-mode tensor (hour, locations, atmospheric pollutants) containing measurements of pollutants. It was collected from 12 air-quality monitoring sites in Beijing between 2013 to 2017.

  • Madrid Air Quality is a 3-mode tensor (day, locations, atmospheric pollutants) containing measurements of pollutants in Madrid between 2011 to 2018.

  • Radar Traffic is a 3-mode tensor (hour, locations, directions) containing traffic volumes measured by radar sensors from 2017 to 2019 in Austin, Texas.

  • Indoor Condition (Candanedo et al., 2017) is a 3-mode tensor (10 min, locations, ambient conditions) containing measurements. There are two ambient conditions defined: humidity and temperature. We construct a fully dense tensor from the original dataset and randomly sample 70% of the elements to make a tensor with missing entries. In Sect. 4.4, we sample from the fully dense version of it.

  • Server Room is a 4-mode tensor (second, air conditioning, server power, locations) containing temperatures recorded in a server room. The first mode “air conditioning” means air conditioning temperature setups (24, 27, and 30 Celsius degrees); the second mode “server power” indicates server power usage scenarios (\(50\%\), \(75\%\), and \(100\%\)).

Table 2 Summary of real-world tensors used for experiments

Before applying tensor decomposition, we z-normalize the datasets. Each dataset is randomly split into training, validation, and test sets with the ratio of 8:1:1; the validation set is used for determining an early stopping point.

4.1.3 Competitors

We compare tatd with the following competitors which use only the observed entries of a given tensor.

4.1.4 Metrics

We evaluate the performance using RMSE (Root Mean Squared Error) and MAE (Mean Absolute Error) defined as follows.

$$\begin{aligned} \text {RMSE} = {\sqrt{\frac{1}{|\Omega |}\sum _{\forall \alpha \in \Omega }{\left( x_{\alpha }-\hat{x}_{\alpha }\right) ^{2}}}} \quad \quad \text {MAE} = \frac{1}{|\Omega |}{\sum _{\forall \alpha \in \Omega }{|x_{\alpha }-\hat{x}_{\alpha }|}} \end{aligned}$$

\(\Omega\) indicates the set of the indices of observed entries. \(x_\alpha\) stands for the entry with index \(\alpha\) and \(\hat{x}_{\alpha }\) is the corresponding reconstructed value. In addition, we evaluate the quality of generated factor matrices from decomposition methods using Factor Match Score (FMS) and Time Factor Match Score (TFMS) defined as follows.

$$\begin{aligned} \text {FMS}&= \min _k \left( \left( 1 - \frac{|\xi _k - \hat{\xi }_k|}{\max (\xi _k, \hat{\xi }_k)}\right) \prod _{n=1, n \ne t}^{N}|\mathbf {a}^{(n) \intercal }_{:k}\hat{\mathbf {a}}^{(n)}_{:k}| |\mathbf {a}^{(t)\intercal }_{:k}\hat{\mathbf {a}}^{(t)}_{:k}|\right) \\ \text {TFMS}&= \min _k \left( \left( 1 - \frac{|\xi _k - \hat{\xi }_k|}{max(\xi _k, \hat{\xi }_k)}\right) |\mathbf {a}^{(t)\intercal }_{:k}\hat{\mathbf {a}}^{(t)}_{:k}|\right) \end{aligned}$$

where \(\mathbf {a}^{(n)}_{:k}\) and \(\mathbf {a}^{(t)}_{:k}\) denote the kth columns normalized to the unit norm of the original non-time factor matrix \(\mathbf {A^{(n)}} (n \ne t)\) and time factor matrix \({\mathbf{A}}^{(t)}\), respectively, and \(\xi _k\) denotes the weight for each kth column factor for \(k = 1, \ldots , K\). Similarly, \(\hat{\mathbf {a}}^{(n)}_{:k}\) and \(\hat{\mathbf {a}}^{(t)}_{:k}\) are the normalized kth columns of the extracted non-time factor matrix \(\hat{{\mathbf{A}}}^{(n)} (n \ne t)\) and the time factor matrix \(\hat{{\mathbf{A}}}^{(t)}\), respectively, and \(\hat{\xi }_k\) denotes the corresponding weight for each kth column factor. Note that FMS and TFMS are closer to 1 if the extracted factors and the original ones are more similar (see Acar et al., 2011 for the details of the metric).

4.1.5 Hyper-parameter

We use hyper-parameters in Table 3 for tatd, except in Sect. 4.7 where we vary hyper-parameters. We use 0.5 for \(\sigma\) which adjusts the smoothing level in kernel function. We change the window size \(S \in \{1, 3, 5, 7, 9 \}\) and find the optimal value for each dataset. For competitors, we follow their default hyperparameter settings suggested in their papers and tune them using grid search. Specifically, we find regularization parameters from \(\{0.1, 1, 10, 100\}\) for TRTF. For CoSTCo, we find the learning rate from \(\{0.001, 0.01\}\) and the batch size from \(\{126, 256, 512\}\). We also change the activation function used in CoSTCo from ReLU to ELU. For all methods, we find optimal hyperparameter settings with evaluation results of validation datasets.

Table 3 Default hyper-parameter setting

4.2 Accuracy (Q1)

We compare tatd with competitors in terms of RMSE and MAE in Table 4. tatd-0 indicates tatd without the sparsity penalty. tatd consistently gives the best accuracy for all the datasets; tatd-0 provides the second-best performance. tatd achieves up to \(1.08\times\) lower RMSE and \(1.37\times\) lower MAE compared to the second-best methods. Note that Indoor Condition dataset represents strong smoothness and the least amount of noise on its values. The gap in RMSE of tatd and CP-ALS on Indoor Condition dataset demonstrates that the smoothing regularization effectively decomposes the real-world tensors by capturing temporal patterns and adjusting the noise.

Table 4 Performance of tensor decomposition by tatd and competitors

4.3 Effect of smoothing regularization (Q2)

We evaluate the effect of smoothing regularization of tatd, by comparing factor matrices extracted by tatd and CP-ALS with true factor matrices. For evaluation, we construct a synthetic tensor from true factor matrices, decompose it, and then compare factor matrices extracted from tatd and CP-ALS with the true ones. We create a 3-order synthetic temporal tensor of size \(900 \times 30 \times 30\) with factor matrices of rank 5. To generate the tensor, we first make a time factor matrix \({\mathbf{A}} \in \mathbb {R}^{900 \times 5}\) by randomly sampling five time-series from Indoor Condition dataset, and non-time factor matrices \({\mathbf{B}},\, {\mathbf{C}} \in \mathbb {R}^{30 \times 5}\) having random values from [0, 1). We then create a synthetic temporal tensor from those factor matrices with Eq. (1), add a noise tensor having random values from [0, 0.001) to it, and randomly sample \(90\%\) of the tensor to make it sparse.

Table 5 shows that tatd achieves higher FMS and TFMS than CP-ALS in decomposing the synthetic tensor. With the smoothing regularization, tatd precisely generates the time factor matrix, achieving \(1.11\times\) higher TFMS in the synthetic tensor. Furthermore, it leads to generating accurate non-time factor matrices, providing \(1.59\times\) higher FMS. Our smoothing regularization allows tatd to generate the time factor matrix and non-time factor matrices accurately.

Table 5 FMS and TFMS of tatd and CP-ALS for a synthetic tensor

4.4 Effect of data sparsity (Q3)

We evaluate the performance of tatd with varying data sparsity. We sample the data with the ratio of \(\{10, 30, 50, 70, 90\}\%\) to identify how accurately the method decomposes real-world tensors even when the data are highly sparse. Figure 4 shows the errors of tatd and competitors for five datasets. Note that tatd precisely decomposes the tensors even when they are highly sparse, compared to competitors. There are two reasons for the best performance of tatd as the sparsity increases. First, tatd is designed to learn the factor for a target slice by using its neighboring slices; this is especially useful when the target slice is extremely sparse and has no information to train its factor. Second, tatd explicitly considers sparsity in its model through the sparsity penalty, and imposes more regularization for sparser slices.

Fig. 4
figure 4

Test RMSE of tatd and competitors for varying data sampling ratios. tatd shows the smallest errors when decomposing highly sparse tensors due to the careful consideration of sparsity

4.5 Running time (Q4)

We compare the running times of tatd and competitors. tatd uses the ALS+Adam optimizer (see Sect. 4.6), while tatd w/ Adam uses the Adam optimizer. For a stopping condition, we set the maximum iterations to 500 and use the early stopping technique.

Figure 5 shows the running times of all methods. Note that tatd w/Adam shows the fastest running time; CoSTCo and tatd follow after that. Although tatd is slower than tatd w/Adam and CoSTCo, tatd gives the smallest errors (see Table 4 and Figure 6), and thus provides more accurate results. Note that the running times of methods vary over different datasets, since the convergence condition for each method varies depending on datasets.

Fig. 5
figure 5

Comparison of the running times between tatd and competitors

4.6 Effect of optimization (Q5)

We evaluate our proposed optimization strategy in terms of error and running time. We call our strategy ALS+Adam and compare it with the following optimization strategies for tatd.

  • SGD a standard stochastic gradient descent method which is widely used for optimization.

  • Adam a recent gradient-based method using momentum and controlled learning rate.

  • Alternating Adam an alternating minimization method which updates a single factor matrix with Adam while fixing other factor matrices.

  • ALS + SGD an alternating minimization method which updates a time factor matrix with SGD and non-time factor matrices with the least square solution.

We use the same stopping condition mentioned in Sect. 4.5. We use the learning rates in Table 3 for the optimization schemes using Adam and the learning rate from \(\{10^{-3}, 10^{-4}, 10^{-5}\}\) for the schemes using SGD.

Figure 6 shows that our proposed strategy ALS+Adam and Adam show the best trade-off between the error and the running time. ALS+Adam produces the smallest error, but takes more time than Adam to train for better accuracy. On the other hand, Adam is faster, but less inaccurate than ALS+Adam. We select ALS+Adam as our optimization strategy since our main focus is to achieve a high accuracy.

Fig. 6
figure 6

Comparison of optimization strategies in tatd. Our proposed strategy ALS+Adam and Adam show the best trade-off between the error and the running time. Note also that ALS+Adam provides the smallest RMSE

4.7 Hyper-parameter study (Q5)

We evaluate the performance of tatd with regard to hyper-parameters: smoothing regularization penalty and rank size.

4.7.1 Smoothing regularization penalty

The smoothing regularization penalty \(\lambda _t\) has an important role in the proposed tatd ’s performance; thus we vary the smoothing regularization penalty \(\lambda _t\) and evaluate the test RMSE in Fig. 7. Note that too small or too large values of \(\lambda _t\) do not give the best results; too small value of \(\lambda _t\) leads to overfitting, and too large value of it leads to underfitting. The results show that a right amount of smoothing regularization gives the smallest error, verifying the effectiveness of our proposed idea.

Fig. 7
figure 7

Effect of the smoothing regularization penalty parameter \(\lambda _t\) in tatd. Note that too small or too large values of \(\lambda _t\) lead to overfitting and underfitting, respectively. A right amount of smoothing regularization gives the smallest error, verifying the effectiveness of our proposed idea

4.7.2 Rank

We increase the rank K from 5 to 50 and evaluate the test RMSE in Fig. 8. We have two main observations. First, tatd shows a stable performance improvement with increasing ranks, compared to CP-ALS which shows unstable performances. Second, the error gap between tatd and competitors increases with increasing ranks. Higher ranks may make the models overfit to a training dataset; however, tatd works even better for higher ranks since it exploits rich information from neighboring rows when regularizing a row of the time factor matrix. TRTF also works well for higher ranks on several datasets since it learns the parameters for each column factor. However, tatd achieves a similar or better effect without using excessive parameters thanks to the smoothing regularization.

Fig. 8
figure 8

Effect of rank on the performance of tatd. tatd works even better for higher ranks since it exploits rich information from neighboring rows when regularizing a row of the time factor matrix

5 Related work

We describe previous works that are closely related to our work. We present one of the major tensor decomposition methods, CP decomposition.

5.1 Tensor decomposition

In early stage, CP decomposition methods (Kang et al., 2012; Jeon et al., 2015; Choi et al., 2014) have been widely used for analyzing large-scale real-world tensors. Kang et al. (2012) and Jeon et al. (2015) propose distributed CP decomposition methods running on the MapReduce framework. Choi et al. (2014) propose a scalable CP decomposition method by exploiting properties of a tensor operation used in CP decomposition. Battaglino et al. (2018) propose a randomized CP decomposition method which reduces the overhead of computation and memory. However, they are not appropriate to deal with highly sparse tensors since they do assume non-observable entries are zero.

Several CP decomposition methods have been developed to handle sparse tensors without setting the values of the non-observable entries as zero. Papalexakis et al. (2012) propose ParCube to obtain sparse factor matrices using a sampling technique in parallel systems. Beutel et al. (2014) propose FlexiFaCT, which performs a coupled matrix-tensor decomposition using Stochastic Gradient Descent (SGD) update rules. Shin et al. (2016) propose CDTF and SALS, which are scalable CP decomposition methods for sparse tensors. Smith and Karypis (2017) improves the efficiency of CP decomposition for sparse tensors by exploiting a compressed data structure. The above CP decomposition methods do not consider time dependency and time-varying sparsity which are crucial for temporal tensors. On the other hand, tatd improves accuracy for temporal tensors by exploiting time dependency and time-varying sparsity.

Applications. CP decomposition have been used for various applications. Kolda et al. (2005) analyze a hyperlink graph modeled as 3-way tensor using CP decomposition. Tensor decomposition is also applied to tag recommendation (Rendle et al., 2009; Rendle & Schmidt-Thieme, 2010). Sun et al. (2009) develop a content-based network analysis framework for finding higher-order clusters. Lebedev et al. (2015) exploit CP decomposition to compress convolution filters of convolutional neural networks (CNNs). Several works (Lee et al., 2018; Perros et al., 2017, 2018) use tensor decomposition for analyzing Electronic Health Record (EHR) data.

5.2 Tensor decomposition on temporal tensors

Tensor decomposition methods have been used to deal with diverse real-world temporal tensors. Dunlavy et al. (2011) propose a tensor decomposition method with an exponential smoothing technique for temporal link prediction. Matsubara et al. (2012) propose a tensor decomposition method with a probabilistic inference to discover hidden topics of web-click logs and perform multi-level analysis for long-term forecasting with this method. Yu et al. (2016) propose a matrix/tensor decomposition method with an autoregressive temporal regularization to handle general time-series. de Araujo et al. (2017) present a non-negative coupled tensor decomposition for forecasting future links in evolving social networks.

In addition, there exist various tensor decomposition methods to analyze spatio-temporal tensors that include spatial information in addition to time information. Zhou et al. (2015) propose a tensor decomposition method with a spatio-temporal regularization to predict missing entries in real-world traffic data. Afshar et al. (2017) propose a non-negative tensor decomposition method to discover interpretable patterns in spatio-temporal tensors. Liu et al. (2019) propose a general tensor decomposition method by exploiting the expressive power of convolutional neural networks to model non-linear interactions inside spatio-temporal tensors.

Beyond the static tensor, many researchers have developed online tensor decomposition methods to deal with tensors collected in real-time. Kasai (2016) propose an online tensor decomposition method for sparse tensors corrupted by noises. Song et al. (2017) propose a dynamic tensor decomposition method for temporal multi-aspect streaming tensor. Zhou et al. (2018) propose a fast and memory-efficient online algorithm for sparse tensor decomposition.

However, those approaches are not designed for modeling time dependency from both past and future information, whereas tatd obtains a time factor considering neighboring factors for both past and future time steps, giving an accurate tensor decomposition result. Moreover, they do not exploit the temporal sparsity, a common characteristic of real-world temporal tensors, while tatd actively exploits the temporal sparsity.

6 Conclusion

We propose tatd (Time-Aware Tensor Decomposition), an accurate tensor decomposition method for sparse temporal tensors. To capture time dependency and sparsity in real world temporal tensors, we design a smoothing regularization on time factor, and adjust the amount of the regularization according to the sparsity of time slices. Moreover, we accurately optimize tatd with a carefully designed optimization strategy. Extensive experimental results show that tatd achieves higher accuracy in decomposing real-world tensors compared to competitors. Future works include extending tatd for an online or a distributed setting.