This section describes UberNet, a deep learning CNN-based approach for the short-term demand prediction of ride-hailing services. The basic architecture of UberNet is based on WaveNet (Van Den Oord et al. 2016a), which was initially proposed for generating raw audio waveforms.
Architecture
Figure 1 depicts the network architecture of UberNet, including the embedding input, convolutional layers, residual blocks, activation functions, and output. The deep learning CNN architecture of UberNet, takes an input embedding \(\mathbf{X}= \{\mathbf{X}(t_0),\mathbf{X}(t_1),\ldots , \mathbf{X}(t_s)\}\) and outputs \(\hat{y}(\mathbf{X}) = p(t_{s+\delta })\), where \(p(t_{s+\delta })\) is the predicted ride-hailing service pickups (e.g., Uber pickups) at time \(t + \delta\), given measurements up to time s, via a number of different hidden convolutional layers of abstraction. Here \(\{\mathbf{X}(t_0),\mathbf{X}(t_1),\ldots , \mathbf{X}(t_s)\}\) is a vector time series comprising the pickups of ride-hailing services (e.g., Uber) \(\{p(t_0),p(t_1),\ldots , p(t_s)\}\) and a number of temporal and spatial features f that have been found to explain demand in ride-hailing services. It should be noted that f can be further divided into two distinctive types, namely space-independent features (e.g., Feature set A in Table 1, see “Deep learning for Uber demand prediction in NYC”) and space-dependent ones (e.g., Feature set B, C, and D in Table 1, see “Deep learning for Uber demand prediction in NYC”). Space-independent features can be created by taking the average of all the values within that time interval. Space-dependent features can be created by taking the average of both the time (e.g., 15- and 30-min interval) and the boroughs, so as to enable them to take into account both time and space variations.
UberNet embeds the previous s timestamp as a \((s+1) \times f\) matrix by taking the embedding query operation, where f is the embedding vector size (c.f., the first layer in Fig. 1). Thus, each row of the matrix mapped into the latent features of one timestamp. The embedding matrix is the “cache” of the \(s+1\) timestamp in the f-dimensional embedding. Intuitively, models of various CNNs that are successfully applied in series transformation can be customized to model the “cache” of a time-dependent traffic demand. However, the demand sequences in real-world entails a large number of “cache” of different sizes, where conventional CNN structures with receptive field filters usually fail. Moreover, the most effective filters for text applications cannot easily fit into modelling demand sequence “caches”, since these filters (with respect to row-wise orientation) often fail to learn the representations of full-width representation (see “Dilated convolution”).
Given the normalised time series, we propose to use filters (Asiler and Yazıcı 2017), that traverse the full columns of the sequence “cache” by a single large filter. Specifically, the width of filters is equal to the width of the input “cache”. The height typically varies with respect to the sliding windows over timestamp at a time. To this end, UberNet is operated directly on a vector time series sequence \(\mathbf{X}= \{\mathbf{X}_1,\ldots , \mathbf{X}_{s+1}\}\). The joint probability \(P(\mathbf{X})\) of the waveform \(\mathbf{X}\) is given as the form of conditional probabilities as follows:
$$\begin{aligned} \pi (\mathbf{x}) =\prod _{t=1}^{s+1}P\,(\mathbf{X}_t\, |\, \mathbf{X}_1, \ldots , \mathbf{X}_{t-1}), \end{aligned}$$
(1)
where \(P\,(\mathbf{X}_t\, |\, \mathbf{X}_1, \ldots , \mathbf{X}_{t-1})\) are conditional probabilities. Each datapoint \(\mathbf{X}_t\) is, therefore, conditioned on the value of all previous timesteps. Here, every conditional distribution is modelled by a stack of convolutional layers (see Fig. 1). To learn the conditional distributions \(P(\mathbf{X}_t\, |\, \mathbf{X}_1,\ldots , \mathbf{X}_{t-1})\) over the individual timesteps, a mixture density network or mixture of conditional Gaussian scale mixtures can be employed (Theis and Bethge 2015). However, a softmax distribution can give a superior performance to the final layer, even if the data is only partially continuous (Asiler and Yazıcı 2017) (as is the case for special events or holidays in Uber demand data).
Unlike traditional CNN structure that considers the input matrix as a 2D “cache” during convolution operation, UberNet stacks the “cache” together by mapping a look-up table. As has been evidenced in Kim and Kwan (2018), the optimal results is achieved when the number of embedding is set as 2k, where k is the size of inner channel of the CNN network. In addition, to better capture the spatio-temporal interactions from the 2D embedding input, we conduct a reshape operation, which acts as a prerequisite for operating the 1D convolution in UberNet. We have one dilated filter of size \(1 \times 3\) and two regular filters of size \(1 \times 1\). The \(1 \times 1\) filters are introduced to change the size of channel which can reduce the parameters to be learned by the \(1 \times 3\) kernel. The first \(1 \times 1\) filter is to change the size of channel from 2k to k, while the \(1 \times 1\) filters does the opposite transformation to maintain the spatial dimensions for the next stacking operation (see the residual blocks in Fig. 1). Since filters of different length will lead to variable-length feature map, max pooling operation is performed over each cache, which selects only the largest value of it, resulting into a \(1 \times 1\) cache feedforward layers. The cache from these filters are concatenated to form a feature embedding, which is then fed into a softmax layer (see last layer in Fig. 1) that yields the probabilities of next timestamp.
The core building block of the UberNet is the dilated causal convolution layer (see Fig. 1), which exploits some key techniques such as gated activations and skip connections (see below). In the sequel, we first explain this type of convolution (causal and dilated) and then we provide details on how to implement the residual layers through skip connection. Causal convolution (see “Causal convolution”) is employed to handle temporal data, while a dilated convolution (see “Dilated convolution”) is used to properly handle long-term dependencies.
Causal Convolution
In a traditional 1-dimensional convolution layer, we slide a receptive field of weights across an input series, which is then applied to the overlapping regions of the series. Let us assume that \(\hat{y}_0, \hat{y}_1, \ldots , \hat{y}_s\) is the output predicted at time steps that follow the input series values \(\mathbf{X}(t_0), \mathbf{X}(t_1),\ldots\), \(\mathbf{X}(t_s)\). Since \(\mathbf{X}(t_1)\) influences the output \(\hat{y}_0\), we use the future time series to predict the past, which will cause serious problems. Using the future data to influence the interpretation of the past one seems to make sense in the context of text classification, since later sentences can still influence the previous ones. In the context of time series, we must generate future values in a sequential manner. To address this problem, the convolution is designed to explicitly prohibit the future from influencing the past. The inputs can only be connected to the future time step outputs in a causal structure. In practice, this causal 1D structure is easy to implement by shifting traditional convolutional outputs by a number of timesteps.
Dilated Convolution
One way to handle long-term dependencies is to add one additional layer per time step to reach farther back in the series (to increase the output’s receptive field) (Neville and Jensen 2000). With a time series that extends over a year, using simple causal convolutions to learn from the entire history would exponentially increase the computational and statistical complexity. In UberNet, instead of employing standard convolutions, we designed the dilated convolution to create the generative model, where the dilated layer acts as the convolutional filter to a field which is broader than its original area via the dilation of a zeros sparse matrix. This operation allows the model to become more efficient as it requires fewer parameters. Another advantage is that dilated layer does not change the spatial dimensions of the input, so that the stacking will be much faster on the convolutional layers and the residual structures. Formally, for a vector time series \(\mathbf{X}(t_0),\ldots ,\mathbf{X}(t_l)\) and a filter \(f:\{0,\ldots ,k-1\} \rightarrow {\mathbb {R}}\), the dilated function F on element t is given as
$$\begin{aligned} F(t) = \sum _{i=0}^{k-1}f(i) \cdot \mathbf{X}(t-d \cdot i) \end{aligned}$$
(2)
where d is the dilation factor, k is the filter size, and \(t-d \cdot i\) represents the direction of the past. Dilation thus works as a fixed step between every two adjacent filter taps. If \(d=1\), a dilated convolution boils down to a regular convolution.
The dilated convolutional operation is more powerful to model long-range time series, and thus does not require the use of large filters or additional layers. Practically speaking, one needs to carry out the structure of Fig. 1 multiple times by stacking to further improve the model’s capacity. In addition, we employ a residual network to wrap convolutional layers by a residual block, so as to ease the optimization of the deep neural network.
Masked Residual
The logic of residual learning is that several convolutional layers can be stacked as a block (see the residual blocks layer in Fig. 1). This allows multiple blocks communicate with each other through the skip connection scheme by passing signature feature of each block. The skip connection scheme can directly train the residual mapping instead of the conventional identity mapping scheme. This scheme not only maintains the input information but also increased the values of the propagated gradients, resolving the gradient vanishing issue. A residual block comprises of a branch leading out to several transformations \(\tau\), the outputs of which are forwarded to the input x of the block:
$$\begin{aligned} o = \mathrm{Activation}(x+\tau (x)) \end{aligned}$$
(3)
These operations allow the layer to learn the modifications of the identity mapping rather than the entire transformation, which has been known to be useful in deep learning networks.
UberNet employs two residual modules, as shown in Fig. 1 (Niepert et al. 2016). Each dilated convolutional layer is encapsulated into a residual. The input layer and convolutional one are stacked through a skip connection (i.e., the identity line in Fig. 1). Each block is represented as a pipeline structure of several layers, i.e., normalization layer, activation layer, convolutional layer, and a softmax connection in a specific manner. In this work we put the state-of-the-art normalization layer before each activation layer, which has shown superior performance than batch normalization when it comes to sequence processing. The residual connection allows each block’s input to bypass the convolution stage and then adds that input to the convolution output.
Activation Function
UberNet employs the following gated activation unit in the residual blocks when conducting the stacking operation among multiple residual blocks:
$$\begin{aligned} \mathbf{z}= \tanh (\mathbf{W}_{f,k} * \mathbf{X}) \odot \sigma (\mathbf{W}_{g,k} * \mathbf{X}), \end{aligned}$$
(4)
where \(*\) is the convolution operator, \(\odot\) is an element-wise multiplication, \(\sigma (\cdot )\) represents a nonlinear sigmoid function, k denotes the layer index, f and g are filter and gate, and \(\mathbf{W}\) is a learnable convolution filter (Van den Oord et al. 2016b). The non-linearity and gated activation unit in (4) can outperform a rectified linear activation function, \(\max \{x,0\}\), as shown in Van den Oord et al. (2016b).
Training of the Neural Network
The neural network weights can be trained using deterministic or stochastic gradient descent with an ultimate goal to reduce the root mean square error (RMSE). Alternatively, the neural network outputs can be optimized by maximizing the log-likelihood of the input data with respect to the weights. To overcome overfitting, i.e., weights of large values due to noisy data that can make the neural network unstable, we employ \(\mathcal{L}_2\) regularization (weight decay). The cost function under optimization can be expressed as
$$\begin{aligned} E(\mathbf{w}) = \frac{1}{T} \sum _{t=1}^{T}\left( x_t - \hat{y}_t(x_t)\right) ^2+\frac{\lambda }{2}||w||^2, \end{aligned}$$
(5)
where T is the number of training data sets, \(\mathbf{w}\in \mathbb {R}^q\) is a q-dimensional embedding of weights, \(\lambda \in \mathbb {R}_{\ge 0}\) is a regularization (or penalty) term, \(\hat{y}_t(x_t)\) denotes the forecast (output) of \(x_t\) using input data \(x_1,\ldots , x_{t-1}\). Intuitively, if \(\lambda = 0\) then the regularization term becomes zero and the cost function represents just the RMSE, and \(\lambda > 0\) ensures that w will not grow too large. Equation 5 is optimized using deterministic gradient descent and leads to a choice of weights that strike a balance between overfitting and underfitting the training data. The \(\mathcal{L}_2\) regularization can ensure an appropriate value range of weights so that it performs better on unobserved data. Note that \(\mathcal{L}_2\) regularization can be combined with \(\mathcal{L}_1\) regularization for better results.
The matrix in the last layer of the convolution structure (c.f., Fig. 1 ) has the same size as of the input embedding. The output should be a matrix that contains probability distributions of all timestamps in the output series, where the probability distribution is the expected one that actually generates the prediction results. For practical neural network with tens of millions of timestamps, the negative sampling strategy can be applied to avoid the calculation of the full softmax distributions. Once the sampling size are properly tuned, the performance operated by these negative sampling strategies is almost the same as the full softmax method (Kaji and Kobayashi 2017).