Let’s apply our analysis to the data of Sect. 1.3 (i.e., Donovan et al. (2016)), which reverse-engineers estimates of traffic counts from origin-destination pairs for taxi trips. Our taxi dataset is an illustrative (and perhaps scaled) proxy for true traffic counts, but we recognize that it is potentially biased. Our methodology could readily be modified when non-biased counts (possibly available from private companies or from dedicated traffic counting sensors installed by municipalities; see also [pem]) are available.
The dataset contains hourly traffic data in the interval
$$\begin{aligned}{}[\text {2011-01-01 00:00},\,\text {2012-01-01 00:00}). \end{aligned}$$
(13)
We thus have
$$\begin{aligned} T= \underbrace{365}_{\text {days per year}} \times \underbrace{24}_{\text {hours per day}}= 8,760 \end{aligned}$$
time records. Arranging them in order, we get \({\mathcal{T}}\) of Sect. 2.1.
The dataset contains estimates of taxi counts for \(L_\circ \overset{\text {def}}{=}260,855\) one-directional links (roadways) in New York City; a two-directional road segment is represented as two one-directional links. We use OpenStreetMap labels. Table 1 gives a statistical summary of these links.
Table 1 Statistic of link lengths (in meters) In total, the dataset of Donovan et al. (2016) thus has
$$\begin{aligned} 8,760\times 260,855 \approx 2.3\times 10^9 \end{aligned}$$
entries. About \(95\%\) of these (2,181,923,208) are zero. Given that New York City is an urban environment and unlikely to have empty roads at any time of the day, we reset these zero entries to \(\text {NaN}\). To restrict our calculations to a subset for which we have somewhat reliable data, we will consider only those links for which there are at most a total of 720 missing hours (30 days worth) of data. We get \({\mathcal{L}}\) with \(L\overset{\text {def}}{=}2,302\). Selecting more links comes with the benefit of incorporating more traffic data into our analysis while also introducing the cost of having a higher number of missing data entries. The distribution of entries is shown in Fig. 1. Multiplying these links by the time horizon of 8, 760 hours gives over \(2\times 10^7\) data points. Our goal is to understand how to appropriately identify and understand complex patterns in this data. Figure 1 shows that 2, 302 of the most-traveled links captures \(19\%\) of the entries in D. Adding more (sparsely traveled) links would correspond to adding sparse columns to D, perhaps with diminishing returns for data interpretation.
Constants, Initialization, and Computational Considerations
For our year-long taxi traffic dataset, on the links of \({\mathcal{L}}\), we use:
These values stem from a grid search of the results of the algorithm of Sect. 2.5. The initial entries of W and H (i.e., the initial condition of W and H) are taken to be i.i.d. Uniform(0, 1) random variables (thus ensuring that W and H start with full rank). We disregard runs which lead to zero columns of W; such zero columns are invariant under our multiplicative update rule and indicate sub-optimal use of rank. Note that, by (5) and (6), an iteration leading to a zero column of W will produce a zero row of H and vice versa. The algorithm is relatively insensitive to changes in \(\eta\), and the above value of \(\eta\) leads to values of W which are not too large. Fixing the value of \(\eta\) (to the above value), we carry out a refined grid search on the \((N, \beta )\) parameters. Table 2 gives specific numerical results for two pairs of \((N,\beta )\) parameters.
By thresholding H, we can forcibly express our approximation of each D as a linear combination of as few signatures as possible. For each column of H, we set all entries in each column below the 40th percentile of that column to zero. In practice, this reduces each link to a linear combination of at most eight signatures. Table 3 updates Table 2 once we have thresholded.
Table 2 Comparison of two \((N,\beta )\) choices without thresholding \(H\) The results of the grid search in \((N,\beta )\) are summarized in Figs. 2 and 3. These figures confirm a natural tradeoff: higher N allows lower error and higher \(\beta\) leads to better sparsity, and these are competing objectives, however. We generally consider a sparsity value between 0.8 and 1.0 to be sufficient as it allows for only a handful of dominant signatures per link.
Table 3 Updated comparison of two \((N,\beta )\) choices after thresholding \(H\) Informally, higher values of \(\beta\) are likely to mean sparser H, and higher values of N mean a larger universe of signatures. Hence, in cases where the algorithm produces a zero signature, we take this to mean that the matrix factorization needs a smaller rank.
To focus on the interpretation of the output of the algorithm, let us now fix \(N=50\) (the smaller value of N) and \(\beta =5000\) for the rest of this section. The output consists of two matrices W and H of sizes \(8,760\times 50\) and \(50\times 2,302\), respectively. The columns of W are \(L^1\)-normalized and the columns of H are sparse.
Recall that columns of W represent traffic signatures over time. Each signature is a time-series for the entire year and hence need not be periodic. For example, the signatures can capture traffic anomalies during holidays, hurricanes, and blackouts.
Furthermore, the entries of a column of H are coefficients for the linear decomposition of a link into distinct signatures. For example, if the 4-th column of H is \(\left( 0,7,2,0,\ldots ,0\right) ^T\), then the traffic in link 4 of \({\mathcal{L}}\) can be written as seven times the second signature plus two times the third signature. This decomposition allows us to identify spatial patterns in traffic across the city. These matrices and the patterns derived from them can then aid in making specific observations about the large-scale behavior of traffic (as detailed in Sect. 4.4).
Independence of Signatures
To quantify the linear independence of the signatures obtained from the algorithm, we can compute the condition number of WHagen et al. (2000). A high condition number (in the thousands) would indicate that the rank should be reduced. By performing several runs of our algorithm for rank \(N=50\) with different (W, H)-initializations, we determine
$$\begin{aligned} \text {Condition Number for }W = 24\pm 2 \end{aligned}$$
This number is low enough to give us confidence in the W returned by the algorithm., and further validates our choice of \(N=50\).
Robustness of Algorithm
The factors produced by our algorithm are not unique and can differ by permutations. If \(W_1\) and \(W_2\) are produced by two runs of the algorithm (starting with different initial conditions), we can calculate the Pearson product-moment correlation coefficient between the columns of \(W_1\) and the columns of \(W_2\). A coefficient value close to one implies that the signatures follow the same pattern up to a scale factor. We can construct a greedy algorithm for searching for a permutation which will maximize correlations by sequencing through the columns of \(W_1\), finding the column of \(W_2\) which maximizes the correlation (with respect to the selected column of \(W_1\)), and then removing that column of \(W_2\) from future computations. We can thus permute the columns of \(W_2\) to better match those of \(W_1\). After doing this, we can construct a heatmap which shows the correlation between the columns of \(W_1\) and the columns of (the permuted) \(W_2\). Figure 4 shows this heatmap for two runs of the Algorithm of Sect. 2.5 with different (random) initial conditions and with the prevailing values of \(\eta =4171\), \(N=50\) and \(\beta =5000\) (see Sect. 4.1). 4 gives the correlation between the i-th column of \(W_1\) and the j-th column of the permuted \(W_2\). We see high correlation on the diagonal, meaning that the original matrix \(W_2\) is very close to a permutation of \(W_1\).
We make observations about the low-rank decomposition are in Sect. 4.4.
Periodicity and Anomalous Observations
We note that the columns of D (and hence the signatures) are roughly periodic, with a period of 7 days. A power spectrum periodogram of D (averaged over all columns) is shown in Fig. 5.
In light of this periodicity, we can look at one day of the week across the entire year (e.g., all Mondays) and compute the hour-wise median traffic for that day and then identify anomalous behavior. See Figs. 6, 7a, 8a, and 8b (in dark red). In gray, we plot the relative taxi counts i.e., entries of the normalized signatures. We then determine which dates have relative taxi counts that differ significantly from the hour-wise median traffic in the sense that those (signature, day) pairs have the highest sum of absolute deviations from the median weekly traffic. In the subsections below, we identify some possible origins of anomalous behavior.
Hurricane Irene
Figure 6 shows Signature 0, which captures a near-shutdown of taxi traffic on August 27, 2011. This may have been caused by Hurricane Irene hitting NYC. There was an early warning and all subways and buses were shut down at noon on Saturday, August 27. A zoned taxi system was implemented at 9 am and taxis were thereafter running flat fares instead of meters [wny]. All other signatures also show similar behavior on and around August 28.
Wisconsin Labor Rally
Figure 7a shows the behavior of Signature \(21\) on February 26, 2011. The traffic deviates from the median Saturday trend. This may have been caused by a Labor Rally that took place near the New York City Town Hall [wis]. Figure 7b shows that Signature 21 is used by links near the Town Hall, which can be seen as further evidence connecting the rally to this traffic deviation.
Christmas Day
Anomalous behavior was also observed on Christmas Day. This can be seen in Fig. 8a for Signature 0 and Fig. 8b for Signature 4. Note how the Christmas Day traffic is reduced by about half throughout the day.
Endemic Signatures
After thresholding as discussed in Sect. 4.1, we can show the support \(\{\ell \in {\mathcal{L}}: H_{j,\ell } > 0\}\) of signature n on a map, for any given \(n\in \{1,2\dots N\}\). See Fig. 9. We note that of the 50 signatures, some tend to be geographically restricted (called endemic), while others are spread out over larger areas (called dispersed). For example, Signature 0, as seen in Fig. 9a is dispersed. Endemic signatures might sometimes explain traffic densities only on a single but long stretch of road. For example, Fig. 9b shows that Signature 10 is largely used by the the northbound 3rd Avenue and streets like Bowery, Lafayette St. and the southernmost part of Broadway that feed into 3rd Avenue. Similarly, Fig. 9d shows Signature 40 being used exclusively by a small section of the south-bound Broadway traffic near Central Park.
In some other cases, signatures can be seen as having a lateral sphere of influence in that they affect not only one street but also others feeding into or out of the street transversally—Signature 24, for example, as seen in Fig. 9c.