Spatiotemporal random fields: compressible representation and distributed estimation
Abstract
Modern sensing technology allows us enhanced monitoring of dynamic activities in business, traffic, and home, just to name a few. The increasing amount of sensor measurements, however, brings us the challenge for efficient data analysis. This is especially true when sensing targets can interoperate—in such cases we need learning models that can capture the relations of sensors, possibly without collecting or exchanging all data. Generative graphical models namely the Markov random fields (MRF) fit this purpose, which can represent complex spatial and temporal relations among sensors, producing interpretable answers in terms of probability. The only drawback will be the cost for inference, storing and optimizing a very large number of parameters—not uncommon when we apply them for realworld applications.
In this paper, we investigate how we can make discrete probabilistic graphical models practical for predicting sensor states in a spatiotemporal setting. A set of new ideas allows keeping the advantages of such models while achieving scalability. We first introduce a novel alternative to represent model parameters, which enables us to compress the parameter storage by removing uninformative parameters in a systematic way. For finding the best parameters via maximum likelihood estimation, we provide a separable optimization algorithm that can be performed independently in parallel in each graph node. We illustrate that the prediction quality of our suggested method is comparable to those of the standard MRF and a spatiotemporal knearest neighbor method, while using much less computational resources.
Keywords
Regularization Graphical models Spatiotemporal1 Introduction
Sensorbased monitoring and prediction has become a hot topic in a large variety of applications. According to the slogan Monitor, Mine, Manage (Campbell 2011), series of data from heterogeneous sources are to be put to good use in diverse applications. A view of data mining towards distributed sensor measurements is presented in the book on ubiquitous knowledge discovery (May and Saitta 2010). There are several approaches to distributed stream mining based on work like, e.g., Wolff et al. (2009) or Sagy et al. (2011). The goal in these approaches is a general model (or function) which is built on the basis of local models while restricting communication costs. Most often, the global model allows to answer threshold queries, but also clustering of nodes is sometimes handled. Although the function is more complex, the model is global and not tailored for the prediction of measurements at a particular location. In contrast, we want to predict some sensor’s state at some point in time given relevant previous and current measurements of itself and other sensors.
Since his influential book, David Luckham has promoted complex event processing successfully (Luckham 2002). Detecting events in streams of data has accordingly been modeled, e.g. in the context of monitoring hygiene in a hospital (Wang et al. 2011). However, in our case, the monitoring does not imply certain events. We do not aim at finding patterns that define an event, although they may show up as a side effect. We rather want to predict a certain state at a particular sensor or set of sensors taking into account the context of other locations and points in time. Although related, the tasks differ.
The analysis of mobile sensor measurements has been framed as spatiotemporal trajectory mining by, e.g., Giannotti et al. (2007). There, frequent patterns are mined from movements of pedestrians or cars. The places are not given a priori, but interesting places could be derived from frequent crossings. It does not deliver a prediction of states and no probabilities. Trajectory mining is a complementary task to that of state prediction.

Given the traffic densities of all roads in a street network at discrete time points t _{1},t _{2},t _{3} (e.g., Monday, Tuesday, Wednesday 8 o’clock): indicate the probabilities of traffic levels on a particular road A at another time point, not necessarily following the given ones (e.g., Thursday 7 o’clock).

Given a traffic jam at place A at time t _{ s }: output other places with a probability higher than 0.7 for the state “jam” in the time interval of t _{ s }<t<ρ.
1.1 Previous work
In this section, an overview of previous contributions to spatiotemporal modeling is given. The task of traffic forecasting is often solved by simulations (Marinosson et al. 2002). This presupposes a model instead of learning it. In the course of urban traffic control, events are merely propagated that are already observed, e.g., a jam at a particular highway section results in a jam at another highway section, or the prediction is based on a physical rule that predicts a traffic jam based on a particular congestion pattern (Hafstein et al. 2004). Many approaches apply statistical time series methods like autoregression and moving average (Williams and Hoel 2003). They do not take into account spatial relations but restrict themselves to the prediction of the state at one location given a series of observations at this particular location. An early approach is presented by Whittaker et al. (1997), using a street network topology that represents spatial relations. The training is done using simply Kalman filters, which is not as expressive as is necessary for queries like the ones above. A statistical relational learning approach to traffic forecasting uses explicit rules for modeling spatiotemporal dependencies like congestion(+s _{1},h)∧next(s _{1},s _{2})⇒congestion(+s _{2},h+1) (Lippi et al. 2010). Training is done by a Markov Logic Network delivering conditional probabilities of congestion classes. The discriminative model is restricted to binary classification tasks and the spatial dependencies need to be given by handtailored rules. Moreover, the model is not sparse and training is not scaleable. Even for a small number of sensors, training takes hours of computation. When the estimation of models for spatiotemporal data on ubiquitous devices is considered, e.g. learning to predict smartphone usage patterns based on time and visited places, minutes are the order of magnitude in demand. Hence, also this advanced approach does not yet meet the demands of the spatiotemporal prediction task in resource constrained environments.
Some geographically weighted regression or nonparametric knearest neighbour (kNN) methods model a task similar to spatiotemporal state prediction (Zhao and Park 2004; Gong and Wang 2002; May et al. 2008). The regression expresses the temporal dynamics and the weights express spatial distances. Another way to introduce the spatial relations into the regression is to encode the spatial network into a kernel function (Liebig et al. 2012). The kNN method by Lam et al. (2006) models correlations in spatiotemporal data not only by their spatial but also by their temporal distance. As stated for spatiotemporal state prediction task, the particular place and time in question need not be known in advance, because the lazy learner kNN determines the prediction at query time. However, also this approach does not deliver probabilities along with the predictions. For some applications, for instance, traffic prognoses for car drivers, a probabilistic assertion is not necessary. However, in applications of disaster management, the additional information of likelihood is wanted.
As is easily seen, generative models fit the task of spatiotemporal state prediction. For notational convenience, let us assume just one variable x. The generative model p(x,y) allows to derive \(p(yx)= \frac{p(x,y)}{p(x)}\) as well as \(p(xy)=\frac{p(x,y)}{p(y)}\). In contrast, the discriminative model p(yx) must be trained specifically for each y. In our example, for each place, a distinct model would need to be trained. Hence, a huge set of discriminative models would be necessary to express one generative model. A discussion of discriminative versus generative models can be found in a study by Ng and Jordan (2002). Here, we refer to the capability of interpolation (e.g., between points in time) of generative models and their informativeness in delivering probability estimates instead of mere binary decisions.
Spatial relations are naturally expressed by graphical models. For instance, discriminative graphical models—as are conditional random fields (CRF)—have been used for object recognition over time (Douillard et al. 2007), but also generative graphical models such as the Markov random field (MRF) have been applied to video or image data (Yin and Collins 2007; Huang et al. 2008). The number of training instances does not influence the model complexity of MRF. However, the number of parameters can exceed millions easily. In particular when using MRF for spatiotemporal state prediction, the many spatial and temporal relations soon lead to inefficiency.

the original parametrization is not well suited for producing sparse models,

trained models tend to overfit to the training data, and

training highdimensional models is not feasible.
In the following, we shall recapitulate graphical models (Sect. 1.2) and regularization methods (Sect. 1.3) so that we can then introduce a new method for spatiotemporal state prediction, that does no longer suffer from the listed disadvantages.
1.2 Graphical models
1.3 Regularization
As we can see, the number of parameters in θ grows quite rapidly as we consider more complex graphical models. A large number of parameters is generally not preferable, since it may lead to overfitting, not to mention that it becomes hard to implement a memory efficient predictor. Therefore some regularization would be necessary to achieve a sparse and robust model.
Popular choices of regularizers are the ℓ _{1} and ℓ _{2} norms of the parameter vector, ∥θ∥_{1} and ∥θ∥_{2}. By minimizing the ℓ _{1} norm, we coerce the values for less informative parameters to zero (similarly to LASSO (Tibshirani 1996)), and by the ℓ _{2} norm we find smooth functions parametrized by θ (similarly to the penalized splines (Pearce and Wand 2006)). Using both together is often referred to as the elastic net (Zou and Hastie 2005), which we also use in our work. For graphical models, elastic nets have been used for the task of structure learning (estimating the neighborhoods) by Cucuringu et al. (2011) in a manner similar to the approach of Meinshausen and Buehlmann (2005). For the state prediction task, there exist two short workshop papers (Piatkowski 2012; Piatkowski et al. 2012) using the elastic net. However, the analytical and empirical validation of such an approach is rather limited there.
1.4 Overview

an interpretable model that truly captures spatiotemporal structures,

a new parametrization that results in sparse models with regularization,

and scalable training even for high dimensional models.
2 From linear chains to spatiotemporal models
Sequential undirected graphical models, also known as linear chains, are a popular method in the natural language processing community (Lafferty et al. 2001; Sutton and McCallum 2007). There, consecutive words or corresponding word features are connected to a sequence of labels that reflects an underlying domain of interest like entities or part of speech tags. If we consider a sensor network G that generates measurements over space as a word, then it would be appealing to think of the instances of G at different timepoints, like words in a sentence, to form a temporal chain G _{1}−G _{2}−…−G _{ T }. We will now present a formalization of this idea followed by some obvious drawbacks. Afterwards we will discuss how to tackle those drawbacks and derive a tractable class of generative graphical models for the spatiotemporal state prediction task.
We first define the part of the graph corresponding to the time t as the snapshot graph G _{ t }=(V _{ t },E _{ t }), for t=1,2,…,T. Each snapshot graph G _{ t } replicates a given spatial graph G _{0}=(V _{0},E _{0}), which represents the underlying physical placement of sensors, i.e., the spatial structure of random variables that does not change over time. We also define the set of spatiotemporal edges E _{ t−1;t }⊂V _{ t−1}×V _{ t } for t=2,…,T and E _{0;1}=∅, that represent dependencies between adjacent snapshot graphs G _{ t−1} and G _{ t }, assuming a Markov property among snapshots, so that E _{ t;t+h }=∅ whenever h>1 for any t. Note that the actual time gap between any two time frames t and t+1 can be chosen arbitrarily.
This model now truly expresses temporal and spatial relations between all locations and points in time for all features. However, the memory requirements of such models are quite high due to the large problem dimension. Even loading or sending models may cause issues when mobile devices are considered as a platform. Furthermore, the training does not scale well because of stepsize adaption techniques that are based on sequential (i.e., nonparallel) algorithms.
3 Spatiotemporal random fields
Now we describe how we modify the naive spatiotemporal graphical model discussed above. We have two goals in mind: (i) to achieve compact models retaining the same prediction power, and (ii) to find the best of such models via scalable distributed optimization.
3.1 Toward better sparsification
The memory consumption of the MRF is dominated by the size of its parameter vector: the graph G can be stored within \(\mathcal {O}(V+E)\) space (temporal edges do not have to be constructed explicitly), and the size of intermediate variables required for inference takes \(\mathcal {O}(2E\mathcal {X}_{v})\). That is, if \(\mathcal {X}_{v}\geq2\) for all v, the dimension d in (5) and therefore the memory consumption of the parameter vector is always a dominant factor. Also, since each parameter is usually accessed multiple times during inference, it is desirable to have them in a fast storage, e.g. a cache memory.
An important observation of the parameter subvector θ(t) is that it is unlikely to be a zero vector when it models an informative distribution. For example, if the nodes can have one of the two states {high,low}, suppose that the corresponding parameters at time t satisfy [θ(t)]_{ v }=0 for all v and equally for all edge weights. Then it implies P(v=high)=P(v=low), a uniform marginal distribution. The closer the parameters of a classical MRF are towards 0, the closer are the corresponding marginals to the uniform distribution.
When all consecutive layers are sufficiently close in time, the transition of distributions over the layers will be smooth in many real world applications. But the optimal θ is likely to be a dense vector, and it will require large memory and possibly long time to make predictions with it as we deal with large graphical models. This brings us the call for another parametrization.
3.1.1 Reparametrization
We note that our reparametrization with Z introduces some overhead, due to the summation in (6), compared to the classical parametrization with θ. In particular, whenever an algorithm has to read a value from θ, it has do be decompressed instantly, which adds asymptotic complexity \(\mathcal {O}(T)\) to every access. However, if we obtain sparse representation with Z, then it can be stored in small memory (possibly even in CPU cache memory), and therefore the chances for cache misses or memory swapping will be reduced. This becomes an important factor when we deploy a learned model to applications running on mobile devices, for instance.
3.1.2 Regularizers revisited
The two regularizers induce sparsity and smoothness respectively, as we have discussed in Sect. 1.3. The difference is that due to the reparametrization, now differences between parameters θ(t−1) and θ(t) are penalized, not the actual values therein—they are unlikely to be zero in general.
3.2 Highdimensional scalable learning
3.2.1 Separable subproblems
The parameter estimation problem in (8) is a convex minimization problem, because of the convexity of A(θ(Z)), ∥Z∥_{1} and \(\\boldsymbol {Z}\_{F}^{2}\) in Z. In order to speed up the computation, distributing the optimization over several computing units would be desirable. However, in its current form the optimization of (8) is not separable over the components of Z.
Following the general framework in the SpaRSA approach (Wright et al. 2009), we construct subproblems with the secondorder approximation to the nonseparable but smooth part f(Z) of the objective function, keeping the nonsmooth part ∥Z∥_{1} intact.
In order to achieve separability, we need the property that the gradient of f(Z) for each node or edge in the spatial graph G _{0} can be computed independently, since otherwise we have to incur costly communication among graph elements in every iteration. The next lemma shows that we do have this property.
Lemma 1
Proof
When we use \(\boldsymbol {D}_{j}^{k} > 0\), then the objective in (11) is strongly convex, and therefore the new iterate \(\boldsymbol {Z}_{j\cdot}^{k+1}\) is uniquely determined for each row.
3.2.2 Estimation of curvature
The subproblem (11) is constructed using a linear approximation of the nonseparable smooth function f(Z), to create a separable objective. The second term \(\frac{1}{2} (\boldsymbol {Z}_{j \cdot}\boldsymbol {Z}_{j \cdot}^{k})^{T} \boldsymbol {D}^{k}_{j} (\boldsymbol {Z}_{j \cdot}\boldsymbol {Z}_{j \cdot}^{k})\) in the objective ensures that the next iterate \(\boldsymbol {Z}_{j\cdot}^{k+1}\) is not too far from the current iterate \(\boldsymbol {Z}_{j\cdot}^{k}\), since the linear approximation becomes less accurate on farther points.
Since the spectrum computed above is only an approximation, we project each value onto an interval defined by 0<D _{min}<D _{max} to avoid numerical issues. In the worst case, for a certain j, we could have all (over time) curvature estimates to be D _{min} (or D _{max})—in such cases our method performs the steepest descent optimization for Z _{ j⋅}.
3.2.3 Approximate line search
In fact we can separate the sums into partial sums for the elements in G _{0} as before, and check if a partial sum corresponding to each graph element becomes negative, performing line search in parallel as well. This may endow a stronger condition than the lower bound itself becoming negative, but at least in our experiments it seems to compensate the inaccuracy of our approximate line search.
3.2.4 Stopping criterion and convergence
For the result that our separated optimization eventually finds an optimal solution, we refer to Theorem 1 of Wright et al. (2009). The theorem requires that the smooth part of the objective is convex and Lipschitz continuously differentiable, and our f(Z) satisfies the conditions (the second condition is easily verifiable from our Lemma 1 and from the fact that \(\\boldsymbol {\mu }(t)  \hat {\boldsymbol {\mu }}(t)\_{\infty}\le1\) by construction).
4 Experiments
We evaluate the performance of our suggested method on two realworld data sets, where each set is described by a spatial graph G _{0}=(V _{0},E _{0}) with a set of sensors V _{0} and connections E _{0}, and a set of historical sensor readings \(\mathcal {D}\). We evaluate the two approaches Markov random field with the original parametrization (MRF) and the spatio temporal random field^{2} (STRF) proposed in the present paper.
First we discuss about model training. We investigate the prediction quality and sparsity of resulting models with respect to regularization parameters. Also, the impact of separable optimization on training time is presented. Next, the quality of prediction on test sets is discussed, regarding the sparsity (and thereby the size in memory) of trained models. Finally, a qualitative analysis in terms of interpretability (nonoverfitting) of our model is presented.
All experiments have been performed on a Linux system with four Intel Xeon CPUs (each with 8 cores at 2.00 GHz and 18 MB cache) and 256 GB of main memory, except for measuring test performance in Sect. 4.4—there we have used a Linux machine with smaller (16 GB) amount of memory and a single commodity Intel i7 CPU (at 3.40 GHz and with 8 MB cache), in order to better simulate lowmemory situations.
Throughout the experiments, our STRF algorithm has produced solutions satisfying our target optimality of σ(Z)<10^{−5} in term of the measure (12), within ten iterations.
4.1 Data sets
Traffic
The traffic data from German North RhineWestphalia highways, are available at the Online Traffic Information System (http://autobahn.nrw.de) and consist of the number of vehicles and their average speed per minute, and the occupancy rate of the highway region covered by each sensor. Due to the amount of data, scalability is particularly an issue with this set. Here, the data from July to December 2010 is used, together more than 200 million sensor readings.
The highways are naturally partitioned into several segments by their departing locations. To prepare a spatial graph, we create nodes for notable segments, and consider the remaining segments as edges connecting the nodes. If there are more than one sensor assigned to the same vertex, we average the sensor readings as a measurement for the vertex. Different types of measurements are combined and discretized into four states following Marinosson et al. (2002), namly green which is defined by high average speed and low traffic density, yellow which allows slightly higher traffic density, orange additionally limits the average speed and red represents a traffic jam. After removing malfunctioning sensors, the final spatial graph contained 174 nodes and 218 edges. It is assumed that each week of collected traffic data is generated by the same underlying distribution. Thus, the spatiotemporal graph has 144×7=1008 layers (i.e. each weekday is sampled at 144 time points), consists of 174×1008=175392 nodes and ((174+218×3)×1007+218)=834014 edges. Notice that the corresponding MRF model has more than 10^{7} parameters, if counted by (5).
Temperature
The second data set consists of data collected in March 2004 from the sensors deployed in the Intel Berkeley Research lab (the data are available at http://db.csail.mit.edu/labdata/labdata.html). The measurements consist of humidity, temperature, light, and voltage values captured every 31 seconds. One half of two million sensor readings were considered for training, and the rest for testing.
The spatial layer is constructed by means of a nearest neighbor graph, where each node represents a sensor and edges are disconnected whenever there is a wall between neighboring sensors. We also excluded sensors reported faulty in (Apiletti et al. 2011). The final spatial graph contained 48 nodes and 150 edges. We use the temperature measurements every 30 minutes, since it was the finest resolution without too many missing values, and discretized them into 21 equally sized bins. We assume that each day of collected temperature data is generated by the same underlying distribution. As a result, the spatiotemporal graph models exactly one day of temperature measurements. It has 48 layers, contains 2304 nodes and 23556 edges. Although the number of layers is smaller than in the traffic data, the corresponding model still has more than 10^{7} parameters, due to the larger state space size.
4.2 Measures
Each data set has been split into a training set \(\mathcal {D}_{\text{Train}}\) and a test set \(\mathcal {D}_{\text{Test}}\) in an obvious way, such that the states of future time points (the test set) are predicted, given those of the past (the training set).
Based on the generative nature of our method, each prediction can be conditioned on an arbitrary subset of vertices U⊆V for which the values have to be known at prediction time. Notice that U=∅ is a perfectly valid choice and represents the most probable joint state of all vertices in the graph.
For kNN, the same spatiotemporal graph as for STRF and MRF is considered. For two nodes u and v, their spatiotemporal distance d(v,u) is given by the number of edges of the shortest path that connects both nodes. The kNN prediction for x _{ v }∣x _{ U } (with v∉U) is made by (i) computing the distance d(x ^{ i },x _{ U }) from each training instance x ^{ i } to x _{ U }, which is simply the cardinality of U minus the number of matching values in x ^{ i } and x _{ U }. (ii) For each vertex u in each training instance x ^{ i }, compute the distance d _{ u }:=d(v,u)+d(x ^{ i },x _{ U }). (iii) Sort all vertices by d _{ u } and return the top k. If more than one of these k states have the maximum frequency, we select one of them at random. This simple algorithm looks appealing, but it has an obvious drawback that it needs access to the complete training data for prediction. Lastly, the random prediction method selects a uniformly distributed random state for each vertex.
4.3 Regularized training of spatiotemporal random fields
In our model, the ℓ _{2} regularizer imposes “smoothness” to the dynamics of parameters over time, providing a controllable way to avoid overfitting noisy observations. The degree of smoothness is controlled by λ _{2}, whereas the compression ratio is controlled by λ _{1}. Positive values of λ _{2} help in our method, since the curvature estimation in Sect. 3.2.2 becomes better conditioned.
4.3.1 Sparsity of trained models and their training accuracy
Since the number of edge parameters is a dominant factor in the dimension d of the parameter space, it would be desirable if STRF compresses edge parameters well enough. Considering the NNZ ratio of vertex and edge parameters separately, it turns out that STRF has such a property: with the good parameter values above, the NNZ ratio of vertices is about 0.95, whereas that of edges is about 0.09.
4.3.2 Scalability of separable optimization
The number of parameters can grow quite rapidly with the size of the spatial graph, the number of states, and the number of layers. Therefore the scalability of training such models is important for practical applications, where models may have to be updated frequently for new data.
4.4 Prediction on test sets
Here we investigate (i) the test set performance of the sparse models, obtained with the good parameter values of λ _{1} and λ _{2} found in training, and (ii) how the sparsity of trained models affect the testing time.
4.4.1 Prediction quality of sparse models
4.4.2 The effect of sparsity on prediction time
As discussed in Sect. 3.1.1, our reparametrization brings in some computational overhead, especially up to \(\mathcal {O}(T)\) to the message computation of the belief propagation (BP) algorithm, which is used for answering queries. However, it is not the entire picture. For example, STRF could produce a tiny model that can be stored in CPU caches, whereas MRF produces a much larger model that has to be swapped out. Then, fetching the model parameters would take much longer for MRF than for STRF, even making the extra computation time for STRF negligible—which is a likely scenario for small ubiquitous devices.
4.5 Smooth probabilistic modelling
Graphical models can produce probabilistic answers for queries, and it is one motivation why we choose such models. Instead of the MAP prediction, the marginal probabilities p _{ v }(x _{ v }) may be considered for analysis.
These very smooth results can be achieved by considering the reparametrization \(\bar{\boldsymbol {\theta }}(t):=\sum_{i=1}^{t} \boldsymbol {Z}_{\cdot i}\) as mentioned in Sect. 3.1.1 in conjunction with strong regularization, i.e. λ _{1}=λ _{2}=10. It has to be clear, that those models represent strong generalizations of the data and as a result, the corresponding test accuracies are strictly worse, compared to reparametrization (6). Compared to the STRF results from Sect. 4.4.1, the just mentioned model gets around 20 % less accuracy. We also note that these robust models cannot be achieved with the classical MRF parametrization, due to its rigidness to regularization.
5 Conclusions
We presented an improved graphical model designed for efficient probabilistic modelling of spatiotemporal data. It is based on a novel parametrization that allows, for the first time, a regularization of spatiotemporal graphical models, such that the estimated parameters are sparse and the estimated marginal probabilities are smooth without loosing prediction accuracy. We investigated sparsity, smoothness, prediction accuracy and scalability of our model on two real world data sets. The experiments showed that our model with around 10 % of the original size retained almost the same prediction accuracy. Our method is designed to run in parallel, and scales very well with an increasing number of CPUs. Future research will consider other reparametrizations of graphical models as well as specialized inference algorithms for spatiotemporal data.
Footnotes
 1.
In general, one may consider indicator functions not only for nodes and edges, but for all cliques (fully connected subgraphs) in G. Our description still applies to higher order models, since we can convert them into models using solely nodes and edges (Wainwright and Jordan 2007, Appendix E).
 2.
Our C++ source code is available at http://sfb876.tudortmund.de/strf.
Notes
Acknowledgements
We thank the reviewers for their suggestions that have helped us improve our manuscript. This work has been supported by Deutsche Forschungsgemeinschaft (DFG) within the Collaborative Research Center SFB 876 “Providing Information by ResourceConstrained Data Analysis”, projects A1 and C1.
References
 Apiletti, D., Baralis, E., & Cerquitelli, T. (2011). Energysaving models for wireless sensor networks. Knowledge and Information Systems, 28, 615–644. CrossRefGoogle Scholar
 Barzilai, J., & Borwein, J. M. (1988). Twopoint step size gradient methods. IMA Journal of Numerical Analysis, 8(1), 141–148. MathSciNetzbMATHCrossRefGoogle Scholar
 Campbell, D. (2011). Is it still Big Data if it fits in my pocket? In Proceedings of the VLDB endowment (Vol. 4, p. 694). Google Scholar
 Cucuringu, M., Puente, J., & Shue, D. (2011). Model selection in undirected graphical models with the elastic net. Google Scholar
 Darroch, J. N., & Ratcliff, D. (1972). Generalized iterative scaling for loglinear models. The Annals of Mathematical Statistics, 43(5), 1470–1480. MathSciNetzbMATHCrossRefGoogle Scholar
 Douillard, B., Fox, D., & Ramos, F. T. (2007). A spatiotemporal probabilistic model for multisensor object recognition. In IEEE/RSJ international conference on intelligent robots and systems (pp. 2402–2408). Google Scholar
 Giannotti, F., Nanni, M., Pinelli, F., & Pedreschi, D. (2007). Trajectory pattern mining. In Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 330–339). CrossRefGoogle Scholar
 Gong, X., & Wang, F. (2002). Three improvements on KNNNPR for traffic flow forecasting. In Proceedings of the 5th international conference on intelligent transportation systems (pp. 736–740). CrossRefGoogle Scholar
 Hafstein, S. F., Chrobok, R., Pottmeier, A., & Mazur, M. S. F. (2004). A highresolution cellular automata traffic simulation model with application in a freeway traffic information system. ComputerAided Civil and Infrastructure Engineering, 19(5), 338–350. CrossRefGoogle Scholar
 Heinemann, U., & Globerson, A. (2011). What cannot be learned with Bethe approximations. In Proceedings of the 27th conference on uncertainty in artificial intelligence, Barcelona, Spain. Google Scholar
 Huang, R., Pavlovic, V., & Metaxas, D. (2008). A new spatiotemporal mrf framework for videobased object segmentation. In The 1st international workshop on machine learning for visionbased motion analysis. Google Scholar
 Kschischang, F. R., Frey, B. J., & Loeliger, H. A. (2001). Factor graphs and the sumproduct algorithm. IEEE Transactions on Information Theory, 47(2), 498–519. MathSciNetzbMATHCrossRefGoogle Scholar
 Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th international conference on machine learning (pp. 282–289). Google Scholar
 Lam, W. H. K., Tang, Y. F., & Tam, M. (2006). Comparison of two nonparametric models for daily traffic forecasting in Hong Kong. Journal of Forecasting, 25(3), 173–192. MathSciNetCrossRefGoogle Scholar
 Liebig, T., Xu, Z., May, M., & Wrobel, S. (2012). Pedestrian quantity estimation with trajectory patterns. In Lecture notes in computer science: Vol. 7524. Machine learning and knowledge discovery in databases (pp. 629–643). Berlin: Springer. CrossRefGoogle Scholar
 Lippi, M., Bertini, M., & Frasconi, P. (2010). Collective traffic forecasting. In Lecture notes in computer science: Vol. 6322. Machine learning and knowledge discovery in databases (pp. 259–273). Berlin: Springer. CrossRefGoogle Scholar
 Luckham, D. (2002). The power of events—an introduction to complex event processing in distributed enterprise systems. Reading: Addison Wesley. Google Scholar
 Marinosson, S. F., Chrobok, R., Pottmeier, A., Wahle, J., & Schreckenberg, M. (2002). Simulation framework for the autobahn traffic in North RhineWestphalia. In Cellular automata—5th int. conf. on cellular automata for research and industry (pp. 2977–2980). Berlin: Springer. Google Scholar
 May, M. & Saitta, L. (Eds.) (2010). Lecture notes in artificial intelligence: Vol. 6202. Ubiquitous knowledge discovery. Berlin: Springer. Google Scholar
 May, M., Hecker, D., Körner, C., Scheider, S., & Schulz, D. (2008). A vectorgeometry based spatial knnalgorithm for traffic frequency predictions. In Data mining workshops, international conference on data mining (pp. 442–447). Google Scholar
 Meinshausen, N., & Buehlmann, P. (2005). High dimensional graphs and variable selection with lasso. The Annals of Statistics, 34(3), 1436–1462. CrossRefGoogle Scholar
 Ng, A. Y., & Jordan, M. I. (2002). On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. Advances in Neural Information Processing Systems, 14, 841–848. Google Scholar
 Nocedal, J., & Wright, S. J. (2006). Numerical optimization (2nd ed.). Berlin: Springer. zbMATHGoogle Scholar
 Pearce, N. D., & Wand, M. P. (2006). Penalized splines and reproducing kernel methods. American Statistician, 60(3), 233–240. MathSciNetCrossRefGoogle Scholar
 Pearl, J. (1988). Probabilistic reasoning in intelligent systems: networks of plausible inference. San Mateo: Morgan Kaufmann Publishers Inc. Google Scholar
 Piatkowski, N. (2012). iSTMRF: interactive spatiotemporal probabilistic models for sensor networks. In International workshop at ECML PKDD 2012 on instant interactive data mining. Google Scholar
 Piatkowski, N., Lee, S., & Morik, K. (2012). Spatiotemporal models for sustainability. In: Proceedings of the SustKDD workshop in ACM KDD. Google Scholar
 Sagy, G., Keren, D., Sharfman, I., & Schuster, A. (2011). Distributed threshold querying of general functions by a difference of monotonic representation. In: Proceedings of the VLDB endowment (Vol. 4). Google Scholar
 Sutton, C., & McCallum, A. (2007). An introduction to conditional random fields for relational learning. In Introduction to statistical relational learning. Cambridge: MIT Press. Google Scholar
 Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58, 267–288. MathSciNetzbMATHGoogle Scholar
 Wainwright, M., Jaakkola, T., & Willsky, A. (2005). A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory, 51(7), 2313–2335. MathSciNetCrossRefGoogle Scholar
 Wainwright, M. J., & Jordan, M. I. (2007). Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1–2), 1–305. CrossRefGoogle Scholar
 Wang, D., Rundensteiner, E. A., & Ellison, R. T. (2011). Active complex event processing of event streams. In: Procs. of the VLDB endowment (Vol. 4). Google Scholar
 Whittaker, J., Garside, S., & Lindveld, K. (1997). Tracking and predicting a network traffic process. International Journal of Forecasting, 13(1), 51–61. CrossRefGoogle Scholar
 Williams, B., & Hoel, L. (2003). Modeling and forecasting vehicular traffic flow as a seasonal arima process: theoretical basis and empirical results. Journal of Transportation Engineering, 129(6), 664–672. CrossRefGoogle Scholar
 Wolff, R., Badhuri, K., & Kargupta, H. (2009). A generic local algorithm for mining data streams in large distributed systems. IEEE Transactions on Knowledge and Data Engineering, 21(4), 465–478. CrossRefGoogle Scholar
 Wright, S. J., Nowak, R. D., & Figueiredo, M. A. T. (2009). Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57, 2479–2493. MathSciNetCrossRefGoogle Scholar
 Yin, Z., & Collins, R. (2007). Belief propagation in a 3D spatiotemporal MRF for moving object detection. IEEE Computer Vision and Pattern Recognition. Google Scholar
 Zhao, F., & Park, N. (2004). Using geographically weighted regression models to estimate annual average daily traffic. Journal of the Transportation Research Board, 1879(12), 99–107. CrossRefGoogle Scholar
 Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67, 301–320. MathSciNetzbMATHCrossRefGoogle Scholar