Machine Learning for Vision-Based Motion Analysis pp 263-274 | Cite as

# Spatio-Temporal Motion Pattern Models of Extremely Crowded Scenes

- 1 Citations
- 2.4k Downloads

## Abstract

Extremely crowded scenes present unique challenges to motion-based video analysis due to the large quantity of pedestrians within the scene and the frequent occlusions they produce. The movement of pedestrians, however, collectively form a spatially and temporally structured pattern in the motion of the crowd. In this work, we present a novel statistical framework for modeling this structured pattern, or steady-state, of the motion in extremely crowded scenes. Our key insight is to model the motion of the crowd by the spatial and temporal variations of local spatio-temporal motion patterns exhibited by pedestrians within the scene. We divide the video into local spatio-temporal sub-volumes and represent the movement through each sub-volume with a local spatio-temporal motion pattern. We then derive a novel, distribution-based hidden Markov model to encode the temporal variations of local spatio-temporal motion patterns. We demonstrate that by capturing the steady-state of the motion within the scene, we can naturally detect unusual activities as statistical deviations in videos with complex activities that are hard for even human observers to analyze.

## Keywords

Optical Flow True Positive Rate Training Video Crowded Scene Query Video## 1 Introduction

The large number of surveillance cameras currently deployed has created a dire need for computational methods that can assist or ultimately replace human operators. Surveillance cameras record large amounts of video that is typically only viewed after an incident occurs. Live monitoring of video feeds requires human personnel who are frequently tasked with observing multiple cameras simultaneously. Vision-based video analysis systems attempt to augment human security personnel by analyzing surveillance videos computationally, enabling the automatic monitoring of large numbers of video cameras.

Despite the general interest and active research in video analysis, an important area of video surveillance has been overlooked. Extremely crowded scenes, such as those shown in Fig. 2, are perhaps in the most need of computational methods for analysis. Crowded, public areas require monitoring of a large number of individuals and their activities, a significant challenge for even a human observer. Videos with this level of activity have yet to be analyzed via a computational method. Common scenes for video analysis, such as the PETS 2007 database [15], contain less than one hundred individuals in even the most crowded videos. Extremely crowded scenes, however, contain hundreds of people in any given frame and possibly thousands throughout the duration of the video.

The people and objects that compose extremely crowded scenes present an entirely different level of challenges due to their large quantity and complex activities. The sheer number of pedestrians results in frequent occlusions that make modeling the movement of each individual extremely difficult, if not impossible. The movement of each pedestrian, however, contributes to the overall motion of the crowd. In extremely crowded scenes, the crowd’s motion may vary spatially across the frame, as different areas contain different degrees of traffic, and temporally throughout the video due to natural crowd variations. As a result, analysis of extremely crowded scenes must account for the spatial and temporal variations in the local movements of pedestrians that comprise the crowd.

In this work, we present a novel statistical framework to model the steady-state motion of extremely crowded scenes. Our key insight is that the crowd’s motion is a spatially and temporally structured pattern of the local motion patterns exhibited by pedestrians. We model the spatial and temporal variations of the motion in local space-time sub-volumes to capture the latent motion structure (i.e., the steady-state) of the scene. We use this model to describe the typical motion of the crowd, that is, usual events within the scene, and demonstrate its effectiveness by identifying unusual events in videos of the same scene.

First, we divide the video volume into local spatio-temporal sub-volumes, or cuboids, defined by a regular grid. Next, we use a local spatio-temporal motion pattern to describe the possible complex movements of pedestrians within each cuboid. We then identify prototypical local spatio-temporal motion patterns that describe the typical movements within the scene, and estimate a distribution of local motion patterns to compactly represent the entire video. Finally, we capture the spatial and temporal variations in the motion of the crowd by training a novel, distribution-based hidden Markov model (HMM) on the local spatio-temporal motion patterns at each spatial location of the video volume. In other words, we model the steady-state motion of the crowd by a spatially and temporally varying structured pattern of the movements of pedestrians in local areas.

## 2 Related Work

Motion-based video analysis characterizes the scene by the movement of the scene’s constituents within a video sequence. Trajectory-based approaches [5, 8, 9], for example, track the objects within the scene and describe their motion by their changing spatial location in the frame. These techniques are suitable for scenes with few moving objects that can easily be tracked, such as infrequent pedestrian or automobile traffic. Trajectory-based approaches focus on each subject individually, but the behavior of extremely crowded scenes depends on the concurrent movement of multiple pedestrians. In addition, the frequent occlusions in extremely crowded scenes makes tracking significantly difficult.

Other approaches estimate the optical flow [2, 3] or the motion within spatio-temporal volumes [4, 6, 7, 10]. Flow-based approaches have used HMMs [2] or Bayes’ classifiers [3] to represent the overall motion within the scene. The large number of people in extremely crowded scenes, however, make modeling specific activity by the collective optical flow difficult. Extremely crowded scenes may contain any number of concurrent, independent activities taking place in different areas of the same frame. This makes global approaches [2, 20] that model the motion over the entire frame unsuitable since the local events of interest would not be discernible from the rest of the scene.

Spatio-temporal approaches [4, 6, 7, 10] directly represent the motion in local sub-volumes of the video. Though these representations are well suited to extremely crowded scenes, their use has been limited to volume distance [4, 10] or interest points [7], requiring the explicit modeling of each event to be detected. In other words, they have not been used to represent the overall motion of a scene, just specific events. In addition, most spatio-temporal representations assume that the sub-volume contains motion in a single direction. In extremely crowded scenes, however, the motion within each cuboid may be complex and consist of movement in multiple directions.

Other work have modeled the motion of the crowd by flow fields [1], topical models [17], or dynamic textures [12]. Ali and Shah [1] model the expected motion of pedestrians in order to track pedestrians at a distance. More recently, Rodriguez et al. [17] model a fixed number of possible motions at each spatial location using a topical model. These approaches, however, fix the number of possible motions at each location within the frame. In extremely crowded scenes the pedestrian motions vary depending on the spatial and temporal location within the video. Pedestrian movements may be severely limited in one area of the frame and highly variable in others. As such, models that fix the number of possible movements may not represent the rich variations in motion that can occur in extremely crowded scenes.

## 3 Local Spatio-Temporal Motion Patterns

The motion of the crowd in extremely crowded scenes is formed by the collective movement of the pedestrians. The movement of each pedestrian depends on the physical structure of the scene, surrounding pedestrians, and the individual’s goals. As a result, the motion of the crowd changes naturally spatially across the frame and temporally over the video. It is exactly these spatial and temporal variations in the local movements of pedestrians that we model to characterize the motion of the crowd.

*i*in the cuboid, we compute the spatio-temporal gradient ∇

*I*

_{ i }, a 3-dimensional vector representing the gradient in the horizontal, vertical, and temporal dimensions. Previous work on analyzing persistent motion patterns [19] and correlating video sequences [18] have used the collection of spatio-temporal gradients in the form of the structure tensor matrix

*N*is the number of pixels in the cuboid. These methods assume that the spatio-temporal gradients lie on a plane in 3D gradient space, and thus the cuboid contains a single motion vector [18]. In extremely crowded scenes, however, a cuboid may contain complex motion caused by movement in multiple directions such as a pedestrian changing direction or two pedestrians moving past one another.

*n*and temporal location

*t*, the local spatio-temporal motion pattern is represented by \(\boldsymbol{\mu}_{t}^{n}\) and \(\boldsymbol{ \varSigma }_{t}^{n}\). Intuitively, by modeling the distribution of spatio-temporal gradients, we are representing the possibly multiple motion vectors that may occur within the cuboid by estimating the shape of the 3D gradients that may not lie on a single plane.

The amount of pedestrian motion represented depends on the size of the cuboid. Since the camera recording the scene is fixed, we set the cuboid size manually. We consider this an acceptable cost of our approach since the size of pedestrians remains similar over the duration of the scene.

## 4 Prototypical Motion Patterns

By directly modeling the video as a collection of local spatio-temporal motion patterns, we reduce the size of the video representation from a set of raw pixels to a collection of Gaussian parameters. This representation, however, is still quite large. For example, a one minute video with resolution 720×480 will have 19,440 cuboids of size 40×40×20, resulting in 233,280 parameters. As shown in Fig. 1(c), we further reduce the size of this representation by identifying common local spatio-temporal motion patterns. We extract prototypical local spatio-temporal motion patterns (prototypes) that represent the characteristic movements of pedestrians within the scene. Note that similar local spatio-temporal motion patterns can occur at disjoint space-time locations in the video, and it is this recurrence that forms the underlying steady-state of the motion of the crowd.

*A*and

*B*by

*d*(

*A*,

*B*) is the KL divergence, \(\mathcal{K}(\cdot)\) is the condition number of the matrix, and

*d*

_{ Σ }and

*d*

_{ μ }are limits on the norms to ensure the distributions are reasonably similar. We refer to this measure as the KL distance for the remainder of this paper. By using the KL divergence, we distinguish the possibly complex movements of pedestrians from different cuboids directly from the 3D Gaussian distributions.

*d*

_{ KL }, for all prototypes {

*P*

_{ s }|

*s*=1,…,

*S*}, then the cuboid is considered a new prototype. Otherwise, the prototype

*P*

_{ s }is updated with the new observation \(O_{t}^{n}\) by

*N*

_{ s }is the total number of observations associated with the prototype

*P*

_{ s }at time

*t*and \(\widetilde{P}_{s}\) is the previous value of

*P*

_{ s }. The set of prototypes is initially empty.

*P*

_{ s }are multi-variate Gaussian distributions, (4) should reflect the spatio-temporal gradients that the local spatio-temporal motion patterns represent. To solve (4) with respect to the KL divergence, we use the expected centroid presented by Myrvoll and Soong [13]

**μ**_{ s }, and

**Σ**_{ s }are the mean and covariance matrices of \(O_{t}^{n}\) and

*P*

_{ s }, respectively. Thus, we compute prototypes that represent the collection of spatio-temporal gradients for each typical movement within the scene. By extracting prototypical local spatio-temporal motion patterns, we construct a canonical representation of the video as a collection of the characteristic movements of pedestrians in local areas.

## 5 Distribution-Based Hidden Markov Models

While the set of prototypes provides a picture of similar movements of pedestrians within the scene, it does not capture the relationship between their occurrences. By modeling the temporal relationship between sequential local spatio-temporal motion patterns, we characterize a given video by its temporal variations. We assume that cuboids in the same spatial location exhibit the Markov property in the temporal domain since the scene is comprised of physically moving objects. We create a single HMM for each tube of observations in the video as shown in Fig. 1(e) to model the temporal evolution of local spatio-temporal motion patterns in each local region. Since each local spatio-temporal motion pattern is a 3D Gaussian of spatio-temporal gradients, we derive an HMM that can handle observations that are distributions themselves.

Ordinary HMMs are defined by the parameters *M*={*H*,**o**,**b**,**A**,* π*}, where

*H*is the number of hidden states,

**o**the possible values of observations,

**b**a set of

*H*emission probability density functions,

**A**a transition probability matrix, and

*an initial probability vector. We model a single HMM*

**π***M*

^{ n }={

*H*

^{ n },

**O**

^{ n },

**b**

^{ n },

**A**

^{ n },

**π**^{ n }} for each spatial location

*n*=1,…,

*N*and associate the hidden states

*H*

^{ n }with the number of prototypes

*S*

^{ n }in the tube. The set of possible observations

**O**

^{ n }is the range of 3D Gaussian distributions of spatio-temporal gradients. Complex observations for HMMs are often quantized for use in a discrete HMM. Such quantization, however, would significantly reduce the rich motion information that the local spatio-temporal motion patterns represent. Using a distribution-based HMM allows the observations to remain 3D Gaussian distributions. Therefore, the emission probability density function for each prototype must be a distribution of distributions.

*s*is computed by

*P*

_{ s }is the prototype given in (4), and \(\tilde{d}(\cdot)\) is the KL distance. This retains the rich motion information represented by each local spatio-temporal motion pattern and provides a probability calculation consistent with our distance measure. We compute the standard deviation by the maximum likelihood estimator

*N*

_{ s }is the number of local spatio-temporal motion patterns associated with the prototype

*P*

_{ s }. In practice, however, there may be too few cuboids in a specific group to estimate

*σ*

_{ s }. On such occasions, we use a 99.7 percent confidence window around \(d_{\text{KL}}\), letting \(\sigma_{s}= 3d_{\text{KL}}\).

The distribution-based HMM represents the temporal variations of local spatio-temporal motion patterns in a sound statistical framework. The emission probability distributions are created using (4) and (8) for each prototype in the scene. Note that, while a single HMM is trained on the local spatio-temporal motion patterns in each tube, the prototypes are created using samples from the entire video volume. The parameters **A** ^{ n } and **π**^{ n } are estimated by expectation maximization.

*t*th temporal sequence of observations within a given video tube

*n*. Thus

*w*is the sequence length, and \(O^{n}_{t},\ldots,O^{n}_{t+w}\) is a subsequence of observed local spatio-temporal motion patterns at location

*n*. Ideally, we would like to measure the likelihood of each individual cuboid. Since \(\mathcal{T}^{n}_{t}\) is calculated for every sequence within the tube of length

*w*, each observation can be associated with

*w*likelihood measures by sliding a window of size

*w*over the entire video. We define an ensemble function that selects a measure from the set of likelihoods associated with the observation. We use a window size of 2 and let the ensemble function maximize over the likelihoods. This correctly classifies the cuboids with the exception of one case when a usual cuboid is temporally surrounded between two unusual cuboids, which is rare and errs on the side of caution (a false positive). Since extremely crowded scenes may contain larger variations in one location than another, we normalize the likelihood of each observation by the minimum likelihood value of the training data in each spatial location

*n*.

## 6 Experimental Results

We demonstrate the effectiveness of our model of the crowd’s motion by detecting unusual events in videos of three scenes: one simulated crowded scene and two real-world extremely crowded scenes. For each scene, we train a collection of distribution-based HMMs on a finite length training video that represents the typical motion of the crowd. We then use these HMMs to detect unusual activity on query videos of the same scene. To quantitatively evaluate our results, we manually annotate cuboids containing pedestrians moving in unusual directions. We fix the cuboid size to 40×40×20 for all experiments, as such a size captures the distinguishing characteristics of the movement of the pedestrians. The threshold values are selected empirically since they directly depend on the variations in the motion within the specific scene.

We generate a synthetic crowded scene by translating a texture of a crowd across the frame, resulting in large motion variations and nonuniform motion along border areas. The image sequence consists of 216 tubes and 9,072 total cuboids. We then insert several smaller images moving in arbitrary directions to simulate unusual pedestrian movements. The thresholds used in this experiment are \(d_{\text{KL}}=0.02\), *d* _{ Σ }=5, *d* _{ μ }=1, and \(d_{\mathcal{K}}=400\). The Receiver Operating Characteristic (ROC) curve for this example is shown in Fig. 3, and is produced by varying values of the classification threshold. Our approach achieve a false positive rate of 0.009 and a true positive rate of 1.0. The false positives occur in cuboids lacking motion and texture.

^{1}from two real-world extremely crowded scenes that we use to evaluate our method. For both scenes, we set

*d*

_{ Σ }=1,

*d*

_{ μ }=1, and \(d_{\mathcal {K}}=1000\). The first scene, shown on the top in Fig. 2, is from an extremely crowded concourse of a subway station. The scene contains a large number of moving and loitering pedestrians and employees directing traffic flow. The query video contains station employees walking against the flow of traffic. The training video contains 3,020 frames and the query video contains 380 frames. The threshold \(d_{\text{KL}}\) is 0.06. The second real-world extremely crowded scene, shown on the bottom in Fig. 2, is a wide-angle view of pedestrian traffic at the station’s ticket gate. The motion of the crowd occurs in a more constant direction than in the concourse, but still contains excessive occlusions. The query video contains instances of people reversing direction or stopping in the ticket gate. The training video contains 2,960 frames and the query video contains 300 frames. The threshold \(d_{\text{KL}}\) is 0.05.

False positives occur in both experiments for slightly irregular motion patterns such as after pedestrians exit the gate, and areas of little motion such as where the floor of the station is visible. The few false negatives in both real-world examples occur adjacent to true positives, which suggests they are harmless in practical scenarios. Intuitively, most unusual behavior that warrants personnel intervention would be less subtle than those detected here, and as such would result in a smaller number of errors. The ability to detect subtle unusual events, however, is made possible by training the HMMs on the local spatio-temporal motion patterns themselves.

*σ*

_{ s }in (8) when there is insufficient observations for specific prototypes and the inclusion of disjoint local motion patterns in the prototypes. The performance with such small training data reflects the rich descriptive power of the distribution-based HMMs.

## 7 Conclusion

In this work, we introduced a novel statistical framework using local spatio-temporal motion patterns to represent the motion of the crowd in videos of extremely crowded scenes. We represented the movement of pedestrians in local areas by a local spatio-temporal motion pattern in the form of a multivariate Gaussian. We then identified prototypical local spatio-temporal motion patterns to canonically represent the characteristic movements within the video. Using a novel distribution-based hidden Markov model, we learned the statistical temporal variations of local motion patterns from a training video of an extremely crowded scene. Finally, we used this model of the motion of the crowd to identify unusual events as statistical anomalies. Our results indicate that local spatio-temporal motion patterns are a suitable representation for analyzing extremely crowded scenes. We evaluated our approach on videos of real-world extremely crowded scenes and successfully detected unusual local spatio-temporal motion patterns including movement against the normal flow of traffic, loitering, and traffic congestion. We believe our proposed framework plays an important role in the analysis of dynamic video sequences with spatially and temporally varying local motion patterns.

## Footnotes

- 1.
The original videos courtesy of Nippon Telegraph and Telephone Corporation.

## Notes

### Acknowledgements

This work was supported in part by Nippon Telegraph and Telephone Corporation and the National Science Foundation grants IIS-0746717 and IIS-0803670.

## References

- 1.Ali, S., Shah, M.: Floor fields for tracking in high density crowd scenes. In: Proc. of European Conference on Computer Vision (2008) Google Scholar
- 2.Andrade, E., Blunsden, S., Fisher, R.: Modelling crowd scenes for event detection. In: Proc. of International Conference on Pattern Recognition, pp. 175–178 (2006) Google Scholar
- 3.Black, M.: Explaining optical flow events with parameterized spatio-temporal models. In: Proc. of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 326–332 (1999) Google Scholar
- 4.Boiman, O., Irani, M.: Detecting irregularities in images and in video. In: Proc. of IEEE International Conference on Computer Vision, pp. 462–469 (2005) Google Scholar
- 5.Dee, H., Hogg, D.: Detecting inexplicable behaviour. In: Proc. of British Machine Vision Conference, pp. 477–486 (2004) Google Scholar
- 6.DeMenthon, D., Doermann, D.: Video retrieval using spatio-temporal descriptors. In: Proc. of the 11th ACM International Conference on Multimedia, pp. 508–517 (2003) Google Scholar
- 7.Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: International Workshop on Performance Evaluation of Tracking and Surveillance, pp. 65–72 (2005) Google Scholar
- 8.Hu, W., Xiao, X., Fu, Z., Xie, D., Tan, T., Maybank, S.: A system for learning statistical motion patterns. IEEE Trans. Pattern Anal. Mach. Intell.
**28**(9), 1450–1464 (2006) CrossRefGoogle Scholar - 9.Johnson, N., Hogg, D.: Learning the distribution of object trajectories for event recognition. In: Proc. of British Machine Vision Conference, pp. 583–592 (1995) Google Scholar
- 10.Ke, Y., Sukthankar, R., Hebert, M.: Event detection in crowded videos. In: Proc. of IEEE International Conference on Computer Vision, pp. 1–8 (2007) Google Scholar
- 11.Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat.
**22**(1), 79–86 (1951) MathSciNetzbMATHCrossRefGoogle Scholar - 12.Ma, Y., Cisar, P.: Activity representation in crowd. In: Proc. of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition, pp. 107–116 (2008) Google Scholar
- 13.Myrvoll, T., Soong, F.: On divergence based clustering of normal distributions and its application to HMM adaptation. In: Proc. of European Conference Speech Communication and Technology, pp. 1517–1520 (2003) Google Scholar
- 14.Nishino, K., Nayar, S.K., Jebara, T.: Clustered blockwise PCA for representing visual data. IEEE Trans. Pattern Anal. Mach. Intell.
**27**(10), 1675–1679 (2005) CrossRefGoogle Scholar - 15.PETS: 10th IEEE International Workshop on Performance Evaluation of Tracking and Surveillance. http://www.pets2007.net/ (2007)
- 16.Rabiner, L.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE
**77**(2), 257–286 (1989) CrossRefGoogle Scholar - 17.Rodriguez, M., Ali, S., Kanade, T.: Tracking in unstructured crowded scenes. In: Proc. of IEEE International Conference on Computer Vision (2009) Google Scholar
- 18.Shechtman, E., Irani, M.: Space-time behavior based correlation. In: Proc. of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 405–412 (2005) Google Scholar
- 19.Wright, J., Pless, R.: Analysis of persistent motion patterns using the 3D structure tensor. In: IEEE Workshop on Motion and Video Computing, pp. 14–19 (2005) Google Scholar
- 20.Zhong, H., Shi, J., Visontai, M.: Detecting unusual activity in video. In: Proc. of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 819–826 (2004) Google Scholar