# Modeling Crowd Flow for Video Analysis of Crowded Scenes

- 2.1k Downloads

## Abstract

In this chapter, we describe a comprehensive framework for modeling and exploiting the crowd flow to analyze videos of densely crowded scenes. Our key insight is to model the characteristic patterns of motion that arise within local space-time regions of the video and then to identify and encode the statistical and temporal variation of those motion patterns to characterize the latent, collective movements of the people in the scene. We show that this statistical crowd flow model can be used to achieve critical analysis tasks for surveillance videos of extremely crowded scenes such as unusual event detection and pedestrian tracking. These results demonstrate the effectiveness of crowd flow modeling in video analysis and point to its use in related fields including simulation and behavioral analysis of people in dense crowds.

## Keywords

Hide Markov Model Optical Flow Training Video Crowded Scene Query Video## 10.1 Introduction

Computer vision research, in the past few decades, has made significant strides toward efficient and reliable processing of the ever increasing video data. These advances have mainly been driven by the need for automatic video surveillance that persistently monitors security critical areas from fixed viewpoints. Many methods have been introduced that successfully demonstrate the extraction of meaningful information regarding the scene contents and their dynamics including detecting people, tracking objects and pedestrians, recognizing specific actions by people and scene-wide events, and interactions among people and other scene contents.

Automated visual analysis of crowded scenes, however, remains a challenging task. As the number of people in a scene increases, nuisances that play against conventional video analysis methods surge. This is particularly true for methods that fundamentally rely on the ability to extract and track individuals. In videos of crowded scenes, the whole body of each person would be hardly visible to the camera, people will occlude each other and other contents in the scene, the notion of foreground and background will start to meld together, and most important the behavior of people will change to accommodate the tightness and clutter in the scene. These are nuisances not only to the computer algorithms for automated analysis but also to human operators that will have to squint through the clutter for hours and days to find a single adverse activity. As such, paradoxically, automated video analysis is most needed where it is actually hardest to do.

The large number of people in a crowd, however, does in turn give rise to invaluable visual cues regarding the scene dynamics. The sheer number of people and their appearance adds texture to the collective movements of the people which we refer to as the crowd flow in this chapter. The crowd flow embodies the latent, coherent motions of individuals which also dynamically varies across the scene and changes as time passes by. If we can model the crowd flow while faithfully encoding its variability both in space and time, we may use it to extract critical contextual information from the dynamic, cluttered scene.

In this chapter, we describe a comprehensive framework for modeling and exploiting crowd flow to analyze videos of densely crowded scenes. Each individual in a crowded scene is not a mere autonomous agent dictated by a set of simple rules, but is an intelligent being that makes judgments on its own movement based on local sensory input with a global perspective in mind. The movements of individuals result in the intricate yet coherent motion that organically evolves in the scene. We will model them as a structured motion field that dynamically changes its form both in space and time. In other words, our approach argues for a scene-centric representation of crowd flow modeling. This is a large departure from conventional object- or people-centric approaches that capture crowd flow as a collection of individuals and their paths.

Our key insight is to exploit the dense local motion patterns created by the large number of people and model their spatio-temporal relationships, representing the underlying intrinsic structure they form in the video. In other words, we model the variations of local spatio-temporal motion patterns to describe common behavior within the scene, and then identify the spatial and temporal relationships between motion patterns to characterize the behavior of the crowd as a whole in the specific scene of interest. We show that modeling the crowd flow can benefit solving critical video analysis tasks that are otherwise challenging to achieve on crowded scenes. Most important, we show that a scene-centric representation of the crowd flow can augment object-centric individual models to track each individual in a highly dense crowd.

We demonstrate the effectiveness of modeling and using the crowd flow for video analysis in two fundamental surveillance tasks: unusual event detection and pedestrian tracking. The experimental results show that exploiting the aggregated movements of people enables robust detection of anomalous behaviors and accurate tracking of individuals. We believe these results have direct implications for human behavior analysis as it enables accurate tracking of individuals in dense crowds, which is essential of longitudinal “in-situ” observations of people in real-world scenes. These results also point to a novel approach of crowd simulation in which the collective movements of people are driven by statistical models learned from observations.

## 10.2 Related Work

Past work on video analysis has mostly relied on the assumption that the scene content to be analyzed can be reliably extracted in each frame. This is usually achieved by maintaining a background model, subtracting the background from the video frames to extract foreground objects (e.g., pedestrians), and then tracking each of the moving foreground objects. Subsequent analysis then relies on the paths or locations of the tracked foreground objects. Although this paradigm has been largely successful in many video analysis applications, it naturally is limited to videos of relatively sparse scenes where people and other scene contents including static background can be clearly discerned from each other.

Video analysis of crowded scenes has recently attracted interest in the computer vision community, especially to reach beyond such simple scenes and to achieve automated surveillance in more complex, cluttered scenes. Here we review some of the representative approaches to modeling such crowded scenes.

Ali and Shah Ali and Shah [2007, 2008] model the crowd motion by averaging the observed optical flow. Their approach assumes that the crowd does not change over time, and uses the same video clips for learning and applications. Similar work by Mehran et al. Mehran et al [2010] use “streaklines,” a concept from hydrodynamics, to segment videos of crowds and track pedestrians. Though streaklines encode more temporal information than the average optical flow, they do not encode the temporal relationship between consecutive observations. In contrast, we model the temporal dynamics over local areas, and use our learned crowd model to analyze videos of the same scene recorded at a different time.

Often, the term “motion pattern” is used to describe motion within the scene that are part of the same physical process Hu et al [2008a]. Hu and Shah Hu et al [2008b] identify motion patterns in crowded scenes by clustering optical flow vectors in similar spatial regions. Such work is applicable to scenes where the motion of the crowd has large, stable patches of heterogeneous flow. In near-view crowded scenes, however, a single physical process (such as the crowd) may be heterogeneous and dynamically varying. Even a single pedestrian may exhibit flow vectors in multiple directions due to their articulated motion. In contrast, we represent the motion in small, space-time areas with a local motion pattern, and capture the dynamically varying heterogeneous crowd with a collection of HMMs.

Andrade et al. Andrade et al [2006a,b] also use a collection of hidden Markov models. The observations to the HMM are vectors of pixel locations and optical flow estimates. While these may be viewed as a form of local motion patterns, they do not directly encode the variability in motion that can occur due to poor texture or aperture problems. In contrast, our representation of local motion pattern are directional distributions of optical flow that directly encode the uncertainty in the optical flow estimate as we later demonstrate. In addition, the dimension of their representation increases in dimension with the resolution of the video. Though they scale the frame size, such a lengthy representation still requires more training data to properly capture the covariance of the observations. In contrast, our directional distributions are parameterized by a single 3D flow vector and a concentration parameter, as will show, which reduces the dimensionality of the representation while retaining the variance in the flow.

Other work view frequently occurring motion patterns as an annoyance. Yang et al. Yang et al [2009] argue that high-entropy words, i.e., motions that occur frequently, are not useful for activity recognition since they represent noisy optical flow or areas without motion. Though noise and areas without motion are a factor, they are not the only motion patterns that can occur frequently. In extremely crowded scenes, it is exactly the high frequency local motion patterns that define the characteristic movement of pedestrians within the crowd. In addition, the minor differences between the instances of frequently occurring motion patterns are typically ignored. Hospedales et al. Hospedales et al [2009], for example, quantize optical flow vectors into one of four primary directions. This disregards the valuable variations of motion patterns that may be used to robustly represent different movements of pedestrians.

Other work that describe the motion in local, space-time volumes assume each cuboid contains motion in a single direction Ke et al [2007], Shechtman and Irani [2005]. Often, the optical flow vectors are quantized into a number of discrete directions Hospedales et al [2009], Wang et al [2009]. These representations disregard the valuable variations of motion patterns that may be used to robustly represent different movements of pedestrians.

Histograms of oriented gradients (HoG) features have been used to describe space-time volumes for human detection Dalal and Triggs [2005] and action recognition Laptev et al [2008]. The HoG feature is computed on the spatial gradients (the temporal gradient is not used), though they have been extended to the 3D space-time gradient Kläser et al [2008]. The orientation of spatial gradients encodes the structure of the pedestrian’s appearance, and thus is not suitable when only motion is necessary. Rather than modeling a distribution of the gradients’ orientations, we use the relationship between spatio-temporal gradients and optical flow to estimate a directional distribution of optical flow vectors that represent the possible motion within the cuboid.

## 10.3 Crowd Flow as a Collection of Local Motion Patterns

The key idea underlying our approach is to view the crowd flow as a collection of local spatio-temporal motion patterns in the scene and to model their variation in space and time with a collection of statistical models. In other words, we model the crowd flow as a dynamically evolving structure of local motions in the scene and time. This enables us to encode the global and local characteristics of the aggregate movements of people in a scene with a concise analytical expression. Such a model becomes crucial in achieving higher level analysis of the scene contents based on the stationary behavior of the whole.

Figure 10.1 shows an overview of our model. First, as shown in Fig. 10.1a, we divide a training video into spatio-temporal sub-volumes, or “cuboids,” defined by a regular grid. Second, as shown in Fig. 10.1b, we model the motion of pedestrians through each cuboid (i.e., the local motion pattern) with a 3D directional distribution of optical flow. Next, as shown in Fig. 10.1c, we train a hidden Markov model (HMM) over the local motion patterns at each grid location. This implies that we assume that the crowd will generate motion patterns that conform to first-order Markov processes at local space-time regions, which may not necessarily be true. Nevertheless, we found that the temporal variation of local crowd motion can be captured well with hidden Markov models which also enables efficient inference of its parameter values. The hidden states of the HMMs encode the multiple possible motions that can occur at each spatial location. The transition probabilities of the HMMs encode the time-varying dynamics of the crowd motion. We represent the crowd motion by the collection of HMMs, encoding the spatially and temporally varying motions of pedestrians that comprise the entire crowd flow.

Our model has three unique characteristics that distinguish it from other methods. First, our model encodes the variability of the crowd flow both in space and time. The collection of HMMs captures the variations of the motion of pedestrians throughout the entire video volume, making it more robust and dynamically adjustable to different crowd behaviors. Second, we model the crowd flow by starting with local motion patterns. This enables the model to scale with the modality of different crowd behaviors, rather than the number of pedestrians. Finally, since our model is a set of statistical models, it may be learned from an example video of the scene and be used to analyze videos of the same scene recorded at a different time.

### 10.3.1 Modeling Local Motion Patterns

In our method, the video is viewed as a spatio-temporal volume which is subdivided into small volumes that typically span 30 pixels in horizontal and vertical spatial domain as well as 30 frames in the temporal domain. We refer to these small spatio-temporal volumes that collectively form the video as *cuboids*. We first seek to represent the motion in each cuboid in the video volume, i.e., the local motion pattern. The optical flow can be reliably estimated when the cuboid contains motion in a single direction with constant velocity and good texture. The motion in cuboids from real-world crowded scenes, however, may be difficult to estimate reliably. A cuboid may contain complex motion, i.e., motion exhibited by multiple objects moving in multiple directions or a single object that changes direction or speed. In addition, cuboids may contain little or no texture and have indeterminable motion. To handle these different cases, we model each local motion pattern with a distribution of *potential* optical flow vectors whose variance encodes the uncertainty in the optical flow estimate. These potential optical flow vectors can be directly computed from the 3D spatio-temporal gradients observed in the cuboid.

*x*,

*y*) at frame

*f*. The constant brightness constraint [Horn and Schunck, 1980] dictates the relationship between this 3D spatio-temporal gradient and the 3D optical flow vector

**q**

**q**has two degrees of freedom due to the ambiguity of global scaling and is estimated as a unit vector.

**q**from a single gradient estimate is ill-posed. For this reason, it is usually assumed that the flow is constant in the space-time area around (

*x*,

*y*,

*f*), and surrounding gradients are used to estimate the optical flow

**q**. Let \(\left \{\nabla I_{i}\vert i = 1\ldots N\right \}\) be a set of

*N*spatio-temporal gradients (we have dropped

*x*,

*y*,

*f*for notational convenience) computed at the different pixel locations of the cuboid. From the collection of the spatio-temporal gradients in a cuboid, one can estimate the optical flow vector from its Gram matrix (or the structure tensor [Wright and Pless, 2005])

**q**can easily be computed as the eigenvector of

**G**with the smallest eigenvalue. Note that this optical flow vector is a unit 3D vector which encodes both the direction and speed.

The single optical flow vector computed from all the spatio-temporal gradients, however, is not a faithful representation of the motion within the cuboid. It represents the dominant motion within the cuboid but assumes that all movements align with that single direction and speed. In reality, the cuboid will contain various motions in different directions and speeds that may be roughly aligned with that single dominant vector but with significant variability. Encoding this variation of motion within each cuboid is critical in arriving at an accurate analytical model of the crowd flow. To capture this variability, we consider the *potential* optical flow vectors that would have arisen from the spatio-temporal gradient vector computed at each pixel in the cuboid.

*I*

_{ i }which is not necessarily on the plane defined by the 3D optical flow vector

**q**. Such a point suggests that the actual motion within the cuboid may be in another direction

**v**

_{ i }is orthogonal to ∇

*I*

_{ i }, and thus satisfies the optical flow constraint in Eq. (10.1) for ∇

*I*

_{ i }.

*c*(

*κ*) is a normalization constant, and

*κ*is the concentration parameter. We fit a von Mises-Fisher distribution to the potential optical flow vectors \(\{\mathbf{v}_{i}\,\vert \,i = 1,\ldots,N\}\). Mardia and Jupp [Mardia and Jupp, 1999] show that the sufficient statistic for estimating \(\boldsymbol{\mu }\) and

*κ*is

*A*

_{3}

^{−1}( |

**r**| ).

As illustrated in Fig. 10.3, the concentration parameter *κ* characterizes the uncertainty in the optical flow estimate within the cuboid. Cuboids containing motion in a single direction have a high concentration parameter, yielding a narrow distribution. Cuboids with complex motion have a wide distribution, indicating motion may occur in different directions. Cuboids with little or no texture have distributions across the entire sphere, indicating that motion may be occurring in any direction. Each local spatio-temporal motion pattern *O* is defined by a mean 3D optical flow vector \(\boldsymbol{\mu }\) and a concentration parameter *κ* that encodes the uncertainty of the estimate.

### 10.3.2 Modeling the Dynamics of Local Motion Patterns

Now that we have a representation of the local motion patterns, we model their variation in space and time with collection of hidden Markov models (HMMs) to encode the crowd flow in the video.

*J*values. Each HMM is defined by a

*J* × 1 initial state probability vector \(\boldsymbol{\pi }\), a

*J* ×

*J*state transition matrix

**A**, and the emissions densities \(\left \{\mathrm{p}(O_{t}\vert s_{t} = j)\,\vert \,j = 1,\ldots,J\right \}\). The likelihood of starting in a specific state

*j*is encoded by the initial probability vector

*i*to state

*j*is represented by the state transition matrix

**b**

_{ t }be a

*J*× 1 vector of likelihoods where

*O*

_{1},

*…*,

*O*

_{ t }are the observations up to time

*t*. After the first observation, the message is initialized

An important aspect of the forwards step is that it may be computed online. When each new observation *O* _{ t } becomes available, the new posterior \(\hat{\boldsymbol{\alpha }}_{t}\) may be computed efficiently by Eq. (10.14). We use this characteristic of HMMs in our applications to achieve online operation.

Next, we turn our attention to the form of the emission density of an HMM \(\left \{\mathrm{p}(O_{t}\vert s_{t} = j)\,\vert \,j = 1,\ldots,J\right \}\). Each observation \(O_{t} = \left \{\boldsymbol{\mu }_{t},\kappa _{t}\right \}\) is a local motion pattern, defined by the 3D mean optical flow vector \(\boldsymbol{\mu }_{t}\) and the concentration parameter *κ* _{ t }. Often complex observations are quantized using a codebook, making the emission densities discrete. This can decrease the training time, but reduces the amount of information represented by each emission density.

*κ*

_{ t }. To achieve this, we assume that the mean vector \(\boldsymbol{\mu }_{t}\) and concentration parameter

*κ*

_{ t }are statistically independent

*a*

^{ j }and scale parameter

*θ*

^{ j }. We model \(\mathrm{p}(\boldsymbol{\mu }_{t}\vert s_{t} = j)\) as a von-Mises Fisher distribution (i.e., the conjugate prior on \(\boldsymbol{\mu }_{t}\) [Mardia and El-Atoum, 1976]) defined by a mean direction \(\boldsymbol{\mu }_{0}^{j}\) and a concentration parameter

*κ*

_{0}

^{ j }.

*a*

^{ j }, and thus we use the numerical technique from Choi and Wette [Choi and Wette, 1969]. Given an estimate of

*a*

^{ j }maximum likelihood is used to estimate the scale

## 10.4 Using the Crowd Flow Model

The collection of hidden Markov models now encode the spatial and temporal variation of the local motion patterns and capture their dynamics both locally and globally. The crowd flow is encoded in these collection of HMMs as the spatially and temporally stationary behaviors of the local motion patterns. Now we are in a position to exploit this statistical crowd flow model to achieve challenging tasks in highly cluttered scenes. We will demonstrate the power of the model in two important video analysis applications, namely unusual event detection and pedestrian tracking.

### 10.4.1 Detecting Unusual Local Events

Unusual event detection is a key application in automatic surveillance systems. The sheer number of surveillance cameras deployed produces an abundance of video that is often only viewed after an incident occurs. By automatically detecting disturbances within the scene, the automatic surveillance system can alert security personnel as soon as an incident occurs.

Large-scale unusual events, such as stampedes, incidents of violence, and crowd panic, are rare, even though they are a primary motivation for automatic video surveillance. While these large crowd disturbances are an area of interest, they are not the only disturbance that may need to be detected in crowded scenes. Since crowded scenes may contain any number of moving objects, a key application is the detection of activities by one or few of the scene’s constituents that happen in local areas. Detecting such local anomalies is of great interest, especially in very crowded scenes since they can easily go unnoticed or disguised due to the heavy clutter within the scene.

To detect local unusual events, we identify local motion patterns in a query video of the same scene that statistically deviate from the learned model. Specifically, we detect local motion patterns that have low likelihood given the spatio-temporal dynamics of the crowd. We demonstrate with real-world data that the method enables the detection of subtle yet important anomalous activities in high-density crowds, such as individuals moving against the usual flow of traffic or stop in otherwise high motion areas. Such unusual activity may only have a subtle effect on the entire crowd, but still be a disturbance that requires intervention from security personnel.

#### 10.4.1.1 Finding Deviations from the Crowd Flow

The collection of HMMs represent the underlying steady-state motion of the crowd by the spatial and temporal variations of local motion patterns. We seek to identify if a specific local motion pattern contains unusual pedestrian activity. For the purpose of this work, we consider a local motion pattern unusual if it occurs infrequently or is absent from the training video. We derive a probability measure of how much a specific local motion pattern deviates from the crowd motion in order to identify unusual events.

*T*is the last local motion pattern in the video clip. Exploiting the statistical independence properties of the HMM yields

*j*when

*O*

_{ t }was absent from the training data.

*t*to be available, but does not consider the transition out of the observation at time

*t*. As such, some cuboids are incorrectly classified but anomaly detection can be performed online.

Often, the crowd can display different modalities at different spatial locations of the scene. For example, some areas may regularly contain no motion, while others contain motion in multiple directions. As a result, the ideal threshold value may change with the spatial location. We account for this by dividing our likelihood measure \(\tilde{\mathcal{T}}_{t}\) by the average likelihood of the training data.

#### 10.4.1.2 Experimental Results

After training the crowd flow models on videos of normal activities of target scenes, we detect unusual movements of pedestrian in query videos of the same scene recorded at a different time. For this chapter, we detect anomalies in a concourse and a ticket gate area of a station. The length of training videos varied for each example between 540 and 3, 000 frames, depending on the specific example. We use cuboids of size 30 × 30 × 20 for the ticket gate scene and 40 × 40 × 20 for the concourse scene.

Figure 10.4 shows successful detection of unusual movements of pedestrians in local areas. Figure 10.4a, from the ticket gate scene, shows detection of pedestrians reversing directions in the turnstiles. Figure 10.4b shows successful detection of pedestrians in the concourse scene moving from left to right against the regular crowd traffic. The training video used for the examples consists of pedestrians moving in many different directions, but not from the left side of the scene to the right. These examples illustrate the unique ability of our approach to detect irregular local motion patterns within a crowded scene comprised of diverse movements of pedestrians.

The type of detected events depends entirely on the training data. Figure 10.5 shows detection of pedestrians loitering in otherwise high-traffic areas. Since the training video contains typical crowd motion, the lack of pedestrians (e.g., the empty turnstiles in the ticket gate scene) deviate from the model. This dependency on the training data is not only expected, it is desirable. It allows users of our approach to decide which particular local movements of pedestrian they consider usual by including it in the training video.

It is unreasonable to expect that all possible typical local motion patterns will be contained in the training video. Inevitably some typical local motion patterns will not be captured by the training data, and result in incorrect classifications such as the false positives. These are exasperated by the fact that the events being detected are subtle, local movements of pedestrian. Events that are dramatically different from the training sequence, such as global crowd disturbances, will result in fewer false positives. As shown in Fig. 10.5, the few false negatives in both scenes always occur adjacent to true positives, which suggests they are harmless in practical scenarios.

Figure 10.6 shows the receiver operating characteristic (ROC) curves (generated by varying the likelihood threshold) for all of the clips. Our approach performs with significant accuracy on each of the example videos. In video C5, the upper bodies of the loitering pedestrians move left and right and exhibit motion patterns similar to that of the crowd. This failure indicates that our approach associates similar motion patterns that may be caused by dissimilar movement, a side effect caused by the robustness of the prototypical distributions.

Figure 10.7 shows the detection accuracy using the online likelihood measure (computed from \(\boldsymbol{\alpha }\)) compared with the full likelihood measure \(\boldsymbol{\gamma }\). The online method achieves comparable accuracy to the offline computation.

Figure 10.8 shows the effects of increasing the training data size for video C1. As expected, the performance increases with longer training data, and achieves good performance with 100 observations, or 2, 000 frames of video. Using only 50 observations the model achieves significant accuracy with a false positive rate (ratio of false positives to total negatives) of 0. 17 and a true positive rate (ratio of true positives to total positives) of 0. 88. This strong performance with few observations directly results from the crowd’s high density. Since the scene contains a large number of pedestrians, significant variations in local motion patterns occur even in short video clips.

### 10.4.2 Using the Crowd to Track Individuals

Tracking objects or people is a crucial step in video analysis with a wide range of applications including behavior modeling and surveillance. Conventional tracking methods typically assume a static background or easily discernible moving objects, and as a result are limited to scenes with relatively few constituents. Videos of crowded scenes present significant challenges to tracking due to the large number of pedestrians and the frequent partial occlusions that they produce.

We can leverage the learned crowd flow model to track individual pedestrians in videos of crowded scenes. Specifically, we leverage the crowd flow as a prior in a Bayesian tracking framework. We use the crowd motion to predict local motion patterns in videos containing the pedestrian that we wish to track. Next, we use the predicted local motion pattern as a prior on the state-transition distribution in a particle filter framework to track individuals. We use these predictions as a prior on a particle filter to track individuals. We show that our approach accurately predicts the motion that a target will exhibit during tracking and leads to accurate tracking of individuals which is otherwise extremely challenging.

#### 10.4.2.1 Predicting Motion Patterns

We train the crowd flow model on a video of a crowded scene containing typical crowd behavior. Next, we use it to predict the local motion patterns at each location of a different video of the same scene. Note that, since we create a scene-centric model based on the changing motion in local regions, the prediction is independent of which individual is being tracked. In fact, we predict the local motion pattern at all locations of video volume given only the previous frames of the video.

**A**. As such, the second summation in Eq. (10.26) may be represented by

*a*

^{ j }

*θ*

^{ j }. Thus the predicted local motion pattern \(\tilde{O}_{t}\) is defined by mean direction

During tracking, we use the previous frames of the video to predict the local motion pattern that spans the next *M* frames (where *M* is the number of frames in a cuboid). Since the predictive distribution is a function of the HMM’s transition probabilities and the hidden states’ posteriors, the prediction may be computed on-line and efficiently during the forward phase of the Forwards-Backwards algorithm [Rabiner, 1989].

#### 10.4.2.2 Crowd Flow Bayesian Tracking

We now use the predicted local motion pattern to track individuals in a Bayesian framework. Specifically, we use the predicted local motion pattern as a prior on the parameters of a particle filter. Our crowd flow model enables these priors to vary in the space-time and dynamically adapt to the changing motions within the crowd.

**x**

_{ f }of the target at time

*f*given past and current measurements \(\mathbf{z}_{1:f} = \left \{\mathbf{z}_{i}\vert i = 1\ldots f\right \}\). Note that the index of each frame

*f*is different from the temporal index

*t*of the local motion patterns (since the cuboids span many frames). We define state

**x**

_{ f }as a four-dimensional vector \({\left [x,y,w,h\right ]}^{T}\) containing the tracked target’s 2D location (in image space), width, and height, respectively. Tracking is performed by maximizing the posterior distribution

**z**

_{ f }is the frame at time

*f*, \(p\left (\mathbf{x}_{f}\vert \mathbf{x}_{f-1}\right )\) is the transition distribution, \(p\left (\mathbf{z}_{f}\vert \mathbf{x}_{f}\right )\) is the likelihood, and \(p\left (\mathbf{x}_{f-1}\vert \mathbf{z}_{1:f-1}\right )\) is the posterior from the previous tracked frame. The transition distribution \(p\left (\mathbf{x}_{f}\vert \mathbf{x}_{f-1}\right )\) models the motion of the target between frames

*f* − 1 and

*f*, and the likelihood distribution \(p\left (\mathbf{z}_{f}\vert \mathbf{x}_{f}\right )\) represents how well the observed image

**z**

_{ f }matches the state

**x**

_{ f }. Often, the distributions are non-Gaussian, and the posterior distribution is estimated using a Markov chain Monte Carlo method such as a particle filter [Isard and Blake, 1998] (please refer to [Arulampalam et al, 2002] for an introduction to particle filters).

As shown in Fig. 10.9, we impose priors on the transition \(p\left (\mathbf{x}_{f}\vert \mathbf{x}_{f-1}\right )\) distribution using the predicted local motion pattern at the space-time location defined by **x** _{ f−1}. For computational efficiency, we use the cuboid at the center of the tracked target to define the priors, although the target may span several cuboids across the frame.

##### Transition Distribution

We use the predicted local motion pattern to hypothesize the motion of the tracked target between frames *f* − 1 and *f*, i.e., the transition distribution \(p\left (\mathbf{x}_{f}\vert \mathbf{x}_{f-1}\right )\). Let the state vector \(\mathbf{x}_{f} ={ \left [\mathbf{k}_{f}^{T},\mathbf{d}_{f}^{T}\right ]}^{T}\) where \(\mathbf{k}_{f} = \left [x,y\right ]\) is the target’s location (in image coordinates) and \(\mathbf{d}_{f} = \left [w,h\right ]\) is the size (width and height) of a bounding box around the target. We focus on the target’s movement between frames and use a second-degree auto-regressive model [Pérez et al, 2002] for the transition distribution of the size **d** _{ f } of the bounding box.

*f* − 1 and

*f*. We model this using the von Mises-Fisher distribution defined by the predicted local motion pattern \(\tilde{O}_{t} = \left \{\tilde{\boldsymbol{\mu }}_{t},\tilde{\kappa }_{t}\right \}\) at space-time location

**k**

_{ f−1}. In the particle filter, a set of

*N*sample locations (i.e., particles) \(\{\mathbf{k}_{f-1}^{i}\vert i = 1,\ldots,N\}\) are drawn from the prior \(\mathrm{p}\left (\mathbf{x}_{f-1}\vert \mathbf{z}_{1:f-1}\right )\). For each sample \(\mathbf{k}_{f-1}^{i}\), we draw a 3D flow vector \({\mathbf{v}}^{i} = [v_{x}^{i},v_{t}^{i},v_{t}^{i}]\) from the predicted local motion pattern at space-time location \(\mathbf{k}_{f-1}^{i}\). We use these 3D flow vectors to update each particle

##### Likelihood Distribution

*T*that represents the target’s characteristic appearance in the form of a color histogram [Pérez et al, 2002] or an image [Ali and Shah, 2008]. A template

*T*and the region

*R*(the bounding box defined by state

**x**

_{ f }) of the observed image

**z**

_{ f }are used to model the likelihood distribution

*σ*is the variance selected empirically,

*d*(⋅ ) is a distance measure, and

*Z*is a normalization constant.

*T*as an image of the individual’s spatio-temporal gradients. This representation is more robust to appearance variations caused by noise or illumination changes. We use a weighted sum of the angles between the spatio-temporal gradient vectors in the observed region and the template to define the distance measure

*M*is the number of pixels in the template,

**t**

_{ i }is the normalized spatio-temporal gradient vector in the template,

**r**

_{ i }is the normalized spatio-temporal gradient vector in the region

*R*of the observed image at frame

*f*, and

*ρ*

_{ i }

^{ f }is the weight of the pixel at location

*i*and frame

*f*,

*E*

_{ i }

^{ f }during tracking to account for a pedestrian’s changing appearance. The error at frame

*f*and pixel

*i*is

*α*is the update rate (set to 0. 05) and

**t**

_{ i }and

**r**

_{ i }are again the gradients of the template and observed region, respectively. To reduce the contributions of frequently changing pixels to the distance measure, the weight at frame

*f*and pixel

*i*is inversely proportional to the error

*Z*is a normalization constant such that \(\sum _{i}\rho _{i}^{f} = 1\). To account for changes in appearance, the template is updated each frame by a weighted average.

#### 10.4.2.3 Experimental Results

We evaluated our method on videos of four scenes: the concourse and ticket gate scenes, and the sidewalk and intersection scenes from the UCF dataset [Ali and Shah, 2007]. We use a sampling importance re-sampling particle filter as in [Isard and Blake, 1998] with 100–800 particles (depending on the subject) to estimate the posterior in Eq. (10.32). We learn a crowd flow model on a video of each scene, and use it to track pedestrians in videos of the same scene recorded at a different time. The training videos for each scene have 300, 350, 300, and 120 frames, respectively. The training videos for the concourse, ticket gate, and sidewalk scenes have a large number of pedestrians moving in a variety of directions. The video for the intersection scene has fewer frames due to the limited length of video available. In addition, the training video of the intersection scene contains only a few motion samples in specific locations, as many of the pedestrians have moved to other areas of the scene in that point in time. Such sparse samples, however, still result in a useful model since most of the pedestrians are only moving in one of two directions (either from the lower left to the upper right, or from the upper right to the lower left).

Due to the perspective projection of many of the scenes, which is a common occurrence in surveillance, the sizes of pedestrians varies immensely. As such, the initial location and size of the targets are selected manually. Many methods exist for automatically detecting pedestrians and their sizes [Dalal and Triggs, 2005] even in crowded scenes [Leibe et al, 2005] and may be used to initialize the tracker.

The motion represented by the local motion pattern depends directly on the size of the cuboid. Ideally, we would like to use a cuboid size that best represents the characteristic movements of a single pedestrian. Cuboids the size of a single pedestrian would faithfully represent the pedestrian’s local motion and therefore enable the most accurate prediction and tracking. The selection of the cuboid size, however, is entirely scene-dependent, since the relative size of pedestrians within the frame depends on the camera and physical construction of the scene. In addition, a particular view may capture pedestrians of different sizes due to perspective projection. We use a cuboid of size 10 × 10 × 10 on all scenes so that a majority of the cuboids are smaller than the space-time region occupied by a moving pedestrian. By doing so, the cuboids represent the motion of a single pedestrian but still contain enough pixels to accurately estimate a distribution of optical flow vectors. Note that, since the cameras recording the scenes are static, the sizes must be determined only once for each scene prior to training. Therefore, the cuboid sizes may be determined by a semi-supervised approach that approximates the perspective projection of the scene.

We measure the accuracy of the predicted local motion patterns by the angle between the predicted flow \(\tilde{\mu }_{t}\) and the observed optical flow. Figure 10.10 shows the angular error averaged over the entire video for each spatial location in all four scenes. Noisy areas with little motion, such as the concourse’s ceiling, result in higher error due to the lack of reliable gradient information. High motion areas, however, have a lower error that indicates a successful prediction of the local motion patterns. The sidewalk scene contains errors in scattered locations due to the occasional visible background in the videos and close-view of pedestrians. There is a larger amount of error in high-motion areas of the intersection scene since a relatively short video was used for training.

Figure 10.11 shows the predicted optical flow, colored by key in the lower left, for four frames from the sidewalk scene. Pedestrians moving from left to right are colored red, those moving right to left are colored green, and those moving from the bottom of the screen to the top are colored blue. As time progresses, our space-time model dynamically adapts to the changing motions of pedestrians within the scene as shown by the changing cuboid colors over the frames. Poor predictions appear as noise, and occur in areas of little texture such as the visible areas of the sidewalks or pedestrians with little texture.

Figure 10.12 shows a specific example of the changing predicted optical flow on six frames from the sidewalk scene. In the first two frames the predicted flow is from the left to the right, correctly corresponding to the motion of the pedestrian. In later frames the flow adjusts to the motion of the pedestrian at that point in time. Only by exploiting the temporal structure within the crowd motion are such dynamic predictions possible.

Figure 10.13 shows a visualization of our tracking results on videos from each of the different scenes. Each row shows four frames of our method tracking different targets whose trajectories are shown up to the current frame by the colored curves. The different trajectories in the same spatial locations of the frame demonstrate the ability of our approach to capture the temporal motion variations of the crowd. For example, the green target in row 1 is moving in a completely different direction than the red and pink targets, although they share the spatial location where their trajectories intersect. Similarly, the pink, blue, red, and green targets in row 2 all move in different directions in the center part of the frame, yet our method is able to track each of these individuals. Such dynamic variations that we model using an HMM cannot be captured by a single motion model. Spatial variations are also handled by our approach, as illustrated by the targets concurrently moving in completely different directions in rows 5 and 6. In addition, our method is robust to partial occlusions as illustrated by the pink target in row 1, and the red targets in rows 3, 5, and 6.

Figure 10.14 shows a failure case due to a severe occlusion. In these instances our method begins tracking the individual that caused the occlusion. This behavior, though not desired, shows the ability of our model to capture multiple motion patterns since the occluding individual is moving in a different direction. Other tracking failures occur due to poor texture. In the sidewalk scene, for example, the occasional viewable background and lack of texture on the pedestrians cause poorly-predicted local motion patterns. On such occasions, a local motion pattern that describes a relatively static structure, such as black clothing or the street, is predicted for a moving target. This produces non-smooth trajectories, such as the pink and red targets in row 5, or the red target in row 6 of Fig. 10.13.

Occasionally, an individual may move in a direction not captured by the training data. For instance, the pedestrian shown on the left of Fig. 10.15 is moving from left to right, a motion not present in the training data. Such cases are difficult to track since the crowd flow model can not predict the pedestrian’s motion. On such occasions, the posteriors (given in Eq. (10.27)) are near identical (since the emission probabilities are all close to 0), and thus the predicted optical flow is unreliable. This does not mean the targets can not be tracked, as shown by the correct trajectories in Fig. 10.15, but the tracking depends entirely on the appearance model.

We hand-labeled ground truth tracking results for 40 targets, 10 from each scene, to quantitatively evaluate our approach. Each target is tracked for at least 120 frames. The ground truth includes the target’s position and the width and height of a bounding box. The concourse and ticket gate scenes contain many pedestrians whose lower bodies are not visible at all over the duration of the video. On such occasions, the ground truth boxes are set around the visible torso and head of the pedestrian. Given the ground truth state vector **k** _{ t }, we measure the error of the tracking result \(\hat{\mathbf{k}}_{t}\) as \(\vert \vert \mathbf{k}_{t} -\hat{\mathbf{k}}_{t}\vert \vert _{2}\).

Figure 10.16 shows the error of our method for each labeled target, averaged over all of the frames in the video, compared to a particle filter using a color-histogram likelihood and second-degree auto-regressive model [Pérez et al, 2002] (labeled as Perez). In addition, we show the results using our predicted state-transition distribution with a color-histogram likelihood (labeled as Transition Only). On many of the targets our state transition distribution is superior to the second-degree autoregressive model, though nine targets have a higher error. Our full approach improves the tracking results dramatically and consistently achieves a lower error than that of Pérez et al. [Pérez et al, 2002].

Figure 10.17 compares our approach with the “floor fields” method by Ali and Shah [Ali and Shah, 2008] and the topical model from Rodriguez et al. [Rodriguez et al, 2009]. Since the other methods do not change the target’s template size, we only measured the error in the *x*, *y* location of the target. Our approach more accurately tracks the pedestrian’s locations in all but a few of the targets. The single motion model by Ali and Shah completely loses many targets that move in directions not represented by their single motion model. The method of Rodriguez et al. [Rodriguez et al, 2009] models multiple possible movements, but is still limited since it does not include temporal information. Our temporally varying model allows us to track pedestrians in scenes that exhibit dramatic variations in the crowd motion.

Figure 10.18 shows the tracking error over time, averaged over all of the targets, using our approach, that of Ali and Shah [Ali and Shah, 2008], and that of Rodriguez et al. [Rodriguez et al, 2009]. The consistently lower error achieved by our approach indicates that we may track subjects more reliably over a larger number of frames. Our temporally varying model accounts for a larger amount of directional variation exhibited by the targets, and enables accurate tracking over a longer period of time.

## 10.5 Summary

In this chapter, we introduced a novel, space-time statistical model of the crowd flow in the image space and demonstrated its use in important video analysis applications of crowded scenes. The experimental results show that the model is able to accurately encode the inherent structural patterns of local motions that constitute the crowd flow in concise analytical forms that can then be used to evaluate conformity and predict the motion of local space-time regions in target videos. The results also showed that these information can be successfully used to identify local unusual events and track individuals in videos of high density crowds. We believe the idea of modeling crowd flow from observation has strong implications in applications beyond the two we have demonstrated. In particular, we are hopeful that it will pave the way to finding realistic yet concise models of individual behaviors in crowded scenes that can directly be used in simulating large crowds and validating or even discovering new insights in behavioral studies.

## Notes

### Acknowledgements

This work was supported in part by National Science Foundation grants IIS-0746717 and IIS-0803670, and Nippon Telegraph and Telephone Corporation. The authors thank Nippon Telegraph and Telephone Corporation for providing the train station videos.

## References

- Ali, S., Shah, M.: A Lagrangian particle dynamics approach for crowd flow segmentation and stability analysis. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1–6 (2007)Google Scholar
- Ali, S., Shah, M.: Floor fields for tracking in high density crowd scenes. In: Proceedings of European Conference on Computer Vision (2008)Google Scholar
- Andrade, E., Blunsden, S., Fisher, R.: Modelling crowd scenes for event detection. In: Proceedings of International Conference on Pattern Recognition, pp. 175–178 (2006)Google Scholar
- Andrade, E.L., Blunsden, S., Fisher, R.B.: Hidden markov models for optical flow analysis in crowds. In: Proceeding of International Conference on Pattern Recognition, pp. 460–463 (2006)Google Scholar
- Arulampalam, S.M., Maskell, S., Gordon, N.: A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Trans. Signal Process.
**50**, 174–188 (2002)CrossRefGoogle Scholar - Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2007)Google Scholar
- Choi, S.C., Wette, R.: Maximum likelihood estimation of the parameters of the gamma distribution and their bias. Technometrics
**11**(4), 683–690 (1969)CrossRefzbMATHGoogle Scholar - Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (2005)Google Scholar
- Horn, B.K.P., Schunck, B.G.: Determining optical flow. Tech. rep., Cambridge, MA (1980)Google Scholar
- Hospedales, T., Gong, S., Xiang, T.: A Markov clustering topic model for mining behaviour in video. In: Proceedings of IEEE International Conference on Computer Vision (2009)Google Scholar
- Hu, M., Ali, S., Shah, M.: Detecting global motion patterns in complex videos. In: Proceedings of International Conference on Pattern Recognition (2008)Google Scholar
- Hu, M., Ali, S., Shah, M.: Learning motion patterns in crowded scenes using motion flow field. In: Proceedings of International Conference on Pattern Recognition, pp. 1–5 (2008)Google Scholar
- Isard, M., Blake, A.: CONDENSATION-conditional density propagation for visual tracking. Int. J. Comput. Vis.
**29**(1), 5–28 (1998)CrossRefGoogle Scholar - Ke, Y., Sukthankar, R., Hebert, M.: Event detection in crowded videos. In: Proceedings of IEEE International Conference on Computer Vision, pp. 1–8 (2007)Google Scholar
- Kläser, A., Marszałek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-Gradients. In: Proceedings of British Macine Vision Conference, pp. 995–1004 (2008)Google Scholar
- Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (2008)Google Scholar
- Leibe, B., Seemann, E., Schiele, B.: Pedestrian detection in crowded scenes. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 878–885 (2005)Google Scholar
- Mardia, A., El-Atoum, S.: Bayesian inference for the Von Mises-Fisher distribution miscellanea. Biometrika
**63**(1), 203–206 (1976)MathSciNetCrossRefzbMATHGoogle Scholar - Mardia, K.V., Jupp, P.: Directional Statistics. Wiley, Chichester (1999)CrossRefGoogle Scholar
- Mehran, R., Moore, B.E., Shah, M.: A streakline representation of flow in crowded scenes. In: European Conference on Computer Vision (ECCV) (2010)Google Scholar
- Pérez, P., Hue, C., Vermaak, J., Gangnet, M.: Color-based probabilistic tracking. In: Proceedings of European Conference on Computer Vision, pp. 661–675 (2002)Google Scholar
- Rabiner, L.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE
**77**(2), 257–286 (1989)CrossRefGoogle Scholar - Rodriguez, M., Ali, S., Kanade, T.: Tracking in unstructured crowded scenes. In: Proceedings of IEEE International Conference on Computer Vision (2009)Google Scholar
- Shechtman, E., Irani, M.: Space-time behavior based correlation. In: Proceedings of IEEE Internationl Conference on Computer Vision and Pattern Recognition, pp. 405–412 (2005)Google Scholar
- Wang, X., Ma, X., Grimson, W.E.L.: Unsupervised activity perception in crowded and complicated scenes using hierarchical Bayesian models. IEEE Trans. Pattern Anal. Mach. Intell.
**31**, 539–55 (2009)CrossRefGoogle Scholar - Wright, J., Pless, R.: Analysis of persistent motion patterns using the 3D structure tensor. In: IEEE Workshop on Motion and Video Computing, pp 14–19 (2005)Google Scholar
- Yang, Y., Liu, J., Shah, M.: Video scene understanding using multi-scale analysis. In: Proceeding of IEEE International Conference on Computer Vision (2009)Google Scholar