# Pedestrian Behavior Understanding and Prediction with Deep Neural Networks

- 40 Citations
- 21k Downloads

## Abstract

In this paper, a deep neural network (Behavior-CNN) is proposed to model pedestrian behaviors in crowded scenes, which has many applications in surveillance. A pedestrian behavior encoding scheme is designed to provide a general representation of walking paths, which can be used as the input and output of CNN. The proposed Behavior-CNN is trained with real-scene crowd data and then thoroughly investigated from multiple aspects, including the location map and location awareness property, semantic meanings of learned filters, and the influence of receptive fields on behavior modeling. Multiple applications, including walking path prediction, destination prediction, and tracking, demonstrate the effectiveness of Behavior-CNN on pedestrian behavior modeling.

## Keywords

Receptive Field Association Strategy Displacement Volume Deep Neural Network Walking Pattern## 1 Introduction

Pedestrian behavior modeling is gaining increasing attention and can be used for various applications including behavior prediction [1, 2, 3, 4], pedestrian detection and tracking [5, 6, 7], crowd motion analysis [8, 9, 10, 11], and abnormal detection [12, 13, 14].

Modeling pedestrian behaviors is challenging. Pedestrian decision making is complex and can be influenced by various factors. The decision making process of individuals [15], the interactions among moving and stationary pedestrians [4, 16], and historical motion statistics of a scene provide information for predicting future behaviors of pedestrians. While existing works focused some of these aspects with simplified rules or energy functions [15, 17], our proposed model takes all these factors into account through a complex deep convolution neural network (Behavior-CNN) and makes more reliable predictions.

*A*and

*B*at time \(t-1\) move to occlude each other at location

*C*at time

*t*. The two flow vectors \((A\rightarrow C)\) and \((B\rightarrow C)\) describe the associations between \(t-1\) and

*t*. If the two pedestrians move to locations

*D*and

*E*at \(t+1\) with flow vectors \((C\rightarrow D)\) and \((C \rightarrow E)\), it is obvious that the association ambiguities between (

*A*,

*B*) and (

*D*,

*E*) cannot be clarified by the flow vectors. It implies important information loss by using flow maps as the representation of input. A motion encoding scheme is proposed. The displacement volumes are used as the input/output of Behavior-CNN to address association ambiguity across multiple frames and avoid cumulative errors during prediction. As shown in Fig. 1(a), the input to our system is encoded from previous walking paths of all the pedestrians in the scene (blue dots) while the output of Behavior-CNN can recover future walking paths of all these pedestrians (red dots).

The contribution of this paper can be summarized into three-folds. (1) Long-term pedestrian behaviors is modeled with deep CNN. In-depth investigations on the proposed Behavior-CNN is conducted on the learned location map and the location awareness property, semantic meaning of learned filters, and the influence of receptive fields on behavior modeling. (2) A pedestrian behavior encoding scheme is proposed to encode pedestrian walking paths into sparse displacement volumes, which can be directly used as input/output for deep networks without association ambiguities. (3) The effectiveness of Behavior-CNN is demonstrated through applications on path prediction, destination prediction, and tracking.

## 2 Related Work

### 2.1 Pedestrian Walking Behavior Modeling

There have been a large number of works on modeling motion patterns. Topic models [18, 19, 20, 21] were widely used for modeling crowd flows based on spatio-temporal dependency. Trajectory clustering was another way of learning motion patterns [22, 23]. These methods only learned general historical motion statistics of a scene, without modeling the decision making process of each individual.

Katani’s work [24] focus on path planning of a single target based on static scene structures. It does not model person-to-person interactions and cannot quickly adapt to varying scene dynamics.

Agent-based models [12, 15, 17, 25, 26] could model the decision making process of individuals and their interactions, and were used for simulation, prediction, and abnormal detection. However, historical motion statistics of scenes were not well utilized. Moreover, most agent-based methods used predefined rules. How to design the rules and whether the rules were proper to describe the complex pedestrian behaviors in a particular scene could not be guaranteed.

### 2.2 Deep Learning

Deep CNNs have shown impressive performance on various vision tasks [27], such as image classification [28], object detection [21, 29, 30], object tracking [31], and image segmentation [32, 33]. However, no deep model has been specially designed for pedestrian behavior modeling. The main difficulty arises from how to design the network input and output, which properly encode pedestrian behavior information and are also suitable for the CNN.

## 3 Pedestrian Behavior Modeling and Prediction

The overall framework is shown in Fig. 2. The input to our system is pedestrian walking paths in previous frames (colored curves in Fig. 2(a)). They could be obtained by simple trackers such as KLT [41]. They are then encoded into a displacement volume (Fig. 2(b)) with the proposed walking behavior encoding scheme. Behavior-CNN in Fig. 2(c) takes the encoded displacement volume as input and predict an output displacement volume (Fig. 2(d)) for all the pedestrians simultaneously. A behavior decoding scheme then translates the output displacement volume to future walking paths of all individuals (Fig. 2(e)).

The pedestrian walking behavior encoding scheme is introduced in Sect. 3.1, and Behavior-CNN is discussed in Sect. 3.2. The walking behavior decoding is the inverse process of the encoding. The loss function and training schemes are introduced in Sect. 3.3.

### 3.1 Pedestrian Walking Behavior Encoding

The encoding process is illustrated in Fig. 3. Let \(p_1,..., p_N\) be *N* pedestrians in a scene, \(t_1,...,t_M\) be *M* uniformly sampled time points to be used as input for behavior encoding, and \(t_M\) be the current time point. The normalized spatial location of \(p_i\) (\(i \in [1,N]\)) at time point \(t_m\) (\(m \in [1,M]\)) is denoted as \(\mathbf {l}_i^m= [x_i^m/X,y_i^m/Y]\), where \(x_i^m \in [1, X]\), \(y_i^m \in [1,Y]\) are the spatial coordinates of \(p_i\) at time \(t_m\), and [*X*, *Y*] is the spatial size of the input frames. The locations are grid based and thus discrete. A 2*M*-dimensional displacement vector \(\mathbf {d}_i = {[\mathbf {l}_i^M - \mathbf {l}_i^1, \mathbf {l}_i^M - \mathbf {l}_i^2,..., \mathbf {l}_i^M - \mathbf {l}_i^{M-1}, \mathbf {l}_i^M - \mathbf {l}_i^{M}]^T} \in \mathbb {R}^{2M}\) is used to describe pedestrian \(p_i\)’s walking path in the past *M* frames with respect to \(t_M\) (Fig. 3(b)).

The input of CNN is constructed as a 3D displacement volume \(\mathcal {D} \in \mathbb {R}^{X \times Y \times 2M}\) based on \(\mathbf {d}_i\). For each pedestrian \(p_i\), all the 2*M* channels of \(\mathcal {D}\) at \(p_i\)’s current location \((x_i^M, y_i^M)\) are assigned with the displacement vector \(\mathbf {d}_i\). \(\mathcal {D}(x_i^M,y_i^M,:) = \mathbf {d}_i + \mathbf {1}^T\), where \(\mathbf {1}^T\) represents an all-one vector. All the remaining entries of \(\mathcal {D}\) are set as zeros. The elements in \(\mathbf {d}_i\) is within the range of \((-1,1)\). By adding 1, \(\mathbf {d}_i\) is transformed to be in the range of (0, 2) before being assigned to \(\mathcal {D}\) so that pedestrians with no movements (1 displacement value in \(\mathcal {D}\)) can now be distinguished from background locations (0 displacement value in \(\mathcal {D}\)).

With the proposed encoding process, pedestrian walking path information are well aligned to the current location of this pedestrian (\(\mathbf {l}_i^M\) in Fig. 3(c)). All the pedestrians in the scene and their spatial relationships are preserved in \(\mathcal {D}\). Importantly, such encoding and its inverse decoding schemes avoid association ambiguity when describing pedestrian walking paths.

### 3.2 Behavior-CNN

*M*previous time points, and \(t_{M+1},...,t_{M+M^*}\) are \(M^{*}\) future time points to predict. As shown in Fig. 4, Behavior-CNN contains three bottom convolution layers (Fig. 4(b)), one max-pooling layer and an element-wise addition layer (Fig. 4(c)), three top convolution layers (Fig. 4(d)), and one deconvolution layer (Fig. 4(e)). conv1-5 are followed by ReLU nonlinearity layers.

Three bottom convolution layers, conv1, conv2, and conv3, are to be convolved with input data of size \(X \times Y \times 2M\). conv1 contains 64 filters of size \(3 \times 3 \times 2M\), while both conv2 and conv3 contain 64 filters of size \(3 \times 3 \times 64\). Zeros are padded to each convolution input in order to guarantee feature maps of these layers be of the same spatial size with the input. The three bottom convolution layers are followed by max pooling layers max-pool with stride 2. The output size of max-pool is \(X/2 \times Y/2 \times 64\). In this way, the receptive field of the network can be doubled. Large receptive field is necessary for the task of pedestrian walking behavior modeling because each individual’s behavior are significantly influenced by his/her neighbors. A learnable location bias map of size \(X/2 \times Y/2\) is channel-wisely added to each of the pooled feature maps. Every spatial location has one independent bias value shared across channels. With the location bias map, location information of the scene can be automatically learned by the proposed Behavior-CNN. As for the three top convolution layers, conv4 and conv5 contain 64 filters of size \(3 \times 3 \times 64\), while conv6 contains \(2M^*\) filters of size \(3 \times 3 \times 64\) to output the predicted displacement volume. Zeros are also padded to each convolution input to keep the output spatial size unchanged. Some high-level walking path information and complex walking behaviors of pedestrians are expected to be encoded in the output volume of conv6. Finally, a deconvolution layer is used to upsample the output prediction of conv6 to the same spatial size as the input displacement volume, *i.e.* \(\mathcal {D^*} \in \mathbb {R}^{X \times Y \times 2M^*}\).

### 3.3 Loss Function and Training Schemes

The training samples of pedestrian walking paths can be obtained in multiple possible ways. Two strategies are tested in this paper. The annotated pedestrian locations are first used for both model training and evaluation to investigate the properties of the learned Behavior-CNN. Moreover, in order to handle real-world scenarios, our model is also trained with keypoint tracking results by the KLT tracker [41] while the human annotations are only used for evaluation.

Due to the high sparsity of input data, the network may converge to a bad local minimum if all the parameters are trained together from random initialization. Thus a layer-by-layer training strategy is adopted. A simpler network with three convolution layers is first randomly initialized and trained until convergence. Afterwards, the trained convolution layers are used as the bottom layers of Behavior-CNN (conv1-3). The following layers (max-pool, eltwise-add, conv4-6, deconv) are then appended and parameters of the newly added layers are trained from random initialization. Lastly, all the layers are jointly fine-tuned.

Stochastic gradient descent is adopted for training and the model converged at around 10k iterations. Optimal model is chosen based on a validation set which is a subset of the training samples.

## 4 Data and Evaluation Metric

Behavior-CNN is evaluated mainly on two datasets. Dataset I is the Pedestrian Walking Route Dataset proposed in [1]. It is 4,000 s in length and 12,684 pedestrians are annotated. Dataset II is collected and annotated by us. We follow the same annotation strategy on Dataset II as in [1]. The complete trajectories of 797 pedestrians from the time point he/she enters the scene to the time he/she leaves are annotated every 20 frames.

To prepare training and testing samples, \(M+M^*\) frames at time \(t_1,...,t_M\), \( t_{M+1},...,t_{M+M^*}\) are uniformly sampled from input videos, and resized to the size of \(256 \times 256\) (\(X=Y=256\)). The first *M* frames at time \(t_1,...,t_M\) are encoded to the input displacement volumes \(\mathcal {D}\) as introduced in Sect. 3.1, which are the input of the Behavior-CNN. The following \(M^*\) frames at time \(t_{M+1},...,t_{M+M^*}\) are encoded to the output displacement volume \(\mathcal {D^*}\) as the ground truth.

The encoding of \(\mathcal {D}^*\) is similar to that of \(\mathcal {D}\). A \(2M^*\)-dimensional displacement vector \(\mathbf d _i^* \in \mathbb {R}^{2M^*}\) is used to capture the future path of pedestrian \(p_i\) with respect to the current time point \(t_M\), \(\mathbf {d}_i^* = {[\mathbf {l}_i^M - \mathbf {l}_i^{M+1}, \mathbf {l}_i^M - \mathbf {l}_i^{M+2},..., \mathbf {l}_i^M - \mathbf {l}_i^{M+M^*}]}\), where \(\mathbf {l}_i^{m}\) is the normalized spatial location of pedestrian \(p_i\) at time \(t_m\) (\(m \in [M+1,M+M^*]\)). \(\mathcal {D^*} \in \mathbb {R}^{X \times Y \times 2M^*}\) are constructed by assigning \(\mathbf {d}_i^*\) to \(\mathcal {D}^*\), \(\mathcal {D^*}(x_i^M,y_i^M,:) = \mathbf {d}_i^* + \mathbf {1}^T\). With such encoding, future walking path information of each individual is also aligned to the pedestrian current location at time \(t_M\).

By setting different *M* and \(M^*\), Behavior-CNN can make prediction at different time scales. In our current implementation, *M* and \(M^*\) are both set to 5, *i.e.* five time points are uniformly sampled as input and five future locations of each pedestrian are predicted. The sample interval is 20 frames (0.8 s) for both input and output. That is to say, based on the output result, our model predicts the pedestrian paths in the coming 4 s. Longer-term behaviors can be predicted by recurrently using output again as new input of Behavior-CNN (detailed in Sect. 6.2). With larger *M* values and more computation cost, performance should be slightly improved because more information is given.

4990 short clips are uniformly segmented from Dataset I and one sample can be obtained from each clip. For Dataset II, 550 samples are generated. The first \(90\,\%\) samples are used for training while the remaining for test on both datasets.

*N*pedestrians at all the \(M^*\) predicted time points are computed.

## 5 Investigations on Behavior-CNN

(a) Prediction results with/without the location bias map. (b) Prediction results of different flipping strategies.

Investigations on | MSE | Dataset I | Dataset II |
---|---|---|---|

(a) Location bias map | With | 2.421 % | 2.348 % |

Without | 2.703 % | 2.628 % | |

(b) Flipping strategies | No flipping | 2.421 % | 2.348 % |

Horizontal flipping | 2.470 % | 2.592 % | |

Vertical flipping | 2.468 % | 2.585 % | |

Horizontal and vertical flipping | 2.502 % | 2.668 % |

### 5.1 Bias Map and Location Awareness Property of Behavior-CNN

For a specific scene, different locations generally have different traffic patterns because of scene structures. The proposed bias map helps capture such information. Experiments are conducted to investigate the effect of the location bias map. The errors of the proposed method with/without the bias map are listed in Table 1(a). Without the bias map, prediction errors increase for both datasets.

One more experiment is conducted to validate the location awareness of Behavior-CNN. Given the trained model (with location bias map) fixed, testing samples are flipped horizontally and/or vertically, and the results of different flipping strategies are reported in Table 1(b). If the prediction of our model has location invariance, flipping all the pedestrian paths at all the locations in the same way will not make difference on prediction errors. However, Table 1(b) shows that prediction error increases if testing samples are flipped, which indicates different locations have different dependence on moving directions.

### 5.2 Learned Feature Filters of Behavior-CNN

From feature maps generated by filters in different layers, strong correlations between specific walking patterns and filter response maps can be well observed. Generally speaking, the three bottom convolution layers (conv1-3) take all the pedestrian behaviors as input and gradually classify them into finer and finer categories according to various criteria. In top layers, the influences of all different categories are combined together to generate the prediction.

For bottom convolution layers, different pedestrians are roughly classified by filters based on their walking behaviors. Examples are shown in Fig. 6(a–c). Two feature maps generated from filter \(\#33\) and filter \(\#59\) of conv1 are shown in Fig. 6(a). The high-response pedestrians in the two feature maps are visualized in Fig. 6(b). It is observed that most pedestrians with high response to filter \(\#33\) move down-leftwards, while pedestrians with high response to filter \(\#59\) move upwards. In this way, the input pedestrian paths can be classified into some rough categories by the filters in conv1. We computed the correlations between the feature maps by the two filters (Fig. 6(a)) and the locations of all moving down-leftwards/upwards pedestrians at different training iterations. As shown by the correlation curves in Fig. 6(c), the two filters gradually learned to capture these specific motion patterns during training.

Some high-response pedestrians by filters of conv2 and conv3 are shown in Fig. 6(d–e). These filters generally classify pedestrians into finer and more specific categories compared with those of conv1. In Fig. 6(d), down-leftward/upward pedestrians in Fig. 6(a) are further classified based on spatial locations, such as the left-bottom corner and the left-up corner. In Fig. 6(e), pedestrians are more meticulously classified based on precise moving directions.

For filters in higher-level layers, they generally encode more complex behaviors. As shown by one example in Fig. 6(f), stationary pedestrians are assigned with high-responses by the filter \(\#19\) of conv4, which demonstrates that stationary crowds could influence other pedestrians’ walking patterns.

### 5.3 Receptive Fields

We observe that pedestrian walking behaviors are significantly influenced by nearby pedestrians. By increasing the size of the receptive field, the sensing range of the network can be increased and the predictions are more reliable. The current receptive field size is around \(10\,\%\) of the scene, which is large enough to capture the pedestrians and activities within their nearby regions.

Two alternative net structures are designed to decrease the receptive field size. (a) The filter size of all layers is changed from \(3 \times 3\) to \(1 \times 1\). In order to keep the same parameter size, the numbers of filters are all increased by 9 times in the meanwhile. (b) The proposed net structure (3conv+pool+3conv+deconv) is simplified to 3conv+pool+3conv and 3conv+3conv by removing some layers. The alternatives are used to demonstrate the power of large receptive field size when predicting future pedestrian walking behaviors.

Prediction results (MSE) of different net structures on Dataset I.

\(3 \times 3\) (ours) | \(1 \times 1\) | |
---|---|---|

3conv+pool+3conv+deconv (ours) | 2.421 % | 2.555 % |

3conv+pool+3conv | 2.431 % | 2.571 % |

3conv+3conv | 2.468 % | 2.858 % |

## 6 Experiments

### 6.1 Pedestrian Walking Path Prediction

The prediction results of the proposed Behavior-CNN are evaluated quantitatively and qualitatively for both Dataset I and Dataset II. For each of the dataset, two trained models were evaluated. One was trained with the human annotated pedestrian locations and the other one was trained with KLT trajectories. The trajectories are not verified and may contain mistakes. All the models are evaluated using the annotated ground truth pedestrian walking paths. Due to the insufficient training samples of Dataset II, the models trained on Dataset I were used as the initial points to train the models for Dataset II.^{1}

Prediction results (MSE) of different methods trained on the annotated pedestrian walking paths or the KLT trajectories on Dataset I and Dataset II.

Dataset I | Dataset II | Dataset I | Dataset II | |
---|---|---|---|---|

(Annotation) | (Annotation) | (KLT) | (KLT) | |

Behavior-CNN | 2.421 % | 2.348 % | 2.517 % | 3.816 % |

Constant velocity | 6.091 % | 6.468 % | 5.864 % | 5.635 % |

Constant acceleration | 9.899 % | 9.428 % | 6.619 % | 7.656 % |

SVM regression | 4.639 % | 4.276 % | 5.053 % | 5.327 % |

SFM [15] | 4.280 % | 5.921 % | 4.447 % | 5.044 % |

LTA [17] | 4.723 % | 4.571 % | 4.346 % | 4.639 % |

TIM [2] | 4.075 % | 4.141 % | 4.790 % | 4.790 % |

MSE introduced in Sect. 4 was evaluated and the results are reported in Table 3. Behavior-CNN achieves the best performance among all the comparisons. This is because the learned feature representations of Behavior-CNN are much more powerful and can capture complex pedestrian behaviors. The model trained with annotations (2.421 %) performs only slightly better than that trained with KLT (2.517 %) on Dataset I, which also demonstrates the robustness of the proposed method to KLT errors.

Several examples of prediction results are visualized in Fig. 7. Behavior-CNN can successfully predict some complex walking patterns, such as change of walking directions, slowing down, speeding up (Pedestrian A in Fig. 7(a)). It also learns the scene layout, which cannot be learned by the other two methods from training samples. Taking Pedestrian B in Fig. 7 as an example, our prediction avoids scene obstacles while the predictions by the other two methods indicate the pedestrian walking into a concrete wall.

In order to validate prediction robustness, the proposed method is also evaluated on five more datasets, *i.e.* ETH [17], Hotel [17], ZARA01 [42], ZARA02 [42], and UCY [42]. Following the same experimental setup and evaluation criteria as [43], leave-one-out validation is adopted and average displacement errors of our proposed method on the five datasets are 0.35, 0.18, 0.20, 0.23, and 0.25, while [43] achieves 0.50, 0.11, 0.22, 0.25, and 0.27.

### 6.2 Application I: Pedestrian Destination Prediction

The long-term prediction results can be used for destination prediction. The destination is determined as the nearest exit to the predicted future walking path. Prediction performance was evaluated on Dataset I, where ten entrance/exit regions are labeled [1] as shown in Fig. 8(c). The *top-N* accuracy (ground truth is within the top-*N* predictions) was adopted for evaluation.

Three existing methods were used as comparisons, *i.e.*, the energy map modeling approach (EMM) [1] where destinations were predicted by minimizing energy function, MDA [8] where predictions were made based on trajectory properties, and an unsupervised visual prediction approach (UVP) [44] where destinations were predicted as the nearest exit to the predicted trajectories. In order to make fair comparisons, all the methods use previous 5 frames (4 s in length) as input. Estimation results are reported in Table 4. Our method performs better as it can better predict long-term motion patterns.

### 6.3 Application II: Predictions as Tracking Prior

Results of pedestrian tracking on Dataset I

Methods | KLT+Behavior-CNN | KLT+RFT [45] | KLT |
---|---|---|---|

Error (\(L_2\) distance) | 83.79 | 228.33 | 411.71 |

The average \(L_2\) distance between ground truth walking paths and tracking results of 1000 pedestrians in Dataset I were used for evaluation. The results of both strategies, together with the results of the baseline KLT tracking are listed in Table 5. The proposed association strategy significantly decreases the tracking error compared with RFT [45]. From the examples in Fig. 9, our method could successfully generate correct and complete trajectories, while the association by the RFT method made wrong associations and lost the tracking targets.

## 7 Conclusion

Behavior-CNN is proposed to model pedestrian behaviors. A behavior encoding scheme is adopted to encode pedestrian behavior into sparse displacement volumes which can be directly used as network input. Behavior-CNN is thoroughly investigated in terms of the learned location map and the location awareness property, semantic meanings of learned filters, and influence of receptive fields. The effectiveness is demonstrated through multiple applications, including walking path prediction, destination prediction, and improving tracking.

## Footnotes

- 1.
The model trained solely with annotations on Dataset I generates a 4.18 % error if directly testing on Dataset II, which is still better than the comparisons. However, with bias map removed, the error decreases to 3.42 %. It indicates that the bias map hinders model transfer ability to a certain degree.

## Notes

### Acknowledgment

This work was supported in part by the Ph.D. Programs Foundation of China under Grant 20130185120039, in part by the Hong Kong Innovation and Technology Support Programme under Grant ITS/221/13FP, in part by the National Natural Science Foundation of China under Grant 61371192 and Grant 61301269, and in part by the General Research Fund through the Research Grants Council, Hong Kong, under Grant CUHK14206114, Grant CUHK14205615, Grant CUHK419412, and Grant CUHK14203015.

## References

- 1.Yi, S., Li, H., Wang, X.: Understanding pedestrian behaviors from stationary crowd groups. In: Proceedings of CVPR (2015)Google Scholar
- 2.Cancela, B., Iglesias, A., Ortega, M., Penedo, M.: Unsupervised trajectory modelling using temporal information via minimal paths. In: Proceedings of CVPR (2014)Google Scholar
- 3.Alahi, A., Ramanathan, V., Fei-Fei, L.: Socially-aware large-scale crowd forecasting. In: Proceedings of CVPR (2014)Google Scholar
- 4.Yi, S., Li, H., Wang, X.: Pedestrian behavior modeling from stationary crowds with applications to intelligent surveillance. TIP
**25**(9), 4354–4368 (2016)Google Scholar - 5.Tang, S., Andriluka, M., Milan, A., Schindler, K., Roth, S., Schiele, B.: Learning people detectors for tracking in crowded scenes. In: Proceedings of ICCV (2013)Google Scholar
- 6.Shu, G., Dehghan, A., Oreifej, O., Hand, E., Shah, M.: Part-based multiple-person tracking with partial occlusion handling. In: Proceedings of CVPR (2012)Google Scholar
- 7.Leal-Taixé, L., Fenzi, M., Kuznetsova, A., Rosenhahn, B., Savarese, S.: Learning an image-based motion context for multiple people tracking. In: Proceedings of CVPR (2014)Google Scholar
- 8.Zhou, B., Wang, X., Tang, X.: Understanding collective crowd behaviors: learning a mixture model of dynamic pedestrian-agents. In: Proceedings of CVPR (2012)Google Scholar
- 9.Nascimento, J.C., Marques, J.S., Lemos, J.M.: Modeling and classifying human activities from trajectories using a class of space-varying parametric motion fields. TIP
**22**(5), 2066–2080 (2013)MathSciNetGoogle Scholar - 10.Kim, K., Lee, D., Essa, I.: Gaussian process regression flow for analysis of motion trajectories. In: Proceedings of ICCV (2011)Google Scholar
- 11.Chang, M.C., Krahnstoever, N., Ge, W.: Probabilistic group-level motion analysis and scenario recognition. In: Proceedings of ICCV (2011)Google Scholar
- 12.Mehran, R., Oyama, A., Shah, M.: Abnormal crowd behavior detection using social force model. In: Proceedings of CVPR (2009)Google Scholar
- 13.Lu, C., Shi, J., Jia, J.: Abnormal event detection at 150 FPS in matlab. In: Proceedings of ICCV (2013)Google Scholar
- 14.Mahadevan, V., Li, W., Bhalodia, V., Vasconcelos, N.: Anomaly detection in crowded scenes. In: Proceedings of CVPR (2010)Google Scholar
- 15.Helbing, D., Molnar, P.: Social force model for pedestrian dynamics. Phys. Rev. E
**51**(5), 4282 (1995)CrossRefGoogle Scholar - 16.Yi, S., Wang, X., Lu, C., Jia, J., Li, H.: L0 regularized stationary-time estimation for crowd analysis. TPAMI
**PP**(99), 1 (2016). doi: 10.1109/TPAMI.2016.2560807 CrossRefGoogle Scholar - 17.Pellegrini, S., Ess, A., Schindler, K., Van Gool, L.: You’ll never walk alone: modeling social behavior for multi-target tracking. In: Proceedings of ICCV (2009)Google Scholar
- 18.Kuettel, D., Breitenstein, M.D., Van Gool, L., Ferrari, V.: What’s going on? Discovering spatio-temporal dependencies in dynamic scenes. In: Proceedings of CVPR (2010)Google Scholar
- 19.Wang, X., Ma, X., Grimson, W.E.L.: Unsupervised activity perception in crowded and complicated scenes using hierarchical Bayesian models. TPAMI
**31**(3), 539–555 (2009)CrossRefGoogle Scholar - 20.Hospedales, T.M., Li, J., Gong, S., Xiang, T.: Identifying rare and subtle behaviors: a weakly supervised joint topic model. TPAMI
**33**(12), 2451–2464 (2011)CrossRefGoogle Scholar - 21.Basharat, A., Gritai, A., Shah, M.: Learning object motion patterns for anomaly detection and improved object detection. In: Proceedings of CVPR (2008)Google Scholar
- 22.Morris, B.T., Trivedi, M.M.: Trajectory learning for activity understanding: unsupervised, multilevel, and long-term adaptive approach. TPAMI
**33**(11), 2287–2301 (2011)CrossRefGoogle Scholar - 23.Wang, X., Ma, K.T., Ng, G.W., Grimson, W.E.L.: Trajectory analysis and semantic region modeling using nonparametric hierarchical Bayesian models. IJCV
**95**(3), 287–312 (2011)CrossRefGoogle Scholar - 24.Kitani, K.M., Ziebart, B.D., Bagnell, J.A., Hebert, M.: Activity forecasting. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 201–214. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-33765-9_15 Google Scholar
- 25.Bonabeau, E.: Agent-based modeling: methods and techniques for simulating human systems. PNAS
**99**(Suppl 3), 7280–7287 (2002)CrossRefGoogle Scholar - 26.Helbing, D., Farkas, I., Vicsek, T.: Simulating dynamical features of escape panic. Nature
**407**(6803), 487–490 (2000)CrossRefGoogle Scholar - 27.Bengio, Y.: Learning deep architectures for AI. Found. Trends\({\textregistered }\) Mach. Learn.
**2**(1), 1–127 (2009)Google Scholar - 28.Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Proceedings of NIPS (2012)Google Scholar
- 29.Girshick, R.: Fast R-CNN. In: Proceedings of ICCV (2015)Google Scholar
- 30.Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of NIPS (2015)Google Scholar
- 31.Wang, N., Yeung, D.Y.: Learning a deep compact image representation for visual tracking. In: Proceedings of NIPS (2013)Google Scholar
- 32.Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. TPAMI
**35**(8), 1915–1929 (2013)CrossRefGoogle Scholar - 33.Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of CVPR (2015)Google Scholar
- 34.Reddy, N.D., Singhal, P., Krishna, K.M.: Semantic motion segmentation using dense CRF formulation. In: Proceedings of ICVGIP (2014)Google Scholar
- 35.Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of NIPS (2014)Google Scholar
- 36.Shao, J., Kang, K., Loy, C.C., Wang, X.: Deeply learned attributes for crowded scene understanding. In: Proceedings of CVPR (2015)Google Scholar
- 37.Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. TPAMI
**35**(1), 221–231 (2013)CrossRefGoogle Scholar - 38.Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of CVPR (2014)Google Scholar
- 39.Yan, X., Chang, H., Shan, S., Chen, X.: Modeling video dynamics with deep dynencoder. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 215–230. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10593-2_15 Google Scholar
- 40.Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using lstms (2015). arXiv preprint arXiv:1502.04681
- 41.Tomasi, C., Kanade, T.: Detection and tracking of point features. School of Computer Science, Carnegie Mellon Univ. Pittsburgh (1991)Google Scholar
- 42.Lerner, A., Chrysanthou, Y., Lischinski, D.: Crowds by example. In: Computer Graphics Forum, vol. 26, pp. 655–664. Wiley Online Library (2007)Google Scholar
- 43.Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social lstm: human trajectory prediction in crowded spaces. In: Proceedings of CVPR (2016)Google Scholar
- 44.Walker, J., Gupta, A., Hebert, M.: Patch to the future: unsupervised visual prediction. In: Proceedings of CVPR (2014)Google Scholar
- 45.Zhou, B., Wang, X., Tang, X.: Random field topic model for semantic region analysis in crowded scenes from tracklets. In: Proceedings of CVPR (2011)Google Scholar