Motion pattern-based crowd scene classification using histogram of angular deviations of trajectories

Automated crowd behaviour analysis and monitoring is a challenging task due to the unpredictable nature of the crowd within a particular scene and across different scenes. The prior knowledge of the type of scene under consideration is a crucial mid-level information, which could be utilized to develop robust crowd behaviour analysis systems. In this paper, we propose an approach to automatically detect the type of a crowded scene based on the global motion patterns of the objects within the scene. Three different types of scenes whose global motion pattern characteristics vary from uniform to non-uniform are considered in this work, namely structured, semi-structured, and unstructured scenes, respectively. To capture the global motion pattern characteristics of an input crowd scene, we first extract the motion information in the form of trajectories using a key-point tracker and then compute the average angular orientation feature of each trajectory. This paper utilizes these angular features to introduce a novel feature vector, termed as Histogram of Angular Deviations (HAD), which depicts the distribution of the pair-wise angular deviation values for each trajectory vector. Since angular deviation information is resistant to changes in scene perspectives, we consider it as a key feature for distinguishing the scene types. To evaluate the effectiveness of the proposed HAD-based feature vector in classifying the crowded scenes, we build a crowd scene classification model by training the classical machine learning algorithms on the publicly available Collective Motion Database. The experimental results demonstrate the superior crowd classification performance of the proposed approach as compared to the existing methods. In addition to this, we propose a technique based on quantizing the angular deviation values to reduce the feature dimension and subsequently introduce a novel crowd scene structuredness index to quantify the structuredness of an input crowded scene based on its HAD.


Introduction
The rapid advances in technology combined with the continuous growth in the human population have increased the need to develop efficient automated video surveillance-based technologies. As a result, automated crowd video surveillance has become a popular research area with the motive of ensuring crowd safety. Crowd behaviour analysis [1][2][3], crowd density estimation/crowd counting [4][5][6], crowd anomaly detection [7][8][9], and group detection [10][11][12] are some of the widely researched areas within the crowd video surveillance domain. In most of these research areas, the performance of the proposed approaches is primarily dependent on the nature and type of the crowded scene [13]. In other words, an approach that performs well for one scene does not guarantee the same performance for a different scenario, especially when the scene dynamics changes.
Analysing the pattern of object motion (people/traffic) within a given crowded scene is one of the effective methods to understand the changes in scene dynamics. According to [14] and [15], crowded scenes can be categorized into structured and unstructured based on the motion patterns of objects within the scene. A structured scene consists of Fig. 1 Examples of : a structured scene and its motion patterns, b semi-structured scene and its motion patterns, and c unstructured scene and its motion patterns. Note that, the motion patterns are represented in the form of trajectories, with each colour representing different motion patterns (best viewed in colour) uniform spatio-temporal motion patterns generated by coherently moving objects across the entire scene (Fig. 1a). In other words, in a structured scene, each spatial location contains the same motion pattern, and the direction of motion remains constant for most of the time. In contrast, non-uniform spatiotemporal motion patterns generated by random or chaotically moving objects with unpredictable and frequently varying motion directions make up an unstructured scene (Fig. 1c). However, the scenes with motion patterns that are neither uniform nor chaotic are called semi-structured (Fig. 1b).
Identification of the type of a crowded scene is a crucial mid-level information about the scene under consideration. This information helps in the development of an efficient crowd behaviour analysis model. Additionally, this prior knowledge could be applied for crowd monitoring within a particular scene by re-assessing the scene type at particular intervals of time to keep track of the stable-unstable changes in state of the crowd. This paper proposes an approach to classify a given crowded scene into either structured or semistructured or unstructured based on the motion patterns represented in the form of trajectories. The proposed approach is an extension of our previous work [11] related to crowd motion pattern segmentation using spatio-angular features of the trajectories and an improvised density-based clustering algorithm. Compared to [11], the proposed approach utilizes only the angular features obtained from the trajectories (computed using the gKLT tracker [16]) to compute pair-wise angular deviations between the trajectories. In this work, we additionally compute the histogram of the angular deviations (HAD) which depicts the global motion structure of a scene. To evaluate the HAD-feature's ability, to classify a given scene into either structured, semi-structured and unstructured, we use the publicly available Collective Motion Database to train different classifiers and compare our classification model with the state-of-the-art crowd scene classification approaches which are based on the collec-tiveness measure. Furthermore, we perform experiments on reducing the original feature dimension by quantizing the angular deviation values into different levels. Finally, using the proposed HAD-based feature vector and a reference histogram for a structured scene, we introduce a measure to quantify the structuredness of a given input scene. The following are the contributions of the proposed work: (i) a novel HAD-based feature vector combined with a robust classifier for efficient crowd scene classification, (ii) an effective quantization-based feature reduction technique for the proposed HAD-feature vector, and (iii) a novel crowd scene structuredness index to quantify the structuredness of a given scene based on its HAD.

Related works
While numerous works have been done on motion patternbased crowd analysis [3,12,[17][18][19][20], only a few of them focus on classifying a scene into the aforementioned three categories (structured, semi-structured, and unstructured). Among them, Zhou et al. [16,21] introduced a descriptor to quantify a crowded scene based on its 'collectiveness', which is defined as "the degree of individuals acting as a union in collective motion". Three types of collectiveness measures were introduced in their work, namely high, medium and low collectiveness. From their definition, a high degree of collectiveness is a typical characteristics of a structured scene, low degree of collectiveness is seen in unstructured scenes, and semi-structured scenes have medium degree of collectiveness. Since the concepts of high, medium and low collectiveness matches the characteristics of a structured, semi-structured and un-structured crowded scenes, respectively, we compare our work with the collectivenessbased crowd scene classification approaches in the literature. In Zhou et al.'s approach [16,21], a k-Nearest Neighbour information is subsequently used to train a classifier to predict the scene type (k-NN) graph is initially constructed with the edge-weights representing the velocity correlation between trajectory vectors. To capture the global behaviour between the trajectory points, the concept of path-based similarity within the weighted graph is employed. Finally, the crowd collectiveness descriptor is computed by aggregating the individual path-based similarities between the nodes in the graph. As a part of this work, Zhou et al. introduced the Collective Motion Dataset (refer Sect. 4.1) for validating the performance of their descriptor. In a similar work, Ren et al. [22] refined the technique for aggregating the topological pathbased similarity (introduced in [16]) by using an exponent generating function to produce an improvised collectiveness measure. In another work, Shao et al. [23,24] used a Markov chain-based approach and proposed a Collective Transition (CT) Prior to model the crowd group behaviour. The CTprior is then used to define group-level collectiveness of the crowd. In contrast, Li et al. [25] used a point selection strategy to refine the features points which are tracked, followed by a manifold ranking approach to compute the crowd collectiveness. In an extended work, Li et al. [10,26] modelled the motion intention of individuals in a scene by proposing an intention-aware model which is combined with a manifold ranking strategy to compute the collectiveness of the crowd. Recently, Roy et al. [27] proposed an approach to classify a given crowded scene into structured, semi-structured and unstructured based on the definitions presented in [15].
Roy et al. [27] used the direction property of the crowd and divided each frame into non-overlapping blocks to compute angle/orientation histograms for each block from the trajectory data. Before combining all the block-level orientation histograms, a Gaussian averaging and an angular value-based quantization approach are used to refine each of them.
Most of the aforementioned approaches measure frameby-frame collectiveness value and subsequently average them (for all the frames) to compute the total collectiveness of a given video. Such frame-by-frame approaches lead to huge variations in the measured collectiveness because of the continuous change in the motion patterns of the trajectory key-points between frames. Also, these approaches are heavily dependent on their model initialisation parameters and are computationally complex. In contrast, we propose a multi-frame approach by simply averaging the trajectory data for fixed set of frames, which not only enables to capture the history of motion but also produces a more stable feature vector (to quantify crowd collectiveness) as compared to the frame-by-frame approaches. Furthermore, even though our approach is similar to [27] in using histogram based on the direction property of the crowd, the essential difference lies with the use of a more robust and distinctive histogram of angular deviation instead of histogram of angular orientation (explained in Sect. 3). Finally, our approach is parameter-free and is computationally inexpensive.

Crowd scene classification using Histogram of Angular Deviations
According to [16,[28][29][30], from the macroscopic point-ofview, the moving crowd has a high degree of collective or structured behaviour if majority of the participants in the crowd move together uniformly in the same direction. Based on this property of the crowd, we propose an approach to classify a given crowded video into either structured, semi-structured or unstructured by computing the angular orientation information of the moving crowd. The proposed approach is detailed in the following paragraphs according to the block diagram shown in Fig. 2. Extract Trajectories(Tracking): Given an input crowd video, the first step is to capture the motion of the crowd. This is done by using an accurate and computationally efficient generalized KLT (gKLT) tracker introduced by Zhou et al. [16], which detects and tracks key-points (corner features) between consecutive frames of the video. Factors like occlusion and illumination variations can produce noisy trajectories with very short length or instances of zerodisplacement values for the major part of a trajectory. Such noisy trajectories are discarded using pre-defined thresholds (determined empirically). Compute Average Angular Orientation: A trajectory is a set of 2D coordinate points that depict a key-point (belonging to an object in the scene) movement across a set of consecutive frames. An effective approach to capture the direction of the moving crowd from the trajectory data is to first compute the average of frame-by-frame displacements for each trajectory. The average angular orientation θ t i for each trajectory is then obtained from the already computed average displacement vector by projecting it onto a unit vector in the horizontal direction (along x-axis), as demonstrated in [11]. Compute Histogram of Angular Deviations (HAD): For each trajectory, the average angular orientation value signifies the history of the trajectory movement for a period of time. If all the averaged trajectory vectors in the scene approximately point towards the same direction, then the scene is said to be structured. If the trajectory vectors are scattered in different directions, then it characterizes an unstructured scene. Therefore, analysing the distribution of orientation values or the histogram of angular orientation gives a clear picture of the global movement characteristics of objects within a scene. However, as shown in Fig. 3c, even though the histogram of angular orientations for two different structured scenes clearly characterizes the structured behaviour, it produces peaks at different locations of the histogram. Therefore, the values of the histogram of angular orientations cannot be considered as a distinguishing feature in its original form. Hence, in this work, we compute a robust histogram of angular deviations instead. The angular deviation (Δθ t i j ) between two averaged trajectory vectors θ t i and θ t j is defined as follows [11]: Angular deviation Angular deviation Angular deviation

Dataset
The Collective Motion Database (introduced by Zhou et al. [16]) is used for evaluating the effectiveness of the pro-  Table 2 Comparison of binary classification performances (measured using the performance metrics-Precision (P), Recall (R), and F1-Score) of the proposed approach with the state-of-the-art on the Collective Motion Database (The best score is marked in bold) Zhou et al. [16] 0 The dataset also contains ground truth labels for each scene (the labels belong to the set {0, 1, 2}) , where '0' refers to scenes with low-collectiveness (unstructured scenes in the context of our work), '1' refers to scenes with medium-collectiveness (semi-structured scenes) and '2' refers to scenes with highcollectiveness (structured scenes). The ground truth labels are generated by major voting the manually decided labels by 10 human subjects, for each scene. Data statistics in Table  1 shows that 52% of the dataset consists of unstructured scenes, while 22% are structured and the rest 26% are semistructured, which means that the dataset is imbalanced.

Experimental set-up
For each scene of the Collective Motion Database, a set of 3000 key-points (as per [16]) are detected and tracked for entire duration of the video using the gKLT tracker. The average displacement is computed for each set of 30 consecutive frames. For the ease of evaluation we experimentally choose 30 consecutive frames from the entire trajectory data which best represents the crowd behaviour (as per [24]). The noisy trajectories are then filtered by discarding the trajectories with total length less than 5-frames, and zerodisplacement throughout the trajectory's span (occurring due to tacking/motion estimation errors). After computing the HAD, the count values of the histogram are normalized in the range [0, 1]. Since the dataset is imbalanced, classifier's performance is evaluated by employing a stratified 10-fold cross-validation with Precision, Recall and F1-Score chosen as the evaluation metrics [34]. The essential hyperparameters (determined empirically based on repeated experiments) used to configure the three classifiers are: (i) Weighted k-NN model uses Bhattacharya distance measure [35] with optimal results obtained for the k-

Results & discussion
Firstly, we compare the classification performance of the proposed HAD-based approach with the state-of-the-art approaches [16], [24], [10], which reports only binary classification results for the pair-wise combinations of structured vs. structured, structured vs. semi-structured, and semistructured vs. unstructured scenes. The results shown in Table  2 (results for the existing approaches are obtained from the paper [10]) clearly indicates the superior performance of the proposed HAD-based approach, which is also able to perform consistently well for each of the selected three classifiers. The state-of-the-art approaches work by modelling complex interactions among the trajectories to generate a collectiveness measure. Since this measure is a single value and is heavily dependent on the scene dynamics and subsequently the model parameters, a small amount of anomaly can adversely effect the collectiveness value, which is a major reason for the low scores for most state-of-the-art approaches.
On the contrary, the proposed HAD-based approach depends only on the accuracy of the tracker (which is applicable to the state-of-the-art approaches as well) and comprises of a multi-valued feature vector where a small amount of anomaly does not create a drastic change in the overall feature vector. It is also observed that the proposed approach is most effective in classifying structured and unstructured scenes. This is because, as demonstrated in Sect. 3, the dif-   ferent possible values for angular deviations for a structured scene is very less and is close to 0 • . Whereas, an unstructured scene contains objects moving in different directions due to which a variety of values are possible for the angular deviation. However, there are instances of several misclassification in the case of structured vs. semi-structured and semi-structured vs. unstructured classifications, due to the close resemblance of the semi-structured scenes with both structured and unstructured scenes. This is also evident from Tables 3 and 4, where we list out the three-class (structured, semi-structured, unstructured) classification performance of the proposed approach for the selected classifiers. Furthermore, since Roy et al. [27] uses a different proportion of scenes for each class (as compared to Table 1) and their binaries are not available, we are unable to quantitatively compare their results. Nonetheless, the dimension of the proposed HAD-based feature vector is 20-times lesser than the feature vector proposed by Roy et al. [27]. More importantly, the proposed HAD-based approach uses the angular deviation measure, which is not only globally consistent within a scene but also consistent across different scenes (Figs. 3 and  4), as compared to the angular orientation-based feature used in [27].
Secondly, we examine the effect of reducing the feature dimension by quantizing the angular deviation values. For this purpose, we choose the weighted k-NN classifier based on the Bhattacharya Distance measure [35], since it generalizes well to reduction in feature dimension. The original angular deviation values in the range [0 • , 180 • ] (having 181 levels with bin width = 1) are quantized into different levels, viz. 90, 60, 45, 36, 18, 12, 9, and 6 with each level having a histogram bin width of 2, 3, 4, 5, 10, 15, 20 and 30, respectively (the last bin will have bin width = bin width + 1). As a result of this quantization, adjacent angular deviation values within a particular range (defined by the bin width corresponding the quantization level) contribute to a single histogram bin count. The visualization of the changes in the HAD structure after the quantization operation (for different levels) can be realised from Fig. 5. It is observed that the HADs, for all levels of quantization, retains the same global structure. Figure 6 shows the performance metrics for each of the reduced feature vector applied over the Collective Motion database, where it is observed that increasing the bin width does not drastically decrease the classification performance. More importantly, the reduced feature vectors can distinguish the structured and unstructured scenes effectively. This is primarily because of the nature of the two scenes. In structured scenes, the pair-wise angular deviation between the majority of trajectory vectors is minimal, resulting in a peak close to the 0 • bin in the histogram. Whereas, in unstructured scenes, the pair-wise angular deviation between the trajectory vectors is not minimal and can take up any values, resulting in the values distributed across the histogram with no clear peaks. Hence, increasing the bin width does not effect the major trend in HAD and thus the capability for discriminating the structured and unstructured scenes. However, it can be also be observed that decreasing the feature dimension limits the capability of the classifier in discriminating subtle variations in the scenes, especially in the case of some semi-structured scenes whose HADs are either similar to that of unstructured or structured scenes. Since, reduction in feature dimension results in faster computations, we focus on choosing a feature dimension from Fig. 6 which does not significantly reduce the classification performance. Based on the data in Fig. 6, we make the following two conclusions: (i) the classification performance of the reduced feature vector with dimension equal to 90 (binwidth = 2) is equivalent to overall performance of the original feature vector (for 2-class and 3-class classification). Thus, the HAD with feature dimension equal to 90 is considered as an optimal choice for crowd classification. (ii) the HAD with feature dimension equal to 9 can be used for structured-unstructured scene classification.

Crowd scene structuredness index
Based on the Collective motion dataset and its HAD data (for each scene), we empirically define a measure, termed as the crowd scene structuredness index, to quantify the structuredness for a given scene (in the scale of 0 to 1). For this purpose, we consider a reference histogram to be one of the scenes (shown in Fig. 7) within the Collective motion dataset which is closest to an ideal structured scene. Now, for a given input scene, we compute the HAD (here, we use the HAD with feature vector dimension 90) and then, compare it with the reference histogram using the Bhattacharya coefficient measure as shown in Eq. 2.
where, the crowd scene structuredness index is represented using the function φ, which quantifies the amount of struc- turedness of the computed HAD (denoted as q) when compared to the reference HAD (denoted as r ), and B denotes the set of all bins in the histogram. The entire term in the right hand side of Eq. 2 represents the Bhattacharya coefficient, a part of the Bhattacharya distance metric, which is widely used to measure the similarity of two distributions. The statistical details of this experiment are shown in Fig. 8, which clearly indicate the separability between the structured and unstructured classes. The following can be concluded from Fig. 8: (i) the scenes with φ-value greater than 0.55 have structured motion patterns (greater the φ-value, more the structuredness), (ii) the scenes with φ-value less than or equal to 0.55 have unstructured motion patterns (lower the φ-value, greater the unstructuredness), and (iii) the scenes with values between 0.40 and 0.65 are semi-structured with values greater than 0.55 indicating a small amount of structuredness within the scene and values less than or equal to 0.55 indicating a small amount of unstructuredness within the scene.

Conclusion
In this paper, a feature vector based on the histogram of angular deviations (HAD) of averaged trajectory vectors was proposed to classify a given crowded scene into structured, semi-structured and unstructured, based on global motion patterns. The proposed HAD-based feature vector is composed of values which depict the count of each possible angular deviation value between 0 • and 180 • . Since a structured scene contains objects moving uniformly in the same direction, the pair-wise angular deviations between the majority of trajectory vectors will be minimal. This creates a well distinguishable peak in the HAD close to the 0 • -bin.
Whereas, for an unstructured scene, the angular deviations values are distributed across various bins of HAD, due to the motion of the objects in different directions. Based on this notion, the experiments performed on the publicly available Collective Motion Database using classical machine learning algorithms prove the robustness of the proposed HADfeature in distinguishing different scene types, specifically structured and unstructured scenes. Furthermore, the proposed approach outperforms the state-of-the-art approaches in binary classification of different combinations of the various scene types. The experiments conducted on quantizing the angular deviation values to reduce the feature dimension proved that the reduced feature vector with dimension equal to 90 (bin width equal to 2) performs as good as the original feature vector of dimension equal to 181, making it an optimal feature dimension for classifying crowded scenes based on motion patterns. Finally, based on comparing the proposed HAD with a reference HAD which depicts an ideal structured scene, we define a crowd scene structuredness index (based on Bhattacharya coefficient) which quantifies the amount of structuredness in a given scene. As a future work, we intend to improvise the computation of HAD by introducing a penalty function that penalizes unstructured behaviour, thereby improving the 3-class classification performance and increasing the separation between the three classes. Furthermore, in order to distinguish between all the three types of crowded scenes even better, we intend to explore about the possibility of including additional features other than the angular deviation feature.
Funding Open access funding provided by Manipal Academy of Higher Education, Manipal.

Conflict of interest The authors declare that they have no conflict of interest
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.