Fig. 1
figure 1

Examples of : a structured scene and its motion patterns, b semi-structured scene and its motion patterns, and c unstructured scene and its motion patterns. Note that, the motion patterns are represented in the form of trajectories, with each colour representing different motion patterns (best viewed in colour)

1 Introduction

The rapid advances in technology combined with the continuous growth in the human population have increased the need to develop efficient automated video surveillance-based technologies. As a result, automated crowd video surveillance has become a popular research area with the motive of ensuring crowd safety. Crowd behaviour analysis [1,2,3], crowd density estimation/crowd counting [4,5,6], crowd anomaly detection [7,8,9], and group detection [10,11,12] are some of the widely researched areas within the crowd video surveillance domain. In most of these research areas, the performance of the proposed approaches is primarily dependent on the nature and type of the crowded scene [13]. In other words, an approach that performs well for one scene does not guarantee the same performance for a different scenario, especially when the scene dynamics changes.

Analysing the pattern of object motion (people/traffic) within a given crowded scene is one of the effective methods to understand the changes in scene dynamics. According to [14] and [15], crowded scenes can be categorized into structured and unstructured based on the motion patterns of objects within the scene. A structured scene consists of uniform spatio-temporal motion patterns generated by coherently moving objects across the entire scene (Fig. 1a). In other words, in a structured scene, each spatial location contains the same motion pattern, and the direction of motion remains constant for most of the time. In contrast, non-uniform spatio-temporal motion patterns generated by random or chaotically moving objects with unpredictable and frequently varying motion directions make up an unstructured scene (Fig. 1c). However, the scenes with motion patterns that are neither uniform nor chaotic are called semi-structured (Fig. 1b).

Identification of the type of a crowded scene is a crucial mid-level information about the scene under consideration. This information helps in the development of an efficient crowd behaviour analysis model. Additionally, this prior knowledge could be applied for crowd monitoring within a particular scene by re-assessing the scene type at particular intervals of time to keep track of the stable-unstable changes in state of the crowd. This paper proposes an approach to classify a given crowded scene into either structured or semi-structured or unstructured based on the motion patterns represented in the form of trajectories. The proposed approach is an extension of our previous work [11] related to crowd motion pattern segmentation using spatio-angular features of the trajectories and an improvised density-based clustering algorithm. Compared to [11], the proposed approach utilizes only the angular features obtained from the trajectories (computed using the gKLT tracker [16]) to compute pair-wise angular deviations between the trajectories. In this work, we additionally compute the histogram of the angular deviations (HAD) which depicts the global motion structure of a scene. To evaluate the HAD-feature’s ability, to classify a given scene into either structured, semi-structured and unstructured, we use the publicly available Collective Motion Database to train different classifiers and compare our classification model with the state-of-the-art crowd scene classification approaches which are based on the collectiveness measure. Furthermore, we perform experiments on reducing the original feature dimension by quantizing the angular deviation values into different levels. Finally, using the proposed HAD-based feature vector and a reference histogram for a structured scene, we introduce a measure to quantify the structuredness of a given input scene. The following are the contributions of the proposed work: (i) a novel HAD-based feature vector combined with a robust classifier for efficient crowd scene classification, (ii) an effective quantization-based feature reduction technique for the proposed HAD-feature vector, and (iii) a novel crowd scene structuredness index to quantify the structuredness of a given scene based on its HAD.

Fig. 2
figure 2

Block diagram of the proposed approach, where a generalized KLT (gKLT) Tracker [16] is used to extract a set of n trajectories (represented as \(\lbrace t_{i} \rbrace \), where \(i=1:n\)) from the input video. Average angular orientation features (represented as \(\lbrace {\overline{\theta }}_{t_{i}} \rbrace \)) computed from the trajectories is then used to compute angular deviation information between each pair of average angular features \({\overline{\theta }}_{i}\) and \({\overline{\theta }}_{j}\). The histogram of angular-deviation (HAD) information is subsequently used to train a classifier to predict the scene type

2 Related works

While numerous works have been done on motion pattern-based crowd analysis [3, 12, 17,18,19,20], only a few of them focus on classifying a scene into the aforementioned three categories (structured, semi-structured, and unstructured). Among them, Zhou et al.  [16, 21] introduced a descriptor to quantify a crowded scene based on its ‘collectiveness’, which is defined as “the degree of individuals acting as a union in collective motion”. Three types of collectiveness measures were introduced in their work, namely high, medium and low collectiveness. From their definition, a high degree of collectiveness is a typical characteristics of a structured scene, low degree of collectiveness is seen in unstructured scenes, and semi-structured scenes have medium degree of collectiveness. Since the concepts of high, medium and low collectiveness matches the characteristics of a structured, semi-structured and un-structured crowded scenes, respectively, we compare our work with the collectiveness-based crowd scene classification approaches in the literature. In Zhou et al.’s approach [16, 21], a k-Nearest Neighbour (k-NN) graph is initially constructed with the edge-weights representing the velocity correlation between trajectory vectors. To capture the global behaviour between the trajectory points, the concept of path-based similarity within the weighted graph is employed. Finally, the crowd collectiveness descriptor is computed by aggregating the individual path-based similarities between the nodes in the graph. As a part of this work, Zhou et al. introduced the Collective Motion Dataset (refer Sect. 4.1) for validating the performance of their descriptor. In a similar work, Ren et al. [22] refined the technique for aggregating the topological path-based similarity (introduced in [16]) by using an exponent generating function to produce an improvised collectiveness measure. In another work, Shao et al. [23, 24] used a Markov chain-based approach and proposed a Collective Transition (CT) Prior to model the crowd group behaviour. The CT-prior is then used to define group-level collectiveness of the crowd. In contrast, Li et al. [25] used a point selection strategy to refine the features points which are tracked, followed by a manifold ranking approach to compute the crowd collectiveness. In an extended work, Li et al. [10, 26] modelled the motion intention of individuals in a scene by proposing an intention-aware model which is combined with a manifold ranking strategy to compute the collectiveness of the crowd. Recently, Roy et al. [27] proposed an approach to classify a given crowded scene into structured, semi-structured and unstructured based on the definitions presented in [15]. Roy et al. [27] used the direction property of the crowd and divided each frame into non-overlapping blocks to compute angle/orientation histograms for each block from the trajectory data. Before combining all the block-level orientation histograms, a Gaussian averaging and an angular value-based quantization approach are used to refine each of them.

Most of the aforementioned approaches measure frame-by-frame collectiveness value and subsequently average them (for all the frames) to compute the total collectiveness of a given video. Such frame-by-frame approaches lead to huge variations in the measured collectiveness because of the continuous change in the motion patterns of the trajectory key-points between frames. Also, these approaches are heavily dependent on their model initialisation parameters and are computationally complex. In contrast, we propose a multi-frame approach by simply averaging the trajectory data for fixed set of frames, which not only enables to capture the history of motion but also produces a more stable feature vector (to quantify crowd collectiveness) as compared to the frame-by-frame approaches. Furthermore, even though our approach is similar to [27] in using histogram based on the direction property of the crowd, the essential difference lies with the use of a more robust and distinctive histogram of angular deviation instead of histogram of angular orientation (explained in Sect. 3). Finally, our approach is parameter-free and is computationally inexpensive.

Fig. 3
figure 3

Histogram of Angular Orientations versus Histogram of Angular Deviations. Each row represents data related two different structured scenes: a Frames from each of the two scenes, b Averaged displacement vectors overlaid on the frame, c Histogram of angular orientations, d Histogram of angular deviations

3 Crowd scene classification using Histogram of Angular Deviations

According to [16, 28,29,30], from the macroscopic point-of-view, the moving crowd has a high degree of collective or structured behaviour if majority of the participants in the crowd move together uniformly in the same direction. Based on this property of the crowd, we propose an approach to classify a given crowded video into either structured, semi-structured or unstructured by computing the angular orientation information of the moving crowd. The proposed approach is detailed in the following paragraphs according to the block diagram shown in Fig. 2.

Extract Trajectories(Tracking): Given an input crowd video, the first step is to capture the motion of the crowd. This is done by using an accurate and computationally efficient generalized KLT (gKLT) tracker introduced by Zhou et al. [16], which detects and tracks key-points (corner features) between consecutive frames of the video. Factors like occlusion and illumination variations can produce noisy trajectories with very short length or instances of zero-displacement values for the major part of a trajectory. Such noisy trajectories are discarded using pre-defined thresholds (determined empirically).

Compute Average Angular Orientation: A trajectory is a set of 2D coordinate points that depict a key-point (belonging to an object in the scene) movement across a set of consecutive frames. An effective approach to capture the direction of the moving crowd from the trajectory data is to first compute the average of frame-by-frame displacements for each trajectory. The average angular orientation \({\overline{\theta }}_{t_{i}}\) for each trajectory is then obtained from the already computed average displacement vector by projecting it onto a unit vector in the horizontal direction (along x-axis), as demonstrated in [11].

Table 1 Statistics for the Collective Motion Database
Fig. 4
figure 4

a Structured scene and its HAD, b Semi-structured scene and its HAD, and c Unstructured scene and its HAD

Compute Histogram of Angular Deviations (HAD): For each trajectory, the average angular orientation value signifies the history of the trajectory movement for a period of time. If all the averaged trajectory vectors in the scene approximately point towards the same direction, then the scene is said to be structured. If the trajectory vectors are scattered in different directions, then it characterizes an unstructured scene. Therefore, analysing the distribution of orientation values or the histogram of angular orientation gives a clear picture of the global movement characteristics of objects within a scene. However, as shown in Fig. 3c, even though the histogram of angular orientations for two different structured scenes clearly characterizes the structured behaviour, it produces peaks at different locations of the histogram. Therefore, the values of the histogram of angular orientations cannot be considered as a distinguishing feature in its original form. Hence, in this work, we compute a robust histogram of angular deviations instead. The angular deviation (\(\varDelta {\overline{\theta }}_{t_{ij}} \)) between two averaged trajectory vectors \({\overline{\theta }}_{t_{i}}\) and \({\overline{\theta }}_{t_{j}}\) is defined as follows [11]:

$$\begin{aligned} \varDelta {\overline{\theta }}_{t_{ij}} = min(|{\overline{\theta }}_{t_{i}}-{\overline{\theta }}_{t_{j}}|, 360 - |{\overline{\theta }}_{t_{i}} - {\overline{\theta }}_{t_{j}}|) \end{aligned}$$
(1)

where the range of \({\overline{\theta }}_{t_{ij}}\) is between \(0^{\circ }\) and \(180^{\circ }\) (inclusive of both). Computing the histogram of pair-wise angular deviation values (for each trajectory vector with every other trajectory vectors) provides a global/scene-level picture of correlation of the trajectory vectors in terms of angles. If the histogram is peaked close to the angular deviation value of \(0^{\circ }\), it means that the majority of trajectory vectors in the scene move in the same direction (structured scene). Therefore, as shown in Fig. 3d, the HAD produces consistent peaks for different scenes of the same type as compared to the histogram of angular orientations. Given an input scene, we compute the HAD and consider the values of the HAD as a feature vector representing the type of scene. Since the range of \(\varDelta {\overline{\theta }}_{t_{ij}}\) is \([0^{\circ },180^{\circ }]\), the dimension of the HAD-feature vector is 181. The HADs for three different types of scenes, shown in Fig. 4, justifies the discriminative capability of the proposed HAD feature vector.

Classification: The HAD-features are used to build a supervised machine learning model for classifying a given crowded scene into either of the three categories. For this purpose, we use the Collective Motion Database [16] (refer section 4.1, for more details of the dataset). Since this dataset contains less data (413 videos) and the proposed HAD-feature vector dimension is large, we choose the nonlinear classical machine learning algorithms for our experimentation. Based on exhaustive experimentation over different classical machine learning algorithms, we observe that the proposed HAD feature vector can efficiently discriminate most of the scenes in the Collective Motion Database. The results of this experimentation are presented in Sect. 4, where for the sake of simplicity, we have selected the best-performing classical machine learning algorithms, namely Weighted k-Nearest Neighbours (Weighted k-NN) [31], \(\nu \)-Support Vector Machines (\(\nu \)-SVM) [32], and Tree ensembles based on an efficient implementation of gradient boosting called eXtreme Gradient Boosting (XGBoost) [33].

Table 2 Comparison of binary classification performances (measured using the performance metrics—Precision (P), Recall (R), and F1-Score) of the proposed approach with the state-of-the-art on the Collective Motion Database (The best score is marked in bold)

4 Experimental results & discussion

4.1 Dataset

The Collective Motion Database (introduced by Zhou et al. [16]) is used for evaluating the effectiveness of the proposed approach. The dataset contains a total of 413 crowd video sequences, with each video containing 100 frames. The dataset also contains ground truth labels for each scene (the labels belong to the set \(\{0,1,2\}\)) , where ‘0’ refers to scenes with low-collectiveness (unstructured scenes in the context of our work), ‘1’ refers to scenes with medium-collectiveness (semi-structured scenes) and ‘2’ refers to scenes with high-collectiveness (structured scenes). The ground truth labels are generated by major voting the manually decided labels by 10 human subjects, for each scene. Data statistics in Table 1 shows that 52% of the dataset consists of unstructured scenes, while 22% are structured and the rest 26% are semi-structured, which means that the dataset is imbalanced.

4.2 Experimental set-up

For each scene of the Collective Motion Database, a set of 3000 key-points (as per [16]) are detected and tracked for entire duration of the video using the gKLT tracker. The average displacement is computed for each set of 30 consecutive frames. For the ease of evaluation we experimentally choose 30 consecutive frames from the entire trajectory data which best represents the crowd behaviour (as per [24]). The noisy trajectories are then filtered by discarding the trajectories with total length less than 5-frames, and zero-displacement throughout the trajectory’s span (occurring due to tacking/motion estimation errors). After computing the HAD, the count values of the histogram are normalized in the range [0, 1]. Since the dataset is imbalanced, classifier’s performance is evaluated by employing a stratified 10-fold cross-validation with Precision, Recall and F1-Score chosen as the evaluation metrics [34]. The essential hyperparameters (determined empirically based on repeated experiments) used to configure the three classifiers are: (i) Weighted k-NN model uses Bhattacharya distance measure [35] with optimal results obtained for the k-value of 10, (ii) the \(\nu \)-SVM model uses a Radial Basis Function (RBF)-based kernel with an \(\nu \) value of 0.3, and (iii) XGBoost-based tree ensemble model uses 30 trees for each ensemble with maximum depth set to 3 and a learning rate of 0.1.

Table 3 3-Class Classification (Structured, Semi-structured, and Unstructured) performance for the proposed HAD-based approach using the selected classifiers (The best score is marked in bold)
Table 4 Confusion matrix for 3-Class Classification (0: Unstructured, 1: Semi-structured, 2: Structured) using the Weighted k-NN classifier
Fig. 5
figure 5

Visualization of the HADs for each level of quantization of the angular deviation values. Each row represents the HADs for structured (top-row) and unstructured scenes (bottom-row). From left to right: Frames from the input scene with trajectory vectors, the original HAD with feature vector dimension 181 (bin width = 1), and some of the quantized HADs with feature vector dimensions: 36 (bin width = 5), 18 (bin width = 10), and 6 (bin width = 30)

Fig. 6
figure 6

Binary and 3-class classification performances of the proposed approach for varying feature dimension using Weighted k-NN classifier, where the feature dimension is varied by quantizing the angular deviation values (the F1-scores for all the approaches are highlighted in red-font for easier interpretation)

4.3 Results & discussion

Firstly, we compare the classification performance of the proposed HAD-based approach with the state-of-the-art approaches [10, 16, 24], which reports only binary classification results for the pair-wise combinations of structured vs. structured, structured vs. semi-structured, and semi-structured vs. unstructured scenes. The results shown in Table 2 (results for the existing approaches are obtained from the paper [10]) clearly indicates the superior performance of the proposed HAD-based approach, which is also able to perform consistently well for each of the selected three classifiers. The state-of-the-art approaches work by modelling complex interactions among the trajectories to generate a collectiveness measure. Since this measure is a single value and is heavily dependent on the scene dynamics and subsequently the model parameters, a small amount of anomaly can adversely effect the collectiveness value, which is a major reason for the low scores for most state-of-the-art approaches. On the contrary, the proposed HAD-based approach depends only on the accuracy of the tracker (which is applicable to the state-of-the-art approaches as well) and comprises of a multi-valued feature vector where a small amount of anomaly does not create a drastic change in the overall feature vector. It is also observed that the proposed approach is most effective in classifying structured and unstructured scenes. This is because, as demonstrated in Sect. 3, the different possible values for angular deviations for a structured scene is very less and is close to \(0^{\circ }\). Whereas, an unstructured scene contains objects moving in different directions due to which a variety of values are possible for the angular deviation. However, there are instances of several misclassification in the case of structured vs. semi-structured and semi-structured vs. unstructured classifications, due to the close resemblance of the semi-structured scenes with both structured and unstructured scenes. This is also evident from Tables 3 and 4, where we list out the three-class (structured, semi-structured, unstructured) classification performance of the proposed approach for the selected classifiers. Furthermore, since Roy et al. [27] uses a different proportion of scenes for each class (as compared to Table 1) and their binaries are not available, we are unable to quantitatively compare their results. Nonetheless, the dimension of the proposed HAD-based feature vector is 20-times lesser than the feature vector proposed by Roy et al. [27]. More importantly, the proposed HAD-based approach uses the angular deviation measure, which is not only globally consistent within a scene but also consistent across different scenes (Figs. 3 and 4), as compared to the angular orientation-based feature used in [27].

Secondly, we examine the effect of reducing the feature dimension by quantizing the angular deviation values. For this purpose, we choose the weighted k-NN classifier based on the Bhattacharya Distance measure [35], since it generalizes well to reduction in feature dimension. The original angular deviation values in the range \([0^{\circ },180^{\circ }]\) (having 181 levels with \(bin~width = 1\)) are quantized into different levels, viz. 90, 60, 45, 36, 18, 12, 9, and 6 with each level having a histogram bin width of 2, 3, 4, 5, 10, 15, 20 and 30, respectively (the last bin will have \(bin~width = bin~width + 1)\). As a result of this quantization, adjacent angular deviation values within a particular range (defined by the bin width corresponding the quantization level) contribute to a single histogram bin count. The visualization of the changes in the HAD structure after the quantization operation (for different levels) can be realised from Fig. 5. It is observed that the HADs, for all levels of quantization, retains the same global structure. Figure 6 shows the performance metrics for each of the reduced feature vector applied over the Collective Motion database, where it is observed that increasing the bin width does not drastically decrease the classification performance. More importantly, the reduced feature vectors can distinguish the structured and unstructured scenes effectively. This is primarily because of the nature of the two scenes. In structured scenes, the pair-wise angular deviation between the majority of trajectory vectors is minimal, resulting in a peak close to the \(0^{\circ }\) bin in the histogram. Whereas, in unstructured scenes, the pair-wise angular deviation between the trajectory vectors is not minimal and can take up any values, resulting in the values distributed across the histogram with no clear peaks. Hence, increasing the bin width does not effect the major trend in HAD and thus the capability for discriminating the structured and unstructured scenes. However, it can be also be observed that decreasing the feature dimension limits the capability of the classifier in discriminating subtle variations in the scenes, especially in the case of some semi-structured scenes whose HADs are either similar to that of unstructured or structured scenes. Since, reduction in feature dimension results in faster computations, we focus on choosing a feature dimension from Fig. 6 which does not significantly reduce the classification performance. Based on the data in Fig. 6, we make the following two conclusions: (i) the classification performance of the reduced feature vector with dimension equal to 90 (\(bin width = 2\)) is equivalent to overall performance of the original feature vector (for 2-class and 3-class classification). Thus, the HAD with feature dimension equal to 90 is considered as an optimal choice for crowd classification. (ii) the HAD with feature dimension equal to 9 can be used for structured-unstructured scene classification.

4.4 Crowd scene structuredness index

Based on the Collective motion dataset and its HAD data (for each scene), we empirically define a measure, termed as the crowd scene structuredness index, to quantify the structuredness for a given scene (in the scale of 0 to 1). For this purpose, we consider a reference histogram to be one of the scenes (shown in Fig. 7) within the Collective motion dataset which is closest to an ideal structured scene. Now, for a given input scene, we compute the HAD (here, we use the HAD with feature vector dimension 90) and then, compare it with the reference histogram using the Bhattacharya coefficient measure as shown in Eq. 2.

$$\begin{aligned} \phi (r,q) = \sum _{x \in B} \sqrt{r(x)q(x)} \end{aligned}$$
(2)
Fig. 7
figure 7

a The reference structured scene chosen from the Collective Motion Dataset (scene name: “startRunning5”), and b its HAD with feature dimension = 90, bin width = 2

Fig. 8
figure 8

Box plot of the crowd scene structuredness index (\(\phi \)-value) for structured, semi-structured and unstructured scene categories (the \(\phi \)-value is computed for all the scenes of the Collective motion dataset and the box plots are computed category-wise)

where, the crowd scene structuredness index is represented using the function \(\phi \), which quantifies the amount of structuredness of the computed HAD (denoted as q) when compared to the reference HAD (denoted as r), and B denotes the set of all bins in the histogram. The entire term in the right hand side of Eq. 2 represents the Bhattacharya coefficient, a part of the Bhattacharya distance metric, which is widely used to measure the similarity of two distributions. The statistical details of this experiment are shown in Fig. 8, which clearly indicate the separability between the structured and unstructured classes. The following can be concluded from Fig. 8: (i) the scenes with \(\phi \)-value greater than 0.55 have structured motion patterns (greater the \(\phi \)-value, more the structuredness), (ii) the scenes with \(\phi \)-value less than or equal to 0.55 have unstructured motion patterns (lower the \(\phi \)-value, greater the unstructuredness), and (iii) the scenes with values between 0.40 and 0.65 are semi-structured with values greater than 0.55 indicating a small amount of structuredness within the scene and values less than or equal to 0.55 indicating a small amount of unstructuredness within the scene.

5 Conclusion

In this paper, a feature vector based on the histogram of angular deviations (HAD) of averaged trajectory vectors was proposed to classify a given crowded scene into structured, semi-structured and unstructured, based on global motion patterns. The proposed HAD-based feature vector is composed of values which depict the count of each possible angular deviation value between \(0^{\circ }\) and \(180^{\circ }\). Since a structured scene contains objects moving uniformly in the same direction, the pair-wise angular deviations between the majority of trajectory vectors will be minimal. This creates a well distinguishable peak in the HAD close to the \(0^{\circ }\)-bin. Whereas, for an unstructured scene, the angular deviations values are distributed across various bins of HAD, due to the motion of the objects in different directions. Based on this notion, the experiments performed on the publicly available Collective Motion Database using classical machine learning algorithms prove the robustness of the proposed HAD-feature in distinguishing different scene types, specifically structured and unstructured scenes. Furthermore, the proposed approach outperforms the state-of-the-art approaches in binary classification of different combinations of the various scene types. The experiments conducted on quantizing the angular deviation values to reduce the feature dimension proved that the reduced feature vector with dimension equal to 90 (bin width equal to 2) performs as good as the original feature vector of dimension equal to 181, making it an optimal feature dimension for classifying crowded scenes based on motion patterns. Finally, based on comparing the proposed HAD with a reference HAD which depicts an ideal structured scene, we define a crowd scene structuredness index (based on Bhattacharya coefficient) which quantifies the amount of structuredness in a given scene. As a future work, we intend to improvise the computation of HAD by introducing a penalty function that penalizes unstructured behaviour, thereby improving the 3-class classification performance and increasing the separation between the three classes. Furthermore, in order to distinguish between all the three types of crowded scenes even better, we intend to explore about the possibility of including additional features other than the angular deviation feature.