We test our proposed approach by combining the proposed features (see Sect. 3) on UMN dataset  and UCSD dataset . Our experiments are divided in two parts. The first part shows a comparison of our proposed classification method with 1 and 4 classification regions. Then, the second part compares our results with other methods published in the literature.
Experimental Setup and Datasets: The final feature vector contains two features describing the horizontal and vertical optical flow fields (velocity), two optical acceleration features (horizontal and vertical directions), three features of optical flow textures and eight-dimensional histogram of optical flow gradient components, four components for each vertical and horizontal directions. Therefore, we have a fifteen-dimensional feature vector for each spatio-temporal patch. In addition, we experimentally set a fixed spatio-temporal patch size of \(20 \times 20 \times 7\). The criterion used to evaluate abnormal events detection accuracy was based on frame-level and to measure the performance of the proposed method, we have calculated the ROC curve, the area under the curve (AUC) and the equal error rate (EER).
UMN dataset consists of three different scenes of crowded escape events with a \( 320 \times 240 \) resolution. The normal events are pedestrians walking randomly and the abnormal events are human spread running at the same time. There are a total of 11 video clips in the dataset. In the training phase the first 300 frames of each video were used to model normal events and the remaining frames were used in the test phase. UCSD dataset includes two sub-datasets, Ped1 and Ped2. The crowd density varies from sparse to very crowded. The training sets are all normal events and the testing set includes abnormal events like cars, bikes, motorcycles and skaters. Ped1 contains 34 video clips for training and 36 video clips for testing with a \( 158 \times 238 \) resolution, and Ped2 contains 16 video clips for training and 12 video clips for testing with a \( 360 \times 240 \) resolution.
Global Abnormal Event Detection: We use UMN dataset  to detect GAE. The results are shown in Fig. 2, where columns (a), (b) and (c) represent the results on UMN dataset (scenes 1, 2 and 3, respectively). Furthermore, the first row shows the results for 1 region of classification, while the second for 4 regions. The ROC curves of the proposed method are shown in Fig. 3(a).
In Table 1, we show the quantitative results of our experiment on the UMN dataset, using two ways of classification (1 and 4 regions). In scenarios 1 and 3, our method using the classification with a single region achieves an AUC of 0.9985 and 0.9954, respectively. These results overcome the classification with 4 regions. Meanwhile, in scene 2, the result using 4 regions of classification achieves an AUC of 0.9486, which overcomes the result of the classification using a single region.
This happens due to the videos in scene 2 have problems of perspective distortion. Table 2 provides the quantitative comparisons to the state-of-the-art methods. The AUC of our method in scenes 1 and 3 overcomes the state-of-the-art methods, which are 0.998 and 0.995, respectively. However, the AUC in scene 2 is comparable to the results of literature.
Local Abnormal Event Detection: We use the UCSD dataset to detect EAL. The results are shown in Fig. 2, where columns (d) and (e) represent the results on Ped1 and Ped2 datasets, respectively. Furthermore, we show the classification of 1 and 4 regions in the first and second row, respectively. The ROC curves of the our method on the UCSD dataset are shown in Fig. 3(b).
Table 3 shows the quantitative results of our experiment on the UCSD dataset, using two ways of classification (1 and 4 regions). The results on the Ped1 dataset using 4 regions of classification achieves an EER of 29.28% and an AUC of 0.7923, this result overcomes the classification using a single region due to perspective distorsion problem presented in this dataset. On the other hand, the results on the Peds2 dataset using a single region of classification achieves a EER of 07.24% and an AUC of 0.9778, this result overcomes the classification using 4 regions. Table 4 shows quantitative comparison of the proposed method with the state-of-the-art methods on the UCSD dataset. In the Ped1 dataset, our method achieves an EER of 29.2% and an AUC of 0.792, being competitive with most of the reported methods in the literature. On the other hand, on Ped2, our method achieves an EER of 07.2% and an AUC of 0.977, outperforming all reported results. It is important to emphasize, that state-of-the-art methods that use deep learning techniques [18, 19] achieve better results on Ped1 dataset. However, our proposal overcomes the results of all methods on Ped2 dataset.
In our experiments we observed that the results on video sequences that don’t have problems of perspective distortion, the performance of our method overcomes all state-of-the-art methods including techniques that use deep learning. This is because our proposed extracts and combines information of motion and appearance. However, our results fall down when videos have problems of perspective distortion. To address this problem we propose a classification by local regions, which improves the performance of our method, as can be observed in Fig. 3(a), (b).