Abstract
Developing a vision-based approach for identifying crowd panic in video surveillance systems is a complex task due to the struggle to gather enough real-world event recordings for training. The use of synthetic data can mitigate this issue, but the domain gap between synthetic and real-world samples needs to be managed to achieve precise results. We present a method to train these systems effectively by combining synthetic and real data to differentiate between normal and panic states. Our method learns domain-invariant spatio-temporal visual cues of the scenes along with supplementary descriptive attributes of crowd directions for the panic state classification. Experimental results show its potential with respect to alternative state-of-the-art methodologies and how it can effectively leverage synthetic data to train this kind of systems with high accuracy.
You have full access to this open access chapter, Download chapter PDF
Keywords
Introduction
Public spaces such as shopping malls, transportation hubs, and entertainment venues are often crowded environments in which it can be challenging for security personnel to monitor the safety of everyone present. CCTV cameras play a crucial role in monitoring the safety of people in crowded public spaces. Artificial intelligence (AI) can enhance the effectiveness of CCTV cameras by using advanced algorithms and machine-learning techniques to provide real-time footage analysis. However, this task is challenging due to the complexity and diversity of scenarios, occlusion and motion of people, and noise and quality of video streams.
Obtaining real-world data that meet these requirements can be difficult specifically for two principal reasons: data collection process should comply with privacy-related regulations (e.g. EU’s GDPR [1]), implementing data protection principles such as anonymization; and it demands considerable time and cost, given the substantial quantity, variety, and the need for a balanced representation of labelled visual appearances to contain all possibilities during training. A common technique to overcome the challenge of data scarcity is using synthetic data to simulate real-world scenarios. This provides AI algorithms with a rich and diverse training dataset, improving the model’s performance. However, challenges related to the domain gap must be considered, such as (i) ensuring the synthetic data accurately represents real-world scenarios, (ii) balancing the proportion of real and synthetic data in the training set, and (iii) avoiding potential biases introduced by the data generation process. In this paper, the main contributions include:
-
Presenting a methodology to train a model with real and synthetic data, avoiding the domain gap issues. Discussing different strategies for incorporating synthetic data when real data are scarce.
-
Explaining a novel approach to visual-based panic detection by learning domain-invariant spatio-temporal visual cues.
-
Delving into the specifics of the data used to train and test the model. Giving details about how the model can learn characteristics to avoid domain gap issues.
-
Comparing the results of different experiments to identify the most effective approach and configuration of hyperparameters.
-
Benchmarking the results against the current state of the art.
Related Work
Panic Detection Methods
Due to the limited number of studies that focus specifically on panic detection, the state of the art is based on anomaly detection strategies and their application to panic detection. Anomaly detection involves two different strategies:
-
1.
Hand-crafted features: used for a long time to detect anomalies, with several approaches available. One approach is analysing the group behaviour, where there are different phenomena to analyse such as collectiveness, stability, or uniformity [2]. Another option is to analyse the crowd density, as proposed by [3] where they use the crowd density and motion of individuals within the crowd as a feature. Other strategies are based on spatio-temporal analysis using the gradient sum of the frame difference as a feature [4]. One of the most widely used strategies is the use of optical flow to extract features. In a recent study [5], they propose an optical flow framework based on a GAN and use transfer learning to detect behavioural abnormalities in large-scale crowd scenes. Other option proposed by [6] is to use entropy-based methods where experimental results shows that panic crowd motion states have higher entropy, while normal crowd states have lower entropy.
-
2.
Automatic features: can be extracted using deep neural networks (DNNs). One widely used method for detecting anomalies is autoencoders [7]. Another approach is the use of convolutional neural networks (CNNs) [8].
Training with Synthetic and Real Data
There are some strategies to train a model with real and synthetic data:
-
1.
A simple method is to simultaneously train the model with synthetic and real data. It consists of building batches with images from both domains (synthetic and real). When defining a ratio, the real images should dominate the distribution [9].
-
2.
Another approach is to pre-train the model on synthetic data and then fine-tune it on real data [10]. Allowing the model to learn general patterns with the synthetic data and then adapt to the real world with the fine-tuning with real data. This approach and the last one described highly depend on the quality and amount of real data.
-
3.
Another method is to add a domain classifier that predicts if the image belongs to the real or synthetic domain, aiming to learn useful features for both domains, in a way that the domain classifier cannot distinguish, achieving the extraction of domain-invariant features from the data [11]. They usually follow two steps: (i) learn features that minimize the loss of the target task and (ii) learn features that maximize the loss of a domain classifier.
-
4.
A completely different approach is to use image-to-image translation techniques. These use generative adversarial networks (GANs) [12]. The idea is to make the synthetic images look more realistic so the model can learn from both domains without confusion. These methods are making remarkable advances, but they still tend to introduce artefacts or distortions in the translated images, which can affect the model’s performance. Therefore, this strategy was not considered in this study.
Our proposed strategy combines the extracted domain-invariant features with spatio-temporal cues to represent the direction of motion of the crowd.
Training Methodology
Multi-task learning has shown that dedicating some networks to classify specific features related to the principal task can help improve the model’s performance, as shown in [13]. Our training strategy uses this principle and adds classifiers that identify the direction of the crowd runs during panic, along with a domain classifier, to outperform a conventional panic detector model. The proposed training strategy extracts domain-invariant features from the data and captures both the input sequence’s spatio-temporal dynamics and the crowd motion information. The architecture consists of three different components (see Fig. 24.1):
-
1.
Panic detection model: a standard model with an input layer that receives frame sequences, a backbone network, and a classifier layer.
-
2.
Discriminator: is a domain classifier that identifies the input sample to its corresponding domain (real or synthetic) using a fully connected layer, and a two-class output layer with a softmax activation. The purpose of adding a discriminator is to help the backbone learn domain-invariant features from the data. This is achieved by taking the loss of the discriminator and applying it to the backbone in a reverse manner using a gradient reversal layer. In the end, the discriminator will not be able to accurately classify the domain of the input due to the information given to the backbone that helped reduce the extraction of domain-specific features.
-
3.
Direction classifiers: these are classifiers that consist of a fully connected layer, and a two-class output layer with a softmax activation. Each classifier is attached to the backbone of the panic detection model. Their purpose is to determine the direction of the panic runs. Each discriminator activates the panic output when the panic happens in their direction, otherwise its value is not panic. While there is no panic, in any direction, all classifiers should predict as no panic.
During the testing phase, the discriminator and direction classifiers are removed, using only the panic detection model.
During training, we create batches that include sequences from both domains, each with six labels: panic, domain, and four panic directions (top, bottom, right, and left). This approach narrows the domain gap by learning domain-invariant features from the data, allowing us to incorporate more synthetic data and have more flexibility in choosing the ratio of distribution of data from both domains. However, a highly imbalanced dataset could yield worse results, and the threshold ratio that determines the trade-off between performance and data balance is still an open question.
Experimental Results
Datasets
Real datasets. There are few real datasets for panic detection events using the CCTV camera’s point of view (see Fig. 24.2). The available datasets are:
-
UMN [14]: a common benchmark created by the University of Minnesota with ten sequences of three scenes, recorded with a fixed camera at 30 fps and 480 × 640 resolution. Is used for testing abnormal event detection in crowds. They walk with no regular motion pattern and then suddenly run away in different directions.
-
MED (Motion Emotion Dataset) [15]: has videos of people walking on individual walkways, filmed from above with a stationary camera at 30 fps, with labels for the emotions and behaviours of the crowd. The videos begin with normal scenes and end with abnormal events (panic, fights, fallen people, abandoned objects, etc).
-
PETS2009 [16]: has multiple cameras, actors, and some abnormal events where people run away, but has a low frame rate of 8 fps. It is good to estimate the number and density of people, track individuals, and detect different flows and events.
Synthetic dataset. We use a synthetic dataset [17] composed of recordings showing the behaviour of crowds in normal and panic situations. Each recording shows a group of pedestrians walking randomly with different density levels (low, mid, and high). The dataset contains sequences generated in six places (see Fig. 24.3), using different camera configurations (angle and position), weather conditions, and pedestrian locations for each simulation. Each simulation has at least four cameras placed in different positions and angles to observe the scene from different perspectives. The resulting dataset contains videos of 375 frames each. The recordings include 6 different places, 31 different CCTV cameras in total (with a slight variation in each simulation), obtaining 320 different videos with various weather conditions.
Training dataset. Our proposed architecture uses spatio-temporal data to make predictions, so the frame rate of the training dataset and testing dataset must match. We discarded the usage of the PETS2009 dataset due to its low frame rate. The UMN dataset is a single video composed of different panic scenes, which we have separated into independent videos. The labels are printed over the frames when the panic begins. To avoid the detection of the text box as panic, we have cropped all the frames removing the top part from the image The MED dataset contains a variety of anomalous situations, so we have selected only the sequences containing panic (1, 2, 3, 4, 5, 8, 9, 10, and 11) for our training and testing. After this process, the UMN offers 11 videos and the MED dataset offers 9 videos, for a total of 20 available videos. As there are insufficient data to train our model, we will use our synthetic dataset.
We annotated the three filtered datasets considering the crowd directions during the panic and created different versions of each training dataset by extracting sequences of varying length (30, 15, or 10 frames) and frame rate (30 or 15 fps) from the filtered video datasets. Any sequence containing both normal and panic frames is excluded; we also avoid any overlap between the sequences. After this process, we obtained 164 normal and 17 panic sequences from the UMN dataset; 1343 normal and 31 panic sequences from the MED dataset; and 11,888 normal and 5944 panic sequences from the synthetic dataset.
Implementation Details
Model selection. The proposed model for action classification is MoviNet-A0 [18], a state-of-the-art model designed for efficiency and accuracy in video recognition tasks. MoviNet is based on 3D CNNs and can handle challenges such as varying camera angles, lighting conditions or background changes. By using MoviNet, we expect high accuracy and efficiency in detecting panic and normal events, as we believe it is suitable for our problem due to its ability to extract spatio-temporal features.
Training parameters. The selection of sequence length and frame rate is crucial in our approach. A consistent frame rate for both training and inference is essential. The “loss weight” is a key parameter for training our proposed methodology, which determines the contribution of each branch to the overall model learning. It can be adjusted to balance the learning of the different branches and prevent one branch from dominating the others. It is important to monitor the training process and evaluate the performance of each classifier, ensuring that all classifiers, except the discriminator, learn to classify the sequences correctly. The discriminator should have an accuracy close to 0.5, indicating that it cannot differentiate between real and fake sequences.
We experimented with different frame rates and sequence lengths to find the optimal dataset for testing our methodology. Using 15 frames at 15 fps produced comparable results to using 30 frames at 30 fps, but with lower computational cost. Using frame rates lower than 15 fps or sequences shorter than 1 s resulted in worse performance. We also tested different image sizes and learning rates, selecting 256 × 256 pixels and 0.001 for the learning rate.
Testing method. In the testing phase, we use a real dataset that was not used in the training phase to evaluate the performance of the model on unseen data. We use a sliding window with a step of 1 and respect the frame rate. We add one every two frames of the video, which has a frame rate of 15 fps.
A sequence is considered panic when all the frames are panic. The panic ground truth labels have been moved 30 frames (1 s, the covered time of a sequence). To evaluate the performance of our model, we use the area under the curve (AUC) metric. It is reliable for evaluating our model’s performance, as it considers both true positive and false positive rates. It measures how well it can distinguish between the two classes, making it useful when dealing with unbalanced datasets.
Analysis
Importance of the distribution of data. Several tests were conducted using the configuration explained in the Implementation Details section, one for determine the data distribution (between real and synthetic data). The model was trained with the panic detector and discriminator to verify that the selected number of synthetic sequences was appropriate for training. The desired behaviour was that the discriminator did not improve its performance and the panic detector learned to identify panic properly. It was found that the difference between the number of synthetic and real sequences was too large. The value was reduced until the training worked as expected. The original synthetic dataset had 17,832 sequences, while the filtered dataset had only 3155. The red experiment (see Fig. 24.4), which used the entire synthetic dataset, learned to identify the domain due to the increasing accuracy and the decreasing loss. The orange experiment showed that the discriminator could not learn the domain, lowering its accuracy to approximately 0.38. As a result, the reduced synthetic dataset was selected for the rest of the test.
Loss weight. To evaluate the impact of the loss weight variable on the performance of the model, a series of tests were conducted. The model was trained using a combination of one real dataset and the reduced synthetic dataset and tested on the non-used-for-training real dataset. This process was repeated, alternating both the UMN and MED datasets. The complete proposed architecture, including the panic model, the discriminator, and the direction classifiers, was utilized for this experiment.
The results shown in Fig. 24.5 correspond to a model trained with the MED dataset mixed with the reduced synthetic dataset and tested on the UMN dataset. Four different models were trained, each with a different set of loss weight values per branch: one for the discriminator, one for the direction classifiers (all sharing the same loss weight value), and one for the panic classifier.
The influence of the loss weight is significant. After evaluating the models trained with both MED and UMN datasets (combined with the synthetic dataset), it was found that the best loss weight configuration was to assign 0.5 for the panic detection classifier, 0.0001 for the discriminator, and 0.4 for the direction classifiers.
Comparative of Methods
In this section, we compare the different strategies. These are referred to as: “Public”: trained only with real data; “Mixed”: trained with real and synthetic data; “Finetuned”: trained first with synthetic data and finetuned with real data; “Domain”: the panic model with the discriminator; and finally, “Direction”: our proposed method.
After testing all the models with both public datasets (see Fig. 24.6), it can be observed that our proposed method (“Direction,” light blue coloured) outperforms the other alternatives when the model is trained with the MED, followed by the “Domain” proposal (yellow). Analysing the models trained with UMN, it can be seen that the performance of both the “Domain” and “Direction” methods is superior, but in this case, the “Direction” model performs slightly better. Although the selected value of the discriminator loss weight may seem low, when comparing the results of the “Mixed” method (orange), that does not have a discriminator, with the “Domain” method, that has one, it is clear that the performance of the model improves, demonstrating that the addition of a discriminator works. The average AUC results of each model for each dataset can be seen in Table 24.1, along with the average metric of both datasets. Our proposed method is the best option among the tested models.
Comparison with State of the Art
As can be seen in Table 24.2, our method offers acceptable results when tested on the UMN dataset but is surpassed by the rest of the SOTA methods. Regarding the MED dataset, only one of the methods offers results. In this case, our method improves it, becoming the SOTA.
We aim to compare our solution with other available methods. However, it is important to note that a direct comparison may not be entirely fair, as each state-of-the-art model was designed to solve different tasks, with some focusing on anomaly detection. Additionally, these models may not have been developed under the same conditions or to address the specific challenges we are tackling in this work, such as reducing the domain gap. It is worth highlighting the robustness of our method, as it offers good results on both datasets, demonstrating that it is capable of generalizing and functioning in different environments and situations.
Conclusion
We have presented a novel training methodology to train a model with real and synthetic data avoiding the domain gap issues demonstrating that it is the most effective option for training models combining synthetic and real data. Our approach has successfully shown the feasibility of utilizing synthetic data, even in situations where real data are limited and have achieved results that are competitive with the current state of the art. Additionally, we have shown that the use of multi-task learning approaches during training can enhance the performance of the primary task. By implementing a domain classifier equipped with a reversal gradient layer and properly configured hyperparameters, we can extract domain-invariant features during training, effectively addressing domain gap issues present in the datasets used. These findings underscore the potential and efficacy of our methodology in overcoming the challenges associated with combining synthetic and real data in model training.
References
European Parliament, & Council of the EU. Regulation (EU) 2016/679 (2016) Official Journal of the European Union, L 119(1). (GDPR).
Afiq, A., Zakariya, M., Saad, M., et al. (2019). A review on classifying abnormal behavior in crowd scene. Journal of Visual Communication and Image Representation.
Ammar, H., & Cherif, A. (2021). DeepROD: A deep learning approach for real-time and online detection of panic behavior in human crowds. Machine Vision and Applications.
Ilyas, Z., Aziz, Z., Qasim, T., et al. (2021). A hybrid deep network based approach for crowd anomaly detection. Multimedia Tools and Applications, 80, 24053–24067.
Alafif, T., Alzahrani, B., Cao, Y., et al. (2022). Generative adversarial network based abnormal behavior detection in massive crowd videos: A hajj case study. Journal of Ambient Intelligence and Humanized Computing. https://doi.org/10.1007/s12652-021-03323-5
Zhang, X., Shu, X., & He, Z. (2019). Crowd panic state detection using entropy of the distribution of enthalpy (Physica A: Statistical Mechanics and Its Applications). Elsevier.
Xu, M., Yu, X., Chen, D., Wu, C., & Jiang, Y. (2019). An efficient anomaly detection system for crowded scenes using variational autoencoders. Applied Sciences, 9(16), 33–37.
Singh, K., Rajora, S., Vishwakarma, D. K., et al. (2020). Crowd anomaly detection using aggregation of ensembles of fine-tuned convnets. Neurocomputing, 371, 188–198.
Ros, G., Sellart, L., Materzynska, J., et al. (2016). The SYNTHIA dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings IEEE CVPR (pp. 3234–3243).
Shafaei, A., Little, J. J., & Schmidt, M. (2016). Play and learn: Using video games to train computer vision models. In Proceedings BMVC.
Tonutti, M., Ruffaldi, E., Cattaneo, A., & Avizzano, C. A. Robust and subject-independent driving manoeuvre anticipation through domain adversarial recurrent neural networks. Robotics and Autonomous Systems, 115, 162–173. 201.
Zhu, J.-Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE international conference on CV.
Rodriguez, A. M., Unzueta, L., Geradts, et al. (2023). Multi-task explainable quality networks for large-scale forensic facial recognition. IEEE JSTSP, 17(3), 612–623.
MultiMedia LLC. (n.d.). Unusual crowd activity dataset of University of Minnesota. Retrieved January, from http://mha.cs.umn.edu
Rabiee, H., Haddadnia, J., Mousavi, H., Kalantarzadeh, M., Nabi, M., & Murino, V. (2016). Novel dataset for fine-grained abnormal behavior understanding in crowd. In IEEE international conference on advanced video and signal based surveillance.
Ferryman, J., & Shahrokni, A. (2009). PETS2009: Dataset and challenge. In IEEE international workshop on performance evaluation of tracking and surveillance.
Calle, J., Leskovsky, P., Garcia, J., & Sanchez, M. (2023). Synthetic dataset for panic detection in human crowded scenes. Eurographics 2023 – Posters.
Kondratyuk, D., Yuan, L., Li, Y., Zhang, L., Brown, M., & Gong, B. (2021). MoViNets: Mobile video networks for efficient video recognition.
Acknowledgements
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement no. 101021981, APPRAISE project.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2025 The Author(s)
About this chapter
Cite this chapter
Calle, J., Unzueta, L., Leskovsky, P., García, J. (2025). Learning Domain-Invariant Spatio-Temporal Visual Cues for Video-Based Crowd Panic Detection. In: Gkotsis, I., Kavallieros, D., Stoianov, N., Vrochidis, S., Diagourtas, D., Akhgar, B. (eds) Paradigms on Technology Development for Security Practitioners. Security Informatics and Law Enforcement. Springer, Cham. https://doi.org/10.1007/978-3-031-62083-6_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-62083-6_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-62082-9
Online ISBN: 978-3-031-62083-6
eBook Packages: Physics and AstronomyPhysics and Astronomy (R0)