Keywords

Introduction

Public spaces such as shopping malls, transportation hubs, and entertainment venues are often crowded environments in which it can be challenging for security personnel to monitor the safety of everyone present. CCTV cameras play a crucial role in monitoring the safety of people in crowded public spaces. Artificial intelligence (AI) can enhance the effectiveness of CCTV cameras by using advanced algorithms and machine-learning techniques to provide real-time footage analysis. However, this task is challenging due to the complexity and diversity of scenarios, occlusion and motion of people, and noise and quality of video streams.

Obtaining real-world data that meet these requirements can be difficult specifically for two principal reasons: data collection process should comply with privacy-related regulations (e.g. EU’s GDPR [1]), implementing data protection principles such as anonymization; and it demands considerable time and cost, given the substantial quantity, variety, and the need for a balanced representation of labelled visual appearances to contain all possibilities during training. A common technique to overcome the challenge of data scarcity is using synthetic data to simulate real-world scenarios. This provides AI algorithms with a rich and diverse training dataset, improving the model’s performance. However, challenges related to the domain gap must be considered, such as (i) ensuring the synthetic data accurately represents real-world scenarios, (ii) balancing the proportion of real and synthetic data in the training set, and (iii) avoiding potential biases introduced by the data generation process. In this paper, the main contributions include:

  • Presenting a methodology to train a model with real and synthetic data, avoiding the domain gap issues. Discussing different strategies for incorporating synthetic data when real data are scarce.

  • Explaining a novel approach to visual-based panic detection by learning domain-invariant spatio-temporal visual cues.

  • Delving into the specifics of the data used to train and test the model. Giving details about how the model can learn characteristics to avoid domain gap issues.

  • Comparing the results of different experiments to identify the most effective approach and configuration of hyperparameters.

  • Benchmarking the results against the current state of the art.

Related Work

Panic Detection Methods

Due to the limited number of studies that focus specifically on panic detection, the state of the art is based on anomaly detection strategies and their application to panic detection. Anomaly detection involves two different strategies:

  1. 1.

    Hand-crafted features: used for a long time to detect anomalies, with several approaches available. One approach is analysing the group behaviour, where there are different phenomena to analyse such as collectiveness, stability, or uniformity [2]. Another option is to analyse the crowd density, as proposed by [3] where they use the crowd density and motion of individuals within the crowd as a feature. Other strategies are based on spatio-temporal analysis using the gradient sum of the frame difference as a feature [4]. One of the most widely used strategies is the use of optical flow to extract features. In a recent study [5], they propose an optical flow framework based on a GAN and use transfer learning to detect behavioural abnormalities in large-scale crowd scenes. Other option proposed by [6] is to use entropy-based methods where experimental results shows that panic crowd motion states have higher entropy, while normal crowd states have lower entropy.

  2. 2.

    Automatic features: can be extracted using deep neural networks (DNNs). One widely used method for detecting anomalies is autoencoders [7]. Another approach is the use of convolutional neural networks (CNNs) [8].

Training with Synthetic and Real Data

There are some strategies to train a model with real and synthetic data:

  1. 1.

    A simple method is to simultaneously train the model with synthetic and real data. It consists of building batches with images from both domains (synthetic and real). When defining a ratio, the real images should dominate the distribution [9].

  2. 2.

    Another approach is to pre-train the model on synthetic data and then fine-tune it on real data [10]. Allowing the model to learn general patterns with the synthetic data and then adapt to the real world with the fine-tuning with real data. This approach and the last one described highly depend on the quality and amount of real data.

  3. 3.

    Another method is to add a domain classifier that predicts if the image belongs to the real or synthetic domain, aiming to learn useful features for both domains, in a way that the domain classifier cannot distinguish, achieving the extraction of domain-invariant features from the data [11]. They usually follow two steps: (i) learn features that minimize the loss of the target task and (ii) learn features that maximize the loss of a domain classifier.

  4. 4.

    A completely different approach is to use image-to-image translation techniques. These use generative adversarial networks (GANs) [12]. The idea is to make the synthetic images look more realistic so the model can learn from both domains without confusion. These methods are making remarkable advances, but they still tend to introduce artefacts or distortions in the translated images, which can affect the model’s performance. Therefore, this strategy was not considered in this study.

    Our proposed strategy combines the extracted domain-invariant features with spatio-temporal cues to represent the direction of motion of the crowd.

Training Methodology

Multi-task learning has shown that dedicating some networks to classify specific features related to the principal task can help improve the model’s performance, as shown in [13]. Our training strategy uses this principle and adds classifiers that identify the direction of the crowd runs during panic, along with a domain classifier, to outperform a conventional panic detector model. The proposed training strategy extracts domain-invariant features from the data and captures both the input sequence’s spatio-temporal dynamics and the crowd motion information. The architecture consists of three different components (see Fig. 24.1):

  1. 1.

    Panic detection model: a standard model with an input layer that receives frame sequences, a backbone network, and a classifier layer.

  2. 2.

    Discriminator: is a domain classifier that identifies the input sample to its corresponding domain (real or synthetic) using a fully connected layer, and a two-class output layer with a softmax activation. The purpose of adding a discriminator is to help the backbone learn domain-invariant features from the data. This is achieved by taking the loss of the discriminator and applying it to the backbone in a reverse manner using a gradient reversal layer. In the end, the discriminator will not be able to accurately classify the domain of the input due to the information given to the backbone that helped reduce the extraction of domain-specific features.

  3. 3.

    Direction classifiers: these are classifiers that consist of a fully connected layer, and a two-class output layer with a softmax activation. Each classifier is attached to the backbone of the panic detection model. Their purpose is to determine the direction of the panic runs. Each discriminator activates the panic output when the panic happens in their direction, otherwise its value is not panic. While there is no panic, in any direction, all classifiers should predict as no panic.

Fig. 24.1
A block diagram of the panic detection model. A photograph points to the inputs that are given to the backbone, which is connected to the direction classifiers, panic classifiers, and discriminators. The panic classifier points to the panic, and the discriminator points to the domain.

Proposed training architecture. Including a panic detection model, a discriminator, and four direction classifiers

During the testing phase, the discriminator and direction classifiers are removed, using only the panic detection model.

During training, we create batches that include sequences from both domains, each with six labels: panic, domain, and four panic directions (top, bottom, right, and left). This approach narrows the domain gap by learning domain-invariant features from the data, allowing us to incorporate more synthetic data and have more flexibility in choosing the ratio of distribution of data from both domains. However, a highly imbalanced dataset could yield worse results, and the threshold ratio that determines the trade-off between performance and data balance is still an open question.

Experimental Results

Datasets

Real datasets. There are few real datasets for panic detection events using the CCTV camera’s point of view (see Fig. 24.2). The available datasets are:

  • UMN [14]: a common benchmark created by the University of Minnesota with ten sequences of three scenes, recorded with a fixed camera at 30 fps and 480 × 640 resolution. Is used for testing abnormal event detection in crowds. They walk with no regular motion pattern and then suddenly run away in different directions.

  • MED (Motion Emotion Dataset) [15]: has videos of people walking on individual walkways, filmed from above with a stationary camera at 30 fps, with labels for the emotions and behaviours of the crowd. The videos begin with normal scenes and end with abnormal events (panic, fights, fallen people, abandoned objects, etc).

  • PETS2009 [16]: has multiple cameras, actors, and some abnormal events where people run away, but has a low frame rate of 8 fps. It is good to estimate the number and density of people, track individuals, and detect different flows and events.

Fig. 24.2
Three photographs. In 1 and 2, a group of people are running. In 3, there are people walking on the road, and there is grass on both sides of the road.

Samples from real datasets: (left) UMN, (centre) MED, (right) PETS2009

Synthetic dataset. We use a synthetic dataset [17] composed of recordings showing the behaviour of crowds in normal and panic situations. Each recording shows a group of pedestrians walking randomly with different density levels (low, mid, and high). The dataset contains sequences generated in six places (see Fig. 24.3), using different camera configurations (angle and position), weather conditions, and pedestrian locations for each simulation. Each simulation has at least four cameras placed in different positions and angles to observe the scene from different perspectives. The resulting dataset contains videos of 375 frames each. The recordings include 6 different places, 31 different CCTV cameras in total (with a slight variation in each simulation), obtaining 320 different videos with various weather conditions.

Fig. 24.3
Six photographs of different locations depict pedestrians walking in a parking lot.

Synthetic dataset samples showing the six scenarios

Training dataset. Our proposed architecture uses spatio-temporal data to make predictions, so the frame rate of the training dataset and testing dataset must match. We discarded the usage of the PETS2009 dataset due to its low frame rate. The UMN dataset is a single video composed of different panic scenes, which we have separated into independent videos. The labels are printed over the frames when the panic begins. To avoid the detection of the text box as panic, we have cropped all the frames removing the top part from the image The MED dataset contains a variety of anomalous situations, so we have selected only the sequences containing panic (1, 2, 3, 4, 5, 8, 9, 10, and 11) for our training and testing. After this process, the UMN offers 11 videos and the MED dataset offers 9 videos, for a total of 20 available videos. As there are insufficient data to train our model, we will use our synthetic dataset.

We annotated the three filtered datasets considering the crowd directions during the panic and created different versions of each training dataset by extracting sequences of varying length (30, 15, or 10 frames) and frame rate (30 or 15 fps) from the filtered video datasets. Any sequence containing both normal and panic frames is excluded; we also avoid any overlap between the sequences. After this process, we obtained 164 normal and 17 panic sequences from the UMN dataset; 1343 normal and 31 panic sequences from the MED dataset; and 11,888 normal and 5944 panic sequences from the synthetic dataset.

Implementation Details

Model selection. The proposed model for action classification is MoviNet-A0 [18], a state-of-the-art model designed for efficiency and accuracy in video recognition tasks. MoviNet is based on 3D CNNs and can handle challenges such as varying camera angles, lighting conditions or background changes. By using MoviNet, we expect high accuracy and efficiency in detecting panic and normal events, as we believe it is suitable for our problem due to its ability to extract spatio-temporal features.

Training parameters. The selection of sequence length and frame rate is crucial in our approach. A consistent frame rate for both training and inference is essential. The “loss weight” is a key parameter for training our proposed methodology, which determines the contribution of each branch to the overall model learning. It can be adjusted to balance the learning of the different branches and prevent one branch from dominating the others. It is important to monitor the training process and evaluate the performance of each classifier, ensuring that all classifiers, except the discriminator, learn to classify the sequences correctly. The discriminator should have an accuracy close to 0.5, indicating that it cannot differentiate between real and fake sequences.

We experimented with different frame rates and sequence lengths to find the optimal dataset for testing our methodology. Using 15 frames at 15 fps produced comparable results to using 30 frames at 30 fps, but with lower computational cost. Using frame rates lower than 15 fps or sequences shorter than 1 s resulted in worse performance. We also tested different image sizes and learning rates, selecting 256 × 256 pixels and 0.001 for the learning rate.

Testing method. In the testing phase, we use a real dataset that was not used in the training phase to evaluate the performance of the model on unseen data. We use a sliding window with a step of 1 and respect the frame rate. We add one every two frames of the video, which has a frame rate of 15 fps.

A sequence is considered panic when all the frames are panic. The panic ground truth labels have been moved 30 frames (1 s, the covered time of a sequence). To evaluate the performance of our model, we use the area under the curve (AUC) metric. It is reliable for evaluating our model’s performance, as it considers both true positive and false positive rates. It measures how well it can distinguish between the two classes, making it useful when dealing with unbalanced datasets.

Analysis

Importance of the distribution of data. Several tests were conducted using the configuration explained in the Implementation Details section, one for determine the data distribution (between real and synthetic data). The model was trained with the panic detector and discriminator to verify that the selected number of synthetic sequences was appropriate for training. The desired behaviour was that the discriminator did not improve its performance and the panic detector learned to identify panic properly. It was found that the difference between the number of synthetic and real sequences was too large. The value was reduced until the training worked as expected. The original synthetic dataset had 17,832 sequences, while the filtered dataset had only 3155. The red experiment (see Fig. 24.4), which used the entire synthetic dataset, learned to identify the domain due to the increasing accuracy and the decreasing loss. The orange experiment showed that the discriminator could not learn the domain, lowering its accuracy to approximately 0.38. As a result, the reduced synthetic dataset was selected for the rest of the test.

Fig. 24.4
2 line graphs of domain classifier head loss and domain classifier head accuracy. Graph A depicts two lines, one constant and the other trending downward. Graph B plots two lines, one in an upward trend and the other in a downward trend.

Domain classifier training metrics. It shows the loss (left) and the accuracy (right) of models trained with different data distribution

Loss weight. To evaluate the impact of the loss weight variable on the performance of the model, a series of tests were conducted. The model was trained using a combination of one real dataset and the reduced synthetic dataset and tested on the non-used-for-training real dataset. This process was repeated, alternating both the UMN and MED datasets. The complete proposed architecture, including the panic model, the discriminator, and the direction classifiers, was utilized for this experiment.

The results shown in Fig. 24.5 correspond to a model trained with the MED dataset mixed with the reduced synthetic dataset and tested on the UMN dataset. Four different models were trained, each with a different set of loss weight values per branch: one for the discriminator, one for the direction classifiers (all sharing the same loss weight value), and one for the panic classifier.

Fig. 24.5
A line graph of panic percentage versus n frame. The panic event occurred approximately between 420 and 510. The lines of directions 1, 2, 3, and 4 have a fluctuating trend.

Comparison of model performance with different loss weight values

The influence of the loss weight is significant. After evaluating the models trained with both MED and UMN datasets (combined with the synthetic dataset), it was found that the best loss weight configuration was to assign 0.5 for the panic detection classifier, 0.0001 for the discriminator, and 0.4 for the direction classifiers.

Comparative of Methods

In this section, we compare the different strategies. These are referred to as: “Public”: trained only with real data; “Mixed”: trained with real and synthetic data; “Finetuned”: trained first with synthetic data and finetuned with real data; “Domain”: the panic model with the discriminator; and finally, “Direction”: our proposed method.

After testing all the models with both public datasets (see Fig. 24.6), it can be observed that our proposed method (“Direction,” light blue coloured) outperforms the other alternatives when the model is trained with the MED, followed by the “Domain” proposal (yellow). Analysing the models trained with UMN, it can be seen that the performance of both the “Domain” and “Direction” methods is superior, but in this case, the “Direction” model performs slightly better. Although the selected value of the discriminator loss weight may seem low, when comparing the results of the “Mixed” method (orange), that does not have a discriminator, with the “Domain” method, that has one, it is clear that the performance of the model improves, demonstrating that the addition of a discriminator works. The average AUC results of each model for each dataset can be seen in Table 24.1, along with the average metric of both datasets. Our proposed method is the best option among the tested models.

Fig. 24.6
Two bar charts. In 1, the bars of public, mixed, finetuned, domain, and direction are highest in U M N 8, 9, and 10. In 2, the bars of public, mixed, finetuned, domain, and direction are highest in MED 2, MED 9, MED 2, MED 4, and MED 2, respectively.

Models AUC by video. (Left) Models trained with MED, prediction over UMN dataset. (Right) Models trained with UMN, prediction over MED dataset

Table 24.1 Methods AUC metric result

Comparison with State of the Art

As can be seen in Table 24.2, our method offers acceptable results when tested on the UMN dataset but is surpassed by the rest of the SOTA methods. Regarding the MED dataset, only one of the methods offers results. In this case, our method improves it, becoming the SOTA.

Table 24.2 Comparison of our proposed method with the state-of-the-art methods

We aim to compare our solution with other available methods. However, it is important to note that a direct comparison may not be entirely fair, as each state-of-the-art model was designed to solve different tasks, with some focusing on anomaly detection. Additionally, these models may not have been developed under the same conditions or to address the specific challenges we are tackling in this work, such as reducing the domain gap. It is worth highlighting the robustness of our method, as it offers good results on both datasets, demonstrating that it is capable of generalizing and functioning in different environments and situations.

Conclusion

We have presented a novel training methodology to train a model with real and synthetic data avoiding the domain gap issues demonstrating that it is the most effective option for training models combining synthetic and real data. Our approach has successfully shown the feasibility of utilizing synthetic data, even in situations where real data are limited and have achieved results that are competitive with the current state of the art. Additionally, we have shown that the use of multi-task learning approaches during training can enhance the performance of the primary task. By implementing a domain classifier equipped with a reversal gradient layer and properly configured hyperparameters, we can extract domain-invariant features during training, effectively addressing domain gap issues present in the datasets used. These findings underscore the potential and efficacy of our methodology in overcoming the challenges associated with combining synthetic and real data in model training.