A Novel Voronoi-based Convolutional Neural Network Framework for Pushing Person Detection in Crowd Videos

Analyzing the microscopic dynamics of pushing behavior within crowds can offer valuable insights into crowd patterns and interactions. By identifying instances of pushing in crowd videos, a deeper understanding of when, where, and why such behavior occurs can be achieved. This knowledge is crucial to creating more effective crowd management strategies, optimizing crowd flow, and enhancing overall crowd experiences. However, manually identifying pushing behavior at the microscopic level is challenging, and the existing automatic approaches cannot detect such microscopic behavior. Thus, this article introduces a novel automatic framework for identifying pushing in videos of crowds on a microscopic level. The framework comprises two main components: i) Feature extraction and ii) Video labeling. In the feature extraction component, a new Voronoi-based method is developed for determining the local regions associated with each person in the input video. Subsequently, these regions are fed into EfficientNetV1B0 Convolutional Neural Network to extract the deep features of each person over time. In the second component, a combination of a fully connected layer with a Sigmoid activation function is employed to analyze these deep features and annotate the individuals involved in pushing within the video. The framework is trained and evaluated on a new dataset created using six real-world experiments, including their corresponding ground truths. The experimental findings indicate that the suggested framework outperforms seven baseline methods that are employed for comparative analysis purposes.


Introduction
With the rapid development of urbanization, the dense crowd has become widespread in various locations, such as religious sites, train stations, concerts, stadiums, malls, and famous tourist attractions.In such highly dense crowds, pushing behavior can easily arise.Such behavior could increase the crowd's density, potentially posing a threat not only to people's comfort but also to their safety [1][2][3][4][5].People in crowds start pushing for different reasons.It could be for saving their lives from fire [6][7][8] or other hazards, catching a bargain on sail or simply accessing an overcrowded subway train [9,10], and gaining access to a venue [3,4,[11][12][13].Understanding the microscopic dynamics of pushing plays a pivotal role in effective crowd management, helping safeguard the crowd from tragedies and promoting overall well-being [1,14].This has led to several studies aiming to comprehend pushing dynamics, especially in crowded event entrances [15][16][17][18][19][20].Lügering et al. [15] defined pushing as a behavior that pedestrians use to reach a target (like accessing an event) faster.This behavior involves pushing others using arms, shoulders, elbows, or the upper body, as well as utilizing gaps among neighboring people to navigate forward quicker.
The study [15] has introduced a manual rating to understand pushing dynamics at the microscopic level.The method relies on two trained psychologists to classify pedestrians' behaviors over time in a video of crowds into pushing or non-pushing categories, helping to know when, where, and why pushing behavior occurs.However, this manual method is time-consuming, tedious and prone to errors in some scenarios.Additionally, it requires trained observers, which may not always be feasible.Consequently, an increasing demand is for an automatic approach to identify pushing at the microscopic level within crowd videos.Detecting pushing behavior automatically is a demanding task that falls within the realm of computer vision.This challenge arises from several factors, such as dense crowds gathering at event entrances, the varied manifestations of pushing behavior, and the significant resemblance and overlap between pushing and non-pushing actions.
Recently, machine learning algorithms, particularly Convolutional Neural Network (CNN) architectures, have shown remarkable success in various computer vision tasks, including face recognition [21], object detection [22], and abnormal behavior detection [23].One of the key reasons for this success is that CNN can learn the relevant features [24][25][26] automatically from data without human supervision [27,28].As a result of CNN's success in abnormal behavior detection, which is closely related to pushing detection, some studies have started to automate pushing detection using CNN models [16,17,29].For instance, Alia et al. [16,30] introduced a deep learning framework that leverages deep optical flow and CNN models for pushing patch detection in video recordings.Another study [29] introduced a fast hybrid deep neural network model based on GPU to enhance the speed of video analysis and pushing patch identification.Similarly, the authors of [17,31,32] developed an intelligent framework that combines deep learning algorithms, a cloud environment, and live camera stream technology to annotate the pushing patches in real-time from crowds accurately.Yet, the current automatic methods focus on identifying pushing behavior at the level of regions (macroscopic level) rather than at the level of individuals (microscopic level), where each region can contain a group of persons.In other words, the automatic approaches reported in the literature can not detect pushing at the microscopic level, limiting their contributions to help comprehend pushing dynamics in crowds.For example, they cannot accurately determine the relationship between the number of individuals involved in pushing behavior and the onset of critical situations, thereby hindering a precise understanding of when a situation may escalate to a critical level.
To overcome the limitations of the aforementioned methods, this article introduces a novel Voronoi-based CNN framework for automatically identifying instances of microscopic pushing behavior from crowd video recordings.The proposed framework comprises two components: feature extraction and labeling.The first component utilizes a novel Voronoi-based Efficient-NetV1B0 CNN architecture for feature extraction.The Voronoi [33]-based method is used to identify the local region of each person over time, and then the EfficientNetV1B0 model [34] extracts deep features from these regions.In this article, the local region is defined as the zone focusing only on a single person (target person), including his surrounding spaces and physical interactions with his direct neighbors.This region is crucial in guiding the proposed framework to focus on microscopic behavior.On the other hand, the second component employs a fully connected layer with a Sigmoid activation function to analyze the deep features and detect the pushing persons.The framework (CNN and fully connected layer) is trained from scratch on a dataset of labeled local regions generated from six real-world video experiments with their ground truths [35].
The main contributions of this work are summarized as follows: 1. To the best of our knowledge, this article presents the first framework for automatically identifying pushing at the individual level in videos of human crowds.2. This article introduces a novel feature extraction method for characterizing microscopic behavior in videos of crowds, particularly pushing behavior.3. The article creates a fresh dataset derived from local regions and includes data from six real-world experiments, each paired with corresponding ground truths.This dataset represents a valuable resource for future research in this domain.
The remainder of this article is organized as follows.Section 2 reviews some automatic approaches to abnormal behavior detection in videos of crowds.The architecture of the proposed framework is introduced in Section 3. Section 4 presents the processes of training and evaluating the framework.Section 5 discusses experimental results and comparisons.Finally, the conclusion and future work are summarized in Section 6.

Related Work
This section begins by providing an overview of CNN-based approaches for automatic video analysis and detecting abnormal behavior in crowds.It then discusses the methods for automatically detecting pushing patches in crowd videos.

CNN-based Abnormal Behavior Detection
Typically, behavior is considered abnormal when seen as unusual under specific contexts.The researchers specifically designed and trained a customized CNN to extract features and label samples, utilizing a dataset comprising both normal and abnormal samples.In another study, Alafif et al. [45] introduced two approaches for detecting abnormal behaviors in crowd videos, varying in scale from small to large.For detecting anomaly behaviors in a small-scale crowd at the object level, the first method utilizes a hybrid approach that combines a pre-trained CNN model with a random forest classifier.On the other hand, the second method employs a two-step approach to identify abnormal behaviors in a large-scale crowd.Initially, a pre-trained model is used as the first classifier to identify frames containing abnormal behaviors.Subsequently, the second classifier, specifically You Only Look Once (version 2), is utilized to analyze the identified frames and detect abnormal behaviors exhibited by individuals.Nevertheless, constructing an accurate CNN classifier requires a substantial training dataset, often unavailable for many human behaviors.
To address the limited availability of large datasets containing both normal and abnormal behaviors, some researchers have employed oneclass classifiers using datasets that exclusively consist of normal behaviors.Creating or acquiring a dataset containing only normal behavior is comparatively easier than obtaining a dataset that includes both normal and abnormal behaviors.[46,47].The fundamental concept behind the one-class classifier is to learn exclusively from normal behaviors, thereby establishing a class boundary between normal and undefined (abnormal) classes.For example, Sabokrou et al. [46] utilized a pre-trained CNN to extract motion and appearance information from crowded scenes.They then employed a one-class Gaussian distribution to build the classifier, utilizing datasets of normal behavior.Similarly, in [47,48], the authors constructed one-class classifiers by leveraging a dataset composed exclusively of normal samples.In [47], Xu et al. employed a convolutional variational autoencoder to extract features, followed by the use of multiple Gaussian models to detect abnormal behavior.Meanwhile, in [48], a pre-trained CNN model was employed for feature extraction, while one-class support vector machines were utilized for detecting abnormal behavior.Another study by Ilyas et al. [49] conducted a separate study where they utilized a pre-trained CNN along with a gradient sum of the frame difference to extract meaningful features.Subsequently, they trained three support vector machines on normal behavior data to identify abnormal behaviors.In general, one-class classifiers are frequently employed when the target behavior class or abnormal behavior is rare or lacks a clear definition [50].However, pushing behavior is well-defined and not rare, particularly in high-density and competitive situations.Furthermore, this type of classifier considers new normal behavior as abnormal.
In order to address the limitations of CNNbased and one-class classifier approaches, multiple studies have explored the combination of multiclass CNNs with one or more handcrafted feature descriptors [23,49].In these hybrid approaches, the descriptors are employed to extract valuable information from the data.Subsequently, CNN learns and identifies relevant features and classifications based on the extracted information.For instance, Duman et al. [37] employed the classical Farnebäck optical flow method [51] and CNN to identify abnormal behavior.They used Farnebäck and CNN to estimate direction and speed information and then applied a convolutional long short-term memory network to build the classifier.Hu et al. [52] employed a combination of the histogram of gradient and CNN for feature extraction, while a least-squares support vector was used for classification.Direkoglu [23] utilized the Lucas-Kanade optical flow method and CNN to extract relevant features and identify "escape and panic behaviors".Almazroey et al. [53] used Lucas-Kanade optical flow, a pre-trained CNN, and feature selection methods (specifically neighborhood component analysis) to extract relevant features.These extracted features were then used to train a support vector machine classifier.In another study [54], Zhou et al. introduced an approach based on CNN for detecting and localizing anomalous activities.Their approach involved integrating optical flow with a CNN for feature extraction and utilizing a CNN for the classification task.
Hybrid-based approaches could be more suitable for automatically detecting pushing behavior due to the limited availability of labeled pushing data.Nevertheless, most of the reviewed hybridbased approaches for abnormal behavior detection may be inefficient for detecting pushing since 1) The descriptors used in these approaches can only extract limited essential data from highdensity crowds to represent pushing behavior.2) Some CNN architectures commonly utilized in these approaches may not be effective in dealing with the increased variations within pushing behavior (intra-class variance) and the substantial resemblance between pushing and non-pushing behaviors (high inter-class similarity), which can potentially result in misclassification.

CNN-based Pushing Behavior Detection
In more recent times, a few approaches that merge effective descriptors with robust CNN architectures have been developed for detecting pushing regions in crowds.For example, Alia et al. [16] introduced a hybrid deep learning and visualization framework to aid researchers in automatically detecting pushing behavior in videos.The framework combines deep optical flow and visualization methods to extract the visual motion information from the input video.This information is then analyzed using an EfficientNetV1B0-based CNN and false reduction algorithms to identify and label pushing patches in the video.The framework

Feature Extraction
Fig. 1 The architecture of the proposed framework.In ft, f signifies an extracted frame, while t indicates its timestamp in seconds, counted from the beginning of the input video (with t taking values like 1, 2, 3, . ..).For a target person i at ft, L i (ft) denotes the local region, while N i (ft) represents the direct neighbors.FC stands for fully connected layer, while GAP refers to global average pooling.
has a drawback in terms of speed, as the motion extraction process is based on a CPU-based optical flow method, which is slow.Another study [29] presented a fast hybrid deep neural network model that labels pushing patches in short videos lasting only two seconds.The model is based on an EfficientNetB1-based CNN and GPU-based deep optical flow.
To support the early detection of pushing patches within crowds, the study [17] presented a cloud-based deep learning system.The primary goal of such a system is to offer organizers and security teams timely and valuable information that can enable early intervention and mitigate hazardous situations.The proposed system relies mainly on a fast and accurate pre-trained deep optical flow, an adapted version of the EfficientNetV2B0-based CNN, a cloud environment and live stream technology.Simultaneously, the optical flow model extracts motion characteristics of the crowd in the live video stream, and the classifier analyzes the motion to label pushing patches directly on the stream.Moreover, the system stores the annotated data in the cloud storage, which is crucial to assist planners and organizers in evaluating their events and enhancing their future plans.
To the best of our knowledge, current pushing detection approaches in the literature primarily focus on identifying pushing at the patch level rather than at the individual level.However, identifying the individuals involved in pushing would be more helpful for understanding the pushing dynamics.Hence, this article introduces a new framework for detecting pushing individuals in videos of crowds.The following section provides a detailed discussion of the framework.

Proposed Framework Architecture
This section describes the proposed framework for automatic pushing person detection in videos of crowds.As depicted in Fig. 1 component analyzes the extracted deep features and annotates the pushing persons within the input video.The following sections will discuss both components in more detail.

Feature Extraction Component
This component aims to extract deep features from each individual's behavior, which can be used to classify pedestrians as pushing or nonpushing.To accomplish this, the component consists of two modules: Voronoi-based local region extraction and EfficientNetV1B0-based deep feature extraction.The first module selects a frame every second from the input video and identifies the local region of each person within those extracted frames.Subsequently, the second module extracts deep features from each local region and feeds them to the next component for pedestrian labeling.Before diving into these modules, let us define the local region term at one frame.
A frame f t is captured every second from the input video.Here, t represents the timestamp, in seconds, since the start of the video and can range from 1 to T , where T is the total duration of the video in seconds.We can analyze individual pedestrians within each of these frames, such as f t .For instance, consider a pedestrian i positioned at ⟨x, y⟩ i .Let N i denote the set of pedestrians whose Voronoi cells are adjacent to that of pedestrian i.Specifically, pedestrian j belongs to N i if and only if their Voronoi cells share a boundary.The local region for pedestrian i at f t , L i , forms a two-dimensional closed polygon, defined by the positions of all pedestrians in N i .As illustrations, Fig. 2a provides examples of both N i (left image) and L i (right image).
The region L i encapsulates the crowd dynamics around individual i, reflecting potential interactions between i and its neighbors N i .Notably, the characteristics around a pushing individual might diverge from those around a non-pushing one, a distinction pivotal for highlighting pushing behaviors.Fig. 2b showcases examples of such L i regions for pushing and non-pushing individuals.The following section introduces a novel method for extracting L i .

Voronoi-based Local Region Extraction
This section presents a novel method for extracting the local regions of pedestrians from the input video over time.The technique consists of several steps: frame extraction, dummy points generation, direct neighbor identification, and local region extraction.
Based on the definition of L i presented earlier, the determination of each i's regional boundary is contingent upon N i at f t (N i (f t )).Nonetheless, this definition might not always guarantee the inclusion of every i within their respective local region.This can be particularly evident when i at f t lacks neighboring points from all directions, exemplified by person 37 in Fig. 3a.
To address this issue, we introduce a step to generate dummy points.This involves adding points around each i at f t in areas where they lack direct neighbors.This ensures every i remains encompassed within their local regions, as illustrated by person 37 in Fig. 3c.For this purpose, as depicted in Fig. 3b and Algorithm 1, firstly, this step involves reading the trajectory data of i that corresponds to f t (Algorithm 1, lines 1-8).Concurrently, the area surrounding every i is divided into four equal square regions, each can accommodate at least one i (Algorithm 1, lines 9-17).The location ⟨x, y⟩ i corresponds to the first 2D coordinate of each region (Algorithm 1, lines 12-13).In contrast, the remaining 2D coordinates (⟨x1, y1⟩,⟨x2, y2⟩,⟨x3, y3⟩, ⟨x4, y4⟩) required for identifying the regions can be determined by: where r is the dimension of each square region.Subsequently, each region is checked to verify if it has any pedestrians.In case a region is empty, a dummy point in its center is appended to the input trajectory data.Fig. 3b illustrates an example of four regions surrounding person 37 and two dummy points (yellow dots in first and second empty regions), see Algorithm 1, lines 18-24.
After generating the dummy points for all i at f t , the trajectory data is forwarded to the next step, direct neighbor identification.Fig. 3c shows a crowd with dummy points in a single f t .The third step, direct neighbor identification, employs a combination of Voronoi Diagram [33] and Convex Hull [55] to find N i (f t ) from the input trajectory data with dummy points.A Voronoi Diagram is a method for partitioning a plane into several polygonal regions (named Voronoi cells Vs) based on a set of objects/points (called sites) [33].Each V contains edges and vertices, which form its boundary.Fig. 4a depicts an example of a Voronoi Diagram for 51 Vs of 51 sites, where black and yellow dots denote the sites.In the same figure, the set of sites contains ⟨x, y⟩ i (dummy points are included) at a specific f t , then each V i includes only one site ⟨x, y⟩ i , and all points within V i are closer to site ⟨x, y⟩ i than any other sites ⟨x, y⟩ q .Where q ∈ all i at that f t , and q ̸ = i.
Furthermore, V i and V q at f t are considered adjacent if they share at least one edge or two vertices.For instance, as seen in Fig. 4, V 4 and V 34 are adjacent, while V 4 and V 3 are not adjacent.Since the Voronoi Diagram contains unbounded cells, determining the adjacent cells for each V i at f t may yield inaccurate results.For instance, most cells of yellow points, which are located at the scene's borders, are unbounded cells, as depicted in Fig. 4a.For further clarity, V i (f t ) becomes unbounded when i is a vertex of the convex hull that includes all instances of i at f t .As a result, the Voronoi Diagram may not provide accurate results when determining adjacent cells, which is a crucial factor in identifying N i (f t ).To overcome such limitation, Convex Hull [55] is utilized to finite the Voronoi Diagram (unbounded cells) as shown in Fig. 4b.The Convex Hull is the minimum convex shape that encompasses a given set of points, forming a polygon that connects the outermost points of the set while ensuring that all internal angles are less than 180 • [56].For this purpose, the intersection of each V i (f t ) with Convex Hull of all i at f t is calculated, then the V i (f t ) in the diagram are updated based on the intersections to obtain the bounded Voronoi Diagram of all i at f t (Algorithm 2, lines 5-12).In more details, the Convex Hull of all i at f t is measured (Algorithm 2, line 8).After that, the intersection between V i (f t ) and the Convex Hull Algorithm 1 Pseudo code for generating dummy points.

Inputs:
tr: a file of pedestrian trajectory data over frames, where each record represents [person Id, frame order, x-coordinate, y-coordinate] f ps: the frame rate of the input video, measured in frames per second.r: the dimension of each square region.

Outputs:
tr dummy: a file of pedestrian trajectory data (over seconds) with dummy points.x ← rec [2] 13: y ← rec [3] 14: append ([x − r, y + r]) to regions  The last step, local region extraction, aims to extract the local region of each i at f t , where i / ∈ dummy points.The step firstly finds L i (f t ) based on each ⟨x, y⟩ j , where j ∈ N i (f t ), Fig. 3c.Then, L i (f t ) are cropped from corresponding f t and passed to the next module, which will be discussed in the next section.Fig. 2b displays examples of cropped local regions.The main difference between MBConv6 and MBConv1 is the depth of the block and the number of operations performed in each block; MBConv6 is six times that of MBConv1.Note that MBConv6, 5 × 5 performs the identical operations as MBConv6, 3 × 3, but MBConv6, 5 × 5 applies a kernel size of 5×5, while MBConv6, 3×3 uses a kernel size of 3 × 3.

Labeling Component
The objective of the labeling component is to analyze the feature maps obtained from the previous component and identify the pushing individuals in the input video.This is accomplished through a binary classification task, followed by an annotation process.To carry out the classification task, as shown in Fig. 1, a 1 × 1 convolution operation, global average pooling2D, a fully connected layer, and a Sigmoid activation function are combined.A 1 × 1 convolutional operation is used to increase the number of channels in feature maps, leading to more information.The new dimension of feature maps for each L i (f t ) is 7×7×1280.After that, the global average pooling2D is utilized to transform the feature maps to 1 × 1 × 1280 and feed them to the fully connected layer.Then, the fully connected layer with a Sigmoid activation function finds the probability δ of the pushing label for the corresponding i at f t .Finally, the classifier uses threshold to identify the class of i at f t as Eq. ( 2): Algorithm 2 Pseudo code of direct neighbor identification step Inputs: tr dummy: a file of pedestrian trajectory data (over seconds) with dummy points.Outputs: direct neighbor: a file of direct neighbors for pedestrians over seconds.By default, the threshold value for binary classification is set to 0.5, which works well for a dataset with a balanced distribution.Unfortunately, the new pushing dataset created in Section 4.1 for training and evaluating the proposed framework is imbalanced, and using the default threshold may lead to poor performance of the introduced trained classifier on that dataset [62].Therefore, adjusting the threshold in the trained classifier is required to obtain better accuracy for both pushing and non-pushing classes.The methodology for finding the optimal threshold for the classifier will be explained in detail in Section 4.3.Following training and adjusting the classifier's threshold, it can categorize individuals i as pushing or non-pushing.At the same time, the annotation process draws a red circle around the head of each pushing person in the corresponding frames f t and finally generates an annotated video.
The following section will discuss the training and evaluating processes of the propsed framework.

Training and Evaluating the Framework
This section introduces a novel labeled dataset, as well as presents the parameter setups for the training process, evaluation metrics, and the methodology for improving the framework's performance on an imbalanced dataset.

A Novel Dataset Preparation
Here, it is aimed to create the labeled dataset for training and evaluating the proposed framework.
The dataset consists of a training set, a validation set for the learning process, and two test sets for the evaluation process.These sets comprise L i (f t ) labeled as either pushing or non-pushing.
In this context, each pushing L i (f t ) means i at f t contributes pushing, while every non-pushing L i (f t ) indicates that i at f t follows the social norm of queuing.The following will discuss the data sources and methodology used to prepare the sets.The dataset preparation is based on three data sources: 1) Six videos of real-world experiments of crowded event entrances.2) Pedestrian trajectory data.3) Ground truths for pushing behavior.Six video recordings of experiments with their corresponding pedestrian trajectory data are selected from the data archive hosted by Forschungszentrum Jülich [35,61].This data is licensed under CC Attribution 4.0 International license.The experimental situations mimic the crowded event entrances, and static top-view cameras were employed to record the experiments with a frame rate of 25 frames per second.For more clarity, Fig. 6 shows overhead views of exemplary experiments, and Table 1 summarizes the various characteristics of the chosen experiments.Additionally, ground truth labels constructed by the manual rating system [15] are used for the last data source.In this system, social psychologists observe and analyze video experiments frame-byframe to manually identify individuals who are pushing over time.The experts use PeTrack software [63] to manage the manual tracking process and generate the annotations as a text file.For further details on the manual system, readers can refer to Ref. [15].
Here, the methodology used for papering the dataset is described.As shown in Fig. 7, it consists of two phases: local region extraction; and local region labeling and set generation.The first phase aims to extract local regions (samples) from videos while avoiding duplicates.To accomplish this, the phase initially extracts frames from the input videos second by second.It employs After that the Voronoi-based local region extraction module to identify and crop the samples from the extracted frames.Table 2 demonstrates the number of extracted samples from each video, and Fig. 2b shows several examples of local regions.Preventing the presence of duplicate samples between the training, validation, and test sets is crucial to obtain a reliable evaluation for the model.Therefore, this phase removes similar and slightly different samples before proceeding to the next phase.It involves using a pre-trained MobileNet CNN model to extract deep features/embeddings from the samples and cosine similarity to find duplicate or near duplicate samples based on their features [64].This technique is more robust than comparing pixel values, which can be sensitive to noise and lighting variations [65].Table 2 depicts the number of removed duplicate samples.
On the other hand, the local region and set generation phase is responsible for labeling the extracted samples and producing the sets, including one training set, one validation set, and two test sets.This phase utilizes the ground truth label of each i at f t to label the samples (L i (f t )).   2 shows the summary of the generated sets.
To summarize, four labeled sets were created: the training set, which consists of 2160 pushing samples and 6112 non-pushing samples; the validation set, which contains 466 pushing samples and 1254 non-pushing samples; the test set 1, which includes 441 pushing samples and 1284 non-pushing samples; and the test set 2, comprising 317 pushing samples and 344 non-pushing samples.

Parameter Setup
Table 3 shows parameters used during the training process.They were chosen based on experimentation to obtain optimal performance with the new dataset.To prevent overfitting, the training was halted if the validation accuracy did not improve after 20 epochs.

Evaluation Metrics and Performance Improvement
This section will discuss the metrics chosen for evaluating the performance of the proposed framework.Additionally, it will explore the methodology employed to enhance the performance of the trained imbalanced classifier, thereby improving the overall effectiveness of the framework.
Given the imbalanced distribution of the generated local region dataset, the framework exhibits a bias towards the majority class (nonpushing).Consequently, it becomes crucial to employ appropriate metrics for evaluating the performance of the imbalanced classifier.As a result, a combination of metrics was adopted, including macro accuracy, True Pushing Rate (TPR), True Non-Pushing Rate (TNPR), and Area Under the receiver operating characteristic Curve (AUC) on both test set 1 and test set 2. The following provides a detailed explanation of these metrics.
TPR, also known as sensitivity, is the ratio of correctly classified pushing samples to all pushing samples, and it is defined as: where TP and FNP denote correctly classified pushing persons and incorrectly predicted nonpushing persons.TNPR, also known as specificity, is the ratio of correctly classified non-pushing samples to all non-pushing samples, and it is described as: where TNP and FP stand for correctly classified non-pushing persons and incorrectly predicted pushing persons.Macro accuracy, or balanced accuracy, is the average proportion of correct predictions for each class individually.This metric ensures that each class is given equal significance, irrespective of its size or distribution within the dataset.For more clarity, it is just the average of TPR and TNPR as: AUC is a metric that represents the area under the Receiver Operating Characteristics (ROC)  .Fig. 8a shows an example of a ROC curve with AUC value.As mentioned above, the binary classifier employs a threshold to convert the calculated probability into a predicted class.The pushing class is predicted if the probability exceeds the threshold; otherwise, the non-pushing label is predicted.The default threshold is typically set at 0.5.However, this value leads to poor performance of the introduced framework because EfficientNetV1B0 and classification were trained on imbalanced dataset [62].In other words, the default threshold yields a high TNPR and a low TPR in the framework.To address the imbalance issue and enhance the framework's performance, it becomes necessary to determine an optimal threshold that achieves a better balance between TPR and FPR (1-TNPR).To accomplish this, the ROC curve is utilized over the validation set to identify the threshold value that maximizes TPR and minimizes FPR.Firstly, TPR and TNPR are calculated for several thresholds ranging from 0 to 1.Then, the threshold that yields the minimum value for the following objective function (Eq.( 6) is considered the optimal threshold: As shown in Fig. 8a, the red point refers to the optimal threshold of the classifier used in the proposed framework, which is 0.038.

Evaluation and Results
Here, several experiments were conducted to evaluate the performance of the proposed framework.Initially, the performance of the proposed framework itself is assessed.Subsequently, It is compared with five other CNN-based frameworks.The influence of the deep feature extraction module on the proposed framework's performance is also investigated.Finally, the impact of the local region extraction module on the framework's performance is explored.All experiments and implementations were performed on Google Colaboratory Pro, utilizing Python 3 programming language with Keras, TensorFlow 2.0, and OpenCV libraries.In Google Colaboratory Pro, the hardware setup comprises an NVIDIA GPU with a 15 GB capacity and a system RAM of 12.7 GB.Moreover, the framework and all the baselines developed for comparison in the experiments were trained using the same sets (Table 2) and hyperparameter values (Table 3).

Performance of the Proposed Framework
The performance of the proposed framework was evaluated using the generated dataset (Table 2) and various metrics, including macro accuracy, TPR, TNPR, and AUC.We first trained the proposed framework's EfficientNetB0-based deep feature extraction module and labeling component on the training and validation sets.Subsequently, the framework's performance on test set 1 and test set 2 were assessed.
Table 4 shows that the introduced framework, with the default threshold, obtained macro accuracy of 83 %, TPR of 74 %, and TNPR of 92 % on test set 1. On the other hand, it achieved 82 % macro accuracy, 88 % TNPR, and 76 % TPR on test set 2. However, it is clear that the TPR is significantly lower than the TNPR on both test sets, see Fig. 9a and c.To balance the TPR and TNPR and improve the TPR, the optimal threshold is 0.038, as shown in Fig. 8a.This threshold increases TPR by 12 % and 7 % on test set 1 and test set 2, respectively, without affecting the accuracy, see Fig. 9b and d.In fact, the framework's accuracy improved by 2 % on test set 1. The ROC curves with AUC values for the framework on the two test sets are shown in Fig. 8b, with AUC values of 0.92 and 0.9 on test set 1 and test set 2, respectively.
To summarize, with the optimal threshold, the proposed framework achieved an accuracy of 85 %, TPR of 86 %, and TNPR of 84 % on test set 1, while obtaining 82 % accuracy, 81 % TPR, and 83 % TNPR on test set 2. The next section will compare the framework's performance with five baseline systems for further evaluation.

Comparison with Baseline CNN-based Frameworks
In this section, the results of further empirical comparisons are shown to evaluate the framework's performance against five baseline systems.Specifically, it explores the impact of the EfficientNetV1B0-based deep feature extraction module on the overall performance of the framework.To achieve this, EfficientNetV1B0 in the deep feature extraction module of the proposed framework was replaced with other CNN architectures, including EfficientNetV2B0 [67] (baseline 1), Xception [68] (baseline 2), DenseNet121 [69] (baseline 3), ResNet50 [70](baseline 4), and MobileNet [71] (baseline 5).To ensure fair comparisons, the five baselines were trained and evaluated using the same sets, hyperparameters, and metrics as those used for the proposed framework.Before delving into the comparison of the results, it is essential to know that CNN models renowned for their performance on some datasets may perform poorly on others [72].This discrepancy becomes more apparent when datasets differ in several aspects, such as size, clarity of relevant features among classes, or overall data quality.Powerful models can be prone to overfitting issues, while simpler models may struggle to capture relevant features in complex datasets with intricate patterns and relationships.Therefore, it's crucial to carefully select or develop an appropriate CNN architecture for a specific issue.For instance, EfficientNetV2B0 demonstrates superior performance compared to EfficientNetV1B0 across various classification tasks [67], including the ImageNet dataset.Moreover, it surpasses the previous version in identifying regions that exhibit pushing persons in motion information maps of crowds [16,17].These remarkable outcomes can be attributed to the efficient blocks employed for feature extraction, namely the Mobile Inverted Residual Bottleneck Convolution [57] and Fused Mobile Inverted Residual Bottleneck Convolution [73].Nevertheless, it should be noted that the presence of these efficient blocks does not guarantee the best performance in identifying pushing individuals based on local regions within the framework.Hence, in this section, the impact of six of the most popular and efficient CNN architectures on the performance of the proposed framework was empirically studied.For clarity, EfficientNetV1B0 was used within the framework, while the remaining CNN architectures were employed in the baselines.The performance results of the proposed framework, as well as the baselines, are presented in Table 5 and visualized in Fig. 10.The findings indicate that EfficientNetV1B0 with optimal threshold leads the framework to achieve superior macro accuracy and AUC with balanced TPR and TNPR compared to CNNs used in baselines 1-5.This can be attributed to the architecture of EfficientNetV1B0, which primarily relies on the Mobile Inverted Residual Bottleneck Convolution with relatively few parameters.As such, the architectural design proves to be particularly suited for the generated dataset focusing on local regions.The visualization in Fig. 11 shows the optimal threshold values for the baselines.These thresholds, as shown in Table 5 and Fig. 10, mostly improved the macro accuracy, TPR, and balanced TPR and TNPR in the baselines.For example, baseline 1 with optimal threshold achieved 84 % macro accuracy, roughly similar to the proposed framework.However, it fell short of achieving a balanced TPR and TNPR along with improving TPR on both test sets as effectively as the framework.To provide further clarity, baseline 1 achieved 80 % TPR with 8 % as the difference between TPR and TNPR, whereas the proposed framework attained an 86 % TPR with 2 % as the difference between TPR and TNPR on test set 1. Similarly, on test set 2, the framework achieved 81 % TPR, while baseline 1 achieved a TPR of 74 %.
Compared to other baselines that utilize optimal thresholds on test set 1, the proposed framework outperformed them regarding macro accuracy, TPR, and TNPR.Similarly, on test set 2, the framework surpasses all baselines except for the ResNet50-based baseline (baseline 4).However, it is essential to note that this baseline only achieved better TNPR, whereas the introduced framework excels in macro accuracy and TPR.As a result, the framework emerges as the superior choice on test set 2. To alleviate any confusion in the comparison, Fig. 12 shows the ROC curves with AUC values compared to its baselines on test set 1. Likewise, Fig. 13 depicts the same for test set 2. The AUC values show that the proposed framework achieved better performance than the baselines on both test sets.Moreover, they substantiate that EfficientNetV1B0 is the most suitable CNN for extracting deep features from the generated local region samples.In conclusion, the experiments demonstrate that the proposed framework, utilizing Efficient-NetV1B0, achieved the highest performance compared to the baselines relying on other CNN architectures on both test sets.Furthermore, the optimal thresholds in the developed framework and the baselines resulted in a significant improvement in the performance across both test sets.extraction module is removed from the proposed framework to construct this baseline.

Impact of Deep Feature Extraction Module
Table 6 demonstrates that the baseline exhibited poor performance, with macro accuracy of 67 % on test set 1 and 59 % on test set 2. Additionally, Fig. 12 and Fig. 13  In summary, the deep feature extraction module significantly enhances the performance of the framework.

Impact of Local Region Extraction
The primary goal of this section is to evaluate the impact of the Voronoi-based local region extraction module on the performance of the proposed framework.To accomplish this, firstly, baseline 7 was created, which replaces this module with a new one that relies on static dimensions; to extract a local square region for each individual.In this new module, the target person's position serves as the center of the extracted area, and each square region dimension is roughly 60 cm on the ground.Such dimension is enough to make the region contains the target person with his/her surrounding spaces.
Fig. 15b shows an example of a square local region of a target person (i).Then, a new dataset was generated utilizing the same video experiments and the same splitting technique used in preparing the local region dataset (Table 2) to train and evaluate baseline 7. The main difference is that the samples in this new dataset are static square local regions (Fig. 15b) instead of dynamic polygonal regions (Fig. 15a).According to the data presented in Table 7, baseline 7 achieved a macro accuracy of 79 % on test set 1 and 62 % on test set 2. This indicates that the Voronoi-based method results in 6 % improvement in accuracy for test set 1 and a significant 20 % improvement for test set 2. Additionally, Fig. 14 demonstrates that the module enhanced the AUC value by 11 % for test set 1 and 13 % for test set 2.
In summary, the Voronoi-based local region extraction module enhanced the accuracy of the propsed framework by a minimum of 6 %.This  The proposed framework has some limitations.First, it was designed to work exclusively with top-view camera video recordings that include trajectory data.Second, it was trained and evaluated based on a limited number of real-world experiments, which may impact its generalizability to a broader range of scenarios.Our future goals include improving the framework in two key areas: 1) Enabling it to detect pushing persons from video recordings without the need for trajectory data as input.2) Improving its performance in terms of macro accuracy, true pushing rate, and true non-pushing rate by utilizing video recordings of additional real-world experiments and transfer learning techniques.
the valuable discussions, manual annotation of the pushing behavior in the video of the experiments.

Declarations
Conflict of interest The authors declare that there is no conflict of interests regarding the publication of this article.

Ethical approval
The experiments used in the dataset were conducted according to the guidelines of the Declaration of Helsinki and approved by the ethics board at the University of Wuppertal, Germany.Informed consent was obtained from all subjects involved in the experiments.

Fig. 2
Fig. 2 An illustration of direct neighbors (a) and examples of local regions (b).The red circles represent individuals engaged in pushing, while the green circles represent individuals not involved in pushing.Direct neighbors j of a person i are indicated with blue circles.

Fig. 3 Fig. 4 a
Fig. 3 An illustration of the effect of dummy points on creating the local regions, as well as a sketch of the dummy points generation technique.a) L 37 and L 3 without dummy points.b) a sketch of the dummy points generation technique.c) L 37 and L 3 with dummy points.The white polygon represents the border of the local regions.Yellow small circles refer to the generated dummy points, while black points in b denote the positions of pedestrians.r is the dimension of each square.

1 : 7 : 8 : 9 :Fig. 6
Fig. 6 Overhead view of exemplary experiments.a) Experiment 270, as well as Experiments 50, 110, 150, and 280 used the same setup but with different widths of the entrance area ranging from 1.2 to 5.6 m based on the experiment [35].b) Experiment entrance 2 [61] The entrance gate's width is 0.5 m in all setups.

Fig. 7
Fig. 7 Pipeline of dataset preparation.In the part 'Local Region Labeling and Set Generation', red refers to the pushing class and pushing sample, while the non-pushing class and non-pushing sample are represented in green.

5 Fig. 8 Fig. 9
Fig. 8 ROC curves for the introduced framework.a) ROC curve with an optimal threshold on the validation set.b) ROC curves with AUC values on test set 1 and test set 2. TPR stands for true pushing rate, while FPR refers to false pushing rate.

2 Fig. 10
Fig. 10 Comparison between the framework (based on EfficientNetV1B0) with the baseline frameworks based on other popular CNN architectures.
This section aims to investigate how the deep feature extraction module affects the framework's performance.For this purpose, a new baseline (baseline 6) is developed, incorporating a Voronoibased local region extraction module and labeling component.In other words, the deep feature 6 (without deep feature extraction module) ROC curve Optimal threshold = 0.34280

Fig. 11
Fig.11ROC curves with optimal thresholds for the baselines over the validation set.TPR stands for true pushing rate, while FPR refers to false pushing rate.ROC stands for Receiver Operating Characteristics.

5 Fig. 12
Fig. 12 ROC curves with AUC values on the test set 1. Comparison between the introduced framework (based on Effi-cientNetV1B0) with five baselines based on different CNN architectures, as well as the one baseline without the deep feature extraction module (baseline 6).TPR stands for true pushing rate, while FPR refers to false pushing rate.ROC represents Receiver Operating Characteristics.AUC stands for the area under the ROC Curve.

5 Fig. 13
Fig. 13 ROC curves with AUC values on the test set 2. Comparison between the framework (based on EfficientNetV1B0) with five baselines based on different CNN architectures, as well as the one baseline without the deep feature extraction module (baseline 6).TPR stands for true pushing rate, while FPR refers to false pushing rate.ROC represents Receiver Operating Characteristics.AUC stands for the area under the ROC Curve.

Fig. 15 a
Fig. 15 a) An example of a polygonal local region based on the bounded Voronoi Diagram.b) An example of a square local region based on static dimension.i stands for the target person.

Table 1
Characteristics of the chosen experiments.The same names as reported in[35, 61].m stands for meter, and s refers to second.

Table 2
Summary of the prepared sets.
i (f t )).Another test set (test set 2) is also developed from the labeled samples extracted from the complete video experiment 50.Table

Table 3
The hyperparameter values used in the training process.

Table 4
Performance of the proposed framework on both test sets.
curve.The ROC curve illustrates the performance of a classification model across various threshold values.It plots the false positive rate (FPR) on the horizontal axis against the true positive rate (TPR) on the vertical axis.AUC values range from 0 to 1, where a perfect model achieves an AUC of 1, while a value of 0.5 indicates that the model performs no better than random guessing[66]

Table 5
Comparative analysis of the developed framework and the five CNN-based frameworks.

Table 6
Performance results of the baseline 6.
illustrate AUC values of 72 % on test set 1 and 61 % on test set 2 for baseline 6. Comparing this baseline with the weakest baseline in Table 5, which utilizes ResNet50, it is evident that deep feature extraction leads to macro accuracy improvement of at least 8 % on test set 1 and at least 20 % on test set 2. Similarly, deep feature extraction enhances AUC values by at least 11 % on test set 1 and more than 24 % on test set 2.

Table 7
Comparison to baseline 7.