Keywords

1 Introduction

Video surveillance cameras have become ubiquitous in our cities. However, their usefulness for preventing crimes, is often questioned due to the lack of adequately trained personnel to monitor a large number of videos captured simultaneously, and to the loss of attention from surveillance operators after a few tens of minutes inspecting such videos [1].

This has attracted great attention from the computer vision community aimed at developing techniques to automatically detect abnormal behaviors in videos which may preserve safety and likely prevent crimes.

Although the proposed methods have achieved significant outcomes, still they are far away to being applied for real world scenarios. In particular, the biggest challenge lies in the definition of abnormality as it is strongly context dependent. In most cases, violence and panic are considered as abnormal behaviors, however, even people running or walking in some areas of a scene may be considered an abnormal event in particular situations. In video surveillance scenarios, the most well-known approach to detect abnormalities is to codify the pedestrians’ behaviors by means of sociological models, being the most notable example the social force model (SFM) [2], which was successfully employed for abnormality (mostly panic) detection in crowd scenes [3]. Specifically, SFM is a method for describing local crowd interactions using Newtonian mechanics.

Although many variants of the SFM have been proposed in the social psychology literature [46], the central tenet of all such models is the ability to describe different crowd scenarios (e.g., cross walk, panic and evacuation) by calibrating a set of physical forces on empirical observations [7]. Despite the interesting performances of the SFM-based models [2], recent social psychology studies argued that they are too simplified [7, 8] to capture complex crowd behaviors, other than being heavily affected by a poor generalization power, meaning that a model calibrated on a set of empirical observations may often fail to deal with a different set of observationsFootnote 1.

To face these limitations, recent works try to exploit a set of simple, yet effective, behavioral heuristic to describe complex individuals’ behaviors observed in crowded scenarios, while using physics-based equations to quantify such rules on crowd videos [79]. Unlike SFM-based models which aim at describing complex crowd movements by calibrating a set of forces on empirical observations, this class of approaches defines a set of behavioral heuristic which are formulated using concepts such as velocity and acceleration borrowed from Newtonian mechanics [7]. The effectiveness of such heuristics for modeling complex human (re)actions and decision-making have been well noted in psychology literature [1012] and share the common characteristics to be fast and frugal [13]. They are fast because of their low computational complexity, and frugal since they benefit from a few pieces of information [13]. Readers may refer to [7, 8] for a full treatment of the above methods from psychological and sociological perspectives.

Fig. 1.
figure 1

Overview of the proposed framework, from behavioral heuristic rules to the Vision Information Processing Signature descriptor (VIPS).

In this work, taking inspiration from such socio-psychological studies above mentioned, we propose to employ cognitive heuristics together with physical equations for detecting violence in video sequences. To the best of our knowledge, this is the first attempt in computer vision that investigates the use of heuristic rules for violence detection in crowd scenarios. More specifically, (I) We extended the heuristics of the cognitive model proposed in  [7, 8] to model violent in crowds. (II) We formalized the heuristics with mathematical equations and (III) we showed how we are able to efficiently approximate and extract them from a video sequence. (IV) Finally, we use the estimated heuristic maps to form a video descriptor, called Vision Information Processing Signature (VIPS) which strongly outperform the social force model, ConNet and other state of the art descriptor on the violence classification task.

Figure 1 depicts an overview of our framework. First, we define three behavioral heuristic rules based on social-psychology studies  [7, 8]. Then, we compute motion information from two successive frames along with particle advection to track particles (to capture as much as possible individual subject motion in crowd scenes). This is followed by computing physics-based feature maps from each behavior heuristic rule. Finally, following the standard bag-of-words paradigm, we sampled P patches and encode them into a number of centers. Then, we concatenate the histograms to form the VIPS descriptor. Eventually, The resulting histograms are fed into a classifier to detect/quantify the violence behaviors.

The rest of the paper is organized as follows. In Sect. 2, we review the state-of-the-art on violence detection using computer vision techniques. Section 3 presents the proposed cognitive models and describe the envisaged heuristic rules. In Sect. 4, we illustrate how to estimate the formulated forces from video sequences. This involves extracting a set of maps from the heuristics, which we will further exploit to define the VIPS descriptor for crowd violence detection. In Sect. 5 we evaluate our approach on several benchmark datasets comparing with prior dominant techniques and descriptors. Finally, Sect. 6 draws a conclusion and presents the future work.

2 Related Works

The first work for detecting violence in videos was proposed in [14]. This approach focused on two- person fight episodes and employed motion trajectory information of individual limbs for fight classification. It required limbs segmentation and tracking, which are very challenging tasks in presence of occlusion and clutters, specially in crowd situations.

More recent methods [1519] mainly differ in the used feature descriptor, sampling strategy and the classifier adopted. For example, [15] used Spatial Temporal Interest Point (STIP) detector and descriptor along with linear Support Vector Machines (SVMs). Nievas et al. [20] applied STIPs, Histogram of Oriented Gradients (HOG) and Motion SIFT (MoSIFT) descriptors along with the Histogram Intersection Kernel SVM [20] for violence detection. Other approaches derived local motion patterns from optical flows. For instance, Solmaz et al. [21] analyzed motion flows (derived from optical flows) to identify a particular set of simple crowd behaviors (e.g., bottlenecks and lanes). The statistics of flow-vector magnitudes changing over the time are exploited in [18] to represent motion patterns for the task of violence detection. The social force model [3] and its variations [2224] represented motion patterns using physics concepts such as attractive and repulsive forces, motion equations and interaction energy. The success of this class of methods, however, is heavily dependent on the video quality and the density of people involved in crowds, and they may not be capable of capturing a wide range of complex crowd behaviors.

3 Formulation of Heuristic Rules

In this section, first, we define a set of heuristic rules inspired from socio-psychological studies  [79] describing how individuals behave in violence crowd. Then, we explain how to formulate these rules using physics equations and basic visual information extracted from the observed scenes.

Our proposed framework consists of the following heuristic rules:

 

H1: :

An individual chooses the direction that allows the most direct path to a destination point, adopting his/her moving regarding the presence of obstacles.

H2: :

In crowd situations, the movement of an individual is influenced by his/her physical body contacts with surrounding persons.

H3: :

In violent scenes, an individual mainly moves towards his/her opponents to display violent actions.

 

The first heuristic rule (H1) is inherited from the socio-psychological literature [7] and encompasses individual’s internal motivation towards a goal avoiding obstacles or other individuals. The second heuristic rule (H2), on the other hand, states that individual movements in a crowd is not only governed by his/her internal motivation but also by the unintentional physical body interactions with his/her surrounding individuals. This is especially true in overcrowded situations where crowd dynamics is unstable and body contacts frequently occur. The third heuristic rule (H3) defines behavioral patterns within violent scenes, where there are two or more parties (e.g., police and rioters) fighting and showing violent behaviors to each other.

We formulate the above heuristic rules using visual information of individuals such as their spatial coordinates and velocity flows, following [79]. For each individual i, we consider its position \((x_i,y_i)\) in the 2D image plane and its velocity \(\varvec{v}_{i}\). With the scalar \(d_{i,j}\), we refer to the distance between i and j, and \(\varvec{n}_{ji}\) is a normalized unit vector pointing from the coordinates of j to i. The visual motion information of i with respect to j is captured by the angle between the velocity vectors \(\varvec{v}_{i}\) and \(\varvec{v}_{j}\), which we call it \(\phi _{ij}\). Based on these visual cues, the heuristic rules are formulated as follows.

Heuristic rule H1: In normal situations, individual i chooses the most direct path towards a destination with a desired velocity of \(\varvec{v}_{i}^{\;des}\). It is, however, a norm that individual i changes his/her desired velocity \(\varvec{v}_{i}^{\;des}\) to \({v}_{i}(t)\), due to an unexpected obstacle at time t [2]. This heuristic can be formulated as:

$$\begin{aligned} \frac{d\varvec{v}_{i}^{\;des}}{dt} = \frac{(\varvec{v}_{i}^{\;des}-\varvec{v}_{i}(t))}{\tau } \end{aligned}$$
(1)

where \(\tau \) is the amount of time individual i requires to change its desired velocity facing an unexpected obstacles. If velocity is constant over time, \( \frac{d\varvec{v}_{i}^{\;des}}{dt} = 0\), meaning the individual is approaching his/her target destination without facing any obstacle. Otherwise, the presence of an obstacle implies a change at the individual ’s velocity.

Heuristic rule H2: The heuristic H1 is, however, valid in sparse crowd scenarios (e.g., walking in a street) where individuals have enough time and space to keep safe distance from other pedestrians, and change their desired velocity against unexpected obstacles. This is not the case in crowd situations (e.g., riots), where individuals do not have enough time and space to control their movements. Hence, they are subject to unintentional physical body contacts that may strongly affect their movements. Borrowing from [7, 25], the body contact force imposed on i from j is formulated as:

$$\begin{aligned} \varvec{F}_{ij}^{bc} = \varvec{n}_{ji} \cdot \mathbf{g}_i(j) \end{aligned}$$
(2)

where \(\mathbf{g}_i(j)\) is a function that returns zero if i and j are not close enough to have body contact and a scalar value inversely proportional to their spatial distance \(d_{ij}\), otherwise.

Heuristic rule H3: In violent situations, individual j may exhibit an action (verbally, emotionally or physically) to individual i that triggers i to move towards j for a reaction [26]. This is what heuristic H3 aims to model. We named this as aggression force \( \varvec{F}_{ij}^{\;agg}\) which is defined as:

$$\begin{aligned} \varvec{F}_{ij}^{agg}= & {} \varvec{n}_{ji}\cdot \frac{(1- \frac{\varvec{v}_{i}\cdot \varvec{v_j}}{\Vert \varvec{v_j} \Vert \cdot \Vert \varvec{v_i} \Vert })}{2}\cdot \mathbf{f}_i \large (j \large ) \end{aligned}$$
(3)

\(\mathbf{f}(\cdot )\) returns 1 for each individual j who is in the view field of individual i regardless of their distance, and 0 otherwise. The term \(\frac{1}{2}(1- \frac{\varvec{v}_{i}\cdot \varvec{v_j}}{\Vert \varvec{v_j} \Vert \cdot \Vert \varvec{v_i} \Vert })\) is referred to as aggression factor and measures how much the individual i is stimulated to move towards j based on the angle between the velocity vectors \(\varvec{v_j}\) and \(\varvec{v_i}\) . The value of aggression factor is in the [0  1] interval, 1 when individuals i and j are moving against each other (the angle between vectors \(\varvec{v_j}\) and \(\varvec{v_i}\) is \(\pi \) ), and 0 when individuals i and j are moving towards a same direction (the angle between \(\varvec{v_j}\) and \(\varvec{v_i}\) is 0). \(\varvec{n}_{ji}\) codes the spatial relation of i and j and gives the aggression force a vector form (with direction and magnitude).

4 Estimating Heuristic Rules from Videos

In this section, we quantify each heuristic rule on video sequences. This provides a set of maps (one map for each rule) which will be further used to define our video descriptor for violence detection.

Assume that the goal is to quantify the heuristic rules on a gray-level video \(\mathbf {V} = \{I^1, ..., I^T\}\) with T frames of size \(h \times w\). Toward this, we need to compute the basic variables in Eqs. 13, including each individual’s spatial coordinate and velocity. This can be performed by detecting and tracking individuals over the video frames. This, however, is very challenging in crowd videos with severe occlusions and clutter. An alternative, without individual detection and tracking, is particle advection [3], where a grid of particles is placed over each frame and moved according to the video flow field computed from the optical flow (OF) [27]. The velocity vector of each particle i located at \((x_i,y_i)\) over frame t is approximated by averaging OF vectors in its neighborhood using a Gaussian kernel in the spatial and temporal domains, i.e., \(\varvec{o}_i = \big <\varvec{\mathbf {OF}} (x_i,y_i,t)\big >_{avg}\). More details about particle advection can be found in [3]. From now on, we will use the term particle(s) instead of individual(s), and optical flow instead of velocity.

Estimation of heuristic rule H1. The formulation of heuristic rule H1 estimates the change of a particle’s velocity over the time, which is particle’s acceleration, \(\varvec{a_i}\) Footnote 2. Borrowing from [3], we estimate Eq. 1 by computing the derivation of OF vectors with respect to the time:

$$\begin{aligned} \frac{d\varvec{v}_{i}}{dt} = \frac{\varvec{v}_i^{\,des} - \varvec{v}_i(t)}{\tau } \simeq \frac{\varvec{o}_i^{\,t+\varDelta t} - \varvec{o}_i^{\,t}}{\varDelta t} \end{aligned}$$
(4)

where the apex states the time (frame index). If we set \(\varDelta t = 1\) (two successive frames), then the particle’s acceleration \((\frac{d\varvec{v}_{i}}{dt})\) at frame \(t+1\) can be efficiently estimated by subtracting two successive OF vectors:

$$\begin{aligned} \frac{d\varvec{v}_{i}}{dt} \simeq \varvec{a_i}^{t+1} = \varvec{o}_i^{\,t+1} - \varvec{o}_i^{\,t} \end{aligned}$$
(5)

Estimation of heuristic rule H2. According to Eq. 2, the formulation of body contact force involves computing the unit vector \(\varvec{n}_{ji}\) for all particles i and j which is not computationally efficient (it is quadratic in the number of particles). It is obvious that body interaction occurs when individual j moves toward individual i and contacts his/her body at time t. This implies that in the case of body contact, the moving direction of j toward i (\(\varvec{v}_j\)) is similar to the direction of \(\varvec{n}_{ji}\) (according to the definition). Furthermore, body contact changes the velocity of individual j, \(\varvec{v}_j\), at time t (individual i is considered an obstacle). As a result, \(\varvec{n}_{ji}\) can be effectively estimated by acceleration (corresponding to a velocity change) at a very low computational cost. Based on above explanation, we estimate the contact force of particle i caused by its neighboring particles j’s as:

$$\begin{aligned} \varvec{F}_{i}^{bc} = \frac{\sum _{j} \varvec{a}_j\cdot \mathbf{g}_{i}(j)}{\sum _{j}\mathbf{g}_{i}(j)} \end{aligned}$$
(6)

where, \(\varvec{a}_j\) is the acceleration vector of particle j (Eq. 5). \(\mathbf{g}_i(j)\) is defined by a Gaussian function with bandwidth R as \(\mathbf{g}_i(j) = \frac{1}{\pi R^2} \exp \big (\frac{-d_{ij}^2}{R^2}\big )\), where \(d_{ij}\) is the Euclidean distance of particles i and j. In practice, Eq. 6 for all particles can be estimated by simply convolving a precomputed 2D Gaussian function over the acceleration map. The magnitude of body contact force, which is the map of H2, is referred to as body compression.

Estimation of heuristic rule H3. To estimate the accumulated aggression force imposed on particle i from its opponent particles, we re-formulate Eq. 3 as:

$$\begin{aligned} \varvec{F}_{i}^{agg} = \sum _j \Big ( \varvec{n}_{ji}\cdot \mathbf {w}_{ij} \cdot \mathbf{f}_{\varvec{o}_i}^{\alpha } \large ( j \large ) \Big ) \end{aligned}$$
(7)

where, using OF, the aggression factor \(\mathbf {w}_{ij}\) is defined as:

$$\begin{aligned} \mathbf {w}_{ij} = \frac{1}{2}\cdot (1- \frac{\varvec{o}_{i}\cdot \varvec{o_j}}{\Vert \varvec{o_i} \Vert \cdot \Vert \varvec{o_j} \Vert }) = \frac{1}{2}\cdot ( 1 - \cos \phi _{ij} ) \end{aligned}$$
(8)

such that \(\phi _{ij}\) is the angle between the optical flows \(\varvec{o}_j\) and \(\varvec{o}_{i}\) of the \(j^{th}\) and \(i^{th}\) particles, respectively.

Fig. 2.
figure 2

(a) The windowing function \(\mathbf{g}(\cdot )\) returns non zero value for particles inside the circle and zero for the rest. The window function \(\mathbf{f}(\cdot )\) simulates the view field of the green particle and returns non-zero for the articles in the view field. The particles marked in red are considered as opponents approaching the green particle. (b) top: two approximations of \(\mathbf {w}\) varying the direction of \(\varvec{v}_j\) as shown on the bottom x-axis respect to \(\varvec{v}_{i}\) with fixed direction on the top x-axis, Eq. 9. Bottom: the binary filters \(\mathbf{f}_ {q}^\alpha \) modeling the particle’s view field, \(Q = 8 \) and \(\alpha = 120 ^ \circ \) (Best viewed in color)

Computing the aggression factor \(\mathbf {w}_{ij}\) for particle i requires to calculate the cosine between \(\varvec{o}_j\) and \(\varvec{o}_{i}\) for each i and j which is quadratic in the number of particles. To reduce the computations, therefore, we propose two approximations of \(\mathbf {w}_{ij}\) over Q quantized bins of OF orientations, \(\theta ^q\), \(q=1,...,Q\), instead of directly computing them exhaustively. \(\theta ^q_i\) indicates the bin to which the orientation of OF vector \(\varvec{o}_i\) (with respect to a fixed reference axis) belongs. As the first approximation, we set \(\mathbf {w}_{ij} = 1\) when \(\theta _i^q = -\theta _j^q\) and zero otherwise, denoted by \(\tilde{\mathbf {w}}_{i,j}^{[1]} \). This implies that the aggressive factor of the particle i depends on its neighboring particles approaching particle i from exactly opposite quantized direction. As second approximation, \(\mathbf {w}_{ij} = 1\) when the orientations of \(\varvec{o}_i\) and \(\varvec{o}_j\) do not fall in a same quantized bin, \(\theta _i^q \ne \theta _j^q\), and zero otherwise, \(\tilde{\mathbf {w}}_{i,j}^{[2]} \). This approximation, on the other hand, states that any particle approaching particle i from a different orientation (bin) is considered in the aggression factor. The first and the second approximations, \(\tilde{\mathbf {w}}_{i,j}^{[1]}\) and \(\tilde{\mathbf {w}}_{i,j}^{[2]}\), are defined as follows:

$$\begin{aligned} \tilde{\mathbf {w}}_{i,j}^{[1]} = {\left\{ \begin{array}{ll} 1 \; &{} \text {if} \;\; \theta _i^q = -\theta _j^q \\ 0 \;\; &{} \text {otherwise} \end{array}\right. }&\qquad \qquad \tilde{\mathbf {w}}_{i,j}^{[2]} = {\left\{ \begin{array}{ll} 1 \;&{} \text {if} \;\; \theta _i^q \ne \theta _j^q \\ 0 \;\; &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(9)

(Figure 2 b-top) illustrates the real values of \(\mathbf {w}_{i,j}\) (Eq. 8) and its approximations \(\tilde{\mathbf {w}}_{i,j}^{[\cdot ]}\) (Eq. 9), where the black arrows indicate the directions of particle i and its neighboring particles j’s. It is shown that \(\tilde{\mathbf {w}}_{i,j}^{[1]}\) is 1 only for particles approaching i from opposite direction (and zero from other directions), while \(\tilde{\mathbf {w}}_{i,j}^{[2]}\) is 1 for particles approaching i from any different direction with respect to i’s direction.

According to the heuristic rule H3, the windowing function \(\mathbf{f}_{\varvec{o}_i}^{\alpha } \large ( \cdot \large )\) should reflect what each particle sees (i.e., individual’s view field). Therefore, we define it as a naive-shaped that resembles one’s field of view, oriented in the direction of the particle’s optical flow \(\varvec{o}_i\). Here we are making the fair assumption that a pedestrian looks at his/her walking direction, which is especially valid in crowd scenarios. We set the angle of view \(\alpha \) to \(120^\circ \) as in human vision. The definition of angle of view \(\alpha \) and length of view field L in \(\mathbf{f}_{\varvec{o}_i}^{\alpha } \large ( \cdot \large )\) is illustrated in Fig. 2(a). In practice, we model particle’s field of view on the image plane using a fixed filter bank composed of Q filters (binary masks), \(\{ \mathbf{f}_ {q}^\alpha \}_{q=1}^{Q}\), where each filter implies a quantized orientation bin as illustrated in (Fig. 2 b-bottom) for \(Q=8\) and \(\alpha = 120^\circ \). Similar to H2, computing \(\varvec{n}_{ji}\) is also quadratic in the number of particles. According to H3, individual i moves towards individual j from the coordinates i to j. This implies that the direction of \(\varvec{v}_{i}\) is in the opposite of \(\varvec{n}_{ji}\). Therefore, for the sake of complexity, we approximate \(\varvec{n}_{ji} \simeq -\varvec{v}_{i}\).

Taking into account all the approximations, the aggression force on particle i is estimated as:

$$\begin{aligned} {\varvec{F}}_i^{agg} \simeq -\varvec{o}_{i} \cdot \sum _{q =1}^Q \Big ( [\theta _i^q = q] \cdot \big (\tilde{\mathbf {w}}_{i}^{[\cdot ]} \star \mathbf{f}_ {q}^\alpha \big ) (i)\Big ) \end{aligned}$$
(10)

where \([\cdot ]\) is a indicator function that returns 1 if \(\theta _i^q = q\) and zero otherwise, and \(\star \) is the convolution operator which identically performs the summation over neighboring particles j in Eq. 7. \(\big (\tilde{\mathbf {w}}_{i}^{[\cdot ]} \star \mathbf{f}_ {q}^\alpha \big ) (i)\) is the value of the convolution at the coordinates of particle i. According to the Convolution Theorem [28], Eq. 10 can be efficiently computed in the Fourier domain. We called the magnitude of aggression force as aggressive drive.

Visual Information Processing Signature - VIPS. Each heuristic captures a different aspect of visual information processed by individual cognition in crowd scenarios. To define a single informative feature, we simply combine together acceleration, body compression and aggression drive in a feature we called Visual Information Processing Signature, in short VIPS.

More specifically, we employ the standard bag-of-words (BOW) paradigm separately for each of the three maps (Eqs. 5, 6 and 10). Then, for each video clip we sampled P patches of size \(5\times 5\times 5\) from locations where the corresponding optical flow is not zero, and we build a visual dictionary of size K using K-means clusteringFootnote 3. In the BOW assumption, each video is encoded by a bag; to compute such bags we assign each of the P patches to the closest codebook, and we pool together all the patches to generate an histogram over the K visual words. The final VIPS is obtained by concatenating the histograms resulting from acceleration, body compression and aggressive drive. This process is illustrated in the right-most part of Fig. 1.

To address the specific approximations, we employed for the aggressive drive, \(\tilde{\mathbf {w}}_{i,j}^{[1]}\) or \(\tilde{\mathbf {w}}_{i,j}^{[2]}\) (see Eq. 9), in the experiments we will refer to our descriptor as VIPS\(^{[1]}\) and VIPS\(^{[2]}\), respectively. Finally, to further validate the aggressive drive, we also considered a third baseline version of \(\tilde{\mathbf {w}}\) in which we did not remove any orientation (Eq. 9), and we simply filtered the quantized OF with the wedge filters of (Fig. 2 b-bottom). We will refer to this baseline as VIPS\(^{[*]}\).

Fig. 3.
figure 3

First three columns are frame samples taken from Violence in Crowds (VIC), Violence in Movies (VIM), BEHAVE and Violence-Cross (VC) datasets, respectively. Reader is encouraged to review the text for details.

5 Experiments

We evaluate our approach on three standard benchmarks namely Violence in Crowds (VIC) [18], Violence in Movies (VIM) [20] and BEHAVE [29] datasets. In particular, VIC is the only available dataset specifically assembled for classifying acts of violence in crowd scenes, while VIM allows us to evaluate the robustness of our approach in person-on-person violent scenes. We also select BEHAVE dataset, which constitutes several complex group activities (e.g., walking together, splitting, escaping, and fighting). Besides, we realized that the most similar behavior to our first approximation (\(VIPS^{[1]}\)) is “crowd crossing” in which people cross a road in opposite directions. Therefore, to show the robustness of the proposed method to distinguish violent from crossing behaviors in normal situations, we create a new dataset called Violence-Cross (VC) whose videos gathered from VIC dataset and CUHK dataset [30]. It includes 300 videos, equally divided into three classes (100 videos for each class). Class 1 consists of videos of violent behaviors, Class 2 contains videos of people walking in opposite directions (cross walk), and Class 3 contains videos showing actions different than violent and crowd crossing behaviors (e.g., marathon, crowd walking in a same direction). The last column of Fig. 3 shows some sample frames of this new dataset.

Effect of varying filter size. We examined the performance of body compression force (\({F}^{bc}\)) and aggression force (\({F}^{agg}\)) with respect to different length of the view field L and filter size (Gaussian bandwidth) R, respectively, on VIC datasetFootnote 4. We set the number of random patches to 1000 and varied R and L as \(\beta * max(h,w)\) pixels (for both R and L), where \(\beta \in \left\{ 0.025,0.05,0.075,0.1,0.15\right\} \) and \(h \times w\) is the dimension of video frame (\(320 \times 240\) in this case). Figure 4(a) shows that a larger length of field of view results in better performance for aggression forces (\({F}^{agg}\)). However, we observed that increasing size of the Gaussian filter leads to decreasing performance of the body compression (\({F}^{bc}\)). This is indeed consistent with our definition of body contact force, where only particles (individuals) that are really close may impose body compression forces.

Fig. 4.
figure 4

Evaluating (a) the effect of filter size on aggression and body compression force values, and (b) the effect of varying number of random sampled patches.

Effect of number of random sampled patches. We evaluated the performance of VIPS varying P, the number of random patches extracted from each video or clip. We empirically set \(\beta \) for R and L to 0.025 and 0.1, respectively (i.e., \(R = 8\) and \(L = 32\) pixels). We varied \(P \in \left\{ 50,100,200,400,800,1000\right\} \). Figure 4(b) summarizes the results. As expected, the accuracy on VIC and VIM are improved by increasing the number of sampled patches P. Interestingly, VIPS\(^{[1]}\) outperformed VIPS\(^{[2]}\) and VIPS\(^{[*]}\) for all the P values on all datasets. This supports our choice of considering individuals approaching from opposite direction as opponents. Finally, the results show the superiority of VIPS compared to optical flow and interaction force (SFM) [3] methods with respect to different number of sampled patches.

Comparison with the state of the art. We compared our approach with the Interaction Force (SFM) [3], Acceleration Measure Vector (AMV) [14], optical flow [27], and ViF [18] as baselines, and some state-of-the-art descriptors used for violent acts from crowd videos including MoSIFT [17, 20], and Substantial Derivative (SD) approach [31]. Moreover, in order to demonstrate the effectivness of the proposed method, we compared with ConNet. Although there is no existing pre-trained ConvNet network exist for violence detection, mainly due to scarcity of training example, we evaluate our method with pre-trained model on WWW-crowd dataset [32], which is the most relevant pre-trained model for crowd behavior analysis. We first construct the feature vector by getting the average deep features vector of 10 jittered samples of the original image. Then, we \(L_{2}\) normalized the feature vectors, and evaluate its performance on VIC, VIM, and BEHAVE datasets. We performed violence classification at video level for VIC, VIM, and VC datasets. For the first two datasets, we followed the standard training-testing splits that come with each dataset, whilst for the VC we equally divide each class into a test set of 150 videos (50 video sequences for each class) and the rest for testing. Then, we compute VIPS for each video and a Support Vector Machine (SVM) with Histogram Intersection Kernel [17] is adopted for video classification. However, for the BEHAVE dataset, the associated task is temporal detection by assigning either normal or abnormal (violent) label to each frame of a video. For this purpose, we computed VIPS at frame level. Since abnormal data is not available in the training time, following the standard procedure of [3], we employed Latent Dirichlet Allocation (LDA) [33] to generatively model normal crowd behaviors. In order to compensate the effect of random sampling, we repeated each experiment 10 times, reporting mean performance. It is also worth mentioning that, for all the experiments, we employed four quantized orientations to compute the aggression force, i.e., \(Q=4\) in Eq. 10. We also tried larger values of Q, but results did not improve. We set filter sizes to \( L = 32 \) and \(R = 8 \) and select \(P=1000\) with size of \(5\times 5\times 5\). Table 1 reports the comparison with the state-of-the-art methods as well as the performance of each element of VIPS descriptor on VIC, VIM, and BEHAVE datasets. As immediately visible in dense (VIC) and moderate crowd scenes (BEHAVE), the first approximation of aggression force (\(F^{agg}-W^{[1]}(H^{[1]}_{3})\)) shows a better performance as compared to acceleration (\(H_{1}\)), Body Compression (\(H_{2}\)), and the second approximation of aggression force (\(F^{agg}-W^{[2]}(H^{[2]}_{3})\)). In addition, we observe that for person-to-person violent situations, \(H_{2}\) and \(H^{[1]}_{3}\) show very similar performance, and their combination with acceleration (VIPS\(^{[1]}\)) improved the overall performance of the classifier for all scenarios including moderate crowd scene (BEHAVE dataset). However, we can see that, as compared to the Energy Potential descriptor [23], VIPS\(^{[1]}\) does not achieve significant improvement. We believe that this is mainly due to our sampling strategies, and that the results can be improved using trajectory-based method. Nonetheless, we conclude that VIPS\(^{[1]}\) has a strong discriminative power on detecting violent behaviors regardless the scene crowdedness (from dense to moderate crowd scenes, as well as person-to-person fight). As an example, MoSIFT [17, 20] obtained very promising accuracy (the second best after our approach) on VIM (person-to-person), but poor performance on VIC (which is characterized by a dense crowd). This states that, unlike our approach, MoSIFT is sensitive to the crowd density. Moreover, the SFM and AMV obtained very competitive accuracy on VIM, while their performance on VIC drastically decreased. This supports the discussion in the socio-psychology literature [7, 25] reporting that social force models perform poorly in overcrowded situations, since they are not capable of modeling complex behavioral patterns in such scenarios. In addition, one can observe that ConNet based approach obtained significant inferior performance compared with hand-crafted competitors on BEHAVE and VIM, however, it gained comparable performance on VIC. This is also understable since VIC has a closer characteristic to the source database used for training the pre-trained network compared with VIM and BEHAVE dataset. Finally, we evaluate robustness of our descriptors to distinguish between acts of violence from crossing behaviors. In particular, we conducted experiments on each element of VIPS to show their contributions to the final performance. Moreover, we select ViF [14] descriptor which was designed for detecting violent behaviors in crowd and SFM [3], which is considered as one of the most well-known descriptor to detect abnormality in crowds. Figure 5 shows the confusion matrices of two state-of-the-art methods and elements of the proposed method. We observe that ViF shows a good performance on detecting acts of violence compared to the SFM, however, its overall accuracy is low since it is much confuse to distinguish violent from normal and crossing behaviors. On the other hand, similar to what we observed in the previous experiments, \(H^{[1]}_{3}\) plays an important role in distinguishing violent behaviors, which results in significantly high performance on \(VIPS^{[1]}\), able to well discriminate among the three classes.

Fig. 5.
figure 5

Average accuracy on Violent-Cross dataset. Class1, Class2, and Class3 are referred to as violent, cross walk, and normal behaviors, respectively. ViF [18] with \(57\,\%\) overall accuracy; SFM [3] with \(69\,\%\) overall accuracy, Acceleration (\(H_{1}\)) with \(74\,\%\) overall accuracy, Body Compression (\(H_{2}\)) with \(75\,\%\) accuracy; (bottom, first): Aggression force (\(H^{[1]}_{3}\)) with \(89\,\%\) overall accuracy, Aggression force (\(H^{[2]}_{3}\)) with \(80\,\%\) overall accuracy, \(VIPS^{[1]}\) with \(92\,\%\) overall accuracy and \(VIPS^{[2]}\) with \(86\,\%\) overall accuracy.

Table 1. Average accuracy over 10 times of repeated trials for the VIC, VIM and BEHAVE datasets.
Fig. 6.
figure 6

Runtime performance. (a) Evaluating the running time of VIPS with respect to different sampled patches and video resolutions. (b) Accumulated accuracy on 21 videos as a function of distance from violence outbreak.

Runtime performance. The final experiment evaluated the complexity (runtime) of computing the proposed video signature comparing to the real time violent-flows descriptor [18]. The time for BOW encoding is not considered in this experiment (the real time efficiency of BOW encoding is shown in [34]). First, we measured the relative computational time of our method with respect to violent-flows [18]. Figure 6(a) shows the ratio between time to process a clip for VIPS and violent-flows as a function of number of sampled patches and video resolution. For both methods we employed the same implementation of the optical flow in [27]. The results show that our method is roughly 1.5 to 2 times slower compared to [18]. During the experiments, we observed that the dominant computational cost of our method belongs to the optical flow computation, in particular for medium-to-high resolutions, whereas the convolutions (in the frequency domain) add negligible computational burden. Second, we evaluated the accuracy and detection time of both methods. For this purpose, following [18], we selected 21 videos from the VIC dataset that start with a non-violent behavior and then turn to violent situations mid-way through the video. The goal is to detect the violence as close to its annotated violence start point (outbreak). Figure 6(b) summarizes the results, where our approach (VIPS\(^{[1]}\) and VIPS\(^{[2]}\)) obtained higher accumulated accuracy for all the expected detection delays. This test, overall, shows that our approach outperforms ViF with slightly higher computational cost. The curve of ViF is fixed after five seconds meaning that its accuracy is not improved anymore.

6 Conclusions

This paper introduced a novel framework to identify violent behaviors in crowd scenes. In particular, we have proposed three behavioral heuristic rules to model a wide range of complex actions underlying crowd scenarios. We explained how to formulate the behavioral heuristics in computational terms and how to estimate them with very low complexity from video sequences. Experimental results illustrated that the proposed approach is not only computationally efficient, but also it is highly robust to various situations in terms of crowd density and different crowd behaviors, such as crossing and fighting, various imaging conditions, occlusions, and camera motions to name a few. Moreover, we observed that the proposed aggressive drive force has a considerable ability to localize regions of conflict at the pixel level, as compared to other descriptors such as optical flow and SFM. However, due to lack of annotated data, we were not able to fully present this type of evaluation. A potential weakness of this work is using fixed-size filter regardless of the scene properties and imaging conditions, which may have a negative impact on the performance. Both the latter aspects require further investigations and will be subject of future work.