Effective Crowd Anomaly Detection Through Spatio-temporal Texture Analysis

Abnormal crowd behaviors in high density situations can pose great danger to public safety. Despite the extensive installation of closed-circuit television (CCTV) cameras, it is still difficult to achieve real-time alerts and automated responses from current systems. Two major breakthroughs have been reported in this research. Firstly, a spatial-temporal texture extraction algorithm is developed. This algorithm is able to effectively extract video textures with abundant crowd motion details. It is through adopting Gabor-filtered textures with the highest information entropy values. Secondly, a novel scheme for defining crowd motion patterns (signatures) is devised to identify abnormal behaviors in the crowd by employing an enhanced gray level co-occurrence matrix model. In the experiments, various classic classifiers are utilized to benchmark the performance of the proposed method. The results obtained exhibit detection and accuracy rates which are, overall, superior to other techniques.


Introduction
Closed-circuit television (CCTV) cameras are widely installed in city centers, along main roads and highways, fixed and/or moving locations inside stadiums, concert halls, shopping malls, and other key installations for ensuring public welfare and safety.The live video feeds are often sent to various control centers for processing and storage.If the monitored crowds exhibit unusual behavioral (motion) patterns, immediate actions can be taken in response, to avoid potential damage or even casualties.For example, when the population density of a crowd in a public event is rapidly increasing and reaching a threshold, measures might need to be taken quickly to avoid a stampede; or, when people in a tightly packed tube station suddenly disperse and run away, an alarm needs to be immediately triggered in the control room.However, the main operational mode today in many countries still relies on human operators to constantly monitor live video streams from multiple sources.This is often in the form of a multi-screen monitor wall, which is a tedious job that easily leads to fatigue, slow-response or even oversight, not to mention the cost of staffing.The primary goal of this research is to design an automatic detection system which could alert human operators to the occurrence of abnormal crowd events, or even predict them.
Many approaches have been proposed for designing crowd behavioral analysis algorithms over the last two decades [1][2][3][4][5][6][7] .The main objectives of analyzing crowd behaviors focus on two topics: global scale (or macroscopic) analysis, local scale (or microscopic) analysis.In global scale analysis, the crowd of similar motions is treated as a single entity.Its main goal is to recognize the dominant and/or anti-dominant patterns of this entity, without concerning itself with any individual behaviors.For example, the congestion or stampede scenarios are a convergence of a crowd′s locomotion.The global scale analysis, therefore, concentrates on the overall tendencies of the critical mass rather than specific behavior such as waving or jumping.In local scale analysis, the detection of an individual behavior, or more specifically, actions, among other crowd entities becomes a focus, and poses a challenging question, especially when crowd density is high.This includes, e.g., occlusions that make the segmentation of a particular individual a challenging task.
For global feature-based approaches, feature patterns such as optical flow are often extracted from entire video footage, and corresponding histograms are constructed.In the bag of visual word (BoW) technique [8] , histograms with similar patterns are clustered to train a dictionary, and then the crowd behavior in a testing video is classified with its histogram.Solmaz et al. [1] proposed an algorithm to identify crowd behaviors based on optical flow information.In their research, the optical flow method is reproduced and evaluated, and then optimization work is carried out to introduce the particle angles as a new parameter for sorting and clustering the so-called regions of interests (RoI) model.By investigating the signature values calculated from the Jacobian matrix of pixel values in each RoI, different behavioral types can then be determined.Krausz and Bauckhage [2] followed a different route in tackling the problem by computing the histograms of the motion direction and magnitude extracted from the optical flow through applying the non-negative matrix factorization (NMF).The obtained histograms are then readily clustered.The essence of the process relies on a signature named as the symmetry value being calculated on the averaged histograms to check if the current cluster is in a congested state or otherwise.For local-featurebased approaches, each individual is treated as a single agent and its motion analyzed independently.One typical approach is the social force model (SFM) proposed by Helbing and Molnar [9] .The assumption of SFM is that the behaviors of each agent in a crowd are determined by multiple types of interaction forces.The extracted flowbased feature is mapped to each agent according to the rules of SFM to define individuals′ abnormal behaviors.Yan et al. [10] proposed a technique using SFM to detect sudden changes in crowd behavior.In this approach, the interaction force in SFM is directly calculated from the code stream to increase efficiency, then the BoW algorithm is applied to generate histograms on intensity and angles of interaction force flow.With the histograms obtained, the crowd′s moving state can be distinguished to detect the anomalies.
Despite the varied approaches mentioned above, the common pitfall of them is the heavy time consumption of calculating optical flow for every frame [11] .In order to maintain the detection accuracy while keeping the workload as low as possible, spatio-temporal information is explored in this research, with the aim of developing a practical crowd anomaly detection and classification framework.
Spatio-temporal information is widely used for single human action recognition, such as gesture, gait, and pose estimation.Niyogi and Adelson [12] used spatio-temporal texture (STT) to analyze human walking patterns, such as gaits at the ankle level.In this research, the key patterns of gaits were firstly defined as various braided streaks extracted from STT, and then the rough estimation of the walker′s pattern was refined using snakes (modeled streaks) proposed by Kass et al. [13] .The walker′s body was modeled by merging the Snake contours into one before the general combinatory contour was classified using the predefined gait signatures.In Wang′s research [14] , dynamic events and actions were modeled and represented by various geometrical and topological structures extracted from identified spatio-temporal volumes (STV) in a scene.Similar to the individual′s behavior, crowd behavior would also generate abundant motion patterns in the spatio-temporal space.Hence, by extracting the spatio-temporal information from regions-of-interest (RoIs) in a crowd, background and irrelevant information can be culled thus saving precious computational time.In recent research by Van Gemeren [15] , a novel model is proposed to detect the interaction of two persons in unsegmented videos using spatio-temporal localization.In this research, the spatio-temporal information is utilized to help model the person′s body pose and motion in detailed coordination with designed part detectors.The researcher claims to have obtained robust detection results when training on only small numbers of behavioral sequences.Ji et al. [16] introduced an approach using the combination of local spatio-temporal features and global positional distribution information to extract 3-dimensional (3D) scale-invariant feature transform (SIFT) descriptors on detected points-of-interest.Then, the SVM is applied to the descriptor for human action classification and recognition.
An abstract pipeline of the crowd anomaly detection framework proposed in this research is shown in Fig. 1.Once the raw video data is acquired, the first phase of the procedure is to perform the preprocessing operations, including noise filtering and background subtraction.Initial steps for the construction of STVs from raw video data also occur at this stage.In the second phase, main crowd features and patterns are extracted from the filtered data, where the features are modeled as descriptors (or signature vectors) for the classification/recognition purpose.In the third phase, extracted crowd patterns are sorted using various machine-learning models such as classifiers and templates.Once the crowd behaviors are identified, the abnormal ones can be treated as anomalies in further studies such as semantic analysis.
This paper is organized as follows: Section 2 introduces a novel model for identifying and extracting spatial-temporal textures (STT) from video footage.Section 3 defines a salient STT signature using a gray level co-occurrence matrix for crowd anomaly labeling.Section 4 presents the experimental results of using the proposed gray level co-occurrence matrix (GLCM) signature on various classifiers.Section 5 concludes the paper.

Effective spatio-temporal texture extraction
Because automatic classification of crowd patterns includes abrupt and abnormal changes, a novel approach for extracting motion "textures" from dynamic STV blocks formulated by live video streams has been proposed.This section starts by introducing the common approach for STT construction and corresponding spatiotemporal texture extraction techniques.Next, the crowd motion information contained within the random STT slices is evaluated based on the information entropy theory to cull the static background and noises occupying most of the STV spaces.A preprocessing step using Gabor filtering for improving the STT sampling efficiency and motion fidelity has been devised and tested.The technique has been applied on benchmarking video databases for proof-of-concept and performance evaluation.Preliminary results have shown encouraging outcomes and promising potential for its real-world crowd monitoring and control applications, detailed in Section 4.

STV-based motion encapsulation and STT feature representation
STV is first proposed by Aldelson and Bergen [17] .Fig. 2 illustrates the STV construction process.The live video signal is first digitized and stored as continuous and evolving 3-dimensional (3D) STV blocks.The construction of a typical STV block from video can be described as the stacking up of consecutive video frames to a fixed time capsule (normally of a few seconds) that consists of evenly spread grey-scale (for black-and-white video) or colored (for color video) mini-cubes over the 3D space, enclosed by the borders of the frame and the length (decided by the STV length in seconds and the video frame rate) along the time axis (Fig. 2(a)).Actually those cubes are 2D pixels of each frame "stretched" into 3D voxels (volumetric-pixels) filling up the STV block (Fig. 2(b)).Compared to 2D frames, a STV block naturally encapsulates dynamic information, such as object movements, as well as static scene information in its structure.2D neighboring frame-based tracking techniques such as the optical flow [18] study the consecutive frame pairs for gradual object motions that work well for continuous human and vehicle tracking.However, this technique has major drawbacks when it comes to evaluating sudden changes, especially concerning a large group of fast moving objects within a dense crowd.In order to further process the constructed STVs, slices of a STV called spatio-temporal textures (STTs) can be extracted to learn patterns recorded in each texture, resembling the medical operations of 3D ultrasonic scan or magnetic resonance imaging (MRI).For example, Niyogi and Adelson [12] used STTs to analyze the gait (walking style) of individual pedestrian.In STV and STT techniques have been widely studied in the last two decades.Bolles et al. [19] used STV for geometric and structure recovery from static scenes.Baker et al. [20,21] used STV for 3D scene segmentation.Ngo et al. [22] used STT techniques for the detection of camera cuts, wipes and dissolves in a video sequence.In this approach, a STT was analyzed by first convolving with the first derivative Gaussian, and then processed using Gabor decomposition, in which the real components of multiple spatial-frequency channel envelopes were retrieved to form the texture feature vector.A Markov energy-based image segmentation algorithm was then used to locate the color and texture discontinuities at region boundaries.The approach was tested on different types of videos, including news and movies.The results show sound performance on "cut" detection with accuracy reaching 95%, but only 64% for the "wipe" detection.
Because of the way a STV block is constructed and the random nature of real-life events, the "useful" information distributed over a STV space is usually uneven and irregular.Thus, one important problem is how to obtain the STT slices from a STV block with the highest information density.Core to the challenge is how to differentiate useful information such as voxels formed by crowd movement from noise such as static background.In this research, instead of an even cut and computation on all STT slices from a STV block, an optimized technique is developed to obtain the specific STT with rich motion information as shown in Fig. 3.

Implementation strategy
A typical pipeline of the crowd abnormality detecting system contains three processing phases [23] as shown in Fig. 1.In the first video data acquisition phase, the raw video signals are collected and stored in suitable digital formats.Then, static or dynamic features contained within the information packets will be extracted; and at last, predefined feature patterns describing signal-level, statistical-level, and/or even semantic-level explanations of the "video events" will be used to evaluate the similarity and differences of the features extracted from the live feeds [24][25][26] .
In this research, at the STT extraction phase, an information entropy evaluation model has been devised to help the sampling and selection of "meaningful" feature containers before feeding them into the feature (crowd patterns) extraction module.This design ensures the STT that contains the most of the crowd dynamics will be selected based on the magnitude and richness of motion "trails" along the time axis in the continuously evolving STV blocks.After that, motion features are extracted from the selected STTs and are modeled into feature vectors (signatures).In the last step of the devised framework, the identified STT RoIs are classified according to their motion signatures.

Information entropy-based STT selection
Information entropy (also referred as Shannon entropy) is proposed by Shannon [27] .It is a concept from information theory that calculates how much information there is in an event.The information gain is a measure of the probability of a certain result to occur [28] .Liang et al. [29] proposed an approach to detect encoded malicious web pages based on their information entropy counts.Zhang et al. [30] used information entropy to detect mobile payment anomaly through recursively training devised entropy mechanism using verified data.The idea of information entropy could also be used as an index to measure the informational value of the extracted STTs.If a STT has higher entropy, it is likely to contain higher motion and scene update information.
As illustrated in Fig. 2, multiple horizontal and vertic-al cuts can be applied to a STV block for obtaining STTs.All of the cuts are along the time axis.The sampling density of the cuts is customizable and depends on actual application scenarios.When the density is set to a higher value, it can be predicted that the result would be closer to optimal, yet the computational burden will increase.In the third step of Fig. 2, once the STTs are obtained, the information entropy is calculated for each STT.The slice with the highest information entropy will then be selected as the target STT for crowd behavior analysis.
The information entropy can be expressed as (1).
In ( 1), represents the total number of different gray scale levels in a STT, represents the amount of pixels of the gray scale level in it, represents the probability of gray scale level in the STT, and is the calculated information entropy.
Fig. 4 shows the calculated information entropy values of a group of extracted STTs from a single STV.The STTs are displayed in descending order according to the calculated entropies.It can be observed that STTs with higher information entropy show abundant motion information as indicated by the ribbon-shape trajectories.
However, when directly applied to a test video database as shown in Table 1, the immediate results do not seem to yielding consistent and satisfactory outcomes against intuition, where UMN3, UMN5 and UMN6 even show higher entropy values yet contain less motion features than UMN1 and UMN2.

Optimization through Gabor filtering
In Section 2.3, the information entropy is calculated on all extracted STTs, the STT with largest entropy would be selected as target for further pattern analysis.However, preliminary tests have shown unsatisfactory pairing between STT slices with high entropy values from the ones actually containing more crowd motion "ribbons".Close inspection revealed that the main cause of the problem is due to the traces left on STTs caused by non-moving objects and background regions, especially those with high color contrast.For example, the obtained sample STTs from UMN3 to UMN8 patches have shown explicit parallel stripes caused by the background.To address this issue, in this research, the Gabor wavelet filtering is exploited for removing the STT background.Fig. 5 shows the renovated processes.Instead of applying the information entropy calculation directly on the extracted STTs, they are firstly converted into gray scale images.Then, the background of STTs is removed through implementing the convolutions of the STTs with the Gabor filter before the entropy measures are calculated.
The Gabor transformation is a special case of the short-time Fourier transformation.Because the Gabor wavelet is very similar to a single cell′s response to visual stimulus from the human vision system, it is sensitive to the border of an image, but not so much so to the change of light, which made it ideal in many application areas in image processing and computer vision.Panda and Meher [31] introduced a hierarchical algorithm for both block-based and pixel-based background subtraction approaches based on the Gabor transformed magnitude feature.Zhou et al. [32] extracted features using circular Gabor filters at five different frequencies, to solve the challenge that conventional background subtraction algorithms struggle to achieve.
In the spatial domain, a two dimensional Gabor filter is the product of a sinusoidal function and a Gaussian function, it is also called the window function.In practice, the Gabor filter can extract features from multiple scales and orientations.For this research, it is expressed as In (2), and are the window sizes along and axis, and the value of varies from negative to positive , the value of varies from negative to positive .defines the orientation of the extraction process.defines the frequency of the sinusoidal function.And The convolution of the Gabor filter and an original STT is then applied to obtain the filtered version.
In a real-life scenario, the motion of crowd recorded in a STV block could be towards any direction, thus the Gabor filtering is applied in eight directions (like the notions of N, S, E, W, NE, SE, NW and SW on a map) to increase the accuracy.Fig. 6   By using this method, the long computational time of calculating flow-based information in every frame can be greatly shortened.The extraction of flow-based information involves the calculation on every pixel in the video data.The amount of pixels needing to be analyzed is , therefore the computational complexity is .The proposed algorithm only has to collect several STTs at certain positions, the amount of pixels needing to be analyzed is then reduced to , thus the overall computational complexity is .Also, because patterns of STTs with varied signature values exhibit different behavioral types, by carefully selecting, some patterns could be modeled into a feature signature which could be used for further texture classification.Unlike the change detection algorithm introduced in the previous chapter, the classification of textures is capable of potentially labeling different scenarios in input video streams.

GLCM signaturing for classification
In order to achieve automatic warning of hazardous crowd behaviors, a spatio-temporal volume (STV) signature modeling method is proposed to detect crowd abnormality recorded in CCTV streams using the texture extraction algorithm proposed in Section 2. Once the optimal STTs are extracted, the gray level co-occurrence matrix (GLCM) can be formulated to measure the crowd behaviors identified.In this section, the proposed STT signatures based on the GLCM indices have been defined.The proposed model has shown a promising accuracy and efficiency in detecting crowd abnormal behaviors.It has been proven that the STT signatures are suitable descriptors for detecting certain crowd events, which provides an encouraging direction for real-time surveillance applications.

STT feature categorization
Depending on different construction patterns, STT features can be roughly classified into statistical texture features, model type texture features and signal domain texture features according to Junior et al. [33] Statistical texture features are obtained by transforming the gray scale values between a target pixel and its neighbors in the first-order, second-order and even higher-order filtering process to denote information -often described in the conventional terms of contrast, variance, etc.The most frequently used statistical texture features is the grey level co-occurrence matrix (GLCM) [34] , which will be discussed in the next section.The model type texture features assume that a texture can be described by certain parameters controlled by probabilistic distribution models.How to recover the most accurate parameter values is the core issue of this approach.Benezeth et al. [35] proposed an algorithm using a hidden Markov model (HMM) associated with a spatio-temporal neighborhood co-occurrence matrix to describe the texture feature.In the signal domain texture features, textures are defined in a transformational domain by certain filters such as the wavelet [36] .It is based on the assumption that the energy distribution within the frequency domain can be used to classify textures.
The grey level co-occurrence matrix (GLCM), known as grey tone spatial dependency matrix, is first proposed by Haralick et al. [34] By definition, the GLCM is a statistic tabulation of the probability of different pixel grey scale values occurred in an image.In brief, assuming the gray scale of current image is divided into three levels, GLCM will store all the neighboring pairs of these three levels.
In this research, the GLCM patterns have been explored to test their performance on STT signature identification.The main strategy of this approach is to extract raw GLCM texture features from relevant STTs.Once these features are acquired, a signature could be modeled for classification purpose.A five-stage process flow of this approach is shown in Fig. 7.

G G G
In order to obtain the GLCM indices from a STT, the very first step is to transform a STT from RGB image to gray scale, and then the raw GLCM, labeled as , can be calculated based on the algorithm introduced in [37].In most cases, the gray scale value distribution of STTs is irregular, thus the obtained results of are often asymmetric.According to the GLCM definition, represents (h) θ = −3 tion matrix along the opposite direction, and then the symmetric matrix can be obtained by adding and , to represent the complete relations along this direction.The next step is the normalization, where the probability matrix is obtained from by using (4).
Pi,j = Si,j Si,j where the obtained and are the row and column indices of matrix and .Obtained probability matrix has two properties: 1) According to the definition of GLCM algorithm, assuming that the gray scale value of the original image is divided into levels, then the column and row numbers are also .Thus, the more levels the gray scales are divided into, the larger will be, which means the size of the GLCM will be larger.Also, the range of is usually from 3 to 10.If it is too large, the GLCM will be sparse and its descriptive ability will be affected.In order to reduce the computation time and to avoid overly sparse GLCMs, a proper value of should be selected.In this research, the value of is set to 8 based on experiments.2) is symmetric along the diagonal.The diagonal elements represent pixels which do not have gray level differences, and the farther away from the diagonal, the greater the differences between the pixel gray levels.According to this property, patterns like the contrast can be readily retrieved in a look-up table style.

P
Next, texture patterns can be calculated from the probability matrix .The resulting low level texture patterns are named here as contrast patterns, orderliness patterns, and descriptive statistical patterns.

P
Contrast patterns describe how the gray scale value of current image varies in terms of contrast, dissimilarity, homogeneity and similarity.The farther the pixel pairs from the central diagonal line in , the bigger the difference it represents within the gray scale, thus the contrast can be obtained by (5).
Pi,j(i − j) 2 . (5) Similar to contrast, dissimilarity also represents difference in gray scale values, except it increases linearly instead of exponentially.Dissimilarity can be obtained by (6).
Pi,j|i − j|.(6)   Homogeneity is also called inverse different (IDM).On the contrary, homogeneity represents how consistent the contrast is, when the contrast of an image is low, the value of its homogeneity will be large.Equation (7) shows how to calculate homogeneity.
Similar to dissimilarity, the linear version of homogeneity can be obtained by (8).
Table 2 gives a comparison of the contrast related patterns for sample images.The GLCM window size is set to 50 by 50 pixels, where the direction is set to horizontal with the step size fixed at 1 pixel.The gray scale level number is set to 8. The patch in Table 2(a) is less contrastive than the one in Table 2(d), thus the result shows that patch in Table 2(a) has less GLCM contrast and dissimilarity values, and larger homogeneity and similarity values.

Orderliness patterns of GLCM
Orderliness related patterns describe how orderly or regular the distribution of gray scale values in an image is, including angular second moment, energy and entropy.The concept of angular second moment (ASM) comes from physics [21] for measuring rotational acceleration.ASM could be obtained using (9).Its value increases while the orderliness distribution is high.
Pi,j 2 .The energy equals to the square root of ASM, as (10).It is often used in fingerprint recognition [38] and plant classification [39] .
On contrary to energy, entropy describes how irregular current gray scale distribution is, where the value of entropy decreases when the distribution is less orderly.Entropy can be expressed as (11). Pi,j(−lnPi,j). ( In Table 2, the orderliness of six different images are measured.The patch in Table 2(a) clearly shows more regular patterns than the patch in Table 2(d), so it can be expected that the Entropy of the patch in Table 2(a) is less than the one in Table 2(d).

P
Descriptive statistical related patterns consist of statistics derived from a GLCM matrix, including mean, variance and correlation.It needs to be emphasized that these patterns describe the statistical pixel pair relations, but not typical gray scale value explicitly.Two GLCM mean values can be obtained by using (12), note that because the probability matrix is symmetric, the two mean values are identical.
and standard deviation can be obtained through (13).
Finally, according to the calculated mean and variance, the GLCM correlation can be obtained by (14).

GLCM signature modeling
In this section, patterns of GLCM matrices are modeled as signatures for crowd motion classification.Six STT patches are extracted at different parts of the STV model in Table 2,    Secondly, a texture patch with normal behavior usually has higher ASM value than patches at abnormal state.Thirdly, among all other patterns, contrast, ASM, entropy and variance show most significant changes between normal and abnormal states.Thus, these four GLCM-based patterns are selected as the most appropriate measures for detecting abnormal crowd states, and are denoted accordingly in Table 2. Fig. 8(a) displays the gray scale image transformed from a STT obtained in Fig. 2(d), the actual test video is chosen from the University of Minnesota (UMN) dataset.All videos from this dataset start with a normal crowd scene followed by an abnormal event, mostly panic behavior.The ground truth of normal and abnormal behaviors is manually marked on Fig. 8  Variance, standard deviation and correlation value also change slightly.Hence, in summary, the contrast (CON), angular second moment (ASM), entropy (ENT) and variance (VAR) are selected as candidates for forming the STT signature vector for classification due to their salient variance magnitude.As the linear version of contrast, dissimilarity is discarded to control the dimension of the signature, the same decision process has been applied to the standard deviation and correlation.The final signature (SIG) for classification is modeled as (15).(15)

Test and evaluations
In this section, an experimental system equipped with the devised signature model and process pipeline has been constructed to classify the crowd motion videos as shown in Fig. 9.The extracted STT is firstly filtered with the six-orientation Gabor transform to amplify the motion details.Once a STT is processed, it is divided into a collection of texture patches, and the patterns are extracted from these patches to model the signature for classification.In the classification phase, the texture patches are classified with a trained classifier using the modeled signatures.TAMURA texture patterns [40] are also utilized to model a signature for performance comparison.The values of coarseness, contrast, line likeness and regularity are modeled as a four dimensional TAMURA signature.
Several classifiers are implemented on these two patterns to assess the performance, including the K nearest neighbor (KNN), Naïve Bayes, discriminant analysis classifier (DAC), random forest and support vector machine (SVM).In the training stage for the classifier, extracted STTs with congestion and panic scenarios are divided into manually labeled texture patches to train the classifier.The texture patches for training are categorized into four different types, which are empty, normal, congested and panic.The empty texture contains no pedestrians but only background.The normal texture contains pedestrians walking casually in a scene.The congested texture contains pedestrians with slow moving velocity and high density.The panic texture contains pedestrians escaping of high velocity.
Once the classifier is trained, STTs for testing are firstly divided into patches and the patterns are extracted to model the signature for classification.The details of parameter setting for classifiers are as follows.The size of patches is set to 50 by 50 pixels.For the KNN, the number of neighbors is set to 4, since in training phase only four types of anomaly are defined.For the random forest classifier, the number of trees is set to 5. The parameters of Naïve Bays, DAC and SVM are set as default.One of classification results is shown as Fig. 10, the KNN classifier is applied on the GLCM signature.The blue line grid marks the boundary of each divided patch.The Dash line stands for the empty texture, cross for the normal texture, triangle for the congested texture, and oblique cross for the panic texture.Since agent′s velocity is higher in uncongested state, the spatial shifting along time will be larger than when it is in a congested state.As consequence, the texture stripe in STT will have larger slope value.On the contrary, textures containing congested behaviors will have parallel stripes with gently sloping value.Therefore, in visual expression, texture patch with more horizontal stripes stands for the congestion behavior, and the one with more vertical stripes stands for the normal states.In summary, congested texture patches have relatively smaller contrast, entropy, variance values and larger angular second moment value.
The TAMURA signatures for the same STT set have also been applied to the KNN classifier to compare the performance.The result is shown in Fig. 11.Comparing to Fig. 10, a number of texture patches with no motion patterns are marked as normal, and some with normal pedestrian behaviors are marked as congested, as highlighted in Fig. 11.The comparison indicated that the GLCM-based signature (feature outperformed the TAMURA in detecting crowd motion patterns. The detection of panic scenes is also carried out.In Fig. 12, STTs extracted from the UMN dataset are processed using the proposed procedure.A comparison is made between the GLCM and TAMURA texture patterns.Fig. 12(a) shows the detection result using GLCM, and Fig. 12(b) shows the detection result using TAMURA.Similar to Figs. 10 and 11, agents with panic behavior are likely to have higher moving speed.Thus, the texture patch with panic behavior will show stripes in higher slope value.
Ci,j A order to measure the performance, all sample test patches are manually labeled with the four texture types in the training phase.If the results equal to the labeled ground truths, then it is considered a correct detection, and the label value is set to 1, otherwise a failed one and the label value is set to 0. The detection accuracy can be calculated using (16).Table 3 shows the accuracy between various combination of signatures and classifiers.

Conclusions and future work
Real-time and effective monitoring of high density crowds for public safety is of increasing demand in the real world.In this research, a novel crowd anomaly detection framework is proposed that satisfies continuous feedin of spatio-temporal information from live CCTVs.Novel STT selection, filtering, and feature modelling techniques have been devised and tested.Evaluation against state-of-the-art benchmarking systems yields satisfactory High level semantic studies of the identified motion features will also be investigated in the future.

Fig. 1 A
Fig. 1 A general structure of crowd abnormal behavior detection system

Fig. 2 (
c), STV is sliced either horizontally or vertically at certain position along time axis, to obtain STTs, and Fig. 2(d) shows an example of extracted STTs describing pedestrians′ motion through time.
(a) Consecutive frames (b) Stacked frames to form spatio-temporal volume (c) Vertical STV slice along time axis (d) Obtained spatio-temporal texture

Fig. 2
Fig. 2 Procedures to obtain STV and STT from raw video data

Fig. 3 Fig. 4
Fig. 4 Entropy values of random STTs shows the detailed steps of the procedure.The first and second row illustrate the filtered STTs in eight orientations respectively.Note that the parameters of Gabor filter are adjusted accordingly.In this case, values of and are set to 2, and is set to on Fig. 6(b)-6(e) and on Fig. 6(g)-6(j).Once the filtering steps are completed, all 8 filtered STTs are accumulated together to formulate a combined one as shown in Fig. 6(f), where Fig. 6(a) is the original STT.

Fig. 5
Fig. 5 Updated structure of the proposed STT extraction technique

Fig. 6
Fig. 6 Gabor filtering results along eight directions

Fig. 7
Fig. 7 Structure of the proposed approach Patches (a)-(c) are obtained from texture with normal motion, and Patches (d)-(f) are obtained from texture with abnormal motion.By comparing pattern values of normal and abnormal patches, the following patterns can be identified.Firstly, a texture patch at a normal state usually has lower contrast, en- tropy and variance, e.g., Patches (a)-(c) all have lower contrast than Patches (d)-(f).
(a), by using a color bars at the bottom of the figure.The grey color indicates normal state and the black color indicates abnormal state.It can be observed that different visual patterns of this figure match the labeled ground truth.It is expected that the differences of patterns will reflect the defined STT signatures too.According to the definition of STT, the column index represents the frame index in the original video, thus by summing up each column calculated by GLCM texture features, the change of GLCM feature patterns over time can be quantified and evaluated.

Figs. 8
Figs. 8 (b)-8 (e) show the trends of contrast patterns of the STT in Fig. 8 (a).As the anomaly occurs, patterns which describe pixel pair dissimilarity, such as contrast and dissimilarity, increase rapidly.However, patterns describing pixel pair similarity such as Homogeneity and Similarity do not change significantly.Figs. 8 (f)-8 (h) show the trends of orderliness patterns of Fig. 8(a).When the anomaly occurs, patterns describing image irregularity such as entropy increase quickly, while the angular second moment shows a significant drop though the energy holds steady.Figs. 8 (i)-8 (l) show the trends of statistic related pattern measures.When the anomaly occurs, the mean value does not show significant change.

Fig. 8
Fig. 8 Trends of GLCM patterns along time

Fig. 9
Fig. 9 Structure of proposed classification approach )

Table 1
Results of selected target STTs′ information entropy values

Table 2
Comparison between texture patterns of STT patches

Table 3
Accuracy of multiple signatures and classifiers combination