1 Introduction

A crowd is defined as a gathering of humans in the same area. If the number of individuals exceeds normal conditions, congestion becomes a concern regarding safety, health, or what may affect human choices due to herd culture. The concept of congestion differs depending on the culture of communities.

For example, gathering over 100 people in India is considered normal, while in other countries like Canada it may be considered a crowd [1]. Analysis, detection, management, and monitoring of crowds are a growing trend in computer sciences to study the behavior of human crowds in any event. Human gatherings may happen due to religious rituals (such as Hajj for Muslims at Makkah, Saudi Arabia, and Kumbh Mela for Hindus at Haridwar, India), sporting events (such as FIFA World Cup and Olympic Games), concerts, or annual carnivals (such as Carnival Parade and Riyadh Season [2]). Moreover, demonstrations, popular protests, and political or social riots (such as a political rally in Los Angeles [3]) are considered gatherings that may impact a crowd risk index and bring many unexpected reactions. Each type of human gathering has its own features: purposes of the individuals, behavior, place, and time. To avoid accidents, the organizers must perform prior analyses and studies for the mass gatherings.

Crowd analysis is one of the most crucial tools for crowd management [4]. Hence, crowd management requires profound and comprehensive plans in advance and flexible and agile strategies to take vigorous and accurate action if unexpected incidents occur.

Examples of common general issues in crowd management in the context of urban planning for smart cities are traffic, crowded pedestrians, pollution, energy consumption, etc. In urban planning for smart cities, it can be possible to predict crowd behavior abnormalities and the future evolution of these situations in order to prevent them and do the best decision-making and planning. In addition, the capacity to monitor, control, and predict the behavior of crowds is a fundamental enabling driver. Where predicting such notifications can be of effective help in a large variety of situations, such as organizing events, organizing pedestrians, managing situations of emergency, or even tracking how the pandemic spread through the urban areas. A violation of crowd management will lead to several consequences that may result in a loss of lives or property, besides the loss of people's confidence in the organization responsible for organizing the event in the future.

Billions of people around the world now have accounts on social media platforms to freely express their beliefs, opinions, or impressions about some things. This huge streamed data gives an opportunity for researchers in the data analysis domain to explore about behavior of people through their text content [5]. We believe big data may open other horizons in crowd management. Abnormal behavior detection or crowd-counting is now possible through these data. However, a lot of works of crowd management lack to attention the textual data analysis aspect of social media, especially in local crowd management works.

This review has conducted extensive examinations in this area, however, a lot of works for crowd management still have limited in using one particular data namely visual data. The main drawbacks and limitations faced by current crowd management are discussed as follows:

  • Existing works for crowd management are currently recurrent, meaning they do not take into account the collected data sources changes about the people during the behavior detection. Hence, current models may not fully apply on the multi-dataset, limiting their effectiveness with similar scenarios.

  • Collecting the datasets for existing models requires consideration of equipment or hardware such as installed cameras, live streaming channels, etc., that increases the cost of running the processes and maintenance besides the computational cost of the models.

  • Most of the existing research uses the same video dataset to study the behavior detection of crowds. Therefore, limiting their model's effectiveness with learning new patterns.

  • There is no dependency on data of social networks as one of the sources of data collection.

For these reasons, the objectives of this SLR are an overall survey of the concept of crowd management from two perspectives, crowd management in various world events and Hajj events especially. The authors endeavor through this work to make it a great reference point for other researchers in the crowd management domain. This systematic review provides a theoretical understanding of deep-learning techniques used in the various branches of crowd management. The review will also highlight factors that impact crowd modeling works such as limited patterns of datasets, applications generalizability, and evaluation metrics. Moreover, the review highlights the importance of urban planning integration, which leads to improving the quality of life for individuals and society. This work emphasizes is also necessary to draw attention to exploiting the various big data in social media as an important tributary of building novel datasets with a diversity of knowledge and patterns. The main contribution of the paper is to examine and summarize the state-of-art technologies and methodologies in the behavior detection of a crowd to apply them in Hajj research to improve the experience of pilgrims and provided services. Hence, the review scope determines three main points as follows: First, our survey summarizes the various technologies, approaches, and models that have been utilized to design and execute solutions to detect behaviors that control and monitor the crowds during any world events. Next, the survey summarizes and review the methods used for Hajj issues. Last, our survey covers shortcomings or defects in previous Hajj studies and how other research related to various crowd management may help improve the management, monitoring, and control of crowds during the Hajj season.

This paper aims to highlight these three points through a comprehensive literature survey and focuses on crowd behavior detection for crowd surveillance and prediction. The purpose of RQs is to give a high level of precise topics which is extremely focused on examining the previous literature. The RQs have extreme significance in an SLR, due to controlling the distinguishing and identification of primary research needed to be involved in the review. Consequently, well-defined, logical, interesting, and relevant research questions should be articulated [6], at the discretion of the authors [7]. The review's contributions will answer the following research questions (RQs):

RQ1: What is the taxonomy of crowd analysis for Deep Learning-based works?

RQ1 aims to find the taxonomies of previous studies based on deep learning (DL) approaches. The answer is explained in Sect. 3: “Related Work.”

RQ2: What are the approaches used in crowd management works?

RQ2 aims to identify the DL approaches used for crowd management at various places around the world. The answer is explained in Sect. 4: “A Comprehensive Study of Crowd Management.”

RQ3: What are the approaches used in Hajj crowd management works?

RQ3 aims to identify the DL approaches used for crowd management during the Hajj season. The answer appears in Sect. 4.6: “Crowd Management at Hajj Event.”

RQ4: What are the most commonly used techniques and algorithms in prior works? and the challenges faced them?

RQ4 aims to discover the most used techniques and algorithms of DL in prior works for crowd management at global around the world and local scope during the Hajj season. The answer is explained in Sect. 5: “Analysis of Comprehensive Study of Crowd Management”.

In conclusion, a systematic review of the research studies in terms of global perspectives on crowd management can help provide insights into the scope and development of this field in Hajj events and establish a comprehensive conceptual framework, which can ultimately improve the pilgrims' experience and the religious rites practices comfortably. The rest of this paper is organized as follows. Section 2 explains the research methodology. Taxonomies of previous studies based on DL approaches appear in Sect. 3. A summarization of the current methodologies of crowd management at various events appears in Sect. 4. Section 5 summarizes the current approaches utilized during the Hajj events from 2010 until 2023. Section 5 displays analysis of comprehensive study of crowd management also analysis of crowd management at hajj event. The authors discuss gaps and directions in Hajj studies and compare them with state-of-the-art research on other events in Sect. 6. Finally, Sect. 7 presents a conclusion of the survey and discuss future work approaches are presented.

2 Research methodology

This section offers the methodology followed to complete this research. The paper [8] has presented several steps for writing a systematic literature review (SLR). An SLR draws a thoughtful methodology to determine the mechanisms of exclusion and inclusion criteria for scientific papers articles. Moreover, the guidelines of SLR identify gaps in current research and extracts final results based on our RQs. This review was performed in four phases, and the following sections explain each phase. Figure 1 shows the review protocol that illustrates the plan to complete this paper.

Fig. 1
figure 1

Protocol of review to complete this paper

2.1 Phase (1): preliminary search

Initially, verification of previous related work. The authors checked that no SLR covers the topic of crowd management by the analysis of textual data of users. The enormous amount of data spread every second across various Social Media Platforms (SMP), such as Twitter, Facebook, Instagram, and many others, is adequate evidence that aspect extraction of textual data to study it has become urgent. Therefore, the authors use this review of the outputs from past related works to address their gaps and shortcomings.

Second, identification of relevant online databases. The authors selected the superior databases that are interested in computer science: SpringerLink, Science-Direct, IEEE Xplore Digital Library, ACM Digital Library, MDPI, Google Scholar and Web of Sciences. Next, the authors determined the starting and ending publishing dates for the articles in the review. This review selected 2010 as the starting date and 2023 as the ending date. This timeframe was chosen because it is the growth of AI research. In the early 2010s, researchers began to use neural networks for speech recognition and image processing, which has significantly improved performance and then spread neural networks widely in the commercial, healthcare, finance, transportation, and crowd control fields. In 2013, the field of computer vision began to transition using neural networks. The same transition occurred in natural language processing in 2016 until today [9]. In the future, similar revolutions will occur in visual robotics and many other AI fields. The searches were narrowed to journals published during the desired span. Table 1 presents the number of scientific paper articles obtained from each database and clarification for the initial and final results of the search.

Table 1 The overall number of papers obtained from the database

Third, detection of keywords and their synonyms used in crowd management. Keywords of research that have been applied for finding articles in these databases are as follows: Crowd AND (“Management” OR “Analysis” OR “Tracking” OR “Monitoring” OR “Controlling” OR “Counting” OR “Density Estimation” OR “Abnormality Detection” OR “Behavior Analysis” OR “Crowd flow” OR “Mass Gathering” OR “Congestion Analysis Detection” OR “Predicting Human Behaviors” OR “Pedestrian”) AND (“Sentiment Analysis” OR “Opinion mining”) AND (“Deep Learning” OR “Machine Learning” OR “Convolutional Neural Network” OR “CNN”) AND (“Social Media” OR “Twitter”) AND (“Hajj” OR “Makkah” OR “Mecca”).

2.2 Phase (2): investigative search

Criteria of inclusion and exclusion. We selected strict criteria to pick studies to be included in our review and those that must be excluded. The objective of inclusion criteria is to choose all papers describing the concept of opinions mining of crowds through DL techniques. Otherwise, it will be exclusion criteria of papers in order to limit the scope of the review and remain focused on the targeted RQs. Inclusion criteria are as follows:

  1. (1)

    Papers that were published from the year 2010 to 2023.

  2. (2)

    Papers written in the English language.

  3. (3)

    Papers selected for publication in a journal.

In terms of exclusion criteria are as follows:

  1. (1)

    Papers that are from a conference or a book.

  2. (2)

    Papers that do not extract specific databases.

  3. (3)

    Duplicate papers.

  4. (4)

    Papers that contain irrelevant keywords.

Figure 2 illustrates the criteria of exclusion and inclusion followed for this review.

Fig. 2
figure 2

Criteria of exclusion and inclusion for this review

2.3 Phase (3): study quality assessment

Quality assessment (QA) of selected studies is a critical strategy for data synthesis and analysis to avoid bias and increase the selection of literature. The QA questions estimate the relevance, truthfulness, and rigorousness of the selected studies. Every one of the questions has only three optional answers derived from the study in [10], where “YES” = 1, “NO” = 0, and “Partly” = 0.5. as shown in Table 2. Besides the QA questions, it has placed other criteria to prevent potential biases. For instance, clarification of studies included and excluded accurately. Comprehensive examination during the selection and publication stages several times. Formulating review protocols according to the sober methodology [8]. The assessment selection was from one of the researchers of this paper. The researchers have followed the mentioned standards rigorously to avoid the dominance of individual personal opinions and potentially biased decisions. The included papers should be achieved at least 2 of QA, otherwise, it will be overridden, as shown in Table 3.

Table 2 Questions of quality assessment of this review
Table 3 The included papers according to standers of quality assessment

2.4 Phase (4): analysis of search

Transparency during the assessment process is conceived as a non-functional quality of the stakeholders of projects. Transparency is an essential factor that can be performed to ensure the stakeholder's satisfaction with the quality of assessment [11]. Therefore, transparency requirements should be clarified regarding the inclusion and exclusion criteria that are used for selecting the primary studies, which have to fulfill them to sustainability for quality and transparency of research. Consequently, our methodology identified 45 studies that we applied to the assessment of objectives of this paper. Figure 3 illustrates the peak appearance of research in the publication year 2021. Whereas Fig. 4 displays the percentage of papers obtained from each database.

Fig. 3
figure 3

Distribution of the papers from 2010 to 2023

Fig. 4
figure 4

Percentage of papers obtained from each database

According to our criteria of exclusion and inclusion papers, as shown in the Figs. 3 and 4, the papers started spread from 2014 to 2023, the researchers note that the publication was at the highest levels in 2019, 2020, and 2021 years. In addition, the SpringerLink database was achieved highest published, whereas ACM database was got the lowest published than other databases. Finally, after applying the above filters of standards did not obtain unique papers in both Google Scholar and Web of Science. Most of the existing papers were duplicates of papers in another publishing database or did not meet our requirements and standards.

3 Related work

Since the last decade, the preceding reviews have illustrated that crowd analysis is studied from several different aspects. For instance, there are computer science [12], sociology-based [13], biology-based [14], and physics-based [15] approaches. Some of these works concentrate on the research axis, and others concentrate on various sides of the research axes as subtopics. In terms of computer science, there are two main types: traditional approaches from the period of pre-DL methods and DL methods [16]. DL techniques are a valuable addition to constructing the ideal models in many fields like Defect Prediction in Software (DeP) [17], improving Search-Based Software Testing (SBST) [18], improving the mechanisms of Detection of DDoS Attack[19], remotely imagery classification for unmanned aerial vehicles (UAV) [20]. Generally, achieving high-level intelligence, high robustness, high accuracy, big data, and low power consumption for artificial intelligence approaches are considered the critical challenges that faces the researchers. The authors in [21,22,23,24] have sought to address these issues. In this section, our review discusses the most other important reviews. Those that focus on the DL side and large datasets. DL algorithms are more properly suited and effective to address concerns related to the variety, volume, and accuracy of big data analytics. Furthermore, DL algorithms inherently exploit the availability of enormous amounts of data to explore and understand the higher-level complexities of various data patterns. Thus, minimizing the need for human experts to extract features from data [25].

The reviews aim to offer a panoramic vision of crowd analysis in the deep learning domain. Each previous survey was studied and organized into subsections to classify its authors.

Grant and Flynn [26] divided crowd analysis into two wide classes, crowd behavior analysis and crowd counting, which include several subsections. Crowd behavior analysis has four subsections: abnormal behavior analysis, dominant motion extraction, crowd analysis and tracking, and group behavior analysis. It focuses on behavior detection of individual scenes at first. Then, it describes group behavior within a crowd, crowd motion, and detection of an abnormal event. On the other hand, crowd counting contains six subsections: density mapping, joint detection and counting, line counting, texture-level analysis, object-level analysis, and pixel-level analysis. It focuses on behavior detection of individual scenes at first. Then, it describes group behavior within a crowd, crowd motion, and detection of an abnormal event. On the other hand, crowd counting contains six subsections: density mapping, joint detection and counting, line counting, texture-level analysis, object-level analysis, and pixel-level analysis.

It discussed the metrics used to estimate the density of a crowd, the Level of Service (LoS), and traffic flow. Moreover, they displayed datasets available according to crowd activity video research. Datasets fell into five categories: crowd counting (UCF_CC_50 dataset [69], UCSD dataset [70], and WorldExpo’10 Dataset [71]), group detection (Collective Motion dataset, The Museum Visitors dataset, student003 dataset, The Mall dataset [72], and the Grand Central Station dataset), behavior understanding (PETS2009 dataset [73], Collective Activity dataset, and The Unusual Crowd Activity dataset), holistic crowd movement (Chinese University of Hong Kong dataset (CUHK) [74, 75], The Meta-Tracking dataset, Data-Driven Crowd Analysis dataset, and Crowd Segmentation dataset), and synthetic (The Agoraset dataset, Seven Environments/scenes).

Tripathi et al. [1] concentrated on studies that included Convolutional Neural Networks (CNNs). The authors have divided the previous studies into four classes: The first class summarizes influential portions of the CNN for handling crowd behavior analysis. The second class summarizes the primary studies proposed that focus on CNNs. The third summarizes studies that use CNNs incorporated with other architectures from deep learning. It includes four types, crowd counting, crowd density estimation, crowded abnormality analysis, and crowded scene analysis. The fourth summarizes studies that use CNNs to extract features and classifiers. Moreover, the authors highlighted opportunities, features, and challenges for future research in the crowd analysis domain. Furthermore, the authors displayed some of the datasets used in CNN-based crowd analysis: WorldEx po10, PETS2009 [73], WorldExpo’10 [71], Pedestrian dataset, UCLA, Dyntex++, DynTex, (WWW) crowd dataset, BEHAVE, NUS-HGA, UCF_CC_50 [69], ShanghaiTech [76], UMN [77], Mall [72], Rare Events Dataset (RED) [78], and City Dataset [79].

Li et al. [80] summarized the main concepts of crowd behavior analysis in terms of the Crowd Dynamics concept. It considers a crowd as either a set of individuals such as the Social Force Model or a fluid such as concepts of thermodynamics and statistical mechanics by computer vision. The survey divided the reviewed studies into three classes, anomaly detection, motion pattern segmentation, and behavior recognition. First, crowd motion pattern segmentation analyzes motion patterns in areas of crowded scenes. Several methods have been proposed based on the cluster of the motions or segment principle. For instance, flow-based segmentation, similarity-based clustering, and probability-model-based clustering. Next, crowded anomaly detection has been classified into two sections, global anomaly detection and local anomaly detection, i.e., where does the anomaly occur? Does the scene include an anomaly case or not? Lastly, crowd behavior recognition is classified into object and holistic-based.

Kiran et al. [81] discussed the detection and prediction of anomalies by defining rare events and detecting unseen objects. Furthermore, the authors present the related works that used DL, unsupervised and semi-supervised methods for anomaly detection in video scenes. They classified their survey according to detection criteria and types of models (deep generative models, predictive models, and reconstruction learning models). Each of these types has several subtypes. Representation learning for reconstruction uses models and methods of normal behavior in surveillance videos to represent deviations in poorly reconstructed anomalies. Examples include principal component analysis, autoencoders, convolutional autoencoders (CAEs), CAEs for video anomaly detection, contractive autoencoders, and other deep models (like stacked DAEs (SDAEs), de-noising autoencoders (DAE), and deep belief networks (DBNs)). Predictive modeling contains four subsections, composite model, convolutional Long Short-Term Memory (LSTM), 3D-autoencoder and predictor, and slow feature analysis (SFA). SFA is used to view video frames as time series or temporal patterns to predict the existing frame or its encoded representation utilizing the previous frames. Lastly, Deep generative models consist of eight subsections: Generative vs. Discriminative, Variational Autoencoders (VAEs), Anomaly Detection Using VAE, Generative Adversarial Networks (GANs), GANs for Anomaly Detection in Images, Adversarial Discriminators Using Cross-Channel Prediction, Adversarial Autoencoders (AAEs), and Controlling Reconstruction for Anomaly Detection. They are employed to model the probability of samples of normal video in a deep learning framework.

Bendali-Braham et al. [16] proposed a novel taxonomy for crowd analysis that includes two branches, crowd behavior analysis and crowd statistics. Crowd statistics determine the number of people currently in a scene. It includes two subbranches, crowd counting and density estimation. Crowded scene analysis is divided into crowd behavior recognition, motion tracking and prediction, and group behavior recognition for human behavior analysis in a crowded scene. Furthermore, crowd activities and motion patterns are described in video scenes and when crowd statistics determine the LoS. Al-Shaery et al. [82] tackled an inclusive review of crowd management, from the discovery of crowded places to crowd monitoring and management. They focus attention on systems of crowd management that require a well-designed decision support system (DSS), as well as the systems that have early warning capabilities to realize the primary goal of gatherings which is crowd safety. They divided their taxonomy into two branches: crowd detection and crowd monitoring and tracking analysis. The last section includes the crowd management and control stage that leads to the crowd DSS stage. They considered the crowd management stage as the intermediate between monitoring and the Crowd DSS stage. Ebrahimpour et al. [62] reviewed the studies of crowd analysis based on various data sources. They divided their taxonomy into three classes, crowd social media analysis, crowd spatiotemporal analysis, and crowd video analysis with some subsections. Crowd spatiotemporal analysis uses a data source generated by transportation that is monitored with Global Positioning System (GPS), such as shared bikes or buses. In terms of crowd social media analysis, it exploits check-in data that have been taken from geo-tagged social microblogs for crowd analysis. The data analysis process contains four steps, discovery, gathering, preparation, and analysis. Finally, crowd video analysis includes two sections with subsections inside them: crowd video behavior analysis (microscopic modeling and macroscopic modeling) for generating trustworthy trajectories for pedestrians as well as crowd video action recognition (single person action recognition and group activity recognition) for single or group activity surveillance, tracking people, objects, sports video analysis, and action recognition.

In summary, the studies and surveys above used various taxonomy according to their perspectives. Most of these studies focused on crowd behavior and motion analysis based on the captured video scenes. One of them focused on crowd spatiotemporal analysis based on GPS data, owing to the ability to collect data automatically remotely by mobile sensing and mobile computing [83]. Obviously, there is no exploit on social media data, this paper investigates this scope with the best technologies. Our review is distinguished from others that we study general cases of crowd management and analysis, in addition to local studies in the Hajj season to discover the flaws and difficulties facing the Hajj authorities in order to avoid disasters and accidents among crowds.

4 A comprehensive study of crowd management

Literature that discusses crowd management in various past universal events. Through a methodical literature review, they have classified crowd analysis into several types according to the purpose of the study. Crowd scene analysis, social media-based analysis, and crowd sound emotion recognition are the main types of crowd management. Each one has some subsections below, according to our taxonomy in Fig. 5. Studies discussing the analysis of crowds with various purposes employ DL algorithms. Researchers seek to use the newest of these technologies to achieve the highest performance and accuracy possible. Table 4 illustrates the statistics for papers obtained from each subsection.

Fig. 5
figure 5

Taxonomy of crowd management

Table 4 Statistics for papers obtained in each subsection

4.1 Crowd detection

To avoid accidents, it is crucial to know when the people will gather. Then, the organizers must perform in-depth prior analyses and develop comprehensive plans for these mass gatherings. Crowd analysis is a vital tool for crowd management [4]. Ordinarily, there will be an advance notice for well-known human gatherings, either religious, sports, carnival events, or always-crowded places such as airports, train stations, stadiums, etc.

Every human gathering has special features regarding the purpose, location, and time as well as the behavior of the people, their beliefs, affiliations, and race. For instance, in 1987, a group of Iranian pilgrims rioted during the performance of the Hajj rituals at Makkah, Saudi Arabia, and, as a result, 402 people were killed and injured [61]. To give another example of religious events in India, Hindus gather to bathe at the Ganga River, Saraswati River, Kshipra River, and Godavari River, where heavy crowds are expected at specific times. On 31 December 2014, on Shanghai New Year’s Eve, there was a stampede, resulting in 36 individuals killed and 47 others injured [60]. Table 5 summarizes the tragedies that happened previously. Therefore, it is critical to adopt crowd management and propose rigorous and flexible strategies to prepare for unforeseen occurrences at any time. If crowd management fails, it will lead to a loss of lives or properties.

Table 5 Summary Of the tragedies that happened previously

4.2 Crowd statistics

Crowd counting and density estimation are characteristic types of crowd analysis. Calculation of crowd counting, and density can be beneficial in planning crowd security and safety. If the crowd size can be estimated at crowded places, such as temples, stadiums, airports, or metro stations, in advance, it would be extremely beneficial for planning alternate strategies for crowd control.

4.2.1 Crowd counting

Several methods have been developed for crowd counting, which include three classes under the methodologies of DL: (1) CNN-based methods [36, 42]; (2) detection-based methods [87]; and (3) regression-based methods [88, 89]. Briefly, detection-based methods utilize detection algorithms, which consider that a crowd consists of the sliding-window detector and individual entities to compute the number of object instances in the detected image [33, 87].

Regression-based methods exist to solve the problem of occlusion. The main ideas of this method are learning a density map and extracting its features from an image to estimate crowd density [33, 88, 89].

Lastly, many works have been developed by CNN-based methods in the crowd counting field due to their successful applications in computer vision.

Kang et al. [27] proposed an adaptive convolutional neural network (ACNN)-based model for counting. It improves the counting precision compared to an ordinary CNN with a similar number of parameters.

Marsden et al. [28] had developed a previous model in [76] of convolutional crowd counting for the high-density crowd. They added several contributions, including a training set increase to minimize redundancy between samples of training to improve counting performance. They also use a single column, deep, fully convolutional network (FCN) for analyzing images with any aspect and resolution ratio.

Sheng et al. [54] proposed a framework based on locality-aware features (LAF) integrated with CNN features to capture more semantic spatial and attributes of the image. Furthermore, they used a vector of locally aggregated descriptors (VLAD) which consider the weights of the coefficients.

Hu et al. [47] used a convolutional neural network (convNet or CNN) structure to extract features of a crowd in a single image to estimate the crowd count. Their approach was based on CNN and appropriate for a mid-level or high-level crowd. Similarly, Kumagai [29] adopted CNNs with fixed weights to reduce the fault rate when counting a crowd.

Dai et al. [84] proposed improved approaches to crowd flow prediction, whose goal is to count the incoming and outgoing numbers of people in urban regions. The approaches were based on a spatiotemporal attention mechanism with a simplified deep spatiotemporal residual network. The first one captures information about the spatial correlations on crowd flows and finds the regions with positive impacts. The second one reduces training time and gives the best prediction performance compared with similar approaches.

Gong et al. [30] used existing images on social media to estimate the number of people in crowds at city events. This study is the first to count crowds from this side, unlike prior studies that used datasets from popular sources such as video surveillance data. They constructed a novel dataset of images collected from social media for diverse events and major activities in the city. Each image is annotated with its characteristics and the size of the crowd. They applied four methods of two types, direct methods (Faceplusplus and Darknet Yolo) and indirect methods (Cascaded method A and B), to crowd size estimation analysis. The results showed that direct methods achieve higher accuracy than indirect methods. Specifically, Darknet Yolo achieves the best accuracy in estimating the crowd size level (72.01%) and the number of people (38.09%). This study provides a novel method to count people via the advantage of their visual posts on social media.

Huang et al. [31] solved the problem of noise in the areas with different densities, which appeared in a previous study that used a multi-column convolutional neural network (MCNN) method. The authors proposed a novel method named a segmentation-aware prior network (SAPNet). Using a map of coarse head-segmentation, they produced a map of high-quality density without noise. SAPNet contains two networks, CR-CNN as the back end and FS-CNN as the front end. They are a crowd-regression convolutional neural network and a foreground-segmentation convolutional neural network, respectively. FS-CNN produces a map of coarse head-segmentation, then this map is inputted to CR-CNN to perform a highly accurate crowd counting to produce a high-quality density map. The four datasets that tested their approach were WorldExpo’10 [71], UCF-CC-50 [90], UCSD [70], and ShanghaiTech [76]. It has achieved high performances on the UCF-CC-50 and ShanghaiTech part B datasets. However, the WorldExpo’10 dataset [71] was unsuitable for their method because the raw images are of low precision. Furthermore, a poor Canny-edge map can lead to the generation of a faulty segmentation map. This study succeeds in an efficient solution to the problem of noise in areas with different densities. It will be very beneficial in high-congestion places such as train stations, stadiums, religious gathering.

In the same context, Jiang et al. [32] produced a novel PSDENet method, the people segmentation-based density estimation network. At first, the PSDENet model performs learning and pre-training on virtual synthetic data, then, it transfers these tests to real data. The proposed method has proven effective even though it uses two independent networks, PSDENet and people segmentation network (PSNet). It requires the consumption of much computation.

Zhang et al. [33] proposed a two-task convolutional neural network (T2CNN). It is a novel method for crowd counting that concomitantly learns two tasks, the density map estimation of images and the classification of the tasks of dense degree. Each image has different degrees of density, and local regions inside them have different degrees of density. Determining the density degrees of images helps the estimation of the density maps. For this purpose, researchers incorporate the module of T2CNN with dense degree classification (DDC). T2CNN takes the scale of the adaptive CNN as the density maps estimator, then classifies images into several categories based on degrees of density. Therefore, that model is an efficient way to treat the perspective and scale variations in crowd images, according to experimental results performed on common datasets: WorldExpo’10 [71], UCF_CC_50 [69], and ShanghaiTech [76].

Shang et al. [34] developed a new architecture to deal with the perspective variation problems for estimating the number of people in images on the web. The proposed approach has two-stage processing: policy network and count network. A policy network is an estimation of perspective by a regular CNN, while a counting network is a normalization of perspective for the input patches into a scale-specific CNN. Then, given the arranged inputs, they adjusted the scale-specific counting network and their approach to deal with a large perspective variation in web images. In this context, the evaluation metrics were used to verify the model of Xu et al. [91], which gives an average enhancement of 4.68% of Grid Average Mean Absolute Error (GAME), 6.7% of Mean Squared Error (MSE), and 3.68% of Mean Absolute Error (MAE). Also, their experiments were performed on datasets following UCF_CC_50 [69], UCF-QNRF [92], RGBT-CC [93], and ShanghaiTech [76].

Jiang and Jin [35] discussed estimating high-quality crowd density maps and counting crowds by revisiting the design of CNNs to get high-quality density maps as well as high resolution on datasets of crowd counting. For instance, these datasets include UCF_CC_50 [69], UCSD [70], and ShanghaiTech datasets [76]. Their proposed method, multilayer perception counting (MPC), realized high results in a high-quality density map, which is better than counting the crowd. Their method relies on diverse deep supervision (DDS) rather than general supervision, which uses all the intermediate layers or hierarchical in the network. Moreover, MPC is considered the ideal way for cases requiring prediction in real-time.

Khan and Basalamah [36] proposed a unified model to detect human heads in visual images for crowds using regression models with CNNs. The model is based on DenseNet, which contains 174 layers. It handles a wide range of scale differences by integrating scale-specific detectors within the network. Therefore, the network parameters are improved in an end-to-end fashion. The model was applied to difficult benchmark datasets, such as UCSD [70] and UCF-QNRF, and achieved the best results.

Liu et al. [52] proposed a global density feature to add to the multi-column convolution neural network (MCNN) to improve its performance using the cascaded learning method. This model differs from existing works because it concentrates on uneven crowd distribution. Furthermore, deconvolutional layers and the max pooling were utilized to generate a thorough density map and to restore the missing details of the accuracy of the density map during the down-sampling process. The results of experiments prove that this model has higher accuracy and stability when applied to ShanghaiTech [76] and UCF_CC_50 datasets [69].

Kizrak and Bolat [4] used video images or static images to estimate the number of people in a crowd by utilizing CNN with modules of capsule network-based attention. They have proposed a 75,442 VOLUME to crowd analysis using a CNN and two-column cascade and CapsNet as an attention module. The positive impact of the Capsule attention was proven to detect the number of people in images of a crowd. However, this method is still not effective in terms of computational complexity.

Elharrouss et al. [53] provided two contributions, a new method using CNN and the creation of a novel crowd counting dataset taken from the Football Supporters Crowd (FSC-Set). It contains 6000 annotated images of various scenes. FSC-Set can be used for other domains such as localization of individuals, image supporter recognition, and face recognition. The proposed method named FSCNet used several modules: channel-wise attention, spatial-wise attention, and context-aware attention for crowd counting. The results were satisfactory on all the datasets. This research provides a solution to counting people in crowded places based on several attributes. This method can be contributed to aid other studies of crowd counting.

Khan et al. [55] developed a framework using end-to-end semantic scene segmentation (SSS) based on CNN for counting people in a densely crowded image. The framework consists of three components: Density Estimation (DE), classification using optimized CNN, and SSS. Moreover, to solve the problem of scaling variations in images, they used four fields that had sixteen filters to feed output at every stage. Their method has validated four standard datasets such as Shanghai Tech, World Expo, NWPU_Crowd [94], and UCF_CC_50 [69]. Furthermore, they claimed that the crowd counting domain is still an immature research area due to limited data in deep learning.

Zou et al. [67] proposed a model to address ignoring the massive temporal information among consecutive frames when process each video frame independently. The model namely, temporal channel-aware (TCA), it realizes exploiting the temporal interdependencies between video sequences through fusion of 3D kernels of convolution in order to encode local spatio-temporal attributes.

Du et al. [68] redesigned a classical multi-scale neural network to treat challenging of crowd counting. The scheme merges multi-scale density maps. The network uses both the local counting map and the crowd density map to optimization. The experiments results proved that the novel scheme fulfills the state-of-the-art performance on five public datasets such as UCF_CC_50, JHU-CROWD++, ShanghaiTech, Trancos, and NWPU-Crowd.

At the end, Most of studies above developed their architectures based on CNN features to count crowd. Moreover, they used the famous benchmarks datasets such as Shanghai Tech, World Expo and others to perform experimentation on these architectures.

4.2.2 Crowd density estimation

Density estimation of a crowd is an extended part of crowd counting. Density computation is important to support preset plans and strategies to avoid overcrowding. The authors of [51] discussed the flow patterns of a crowd. They used an unsupervised methodology to cluster people patterns in large public infrastructures. The proposed approach has been applied to an international airport. Their approach successfully summarized the representative patterns and provided the required data for airport management.

The work of [44] proposed a model to estimate crowd density via an optimized ConvNet. The model has two ConvNet classifiers to improve its speed and accuracy. In the same context, the work of [63] used LSTM-combined Node2Vec graph embedding to extract spatial features.

4.3 Crowd scene analysis

Crowd Scene Analysis is most important to study normal or abnormal human behavior. This aspect includes Crowd Motion Analysis and Tracking using the most common approach is video surveillance to detect alarms and anomalies.

4.3.1 Crowd monitoring and tracking

[95] developed a new framework for an online gating neural network. It consists of two phases: the offline training phase and the online predicting phase. In the first phase, their training set is trained daily using a gated recurrent unit-based predictor of human mobility. In the second phase, they constructed an online adaptive predictor of human mobility. Moreover, it switched between offline pre-trained and online adaptive human predictors using a gating neural network. They have adopted a real-world GPS-log dataset for training Tokyo and Osaka cities, where this approach realized a higher prediction accuracy for this approach. This framework can be employed for several purposes, for instance, incorporating additional data such as event information or weather data to predict human mobility. The framework minimizes unnecessary information by performing more than one online training simultaneously. Moreover, the used dataset is considered a little representation of the real world. However, that system is unstable due to the sparse data.

Shi et al. [77] proposed a novel model for the trajectory prediction of pedestrians in highly crowded scenarios. The model relies on using LSTM and contains a trained decoder and encoder by truncated backpropagation. The experiments used data from the trajectory train station in Tokyo, Japan. This model has proven stable concerning predictions of varying lengths. In addition, it realized an average for both Evaluation Metrics Of The Prediction Errors (Average Displacement Error And Final Displacement Error) Of 21.0%.

In the same context, [57] studied the prediction of the trajectories of foreign tourists using lstm. Nevertheless, there is a difference. The first layer of lstm is fed with the input sequence, and every other layer of lstm is fed with the layer's output that precedes it. They claimed that the proposed method outperformed classical approaches.

Zhang et al. [79] studied monitoring passenger flow in a passenger metro by creating a cnn-based platform. The proposed method has three parts: the first is a cnn group used to extract features from images. Then, the second is a module of feature extraction utilized to enhance multiscale. Finally, transposed convolution is applied to the sample to create a high-quality density map.

Lastly, some of these works used cnns with lstm methods to extract images feature in order to examine and analyze crowd scene, they have accomplished high-quality. In every case, the integration of cnns with lstm is considered an effective method to produce a high-quality density map, and thus it can give good results.

4.3.2 Crowd behavior analysis

Swathi et al. [37] developed a vigorous model, which integrates features of deep learning AlexNet (alippi, disabato and roveri, 2018) with high-dimensional features of the gray-level co-occurrence matrix (GLCM) that have hybrid deep statistical features. Moreover, it used a multi-feed forward neural network model (MFNN) to execute multi-category classification. AlexNet and GLCM provide a wealth of information on spatiotemporal features to make ideal classification decisions. The MFNN algorithm helps ideal multi-class classification. The model has achieved an accuracy rate of crowd behavior classification of 91.35%, 89.92% precision, 89.12% f-measure, and recall of 88.34%.

Zhang et al. [65] proposed a framework to predict crowd behavior in complex scenarios. The framework consists of three components: the module of scene feature extraction, the discriminator, and the generator. The first component captures the environment's visual signal, the spatial layout, and the interrelationship of pedestrians. The second component measures the similarities between the real trajectories and the generated ones. The third component consists of the encoder and the decoder parts that use lstm for inputting. Experiments are executed on the standard crowd benchmarks datasets, such as the chinese university of hong kong(cuhk) crowd (shao, change loy and wang, 2014; shao, loy and wang, 2016), the eth zurich university(eth) datasets [97], the crowd-flow, and the university of cyprus (ucy)datasets [98]. These experiments confirm that the proposed framework successfully predicts the behaviors of crowds in complex scenarios.

According to above, integration alexnet features with glcm have achieved a good accuracy rate for classification.

4.3.3 Crowd abnormality detection

Abnormal behavior is an unusual event occurring in overcrowded scenes. Therefore, crowd abnormality detection in crowded areas plays a pivotal role in preventing any disasters due to riots. The domain of anomaly detection has gained the interest of researchers in computer science in recent years.

Video anomaly detection (VAD) uses algorithms of temporal video segmentation to detect shot boundaries in sequential frames of video [99]. VAD challenges relate to crowded and complex scenes, small anomaly datasets, and anomaly localization [49]. Moreover, the challenge of false-positive detection results is that the system incorrectly discovers normal events as abnormal ones [49]. For these reasons, deep learning methods are more suitable than traditional methods [69, 95, 100]. In particular, unsupervised deep learning methods are the best solution [49].

Ganokratanaa et al. [49] proposed a new unsupervised deep residual spatiotemporal translation network (named DR-STN). The proposed approach has embedded with DR-cGAN and OHNM, which refer to Deep Residual conditional Generative Adversarial Network and Online Hard Negative Mining, respectively. The authors claim that their approach reduces the detection of a false-positive anomaly. Furthermore, it increases anomaly localization accuracy with a rate of 96.73%.

Wang et al. [38] proposed a novel algorithm to solve the problem of visual abnormality detection in crowd scenes. The abnormal frame is called a global abnormal event (GAE). However, determining the abnormal area in one frame is called a local abnormal event (LAE). This process uses a feature descriptor extraction of MHOFO (motion descriptor, namely a multi-frame descriptor). The motion information is represented by this descriptor after capturing it as a multi-frame. After that, captured samples are trained via a cascade deep autoencoder (CDA) as a generative network to detect abnormal behavior. Their experiment was performed on three benchmark datasets, University of California, San Diego (UCSD) [70], PETS2009 [73], and UMN [77]. They have proven that their algorithm shows competitive results. Although their model is slower than the SCL method, it is better in terms of performance. The SCL is the fastest method in the published papers for anomaly detection.

Ammar and Cherif [39] proposed a model to treat the problem of panic behavior detection in abnormal situations, which is named DeepROD. This technique worked in real time, online, and offline. It relies on statistical characterization and LTMS neural networks to predict future values of features. They claimed that their model is proven by experiments on well-known datasets (both public databases and livestreaming sources). Specifically, online training has given a better performance than offline training for the crowded scenes. Furthermore, it provided good processing time and accuracy. Nevertheless, DeepROD has lower accuracy when tested on a livestreaming source, such as a festival video.

Khan et al. [59] proposed an AlexNet-based crowd anomaly detection model to detect the anomaly in the image frame. Their model was comprised of three fully connected layers, four convolution layers, with additional the rectified linear unit (ReLU) was used as an activation function. The experiment has been performed on a personal computer using fewer computational resources, it appeared that the proposed model outperformed other studies and fulfilled 98%.

Basalamah et al. [50] proposed a Bi-LSTM framework using motion information to detect congestion rather than count pedestrians.

4.3.4 Group activity detection

Vahora et al. [45] proposed a novel model using a deep neural network for the recognition of group activity via video monitoring. The model has a multi-layer deep architecture, which integrates CNN with RNN. CNN model was used to capture information, feature, and level semantics from the scene for recognizing mysterious group activities. The RNN model used the LSTM model and gated recurrent unit (GRU) model to handle the problem of long-term dependency for the RNN model.

4.4 Social media-based analysis

Over the last few years, several smartphone social network applications (apps) have come to market. These apps enable users to exchange their information, location, and temporal data, usually called check-in data. Day by day, social network apps have become more utilized by people. Especially popular are apps based on geotagged social microblogs and location-based social networks (LBSNs) such as Twitter, Facebook, Instagram, and LinkedIn. An advantage of social media (SM) is that users share their interests and purposes when, where, and why they go out. A result is an enormous source of data that may help researchers in different domains to perform crowd analysis, such as sports, religious, and carnival events, as well as in the marketing domain and trend detections. From another perspective, the data sources of SM open other horizons in the analysis domain, including counting people, computing individual tracks, and detecting normal and abnormal behavior in a crowd [101]. Moreover, they show how much impact these data have to create a crowd or influence their behavior [102]. Figure 6 illustrates trends of analysis via social media.

Fig. 6
figure 6

Analysis trends via Social Media

4.4.1 Opinion/sentiment analysis

Öztürk and Ayvaz [46] collected all tweets in English and Turkish languages that discuss humanitarian issues concerning the Syrian refugee crisis to perform sentiment analysis on them. They used the twitter package to collect data from twitter. Then, they utilized the Rsentiment package. Both packages were developed in the r programming language. Rsentiment contains a comprehensive sentiment dictionary in English and provides a sentiment score. Whereas in the Turkish language, the authors have developed a sentiment lexicon of 5405 words. Finally, the results of these analytics, overall sentiments were positive about Syrian issues. While only 12% of the tweets in English were positive, the tweets in Turkish were equally distributed among neutral, negative, and positive sentiments. It is good to adopt this model to suit different languages.

In the same context, Malik et al. [64] created an alert system for Pakistan government authorities in order to determine the public emotions of people against upcoming anti-government.

Duan et al. [60] studied a stampede on shanghai new year’s eve in 2014. This study investigated the reasons for this crowd behavior through the viewpoint of social media data. The authors developed a framework using check-in data of the Weibo platform of three trends, the emotional fluctuations of citizens, the topic changes in posts, and the collection level of check-in data. The framework processes are executed as follows. At first, the location information of check-in data is taken from Weibo to analyze the spatial and temporal using Moran’s i index. Next, the textual data of Weibo is analyzed through topic modeling using the Latent Dirichlet Allocation (LDA) method. Finally, sentiment analysis is analyzed and divided into five groups to extract percentages of negative and positive sentiments. As a result of this study, the geographical features can directly reflect changes in crowd flow, as well as the psychological states of people before and after accidents. However, it still faced some challenges.

4.4.2 Geo-located analysis

Redondo et al. [48] proposed a hybrid solution based on clustering techniques and entropy analysis to early detect unexpected behaviors in social media. Data is collected from Location-based Social Networks (LBSNs). The authors used the Instagram platform for this study because it is a good source for geo-located data. Moreover, the APIs of some social media platforms impose limitations on the access of visual data by developers.

Finally, previous works has studied crowd flow behavior by analyzing textual data. Thus, Duan et al. [18] claimed that prior knowledge of people's psychological and behavioral states may help in understanding crowd behavior.

4.5 Crowd sound emotion recognition

Franzoni et al. [40] introduced the first model to study sound emotions for the crowd. The model integrates CNN and spectrogram-based techniques. According to their claims, they have not compared the results of their experiments with any prior study with a similar domain (crowd). However, they compared these results with studies focused on analyzing individual-speech emotions. The model has a 10% improvement in average accuracy. Their study proved that the AlexNet-CNN spectrogram-based method is appropriate to analyze the sound emotions of the crowds.

At the end of this section, this paper demonstrates the limitations of the above papers. There are some problems during the process of data extraction. The information inferred was insufficient due to a lack of data on procedure, methods, and performance, which may be reflected by the QA.

4.6 Crowd management at Hajj event

This part will display all studies that support Hajj research. Hajj ritual is the fifth pillar of Islamic, every Muslim should visit to the holy places in Makkah, Saudi Arabia once at least in his life. They should able financially and physically to perform worships of Hajj. The period of Hajj is between 8 and 12th of the 12th month every year, it is called (Dhulhijjah) in the Islamic (lunar) calendar. Hajj's crowd of up to three million people comprised of pilgrims from all over the world in one sacred spot. The geographic area of the holy sites for performing Hajj rituals is not exceeding 33 km2. This makes the Hajj authorities to face a great challenge to deal with the overcrowding of the Hajj in a specific period and place, firstly relating to the security and safety of pilgrims. The objective of our work is to discover the efficient practices of crowd management in the holy biggest event in Saudi Arabia, it is the Hajj ritual [103, 104]. All methods currently applied in the field of Hajj crowd management are still lack of attention and development from researchers, especially in terms of exploiting of textual data to analysis of crowd's emotions, sentiments, or opinions. The two papers are picked below according to our criteria in this survey.

Farooq et al. [41] presented a novel method for abnormal behavior detection for crowds that may lead to dangerous disasters, such as a stampede. The model captures motion in the form of images, then classifies these images according to crowd divergence behavior using a CNN method, where the CNN has been trained on motion-shape images (MSIs). Moreover, the finite-time Lyapunov exponent (FTLE) domain is acquired when the optical flow (OPF) is computed first. LCS (Lagrangian coherent structure) in the FTLE domain represents dominant motion for the crowd. Finally, a scheme of ridge extraction transforms the LCS-to-grayscale MSIs. The model is tested on six real-world low and high-density datasets. They claimed that the experiments produced effective results for their method in terms of detecting divergence accurately, as well as detecting starting points of congestion at high and at low density. Furthermore, they presented two new datasets, including video scenes of normal and abnormal behaviors for a high-density crowd. In the Hajj case, the authors have applied their model to pilgrims’ crowds at Makkah, Saudi Arabia. It used recorded Video data (the PILGRIM dataset) taken from a live broadcast of the Makkah TV channel. They have generated three behavior videos from every single video. The proposed method also has outperformed this dataset.

Habib et al. [61] developed a novel framework to identify abnormal activity for pilgrims at Makkah. A lightweight CNN model was trained on the dataset of pilgrims. This dataset was captured from installed CCTV et al.-Haram. The images’ frames were passed to the proposed model for the extraction of spatial features. Then, an LSTM network was created for the extraction of temporal features. Lastly, the system will make an alarm when an emergency occurs, such as an accident or violent activity, to inform the authorities to take the appropriate action. They have performed experiments on two violent activity datasets: Hockey Fight and Surveillance Fight. The model achieved good accuracies of 98.00 and 81.05, respectively. However, this model suffers from the shortcoming of recognizing violent activity from one perspective only. And it is the best that recognizes violent activity from multiple perspectives to obtain insights into the activities.

Finally, the review concludes from Sects. 4 and 5 that most of these studies have been concerned with CNN methodology and integrated with other techniques to benefit further and improve the accuracy of model performance, Table 6 summarizes all studies that used CNN approach with their advantages and disadvantages. Table 7 displays that used the methodologies of RNN such as LSTM approach. While Table 8 illustrates studies that used different methodologies to create models.

Table 6 Overall, studies of CNN approach with their advantages and disadvantages
Table 7 Overall, studies of RNN approach with their advantages and disadvantages
Table 8 Overall, studies with different approaches with their advantages and disadvantages

According to our viewpoint, the models of crowd counting still need development for several problems as follows: Detecting large objects as people, but do not detect small objects. Most of models cannot be generalized on all datasets, where they give good results with some datasets and inefficient results with others. Classifying some objects as people by mistake. The accuracy of captured image/video varies according to the installed camera angles. Hence, the challenges can be minimized as possible, installing cameras at every angle in the place to ensure the monitoring of all people and improving the accuracy of image/video pixels for the extraction of features efficiently. At last, internet of things (IoT) devices can improve counting crowds besides DL models. In terms of analysis of social media, geographic locations feature should be exploited for processes analytic of crowds. Furthermore, the various languages should be supported just like the English language. Consideration of the backgrounds of psychological, social, and beliefs of people when studying their expression on social media.

5 Results analysis of comprehensive study of crowd management

In this study, the comprehensive study is divided into two domains. The first one is crowd management in various events in the world. The second one is crowd management in local area named Hajj event in Makkah, Saudia Arabia. All these studies focus on approaches that applied DL techniques. In terms of comprehensive study of crowd management in various events, the related works are divided into four sections, everyone has two or more subsections. The sections are crowd statistics, crowd scene analysis, social media-based analysis, and crowd sound emotion recognition.

5.1 Crowd statistics

Many new hybrid approaches have been developed by CNN-based methods to improve crowd counting and crowd density estimation fields. The outcomes of combining two or more methods have confirmed that hybrid techniques enhance performance, increase accuracy, and prevent many obstacles in computer vision projects. This has been displayed using the adaptive convolutional neural network (ACNN)-based model, locality-aware features (LAF) integrated with CNN features, multi-column convolutional neural network (MCNN) method, two-task convolutional neural network (T2CNN) with dense degree classification (DDC), and end-to-end semantic scene segmentation (SSS) based on CNN for calculation of crowd counting and density.

5.2 Crowd scene analysis

Many works have attempted to find the successful solutions for normal or abnormal human behavior analysis. Videos scene analysis for human behavior detection is faced many challenges relate to crowded and complex scenes, anomaly localization, small anomaly datasets, and false-positive detection [56]. To solve these reasons, DL techniques are more suitable and successful solutions than traditional methods to treat problems and challenges. The researches have been developed by CNN-based methods integrated with other techniques. The works of Shi et al. [50] and Crivellari and Beinat [51] have used backpropagation processing through the LSTM technique. LSTM in the first layer feeds subsequent layers and every other layer of LSTM is provided with the layer's output that precedes it. Furthermore, the model of Zhang et al. [54] used LSTM in third component for inputting stage to improve accuracy rate for classification. Moreover, Ammar and Cherif [61] have integrated LTMS neural networks and statistical characterization to predict future values of features. Vahora et al. [31] used LSTM beside GRU to handle the problem of long-term dependency, also used CNN to capture features from the scenes.

Finally, some of these works used CNNs with RNN methods such as LSTM and GRU to treat long-term dependency problems, increase improve extraction of scene features, and accomplish high-quality to anomaly behavior detection.

5.3 Social media-based analysis

Some of the works focus on studying crowd management through text data analysis is less compared to analyzing image and video scenes. Due to of the difficulties associated with natural language processing that make it difficult to understand the intentions of people's feelings and emotions towards events and situations. Moreover, It would be excellent if there is prior knowledge of people's social, psychological and behavioral states in order to understand crowd behavior to prevent emergency cases proactively to ensure crowd safety [43].

5.4 Crowd sound emotion recognition

There is one research that has presented to study emotion detection by sounds, where it used AlexNet-CNN spectrogram-based method to analyze the sound emotions of the crowds. Spectrogram-based method is appropriate for crowd sounds analysis.

5.5 Crowd management at Hajj event

In terms of crowd management at hajj event, the flow of data during a Hajj period is huge, whether it is visual, text, or audio data. This digital wealth must be greatly exploited by the Hajj authorities to reduce the terrible effects that may be when proactive solutions are not developed to control crowds. Like previous studies, CNN with LSTM have been used to effectively extract visual attributes to classify anomalous or anomalous crowd behavior, this achieved excellent accuracy results.

5.6 Addressing of used methodologies limitations

It is clear from the review of existing works that crowd management is plagued by drawbacks and challenges that have restricted its rapid improvement in recent years. Hence, we made several important considerations when building crowd models.

It can be noticed most of the works of literature is that they still remain density-dependent. It means their models developed for macro-analysis independently from micro-analysis, while applications of real-world require crowd analysis to be conducted starting at macro-level and branching down into the micro-level. Therefore, it is important in future works in terms of modelling surveillance, behavioral understanding of crowds, must concern on the enhancement at both macro-and micro-levels and integrated between them.

Furthermore, it can be observed that most of the literature based on computer vision is performed under strong and restrictive conditions, for example, the perspective of the installed camera in the place, surrounding environment, estimating density of crowds, noise, etc. It is vital to realize that these requirements are inherited from the computer vision field since they are viewed as extension techniques for crowd modeling. There is a common sense of acceptance of these challenges for researchers. Whereas in this work, we recommend the integration of some techniques that collect data about people to reduce absolute dependence on traditional equipment. We claim that utilizing various data on social media will make a vital source for video surveillance domain and counting crowd density, behavior, or abnormality crowd detection, which increases the cognitive diversity and learning new patterns for datasets. Hence, the crowd models will be generalizability on the different environments. It has increased the number of large-scale events in the world; thus, the organizers should benefit from the deepest insights about attendees’ characteristics besides events' characteristics. It becomes possible to describe the behavior of people during crowd events using social media data, this is paving the path for crowd monitoring and management by using real-time applications [105]. The work of [106] is a good example of exploiting the data in streaming channels and social media during the Hajj season. We believe that, in the future, social media data related to expressing people’s daily lives will become close to understanding the behavior of crowds during events.

The future research must be concerned with the complementarity of the models to solve challenges and drawbacks rather than with minor developments to increase the accuracy of the model only. For these reasons, it is important to understand the differences between the kinds of supervised and unsupervised DL techniques. Many of the relevant concepts may confused together when building large or complex models. Table 9 clears the most important strengths and weaknesses of ML and DL algorithms.

Table 9 Strengths and weaknesses points between ML and DL

One of the main problems with the majority of local works during Hajj season is that most research is performed in isolation as urban planning for smart cities and the variety of needs regarding crowd management. Urban crowd management is an integral activity for any event, such as crowd flow, estimating density, monitoring street grid, movement of buses [107], crowd trajectory [108], impact of a pandemic on crowds [109], etc. Integration Urban provides good decision support for the development of the city in all respects. Big data and advanced intelligence computational techniques can help the planning, design, management, analysis, and simulation of smart cities. For instance, early planning of safety evacuations in the midst of a natural disaster incident based on location data of mobile phones leveraging both machine learning approaches [110]. Use crowd-harvested data to study the population's sentiments, traffic patterns, and perceptions of neighborhoods, in addition, to simulating the model urban systems more realistically, which is crowdsourcing effective the analyzing and modeling of urban morphology at much finer social scales, temporal, and spatial [111]. Hence, smart cities will be able to control and monitor dynamic changes as they happen inside the city during crowded events. For instance, using the FOPID controllers controls systems with nonlinear dynamics, also improves the complex systems performance in various applications [112], and using DETDO optimizer to solve real-world engineering design problems [113].

6 Discussion

In this SLR, about 45 DL-based articles are reviewed. According to the detailed analysis of various crowd management approaches and their state-of-the-art performance in this survey, our survey forecasts that DL-based methods will predominate future research in the crowd analysis and management fields. It can be noticed that most methods have been integrated with other DL methods, such as CNN with LSTM, to increase the accuracy performance. Moreover, variations in types of input, layers, or fed to in the CNN. It utilizes popular datasets or creates a novel dataset for performing testing on proposed approaches.

Most of these papers focused on crowd scene analysis in the computer vision field. Therefore, the major challenge for crowd management is the lack of sentiment analysis of crowd-based big data on social media. There is also a lack of custom datasets to feed textual data analysis. Owing to the challenges related to natural language processing, it makes difficult to understand people's emotions towards events and situations by textual expression. Furthermore, studying of people's psychological and behavioral it may reduce the severity of these challenges. Furthermore, its open scope for a greater understanding of what is behind the meanings and words. Thus, investigation of the crowd’s behavior from all aspects is greatly crucial for crowd safety, also to prevent dangerous emergency situations before they happened [43].

6.1 Current study vision discussion

This section discusses our vision of this study compared to previous state-of-the-art studies. The main objective of this survey is to spotlight the shortcomings or defects of previous papers. New approaches must provide effective solutions for crowd video analysis in real-time, while traditional approaches are not able to handle efficient solutions in a time-bounded manner. Traditional approaches are insufficient for crowd analysis cause the size of the crowd is huge and dynamic in real-world scenarios. In addition, the behavior and actions of individuals are difficult to identify. The shortcomings can be identified in existing approaches as follows: real-world dynamics, time complexity, bad weather conditions, overlapping of objects [119], and unexpected incidents. All existing approaches were handling the shortcomings independently. It can be observed that there seem to be almost no concerns about the lack of research in sentiment analysis of crowd-based big data on social media in the world, especially during Hajj events. It can be perceived as a missed chance to learn from different visions. Thus, this study seeks to change imbalance research by configuring new Integrative frameworks and methodologies and highlighting the prior good practices in this domain. It is significant that the researcher community realizes these gaps when constructing existing systems and continuing to monitor the development of integrated research in crowd management.

This paper aims to support Hajj research through the enhancement of the pilgrims' behavior analysis and work to cover the above aspects. Moreover, we will provide datasets of pilgrims taken from social media and will make them publicly available to be useful to other researchers. The authors are seeking “actionable SM-based crowd management”. In this sense, traditional crowd management needs a new multidimensional conception. To build a new robust infrastructure, it must be integrated as follows; “AI algorithm + computing power + big data = smart service” [120]. It is significant that utilize the best optimizers to improve the performance of complex systems such as FOPID [112], DETDO [113], Genghis Khan shark [121], Geyser Inspired Algorithm [122], Prairie Dog Optimization Algorithm [123], Dwarf Mongoose Optimization Algorithm [124], and Gazelle Optimization Algorithm [125].

The advantage of this work is the crowd management domain may rise to another sophisticated level if considering the attention on social media big data. Presenting a new taxonomy of crowd management based on deep learning algorithms, including all domains of analyzing data: visual, audio, and textual. Presenting the comprehensive examination of the global crowd management works in order to benefit Hajj authorities to apply the best practices locally. The following research [53, 126] can be contributed to addressing weaknesses in Hajj research regarding counting crowds, where the model focuses on the attention of CNN channel-wise, spatial-wise attention, and context-aware. In addition, the model can be used for other purposes, such as image supporter recognition, localization of individuals, and face recognition. Furthermore in [46], the authors provide a good model for the analysis of textual data and understanding the emotions of users, but it needs some modifications to suit other languages like Arabic.

There are some lessons that can be learnt from this SLR. First, our practice of SLR has emphasized the lack of Hajj data analysis topics. SLR employment is valuable to stay informed about those topics to support the data or knowledge of Hajj. Second, this SLR discovered that the future direction of data analysis and prediction depends on the development of CNN models. Third, crowd analysis studies focus on the analysis aspect of video surveillance more than textual data.

Limitations of this SLR include intentionally ignoring conference papers because they contain incomplete models or studies. Thus, we limited our sources to academic articles only. During data extraction, we found insufficient information related to the environment, evaluation, accuracy, and procedure in some papers, which may be reflected by the QA. As a result, some inferred data may have inaccuracies due to unclear information in the papers. Table 10 shows the summarization of the learning lesson, limitations, and future research directions.

Table 10 The summarization of the learning lesson, limitations, and future research directions

7 Conclusion

In this SLR, the researchers explored comprehensive crowd management from the aspect of DL methods. This survey has performed a wide investigation for relevant related works published in the interval 2010 to 2023. Moreover, the survey elicited pivotal information based on our research questions (RQs). The research goals have been fulfilled effectively through these RQs that were established to examine and analyze the scope of the research. Moreover, ensuring that the key findings and contributions have been performed usefulness for future researchers, also the gaps and obstacles that faced crowd management have been discussed in the Sects. 5.6 and 6. The four RQs raised in this SLR, and their findings are as follows:

  1. (1)

    RQ1: Most previous works have been classified into two categories, crowd scene analysis and crowd statistics. However, these previous works omit the opinion mining of users on social media to predict future crowd actions.

  2. (2)

    RQ2: Supervised and unsupervised DL methods provide high accuracy in general, which supported computer vision in crowd analysis in many studies, especially in architectures based on the CNN model. Because of its high efficiency and accuracy, it has become the most reliability model for researchers.

  3. (3)

    RQ3: We observed there is the fewest number of studies regarding crowd management at Hajj. As well they focused on the detection of abnormalities in crowded scenes.

The primary aim of our review is to investigate crowd analysis fields in every gathering around the world. Furthermore, the local gathering, especially crowd analysis in a Hajj event for example. The paper aims to illustrate the dilemmas and obstacles that have been faced in every study. Moreover, it aims to find research gaps existing to focus on it in future studies. Finally, we observe that most studies were about crowd scene analysis via private/publicly available datasets or live-streaming surveillance, whether using supervised or unsupervised techniques. They ignored the behavior analytics and predicted it by textual data on social media. In addition, the literature indicates the lack of Hajj research, especially in sentiment analysis and the study of the pilgrims' behavior. Overall, the systematic literature review links the widespread knowledge transfer debate of crowd management in terms of the study behavior of entities via SM big data and predicting the actions. Thus, the current study enriches the research communities and academic discourse.