1 Introduction

Transport companies can benefit from reliable real-time people-counting systems to estimate travel demand, improve customer satisfaction, and optimize routes (Mehmood et al. 2019; Reichl et al. 2018). In recent years, thanks to the progress of Information and Communication Technologies (ICT), the number of Internet of Things (IoT) devices has soared. The increase in the number of Wi-Fi access points and client devices has enabled the collection of vast amounts of data, making people counting easier, since many of these devices are consumer devices (Cisco 2020).

Public Transport (PT), despite its sustainability, is still used predominantly by students and senior citizens (Kos-Łabędowicz 2019), but an improvement in service quality could make it more attractive to the working population; that is to say, if it had a higher frequency, increased comfort, and better accessibility to stops. The perception of the quality of public transportation in terms of location, waiting times, and equipment (such as benches) is influenced significantly by the characteristics of bus stops. To determine whether bus stops are optimally located, properly equipped, and have a frequency of buses that match demand, it is essential to monitor the flow of people at the stop. However, transport studies involving people counting have largely been limited to automatic onboard passenger counting (Fihn and Finndahl 2011; Mccarthy et al. 2021; Mehmood et al. 2019; Nitti et al. 2020), and little attention has been given to the importance of bus stops in improving transport infrastructure.

Traditionally, counting people at bus stops was done manually using hand-held counters, or via manual customer surveys once or twice a year. Such approaches are generally expensive and time-consuming, and only provide limited snapshots on the days that the counts/surveys are completed (Reichl et al. 2018). But with time and the advance of machine learning and artificial intelligence techniques, new methods have been developed for automatic data collection using treadle mats, cameras, Wi-Fi, Bluetooth, and other sensors (Charansonney 2018; Fihn and Finndahl 2011; Oberli et al. 2010; Rakebrandt 2015; Yang et al. 2010).

A number of Wi-Fi-based passenger counting systems have been studied in the last few years (Choi et al. 2022; Hidayat et al. 2018; Mccarthy et al. 2021; Nitti et al. 2020), and have been shown to have varying levels of accuracy. To the best of our knowledge, most validation of these Wi-Fi-based counting systems has been done either in controlled environments or only over short periods (Nitti et al. 2020; Oransirikul et al. 2014). Some studies have been done on people-counting systems based on Wi-Fi probe requests, but very few of these studies have focused on counting in the vicinity of local transport infrastructure in an open environment, like a bus stop. Also, while there have been attempts to use machine learning and deep learning (DL) methods for automatic people-counting, using source data based, for example, on link-blockage time in a Wi-Fi network (Ibrahim et al. 2019), on Channel State Information (CSI) from Wi-Fi signals (Choi et al. 2022; Liu et al. 2019), and on images from video cameras (Baumann et al. 2022), no studies, to the best of the authors’ knowledge, have used Wi-Fi probe request packets as the source data for a CRNN-based DL method.

In this research we were seeking to estimate the number of people waiting at bus stops by detecting wireless devices in the vicinity and to process the collected information in order to determine whether DL could provide good results for this context. The structure of this paper is as follows: Sect. 2 provides an overview of various methods employed for automated people counting, specifically focusing on Wi-Fi sniffing and data processing approaches. The methodology used for device detection is presented in Sect. 3, followed by the results in Sect. 4. Finally, Sect. 5 concludes the paper by examining the significance of the proposed model, discussing the study's limitations, and suggesting future research directions.

2 Literature review

Traditionally, mat sensors have been used to count passengers by measuring their weight when boarding the bus (Basalamah 2016). However, these systems are expensive and require periodic maintenance, and they also have low accuracy (Baumann et al. 2022). Infrared sensors or cameras have been proposed as alternatives, but these have a number of limitations, such as blind spots, weak performance in poor visibility (Bernini et al. 2014; Liu et al. 2019; Saponara et al. 2016; Yahiaoui et al. 2010), and a high cost (Liu et al. 2019). Bluetooth signals have also been explored, but they are becoming less useful as a consequence of mobile devices automatically disabling Bluetooth functionality (Nishide and Takada 2013), their short transmission range, and their ability to operate in “hidden” mode (Schauer et al. 2014). Oransirikul et al. (2014) tested Bluetooth device detection as a way of counting passengers and were able to show that this approach is less reliable than Wi-Fi sniffing. Collaborative applications have been proposed as another option, but they require access to mobile device functions like GPS or microphones (Carrel et al. 2015; Gao et al. 2017).

Wi-Fi based detection has become popular owing to its lower energy usage, shorter discovery time, and a higher likelihood of devices having an active Wi-Fi interface (Kurkcu and Ozbay 2017; Lee et al. 2012; Schauer et al. 2014; Singh et al. 2021).

The ubiquity of smartphones, tablets, and wearables has led to the widespread use of Wi-Fi (Cisco 2020), which allows these devices to search for wireless access points in their immediate vicinity. This search is performed using probe requests that contain device-specific information, including the unique Media Access ControlFootnote 1 (MAC) address. Even when Wi-Fi is switched off, devices continue to send out these requests to calculate device location based on the location of known Wi-Fi access points nearby (KODY 2018). By analysing these probe requests, it is possible to count the number of devices in the vicinity and estimate the number of people (Mikkelsen et al. 2016; Myrvoll et al. 2017; Reichl et al. 2018). Transport for London has reportedFootnote 2 that analysing data from free Wi-Fi connections can offer valuable insights into customer movement at underground stations.

Oransirikul et al. (2014) attempted to use Wi-Fi signals to count the number of devices at a bus stop. However, the quantity of data collected was limited, as the experiment took place only over a short period and did not account for MAC address randomization. Because of privacy concerns, smartphone manufacturers have implemented a mechanism for randomizing the MAC address between probe requests, making it difficult to uniquely identify a device (Fenske et al. 2021; Matte 2017; Purvis and Dementyev 2020; Vanhoef et al. 2016). As a result, automatic people-counting systems based on unique MAC addresses in Wi-Fi probe requests are becoming obsolete.

Some authors have, nevertheless, proposed ways of overcoming the problem of address randomization. These include the iABACUS system proposed by Nitti et al. (2020), which uses a de-randomization algorithm. Other solutions involve machine learning models, such as those proposed by Guillen-Perez and Cano (2019) and Uras et al. (2020), but these have not been tested in scenarios with a large number of people.

Baumann et al. (2022) judged that the progress in DL over the past few years has unlocked fresh opportunities for businesses, thanks to the rise in storage capacity, wider availability of data, and increased computing speed. Specifically, Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have shown impressive advances in performing various tasks (Liu et al. 2019). CNN algorithms apply multiple filters over the input data to detect patterns, more information about how CNN works can be found in Lecun et al. (1998). RNN models make use of a loop-structure to connect different layers to process sequential data. For more information on RNNs see Rumelhart et al. (1986). RNN models are particularly suitable for applications that involve sequence data, such as speech recognition (Graves et al. 2013) and handwriting alignment (Raue et al. 2015). However, traditional RNNs suffer from the vanishing gradient problem when dealing with long sequences. To overcome this, Long Short-Term Memory (LSTM) architecture is commonly used in conjunction with RNN (DiPietro and Hager 2020; Hochreiter and Schmidhuber 1997). A few studies have combined the benefits of both CNN and RNN, extracting information from features and then connecting the extracted features to produce estimations (Liang et al. 2018; Wang et al. 2019).

Various DL approaches have been tried in recent years for counting people. Ibrahim et al. (2019) proposed an RNN-based approach called "CrossCount" that maps a sequence of temporal link-blockage patterns to a human count, using a single Wi-Fi link and achieving an accuracy of about 55%. The accuracy of their model decreased as the number of people increased, which could be due to limitations in data collection and to the method of superimposition that they used to simulate multi-person data from single-person data.

Liu et al. (2019) proposed a crowd counting method called "DeepCount" that counts the number of people using the Channel State Information (CSI) from Wi-Fi signals, applying CNN. Its accuracy was 86.4% in a confined environment, but the authors did not report the performance and accuracy of their model when the number of people is greater than 5. Also, their model was designed and tested in an indoor environment, and it is doubtful that the data they collected would be relevant to all scenarios in an outdoor environment.

Similarly, Choi et al. (2022) constructed machine learning and Deep Neural Network (DNN) models for crowd counting using Wi-Fi Channel State Information (CSI) in a closed environment. Although the DNN models showed a slightly better performance, the authors ultimately preferred machine learning methods because of the high cost of model training required to achieve the system's best performance with DNN. Singh et al. (2020) proposed the use of DL for time-series forecasting of crowds over a short time horizon, based on time-stamped crowd data, but DL was not used in their study to address MAC address randomization. Liu et al. (2022) proposed a spatial analysis model to understand passenger flow on a Bus rapid transit (BRT) system. The authors used an approach of counting the number of unique devices based on the probe request packets, but the paper did not account for the phenomenon of MAC address randomization.

Although DL models have been used by a number of researchers for people counting, these models have been limited to smaller training datasets or synthetic data for the training phase. The information sources used in these studies were Wi-Fi CSI and link-blockage time, rather than Wi-Fi probe requests. Moreover, the methods were assessed in a controlled environment and their performance was reported for small crowds.

There have been a few other studies relating to people counting that address topics that we explore, although the focus of these studies has been different. The research by Myrvoll et al. (2017), while having some relevance to our work, did not deal with MAC address randomization. The study by Nitti et al. (2020) centers on de-randomization without integrating deep learning, and its contextual scope is limited to a simulated bus environment, which differs in significant respects from the real-world context of a bus stop. In the studies by Uras et al. (2020) and Vanhoef et al. (2016) emphasis is placed on de-randomization and uniquely identifying devices via fingerprinting, complemented by clustering algorithms like DBSCAN and OPTICS (we should mention that Vanhoef et al. (2016) relates neither to people counting nor to public transport). Although both of the studies just mentioned, like ours, use probe request data to overcome MAC address randomization, our study also seeks to utilize the patterns and inherent variability of different features in estimating the number of passengers, and it does not attempt to identify and individuate unique devices. This has the additional benefit of protecting user privacy. It is a change in approach that led us to opt for Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) rather than clustering algorithms.

Tu et al. (2021) also take a different path, encompassing network events beyond probe request packets and involving different data collection methods. In fact, the authors place a greater reliance on network events linked to user interactions via a router than on probe request data. As part of this observable shift in emphasis away from probe requests, the present study has an important contribution to make, addressing gaps mentioned above through the application of deep learning techniques to real-world Wi-Fi probe request data within the dynamic context of a bus stop.

In the context of bus stops, where it is essential to take the capacities of stops, together with temporal patterns, into account, as underscored by Voß et al. (2020), there have been very few studies that have used either Wi-Fi probe request packets or DL methods to count passengers. Designing bus stops that balance functionality and safety hinges on a nuanced grasp of their utilization dynamics. Furthermore, studying bus stop crowding can provide insights into the dynamics of bus stop usage, have useful spin-offs relating to bus stop design, and potentially even help combat congestion by reducing dwell times of buses at stops (Tirachini 2014). A recent study by Jee et al. (2023) has also demonstrated that Wi-Fi sensors can be used at bus stops for estimating passenger queue lengths and waiting times, offering a cost-effective method to assess transit demand.

The study addresses gaps in people counting research by applying deep learning, more specifically CRNN, to real-world Wi-Fi probe request data within a bus stop environment, overcoming limitations of previous studies in relation to data size, sources, and methodologies. The research also seeks to use these methods to estimate passenger numbers, while protecting user privacy. It thus offers a singular contribution in the field of transportation-oriented passenger counting.

In summary, the paper makes valuable contributions as outlined below:

  • The paper contributes to the existing literature by addressing limitations in people counting research through the use of deep learning on real-world Wi-Fi probe request data, avoiding the use of limited or synthetic data sources.

  • In contrast to previous research centered on device de-randomization and identification, the study adopts a distinct approach: utilizing inherent patterns within diverse features to estimate passenger numbers while preserving user privacy.

  • Within the context of bus stop dynamics and utilization, where Wi-Fi probe requests and DL methods have not been extensively harnessed, the paper offers substantial insights through the application of deep learning in a real-world setting, effectively bringing theory and practice together.

3 Methodology

The design of our proposed system using wireless devices for estimating the number of people waiting at bus stops is illustrated in Fig. 1. A device is installed at a bus stop to detect waiting passengers, and the data is saved to a cloud database. An algorithm is then applied to this data to estimate the number of people waiting for a bus, and the output is stored in the same database.

Fig. 1
figure 1

Overview of the system

The methodological approach comprises two steps:

  • Choice of device and data collection;

  • Data analysis and development of algorithms.

The proposed system also includes a dashboard for displaying the people-count information in an easily understandable format. The dashboard was developed using Flutter,Footnote 3 an open-source user interface software development kit. The code is written in the Dart programming language. This framework was chosen since it is highly customizable, powerful, and cross-platform, meaning that it can be installed on a large selection of devices and operating systems. It can be stored on a server and is accessible from the web via standard browsers.

3.1 Device choice and data collection

The hardware board selected for Wi-Fi probe request sniffing at bus stops must have dual Wi-Fi capabilities, be low-cost and energy-efficient, have small dimensions for ease of installation, and possess reasonable memory capacity to collect and send data. These requirements ensure that the device will be able to sniff and upload data, be cost-effective for large-scale installation, occupy minimal space, and have sufficient memory to store data temporarily. We identified the best options available on the market, and these are presented in Table 1.

Table 1 Suitable marketed hardware solutions

The cloud system for Wi-Fi probe request sniffing at bus stops must have the capability to store sniffed packets in a database, process Hypertext Transfer ProtocolFootnote 4 (HTTP) requests for data transfer, maintain a database for the entire system, and allow for high scalability. The device connects to the internet using Wi-Fi provided over a mobile hotspot that uses LTE. The system design means that it can easily be extended to new stops, and transport operators can integrate data smoothly into their systems or create a custom dashboard. The system uses a REpresentational State TransferFootnote 5 (REST) Application Programming InterfaceFootnote 6 (API), and a multi-platform dashboard has been developed for easy access to collected and processed data. Scalability is critical for the quick and cost-effective installation of sniffing devices at a large number of bus stops in a city, and to this end a plug-and-play approach is used based on a configuration file.

Our collection of data at bus stops implied a number of choices:

  1. (a)

    Choice of bus stops. These were selected using two criteria:

    • The bus line needed to include bus stops at different types of locations, that is to say crowded central urban areas in addition to less populated suburban areas (line 55 operated by Gruppo Trasporti TorineseFootnote 7 – GTT – is an example of a line meeting this requirement);

    • Stops had to be sufficiently distant from traffic intersections to avoid an excessive level of “noise” in the form of Wi-Fi probe request packets from devices in nearby stationary vehicles;

  2. (b)

    Choice of the period for data collection (for automated and manual counting). Data collection took place for 9 days from March 1st to April 10th 2022, including weekends and public holidays, in three time slots: morning peak hours (7:30–10:30), midday peak hours (12:00 to 15:00), and evening peak hours (17:00–19:00);

  3. (c)

    Choice of the characteristics of the Wi-Fi probe request packets to be used for automated counting. Some of the features used as input for the sequential (Sect. 3.2.2) and neural network approach (Sect. 3.2.3) for the Wi-Fi probe request data collection are reported in Table 2;

  4. (d)

    Choice of the manual counting method. The manual counting was carried out using the “People Counter” mobile app, created for this purpose, that allows the user to increment, decrement, and reset the number of people recorded as waiting at the bus stop. The number of people, as soon as it changes, is immediately sent to Firebase FirestoreFootnote 8 for storage. The back-end system integrates the “ground truth” with the data taken from the sniffing device. Some screenshots from the app are shown in Fig. 2. People waiting were counted manually in this way when they came within 11 m of the device (although at some stops people were waiting for a bus beyond this 11-m radius).

Table 2 Features of Wi-Fi request packets
Fig. 2
figure 2

People Counter app screenshots: a Search screen, b Count screen, c Plot screen

3.2 Data analysis and algorithm development

To analyze the data coming from the IoT sensors and estimate the number of people waiting at the bus stop, different approaches were tested. In what follows, an “event” will be considered as a single packet sniffed by the IoT device and posted to the cloud database. Sniffed packets coming from Wi-Fi devices are asynchronous events occurring on a continuous-time basis. To process the information collected by the IoT device through synchronous discrete-time algorithms, there must be a discretization of time in the processing pipeline. The discretization that was implemented groups together events occurring inside different time windows ti of length ∆t, and the algorithm takes a fixed number of these time windows N∆t as input. As an illustration, consider the depiction in Fig. 3. Here we have four discrete time windows—t1, t2, t3 and t4—each of length ∆t. The first three events, from two distinct devices, are assigned to time window t1. Subsequently, the fourth event is assigned to time window t2, and so on, depending on the time of arrival of the new event. So, for each time window, the input data will be a matrix with the number of columns corresponding to the number of features from Table 2 and the number of rows corresponding to the number of probe request packets received in that time window. When events are used to estimate the number of people at the bus stop, multiple time windows are grouped together to form a “visibility period” and the data is provided in the form of multiple matrices, where the number of matrices is equal to the number of time windows N∆t. The average stop occupancy is estimated for the visibility period and compared with the ground truth. In order to facilitate alignment with the ground truth data, the ground truth information is segmented into time windows of equivalent duration.

Fig. 3
figure 3

Data preparation for the processing algorithms

At a given bus stop, each time window ti of length ∆t has a corresponding value Pi, that is the number of people at the stop at the end of the time window. The ground truth values are calculated as shown in Eq. (1) for a given visibility period, which can be described as the time period during which the actual average stop occupancy is compared with the people count calculated as described below.

$${\hat{{R}}} = \frac{1}{{N_{\Delta t} }}\mathop \sum \limits_{{t_{i} = 1}}^{{N_{\Delta t} }} P_{i}$$
(1)

where:

  • ti: Time window of length ∆t;

  • Pi: number of people at the bus stop in each time window ti;

  • N∆t: Number of time windows of length ∆t considered for obtaining ground truth for a visibility period, which can be denoted in seconds as ∆t* N∆t;

  • Ȓ: Ground truth people count for the visibility period (∆t* N∆t).

The number of people may be estimated from the detected packets using either of two approaches: a packet-content independent approach, or a packet-content dependent approach. As regards the three different methods proposed in this study, the first method, which uses a packet-content independent approach, is presented in Sect. 3.2.1, while the other two methods, using a packet-content dependent approach, are presented in Sects. 3.2.2 and 3.2.3.

3.2.1 Matched digital filter

In a packet-content independent approach, the content of the detected packets is ignored. Packets are treated simply as “events” to be counted, and then filtered as in a classical signal-processing task. The input of these algorithms is the number of events (sniffed packets) at each time.

The hypothesis is that the number of events observed in a given time window (ti) is determined by the sum of two factors: the number of people (Pi) at the bus stop and a random variable (η) that encompasses various external factors such as the interference of nearby devices, fixed Wi-Fi antennas, and other perturbations. This relationship is expressed by Eq. (2):

$$\mu ={P}_{i} +\eta$$
(2)

where:

  • μ corresponds to the number of events (probe requests) captured in time window ti;

  • Pi corresponds to the number of people at the bus stop at the end of the time window ti;

  • η corresponds to a random variable representing noise.

Pi and η are not zero-centered signals. The first pre-processing stage serves to center the signal μ to the value E(\({P}_{i}\)) that can be determined from the training data, producing a pre-processed output ye. The idea is to take the centered signal of the random variable as a “noise” signal to be filtered through a filter h, to produce the estimation of the number of persons as in Eq. (3).

$$\widehat{P}={y}_{e}*h$$
(3)

where:

  • \(h\) – filter to be applied to pre-processed input signal;

  • \({y}_{e}\) – input signal for the matched filter (output of pre-processing);

  • \(\widehat{P}\) – estimated people count at a stop during time window ti.

The choice of testing a matched filter is in fact to find a baseline performance that can be obtained with a simple packet-content independent approach, and then to check whether a packet-content dependent approach can outperform this baseline.

3.2.2 Sequential processing algorithm

The packet-content dependent approach analyses the sequence of logged packets to estimate the number of people at a bus stop. The input to the algorithm is the sequence of packets sniffed at each time. Before analyzing the relevant features of the probe request packets, the input data are filtered based on the Received Signal Strength Indicator (RSSI) contained in the sniffed packets. Only packets with an RSSI greater than – 78 dB are considered in order to eliminate possible extraneous sources of noise.

The sequential processing algorithm is used to analyze the most relevant features of the probe request packets across N∆t time windows, each of length ∆t. To estimate the number of people at the bus stop, the algorithm counts the number of unique MAC addresses in each time window. To address this issue of MAC address randomization (APPLE Inc. 2021; Fenske et al. 2021), a MAC address de-randomization approach is used. This approach considers the supported rates and RSSI (see Table 2) of each device within a time window. If the supported rates are unchanged and the RSSI is in a range of ± 4 from a previously detected MAC address within the same time window, then the device is considered to be the same as the previously detected device, as shown in Eq. (4).

$$MAC_{j} = \left\{ {\begin{array}{*{20}c} {MAC_{i}^{a} ,} & {if\;SR_{i}^{a} = SR_{j} \;and\;RSSI_{i}^{a} - 4 \le RSSI_{j} \le RSSI_{i}^{a} + 4\;and\;1 \le a \le \mu } \\ {MAC_{j} ,} & {otherwise\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;} \\ \end{array} } \right.$$
(4)

where:

  • \({MAC}_{i}^{a}\): the MAC address of any detected device a in time window ti;

  • ti,tj: successive time windows where j = i + 1;

  • µ: number of probe request events captured in time window ti;

  • \({RSSI}_{i}^{a}\) \(\mathrm{and }{{\text{SR}}}_{{\text{i}}}^{{\text{a}}}\)—RSSI value and supported rates of the corresponding detected device (see Table 2) in time window ti;

  • MACj: the MAC address of a device being currently checked in time window tj;

  • SRj is RSSIj: RSSI value and supported data rates at the time of detection for the device with MAC address MACj.

The people count is then computed as the number of unique devices in each time window ti, as shown in Eq. (5).

$$\hat{P} = \sum\limits_{{a = 1}}^{\upmu } {\left\{ {\begin{array}{*{20}c} {1,} & {if\;MAC^{a} \;has\;not\;been\;so\;far\;in\;time\;window\;\;} \\ {0,} & {if\;MAC^{a} \;has\;been\;seen\;before\;in\;time\;window} \\ \end{array} } \right.}$$
(5)

where:

  • \(\widehat{P}\): Estimated count of people in time window ti;

  • \(\upmu\): number of events captured in time window ti;

  • \({MAC}^{a}\): MAC address of each detected event after applying Eq. (4).

3.2.3 Convolutional and Recurrent Neural Networks (CRNN)

The input data are in the form of probe request packets with features (or columns) as listed in Table 2. These data are further partitioned into distinct time windows, a process illustrated in Fig. 3. The size of the input data is not fixed, but variable, as a result of fluctuations in the number of probe requests received in different time windows. These fluctuations are contingent upon several elements, including the number of nearby devices, and factors dependent on device manufacturers and models. The current operational state of the device—be it idle mode, screen-on status, or Wi-Fi connectivity—likewise contributes to these fluctuations (Pintor and Atzori 2021).

In order to fix the size of the input data, transmit it as sequential data to the adjacent layers, and then process the sequential data, a combination of CNN and RNN is used. CNN is good for feature extraction due to its ability to identify spatial patterns, while RNN is adept at handling sequential data such as time series, making this combination well-suited for capturing both spatial and temporal aspects inherent in predicting people counts over time (Hasan et al. 2019). LSTM cells are used for the RNN layer because of their capabilities in terms of long-term memory and their better handling of the vanishing gradient problem (DiPietro and Hager 2020; Yu et al. 2019). A CRNN algorithm (Liang et al. 2018; Wiest 2017) is used with a first convolutional layer that fixes the input data size for the following layers by applying a set of filters to each collected packet, as described in the algorithm reported in Table 3. The number of filters is denoted by Fc and is one of the parameters that is optimized during the model training phase. This first convolutional operation is shown in Fig. 4.

Table 3 Convolutional layer operation algorithm
Fig. 4
figure 4

Convolutional input layer

The convolutional filters at the CNN layer typically aid in capturing patterns within the features found in each probe request packet. Any of these filters may be of use in capturing conditions or patterns relating to RSSI values, SSID patterns, patterns with MAC address values and so on. These conditions help in identifying how many packets come from unique client devices belonging to passengers at the bus stop. Then by summing the outputs, the final vector for each window is prepared and concatenated to produce the final output matrix of fixed size N∆t × Fc, with the rows containing data from the different time windows, and the columns representing the filters that were applied to the data from the different time windows.

Each vector produced by the convolutional layer for each time window is then sequentially provided as input to a recurrent layer (composed of LSTM cells) that will learn the dynamics of how the patterns corresponding to the convolutional filters change over time, to produce a final estimation of the average stop occupancy via a final Rectified Linear UnitFootnote 9 (ReLU) neuron, as shown in Fig. 5. The final output contains the model’s estimation of the number of people at the bus stop during the visibility period.

Fig. 5
figure 5

Long Short-Term Memory (LSTM) and Rectified Linear Unit (ReLU) layer

The only parameters for this architecture are the number of filters Fc and the number of LSTM cells Ln, that are optimized during the “Training campaign” using stochastic gradient descent optimization (Hardt et al. 2016).

4 Results

Among the suitable devices identified (Table 1), the one selected was the WiPy 3.0 module in combination with the PyTrack 2.0 X electronic board equipped with a microcontroller. WiPy 3.0 was used to capture Wi-Fi probe request broadcast packets. Probe requests are control frames sent by devices equipped with a Wi-Fi interface. The device was programmed to emit light signals as follows:

  • Blue, when in operation;

  • Green, when sending an information packet to MongoDB;

  • Red, when sending a packet to MongoDB fails. In this case, the data collected up to that point are saved in memory and the device is forced to restart. Once restarted, before starting to capture other probe request packets, the device sends everything it has previously saved in the memory.

The operations performed on the device are shown in Fig. 6. For every ten packets received, a JavaScript Object Notation (JSON) file is produced, which is sent to the MongoDB database. This is done to reduce the number of database connection requests the device makes.

Fig. 6
figure 6

Operations performed on device

The device used to store and compute all the cloud functions is a Raspberry Pi 3 Model B. It was chosen since resources available for this research are limited, and the use of faster, more expensive hardware would have resulted in over-dimensioning. Moreover, all the relevant software was designed to compile and work on any other Linux-based machine, which would, if necessary, allow migration to a more powerful machine with little effort. CherryPy is used to handle Representational state transfer (REST) requests for all the components of the architecture. This makes the system highly scalable since all transfers of data are handled using standard JSON files and GET/POSTFootnote 10 requests (GET and POST are the two most common HTTP transfer requests).

To store the data, MongoDB is used, which is a NoSQLFootnote 11 document database. It is open-source and used mainly for high-volume data storage (Győrödi et al. 2015). It also provides scalability and flexibility. An overview of the cloud system is shown in Fig. 7.

Fig. 7
figure 7

Cloud system overview

For data storage in MongoDB, three types of collection were set up: (1) input collection, storing all the raw data from the sniffing boards with the addition of a timestamp; (2) output collection, storing the results evaluated by the counting algorithm together with a timestamp; and (3) device collection, including information about all the active sniffing devices.

Data were collected, on the one hand via the device installed on the bus stop bench, and on the other hand manually, at the Ferrucci bus stop 3287 on Corso Vittorio Emanuele II in Turin (Fig. 8), according to the schedule given in Table 4. On March 5th, artificially generated data were collected, in order to provide at least one dataset in a controlled environment. The data were collected in a room where all Wi-Fi capable devices apart from the test devices were removed. The data for March 12th, 2022 and April 22nd, 2022 were collected, respectively, at the bus stops 624 and 39 to check that the model works in different contexts, so as to avoid model overfitting (Gupta and Sharma 2022) to the selected stop.

Fig. 8
figure 8

Source: Screen capture from google maps

Ferrucci bus stop.

Table 4 Schedule of the manual data collection

More than 22,700 records of sniffed packets, and more than 1700 ground truth labels were obtained from the collected data. The collected ground truth data with the mean, minimum and maximum stop occupancy at the different bus stops is listed in Table 5. Stop number 39 is seen to exhibit a crowding pattern that is notably different from the other stops in the dataset. This variance in crowding patterns provided diversity in the training and testing of the model.

Table 5 Distribution of collected ground truth values

4.1 Data analysis and modelling

The conversion of the input data into a processable format allows feature selection and preparation for the analysis. Data processing algorithms need number-type data as input, and therefore sniffed packets that may contain information in string formats, such as the SSID name and the MAC address, must be converted in order to be processed. In addition, the literature shows that for some devices, MAC addresses are not randomized across all their fields, but only in a subset of them (Purvis and Dementyev 2020), and so in order to let the neural network learn this process, the MAC address was split into a vector of six numbers, as described in Table 6.

Table 6 Pre-processed features

Using the collected data, the algorithms described in the methodology were implemented and tested with different parameters iteratively, to select the parameters that produced the best results. The results obtained in terms of mean absolute error are shown in Table 7. Based on the literature review, some studies have suggested the use of the Mean Absolute Error (MAE) the Root Mean Square Error (RMSE) (Brassington 2017; Willmott and Matsuura 2005) especially for datasets with errors of varying magnitude and potential outliers. Therefore, to maintain consistency among all tested methods, the MAE was chosen to be used. It can be seen that the best-performing algorithm is the CRNN, which has the lowest MAE among the compared algorithms. The CRNN algorithm was therefore selected for further fine-tuning with a larger training set, varying its parameters to find the best result.

Table 7 Performance evaluation of the tested algorithms during the preliminary analysis

After selecting CRNN as the best algorithm for our use case, we used different numbers of neurons for each layer, and different values for Fc, Δt and NΔt to train different models in order to determine which performed best. Two parameters of the model play an important role in addressing the issue of devices that come close to the bus stop but do not remain there: these are the length of the time window ∆t and the number of time windows N∆t that form the visibility period. If the time window duration is set too low, devices at the stop may be missed, but if it is too long there is a greater likelihood of counting devices that are simply passing by the stop, and there are also risks associated with MAC address randomization. The reference time window duration values were established based on prior research (Matte 2017) and in-house experiments in between arrival times of probe requests from client devices. Variations from the reference value, which ranged from 30 to 60 s, were explored in relation to fluctuations caused by device usage and model differences during the hyper parameter optimization phase of model training. The existence of multiple time windows helps smooth out any minor fluctuations resulting from passing devices.

The data at hand enabled us to train more than 40 models, each using distinct parameter values. Through careful analysis, we identified the best model by selecting the parameter values that yielded the most favorable outcomes on the testing dataset during the present study, as presented in Table 8. Even though more than 40 different configurations were used, Table 8 is restricted to only a few selected model configurations, along with the configuration that produced the best result.

Table 8 Best parameter values for the neural network training

The learning curve on the testing set for the best model with mean absolute error of 1.14 (Fig. 9a) shows that the algorithm improves its performance during the training by avoiding overfitting. The final distribution for the error (computed as error = prediction − real) is shown in Fig. 9b.

Fig. 9
figure 9

a Learning curve and b final error density distribution on the testing set

The final performance of the algorithm has a MAE of ~ 1.2 persons on the testing set. The visibility period of the final algorithm (the period taken into account in generating the people-count) is of length Δt*NΔt = 5 min, based on Δt = 30 s and NΔt = 10, as reported in Table 8. As can be seen from the density plot for the error (Fig. 9b), the algorithm has a positive bias, so it tends to overestimate the number of people at the bus stop. This can also be seen in Fig. 10, where the output in time for one of the experiments is shown. The tendency to overestimate might be due to a number of factors, such as the model requiring a larger training dataset to learn underlying patterns, or practical considerations, such as passengers carrying multiple devices, which can impact the estimation to some extent. The problem of overestimation resulting from passengers carrying multiple devices is a more difficult problem to overcome within the scope of this study, especially if the types of devices being carried are different—smartphones, tablets, and laptops—and if more than one device is used around the same time.

Fig. 10
figure 10

Output in time for one of the testing datasets

4.2 Data visualisation through the dashboard

To facilitate the monitoring of crowding at bus stops, the dashboard has two main screens. The first is a "search interface" (Fig. 11) featuring a dropdown menu to inform the user of all the cities and bus stops actively in the system. Alternatively, the user may search for and filter bus stops via city names, bus stop names, or bus numbers. As shown in Fig. 12, the query is global, meaning that a number or a few letters can be entered into a single text box, and the system automatically searches in all the fields.

Fig. 11
figure 11

Search interface (dropdown menu)

Fig. 12
figure 12

Global search screen

Results are shown in real time and the user can select the entry of interest, which then loads on the next screen. The next screen shows the people count with an indicator of how crowded the stop is. The indicator will be one of three different colours: green, yellow, or red. Since stops are differently dimensioned, the transport operator can specify for each individual stop how numbers of people are to be mapped to colours. On the left there is a dynamic map with a location pin, marking the position of the bus stop. And finally, there is a plot showing the situation over the preceding three hours. The overview screen is shown in Fig. 13. The use of Flutter allows the dashboard to be adapted easily to different screen characteristics and operating systems.

Fig. 13
figure 13

Dashboard overview

5 Discussion and conclusions

Automated people counting at bus stops has several practical benefits for public transport management. It provides information about the number of people waiting for buses, which can help to ensure efficient resource allocation based on demand. Focusing on those waiting at stops, rather than on those who have actually boarded buses, can help to better understand specific stop-related demand. It can also offer valuable insights into passenger behaviours relating, for example, to waiting times, seasonal fluctuations, and boarding tendencies. This kind of information is crucial for optimizing service planning and resource distribution. An accurate count of passengers can guide informed decisions about resource allocation, even within budget constraints. It assists in selecting appropriate locations for resource installation, considering costs and maintenance needs.

For instance, with limited budgets, decisions about installing digital displays for real-time information or providing shelter for waiting passengers can have significant implications for installation and maintenance costs. These decisions can be informed due to the knowledge of people counting systems and in understanding crowding patterns at stops. Decisions about bus stop design can in their turn impact bus dwell times and contribute to reducing congestion (Tirachini 2014). Overall, automated passenger counting at bus stops contributes to more effective public transport management and informed decision making.

To this end, the goal of this study was to estimate the number of people waiting at a bus stop, with Wi-Fi probe request packets emitted by modern ICT devices as source data, and using deep learning (DL) methodologies. The system that was developed has a number of advantages, employing a robust, low-cost device that is easy to install, requires little maintenance, and makes additional bulky devices unnecessary. The optimal DL model within the current study, using a Convolutional Recurrent Neural Network (CRNN) with Long Short-Term Memory (LSTM) architecture for the Recurrent Neural Network (RNN), yielded the best results, predicting the number of people waiting at the stop with a mean absolute error of 1.2 persons. This is a promising preliminary outcome, given the complexity of the experiment.

However, there are factors that can adversely affect the performance of the algorithm: not everyone owns a Wi-Fi capable device; the Wi-Fi interface may not be enabled; and there may be multiple Wi-Fi devices linked to the same user and undetected transmissions due to mobile devices passing through areas quickly, as also highlighted by Li et al. (2020). Additionally, environmental factors, such as noise and uncontrolled surroundings, can negatively impact the system's performance. In order to reduce the impact of the randomization of the MAC addresses and background noise, and to obtain an initial model capable of operating in a more controlled environment, the main bus stop in this study was placed at some distance from a road intersection to provide a reference case for future applications in more complex environments. However, other bus stops with a more complex environment were also used in the course of this study, including one in front of the city's central railway station, further enriching the breadth of the considered scenarios. However, it is important to recognize that most of the data collection was carried out at a single location, which may limit the generalizability of the results.

A big challenge in using the DL approach is obtaining a ground truth dataset that is sufficiently large to train the model. The "People Counter" app was designed to make manual counting more efficient and less resource-intensive, and to remove the need to translate manually collected data into digital format. The reliability and efficiency of this app allowed us to devote fewer resources to counting, and to increase the quantity and the accuracy of the collected data. Previous studies on automatic people counting evaluated their systems with limited numbers of people, such as up to 10 people (Choi et al. 2022; Ibrahim et al. 2019), or up to 5 people in an indoor environment (Liu et al. 2019). In contrast, the system evaluated in the present study was tested in an open environment with up to 30 people.

Another step forward as regards previous studies, to not compromise the reliability of DL models, was the extended duration of the data collection if compared, for example, to Oransirikul et al. (2014), who collected data for only 70 min, with 60 min for counting people at the bus stop. The reason is that a limited data collection can create an imbalanced data distribution for different count classes, which limits the availability of large amounts of training data for deep learning models (Ibrahim et al. 2019). However, making valid comparisons of the accuracy of our proposed people-counting system against that of existing crowd counting systems is far from straightforward, given the different testing approaches involved. Real data versus synthetic data is also an advantage to balance the training phase during model building, allowing an increase in accuracy, precision, recall, and F1 as well as a lower rate of false positive and false negative predictions (Rankin et al. 2020). The quality of synthetic data also depends on the source data, which can introduce bias and miss random behaviors of real-world data. Our study was able to confirm the system's accuracy using over 14 h of manual counting data as ground truth, with no use of synthetic data.

Given the various limitations mentioned above that are inherent in Wi-Fi sniffing, a significant factor in improving our proposed system will be the amount of data collected for training the neural network. The data must be acquired over a long period and at different times of the day and at various types of stops. Territorial variables such as the classification of stops based on the number of people present and the type of territory may potentially be used as inputs to enhance the performance of the network. Potential future research avenues include testing this system in a variety of scenarios, including counting passengers at different types of stops, and extending it to systems of transport other than buses, that is to say to trams, to metros, and at train stations. Future research could explore utilizing real-time passenger counts at bus stops to refine existing models that predict passenger arrival times, queue lengths (Jee et al. 2023) and patterns. This could facilitate long-term planning, route scheduling, and the optimization of headways and timetables (Olivo et al. 2019). Finally, in a future work we could consider associating the measurements of our system with the real-time bus passing times to monitor additional metrics.

Finally, we may assume that wireless technologies will be subject to future developments, in the same way that MAC address randomization was introduced a few years ago, and it is therefore important to keep abreast of new technologies and trends with a view to proposing automated counting systems that are alternatives to the well-established but more expensive camera-based systems.