On cleaning strategies for WiFi positioning to monitor dynamic crowds

Monitoring open crowded areas is fundamental for policy makers to set up the proper measures for people security and safety. Different techniques have been developed to tackle this issue. The most relevant approaches, which are currently available to estimate the number of attenders, are based on the size of the area hosting the event, the count of people passing through specific points, and the employment of satellite images. All these techniques roughly estimate static crowds, but they may be limited by different factors such as the availability of satellite images, the cost of dedicated unmanned vehicles, and the capability to set up multiple counting points. In order to fill this gap, a tool to monitor dynamic crowds, based on WiFi positioning, is presented. The tool allows not only to assess the total number of people attending an event, but also to monitor their spatiotemporal distribution. In particular, the impact of the cleaning strategies on both the estimated number of participants and their spatiotemporal distribution is analyzed. The proposed approach is demonstrated using real data collected during the JRC Open Day 2016. From the results, the need of a clear strategy to identify real users in order to avoid misleading results emerges. Moreover, a proper setting of the thresholds used for the identification criteria is required. Such thresholds need to be set according to the dimension of the site, the geography of the WiFi network, and the duration of the event.


Introduction
Monitoring open crowded areas is fundamental for policy makers to set up the proper measure for people security and safety (Doig 2009;Oberschall 1973;Nardo 1985;Lohmann 1994); of foremost importance is the number of people Ciro Gioia ciro.gioia@ec (Jacobs 1967;Krewson 2012) and their distribution in time and space. In Cariveau (2006) and Rabaud and Belongie (2006), a review of the common techniques adopted for estimating the number of people attending an event is presented. In McPhail and McCarthy (2004) and Yip et al. (2010), the estimation of the number of people participating to a demonstration is attempted. Many different techniques have been proposed to estimate the dimension of a crowd. The most relevant techniques currently available to assess the number of people attending an event are based on the following: -the size of the area hosting the event (Jacobs 1967;Krewson 2012): the total number of participant is obtained multiplying the area for a factor depending on the season during which the event takes place. -the actual count of the people passing through specific points (Watson 2011). -the employment of satellite images and density-analysis (Sirmacek and Reinartz 2011;Wallace and Parlapiano 2017).
All the abovementioned approaches are mainly used for the estimation of static crowds, and they are limited by different factors such as the availability of satellite images, the cost of images from a dedicated unmanned vehicles, and the capability to set specific counting points. About the dynamic crowd monitoring, several approaches are known to currently be used; each of them presents given benefits and limitations. Two of the most common approaches for dynamic crowd monitoring in outdoor scenarios are mentioned in the following. One of the most adopted is the image-video processing supported by a dense network of cameras (Zhan et al. 2008;Li et al. 2015). Its performance depends on dislocation of the cameras, resolution of the acquired images, characteristics of the adopted image processing algorithm, and applied tracking techniques. Its main limitations are represented by the possible camera occlusion and the weather-environmental conditions, e.g., rain, fog, smoke bombs, and low-light conditions, may limit the applicability of the approach. A comprehensive survey on crowd monitoring based on video and images is available in Lamba and Nain (2017). In Yuan et al. (2013) and Yuan (2014), the authors propose an RF-based crowd density estimation for indoor scenarios, using mobile phones, together with considerations on its extension to outdoor. This approach does not present the limitations of that based on cameras and it provide excellent performance, but it requires data from telecom providers and faces strong privacy constraints, which may limit its application (EC-GDPR 2016).
In this paper, a tool to monitor dynamic crowds is presented; such a tool allows not only the estimation of the total number of people attending an event, but also their spatiotemporal distribution.
The capability to locate and track users exploiting WiFi data has been already proven in several works (Bobescu and Alexandru 2015;Kotaru et al. 2015;Biswas and Veloso 2010). The approach is based on two implicit assumptions (Petre et al. 2017): everyone uses a smartphone whose WiFi is enabled all the time. Starting from these two assumptions, the WiFi positioning and tracking exploit the fact that smartphones repeatedly broadcast probe requests to identify known networks. The probe request contains the device unique identifier: the media access control (MAC) address (Alessandrini et al. 2017). Therefore, in order to be detected, a smartphone does not have to be connected to a WiFi network, but it just needs to be within the range of a WiFi node, when the probe request is sent. Having said that, one can easily imagine that by carefully dislocating WiFi nodes within a given area, it is possible to collect MAC addresses of most of the users passing through that area and, through a proper processing, to track these users.
Nowadays, the applications of WiFi positioning and tracking are numerous and growing, including the analysis of visitor movements and flows, including queue times, dwell times, wait times, and first-time/repeat visitors; the implementation of geofencing systems to define boundaries around areas of interest, triggering alerts when registered mobile devices enter or exit it; and the optimization of security and safety assets dislocation during events that gather critical masses of people.
Obviously, the exploitation of WiFi data is not the only available solution to track users. Among the other techniques, global positioning system (GPS) tracking is the most common and extensively used. Nevertheless, differently from GPS tracking, WiFi tracking is feasible also indoors (Mingkhwan 2006). Moreover, whereas GPS data are owned by telecommunications providers, an ad hoc network of WiFi nodes is both cheap and easily deployable, allowing direct collection of the necessary data. On the other hand, one of the main drawbacks of WiFi tracking is the need of a pre-processing, namely "data-cleaning," which is necessary to identify actual users, but it may limit the real-time application of the method. Another key aspect that favors a tracking method rather than another is how it complies with privacy rules.
Privacy-related aspects are a fundamental topic, when it comes to locate people through WiFi. Some relevant considerations about this subject have been made in Alessandrini et al. (2017), but the discourse on privacy and personal data has definitely grown up within and outside the EU, since the enforcement of the General Data Protection Regulation (GDPR) in April 2018. How does the GDPR affect the use of WiFi data? Primarily, companies will no longer be able to provide free WiFi to consumers in exchange for their browsing data: a very common practice for the profiling and targeting of potential costumers. The GDPR also states that organizations must "implement appropriate technical and organizational measures to ensure a level of security appropriate to the risk" of the provided service. In other words, companies should implement best practices to secure publicly available WiFi networks (Meyer 2018). What is still unclear is how the GDPR will affect WiFi tracking. The whole discourse revolves around the definition of MAC address. In fact, the MAC address, apart from being unique, does not contain any information about the user. Nevertheless, since it identifies a given device, that in the case of smart phone is always kept by the user, the MAC address could be considered as indirect personal information, or "pseudonymous" data (Maldoff 2016). "Pseudonymization" is a new concept introduced by the GDPR, consisting in the separation of data from direct identifiers so that linkage to an identity is not possible without additional information that is held separately. Much debate surrounds the extent to which pseudonymized data can be re-identified. This issue is of critical importance because it determines whether a processing operation will be subject to the provisions of the Regulation.
The detection of a device using WiFi relies on the uniqueness of its MAC address. Nevertheless, once a device is detected, one needs to identify if that device corresponds to an actual user, i.e., to a physical person, or to a static or fake device. In fact, smartphones are not the only devices using WiFi. There are plenty of new-generation devices that exploit WiFi connection, such as printers, hard drives, music players, surveillance cameras, and smart thermostats; even some digital photo frames use WiFi. Moreover, some devices use a fake MAC address when they broadcast probe requests to identify available WiFi networks. The identification of actual users is the main goal of the abovementioned "data cleaning" procedure. Immediately after the data cleaning, the "positioning" of the detected users comes. The estimation of the user position within a given area can be done according to different methods, which can be based on the strength of the received WiFi signal, the number of times that a WiFi receiver registers a given users, etc.
This paper focuses on both cleaning and user positioning procedures, starting from the data collection of the 2016 Open Day of the Joint Research Centre (JRC) in Ispra (Italy) (Alessandrini et al. 2017) and comparing different approaches on the basis of the results described in Gioia et al. (2017). Data cleaning is conducted following different approaches: a first screening of the device can be made using the number of times a device is registered; the second criterion is based on the number of base stations registering the presence of the user; and a different approach to identify real user is developed considering the dispersion of the estimate user position. The explored positioning strategies are the following: proximity principle based on received signal strength (RSS), proximity principle using the number of records, weighted centroid approach.
The paper is organized in the following way: the next section describes the cleaning (the "Cleaning approaches" section) and the positioning (the "Positioning approaches" section) approaches; the "Experiment" section briefly illustrates the data collection experiment; the results are summarized in the "Results" section , which is followed by relevant concluding remarks in "Conclusions" section.

Methods
In the following sections, the cleaning strategies and the localization algorithms are described.

Cleaning approaches
WiFi data includes static devices (printers, personal computer, etc.), fake devices, and devices outside the test sites. In order to remove the data relative to the not realusers, three cleaning strategies are implemented in the measurement and position domains. The first one is based on the minimum number of times a device is recorded, the second one considers the number of stations by which a device is registered, whereas in the position domain, the distribution of the estimated user positions is analyzed.
A first screening of the devices can be made using the number of times a device is registered: where num record i is the number of records of the ith device and th rec is the threshold set for the detection of the real users.
The criterion based only on the number of records for a given device allows to screen out devices which are registered for a very limited period of time, for example, devices which were in the proximity of the test field but not entering it. However, this criterion does not allow the exclusions of static devices (printers, PC, etc.) which are registered for all the duration of the experiment. Hence, another selection criterion is adopted, which is based on the number of base stations registering the presence of the given user. Using this approach, a device is classified as "actual user" if its identifier is recorded by a number of access points (APs) higher than a fixed threshold: where num base i is the number of APs registering the presence of the ith device and th base is the threshold defining the minimum number of APs required to classify a device as a real user. A different approach for identifying real user is developed considering the dispersion of the estimate users position. The standard deviation of the user coordinates is computed and compared with a given threshold in order to evaluate the dispersion of the user position: where σ (pos user ) is the standard deviation of the estimate user positions and th pos is the threshold used to identify a real user. The criterion represented in Eq. 3 is very general: it can be applied to the single coordinates separately real user = if σ pos user,x ≥ th pos,x real user = if σ pos user,y ≥ th pos,y

Positioning approaches
In this section, the methodologies developed to compute the users position are described. Specifically, three different strategies have been adopted to track users: -proximity received signal strength indicator (RSSI)based (Manandhar et al. 2008;Dempster 2009); -proximity occurrence-based; -Weighted centroid (WeC) (Wang et al. 2013;Borio et al. 2016;Gioia and Borio 2014a, b).
These approaches are commonly used with simultaneous measurements, i.e., the object to be localized is simultaneously connected to two or more nodes (Manandhar et al. 2008;Dempster 2009;Borio et al. 2016;Gioia and Borio 2014a). During the performed test, the objects to be localized are usually seen only by one node at a time; this is due to the size of the site where the experiment has been carried out and to the typical area coverage of the APs adopted for the experiment, together with their geographical displacement within the site. Hence, the traditional algorithms have been modified to compute the position of the tracked object after accumulating measurements during a given time interval. The time interval used to estimate the user position is 3 min; this value has been selected considering the following factors: -The size of the site, the geometry of the network, and the typical dynamic of the users: during the experiment, the mean distance between the APs along the East direction is almost 250 m and some 160 m along North, assuming that a pedestrian moves at approximately 4 km/h (1.1 m/s), this covers about 200 m in 3 min. -Heterogeneity of the device, in particular considering the different data rates. A fundamental element to set the value is the update rate of the measurements; the cumulative distribution function of the number users as a function of the update rate is shown in Fig. 1 figure, it can be noted that 99% of the devices have an update rate higher than 3 min.
In Fig. 2, the three positioning approaches are shown together with the relative inputs. The input of the occurrence-based positioning method is only the list of the stations which recorded the user presence in the time interval, whereas the other two approaches exploit also the power of the received signal.
As mentioned above, three approaches have been implemented: two derived the from proximity concept and the third exploits the centroid concept.

Proximity RSSI-based
The first algorithm is based on the proximity principle and exploits the RSS measurements. The position of the tracked object is associated with the position of the station recording the signal of the user with the highest RSSI in the specific time period ΔT : where pos u is the vector containing the user coordinates and pos s MaxRSSI is the vector containing the coordinates of the station that registered the signal of the user with the maximum RSSI.

Proximity occurrence-based
The RSS is strongly affected by multipath and fading phenomena; these effects are intrinsic characteristics of the propagation environment and they can amplify or reduce the received signal power. The multipath-induced variation of the RSS could lead to erroneous object localization; in order to fill this gap, an algorithm based on the proximity principle, but exploiting the number of times a users is recorded by a station during a specific time period ΔT , is proposed. In this case, the position of the tracked object is associated with the position of the node that registers more times the presence of the user.

WeC
The third approach is the WeC, which is an extension of the proximity principle; in this case, the user position is a linear combination of the node coordinates. The mathematical expression of the WeC is that of Borio et al. (2016) and Gioia and Borio (2014a): where P s,i = (x i , y i ) is the vector containing the coordinates of the ith station and w i is the weight associated to the ith node. In this work, the weights are related to the RSSI of the received signal, in particular the following weighting function is adopted: where (RSSI) i is the RSS of the ith received signal expressed in dB.
The user position is obtained as the WeC of the nodes coordinates. If the time interval is reduced and only one measurement is obtained within the considered interval, the WeC solution converges to the RSSI-based proximity solution.
In Table 1, the three implemented positioning algorithms are summarized, together with their main pros and cons.

Experiment
In this section, the experimental setup is briefly described; a more comprehensive description of the experiment is available in Alessandrini et al. (2017).
The experiment was carried out on 28 May 2016, during the JRC open day event (JRC Open Day 2016). Thanks to the large attendance (more than 7500 participants) and to its duration (some 11 h), the event was a unique opportunity to collect a large amount of data. For the experiments, 20 APs were placed within the JRC Ispra site, as shown in Fig. 3, where the AP locations and the relative identification number are reported. The figure shows also the theoretical

More dynamics
Potentially affected by anomalous RSSI measurements; a proper weighting function needs to be adopted.
coverage (the effect of obstacles limiting the range of the APs is not considered) of the APs. The blue markers identify APs located in proximity to the main entrance: two devices (AP numbers 1 and 2) were placed close to the gate reserved to the general public, which remained closed until the official opening of the event (9:00 AM), whereas the AP 3 was placed close to the gate reserved for the access of volunteers, which was open from 7:30 to 9:00. The yellow markers show the positions of the APs in the central area of the site; among the yellow markers, the AP 7 was located in the Brebbia gate, which was reserved to volunteer entrance and was open only 1 h and half before the official opening. Finally, the red markers are used to identify APs close to the exit gate, located in the west part of the site. The event was a unique opportunity, not only to collect a huge amount of data, but because of the nature of the site: only one access and one exit for the general public, with the security staff counting the accesses, the security report allows to partially verify the results obtained (Sousa 2016). This gave us the possibility to perform a "qualitative" check of the results obtained through the proposed methods. The results check could be only "qualitative" for two main reasons: -Despite the security office reported 7623 accesses during the entire event (Sousa 2016), there is no way to know how many of the participants held a mobile device with WiFi enabled. In fact, the two implicit assumptions "everyone use a smartphone whose WiFi is enable all the time" are not entirely true: among the participants, there were children and elderly people who might not had a mobile; and it is reasonable to assume that there were people who's mobile had the WiFi switched off. Therefore, the estimation of the number of participants carried out with the proposed method can be compared with the actual count of accesses reported by the security service only through an assumption on the share of participants without a mobile or with WiFi disabled. -Analogously, the participants were not actually tracked with a different system during the event; therefore, the results relative to their flow within the site can be only qualitatively interpreted according to the schedule of the event, the geography of the site, and the location of the different exhibitions.
However, having said that, the impossibility to perform a more accurate ("quantitative") validation of the results with the ground truth does not represent a limitation to the potential of the proposed method as a crowd-monitoring tool, especially when it is compared with the currently exploited techniques and with their accuracy.

Results
In this section, the results obtained combining the diverse cleaning strategies and positioning approaches are presented. The results are at first analyzed in terms of estimated number of real users; then, the concentration of the identified users is shown; finally, the movements of the users among the nodes of the network are considered.

Cleaning based on the number of base stations and records
The estimated number of real users using the criteria described by Eqs. 1 and 2 is discussed in this section. In Fig. 4, the estimated number of real users as a function of the threshold value is shown. In the upper box, the criterion based on the number of stations which have recorded a device is considered. From the analysis, it can be noted that the number of real users decreases exponentially: the grey line identifies the exponential trend. The relation between the estimated number of users and the minimum number of stations which recorded the device is: ln(number of real users) = a * th base + b  unique devices is 51,096, which is reduced by some 12,000 if a user is defined as a device recorded at least by two stations (number of devices recorded by at least two stations 39,880). Increasing the threshold, the number of users is reduced; only 79 users were recorded by all the APs.
The estimated number of users, as a function of th rec , is shown in the bottom graph of Fig. 4, where the number of users for 21 different values of th rec is explicitly reported. The number of users decreases when increasing th rec , according to the function: ln(number of real users) = c * ln(th rec ) + d with c = −0.622 and d = 11.2; the R-squared is 0.989. From the the same graph, it can be noted that there were some 8000 devices (51, 096 − 43, 077) which were recorded only once. These records are probably due to devices generating fake identifiers before connecting to a node (Zebra Technologies 2015). It can also be appreciated that only 448 devices out of 51,096 where recorded more than 2000 times.
As mentioned in the "Cleaning approaches" section, the two methods, individually considered, are not able to exclude all the devices which are not associated to real users. In order to enhance the exclusion capability of the algorithm, the two criteria need to be used together; the results obtained combining the two criteria are shown in Fig. 5. In the figure, the size of the square indicates the number of real users identified by the combination of the two criteria. From the graph, it can be appreciated that the squares with the largest size are in the upper left, indicating a low exclusion capability. The number of exclusion increases passing from up to down and from left to right; in the two corners, upper left and lower right, the extreme values recorded are 51,096 and 307 respectively. Using th base = 3 and th rec = 50, a total number of Fig. 5 Estimated number of real users considering two criteria together: a device is classified as real user if it is recorded by a minimum number of stations and for a minimum number of times 6009 potential real users is estimated. The estimate strongly depends on the setting of the thresholds, which has to be set taking into account both the geometry of the site and the length of the phenomena to monitor. In our case, considering the geometry of the site, three can be chosen as a reasonable threshold for the minimum number of the stations, because in the proximity of the main gate, three APs were installed.
In order to further investigate the impact of the minimum number of stations, the estimated number of users as a function of the minimum number of stations considering a minimum number of times is shown in Fig. 6. The lines represent the estimated number of users as a function of the minimum number of stations. From the graph, it is evident that in the considered case (more than 3 base and at least 50 times), there is only limited impact passing from 1 to 4 stations (i.e., the number of users is reduced only by 8). If the minimum number of records is reduced, a larger impact of the minimum number of stations can be appreciated. In all the curves, a plateau is present in the first part of the line, the plateau becomes longer as the minimum number of records is increased; this is reasonable because a user registered for longer time is more likely registered by a larger number of stations. Hence, the dominant criterion in the considered case is the minimum number of records. If a different set of thresholds is used, the effect of the two criteria varies.

Cleaning based on the analysis of users movements
The potential number of users can be estimated with a criteria based on the user movements, as described in the "Cleaning approaches" section. In this section, the results relative to the aforementioned criteria are discussed; the localization of the users has been performed using the three positioning strategies described in the "Positioning approaches" section.
In Figs. 7, 8 and 9, the estimated number of potential real users is shown; the positions of the users are computed using WeC, proximity occurrence-based, and proximity RSSIbased respectively. In the upper boxes, the estimated number of users is obtained applying the exclusion criterion at the East coordinates, while in the central boxes, the criterion is applied on the North coordinates, finally in the lower boxes, the number of users is obtained from the intersection of the users satisfying both criteria (East and North). In the three figures, the estimated number of users is plotted as a function of th pos .
For all the cases, a linear relation (gray line) can be seen between the estimated number of users and the threshold values: number of real users = m * th pos + q.
The values of m and q for the different cases are reported in  The criteria allow the exclusion of the static devices and of the devices registered only one time. In fact, when the value of th pos passes from zero to 5 m, the total number of devices reduces from 51,096 to some 13,000, for the RSSIbased and occurrence-based, and to some 15,000 for the WeC positioning.
The threshold for the screening criteria has to be set in accordance with the size of the site to monitor and to the geometry of the monitoring network. In the specific case, a threshold of 300 m (corresponding at some half of the mean distance among the nodes of the network) was adopted. From the results, it can be noted that the criteria based on the East and North components were too much stringent and the results obtained are not consistent. For example, using the proximity RSSI-based algorithm (Fig. 9), a total of 6596 real users were estimated using the criterion based on the East coordinate, whereas only 4595 were the users satisfying the condition on the North component. Finally, only 1673 devices satisfied both criteria. This relatively high inconsistency is due to the fact that the criteria do not properly represent the users motion: usually, a user does not move only along a single axes or within a square.
In order to remove the dependence from the direction, a criterion based on the horizontal component (second equation in Eq. 5) is applied; the estimated number of real users is shown in Fig. 10. The mean horizontal distance among the stations is some 600 m, hence 300 m has been identified as a suitable value for th pos for the horizontal case.
In correspondence of such a value, the estimated number of users varies between 5534 (WeC yellow markers) and 8233 (proximity RSSI-based red markers).
In order to complement and compare the results shown in Figs. 8, 9, and 10, a sample of the plotted data is provided in Table 3.
In order to further investigate the impact of the value of the threshold on the number of estimated real users, the cumulative distribution function (CDF) of the excluded device as a function of the horizontal position dispersion  Fig. 11. From the figure, the impact of the threshold values clearly emerges, and considering the current value, the percentage of excluded devices is 86, 84, and 83 for the three positioning approaches. Considering the approaches based on RSSI and on the number of occurrence, some 75% of the total number of device where completely static; this percentage is a bit reduced (some 70%) for the case of the WeC. This discrepancy is due to the nature of the positioning approach used. For example, if a small oscillation in the RSSI of a device is present, the methods based on the RSSI and number of occurrence are robust with respect to this phenomenon, while the same anomaly produces small variations in the positions estimated using the WeC.
The estimated number of real users is a fundamental figure; however, it is important to verify if the different cleaning procedures identify the same devices as real users; hence, the intersection among the possible user lists, using the three different positioning methods, is computed and plotted as a function of th pos in Fig. 12.
The number of users estimated using the proximity RSSI-based includes the users identified using the WeC approach; this evidence clearly emerges from the fact that the blue continuous line is almost flat and close to one (users identified using the proximity RSSI-based are also identified using the WeC) and the blue dashed line is lower than the blue one (not all the users identified using the proximity RSSI-based are included in the user list obtained using the WeC approach). However, a significant overlapping among the set of real user can be appreciated, since all the curves are higher than 50%.

User concentration and node connections
In order to evaluate the users distribution after the exclusions of the devices not classified as real users, heat maps of the user concentration are shown in Fig. 13. The heat maps are computed using the user position estimates obtained exploiting the three localization algorithms and two cleaning approaches, one based on the horizontal  has been generated considering a time span of 1 h between 11:00 and 12:00. The combinations of positioning methods and cleaning strategies show that the users are mainly concentrated in the central area of the site; only small differences can be appreciated considering the same localization algorithm and a different cleaning strategy (i.e., comparing the heat maps in the same columns). On the other hand, a sensible difference can be appreciated when comparing the heat map obtained using different localization algorithms (i.e., comparing the heat maps in the same rows): the user concentration is very similar using the WeC amd the proximity RSSI-based approaches, whereas a different user concentration is obtained when using the proximity occurrence-based. This is probably due to the effect of the environment on RSSI measurements.
In Fig. 14, the heat maps of the device exclusions are shown; the concentration has been calculated considering the whole duration of the event. The heat maps are built on the exclusion performed using the configurations described above. An high consistency among the heat maps can be noted: almost all the heat maps show a high number of excluded devices in the correspondence of the AP numbers   Fig. 11 CDF of the excluded device as a function of the horizontal position dispersion threshold 10 and 21, whereas only few devices are excluded in the proximity of the other APs. The most probable explanation for these results is given by the fact that, whereas AP 10 is located in the core of the administrative zone of the site (where most of the remote devices as printers etc. is present), AP 21 is closed to the west fence, which separates the site from the most busy road in the area (therefore, it is possible that part of the users registered by this AP never enters the site).
The user flow between the nodes of the network is plotted in Figs. 15 and 16. Each node is colored differently from blue to yellow; if a line connects two nodes, it means that at least one user moves from one node to the other. The color of the line is the same color of the node from which the user moved, while the width of the line represents the total number of users moving between the nodes.
Both figures are built considering 1 h of data; specifically, Fig. 15 refers to the time frame 8:00-9:00 (before the official opening) and the second figure refers to the time frame 12:00-13:00. The configurations used to build the graphs are obtained combining positioning proximity approaches with exclusion criteria based on the horizontal position dispersion and the one considering the number of stations and records. The color of the line identifies the starting node.
From Fig. 15, it emerges that real users are moving from only three nodes, specifically numbers 3, 7, and 10: presumably, the identified users were volunteers accessing the site before the official opening of the event. The first two APs (3 and 7) were located in the proximity of the gates reserved for volunteers, while AP 10 was at the main gate. The connections of the APs 3 and 10 are almost identical, because almost all the devices passing from AP 3 had to pass also in front of the AP 10. The nodes 3 and 10 are connected with almost all the nodes of the network and the curves are wider than those departing from node 7. This is due to the fact that the AP of the Brebbia gate registered only 47 users, as described in Gioia et al. (2017).

Conclusions
Monitoring open crowded areas is fundamental for people security and safety, and of foremost importance is the number of people attending an event and their distribution. In this paper, a monitoring tool exploiting big data and WiFi positioning is presented. One of the issues related to this type of activity is the identification of the devices belonging to real users. This aspect is fundamental to avoid misinterpretation of results; for example, devices broadcasting fake identifiers before connecting to the network can generate false users; analogously, static devices, such as printers and PC, which continuously connect to the monitoring network, can generate a false overcrowded area, masking the real users concentration.
To fill this gap, an approach to identify real users among a set of devices is proposed. The developed approach combines seven exclusion criteria with three WiFi localization algorithms. The exclusion criteria are based on the following: -the minimum number of APs recording a device; -the minimum number of times a device is registered by the network of APs; -combination of minimum number of APs and minimum number of records; -user East-West position dispersion; -user North-South position dispersion; -user East-West and North-South position dispersion; -horizontal user position dispersion.

Results discussion
The proposed localization algorithms developed are based on the proximity and WeC principles. The traditional algorithms exploit measurements collected simultaneously, but this condition is seldom verified when the tracking of an object is performed within a wide area. To fill this gap, the traditional proximity and WeC algorithms have been modified. In the proposed versions, the algorithms estimate the users positions using measurements collected during a given time interval; the time interval adopted is 3 min. The three developed localization approaches are as follows: -proximity occurrence-based: the user is located in correspondence of the AP which recorded the user more times during the time interval; -proximity RSSI-based: the user position is associated with that of the AP which recorded the signal of the users with the highest power, during the reference time interval; -WeC: the user position is estimated as a linear combination of the positions of the APs recording the presence of the user in the time interval.
The screening algorithm has been tested using a unique data set, which was collected during the JRC open day 2016. More than 7500 people attended the event and almost 11 h of data was collected using 20 WiFi APs; the data set is unique because of the extension and restricted access nature of the site and of the availability of a program of the event, which allows to verify the results, at least in a qualitative way. This is usually one of the weak points of big data analysis.

Main considerations
From the results, it can be concluded that: • In the measurement domain, the two criteria taken individually are not able to properly identify the real users; hence, their combination should be adopted. In addition, the threshold of the criteria has to be set according to the distribution of the nodes of the monitoring network and to the duration of the event.
-For the specific case hereby analyzed, the threshold for the minimum number of station was identified as 3 (corresponding to the number of station in the proximity of the main entrance gate), while the minimum number of records for a given device was set to 50. -Using these thresholds, a total number of 6009 users were identified, some 75% of the actual number of people attending the event (7623 according to the report of the security office).
• In the position domain, the threshold for the screening criteria has to be set in accordance with the size of the site and with the geometry of the monitoring network.
-In the specific case, a threshold of 300 m (corresponding at some half of the mean distance between the nodes) was adopted. -The criteria based on the East-West and North-South components individually were much too stringent and the results obtained are not consistent. For example, using the proximity RSSI-based algorithm, a total of 6596 real users were identified using the criterion based on the East coordinate, whereas only 4595 were the users satisfying the condition on the North component. Finally, only 1763 devices satisfied both criteria. This is due to the fact that the criteria do not properly represent the user motion; usually, a user does not move only along a single axes or within a square. -A possible solution is to consider the horizontal user position distribution criterion, using such screening method and the three positioning approaches, the estimated number of real users was 5534, 6269, and 8283 for the WeC, proximity occurrence-based, and proximity RSSI-based respectively. The overestimation of the real number of users is probably due to the interference effect on the RSSI measurements.
After the screening procedure, the distribution of the users was computed. From the distribution of the excluded users, it emerges that the main part of the excluded devices was concentrated in the proximity of two APs; this result is consistent among all the approaches used. Finally, the connections among the nodes of the network were analyzed: a high consistency can be noted among the diverse methods adopted.
In conclusion, the feasibility of crowd monitoring through WiFi positioning has been demonstrated. Different cleaning strategies have been adopted and the relative results compared. A general rule, applicable to all possible WiFi positioning scenarios, cannot be stated. In fact, the particular geography of the AP network within the area to be monitored and the coverage range of the available APs are fundamental to the setting of thresholds necessary for a proper data cleaning.
Another point to be noted is that the results obtained are derived by a posteriori processing, which has been carried out on the whole data collection. Therefore, the real-time implementation of the methods still remains to be investigated. This will most probably be the direction of our future research.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons. org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.