Orthogonal projection for anomaly detection in networking datasets

In recent years, the impressive growth of new wireless technologies, together with the appearance of new requirements in applications and services, is progressively changing the use of networks. Due to the high mobility required, the network must adapt to the infrastructure to meet the demands of the users. As a result, service providers currently have to over-provision network capacity, which is costly. In addition, considering efficient resource planning in advance involves a lot of labor-intensive efforts. Consequently, network usage analysis is a very useful tool that allows network administrators to find patterns and anomalies. Whilst pattern detection provides administrators the ability to define the infrastructure, anomaly detection provides rich and valuable information for certain applications, for example, to avoid network saturation in urban areas during peak hours. This article proposes a new methodology based on orthogonal projections over Call Detail Records (CDR) for anomaly detection to help in the dynamic management of the network in an urban area. The method is evaluated in a real scenario provided by an Italian telecommunications operator, considering different locations in the Milan metropolitan area, differentiated by the geographic resolution of the data, reaching F1 scores above 0.8. In addition, a new ground truth is presented, hoping it will become a reference data set for the community, in the form of a set of locations that have been corroborated for use in evaluating anomaly detection techniques.


Introduction
5 G and beyond (5 G and B5G) are expected to satisfy multiple challenges emerged in this mobile generation. Complex network architectures with ultra-dense deployments and multiple services with strict requirements are a common profile in this new generation of networks. The standard is developed to support 1000× the connection density, 3× the spectrum efficiency, 10× the experienced data rate and 10× the energy efficiency, among others . These new network capabilities are supported by new technologies like virtual architectures, coexistence between traditional radio networks and new wave bands with massive MIMO, network slicing or the "network softwarization" (Barakabitze et al. 2020).

3
These paradigms are focused on the suitability of the network to share data and enable the exchange of information and knowledge. They also combined with the ability to adapt to every service or user's requirement. As a result, there is an improvement in the user experience in terms of network speed, latency or reliability, and the deployment of new services on the network, such as the autonomous car, tactile Internet or remote surgery (Paolucci et al. 2021).
In this challenging context, the analysis of the network information such as that contained in Call Details Record (CDR), by means of Artificial Intelligence (AI) techniques, contribute to the improvement of the network performance, the Quality of Service (QoS) provided by the network and the Quality of Experience (QoE) experienced by users by analyzing their behaviour (Naboulsi et al. 2021). Therefore, CDR can be used to analyze human activity and mobility patterns, obtaining useful information in a wide range of fields, such as population distribution, city planning, tourism or event detection, among others (Jones et al. 2018;Zhao et al. 2016;Phithakkitnukoon et al. 2012).
Moreover, B5G and 6 G standards aim to combine AI with the new technologies introduced in 5 G to transform the traditional network management into a self-adapting mechanism where the network reacts to the changing conditions with minimal human intervention Benzaid andTaleb 2020, Moysen andGiupponi 2018). In this context, the identification and analysis of unexpected or abnormal behaviours of the users is an important task that must be addressed to prevent network performance degradation or, even self-adapting available network and resources to manage these events. In this sense, anomaly detection can be considered a useful tool for optimizing network performance (Sultan et al. 2018).
In order to increase the efficiency of anomaly detection, the addition of network information metadata to the geographic coordinates provided by the CDR database offers an interesting mechanism to improve the precision of the data, revealing particular behaviours of the network. Furthermore, most proposed network-collected data analysis techniques in the literature (Garroppo and Niccolini 2018;Zhu and Sun 2020;Sultan et al. 2019) lows on examining a single parameter or the sum of the set of characteristics, so the inclusion of additional information about a particular area should lead to a greater understanding of it and can be extracted through its analysis. With this view towards future progress, the independent data contained in a CDR can be represented as a set of different features in the form of a data cube to accurately describe the associated geographic area. This concept is widely used in other areas of knowledge (Racetin and Krtalić 2021). Nevertheless, it is very recently included in the analysis of communication networks (Cortés-Polo et al. 2019), hence it opens new paths in terms of its practical application. In particular, this work extends the use of the Orthogonal Projection Network Analysis (OPNA) technique proposed in previous works (Cortés-Polo et al. 2019 to the detection of anomalous situations that may occur in the course of activity of an urban area to provide with a better understanding of what events could be considered a potential risk and coordinate the provision of services of a telecommunications network. In addition, this document provides a qualitative and quantitative study of the methodology used on a CDR of the Milan metropolitan area, adding new information about the events that occurred during the period in which this data set was collected. An additional objective of this work is to establish these facts collected as a reference in this data set for other works in this line of research.
The remainder of this document is organized as follows. Section 2 presents the existing research related to this study. Section 3 enumerates and describes the different steps of the orthogonal projection-based methodology. Section 4 describes the CDR of the metropolitan area of Milano and, in particular, the selected cells to be analyzed, as well as the metrics used to evaluate the method. Section 5 describes the experiments intended to evaluate the accuracy achieved by our implementation in the context of multi-feature data analysis. Finally, Sect. 6 concludes this work with some remarks and hints as plausible future research lines.

Background
CDR provides invaluable information of around 4.9 billion Internet active users [(ITU) 2021] to the network providers, as it is used in many fundamental processes like value-added services development, customers' billing, network efficiency measurements, churn or fraud detection, etc. The information contained in CDR includes anonymous user data such as the start time stamp, duration and cellular sector of origin of each activity of voice traffic, text messages and data of each device among others. Moreover, CDR has the capability to organize this information in patterns, like a grid (Barlacchi et al. 2015). This may complicate the analysis as some behaviours are masked in the order of the data itself.
The data contained in CDR provides significant information to analyze the mobility patterns developing frameworks as in Zhao et al. (2016), to understand human mobility, or in Wang and Chen (2018) where the analysis reduces the location uncertainty and oscillation of the data. In addition, frameworks can be used to examine data from other perspectives, such as gendered urban mobility like in Gauvin et al. (2020); to extract other parameters like forensic analysis as in Abba et al. (2019) or the activities of the base stations in a mobile cellular network (Jiang et al. 2020). Recently, many machine learning techniques are included in frameworks oriented to analyze and classify the information of a CDR (Sultan et al. 2019) or to apply it to concrete areas like churn prediction Garimella et al. 2021).
Those frameworks, which analyse CDR using machine learning techniques, can also be applied for detecting unusual events of critical significance in the network and analysing the users' activities. As explained in the previous section, CDR contains spatial and temporal information about the network, providing an approximate location and temporal record, which are very useful when detecting particular events, an important peculiarity in anomaly detection analysis. In this sense, Sultan et al. (2018) describes a mechanism to identify anomalies using the K-means clustering algorithm and how to forecast future traffic for a user. Two different approaches are developed in Garroppo and Niccolini (2018) to detect on-the-fly traffic anomalies, using the Spatial Analysis and the Time Analysis separately. In Zhu and Sun (2020), the anomalies are detected using machine learning to extract traffic patterns through clustering algorithms and forecast the traffic of the network using autoregressive models. The anomaly detection research area is drawing more and more attention from academia and industry, due to the importance of detecting undesired events in the network (Dasari et al. 2021;Jiang and Liu 2022;Shajari et al. 2022).
This paper differs from the previous works due to the methodology used to analyze the CDR, which characterizes each area as a linear combination of different unknown network features or comportments. These are going to be unique descriptions of a particular activity in the network in a specific area. In addition, the sequence of different characterizations of a location, ordered according to its temporal record, provides significant information to identify associated behaviour patterns, being this activity analyzed using a statistical approach. Also in this work, a new set of points of interest are analyzed in order to validate the results obtained by the OPNA technique.

Orthogonal projection network analysis
The information contained in the CDR dataset provides valuable information, which can be used in many applications. The OPNA technique proposes a methodology to extract the descriptive comportments of each interval and use them to analyze network usage. Figure 1 depicts the two phases of the proposed methodology, composed of the OPNA technique and the analysis of some relevant locations to detect events which can be considered anomalous. This second phase will be covered in the following sections.
The OPNA methodology analyzes the CDR Database extracting the user's comportments of each zone and the time interval. We consider each CDR data in an interval as a linear combination formed by the characteristic vectors multiplied by their respective abundances and modified by an error or correction factor, as shown in Eq. (1), where y i is the measured value of the feature i, c ij defines the value of the jth comportment in feature i, a j is the fractional abundance of the jth comportment, and e i represents the error or correction term for the feature i in this cell.
The model which describes a single interval can be generalized for the other intervals n = 1, 2, … , N . Therefore, Eq.
(1) can be rewritten in compact form as in Eq. (2), where Y represents a complete CDR interval, the comportment set is defined by C, A is a matrix containing the fractional abundances for each comportment, and E is the accumulated error in matrix form.
The second step of the OPNA methodology shown in Fig. 1 is the comportment extraction. The proposed methodology (Cortés-Polo et al. 2019) uses the Orthogonal Subspace Projection (OSP) algorithm (Harsanyi and Chang 1994) as one of the most efficient algorithms to extract the comportment contained in the CDR data set. Nevertheless, other algorithms can be used as presented in Cortés-Polo et al. (2021). The algorithm aims to find the extreme points of the I-dimensional space of each interval formed by the feature array for each sub-area of the data cube. This array represents the comportment of a cell at a particular time in the data set.
Each extreme element obtained is considered part of the set of comportments that will characterize the complete CDR. These comportments vectors are ordered by their modules in ascending order, and those for which similarity is likely to be considered representatives of a single intensity level are removed. Once ordered and reduced, the behaviours are compared with each cell contained in the CDR, selecting the behaviour that minimizes the Euclidean distance. With this information, a network usage map of each of the intervals is obtained, as shown in the third stage of the OPNA methodology shown in Fig. 1.
The OPNA methodology can be helpful in analyzing the CDR data entirely and locating different events, abnormal behaviours or resources misused. As can be observed in the flowchart, the proposed technique in this work is composed of two stages. Firstly, the OPNA analysis is applied as explained before, to extract the network usage of each cell of the data set. The second part of the methodology requires the selection of some locations, based on their expected behaviour, to detect events celebrated in those areas by statistical analysis of the median absolute deviation (MAD) in the second stage. (1)

Data set and metrics
Telecom Italia and SpazioDati published the first opensource telecommunications data through the first Big Data Challenge in 2014 (Barlacchi et al. 2015), with the aim of promoting new ideas in the field through Big Data techniques. The data set used in this work summarizes the activity of Telecom Italy's telecommunications network during a period of two months, particularly November and December 2013, and specifically the activity of the city of Milano and its metropolitan area, as well as part of adjacent towns. The activity of the network is described through five representative characteristics of the use that was given to telecommunications in those years, which are described below.
• Received / Sent SMS: the information is generated each time a user receives or sends an SMS. • Incoming / Outgoing Call: a new record is generated each time a user receives or issues a call. • Internet: a record is generated each time a user initiates or finishes an Internet connection. During the same connection, a CDR is generated when the connection lasts for more than fifteen minutes or when the user transferred more than 5MB.
Altogether, the data set has dimensions of 100 × 100 cells with five components in each interval, each of them being ten minutes long. The total number of intervals during the analyzed period is 8784 (144 intervals × 61 days). For this work, the set of intervals has been reduced to a total of 1488, grouping them in complete hours.

Points of interest analyzed
Due to the very nature of the anomaly detection problem, in this work some key locations of the city of Milan have been compiled for analysis during the period studied, taking into account different locations where the occurrence of events is known. A large part of this work focuses on the presentation of several testable and verifiable events in three of the four analyzed locations, with the aim of extending the known ground truth of the data set, and expanding the functionality of the techniques that can be applied to it. Figure 2 depicts the four locations that have been examined, which will be described in detail below. The first cell of interest that we are going to consider in this work is the Fiera Milano Rho, one of the two complexes that comprise the Fiera di Milano, the main trade exhibition in Italy and one of the most important in the European Union. The cell analyzed for this study is the one located above the entrance of the enclosure, which is particularly interesting as its activity is distributed among several pavilions and depends on the exhibition. In this regard, some pavilions may not be used. In addition, these events have a longer duration in time compared to those examined in other cells which are directly related to the affluence of the event. Table 1 displays the events that occurred during the analyzed period, taking into account that the influx of people may vary greatly from one event to another.
Another interesting location that has been taken into account in this work is the Giuseppe Meazza Stadium, also known as the San Siro Stadium, with a capacity for 80,000 spectators. Indeed it is a sports venue located in the city of Milan, where AC Milan and FC Internazionale play their home games. This point has been used in other works (Garroppo and Niccolini 2018; Parwez et al. 2017) with the same purpose of detecting abnormal behaviours as its use is very sporadic (only when there is a match since the training sessions of the Milan teams are held at the Milanello Sports Center, and the Angelo Moratti Sports Center) and the surrounding cells are devoid of activity unless there is an event in the stadium. In addition, the dates of the matches played during the network analysis period are well known as shown  in Table 2, where the dates of the ten games played in the stadium by AC Milan, Inter Milan and the Italian national team are displayed. Despite their differences, the previous locations, assume similar cell pattern, where the intensity of the use of the telecommunications network will affect the adjacent cells. That is why we will also analyze a couple of locations in which its influence is reduced to a single CDR cell. The first is the Mediolanum Forum, which is a sports hall located in the Italian town of Assago on the outskirts of Milan. With a capacity for 12,700 spectators, it is the court where the Pallacanestro Olimpia Milano of the Italian national basketball league plays its home games. In terms of this analysis, it is interesting due to the large number of events held during the period examined, as it is shown in Table 3, including a variety of concerts and basketball games.
Last but not least, the cell where the Teatro degli Arcimboldi is located will be analyzed. The theatre has a capacity for approximately 2400 spectators hosting opera performances and other cultural events. The cell is particularly interesting given its proximity to the University Milano Bicocca and to a theatre school, which can generate distortions for the detection of the events listed in Table 4 due to the differences between the typical school and the cultural leisure schedule.
All the events mentioned in the Tables 1, 2, 3, 4 have been corroborated through the different official websites of each location and people involved, audiovisual elements on the Internet and user records on Twitter.

Metrics
To analyze the cells and extract those intervals where an event is produced, the MAD is used. The metric involves finding the median of absolute deviations from the median, and can be defined as follows: where the Median(X) is the median of the analyzed cells' observations, and x i is each analyzed interval of the cell. Finally, b is a constant related to the normality of the data, disregarding the abnormality induced by outliers (Rousseeuw and Croux 1993). In particular, to select the value of b, the distribution of the data must be assumed, where b is calculated as b = 1∕Q(75) , considering Q(75) as the 0.75 quantile of the underlying distribution. In case that the data is assumed that follows a normal distribution, 1∕Q(0.75) = 1.4826 (Leys et al. 2013). The selected value is a key aspect because defines the outlier values of the distribution. Figure 3 describes an example, where one of the aforementioned cells is analyzed, depicting the comportment of a Regular Day, with the network usage inside the MAD interval, and an Event Detection, where the MAD analysis shows an abnormal behaviour compared to other days.  For each analyzed cell, precision, recall and F1 scores are compared. In this regard, the precision is the fraction of correct predictions compared with the overall predictions made. The recall is the true positive rate and the F1 Score is the weighted average of precision and recall as it helps in balancing both parameters. The metrics are calculated as follows: where TP is the number of true positives, FN is the number of false negatives and FP is the number of false positives. The F1 score ranges from 0 to 1, where 0 is the worst possible score and 1 is a perfect score indicating that the model predicts each observation correctly. A good F1 score depends on the type of data you are working with. In this work we consider that a good value of F1 score is established above 0.8. By extension, we have established the ranges detailed in Table 5 to assess the results obtained qualitatively.

Experiments results
The experiments are conducted according to certain parameters in the validation of the results. The events taken as a reference are bounded from the start to end hours marked by the organization a priori. By dividing the data into one-hour time slices, any continuous block detected as an anomaly by the method, which totally or partially coincides with any of the established references, will be considered a hit; in case there is an anomaly detected by the method that does not have its peer in the established references, it will be considered as an unverified success; on the contrary, when the technique does not detect an anomaly in a time segment   where there is an established reference, it will be considered a failure. In this sense, the analysis of the four cells described in Sect. 4.1 has been performed, evaluating their results individually and obtaining the precision, recall and F1 measurements that will be studied as a whole to validate the effectiveness of the method. Table 6 describes the results obtained using the OPNA proposal to analyze the network usage of each cell of the data set and to extract the average distance between each data point and the mean of each cell using the MAD approach. Figure 4 depicts all the reference events in comparison to the anomalies detected for each location described above. The graphs show the events digitally and configuring the same time frame for easy understanding, each one of them will be analyzed in detail below.  into different well-defined pavilions and there are no other potential users in the surrounding areas. As shown in Fig. 4(b), all but one of the reference events for the San Siro stadium are detected with the OPNA + MAD approach. According to Cortés-Polo et al. (2019), the cell usage is generally low in all matches, except in the International Football Match Italy vs. Germany on 15th of November and the derby match between FC Internazionale Milano and AC Milan on 22nd of December. In this context, it is noteworthy that users' behaviour during this type of event has changed since the dataset was obtained in 2013, due to the rise of various social networks, and only when attendance was close to full occupancy is the usage of the network relevant. It is also necessary to mention two unverified events on November 4th and 6th.
The previous locations have in common that the resolution of the mesh in which the data set is divided is adjusted to the space occupied by the enclosures in question. This fact results in a purer analysis of the method used. On the contrary, the following cells analyzed encompass an area where the analyzed location is more blurred. As a result, we can check how the introduction of "noise" affects the proposed method. Figure 4(c), is focused on the events that occurred in the Mediolanum Forum Sports Hall, which comprises not only the stadium but also a sports area with a bowling alley, an indoor swimming pool and a gym. Events held in this multisport area can be detected as unverified events in the analysis and decrease the recall measure. In the figure, it can be clearly observed that there are six unverified events and seven missed events, mainly related to the Disney on Ice event, the Cirque du Soleil show and a Mario Biondi concert.
Finally, Fig. 4(d) shows the events detected in cell 7468, which contains the Teatro degli Arcimboldi. This cell has mixed usages, as the theatre only takes up around 40% of the cell. As is shown in Table 6, the precision obtained in this analysis is around 65% as a direct consequence of the unverified events, located in the weekdays from 4th to 8th, 11th to 15th, 18th to 22nd, 25th and 26th of November from 9 to 20 h. In addition to the theatre, the cell is also occupied by a music and theatre school called Scuola del Teatro Musicale (STM). This school is opened in the intervals where the unverified events occur and mainly could be assigned to the operation of the school, as it could be seen in the Annual Report of the STM. 1 If those events were considered, the precision and the F1 score of the algorithm will be increased up to 95% and 94% respectively. We consider that it is necessary to include this location, even though the results could not be accurately confirmed, precisely due to the composition of this cell, and the wide margin for improvement that it may mean for new techniques that might be applied in the future.

Conclusions
This paper explores the application of a novel methodology to analyze and classify real user traffic, to detect network anomalies using a real CDR dataset from Telecom Italia. Leveraging the knowledge gained from previous analyses, extracting the comportments in each cell of the analyzed dataset provides an interesting way to find events that require some attention to handle the required network traffic. With this methodology, network operators can take advantage of the information contained in their CDR in order to improve both network performance and work efficiency, implement a new QoS rule or increase the QoE of a specific service in a zone, and analyse the behaviour of its users over time. Likewise, in this work new points of interest and contrasted references of events that occurred during the period included in the data set studied have been included in order to provide the community with the necessary elements for experimentation and testing of other techniques. A qualitative and quantitative evaluation of the proposed method has been carried out through a series of experiments that demonstrate its potential.
As future lines of action, further analysis will be performed to introduce new algorithms to detect anomalies in this dataset, which could lead to new techniques, such as supervised anomaly detection algorithms or deep learning algorithms. The results obtained by these applications should be applied to other data sets with similar characteristics such as that of the province of Trento. distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.