Abstract
The continuous global urbanization with rapid and dynamic transitioning in traffic situations among highly populated cities results in difficulty for data collection and communication. Data collection for millions of vehicles hinders by various problems, i.e., higher cost of energy, time, space, and storage resources. Moreover, higher data traffic results in higher delays, larger throughput, excessive bottlenecks, and frequent repetition of data. To better facilitate the aforementioned challenges and to provide a solution, we have proposed a lightweight Machine Learning based data collection protocol named ML-TDG to effectively deal with higher data volumes in a real-time traffic environment capable of bringing the least burden on the network while utilizing less space, time, and energy. ML-TDG is functional based on Apache Spark, an effective data processing engine that indexes the data based on two logs, i.e., old commuters or frequent/daily commuters and second new/occasional commuters. The proposed protocol’s main idea is to utilize real-time traffic, distinguish the indexes in parallel based on two assigned logs criteria to train the network, and collect data with the least sources. For energy and time optimization, dynamic segmentation switching is introduced which is an intelligent road segments division and switching for reducing bottlenecks and replication. ML-TDG is tested and verified on Dublin, Ireland’s busiest motorway M50. ML-TDG performs the data collection, data sorting, and network training to decide the next execution altogether for better optimization every time. The experimental results verify that our proposed protocol is attaining higher performance with lower resource requirements along with rich and time-efficient sustainable data collection clusters in comparison with baseline protocols.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
The computationally integrated vehicles capable of working smartly through incorporating informed, coordinated, and better use of transport network are known as Intelligent Transport System (ITS) [1, 2]. A transport system independent enough to achieve better traffic efficiency while minimizing problems to facilitate drivers can be declared as a key feature of ITS [3]. The easiest description of ITS is the integration of information and communication technologies in transport [4]. A country’s growth is often measured through the quality of transportation network [5, 6]. Advanced network infrastructure with better traffic facilities results in better economic (trade, cross-country commutes, e-commerce businesses, etc.), civil, security, and technology-assisted benefits [7].
The number of vehicles on Irish roads has reached the highest levels since 2008. The total number of licensed vehicles in Ireland stands at 2.2 million calculated in 2019 [8], while it was only 1,985,000 (approximately 2 million) back in 2008–10. The vehicles are 179% increased [9]. The Irish government has set a target of 1 m electric vehicles by 2030 which indicates that this increasing ratio is expected to hit the highest peak. Ireland had 439 passenger cars per 1,000 inhabitants in 2016 [10]. This was the sixth lowest in the EU. The total number of vehicles on the road in 2021 is increased by 22.2% [9]. The annual number of new licensed goods vehicles is raised by 32.1% with private cars by 20.8%. On average, approximately 47% increase is being witnessed every month in comparison with the prior month [9].
Due to the higher expectancy of the increased number of vehicles added daily, data collection for the computational operations linked with Intelligent Transport System is getting complex and tedious task to accomplish smoothly [11,12,13,14]. Irish Licensed vehicles traveled 41.9 billion kilometers in 2021, 36.2 billion kilometers in 2020, and 47.1 billion in 2019 with COVID-19 restrictions [9]. These figures were expected to be huge without COVID-19 restrictions. Road traffic volumes are largely impacted by the drastically increasing number of vehicles which highlights the dire need for intelligently functional self-sufficient data collection and communication protocol to manage these potential numbers effectively.
There are various ITS projects in Ireland that are currently working on enhancing traffic features and experiences. ITS-based projects are functional on academic as well as industrial levels [15,16,17,18,19]. Despite the higher research interest in the field of ITS for academician, there still exists a gap in collaborating and practically implementing academically proposed ITS protocols for industrial gains [12, 20, 21]. These research gaps include integration with legacy traffic control systems, cost-effectiveness, increased congestion that results in higher travel times, higher environmental damage, increased global warming, poor communication technologies, lesser controlled accidental response, and least infrastructure controls [22,23,24,25].
To resolve the aforementioned challenges and gaps, we practically implemented the previously proposed Real-time Data Gathering (TDG) Protocol [26] on Dublin’s M50 motorway, Ireland with plenty of changes incorporated in TDG. The revised and modified protocol is named as Machine Learning based Data Gathering (ML-TDG) Protocol. Before discussing (ML-TDG) modules, operations, and functionalities, the overview of TDG [26] is as follows: TDG is lightweight and dynamically designed for collecting and forwarding data packets based on current and rapidly evolving traffic conditions to reduce network and data communication overhead while incorporating real-time data collection time constraints. A data aggregation scheme is implemented for data analysis to fetch information based on location, speed, vehicle id, and neighbor count. A data extraction scheme is also integrated to increase data retrieval and data utilization effectiveness in an intelligent way at the base station. The proposed solution outperformed existing data-gathering protocols in effectiveness, efficiency, delay, communication overhead, and data transmission rate. More details to follow [26].
While proceeding toward objectives, motivations, and contributions, we can draw the research questions as: (1) Is it possible to integrate a Machine Learning based data engine (Apache Spark) into VANETs and ITS to induce learning within vehicles to operate on the basis of previously processed data for gaining maximum benefits? (2) Does sorting out data systems (based on labels) to deal with huge traffic flows, data processing and analysis, real-time communication, ITS, and smart cities is practically possible in Dublin’s M50 environment? (3) Is it attainable to self-train a network model in an efficient way from a live-streaming of traffic data and from the data information system.
The concept of integrating machine learning into VANETs and ITS and specifically in ML-TDG is to induce learning within vehicles to operate on the basis of previously processed data for gaining maximum benefits [27]. Machine Learning along with VANETs is achieving new horizons of advancements in terms of sorting out the major issues, i.e., dealing with huge traffic flows, data processing and analysis, real-time communication, ITS, and smart cities [28, 29]. ML-TDG is utilizing machine learning concepts to deal with massive traffic flows and data congestion for better and optimized vehicular communication.
The primary objectives and motivations of the proposed ML-based Intelligent Transport System protocol are listed as follows:
-
(1)
VANET faces rapidly changing traffic trends, unpredictable vehicle scenarios, and the limited availability of database stations. The main objective is to propose and design a protocol with a feature of timely and updated data collection in which desired data should be collected via self-training of the network model in an efficient way.
-
(2)
Outdated and even the slightest delayed data conserve memory and energy and gives the least usability for Intelligent Transport Systems. Another primary objective is to design a protocol to overcome the memory and energy issues within a tolerable delay of time, i.e., real time.
-
(3)
VANETs’ environment is highly mobile and topologically constrained by roads, neighboring vehicles, and traffic road signals. Therefore, vehicles on roads do follow some pattern. In the case of high density, vehicles usually move closer and naturally formed clusters or groups on motorways. Hence, knowledge of road structure, motorways entry/exits, vehicle daily commutes, junctions usage, positioning, and neighbor count can be considered as informative parameters for developing a clustering-based real-time traffic-aware data-gathering protocol via pre-feeding the networks to avoid repetition of processes that takes energy, space, and time.
-
(4)
Finally, the most critical objective of this paper is to design, propose, and implement an architecturally centered protocol with the ability to automatically learn through daily vehicular commutes on its own while delivering automated and real-time synced data collection solutions with enhanced time, storage, and processing features.
The highlighted contributions of the paper is enumerated as follows:
-
(1)
A lightweight Machine Learning based data collection protocol is proposed for communicating and forwarding data packets based on real-time learning of rapidly evolving traffic conditions for reduced data communication cost and overhead while integrating large-scale data collection.
-
(2)
Integration of Dynamic Segmentation Switching via machine learning (spark streaming) is proposed that significantly reduced data communication cost, traffic data congestion, and data overriding during the execution of the protocol.
-
(3)
Proposed protocol is applied over known architecture of heavily flooded motorway infrastructure to train the model that comparatively produces quicker results, enables real-time data communication and collection, and takes less energy and storage over real-time traffic conditions.
-
(4)
A vehicular network is trained with real-time evolving traffic conditions based on daily stats of the motorway that makes the protocol functioning considerably applicable on larger scale data without increasing the mean execution time.
-
(5)
Long-term and bigger clusters are achieved in response to trained protocol that makes this protocol independent, efficient, and cost-effective.
-
(6)
Extensive empirical evaluations, state-of-the-art tools usage, and keen simulations are performed using real-time traffic scenarios and data to qualify for a higher performance badge.
Based on the above-mentioned objectives, motivations and contributions, the main idea of the proposed protocol is to design a lightweight energy-efficient Machine Learning based data collection protocol to effectively deal with higher data volumes in real-time vehicular traffic environment through organized information log(system) with optimal usage of resources (space, time, and energy).
The rest of the paper is organized as follows: “Literature review” covers relevant literature, “Methodology” discusses the Machine Learning based Data Gathering Protocol (ML-TDG) with its counterparts, and “Results and analysis” presents and analyzes the results with suitable illustrations. “Advantages and challenges of proposed methodology” covers the advantages and challenges of the proposed methodology followed by the conclusion in “Conclusion”. The paper provides Future work in “Future work” and then References.
Literature review
Vehicular Ad hoc Networks (VANETs) significantly assist in understanding the in-depth study of vehicular communication [23, 27, 30]. Every vehicle in the vehicular network is bound to exchange data for communication, infotainment, safety, and many other critical factors to keep the traffic flow smooth alongside maintaining roadside infrastructure [31]. A better vehicle-to-infrastructure (V2I) communication brings efficiency to ITS by improving vehicles communication without disruptions and collisions [25].
V2I communication poses multiple challenges among VANETs in terms of frequent disruptions, network intruders, data loss, vehicles collision, and malicious vehicles and data interruption [32]. To overcome these V2I challenges, Machine Learning (ML) provides potential solutions. ML is an artificial intelligence component that allows machines [like Vehicles and Road Side Units (RSUs)] an ability to learn without being explicitly trained for operational and functional performance factors automatically [33]. Plenty of approaches and protocols are proposed to resolve data collection and communication challenges in VANETs based on Machine Learning. A few such approaches are discussed in this section.
A conceptual objective of ML is to let the vehicle learn and improve its operations by the previously processed data. One such scheme is the Efficient Clustering Routing approach using a new clustering algorithm based on Density Peaks (ECRDP) [34] that applies Particle Swarm Optimization (PSO) and Density Peaks Clustering (DPC) algorithms to determine Cluster Heads (CH) for reliable links among connected vehicles. In this proposed scheme, CHs are selected and supported through a systematic maintenance phase that updates and redistribute the vehicles into clusters under updated CHs.
Another ML-based scheme for V2I communication is proposed to facilitate multiple vehicle local communication via Software Agents (SAs) [35]. The proposed agent-based model is designed to coordinate with static and mobile agents through a decision tree and Q-learning algorithm for the identification of events like critical, non-critical, and destination vehicles. Critical and non-critical event is identified via Event Decision Agent (EDA) which is a decision tree algorithm that uses vehicle sensors outputs. Road Side Unit Management Agent (RSUMA) and RSU Information Agent (RSUIA) are designed through Q-Learning for vehicle tracing and neighbor selection.
ML-based data collection scheme named Authentic Vehicle Node with FOG Computing (AVNFC) scheme [36] work on continuous time-sensitive data exchanges to assist intelligent infrastructure. This protocol is operational in terms of storing, communicating, and computing data frames in real time. This scheme utilizes ML Lagrange known as Polynomial Interpolation for the purpose of node authentication via fog-enabled VANETs’ architecture.
Machine Learning based Misbehavior Detection System (ML-MDS) [37] is another scheme for cognitive software-defined multimedia vehicular networks (CSDMV) in smart cities that works before data communication starts for better misbehavior detection. A Trust Value (TV) is used as a standard, e.g., if a TV of a vehicle is higher than a set threshold, then the communication will only happen. A proper channel of ML-based algorithm is designed, i.e., decision tree, Support Vector Machine (SVM), Neural Network (NN), and Logistic Regression (LR) algorithms for behavior detection accuracy.
YOLOv3 (You Only Look Once v3) is a recently proposed method that particularly targets the capability of cross-scale detection and focuses on the valuable area [38]. The proposed method performs multi-scale road object detection via the K-means-GIoU algorithm. This algorithm is designed to generate a priori boxes whose shapes are close to real boxes followed by training. It then maintains KITTI dataset shows that the proposed method maintains a fast detection speed and increases the mAP (mean average precision) value. YOLOv3 strategy is a bit complex in terms of complications that arise due to object detection as object sizes vary, removal of background targets, and strengthening the network’s attention. Moreover, YOLOv3 is not storage efficient (the object carries huge sizes) and requires a lot of resources (time and energy) to train and process. However, we achieved the time, space, and energy-efficient target by maintaining a log corresponding to every target’s index that trains and matures for better performance accuracy every time. Second, our proposed methodology is real time implemented in Dublin City which makes it reliable and worthy enough for real-time consideration.
A model is purposed for parameter optimization and feature metric-based fault diagnosis that serves as an unknown matching network model to solve issues corresponding to data sets in real industrial environments mainly catered from sparse fault samples and cross-domain data sets [39]. The proposed model is functional on the meta-learning network that extracts optimization information for parametric optimization. This methodology provides an effective solution for cross-domain problems of various connected devices that occur usually in response to changes in equipment operating conditions and production requirements.
Compacted Area with Effective Links (CAEL) [40] is another recent protocol designed to focus on decreasing overhead while maintaining smooth communication between selected nodes on the basis of geographical location and adequate already existing links’ references among vehicles with the inclusion of reliability factor. The link expiration time is of key element throughout to achieve real-time communication in terms of removing malicious nodes, selecting trustworthy nodes, and holding suspicious activity. Another recently proposed [41] preassigned performance control scheme that ensures that all nonlinear systems subject to the closed-loop system are practically finite-time bounded, including the tracking error converges to a preassigned area with a finite time. Well, the time-bound factor is remarkably addressed in our scheme as the time decreases when log indexes increase, i.e., maturity of the training.
A framework called Iterative learning control (ILC) [42] is a high-performance discrete linear time-invariant (LTI) system that works on an objective of minimizing energy to maintain the required tracking accuracy. The given framework is verified by a twin-rotor aerodynamic system (TRAS) model for operations defined within a finite duration. Another important category of ML-based data collection protocols are the ones with Apache Spark integration [43,44,45,46,47,48,49]. In Apache Spark, cluster environment, R, a statistical computing and graphics software within Spark, is used to give the user the capability to construct statistical and prediction models using the traffic data. It monitors the data of subsystem while giving analyses of current condition/strategy during the execution of desirable state and strategy [50,51,52,53,54].
Apache Spark intelligently responds to planned strategy [55]. For example, if any of the scheme modules stays high for an unpleasant time, the subsystem will automatically add one node to the cluster pool to bring the platform and module to the normal state [56,57,58,59]. It also enables low latency of storage and access that is very beneficial in the transportation design and protocols planning and execution. It offers great help to multiple types of information that need to be executed in multiple logs [60,61,62].
One such example is IDS framework [63,64,65] which is designed to deploy big data engines to provide efficient end-to-end-detection solution to reduce the impact on network performance from the heavier density traffic flow. This machine learning based framework uses random forest model training and runs on Apache Spark for data acquisition, anomaly detection, traffic logging, and data visualization for DoS/DDoS attack [66, 67].
A traffic prediction system using Prophet and Spark Streaming is designed on Apache Spark that provides the features of a big data processing framework for processing huge amounts of data [46]. Spark Streaming is enabling real-time forecasting of the traffic flow, while the Prophet model is capturing long-range temporal sequences of data to predict traffic flow [68]. Integration of Apache Spark in this protocol is facilitating handy features like huge amounts of data processing, precise prediction, and prompt real-time forecasting of the traffic flow and software critical systems simultaneously [69]. Management of the software systems is briefly elaborated in [70, 71].
Another ITS system that is designed to predict the total traffic count of streaming data in various routes for traffic congestion reduction is using the Spark Streaming engine for live processing of data [72]. Apache Spark process data and updates systematically using the total traffic count of predicted traffic via connected vehicle. Spring boot is utilized for the total traffic count display in a dashboard. In response to timely and prompt display analysis, the congestion problem is rooted out with the real-time road traffic data streaming enabled to Apache framework [73, 74].
Detailed features are illustrated in Fig. 1. Figure 1 shows all the challenges and gaps of Intelligent Transport Systems that are now potentially covered by Apache Spark with greater ease. These gaps were once considered a tedious task to accomplish. In our proposed ML-TDG, we have overcome the issues and challenges mentioned in Fig. 1 through Apache Spark. Details are discussed in “Methodology”.
Methodology
According to Transport Infrastructure Ireland (TII) [Reference], Ireland’s busiest road is M50 which is a 45.5 km road with eight lanes and 17 junctions and carries an average of 142,496 vehicles a day. Some of the M50 routes have been recorded with 51 million journeys per year with 400,000 unique journeys (commuters) every day [75]. While considering these facts and figures, we have designed an Advanced Machine Learning based Data Gathering (A-TDG) Protocol (modified version of TDG [26]) while incorporating Apache Spark [76] which is a data management and analysis technique.
Apache Spark [20] is a machine learning based data processing engine that works with MLlib library [77] [78] to run and train rich and extensive data models. The data coming from traffic flows continuously keep on streaming and queening. In the case of VANETs, when data (traffic flows) are subjected to unpredictable state changes, it is difficult to analyze traffic in real time. Apache Spark will help in analyzing each recorded vehicle flow and labeling it accordingly. We have created two labels, i.e., New Path Vehicle (NPV) and Old Path Vehicle (OPV). Recorded and collected data flows are maintained against each label through logs by indexing. The labeling results are subjected to save via Elasticsearch [23] [79] [80] and can be searched and retrieved from Kibana dashboard [81] [82] [28] for future reference and analysis.
Figure 2 illustrates various components used for data collection. One of the main components is Apache Spark which is a streaming engine that supports Structured Streaming for streaming processes and pipelines. Structured Streaming allows taking data collection operations that require batch mode using Spark’s structured APIs and processing them in batches streaming fashion. This feature facilitates reduced latency and incremental processing of data flows in real time. Structured Streaming produces values rapidly in response to a batch or a streaming job. Incremental execution and processing of the real-time data collection through structured features are shown in Fig. 2.
Figure 2 is a model structure considered for real-time data collection. It shows five inter-connected modular components that include Spark Streaming which is a scalable fault-tolerant streaming data processing system. As illustrated, It is supporting both batches (OPV and NPV) and streaming workloads. Spark processes real-time data from various sources, i.e., RSUs and Vehicles. Apache Kafka is a scalable messaging system that collaborated with Spark for data streaming analytics for high-throughput traffic processing. Apache Kafka is intended to deploy for message queues coming from vehicles and RSUs to transmit the collected data from Spark. In the proposed protocol, streaming (mostly Vehicles) and static data (mostly RSU) sources are considered for batches of input data.
The processed data are disseminated as live dashboards via discrete streams or small batches to categorize them as different logs and data labels used in a protocol. Spark Streaming is integrated with MLlib to implement a Machine Learning algorithm while using a micro-batch system in Spark along with adding two key functional operations, i.e., training model with real-time Data and using trained model simultaneously. Spark Data frames are added to satisfy the processing needs of data indexing and labeling used by the proposed protocol and explained in detail in later sub-sections.
A vehicle network real-time depiction of Ireland’s M50, which is a C-shaped orbital motorway in Dublin and the busiest motorway in Ireland with 17 junctions, is created on the SUMO simulator. The vehicle density is completely mimicked based on data available on [83]. 51 million journeys per year with 400,000 unique journeys every day along with the consideration of peak hours (busiest hours) ratios and times as per stats given at [84]. These real-time implementations based on actual data give us the feasibility and actually monitored practicality of ML-TDG in Dublin, Ireland.
Scheme plotting
The scheme plotting is based on three cases considered for systematic results analysis of the proposed protocols. OPV stands for all the vehicles that pass from M50 multiple times a week while adapting an identical path/route. On the other hand, NPV are the vehicles that are on M50 for the very first time. NPV after getting a series of consecutive logs eventually becomes OPV. This feature does not increase the data burden, because it is simply getting shifted from one log to another. Old Path Vehicles are the vehicles driven by people who adopt the same path, e.g., a mother dropping her kids at the same school every day for a couple of years, A man who works in an organization on a 3-year contract, a business person visiting the same site every day, and a student going to university for a degree program. Considering 51 million journeys per year means 1.4 million journeys per day and 400,000 unique (NPVs) journeys per day. Based on these facts, detailed data flow scenarios/cases considered to justify ML-TDG functionalities for smart data analysis and management are covered in “Results and analysis”.
Proposed architecture
In this section, we have discussed the proposed protocol’s architecture and explained each module. Dublin’s M50 Motorway (M50) is considered which is divided into 17 Junctions (J1, J2, and J3 \(\ldots \) J17) with a 45.5 Km road comprised of eight lanes (L1, L2, and L3 \(\ldots \) L8). The real-time implementation considered for structural illustration is shown in Fig. 5. In Fig. 5, 17 junctions can be seen that are exactly replicated from Dublin’s M50 motorway to implement and test the proposed ML-TDG. Every junction is given the same name, curves, rows, and passage to analyze real-time situations better. Results catered through these junctions are considered in the results and analysis section.
The protocol functionality is divided into three major modules:
(1) Vehicles logging
(2) ML-TDG data collection
(3) Integration and revised training.
These modules are further divided into various sub-modules discussed below.
Vehicles logging
This module enables real-time traffic management for smart decision management based on vehicle logging. Vehicles Logging is described as recording daily commuters based on traffic flows and creating logs based on two log categories. (1) New Path Vehicles (NPV) and (2) Old Path Vehicles (OPV). The NPV log and OPV log are indexed by storing the input data via Elasticsearch. The traffic logging of NPV and OPV cannot be designed to act in parallel. It is based on the fact that NPV ratios on M50 are comparatively lesser than OPV. NPV log gets functional only when a new path vehicle is detected and needed to be recorded.
To establish OPV and NPV logging, Apache Spark Cluster is used where OPV and NPV are Spark workers and Spark Master is the primary data streaming and processing stream. Spark framework is using a master–worker architecture that runs across the cluster while processing batches of OPV and NPV in real time. However, Spark manages the cluster to accomplish the traffic logging task while coordinating batch processes for traffic analysis, as shown in Fig. 3. Figure 3 portrays that traffic logging tasks are processed in batches mainly fulfilled by Spark worker with each index, i.e., OPV and NPV, which is synced with Spark master. Spark master is processing Old commuters (OPV) along with making repeated NPV (New commuters) to be a part of OPV simultaneously to maintain an updated log. This feature is also useful in avoiding duplication of data, e.g., same data coming from OPV and NPV.
The open-source Elasticsearch is applied for data storing, indexing, and visualization within Cluster. Results within the cluster are considered databases for indexing and storing the input data by Elasticsearch. The data collected through traffic are iterated via Kafka and then indexed in Elasticsearch for log management. The traffic logging/indexing is performed in parallel to speed up the whole process in real time.
-
Data Labels Exchanges Data Labels Exchanges perform the continuous bridging between both logs of NPV and OPV. A dynamic log shifting is created to adjust the switching of NPV to become OPV. All the newly detected and joined vehicles around all junctions after being recorded multiple times will be shifted to the OPV log based on the consistency of commutes. These data labels help in generating flux for the OPV log to update and react accordingly. In the proposed structure, the data labels exchange significance is better discussed in the results and analysis section, but precisely DLE brings stability while increasing OPV logs for better, long, and stable Clustering possibilities.
-
Smart Data Management This factor enables expected traffic clearance time and each junction while considering traffic flow rates as inputs. This step also reduced the changes of congestion and any possible Clustering mishaps.
-
Data Visualization and Analysis To keep the data logs, data labels exchanges, and smart data management accountable and clearly visual for the analysis of its designated features, Kibana is integrated as the endpoint of the system for data visuals. The database (Elasticsearch) is queried by Kibana which retracts with the help of indexing (given while maintaining logs) and then displays updated matrices accordingly.
ML-TDG data collection
The second module is implementing Smart Machine Learning-based Traffic Data Gathering protocol (ML-TDG) for data collection. This protocol proposes dynamic segmentation switching for smart handling of communication limitations, e.g., Bottlenecks, flooding, spamming, and jamming. ML-TDG is a lightweight data collection module that is dynamically designed for collecting and forwarding data packets based on current and rapidly evolving traffic conditions via Vehicle logs classifications. This module is functional based on the list of steps following the below-mentioned stages.
-
Dynamic Segmentation Switching (DSS) As we considered Ireland’s M50, so our model is based on multi-directional lanes (Lm) with 45.5 Km of length (Ln). As the length of the road is considerably large and based on various turns and multi-directions, we have considered the division of 17 Junctions (J1, J2, and J3 \(\ldots \) J17) to test the implementation of the proposed protocol. Each junction is considered and tested and the results counts are measured based on accumulated results catered from all junctions simultaneously by taking an average. Considering the fact that there are higher chances of multiple area occurrences in the vehicular network, we have taken into account the density of the network based on directions, vehicle density, and collection area. While bi-directions stand for M50’s 8 lanes with 4 lanes on each side (4 lanes in each direction) abbreviated as Da1, Da2, Da3, Da4, Db1, Db2, Db3, and Db4. Density stands for variable vehicle ranges as stated by RSA approximately including both NPV and OPV. The number of BS deployment N(BS) is calculated as follows:
$$\begin{aligned} N(BS)= ln/x \end{aligned}$$(1)where ln is M50 length and x is the coverage area in km for one BS (range covered by BS). N(BS) is the number of Base stations deployed in respective collection regions. Every 4 lanes (single direction) is taken as a road for the distribution of respective segments. And each road is distributed into two virtual segments tagged as Collector Segments (CS) \(\epsilon \) CS1, CS2, CS3 \(\ldots \) CSn, and Silent Segments (SS) \(\epsilon \) SS1, SS2, SS3 \(\ldots \) SSn. CS performs data collection and communicates with the BS, while SS are no communication zones. To divide the road into a considerable amount of virtual segments with the same length, the number of segments (VLs) is calculated by the following:
$$\begin{aligned} VLs= (lm)/CR \end{aligned}$$(2)where lm is the total distance of the collection area. CR is the Communication Range of vehicles on the virtually created segments. The integration of dynamic segmentation switching is significant in reducing bottlenecks, network congestion, and messages spams and collisions. The conventional segmentation created on roads remains static, while in our case, DSS allows each segment to switch dynamically, taking control from one Collector Segment while assigning control to the other Silent Segment. This switching is time-driven where virtual segments are allocated time \(\Delta \)t to switch allowing maximum vehicles to become a part of the data collection process. Time factor assists switching of segments turn by turn alternatively, i.e., conversion of CS into SS and SS into CS. Vehicles Speed (VS) with respect to time can be calculated as follows:
$$\begin{aligned} V\small s = (C \small sl/AVS) \times \Delta t \end{aligned}$$(3)CSl is collection segment length, AVS is average vehicle speed on the road, and \(\Delta \)t is a time that determines exceeding limits for data collection. The switching of CS and SS segments enables to detection of any blockage present. This factor makes it a favorable choice for determining and highlighting possible areas of accidents and blockage. Figure 4 illustrates DSS.
-
Real-time cluster head election (R-CHE) Clustering is considered an effective mechanism capable of managing inter-vehicular communications for a set of regions moving along in a targeted region. Sustainable and sizeable clusters yield better and desirable energy-efficient results. The previous version of ML-TDG (named as TDG [26]) offered clustering based on beacon messages initiated through BS periodically to all vehicles driving in range. On receiving neighborhood information, vehicles inform the BS about their neighbor count, current position, and direction of movement. BS declares Vehicle has a bigger neighbor count as CH due to its better positioning surrounded by other vehicles. Well, this CH election and clustering mechanism was effective in covering the best possible number of covered vehicles. However, plenty number of messages are subjected to exchanges before a cluster can start data collection and communication. Second, this type of cluster head election also needs a self-induced clustering mechanism where a vehicle with no beacon invitation needs to declare itself as CH to start a collection. This feature requires more energy, more data capacities, more messages exchange, and ultimately increased time. ML-TDG accomplishes R-CHE by distinguishing based on OPVs and NPVs and junctions given at a particular area. Let us assume the scenario of J1 in which vehicles can only enter M50 through R131 joining in via two conjoint roads with no possibility of exit and other roads joining further. In J1, vehicles from Da1, Da2, Da3, and Da4 are moving in one direction and Db1, Db2, Db3, and Db4 are moving in another direction. The already built-in logs labeling and indexing described in Module 1 have distinguished Y number OPVs and Z number of NPVs. The OPVs are set to make a cluster based on N number of possible ranges. While NPVs are subjected to act as source push where they are set to push the required data to the nearest OPV eradicating the need for RS and its possible availability thus making it infrastructure-independent. The possible logging and indexing provide higher stability toward the clustering phenomenon while eliminating the need for beacon messages, self-induced clustering, neglected vehicles, and chances of getting replicated data from the same vehicle at a given time. All Vehicles V1, V2, V3 \(\ldots \) Vn given an Index Ixn within a Database of respective OPV and NPV has predetermined/pre-recorded indexes based on Junction passing through are set to be declared as a cluster of range CR, where CR is Communication range. There are 17 junctions (J1, J2, and J3 \(\ldots \) J17) that are replicated on SUMO, as shown in Fig. 5.
-
Real-Time Data Aggregation Every data packet entry corresponding to a given index is interpreted through the data packet number. Data packets are concatenated via logs maintained and then send to the BS. Data logs will reset the array and then will start concatenating data for aggregation. This reset feature prevents an empty array to be aggregated. If data are present already, then upcoming data will concatenate with previous data. Resetting the array also prevents any previously present redundant and expired data to concatenate with newly arrived data. It updates data delivery more efficiently. Every aggregated data array contains (comma) and (semi-colon) as pre-part of it. Therefore, both are considered as the delimiter to break the data from every point where either coma comes, or a semi-colon arrives. This happens until every parameter separates and gives distinguish values for every set parameter. Delimiters are basically limit setter symbols implied on collected data to make it distinguish on the basis of the assigned values desire to retrieve. Symbols selected are already part of the data reached at BS.
Integration and revised training
The ML model is set to train during log exchanges from NPVs shifting to OPVs. The training data sets (i.e., OPV data set and NPV data set) are trained by the flow of records collected in real time. The trained sets are labeled and included to analyze the packet sent and packet received. By storing the captured real-case data in Elasticsearch, it is possible to re-train logs.
Results and analysis
This section covers the simulation environment with a detailed analysis of results to evaluate the performance of the ML-TDG protocol. The base scheme protocol named TDG [26] is also proposed and implemented following NS\(-\)2.35 on Ubuntu 16.04 along with the newly proposed protocol ML-TDG. Vehicles are deployed using Tool Command Language (TCL) covering road-specific scenarios. TDG is set for message initiation, nodes configuration, CH and Sink nodes are also covered through TCL. However, ML-TDG includes Python for machine learning modules implementation, whereas previously included modules from TDG are written in C language code.
The data packet exchanged and communicated for data collection includes position, velocity, sequence number, identity, source, and destination. Sending and receiving data functions are performed separately, i.e., CH manages the selection of recipient vehicles and aggregation of data is performed at the sink node. AWK scripts are used to extract the end-to-end delay, Packet Delivery Ratio (PDR), efficiency, and effectiveness from the generated trace files. Simulation parameters are given in Table 1.
KPIs: Key Performance Indicators: These are the indicators that serve as benchmarks through which ML-TDG optimal network performance is determined. These KPIs include Efficiency, Effectiveness, Average Efficiency (cumulative) and average effectiveness (cumulative), and Cluster Stability. These indicators are explained and discussed below.
Efficiency: Efficiency is calculated in AWK scripts by the formula
where V(n) is the amount of data received by the BS and sent by vehicles. N is the total participated vehicles in sharing data.
Effectiveness : Effectiveness is calculated in trace files by the formula
where V(n) is the number of vehicles whose data are delivered to the BS and S(n) is the number of vehicles whose data should be delivered to the BS. Efficiency in ML-TDG can be described as the protocol’s ability to measure data exchange information within a given time. The efficiency of ML-TDG is also known as communication efficiency, i.e., lesser communication reported at the junction indicates less-efficient communication provided the real-time vehicles ratio. However, the effectiveness of ML-TDG indicates the ability of a protocol to accomplish designated data communication while completing all designed activities at the right time, cost, and speed in the least expensive way.
Efficiency and effectiveness are calculated over the 17 Junctions (Jn) and their average is considered to map the percentage through the given formulas below
where avg indicates average, V is total Vehicles, \(\Sigma \) is the total number of vehicles within junctions, V(n) is the number of vehicles whose data are delivered to the BS, S(n) are the number of vehicles whose data should be then delivered to the BS, and Jn indicates a junction.
Based on Formulas (4), (5), (6), and (7), efficiency and effectiveness are calculated and shown in Fig. 6. In Fig. 6, efficiency and effectiveness trend lines (orange for effectiveness and blue for efficiency) can be witnessed through varying percentages along different junctions. It is noticeable that Jn3 and Jn4 recorded near to 70% and 61%, respectively. The possible reason for this slightest less efficiency is that the Jn3 and Jn4 have the maximum number of diversions, and bi-directional entries and exits that possibly effect efficiency (refer to Fig. 5). While, Jn5 to Jn9 are recorded near to 65% to 69%. The efficiency recorded is slightly lesser than the expected efficiency percentage, but still multi-directional, multi-lanes, and multiple entries and exits are affecting the efficiency. On the other hand, the rest of all Junctions, i.e., Jn10–Jn17, are ranging from 88% to 89% that is indicating credibility and success in achieving higher efficiency. Out of all 17 junctions, 8 junctions efficiency is above 88%, 3 junctions are above 70%, and 3 junctions are above 68%, while only 2 junctions (jn4 and jn5) revolve around 61% and 65%.
Effectiveness of ML-TDG shows the reliability of the protocol, i.e., all junctions are attaining maximum percentages ranging from 79% to 94%, while a maximum number of junctions is showing effectiveness above 92% and only 1 junction (jn3) is at 79%. The rest of all junctions are above 82%. One of the noticeable facts about Jn3 is its efficiency which is at 71% and effectiveness is standing at 79%. This is the lowest in both parameters. This is due to the fact that Jn3 is architecturally different from the rest of all junctions. Traffic hits it differently mostly asynchronously. Jn3 has more than 8 entries and exits with 2 roundabouts, unlike any other junctions. the complex nature of the junction is posing a different impact on efficiency and effectiveness making it as least performing junctions. However, it is still acting under acceptable ratios.
In Fig. 7, the Efficiency and Effectiveness of TDG are given in comparison with ML-TDG. It is clearly evident that TDG without Machine learning modular components is giving 5 to 10% on average lesser percentage than ML-TDG. One of the primary factors of slightly less efficiency and effectiveness is the larger set of vehicles passing the motorway. However, TDG was designed and implemented in real-time environments without training data sets. TDG was also infrastructure-independent which also adds a time factor to produce better efficiency. Large traffic flows are potentially causing bottles necks and larger overheads, whereas ML-TDG takes full benefit of larger volumes to train and then executes data collection.
Clusters Stability: Cluster Stability is an indicator to measure how much a clustering remains unaltered under any change reported by the vehicles or respective network. In the case of real-time weighted traffic conditions, the addition and removal of vehicles within a cluster might be frequent. ML-TDG is set to design better cluster stability. For example, in Fig. 8, cluster stability has been shown over the period of 24 h with a gap of 3 h based on Maximum Clustering (MT) percentages from the Start of the Junction till the end of the Junction given a standard response on average throughout the day. Based on the clustering frequencies considered, it is concluded that the high frequency of vehicles yields better and long-lasting clusters while facilitating efficient data collection. During the first 3 h considering the midnight’s real-time vehicles traffic flow from 12 am to 3 am, the Distant and lesser frequency of car flow is giving lesser clustering stability. As long as the time enters the early morning range with the increased vehicular flow, ML-TDG is producing better clustering sizes and time due to potential early commuters. Soon after early morning peak hours, the cluster sizes begin to shorten with lesser time. This trend increased again from 12 pm to 3 pm and kept on increasing with time and started dropping around and after 10 pm. The machine learning based protocol provides the leverage of higher vehicular flows, i.e., higher the vehicles, the better the efficiency for clustering sizes, strength, and longer time.
In Fig. 9, the Cluster Stability of TDG is shown which indicates smaller yet numerous clusters closely working together based on traffic conditions. The primary reason behind smaller functioning clusters is TDG CH election criteria where A vehicle with more neighboring vehicles declared itself as CH. Therefore, the range of neighboring vehicles is limited, thus resulting in smaller clusters. This factor results in more data frame exchanges and higher throughput in comparison with ML-TDG Contrary to this feature, ML-TDG uses logs and indexes and that is why encapsulates bigger clusters for data collection.
Rate of vehicle logs shifting from NPV to OPV: Random allocation of NPV and OPV based on daily commutes given at [85] and some of the stats are achieved through [86]. The final distribution is given in Fig. 14. As illustrated in Fig. 14, the jn1–jn17 rate of change of New Path Vehicle (NPV) to Old Path Vehicle (OPV) is shown through percentages of the exchange rate. At Jn1, the rate of OPV (89%) is greater than NPV (11%) while indicating a low conversion rate in terms of adding new commuters. Jn2 got 25% NPV and Jn3 got 31%. It can be concluded that NPV ratios throughout the junctions are much lesser than OPV. It is a very common scenario that in any motorway, old commuters account for a larger proportion than New commuters. As per stats given at [85, 86], daily commuters are based heavily on people who use the motorway more often than new commuters joining it once in a while. Jn17 got 47% new commuters making it the only junction with mostly targeted new commuters leaving behind Jn11 and Jn12 with 44% and 37%.
However, Jn 13 holds 93% of OPV leaving behind Jn14 with 89% and Jn6 with 83%. One of the critical perspectives of these ratios is inter-conversion factor. As ML-TDG is designed to convert every NPV into OPV upon reporting a vehicle couple of times. This makes the job easy as the data set for OPV grows, it makes the ML-TDG functioning better and time effective. Lesser NPV ratios require less time for the execution of ML-TDG modular components. Whereas, if NPV is spotted, it has to go through one additional step in which the NPV index is highlighted, and counted and then adding new index under OPV to get operational from there to avoid repetition of identification that might cause potential delays. This factor effectively makes ML-TDG responsive and avoids repetitions.
Based on the ML-TDG operations, below mentioned are four cases consider to highlight what makes the proposed protocol favorable in complications and scenarios. To extensively test the larger number of Dublin’s M50 motorway real-time vehicular projections and to optimize the time to test the model over all the junctions considered. We have used four cases as shown and discussed below:
Case I: when half of the commutes are NPV and the other half are OPV: Data Collection success ratios when half of the commutes are NPV and the other half is OPV is given in Fig. 10. By storing the indexes of assigned OPV and NPV captured through real-time data in Elasticsearch, it is conveniently possible to re-train the ML-TDG model periodically to execute the system within the network environment. When a system is given an equal proportion of NPV and OPV, i.e., NPV = OPV, across all 17 junctions interestingly exists between a gap of 70% and 80% which is a remarkable achievement. The lowest scoring junction (Jn5) is at 70% and the highest scoring (Jn4) is at 79.72%. When OPV (indexes) already trained are collecting data along with NPV, it neutralizes the NPV execution indexing time. However, NPV becoming part of OPV is a continuous process. Every new commuter is set to eventually become part of OPV which provides a stimulus to increase the data collection ratios by increasing the data sizes of OPV which is a plus for machine learning based algorithms.
Case II: When the majority of commutes are OPV and lesser NPV: Data collection ratios are better in this case when the majority of the vehicles are deployed as OPV (already discovered, indexed, and trained) and NPV are in less ratio. The ratio is set at 80% of OPV and 20% NPV only to qualify for this case. In this case, the junctions are extremely performing well as only a minute ratio of NPV is pushing OPV to adjust the new indexing for training. This case enables OPV to be functional based on its maximum capacity. Following this case, all junctions are marking performance within a range of 80% to 89% with Jn5 at 80.41% and Jn4 at 89%. The details of the stats are given in Fig. 11. This case is straightforward as well as pretty predictable because of the fact that the already trained vehicular model efficiency and success ratio is highly accurate. However, frequent indexes and training sessions (executing modules) with rapidly changing logs may affect the accuracy results of the model. The high-performing increment reported in this case is functioning without considering the interchangeable log and index formats.
Case III: When the majority of commutes are NPV and lesser OPV: Higher ratio of New commuters does not affect the data collection ratio largely in comparison with higher OPV. The only factor that enables us to achieve higher accuracy during data collection or higher NPV is our model which is designed to gets functional only when a new path vehicle is detected and needed to be recorded. However, the traffic logging of NPV and OPV cannot be designed to act in parallel. It is based on the fact that NPV ratios on M50 are comparatively lesser than OPV. NPV log gets functional only when a new path vehicle is detected and needed to be recorded. In this case, when it is deliberately designed for higher ratios, NPV logging will stay active to record the series of consecutive logs that can eventually become OPV. In the end, the majority of the NPV is added in OPV logs and data collection is resuming normally while normalizing the efficiency and time factor. Following this approach, this case is evidently providing considerably good percentages ranging from 70% to 74%. However, in previous cases, few of the junctions were close and above 80%. The factor behind a slight decrement is highlighting the time taken during logs conversion and indexes are given. The primary outcome, in this case, is still appreciable following the fact that it is now below 70% anyway. Details of the stats are given in Fig. 14.
Case IV: Real case with 1.4 million OPV and 400,000 NPV: This case actually measures the strength of our designed model in an actual scenario/real time considered to declare its suitability. The success of data collection is attained within a span of 70% to 77%. Jn9 at 75%, Jn12 at 76%, Jn13 at 77%, Jn15 at 74%, Jn16 at 73%, and Jn17 at 72% is recorded proving the uninterrupted data collection. Jn14 at 71%, and Jn7 and Jn1 at 70% are depicting the architectural complexity of junctions that are hindering the execution of components and giving gaps of few percentages. The detailed stats are given in Fig. 13. In Fig. 13, most ranges are reported with a marginal difference of 25% that indicates a steady and progressive flow of events (Fig. 14).
Advantages and challenges of proposed methodology
As per the above-mentioned results and analysis based on the proposed methodology, there are various significant advantages of ML-TDG.
-
ML-TDG can save a considerable amount of time that is taken by excessive data exchanges for clustering and identification of vehicles for large-scale data collection. Reduced data communication cost and overhead based on real-time learning of rapidly evolving traffic conditions is the primary advantage of the proposed scheme.
-
Second, issues corresponding to delays, larger throughputs, and excessive bottlenecks can be smartly handled through the proposed algorithms based on the integration of dynamic segmentation switching and smart vehicular information processing and management.
-
Frequent repetition of data and energy taken for identification of a vehicle multiple times at multiple locations can also be overcome through the proposed methodology.
-
Energy and time optimization for data communication is one of the integral elements achieved through the implemented scheme.
-
Fewer resource requirements as the proposed methodology depict better clustering life and ratio along with better optimization of results every time. ML-TDG keeps on maturing with every log indexing gives reduced bottlenecks and data replication.
However, in light of the above-mentioned advantages, the proposed algorithm can have possible problems which may include environmental, physical, mechanical, and electrical constraints that are unavoidable in real-time environments. For example, real-time traffic is dependent on ambulances, emergency vehicles, and accidental blockage that are neither predictable nor avoidable. The situations like ambulances and emergency vehicles path coverage can potentially affect the real-time live streaming of data traffic feed and gives challenging situation to network data training along with structured data processing and real-time displays.
Conclusion
In this paper, we propose a lightweight machine learning-based data collection protocol to potentially deal with higher data volumes followed by implementation and testing over Dublin’s M50 motorway real-time vehicular traffic environment. The protocol is functional based on the known architecture or pre-feeded map. The methodology is aligned with a data processing engine known as Apache Spark that is further coordinated with a smart algorithm known as Dynamic Segmentation Switching (DSS). DSS enables road segments division and allows switching intelligently to reduce bottlenecks and replication resulting in energy and time optimization. Further, a Data aggregation scheme is implemented for data analysis to fetch information based on location, speed, vehicle id, and neighbor count. A data extraction scheme is also integrated to increase data retrieval and data utilization effectiveness in an intelligent way at the base station. The protocol induces learning within vehicles to operate on the basis of previously processed data for gaining maximum benefits, i.e., effectiveness, efficiency, communication delay, communication overhead, bottlenecks, and better data transmission rate.
The method proposed in this research is proven to be useful in terms of dealing with huge traffic flows, bulk data processing, and analysis with real-time communication. As traffic flow is collected from vehicular network topology based on reality, the potential impact brought by the system on the network performance is studied by comparing the network throughput. The results verify the system is a lightweight solution by bringing a little burden to the network. Apache Spark is used in response to timely and prompt display analysis, and congestion problem is rooted out with the real-time road traffic data streaming. Ireland’s most famous, busiest, essential, and extremely congested road (motorway) is used for testing and verification of the authenticity of our proposed protocol.
The proposed protocol is rigorously tested based on various possible cases for systematic results analysis, i.e., Rate of Vehicle Logs shifting from NPV to OPV, half of the commutes are NPV and the other half are OPV, majority of commutes are OPV, and lesser NPV vice versa, Real Case with 1.4 million OPV and 400,000 NPV. The results show that ML-TDG is potentially useful for higher data volumes and massive traffic flows. As the Vehicular network is trained with real-time evolving traffic conditions based on daily stats of the motorway that makes the protocol functioning considerably applicable on larger scale data without increasing the mean execution time.
Future work
The protocol proposed in this research is proven to be useful. However, there are still some problems and uncertainties that need to be further investigated and are part of future work. For example, the very first limitation of the proposed protocol is its pre-requisite to have known infrastructure of the highways/roads to actually implement the functional design. This factor is considered casual as most of the world’s road infrastructures are predetermined and amendments can be made based on design and structural changes. ML-TDG can collect data based on real-time vehicle interactions; however, adapting infrastructure based on real-time factors is not possible. We are working on this factor to make it infrastructure-friendly or infrastructure-independent in the near future.
Second, the efficiency of results gradually increases as the databases grow, train, and mature. This factor can be eliminated and early efficiency can be achieved by feeding the network with prepossessed data directly taken from Road and Safety Department Ireland [87] that may require formal permission to use the already recorded departmental data. This element will also enhance the energy as well as time parameters due to little-to-no real-time logs recording and maintenance. In pre-recorded data, OPV will not be recorded and NPV can still remain intact to work for any possible new vehicle.
Third, in the near future, ML-TDG is set to be slightly modified with different Machine learning models to test whether any other ML framework can help us increase our time, storage, energy, and communication efficiency with the possibility of security factors incorporated as well. Also, various statistical and mathematical functions are in consideration to check the feasibility and better results outcome.
Data availability
The datasets used and analysed during the current study are available from the corresponding author on request.
References
Meena G, Sharma D, Mahrishi M (2020) Traffic prediction for intelligent transportation system using machine learning. In: 2020 3rd international conference on emerging technologies in computer engineering: machine learning and internet of things (ICETCE), IEEE, pp 145–148
Kumar R, Kumar P, Tripathi R, Gupta GP, Kumar N, Hassan MM (2021) A privacy-preserving-based secure framework using blockchain-enabled deep-learning in cooperative intelligent transport system. IEEE Trans Intell Transport Syst
Khatri S, Vachhani H, Shah S, Bhatia J, Chaturvedi M, Tanwar S, Kumar N (2021) Machine learning models and techniques for vanet based traffic management: implementation issues and challenges. Peer-to-Peer Netw Appl 14(3):1778–1805
Haydari A, Yılmaz Y (2020) Deep reinforcement learning for intelligent transportation systems: a survey. IEEE Trans Intell Transport Syst 23(1):11–32
Chen M-Y, Chiang H-S, Yang K-J (2022) Constructing cooperative intelligent transport systems for travel time prediction with deep learning approaches. IEEE Trans Intell Transport Syst
Al-Suqri MN, Gillani M (2022) A comparative analysis of information and artificial intelligence toward national security. IEEE Access 10:64420–64434
Chavhan S, Gupta D, Nagaraju C, Rammohan A, Khanna A, Rodrigues JJ (2021) An efficient context-aware vehicle incidents route service management for intelligent transport system. IEEE Syst J 16(1):487–498
Statista (2022) Number of registered passenger cars in the republic of ireland from 2010 to 2019. https://www.statista.com/statistics/452305/ireland-number-of-registered-passenger-cars/. Accessed 31 Nov 2022
Central Statistics Office, Ireland (2022) Environmental indicators Ireland 2018, transport. https://www.cso.ie/en/releasesandpublications/ep/p-eii/eii18/transport/. Accessed 31 Nov 2022
Independent.ie (2016) Number of private cars on our roads hits two million. https://www.independent.ie/life/motoring/car-news/number-of-private-cars-on-our-roads-hits-two-million-34460268.html. Accessed 31 Nov 2022
Yuan T, da Rocha Neto W, Rothenberg CE, Obraczka K, Barakat C, Turletti T (2022) Machine learning for next-generation intelligent transportation systems: a survey. Trans Emerg Telecommun Technol 33(4):e4427
Hlaing SS, Tin MM, Khin MM, Wai PP, Sinha G (2020) Big traffic data analytics for smart urban intelligent traffic system using machine learning techniques. In: 2020 IEEE 9th global conference on consumer electronics (GCCE), IEEE, pp 299–300
Haghighat AK, Ravichandra-Mouli V, Chakraborty P, Esfandiari Y, Arabi S, Sharma A (2020) Applications of deep learning in intelligent transportation systems. J Big Data Anal Transport 2(2):115–145
Yuan T, da Rocha Neto WB, Rothenberg C, Obraczka K, Barakat C, Turletti T (2019)“Harnessing machine learning for next-generation intelligent transportation systems: a survey. In: Proceedings of the computational intelligence, communication systems and networks (CICSyN)
Lécué F, Tallevi-Diotallevi S, Hayes J, Tucker R, Bicer V, Sbodio M, Tommasi P (2014) Smart traffic analytics in the semantic web with star-city: scenarios, system and lessons learned in Dublin city. J Web Semant 27:26–33
Jia Y, Wu J, Ben-Akiva M, Seshadri R, Du Y (2017) Rainfall-integrated traffic speed prediction using deep learning method. IET Intell Transport Syst 11(9):531–536
Dusparic I, Monteil J, Cahill V (2016) Towards autonomic urban traffic control with collaborative multi-policy reinforcement learning. In: 2016 IEEE 19th international conference on intelligent transportation systems (ITSC), IEEE, pp 2065–2070
Taparia A, Brady M (2021) Bus journey and arrival time prediction based on archived avl/gps data using machine learning. In: 2021 7th international conference on models and technologies for intelligent transportation systems (MT-ITS), IEEE, pp 1–6
Philip AO, Saravanaguru R (2018) A vision of connected and intelligent transportation systems. Int J Civ Eng Technol 9(2):873–882
Nama M, Nath A, Bechra N, Bhatia J, Tanwar S, Chaturvedi M, Sadoun B (2021) Machine learning-based traffic scheduling techniques for intelligent transportation system: opportunities and challenges. Int J Commun Syst 34(9):e4814
Lansky J, Rahmani AM, Hosseinzadeh M (2022) Reinforcement learning-based routing protocols in vehicular ad hoc networks for intelligent transport system (its): a survey. Mathematics 10(24):4673
Alonso F, Faus M, Tormo MT, Useche SA (2022) Could technology and intelligent transport systems help improve mobility in an emerging country? Challenges, opportunities, gaps and other evidence from the Caribbean. Appl Sci 12(9):4759
Creß C, Knoll AC (2021) Intelligent transportation systems with the use of external infrastructure: a literature survey. arXiv: 2112.05615
Njoku JN, Nwakanma CI, Amaizu GC, Kim D-S (2022) Prospects and challenges of metaverse application in data-driven intelligent transportation systems. IET Intell Transport Syst
Gillani M, Niaz HA, Farooq MU, Ullah A (2022) Data collection protocols for vanets: a survey. Complex Intell Syst 1–30
Gillani M, Niaz HA, Ullah A, Farooq MU, Rehman S (2022) Traffic aware data gathering protocol for vanets. IEEE Access 10:23438–23449
Seth I, Guleria K, Panda SN (2022) Introducing intelligence in vehicular ad hoc networks using machine learning algorithms. ECS Trans 107(1):8395
Chaymae T, Elkhatir H, Otman A (2022) Recent advances in machine learning and deep learning in vehicular ad-hoc networks: a comparative study. In: International conference on electrical systems & automation. Springer, pp 1–14
Gillani M, Ullah A, Niaz HA (2018) Trust management schemes for secure routing in vanets—a survey. In: 2018 12th international conference on mathematics, actuarial science, computer science and statistics (MACS), IEEE, pp 1–6
Kashinath SA, Mostafa SA, Mustapha A, Mahdin H, Lim D, Mahmoud MA, Mohammed MA, Al-Rimy BAS, Fudzee MFM, Yang TJ (2021) Review of data fusion methods for real-time and multi-sensor traffic flow analysis. IEEE Access 9:51258–51276
Pandey MK (2022) Advance automated highway systems and their impact on intelligent transport systems. J East China Univ Sci Technol 65(2):631–640
Nazib RA, Moh S (2021) Reinforcement learning-based routing protocols for vehicular ad hoc networks: a comparative survey. IEEE Access 9:27552–27587
Gillani M, Niaz HA, Tayyab M (2021) Role of machine learning in wsn and vanets. Int J Electr Comput Eng Res 1(1):15–20
Kandali K, Bennis L, El Bannay O, Bennis H (2022) An intelligent machine learning based routing scheme for vanet. IEEE Access 10:74318–74333
Sataraddi MJ, Kakkasageri MS (2021) Machine learning based vehicle-to-infrastructure communication in vanets. In: 2021 IEEE 18th India council international conference (INDICON), IEEE, pp 1–6
Devi A, Kait R, Ranga V (2022) Automated cluster head selection in fog-vanet via machine learning. In: Communication and intelligent systems. Springer, pp 1169–1179
Nayak RP, Sethi S, Bhoi SK, Sahoo KS, Nayyar A (2022) Ml-mds: machine learning based misbehavior detection system for cognitive software-defined multimedia vanets (csdmv) in smart cities. Multim Tools Appl 1–21
Shen L, Tao H, Ni Y, Wang Y, Vladimir S (2023) Improved yolov3 model with feature map cropping for multi-scale road object detection. Meas Sci Technol
Tao H, Cheng L, Qiu J, Stojanovic V (2022) Few shot cross equipment fault diagnosis method based on parameter optimization and feature mertic. Meas Sci Technol 33(11):115005
Kazi AK, Khan SM, Farooq U, Hina S (2022) Compacted area with effective links (cael) for data dissemination in vanets. Sensors 22(9):3448
Sun P, Song X, Song S, Stojanovic V (2023) Composite adaptive finite-time fuzzy control for switched nonlinear systems with preassigned performance. Int J Adapt Control Signal Process 37(3):771–789
Zhou C, Tao H, Chen Y, Stojanovic V, Paszke W (2022) Robust point-to-point iterative learning control for constrained systems: a minimum energy approach. Int J Robust Nonlinear Control 32(18):10139–10161
Mohandu A, Kubendiran M (2021) Survey on big data techniques in intelligent transportation system (its). Mater Today Proc 47:8–17
Balisi AN, Jula H, Chassiakos A (2021) Smart cities: a focus on intelligent transportation systems. In: 2021 IEEE green energy and smart systems conference (IGESSC), IEEE, pp 1–7
Mulerikkal J, Thandassery S, Rejathalal V, Ayyappan B et al (2021) Jp-dap: an intelligent data analytics platform for metro rail transport systems. IEEE Trans Intell Transport Syst
Nguyen N-L, Vo H-T, Lam G-H, Nguyen T-B, Do T-H (2022) Real-time traffic congestion forecasting using prophet and spark streaming. In: International conference on intelligence of things. Springer, pp 388–397
Sengul MK, Tarhan C, Tecim V (2022) Application of intelligent transportation system data using big data technologies. In: 2022 innovations in intelligent systems and applications conference (ASYU), IEEE, pp 1–6
Alazzam H, AbuAlghanam O, Sharieh A (2022) Best path in mountain environment based on parallel a* algorithm and apache spark. J Supercomput 78(4):5075–5094
Azeroual O, Nikiforova A (2022) Apache spark and mllib-based intrusion detection system or how the big data technologies can secure the data. Information 13(2):58
Mohyuddin S, Prehofer C (2021) A scalable data analytics framework for connected vehicles using apache spark. In: 2021 international symposium on electrical, electronics and information engineering, pp 322–329
Jain M, Vasdev D, Pal K, Sharma V (2022) Systematic literature review on predictive maintenance of vehicles and diagnosis of vehicle’s health using machine learning techniques. Comput Intell 38(6):1990–2008
Nagy E, Lovas R, Pintye I, Hajnal Á, Kacsuk P (2021) Cloud-agnostic architectures for machine learning based on apache spark. Adv Eng Softw 159:103029
Ali Mohamed M, El-Henawy IM, Salah A (2021) Usages of spark framework with different machine learning algorithms. Comput Intell Neurosci 2021
JayaLakshmi A, Kishore KK (2022) Performance evaluation of dnn with other machine learning techniques in a cluster using apache spark and mllib. J King Saud Univ Comput Inf Sci 34(1):1311–1319
Prajapati GL, Raghuwanshi R (2021) Study of big data analytics tool: Apache spark. In: Big data analytics in cognitive social media and literary texts. Springer, pp 65–100
Kumar K, Sharma NA, Ali AS (2021) Machine learning solutions for investigating streams data using distributed frameworks: literature review. In: 2021 IEEE Asia-Pacific conference on computer science and data engineering (CSDE), pp 1–6, IEEE
Perr-Sauer J, Phillips C, Duran A, Van Roijen A(2021) Code artifact for: clustering analysis of commercial vehicles using automatically extracted features from time series data, technical report, National Renewable Energy Lab.(NREL), Golden, CO (United States)
Prathilothamai M, Viswanathan V(2022) Traffic prediction system using iot cluster based evolutionary under sampling approach. Int J Artif Intell Tools 2240024
Kozicki TM (2022) The usage of Apache Spark for dynamic open data processing. PhD thesis, Wydział Matematyki i Nauk Informacyjnych
Prehofer C (2021) Challenges of big data and vehicle data. In: 2021 IEEE international conference on autonomic computing and self-organizing systems companion (ACSOS-C), IEEE, pp 287–288
Shrivastava A, Verma JPV, Jain S, Garg S (2021) A deep learning based approach for trajectory estimation using geographically clustered data. SN Appl Sci 3(6):1–17
Park G-M, Heo YS, Kwon H-Y (2021) Trade-off analysis between parallelism and accuracy of slic on apache spark. In: 2021 IEEE international conference on big data and smart computing (BigComp), IEEE, pp 5–12
Zeng Y, Gu H, Wei W, Guo Y (2019) \(Deep-full-range\): a deep learning based network encrypted traffic classification and intrusion detection framework. IEEE Access 7:45182–45190
Aloqaily M, Otoum S, Al Ridhawi I, Jararweh Y (2019) An intrusion detection system for connected vehicles in smart cities. Ad Hoc Netw 90:101842
Liu H, Lang B (2019) Machine learning and deep learning methods for intrusion detection systems: a survey. Appl Sci 9(20):4396
Linhares T, Patel A, Barros AL, Fernandez M (2022) Sdntruth: innovative ddos detection scheme for software-defined networks (sdn)
Malliga S, Kogilavani S, Sowmya R (2022) Deep discover: deep learning models for detecting distributed denial of service (ddos) attacks. In: AIP Conference Proceedings, vol 2393, AIP Publishing LLC, p 020191
Jiang W, Luo J (2022) Big data for traffic estimation and prediction: a survey of data and tools. Appl Syst Innov 5(1):23
Gillani M, Ullah A, Niaz HA(2018) Survey of requirement management techniques for safety critical systems. In: 2018 12th international conference on mathematics, actuarial science, computer science and statistics (MACS), IEEE, pp 1–5
Gillani M, Niaz HA, Ullah A (2022) Integration of software architecture in requirements elicitation for rapid software development. IEEE Access 10:56158–56178
Gillani M, Niaz HA, Ullah A (2020) Multi-cyclic requirement engineering for educational and industrial models in software development. In: 2020 IEEE 23rd international multitopic conference (INMIC), IEEE, pp 1–6
Ouhssini M, Afdel K, Idhammad M, Agherrabi E (2021) Distributed intrusion detection system in the cloud environment based on apache kafka and apache spark. In: 2021 fifth international conference on intelligent computing in data sciences (ICDS), IEEE, pp 1–6
Abushwereb M, Alkasassbeh M, Almseidin M, Mustafa M (2022) An accurate iot intrusion detection framework using apache spark. arXiv: 2203.04347
Rathore M. M, Attique Shah S, Awad A, Shukla D, Vimal S, Paul A (2021) A cyber-physical system and graph-based approach for transportation management in smart cities. Sustainability 13(14):7606
The Irish Times (2022) M50 blues: Ireland’s busiest road, dublin’s biggest car park. https://www.irishtimes.com/life-and-style/people/m50-blues-ireland-s-busiest-road-dublin-s-biggest-car-park-1.3259694. Accessed 31 Aug 2022
Apache Spark (2022) Spark streaming. https://spark.apache.org/. Accessed 25 Dec 2022
Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S et al (2016) Mllib: machine learning in apache spark. J Mach Learn Res 17(1):1235–1241
Ali AH, Abbod MN, Khaleel MK, Mohammed MA, Sutikno T (2021) Large scale data analysis using mllib. Telkomnika (Telecommunication Computing Electronics and Control) 19(5):1735–1746
Kononenko O, Baysal O, Holmes R, Godfrey MW (2014) Mining modern repositories with elasticsearch. In: Proceedings of the 11th working conference on mining software repositories, pp 328–331
Gormley C, Tong Z (2015) Elasticsearch: the definitive guide: a distributed real-time search and analytics engine. O’Reilly Media, Inc
Sharma V (2016) Getting started with kibana. In: Beginning Elastic Stack. Springer, , pp 29–44
Takase W, Nakamura T, Watase Y, Sasaki T (2017) A solution for secure use of kibana and elasticsearch in multi-user environment. arXiv: 1706.10040
The Society of the Irish Motor Industry (2022) National vehicle statistics. https://www.simi.ie/en/motorstats/national-vehicle-statistics. Accessed 31 Oct 2022
Transport Infrastructure Ireland (2022) Irish toll data statistics. https://www.tii.ie/roads-tolling/tolling-information/tolling-dashboards/. Accessed 31 Oct 2022
Transport Infrastructure Ireland (TII) (2022) Transport infrastructure Ireland. https://www.tii.ie/. Accessed 25 Dec 2022
M50 Concession Limited (2022) Live travel times & traffic. https://www.m50concession.com/live-travel-times-traffic/
Road Safety Division, Department of Transport (2019) Road safety division. https://www.gov.ie/en/organisation-information/9d873d-road-safety-division/. Accessed 5 June 2023
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work is supported by University College Dublin, Ireland.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Gillani, M., Niaz, H.A. Machine learning based data collection protocol for intelligent transport systems: a real-time implementation on Dublin M50, Ireland. Complex Intell. Syst. 10, 1879–1897 (2024). https://doi.org/10.1007/s40747-023-01241-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40747-023-01241-x