Introduction

The computationally integrated vehicles capable of working smartly through incorporating informed, coordinated, and better use of transport network are known as Intelligent Transport System (ITS) [1, 2]. A transport system independent enough to achieve better traffic efficiency while minimizing problems to facilitate drivers can be declared as a key feature of ITS [3]. The easiest description of ITS is the integration of information and communication technologies in transport [4]. A country’s growth is often measured through the quality of transportation network [5, 6]. Advanced network infrastructure with better traffic facilities results in better economic (trade, cross-country commutes, e-commerce businesses, etc.), civil, security, and technology-assisted benefits [7].

The number of vehicles on Irish roads has reached the highest levels since 2008. The total number of licensed vehicles in Ireland stands at 2.2 million calculated in 2019 [8], while it was only 1,985,000 (approximately 2 million) back in 2008–10. The vehicles are 179% increased [9]. The Irish government has set a target of 1 m electric vehicles by 2030 which indicates that this increasing ratio is expected to hit the highest peak. Ireland had 439 passenger cars per 1,000 inhabitants in 2016 [10]. This was the sixth lowest in the EU. The total number of vehicles on the road in 2021 is increased by 22.2% [9]. The annual number of new licensed goods vehicles is raised by 32.1% with private cars by 20.8%. On average, approximately 47% increase is being witnessed every month in comparison with the prior month [9].

Due to the higher expectancy of the increased number of vehicles added daily, data collection for the computational operations linked with Intelligent Transport System is getting complex and tedious task to accomplish smoothly [11,12,13,14]. Irish Licensed vehicles traveled 41.9 billion kilometers in 2021, 36.2 billion kilometers in 2020, and 47.1 billion in 2019 with COVID-19 restrictions [9]. These figures were expected to be huge without COVID-19 restrictions. Road traffic volumes are largely impacted by the drastically increasing number of vehicles which highlights the dire need for intelligently functional self-sufficient data collection and communication protocol to manage these potential numbers effectively.

There are various ITS projects in Ireland that are currently working on enhancing traffic features and experiences. ITS-based projects are functional on academic as well as industrial levels [15,16,17,18,19]. Despite the higher research interest in the field of ITS for academician, there still exists a gap in collaborating and practically implementing academically proposed ITS protocols for industrial gains [12, 20, 21]. These research gaps include integration with legacy traffic control systems, cost-effectiveness, increased congestion that results in higher travel times, higher environmental damage, increased global warming, poor communication technologies, lesser controlled accidental response, and least infrastructure controls [22,23,24,25].

To resolve the aforementioned challenges and gaps, we practically implemented the previously proposed Real-time Data Gathering (TDG) Protocol [26] on Dublin’s M50 motorway, Ireland with plenty of changes incorporated in TDG. The revised and modified protocol is named as Machine Learning based Data Gathering (ML-TDG) Protocol. Before discussing (ML-TDG) modules, operations, and functionalities, the overview of TDG [26] is as follows: TDG is lightweight and dynamically designed for collecting and forwarding data packets based on current and rapidly evolving traffic conditions to reduce network and data communication overhead while incorporating real-time data collection time constraints. A data aggregation scheme is implemented for data analysis to fetch information based on location, speed, vehicle id, and neighbor count. A data extraction scheme is also integrated to increase data retrieval and data utilization effectiveness in an intelligent way at the base station. The proposed solution outperformed existing data-gathering protocols in effectiveness, efficiency, delay, communication overhead, and data transmission rate. More details to follow [26].

While proceeding toward objectives, motivations, and contributions, we can draw the research questions as: (1) Is it possible to integrate a Machine Learning based data engine (Apache Spark) into VANETs and ITS to induce learning within vehicles to operate on the basis of previously processed data for gaining maximum benefits? (2) Does sorting out data systems (based on labels) to deal with huge traffic flows, data processing and analysis, real-time communication, ITS, and smart cities is practically possible in Dublin’s M50 environment? (3) Is it attainable to self-train a network model in an efficient way from a live-streaming of traffic data and from the data information system.

The concept of integrating machine learning into VANETs and ITS and specifically in ML-TDG is to induce learning within vehicles to operate on the basis of previously processed data for gaining maximum benefits [27]. Machine Learning along with VANETs is achieving new horizons of advancements in terms of sorting out the major issues, i.e., dealing with huge traffic flows, data processing and analysis, real-time communication, ITS, and smart cities [28, 29]. ML-TDG is utilizing machine learning concepts to deal with massive traffic flows and data congestion for better and optimized vehicular communication.

The primary objectives and motivations of the proposed ML-based Intelligent Transport System protocol are listed as follows:

  1. (1)

    VANET faces rapidly changing traffic trends, unpredictable vehicle scenarios, and the limited availability of database stations. The main objective is to propose and design a protocol with a feature of timely and updated data collection in which desired data should be collected via self-training of the network model in an efficient way.

  2. (2)

    Outdated and even the slightest delayed data conserve memory and energy and gives the least usability for Intelligent Transport Systems. Another primary objective is to design a protocol to overcome the memory and energy issues within a tolerable delay of time, i.e., real time.

  3. (3)

    VANETs’ environment is highly mobile and topologically constrained by roads, neighboring vehicles, and traffic road signals. Therefore, vehicles on roads do follow some pattern. In the case of high density, vehicles usually move closer and naturally formed clusters or groups on motorways. Hence, knowledge of road structure, motorways entry/exits, vehicle daily commutes, junctions usage, positioning, and neighbor count can be considered as informative parameters for developing a clustering-based real-time traffic-aware data-gathering protocol via pre-feeding the networks to avoid repetition of processes that takes energy, space, and time.

  4. (4)

    Finally, the most critical objective of this paper is to design, propose, and implement an architecturally centered protocol with the ability to automatically learn through daily vehicular commutes on its own while delivering automated and real-time synced data collection solutions with enhanced time, storage, and processing features.

The highlighted contributions of the paper is enumerated as follows:

  1. (1)

    A lightweight Machine Learning based data collection protocol is proposed for communicating and forwarding data packets based on real-time learning of rapidly evolving traffic conditions for reduced data communication cost and overhead while integrating large-scale data collection.

  2. (2)

    Integration of Dynamic Segmentation Switching via machine learning (spark streaming) is proposed that significantly reduced data communication cost, traffic data congestion, and data overriding during the execution of the protocol.

  3. (3)

    Proposed protocol is applied over known architecture of heavily flooded motorway infrastructure to train the model that comparatively produces quicker results, enables real-time data communication and collection, and takes less energy and storage over real-time traffic conditions.

  4. (4)

    A vehicular network is trained with real-time evolving traffic conditions based on daily stats of the motorway that makes the protocol functioning considerably applicable on larger scale data without increasing the mean execution time.

  5. (5)

    Long-term and bigger clusters are achieved in response to trained protocol that makes this protocol independent, efficient, and cost-effective.

  6. (6)

    Extensive empirical evaluations, state-of-the-art tools usage, and keen simulations are performed using real-time traffic scenarios and data to qualify for a higher performance badge.

Based on the above-mentioned objectives, motivations and contributions, the main idea of the proposed protocol is to design a lightweight energy-efficient Machine Learning based data collection protocol to effectively deal with higher data volumes in real-time vehicular traffic environment through organized information log(system) with optimal usage of resources (space, time, and energy).

The rest of the paper is organized as follows: “Literature review” covers relevant literature, “Methodology” discusses the Machine Learning based Data Gathering Protocol (ML-TDG) with its counterparts, and “Results and analysis” presents and analyzes the results with suitable illustrations. “Advantages and challenges of proposed methodology” covers the advantages and challenges of the proposed methodology followed by the conclusion in “Conclusion”. The paper provides Future work in “Future work” and then References.

Literature review

Vehicular Ad hoc Networks (VANETs) significantly assist in understanding the in-depth study of vehicular communication [23, 27, 30]. Every vehicle in the vehicular network is bound to exchange data for communication, infotainment, safety, and many other critical factors to keep the traffic flow smooth alongside maintaining roadside infrastructure [31]. A better vehicle-to-infrastructure (V2I) communication brings efficiency to ITS by improving vehicles communication without disruptions and collisions [25].

V2I communication poses multiple challenges among VANETs in terms of frequent disruptions, network intruders, data loss, vehicles collision, and malicious vehicles and data interruption [32]. To overcome these V2I challenges, Machine Learning (ML) provides potential solutions. ML is an artificial intelligence component that allows machines [like Vehicles and Road Side Units (RSUs)] an ability to learn without being explicitly trained for operational and functional performance factors automatically [33]. Plenty of approaches and protocols are proposed to resolve data collection and communication challenges in VANETs based on Machine Learning. A few such approaches are discussed in this section.

A conceptual objective of ML is to let the vehicle learn and improve its operations by the previously processed data. One such scheme is the Efficient Clustering Routing approach using a new clustering algorithm based on Density Peaks (ECRDP) [34] that applies Particle Swarm Optimization (PSO) and Density Peaks Clustering (DPC) algorithms to determine Cluster Heads (CH) for reliable links among connected vehicles. In this proposed scheme, CHs are selected and supported through a systematic maintenance phase that updates and redistribute the vehicles into clusters under updated CHs.

Another ML-based scheme for V2I communication is proposed to facilitate multiple vehicle local communication via Software Agents (SAs) [35]. The proposed agent-based model is designed to coordinate with static and mobile agents through a decision tree and Q-learning algorithm for the identification of events like critical, non-critical, and destination vehicles. Critical and non-critical event is identified via Event Decision Agent (EDA) which is a decision tree algorithm that uses vehicle sensors outputs. Road Side Unit Management Agent (RSUMA) and RSU Information Agent (RSUIA) are designed through Q-Learning for vehicle tracing and neighbor selection.

ML-based data collection scheme named Authentic Vehicle Node with FOG Computing (AVNFC) scheme [36] work on continuous time-sensitive data exchanges to assist intelligent infrastructure. This protocol is operational in terms of storing, communicating, and computing data frames in real time. This scheme utilizes ML Lagrange known as Polynomial Interpolation for the purpose of node authentication via fog-enabled VANETs’ architecture.

Machine Learning based Misbehavior Detection System (ML-MDS) [37] is another scheme for cognitive software-defined multimedia vehicular networks (CSDMV) in smart cities that works before data communication starts for better misbehavior detection. A Trust Value (TV) is used as a standard, e.g., if a TV of a vehicle is higher than a set threshold, then the communication will only happen. A proper channel of ML-based algorithm is designed, i.e., decision tree, Support Vector Machine (SVM), Neural Network (NN), and Logistic Regression (LR) algorithms for behavior detection accuracy.

YOLOv3 (You Only Look Once v3) is a recently proposed method that particularly targets the capability of cross-scale detection and focuses on the valuable area [38]. The proposed method performs multi-scale road object detection via the K-means-GIoU algorithm. This algorithm is designed to generate a priori boxes whose shapes are close to real boxes followed by training. It then maintains KITTI dataset shows that the proposed method maintains a fast detection speed and increases the mAP (mean average precision) value. YOLOv3 strategy is a bit complex in terms of complications that arise due to object detection as object sizes vary, removal of background targets, and strengthening the network’s attention. Moreover, YOLOv3 is not storage efficient (the object carries huge sizes) and requires a lot of resources (time and energy) to train and process. However, we achieved the time, space, and energy-efficient target by maintaining a log corresponding to every target’s index that trains and matures for better performance accuracy every time. Second, our proposed methodology is real time implemented in Dublin City which makes it reliable and worthy enough for real-time consideration.

A model is purposed for parameter optimization and feature metric-based fault diagnosis that serves as an unknown matching network model to solve issues corresponding to data sets in real industrial environments mainly catered from sparse fault samples and cross-domain data sets [39]. The proposed model is functional on the meta-learning network that extracts optimization information for parametric optimization. This methodology provides an effective solution for cross-domain problems of various connected devices that occur usually in response to changes in equipment operating conditions and production requirements.

Compacted Area with Effective Links (CAEL) [40] is another recent protocol designed to focus on decreasing overhead while maintaining smooth communication between selected nodes on the basis of geographical location and adequate already existing links’ references among vehicles with the inclusion of reliability factor. The link expiration time is of key element throughout to achieve real-time communication in terms of removing malicious nodes, selecting trustworthy nodes, and holding suspicious activity. Another recently proposed [41] preassigned performance control scheme that ensures that all nonlinear systems subject to the closed-loop system are practically finite-time bounded, including the tracking error converges to a preassigned area with a finite time. Well, the time-bound factor is remarkably addressed in our scheme as the time decreases when log indexes increase, i.e., maturity of the training.

A framework called Iterative learning control (ILC) [42] is a high-performance discrete linear time-invariant (LTI) system that works on an objective of minimizing energy to maintain the required tracking accuracy. The given framework is verified by a twin-rotor aerodynamic system (TRAS) model for operations defined within a finite duration. Another important category of ML-based data collection protocols are the ones with Apache Spark integration [43,44,45,46,47,48,49]. In Apache Spark, cluster environment, R, a statistical computing and graphics software within Spark, is used to give the user the capability to construct statistical and prediction models using the traffic data. It monitors the data of subsystem while giving analyses of current condition/strategy during the execution of desirable state and strategy [50,51,52,53,54].

Apache Spark intelligently responds to planned strategy [55]. For example, if any of the scheme modules stays high for an unpleasant time, the subsystem will automatically add one node to the cluster pool to bring the platform and module to the normal state [56,57,58,59]. It also enables low latency of storage and access that is very beneficial in the transportation design and protocols planning and execution. It offers great help to multiple types of information that need to be executed in multiple logs [60,61,62].

One such example is IDS framework [63,64,65] which is designed to deploy big data engines to provide efficient end-to-end-detection solution to reduce the impact on network performance from the heavier density traffic flow. This machine learning based framework uses random forest model training and runs on Apache Spark for data acquisition, anomaly detection, traffic logging, and data visualization for DoS/DDoS attack [66, 67].

A traffic prediction system using Prophet and Spark Streaming is designed on Apache Spark that provides the features of a big data processing framework for processing huge amounts of data [46]. Spark Streaming is enabling real-time forecasting of the traffic flow, while the Prophet model is capturing long-range temporal sequences of data to predict traffic flow [68]. Integration of Apache Spark in this protocol is facilitating handy features like huge amounts of data processing, precise prediction, and prompt real-time forecasting of the traffic flow and software critical systems simultaneously [69]. Management of the software systems is briefly elaborated in [70, 71].

Another ITS system that is designed to predict the total traffic count of streaming data in various routes for traffic congestion reduction is using the Spark Streaming engine for live processing of data [72]. Apache Spark process data and updates systematically using the total traffic count of predicted traffic via connected vehicle. Spring boot is utilized for the total traffic count display in a dashboard. In response to timely and prompt display analysis, the congestion problem is rooted out with the real-time road traffic data streaming enabled to Apache framework [73, 74].

Fig. 1
figure 1

ITS challenges potentially covered by Apache Spark

Detailed features are illustrated in Fig. 1. Figure 1 shows all the challenges and gaps of Intelligent Transport Systems that are now potentially covered by Apache Spark with greater ease. These gaps were once considered a tedious task to accomplish. In our proposed ML-TDG, we have overcome the issues and challenges mentioned in Fig. 1 through Apache Spark. Details are discussed in “Methodology”.

Methodology

According to Transport Infrastructure Ireland (TII) [Reference], Ireland’s busiest road is M50 which is a 45.5 km road with eight lanes and 17 junctions and carries an average of 142,496 vehicles a day. Some of the M50 routes have been recorded with 51 million journeys per year with 400,000 unique journeys (commuters) every day [75]. While considering these facts and figures, we have designed an Advanced Machine Learning based Data Gathering (A-TDG) Protocol (modified version of TDG [26]) while incorporating Apache Spark [76] which is a data management and analysis technique.

Apache Spark [20] is a machine learning based data processing engine that works with MLlib library [77] [78] to run and train rich and extensive data models. The data coming from traffic flows continuously keep on streaming and queening. In the case of VANETs, when data (traffic flows) are subjected to unpredictable state changes, it is difficult to analyze traffic in real time. Apache Spark will help in analyzing each recorded vehicle flow and labeling it accordingly. We have created two labels, i.e., New Path Vehicle (NPV) and Old Path Vehicle (OPV). Recorded and collected data flows are maintained against each label through logs by indexing. The labeling results are subjected to save via Elasticsearch [23] [79] [80] and can be searched and retrieved from Kibana dashboard [81] [82] [28] for future reference and analysis.

Figure 2 illustrates various components used for data collection. One of the main components is Apache Spark which is a streaming engine that supports Structured Streaming for streaming processes and pipelines. Structured Streaming allows taking data collection operations that require batch mode using Spark’s structured APIs and processing them in batches streaming fashion. This feature facilitates reduced latency and incremental processing of data flows in real time. Structured Streaming produces values rapidly in response to a batch or a streaming job. Incremental execution and processing of the real-time data collection through structured features are shown in Fig. 2.

Fig. 2
figure 2

ML-TDG system model for real-time data collection

Figure 2 is a model structure considered for real-time data collection. It shows five inter-connected modular components that include Spark Streaming which is a scalable fault-tolerant streaming data processing system. As illustrated, It is supporting both batches (OPV and NPV) and streaming workloads. Spark processes real-time data from various sources, i.e., RSUs and Vehicles. Apache Kafka is a scalable messaging system that collaborated with Spark for data streaming analytics for high-throughput traffic processing. Apache Kafka is intended to deploy for message queues coming from vehicles and RSUs to transmit the collected data from Spark. In the proposed protocol, streaming (mostly Vehicles) and static data (mostly RSU) sources are considered for batches of input data.

The processed data are disseminated as live dashboards via discrete streams or small batches to categorize them as different logs and data labels used in a protocol. Spark Streaming is integrated with MLlib to implement a Machine Learning algorithm while using a micro-batch system in Spark along with adding two key functional operations, i.e., training model with real-time Data and using trained model simultaneously. Spark Data frames are added to satisfy the processing needs of data indexing and labeling used by the proposed protocol and explained in detail in later sub-sections.

A vehicle network real-time depiction of Ireland’s M50, which is a C-shaped orbital motorway in Dublin and the busiest motorway in Ireland with 17 junctions, is created on the SUMO simulator. The vehicle density is completely mimicked based on data available on [83]. 51 million journeys per year with 400,000 unique journeys every day along with the consideration of peak hours (busiest hours) ratios and times as per stats given at [84]. These real-time implementations based on actual data give us the feasibility and actually monitored practicality of ML-TDG in Dublin, Ireland.

Scheme plotting

The scheme plotting is based on three cases considered for systematic results analysis of the proposed protocols. OPV stands for all the vehicles that pass from M50 multiple times a week while adapting an identical path/route. On the other hand, NPV are the vehicles that are on M50 for the very first time. NPV after getting a series of consecutive logs eventually becomes OPV. This feature does not increase the data burden, because it is simply getting shifted from one log to another. Old Path Vehicles are the vehicles driven by people who adopt the same path, e.g., a mother dropping her kids at the same school every day for a couple of years, A man who works in an organization on a 3-year contract, a business person visiting the same site every day, and a student going to university for a degree program. Considering 51 million journeys per year means 1.4 million journeys per day and 400,000 unique (NPVs) journeys per day. Based on these facts, detailed data flow scenarios/cases considered to justify ML-TDG functionalities for smart data analysis and management are covered in “Results and analysis”.

Proposed architecture

In this section, we have discussed the proposed protocol’s architecture and explained each module. Dublin’s M50 Motorway (M50) is considered which is divided into 17 Junctions (J1, J2, and J3 \(\ldots \) J17) with a 45.5 Km road comprised of eight lanes (L1, L2, and L3 \(\ldots \) L8). The real-time implementation considered for structural illustration is shown in Fig. 5. In Fig. 5, 17 junctions can be seen that are exactly replicated from Dublin’s M50 motorway to implement and test the proposed ML-TDG. Every junction is given the same name, curves, rows, and passage to analyze real-time situations better. Results catered through these junctions are considered in the results and analysis section.

The protocol functionality is divided into three major modules:

(1) Vehicles logging

(2) ML-TDG data collection

(3) Integration and revised training.

These modules are further divided into various sub-modules discussed below.

Vehicles logging

This module enables real-time traffic management for smart decision management based on vehicle logging. Vehicles Logging is described as recording daily commuters based on traffic flows and creating logs based on two log categories. (1) New Path Vehicles (NPV) and (2) Old Path Vehicles (OPV). The NPV log and OPV log are indexed by storing the input data via Elasticsearch. The traffic logging of NPV and OPV cannot be designed to act in parallel. It is based on the fact that NPV ratios on M50 are comparatively lesser than OPV. NPV log gets functional only when a new path vehicle is detected and needed to be recorded.

To establish OPV and NPV logging, Apache Spark Cluster is used where OPV and NPV are Spark workers and Spark Master is the primary data streaming and processing stream. Spark framework is using a master–worker architecture that runs across the cluster while processing batches of OPV and NPV in real time. However, Spark manages the cluster to accomplish the traffic logging task while coordinating batch processes for traffic analysis, as shown in Fig. 3. Figure 3 portrays that traffic logging tasks are processed in batches mainly fulfilled by Spark worker with each index, i.e., OPV and NPV, which is synced with Spark master. Spark master is processing Old commuters (OPV) along with making repeated NPV (New commuters) to be a part of OPV simultaneously to maintain an updated log. This feature is also useful in avoiding duplication of data, e.g., same data coming from OPV and NPV.

The open-source Elasticsearch is applied for data storing, indexing, and visualization within Cluster. Results within the cluster are considered databases for indexing and storing the input data by Elasticsearch. The data collected through traffic are iterated via Kafka and then indexed in Elasticsearch for log management. The traffic logging/indexing is performed in parallel to speed up the whole process in real time.

Fig. 3
figure 3

Traffic logging task for batches execution and indexing

  • Data Labels Exchanges Data Labels Exchanges perform the continuous bridging between both logs of NPV and OPV. A dynamic log shifting is created to adjust the switching of NPV to become OPV. All the newly detected and joined vehicles around all junctions after being recorded multiple times will be shifted to the OPV log based on the consistency of commutes. These data labels help in generating flux for the OPV log to update and react accordingly. In the proposed structure, the data labels exchange significance is better discussed in the results and analysis section, but precisely DLE brings stability while increasing OPV logs for better, long, and stable Clustering possibilities.

  • Smart Data Management This factor enables expected traffic clearance time and each junction while considering traffic flow rates as inputs. This step also reduced the changes of congestion and any possible Clustering mishaps.

  • Data Visualization and Analysis To keep the data logs, data labels exchanges, and smart data management accountable and clearly visual for the analysis of its designated features, Kibana is integrated as the endpoint of the system for data visuals. The database (Elasticsearch) is queried by Kibana which retracts with the help of indexing (given while maintaining logs) and then displays updated matrices accordingly.

Algorithm 1
figure a

: Vehicles logging/data labeling/data log/data flow

Algorithm 2
figure b

: Dynamic segmentation switching and aggregation

ML-TDG data collection

The second module is implementing Smart Machine Learning-based Traffic Data Gathering protocol (ML-TDG) for data collection. This protocol proposes dynamic segmentation switching for smart handling of communication limitations, e.g., Bottlenecks, flooding, spamming, and jamming. ML-TDG is a lightweight data collection module that is dynamically designed for collecting and forwarding data packets based on current and rapidly evolving traffic conditions via Vehicle logs classifications. This module is functional based on the list of steps following the below-mentioned stages.

  • Dynamic Segmentation Switching (DSS) As we considered Ireland’s M50, so our model is based on multi-directional lanes (Lm) with 45.5 Km of length (Ln). As the length of the road is considerably large and based on various turns and multi-directions, we have considered the division of 17 Junctions (J1, J2, and J3 \(\ldots \) J17) to test the implementation of the proposed protocol. Each junction is considered and tested and the results counts are measured based on accumulated results catered from all junctions simultaneously by taking an average. Considering the fact that there are higher chances of multiple area occurrences in the vehicular network, we have taken into account the density of the network based on directions, vehicle density, and collection area. While bi-directions stand for M50’s 8 lanes with 4 lanes on each side (4 lanes in each direction) abbreviated as Da1, Da2, Da3, Da4, Db1, Db2, Db3, and Db4. Density stands for variable vehicle ranges as stated by RSA approximately including both NPV and OPV. The number of BS deployment N(BS) is calculated as follows:

    $$\begin{aligned} N(BS)= ln/x \end{aligned}$$
    (1)

    where ln is M50 length and x is the coverage area in km for one BS (range covered by BS). N(BS) is the number of Base stations deployed in respective collection regions. Every 4 lanes (single direction) is taken as a road for the distribution of respective segments. And each road is distributed into two virtual segments tagged as Collector Segments (CS) \(\epsilon \) CS1, CS2, CS3 \(\ldots \) CSn, and Silent Segments (SS) \(\epsilon \) SS1, SS2, SS3 \(\ldots \) SSn. CS performs data collection and communicates with the BS, while SS are no communication zones. To divide the road into a considerable amount of virtual segments with the same length, the number of segments (VLs) is calculated by the following:

    $$\begin{aligned} VLs= (lm)/CR \end{aligned}$$
    (2)

    where lm is the total distance of the collection area. CR is the Communication Range of vehicles on the virtually created segments. The integration of dynamic segmentation switching is significant in reducing bottlenecks, network congestion, and messages spams and collisions. The conventional segmentation created on roads remains static, while in our case, DSS allows each segment to switch dynamically, taking control from one Collector Segment while assigning control to the other Silent Segment. This switching is time-driven where virtual segments are allocated time \(\Delta \)t to switch allowing maximum vehicles to become a part of the data collection process. Time factor assists switching of segments turn by turn alternatively, i.e., conversion of CS into SS and SS into CS. Vehicles Speed (VS) with respect to time can be calculated as follows:

    $$\begin{aligned} V\small s = (C \small sl/AVS) \times \Delta t \end{aligned}$$
    (3)

    CSl is collection segment length, AVS is average vehicle speed on the road, and \(\Delta \)t is a time that determines exceeding limits for data collection. The switching of CS and SS segments enables to detection of any blockage present. This factor makes it a favorable choice for determining and highlighting possible areas of accidents and blockage. Figure 4 illustrates DSS.

  • Real-time cluster head election (R-CHE) Clustering is considered an effective mechanism capable of managing inter-vehicular communications for a set of regions moving along in a targeted region. Sustainable and sizeable clusters yield better and desirable energy-efficient results. The previous version of ML-TDG (named as TDG [26]) offered clustering based on beacon messages initiated through BS periodically to all vehicles driving in range. On receiving neighborhood information, vehicles inform the BS about their neighbor count, current position, and direction of movement. BS declares Vehicle has a bigger neighbor count as CH due to its better positioning surrounded by other vehicles. Well, this CH election and clustering mechanism was effective in covering the best possible number of covered vehicles. However, plenty number of messages are subjected to exchanges before a cluster can start data collection and communication. Second, this type of cluster head election also needs a self-induced clustering mechanism where a vehicle with no beacon invitation needs to declare itself as CH to start a collection. This feature requires more energy, more data capacities, more messages exchange, and ultimately increased time. ML-TDG accomplishes R-CHE by distinguishing based on OPVs and NPVs and junctions given at a particular area. Let us assume the scenario of J1 in which vehicles can only enter M50 through R131 joining in via two conjoint roads with no possibility of exit and other roads joining further. In J1, vehicles from Da1, Da2, Da3, and Da4 are moving in one direction and Db1, Db2, Db3, and Db4 are moving in another direction. The already built-in logs labeling and indexing described in Module 1 have distinguished Y number OPVs and Z number of NPVs. The OPVs are set to make a cluster based on N number of possible ranges. While NPVs are subjected to act as source push where they are set to push the required data to the nearest OPV eradicating the need for RS and its possible availability thus making it infrastructure-independent. The possible logging and indexing provide higher stability toward the clustering phenomenon while eliminating the need for beacon messages, self-induced clustering, neglected vehicles, and chances of getting replicated data from the same vehicle at a given time. All Vehicles V1, V2, V3 \(\ldots \) Vn given an Index Ixn within a Database of respective OPV and NPV has predetermined/pre-recorded indexes based on Junction passing through are set to be declared as a cluster of range CR, where CR is Communication range. There are 17 junctions (J1, J2, and J3 \(\ldots \) J17) that are replicated on SUMO, as shown in Fig. 5.

  • Real-Time Data Aggregation Every data packet entry corresponding to a given index is interpreted through the data packet number. Data packets are concatenated via logs maintained and then send to the BS. Data logs will reset the array and then will start concatenating data for aggregation. This reset feature prevents an empty array to be aggregated. If data are present already, then upcoming data will concatenate with previous data. Resetting the array also prevents any previously present redundant and expired data to concatenate with newly arrived data. It updates data delivery more efficiently. Every aggregated data array contains (comma) and (semi-colon) as pre-part of it. Therefore, both are considered as the delimiter to break the data from every point where either coma comes, or a semi-colon arrives. This happens until every parameter separates and gives distinguish values for every set parameter. Delimiters are basically limit setter symbols implied on collected data to make it distinguish on the basis of the assigned values desire to retrieve. Symbols selected are already part of the data reached at BS.

Algorithm 3
figure c

: Data transmission and aggregation

Fig. 4
figure 4

Dynamic segmentation switching

Integration and revised training

The ML model is set to train during log exchanges from NPVs shifting to OPVs. The training data sets (i.e., OPV data set and NPV data set) are trained by the flow of records collected in real time. The trained sets are labeled and included to analyze the packet sent and packet received. By storing the captured real-case data in Elasticsearch, it is possible to re-train logs.

Results and analysis

This section covers the simulation environment with a detailed analysis of results to evaluate the performance of the ML-TDG protocol. The base scheme protocol named TDG [26] is also proposed and implemented following NS\(-\)2.35 on Ubuntu 16.04 along with the newly proposed protocol ML-TDG. Vehicles are deployed using Tool Command Language (TCL) covering road-specific scenarios. TDG is set for message initiation, nodes configuration, CH and Sink nodes are also covered through TCL. However, ML-TDG includes Python for machine learning modules implementation, whereas previously included modules from TDG are written in C language code.

Fig. 5
figure 5

ML-TDG system architecture illustration

The data packet exchanged and communicated for data collection includes position, velocity, sequence number, identity, source, and destination. Sending and receiving data functions are performed separately, i.e., CH manages the selection of recipient vehicles and aggregation of data is performed at the sink node. AWK scripts are used to extract the end-to-end delay, Packet Delivery Ratio (PDR), efficiency, and effectiveness from the generated trace files. Simulation parameters are given in Table 1.

KPIs: Key Performance Indicators: These are the indicators that serve as benchmarks through which ML-TDG optimal network performance is determined. These KPIs include Efficiency, Effectiveness, Average Efficiency (cumulative) and average effectiveness (cumulative), and Cluster Stability. These indicators are explained and discussed below.

Efficiency: Efficiency is calculated in AWK scripts by the formula

$$\begin{aligned} Efficiency = \frac{Total\,V (n)}{N(Vehicles)},\end{aligned}$$
(4)

where V(n) is the amount of data received by the BS and sent by vehicles. N is the total participated vehicles in sharing data.

Effectiveness : Effectiveness is calculated in trace files by the formula

$$\begin{aligned} \text {Effectiveness} = \frac{\text {Total}\,V (n)}{S_n(\text {Vehicles})}\end{aligned}$$
(5)

where V(n) is the number of vehicles whose data are delivered to the BS and S(n) is the number of vehicles whose data should be delivered to the BS. Efficiency in ML-TDG can be described as the protocol’s ability to measure data exchange information within a given time. The efficiency of ML-TDG is also known as communication efficiency, i.e., lesser communication reported at the junction indicates less-efficient communication provided the real-time vehicles ratio. However, the effectiveness of ML-TDG indicates the ability of a protocol to accomplish designated data communication while completing all designed activities at the right time, cost, and speed in the least expensive way.

Efficiency and effectiveness are calculated over the 17 Junctions (Jn) and their average is considered to map the percentage through the given formulas below

$$\begin{aligned}{} & {} \text {Efficiency}{_{(\text {avg})}} = \frac{\text {Total}\,V \,|\, \Sigma \,{_{jn1 \ldots jn17}}\,| {V_{(n)}}}{\Sigma \,{J_{n}}| \,{N_{\text {Total}\,\text {Vehicles}}}} \end{aligned}$$
(6)
$$\begin{aligned}{} & {} \text {Effectiveness}_{(\text {avg})} = \frac{\text {Total}\,V \,|\,\Sigma \,_{jn1 \ldots jn17}\,| V_{(n)}}{\Sigma \,S_{n\text {Vehicles}}}\end{aligned}$$
(7)

where avg indicates average, V is total Vehicles, \(\Sigma \) is the total number of vehicles within junctions, V(n) is the number of vehicles whose data are delivered to the BS, S(n) are the number of vehicles whose data should be then delivered to the BS, and Jn indicates a junction.

Table 1 Parameters used in ML-TDG

Based on Formulas (4), (5), (6), and (7), efficiency and effectiveness are calculated and shown in Fig. 6. In Fig. 6, efficiency and effectiveness trend lines (orange for effectiveness and blue for efficiency) can be witnessed through varying percentages along different junctions. It is noticeable that Jn3 and Jn4 recorded near to 70% and 61%, respectively. The possible reason for this slightest less efficiency is that the Jn3 and Jn4 have the maximum number of diversions, and bi-directional entries and exits that possibly effect efficiency (refer to Fig. 5). While, Jn5 to Jn9 are recorded near to 65% to 69%. The efficiency recorded is slightly lesser than the expected efficiency percentage, but still multi-directional, multi-lanes, and multiple entries and exits are affecting the efficiency. On the other hand, the rest of all Junctions, i.e., Jn10–Jn17, are ranging from 88% to 89% that is indicating credibility and success in achieving higher efficiency. Out of all 17 junctions, 8 junctions efficiency is above 88%, 3 junctions are above 70%, and 3 junctions are above 68%, while only 2 junctions (jn4 and jn5) revolve around 61% and 65%.

Effectiveness of ML-TDG shows the reliability of the protocol, i.e., all junctions are attaining maximum percentages ranging from 79% to 94%, while a maximum number of junctions is showing effectiveness above 92% and only 1 junction (jn3) is at 79%. The rest of all junctions are above 82%. One of the noticeable facts about Jn3 is its efficiency which is at 71% and effectiveness is standing at 79%. This is the lowest in both parameters. This is due to the fact that Jn3 is architecturally different from the rest of all junctions. Traffic hits it differently mostly asynchronously. Jn3 has more than 8 entries and exits with 2 roundabouts, unlike any other junctions. the complex nature of the junction is posing a different impact on efficiency and effectiveness making it as least performing junctions. However, it is still acting under acceptable ratios.

In Fig. 7, the Efficiency and Effectiveness of TDG are given in comparison with ML-TDG. It is clearly evident that TDG without Machine learning modular components is giving 5 to 10% on average lesser percentage than ML-TDG. One of the primary factors of slightly less efficiency and effectiveness is the larger set of vehicles passing the motorway. However, TDG was designed and implemented in real-time environments without training data sets. TDG was also infrastructure-independent which also adds a time factor to produce better efficiency. Large traffic flows are potentially causing bottles necks and larger overheads, whereas ML-TDG takes full benefit of larger volumes to train and then executes data collection.

Fig. 6
figure 6

Efficiency and effectiveness of ML-TDG

Fig. 7
figure 7

Efficiency and effectiveness of TDG

Clusters Stability: Cluster Stability is an indicator to measure how much a clustering remains unaltered under any change reported by the vehicles or respective network. In the case of real-time weighted traffic conditions, the addition and removal of vehicles within a cluster might be frequent. ML-TDG is set to design better cluster stability. For example, in Fig. 8, cluster stability has been shown over the period of 24 h with a gap of 3 h based on Maximum Clustering (MT) percentages from the Start of the Junction till the end of the Junction given a standard response on average throughout the day. Based on the clustering frequencies considered, it is concluded that the high frequency of vehicles yields better and long-lasting clusters while facilitating efficient data collection. During the first 3 h considering the midnight’s real-time vehicles traffic flow from 12 am to 3 am, the Distant and lesser frequency of car flow is giving lesser clustering stability. As long as the time enters the early morning range with the increased vehicular flow, ML-TDG is producing better clustering sizes and time due to potential early commuters. Soon after early morning peak hours, the cluster sizes begin to shorten with lesser time. This trend increased again from 12 pm to 3 pm and kept on increasing with time and started dropping around and after 10 pm. The machine learning based protocol provides the leverage of higher vehicular flows, i.e., higher the vehicles, the better the efficiency for clustering sizes, strength, and longer time.

In Fig. 9, the Cluster Stability of TDG is shown which indicates smaller yet numerous clusters closely working together based on traffic conditions. The primary reason behind smaller functioning clusters is TDG CH election criteria where A vehicle with more neighboring vehicles declared itself as CH. Therefore, the range of neighboring vehicles is limited, thus resulting in smaller clusters. This factor results in more data frame exchanges and higher throughput in comparison with ML-TDG Contrary to this feature, ML-TDG uses logs and indexes and that is why encapsulates bigger clusters for data collection.

Fig. 8
figure 8

Cluster stability of ML-TDG

Fig. 9
figure 9

Cluster stability of TDG

Rate of vehicle logs shifting from NPV to OPV: Random allocation of NPV and OPV based on daily commutes given at [85] and some of the stats are achieved through [86]. The final distribution is given in Fig. 14. As illustrated in Fig. 14, the jn1–jn17 rate of change of New Path Vehicle (NPV) to Old Path Vehicle (OPV) is shown through percentages of the exchange rate. At Jn1, the rate of OPV (89%) is greater than NPV (11%) while indicating a low conversion rate in terms of adding new commuters. Jn2 got 25% NPV and Jn3 got 31%. It can be concluded that NPV ratios throughout the junctions are much lesser than OPV. It is a very common scenario that in any motorway, old commuters account for a larger proportion than New commuters. As per stats given at [85, 86], daily commuters are based heavily on people who use the motorway more often than new commuters joining it once in a while. Jn17 got 47% new commuters making it the only junction with mostly targeted new commuters leaving behind Jn11 and Jn12 with 44% and 37%.

However, Jn 13 holds 93% of OPV leaving behind Jn14 with 89% and Jn6 with 83%. One of the critical perspectives of these ratios is inter-conversion factor. As ML-TDG is designed to convert every NPV into OPV upon reporting a vehicle couple of times. This makes the job easy as the data set for OPV grows, it makes the ML-TDG functioning better and time effective. Lesser NPV ratios require less time for the execution of ML-TDG modular components. Whereas, if NPV is spotted, it has to go through one additional step in which the NPV index is highlighted, and counted and then adding new index under OPV to get operational from there to avoid repetition of identification that might cause potential delays. This factor effectively makes ML-TDG responsive and avoids repetitions.

Based on the ML-TDG operations, below mentioned are four cases consider to highlight what makes the proposed protocol favorable in complications and scenarios. To extensively test the larger number of Dublin’s M50 motorway real-time vehicular projections and to optimize the time to test the model over all the junctions considered. We have used four cases as shown and discussed below:

Fig. 10
figure 10

Case I: when half of the commutes are NPV and other half are OPV

Case I: when half of the commutes are NPV and the other half are OPV: Data Collection success ratios when half of the commutes are NPV and the other half is OPV is given in Fig. 10. By storing the indexes of assigned OPV and NPV captured through real-time data in Elasticsearch, it is conveniently possible to re-train the ML-TDG model periodically to execute the system within the network environment. When a system is given an equal proportion of NPV and OPV, i.e., NPV = OPV, across all 17 junctions interestingly exists between a gap of 70% and 80% which is a remarkable achievement. The lowest scoring junction (Jn5) is at 70% and the highest scoring (Jn4) is at 79.72%. When OPV (indexes) already trained are collecting data along with NPV, it neutralizes the NPV execution indexing time. However, NPV becoming part of OPV is a continuous process. Every new commuter is set to eventually become part of OPV which provides a stimulus to increase the data collection ratios by increasing the data sizes of OPV which is a plus for machine learning based algorithms.

Case II: When the majority of commutes are OPV and lesser NPV: Data collection ratios are better in this case when the majority of the vehicles are deployed as OPV (already discovered, indexed, and trained) and NPV are in less ratio. The ratio is set at 80% of OPV and 20% NPV only to qualify for this case. In this case, the junctions are extremely performing well as only a minute ratio of NPV is pushing OPV to adjust the new indexing for training. This case enables OPV to be functional based on its maximum capacity. Following this case, all junctions are marking performance within a range of 80% to 89% with Jn5 at 80.41% and Jn4 at 89%. The details of the stats are given in Fig. 11. This case is straightforward as well as pretty predictable because of the fact that the already trained vehicular model efficiency and success ratio is highly accurate. However, frequent indexes and training sessions (executing modules) with rapidly changing logs may affect the accuracy results of the model. The high-performing increment reported in this case is functioning without considering the interchangeable log and index formats.

Case III: When the majority of commutes are NPV and lesser OPV: Higher ratio of New commuters does not affect the data collection ratio largely in comparison with higher OPV. The only factor that enables us to achieve higher accuracy during data collection or higher NPV is our model which is designed to gets functional only when a new path vehicle is detected and needed to be recorded. However, the traffic logging of NPV and OPV cannot be designed to act in parallel. It is based on the fact that NPV ratios on M50 are comparatively lesser than OPV. NPV log gets functional only when a new path vehicle is detected and needed to be recorded. In this case, when it is deliberately designed for higher ratios, NPV logging will stay active to record the series of consecutive logs that can eventually become OPV. In the end, the majority of the NPV is added in OPV logs and data collection is resuming normally while normalizing the efficiency and time factor. Following this approach, this case is evidently providing considerably good percentages ranging from 70% to 74%. However, in previous cases, few of the junctions were close and above 80%. The factor behind a slight decrement is highlighting the time taken during logs conversion and indexes are given. The primary outcome, in this case, is still appreciable following the fact that it is now below 70% anyway. Details of the stats are given in Fig. 14.

Fig. 11
figure 11

Case II: When the majority of commutes are OPV and lesser NPV

Fig. 12
figure 12

Case III: When the majority of commutes are NPV and lesser OPV

Case IV: Real case with 1.4 million OPV and 400,000 NPV: This case actually measures the strength of our designed model in an actual scenario/real time considered to declare its suitability. The success of data collection is attained within a span of 70% to 77%. Jn9 at 75%, Jn12 at 76%, Jn13 at 77%, Jn15 at 74%, Jn16 at 73%, and Jn17 at 72% is recorded proving the uninterrupted data collection. Jn14 at 71%, and Jn7 and Jn1 at 70% are depicting the architectural complexity of junctions that are hindering the execution of components and giving gaps of few percentages. The detailed stats are given in Fig. 13. In Fig. 13, most ranges are reported with a marginal difference of 25% that indicates a steady and progressive flow of events (Fig. 14).

Fig. 13
figure 13

Case IV: Real Case with 1.4 million OPV and 400,000 NPV

Fig. 14
figure 14

Rate of change of NPV to OPV

Advantages and challenges of proposed methodology

As per the above-mentioned results and analysis based on the proposed methodology, there are various significant advantages of ML-TDG.

  • ML-TDG can save a considerable amount of time that is taken by excessive data exchanges for clustering and identification of vehicles for large-scale data collection. Reduced data communication cost and overhead based on real-time learning of rapidly evolving traffic conditions is the primary advantage of the proposed scheme.

  • Second, issues corresponding to delays, larger throughputs, and excessive bottlenecks can be smartly handled through the proposed algorithms based on the integration of dynamic segmentation switching and smart vehicular information processing and management.

  • Frequent repetition of data and energy taken for identification of a vehicle multiple times at multiple locations can also be overcome through the proposed methodology.

  • Energy and time optimization for data communication is one of the integral elements achieved through the implemented scheme.

  • Fewer resource requirements as the proposed methodology depict better clustering life and ratio along with better optimization of results every time. ML-TDG keeps on maturing with every log indexing gives reduced bottlenecks and data replication.

However, in light of the above-mentioned advantages, the proposed algorithm can have possible problems which may include environmental, physical, mechanical, and electrical constraints that are unavoidable in real-time environments. For example, real-time traffic is dependent on ambulances, emergency vehicles, and accidental blockage that are neither predictable nor avoidable. The situations like ambulances and emergency vehicles path coverage can potentially affect the real-time live streaming of data traffic feed and gives challenging situation to network data training along with structured data processing and real-time displays.

Conclusion

In this paper, we propose a lightweight machine learning-based data collection protocol to potentially deal with higher data volumes followed by implementation and testing over Dublin’s M50 motorway real-time vehicular traffic environment. The protocol is functional based on the known architecture or pre-feeded map. The methodology is aligned with a data processing engine known as Apache Spark that is further coordinated with a smart algorithm known as Dynamic Segmentation Switching (DSS). DSS enables road segments division and allows switching intelligently to reduce bottlenecks and replication resulting in energy and time optimization. Further, a Data aggregation scheme is implemented for data analysis to fetch information based on location, speed, vehicle id, and neighbor count. A data extraction scheme is also integrated to increase data retrieval and data utilization effectiveness in an intelligent way at the base station. The protocol induces learning within vehicles to operate on the basis of previously processed data for gaining maximum benefits, i.e., effectiveness, efficiency, communication delay, communication overhead, bottlenecks, and better data transmission rate.

The method proposed in this research is proven to be useful in terms of dealing with huge traffic flows, bulk data processing, and analysis with real-time communication. As traffic flow is collected from vehicular network topology based on reality, the potential impact brought by the system on the network performance is studied by comparing the network throughput. The results verify the system is a lightweight solution by bringing a little burden to the network. Apache Spark is used in response to timely and prompt display analysis, and congestion problem is rooted out with the real-time road traffic data streaming. Ireland’s most famous, busiest, essential, and extremely congested road (motorway) is used for testing and verification of the authenticity of our proposed protocol.

The proposed protocol is rigorously tested based on various possible cases for systematic results analysis, i.e., Rate of Vehicle Logs shifting from NPV to OPV, half of the commutes are NPV and the other half are OPV, majority of commutes are OPV, and lesser NPV vice versa, Real Case with 1.4 million OPV and 400,000 NPV. The results show that ML-TDG is potentially useful for higher data volumes and massive traffic flows. As the Vehicular network is trained with real-time evolving traffic conditions based on daily stats of the motorway that makes the protocol functioning considerably applicable on larger scale data without increasing the mean execution time.

Future work

The protocol proposed in this research is proven to be useful. However, there are still some problems and uncertainties that need to be further investigated and are part of future work. For example, the very first limitation of the proposed protocol is its pre-requisite to have known infrastructure of the highways/roads to actually implement the functional design. This factor is considered casual as most of the world’s road infrastructures are predetermined and amendments can be made based on design and structural changes. ML-TDG can collect data based on real-time vehicle interactions; however, adapting infrastructure based on real-time factors is not possible. We are working on this factor to make it infrastructure-friendly or infrastructure-independent in the near future.

Second, the efficiency of results gradually increases as the databases grow, train, and mature. This factor can be eliminated and early efficiency can be achieved by feeding the network with prepossessed data directly taken from Road and Safety Department Ireland [87] that may require formal permission to use the already recorded departmental data. This element will also enhance the energy as well as time parameters due to little-to-no real-time logs recording and maintenance. In pre-recorded data, OPV will not be recorded and NPV can still remain intact to work for any possible new vehicle.

Third, in the near future, ML-TDG is set to be slightly modified with different Machine learning models to test whether any other ML framework can help us increase our time, storage, energy, and communication efficiency with the possibility of security factors incorporated as well. Also, various statistical and mathematical functions are in consideration to check the feasibility and better results outcome.