1 Introduction

Recently Autonomous Vehicular Systems (AVSs) have seen a gigantic growth in a wide variety of aspects with the development of smart cities to build the Intelligent Transport Systems (ITSs). For instance, the dramatic use of embedded systems and wireless communication (e.g., 4 G LTE and 5 G) in modern internet of vehicles which ultimately improve users safety and comfort. However, growing interest in the development of Connected Autonomous Vehicles (CAVs) and ITSs has introduced new security challenges and vulnerabilities in AVSs that has a great impact on the smart environments for smart-cities. However, classical computer security solutions are not applicable in automotive industry standards for in-vehicle, vehicle-to-vehicle (V2V) communication and vehicle to everything (V2X) communications mainly because of real-time performance requirements, constrained computational resources, and differences among heterogeneous networks and their configurations [1].

Various recent reports have sketched attempts where cybercriminals have successfully demonstrated practical but remote attacks to key functions of automotive vehicles (as depicted in Fig. 1) either through V2V or V2X that include disconnecting the engine and the brakes [2,3,4,5].

CryptoLocker, WannaCry, and Petya attacks are prominent one of the most widely used attacks against sensitive IT systems [6]. In the past, ransomware attacks used to affect various entities such as personal computers, public or private organizations, health sectors, mobile phones, and other similar devices. However, the focus of ransomware attacks has now shifted towards smart vehicles and smart cities, posing a significant threat to both human lives and financial stability [7, 8]. Moreover, there have been attempts where researchers have shown that malware is one of the keys and emerging security threats that can be launched by exploiting the wireless communication system of AVSs [9, 10]. For instance, by exploiting known vulnerabilities in the design and implementation of onboard communication systems, embedded software, and application software [11,12,13] as depicted in Table 1. Moreover, a report in [2, 14, 15] has shown that an AVSs is not just a simple machine by hijacking the steering and brakes of a Ford Escape and a Toyota Prius. However, on the other hand, it is of utmost critical to understand that AVSs are now a network of computers that can be hacked by practicing classical cyber threat mechanisms. For instance, during the year 2015, approx. 1.5 million vehicles were subject to a recall by Daimler Chrysler mainly because cybercriminals could remotely take the control of a jeep’s digital system over the Internet [3]. In another report [4], a team of cybercriminals remotely hijacked a Tesla Model S from a distance of approx. a dozen miles. In a more recent attempt [5], authors have identified 14 vulnerabilities in the infotainment system in several of BMW’s series. Moreover, another Tesla S and Tesla X was targeted by cybercriminals in November 2019 via the Wi-Fi attack vector [6]. All of the above-mentioned incidents show that the security of AVSs is integral to their core functions in order to make smart transportation secure, therefore, it must be handled to protect the vehicles enabling them to operate safely.

Fig. 1
figure 1

Typical V2V, V2X cyber threat scenario in smart autonomous vehicles

Table 1 Various attacks to CAVs

The key to the afore-mentioned success of remote attacks on AVSs is information sharing by the vehicles over a wireless medium which increases the susceptibility of the vehicles to different security and malware attacks. Consequently, data exchange including input and output data as well as protecting Electronic Control Unit (ECUs) inside the AVSs are among the most significant security issues for the intelligent vehicles [10, 19]. Specifically, the most damaging cyber threats, are emerging as the vehicles connect to the Internet, provide onboard Wi-Fi hotspot services, communicate with other vehicles and ITSs infrastructures, and support advanced applications such as over-the-air (OTA) ECU firmware update [9]. As discussed above, many modern attacks do not require physical access to a vehicle instead can now be carried out remotely over wireless by exploiting communication vulnerabilities among vehicles and other connected network services. This allows attackers to compromise more vehicles with relative ease whereas later a compromised vehicle can also be used to attack other vehicles.

Considering the performance requirements of AVSs, it is important to detect a malware in real-time to timely protect any physical and financial damage and loss of human lives [20, 21]. Current approaches to detect such malware either employ static analysis or dynamic analysis techniques [22, 23]. Static analysis technique include: signature-based detection techniques that uses predefined patterns or signatures to identify known malware based on specific characteristics or sequences of code [24,25,26,27,28], heuristic analysis involves using predefined rules or algorithms to identify potentially malicious code by analyzing its structure, behavior, or attributes [26, 29,30,31]. Code structure analysis focuses on analyzing the structure and syntax of the code to identify suspicious or malicious patterns that may indicate the presence of malware [32,33,34,35], String analysis involves analyzing the strings or text within the code to detect hardcoded URLs, IP addresses, encryption keys, or other indicators of malicious behavior [36,37,38,39,40], Metadata analysis involves examining the file’s metadata, such as file size, creation date, or digital signatures, to identify anomalies or signs of tampering [40,41,42,43], Control flow analysis technique analyzes the flow of instructions within the code to detect any unusual or malicious behavior, such as code obfuscation, anti-analysis techniques, or hidden functionality [44,45,46,47] whereas Sandbox analysis involves running the code or file in a controlled environment (sandbox) to observe its behavior, monitor system interactions, and detect any malicious activities or suspicious network communications [39, 41, 48, 49]. Just like static analysis, the most common dynamic analysis technique includes: Behavior analysis that involves monitoring the runtime behavior of the code or file to identify any suspicious or malicious activities, such as unauthorized system modifications, file system changes, or network communications [41, 44, 50], API monitoring technique focuses on monitoring the interactions between the code and Application Programming Interfaces (APIs) to detect any abnormal or malicious API calls that may indicate malicious intent[48, 51, 52]. Network traffic analysis involves capturing and analyzing the network traffic generated by the code or file during execution, looking for communication with known malicious servers, unusual data transfers, or suspicious network behavior [53,54,55,56,57]. Dynamic code analysis involves analyzing the code’s behavior during runtime, including function calls, memory operations, and system calls, to identify any malicious or suspicious activities [41], System call monitoring focuses on monitoring the system calls made by the code or file to the operating system, detecting any unusual or unauthorized system calls that may indicate malicious behavior [58,59,60], Sandboxing involves executing the code or file in a controlled virtual environment (sandbox) to observe its behavior while isolating it from the host system, thus preventing potential harm to the system [39, 41, 48, 49], Emulation and virtualization emulates or virtualizes the target system environment to execute the code or file, allowing for the analysis of its behavior and interactions without directly affecting the host system [61,62,63,64]. These techniques, often used in combination, help in identifying and classifying malware, enhancing the security of systems and networks. The former techniques are good at detecting active malware, i.e., the malware that is directly targeting some unauthorized resource or feature of the vehicle, however, such techniques fail to detect any passive malware that exploits some system vulnerability through monitoring run-time data of the vehicle. The latter techniques are more robust and rigorous as they can detect any variant of malware through observing run-time behavior of systems [65] but such approaches typically require more computational resources which is not the case in autonomous vehicles. As autonomous vehicles may require less computational resources compared to other applications in a way that they have specialized hardware such as Application-Specific Integrated Circuits (ASICs) or Graphics Processing Units (GPUs), designed to efficiently handle the specific computations required for autonomous tasks. Furthermore, with the development of advanced algorithms and machine learning techniques specific to autonomous vehicle have enabled more efficient processing of data. Lastly, many autonomous vehicle systems leverage cloud computing capabilities to offload intensive computational tasks. By utilizing remote servers with powerful computing resources, the computational load on the vehicle itself can be reduced. This approach enables the vehicle to rely on the cloud for resource-demanding tasks like high-definition mapping, complex route planning, or deep learning-based processing such as malware detection [66,67,68]. Therefore, the proposed hybrid approach (utilizing both static and dynamic techniques) can help in detecting malware by leveraging the advantages of both approaches in a single model. Alternatively, some approaches attempted to install vehicle gateways that allow only authorised communication to the vehicles and introduced vehicle Intrusion Detection Systems (IDSs) to detect abnormal behaviors in the Controller Area Network (CAN) [69]. However, it is difficult for a gateway or IDS to block these actions in advance, as most malware and adware are behavior-based. Therefore, to detect unknown malware threats, it is vital to introduce a methodology that can detect suspicious behaviors and analyze anomalous indicators rigorously (i.e., negligible false alarms) and efficiently (i.e., in real-time).

The rest of the paper is structured as follows: Sect. 2 provides background of autonomous vehicles, while Sect. 3 sketches state of the art about rigorous malware detection techniques. Section 4 explains our malware detection methodology, while Sect. 5 presents experimental setup, experiment results and critical discussion. Finally, we conclude in Sect. 6.

2 Background and motivation

Modern smart AVSs will strikingly change the worldwide transport industry and smart environments. AVSs where improving the standard of smart living and road safety also require to wirelessly communicate with other vehicles and devices to efficiently and securely plan safe travel. The number of traffic accidents are reducing day by day. In Addition, people with disabilities can significantly taking advantage from smart cities and ITSs technology preventing injuries and deaths in combat [70]. However, due to unreliable wireless communication among them, such vehicles are an easy target of malware attacks that may compromise vehicles’ autonomy, increase inter-vehicle communication latency, and drain vehicles’ power. Such compromises may result in traffic congestion, threaten the safety of passengers, and can result in financial loss. Therefore, real-time detection of such attacks is key to the safe smart transportation and ITSs. With the increasing trend of Internet of Things (IoT), ITSs aims to improve the efficiency and safety of AVSsnetwork [71]. ITSs in societies that are converting into smart cities becomes more vulnerable to cyber-threat and cyber-terrorism [72]. Different types of ITSs are vulnerable to attacks. The success of remote attacks on autonomous vehicles is information sharing by the vehicles over a wireless medium which increases the susceptibility of the vehicles to different security and malicious attacks. Consequently, data exchange including input and output data as well as protecting ECUs inside the AVSs are among the most significant security issues for the intelligent vehicles. ECUs are the embedded system that monitors electrical systems or subsystems in a conventional vehicle for instance the energy conversion, the air conditioner, vehicle speed and the warnings on the instrument panel [73].

An AV is not just a massive car with four wheel but is made up of networked embedded computers that are responsible for performing different tasks in a smart and timely manner. Therefore, an AV is a diverse and complex environment that comprises of several types of Operating System (OS) installed among different vehicles as shown in Fig. 2. Although ECU act as a brain for AVSs and is considered as minicomputers yet they vary in size, purpose and the OS they run. Thus, we can divide ECUs into two categories: managed by realtime operating systems (RTOS) and general purpose operating system (GPOS). Other than that, Robotic operating system (ROS) is also used. ROS is not an operating system but is an open-source robotics framework having collection of software for robot software development. Tesla, a leading automotive car vehicle is a new energy innovation owns a self-developed OS [74] is now testing Windows OS [75] and Tesla patent seems to be working on windows operating system [76]

Fig. 2
figure 2

Types of operating systems (OS) used in smart autonomous vehicles

Numerous research endeavors focus on utilizing Machine Learning (ML) techniques to identify malware that exploits the dynamic or runtime aspects of running applications. These efforts employ classification methods, considering diverse features like Windows API calls, Registry Key Operations, File System Operations, File Extension-based operations, Directory Operations, Dropped Files, and Strings to classify malware. In addition to static and dynamic approaches, contemporary practices involve utilizing Hardware Performance Counters, which accurately reflect the execution behavior of the application, to measure the performance of the software under investigation. [77]. However, none of the existing dynamic and ML malware detection techniques use hardware performance counter for malware classification specifically in autonomous vehicles. Although, however, [78] employs a dynamic approach to classify malware based on their hardware performance counters and [79] have used hardware performance counter for ransomware classification on Windows platform. There exists no such work that considers all these important aspects in a single methodology. We believe that collective consideration of all of the above stated aspects can significantly improve malware detection rates in AVSs. Therefore, this study encompasses efficient malware detection mechanisms in terms of a hybrid approach that utilizes static as well as dynamic analysis focuses on the use of hardware performance counters to analyze the runtime behavior to detect malware. Moreover, this work shows how accurately hardware performance counters are able to classify malware in AVSs.

3 Related work

Numerous static and dynamic analysis techniques have been presented by the scholarly community to detect and classify malware. Both of the techniques, static and dynamic have their own benefits and limitations. This section depicts state-of-the-art techniques that pertain to malware analysis.

In [80] authors have proposed the analysis of malware on X86-based IoT devices in an autonomous driving approach features based on static analysis and using machine learning to solve problems of resource overhead for dynamic analysis. Paper [81], authors have used Bayesian Network (BN) model to analyse cyber risk in AVSs by introducing the variables and causal relationships derived from the Common Vulnerability Scoring Scheme (CVSS). The model is then applied on the GPS system of the connected AVSs without cyptographic authentication.

Beside other malware attacks, ransomware attacks are emerging and their analysis are used widely by the scholarly community now-a-days. In [82], the authors presented a case study of CryptoLuck Ransomware to highlight the importance of behavioral-based Ransomware detection. In [83], authors statically analyzed process monitoring on file events, processor usage, and I/O rates. In [84], authors suggested that static detection technique as used by [85], can help in evading anti-virus (AV). In [86], authors performed ransomware behavioral analysis on windows platform of 14 strains of ransomware. They observed the individual behavioral pattern of ransomware. In [87], authors presented an automated detection and analysis of ransomware to monitor dynamic behavior by generating API calls and Control Flow Graph (CFG). Authors in [88], developed a dynamic analysis system (UNVEIL), designed specifically for the detection of ransomware by automatically generating an artificial user environment.

There are several other research efforts which follow Machine Learning (ML) based approaches to detect malware exploiting the dynamic or runtime features of executing applications. Another proposed study of dynamic analysis using machine learning through monitoring file system activity of windows platform was conducted by [84]. They used classification technique by considering a wide range of features such as Windows API calls, Registry Key Operations, File System Operations, file operations performed per File Extension, Directory Operations, Dropped Files, and Strings to classify malware.

Other than static, dynamic and ML approaches, Hardware performance counters (represent the true execution behaviors of the application) are typically being employed nowadays to measure the performance of the under investigation software [77]. However, none of the existing dynamic and ML malware detection techniques use hardware performance counter for malware classification specifically in autonomous vehicles. Although, however, [78] employs a dynamic approach to classify malware based on their hardware performance counters and [79] have used hardware performance counter for ransomware classification on Windows platform.

It has been observed from the literature work that most of the techniques [84] can either only observe System/API calls [86, 87, 89], file operations [88], processor usage [83], or registry activities [90]. Some of the studies are based on static analysis [82] whereas other proposed techniques mainly focus on dynamic analysis for classification. A lot of solutions have been developed against malware and ransomware as well as ransomware classification among families that significantly improve the user’s security. A few researches [91,92,93,94] have shown that there is a lack of behavioral analysis that use hybrid technique to classify malware in AVSs using API Calls, File operations, Registry keys, and Hardware performance counter based features (i.e., processor usage, cache-misses, memory usage, page faults, instructions, branches, etc.). So far, hardware-based features have been analyzed on malware and non-malware apps, but have not been considered for AVSs. There exists no such work that considers all these important aspects in a single methodology. We believe that collective consideration of all of the above-stated aspects can significantly improve malware detection rates in AVSs. Therefore, this study encompasses efficient malware detection mechanisms in terms of a hybrid approach that utilizes static as well as dynamic analysis focuses on the use of hardware performance counters to analyze the runtime behavior to detect malware. Moreover, this work shows how accurately hardware performance counters are able to classify malware in AVSs.

4 Methodology

Autonomous Vehicles (AVSs) have become a core constituent of the smart transportation system [95]. The computation power of AVSs gradually increasing and a large amount of information exchange is required with smart components of the transportation system. Information exchange with malicious counterparts in the smart systems could produce catastrophic results such as a change of drive-plan, sudden halt, and ignore obstacles on the roads. Generally, malware exploits different vulnerabilities of the computer system (i.e., hardware platform, operating system, and application software). However, considering the drastic implications of the malicious activity in AVSs, we should formulate a holistic approach considering handling precision, vehicle efficiency, and digital security.

With the static-analysis, malware detection can take place efficiently by merely matching the known application features such as signatures (before application execution) requiring few computational resources. Therefore, static analysis provides early detection to mitigate malicious activities during autonomous vehicle operation. However, the static analysis does not encompass the zero-day attacks and obfuscated (hidden or purposefully crafted features such as like packed or compressed programs or indirect addressing [96]) malicious applications. To address these issues, a dynamic analysis based mechanism can be employed that exploits the run-time behavior (including system hardware, operating systems interactions, etc.) of the executing applications to classify and detect malicious behavior. However, the proficient detection capabilities of the dynamic analysis come along with the high-resource consumption (CPU, memory, energy-cost, etc.). Additionally, in the AVSs context, it would be too risky to rely directly on the dynamic analysis because of potentially high false-positive detection as compared to static analysis.

Therefore, this study encompasses efficient malware detection mechanisms in terms of a hybrid approach that utilizes static as well as dynamic analysis. Traditionally, the proposed models can be built using basic hybrid mechanisms, i.e., (i) a single hybrid approach where distinctive aspects related to both pre-/in-execution of the applications are obtained for analysis and detection. For the obligatory requirements such as efficient and thorough detection of malware with reduced false-positive rate, the hybrid-approach is appropriate and recommended.

The proposed security modules for AVSs i.e., the hybrid mechanisms Combined Hybrid Analyzer (CHA) is shown in Fig. 3. CHA adheres to a factual technical concept of using a hybridization concept for bringing together heterogeneous parameters (in terms of the execution requirements i.e., pre-/in-execution based parameter extraction). As discussed above, the utilization of this model has certain operational consequences that hinder its practical use.

Fig. 3
figure 3

Combined hybrid analyzer (CHA)

Table 2 Microsoft windows based services for automotive vehicles

Let’s discuss the architecture of these models in detail. The proposed CHA model considers input applications and data to employ both pre-/in-execution feature extraction simultaneously. The specific features extracted can be divided into two categories, i.e., static-analysis based features (which can be extracted without application execution), and dynamic features are extracted during the execution of the application within an operating system. The static features include embedded command-strings and the usage of operating system manipulating libraries. The dynamic features (extracted during the execution) are the activity logs related to system-wide low-level configuration manipulations, invoking system call interface to gain privileged access, and manipulation of the operating system resources, file-system related activities, and hardware execution profiles (i.e., low-level hardware performance counters). After performing static and dynamic analysis of malware, the information extracted is in the form of raw data. This data can be converted into CSV (Comma-Separated Values) format using appropriate data processing techniques (such as parsing and extraction, data transformation, delimiter handling etc.,). We have extracted the relevant information that involves identifying specific patterns or structures within the data and extracting the desired fields or attributes. This can be done using specialized parsing libraries that can assist in this process. Furthermore, that data is normalized by handling missing values. Later, the file was converted in CSV format. Various programming languages, libraries, or data processing software like Python with pandas, R, or Excel can be utilized to facilitate the conversion of raw data into CSV format, where each attribute or feature corresponds to a column, and each record represents a row in the CSV file. These features are then combined in feature vectors to be used for both training and validation purposes. The Machine-Learning (ML) model training and validation strategies along with feature selection mechanisms are discussed in Sects. 4.2, 4.3 and 4.4. The machine learning model i.e., J48, Naive Bayes (NB), Gradient Boosting, XGBoost and Random Forest (RF) are used to classifying the applications into malware and non-malware classes. The reason of using these machine learning classifiers are that their results depict are better and efficient in terms of time and computational complexity. Moreover, classifiers like J48, which are more suitable for categorical and mixed data, Naïve Bayes which is often considered one of the fastest classifiers due to its simplicity and computational efficiency. It typically has faster training and prediction times compared to more complex models like random forests or gradient boosting. However, classifiers like Gradient Boosting and XGBoost are generally more computationally intensive and can take more time to train and make predictions, especially when dealing with AVSs which require quick response. These algorithms involve building an ensemble of weak models iteratively, which can be time-consuming compared to simpler algorithms like Naive Bayes, J48. Random Forest and Gradient Boosting (including XGBoost) are often recognized for their high accuracy in classification tasks like malware classification. For IoT related malware detection algorithms complexity should be lesser as IoTs have battery consumption problems.

For the initial investigation and proof of the concept, we have used a dataset of executable applications MS windows platform. We have chosen Windows based dataset for several reasons, for instance, most of the major initiatives in automotive vehicle industry use Windows based services (see Table 2) for their live communication, which is certainly a key source of threat to such services and eventually to the vehicles [106, 107]. Furthermore, as reported in [108], Microsoft services and platforms are helping automakers to create smart connected car solutions that seamlessly address their customers’ unique needs, competitively differentiate their products and generate new and sustainable revenue streams. The Microsoft services do not only offer the right tools, but also allows them to keep their data, has a secure and compliant cloud platform, and operates at a truly global scale (given that most automotive brands operate in many countries). Importantly, 85% of Fortune 500 companies already rely on Microsoft’s cloud for the afore-mentioned reasons. In principle, using such platforms, automakers and suppliers can benefit from the billions of dollars that Microsoft has already invested in the cloud services. For instance, Azure already offers more than 200 services in 38 worldwide regions, with robust measures for security and the global compliance and privacy regulations that are required to support connected cars, letting automakers focus on innovation rather than building out their own cloud-based infrastructure. Consequently, Microsoft aspires to empower automakers in their goals for fully autonomous driving, with elegant machine learning and artificial intelligence capabilities, as well as advanced mapping services. For instance, more recently Microsoft has partnered with TomTom, HERE and Esri, to create more intelligent location-based services across Microsoft [101].

Furthermore, pseudo-code for the proposed security modules for AVSs i.e., the hybrid mechanisms CHA is shown in Algorithm 1. Table 3 represents the notations used in pseudo-code for CHA.

figure a
Table 3 Abbreviations used in pseudo-code for CHA

4.1 Dataset

We have used a dataset of 1000 malware applications of different families (e.g., crypto, petya, locker) downloaded from Virusshare.com repository [109]. Similarly, 1000 non-malware applications (freely available apps) are included resulting in a dataset of 2000 applications. We use a three-step ML-based mechanism: (i) feature extraction, (ii) feature selection, and (ii) application classification

4.2 Feature extraction

The choice of a good feature set is the initial phase of any data mining approach. A few of the extracted features are inspired by previous work [79, 84], however, more features have also been added in this research i.e., hardware performance counters [78, 79], DLLs [110], and strings [18, 84, 111]. We have extracted a total of 1713 features and 10,985 features during static and dynamic analysis, respectively. Cuckoo Sandbox is selected in a Linux platform for automated dynamic analysis of Windows executable malware. It automatically runs and analyzes files and collect comprehensive analysis results that outline what the malware does while running inside an isolated operating system. All processes and file changes are tracked and logged. Generated logs and behavioral analysis reports are recorded by Cuckoo. For validation, we have used the K-fold (k=10) cross-validation mechanism and compare the malware detection accuracy of different classifiers to make sure that the dataset is used uniformly without any biasness. This results in unbiased training and testing cycles producing the results on which we could conclude with confidence. For each cycle of the training/testing and validation, a 70% training and 20% testing and 10% validation partition was employed. A list of features extracted are shown in Tables 6 and 7 as sketched in “Appendix”.

4.3 Feature selection

The reduced number of features increases ML model performance with minor or negligible effects on classification decisions. Moreover, feature selection minimizes the over-fitting factors and the time required for training/testing increases the accuracy to generate simple interpreted models. For this, we employ the information gain criterion [112]. A specific method called infogainAttributeEval from Weka is applied for attribute selection. The value of information gain determines how important a given attribute of the feature vectors is by assigning weights to emphasise the effectiveness of the features. Therefore, the top 25 features out of 1713 selected after applying the feature selection infogain algorithm for static analysis, and top 47 features out of 10,985 were selected for dynamic analysis. Figure 4 depicts the top 10 static features formulated using the Info-gain method where X-Axis shows the rank of the feature.

Fig. 4
figure 4

Top-10 ranked static features

4.4 Model selection and training

Considering the nature of the employed dataset (i.e., categorical and mixed data), this study has been conducted using the three well-known ML classifiers: Naive Bayes (NB) [113] which is often considered one of the fastest classifiers due to its simplicity and computational efficiency. It typically has faster training and prediction times compared to other complex models, Random Forest (RF) [114, 115], Decision Tree (J48) [116, 117] which are more suitable for categorical and mixed data Gradient Boosting and XGBoost. The area under the ROC Curve [118] is a common mechanism to calculate the performance of a certain ML classifier. A higher value (i.e., near to 1) reflects the better classification capability of the ML classifier. Table 4 shows the ROC values for the CHA. As shown in Table 4, the RF and XGBoost stands prominent as compared to other ML classifiers that have attained area under the ROC curve up to 0.9816 for both classes (i.e., malware and non-malware). This indicates that the RF and XGBoost are the best performing classification model as compared to the other two models.

Table 4 Combined hierarchy analyzer (CHA) values of class 1 and class 0 for Area under ROC

4.5 Zero-day attack detection

Zero-day attacks refer to vulnerabilities or exploits that are unknown or not yet discovered by security researchers. Hybrid methods for detecting zero-day attacks typically combine multiple techniques and approaches to enhance detection capabilities. While these methods cannot directly detect zero-day attacks that are unknown to the model, our proposed model can still provide some level of protection through the following mechanisms:

  • Behavior-based Detection: The proposed hybrid method employ behavior-based detection techniques. This method focus on monitoring the behavior of software or systems to identify anomalies or suspicious activities by establishing a baseline of normal behavior, any deviations from the expected patterns can trigger alerts or raise suspicions, even for zero-day attacks. Behavioral analysis can help detect novel attack patterns or malicious activities that were not explicitly known to the model.

  • Anomaly Detection: The proposed hybrid approach can incorporate anomaly detection techniques to identify deviations from normal behavior or expected patterns. By modeling the normal behavior of the system or application, any unusual activities can be flagged as potential threats. Anomaly detection can be effective in detecting previously unseen attack vectors or exploits, including zero-day attacks.

It’s important to note that while hybrid methods provide enhanced detection capabilities, they cannot guarantee complete protection against all zero-day attacks. Zero-day attacks, by nature, exploit unknown vulnerabilities, and it takes time for security solutions to catch up. Nevertheless, employing a hybrid approach with multiple detection techniques significantly improves the overall security posture and helps mitigate the risks associated with zero-day attacks.

5 Experimental setup, results and discussion

We have performed experiments on a stand-alone machine having specifications shown in Table 5.

Table 5 System configuration

For performance evaluation of selected classifiers, we employed the following metrics.

5.1 Accuracy

We have used accuracy to evaluate the results. The accuracy is the fraction of the total number of correctly classified applications as malware or non-malware. Where TP, TN, FP, and FN stands for True Positive, True Negative, False Positive, and False Negative respectively.

$$\begin{aligned} Accuracy= \frac{TP + TN}{TP + TN + FP + FN} \end{aligned}$$
(1)

5.2 Precision

Precision denotes the proportion of the predicted correctly classified applications to the total of all applications that are correctly real positives.

$$\begin{aligned} Precision= \frac{TP}{TP + FP} \end{aligned}$$
(2)

5.3 Recall

Recall is the fraction of the actual apps that are correctly classifies to the total number of the apps that are classified correctly or incorrectly.

$$\begin{aligned} Recall= \frac{TP}{TP + FN} \end{aligned}$$
(3)

5.4 F-measure

F-measure is the harmonic mean of precision and recall. F measure represents the value that tells how much the model is capable of making fine distinctions.

$$\begin{aligned} F Measure= 2 \times \frac{Precision * Recall}{Precision + Recall} \end{aligned}$$
(4)

For evaluation, accuracy-related results are reported which can be defined as the fraction of the total number of correctly classified applications as malware or non-malware [119]. Figure 5 shows the accuracy results for the proposed model CHA for all three ML classifiers. It is evident from the results that the CHA have shown excellent accuracy indicating that a good-percentage of known malware can be identified using time-/cost-efficient and safer mechanism as compared to risking autonomous vehicle operations with dynamic analysis for all the potential applications.

Based on the values of the True Positive and True Negative, we have calculated precision, recall, and F-measure for CHA approach. The results of the precision and recall of classification using all the five classifiers of the CHA approach are explained in Fig. 5. Results depict that RF and XGBoost generated 32.7% and 5.5% improvement in precision as compared to NB and J48. The values of precision for RF, NB, Gradient Boosting, XGBoost and J48 are 0.96, 0.723, 0.87, 0.96 and 0.91, respectively. RF and XGBoost attained the highest values of precision and recall.

Fig. 5
figure 5

Precision, recall and F-measure of CHA

6 Conclusion and future directions

With the advancement in technology and use of smart connected vehicles, we can find examples where cybercriminals have already proven their intent by exploiting several vulnerabilities in the smart transportation systems of automotive ecosystem. we expect to see dramatic increase of cyber attacks against them. The vulnerabilities in the software of AVSs may prove far more dangerous than malware that may appear in personal computers and mobile devices. Malicious applications harm the lives of drivers, passengers as people who are not using AVSs. In this paper, we performed a comprehensive analysis of cybersecurity threat of malware targeting smart transportation systems of connected and autonomous vehicles by proposing hybrid model CHA. The experimentation discussed in the article provides a proof of concept for securing AVSs in general and automotive CPSs in particular, that is adaptive, lightweight, and promises accurate results.

For the future work, we plan to develop future of intelligent transportation system in smart cities that can efficiently detect high priority attacks based on IDS and evaluate their effectiveness using simulations. In addition, network feature analysis can be considered in future along with the communication protocols, such as encryption and authentication mechanisms that ensures that the vehicle’s communication channels are protected from unauthorized access or tampering. Moreover, Implementing intrusion detection systems (IDS) within the vehicle’s network infrastructure can help identify any unauthorized or malicious attempts to access or manipulate the vehicle’s systems. IDS can detect patterns of known attacks or suspicious network traffic. Lastly, for future work, our proposed hybrid method can incorporate machine learning algorithms that adapt and learn from new data and emerging threats. By continuously updating the model with new information and training data, the system can improve its detection capabilities over time and become more adept at identifying previously unknown attack patterns, including zero-day attacks.