1 Introduction

The spread of mobile devices, IoT technologies, ad hoc networks, and intelligent sensors has resulted in a surge in the usage of the global Internet due to the rapid development of global networking and communication technologies. Due to the COVID-19 pandemic, which has relocated many daily activities such as social networking, e-banking, e-commerce, and cyberspace, the year 2020 has seen a tremendous shift in internet usage. According to worldwide data, roughly 5.25 billion internet users access and use the internet often, with internet use rising by 1.355 percent for the global population from 2000 to 2022, which is now projected at 7.9 billion people [1].

As a result of the exponential growth in the Internet and digital technology usage, web apps have become pervasive in many commercial sectors and a vital part of people's everyday lives and activities worldwide. While the Internet infrastructure is an open, anonymous, and unsupervised design that allows for a suitable system for cyber assaults, it simultaneously exposes network systems and computer users to many security and significant risks.

In today's environment, rogue URLs are a pervasive threat to many internet businesses. Even though the World Wide Web has become one of the most essential on the Internet, it has dramatically increased the likelihood of cyber assaults. The hostile actors have used the Internet to carry out harmful operations such as phishing, spamming, and malware. For example, phishing involves sending an email that looks to have come from a trustworthy source to trick people into clicking a URL (Uniform Resource Locator) the email and leads to a bogus webpage. Moreover, a malicious URL is created to promote frauds, attacks, and scams; by clicking on an infected URL, a malware can be downloaded on your device, or one can be tricked into providing sensitive and private information on a bogus website. However, many of these programs are either vulnerable to web defacement attacks or designed and controlled by hackers. Scam websites and phishing schemes are two examples. It is critical to detect websites containing harmful code to limit the spread of malware and defend end users from becoming victims.

In practice, the most known attacks using malicious URLs are spam and phishing. Phishing attacks are a form of cybercrime in which the intruder attempts to get private and critical information from the end user, such as login credentials or account information through email or fake websites that look legal. For instance, Figure 1 shows a phishing website identical to Twitter, so the attacker can manipulate the user to enter their credentials.

Fig. 1
figure 1

Phishing website example

The most current research of the APWG Phishing Activity Trends Report indicated a steady flow of phishing and confirmed attack sites in the first three months of 2022 [2]. The number of malicious URLs and their trend are displayed in Fig. 2, which covers the period from the 1st quarter of 2019 through the 1st quarter of 2022. Website builders and free web hosting companies have seen a rise in phishing attacks. Total phishing assaults for the first quarter of 2022 were 1,025,968, according to the company. Additionally, the APWG has the worst phishing quarter ever recorded for the first three months of 2022. In March 2022, the APWG experienced a monthly assault total of 384,291 for the first time.

Fig. 2
figure 2

Phishing URLs number reported to APAC from the first Quarter of 2019 to the first Quarter of 2022

URLs are usually divided into two types: those that are safe and those that are bad. Malicious URLs are also put into different groups based on the type of attack, like spam, phishing, or malware. Detecting malicious URLs has been a major issue for cybersecurity professionals for many decades. Several strategies for detecting malicious URLs and protecting end users from becoming victims of an attack are offered [3,4,5,6,7,8]. These solutions are classified according to their detection type, which can be feature-based detection or blacklist-based detection. Features representing URLs are automatically extracted and examined in feature-based detection, whereas blacklist-based detection depends on user reports and expert analysis.

Malicious URL detection is critical because several attackers distribute malicious links to genuine websites, for instance social media platforms and emails. Furthermore, certain harmful URLs are distributed via downloading malware, which is able to infect the detector while it is crawling. Furthermore, because of the striking similarities between certain malicious web content with genuine information, such as phishing websites and fraudulent websites, malicious URL detection is more efficient and accurate than web content detection.

URL-based detection is preferred. It is a proactive and secure strategy for the detecting machines because it can detect malicious URLs prior to the user visiting them. This makes URL-based detection the preferred method.

The classification of malicious URLs to identify attack types is significant since it might indicate the direction and necessity of network security measures. For instance, defaced URLs may require a faster response than spam URLs due to credential breaches. In addition, spamming may be easier to overlook; malware infestation requires quick action. Identifying and classifying URLs that lead to dangerous websites is a significant problem with real-world implications. It is possible for a business to protect itself with a machine learning model by blocking incoming emails and employee websites based on harmful URLs. In addition, the identification and classification of malicious URLs are further effective for real-time threat detection and applications with limited resources, such as mobile and Internet of Things (IoT) devices.

Consequently, in this paper, an intelligent Identification and Classification System for Malicious URLs is proposed. One of the main goals of this system is to detect malicious URLs and identify attack types of malicious URLs attempt to launch including benign phishing, spam, malware, and defacement. Therefore, this system performs two tasks: The first challenge involves binary classification, whereas the second involves multi-label classification. The dataset includes information from five distinct URL categories, which are as follows:

  • Benign: trustworthy websites that offer specific services and include no potentially hazardous data or information.

  • Phishing: the act of a website pretending to be a reliable entity in an electronic connection to get information, personal information like usernames, passwords, and credit card numbers.

  • Malware is defined as software created to interfere with, damage, steal confidential data, or gain unauthorized access to a computer system.

  • Defacement is a form of online vandalism that occurs when an unauthorized person modifies the appearance of a website by using various ways.

  • Spam is the dissemination of uninvited and irrelevant content for advertising, phishing, malware distribution, etc.

2 Literature review

Several studies are given on how to find malicious URLs and keep users from falling victim to an attack. This study examines URL-based detection solutions and summarizes studies that examine malicious URLs. In addition, several patterns observed in the literature and research gaps will be explored.

Sayamber and Dixit [9] have proposed a model using an NB classifier to identify and classify four types of classes (Spam, Malware, phishing, and Benign). Clustering and classification techniques were applied to assess the NB classifier. In [10], the authors have used the Weka tool as the ML infrastructure, applying five different classification techniques (PRISM, Multi-Layer Perceptron (MLP), Naïve Bayes (NB), KStar (K*), and Random Forest (RF)) against their dataset acquired from the NASA repository. Their classification process was divided into six main steps: Identifying the required data, feature selection, identifying the training set, classifier selection, training, and classifier evaluation using the test set. Two experiments were directed using tenfold cross-validation; one using features selection and the other one without it. In [11], the authors have used an intelligent system for detecting phishing attacks using some data mining techniques to classify websites as authentic or fraudulent. Several ML classifiers were used to classify phishing websites in the dataset that was obtained from the UCI repository, these classification algorithms are Rotation Forest (RoF), K-Nearest Neighbor (kNN), C4.5 Decision Tree, Support Vector Machines (SVM), Artificial Neural Networks (ANN), and Random Forests (RF).

In [12], the authors proposed an approach that uses natural language processing (NLP) techniques to detect phishing attacks by performing semantic analysis of the text to detect malicious content of phishing emails. SEAHound, the name of the proposed system, examines a one sentence at a moment of a document, and returns "True" if the document includes a social engineering attack. A topic blacklist which is a list of (verb-direct object) pairs, was generated using ML by developing a MultinomialNB() function which produces a prediction label for each (verb-direct object) pair and generates a confidence score (0,1) for the prediction. For training they have used 2000 emails (1000 phishing emails and 1000 non-phishing), and 10014 emails (5014 phishing emails and 5000 non-phishing) for testing purposes.

In [13], the authors discuss three approaches for detecting phishing URLs and websites using ML by; analyzing several features of URL (black or white list), checking the who is hosting and managing the website (heuristics), and analyzing the visual appearance of the website (Visual Similarity). They have developed a system that takes advantages of all the three approaches by; Monitor all “http” traffic, comparing the domain of every URL with the white/black list, analyzing the whole website features, checking the similarity between the phishing page and the target page, extracting and comparing the CSS of suspicious and legit pages, using ML classification ML algorithms like logistic regression, decision tree, and random forest against the collected data, conducting similarity score and checking it against a pre-defined threshold. D. Patil and J. Patil [14] reported that none of the writers who have worked in this field have employed multi-class classification to determine the nature of URLs, instead focusing solely on binary classification. They claim that the URLs are not classified as any other sort of spam, just as malicious or benign. In order to identify the nature of URLs, the basic concept more precisely behind their work was to develop a multi-class classification method for detecting harmful attacks based on confidence weighted learning (a multi-class classifier) and retrieved 42 new phishing, malware, and spam features from malicious URLs using supervised machine learning strategy for training.

Adebowale et al. [15], in their research have used integrated features of images, texts, and frames in building their model for web-phishing detection and prevention scheme. Adaptive Neuro-Fuzzy Inference System (ANFIS) based robust scheme was deployed. Also, Combine features images, text, and frames for phishing detection. SVM was used for classification and detection. Ping Yi et al. [16] developed a framework for detecting phishing websites using Deep Learning (DL). Two main features-types were intended for phishing websites: original features and interaction features. Deep Belief Networks (DBN) is the ground of the proposed detection model.

In [17], the authors have used the ML stacking model in their proposed framework in order to detect phishing websites. They started by using features reduction algorithms (Relief-F, Information Gain [IG], Recursive Feature Elimination [RFE], and Gain Ratio [GR]) against the dataset to end-up with two sub-sets, strongest features and weakest features. Then normalize the data and feed it into the Principal Component Analysis (PCA). Afterward, ML algorithms (Naïve Bayes, random forest [RF], bagging, neural network [NN], support vector machine [SVM], and k-nearest neighbor [KNN]) were applied against the two sub-sets. Later on, staking occurs by merging the best performing algorithms, two stacking models were used, one using RF+Bagging+NN, and the other one using RF+Bagging+KNN. The authors choose the tenfold cross-validation as the training model, then using the performance evaluation measures to assess the detection rate.

In [18], Sahingoz et al. projected a real-time anti-phishing system which employs seven distinct classification algorithms and natural language processing (NLP)-based characteristics. Their system is claimed to execute real-time detection, new phishing websites detection, can distinguish a large number of legit/phishing websites, third-party independent, and language independent. They have used KNN, SMO, Decision Tree, Kstar, Random Forest, Adaboost, and Naive Bayes as ML algorithms, and run each one in three different classes using NLP Features, Word Vector, and Hybrid. Rao et al. [19] used the ensemble ML to classify benign and malicious URLs using the XGBoost (gradient boosted decision trees implementation) algorithm. Hot encoder was used to change categorical values to values that machine understand, which enhance the building and predicting speed. Rohan in his research [20], has used distributed modern machine learning technologies using Spark MLlib. This model uses logistic regression and Support Vector Machine (SVM) techniques to estimate the trustworthiness of a URL.

In [21], the authors presented a phishing detecting data-driven framework by using deep learning multilayer Perceptron approach. The framework is consisted of several steps, starting by data acquisition, then data preprocessing for selecting the most effective features and analyzing them using IG, RFE, GR, and Relief-F, afterward the data is normalized, then PCA is applied. Now the classification phase is conducted by training the model using a tenfold cross-validation technique along with the Multilayer Perceptron (MLP) model.

In [22], the authors proposed a model for phishing detection using an intelligent detection model by employing Multilayer Perceptron (MLP), and 70% for training and 30% for testing. They begin with data acquisition, data cleaning, and normalization. After that comes the processing phase, in which ranking the features according to their significance, then using the MATLAB Independent significance Features test (IndFeat()) for ranking purposes and using the Correlation-based Feature Selection (CfsSubsetEval()) to evaluate the attributes. To utilize the best features for classification, an intersection between output features from IndFeat() and CfsSubsetEval() is performed. Finally, applying MLP classifier to learn the correlation between the features and apply this approach against the testing dataset.

In [23], the authors used ML and binary visualization to build their approach by employing an automated detection process. In the learning phase, ML algorithm TensorFlow and samples are built, while in the detection phase, the submitted URLs are tested against the samples in the database. To distinguish between legit and phishing websites, the model visualizes the scraped HTML file to a 2D images (using BinVis tool) which are processed by the TensorFlow model in which analyzes the images against the training model. The system stores any URL that passthrough in a database to be visualized in the background. If a user visits a URL, the system will check if it does exist in the database, if not it will scrape the HTML code and store it in the database, then visualizing the HTML file into a 2D image, and finally analyzes it to perform the detection.

Xiao et al. [24] considered using CNN-LSTM along with multi-head self-attention (MHSA) in their proposed model called “CNN-MHSA” intended for detecting phishing websites. The first step is using CNN to learn the URL features automatically and considering the relationship of characters inner dependencies in the URLs. Then MHSA explores these relationships to assign weights for CNN studied features. In [25], Shirazi et al. employed Adversarial Autoencoder (AAE) in their proposed model to augment existing datasets used by machine learning algorithms. AAE mimics phishing websites by generating samples and providing a system of measurement to measure the quality of those samples. The samples were evaluated against models trained using real-world data and were able to escape the detection model. Some of these samples were then used in training. The main contribution is AAE and synthesized data in the training set.

In [26], the authors present system for detection phishing websites by analyzing the URL for patterns. ML techniques were employed to learn URL’s data patterns. Phishing Websites Dataset (PWS) was used, consists of small balanced dataset and a large, unbalanced dataset. The phases for this system are preprocessing (data collection, feature selection, conversion of data, encoding, data normalization, shuffling, augmentation and distribution), training (ML algorithm for pattern characterizing. 70% training and 30% testing), and detection (model deployment and prediction). Wide Shallow Neural Network (W-SNN), Narrow Shallow Neural Network (N-SNN), and the Optimizable Decision Tree/GentleBoost Ensemble (ODT) were used for learning and optimization module, and Sigmoid classifier for detection module.

In [27], Maini et al. has built a model to classify legit and phishing websites using several ML classifiers; AdaBoost, Decision tree, KNN, Logistic Regression, Random Forest, Extreme Gradient Boosting (XGBoost), Support Vector Machines (SVM), and Naive Bayes. An ensemble model was built aiming to improve the detection rate. Wang et al. [28] proposed a model using Dynamic Convolutional Neural Network (DCNN) for malicious URLs detection. Stating that their model can prevent attackers from passing the current detection model, and the problem of not having the malicious features extracted properly is solved. They have added a new folding layer replacing the pooling layer of the original multilayer convolution network with a k-max pooling layer. This layer is adjusted dynamically depending on the URL length and the convolution layer depth. A new embedding method is proposed to learn the vector representation by leveraging the word embedding which is based on the character embedding. Several experiments were conducting (Character embedding alone, Word embedding alone, and Word embedding based on character embedding) proving that their model is outperformed the overall results. In [29], Manoj et al. has proposed an intelligent ML model to detect phishing websites by assessing the characteristics of the phishing site and selecting an acceptable combination of systems for classifier training. Logistic regression, SVM, XGBoost, and Auto-encoders were used as classifiers.

In [30], the authors rely on ML for email phishing classification. PCA was used as an attribute selector, Ranker Search as a search method, and Multilayer perceptron (MLP), DT, and Logistic Model Tree (LMT) as classification algorithms. They have used the PCA and Ranker to govern the positively classified, incorrectly categorized, and Kappa statistics for each algorithm, then apply the classification algorithms as pairs (PCA + MLP), (PCA + J48), and (PCA + LMT) against the resulted data. To sum up, Table 1 provides a comparative summary of all the surveyed studies.

Table 1 Summary of review-related research

By examining past research and solutions that analyze URLs for harmful content, and the patterns observed in the literature, it is highlighted that most existing research has only concentrated on malicious URL detection. Although the detection process is a huge challenge, classifying malicious URLs is crucial as it can indicate the direction and importance of measures an organization should take to secure its network. There is widespread agreement that blacklisting malicious URLs is ineffective because it does not address issues like obfuscation or new attacks. Consequently, the majority of current research has been on applications of machine learning techniques. Among the most popular algorithms, Random Forest (RF) seems to be one of the most effective, Decision Tree, X-Boot, and so on.

3 Proposed classification system

Identifying the malicious URLs and classifying them into attack types is a substantial task as it facilitates the proper network security measures and policies. Defaced URLs require a swifter response than spam ones since they infringe the credentials. Therefore, a number of intelligent and statistical techniques have been utilized to develop an autonomous and self-reliant detection system that can recognize and categorize malicious URLs to network administrators in security-based decision-making. In this work, we propose a detection and classification system that makes use of different ensemble learning approaches to develop a high-performant detection and classification for malicious URLs. The proposed system is illustrated in Fig. 3, and it is decomposed into four main processing phases: Preparation, Learning, Assessment, and Deployment. Besides, the dataset in this study, namely known as, URL dataset or ISCX-URL2016 [31], is composed of 57,000 URL samples distributed as Benign URLs (over 35,300), Spam URLs (over 12,000), Phishing URLs (Around 10,000), Malware URLs (over 11,500), and Defacement URLs (over 45,450). Moreover, the dataset is formulated using 79 features and two-class labels (Binary and Multiple). This includes the query length, domain token count, the path token count, and others.

Fig. 3
figure 3

The dataflow diagram of the proposed Malicious URL Detection and Classification System

4 Preparation phase

This stage is responsible for preparing the dataset for the learning process through the following operations:

  • Data Cleaning: This process is responsible for providing an error-free dataset to be ready for feeding the learning phase. This phase includes removal of unwanted observation (duplicate/redundant or irrelevant values deletion), missing data handling (fixing issues of unknown missing values), fixing structural errors (fixing problems with mislabeled classes, types in names of features, the same attribute with different names, and others), and managing unwanted outliers (unwanted values which are not fitting in the dataset). The data cleaning process is illustrated in Fig 4A [32].

  • Feature Selection: This process is responsible for reducing the input features to the learning model by using only relevant features and eliminating irrelevant features. The process of feature selection is illustrated in Fi 4B . In this work, 59 features are selected out of 79 using the minimum redundancy and maximum relevance (MRMR) algorithm [33], where each feature is ranked based on minimum redundancy and maximum relevance and assigned an importance score [34].

  • Dataset Shuffling: This process is responsible for mixing data and preserving logical relationships between columns. It randomly shuffles data from a dataset within a set of features (columns) [35]. The process of Dataset Shuffling is illustrated in Fig 4C .

  • Dataset Division: This process is responsible for splitting the dataset into training datasets to train the learning model and testing the dataset to evaluate the learning model's performance. In this research, we split the dataset into 70% training and 30% for texting datasets. To ensure efficient validation, we have implemented fivefold cross-validation, which randomly split all data into five folds by dividing the data into different 70:30 ratios for the dataset at every fold. The model performance is evaluated at every fold. This procedure is repeated five times. After that, the overall performance metrics are calculated by averaging the results obtained from the five experiments (Folds/Iterations). The process of Dataset Division using fivefold cross-validation is illustrated in Fig 4D [36].

Fig. 4
figure 4

Data preparation phases (from top-left toward bottom-right): A Data cleaning, B Feature Selection, C Dataset Shuffling, and D Dataset Division

4.1 Learning phase

This phase is the main subsystem of the proposed malicious URL detection and classification system since it is accountable for creating the intelligent portions of the system. This can occur by training and validating the system using supervised learning methods that can then develop behavioral contours for the different types of URLs (Benign, Spam, Phishing, Malware, and Defacement). In this work, ensemble learning techniques were employed to provide both detection and classification for URL websites. Two main reasons to use ensemble learning are (1) Performance (an ensemble can make better predictions and perform better than any single contributing model. (2) Robustness (an ensemble reduces the spread or dispersion of the predictions and model performance) [37, 38]. Specifically, four ensemble learning models were established and assessed to differentiate their performance and select the most performant model. These models are the ensemble of bagging trees (En_Bag) approach, the ensemble of k-nearest neighbor (En_kNN) approach, the ensemble of boosted decision trees (En_Bos) approach, and the ensemble of subspace discriminator (En_Dsc) approach. Table 2 outlines summarized points of the implemented models and their descriptions

Table 2 Summary of ensemble learning models utilized in this work

4.2 Assessment phase

In this research, we have evaluated the performance of proposed models using a conventional machine learning assessment system of measurement. This includes the confusion matrix analysis (reporting TP, TN, FP, and FN), in which we calculate the number of correctly classified samples (NCS) and the number of misclassified samples (MCS) that can be used to derive the other standard metrics such as detection/classification accuracy, detection/classification precision, detection/classification recall (sensitivity), and the detection/classification harmonic average (F Score) [39]. Finally, we also measure every developed model's detection/classification overhead (prediction time). To sum up, Figure 5 summarizes the calculations for the performance assessment measurement system utilized in this paper.

Fig. 5
figure 5

Summary of the performance assessment system of measurement

4.3 Deployment phase

System deployment entails the transition of the developed service (such as the Malicious ULS detection system) to the eventual end-user and the transition of support and maintenance liabilities to the post-deployment support organization or organizations. The proposed Malicious ULS detection system is to be deployed with the intrusion detection/prevention system (IDS/IPS) as either host-based IDS, network-based, or even at both sides. Preliminary industry-standard IDS deployment imposes utilization of network-based IDS, then host-based IDS. This guarantees the network and host devices are protected [40]. The primary basis of any corporation is the network infrastructure, then devices within those networks. IDS should be deployed in the same fashion. Figure 6 illustrates the Deployment of IDS/IPS on an Enterprise deployment of IDS/IPS on an enterprise.

Fig. 6
figure 6

The typical deployment of IDS/IPS on an enterprise

5 Results and analysis

Like any autonomous system, we evaluate our system using the above-mentioned conventional evaluation metrics. First, we provide the system evaluation using four ensemble learning models, viz. the ensemble of bagging trees (En_Bag) approach, the ensemble of k-nearest neighbor (En_kNN) approach, and the ensemble of boosted decision trees (En_Bos) approach, and the ensemble of subspace discriminator (En_Dsc) approach. Then, we extract the model that can be optimally used to provide detection and classification of the malicious URL. Thus, we further investigate more results to gain insights into the solution approach. Finally, we benchmark our best results with state-of-the-art models to gain more insights into the advantage of the proposed solution over existing models.

5.1 Results and Analysis

Table 3 and Fig. 7 provide a summary of the experimental results for the performance evaluation for the proposed system using four learning models: model 1(En_Bag), model 2 (En_kNN), and model 3 (En_Bos), and model 4 (En_Dsc). The comparative study considers the detection/classification accuracy, the detection/classification precision, the detection/classification sensitivity (recall), the detection/classification harmonic mean (F Score), and the detection/classification time. Our empirical assessment reveals that the ensemble of bagging trees model (En_Bag) provides better performance rates than other ensemble models for detection and classification tasks. On the other hand, the ensemble of k-nearest neighbors (En_KNN) methods provide the highest inference speed compared with other ensemble models for both detection and classification tasks, confirming their appropriateness for high-bandwidth networks. Similarly, the ensemble of boosted trees (En_Bos) provided competitive performance rates with low detection time. The lowest performance rates were reported for the ensemble of subspace discriminator (En_Dsc) scoring accuracy that falls below the best model by 10–20%. Overall, \(En\_Bag\) model provided the best outcomes in terms of all performance indicators along with a very low detection/classification time (\(6.67\mu \mathrm{Sec} - 11.77 \mu \mathrm{Sec}\)) that makes it suitable for real-time communication of Internet of things (IoT) and Cyber-Physical Systems (CPS) networks [43, 46].

Table 3 Summary of experimental results for the performance of using Model 1(En_Bag), Model 2 (En_kNN), Model 3 (En_Bos), and Model 4 (En_Dsc)
Fig. 7
figure 7

Summary of experimental results for the performance of using Model 1(En_Bag), Model 2 (En_kNN), Model 3 (En_Bos), and Model 4 (En_Dsc)

5.2 Further Investigation for the best model

In this subsection, we focus on analyzing further results for the best-extracted model, the En_Bag model. Figure 8 shows the confusion matrix of the En_Bag based system for (a) the 2-classes detection model [0 for benign and 1 for anomaly] and (b) the multi-classes classification model (considering five classes benign, spam, phishing, malware, and defacement). According to the detection matrix, it can be clearly seen that most samples are correctly classified as benign (TP= 7666 and FP = 115) and malicious (TN =28825 and FN = 101). Similarly, for multi-class classification, the majority of samples are correctly classified (blue diagonal) with slight differences in the confusion rates between the different classes. One major observation is that classes for the multi-class classifier are almost perfectly balanced regarding the number of samples per class. At the same time, the classes for the two-class classifier are imbalanced in terms of the number of samples per class. Thus, ensemble learning is the optimal learning and sampling technique for balanced and imbalanced data [47].

Fig. 8
figure 8

Confusion Matrix of En_Bag Based System for a 2-Classes Detection Model [0 for benign and 1 for anomaly] b 5-Classes Classification Model

Also, Table 4 and Fig. 9 present the summary of experimental results for the detection system using the ensemble of bagging trees: Performance evaluation metrics for the binary classes and the overall. The table reveals the performance metrics for individual classes and the overall metrics for the detection-based En_Bag model. Accordingly, the performance results for the “anomaly” class are slightly higher than their peers for the “Benign” class. This can be justified by the extremely larger number of samples trained by the model for the “anomaly” class than that for the “Benign” class, which in turn, improved the detectability for the “anomaly” class in terms of the different evaluation metrics, especially for the detection accuracy metric. Overall, the system exhibited extreme performance indicators scoring an average of 99.30% for detection accuracy, precision, recall, and the F score. Such results are very attractive and competitive; thus, the model can be deployed effectively to uncover malicious URLs for different web applications.

Table 4 Summary of experimental results of the detection system using the ensemble of bagging trees: Performance evaluation metrics for the binary classes and the overall
Fig. 9
figure 9

Summary of experimental results of the detection system using the ensemble of bagging trees: Performance evaluation metrics for the binary classes and the overall

Moreover, Table 5 and Fig. 10 present the summary of experimental results for the classification system using the ensemble of bagging trees: Performance evaluation metrics for the multi-classifier and the overall. The table reveals the performance metrics for individual classes (Benign, Spam, Phishing, Malware, and Defacement) and the overall metrics for the classification-based En_Bag model. Accordingly, the performance results for the “Spam” and “Malware” classes have recorded the highest accuracy and precision rates with average rates of 99%, slightly higher than their peers for the same metrics. On the other hand, the system has recorded the highest sensitivity toward the “Benign” class scoring a recall rate of 98.6%, marginally higher than its peers for the same metric. However, the harmonic mean (harmonic average of recall and precision, i.e., F Score) was almost the same for all classes with an average rate of 98.1% except for the “Phishing” class, which recorded the lowest (relatively lowest) performance metrics among all classes. This can be observed through the confusion matrix analysis, which reflects that the “Phishing” class has conquered the largest false rates (FP, FN), among others. Overall, the system exhibited extreme performance indicators scoring an average of 97.8% for classification accuracy, precision, recall, and the F score. Such results are very attractive and competitive. Thus, the model can be deployed effectively to uncover the malicious URLs for different web applications and provide a further classification for the target malware ULR class.

Table 5 Summary of experimental results of the classification system using the ensemble of bagging trees: Performance evaluation metrics for the individual classes and the overall
Fig. 10
figure 10

Summary of experimental results of the detection system using the ensemble of bagging trees: Performance evaluation metrics for the binary classes and the overall

Eventually, Table 6 provides a performance comparison of our proposed malicious URL detection and classification system with other up-to-date state-of-art malicious URL detection and/or classification systems developed via machine learning (ML) or deep learning (DL) models. The table differentiates our best experiential outcomes attained for the En_Bag-based model from the corresponding outcomes stated in the existing state-of-art models. In addition to the year of publication, the assessment in the table evaluates five analogous design performance extents, comprising the detection scheme (using the ML or DL model), the modeling task (detection for two-class classifiers or classification for multi-class classifiers), the number of target classes (\(2\) for the detection or \(>2\) for classification), the number of the testing accuracy rates for the detection systems, and the harmonic mean (F score) rates for the detection systems.

Table 6 Comparison with other state-of-the-art models

Successively, the table contemplates ten different malicious URL detection systems developed during the last five years (from 2017 to 2022), in conjunction with our proposed in- malicious URL detection and classification system (which depends on the ensemble of bagging trees (En-Bag) as the principal learning model for our system). The reported detection systems encompass the following supervised learning techniques the random forest classifier with feature section (RFC_FS) employed by Ibrahim et al. [41], the random forest classifier (RFC) used by Subasi et al. [42], the Multinomial Naïve Bayes (MNB) utilized by Peng et al. [43], the random forest classifier (RFC) applied by Patil et al. [43], the hybrid classifier model of K-nearest neighbors, random forest, and bagging trees (KNN-RF-Bag) implemented by Zamir et al. [44], the multilayer perceptron (MLP) used by Saha et al. [21], the multilayer perceptron (MLP) used by Odeh et al. [22], MobileNet convolutional neural network (MB_CNN) utilized by Barlow et al. [45], Ensemble of optimizable decision trees (En_ODT) used by Abu Al-Haija et al. [26], and logistic model trees (LMT) employed by Rana et al. [30].

Rendering the discussion presented above and the results provided in Table 6, one can presume that our malicious detection and classification system is outstanding due to the highest performance records over the other compared state-of-the-art schemes. The proposed model has enhanced the validation accuracy by \(1.9-12.7\%\) over the evaluated models. In addition, the proposed system can be productively implemented for real-time ecosystems due to the low prediction delay required by the proposed system; the intelligent model needs only 6.67 µSec and 11.77µSec to provide the detection and classification outcomes, respectively.

6 Conclusions and remarks

This paper discusses a new detection and classification system to identify and categorize the malicious uniform resource locators (URLs), which is developed, trained, validated, evaluated, and discussed. The proposed system makes use of four ensembles supervised machine learning approaches, namely the ensemble of bagging trees (En_Bag) approach, the ensemble of k-nearest neighbor (En_kNN) approach, the ensemble of boosted decision trees (En_Bos) approach, and the ensemble of subspace discriminator (En_Dsc) approach. The implemented models were evaluated on the ISCX-URL2016 dataset, a modern and global multi-class dataset for malicious URLs. Firstly, we identify the URLs as either benign or malware using a binary classifier. Secondly, we classify the URL classes based on their feature into five classes: benign, spam, phishing, malware, and defacement. Five performance indicators were evaluated and compared for the different models, including accuracy, precision, recall, F Score, and detection time, to characterize the models. The simulation results showed that the En_Bag approach reported the best-of-all performance rates scoring accuracy rates of 99.3% and 97.92% for the 2-class and 5-class classifications, respectively. Besides, the comparison with existing state-of-the-art models showed that our best results are better than any prior.