IoT for measuring road network quality index

Egypt has been fighting the issue of ensuring road safety‚ reducing accidents‚ preserving the lives of citizens since its inception. For these reasons‚ precisely identifying the road condition‚ followed by effective and timely maintenance and rehabilitation measures‚ leads to an increase in the road network's safety level and lifespan. This paper presents a multi-input deep learning framework that combines BiLSTM and Depthwise separable convolution to work in parallel for automatic recognition of road surface quality and different road anomalies. Furthermore, we performed an investigation to compare deep networks approaches against other traditional approaches using real-time data sensed and collected from the Egyptian road network. The proposed deep model has achieved an average accuracy of 93.1%‚ which is superior compared to other evaluated approaches. Finally, we utilized the proposed model to estimate a road quality index in the Egyptian cities.


Introduction
Poor road conditions such as cracked pavement' potholes' and bumps can result in traffic bottleneck' accidents' and vehicle damages. Utilizing sustainable road construction materials is not sufficient to guarantee a safe road system. However' to provide safe roads for safe commuting' realtime observation and continuous collection of data about the routes are required. Therefore' it is significant and urgent to provide an effective road surface recognition for authorities to undertake effective control and maintenance in order to reduce deaths and injuries due to road traffic accidents.
There are three major approaches to detect road surface quality: (a) laser scanning method' (b) computer vision method' and (c) sensor-based method [1]. The laser scanning-based method is more accurate than the other two methods in detecting road surface quality' but the equipment required for it is quite huge and expensive. Furthermore' because this approach scans roads at low speeds' it may result in traffic jams. Regarding the computer vision approach' the road surface images are used to distinguish road surface conditions. The performance of the computer vision system may vary with the changes in both the time of capturing images and weather conditions. In regards to the last approach, Motion Tracking Sensor data is frequently used for road condition monitoring' whether it comes from a smartphone' an accelerometer sensor embedded into a micro-processing unit' such as Raspberry Pi' or from an accelerometer installed in a vehicle [2,3]. Also, other sensors, including gyroscope' steering angle' and wheel speed sensors were used in the related studies. The integration of artificial intelligence with the vast amount of real-time sensor data can enhance adaptive decision-making. So, many studies were conducted to find intelligent and cost-effective solutions that make utilization of sensors embedded in everyday devices and vehicles [4].
In this context, this study aims to use real-time sensed data collected from Egyptian roads network to automatically recognize road surface quality and different road anomalies. In addition, estimating a road Quality Index based on road surface quality and anomalies to guide authorities in maintaining road networks and undertake corrective measures accordingly.
Generally' sensor-based road surface recognition is a challenging multivariate timeseries classification task that includes identifying the road surface quality based on sensor data. Furthermore, sensor-based data are vulnerable to errors due to noise and signal sensitivity. Traditional classification methods rely heavily on heuristic hand-crafted feature extraction' which is often restricted to human expertise. Moreover' those methods can only learn shallow features. As a result' the performance of traditional techniques in terms of classification accuracy and model generalization is limited [5]. In this context' deep neural networks such as CNNs and RNNs have achieved state-ofthe-art results on time-series classification tasks [4]. This is due to their high capability of automatically learning features instead of being manually extracted and making accurate predictions from the raw data directly [6]. We expect that a novel approach that fuses signal processing techniques and deep neural networks together can result in a high-level representation that is more appropriate for complex multivariate timeseries classification problems. To summarize, the main contributions in this paper are as follows: • Collecting a real-time dataset sensed from the Egyptian roads network. • Investigating multiple machine learning models including both traditional and deep learning models applied for multivariate timeseries data. • Proposing a novel deep learning architecture model that uses depth-wise separable convolutions with BiLSTM in parallel to automatically recognize road surface quality and different road anomalies using both time domain and frequency domain features. • Proposing a road quality index estimator to guide authorities in maintaining road networks.
This paper is organized as follows: a review of recent literature is presented in Sect. 2. In Sect. 3, the investigated methods' including Random-Forest' RNN, GRU, LSTM, 1-D CNN' and voting ensemble are presented. Section 4 presents our proposed framework architecture. Followed by road quality index estimator in Sect. 5. Section 6.1 introduces the dataset collection process and dataset preprocessing and preparation. Section 6.2 shows the results and an experimental analysis. Section 7, presents the discussion. Finally, we conclude our work with future recommendations in Sect. 8.

Literature review
In this section presents, we provide an overview of the main approaches related to our work in detecting road surface quality and anomalies, where, we first, in Sect. 2.1, present the deep learning techniques utilized for multivariate time series classification problems. Then, in Sect. 2.2, we overview the recent literature on road surface quality recognition involving data from sensors like accelerometers' gyro meters' and GPS.

Deep learning for time series classification
Over the past decade, Time Series Classification (TSC) has become one of the most challenging problems as the availability of temporal data has increased significantly due to more sensors being included in everything around us [4,6]. Researchers have adopted the use of NLP methods to solve TSC problems due to the common nature of being consisting of sequtial data. At the same time, deep learning models have progressed significantly and pushed the research and development in both Computer Vision and Natural Language Processing [4,6]. Hence, deep learning models such as recurrent neural network (RNN) and its variants LSTM and GRU were widely applied in TS classification and have demonstrated their ability to capture complex temporal features from raw data compared to traditional machine learning techniques [7][8][9]. Moreover, several architectures based on RNN have been proposed like Echo State Networks (ESN) which can produce remarkable results after a fast-training process [6]. Also, CNN deep learning models, which have achieved great success in various fields, has been applied by researchers on timeseries data [10]. CNNs are widely applied to solve the TSC problem, because of their capability to capture both spatial and temporal features through their filters. Although, several new architectures derived from CNN have been proposed such as 'InceptionTime', which can handle different feature sizes and speeds up the training process [11], but using CNN to perform the classification of the multivariate time series is still a challenging task [12].
In addition to that, researchers have inspired by the impressive results of using Transformers in speech recognition, natural language processing, and computer vision [13], and they recently started to investigate using them in TSC challenges due to their high ability to capture longrange dependencies. However, their use usually needs a lot of data to avoid overfitting [13].
From the above, we conclude that despite all the successes that deep learning has achieved in most fields, deep learning's dominance in timeseries community is far from proven [6] and represents a challenge that requires more research efforts.

Vibration-based methods for road surface quality and anomalies classification
With regards to the approaches applied to process acceleration data' we can differentiate between (a) Threshold approaches and (b) Machine learning approaches. In threshold approaches, events are detected when the accelerometer oscillation surpass a certain threshold [14].
Recently' various studies have been conducted to classify road surface conditions using both machine learning and deep learning approaches [3,15,16]. Reference [17] proposed a fixed threshold-based system for potholes and speed bumps detection using the acceleration data.  [18], the road surfaces was classified as Potholes' Cracks' or Smooth roads. The data were collected using a smartphones accelerometer' gyroscope' and GPS sensors. The extracted features were then passed to several machine learning models such as SVM' Decision Tree' and MLP Neural Networks and the achieved accuracies were 88.55' 88.35' and 91.90%, respectively. They also proved that using 3-axis of the sensor data improves accuracy results rather than using only 1-axis. The authors of [15] applied various deep learning models such as CNN' LSTM' and reservoir computing models to identify different road surface types and potholes and speed bumps. The CNN model produced the best results, with an accuracy rate of more than 95%. In [19], ''RoadSurP'' was developed to make it robust under crowdsensing scenarios' the average accuracy was found to be 98% for speedbreaker and 92% for pothole over a smooth road and 92% for speed-breaker' and 90% for pothole over the rough road. In [20], data were collected using a cyclist's smartphone sensors and these data were used to train artificial neural networks (DNN' CNN' and LSTM). The trained networks were then utilized to detect abnormalities in the road surface, such as potholes and bumps. Results obtained in the experiments indicate that for input data (acc' gyro' diffAcc' and diffGyro)' a CNN provided the best maximum accuracy of 93.88%. The authors in [21] developed a mobile application to record sensor data while driving through Egyptian roads. They labeled these records automatically using DBSCAN to two clusters (road anomalies or normal road surface). Finally, using the automatically labeled datasets, an SVM model was trained, which achieved a 96% accuracy. The authors in [22] collected data using an Android smartphone accelerometer and GPS while travelling at a constant speed of roughly 30 km/h. The surface quality of roads was classified into three road quality categories (smooth road' rough road' or bumpy road  [24] suggested ''Pitfree'' pothole recognition system based on smartphone acceleration and GPS data. Potholes were divided into three levels: low, medium, and high. The K-Means classifier was used to cluster the data into potholes and other categories. Various supervised machine learning classifiers were investigated showing that SVM had the best accuracy. In [25], a supervised machine learning framework was presented for the recognition and georeferencing of speed bumps using vehicle-mounted smartphones. Supervised classification was accomplished by supplying a set of input data (vehicle pitch' vehicle roll' and forward and lateral acceleration) and two classes (no defect' speed bumps). In [26], the authors proposed a real-time road pavement conditions monitoring using vibration-based data. They used short-time Fourier analysis to extract signal energy information. Then, they applied DT, SVM and KNN machine learning models to detect short-time distresses, long-time distresses and the no-distress classes. They state that SVM was the most accurate with 97, 84 and 97% accuracy for short-time distress, long-time distress and nodistress, respectively. Reference [27] proposed 1D-CNN based on accelerometer data for detecting potholes. The experimental results show that the proposed CNN approach has a significant advantage over the traditional models. They investigated multiple hidden layers ranging from 1 to 7 with accuracy varied in the range of 95%. The next sections discuss the implemented machine learning approaches, the dataset collection, and the results.

Multivariate timeseries classification approaches
In this section' we present both the traditional machine learning and the deep learning classification approaches that were implemented and evaluated in our work. In multivariate time series classification, we have multiple time series features and multiple instances of labels associated with it. The objective is to learn the relationship between the multiple time series data and the labels and accurately predict the label of any new timeseries data.

Traditional machine learning methods
Traditional classification problems differ from time series classification problems. Therefore, traditional classification techniques cannot be directly used with time series data. This is because, in traditional classification problems, it is assumed that there is no relationship between the observations and each timestep is independent of the others. In the last few years there have been a large number of new time series classification techniques [28]. According to [28], these techniques are classified into Distance-based, Interval-based, Dictionary-based, frequency-based, or Shapelet-based algorithms. Distance-based classifiers classify data using time series specific distance functions. While Interval Based Classifiers focus on extracting features from each series' intervals. In regards to Dictionarybased classifiers, they locate words in a time series using sliding windows and discretization, then categorize them based on their distribution. However, Frequency domain features are used by spectral classifiers. While Shapeletbased algorithms use phase-independent subseries.

K-nearest neighbors timeseries classifier
We adapt the KNN to timeseries data using dynamic time warping (DTW) as a baseline for evaluating timeseries classification algorithms because it is simple, and does not require extensive hyperparameter tweaking. Time Series K-neighbors Classifier is a distance-based classifier in which we utilize DTW to measure similarity between two temporal sequences that may not be identical in terms of time, speed, or length.

Time series forest classifier
One of the most well-known interval-based approaches is the time series forest (TSF) classifier. TSF seeks to extract basic summary features from time series intervals. TSF samples these intervals using a random forest technique, with three summary statistics (mean, standard deviation, and slope) as features. To create a new dataset, the summaries of each interval are concatenated into a single feature vector for each timeseries. Finally, this new dataset is used to create a decision tree [28].

MUltivariate Symbolic Extension (MUSE)
MUSE is a multivariate dictionary-based classifier that builds a bag-of-patterns using symbolic Fourier approximation SFA for different window lengths and a classifier analyzes this bag. MUSE's main task involves forming a multivariate feature vector by applying a sliding window to each variable of the multivariate time series, then extracting discrete features per window and dimension. The feature vector is then passed through feature selection to remove non-discriminative features, and finally logistic regression analyzes these data [29].

MrSEQLClassifier
MrSEQL is a timeseries shaplet-based classifier that learns from multiple symbolic representations of multiple resolutions and domains. Shapelets are time series subsequences that are discriminatory of class membership. They can detect phase-independent localized similarity between series within the same class [30].

Deep learning methods
Recently, deep learning has emerged as one of the most effective methods for solving supervised classification tasks, particularly in the field of computer vision [6]. In this context, the main objective of this section is to develop deep networks specifically constructed for the time series classification.

Recurrent Neural Network (RNN)
Sequential neural networks are utilized to solve temporal problems and achieve the SOTA results in many applications like language Translation, sentiment classification, and image captioning. Unlike conventional neural networks in which input and output are independent of each another, the output sequential neural networks is usually based on prior inputs [8].
The RNN model structure is composed of an input layer, three recurrence layers, and a Dense layer for output prediction. The input layer receives an input tensor with a shape of 69 9 7. Each block is composed of a Bidirectional-RNN layer of 32 units to learn the long-term temporal dependencies between time steps of a data sequence, followed by a Dropout layer. The Dropout with 0.5 is used to avoid overfitting by ignoring randomly selected neurons during training. After processing in the recurring layers, the output passes a fully connected dense layer with 32 units and Softmax activation on top of the last hidden state to classify the road surface quality.

Bidirectional long short term memory (BiLSTM)
The long short term memory (LSTM) is a type of recurrent neural network that can learn long-term sequences [31]. The major issue is that RNN finds it too difficult to learn to preserve information over long-term sequences. To overcome the gradient problems, the LSTM is designed [31]. Bidirectional LSTMs are an extension of LSTMs that can provide the network with more context and result in faster and even fuller learning on sequence classification problems [32]. BiLSTM learns bidirectional long-term dependencies between sequence data. These dependencies can help the network to learn about the past and future of each timestep. BiLSTM trains two LSTMs on the input sequence instead of one by feeding the first layer the input sequence as-is and the second layer a reversed copy of the input sequence, as shown in Fig. 1.
The first step is to feed multivariate time series data into the BiLSTM model. The expected structure has the dimensions in the form of 3D tensors with shape (batch size, timesteps, features). Thus, we constructed batches of the timeseries data, where the batch size is 30 and the input layer has 69 timesteps with 7 features. See Fig. 2 for further explanation.
The multilayer Bidirectional LSTM architecture is illustrated in Fig. 3. The network begins with a sequence input layer that receives the raw data, followed by 3 Bidirectional LSTM layers and finally ends with a fully connected neural network with four output classes. The overview of the training phase is summarized as follows: The input sequences are fed into the first Bi-LSTM layer, which captures the essential features of timeseries data in both forward and backward directions. Then, the outputs of the first layer are passed through the two remaining Bi-LSTM layers. Moreover, after each BiLSTM layer a dropout layer is added to prevent overfitting. Finally, the fully connected layer is applied to reduce the feature dimensions to predict road surface class.

GRU
GRU is a type of RNN that is considered an improvement in LSTM, which uses fewer parameters than the LSTM as it lacks an output gate which makes the training process is faster [33]. The GRU model is composed of an input layer, three bidirectional recurrence layers, and a Dense layer for output prediction. The model developed in this section for GRU has the same structure described in RNN (Sect. 3.2.1), exchanging RNN layers for GRU layers only.

1D Convolutional neural network
1D CNN is one of the most successful deep learning methods used for sensory data [6,34]. A 1D CNN layer applies convolutional filters to extract the local temporal patterns that are useful to explore the relationship between the multivariate timeseries data. The 1D CNN expects the input shape to be two-dimensional structure, the first dimension is the number of timesteps per sample, and the last dimension specifies the number of parallel timeseries features. The kernel can only move in one direction along the time axis. The used 1D CNN model contains three deep Convolutional layers followed by Dense layers. The detailed architecture of the 1D CNN model is shown in the following Fig. 4. The combination of pair convolution layer, and MaxPooling layer is referred to as a deep convolution layer. Regarding the first deep convolution layer, we have the input series of dimension 69 9 7 convolves with 64 filters of size 3 and takes the stride as 1 with 'same'  padding. After this, we convolute the output further to the next convolution layer and finally a 1D Max-pooling layer of size 2 and stride 2 is applied to reduce the dimension of the learned features to the most significant ones. For further training, we repeated the deep Convolutional layer twice to provide more abstraction of the input signals. Finally, the extracted features from the third deep convolution layer are flattened and then fed into dense layers. The dense layers use these features to perform the classification task. The final output layer of the network is a SoftMax classifier, which has four outputs as the number of classes.

1D CNN as Feature Extractor and RF model
CNNs are the most widely used architectures in deep learning approaches [34]' and they are usually designed to extract features from raw data' so in this model' we used it as a feature extractor layer before applying the Random Forest model. 1D CNN feature extraction architecture is composed of three blocks of convolutional layers. The structure of the first and second convolutional block consists of pair of 1D CNN layers with 64 filters of size 3 and takes the stride as 1 with 'same' padding with ReLU activation' followed by 1D MaxPooling of size 2' then the third block is-as the previous, but we replace 1D MaxPooling by 1D Global Average Pooling to perform dimensionality reduction by taking the maximum value over the time dimension. The aim of convolutional blocks is to extract features and patterns efficiently from the input dataset. Finally' the extracted features are fed to the RF model' see Fig. 5.

Average voting ensemble
In this section' we present soft voting ensemble classifier that integrates multiple classifiers' predictions where every classifier contributes to the final one. This ensemble classifier can overcome the limitations of the individual classifiers in order to provide better generalization performance. For building this ensemble model' we combine the three previously mentioned machine learning approaches (i.e., random forest' BiLSTM' and 1D CNN).
Predictions are weighted based on the importance of the classifier in the soft voting ensemble approach and then merged to create a total of weighted probabilities. The target label with the greatest sum of weighted probability is chosen, see Fig. 6a. Furthermore' we used the weighted average ensemble approach which is an extension of averaging ensemble model. One drawback of the latter is that each model contributes equally to the ensemble's final prediction. Customized weights can be used to give more importance and involvement to each classifier' as shown in Fig. 6b. So' we adjusted the weights by using a grid search of these classifiers.

Transformers
Recently, Transformers have shown impressive results in speech recognition, natural language processing, and computer vision [13]. Transformers have been investigated recently in TSC problems like human activities recognition (HAR) based on the acceleration data from smartphones [13,35]. To our knowledge, transformers have not been investigated yet for the detection of road surface quality based on time series data from IMU sensors. The model developed in this section has the same structure described in [35].  First, the timeseries sequences in shape (sequence length, input channels) pass through the Convolutional backbone block, which consists of four 1D CNN, each with 64 filters of size 1 with GELU. This block transforms the raw timeseries data into a higher dimension, generating latent features. The latent sequence representation is further expanded by learning a positional embedding for each position in the sequence. Both the latent dimension and positional embedding feed to the transformer encoder block, which aggregates it by applying several layers of coding on top of each other. Each encoder layer performs the self-multi-head-attention calculation on its input. The output of the Transformer Encoder is provided as an input for a classifier head, which consists of Layer Norm and two dense layers with GELU nonlinearity and Dropout. Finally, LogSoftMax is applied to the output of the top dense layer.

The proposed approach: multi-input DSC-BiLSTM
The time series classification problems seem to be a great choice to apply Deep Learning algorithms [6,36]. Convolutional neural networks (CNNs) have been extensively used in timeseries classification problems and achieved outstanding performance in many applications [32]. However, the increasing model size and computation makes it difficult to operate in embedded systems with limited hardware resource constraints. Depthwise separable convolutions have recently been studied in computer vision analysis and have demonstrated significant improvements in feature extraction, resulting in improving the performance and reducing computation. These results have encouraged us to adopt Depthwise separable convolutions instead of conventional convolutions for road surface conditions recognition.
In this section, we present a novel architecture that utilizes the strength of Depthwise Separable Convolutions and BiLSTM, which exploits the timeseries information and learns the sensor's data representation for road surface condition recognition. The first step is to prepare the dataset. The preprocessing involves extending the dataset by converting the 3-axis acceleration and 3-axis orientations timeseries data to the frequency domain and returns the absolute value of the complex components of Fourier transform is returned from the previous step. The usefulness of applying Fourier Transformations lies in obtaining a reduced representation of the original data, which also works to remove noise without smoothing outside the main features of the data. Finally, applying normalization across the features to ensure the data are zero-mean and unit standard deviation, and finally reshape the data to fit the models.
The general architecture of multi-input DSC-BiLSTM is shown in Fig. 7. It consists of two parallel Networks that act independently i.e., DSC network and BiLSTM network that focuses on time-domain and frequency-domain, respectively, to infer features from data sequences. Finally, the features obtained through each network are then fused and forwarded through fully connected layers to classify different road surface classes. This model requires input in the form of a list of two elements where each element in the list contains two separate input tensors per sample. The DSC Network expects the input shape to be a 3D tensor with shape (batch, features, timesteps), therefore, we need to reshape each sequence before feeding it to the network. In our case, we have 69 timesteps with 7 features in each timestep. These 7 features are 3-axis acceleration, 3-axis orientation, and speed, while The input to the BiLSTM network (batch, timesteps, features) represents 35 timesteps with 6 features in each timestep. These 6 features are absolute values of FFT of both 3-axis acceleration, and 3-axis orientation.
The raw sensor data are passed onto four DSC layers, each consists of a DSC layer, followed by batch normalization, ReLU activation and Dropout to accelerate training and to reduce the risk of overfitting. DSC breaks up into two stages: (a) depthwise convolution (filtering stage), where the input tensor is spilt into channel and a convolution is applied to a single input channel independently to learn deep features across each input channel, and (b) a pointwise convolution (combination stage) which stacks the output back together. At the end of the DS convolutional layers, the output was flattened then fed into two fully connected layers. ReLU and Dropout layers are used between the two FC layers, as illustrated in Fig. 7. In parallel, the FFT dataset batches fed into the BiLSTM network, which consists of three bidirectional LSTM layers with 0.7 dropout and 128 hidden. Finally, we build a fusion layer that combines the obtained learned features from each network are then fed to dense layers with softmax activations to derive the higher-order features to perform the road surface condition classification task.

Road quality index estimator
The World Economic Forum (WEF) releases an annual Global Competitiveness Index [37], which includes a road quality measure. It represents a rating of a country's road quality based on World Economic Forum's Executive Opinion Survey that polled the views of over 14,000 company leaders from 144 countries. Only one question determines the road quality indicator score. Respondents are asked to rank the roads in their country on a scale of 1 (underdeveloped) to 7 (efficient). A country score is calculated by combining the individual responses. The problem with this indicator is its subjectivity. while, Understanding the roads deterioration is a crucial component of road asset management, which needs considerable effort and resources to collect data, particularly in relation to road conditions regularly. Moreover, collecting data may have negative impacts in terms of crew safety and traffic flow [38]. This research aims at estimating an objective RQI based on establishing a relation between roughness and anomalies for road surface and road quality, with the objective that the developed road quality condition model could be utilized to estimate road index to guide decisionmakers in enhancing the quality of road networks.
Due to the lack of roughness data for Egyptian roads, A systematic approach was used to come up with a reasonable correlation between the Road quality index (RQI) and road condition. A road quality index (RQI) is a scale to show how good the roads are, along with the risks associated with each rating. RQI would increase the road network asset management system capabilities. Road roughness, as well as Potholes, and speed bumps are major factors that affect RQI. To reflect the overall quality index of each kilometer of the road segments RQI, we used the ratio of Speedbumps (SPR), potholes (PHR) and road segment roughness rate (RSRR) per kilometer as a penalty, as shown in the following equation. Each of the three factors that make up the RQI must be calculated. The calculation of RSRR and PHR is relatively straightforward, while SPR requires some additional steps.
Road Segment Roughnrss Rate (RSRR) represents the number of road segments that are classified as bad, relative to the total number of sequences per kilometer:

RSRR ¼ Number of bad sequences Total number of seqences ð1Þ
Potholes Rate (PHR) represents the number of potholes, relative to the total number of sequences per kilometer:

PHR ¼ Number of Potholes sequences Total number of seqences ð2Þ
Speedbumps Rate (SPR) represents the number of speed bumps, relative to the total number of sequences per kilometer. We classify the speed bumps into two classes a good and a bad bump using the observation that the height of the speed bump causes an increase in the amplitude of vehicle vibrations in the vertical direction, thus to get the type of speedbump, we calculate the third quartile Q AccY of the vertical acceleration of our dataset for ''speedbump'' class which was equal to 0.1012. SPR is calculated in two steps: firstly, we determine the bump type as follow: where,x is the maximum value of vertical acceleration in the speedbump sequence. Then we get the rate of both good and bad speedbumps (GSP, BSP) relative to total sequences per kilometer: After obtaining the factors, the index can be determined by adding the three factors together. In this equation, the index changes in direct proportion to changes in all three factors. It has been determined that the contribution of the RSRR to the RQI score is greater than the contribution of the other two factors. Also, it has been determined that the bad bumps penalty is greater than the good bumps. where, The divisor W Max normalizes the resultant values to a range of 0 to 1, where the higher the number the better the road quality is. As shown in below table, RQI is classified into five groups that correspond to the index range from poor to very high (Table 1).

Dataset collection
Because of their low cost, minimal memory requirements, and resilience to light conditions, we utilized a vibrationbased method for identifying road quality and anomalies relying on data acquired from these sensors. The multi-sensory dataset used is collected using an onboard device that continually records in-vehicle information, while it travels across some Egyptian asphalt roads at different speeds' with real scenarios' and these collected samples are from diverse set of roads including normal roads' bad roads' Potholes' and speed bumps' as shown in Fig. 8. For the data collection' we used an MPU9250 9-DoF IMU' a Neo 6 m GPS' and a Raspberry Pi 4 which was mounted to the dashboard of the vehicle. MPU9250 is an inertial sensor comprising of a 3-axis gyroscope' 3-axis accelerometer' and 3-axis magnetometer. Fusing the data generated from these sensors for yielding meaningful and reliable information is essential. Thus, we used the calibration technique proposed in [39], where they combined the data from the gyroscope' accelerometer and heading data from the magnetometer to overcome the gyroscope drift and the accelerometer noise. Finally' Kalman filter is used to produce a trustworthy orientation estimate.
The raw data captured by a sensor are commonly represented as time series, where the values of the data captured by the senor are represented as points recorded at regular intervals. As shown in Fig. 8, a set of timeseries are presented for roads with 4 different conditions: normal, bad, speed bump and pothole. For each road type, 6 timeseries are shown: 3 for captured acceleration values in X, Y and Z directions and another 3 for Pitch, Roll and Yaw. Fig. 8 indicates that when vehicles pass on road anomalies' both angles and acceleration make a significant difference. Compared with good and bad roads, Values of both acceleration signals largely follow the same pattern, however some differences can be observed, especially the amplitude of the signal. Furthermore, both 3-axis acceleration and 3-axis rotation provide insight into the detection of road anomalies and surface condition. However, it is obvious that orientation is more helpful in detecting speed bumps and potholes than distinguishing between road surface conditions. The map of the Egyptian route from which these data were collected is shown in Fig. 8e. The time series in Fig. 8d shows an example of the data collected from a bad road. The amplitude of both 3-axis acceleration and 3-axis orientation is higher than the rest of data at these moments because the car had got more vibrations, as it passed in road bad segments at this time period. Similarly, Fig. 8b, illustrates the effect of speed bumps on the captured sensor data. As observed, the vibrations are wider than other classes.
The next stage requires the road surface segments to be labeled to obtain the ground truth for our supervised techniques. The dataset labeling was done using videos recorded throughout the experiments manually. The data collection process adopted for building this dataset considers four classes of the road: normal road' bad road' pothole' and speed bump. Figure 9 illustrates the  Multivariate timeseries data representation. Because each series comprises observations at the same time steps, the elements of the input time series are fed in parallel via integrating These seven timeseries data into a single dataset, with each row representing a time step and each column representing a different time series. Furthermore, location-based windows of sensor data streams were formed. Since GPS data is collected once per second and IMU data roughly once every 14 ms. the GPS data collection points act as boundaries around the IMU data. Each window contains approximately 69 (65 * 74) data points in total' represented by the GPS coordinates. Thus' all timeseries that contain less than 65 samples were removed (Which is clearly recorded at the beginning and end of each session)' then We used the resampling process so that each sequence fit the 69 timesteps (the most frequent length for location windows) by interpolating the information from the closest data points. This structure enables the division of roads into segments and therefore each road segment can be classified individually. In the context of time series classification, it is important to provide the features and labels, so we can learn how to classify the road surface. This dataset has 20,769 samples in each accelerometer and gyroscope and car speed. Thus, the timeseries dataset has a 3D structure as shown in Fig. 9, it is composed of 301 sequences, and each sequence shape is (69, 7) where 69 and 7 are the number of timesteps and features, respectively. Each sequence is labeled by either 'Normal' or 'Bad' or 'Bump' or 'Pothole' (multiclass classification). Also, For each sequence of input features, we created a 1D array containing only the labels that we are attempting to predict. As observed, road surface datasets collected in natural scenes are often imbalanced between categories. Some categories may contain a large number of samples'  while others contain only a few samples such as potholes and speed bumps.

Results
All models developed in this paper were programmed in Python 3. The classical machine learning models were developed using tslearn, and sktime libraries, while for building the deep learning models we; used the Keras and Pytorch frameworks. The training was performed on Google Collaboratory, using GPU with 12 GB RAM. All the applied machine learning models used for the classification of road quality and anomalies have been evaluated using the collected data described earlier in these paper. The dataset sequences were split into 80% for training and 20% for testing' and to keep the independence between training and testing samples' they were collected from different travel sessions. In addition, we adjusted all experiments so that all the models are trained and tested on the same data sequences.

Traditional machine learning techniques
Regarding the traditional machine learning techniques, we realized that the Timeseries Forest model has the highest accuracy average. The detailed results corresponding to traditional machine learning are shown in Table 2.
Kneighbors Timeseries classifier performs poorly with an average accuracy of 55%. The behaviors of the MrSEQL and the MUSE are very similar with 60, and 61%, respectively. However, Timeseries Forest has shown strong performance as well as relatively high speed, in addition to an increase in accuracy of 75%.
Moreover, by observing the confusion matrices in Fig. 10, we notice that the Timeseries forest classifier has more true positives and true negatives than false negatives and false positives when considering normal, bad, and pothole road surface classes. From Table 2 results, the Timeseries Forest has the best f1-score for three road surface classes, and classified the normal roads with f1score of 86%, pothole with 69%, bad roads with 70%, and speedbump with 68%, The highest f1-score for speedbump class was classified by MrSEQL model with 74% (Fig. 11).

Deep learning techniques
In all deep learning techniques, the experiments were validated using Stratified K-Cross Validation with 10 folds. In regards to Sequential neural networks, all the model was trained for 200 epochs with early stopping for patience equals 50 ' batch size 30' and a fixed learning rate of 0.0001. RNN and GRU models were trained with three bidirectional layers, 32 hidden units for each. Each layer is followed by a dropout with 0.5 and finally a 23 9 4 fully connected layer, while we defined the BiLSTM layers with 256 hidden states followed by dropout with 0.5 and the output layer was a fully connected layer with 256 9 4. We used the Softmax activation function on the output to predict the class, and CrossEntropyLoss function and Adam optimizer to update the weights and learn new features.
Results indicate their capability to capture long-term dependencies between time steps of sequence data without heavy domain-specific feature engineering process with an average accuracy of 85, 81.6, and 75.6% for BiLSTM, GRU, and RNN, respectively. The bidirectional LSTM Bad 60 57 63 model shows superiority over RNN, and GRU by increasing accuracy by 9.4, and 3.4%, respectively, see Fig. 1.
To train the 1-D CNN model' Adam was used to optimizing the model' and the sparse categorical cross-entropy loss function. for each deep layer, we used 64 filters of size 3 and takes the stride as 1 with 'same' padding, and Relu activation function. Also, each 1D Max-pooling of size 2 and stride 2 was used. For Dense layers, the values of units are 512, 256, 128, 64, and 32, respectively. In terms of average accuracy, the CNN-based model was slightly better than the LSTM-based model with a 1% increase.
We further used CNN as a feature extractor and fed the output features to a random forest algorithm which reach 79% accuracy. Moreover, we investigated the average ensemble model for random forest' BiLSTM' and 1D CNN classifiers that get an accuracy score 89.1%' also the best prediction result for the weighted average ensemble model was 91.3% with adjusted weights as follows [0.0' 0.1' 0.3] for the random forest' BiLSTM' and 1D CNN classifiers, respectively. We found that the performance of the weighted soft voting ensemble method is comparable with the three stand-alone classifiers.
In regards to the transformers model, a Dropout p = 0.1 is used for both the Transformer Encoder and classifier head, and LogSoftmax is applied on the output vector in conjunction with Adam optimizer and Cross-Entropy loss. A batch size of 35 is employed and an initial learning rate of 0.0001 is used and reduced using One-Cycle Policy. The model is trained for up to 500 epochs with Early Stopping. The average accuracy of the 10-cross validation is 86.7%.
Concerning the Proposed model, Table 3 summarizes the DSC network parameters' we used a total of [64, 128, 256, 512] filters with a size of [40,40,40,41] and a dropout value equals 0.2, after conducting several tests with various filters and sizes. All the strides were defined as 4 and padding parameters were set to 2.
The model is trained for several mini-batch sizes and hidden units, and the best training accuracy is obtained for the LSTM with 128 hidden units, and a dropout rate of 0.7. The model was trained using the Adam and sparse category cross-entropy as a loss function, which is useful for our imbalanced multi-class classification problem. Moreover, we used the One-Cycle Policy scheduling function to reduce the learning rate as the training progresses with an initial learning rate of 0.001 [42]. Kaiming weight Initialization [43] was used to avoid the vanishing gradient problem which shows better stability rather than random initialization. The early Stopping technique is also used so that the model stops automatically if the validation errors do not decrease after some certain number of epochs. we used early stopping with patience equals 200 to train our proposed model to reduce overfitting and improve performance with a batch size of 35 sequences, see Fig. 12a. The average accuracy is 93.1% with a standard deviation of 0.034. Figure 12b, and c indicate the classification results of validation loss, and validation accuracy for fold 10 respectively. The model training took * 45.24 s using GPU, whereas the CPU took * 4 min.
In the performance context, we used the total number of parameters and floating-point operations (FLOPs) to evaluate the architectural, and computational complexity of these models [44], see Table 4. FLOPs calculate the total number of computations required to run a single sample, so the lower the better. The Bi-LSTM model has the worst FLOPs with 512.3 MFLOPs which is 7.7 times the proposed model and the largest number of parameters, while RNN, GRU, and 1D CNN are the fastest. The proposed model achieves the highest accuracy followed by the transformers model, while the transformers model has lower floating-point operations than the proposed model by 30%.
By investigating the proposed model, we found around 98.5% of operations are related to the BiLSTM network, while the Depthwise Separable Convolution network required round 1 M floating-point operations. Furthermore, 78% of the total number of parameters is related to the BiLSTM network, see Table 5. All the measurements were done using deepspeed, ptflops, and keras_flops libraries.
Furthermore, we evaluate the mean performance difference resulting from the cross-validation using statistical hypothesis-testing. The statistical significance tests of the difference between deep learning models' performances were assessed and reported using paired T-test. The experimental results prove that the proposed model performed significantly better on all classifiers for road quality detection, since p \ 0.05, We can reject the null hypothesis  that both models perform equally well on this dataset, see Table 6.
In a summary, after evaluating the classification techniques mentioned above on our sensor-based collected dataset, deep learning techniques achieved an average accuracy better than the accuracy of the evaluated traditional approaches. Furthermore, we found that an ensemble of these two deep models in addition to the traditional Random Forest achieved even a slightly better performance. The highest average accuracy was reached by DSC-BiLSTM with 93.1%. Based on all the previous experiments, it is concluded that using both sensor's frequency domain and time domain data looks promising for dealing with variations in road surfaces and the proposed model outperforms the performance of the conventional CNN and RNN.
Finally, we estimate the RQI for approximately five kilometers trip through Egyptian roads. The results plotted using Mapbox maps as shown in Fig. 13. Firstly, we applied our model to get the road sequences condition type, then we compute the cumulative distance between each two location points to get the 1-km road segments and calculate the relative RQI using the proposed equation in Sect. 5. Both of the red rectangles indicate road deterioration and the driver used an alternative route because of maintenance work. According to the RQI value and Table 1, both are classified as low-quality roads and this confirms the effectiveness of the proposed RQI calculation.

Discussion
Determining road surface conditions is an important research topic' as it helps to automatically assess the road quality and to identify road segments that require maintenance. Moreover' it enables the driver to adapt and to make early decisions that result in safe driving. In this paper' we proposed a set of models for sensor-based road surface conditions detection, including both traditional and deep learning models and we evaluated the performance of all these models. Compared to traditional techniques' deep learning approaches decreases the reliance on human crafted feature extraction and improves the accuracy. The experimental studies demonstrate that the combination of time and frequency domain representation of sensor data empowers the model to capture diverse hidden features. Benefiting from the Depthwise separable convolution and BiLSTM combined, the proposed model consistently outperforms other methods. Proposed model achieves a relatively high accuracy but have relatively high complexity. the experimental results demonstrate that the fusion of both deep learning networks and signal processing techniques prove their effectiveness in feature extraction for Timeseries classification problems, although our timeseries dataset was imbalanced and was a smallsized training dataset. The proposed model achieves a relatively high accuracy but has relatively high complexity.

Conclusion and future work
We proposed a low-cost framework for sensor-based road surface quality and anomaly detection problems. Performance is evaluated using real data, collected by sensors located on the car's dashboard under different road conditions through the Egyptian roads network and labeled manually. We feed both time and frequency domain representation of sensor data to a multi-input deep learning framework that combines BiLSTM and Depthwise separable convolution to work in parallel. Results show the ability of the proposed model to detect different road surface conditions and anomalies with high classification rates. Furthermore, an objective road quality index is estimated based on real-time measurements of road surface conditions. To the author's knowledge, it is the first objective road quality index and is expected to help decision-makers in enhancing the quality of road networks and preventing cars from accidents. In the future, we believe that extending this strategy to a digital twin platform for road networks based on RQI will be highly effective to ensure a safer commutation experience for the Egyptian citizens and help the government to maintain roads effectively. More road surfaces types and qualities can be considered and more complex anomalies can be added' such as concrete roads' and cracks. Furthermore, we need to optimize inference time by reducing sizes and FLOPs of the proposed model.

Appendix
To further clarify the superiority of the proposed model, we evaluate the proposed model on the Human Activity Recognition (HAR) benchmark dataset [45]. HAR has been collected from 30 Participants who performed six activities in the same environment and conditions: ''Walking,'' ''Jogging,'' ''Sitting,'' ''Standing,'' ''Stairs down'' and, ''Stairs up.'' It consists of accelerometer and gyroscope sensor data that were collected using a smartphone. HAR dataset has been partitioned randomly into two datasets, where the training dataset is 70% of the volunteers and the rest for the test dataset. The similarity between our dataset and HAR is multivariate timeseries classification problems, vibration-based, and Imbalanced. Figure 14 and Table 7 show the confusion matrix and evaluation metrics results for the proposed and Transformers models. We used the same parameters for our proposed model and the same parameters proposed in [35] for the transformers model. The results indicate the superiority of the proposed model over the transformers model with increased accuracy by 8%. Both models have the same classification accuracy in ''sitting'' and ''stairs up'' activities but the proposed models increase the accuracy by 16,9,16, and 7% for ''Walking,'' ''jogging,'' ''Standing,'' and ''stairs down,'' respectively 35].
Funding Open access funding provided by The Science, Technology & Innovation Funding Authority (STDF) in cooperation with The Egyptian Knowledge Bank (EKB).
Data Availability The datasets generated during and analyzed during the current study are available from the corresponding author on reasonable request.

Declarations
Conflict of interest The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.