1 Introduction

Software development and maintenance involve high costs, especially as systems become more complex. Code smells indicate design or programming issues, which make it difficult to alter and maintain the software [1,2,3]. Code smells are one of the most accepted approaches to identifying design problems in the source code. Detecting code smells is a significant step for guiding the subsequent steps in the refactoring process. The quality of the software can be enhanced if code smells are identified in the class and method levels in the source code. Several studies examined the impact of code smells on software [4,5,6], and they showed their adverse effects on the quality of the software.

Previous work provided several tools for code smell detection. These tools rely on detection rules that compare the values of relevant metrics extracted from source code against empirically identified thresholds to discriminate code smells. The limitations of these tools are that the performance is strongly influenced by the thresholds needed to identify smelly and nonsmelly instances. For instance, when the threshold is set too low, it can generate false positives, which are incorrect indications that a part of the code is a code smell. Conversely, when the threshold is set too high, false negatives may occur, which means that valid code smells are not detected by the tool. To overcome these limitations, researchers recently adopted and developed many ML and DL techniques and conducted many experimental studies to detect code smells and obtained different results when applying the same case study, where a classifier is trained on previous releases of the source code by exploiting a set of independent variables (e.g., structural, historical, or textual metrics) [2, 7]. ML and DL have been used in several recent works on code smell detection [8]. DL is a type of ML technique that allows computational models consisting of multiple processing layers to learn to represent data with multiple levels of abstraction [9]. DL architecture has been widely used to solve many detection, classification, and prediction problems [10, 11]. Long Short-Term Memory (LSTM) and GRU are particular types of Recurrent Neural Networks (RNN) used in DL, designed to recognize patterns in data sequences. LSTM and GRU have been proven particularly effective for problems of classification and detection [10, 12, 13].

When there is an uneven distribution of classes in the training data set, this indicates that this data is imbalanced [14]. Using an imbalanced data set to train classification algorithms can lead to misclassification, as the classifier may be biased and not correctly classify instances of the minority label [13]. Imbalanced class classification biases performance toward the majority numbered class in the case of a binary application [15]. Most ML and DL techniques can predict better when each class’s instances are roughly equal. So, data imbalance is the biggest problem faced by ML and DL techniques. This problem severely hinders the efficiency of these techniques and produces unbalanced false-positive and false-negative results [16].

This study selects public benchmark datasets containing 74 open-source systems from Qualitas Corpus for experimental purposes. These datasets are imbalanced, which motivates a solution such as applying data balancing techniques to solve the problem of imbalanced data to help develop an efficient and accurate model for code smell detection. Although several experiments in the previous studies [10, 12, 17, 18] are conducted based on this dataset using many ML and DL algorithms, very few are based on ML and DL algorithms with data balancing techniques.

Data balancing techniques aim to address the class imbalance problem to allow the training of robust and well-fit ML and DL models. Based on studies that have applied data balancing techniques with ML and DL models in detecting code smells [2, 7], the authors indicate a positive effect of data balancing techniques on the ML and DL model’s performance. Therefore, this study aims to apply data balancing techniques to address the problem of imbalanced data and investigate the impact of data balancing techniques on the performance of DL models in detecting code smells. Firstly, we apply data balancing techniques to balance the data sets. Secondly, we train and test the proposed models using balanced datasets. Finally, we evaluate the results based on many performance measures: accuracy, precision, recall, f-measure, MCC, AUC, AUCPR, and MSE. In summary, the goal and main contributions of our study are summarized as follows:

  1. (i)

    The paper identifies data imbalance as a major challenge for ML and DL techniques in detecting code smells. This recognition paves the way to propose a novel solution to improve the accuracy of code smell detection.

  2. (ii)

    To address data imbalance issues and improve the accuracy of code smell detection. This paper aims to present a novel method that combines two DL algorithms (Bi-LSTM and GRU) with two data balancing techniques (random oversampling and Tomek Links) to detect four code smells (God class, Data Class, Feature envy, and Long method).

  3. (iii)

    To demonstrate the capability, effectiveness, and efficiency of the proposed method for code smell detection, we conducted extensive experiments and compared the results of the method with the results of state-of-the-art methods in code smell detection based on various performance measures.

The rest of the paper is organized as follows: Sect. 2 discusses related work. Section 3 presents background on code smells, bidirectional LSTM, gated recurrent units, and data balancing techniques. Section 4 presents the hypothesis and research questions. After that, our research methodology is presented in Sect. 5. Section 6 presents the experimental results and discussion, followed by conclusions in the last section.

2 Related work

Most approaches for code smell prediction use object-oriented metrics to determine whether a software system contains code smells. Our previous work implemented several ML and DL models combined with various data balancing techniques for code smell detection [2, 13]. Therefore, this study is an extension of our previous works by applying two DL algorithms combined with two data balancing techniques to investigate the role of different data balancing techniques in improving code smell detection.

Various methodologies and techniques, such as classical ML algorithms, neural networks, and DL algorithms, have been proposed in previous work to detect code smells [2, 4, 12, 13, 18,19,20].

Some approaches [4, 6, 21,22,23] applied classical ML algorithms to detect code smells. Arcelli Fontana et al. [4] and Mhawish and Gupta [6] presented a method using different ML algorithms and software metrics to detect code smells based on 74 software systems. The experimental results showed that ML techniques have high potential in predicting the code smells, but imbalanced data caused varying performances that need to be addressed in future studies. Dewangan et al. [21] and Jain and Saha [22] proposed a method using several ML algorithms to predict the code smells. The performance of the proposed method was evaluated based on different performance metrics, and two feature selection techniques were applied to enhance the prediction accuracy. The experimental results of Dewangan et al. showed that the best performance was 100%; while, the worst performance was 95.20%. The experimental results of Jain and Saha showed that among several models, boosted decision trees and Naive Bayes gave the best performance after dimensionality reduction. Pontillo et al. [23] proposed a novel test smell detection approach based on six ML to detect four test smells. They assess their models capabilities in within- and cross-project scenarios and compare them with state-of-the-art heuristic-based techniques. The results showed that the proposed approach is significantly better than heuristic-based techniques. Some approaches [2, 7, 13, 18] investigated the role of data balancing methods with ML and DL algorithms in improving the accuracy of code smell detection. Khleel and Nehéz [2, 13] presented various ML and DL algorithms with oversampling methods (random oversampling and synthetic minority oversampling technique (SMOTE)) to detect four code smell (God class, data class, feature envy, and long method). The results were compared based on different performance measures. Experimental results show that oversampling techniques improve the accuracy of the proposed models and provide better performance for code smell detection. Pecorelli et al. [7] investigated five data balancing techniques to mitigate data imbalance issues to understand their impact on ML algorithms for code smell detection. The experiment was based on five code smell datasets from 13 open-source systems. The experimental results showed that the ML models relying on the synthetic minority oversampling technique realize the best performance. This method solved the class imbalance problem by applying sampling methods. Yet, this method achieved low prediction accuracy in some datasets. Hadj-Kacem and Bouassida [18] proposed a hybrid approach based on deep Autoencoder and artificial neural network algorithms to detect code smells. The approach was evaluated using four code smells extracted from 74 open-source systems. This method solved the class imbalance problem by applying feature selection techniques. The experiment results showed that the recall and precision measurement values were highly accurate. Some approaches [10, 12, 17] applied neural network algorithms to detect code smells. Sharma et al. [10, 12] proposed a new method for code smell detection using Convolution Neural Networks (CNN) and Recurrent Neural Networks (RNN). The experiments were conducted based on C# sample codes. The experiment results showed that it is feasible to detect smells using DL methods, and transfer learning is possible to detect code smells with a performance like that of direct learning. Kaur and Singh [17] suggested a neural network model based on object-oriented metrics for detecting bad code smells. The model was trained and tested using epochs to find twelve bad code smells. The experimental results showed the relationship between bad smells and object-oriented metrics. Some approaches [9, 20, 24] applied the DL algorithm to detect code smells. Liu et al. [9] proposed a new DL-based approach to detecting code smells. The approach was evaluated based on four types of code smell: feature envy, long method, large class, and misplaced class. The experiment results showed that the proposed approach significantly improves the state-of-the-art. Das et al. [20] proposed a DL-based approach to detect two code smells (Brain Class and Brain Method). The proposed system was evaluated using thirty open-source Java projects from GitHub repositories. The experiments demonstrated high accuracy results for both the code smells. Xu and Zhang [24] proposed a novel DL approach based on abstract syntax trees to detect multi-granularity code smells based on four types. Experimental results showed that the approach yields better than the latest methods for detecting code smells with different granularities.

After reviewing previous studies in code smell detection, we noticed that most proposed methods ignore the class imbalance problem. According to studies that dealt with the issue of class imbalance and handled it, the authors point out that the data balancing methods have an essential role in improving the accuracy of code smell detection [2, 7, 13, 18]. So, the primary point from the recent studies is that ML and DL combined with data balancing techniques can improve and increase prediction accuracy. Therefore, our study focuses on solving the class imbalance problem using random oversampling and Tomek Links methods.

3 Background

This section presents the background of code smells, Bidirectional LSTM, Gated Recurrent units, and data balancing techniques.

3.1 Code smells

Code smells are warning signs which refer to certain patterns or characteristics observed in source code that indicate potential design flaws or violate basic design rules such as abstraction, hierarchy encapsulation, etc. [21, 25]. While code smells are not necessarily indicative of bugs and the code will still work, they can make future development more difficult and increase the risk of bugs and code smells often suggest that the code could be refactored, or changeability of a given part of the source code to improve clarity, maintainability, or efficiency [22,23,24,25]. As dependent variables, choose code smells with high frequency that may have the most significant negative impact on the software quality, which can be recognized by some available detection tools [26]. Code smell detection is the primary requirement to guide the subsequent steps in the refactoring process [18, 27]. Detection rules are approaches used to detect code smells through a combination of different software metrics with predefined threshold values. Most current detectors need the specification of thresholds to distinguish smelly and nonsmelly codes [28]. Many approaches have been presented by the authors for uncovering the smells from the software systems. Different detection methodologies differ from manual to visualization-based, semi-automatic studies, automatic studies, empirical-based evaluation, and metrics-based detection of smells. Most techniques used to detect code smells rely on heuristics and discriminate code artifacts affected (or not) by a specific type of smell through the application of detection rules that compare the values of metrics extracted from source code against some empirically identified thresholds. Researchers recently adopted ML and DL to detect code smells to avoid thresholds and decrease the false-positive rate in code smell detection tools [29, 30]. Table 1 lists the four specific code smells that we have investigated.

Table 1 Lists the four specific code smells that we have investigated [4]

3.2 Long short-term memory (LSTM), and bidirectional LSTM

Long Short-Term Memory (LSTM) was introduced to avoid or handle long-term dependency problems without being affected by an unstable gradient [10]. This problem frequently occurs in regular RNNs when connecting previous information to new information. A standard LSTM unit comprises a cell, an input gate, an output gate, and a forget gate. The cell remembers values over arbitrary time intervals, and the three gates regulate the flow of information into and out of the cell. Due to the ability of the LSTM network to recognize longer sequences of time-series data, LSTM models can provide high predictive performance in code smell detection. More recently, Bidirectional long-short-term memory (Bi-LSTM) is a new way to train data by expanding the capabilities of LSTM networks; it uses two separate hidden layers to train the input data twice in the forward and backward directions, as shown in Fig. 1. With the regular LSTM, the input flows in one direction, either backward or forward. Bi-LSTM is a process of making any neural network have the sequence information in both directions (a sequence processing model that consists of two LSTMs): one taking the input in a forward direction (past to future) and the other in a backward direction (future to past). The idea behind Bi-LSTM is to exploit spatial features to capture bidirectional temporal dependencies from historical data to overcome the limitations of traditional RNNs [28]. Standard RNN takes the sequences as inputs, and each sequence step refers to a certain moment [10, 12]. For a certain moment t, the output \({o}_{t}\) not only depends on the current input \({x}_{t}\) But is also influenced by the output from the previous moment \(t-1\). The following equations show the output of moment (t).

$$\begin{aligned} h_{t} = & f\left( {U \times x_{t} + W \times { }h_{t - 1} + b} \right) \\ o_{t} = & g\left( {V \times h_{t} + c} \right) \\ \end{aligned}$$
(1)

where U, V, and W denote the weights of the RNN, b, and c represent the bias, f, and g are the activation functions of the neurons. The cell state carries the information from the previous moments and will flow through the entire LSTM chain, which is the key that LSTM can have long-should be filtered of the prior moment; the output of forget gate can be formulated as the following equation:

$$f_{t} = \sigma \left( { W_{{\text{f}}} \cdot \left[ {h_{t - 1,} x_{t} } \right] + b_{{\text{f}}} } \right)$$
(2)

where σ denotes the activation function, \(W_{{\text{f}}}\) and \(b_{{\text{f}}}\) denote the weights and bias of the forget gate, respectively. The input gate determines what information should be kept from the current moment, and its output can be formulated as the following equation:

$$i_{t} = \sigma \left( {W_{i} \cdot \left[ {h_{t - 1,} x_{t} } \right] + b_{i} } \right)$$
(3)

where σ denotes the activation function, \(W_{i}\) and \(b_{i}\) denote the weights and bias of the input gate, respectively. With the information from the forget gate and input gate, the cell state \(C_{{t - 1{ }}}\) is updated through the following formula:

$$\begin{aligned} \check{C}_{t} & = \tanh \left( { W_{c} .\left[ {h_{t - 1,} x_{t} } \right] + b_{c} } \right) \\ \check{C}_{t} & = f_{t} \times C_{t - 1} + i \times \check{C}_{t} ) \\ \end{aligned}$$
(4)
Fig. 1
figure 1

Interacting layers of the repeating module in a Bi-LSTM “Figure adapted from Verma [31]”

\(\check{C}_{t}\) is a candidate value that is going to be added to the cell state and \(C_{{t{ }}}\) is the current updated cell state. Finally, the output gate decides what information should be outputted according to the previous output and current cell state.

$$\begin{aligned} o_{t} & = \sigma (W_{o} \cdot \left[ {h_{t - 1,} x_{t} + b_{o} } \right] \\ h_{t} & = o_{t} \times \tanh \left( {C_{t} } \right). \\ \end{aligned}$$
(5)

3.3 Gated recurrent unit

GRU network is one of the optimized structures of the RNN. Due to the problem of long-term dependencies that arise when the input sequence is too long, RNN cannot guarantee a long-term nonlinear relationship. This means the learning sequence has a gradient vanishing and gradient explosion phenomenon. Many optimization theories and improved algorithms have been introduced to solve this problem, such as LSTM, GRU networks, Bi-LSTM, echo state networks, and independent RNNs [12]. The GRU network aims to solve the long-term dependence and gradient disappearance problem of RNN. A GRU is like LSTM with a forget gate, but it has fewer parameters than LSTM and uses an update gate and a reset gate, as shown in Fig. 2 [28]. The update gate helps the model to determine how much of the past information (from previous time steps) needs to be passed along to the future, and the reset gate helps the model to decide how much of the past information to forget [10]. The update gate model in the GRU network is calculated as shown in the equation below.

$$z\left( t \right) = \sigma \left( {W\left( z \right)\cdot \left[ {h\left( {t - 1} \right), x\left( t \right)} \right]} \right)$$
(6)

\(z\left( t \right)\) represents the update gate, \(h\left( {t - 1} \right)\) represents the output of the previous neuron, \(x\left( t \right)\) represents the input of the current neuron, \(W\left( z \right)\) represents the weight of the update gate, and \(\sigma\) represents the sigmoid function. The reset gate model in the GRU neural networks is calculated as shown in the equation below.

$$r\left( t \right) = \sigma \left( {W\left( r \right)\cdot\left[ {h\left( {t - 1} \right), x\left( t \right)} \right]} \right)$$
(7)

\(r\left( t \right)\) represents the reset gate, \(h\left( {t - 1} \right)\) represents the output of the previous neuron, \(x\left( t \right)\) represents the input of the current neuron, \(W\left( r \right)\) represents the weight of the reset gate, and \(\sigma\) represents the sigmoid function. The output value of the GRU hidden layer is shown in the equation below.

$$\check{h}\left( t \right) = \tanh \left( {W\check{h} \cdot \left[ {rt*h\left( {t - 1} \right), x\left( t \right)} \right]} \right)$$
(8)

\(\check{h}\left( t \right)\) represents the output value to be determined in this neuron, \(h\left( {t - 1} \right)\) represents the output of the previous neuron, \(x\left( t \right)\) represents the input of the current neuron, \(W\check{h}\) represents the weight of the update gate, and tanh(.) represents the hyperbolic tangent function. \(rt\) is used to control how much memory needs to be retained. The hidden layer information of the final output is shown in the equation below.

$$h\left( t \right) = \left( {1 - z\left( t \right)} \right)*h\left( {t - 1} \right) + z\left( t \right)* \check{h}\left( t \right)$$
(9)
Fig. 2
figure 2

Interacting layers of the repeating module in GRU “Figure adapted from Christopher [32].”

3.4 Data balancing techniques

Most code smells datasets are imbalanced, meaning that the target group (smelly classes) has a lower distribution than the nonsmelly classes. Classifying this data type is one of the most challenging problems facing ML and DL algorithms [2, 15]. Several data balancing techniques have been developed to overcome the class imbalance problem, including subset methods, cost-sensitive learning, algorithm-level implementations, ensemble learning, feature selection methods, sampling methods, etc. [16]. The most common techniques used in previous work to deal with the class imbalance problem are sampling methods (oversampling and under-sampling methods) [8]. Sampling methods tend to adjust the prior distribution of the majority and minority classes in the training data to obtain a balanced class distribution [2, 15]. The oversampling method supplements instances of the minority class to the dataset; while, the under-sampling method eliminates samples of the majority class to obtain a balanced dataset; despite its simplicity, random under-sampling is one of the most effective resampling methods [2, 16]. Oversampling methods are more effective than under-sampling methods in prediction accuracy [2, 7]. Random oversampling is a method developed to increase the size of a training data set by making multiple copies of some minority classes; Tomek Links is an under-sampling method developed to randomly select samples with its k-nearest neighbors from the majority class that wants to be removed [15].

4 Hypothesis and research questions

Our hypothesis in this study is that if data balancing techniques are applied to balance the original data sets, the classification performance of the proposed models will be better in detecting code smells. To investigate our hypothesis, we used a paired t-test to determine whether there was a statistically significant difference between our models on the original and balanced datasets, and between our models on the balanced datasets and existing approaches based on the accuracy values. The formula for the paired t-test is shown in Eq. 10 below. To statistically prove the validity of the impact of data balancing techniques on the performance of DL algorithms, the hypothesis is formed as follows:

H1

There is no difference in the accuracy of models when there are no data balancing techniques and when the data balancing techniques are used.

H2

There is a difference in models’ accuracy when there are no data balancing techniques and when the data balancing techniques are used.

$$t = \frac{m}{s/\sqrt n }$$
(10)

where m is the mean differences, n is the sample size (i.e., number of pairs), and s is the standard deviation.

Based on our hypothesis, this study aims to understand the impact of data balancing techniques on the performance of DL algorithms in detecting code smells. In particular, we aim to address the following research questions.

RQ1

Do data balancing techniques improve DL models’ accuracy in detecting code smells?

This RQ aims to investigate data balancing techniques in improving DL models’ accuracy in detecting code smells.

RQ2

Which data balancing technique is the most effective at improving the accuracy of DL techniques?

This RQ aims to verify the best data balancing technique for detecting code smells.

RQ3

Does the proposed method outperform the state-of-the-art methods in detecting code smells?

This RQ aims to investigate the performance of the proposed method in detecting code smells compared to the state-of-the-art methods.

5 Proposed methodology

The main objective of this study is to perform an empirical study on the role of data balancing techniques in improving DL accuracy for detecting code smells. A series of steps have been taken and described, such as data modeling and collection, data pre-processing, features selection, class imbalance, model building, and evaluation. Figure 3 illustrates the overview of the proposed methodology for code smell detection, where each step is described in the following sections.

Fig. 3
figure 3

Overview of the proposed methodology for code smells detection

5.1 Data modeling and collection

Dataset selection is an essential task in the problem of ML and DL, and classification models perform better if the dataset is more relevant to the problem. This study’s code smell detection models use a supervised learning task that relies on a large set of software metrics as independent variables. Many systems or datasets are fundamental to training ML and DL models, and allowing the generalization of the results. We considered the Qualitas Corpus (QC) of systems collected by Tempero et al. [33] to perform the analysis. The QC systems comprise 111 systems written in Java, characterized by different sizes and belonging to different application domains, as shown in Table 2. The authors utilized a tool called Design Features and Metrics for Java (DFMC4J) to parse the source code of Java projects through the Eclipse JDT library and calculate all relevant metrics necessary to define code smells. The reason for selecting these datasets is that (1) the QC systems are the largest curated corpus for code analysis studies, with the current version having 495 code sets representing 100 unique systems. The corpus has been successful in that groups outside its original creators are now using it, and the number and size of code analysis studies have significantly increased since it became available. (2) Systems must be able to calculate metric values correctly. Moreover, these data sets are freely available, and researchers can iterate, compare, and evaluate their studies. The selected metrics in QC systems are at class and method levels; the set of metrics is standard metrics covering different aspects of the code, i.e., complexity, cohesion, size, encapsulation, coupling, and inheritance [4].

Table 2 Summary of project characteristics [4]

5.2 Software metrics

Software Metrics play the most vital role in building a prediction model to improve software quality by predicting as many software defects as possible. Software metrics help identify patterns and indicators associated with software bugs or code smells [34]. Software metrics are widely used indicators of software quality and many studies have shown that these metrics can be used to estimate the presence of vulnerabilities in code [35]. Furthermore, Researchers have shown that software metrics can be effectively used to assess software reusability [36]. Software metrics can be classified as static code metrics and process metrics. Static code metrics can be directly extracted from source code. Process metrics can be extracted from the source code management system based on historical changes in source code over time. These metrics reflect the modifications over time, e.g., changes in source code, the number of code changes, developer information, etc. Several researchers in the primary studies used McCabe and Halstead metrics as independent variables in the studies of code smell. The first use of McCabe metrics was to characterize code features related to software quality. McCabe’s has considered four basic software metrics: cyclomatic complexity, essential complexity, design complexity, and lines of code. Halstead also considered that the software metrics fall into three groups: base measures, derived measures, and line of code measures [4]. The computed metrics for all 74 systems of the QC are displayed in Table 3.

Table 3 Show the computed metrics for all 74 systems of the QC [4]

5.3 Data pre-processing

Pre-processing the collected data is one of the critical stages before constructing the model. To generate a good model, data quality needs to be considered [2]. Not all data collected is suitable for training and model building. The inputs will significantly impact the model’s performance and later affect the output. Data pre-processing is known as a group of techniques that are applied to the data to improve the quality of the data before model building to remove noise and unwanted outliers from the data set, deal with missing values, feature type conversion, etc. The data set used in this study is a clean copy [2, 6]. Normalization is necessary to convert the values into scaled values (scaling of the data in numeric variables in the range of 0 to 1) to increase the model’s efficiency [6]. Therefore, the data set was normalized using Min–Max normalization based on the Python framework (TensorFlow). The formula for calculating the normalized score can be described in (11).

$$x_{i} = \left( {x_{i} - X\min } \right)/\left( {X\max - X\min } \right)$$
(11)

where max(x) and min(x) represent the maximum and minimum value of the attribute x, respectively.

5.4 Feature selection

Feature Selection (FS) is a crucial step in selecting the most discriminative features from the list of features using appropriate FS methods [37]. FS aims to choose the features more relevant to the target class from high-dimensional features and remove the redundant and uncorrelated features [38]. Feature extraction facilitates the conversion of pre-processed data into a form that the classification engine can use [39]. FS methods are categorized into three categories (i) Filter methods: These methods are model agnostic, i.e., variables are selected independently of ML or DL algorithms. These methods are faster and less computationally expensive, (ii) Wrapper methods: These methods are greedy and choose the best feature subsets in each iteration according to ML or DL algorithms. It is a continuous process of finding a feature subset. These methods are very computationally expensive and often unrealistic if the feature space is vast, (iii) Embedded methods: FS is part of building ML or DL algorithms in these methods, where feature selection is integrated into the model training process itself [22]. Unlike filter methods that pre-process data before model training or wrapper methods that select features based on the performance of a specific model, embedded methods embed feature selection directly within the model training process. These methods select the best possible feature subset as per the ML or DL model to be implemented [40]. In this study, we applied embedded methods because it is faster and less computationally expensive than other methods. It fits ML and DL models, and a feature scaling technique was used to make the output the same standard.

5.5 Class imbalance

Class imbalance in classification models represents those situations where the number of examples of one class is much smaller than another. The class imbalance problem makes classification models not effectively predict minority modules [2, 13]. This study’s dataset chosen for code smell detection is highly imbalanced [4]. The original datasets were composed of 561 smelly instances and 1119 nonsmelly instances; the two first datasets concern the code smells at the class level, for God Class (the number of smelly instances is 140, and the number of nonsmelly instances is 280), for Data Class (the number of smelly instances is 140 and the number of nonsmelly instances is 280). The two-second datasets concern the code smells at the method level, for Feature Envy (the number of smelly instances is 140 and the number of nonsmelly instances is 280), for Long Method (the number of smelly instances is 141 and the number of nonsmelly instances is 279). To solve the problem of class imbalance and increase the realism of the data, we modified the original datasets by changing the distribution of instances with the algorithms of Random Oversampling and Tomek Links. After balancing the data sets using the random oversampling method, for God Class (the number of smelly instances is 280 and the number of nonsmelly instances is 280), for Data Class (the number of smelly instances is 280 and the number of nonsmelly instances is 280), for Feature Envy (the number of smelly instances is 280 and the number of nonsmelly instances is 280), for Long Method (the number of smelly instances is 279 and the number of nonsmelly instances is 279). After balancing the data sets using the Tomek links method, for God Class (the number of smelly instances is 140 and the number of nonsmelly instances is 263), for Data Class (the number of smelly instances is 140 and the number of nonsmelly instances is 256), for Feature Envy (the number of smelly instances is 140 and the number of nonsmelly instances is 261), for Long Method (the number of smelly instances is 141 and the number of nonsmelly instances is 270). Figures 4 and 5 show the distribution of learning instances over original and balanced datasets.

Fig. 4
figure 4

Distribution of learning instances over the original datasets

Fig. 5
figure 5

Distribution of learning instances over the balanced datasets

5.6 Models building and evaluation

Different ML and DL algorithms are used to build code smell prediction models, and each algorithm has its benefits [2, 12]. An empirical study has been conducted to demonstrate the effectiveness and accuracy of the proposed DL models for code smell detection. The proposed models have been trained and tested using a set of massive open-source Java projects to get more accurate results. The choice of paradigm, or framework within a given programming language can significantly impact the development process, development speed, code complexity, and compatibility with other systems [3]. Therefore, after an extensive review of the existing literature and leveraging the collective knowledge of the field, the decision to utilize a dynamically typed language like Python, coupled with a robust open-DL framework such as TensorFlow, emerges as a prudent and advantageous choice for building and training DL models [2]. In the realm of implementation frameworks, we use Keras as a high-level API based on TensorFlow to build our models for simplicity and correctness; training is performed with 80% of the dataset (random selection of features), while the remaining 20% is used for validation and testing. The architecture of the proposed LSTM and GRU models consists of several components, each playing a crucial role in the model’s functionality. Firstly, the input phase involves utilizing software metrics as inputs, which serve as the foundational data for the subsequent layers to analyze. Secondly, the LSTM and GRU Layers are pivotal elements, employing multiple layers of both LSTM and GRU units. These layers are designed to adeptly capture sequential dependencies within the code, facilitating the model’s ability to understand and learn from intricate patterns and long-range dependencies. The third component, the Output Layer, is comprised of a dense output layer with sigmoid activation, responsible for producing the final predictions based on the learned features. Moving on to the fourth component, the Loss Function and Optimization phase, the MSE is employed as the Loss Function; while, the Adam optimizer is utilized for efficient parameter updating during training. This combination ensures that the model is effectively trained to minimize errors and enhance predictive accuracy. Additionally, the fifth component, Hyperparameter Tuning, plays a critical role in optimizing model performance and generalization. Various hyperparameters such as learning rate, batch size, number of layers, hidden units, and dropout regularization between LSTM and GRU layers are meticulously tuned. This tuning process is essential for preventing overfitting, optimizing the model’s performance, and enhancing its ability to generalize to unseen data. It’s noteworthy that each model was developed separately as shown in Table 4. Moreover, Cross-validation is a vital technique in ML or DL used to evaluate the performance and generalizability of predictive models. It involves partitioning a dataset into subsets, typically referred to as folds, and systematically training and evaluating the model multiple times [5, 39]. Cross-validation helps mitigate issues like overfitting, and tuning hyperparameters, selecting the best model, and provides a more reliable assessment of how well a model will perform on unseen data and ensuring the model’s generalization across different subsets of the dataset. Cross-validation comes in various forms such as K-Fold Cross-Validation, Stratified K-Fold Cross-Validation, Leave-One-Out Cross-Validation, Leave-P-Out Cross-Validation, etc. to suit different data characteristics and modeling objectives. K-Fold Cross-Validation and Stratified K-Fold Cross-Validation are the most standard methods of Cross-validation [21, 38]. Stratified K-Fold Cross-Validation is a variation of the standard K-Fold Cross-Validation method that maintains the class distribution in each fold, is beneficial for imbalanced datasets, and is designed to address the potential issue of imbalanced class distributions in the dataset [38]. Therefore, we applied the Stratified K-Fold Cross-Validation method to evaluate the performance of our proposed predictive models. This study used a set of standard performance measures based on the confusion matrixes, MCC, AUC, AUCPR, and MSE as Loss Function.

Table 4 Parameters setting of the models

A confusion matrix is a specific table used to measure the performance of a model [6]. A confusion matrix summarizes the results of the testing algorithm and presents a report of (1) True Positive Rate (TPR), (2) False-Positive Rate (FPR), (3) True Negative Rate (TNR), and (4) False Negative Rate (FNR). From the values in the confusion matrix, various performance metrics can be derived, such as accuracy, precision, recall, and F1-measure shown in the below equations. These metrics provide insights into the model’s strengths and weaknesses, especially in scenarios where class imbalance is present [6, 13].

$${\text{Accuracy}} = \left( {{\text{TP}} + {\text{TN}}} \right)/\left( {{\text{TP}} + {\text{FP}} + {\text{FN}} + {\text{TN}}} \right)$$
(12)
$${\text{Precision}} = {\text{TP}}/\left( {{\text{TP}} + {\text{FP}}} \right)$$
(13)
$${\text{Recall}} = {\text{TP}}/\left( {{\text{TP}} + {\text{FN}}} \right)$$
(14)
$${\text{F - Measure}} = \left( {{2}*{\text{Recall}}*{\text{Precision}}} \right)/\left( {{\text{Recall}} + {\text{Precision}}} \right)$$
(15)

The MCC is a performance metric for binary classification. MCC is used for model evaluation by measuring the difference and describing the correlation between the predicted and actual values [2]. The MCC formula is shown in the equation below:

$${\text{MCC}} = {\text{TP*TN}} - {\text{FP*FN}}/\sqrt {\left( {{\text{TP}} + {\text{FP}}} \right){*}\left( {{\text{TP}} + {\text{FN}}} \right){*}\left( {{\text{TN }} + {\text{FP}}} \right){*}\left( {{\text{TN }} + {\text{FN}}} \right)}$$
(16)

AUC graph shows the performance of classification models with all classification thresholds and plots based on two parameters, actual positive rate (TPR.) and false-positive rate (FPR.) [22]. The AUC formula is shown in the equation below:

$${\text{AUC}} = \frac{{\sum\nolimits_{{{\text{ins}}_{i} \in {\text{Positive}}\;{\text{Class}}}} {{\text{rank}}\left( {{\text{ins}}_{i} } \right) - \frac{{M\left( {M + 1} \right)}}{2}} }}{M \cdot N}$$
(17)

where \(\sum\nolimits_{{{\text{ins}}_{i} \in {\text{Positive}}\;{\text{Class}}}} {{\text{rank}}\left( {{\text{ins}}_{i} } \right)}\) it is the sum of the ranks of all positive samples, and M and N are the numbers of positive and negative samples, respectively.

AUCPR is a curve that plots the Precision versus the Recall or a single number summary of the information in the precision–recall curve [2]. The AUCPR formula is shown in the equation below:

$${\text{AUCPR}} = \mathop \smallint \limits_{0}^{1} {\text{Precision}}\left( {\text{Recall }} \right)d\left( {{\text{Recall}}} \right)$$
(18)

MSE is a metric that measures the amount of error in the model. It assesses the average squared difference between the actual and predicted values. The MSE formula is shown in the equation below:

$${\text{MSE}} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left( {x\left( i \right) - y\left( i \right)} \right)^{2}$$
(19)

where n is the number of observations, x(i) is the actual value, y(i) is the observed or predicted value for the \(i{\text{th}}\) observation.

6 Experimental results and discussion

In this section, we evaluate the effectiveness and efficiency of our proposed DL models. We experimented with our method by selecting two suitable DL algorithms and testing them on the generated datasets based on four code smells. The experimental environment was based on a Python environment. The performance of the prediction models on the original and balanced datasets is reported in Tables 5, 6, 7 and 8, and Figs. 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, and 23.

Table 5 Evaluation results for the original datasets
Table 6 Evaluation results for the balanced datasets—random oversampling
Table 7 Evaluation results for the balanced datasets—Tomek links
Table 8 Confusion matrix of our best model (GRU Model) on balanced datasets using Tomek links
Fig. 6
figure 6

Training and validation accuracy on the original datasets using the Bi-LSTM model

Fig. 7
figure 7

Training and validation accuracy on the original datasets using the GRU model

Fig. 8
figure 8

Training and validation loss on the original datasets using the Bi-LSTM model

Fig. 9
figure 9

Training and validation loss on the original datasets using the GRU model

Fig. 10
figure 10

Training and validation accuracy on the balanced datasets using Bi-LSTM model-random oversampling

Fig. 11
figure 11

Training and validation accuracy on the balanced datasets using Bi-LSTM model-Tomek links

Table 5 presents the results of our Bi-LSTM and GRU Models on the original datasets in terms of accuracy, precision, recall, F-Measure, MCC, AUC, AUCPR, and MSE. We notice that the accuracy values of the Bi-LSTM model range from 0.95 to 0.98, the precision values range from 0.93 to 1.00, the recall values range from 0.83 to 0.96, the F-Measure values range from 0.90 to 0.96, the MCC values range from 0.88 to 0.94, the AUC values range from 0.97 to 0.99, the AUCPR values range from 0.95 to 0.99, and the MSE values range from 0.023 to 0.044 across all datasets. The accuracy values of the GRU model range from 0.93 to 0.98, the precision values range from 0.86 to 0.97, the recall values range from 0.86 to 0.96, the F-Measure values range from 0.89 to 0.96, the MCC values range from 0.84 to 0.94, the AUC values range from 0.95 to 0.99, the AUCPR values range from 0.89 to 0.99, and the MSE values range from 0.020 to 0.065 across all datasets.

Table 6 presents the results of our Bi-LSTM and GRU Models on the balanced datasets using random oversampling in terms of accuracy, precision, recall, F-Measure, MCC, AUC, AUCPR, and MSE. We notice that the accuracy values of the Bi-LSTM model range from 0.96 to 1.00, the precision values range from 0.94 to 1.00, the recall values range from 0.98 to 1.00, the F-Measure values range from 0.97 to 1.00, the MCC values range from 0.92 to 1.00, the AUC values range from 0.97 to 1.00, the AUCPR values range from 0.96 to 1.00, and the MSE values range from 0.005 to 0.037 across all datasets. The accuracy values of the GRU model range from 0.96 to 1.00, the precision values range from 0.95 to 1.00, the recall value range from 0.98 to 1.00, the F-Measure values range from 0.97 to 1.00, the MCC values range from 0.92 to 1.00, the AUC values range from 0.96 to 1.00, the AUCPR values range from 0.93 to 1.00, and the MSE values range from 0.002 to 0.033 across all datasets.

Table 7 presents the results of our Bi-LSTM and GRU Models on the balanced datasets using Tomek links in terms of accuracy, precision, recall, F-Measure, MCC, AUC, AUCPR, and MSE. We notice that the accuracy values of the Bi-LSTM model range from 0.95 to 0.99, the precision values range from 0.85 to 1.00, the recall values range from 0.87 to 1.00, the F-Measure values range from 0.92 to 0.98, the MCC values range from 0.88 to 0.97, the AUC values range from 0.97 to 0.99, the AUCPR values range from 0.92 to 0.98, and the MSE values range from 0.013 to 0.044 across all datasets. The accuracy values of the GRU model range from 0.96 to 0.99, the precision values range from 0.94 to 1.00, the recall values range from 0.87 to 1.00, the F-Measure values range from 0.93 to 0.98, the MCC values range from 0.90 to 0.97, the AUC values range from 0.98 to 0.99, the AUCPR values range from 0.97 to 0.99, and the MSE values range from 0.018 to 0.038 across all datasets.

Table 8 shows the confusion matrix of our best model, the GRU model, based on the balanced datasets using the Tomek links technique. Regarding the God Class dataset, the actual nonsmelly cases are 58, our model predicted all of them correctly, and the actual smelly instances are 23, of which our model predicted 20 correctly. Regarding the Data Class dataset, the actual nonsmelly cases are 57, our model predicted 56 of them correctly, and the actual smelly instances are 23, our model predicted all of them correctly. Regarding the Feature envy dataset, the actual nonsmelly cases are 52, our model predicted 51 of them correctly, and the actual smelly instances are 29, and our model predicted all of them correctly. Regarding the Long method dataset, the actual nonsmelly cases are 54, our model predicted 52 of them correctly, and the actual smelly instances are 29, our model predicted all of them correctly.

Figures 6, 7, 8 and 9 show the training and validation (accuracy and loss) of the proposed models on the original datasets.

Figures 6 and 7 show the training and validation accuracy of the models on the original datasets. The vertical axis presents the accuracy of the models, and the horizontal axis illustrates the number of epochs. Accuracy is the fraction of predictions that our models predicted right.

Figure 6 shows the accuracy values of the Bi-LSTM model. From Figure, the model learned 95% accuracy for God Class, 95% accuracy for Data Class, 95% accuracy for Feature envy, and 98% accuracy for Long method at the 100th epoch.

Figure 7 shows the accuracy values of the GRU model. From Figure, the model learned 93% accuracy for God Class, 96% accuracy for Data Class, 93% accuracy for Feature envy, and 98% accuracy for Long method at the 100th epoch.

Figures 8 and 9 show the training and validation loss of the models on the original datasets. The vertical axis presents the loss of the models, and the horizontal axis illustrates the number of epochs. The loss indicates how bad a model’s prediction was.

Figure 8 shows the loss values of the Bi-LSTM model. From Figure, the model loss is 0.035 for God Class, 0.037 for Data Class, 0.044 for Feature envy, and 0.023 for the long method at the 100th epoch.

Figure 9 shows the loss values of the GRU model. From Figure, the model loss is 0.063 for God Class, 0.026 for Data Class, 0.065 for Feature envy, and 0.020 for the long method at the 100th epoch.

As shown in Figures, the accuracy of training and validation increases, and the loss decreases with increasing epochs. Regarding the high accuracy and low loss obtained by the proposed models, we note that both models are well-trained and validated. Additionally, we note that the models are approximately perfectly fitting, there are no overfitting or underfitting.

Figures 10, 11, 12, 13, 14, 15, 16 and 17 show the training and validation (accuracy and loss) of the proposed models on the balanced datasets.

Fig. 12
figure 12

Training and validation accuracy on the balanced datasets using GRU model-random oversampling

Fig. 13
figure 13

Training and validation accuracy on the balanced datasets using GRU model-Tomek links

Fig. 14
figure 14

Training and validation loss on the balanced datasets using Bi-LSTM model-random Oversampling

Fig. 15
figure 15

Training and validation loss on the balanced datasets using Bi-LSTM model-Tomek links

Fig. 16
figure 16

Training and validation loss on the balanced datasets using GRU model-random oversampling

Fig. 17
figure 17

Training and validation loss on the balanced datasets using GRU model-Tomek links

Figures 10, 11, 12 and 13 show the training and validation accuracy of the models on the balanced datasets. The vertical axis presents the accuracy of the models, and the horizontal axis illustrates the number of epochs. Accuracy is the fraction of predictions that the models predicted right.

Figure 10 shows the accuracy values of the Bi-LSTM model with the Random Oversampling technique. From Figure, the model learned 96% accuracy for God Class, 99% accuracy for Data Class, 96% accuracy for Feature envy, and 100% accuracy for Long method at the 100th epoch.

Figure 11 shows the accuracy values of the Bi-LSTM model with the Tomek links technique. From Figure, the model learned 96% accuracy for God Class, 95% accuracy for Data Class, 98% accuracy for Feature envy, and 99% accuracy for Long method at the 100th epoch.

Figure 12 shows the accuracy values of the GRU model with the Random Oversampling technique. From Figure, the model learned 96% accuracy for God Class, 98% accuracy for Data Class, 97% accuracy for Feature envy, and 100% accuracy for Long method at the 100th epoch.

Figure 13 shows the accuracy values of the GRU model with the Tomek links technique. From Figure, the model learned 96% accuracy for God Class, 99% accuracy for Data Class, 99% accuracy for Feature envy, and 98% accuracy for Long method at the 100th epoch.

Figures 14, 15, 16 and 17 show the training and validation loss of the models on the balanced datasets. The vertical axis presents the loss of the models, and the horizontal axis illustrates the number of epochs. The loss indicates how bad a model’s prediction was.

Figure 14 shows the loss values of the Bi-LSTM model with the Random Oversampling technique. From Figure, the model loss is 0.035 for God Class, 0.006 for Data Class, 0.037 for Feature envy, and 0.005 for the long method at the 100th epoch.

Figure 15 shows the loss values of the Bi-LSTM model with the Tomek links technique. From Figure, the model loss is 0.037 for God Class, 0.044 for Data Class, 0.020 for Feature envy, and 0.013 for the long method at the 100th epoch.

Figure 16 shows the loss values of the GRU model with the Random Oversampling technique. From Figure, the model loss is 0.033 for God Class, 0.023 for Data Class, 0.032 for Feature envy, and 0.002 for the long method at the 100th epoch.

Figure 17 shows the loss values of the GRU model with the Tomek links technique. From Figure, the model loss is 0.038 for God Class, 0.018 for Data Class, 0.021 for Feature envy, and 0.025 for the long method at the 100th epoch.

As shown in Figures, the accuracy of training and validation increases, and the loss decreases with increasing epochs. Regarding the high accuracy and low loss obtained by the proposed models, we note that both models are well-trained and validated. Additionally, we note that the models are approximately perfectly fitting, there is no overfitting or underfitting.

Figures 18 and 19 show the original datasets’ ROC curves for both models. The vertical axis presents the actual positive rate of the models, and the horizontal axis illustrates the false-positive rate. The AUC is a sign of the performance of the model. The larger the AUC is, the better the model performance will be. Based on Figures, the values are very encouraging and indicate our proposed models’ efficiency in detecting code smells.

Fig. 18
figure 18

ROC curves for the original datasets—Bi-LSTM model

Fig. 19
figure 19

ROC curves for the original datasets—GRU model

Figure 18 shows the AUC values of the Bi-LSTM model. From Figure, the AUC values are 99% on God Class,99% on Data Class, 95% on Feature envy, and 99% on the Long method.

Figure 19 shows the AUC values of the GRU model. From Figure, the AUC values are 97% on God Class, 99% on Data Class, 89% on Feature envy, and 99% on the Long method.

Figures 20, 21, 22 and 23 show the ROC curves for both models on the balanced datasets. The vertical axis presents the actual positive rate of the models, and the horizontal axis illustrates the false-positive rate. The AUC is a sign of the performance of the model. The larger the AUC is, the better the model performance will be.

Fig. 20
figure 20

ROC curves for the balanced datasets—Bi-LSTM model-random oversampling

Fig. 21
figure 21

ROC curves for the balanced datasets—Bi-LSTM model-Tomek links

Fig. 22
figure 22

ROC curves for the balanced datasets—GRU Model-random oversampling

Fig. 23
figure 23

ROC curves for the balanced datasets—GRU Model-Tomek links

Figure 20 shows the AUC values of the Bi-LSTM model with the Random Oversampling technique. From Figure, the AUC values are 98% on God Class, 100% on Data Class, 97% on Feature envy, and 100% on the Long method.

Figure 21 shows the AUC values of the Bi-LSTM model with the Tomek links technique. From Figure, the AUC values are 0.98% on God Class, 97% on Data Class, 99% on Feature envy, and 98% on the Long method.

Figure 22 shows the AUC values of the GRU model with the Random Oversampling technique. From Figure, the AUC values are 96% on God Class, 99% on Data Class, 97% on Feature envy, and 100% on the Long method.

Figure 23 shows the AUC values of the GRU model with the Tomek links technique. From Figure, the AUC values are 98% on God Class, 99% on Data Class, 99% on Feature envy, and 99% on the Long method.

6.1 Results of RQ1

6.1.1 RQ1 Do data balancing techniques improve DL models’ accuracy in detecting code smells?

To answer RQ1, we test our proposed models on four types of code smells. The performance of the prediction models for the four code smells datasets is reported in Figs. 24, 25 and 26, and Tables 9 and 10.

Fig. 24
figure 24

Boxplots representing performance measures obtained by models on the original datasets

Fig. 25
figure 25

Boxplots representing performance measures obtained by models on the balanced datasets-random oversampling

Fig. 26
figure 26

Boxplots representing performance measures obtained by models on the balanced datasets-Tomek links

Table 9 Comparison of the proposed models in terms of accuracy using paired t-test- based on the original and balanced datasets (using random oversampling)
Table 10 Comparison of the proposed models in terms of accuracy using paired t-test- based on the original and balanced datasets (using Tomek Links)

Boxplots are very useful for describing the distribution of results and providing raw results for comparing different techniques. Therefore, we aggregated the achieved results to get a more accurate overview of the quality of the results using boxplots.

Figure 24 shows the Box plots for the original datasets’ performance measures. For the Bi-LSTM model, the highest accuracy is 98% on the Long method dataset and the lowest accuracy is 95% on the God Class, Data Class, and Feature envy datasets, the highest precision is 100% on the Data Class dataset and the lowest precision is 93% on the Feature envy dataset, the highest recall is 96% on the Long method dataset and the lowest recall is 83% on the Data Class dataset, the highest f-measure is 96% on the Long method dataset and the lowest f-measure is 90% on the Data Class dataset, the highest MCC is 94% on the Long method dataset and the lowest MCC is 88% on the Data Class dataset, the highest AUC is 99% on the God Class, Data Class and Long method datasets and the lowest AUC is 97% on the Feature envy dataset, the highest AUCPR is 99% on the God Class, Data Class and Long method datasets and the lowest AUCPR is 95% on the Feature envy dataset.

In contrast, For the GRU model, the highest accuracy is 98% on the Long method dataset and the lowest accuracy is 93% on the God Class and Feature envy datasets, the highest precision is 97% on the God Class dataset and the lowest precision is 86% on the Feature envy dataset, the highest recall is 96% on the Data Class and Long method datasets and the lowest recall is 86% on the God Class dataset, the highest f-measure is 96% on the Long method dataset and the lowest f-measure is 89% on the Feature envy dataset, the highest MCC is 94% on the Long method dataset and the lowest MCC is 84% on the Feature envy dataset, the highest AUC is 99% on the Data Class and Long method datasets and the lowest AUC is 95% on the Feature envy dataset, the highest AUCPR is 99% on the Data Class and Long method datasets and the lowest AUCPR is 89% on the Feature envy dataset.

Figure 25 shows the Box plots for the performance measures on the balanced datasets using random oversampling. For the Bi-LSTM model with random oversampling, the highest accuracy is 100% on the Long method dataset and the lowest accuracy is 96% on the God Class and Feature envy datasets, the highest precision is 100% on the Long method dataset and the lowest precision is 94% on the Feature envy dataset, the highest recall is 100% on the Data Class, Feature envy and Long method datasets and the lowest recall is 98% on the God Class dataset, the highest f-measure is 100% on the Long method dataset and the lowest f-measure is 97% on the God Class and Feature envy datasets, the highest MCC is 100% on the Long method dataset and the lowest MCC is 92% on the God Class and Feature envy datasets, the highest AUC is 100% on the Data Class and Long method datasets and the lowest AUC is 97% on the Feature envy dataset, the highest AUCPR is 100% on the Data Class and Long method datasets and the lowest AUCPR is 96% on the Feature envy dataset.

In contrast, For the GRU model with random oversampling, the highest accuracy is 100% on the Long method dataset and the lowest accuracy is 96% on the God Class dataset, the highest precision is 100% on the Long method dataset and the lowest precision is 95% on the God Class and Feature envy datasets, the highest recall is 100% on the Feature envy and Long method datasets and the lowest recall is 98% on the God Class and Data Class datasets, the highest f-measure is 100% on the Long method dataset and the lowest f-measure is 97% on the God Class dataset, the highest MCC is 100% on the Long method dataset and the lowest MCC is 92% on the God Class dataset, the highest AUC is 100% on the Long method dataset and the lowest AUC is 96% on the God Class dataset, the highest AUCPR is 100% on the Long method dataset and the lowest AUCPR is 93% on the God Class dataset.

Figure 26 shows the Box plots for the performance measures on the balanced datasets using Tomek links. For the Bi-LSTM model with Tomek links, the highest accuracy is 99% on the Long method dataset and the lowest accuracy is 95% on the Data Class dataset, the highest precision is 100% on the God Class dataset and the lowest precision is 85% on the Data Class dataset, the highest recall is 100% on the Data Class and Long method datasets and the lowest recall is 87% on the God Class dataset, the highest f-measure is 98% on the Long method dataset and the lowest f-measure is 92% on the Data Class dataset, the highest MCC is 97% on the Long method dataset and the lowest MCC is 88% on the Data Class dataset, the highest AUC is 99% on the Feature envy dataset and the lowest AUC is 97% on the Data Class dataset, the highest AUCPR is 98% on the Feature envy dataset and the lowest AUCPR is 92% on the Data Class dataset.

In contrast, For the GRU model with Tomek links, the highest accuracy is 99% on the Data Class and Feature envy datasets and the lowest accuracy is 96% on the God Class dataset, the highest precision is 100% on the God Class dataset and the lowest precision is 94% on the Long method dataset, the highest recall is 100% on the Data Class, Feature envy and Long method datasets and the lowest recall is 87% on the God Class dataset, the highest f-measure is 98% on the Data Class and Feature envy datasets and the lowest f-measure is 93% on the God Class dataset, the highest MCC is 97% on the Data Class and Feature envy datasets and the lowest MCC is 90% on the God Class dataset, the highest AUC is 99% on the Data Class, Feature envy and Long method datasets and the lowest AUC is 98% on the God Class dataset, the highest AUCPR is 99% on the Data Class, Feature envy and Long method datasets and the lowest AUCPR is 97% on the God Class dataset.

Table 9 presents the statistical analysis results (paired t-test) of proposed models on the original and balanced datasets (using random oversampling) in terms of mean, Standard Deviation (STD), min, max, and P value. We notice that the mean values of the Bi-LSTM model are 0.95 on the original datasets and 0.97 on the balanced datasets. The mean values of the GRU model are 0.95 on the original datasets and 0.97 on the balanced datasets. The STD values of the Bi-LSTM model are 0.01 on the original datasets and 0.02 on the balanced datasets; while, the STD values of the GRU model are 0.02 on the original datasets and 0.01 on the balanced datasets. The Min values of the Bi-LSTM model are 0.95 on the original datasets and 0.96 on the balanced datasets; while, the Min values of the GRU model are 0.93 on the original datasets and 0.96 on the balanced datasets. The Max values of the Bi-LSTM model are 0.98 on the original datasets and 1.00 on the balanced datasets; while, the Max values of the GRU model are 0.98 on the original datasets and 1.00 on the balanced datasets. The P value of the Bi-LSTM model is 0.06 for the original and balanced datasets; while, the P value of the GRU model is 0.01 for the original and balanced datasets. Based on the P value of the GRU model on the original and balanced data sets, we note that the P value is less than 0.05, indicating a difference between the results of the models on the original and balanced data sets.

Table 10 presents the statistical analysis results (paired t-test) of proposed models on the original and balanced datasets (using Tomek Links) in terms of mean, Standard Deviation (STD), min, max, and P value. We notice that the mean values of the Bi-LSTM model are 0.95 on the original datasets and 0.97 on the balanced datasets. The mean values of the GRU model are 0.95 on the original datasets and 0.98 on the balanced datasets. The STD values of the Bi-LSTM model are 0.01 on the original datasets and 0.01 on the balanced datasets; while, the STD values of the GRU model are 0.02 on the original datasets and 0.01 on the balanced datasets. The Min values of the Bi-LSTM model are 0.95 on the original datasets and 0.95 on the balanced datasets; while, the Min values of the GRU model are 0.93 on the original datasets and 0.96 on the balanced datasets. The Max values of the Bi-LSTM model are 0.98 on the original datasets and 0.99 on the balanced datasets; while, the Max values of the GRU model are 0.98 on the original datasets and 0.99 on the balanced datasets. The P value of the Bi-LSTM model is 0.14 for the original and balanced datasets; while, the P value of the GRU model is 0.09 for the original and balanced datasets. Based on the P value of both models on the original and balanced data sets, we note that the P value is greater than 0.05, indicating no difference between the results of the models on the original and balanced data sets.

6.2 Results of RQ2

6.2.1 RQ2 Which data balancing technique is the most effective at improving the accuracy of DL techniques?

To answer RQ2, the results of the models for the four code smells datasets are reported in Table 11; the best values are indicated in bold in Table.

Table 11 Comparison of proposed models based on balanced datasets using the average of many performance measures

6.3 Results of RQ3

6.3.1 Does the proposed method outperform the state-of-the-art methods in detecting code smells?

To answer RQ3, the results presented by our models and previous studies’ results are reported in Tables 12 and 13. We compared our results with the results obtained in previous studies based on two performance measures: accuracy and AUC. Tables 12 and 13 compare the values of performance measures obtained by our models and those of previous studies. Table 12 shows the results based on accuracy; Table 13 shows the results based on AUC. The best values are indicated in bold in Tables and "-" indicates that the approaches that did not use data balancing techniques or did not provide results for performance measures in a particular data set. According to Tables 12 and 13, some of the results in the previous studies are better than ours, but in most cases, our method outperforms the other state-of-the-art approaches and provides better predictive performance.

Table 12 Comparison of the proposed models with other existing approaches based on the accuracy
Table 13 Comparison of the proposed models with other existing approaches based on AUC

The implication of the findings The results should have implications for researchers and practitioners in code smell detection. They are interested in quantitatively understanding the effectiveness and efficiency of applying data balancing techniques with DL models in code smell detection. Furthermore, the formers are concerned about the qualitative perspective of the results. To summarize the main findings of our results and research questions, we provide implications related to effectiveness, efficiency, comparison, and relation with previous work, as follows:

Concerning RQ1, we observe from Figs. 24, 25 and 26 and Tables 9 and 10 that the proposed models perform better on the balanced datasets than on the original datasets. This indicates that data balancing techniques enhance the accuracy of DL models in detecting code smells.

Concerning RQ2, Table 11 shows that the results obtained on the balanced datasets using random oversampling are better than the results obtained using Tomek links. This suggests that random oversampling is the most effective technique for improving the accuracy of DL techniques in the detection of code smells.

Regarding RQ3, the comparison results of our proposed models with existing approaches using paired t-tests based on random oversampling and Tomek Links are presented in Tables 12 and 13. We observe that there is a difference between the results, indicating that our proposed models achieved better accuracy averages than the existing approaches.

Threats to validity This sub-section discusses our study’s threats to validity and experiment limitations and how we mitigate them. It is essential to assess the threats to validity, such as construct, internal, external, and experiment limitations, particularly constraints on the search process and deviations from the standard practice.

Construct validity concerns the study’s design and its possibility to reflect the actual goal of the research. To avoid threats in study design, we have applied a procedure of systematic literature review. To ensure that the research area is relevant to the study goal, we have cross-checked the research questions and adjusted them several times to address the business needs. Another threat is the construction of the DL models, for which we considered several aspects that could have influenced the study, i.e., data pre-processing, which features to consider, how to train the models, etc. However, the procedures followed in this respect are precise enough to ensure the study’s validity.

Threats to internal validity are related to the correctness of the experiment’s outcome or the study’s process. The main threat to internal validity is datasets. The datasets used in our study are constructed from datasets published by Arcelli Fontana et al. [4]. The reference datasets are imbalanced datasets that show a lack of the real distribution of code smells and metrics. We manage this threat by modifying the original datasets to increase the realism of the data in terms of smells actual presence in the software system. The distribution of the dataset is modified by applying two data sampling methods.

External validity concerns the possibility of generalizing the study to a broader range of applications. We used four code smell datasets constructed from 74 open-source Java projects for our experimentation. However, we cannot declare that our results can be generalized to other coding languages, practitioners, and industrial practices. Future replications of this study are necessary to confirm the generalizability of our findings.

The limitations of the experiments are summarized as follows. First, the number of code smells used in our experiments is limited to only two class-level and two method-level smells. Second, our findings may not be enough to generalize to all software in the industrial domain.

7 Conclusion

Code smells in the software systems indicate problems that can negatively affect software quality and make it hard to maintain, reuse, and expand. Therefore, detecting code smell is essential to enhance software quality, improve software maintainability, and reduce the risk of failure in the software system. This study presented a methodology based on DL algorithms combined with data balancing techniques and software metrics to detect code smells from software projects, considering four different types of code smells over a dataset of 6,785,568 lines comprising 74 open-source software systems. We applied two data balancing techniques which are random oversampling and Tomek links to address the data imbalance problem, and then conducted a set of experiments using two DL models on Java projects in different application domains and evaluated smell detection accuracy using various performance measures. The results of the proposed models were compared with the state-of-the-art approaches in code smell detection. The experimental results show that, on average, the results obtained by both models on the balanced datasets (using random oversampling) were better than the results on the original datasets by 2%; while, the results obtained by both models on the balanced datasets (using Tomek links) better than the results on the original datasets by 3%, which indicates the best average accuracy was obtained on the balanced datasets and our models relying on random oversampling got the best performance. The experimental results showed that the data balancing techniques could improve the performance of DL models for code smell detection. The results also demonstrate that the proposed models significantly improve the average accuracy compared to the recent work on code smell detection. In future work, we will evaluate our models on various datasets and investigate their accuracy in detecting other code smell types. Additionally, we intend to combine more ML and DL algorithms with data balancing techniques to improve the accuracy of code smell detection.