1 Introduction

Refactoring involves changing the internal structure of an existing object-oriented software code without altering its external behavior [1]. The key goal of such internal structure modification is the enhancement of several code quality attributes, such as comprehension and maintainability [2, 3]. Fowler [1] proposed several refactoring scenarios and explained how they can be performed. One of the proposed scenarios is called move method refactoring (MMR), which involves moving a method from one class to the class in which the method is used most. In practice, MMR is one of the most frequently performed refactoring scenarios [4,5,6], and it has a great impact on class design. Therefore, it is necessary to carefully investigate whether such refactoring indeed achieves its intended goal, which is quality improvement.

Al Dallal and Abdin [7] performed a systematic literature review (SLR) to examine the impact of refactoring on code quality. They identified several general limitations in existing studies and recommended performing additional empirical studies to fill some of the research gaps and overcome the identified limitations. One of the key limitations is that most of the existing studies did not statistically explore the significance of the change in the quality when refactoring is performed. This limitation potentially makes the obtained results less reliable and trustworthy in terms of determining whether refactoring achieves its main goal of improving code quality. They also found that only 10 studies out of 76 (13%) applied statistical techniques to study the significance of the change in quality values. The SLR included 15 studies exploring the impact of MMR on quality, and none of these studies applied a statistical technique to study the significance of the change in quality values when MMR was applied. Instead, these studies reported only the amount or percentage of difference, which makes their conclusions questionable and decreases the confidence about whether it is worthwhile, from a quality perspective, to perform MMR refactoring at all. Comparing a set of values by simply subtracting corresponding values from each other does not provide scientific statistical meaning. Instead, researchers are advised to apply statistical techniques to compare the set of quality values before and after refactoring and comment on the statistical significance of the changes [7]. One of the key factors that affects statistical significance results is the sample size. Relatively small applications with relatively few considered refactoring cases may potentially cause the findings to be statistically insignificant. The direct impact of a single MMR activity on quality might be limited to a few classes that are involved in the refactoring activity. Therefore, it is essential to apply MMR activity to a relatively large number of classes to study the statistical significance of the change in the quality values.

In general, to perform a study that explores the impact of a certain refactoring activity on code quality, researchers typically follow one of two approaches. The first approach involves applying an existing or proposed technique and tools to identify refactoring opportunities in the system selected for study, applying the considered refactoring activity and comparing the quality of the code before and after refactoring. This approach suffers from the limitation that, in some cases, there may be a lack of agreement on the correctness of the applied techniques or tools. In most cases, studies in which developers checked the correctness of the refactoring decisions based on the applied techniques and tools showed that the decisions were not 100% correct. For example, Bavota et al. [8] involved the developers of two systems under study to check whether they agreed to apply the refactoring suggestions recommended by an existing tool and a proposed technique. The results for the existing tools were as follows: 27% answered either no or definitely no, and 58% answered maybe. The results for the proposed technique were as follows: 33% answered either no or definitely no, and 26% answered maybe. When these “doubtful” refactoring activities are performed and the quality is compared before and after refactoring, the results become unreliable. Alternatively, the second approach is based on mutating existing code by applying refactoring activities to the code that supposedly does not require refactoring, and then, the quality before and after mutation is compared. In this case, the code after mutation is considered to be the code in need of refactoring, and it is expected that this code has worse quality than the code before mutation. The only threat to validity for this approach is when the selected piece of code to be mutated is found to be in need of refactoring and it is, by chance, mutated in a way that simulates the way in which the code must be actually refactored. In this rare case, the code in need of refactoring is the original code rather than the mutated code. We found that all of the existing studies on the impact of MMR on quality follow the first approach, and none of them applied the second, which seems to have a better chance of correctness.

Given the limitations of existing studies and the need for replicated studies to support or disprove the existing results [9,10,11], we performed an empirical study to investigate the impact of MMR on code internal quality. In the study, we involved seven systems for a total of approximately 4 K classes. In addition, we considered 30 measures that consider eight internal quality attributes and different quality attribute aspects. We mutated the systems by moving methods from their proper classes to other classes. In this case, the mutated systems are those in need of MMR. We then applied the Wilcoxon paired test [12] to statistically explore the significance of the change in the quality values when the mutation was applied. In addition, we applied two machine learning (ML) techniques, including logistic regression analysis [13] and Chi-square Automatic Interaction Detection (CHAID) analysis [14], on the mutated code to build MMR prediction models to explore whether the selected measures, considered individually or in combination, significantly contribute to MMR prediction.

The major contributions of this paper are as follows:

  1. 1.

    An empirical exploration of whether MMR significantly improves object-oriented code quality, which is quantified using several measures and considers key quality attributes.

  2. 2.

    An ML-based investigation of the ability of quality measures, considered individually or in combination, to predict classes that include methods in need of MMR.

This paper is organized as follows: Sect. 2 provides an overview of related work. Section 3 explains how the empirical study is designed. Section 4 reports and discusses the Wilcoxon analysis results. Sections 5 and 6 present and discuss the MMR ML-based prediction results. Section 7 discusses the empirical study’s threats to validity. Finally, Sect. 8 concludes the paper and outlines possible future work.

2 Related Work

In this section, we provide an overview of the considered internal quality attributes and measures and discuss relevant research exploring the relationship between MMR and code quality.

2.1 Internal Quality Attributes

Software quality attributes are classified into internal and external attributes [15]. Internal quality attributes, such as size, coupling, and cohesion, are those that can be measured using software artifacts. On the other hand, external quality attributes, such as maintainability, reusability, and reliability, require knowledge of the environment to be measured. Table 1 provides definitions for 30 quality measures of eight internal quality attributes. These quality attributes were addressed by Bansiya and Davis [16], and they include size, coupling, cohesion, inheritance, messaging, encapsulation, composition, and complexity.

Table 1 Definitions of the considered internal quality measures (adapted from [30])

The most widely used sets of internal quality measures, including Chidamber and Kemerer (CK) [17] and Quality Model For Object-Oriented Design (QMOOD) [16], are included in Table 1. The list includes additional measures to consider a wider range of the selected quality attributes’ aspects. For the size attribute, NOA and NOOA are added to consider the size in terms of attributes. In addition, NOP and NOOP are added to consider the size in terms of method parameters. For coupling, Ca and Ce are added to differentiate between afferent and efferent coupling, and IC and CBM are added to consider the coupling caused by inherited class members. For cohesion, LCOM2 and Coh are listed to consider coupling caused by the accessibility of the attributes by the methods in the class. LSCC, CC, and SCOM are added because they measure the degree of cohesion between each pair of methods caused by the sharing of attributes. NHD considers the cohesion based on the share of types of method parameters. The LSMC, TMPC, and MCoh measures are indicators of MMR designed based on cohesion measures, and the considered inheritance measures, DIT, NOC, and MFA, consider different inheritance aspects. The rest of the measures related to other attributes are included in the two widely used sets, which are CK and QMOOD.

2.2 Investigating the Impact of MMR on Code Quality

Analysis for object-oriented systems is typically addressed at several levels including system, package, cluster, class, and method levels [31]. Fowler [1] introduced several refactoring scenarios that are applicable at different levels. MMR is a refactoring scenario applicable at a class level, where a method is moved from one class to another. ML-based techniques have been widely applied to predict refactoring opportunities at method and class levels (e.g., [32,33,34,35]).

Several studies have explored the impact of MMR on code quality. In Table 2, we list the studies and the considered quality attributes and measures, and we provide the number of applications used in the studies. Table 3 lists the quality attributes considered in the studies listed in Table 2. For each attribute, the table reports the number of studies in which the attribute was considered and the total number of applications used in these studies to explore the impact of MMR on the quality attribute. In addition, the last three columns of Table 3 summarize the key results reported in the studies, where “+,” “−,” and “0,” indicate an increase, decrease, and no change in the quality attribute values, respectively. In these columns, the number of votes is reported, where a vote is given to each measure and to each application. For example, there are two studies considering the maintainability quality attribute (i.e., [36, 37]). The former study used one maintainability measure that was applied to three applications, and the values for the measure increased when MMR was applied to each application; therefore, the study has three “+” votes for maintainability. The latter study used one maintainability measure applied to two applications, and the values for the measure increased when MMR was applied to each application; therefore, the study has 2 “+” votes for maintainability. Accordingly, the total number of “+” votes is five, as reported in Table 3. In case the study combines the results for all of the considered applications, such as the case in [51], only one application is added to the third column in Table 3 and a vote is given to each measure.

Table 2 Summary of MMR-related studies
Table 3 Summary of the results reported in the existing studies

To explore the impact of MMR on refactoring, the existing studies considered several quality attributes and several measures. As shown in Table 3, these attributes include maintainability, coupling, cohesion, complexity, size, inheritance, composition, and information hiding. Although the considered attributes seemed to be diverse, the considered measures are very limited. For example, there are several existing approaches that measure cohesion, but the existing studies considered measures that follow two approaches and ignored the rest. This omission means that the reported results did not consider some cohesion aspects. The same observation applies to the considered coupling, size, and complexity measures. Only seven of the 19 studies that considered MMR examined relatively large applications, and the rest of the studies considered only small- to medium-sized applications. We found that seven studies considered small applications only, which raises questions about the reliability of the obtained results.

All of the studies listed in Table 2 used the same approach for the study design. This approach involves applying tools or proposed techniques to identify MMR opportunities, applying MMR to the identified opportunities, and comparing the values of the considered quality measures before and after MMR application. This approach has the limitations discussed in Sect. 1. Table 2 shows that most of the studies used a small number of applications, and Table 3 shows that most of the quality attributes are only considered by one or two studies. Most of the studies considered cohesion and coupling attributes, and a considerable number of studies accounted for complexity attribute. The same observation applies to the number of used applications. Most of the results show that MMR potentially improves maintainability, coupling (shown by the decrease in the coupling value), and cohesion. The results of the other attributes are either inconsistent or rely on a very small number of votes. However, all of the studies listed in Table 2 compared the values of the measures before and after performing MMR without investigating whether the changes in the quality values were statistically significant, which raises the question of whether applying MMR was useful in terms of improving the code quality and whether software engineers can rely on the number of votes reported in Table 3.

Fernandes et al. [50] performed an extensive investigation of the relationship between different refactoring methods, including MMR, and internal quality metrics. While the novelty in their work is examining both refactoring and re-refactoring efforts, the aggregate means by which they report the results creates further questions and renders the results vulnerable to alternative interpretations. They report on whether an attribute resulted in an improvement using two methods. The first is the “most” method, where they consider the attribute to have improved the internal quality metrics if an improvement is noticed in most of the metrics comprising the given attribute. The second is the “at least one” method, where an attribute is considered to have improved the internal quality metrics if an improvement is noticed in at least one of the metrics comprising the attribute. The validity of both measures is suspect given that both measures may report conflicting results and suffer in that they do not consider the magnitude of improvement in individual metrics.

Hamdi et al. [51] explored the impact of several refactoring scenarios on internal quality attributes using 300 open-source Android applications. The refactoring scenarios were considered individually and MMR was one of them. The size of the selected applications was not provided. In addition, the applications were not considered individually in the analysis. Instead, the results from all applications were combined and used in the statistical analysis as a unit.

Indeed, the close association of refactoring and MMR with quality resulted in studies that examined this association from different perspectives. For example, some studies predicted code quality based on refactoring activities, such as the RIPE technique introduced by Chaparro et al. [53]. Other studies explored the developers’ intentions to engage in refactoring actions, including MMR, where developers were found to recognize that improvement of quality metrics is the main driver for refactoring [54]. MMR was found to be most used by developers in refactoring activities targeting code reusability [55], code readability [56], and feature envy code smell [57], addressing self-admitted technical debt [58], and improving energy consumption [59, 60]. Developers should be aware of some of the unwanted consequences of using MMR, such as breaking class API in instances, especially when MMR is used in changes that introduce new features or bug fixes [61]. In some instances, MMR can also result in introducing some bad code smells [62]. AlOmar et al. [63] performed a systematic literature mapping and found the behavioral preservation of certain refactoring scenarios, including MMR, to be understudied.

Several studies considered MMR but did not explore its impact on code quality (e.g., [4, 25, 64]). Abid et al. [65] found that MMR not only improved quality metrics but security metrics as well. Al Dallal [66] reported an SLR for studies related to determining refactoring opportunities, including MMR and other refactoring activities. MMR was predicted to occur with changes that target bug fixing, introduce a new feature, or perform general maintenance [67, 68]. It was also found to occur in association with increases in CBO, LOC, and LCOM [58].

Couto et al. [69] proposed an approach that recommends MMR refactoring operations to improve QMOOD metrics. Aniche et al. [70] built a model that leveraged code, process, and ownership metrics to predict the use of MMR to maintain or improve quality. Shahidi et al. [71] built a model that leveraged network-based coupling and cohesion measures to identify and fix long methods using MMR. Liu et al. [72] built synthetic training data for a refactoring model where they deliberately injected code smells into existing code to train models. Their work resulted in a model that suggests using MMR to fix feature envy.

Some of the refactoring tools that performed MMR were fully automated. Rebai et al. [73] built a genetic algorithm-based tool that improved QMOOD metrics based on a recommendation of sequences that typically include MMR. Morales et al. [59] built an automated refactoring tool that utilized MMR in many instances to improve the effectiveness, extensibility, and power consumption of Android apps. Wang et al. [74] built a tool that mostly utilized move and extract operations, including MMR, to improve cohesion and coupling, resulting in improved code understandability, flexibility, reusability, and maintainability.

3 Overall Research Methodology

This empirical study aims to explore the impact of MMR on several software internal quality attributes using ML-based techniques. This section lists the research questions and provides an overview of the data collection, manipulation, and extraction processes and the considered analysis techniques.

3.1 Research Questions

The following research questions are considered:

RQ1: Which quality attributes and which quality measures are significantly affected by MMR?

RQ2: Which of the quality measures, when considered individually, is an MMR significant predictor?

RQ3: Which of the quality measures, when combined with other measures in an MMR ML-based prediction model, makes a significant contribution to the model and to what extent?

3.2 Data Collection

The study involved seven systems selected based on the following criteria: (1) They are Java systems, (2) they have a variety of sizes, (3) they have available open-source code, and (4) they belong to different domains. The first system, S1, is Ant 8.0.5 [75], which is a Java building tool. It has 254 K LOC and 1242 classes. The second system, S2, is FreePlane 1.3.12 [76], which is a project management system consisting of 111 K LOC and 926 classes. The third system, S3, is Gantt 2.0.10 [77], a project scheduling system with 51 K LOC and 419 classes. The fourth system, S4, is JHotDraw 5.2 [78], a Java development software including 25 K LOC and 148 classes. The fifth system, S5, is OmegaT 2.6.3.11 [79], which is a text processing tool. It has 121 K LOC and 377 classes. The sixth system, S6, is PDFSAM 2.2.2 [80], which is a management system for PDF files consisting of 68 K LOC and 277 classes. The last system, S7, is TVbrowser 3.4.1.0 [81], an entertainment tool that consists of 90 K LOC and 543 classes.

3.3 Data Manipulation and Extraction

To explore the impact of MMR on software quality, we mutated the selected systems by moving some methods from their proper classes (source classes) to some other randomly selected classes (target classes). Therefore, the mutated version of the systems is the version that is considered to be in need of MMR. Both the source and target classes are randomly selected. The methods selected to be moved adhere to the following conditions: (1) They are not constructor, setter, getter, or delegation methods, and (2) their movement follows the behavior preservation and compilation preconditions declared by Tsantalis and Chatzigeorgiou [43]. In addition, any class can be either a source or a target, but not both, and any target class is a target class for one method at most. These restrictions are necessary to reduce the study’s threats to validity. A research assistant with nine years of experience in conducting software refactoring-related research and a B.Sc. in computer science performed the required refactoring activities. The first author checked all mutated pieces of code to ensure that the mutation process was performed as intended. The resulting code was compiled and tested to check that the mutation did not alter the systems’ behaviors. The mutation process resulted in mutating 44.5% of the classes (i.e., including source and target classes). Based on this mutation process, the mutated version is the version in need of MMR, and the original code is considered the refactored version; therefore, it is expected to have better quality. Table 4 reports the percentage of target classes considered in the seven selected systems.

Table 4 Percentage of target classes

To explore the impact of MMR on quality, we selected the 30 measures described in Table 1. These measures include eleven cohesion, six coupling, six size, and three inheritance measures. In addition, they include a single measure for each of the messaging, encapsulation, composition, and complexity quality attributes. As explained in Sect. 2.1, the measures are selected to cover a variety of the considered quality attributes’ aspects and to include the CK [17] and QMOOD [16] sets of measures that are applicable to classes in addition to other measures. We extended an existing CKJM tool [22] that originally analyzes the bytecode of a Java class and obtains the CK and QMOOD measures. The tool is extended to obtain the values for all of the other considered measures. In addition, the tool is extended to produce the results of all classes in one run and report the results in an Excel sheet instead of obtaining the results for each class separately. We applied the tool to the classes of each version (i.e., original and mutated) for the seven systems. Table 5 reports the descriptive statistics of the measures for all systems considered together before and after mutation.

Table 5 Descriptive statistics of the considered measures for all systems considered together

3.4 Analysis Techniques

To explore the impact of MMR on quality and test whether that impact is statically significant, we considered two approaches. The first approach compares the values of each measure for all of the classes in a system before and after mutation. To perform this comparison, we first applied the Shapiro‒Wilk normality test [82] and found that none of the considered measures followed a normal distribution. Therefore, we selected the Wilcoxon paired test [7], a standard nonparametric statistical technique, to compare the set of values for a specific measure before and after mutation. If the resulting p-value is below the typical threshold (α = 0.05), mutation, and consequently, MMR, is considered to have a significant impact on the quality measure. For each case, we calculated the mean values for each of the original and mutated versions to explore whether MMR improved code quality from that aspect. For example, if the mean value of the LSCC for classes in the original version’s system is found to be greater than that of the mutated version (i.e., the version in need of MMR), we conclude that MMR potentially improves cohesion. This study aims to answer RQ1.

The second approach, aiming at answering RQ2, relies on applying univariate logistic regression (ULR) [13] and CHAID [14], ML-based techniques. The ULR technique is applied to the mutated version of the systems. In this analysis, the dependent binary variable is set to be “1” if the class is a target class (i.e., has a method in need of MMR because of the mutation process) and “0” otherwise. Considering all of the classes in a system and a specific measure X, the application of the ULR analysis results in the following equation:

$$\pi (X)\frac{1}{{1 + e^{{ - (C_{0} + C_{1} X)}} }},$$

where π represents the probability that the class has a method in need of MMR. The absolute value of C1 indicates the strength of the impact of X on MMR, and the sign of this value indicates the direction of impact. That is, if the sign of C1 is negative, it means that the probability that a mutated class is in need of MMR decreases as the value of X increases. In other words, if the increase in the value of X indicates a quality improvement, C1 with a negative sign indicates that not being in need of MMR potentially implies that the code has good quality. We standardized C1 by subtracting the mean and dividing by the standard deviation to compare the impact of different measures on MMR.

In this ULR analysis, we considered the typical threshold (α = 0.05) for the p-value to indicate the significance of the considered measure’s impact on MMR. We applied a boxplot statistical technique [83] to detect outliers. A few outliers were detected, but none were removed because we found that this removal did not cause any significant change in the results.

The CHAID classification tree algorithm uses a Chi-square test to determine which variables are most important for splitting the data. In our context, the nodes in the classification tree are created by testing for statistically significant differences between groups of cases based on the selected quality measure. In this analysis, we considered the resulting Chi-square test p-value to indicate the significance of the considered measure’s impact on MMR. We set the maximum tree depth to 3, the significance level to 0.5, and the merge threshold to 0.05.

To explore the impact of the combination of measures on MMR, we applied multivariate logistic regression (MLR) and CHAID analyses. The MLR analysis resulted in the formation of the following equation:

$$\pi (X_{1} ,X_{2} , \ldots ,X_{n} ) = \frac{1}{{1 + e^{{ - (C_{0} + C_{1} X_{1} + C_{2} X_{2} + \cdots C_{n} X_{n} )}} }},$$

where Xis represent the quality measures, and Cis denote the coefficients obtained using logistic regression analysis. To construct the MLR ML-based model, we applied a backward selection technique in which all of the considered measures were included in the model. After constructing the model, we removed the measure that had the highest p-value and we constructed the model again using the remaining measures. We repeated this process until all of the remaining measures had p-values below the typical threshold (α = 0.05). We validated the constructed ULR and MLR models using 10 times tenfold cross-validation, where the dataset is partitioned into 10 subsamples and the model is constructed and tested 10 times. Each time, a subsample is used as a testing set, and the remaining samples are combined as a training set. When applying the multivariate CHAID analysis, we set the maximum tree depth to 6, the significance level to 0.5, and the merge threshold to 0.05. We evaluated each of the MLR and CHAID classification performances using four measures, including precision, recall, F-measure, and area under curve (AUC) [13].

4 Wilcoxon Analysis Results and Discussion

Table 6 reports the Wilcoxon analysis results. The “+” sign indicates that the mean value of the measure in the original classes is greater than that in the mutated version. This sign indicates that when reversing the mutation process (i.e., applying MMR to the mutated version), the mean value of the measure increases. For simplicity, in the rest of this paper, we consider the original version as the version created by applying MMR to the mutated version. Although we did not actually apply this MMR, this claim is valid because applying MMR to the mutated version, considering the moved methods, potentially produces the original version. The “−” sign indicates that the mean value of the measure decreased when applying MMR, and “0” indicates that MMR did not change the mean value of the measure. The parentheses indicate that the change in the mean values based on Wilcoxon analysis is not statistically significant (i.e., p-value > 0.05), whereas a value without parentheses indicates that the value was significantly changed. For example, the last column in Table 6 shows that when considering all of the classes in the seven selected systems, the mean LOC value was significantly increased when MMR was performed (i.e., the mean value was significantly decreased when the classes were mutated).

Table 6 Summary of the Wilcoxon analysis results

The results reported in Table 6 lead to the following observations:

  1. 1.

    Except for LOC, all of the size measures were significantly and consistently decreased by MMR for all of the systems when considered individually or all together. These results are consistent with the intuition that MMR improves quality because the reduction in system size potentially indicates a reduction in system complexity and an improvement in system quality. The results of the LOC measure are inconsistent in terms of the sign of the change and significance. Therefore, the impact of MMR on LOC is unclear.

  2. 2.

    The results for most of the coupling measures and most of the selected systems show that the values were significantly decreased when MMR was applied. This result is consistent with the intuition that MMR potentially reduces coupling when a method is moved to a more related class and thus improves code quality. The key exception is JHotDraw, where most of the coupling measures were insignificantly affected by MMR. This result is potentially because JHotDraw is the smallest selected system in terms of the number of classes and, consequently, the number of target classes, which causes the corresponding data to be potentially insufficient in terms of drawing a clear conclusion. IC and CBM are measures for coupling caused by inherited methods. These two measures were insignificantly affected by MMR, potentially because a few methods featuring inheritance coupling were selected during the mutation process.

  3. 3.

    As expected, the results for the lack of cohesion measures (i.e., LCOM1, LCOM2, and NHD) and the measures designed to indicate a need for MMR based on cohesion aspects (i.e., LSMC, TMPC, and MCoh) indicate that lack of cohesion significantly decreases, and consequently, cohesion significantly increases, when MMR is applied. This result is consistent for most of the selected systems. MMR had different impacts on the remaining cohesion measures. That is, MMR caused the CAMC and CC values to significantly increase for most of the selected systems, which follows the intuition that moving a method to a more related class causes the cohesion of each of the source and target classes to be potentially improved. However, MMR has either insignificant or contradictory impacts on Coh, LSCC, and SCOM across the selected systems. Al Dallal [29] showed that Coh and LSCC must be adapted to become suitable for indicating MMR; the adapted versions of these measures are MCoh and LSMC, which were significantly changed by MMR. SCOM and LSCC measure the same cohesion aspect with two different but close ideas. Therefore, SCOM is expected to behave the same as LSCC when MMR is applied.

  4. 4.

    The inheritance measures DIT and NOC are unaffected by MMR because MMR does not alter class hierarchy. MFA measures the ratio between the number of inherited methods to the total number of accessible methods, and a high MFA value indicates that the methods are not concentrated in the subclasses. Instead, they are potentially well distributed among the classes within the class hierarchy, which indicates good quality. The results show that the value of MFA increased when applying MMR, which indicates that MMR improved this quality aspect. This result agrees with intuition because MMR is expected to move a method to its proper class within a class hierarchy.

  5. 5.

    Consistently, for all of the selected systems, the value of CIS, which indicates the messaging degree, significantly decreased when MMR was applied. CIS counts the number of public methods, and the empirical study’s design caused this value to decrease because the study was based on mutation idea. In this case, the original code is mutated by moving a method to a target class and replacing it with a delegation method in the source class. Therefore, the mutated version has more methods than the original version, which caused the CIS to decrease when comparing the mutated version to the original version. The opposite is expected to occur when actual MMR is applied, where a delegation method is placed in the source class to invoke the moved method. In this case, the CIS value is expected to increase, indicating quality improvement.

  6. 6.

    The degree of encapsulation, indicated by DAM, significantly increased in all systems when applying MMR, which indicates quality improvement. This result agrees with intuition because, in some cases, it is required to copy the attributes that are accessed by the moved method to the target class. When these attributes are declared private, the insertion of the copied methods increases DAM, which measures the ratio of private attributes.

  7. 7.

    MOA measures the degree of the aggregation relationship between a class and other classes caused by attribute types. Similar to coupling, decreasing this value indicates quality improvement. The results show that this value significantly and consistently decreases when MMR was applied. The results agree with intuition because, in some cases, a misplaced method requires accessing an attribute that is an object of the class to which the method is more related. Moving the method to its proper class potentially results in deleting this attribute if it is not used by any other methods in the class.

  8. 8.

    Performing MMR results in adding delegation methods in the source classes. The size of the delegation methods, in terms of LOC, is typically small, which causes the average method size, measured by AMC, to potentially decrease. As explained in Observation 5, due to the applied mutation idea, this decrease in the AMC value is reversed, which explains the significant and consistent increase in the AMC value when applying MMR. The opposite is expected to occur when actual MMR is applied. In this case, the AMC value is expected to decrease, indicating a complexity reduction and quality improvement.

In summary, most of the results provide evidence that MMR improves code quality, indicated by different attributes and measures. Exceptions occur either because of the empirical study’s design (i.e., the applied mutation idea) or because of special reasons related to a measure definition or a certain characteristic of a selected system.

5 Logistic Regression Analysis Results and Discussion

The Wilcoxon analysis provided evidence for whether MMR caused the value of any of the considered measures to significantly change. However, finding a measure that is significantly affected by MMR does not imply that it is a significant predictor for MMR. To test the prediction ability, we performed ULR and MLR ML-based analyses. The ULR analysis results are summarized in Table 7, which reports the standardized C1 values. As explained in Sect. 3.4, in our context, a positive value indicates a positive impact (i.e., the probability of a class needing MMR increases as the value of the measure increases), and a negative value indicates a negative impact. The parentheses indicate that the measure was found to be a nonsignificant predictor for MMR (i.e., p-value > 0.05), and a value without parentheses indicates that the measure was found to be a significant MMR predictor.

Table 7 Standardized coefficients of the ULR models

The results provided in Table 7 lead to the following observations:

  1. 1.

    The NOA and NOOA size measures were consistently found to be insignificant MMR predictors. For most of the considered systems, LOC and WMC were found to be insignificant MMR predictors. Only the NOP and NOOP size measures were found to be significant MMR predictors, in most cases, and they have positive impacts on MMR. These results indicate that although MMR potentially affects the LOC, WMC, NOA, and NOOA size measures, the size they measure is potentially not a sign that the class has a method in need of MMR. That is, having a large class in terms of LOC, the number of methods, and the number of attributes does not indicate that there is a method in need of MMR. This result might be an indication of the need for other refactoring activities, such as Extract Class refactoring, which splits a class into two classes. However, having a large number of parameters (NOP) or a large number of parameters of object types (NOOP) potentially indicates that the functionality of the method relies on inputs not provided by the attributes but potentially related to other classes, which might indicate that such a method requires MMR. The direction of impact for MMR on NOP and NOOP is consistent with this intuition and with the corresponding Wilcoxon results. Both statistical and ML-based analysis techniques indicate that the NOP and NOOP values potentially decrease due to MMR, which indicates a quality improvement. The strength of the impact of NOOP on MMR is consistently found to be higher than that for NOP, because objects are more complex than primitive types, and they can encapsulate several attributes.

  2. 2.

    Except for Ce and RFC, each of the considered coupling measures was found to be an insignificant ML-based predictor for MMR in most cases. The Ce and RFC measures are locally measured within the class of interest; they measure the number of classes referenced by the class of interest and the number of methods called by the class of interest, respectively. The results indicate that these measured aspects are potential ML-based predictors for MMR opportunities. As expected, coupling measures insignificantly affected by MMR (i.e., IC and CBM) are also insignificant MMR predictors in most cases. In most cases, measures that rely on quantifying the coupling from other classes to the class of interest (i.e., Ca and CBO) are found to be insignificant MMR predictors. This result is because there are several reasons that have a greater effect on such type of coupling than the existence of methods in need of MMR. For example, such classes might be the core system classes referenced by many other classes in the system. The strength of the impact of Ce on MMR is consistently higher than that for RFC. This observation indicates that efferent coupling potentially plays a greater role in indicating MMR opportunities than the number of invocations of other classes’ methods.

  3. 3.

    Except for lack of cohesion measures, LCOM1 and LCOM2, the considered cohesion measures are always or in most cases significant ML-based predictors for MMR opportunities. In addition, their direction of impact on MMR is always consistent with intuition. That is, a lower cohesion indicates a higher probability of needing MMR, and vice versa. NHD measures the lack of cohesion, and therefore, the results expectedly suggest that a higher NHD value indicates a higher probability of needing MMR, and vice versa. The LSMC, TMPC, and MCoh measures are MMR indicators designed based on cohesion measures; a higher value for any of these measures indicates a higher probability of needing MMR, and vice versa. These results are consistent with intuition because a method in need of MMR is more related to another class than the class in which it is placed. This observation potentially indicates that the relationship between such a method and the class in which it is placed is relatively weak. The definitions of cohesive relationships in LCOM1 and LCOM2 are less precise than those for the other considered cohesion measures, which potentially justifies their insignificant MMR prediction abilities. There is no consistency across the systems on which Coh, LSCC, CC, SCOM, CAMC, and NHD have the strongest impact on MMR. However, the cohesion-based MMR indicators, LSMC, TMPC, and MCoh, always have a stronger impact on MMR than other cohesion measures. This observation is expected because these three measures are designed specifically to indicate MMR. The results provide empirical evidence that these measures succeed in performing their intended purpose, and they were found to have a stronger impact on MMR than any other measure of any quality attribute considered in this study.

  4. 4.

    In most cases, the inheritance, composition, and complexity measures were found to be insignificant ML-based predictors for MMR opportunities. As explained in Sect. 4, inheritance and MMR are unrelated. In terms of composition, MOA measures the number of attributes of object types. Although this number is potentially affected by MMR, it is more dependent on the systems’ design aspects. Therefore, it cannot solely indicate the need for MMR. Similarly, the average complexity of the methods measured by AMC relies on many factors, and therefore, it does not solely indicate the need for MMR.

  5. 5.

    Messaging and encapsulation measures (i.e., CIS and DAM) are significant ML-based predictors for MMR opportunities in most cases. CIS measures the number of public methods. As this number increases, the intuition is that the probability that the class has a method in need of MMR potentially increases, which is confirmed by the positive sign of the coefficients. DAM measures the percentage of private attributes. The results indicate that higher DAM values indicate a higher probability of the existence of a method in need of MMR. This is potentially due to one of the reasons for misplacing methods. That is, developers sometimes misplace a method in a class because it accesses some private attributes of that class even though the method is more related to another class. When MMR is applied, the code of the moved method is revised by changing these direct accesses to the private attributes into invocations of the private attributes’ accessor methods (i.e., setter and getter methods).

In summary, and in response to RQ2, the results indicate that parameter-based size measures, coupling measures based on local artifacts, most cohesion measures, messaging measures, and encapsulation measures were found to be significant ML-based predictors for MMR opportunities.

To answer RQ3, we constructed an MLR model considering all measures and using all classes of all systems together. As explained in Sect. 3.4, this ML-based model was constructed using the backward stepwise fashion. Table 8 provides the coefficients of the measures that significantly contribute to the MLR model. The model includes 22 measures, and it has a relatively high classification performance ability (i.e., precision = 95.8%, recall = 61.0%, Fmeasure = 74.5%, and AUC = 94.5%). The contributing measures to this ML-based model are ordered in Table 8 based on their strength of contribution (i.e., the absolute value of the standardized coefficient) from highest to lowest. Based on the standardized coefficient values (Std. Cis), LSMC and MCoh were found to make the highest contribution to this ML-based model, whereas MFA makes the weakest contribution to the model.

Table 8 Coefficients of the multivariate MMR prediction model

6 CHAID Analysis Results and Discussion

To answer research question RQ2 using CHAID, which is the other ML-based technique considered in this study, we constructed a CHAID-based prediction model using each of the quality measures considered individually. Table 9 reports the significance results based on the resulting p-values. Most of these results (82%) are identical to those reported in Table 7. Dissimilar results are highlighted in boldface in Table 9. In summary, WMC, Ca, CBO, LCOM1, LCOM2, MFA, MOA, and AMC were significant MMR predictors, in most cases, in the CHAID ML-based analysis, whereas they were insignificant predictors, in most cases, in the ULR analysis. The CHAID results for the remaining measures were generally similar to those of ULR.

Table 9 The CHAID-based significance results (significant/insignificant)

To answer research question RQ3 using CHAID, we built a CHAID multivariate model. Each of the considered measures contributed to some extent to the model, except WMC, LSCC, CC, NHD, and TMPC. These five measures were not involved in this ML-based model due to their multicollinearity with other measures involved in the model. The model featured a relatively high classification performance ability (i.e., precision = 76.9%, recall = 81.2%, Fmeasure = 79.0%, and AUC = 96.6%). This ML-based model has worse precision but better recall, Fmeasure, and AUC values than those of the MLR model described in Sect. 5. Notably, having differences between results of different prediction techniques is normal and it is due to the differences between the algorithms these techniques apply.

7 Validity Threats

Several factors may restrict the generality and limit the interpretation of our results. All of the selected systems are Java open-source systems. This limitation was imposed because we extended a tool that obtains the values of quality measures for Java systems. One of the key differences between Java and other object-oriented languages, such as C++, is that Java allows for single inheritance only. However, MMR and inheritance are hypothetically unrelated, and this hypothesis is empirically confirmed in this study. We relied on open-source systems because source code availability is essential to perform the required mutation process. It is also important to note that using open-source systems in empirical studies is a common practice in the research community [84]. Although we considered systems of different sizes and from different domains, generalizing the results requires performing more studies that consider more systems in different languages and including industrial systems [85]. This study considered 30 measures. Although many more measures are available in the literature, the selected measures considered a variety of quality attributes and a variety of aspects of the key quality attributes. This study is much larger than any existing similar study in terms of the number of selected classes and measures.

Although the mutation process was performed manually, the systems were tested after mutation to ensure that the mutation did not cause any behavior change. In addition, all mutations were checked by the first author to ensure that the mutation guidelines were correctly followed. The mutation idea caused some measures, including CIS and AMC, to be inversely affected. Such an impact is reported and thoroughly discussed in Sect. 4.

8 Conclusions and Future Work

This paper reports an empirical study that considers statistical and ML-based techniques to investigate the impact of MMR on internal quality attributes. It overcame the limitations indicated for the relevant existing studies. The current study considered 30 measures of eight different quality attributes and involved seven systems of different sizes and a total of approximately 4 K classes. This study is the first to explore the significance of the change in quality values caused by MMR. In addition, it is based on the mutation idea, which experiences fewer threats to validity than other considered ideas, as explained in Sect. 1.

The reported empirical study explored the impact of MMR on quality in two different ways. The first aimed to compare the values of quality attributes before and after MMR. The goal was to investigate whether MMR causes the quality of the code to improve or worsen and whether this improvement or worsening is statistically significant. The results confirmed the expectations in most cases and for most of the considered quality attributes and measures. The second is to build logistic regression and CHAID models based on the selected measures and explore whether the measures are significant predictors. The results indicated that most cohesion measures, the considered messaging and encapsulation measures, and only a few size and coupling measures were found to be significant ML-based predictors for MMR opportunities.

One of the key conclusions of this study is that, in general, it is inaccurate to claim that a refactoring scenario has an impact on quality without studying this impact on each quality attribute. This conclusion is reached because a refactoring scenario can have different impacts on different quality attributes; it might cause some of them to improve, others not to be affected, and some others to be badly affected. For example, the results of this study indicated an improvement in quality in terms of some coupling and cohesion aspects, whereas the inheritance attribute was unaffected. Another important conclusion is that it is inaccurate to generalize the results for a certain quality attribute by selecting a few measures. Different measures considering different aspects of the quality attribute must be selected. For example, size measures based on structured data (e.g., the number of attributes or methods) featured different results than LOC, which is not based on structured data. The latter was found to be insignificantly affected by MMR. Similarly, coupling measures related to inheritance (i.e., IC and CBM) were found to be insignificantly affected by MMR, whereas other coupling measures were significantly affected.

The dimensions of the data considered in the first and second studies are different. The first study is based on comparing the original and mutated versions, whereas the latter study is based on comparing the values of the measure across the mutated and unmutated classes in the same mutated version. Therefore, a measure found to be significantly affected by MMR is not necessarily found to be a significant MMR predictor, and vice versa. This study reports several examples that support this hypothesis. For example, NOA, NOOA, Ca, CBO, LCOM1, LCOM2, MFA, MOA, and AMC were found to be significantly affected by MMR in most cases, but they were also found to be insignificant MMR predictors, using logistic regression ML-based models, in most cases. On the other hand, LSCC and SCOM were found to be insignificantly affected by MMR in most cases, but they were found to be significant MMR predictors in most cases. In addition, some measures were found to be significantly affected by MMR and significant ML-based predictors for MMR opportunities in most cases, such as NOP, NOOP, Ce, RFC, CC, CAMC, NHD, LSMC, TMPC, MCoh, CIS, and DAM.

This empirical study can be extended by considering external quality attributes, more internal quality attributes, and extra measures. In addition, it can be extended by involving more systems of different programming languages and different domains and including large industrial systems to validate or invalidate the obtained results. Performing a qualitative analysis for the impact of MMR on quality measures is left open for future work.