Measurement in software development has its own specific challenges. The developer may easily try to count all the lines of code of the program. Yet, that will not give an answer about the quality of the code, as it is much more complex and includes many other aspects beyond the lines of code. In order to understand the effects of actions that are implemented in software development and gain an understanding of how improvements can be made for future software development processes, a certain purpose should be added to the measurement process. This can be reached with the use of the GQM model that allows to select goal-focused metrics among all possible variants. However, even after the application of the GQM model, too many metrics may remain, which will affect the efficiency of the processes and their collection and interpretation will not benefit the project. With the use of the recommender system, it is possible to choose the most useful among them without any wastage of resource.

4.1 Introduction

Software engineering is a unique phenomenon, standing so close to formal and rigorous mathematics, and at the same time to art, that it is impossible to define or frame. Software tasks have no standard algorithmic solutions, and it is very difficult to formalize the quality degree of the final product. That is why, as noted earlier in this book, every software project needs properly chosen metrics. They help to evaluate the processes, products, and resources in the early stages of the development and shape the right direction accordingly.

However, there are several problems to consider when choosing metrics. Firstly, they need to be selected very carefully and competently, as incorrectly defined metrics can lead the project away from its goal and drain valuable time and resources allocated to the project. The second problem is that there exist an infinite number of metrics and ways to categorize them. Overloading in their number can lead to a shift in priorities and loss of focus on the project.

4.2 Concept of Goal-Question-Metric

Because the problem of metrics selection is so important and complex, usually the development team hires a specialist who goes through a long a list of requirements and just as long a list of metrics and allocates a basic scope of metrics for further use. However, this is costly both financially and in terms of time. Consequently, several structured and formal approaches to derive metrics have emerged to simplify this process. Among them, the most widely known in the scientific area is the Goal-Question-Metric model, invented by Victor Basili and David Weiss. With this technology, it is possible to solve the first problem mentioned in the Introduction—exclude the possibility of deviations from the project goals with the inappropriate metrics.

GQM, according to Basili et al. (1994), stands for “goal, question, metric” and allows one to define the goals to achieve by clarifying the questions on how the goal can be reached with the collected data. For that purpose, GQM defines a measurement model on three levels:

  1. 1.

    Goal—The goal for which all the work is done, all the artifacts and processes being produced.

  2. 2.

    Question—Several questions defined based on the goal to outline a way to achieve it.

  3. 3.

    Metric—A set of metrics that answer a given question in a measurable way.

The GQM model can be represented in the form of a tree, the root of which is the goal. The branches of the tree are represented by questions that allow the goal to be made more specific. And the leaves are the metrics, the measurable outcome of the whole model (Fig. 4.1).

Fig. 4.1
figure 1

Hierarchical structure of the GQM model

Defining the GQM model, namely, questions and metrics, helps to determine why and how it is possible to achieve the goal. Consequently, the movement from the root of the tree to the leaves-metrics makes the abstraction itself more understandable.

4.3 The Goal-Question-Metric Process

According to Solingen et al. (1999), there emerged certain goal-setting methodological processes that include the GQM model construction:

  1. 1.

    Planning phase—The first step for the project selection, initial artifacts creation, and planning.

  2. 2.

    Definition phase—The GQM model construction and documentation.

  3. 3.

    Data Collection phase—Collect data according to the defined GQM.

  4. 4.

    Interpretation phase—Interpret the collected data regarding the defined metrics, which will give the answers to the stated questions. With the gathered results, the achievement of the goal can be evaluated.

In the first phase, the team prepares the ground for further steps by identifying the project area, its clients, and their needs and by the creation of the initial documentation.

All of the management, training involvement, and project planning are done during this phase. After the plan is made, the definition phase starts. During the definition phase, the GQM deliverable is developed, and the information for the deliverable is acquired from informative sources such as interviews, analysis, and articles. The data collection phase presumes the measurement itself. During the data collection phase, the data is gathered, stored, and defined. Then the interpretation phase begins when the measurements are used to answer the stated questions, and the answers are further used for the stated goals (Fig. 4.2).

Fig. 4.2
figure 2

Four phases of the Goal-Question-Metric method

For more details, the steps of the GQM method are:

  1. 1.

    Develop business goals—Develop a set of corporate, division, and project business goals and associated measurement goals for productivity and quality. In other words, aims that should be reached or investigated should be stated at this step in such a way that will easily allow questions to be found that should define the goals. Usually, the goal is specifying the purpose of measurement, object to be measured, issue to be measured, and viewpoint from which the measure is taken. For instance, the goal can be stated as “Improve (a purpose) the timeliness (a quality issue) of change request processing (a process) from the project manager’s point of view (obviously, viewpoint).”

  2. 2.

    Generate questions—Generate questions (based on models) that define those goals as completely as possible in a quantifiable way. Questions usually break down the goal into its major components. They should also be specific enough in order to be refined into metrics on the next step. An example of a question for the goal stated previously is “What is the current change request processing speed?”

  3. 3.

    Specify the measures—Specify the measures needed to be collected to answer those questions and track process and product conformance to the goals. It should also be noted that any metric can help to answer more than one question, that is, it can be used two or more times. Examples of metrics for the question above are “Average cycle time” and “Percentage of cases outside of the upper time limit.”

  4. 4.

    Define mechanisms for data collection—Choose the best-fit mechanisms that will help collect necessary data. They can range from invasive methods when somebody can even pause processes in order to capture some metrics to noninvasive when the processes are observed from outside without any intervention. For example, considered metrics can be collected by analyzing log files produced by the system.

  5. 5.

    Collect, validate, and analyze the data—Process the data in real time to provide feedback to projects in order to correct or improve the process. This step often includes processing of collected data with the usage of methods and tools provided by statistics and probability theory.

All this helps to make sure that the selected metrics are aligned to the goal and will guide the project in the right direction.

4.4 Recommender Systems

The solution for the second issue can become the recommender system. Automation in our century is the main engine of progress. That is why recommendation systems were created to automate our choices. In order for users not to have to search through a multitude of options that do not interest them, the recommender systems choose the most interesting ones for them instead. This mechanism is extremely useful in modern online stores, such as Amazon, content generators such as YouTube, and search engines like Google. It can be seen that all the platforms from these examples are very popular and have a large user base. This is one of the main features of recommender systems. With a large number of users, services can collect huge amounts of data, which can be analyzed to develop recommender systems. Collected datasets can consist of any information that the computer can process—text data, numbers, boolean values, and much more.

Because of their widespread use in different fields, many types of recommender algorithms have appeared over time. The first and best known of them is collaborative filtering (Schafer et al. 2007). There are several types of collaborative filtering algorithms. User-based filtering finds the most similar users based on an analysis of their preferences and makes recommendations based on that comparison (Wang et al. 2021). The item-based method, on the other hand, is based on item analysis. This approach compares item ratings with one another and produces a result based on similarity (Sarwar et al. 2001). In general, collaborative filtering is effective for solving problems that do not have a detailed list of characteristics for each item. However, it also has its disadvantages. For example, it is not possible to make a good recommendation if there are no similar users or similar rated items. One should also keep in mind that the higher the amount of data, the better this method will work, but the longer the algorithm will take to work (Isinkaye et al. 2015).

In order to generate recommendations when there are detailed descriptions of items but not a lot of collected data, the second type of recommender algorithm—content-based filtering—was invented (Pazzani et al. 2007). As the name suggests, the focus is shifted from the user to the recommended product. This method is somewhat similar to filtering by parameters in online stores. Although it does not require a huge dataset of information about users, it still requires a broad and detailed description of each product (Lops et al. 2011).

Since both approaches have their advantages and disadvantages, a combination of these algorithms—hybrid filtering—has appeared. This mixing allows to bypass the problems of both (Çano 2017).

Of particular interest is the use of recommender systems in software engineering. Such systems are used in all phases of development (Gašparič et al. 2015), as well as in various subsections of programming. For example, they can be used to assign status to pull requests (Azeem et al. 2020) and tags for questions (Zhang et al. 2018a), API selection (Cai et al. 2019; Thung et al. 2013; Xie et al. 2020), forum recommendation (Castro-Herrera et al. 2009), package suggestions (McMillan et al. 2012), bug detection (Ashok et al. 2009; Gomez et al. 2015), task selection (Wang et al. 2020), refactoring recommender (Lin et al. 2016), and many others. However, recommender systems have not yet been applied to define a set of metrics for a software project.

4.5 Metrics Recommender

Since data collectors, including Innometrics, collect data and present the user metrics in the form of graphs, it was necessary to create a recommender system to sift out unnecessary information on the application dashboard. The solution process for this task is as follows. The user enters the application, opens the tab with the GQM model, fills it—namely, enters the goals and questions for his project—and then clicks on the “Generate Metrics” button. The system processes all the text data it has previously received from other users. After that, it passes them to the recommender algorithm. It, in its turn, analyzes the data and displays the answer to the user by assigning the necessary metrics to the questions. For analysis, the system needs three things—a dataset, a way to process the data, and the recommender algorithm itself. The next sections examine these three components in more detail.

4.5.1 Dataset

Since the GQM-based metrics recommender is based on the analysis of textual data, firstly this data needs to be processed—brought to a common view to lower the sparsity level. Preprocessing is a data handling algorithm that solves exactly this problem. But there are many ways to process textual data. To choose the best one for our problem, we used a dataset that consists of many English sentences from Kagle: https://www.kaggle.com/theoviel/improve-your-score-with-some-text-preprocessing/notebook. With this dataset, we conducted several experiments, described in the next section.

Next, to evaluate the effectiveness of different recommender algorithms, we compiled our own second dataset, since no one had posted such a dataset in the public domain before us (Fitzgerald et al. 2011). To collect the dataset, we interviewed 35 developers of Innopolis. The developers, ranging from 21 to 32 years old with 1.5 to 6 years of work experience in different fields of software development, were included in the distribution.

4.5.2 Preprocessing

During the research, we found out that there are a limited number of steps used for text preprocessing, namely, (1) TF-IDF, (2) Stop words removal, (3) Tokenization, (4) Stemming, (5) Vectorization, (6) PoS and Lemmatization, and (7) Non-letter symbols removal. In order to determine the most suitable sequence for our problem, an experiment was conducted. Each possible sequence composed of these preprocessing steps was run 1000 times on the 500 sentences from the Kagle dataset. The machine used in the experiment had the following characteristics: Intel Core i5-8250U, 4GHz, 7862MiB RAM. The resulting mean runtime and standard deviation are shown in Table 4.1.

Table 4.1 Execution time for each combination

Multiple pairwise comparison tests using Tukey’s method with a family error coefficient of 0.05 (Lee et al. 2018) show that the most efficient sequence is placed under letter B: (1) Non-letter symbols removal, (2) Tokenization, (3) Stop words removal, (4) PoS definition, (5) Lemmatization, and (6) TF-IDF. This is the sequence we have chosen for our system.

4.5.3 Recommender Algorithm

The metrics recommender problem we identified earlier is a multi-label classification problem (Zhang et al. 2014), where multiple metrics can be assigned to each question. That is why we need to consider all types of multi-label classification algorithms in order to create a recommender algorithm. To quantify the effectiveness of each of them, we used the dataset collected from Innopolis employees, as mentioned in the “Dataset” subsection.

There exist several multi-label classification algorithms, which can be divided into two categories: problem transformation and algorithm adaptation (Tsoumakas et al. 2009). To choose from representatives of these two groups, we divided the dataset, described earlier, into 90% of train and 10% of the test set and evaluated all the multi-label algorithms from Table 4.2 on it. The results are shown in the same table.

Table 4.2 Multi-label algorithms comparison

The table shows that binary relevance is the best, so it was used for the implementation of the recommender algorithm.

4.5.4 Conclusion

Based on our experiments, we developed a recommender algorithm for the “Innometrics” system. It was written in Java using the REST API technique. In one of the written endpoints, we put all the logic of the recommender system. Data received from users are first processed by the following sequence of preprocessing steps: (1) Non-letter symbols removal, (2) Tokenization, (3) Stop words removal, (4) PoS definition, (5) Lemmatization, and (6) TF-IDF. After that, they are passed to the recommender algorithm. It analyzes them and finally generates metrics suggestions for a new user. Thus, the user does not need to search for appropriate metrics from all of those available in the system. The algorithm automatically generates a goal-focused set of metrics that makes the best fit for each individual.