HybridRec: A recommender system for tagging GitHub repositories

Software repositories are increasingly essential to support the management of typical artifacts building up projects, including source code, documentation, and bug reports. GitHub is at the forefront of this kind of platforms, providing developer with a reservoir of code contained in more than 28M repositories. To help developers find the right artifacts, GitHub uses topics, which are short texts assigned to the stored artifacts. However, assigning inappropriate topics to a repository might hamper its popularity and reachability. In our previous work, we implemented MNBN and TopFilter to recommend GitHub topics. MNBN exploits a stochastic network to predict topics, while TopFilter relies on a syntactic-based function to recommend topics. In this paper, we extend our work by building HybridRec, a recommender system based on stochastic and collaborative-filtering techniques to generate more relevant topics. To deal with unbalanced datasets, we employ a Complement Naïve Bayesian Network (CNBN). Furthermore, we apply a preprocessing phase to clean and refine the input data before feeding the recommendation engine. An empirical evaluation demonstrates that HybridRec outperforms three state-of-the-art baselines, obtaining a better performance with respect to various metrics. We conclude that the conceived framework can be used to help developers increase their projects’ visibility.

i.e., README files, of the repository of interest. However, such a tool can recommend only featured topics, i.e., a set of topics, which are curated by GitHub. 2 Subsequently, we successfully developed another approach, called Top-Filter [7], that relies on a collaborative filtering technique to extend the recommendation capabilities of MNBN, taking into consideration non-featured topics. Given an initial set of topics already assigned to the GitHub repository of interest, a project-topic matrix is created starting from a graph-based format that encodes the reciprocal relationships between projects and relationships. Then, the underpinning recommendation algorithm relies on a syntactic similarity function based on featured vectors to recommend the most similar topics.
In our recent work [7], we show that the combined use of MNBN and TopFilter (named TopFilter + hereafter 3 ) can improve the results obtained by MNBN. However, according to further investigations, there is still room for improvement for TopFilter + . In particular, it is necessary to deal with unbalanced datasets (as typically occur in real contexts) as well as to enhance the quality of the recommended items. In this work, we further advance the existing tools by proposing a hybrid recommender system, named HybridRec, which retrieves topics for repositories using a combination of stochastic and collaborative filtering recommendation strategies (thereby yielding the name hybrid), being capable of dealing with unbalanced and heterogeneous datasets. To enable both techniques we select the most used artifacts 4 by relying on popular topics. A crucial step lies in the preprocessing phase of the dataset and topics, which dramatically differs from the one set in place for the GitHub repositories. In particular, the type of available metadata for each project can affect the recommendation outcomes as well as the efficiency of the underpinning techniques.
To evaluate HybridRec, we perform a series of experiments, exploiting real-world datasets collected from GitHub. Moreover, we compare HybridRec with MNBN [8], TopFilter, and TopFilter + [7], which can be considered as among state-of-the-art approaches to the recommendation of topics for GitHub repositories. The experimental results show that HybridRec performs better than the baselines, recommending highly relevant topics in most cases. In this sense, the contributions of this paper are summarized as follows: -A comprehensive discussion on open challenges in retrieving relevant information from the GitHub ecosystem; -A hybrid recommender system, called HybridRec, built on top of (i) a Complement Naïve Bayesian Network [23], (ii) a collaborative-filtering technique, and (iii) a rule-based preprocessing phase to suggest topics for GitHub repositories; -An empirical study using real-world datasets collected from GitHub and MVN Repository, comparing HybridRec with three state-of-the-art baselines, i.e., MNBN [8], TopFilter, and TopFilter + [7]; -The tools and datasets developed and curated through our work have been published to foster future research. 5 The paper is structured into the following sections. In Section 2, we motivate the problem addressed in this paper and the research objective we aimed to achieve. The conceived approach is presented and evaluated in Section 3 and Section 4, respectively. Afterward, the obtained results are quantitatively and qualitatively analyzed in Section 5. Next, we discuss the probable threats to the validity of the performed experiments in Section 6, while the related work is reviewed in Section 7. Finally, we draw some perspective work and conclude the paper in Section 8.

Background and motivation
This section provides an introduction to the importance of topics in GitHub. We identify several challenges that arise during the mining process of the ecosystem. The research problem addressed in this paper is then introduced.

Problem statement
Software developers use open-source software (OSS) repositories to publish and disseminate their work. GitHub is amongst the most popular platforms that offer open environments where developers can share their code and interact with each other. To assist developers in approaching projects of their interest, GitHub provides users with tools and techniques that help narrow down the search scope and increase the visibility of its projects. In particular, to characterize the stored artifacts, GitHub leverages featured topics, 6 i.e., a list of the most popular and active topics, which are periodically monitored and updated. Such a public list indicates the community's trend in terms of the most used topics. Developers have manually assigned GitHub topics according to their experience as well as to the content of the repositories at hand. Nevertheless, topics assigned in such a way might be inaccurate or represent wrong concepts, thus compromising the visibilities of projects. Figure 1 represents an explanatory example consisting of the longclaw repository, 7 which implements extensions of Wagtail CMS 8 to enable the development of typical ecommerce functionalities. By looking at the list of topics, users can easily understand that the project employs django, a python web framework. Though the given topics characterize the repository's features, some of them might be considered redundant, e.g., python3, python-2, and wagtail-cms. Moreover, only three topics are considered to be featured as they are given in Table 1.
The example in Fig. 1 shows that on the one hand, userspecified topics are usually more extensive than the featured ones. On the other hand, two different topics can express the same concept, and a larger set of topics could affect the prediction accuracy of automatic approaches that might rely on such data. Summary. Topics are an effective means of summarizing the main features of a GitHub repository. Furthermore, the proper use of topics facilitates the searchability and discoverability of different items. Thus, there is the need for techniques and tools to automatically generate topics to prevent them from being misleadingly/wrongly established and not correctly reflecting the contents of the considered projects.

Challenges in mining the GitHub ecosystem
According to existing work [5,15], extracting valuable data from GitHub is a daunting task and exhibits several pitfalls. We identify the following challenges to be undertaken to conceive the desired recommender system: C1: Data redundancy. GitHub topics are specified (or even created) by users to classify their repositories. However, this manual process results in inaccurate labeling in some situations. For instance, a user can define both python and python-3 as topics for its repository, which are redundant. Refining the list of topics by removing such kinds of topics can improve the discoverability of the repository; C2: Structure of available metadata. Concerning the available sources of knowledge, a standard GitHub project is described by one or more README file(s), a brief description, and possibly by a Wiki. Furthermore, there are different kinds of accessible metadata, e.g., commits, issues, forks, and stars. Though GitHub provides publicly access in most of the cases, extracting valuable information from the data mentioned above requires a set of tailored preprocessing techniques; C3: Popularity mechanisms. GitHub provides users with the star and forking mechanisms to assess the popularity of a given repository [3,14]. The former is used to improve the visibility of a project. The latter is typically employed when a developer wants to "freely experiment with changes without affecting the original project." 9 Moreover, GitHub groups the most popular projects under a curated list, i.e., featured topics. In such a way, the popularity of a repository helps the mining process filter out unuseful artifacts, e.g., toy and dummy projects. Though many software artifact repositories provide popularity mechanisms, there is no uniform way to define the popularity of an artifact. For instance, GitHub includes many elements, i.e., star, forking, number of committers, etc., to assess the repository popularity, while Maven defines the popularity of an artifact by relying on the number of usages, i.e., when another project employs the artifact;

C4:
Crawling and Data Dump. The availability of input data is a crucial aspect of the recommendation process.
To gather them from a platform that stores OSS projects we can (i) use a crawler or (ii) rely on a data dump stored in some database format. Concerning the crawling activity, GitHub exposes its API to obtain all needed data by exploiting different libraries, e.g., JGit, 10 PyGitHub, 11 to name a few. Furthermore, GHTorrent [11] offers a regularly updated data dump of the entire platform in several formats, e.g., SQL, and MongoDB. It is worth noting that GHTorrent does not include the actual content of each repository, but only stores metadata; C5: Configurations of the underpinning recommender systems. Depending on the features of the considered OSS ecosystem, the employed recommendation algorithm must be adapted accordingly by changing its internal configurations. Such a phase can rely on different parameters that vary according to the nature of the used algorithms.
Research objective. Considering the need mentioned above and the corresponding challenges, we propose a workable solution to the recommendation of topics to GitHub repositories. We conceptualize a hybrid use of a complement naïve bayesian network and a collaborative-filtering technique.

Proposed approach
In this section, we present in detail HybridRec, the proposed recommender system to provide topics for GitHub repositories. HybridRec parses textual contents, e.g., README files, Wiki contents, and commit messages collected from GitHub, and employs a Complement Naïve Bayesian Network [23], or CNBN for short, to extract preliminary topics. The predicted topics are then fed as input to retrieve additional relevant topics by means of a collaborative-filtering technique. The outcome of this phase is combined with the preliminary topics to yield the final recommendations. Figure 2 illustrates the architecture of HybridRec, which consists of two main components, namely (i) ST: Probability based recommendation using a stochastic network [23]; 10 https://www.eclipse.org/jgit/ 11 https://pygithub.readthedocs.io/en/latest/index.html and (ii) CF: Collaborative-filtering based recommendation [21,27]. The following subsections describe these components in greater detail.

ST: The stochastic-based component
The architecture of the first component is depicted in Fig. 3. Data is crawled from GitHub and then vectorized by the TF-IDF encoding component. The obtained data is used to feed the CNBN component, which returns a set of most probable topics. By resembling the choice made in the original work [8], we consider only the featured topics, i.e., the most popular according to GitHub's statistics. In such a way, we can train CNBN with the projects labeled with featured topics that are representative in terms of coverage. Furthermore, various NLP techniques are applied to both predicted and actual topics to improve the quality of the outcomes, as we explain as follows.

TF-IDF encoding
Given a repository, this module vectorizes the textual content of its artifact descriptions, using the scikit-learn library 12 that provides all the functionalities to encode textual content by finding the most representative terms. Encoder computes the inverse document-frequency using the following formula: where n is the total number of documents in the document set; df(t) is the number of documents in the document set that contain term t. A previous study [16] showed that applying such a weighting scheme should possibly enhance the quality of predicted items of the Bayesian classifier. Thus, we decide to apply this additional preprocessing step to improve the quality of the recommended topics.

CNBN-based prediction
In our previous work [8], a Multinomial Naïve Bayesian network (MNBN) has been used to extract topics, and the results were encouraging. In this paper, in an attempt to improve the recommendation capability as well as to deal better with unbalanced datasets, we adopt an enhanced version of the Multinomial Naïve Bayesian Network called Complement Naïve Bayesian Network to compute the predictions. The term "naïve" refers to the assumption that all the features are conditionally independent. In other Fig. 2 Overview of the HybridRec approach words, the network augments its prediction capabilities when each element has weak or no ties with the others. This model has been successfully employed to classify multinomial distributed data. Though MNBN and CNBN are similar from the structural point of view, their underpinning mechanisms used to compute the outcomes are completely different. In particular, MNBN employs a stochastic model that predicts a certain class c by relying on its training data. In contrast, CNBN makes use of the training data coming from all classes except c. In such a way, the performed estimation is more effective as the model uses a more even amount of training data per class.
To be concrete, CNBN computes predictions by means of the following formula. 13 where the summations are over all documents j not in class c, d ij is the tf-idf value of term i in document j and a i is a smoothing parameter used to compute the final weight w ci . This mechanism impacts positively on the computation of the weights by using the same input data. In fact, using CNBN instead of MNBN leads to better accuracy considering unbalanced datasets [23]. After this phase, CNBN produces a list of top-N topics according to their probability. 13 We made use of the Python implementation of CNBN embedded in the scikit-learn library (https://scikit-learn.org/stable/modules/ generated/sklearn.naive bayes.ComplementNB.html).

Post natural language processing
The following lightweight post-processing steps are performed on the obtained topics, with the aim of further enhancing the prediction capabilities: -Dash removal: All the dash occurrences in each topics are removed. For instance, the build-system topics becomes buildsystem. In this way, we enlarge the possible set of recommended items; -Stemming: Using Porter's stemmer implementation provided by the nltk language processing library, 14 we squeeze each term to its root by cutting the suffix. Thus, terms such as monitoring or testing collapse to monitor and test, respectively.
The topics retrieved by ST are fed as input for the collaborative-filtering component to further refine the recommendation results, as we explain in detail in the next subsection.

CF: The collaborative filtering-based component
Though CNBN can recommend relevant topics by using textual contents, it provides only featured topics, i.e., a set of topics curated by GitHub. Therefore, it is necessary to have machinery to provide also non-featured topics. Being inspired by the TopFilter system [7], in this work we further extend the recommendation capabilities of the ST component to non-featured topics, using a collaborativefiltering recommendation engine. Figure 4 provides an overview of the CF component. The Preprocessing module is used to obtain a filtered dataset. Given an initial set of topics already assigned to the repository of interest, Data Encoder encodes it in a graph-based structure to represent the mutual relationships between repositories and topics. From this, a projecttopic matrix is created by following the typical user-item structure used in existing collaborative filtering applications [27]. Then, Similarity Calculator computes similarities among all the considered artifacts. Finally, the Recommendation Engine module retrieves a ranked list of top-N topics that are suggested by using userbased collaborative filtering technique [35]. The functionalities and preprocessing techniques implemented in CF are described in detail as follows.

TopFilter preprocessing
We adapt semantic mapping rules proposed in a recent work [13], revise them as well as propose our own rules, with the aim of enhancing the quality of the retrieved topics. These rules have been manually defined by building direct relationships among the terms. For instance, topics firefox and chrome are rooted to a common term, i.e., browser. In particular, we adopt two types of rules, i.e., Aggregating Rewriting Rules (ARR)s and Extending Rewriting Rules (ERR)s. The former are used to restrict the dictionary of possible topics by rewriting the topics in a different manner, e.g., exporter and exporting are aggregated in export that preserves their semantic. Meanwhile, the latter provide additional information about the topic, e.g., the repository containing the sysmodule topic are augmented by adding the system and module topics. Given a repository, ERRs extend the list of topic by adding new ones. Specifically, we perform the following ordered preprocessing steps:

Data encoder
This component represents the relationships between projects and topics in a matrix, whose rows and columns represent all the projects and all the corresponding topics. Thus, the cell (i, j ) is set to 1 if the artifact in the i th row is labeled with the topic in the j th column, 0 otherwise. The matrix is eventually constructed by preprocessing raw topics with the same NLP techniques used in MNBN to filter out possible biased terms (see Section 3.1.2).
To illustrate how Data Encoder works, we consider a set of four GitHub repositories P = {p 1 , p 2 , p 3 , p 4 } together with a set of topics T = {t 1 = junit; t 2 = testing; t 3 = specs; t 4 = module; t 5 = mocking, t 6 = mock}. Moreover, the repositories-topics inclusion relationships is denoted as . A parsing of the projects reveals the following inclusions: 5 . After the NLP normalization steps, t 5 and t 6 collapse on the same term which is named as t 6 . The final project-topic matrix is shown in Table 2. Table 2 The artifact-topic matrix for the example

Similarity calculator
Given the encoded data, this module computes the similarity among the projects by considering the mutual relationships. Two nodes in a graph are considered to be similar if they share the same neighbours by considering their edges. We represent a set of projects and their labels in a graph, so as to calculate the similarities among the projects. For instance, Fig. 5 depicts the graph-based representation of the project-topic matrix in Table 2.
Considering a project p that has a set of neighbor nodes (t 1 , t 2 ,.., t l ), the features of p are represented by a vector φ = (φ 1 , φ 2 , .., φ l ), with φ i being the weight of node t i computed where f t i is the number of occurrences of t i with respect to p, it can be either 0 and 1. |P | is the total number of considered projects; a t i is the number of projects connecting to t i via corresponding edges. Intuitively, the similarity between two projects p and q with their corresponding feature vectors φ = {φ i } i=1,..,l and ω = {ω j } j =1,..,m is computed as the cosine of the angle between the two vectors as given below.
where n is the cardinality of the union of topics by p and q.

Recommendation engine
Given an input project p, and an initial set of related topics decided by the developer, the inclusion of additional topics can be predicted from the projects that are similar to p. The collaborative-filtering recommender works based on the assumption that "if projects share some topics, then they will probably share additional topics." In other words, it predicts topics' presence by means of those collected from the top-k similar projects using the following formula [21]: where r p and r q are the mean of the ratings of p and q, respectively; q belongs to the set of top-k most similar projects to p, denoted as topsim(p); sim(p, q) is the similarity between the active project and a similar project q, and it is computed using (3). The next sections present the experiments performed to evaluate HybridRec as well as to compare it with MNBN, TopFilter and TopFilter + .

Evaluation
This section describes the process conducted to study the performance of HybridRec. In particular, three research questions are presented in Section 4.1 to address different aspects of the quantitative and qualitative evaluations. Afterward, an informative description of the datasets exploited in the evaluation is given in Section 4.2. Sections 4.3 and 4.4 describe the evaluation metrics and process, respectively.
To facilitate future research, we made available the HybridRec tool together with the related data in GitHub. 15 15 https://github.com/MDEGroup/HybridRec

Research questions
We study the performance of our proposed approach by answering the following research questions:

Data extraction
To answer the four research questions, we curate four different datasets, namely D 1 , D 3 , D 2 , and D M as shown in Table 3 and described as follows.
-D 1 is the original dataset already used in our previous work [7,8] consisting of 11,694 GitHub repositories with 19,337 topics; -D 2 : As discussed in our previous work [7], infrequent topics negatively affect the prediction outcomes. In this way, we removed infrequent elements from the dataset to analyze their impacts on the overall recommendation phase. We firstly filtered the initial set of topics using their frequencies counted on the entire GitHub dataset. Afterward, topics that occur in less than 20 repositories were removed, obtaining D 2 with 6,253 repositories and 455 topics;

Metrics
In this work, success rate, precision, recall, top rank, and catalog coverage are employed to study the considered systems, as these metrics have been widely exploited by related research [21,24]. First, the following notations are introduced: t is the frequency cut-off value of input labels, i.e., all topics that occur less than t times are removed from the dataset; -τ is the number of topics fed as input to TopFilter; -N corresponds to the cut-off value for the ranked list of recommended items; k specifies the number of top-similar neighbor projects TopFilter incorporated to predict topics; -GT(p) is defined as a half of the extracted topics for a testing project p using as ground-truth data; -REC N (p) is the top-N suggested topics sorted in a descending order; -a recommended topic rt to a repository p is marked as a match if rt ∈ REC(p); -match N (p) is the set of items in REC N (p) that match with those in GT(p) for repository p. -T is the set of all the available topics. 17 https://www.crummy.com/software/BeautifulSoup/ Using the aforementioned notations, the success rate, accuracy and coverage metrics are defined as follows.
Success rate@N Given a set of testing projects P, SR@N is defined as the fraction of projects having at least a matched topic among the total number of queries.
Accuracy Given a list of top-N libraries, precision and recall are utilized to measure the accuracy of the recommendation results. In particular, precision is the ratio of the top-N recommended topics found in the ground-truth data, whereas recall is the ratio of the ground-truth topics belonging to the N recommended items [6]: Top rank It measures the percentage of the first top element in the user's topics: where TpRank(r) returns 1 if the first predicted element belongs to GT(p), 0 otherwise.

Catalog coverage
Given the set of projects, we compare the number of recommended topics with the global number of the available ones. This metric measures the suitability of the delivered topics considering all the possible set of values.

Evaluation process
We use the ten-fold cross-validation technique [17] to analyze the performance of our proposed approach. Figure 6 depicts the evaluation process consisting of three consecutive steps, i.e., Data Preparation, Recommendation Production, and Outcome Evaluation, which are explained as follows.
Data preparation This phase is conducted to collect repositories from GitHub that match the requirements defined in previous section during the Data collection step. The dataset is then split into a training and a testing set using Split ten-fold. Since the recommender systems are different in terms of input data, i.e., CNBN requires README files as input and training data, whilst HybridRec uses a set of assigned topics as input and for training, testing and training sets need to be specifically parsed for each individual approach. The Split labels activity simulates a real development scenario when a developer has already assigned some topics to her repository, and waits for recommendations, i.e., Query labels. For instance, given a GitHub repository and its topics, we fed the CNBN component using the repository README file annotated with the corresponding topics. Meanwhile, the testing data represent unlabeled README files that need to be categorized by the network. Contrariwise, TopFilter requires only training topics and an initial set of labels extracted from the input repository, namely Query labels since the underpinning engine makes use of a collaborative filtering technique that suffers from the cold start problem.
Recommendation production To enable the evaluation of HybridRec, we extracted a part of the topics for each testing project, resulting in the ground-truth data. The remaining part is used as a query to produce recommendations. We parsed and encoded text files in vectors using the TF-IDF weighting scheme to provide input to CNBN.

Outcome evaluation
We evaluate HybridRec and compare it with MNBN, TopFilter, and TopFilter + , analyzing the recommendation results by comparing them with those stored as ground-truth data to compute the quality metrics, i.e., Success rate, Precision, Recall, Top rank, and Catalog coverage.

Results and discussion
In this section, we study the performance of our proposed approach by answering the research questions presented Section 4.1. First, Section 5.1 takes a concrete example to illustrate how TopFilter + and HybridRec recommend topics to a repository. Afterward, Sections 5.2 and 5.3 report and analyze the experimental results.

Explanatory example
Before answering the research questions, we illustrate the recommendations provided by TopFilter + and HybridRec through a running example. As shown in Table 4, the maintainers of the repository markdown-viewer 18   with nine labels as listed on the Original topics column.
The column Processed topics, shows the outcomes obtained from the preprocessing step on the original topics. The last two columns report the ranked recommendations provided by TopFilter + and HybridRec. Moreover, for each topic, Table 4 lists the corresponding frequencies over the D 2 and D 3 datasets. For instance, the split token rule (see Section 3.2.1) rewrites chromeextension as chrome and extension, while the alias substitution rule generates browser from chrome and firefox. In other words, the rewriting rules refine the mined topics to generate more exhaustive and coherent ones. For instance, it is evident that firefox-extension, chromeextension, and browser-extension can be better represented by their corresponding short forms: browser, firefox, chrome and extension. By looking to the original topics and the preprocessed ones, we can see that the semantics of the topics are preserved but their frequencies are increased. TopFilter + and HybridRec return as output a ranked list of items, and we take the first top-20 topics, and match them with the ground-truth data. It is evident that HybridRec provides more matched items compared to TopFilter + . In the following experiments we show that the preprocessing steps reduce the usage of different terms that have a very close semantics and increase the topic frequency.
The values of the considered quality metrics, i.e., success rate, precision, recall, top rank, and catalog coverage, reported in the next subsections are obtained by calculating their average values on all the folds from the cross-validation process.

RQ 1 : How does HybridRec perform compared to TopFilter + ?
We compare HybridRec with TopFilter + on all the considered datasets, i.e., D 1 , D 2 , and D 3 . Given a testing project p, a certain number of topics τ is used as input, and the remaining ones are saved as ground truth data GT(p). In our experiments, τ is always considered half of the number of topics already assigned to the project under analysis, and the number of neighbor projects k is set to 20. This was identified as the configuration that brings the best performance to TopFilter + [7]. Moreover, we try with different values of N to find out how the size of recommended items impacts the prediction performance.
The average success rate, precision, and recall scores obtained by running the ten-fold cross-validation technique with HybridRec and TopFilter + on D 1 , D 2 and D 3 are reported in Table 5, considering consecutive cut-off values, i.e., N = {1, . . . , 10}. In the table, a better performance -corresponding to higher scores -is printed in bold. Altogether, it is evident that HybridRec outperforms Top-Filter + for any value of N except the success rate for N = 1. For instance, the best success rate obtained by TopFilter + is 0.479 when N = 10, while the corresponding score Answer to RQ 1 . In comparison to the baseline TopFilter , HybridRec yields a substantial performance improvement in terms of success rate, precision, and recall.

RQ 2 : In comparison to MNBN, does CNBN contribute to a better HybridRec performance?
In our previous work [7], we utilized a balanced dataset manually curated from GitHub to improve MNBN's performance as the network is not able to handle unbalanced datasets. Nevertheless, this does not essentially resemble a real-world situation in the context of OSS platforms, where the distribution of topics and repositories is usually not balanced. Thus, we employ an enhanced version of MNBN, namely CNBN, to deal with realistic settings. This research question aims to validate such a hypothesis. To compare with MNBN, we run HybridRec using only the ST component (see Section 3.1), without triggering the subsequent collaborative filtering phase. Figure 7 shows the success rate considering different cutoff values, i.e., N = {1, . . . , 10}. It is evident that using CNBN obtains a better success rate with compared to using MNBN, given that an unbalanced dataset is considered. For instance, the improvement is around 10% with N = 2, i.e., MNBN and CNBN achieve 0.50 and 0.67 respectively in recommending at least one correct topic. As expected, increasing the number of recommended items leads to a better performance, i.e., CNBN's success rate reaches 0.90 with N = 9 and N = 10. This holds also for the MNBN, which however achieves worst performance since its success rate is always lower than 0.65.
To better study the performance of both techniques, i.e., MNBN and CNBN, we compute the precision and recall scores and show them in Table 6. Overall, CNBN improves the results obtained by MNBN for all the considered metrics. By considering all the cut-off values, the precision scores are increased by 10% on average. In particular, precision reaches its maximum at N=2, i.e., 26.57 and 35.00 for MNBN and CNBN, respectively; The minimum value obtained by MNBN is 9.41, while the corresponding score by CNBN is 15.44. Concerning recall, we observe a remarkable improvement when CNBN is adopted, i.e., it increases MNBN's results 30% of the time with N = 10. As expected, increasing the number of the returned items positively affects the performance of both networks. Still, CNBN improves recall from 40.14 to 74.07, while MNBN gains 45.38 as its maximum value. In other words, the usage of CNBN reduces the impact of false negatives during the prediction phase. Finally, the advantage of CNBN is eventually confirmed by the Top-rank metric, which measures the accuracy of the two networks in recommending the first item, i.e., the most probable one.
In particular, the computation shows that CNBN obtains a better performance also in this case by increasing the top rank prediction up to 65.85 while MNBN gets only 49.55.
Answer to RQ 2 . On an unbalanced dataset, compared to MNBN, CNBN improves the prediction performance. As data sources are unbalanced by their nature, adopting CNBN helps obtain a more precise prediction in the field.

RQ 3 : How do the preprocessing steps impact on the HybridRec performance?
We measure the impact of the preprocessing steps on the collaborative filtering recommendation engine on the three datasets collected from GitHub, i.e., D 1 , D 2 and D 3 . As described in Section 4.2, D 2 is obtained from D 1 by filtering out topics that occur in less than 20 repositories, while D 3 is extracted from D 1 by applying the preprocessing rules defined in Section 3.2.1. To compare with the original TopFilter approach [7], we run HybridRec using only the CF component (see Section 3.2).
Given a testing project p, a certain number of topics is used as input, i.e., τ = 5, and the remaining ones are saved as ground truth data, i.e., GT(p) (cf. Section 4). Moreover, we investigate various number of recommended items N = {5, 10, 15, 20} to understand how the size of recommended items impacts the prediction performance of TopFilter. The average success rates obtained by running the ten-fold cross-validation technique with HybridRec on D 2 and D 3 are depicted in Fig. 8.
It is evident that HybridRec yields a better success rate when preprocessed datasets, i.e., D 2 and D 3 , are considered. For instance, it gets SR@1 of 0.715 for D 3 , while for To further study the performance of HybridRec, we depict in Fig. 9 the precision/recall curves (PRCs). For this setting, we varied the recommended items N from 1 to 20, aiming to study the performance for a more detailed recommendation list. Each dot in a curve represents the precision and recall scores obtained for a specific value of N. Furthermore, we fixed k = 20 since this number of neighbors brings the best prediction outcomes among others, while it allows HybridRec to maintain a reasonable execution time. As a PRC close to the upper right corner corresponds to a higher precision and recall, suggesting a better performance [21], Fig. 9 shows that by considering a preprocessed dataset, HybridRec gains a better prediction. In particular, the worst precision-recall relationship is seen by the raw dataset D 1 , while DGP obtains the best one. Overall, these results are consistent with those presented in Fig. 8, i.e., using the preprocessing steps given in Section 3.2.1 enables HybridRec to enhance its performance substantially. Finally, we investigate if HybridRec can provide a wide range of topics to repositories, taking into consideration catalog coverage. This metric measures the percentage of the recommended topics in the training data that the model recommends to a test set, and a higher value corresponds to better coverage. Table 7 reports the average coverage values got from running HybridRec on the datasets, i.e., D 1 , D 2 , and D 3 . The table suggests an evident outcome: we get better coverage by considering longer lists of items. We also studied the impact of the preprocessing step on the coverage results. The catalog values range from 0.013 (obtained for D 1 with N = 1) to 0.765 (obtained for D 2 with N = 20). It is important to recall that D 2 and D 3 are very different with respect to the number of topics, i.e., 455 compared to 6,442. At the same time, the sets of considered repositories are very similar (only 239 projects are discarded in D 2 ) (see Section 4.2). Altogether, we see that using a denser dataset for training, i.e., projects with more topics is beneficial to success rate, accuracy but not to catalog coverage.
Answer to RQ 3 . The preprocessing steps are beneficial to the recommendation process as they allow HybridRec to retrieve more relevant topics, thus improving the recommendation performance.

RQ 4 : What are the key differences between GitHub and MVN Repository, and how do they impact on the whole HybridRec recommendation process?
To investigate HybridRec's performance in a different type of repositories, we run it on a different dataset, namely D M collected from MVN Repository (see Section 4.2). The artifacts available in MVN Repository are labeled with tags, which express similar concepts to GitHub topics, i.e., summarizing their key functionalities. Moreover, each artifact comes with a textual description that can be used as a README file.
There are differences in the nature of topics in GitHub with respect to tags in MVN Repository. In particular, Fig. 9 Precision/recall curves topics for a GitHub repositories are manually assigned by the owner(s). Thus, the resulting set of the given topics might include issues, e.g., duplicate drop, and word spelling, which necessarily require some pre-processing steps. In our proposed approach, we refined the mined GitHub topics by means of the rewriting rules defined in Section 3.2.1. In contrast, MVN Repository tags are well-maintained as they are not freely assigned by developers but presumably audited by a watchdog. This implies that MVN repositories are more curated than those hosted in GitHub, thus limiting errors that might happen due to the manual activities performed by developers. Altogether, this helps enhance the list of possible tags by automatically suggesting new tags, thus improving the reachability of the artifacts.
We conduct experiments for the D M dataset using the same settings applied on the GitHub datasets and compute the quality metrics accordingly. Figure 10 represents the success rate scores obtained by running HybridRec on the Maven dataset, using different values of k, the number of neighbour projects (see Section 3.2). As we can see, the usage of a more curated dataset to make predictions contributes to improving the overall performance, i.e., the maximum success rate obtained by HybridRec reaches almost 0.90 by most of the cut-off values, and eventually, it goes beyond the 0.90 threshold with N = 10.
Concerning the neighborhood parameter, the results show that the best success rate is achieved with k = 15. This means that starting from k = 15, adding more neighbour projects to compute recommendations does not bring any matched tags. Overall, running HybridRec on D M yields a better prediction performance than the results obtained with the D 1 dataset. It is worth noting that no preprocessing has been applied on Maven tags since the hand-written rules are tailored for GitHub topics that need some cleaning due to duplicates or incorrect terms. In contrast, D M does not require such a process since tags are already fine-tuned, and their constituent terms are concretely defined. For instance, there is no need to predict the language programming tag since all projects are written in Java.
We compute precision and recall scores and show them in Fig. 11. Similar to the previous results, running HybridRec on D M contributes to a better performance since the precision and recall are higher, compared to those obtained with the D 1 dataset in Table 5. The precision scores increase by 10% on average with the best configuration, i.e., k = 5. Similarly, the recall scores improve from 0.191 to 0.70 when more recommended items are considered in the ranked list.
The gain in performance is further confirmed with the catalog coverage scores in Fig. 12. HybridRec achieves a maximum catalog coverage of 0.47 with N = 10. This means that the probability that HybridRec recommends correct items is higher if D M is considered. Referring to Table 3, we see that the number of topics in D M is much smaller than that of D 1 , i.e., 489 compared to 19,337. Similarly, D M contains less projects than D 1 , i.e., 2,932 compared to 11,694. Altogether, this demonstrates that even on a small Maven dataset, HybridRec can still obtain a good performance.
Answer to RQ 4 . Compared to topics in GitHub, tags in MVN Repository are more well-defined as they are audited by a central authority. Due to this reason, running HybridRec on input data curated from Maven repositories brings in a better prediction performance.

Discussion
HybridRec has been developed by combining a complement naïve bayesian network with a collaborative-filtering technique. Moreover, we employed a tailored preprocessing phase on the considered topics to clean and refine the input data before feeding it to the recommendation engine. Altogether, the boosting mechanisms help HybridRec retrieve more relevant topics. The empirical evaluation has shown that HybridRec is able to improve the recommendation results compared to the baselines.
As discussed in Section 2.2, projects coming from GitHub exhibit different features that could affect the recommendation outcomes. In this section, by relying on the previously performed experiments, we analyze the five challenges highlighted in Section 2.2.  Concerning Challenge C1 (data redundancy), we have the following investigations. Topics for a GitHub project are manually assigned by the owner(s). Thus, the resulting set of given topics might include issues (e.g., duplicate drop, and word spelling) that necessarily require some preprocessing steps. In the proposed approaches, we refined the mined GitHub topics by means of NLP techniques and word embeddings, e.g., GloVe [22], word2vec [20], and FastText [12]. Furthermore, we adopt a well-structured set of rules to consider frequent patterns as well as to refine them. We have proven that such pre-processing steps improve the overall HybridRec's prediction capabilities.
Another aspect that may impact the outcomes is the topic's distribution. To analyze to what extent this can affect the recommendation items, we make use of a GitHub dataset employed in our previous work [7]. Figure 13 gives an overview on the distribution of topics among repositories for the raw D 1 dataset. From this picture, we can observe that many topics are infrequent, i.e., 14,175 among 15,743 topics are used in less than 11 projects, and just a few topics are widely adopted by more than 200 projects. In other words, the long tail effect has a profound impact on the GitHub datasets.
Challenge C2 involves the structure of available metadata that an OSS ecosystem can provide to perform the recommendation activities. A GitHub repository offers a lot of metadata about the activities of developers as well as their interactions. For instance, the platform keeps track of any user who makes a pull request or modifies the project by adding files through commits. 19 Furthermore, each repository is characterized by textual information, e.g., README files, or Wiki content. In other words, OSS mining tasks can be characterized by two dimensions: the employed recommendation technique and the proper metadata that can be extracted from the input data. Table 8 describes these two dimensions in terms of the techniques and datasets presented in this work. As shown in the table, the collaborative-filtering component relies solely on the list of initial terms to perform the recommendations considering GitHub platform since all needed relations concerning projects are encoded in the matrix used to fed the recommendation engine. Furthermore, it employs the initial list of terms to enable the recommendation engine. As stated in Section 5.3, using a larger number of inputs definitively improves the tool's performance. In contrast, the ST component needs textual files to predict GitHub topics, namely README files.
In the context of collaborative development, the availability of popularity mechanisms play an important role to foster the project's visibility. As we mentioned in the description of C3, GitHub is reliant on a well-defined popularity mechanism that considers stars, forks, and featured topics. Both the ST and CF components benefit from using popular projects as the initial dataset, i.e., we observed the best results with respect to various quality metrics. Another important factor to consider is the fact the GitHub exposes the selected featured topics in a dedicated repository 20 and web pages. 21 Despite this, the crawling phase requires a well-structured process to extract the needed information properly.
Concerning C4 (Crawling and Data Dump), we see that even when the required data is available, the gathering process could be a daunting task. Data dumps can simplify this activity by offering all the information in a structured way, for instance with a database. As we mentioned before, data dumps for GitHub have been provided by an external tool, i.e., GHTorrent [11]. Nevertheless, data contained in these collections is not suitable for supporting automatic tagging activities. Therefore, we made use of a crawler that can download the data belonging to the repositories. In our previous studies, we exploited the PyGitHub library since GitHub offers all the required APIs.
With respect to Challenge C5 (Configurations of the underpinning recommendation systems), besides the chosen OSS platform, the capabilities of the underpinning recommendation system are affected by the internal parameter configurations. The possible tuning parameters depend on the algorithm employed to perform the recommendation as well as the metadata offered by the platform. Table 9 describes the parameters employed by the two considered components, i.e., ST and CF, to bring the best prediction performance.
As described in Section 5.4, we have these values to reach a balanced dataset which is crucial to earn a better performance. This process has been conducted manually for the GitHub datasets as they exhibit heterogeneous support for each featured topic in terms of projects. Concerning the collaborative-filtering component, this consideration mainly impacts the selection of the cut-off values t. In particular, to cope with different impacts of the long-tail effect, we 20 https://github.com/github/explore 21 https://github.com/topics varied t, for example, to evaluate the CF component with the GitHub dataset we changed t from 5 to 20 with 5 as the step, while t has been shifted from 1 to 5 with 1 as the step. Moreover, because of the different levels of similarities between the projects of GitHub we used a different number of neighbors k when TopFilter recommends suitable labels.

Threats to validity
In this section, we highlight possible threats that may have an impact on the outcomes of the conducted experiments. We also list the countermeasures adopted to mitigate these issues.
Internal validity The overall recommendation capabilities of the presented approaches could be compromised by the dataset features, i.e., the number of unique tags, lack of support for each category. To cope with these risks, we analyzed several configurations obtained by means of a well-defined data preprocessing phase. These settings allow for a more comprehensive evaluation of the two approaches. External validity As specified in Section 4.2, the data extraction phase has been conducted by relying on the GitHub API. This could compromise the quality of the obtained information as we cannot access the entire knowledge due to the rate limits. Such a threat has been mitigated by considering popular repositories. In this way, we chose the most representative projects that help minimize any potential bias. with HybridRec might be susceptible to bias. We carefully examined existing work in the domain, but no comparable tool is available with an available sound replication package. Thus, we used the same evaluation conducted to evaluate MNBN in the original work, bringing the best results. We adapted the structure of the two techniques to mitigate any possible pitfalls in the overall evaluation process.

Related work
In this section, we review related studies to our work, focusing on the following two main topics: (i) recommending GitHub topics; (ii) categorization of open source software projects; and (iii) advanced techniques for recommender systems.

Recommendation of GitHub topics
Before the introduction of topics, GRETA (Graph-Based Tag Assignment for GitHub Repositories) was the very first attempt to automatize GitHub tagging [4]. This approach exploited two concepts, namely ETG (entity-tag graph) and a random walk with a restart algorithm. The former is a graph that uses StackOverflow and GitHub data. In contrast, the latter is an algorithm that iteratively explores the global structure of the network to estimate the proximity (affinity score) between two nodes. This aims to build the graph by linking GitHub repositories with StackOverflow posts which share user and lexical similarities. Afterwards, tags are propagated using the walk algorithm. Repo-Topix [10] was released right after the introduction of GitHub topics as a means to automatically recommend topics by relying on README files and the textual content of a given repository. Standard NLP techniques and a regression model have been applied at the early stage of the process to exclude biased terms from the recommendation process. In addition, the tool makes use of an adapted version of the Jaccard distance to enhance the quality of retrieved items further. Repo-Topix has been preliminarily evaluated using the n-gram ROUGE-1 metrics to count the number of overlapping terms between the recommendation items and the original repository description.
Recently, Izadi et al. [13] proposed Repologue, a multilabel classification tool that combines various state-of-theart classifiers, i.e., MNB, logistic regression, FastText, and DistilBERT. The first step involves topic augmentation using hand-written association rules to derive more information from a given repository and its tags. Then, Repologue compares the mentioned models employing several word embeddings techniques to discover the best one. This can be a potential baseline for our approach. Unfortunately, the tool is not available when writing this paper. 22 GHTRec [36] exploits the BERT model to recommend personalized trending repositories 23 by using GitHub topics. First, the underpinning neural network is fed with preprocessed README files to predict topics given a repository. Afterwards, the system captures the user's topic preferences by relying on its commits. The retrieved list is eventually reranked by computing two similarity methods on the topic vectors, i.e., cosine similarity and shared similarity between the developer and a trending repository.
Sally [31] is an automated approach aiming to categorize Maven projects by relying on bytecode analysis and tags extracted from StackOverflow. Given a JAR file as the input, the tool obtains relevant data related to the project, i.e., class names, class fields, and method names. Such results are filtred considering tags excerpted from StackOverflow. In parallel, a weighted graph is computed to identify and encode project dependencies. Then, primary tags are computed by considering only the input project. In contrast, the weighted graph is used to obtain dependencies' tags. These two kinds of tags are combined to retrieve final recommendations in a tag cloud format.
Velázquez-Rodríguez and De Roover [32] have recently proposed MUTAMA, an automated multi-label tagging approach to support Maven projects. Different from our approach, the project's tags are extracted from bytecode, employing a well-founded tool. For each project, the tool takes as the input a pair of groupId-artifactId-version and the corresponding set of tags. In such a way, MUTAMA learns which projects have been tagged similarly by considering the Java classes and methods employed. Then, several machine learning algorithms provided by the MEKA tool are fed with the extracted tags and the vectors obtained from the analyzed source code.
Zhang et al. [34] solved the task of keyword-driven hierarchical classification, proposing a tool with three main modules as follows: (i) HIN (Heterogeneous Information Network) construction and embedding; (ii) key-word enrichment; and (iii) topic modeling and pseudo document generation. During the first step, HIN creates a graph to capture all the interactions between the core elements of GitHub repositories, like Users, Words, Names, to name a few. The keyword enrichment step is devised to cope with the problem of scarcity and bias of the keyword provided by users. The machinery in the last step is employed to feed a classifier with pre-labeled repositories and the overfitting related to the usages of the keywords only. The tool has been tested on two different datasets, a machine learning taxonomy with 1,600 examples and a bioinformatics one with 876 projects.
LabelGit [26] is a dataset for the classification of Java software projects. The authors proposed an approach to 23 https://github.com/trending excerpt information directly from source code and dependency graphs. The former is obtained by a combination of techniques like code2vec and fastText, while the latter is a graph describing the dependencies between classes. Similarly, ClassifyHub [28] is a GitHub projects classification algorithm based on ensemble learning. The approach has been evaluated on a dataset consisting of 681 projects and obtained a precision of about 60% and a recall of 58%.

Automated categorization of OSS projects
With Lascad (Language-Agnostic Software Categorization and Similar Application Detection) [2], the authors proposed a tool based on information retrieval techniques, i.e., LDA (Latent Dirichlet Allocation) and hierarchical clustering. First, the tool extracts terms from the dataset's source code, followed by a refining phase where stop words and similar code-specific terms are removed. The second step is the most important one; LDA is applied on the terms corpus to get the topics, then these topics are grouped by using the hierarchical clustering technique. Finally, projects are assigned to each group and then labeled. Based on this result, during the third step, given an application as input, LASCAD retrieves corresponding software ranked by similarity.
Five different machine learning algorithms have been used to classify OSS projects [19], i.e., SVM, Naive Bayesian, Decision Tree, and IBK classifier. The training phase for each model relies on bytecode analysis performed by JClassInfo, a well-founded tool. Each model is fed with API methods, classes, and packages excerpted from 3,286 projects labeled belonging to SourceForge. The results show that SVM outperforms the other models in terms of precision, recall, and success rate. Different from our work, this study compares five machine learning models considering only 22 different categories.
Wang et al. [33] propose a tag recommendation based on a semantic graph (TRG) to classify projects extracted from two OSS communities, i.e., OhLoh and Freecode. The first step is to analyze semantic correlations between the software description and tags and encode them in a hierarchical graph. Then, a list of tags is retrieved given an input project by relying on the constructed graph. TRG eventually ranks the final list of recommendations according to their probability.
An agglomerative hierarchical clustering for taxonomy construction name AHCTC has been proposed [18] to categorize software projects stored on the OhLoh platform. To this end, the approach integrates the technique mentioned above with a topic model based on the well-established LDA algorithm to identify thematic spaces composed of similar tags. Finally, the proposed clustering technique can extract the most central tag from the thematic space to classify the given project properly.

Boosting techniques for recommender systems
Research in recommender systems has amassed momentum in recent years, and various techniques have been proposed to equip recommender systems with the ability to retrieve highly relevant items, improving their prediction capability. In this section, we review the most notable studies in this domain, and associate them with our work.
Fan et al. [9] conceived an end-to-end framework to improve recommender systems by means of a knowledge graph, which can capture users' preferences. To this end, a user preference matrix (UPM) was built to project refined item embeddings from their latent space into the user embedding space. They evaluated the potential probability of the user preferences towards the target items through inner product results of user embedding with the projected item embedding representation. Such a technique can come in handy for the collaborative-filtering module of HybridRec, where it is used to further improve its ability to find relevant topics.
Taraghi et al. [29] presented a hybrid recommender system for open journal systems, which uses content based filtering as the basic system extended by the results of a collaborative-filtering approach. The collaborative filtering algorithm works on the basis of weighted user paths, and the authors conceived a method to reduce the undesirable effect of the navigation path, allowing the collaborativefiltering module to retrieve relevant serendipitous and novel recommendations. We anticipate that such a technique can be employed to improve the collaborative-filtering component of HybridRec, and this task is in our calendar for future research.
Tran et al. [30] provided a systematic overview of existing research on recommender systems in the healthcare domain, paying attention to different recommendation scenarios and approaches. Among the issues raised in the paper, we assume that the problem of constructing user profile is interesting and worth being investigated in the scope of the HybridRec work. We plan to improve the similarity computation by enriching projects with additional metadata such as developers, stargazers, or source code.
Given that computing similarity is important in the whole recommendation process, Al-Shamri [1] studied existing similarity modifiers and proposed various similarity modifiers that consider the active and training users' statistical parameters. In fact, computing similarity also plays a crucial role in recommender systems in software engineering [25], and we anticipate that the incorporation of new similarity algorithms -like the ones proposed by Al-Shamri [1] -into the internal design of HybridRec might help it further improve its performance.
According to a careful observation, we see that HybridRec is among the first hybrid recommender systems in software engineering, as the system works by combining different algorithms for producing recommendations. This paves the way for further improvements by considering various boosting algorithms as we have introduced in this subsection. We consider this issue as our future work.

Conclusion and future work
OSS platforms play an important role in aggregating and handling projects developed by a prolific community. Automatic tagging is one of the most valuable techniques to improve their discoverability, even though it requires a lot of effort to produce useful outcomes. In this paper, we conceived HybridRec, a hybrid recommender system working on top of a stochastic network and a collaborativefiltering technique to recommend topics. We performed an empirical evaluation on real-world datasets to study HybridRec by comparing it with state-of-the-art toots. The results showed that the newly conceived approach improves our former recommender systems substantially. More importantly, we demonstrated that HybridRec can increase its prediction performance on well-curated data sources.
For future work, we plan to improve the proposed framework further by analyzing the source code of OSS projects to compute similarity for HybridRec. Furthermore, we plan to conduct a well-structured user study by involving developers to evaluate HybridRec's outcomes. Last but not least, we will create a taxonomy of GitHub topics by grouping tags with a pairwise ranking algorithm. As a shortterm plan, we plan to propose an integrated approach to generate a taxonomy of software domains by inspecting GitHub topics. In such an approach, GitHub repositories are fetched to collect different topics. Then, various filtering steps will be used to reduce the number of topics. Afterward, a reconciliation step is applied, where the selected topics are linked to Wikidata in order to help with the disambiguation of these terms. The final step is to create a discrete rank of the selected application domains by clustering algorithms.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/. Riccardo Rubei is a PhD student at the University of L'Aquila, Italy. He is working on mining techniques to analyse open source software with the aim of providing developers with useful real-time recommendations.