PHANTOM: Curating GitHub for engineered software projects using time-series clustering

Pickerill, Peter; Jungen, Heiko Joshua; Ochodek, Mirosław; Maćkowiak, Michał; Staron, Miroslaw

doi:10.1007/s10664-020-09825-8

PHANTOM: Curating GitHub for engineered software projects using time-series clustering

Open access
Published: 27 May 2020

Volume 25, pages 2897–2929, (2020)
Cite this article

Download PDF

You have full access to this open access article

Empirical Software Engineering Aims and scope Submit manuscript

PHANTOM: Curating GitHub for engineered software projects using time-series clustering

Download PDF

Peter Pickerill¹,
Heiko Joshua Jungen¹,
Mirosław Ochodek²,
Michał Maćkowiak² &
…
Miroslaw Staron³

2441 Accesses
1 Altmetric
Explore all metrics

Abstract

Context

Within the field of Mining Software Repositories, there are numerous methods employed to filter datasets in order to avoid analysing low-quality projects. Unfortunately, the existing filtering methods have not kept up with the growth of existing data sources, such as GitHub, and researchers often rely on quick and dirty techniques to curate datasets.

Objective

The objective of this study is to develop a method capable of filtering large quantities of software projects in a resource-efficient way.

Method

This study follows the Design Science Research (DSR) methodology. The proposed method, PHANTOM, extracts five measures from Git logs. Each measure is transformed into a time-series, which is represented as a feature vector for clustering using the k-means algorithm.

Results

Using the ground truth from a previous study, PHANTOM was shown to be able to rediscover the ground truth on the training dataset, and was able to identify “engineered” projects with up to 0.87 Precision and 0.94 Recall on the validation dataset. PHANTOM downloaded and processed the metadata of 1,786,601 GitHub repositories in 21.5 days using a single personal computer, which is over 33% faster than the previous study which used a computer cluster of 200 nodes. The possibility of applying the method outside of the open-source community was investigated by curating 100 repositories owned by two companies.

Conclusions

It is possible to use an unsupervised approach to identify engineered projects. PHANTOM was shown to be competitive compared to the existing supervised approaches while reducing the hardware requirements by two orders of magnitude.

Creating Evolving Project Data Sets in Software Engineering

Curating GitHub for engineered software projects

Article 18 April 2017

On Identifying Similarities in Git Commit Trends—A Comparison Between Clustering and SimSAX

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Software project analysis used to be performed on small corpora of projects, as seen in industry-based case studies (Feldt et al. 2013; Staron et al. 2013a, 2013a). However, since the widespread use of code sharing sites such as GitHub,^{Footnote 1} researchers now have access to a massive corpus of software projects that can be analysed. Consequently, we see more and more studies in the area of software quality that base their research on mining GitHub repositories. Unfortunately, the quality of these repositories is unclear. It has been shown that most repositories can be considered to be of low quality, and could therefore skew analysis (Kalliamvakou et al. 2016).

For instance, in the recent editions of the Mining Software Repositories (MSR) conference, a number of studies performed analyses based on GitHub repositories. The reported dataset sizes ranged between one to over 80,000 repositories (e.g. Cito et al. (2017), Gonzalez et al. (2017), Noten et al. (2017), Sadat et al. (2017), Zhu et al. (2017), Rausch et al. (2017), Macho et al. (2017)), which is just 0.08% of the current number. Most of these studies needed to filter out low-quality repositories from the collection, where low-quality denotes those repositories that do not fit into the desired sample. One of the frequently used approaches to filter such repositories is to rely on the project popularity. Unfortunately, it has been shown to perform poorly (Munaiah et al. 2017). It is clear then, that for researchers to make use of the large number of repositories available on GitHub, new filtering methods are required.

Munaiah et al. (2017) proposed a filtering method that outperformed traditional filtering approaches by using supervised classification. With this framework, over 1.8 Million GitHub repositories have been analysed, what makes it one of the largest datasets (Cosentino et al. 2017). While this is an impressive achievement, a number of key issues with the method remain. First, the method analyses multiple artefacts to calculate the necessary metrics, i.e. Git logs, configuration files, source code, GitHub issues, license, unit tests. Consequently, the method has high computing-resource demands (in the study, a computer cluster with over 200 nodes to process the largest dataset and the task still required a month to complete). Second, some artefacts require dedicated processing, depending on the programming language and libraries used in the project. As a result, the proposed tool might require further work to adapt it to changing trends in programming technologies. Finally, the method requires multiple project artefacts to be downloaded and analysed; some of them might be out of interest for many studies using the method to pre-process the data.

With this in mind, in this paper, we propose a new method called PHANTOM^{Footnote 2} (Project History Analysis of Time-Series Method) that has the same purpose as the method proposed by Munaiah et al.. It uses unsupervised learning to distinguish between “engineered” and “not engineered” projects at very large scale, solely based on their development history, while using commodity hardware. We argue that it is possible to achieve comparable results to Munaiah et al. (2017) (which we refer to as the baseline study) by applying the proposed unsupervised-learning method to filter repositories. Since the proposed method is unsupervised, it could also help the researchers to establish a ground truth by automatically grouping projects which have similar characteristics. By using the development history exclusively, we simplify the acquisition of data used for analysis, making it possible to perform the analysis on commodity hardware. We validate PHANTOM on the datasets published in the baseline study (Munaiah et al. 2017), extending it with additional 200 repositories. The method’s efficacy is shown by using this dataset consisting of 850 labelled repositories from open source and 100 additional repositories owned by companies. PHANTOM’s applicability for large-scale analyses is shown when it is applied to the dataset of over 1.8 Million software repositories.

The structure of this paper is as follows. Section 2 presents and discusses related studies. The problem statement and the design of the validation study are presented in Section 3. Section 4 describes the proposed method, which is further validated in Section 5. We discuss the findings from the validation study in Section 6. Finally, we conclude our findings in Section 7.

2 Related Work

The large-scale projects analyses (including code analysis) are mainly performed on open-source code repositories. With over 100 million Git repositories available, GitHub is becoming one of the most important sources data that could be used to study software source code and project-related phenomena. Unfortunately, the quality of the data available on GitHub is questionable. For instance, Kalliamvakou et al. (2016) performed the quantitative analysis of project metadata and manual investigation of a sample of 434 projects stored on Github to learn that the assumption that every repository available on GitHub contains a software project does not hold. In the studied sample, only 63.4% of the repositories were related to software development. Also, most of the projects were personal—67% of the projects had only 1 committer while 87% had 2 or less. Finally, most repositories remained inactive or showed low activity (only 25% of the studied repositories were active for over 100 days). Therefore, there is a need for filtering repositories that contain engineered software projects in order to reduce the chance of biasing the results of MSR studies by poorly engineered projects.

The simplest and often used approach to filter unwanted repositories is to rely on projects popularity, e.g. GitHub Stargazers (e.g. Padhye et al. (2014), Ray et al. (2014), Casalnuovo et al. (2015), Silva et al. (2016), and Russell et al. (2018)). The rationale behind using popularity as a filtering criterion is that it is assumed to be positively correlated with quality. However, as Sajnani et al. (2014) have shown, this might not be true.

Many studies used multi-stage pipelines to select a desirable sample of projects from GitHub. Most of them query publicly available services such as GHTorrent.^{Footnote 3} For instance, Vasilescu et al. (2015) and Yu et al. (2015) selected projects by sequentially querying GHTorrent and Travis API in a funnel-like manner to study continuous integration practices. A similar approach was used by Gharehyazie et al. (2019) who studied software clones.

Some studies go beyond querying the metadata collected by GHTorrent by processing project repositories and extracting metadata from their artefacts. Good examples of studies implementing project analysers capable of processing such repositories are Gabel and Su (2010) and Hebig et al. (2016). They feature a high level of automation, yet the resources required (e.g. time, computing power) are large. Gabel and Su (2010) conducted a study about the uniqueness of source code within C++, C# and Java applications taken from SourceForge.^{Footnote 4} They approached the question of “how unique is software?” by performing lexical analysis on a dataset of 6000 projects (in total 420 million lines of code). Following this, the percentage of unique code within 30 selected projects was measured. The study identified a general lack of uniqueness within software, where most programs are made from code snippets found in other software. The analysis time took four months, where source code was compared on a token-level with an optimised tool. This shows that source-code analysis requires a lot of time even for a small number of projects. Similarly, in the study by Hebig et al. (2016) that aimed to find GitHub repositories containing UML models, the process of downloading and processing the repositories took 6 weeks.

Cosentino et al. (2017) looked into 80 studies that mined GitHub. They found that the two largest data sources, GHTorrent and GitHub API^{Footnote 5}, were criticised by researchers. The GitHub API was said to be a source of problems, given request limitations and errors in the data returned. They state that “the GitHub API request limit acts as a barrier to getting data from GitHub”, which affects curated datasets (such as GHTorrent) and individual researchers. Many researchers criticise the size and up-to-dateness of services like GHTorrent. Cosentino et al. report that of studies they explored, only three looked at more than 100,000 repositories. These findings show that despite using these services, researchers struggle to collect up-to-date data at large scale.

Robles et al. (2017) identified similar issues with GitHub when collecting information about twelve million repositories. First, due to API limits, they calculated that it would have taken fourteen months to collect all of the data using a single API key, adding that “[…] this would have made the data gathering unfeasible.” To gather the data in a feasible time, twenty keys were used. Secondly, 25% of the twelve million repositories had been moved or deleted between the time GHTorrent collected its data and the researchers request to the GitHub API, which wasted keys and analysis time. This shows that data collection time can be reduced by using volunteered keys for data collection, a practice employed by GitHub mirroring sites, like GHTorrent and Boa.^{Footnote 6}

Kalliamvakou et al. (2016) observed that there are different ways to merge commits and GitHub cannot detect all of them. Therefore, some merges are not reported through the API. A further peril when using the GitHub API is that, unlike cloning with Git, the GitHub API does not redirect requests when a repository was moved. Accessing a moved repository with the API will result with a not found status code.

Nuñez-Varela et al. (2017) conducted a systematic review of 226 papers studying source-code metrics. They identified that most studies considered one programming language and paradigm. Over 85% of the studies use object-oriented metrics. This is reflected in the available public datasets and metric-extraction tools. While the paper does not reason why researchers focus on one language, it indicates that cross-language analysis of source code may not be straightforward. They claim that although the use of metrics “theoretically can be applied to any language”, in practice, it is complex and tools do not support all languages.

Recently, Munaiah et al. (2017) proposed a new automatic method along with a tool called reaper that can be used to curate GitHub for engineered software projects. The method classifies GitHub repositories using seven dimensions: Community, Continuous integration, Documentation, History, Issues, License, and Unit testing. reaper was used to download and process 1.8 million GitHub repositories. The analysis required a computer cluster with over 200 nodes and took over a month to complete. In the next step, the authors sampled the collected dataset and manually labelled 500 repositories, of which 200 were used for validation. In addition, they analyzed and manually labelled 150 repositories owned by organizations such as Amazon, Apache, Facebook, Google, and Microsoft. They trained and validated four classifiers (Random Forest and custom score-based classifier) using the collected data and applied them to predict the number of repositories hosted on GitHub that contain engineered software projects. Depending on the classification algorithm and training dataset, the predictions of the number of such repositories ranged between 6% and 70%. The most accurate classifier predicted that only around 24% of the repositories contained engineered projects. Also, they showed that the proposed classifiers outperform filtering repositories by their popularity.

Our study proposes a light-weight method for filtering projects that complements the methods proposed in the above studies. The method processes only one project artefact, the Git log, and extracts five measures directly from it (Integration Frequency, Commit Frequency, Integrator Frequency, Committer Frequency, and Merge Frequency). Each of the measures (being a time series) are characterised by a set of over forty manually-designed features (e.g. duration, amplitude, positive and negative gradients). Since the method directly analyses the Git logs, it does not suffer from the limitations of GitHub API or GHTorrent. Secondly, it is programming language agnostic because it does not analyse the source code. Its closest counterpart is the filtering method proposed by Munaiah et al. (2017), since any both methods are based on machine-learning algorithms and have the purpose of filtering engineered software projects by a defined quality. The main difference between the two approaches is that they use different artefacts to extract information about the projects and different machine-learning algorithms. The approaches to filtering projects proposed by other authors were developed to select projects having particular, explicitly known characteristics that were interesting for a given study (e.g. Vasilescu et al. (2015) selected projects not being forks, written in certain programming languages, and having 200+ pull requests). In the case of distinguishing between engineered not engineered projects, it is difficult to explicitly provide such characteristics. However, even for such studies and others that require processing other project artefacts than Git logs, the proposed method could be used as a scanning tool as the first step to quickly identify and reject repositories containing not engineered projects and prevent wasting resources to download and process their artefacts.

3 Research Methodology

The research conducted in this paper follows the Design Science Research (DSR) methodology (Hevner et al. 2004). In particular, we decided to follow the guidelines provided by Wieringa (2014).

DSR is a problem-solving paradigm that focuses on creating and evaluating artefacts and solutions for practical purposes. In design science, research follows the engineering cycle—it is conducted iteratively until the objectives are reached. The engineering cycle consists of five steps:

1.
Problem investigation — understanding the problem and outlining the steps to solve the problem and evaluate the solution.
2.
Treatment design – designing an artefact which is used in the study (e.g. a computer program).
3.
Treatment validation — evaluating the artefact in a context similar to the context where it is to be used (e.g. lab environment).
4.
Treatment implementation — introducing the artefact into the context where it is to be used (e.g. software development organization).
5.
Implementation evaluation — evaluation of the effects of the introduction of the artefact on the context (e.g. check whether the program improved the process of software development).

The first three steps of the engineering cycle are called the design cycle. According to Wieringa, the design cycle is what is usually performed by researchers when designing an artefact while the remaining two steps (treatment implementation and evaluation) can be done once the artefact is in the hands of its intended users.

The design cycle begins with an investigation of the problem to gain an in-depth understanding of the causes. The acquired knowledge is then used to design a treatment. Treatment is defined as the interaction of an artefact with a problem context. The goal of the third step, treatment validation, is to confirm that the designed treatment satisfies all the requirements and whether the treatment is able to treat the problem.

In our study, we performed a full design cycle to create a new method—PHANTOM which is the method to analyse software projects, and the problem context is to filter engineered projects from GitHub. Firstly, we investigate the problem and define the requirements for the method by analysing the limitations of the reaper tool proposed in the study by Munaiah et al. (2017). Then we perform a validation of the method by applying it to the datasets provided in the baseline study and comparing the results between PHANTOM and reaper.

The goal of two remaining steps in the engineering cycle is to apply the treatment to and investigate how it interacts with their real-world context. Treatment evaluation is a scalable process and usually requires multiple applications of the treatment to get a full understanding of its usefulness. Although we do not aim to evaluate the method in this study, we apply PHANTOM to replicate the task of filtering nearly 1.8 Million real software repositories stored on GitHub in the baseline study, what could be perceived as an initial evaluation of the method in its real-life context, and apply it to recognise engineered projects in industrial Git repositories.

In the remaining part of this section, we investigate the problem of filtering software repositories that justifies the need for designing PHANTOM. Then, we discuss the validation and evaluation procedures used in this study.

3.1 Problem investigation

Munaiah et al. (2017) provides an abstract definition of an engineered software project which states that “a software project that leverages sound software engineering practices in one or more of its dimensions such as documentation, testing, and project management.” Later in their study, they customise it by stating that “(a) an engineered software project is similar to the projects contained within repositories owned by popular software engineering organizations such as Amazon, Apache, Microsoft and Mozilla and^{Footnote 7} (b) and engineered software project is similar to the projects that have a general-purpose utility to users other than the developers themselves.” Since we partially base our study on the dataset provided by Munaiah et al. (2017), we follow these definitions of an engineered software project.

The existing methods for filtering projects (and in particular filtering engineered projects) make trade-offs between the depth of analysis and performance. The lightweight approaches use community-based filtering strategies, for example; popularity, issues, and forks are common but do not measure the internal quality of a project. Unfortunately, these approaches are typically less accurate than those relying on analysing project artefacts (Munaiah et al. 2017) and cannot be used on repositories without community interaction, for instance, those that are hosted on private servers. The methods that perform deeper analyses of project artefacts (e.g. reaper proposed byMunaiah et al. (2017)) are more accurate but require high computing power. These methods often analyse multiple project artefacts, including source code, which is time-consuming and require implementing and maintaining dedicated analysis tools (e.g. to analyse code written in different programming languages). To mitigate performance problems, these methods often rely on the metadata collected by publicly available services such as GHTorrent. Unfortunately, multiple studies have reported problems and limitations of this approach. Also, the existing methods perform filtering based on the cross-sectional assessments of projects, i.e. without taking into account their history. Finally, the methods that base filtering on supervised machine-learning algorithms require manual labelling of the training data which is time-consuming.

Therefore, it seems that if one wants to create a new filtering method that analyses project artefacts, they should focus on reducing hardware requirements. Also, it is worth considering basing the filtering on chronological measures, which not only reflect the current status of the project but also its history. Finally, it would be preferable if the method does not require manual labelling of data, and instead use unsupervised methods, such as clustering.

Chronological measures taken from repositories can be represented as a time-series. In the field of time-series clustering, a number of techniques are available. Euclidean Distance and DTW (Ratanamahatana and Keogh 2004) are most commonly used, however they cannot compare time-series of very different lengths. Feature-based approaches have been shown as an effective way to compare time-series of different lengths and reduce the dimensionality of the data significantly (e.g.Wang et al. (2006), Deng et al. (2013), Fulcher and Jones (2014), Guo (2008)). However, extracted features must describe the time-series correctly to ensure accurate clustering.

Git logs seem to be one of the most frequently analysed artefacts to extract features used for filtering projects. Also, it allows extracting chronological measures describing trends in the development of a software project (e.g. commit frequency, number of contributors). Commit frequency can tell much about the process and practices employed by a software development team and reflects the changes made in the ways of working (Zhao et al. 2017). Also, the frequency of commits correlates with the number of bug-introducing changes in software (Eyolfson et al. 2014). The relationship between commit frequency and overall quality of a project is also observed by Kolassa et al. (2013>), who stated that “the commit frequency is a fast indicator to determine if the project is healthy because it has regular contributions and if the developers are productive by checking whether they contribute regularly.” (Kalliamvakou et al. 2015) observed that industrial projects have more merges resulting from pull requests. Even the number of reverted commits could be an indicator of how the project team operates. According to Shimagaki et al. (2016) many reverted commits could be avoided if a team has good communication practices and high change awareness. Therefore, it is reasonable to assume that the information about commit frequency might be a useful source of information when filtering engineered projects. We could expect that engineered projects will exhibit different characteristics of commit frequency in time (including merging commits) not present in not engineered projects.

Taking the above into account, we define the following requirements for a new project-filtering method, which we call PHANTOM:

Req1: All measures must be extractable from a Git log. — we want to limit the number of artefacts that need to be processed to a single one, being Git log.
Req2: Time-series must be characterised by feature vectors accurately. — we want to extract chronological measures from Git log; since the feature-based clustering seems to be the best option to compare time-series of different lengths, we need to find a set of features that will correctly capture the most important characteristics of the time-series (Esling and Agon 2012).
Req3: The established ground truth can be discovered using unsupervised learning (without knowing the decision classes). — we have to verify that the proposed method is able to identify clusters that are meaningful from the perspective of the problem of filtering repositories containing engineered software projects.
Req4: The method performs well on commodity hardware at large-scale. — it is the key requirement; it addresses the most important limitation of the existing filtering methods that analyse project artefacts.
Req5: The method provides comparable accuracy to supervised methods. — the existing approaches make a trade-off between accuracy and performance; the proposed method is supposed to provide similar accuracy while reducing the performance requirements.
Req6: The method can be used to filter projects in different contexts. — there is a threat that the performance of an unsupervised method can be specific to a single dataset only; therefore, we need to perform studies on different datasets coming from both open-source community and industry to mitigate that risk.

3.2 Treatment validation

In order to validate PHANTOM, we performed a series of simulation studies on the datasets published in the baseline study.

Munaiah et al. (2017) published five datasets. These datasets are formatted as collections of GitHub repository URLs. Four of these datasets (Organization^{Footnote 8}, Utility, Negative Instances, and Validation) are used as ground truth in the validation of the proposed method, with the fifth being referred to as the Large dataset (a collection of over 1.85 Million URLs).

To create the ground truth datasets, Munaiah et al. followed a manual curation process in order to label repositories. Each repository was independently judged by two or three researchers as either engineered or not, according to agreed guidelines. If the judgement about a repository differed, it was either discussed further or discarded. The datasets are summarised in the list below:

Organization — it consists of a set of 150 engineered repositories. Engineered projects are defined as similar to those of popular software engineering companies such as Amazon, Apache and Facebook. The researchers manually investigated repositories to find those project that matched the definition.
Utility — it consists of another set of 150 engineered repositories. It defines an engineered project as one with a general-purpose. That is to say, a repository that has value to users other than the developers. The repositories were randomly sampled from 1,857,423 GitHub repositories.
Negative Instances — it holds 150 repositories that are not engineered. The repositories do not conform to either of the definitions of engineered project. The dataset resulted from the selection process of the Utility dataset, which means that it contains the first 150 repositories that both authors rejected.
Validation — it consists of 100 engineered and 100 not engineered project repositories. The selection process is similar to the one of the Utility dataset and shares the definition of what is engineered and not.
Large dataset — it is a collection of 1,857,423 GitHub URLs. In contrast to the other datasets, there is no ground truth, meaning the quality of the repositories is unknown.

Since the four ground-truth datasets contained labelled data, they allowed us to validate the PHANTOM’s accuracy and compare it with the accuracy of reaper proposed in the baseline study. While evaluating the accuracy of the methods we used the popular prediction quality metrics Precision, Recall, F-Measure, and Matthews Correlation Coefficient (MCC).

When predicting the label of an object and comparing the predicted label to the actual label there are four possible outcomes: the object is correctly classified to a positive class—true positive (TP), the object is falsely classified to a positive class—false positive (FP), the object is correctly recognised as not belonging to the positive class—true negative (TN) and the object is falsely recognised as not belonging to the positive class—false negative (FN). We calculate Precision, Recall, F-Measure, and Matthews Correlation Coefficient (MCC) by aggregating information about the outcomes for all classified cases.

We extended the previous study by comparing the accuracy of the baseline and PHANTOM models on another sample of projects from the Large dataset. The goal of this analysis was two-fold. First, it allowed us to compare the accuracy of the models on a new dataset. Second, we were able to verify the validity of predictions regarding the number of engineered software projects hosted on GitHub reported in the baseline study and the ones obtained with the PHANTOM models.

We classified all instances from the Large dataset using the baseline and PHANTOM models (we used a best-performing PHANTOM model for each of the measures considered by PHANTOM). Since we wanted to understand the similarities and differences between the prediction made by the models, we used stratified sampling to select 5 samples, each containing 50 project instances (the total number of 250 instances):

True/True — this was a sample of instances for which the best-performing baseline models and PHANTOM models trained on the Utility dataset^{Footnote 9} unanimously classified them as engineered projects.
False/False — we randomly selected instances that were unanimously^{Footnote 10} classified as not engineered projects.
False/True — a random sample of instances which were unanimously predicted to be not engineered projects by the PHANTOM models and unanimously predicted as engineered projects by the best-performing baseline models.
True/False — a random sample of instances which were unanimously predicted to be engineered projects by the PHANTOM models and unanimously predicted as not engineered projects by the best-performing baseline models.
Mixed — a random sample of the remaining repositories not fitting to any of the above categories. These were instances for which there was a partial agreement between the prediction models.

All of the selected instances were then independently labelled by two or three authors using the same criteria as in the baseline study for the Utility dataset. By using this dataset, we were able to analyse how much the baseline and PHANTOM models complement or contradict each other. Consequently, we were able to assess the impact of the differences in the way they classify repositories on their predictions of the number of engineered projects hosted on GitHub.

Finally, we applied the best-performing PHANTOM models to filter industrial projects to investigate whether it is possible to use PHANTOM in different contexts. We introduced a new dataset (Industry) that contains 100 repositories belonging to two companies. Company A is one of the fastest-growing software agencies in the European Union. With its presence on the market for more than ten years, it has successfully delivered over 1400 projects. Currently, it employs more than 500 employees and develops products for multiple sectors, e.g. FinTech, Healthcare, Tourism, E-commerce, Entertainment, and e-Government. It maintains around 500 Git repositories in a private GitHub space. Company B develops embedded software for infrastructure projects. The dataset was labelled by the employees of the companies.

4 PHANTOM—A Developed Artefact

PHANTOM (Project History Analysis of Time-Series Method) is a software-repositories filtering method that addresses the problems identified in Section 3.1. The process used in PHANTOM is presented in Fig. 1 and its steps are explained in the subsequent paragraphs.

The input to the method is a collection of repository URLs, which locate the repositories to be analysed. Each repository is cloned to the machine using Git (Step A). Next, the Git log of the cloned repository is generated (Step B) in the format defined in Table 1, where each column is separated by a comma (”,”). From now on, Git log will refer to this format.

Table 1 The format of the generated Git logs, each column is separated with a comma (UTS refers to Unix Timestamp, GUID refers to Globally Unique Identifier

PHANTOM: Curating GitHub for engineered software projects using time-series clustering

Abstract

Context

Objective

Method

Results

Conclusions

Similar content being viewed by others

Creating Evolving Project Data Sets in Software Engineering

Curating GitHub for engineered software projects

On Identifying Similarities in Git Commit Trends—A Comparison Between Clustering and SimSAX

1 Introduction

2 Related Work

3 Research Methodology

3.1 Problem investigation

3.2 Treatment validation

4 PHANTOM—A Developed Artefact

5 Validation Results

5.1 All measures must be extractable from the Git log (Req1)

5.2 Time-series must be characterised by feature vectors accurately (Req2)

5.3 The established ground truth can be discovered using unsupervised learning (Req3)

5.4 The method performs well on commodity hardware at large-scale (Req4)

5.5 The method provides comparable accuracy to supervised methods (Req5)

5.6 The method can be used to filter projects in different contexts (Req6)

6 Discussion

6.1 Threats to validity

Constructs validity

Internal validity

External validity

6.2 Ethical considerations

7 Conclusions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation