1 Introduction

Security breaches have become regular occurrences, with devastating consequences and costs to organisations and society (Ponemon-Institute, IBM-Security, 2017). Unfortunately, security as a non-functional requirement is under-addressed in many software development projects. Cois and Kazman (2015) analysed 400,000 projects on GitHub and found only 1.4% to explicitly identify security-related issues. It is therefore relevant to understand the security picture of a project by assessing product security and the project’s management capabilities (Ellison et al., 2010).

Researchers (Cois & Kazman 2015; Cleland-Huang et al., 2006; Hindle et al., 2013; Ray et al., 2016) have investigated security concerns in software management repositories (e.g. Issue Trackers and Version Control Systems), seeking relevant quantitative measures that could be derived from security analysis of software management repositories. Such measures could assist project managers and development teams in taking informed decisions regarding the security posture of a project by providing answers to, e.g., How many security-related changes have been made in the system? How many security-related bugs are left unresolved? What is the average window-of-exposure (in days) for security-related issues in a project? However, these studies are project specific, and we do not know how their results generalise beyond the environments studied.

Researchers have constructed text classification models by building a document term matrix (DTM) from a project’s document corpus and selecting promising features from the DTM(Sparck Jones, 1972; Salton et al., 1983; Salton & McGill, n.d.; Salton and Buckley 1988; Wu et al., 2008). This approach limits the model to the environment where it is constructed. (Gegick et al., 2010) advised against applying a text classification model for identifying security messages trained on one system to a different system. Based on their findings, cross-project classification suffered significant performance reduction. In (Anvik et al., 2006), the precision of their model decreased from 64 to 6% with low recalls in general when applied to another project’s dataset thus showing the task to develop a generic text classification model that could be practically useful across different systems to be non-trivial.

Many benefits could be derived from a model that works across heterogeneous projects. First, it could eliminate the need to create a unique model for each project in an organisation. Organisations sometimes maintain several hundred projects, making development of classification models messy and difficult. Second, in many projects and small businesses (Unterkalmsteiner et al., n.d.), time-to-market is critical, and as a result, security activities are not prioritised. An off-the-shelf model that can be quickly leveraged to identify security-related issues in such projects would reduce the time and effort required to build a model for each project. Third, thid could provide the research community with a benchmark for security-related issues across heterogeneous projects and organisations.

The goal of this work is to investigate the hypothesis that a generic text classification model can be developed for classifying security-related messages in software development project communications. Generalising to unseen datasets is the goal of a classification model. Typical text classification models in the security domain are usually constructed based on an oracle dataset (Nigam et al., 2000) containing experts’ (human/system) classifications of a sample of the population that a classification algorithm should learn. The requirement for an oracle presents a limitation to generalising to unseen data.

We address the problem of generalisation by augmentation of the training set with security keywords, using these terms as features to train a text classification model. Our research question is:

RQ: Does the performance of our classification model outperform models built using project-specific terms when generalising to new projects? In other words, can we build a model on one project and transfer it to another project using transfer project prediction (TPP) approach (e.g., Peters et al., 2017)?

Our contributions in this paper are:

  • A new and promising approach to weighting and training a generalised text classification model for identifying security messages across projects.

  • A list of security terms that can be used independently to identify security-related issues in any repository.

  • A new approach of using security terms divided into different categories to build classification models.

  • A security classifier framework and a add-on for the JIRAFootnote 1 platform, based on our best-performing trained model.

The remainder of the paper is organised as follows: In Sect. 2, we discuss studies related to our work. In Sect. 3, we describe our dataset selection and analysis approach to answer our research questions. In Sect. 4, we present the results of our study. We provide discussion in Sect. 5. We discuss various threats to the validity of our study in Sect. 6. Lastly, we conclude in Sect. 7.

2 Related work

Researchers have applied text classification approaches to topics such as text filtering, document categorisation, automatic text indexing, and to various text sources such as the World Wide Web, Internet news feeds, electronic mail, corporate databases, medical patient, records, and digital libraries (Nigam et al., 2000; Sebastiani, 2002). These approaches leverage supervised (Debole & Sebastiani, 2004) or unsupervised learning techniques (Le & Mikolov, 2014). A supervised technique uses an already labelled dataset to train a classification algorithm. In an unsupervised approach, a dataset is labelled using certain heuristics such as distance measures to cluster related texts (Yan et al., 2017).

Our problem is one of text classification for predicting security messages. We discuss related works in text classification and specifically its application to predicting security messages in software development repositories such as issue tracking systems and version control systems. Researchers have applied a variety of approaches to identify and characterise security-related messages in software development repositories. A problem in this domain is the need for a labelled and large enough training dataset sample from which a classifier can learn accurately (Nigam et al., 2000).

One approach for selecting a training dataset is to select a sample of text from a corpus by using a set of security keywords (Cleland-Huang et al., 2006; Pletea et al., 2014; Ray et al., 2014; Pletea et al., 2014) mined security discussions in commits and pull requests in GitHub by using a selected set of security keywords and then used the dataset to train a classifier to understand emotions between security-related and non-security-related discussions. Ray et al. (2014) used a restricted set of keywords covering different quality attributes (e.g. security) to select training dataset from commit messages in GitHub as a sample representation to determine the code quality of different programming languages. We have applied text classification to develop a generic model for identifying security-related messages in different software development projects.

An alternative to generic security keyword lists is to mine them from existing repositories where security issues have been labelled. Such training sample can be obtained from labelled messages by developers, security champions in an organisation, or security researchers (Gegick et al., 2010; Ohira et al., 2015; Riaz et al., 2014). This approach can be fraught with subjectivity and inaccuracies (Gegick et al., 2010); (Scandariato et al., 2014). Existing training samples can also be mined from security advisory reports (Zaman et al., 2011) (e.g. Mozilla Firefox Security Advisor, Apache Security Advisories, or CVEs), or mined from reports generated by a static analysis tool (Scandariato et al., 2014). The MITRE CWE framework has also been used to manually categorise and label security-related messages (Tyo, 2016).

Gegick et al. (2010) used text classification models constructed on manually labelled bug reports to identify security bug reports that are manually mislabelled as non-security bug reports in a large Cisco project. Their approach was based on document term matrix construction to identify classification features. Their model successfully identified 78% security bug reports previously misclassified as non-security bug reports by the Cisco engineers with 98% probability. However, the model performance decreased significantly when applied on other systems.

Wijayasekara et al. (2014) applied text mining to the Linux bug database to detect ‘hidden impact’ bugs (HIB). Their approach is based on term-frequency determination from the Linux bug reports and mapping to the publicly disclosed vulnerability data from MITRE CVE within the same period. Our study differs from Gegick et al. (2010) and Wijayasekara et al. (2014) by exploiting security domain knowledge to extract promising features from relevant security sources.

Cois and Kazman (2015) developed text classifiers that could identify security-related messages in issue trackers and applied their models on the large-scale open source Chromium software project. They have used features constructed as n-grams from the text artefacts of Chromium and recorded an average of over 85% in recall, precision and f-score. Our approach differs from Cois and Kazman (2015) because we have manually defined our set of features whereas they have used terms obtained automatically from the project’s text artefacts.

Zaman et al. (2011) studied the characteristics of security vs. performance bugs in terms of fix rates, change rates, and who fixes the bugs. They have used security bugs from the Mozilla Foundation Security Advisory to represent a security dataset. Our study aims to develop a generic model for identifying security-related messages in bug reports and related repositories.

Riaz et al. (2014) manually classified requirement documents in the healthcare domain to identify security-relevant requirements. They applied machine-learning algorithm to develop automated tools built on their previously classified oracle dataset to assist requirement engineers during analysis. The results show 79% recall and 82% precision.

Cleland-Huang et al. (2006) developed a text classifier by using term-frequency to identify non-functional requirements from requirement specification documents that were developed by MS students. The result of the model when applied on an industrial dataset with terms from initial documents suffered performance reduction. Their results significantly improved when the terms were augmented with terms from the industrial dataset.

Hindle et al. (2013) performed topic labelling of various non-functional requirements using semi-unsupervised and supervised techniques on three large-scale relational databases. They defined a set of generic keywords to build models that could generalise on the different projects. Their results show an average f-score between 39 and 48% and a range between 4 and 90%. Unlike the studies above that addressed a broad range of non-functional requirements, we have focused primarily on security.

Scandariato et al. (2014) applied a static code analysis tool to collect security-related records of 20 Android applications. By applying text mining approach to software source code, they formulated their features as a set of source components (actual source files) with their associated frequencies and then built a prediction model to predict which component is vulnerable or not. This study is based on mining the actual source code which is different from our work that focuses on the natural language contained in the software development repositories.

Peters et al. (2017) observed that the presence of security-related keywords in both security and non-security bug reports (SBR, NSBR) can lead to mislabelling of SBRs and reduce prediction performance on within project prediction (WPP) and transfer project prediction (TPP). They proposed a framework, FARSEC, to filter and rank bug reports to reduce NSBRs with security-related keywords before building a prediction model. They applied their framework to data and security classifications drawn from Ohira et al. (2015) and from the Chromium project. Their results showed a 38% reduction in mislabelled SBRs. The goal of our work is similar to Peters et al. (2017). However, on the one hand, our work is different from Peters et al. (2017) in the approach we have used to construct a generalised model. While Peters et al. (2017). uses filtering techniques on the dataset, we have used a harvested set of security keywords for our models. On the other hand, we investigate how to improve model’s performance by increasing the weight of security terms in the training dataset thereby showing some similarities to their work.

3 Methodology

In this section, we first present and describe our datasets. Next, we present project-specific approach for extracting features to train a text classification model for a project. Then, we present our proposed generalised modelling approach. Finally, we describe our experiment setup, the metrics and our approach for answering the research questions.

3.1 Dataset selection

For testing our models, we have used published datasets from nine different sources (Table 1), containing messages that have been manually tagged as security related. We now describe the bug report (BR) dataset we have used in this study.

  1. 1)

    Industrial systems (Comm): We collected data from two projects recorded in the JIRA production environment of one of our industrial partners. The organisation has a security champion and has created a security field in JIRA where they manually label created issues (bug, story, task, or improvement) as security-related or not. This classification has been applied on one of the projects. We downloaded 946 records of which 20 are labelled as security related.

  2. 2)

    The high-impact bug dataset by Ohira et al. (2015) (HIB): This publicly available dataset has been manually classified by a team of researchers to allow its use for research studies. The four projects covered by this dataset are: Apache Camel, Apache Derby, Apache Ambari and Apache Wicket. The dataset contains a total of 4000 records with 160 records manually labelled as security related.

  3. 3)

    Chromium dataset: The chromium dataset has been prepared and used in Peters et al. (2017) The dataset is available onlineFootnote 2, and we have downloaded the entire dataset and reuse as-is in our environment.

  4. 4)

    Apache projects dataset (Apache): We downloaded data from Apache JIRA open-source projects where developers have manually labelled their recorded issues as security related. We search the apache Jira URL at ‘https://issues.apache.org/jira’ and use the search criteria ‘labels = security OR labels = ‘security’ OR labels = security-issue’. The query yields a total of 1283 records for different projects in Apache. The records form our security bug reports (SBR). We separately downloaded 10,000 records for 10 different products using the search criteria ‘labels is EMPTY’ from which we randomly extracted 1283 records as our non-security-related bugs (NSBRs).

  5. 5)

    Mozilla dataset (Mozilla): We downloaded security bugs from the Mozilla bug tracking system at ‘https://bugzilla.mozilla.org’ by using the search criteria ‘group: security’ yielding a total of 3276 security bug reports (SBR). We separately downloaded 10,000 records for 31 different products using the search criteria ‘status: all’ from which we randomly extracted 3276 records as our non-security-related bugs (NSBR)

  6. 6)

    Orthogonal defect classification-vulnerabilities (ODCV)(Morrison et al., 2018a) is a publicly available dataset (https://sites.google.com/a/ncsu.edu/odc-v/) of 583 security vulnerabilities and 583 non-security defects collected from three open-source projects (Chrome, Firefox, phpMyAdmin), classified according to a security-enhanced version of Orthogonal Defect Classification (Chillarege et al., 1992).

Table 1 Datasets used in our study

3.2 Design and analysis approach

3.2.1 Feature selection approaches

In this section, we describe three feature selection approaches that we have used in our experiments:

3.2.1.1 tf and its variants—project-specific modeling approach

Term frequency (tf) is a standard metric for determining features in document classification problems (Manning et al., 2008). Term frequency is computed by tokenizing each document in a document corpus and collecting the number of occurrence (frequency) of the unique terms in the entire corpus. Subsequently, a document term matrix (DTM) can then be constructed from the document corpus using these terms. As an example, if we have three documents namely:

D1 = ‘owasp top software errors’

D2 = ‘buffer overflow is dangerous’

D3 = ‘software errors include buffer overflow errors’. A DTM is represented by tokenizing the documents and generating a term frequency for each text token as:

$$DTM= \left[\begin{array}{cccccccccc} & \mathrm{owasp}& \mathrm{top}& \mathrm{software}& \mathrm{errors}& \mathrm{buffer}& \mathrm{overflow}& \mathrm{is}& \mathrm{dangerous}& \mathrm{include}\\ {D}_{1}& 1& 1& 1& 1& 0& 0& 0& 0& 0\\ {D}_{2}& 0& 0& 0& 0& 1& 1& 1& 1& 0\\ {D}_{3}& 0& 0& 1& 2& 1& 1& 0& 0& 1\end{array}\right]$$

where the topmost row represents the unique tokens in the entire document corpus, and the leftmost column represents each document. The values in the matrix represent the frequency of terms (number of times the term appears). An algorithm can then use term weighting or normalisation techniques (Manning et al., 2008; Sparck, 1972; Forman, 2003; Sebastiani, 2002) to determine the most interesting terms that could represent each document from the document corpus.

In this study, we have considered the normalised variant of term frequency-inverse document frequency (tf-idf) (Manning et al., 2008) as used in Peters et al. (2017).

The normalised tf-idf is computed as:

Normalised term frequency: \({ntf}_{t,d}=0.5+ \frac{0.5\times {tf}_{t,d}}{{max}_{t}({tf}_{t,d})}\) (1)

where \({tf}_{t,d}\) represents the number of times, term t appears in document d, and \({max}_{t}\) represents the frequency of the term with the maximum occurrence in document d. The smoothing term, 0.5, is used to avoid a large swing in \({ntf}_{t,d}\) from a small change in \({tf}_{t,d}\).

Inverse document frequency: \({idf}_{t}=log\frac{N}{{df}_{t}}\) (2)

where N = number of documents in the entire document corpus, and \({df}_{t}\) represents the number of documents where term t appears. The intuition behind idf is that terms that are frequent in all documents may not discriminate very well and will thus be penalised with low idf. Conversely, terms that occur in a few documents may be more interesting for the documents where they appear and will thus be weighted with higher idf.

Term frequency-inverse document frequency: \({tf-idf}_{t}={ntf}_{t,d}\times {idf}_{t}\) (3).

The tf-log is computed as

Term frequency (Logarithm): \({tf-log}_{t}=1+\mathrm{log}({tf}_{t,d})\) (4).

A project-specific approach can use any of tf, ntf, tf-idf or tf-log metrics based on a threshold (e.g. top n terms) to determine the most promising features. For example, Peters et al. (2017) use the top 100 terms with the highest tf-idf values as their feature set.

3.2.1.2 Our generalised modelling approach

In this approach, we first manually define security keyword list by extracting security keywords from popular and relevant security sources as depicted in Fig. 1. The feature set used in this work is based primarily on keywords extracted from the RFC4949Footnote 3 (Internet Security Glossary) and NVDFootnote 4 datasets. We also used MITRE CWE,Footnote 5 OWASP,Footnote 6 SANS,Footnote 7 and programming language exceptions (e.g. PHPFootnote 8). We parsed and tokenized the full text removing English stop words from the list and manually reviewing the collected terms. We reviewed the other listed sources (CWE, OWASP and SANS) manually for security keywords by reviewing their high-level documents. For example, with CWE, we mainly reviewed the ‘Research Concepts’ view where the graph shows high-level description of the CWE errors. Similarly, for OWASP, we reviewed annotated categories such as attack and control, and for SANS, we reviewed the top 25 software errors and extracted relevant terms. Next, we classified the keywords into four categories namely:

  • Assets: set of terms that represents data or information that should be protected by an entity (e.g. PII, credit card, ecash, token, keys, and session)

  • Control: list of terms that represents implemented security controls or mitigations (e.g. authentication, checksum, captchas and encryption)

  • Attack/threats: list of terms that represents security attacks or threats to a system (e.g. xss, backdoor, injection, vulnerability, untrusted data)

  • Implicit: list of terms that does not fall under these categories and many times are implicit security-related terms (e.g. incorrect, wrong, invalid, restrict, null pointer exception)

Fig. 1
figure 1

Extracting security feature list from security data sources

Our approach relies on this set of keywords that we have defined to form a feature set. A key point in machine learning is that better features tend to outperform a cleverer algorithm (Sparck, 1972; Salton & Buckley, 1988; Wu et al., 2008; Forman, 2003). Domain knowledge can improve data knowledge and representation and thus improve the model. In addition, the feature space can be reduced with less computationally expensive modelling operations (Wijayasekara et al., 2014) than using frequent terms computed from a document term matrix (Feinerer, 2013).

3.2.1.3 Classification models and performance metrics

Based on the approaches described in Sects. 3.2.1.1 and 3.2.1.2, we derive four different classification models based on two feature selection methods (see Fig. 2):

  1. 1.

    Model fsec-ext+ uses the harvested security keywords as feature set and includes security keywords from the list of attack/threat, control, asset and implicit terms as records in the training dataset. This approach aims to increase the weight of the security terms by including them in our training set and setting them as SBRs (see Table 2).

  2. 2.

    Model fsec-tfidf uses the tf-idf metric (Eq. 3) and preset to top 100 as used in Peters et al. (2017) This metric is project specific as terms are determined from the security bug reports of the project. Following the argument that the top 100 terms span nearly all the feature families of the vulnerability reports (Bozorgi et al., 2010), we adopt these terms for our experiments with the aim to also be able to compare our results.

  3. 3.

    Model fsec-ext uses the harvested security keywords as feature set as in fsec-ext+ model but exclude the features from the training dataset.

  4. 4.

    Model fsec-tfidf+ uses the tf-idf metric (Eq. 3) as in fsec-tfidf but includes the features in the training dataset.

Fig. 2
figure 2

Experiment setup and modelling approach

Table 2 Example of text and security keywords added to form our training dataset
3.2.1.4 Performance metrics

We report the recall, precision, probability of false (pf) alarm, f-score, g-measure and area under the curve of receiver operating curve (Powers, 2011) (aucroc) to show the performance of the text classification models (Sebastiani, 2002; Peters et al., 2017). We included pf and g-measure in order to compare our work with Peters et al. (2017). The g-measure provides the harmonic mean between recall (true positive rate) and true negative rate (100—pf) while the f-score measures the harmonic mean between recall and precision. The roc measures the relationship between true positive rate and false positive rate (aka pf) of a classifier at different thresholds, and the auc computes a single metric that provides the area of a classifier under the roc. This allows a fair comparison of the performance of one model to another. The metrics are computed from true positive (TP), true negative (TN), false positive (FP), and false negative (FN) where:

TP = number of security records correctly identified as security records

TN = number of non-security records correctly identified as non-security records

FP = number of non-security records incorrectly identified as security records

FN = number of security records incorrectly identified as non-security records

as follows

$$\mathrm{Recall}= \frac{TP}{TP+FN}$$
$$\mathrm{Precision}=\frac{TP}{TP+FP}$$
$$pf=\frac{FP}{FP+TN}$$
$$F-score=2 \times \frac{\mathrm{Precision}*\mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}$$
$$g-measure=\frac{2 \times \mathrm{Recall} \times (100-pf)}{\mathrm{Recall}+(100-pf)}$$
3.2.1.5 Constructing training and test matrices from feature set

We use the hand-crafted keywords (custom dictionary list) for fsec-ext model as the feature vector. In the case of fsec-tfidf, the 100 terms forming the dictionary list is generated from SBR of the training dataset as used in Peters et al. (2017). In both cases (fsec-ext and fsec-tfidf), the derived features are used to construct the training and test matrices as shown in Fig. 2. We use the tf-log metric (Eq. 4) when a word matches the feature vector otherwise, we record a ‘0’. The example below demonstrates how the document-term matrix is constructed using a predetermined dictionary list.

Assume a document contains 2 texts with classification categories as follows:

$$\left[\begin{array}{cc}\mathrm{Text}& \mathrm{Classification}\\ \mathrm{Diffucult\;to\;place\;caret\;at\;the\;end\;of\;a\;line\;in\;Show\;All\;Tags\;view}&\;0\\ \mathrm{Allow\;Personal\;Security\;Password }> 30\;characters& 1\end{array}\right]$$

Given a set of security keywords, F determined by fsec-ext, or fsec-tfidf, such that

F = {password, xmli, security, overflow}, we arrive at vectors:

V(text1)F = [0,0,0,0]

V(text2)F = [1,0,1,0]

and a matrix: \(M\left({text}_{1},{text}_{2}\right)=\left[\begin{array}{cccc}0& 0& 0& 0\\ 1& 0& 1& 0\end{array}\right] \to \left[\begin{array}{c}0\\ 1\end{array}\right]\)

3.2.1.6 Text cleaning and transformation approaches

We have used the four datasets from Ohira and the cleaned and scrubbed version of Chromium dataset from Peters et al. (2017). In addition to the four datasets described in Sect. 3.1, we remove stop words from all the nine datasets and remove unwanted terms from Chromium and the HIB dataset using the list from Peters et al. (2017). We then collect only text data from each dataset using the regular expression [^a-z A-Z]. After these steps, we tokenise the text and transform the documents (SBR or NSBR) into a document term matrix as described in 3.2.1.5.

3.2.1.7 Generating training and test dataset

We split the Ohira dataset and Chromium dataset by using date and exactly as reported in (Peters et al., 2017) into 50% past and 50% present. The training dataset represents the past while the test dataset represents the present. In the case of ODCV, Apache, Mozilla, and Comm dataset, we sort by the ‘Summary’ field and split into 50% training and 50% test dataset. Table 3 lists the properties of our training and test dataset in terms of the number of security bug reports (SBR) and non-security bug reports (NSBR). The test dataset is unseen data that is used to evaluate the true performance of the constructed model.

Table 3 Properties of the training and test data
3.2.1.8 Class imbalance and sampling techniques

Wicket, Derby, Camel, Ambari, Chromium and Comm datasets are highly skewed with SBRs accounting for 1%, 8.8%, 3.2%, 2.9%, 0.5%, and 2.1% of the total bug reports respectively. Imbalance data is known to reduce the performance of classifier on the class where the data is sparse. The BR is highly skewed with very few reports labelled as SBR. We deal with this imbalance in our training dataset by either over sampling or under sampling (He & Garcia, 2009) the NSBR depending on the ratio of SBR to NSBR in the project. We treat sampling ratio of NSBR to SBR in our training dataset as a treatment in our experiment to determine which sampling ratio can produce adjusted training dataset that can best be learned by the classifier. In other words, we do not change the size of the SBR in our training dataset; rather, we adjust the size of NSBR by a factor of SBR as listed in Algorithm 1. In line 6, instances from NSBR are randomly picked by using a random number that returns an integer bounded by the size of NSBR.

figure c
3.2.1.9 Training parameters and treatments

Some terms must be weighted higher than the others. Term weighting is a known technique in machine learning (Salton & Buckley, 1988; Wu et al., 2008). The objective is to allow the algorithm to learn a function that prioritises the term or feature with a higher weight. Our approach is to give a higher weight to the security features (attack/threat, asset, control, and implicit terms). To achieve this, we classify every term in our feature set to be security related. These terms are then included as SBRs in our training dataset as shown in Table 2.

For the fsec-ext feature selection method, we experiment with using different combinations of security category such as attack/threat (T), asset (A), control (C) and implicit (I) terms to train our classification model. We devise seven combinations from these categories (see Table 4)—T, C, TC, TA, CA, TCA, TCAI. The intuition is that different projects may report on different security category which in turn may influence the quality of the learner. In addition, this approach can provide insight into the dominant security category and issues for a project. Furthermore, we experiment with including and excluding these security categories as SBRs in our training dataset. Lastly, we experiment with class imbalance ratio by adjusting (over- or undersampling) NSBR size relative to the size of SBR. For instance, a value of 0 means the original size of SBR to NSBR in the training dataset is kept, and a value of 1.0 means the ratio SBR:NSBR is 1:1. Table 4 lists the parameters and the corresponding values we have adopted in our experiments.

Table 4 Treatments and parameters for training dataset
3.2.1.10 Experiment setup and modelling approach

Figure 2 shows our experiment setup and modelling approach. To answer our research question, we investigate two prediction targets, Within Project Prediction (WPP) where a project’s prediction model is used to predict on the project’s test dataset and Transfer Project Prediction (TPP) where a project’s prediction model is used to predict on the test dataset of other projects. In total and using the treatments listed in Table 4, we constructed 3600 prediction models for each project’s dataset, 400 models for WPP and 3200 models for TPP as listed in Table 5. All data, source code and analysis scripts for our experiments are available online.Footnote 9 We have used five common text classification algorithms that have been used extensively in the vulnerability and defect prediction community (Joachims, 1998); (Peters et al., 2017) namely support vector machine (SVM), random forest (RF), Naïve Bayes (NB), logistic regression (LR) and K-nearest neighbour (KNN). We have used the Weka (Witten & Frank, 2005) machine learning library version 3.8.4 with the default parameter settings for the five algorithms.

Table 5 Experiment details
3.2.1.11 Complexity of our approach

A DTM typically yields a large number of terms that are used as features for training a prediction model. Both the size of the features and the training dataset influence the complexity of learning algorithms. For instance, the best-case time complexity of random forest (Louppe, 2014) for building forests of M randomised tree on N observation points and using K randomly drawn features (variables) from a total of p variables (where K ≤ p) is given as O(MKNlog2N) and the worst case as O(MKN2logN).

Assume the number of features in fsec-ext is P, and N is the total bug records (SBR + NSBR) in the training dataset (i.e. observation points), fsec-ext+ includes the features P in the training dataset to increase the observation points to N + P. This implies a complexity that is dependent on N or P depending on which one is higher. P > N exists for cases where the number of features is more than the training bug reports. Examples are the Ambari, Wicket, Derby, Camel, Comm, ODCV and Apache dataset. While N > P exists for cases where the number of training bug reports is higher than the number of features. Examples are Chromium and Mozilla dataset. Using random forest, which has the worst running time in our experiments, the worst-case time complexity when N > P will be O(MKN2log(N + P)) and when P > N, it will be O(MKP2log(N + P)).

3.2.1.12 Comparing with Peter’s et al. (2017)

We compare our results to Peters et al. (2017) where applicable. Peters et al. (2017) use filtering techniques named FARSEC to remove NSBRs that contain security crosswords from the training dataset. A security crossword is a feature (or dictionary term) that exists in both the SBR and NSBR. FARSEC filtering works by first identifying 100 terms with the highest tf-idf as defined in Eq. 3 to form a dictionary. Next, for each term in this dictionary, FARSEC computes a score which is the probability of the term appearing in SBRs and NSBRs. FARSEC proceeds to score each bug report by retrieving the scores for all the dictionary terms found in the bug report and then producing an overall score or ranking for this bug report. Finally, FARSEC filters out NSBR with overall ranking \(\ge\) 0.75 from the training dataset. Rather than focusing on filtering NSBR, we take a different approach by focusing on feature selection methods, term weighting and sampling of imbalance dataset. To have a fair comparison with our work, we have used the top 100 terms identified using the same tf-idf metric defined in Eq. 3 to generate our feature set for the model fsec-tfidf.

Table 6 lists the security keywords generated by fsec-tfidf for each project compared with the list in Peters et al. (2017). We observe some slight differences in both lists. We speculate that the differences may be due to differences in the programming language used and especially how tokens are derived from the documents. For example, the token ‘interfacerequest’ keyword from Peters et al. (2017) in Wicket is not a single word but a substring from e.g. BookmarkableListenerInterfaceRequestTarget. Such differences can account for the disparity between our list and Peters et al. (2017).

Table 6 Security-related keywords derived from using the normalised tf-idf (Manning et al., 2008)
3.2.1.13 Approach to research questions

We now define our research questions and hypotheses based on our overall question. Broadly, we investigate whether using features based on external data sources can produce prediction models that improve transfer project predictions on other projects. We then investigate whether including security features in the training dataset can improve the performance of a model. Lastly, we investigate if these modelling approaches have significant effect on within-project predictions.

We formulate our research questions and hypotheses as follows:

RQ1: Do externally sourced features generalise better than project-specific features for TPP?

We test the hypothesis: H1: fsec-ext/ fsec-ext+ can generalise on other projects significantly better than fsec-tfidf/fsec-tfidf+ for TPP.

RQ2: Does including security features as SBRs in training dataset improve model’s performance over models that exclude them for both WPP and TPP?

We test the hypothesis: H2: fsec-ext+ significantly outperform fsec-ext and fsec-tfidf+ significantly outperform fsec-tfidf.

RQ3: Do models that utilise external sources for features outperform model that uses project-specific features in within-project prediction (WPP)?

We test the hypothesis: H3: fsec-ext/fsec-ext+ significantly outperforms fsec-tfidf/fsec-tfidf+ for WPP.

We investigate the performance measures of each model type we have defined on its own dataset (WPP) as well as other projects’ datasets (TPP) and compare the results. Both goals are complementary and important to the goal of our study, as a model that performs well on its own dataset but poorly on another dataset would not be useful for generalisation purpose. Conversely, a model that performs well on other datasets but poorly on its own dataset may be useless. We use summary statistics showing the minimum, maximum and average of the performance metrics to explain our results. A well-generalised and robust model will show high minimum and mean g-measure, recall, f-score and low standard deviations. To capture this generalisation effect for a target, we collect the prediction performance results from each source per treatment—algorithm, selection method, ratio and security category. For eight (8) sources, this will yield a total of 8 performance results per target per treatment. We then compute the minimum, maximum, mean and standard deviations for g-measure, f-score and recall. These statistics allow us to truly appreciate how well a given model can generalise across all projects.

We follow the guideline of Demsar (2006) (He & Garcia, 2009) and use the Wilcoxon signed rank test (a non-parametric test) at 95% confidence level to compare (1) whether the performance of each model (fsec-ext or fsec-tfidf) significantly improve when security features are included (fsec-ext+ or fsec-tfidf+) over when security features are excluded and (2) whether the performance of model fsec-ext is better and statistically significant or not to model fsec-tfidf. In addition, we perform effect size check on our results. As noted in Kampenes et al. (2007), effect size quantifies the size of the difference between two groups and allows us to judge whether the conclusions drawn from our hypotheses testing are meaningful or not. It might be possible that the effect size is negligible even when the statistical test is significant and vice versa. We apply the Hedges, g standardised effect size measure calculated as:

$$\mathrm{Hedges},g=\frac{{\stackrel{-}{X}}_{1}-{\stackrel{-}{X}}_{2}}{{S}_{p}}$$

where \({\stackrel{-}{X}}_{1}\) and \({\stackrel{-}{X}}_{2}\) represent the sample means for classification measures (recall, f-score and g-measure) for the model pairs per project (e.g. fsec-ext vs. fsec-tfidf) and \({S}_{p}\) represents the pooled standard deviation computed from the standard deviations of s1 and s2 of the two groups. Effect size results can be interpreted in different ways as explained in Kampenes et al. (2007). For fsec-ext+ vs fsec-ext and fsec-tfidf+ vs fsec-tfidf, we use paired data. That is, each data pair contains metric values of model before and after treatment. The treatment in this case is including security features as SBR in the training dataset, all other variables for both groups are kept constant—i.e. algorithm, ratio and security category. For fsec-ext+ vs fsec-tfidf+ and fsec-ext vs fsec-tfidf, we use unpaired data by collecting top 20 results for each model type and then test which group’s mean is significant to the other. We compare our results with those reported in Software Engineering empirical studies categorised under Table 9. The size category for 284 estimated values for Hedges, g is given as Small: 0.00–0.376, Medium: 0.378–1.000 and Large: 1.002–3.40. We have used standard statistical packages in R (2008) for these purposes.

4 Results

4.1 Harvested security keywords

We have mined a total of 54 keywords for the asset category, 334 keywords for the control category, 329 keywords for the attack category and 203 keywords for the implicit category. We present a sample of the security words we have mined in Table 7. We caution that this list is not, and cannot be, exhaustive, as additional keywords can be extracted from other sources.

Table 7 Sample of security keywords extracted from different sources and grouped into 4 categories

We present the results of the models (fsec-ext and fsec-tfidf). We have reported the best results and the summary statistics for TPP. The best result is determined by sorting first by the g-measure, followed by the f-score and then by recall. Table 8 presents the summary statistics (i.e. mean, minimum, maximum and standard deviations) of the most generalised model for Transfer Project Prediction (TPP). Table 9 lists the results with the best g-measure for TPP. Table 10 lists the statistical tests for comparing the two models for TPP. Table 11 presents the statistical tests for comparing the effect of including and excluding features during training on both WPP and TPP. Table 12 lists the results with the best g-measure for WPP. Figure 3 shows the mean of top 20 results of each model for WPP. Table 13 lists the statistical tests that compare the two models for WPP. Finally, Tables 14 and 15 list the results with the best g-measure from all the models for WPP and TPP respectively in comparison with the best results from Peters et al. (2017).

Table 8 Summary statistics of best generalised models for TPP
Table 9 TPP results with the best g-measure
Table 10 H1-Wilcoxon and effect size tests comparing models fsec-ext and fsec-tfidf (TPP)
Table 11 H2—Wilcoxon and effect size tests comparing models fsec-ext and fsec-tfidf when feature is included or excluded
Table 12 Best g-measure per project for WPP
Fig. 3
figure 3

Mean of recall, f-score and g-measure of top 20 results for WPP

Table 13 H3—Wilcoxon and effect size tests comparing models fsec-ext and fsec-tfidf (WPP)
Table 14 WPP results with the best g-measure in comparison to the best results from Peters et al. (2017)
Table 15 TPP results with the best g-measure in comparison to the best results from Peters et al. (2017)

4.2 RQ1: Do externally sourced features generalise better than project-specific features for TPP?

To answer RQ1, we use the summary statistics listed in Table 8 complimented by the results in Table 9 that lists the results for TPP with the best g-measure. The summary statistics show the treatments—ratio, security category, algorithm—and the feature selection method that yield the best results from using different sources (8) to predict a target. For example, for Chromium, random forest (RF) combined with 0.0 ratio and using security categories threats and control produce the best and most stable results when applied on each source project to predict Chromium. Predictions using TPP shows model fsec-ext+ with an average recall from 55 to 86%, minimum recall from 43 to 71% and maximum recall from 60 to 100% across all projects. Model fsec-ext has an average recall from 23 to 64%, minimum recall from 2.4 to 44% and maximum recall from 42 to 90%. Model fsec-tfidf+ has an average recall from 38 to 82%, minimum recall from 19 to 50% and maximum recall from 60 to 100%. Model fsec-tfidf has an average recall from 11 to 60%, minimum recall from 0 to 43% and maximum recall from 36 to 60%.

In comparison with model fsec-tfidf+ , model fsec-ext+ increased recall on average by 4–17% and by 26–44% when compared with fsec-tfidf. The average f-score for model fsec-ext+ is between 3.4 and 88% which is higher by 1.6–28% in all projects when compared with model fsec-tfidf+. One reason for the low precision and subsequent low f-score is because of scarcity of SBRs in the Commercial, Ambari, Derby, Camel and Wicket datasets. Although the f-score is lower, the average g-measure for fsec-ext+ is at least 66% across all projects and higher than model fsec-tfidf+ by 19–28% across the projects. This shows that the true negative rate is relatively high and that the cost of misclassification can be manageable. For instance, the TPP result of Camel \(\to\) Wicket in Table 9 gives a TP = 6, FN = 0, TN = 412, and FP = 82. This confusion matrix produces a recall of 100%, f-score of 13% and a g-measure of 91%. Based on this model’s prediction, a total of 88 (TP = 6 and FP = 82) security bug reports will have to be examined by a security/quality assurance team that relies on such model which is equivalent to reviewing 17.6% (88 out of 500) of the total bug reports for security-related bugs.

A striking observation is that fsec-ext+ produces the most stable model that generalises better across all projects with the exception of Ambari where fsec-ext is slightly better in g-measure and f-score. The standard deviations of the measures for fsec-ext+ are also lower in most cases compared with the other models. This suggests that including security features can produce a more robust, stable and generalisable model. As regards AUC of ROC, model fsec-ext+ has an average between 69 and 89%, fsec-ext has an average between 60 and 78%, fsec-tfidf+ has an average from 55 to 72%, and fsec-tfidf from 54 to 70%. For all the projects, except Ambari, the AUC of ROC of fsec-ext+ is the highest and most stable, ranging between 67 and 94%. In comparison with the other models that have minimum AUC-ROC below 45%, fsec-ext+ models show the best classification results at different thresholds.

In terms of security category used for training, the threat (T) and control (C) categories dominate the best result. The exceptions are Ambari and Camel that include the asset (A) category and Mozilla with the asset (A) and implicit (I) categories. Camel, for example, contains security bugs that feature terms in the asset category such as ipaddress, token and password. This is plausible as Camel is a rule-based routing and mediation engine where such keywords frequently feature.

Results in Table 9 compliment the results in Table 8 with fsec-ext+ producing the best results for g-measure in seven out of nine projects. However, model fsec-tfidf+ fair slightly better for Ambari and Camel. The ratio of SBR to NSBR varies as well as the machine learning algorithms. We cannot infer any specific pattern as different ratio and algorithm work for different projects and feature methods.

The result in Table 8 shows that model fsec-ext+ thrives well for projects Chromium, Ambari, Derby, Wicket, Camel and Comm where security bug reports are scarce as well as projects with balanced SBR to NSBR—Apache, ODCV and Mozilla. In general, we can infer that model fsec-ext+ generalises better than models fsec-ext and fsec-tfidf on all unseen and independent dataset. As observed in the result (Table 8), the effect of including the security keywords as records in the training dataset produced models that can generalise on unseen security instances in a different project’s dataset.

Hypothesis testing—H1: fsec-ext generalises significantly better than fsec-tfidf

We test the null hypothesis H0: fsec-ext+ ≤ fsec-tfidf+ for transfer project predictions (TPP). We use Table 10 to present our results. It lists the p values for the Wilcoxon test performed on the hypothesis and the Hedges, g values for the effect size. Results show that the mean values for recall, f-score and g-measure for model fsec-ext+ are higher and statistically significant than those for model fsec-tfidf+ with the exception of recalls for Ambari and Camel. The effect sizes are medium to large between 0.84 and 4.54 across all the metrics. The effect size test validates the result of our hypotheses showing that the differences in the mean are indeed not trivial and suggests that model fsec-ext+ are better predictors of security bug reports in different software projects compared with models fsec-tfidf and fsec-tfidf+. Similarly, the significant tests between fsec-ext vs. fsec-tfidf show fsec-ext to have average recalls, f-scores and g-measures that are higher and statistically significant than model fsec-tfidf and mostly with large effect sizes (1.22–4.66). Three cases—f-score for Chromium, recall for Ambari and recall for Camel—have medium effect sizes.

In summary, using the statistical test results, we can strongly suggest that models that use the harvested security keywords and include those keywords in the training sample can, on the average, predict better than models that use project-specific terms.

4.3 RQ2: Does including security features as SBRs in training dataset improve model’s performance over models that exclude them for both WPP and TPP?

To answer this research question, we perform a hypothesis testing on results reported by fsec-ext+ versus fsec-ext and fsec-tfidf+ versus fsec-tfidf for both WPP and TPP.

Hypothesis testing—H2: Including security terms in the training set during modelling can significantly outperform models that do not include them.

We test the null hypotheses H0: fsec-ext+ ≤ fsec-ext and H0: fsec-tfidf+ ≤ fsec-tfidf. We use Table 11 to answer RQ2. It lists the p values for the Wilcoxon test performed on the hypotheses and the Hedges, g values for the effect size.

4.3.1 TPP

We observe a clear pattern of significant results for fsec-ext+ and fsec-tfidf+ when using TPP. All recall results in TPP are significant with medium to large effect sizes. The exception is Ambari where there is a small effect size showing that the mean recall of fsec-ext+ is only slightly higher than fsec-ext. G-measures show effect sizes between medium and large for six projects (Chromium, Wicket, Camel, Derby, ODCV, Mozilla) indicating a fair model with a relative balance between recalls and true negative rates. We observe worse test and effect size results for f-score for model fsec-ext+ for projects with few SBRs (Chromium, Wicket, Ambari, Camel, and Derby) whereas the results are better for fsec-tfidf+ in some cases. Examples are Chromium, Camel and Derby. One reason for this difference is the project-specific nature of the features used by fsec-tfidf+. fsec-ext+ uses externally sourced features resulting in higher false positives and consequently low precisions and low f-scores. Further study can investigate how to combine both project-specific and externally sourced features to both maximise recalls and f-scores.

Clearly, the test results demonstrate that including features in the training dataset produce superior results to models that exclude them when transferring model from one project to another.

4.3.2 WPP

Here, the results are mixed. For fsec-ext+, we observe that including security features in training clearly worsen the results in some of the projects (Ambari, Derby, ODCV). For instance, the result in Ambari indicates that the metrics for fsec-ext fair better as the mean values are significantly greater than fsec-ext+ as indicated by the negative Hedges, g values. Model fsec-ext+ also improves results in a few projects (Comm and Camel). The rest of the projects (Chromium, Wicket, Apache and Mozilla) contain mixed results. We also reckon from the results that fsec-ext+ increases recalls in six out of the nine projects albeit with mixed effect sizes ranging from negligible to medium.

Model fsec-tfidf+ significantly improve fsec-tfidf across all metrics in four projects (Wicket, ODCV, Mozilla, and Comm) while it worsens f-scores in Ambari, Camel, Derby and Apache. Model fsec-tfidf+ also significantly improve recall and g-measure results in Chromium with medium and large effect sizes. However, the f-score result is not significant, and the effect size is negligible although they are slightly higher than fsec-tfidf. Our observation from the WPP result is that model built with project-specific features (fsec-tfidf) fair better when treated (fsec-tfidf+) than model built with externally sourced features (fsec-ext) when treated (fsec-ext+). One reason may be that WPP model captures project-specific terms that are not possible with externally sourced features and therefore can predict better on own test dataset.

Overall, using the statistical test results, we can strongly suggest that models that include security keywords in the training sample can, on the average, generalise better than models that exclude them during model construction. When predicting within project, including security keywords can improve results especially by using project-specific features.

4.4 RQ3: Do models that utilise external sources for features outperform the model that uses project-specific features in within-project prediction (WPP)?

As listed in Table 12, fsec-ext+ has the best g-measure in seven out of the nine projects. In ODCV, fsec-tfidf+ records a slightly higher g-measure over fsec-ext+ and fsec-ext. The mean of top 20 values for the metrics (see Fig. 3) further reveals fsec-tfidf+ to perform better compared with fsec-ext+ for ODCV. Similarly, fsec-tfidf+ produces higher mean recalls for Chromium and Derby. Otherwise, fsec-ext+ records higher mean values for the metrics for the remaining projects. Different combinations of security category work for the different fsec-ext models demonstrating the usefulness of separating the categories. In addition, the ratio SBR to NSBR is different for the models suggesting that finding a useful model will require searching for which SBR to NSBR ratio is best for a given dataset and feature selection type. AUC of ROC also shows the same pattern similar to other metrics where the mean for AUC is higher for fsec-ext+ or fsec-ext. We can establish that models that use externally sourced features are more robust with the highest AUC at different thresholds.

In general, the results show that models that use the harvested security-related keywords as a feature set perform relatively and reasonably well on their own oracle dataset in terms of recall, f-score and g-measure. The f-scores suffer reduced performance in general for most of the projects with fewer cases of security bug reports in their dataset. Although, some f-scores can be judged to be relatively high. An example is Derby with 42 SBRs where 26 SBRs are correctly classified, and 32 NSBRs are incorrectly classified as SBR giving a 52% f-score. This translates to reviewing 58 bug reports if we are concerned with type I error. Similarly, the g-measure also shows a relatively useful model where majority of security bug reports are correctly classified (true positive rate) and majority of non-security bug reports are also correctly classified (true negative rate). As an example, the Comm performance shows 93% recall and a true negative rate of 87% (59 misclassified non-security bug reports out of 459).

Hypothesis testing—H3: using externally sourced features significantly outperform project-specific features for WPP

We test the null hypothesis H0: fsec-ext+ ≤ fsec-tfidf+ and H0: fsec-extfsec-tfidf for WPP. Without treatment, models built with externally sourced features (fsec-ext) significantly outperformed those built with project-specific features (fsec-tfidf) in all projects. Across all metrics, the difference in mean values also has large effect sizes showing that the differences in the means values are not trivial. Only the recalls for ODCV and Mozilla have medium effect sizes. We can conclude that in the absence of treatments or noise filtering before training, models built on externally sourced features will be a better choice by default.

With treatment, fsec-ext+ shows significantly higher mean across the three metrics (g-measure, f-score, recall) than fsec-tfidf+ in five projects—Wicket, Camel, Apache, Mozilla and Comm—and with large effect sizes. However, fsec-tfidf+ outperformed fsec-ext+ only in ODCV and the recalls for Chromium and Derby as reflected by the negative effect sizes (see also Fig. 3). Otherwise, fsec-ext+ outperformed fsec-tfidf+ in the remaining metrics and with large effect sizes.

The results suggest that using externally sourced features can reliably predict SBRs when used in WPP and with better results in most cases than project-specific features. The mixed results in Chromium and Derby however, hint on the potential for improved results over both models when the features are combined. This remains a possible future work.

4.5 Comparing our results with Peters et al. (2017)

To compare our results fairly with Peters et al. (2017), we remove the other four projects (ODCV, Apache, Mozilla, and Comm) before we run our analysis for this comparison. We report the result with the best g-measure of using the feature selection method used in Peters et al. (2017) with treatment—fsec-tfidf+ (i.e. including features in the training set and adjusting the ratio of NSBR relative to SBR) and without treatment—fsec-tfidf. We note here that fsec-tfidf (without treatment) corresponds to the ‘Peters et al. (2017) (train)’. We report the best overall results from Peters et al. (2017) when their filtering techniques are used as well as the best result for the ‘train’ benchmark. We report the best results of our generalised feature selection method (fsec-ext). We then compare the results of our best models on the five projects—Chromium, Wicket, Ambari, Camel and Derby. We use similar metrics which are the g-measure, f-score, probability of false alarm (pf) and recall (pd) to compare which result is better to the other.

In the case of WPP (Table 14), fsec-ext+ improves g-measure by 1–21% over the best results in all of the five projects (100%) reported in Peters et al. (2017). Similarly, for recall (Pd), our model outperformed (Peters et al., 2017) in all the projects by 14–20%. For f-score, our model score better in three out of the five projects. Similarly, for TPP (Table 15), fsec-ext+ improves g-measure by 13–27% over the best results in all of the five projects (100%) reported in Peters et al. (2017). Interestingly, for Ambari, our feature inclusion method (fsec-tfidf+) and adjusted ratio also improve the feature selection method in Peters et al. (fsec-tfidf) by 13.7%. For recall, fsec-ext+ improves by 19–50% over the best results in Peters et al. (2017) For f-score, our model scores better in three out of the five projects.

We note here, also, that applying our treatment to the same feature selection method used in Peters et al. improved both the g-measure and recall in 90% cases over the best results reported in Peters et al. (2017). However, in cases where business users care more about precision, we use the TP of Peters et al. (2017) as threshold to compute the FP equivalent for our best model. For WPP, fsec-ext+ reports 3.4% precision (see footnote in Table 14) for Chromium as against 6.7% reported by Peters et al. (2017) Similarly, for TPP, fsec-ext+ reports a precision of 2.2% (see footnote in Table 15) as against 5.1% reported by Peters et al. (2017) In this context, (Peters et al., 2017) score better than our model. For Ambari, our model is better than (Peters et al., 2017) when we apply the same threshold.

Overall, the results from our experiment demonstrate that using the model fsec-ext+ can be robust, generalisable and superior to existing state-of-the-art prediction models for classification of security bug reports in software development projects. It remains to see whether combining fsec-ext+ with FARSEC filtering approaches can improve fsec-ext+. We reserve this investigation as a future study.

5 Discussion

5.1 Project-specific vocabularies and explicit security keywords

We assess a sample of the records that are predicted to be false negatives by fsec-ext. We find that specific security-related project terms are used in many of these messages. Some of the examples we found are: ‘SecureServerTest’, ‘NSSecurityMechanismTest’, ‘testOSReadOnly’, ‘DefaultShutdownStrategy’, ‘camel-xmlsecurity’. In Mozilla, among issues classified as false negatives, we found many issues identified by the team as security related but lacking the traditional security terms. Examples including ‘Link Visitedness can be detected by redraw timing’, ‘Cross-domain drag and drop across IFrames’ or ‘Page can stomp on PRIMARY clipboard by calling select() at the right time’ are classified as medium security issues. These cases present limitations for our approach because the project-specific terms are missing and would need to be represented during model construction (Morrison et al., 2018b). One approach we hope to further investigate is to detect project-specific vocabularies by using fsec-tfidf feature selection approach and combine with the externally harvested keyword approach to improve the model’s performance.

5.2 Security categories

Our results point to the usefulness of separating the security categories for building prediction models. We observe that different combinations of the categories work for different projects. Prior to our work and to the best of our knowledge, no study has separated features the way we have done—i.e. into threat, control, asset and implicit, and combining them to build a classifier. This can also provide insight into dominant security issues in a project. For instance, the best model for Camel is based on control and asset features. This new technique provides a new approach to separate features before applying them to build models. An interesting extension will be to look into how to extend our approach to automatically separate project-specific features into these categories before combining them with our externally sourced features. We aim to investigate word embedding (Dai et al., 2017) in the future for this purpose.

5.3 When data scarcity is an issue

As noted by Peters et al. (2017), there exist a core set of security-related keywords which can predict security bug reports on any project. However, we establish that these core security keywords can vary from project to project in terms of their predicting power. We show that by separating these keywords into different security categories, we can further refine models per projects. Our result shows that by using our harvested security-related terms and by including them in the training dataset, it is possible to build prediction models on small dataset that can be successfully transferred to other projects irrespective of their sizes. This is the case with all the projects—Camel, Derby, Wicket, Ambari and Comm—where security bug reports are scarce. Our approach can also prove useful for projects with no prior labels for security bug reports in their bug report repositories. As discussed in Cois and Kazman (2015), very few projects (1.4%) explicitly identify security-related issues when they analysed 400,000 projects on GitHub. Our set of security keywords can be used for auto labelling, and a WPP model can be constructed for future classification of newly reported bug reports.

5.4 Implications for practice and research

5.4.1 Practice

We have deployed the projectFootnote 10 as open source on both github and bitbucket to provide an off-the-shelf add-on on JIRA platform. As an off-the-shelf add-on, it offers the possibility for immediate installation and usage. The idea of transferring the research results via a platform that is widely and popularly used is to increase the visibility and impact on the security consciousness across all projects. We believe that the add-on can be used to access the security portfolio of a system prior to a buy or build decision. In addition, the model/add-on can be used to collect useful metrics such as records of security-related issues that are still open, records of security-related issues that have been changed and the window of exposure of identified security-related records. The approach we have employed provides the possibility of adaptation to include ‘implicit’ security-related terms and generate an organisational-based classification model that can be used across projects within an organisation.

5.4.2 Research

Researchers have long conducted studies to develop text classification models. Our modelling approach and eventual add-on provide more possibilities for further research studies in this direction. First by making it possible to quickly collect security related messages from different projects. Second, by making it possible to apply the same approach to prediction in related areas where domain knowledge exists, and feature set can be extracted upfront.

6 Threats to validity

Manual term extraction: we manually harvested security-related terms from many popular sources. However, this is not exhaustive as there are many other sources where additional security-related keywords could be extracted.

Manual labelling of security-related records: manually labelled security records including those reported in publicly disclosed venues such as the MITRE CVE have been argued to be subjective and sometimes inaccurate (Gegick et al., 2010; Scandariato et al., 2014; Massacci & Nguyen, 2010; Christey & Martin, n.d.; Massacci & Nguyen, 2010; Dai et al., 2017) showed that using different vulnerability databases can yield different results. Further studies will be useful on the dataset we have used to replicate our study.

Modelling: We have used default settings for the parameters of the algorithms during model construction. It is possible that by tuning the parameters, different results might be obtained (Xia et al., 1805). As mitigation, we have performed our experiments under the same conditions, and we could expect further tuning of the algorithm to produce results in the same direction we have reported in this work. Additionally, we have only explored a subset of imbalance ratio. It is possible that higher SBR to NSBR ratio worsens or improves the results.

Categorisation of terms: our categorisation of security-related terms is subjective and prone to inaccuracies. It is possible to find some of the terms we have included in the implicit category to be directly related to attack/threat and control categories.

Generalisation: Although we have evaluated our approach on 8 major open source datasets and one commercial dataset, we can still not claim generalisation of the result to all types of datasets. Further empirical studies are necessary to validate our result on new dataset in domains we have not covered in this study.

7 Conclusion

We have investigated an approach for training a text classification model to identify security messages in heterogeneous software project repositories such as issue tracking and version control systems. Due to the problem of unseen security instances in different projects, we proposed harvesting security keywords from relevant security sources as features and including them as security-related records in the training set during modelling versus typical text classification approaches that determine promising features from the project’s dataset. We provide meaning to our feature set by grouping the extracted terms into different categories of asset, attack/threat, control/mitigation and implicit. We then use these groups as treatments in our experiments by combining them to derive classification models.

Our evaluation of these approaches shows that using a broad security keyword as feature set during model construction can out-perform the approach that uses project-specific features. Further, including these security keywords as SBRs in the training dataset produced models that generalised better on unseen and heterogeneous datasets. In addition, breaking down security features into different categories reveal that different projects perform better with different combinations of security categories. Finally, we compare our results with a state-of-the-art result for both WPP and TPP. Results demonstrate that our modelling approaches produce superior results in all of the projects evaluated.

In the future, we aim to investigate (1) using an unsupervised learning approach in combination with sentiment analysis of discussions and fsec-tfidf features to augment generic security keywords with project-specific keywords and (2) using our approach for risk-estimation during software development and maintenance. Our idea is that the categories we have formulated can be useful to derive a risk model for a project. For instance, it might be possible to compute risk = likelihood × Impact, where the likelihood can be approximated from the vulnerabilities and existing mitigations, and the impact can be derived from assets and the threats/attacks to these assets. In addition, it might be possible to assign weights to these categories during classification. Keywords in asset and attack categories could be assigned higher weights as compared with control and implicit keywords. We believe this could lay a good foundation for building classification tools with useful feedback and interesting use cases. Lastly, we want to investigate whether building a model on all the remaining projects’ training dataset to predict a target will produce a better result. In addition, we want to investigate whether a bellwether effect exists in our TPP as this effect has been observed in many transfer learning tasks of software engineering.