1 Introduction

In this digital era, software applications play an important role in providing a range of services for people in their daily lives (e.g. browsing information and using online learning platforms). People leave their digital footprints such as search history and personal details when interacting with those software applications. This raises critical concerns to their privacy and personal data protection. There are many data protection and privacy legislations and policies around the world put in place to govern personal data processing (e.g. General Data Protection Regulations (EU) Office Journal of the European Union (2016) and California Consumer Privacy Act (US) State of California Department of Justice (2018)). These regulations provide a set of requirements for handling personal data in organisations. They also provide the rights for individuals to manage their personal data (e.g. right to be informed and right of access). The organisations needing to comply with these regulations are required to consider privacy compliance in their software systems. Failing to comply with these regulations may cause negative consequences to organisations in terms of reputation and financial hardship Data Privacy Manager (2020); European Commission (2019); CNET (2019); Privacy Affairs (2020).

In addition, the cases of privacy breaches and vulnerabilities have been rapidly increasing. Those breaches not only occurred in small organisations but also happened to the world’s top leading companies (e.g. Google and Marriott) BBC (2020); Reuters (2020); Swinhoe (2020). These scenarios affect the organisations’ reputation and exposed individuals. Hence, there is an urgent need to ensure that privacy and personal data protection are taken into consideration when developing software systems. It is however challenging for organisations to integrate privacy and personal data protection requirements into the existing processes, especially for deployed systems.

As an agile, issue-driven software development approach has been increasingly adopted in most today’s software projects, issues become a main source of requirements of the software project Choetkiertikul et al. (2021). Issues are lodged into issue tracking systems (e.g. JIRA) which are accessible for all the stakeholders who involve in the projects. For each development iteration, the development team selects a set of issues to work on for that iteration. Hence, the issue reports are the first source of requirements and project tasks that are considered by the development team. Issues contain important information about new requirements (i.e. feature requests), change requests for existing requirements (i.e. improvements), reporting requirements not being properly met (i.e. bugs) or representing work that has to be done (i.e. tasks) Choetkiertikul et al. (2018, 2021).

Fig. 1
figure 1

An example of Google Chrome (top) and Moodle (bottom) issue reports

Issue reports are normally written in natural language (see Fig. 1). They contain information (e.g. issue key or ID, issue summary, issue description, issue type, priority, status and components) that describes scenarios and states actions needed to be attend by software engineers. Given thousands of issue reports in large software projects, it is challenging for software engineers to identify privacy-related issues. In addition, privacy concerns vary depending on functionalities and context provided by a software. For example, a web browser (e.g. Chrome) may prioritise the user search history while an online learning platform (e.g. Moodle) focuses on protecting user profiles and personal information. Hence, there is an emerging need to identify relevant privacy requirements in issue reports.

A recent study developed a taxonomy of privacy requirements from data protection regulations and privacy frameworks Sangaroonsilp et al. (2023). This taxonomy provides a set of fundamental privacy requirements for developing privacy-aware software applications. One of its usages is to classify privacy-related issues in a software project into relevant privacy requirements. This classification facilitates software development teams in identifying privacy requirements concerned in issue reports as well as ensuring that the associated privacy requirements are properly addressed. In addition, the process would also enable privacy compliance checking which requires a demonstration on the privacy needs and concerns in legislations are addressed in a software system. Since the taxonomy contains a large group of privacy requirements, the classification process is labour intensive and time consuming. Hence, an automated support is much needed for performing this task. The support could be provided in a “just-in-time” manner: once an issue is created or modified, the machinery (integrated with an issue-tracking system) will automatically classify the issue into appropriate privacy categories and requirements.

One prominent option to develop this automated solution is leveraging the usage of machine learning (ML) and natural language processing (NLP) techniques. Information in an issue report (e.g. title and description) is extracted into features. Those features are then used to build machine learners that are capable of learning from historical data to perform the classification on new data. There is a range of state-of-the-art techniques for extracting and learning those so-called textual features. In this study, we explore a wide range of machine learning and natural language processing techniques that can be used to automatically classify privacy requirements in issue reports. This paper provides the following contributions:

  • We evaluate the performance of the traditional word embedding techniques (i.e. BoW Tirilly et al. (2008), N-gram IDF Shirakawa et al. (2015), TF-IDF Blei et al. (2003) and Word2Vec Mikolov et al. (2013); Google (2013)) and deep learning techniques (i.e. CNN Lee et al. (2017) and BERT Devlin et al. (2018)) in classifying privacy requirements in issue reports. We use the labelled dataset published by Sangaroonsilp et al. (2023) as input data in this empirical study. We employ Mean Reciprocal Rank (MRR) and recall at k (Recall@k) to compare the performance of each method. In addition, we identify the best performing technique that can be used to assist a software team in identifying privacy requirements in issue reports. The results confirmed that N-gram IDF is the best performer with 0.6093 and 0.5838 on MRR in Google Chrome and Moodle projects respectively. TF-IDF also performs well on both MRR and recall@5 in both projects. It achieves 0.6093 on MRR in Google Chrome project and achieve the highest recall@5 at 0.7866 and 0.6027 in Google Chrome and Moodle respectively.

  • We perform a Wilcoxon test to investigate if the classification results of all classification methods are statistically significant difference. We found that the recall@k results of random guessing method are statistically significant difference for all the traditional word embedding and deep learning techniques with effect sizes greater than 0.95 in Google Chrome project. In Moodle project, only the recall@k results between random guessing and BoW, N-gram IDF and TF-IDF, and between N-gram IDF and BERT are statistically significant difference with effect sizes greater 0.95.

A full replication package containing all the artefacts generated by our studies is publicly made available at Sangaroonsilp et al. (2022). The remainder of this paper is structured as follows. We introduce background and related work on a taxonomy for privacy requirements classification and text classification techniques in Sect. 2. In Sect. 3, we discuss on a motivating example and explain the approaches implemented in our study. The details of dataset, experimental setting, performance measures and evaluation results are presented in Sect. 4. We address threats to validity of our study in Sect. 5. Finally, we conclude and discuss future work in Sect. 6.

2 Background and related work

Personal data protection and privacy have attracted attention from individuals and organisations globally in recent years. After the announcement of GDPR in 2016, over hundred countries around the world have developed their own data protection and privacy legislations United Nations Conference (2020). As people interact with software applications and systems, it is necessary to develop the software applications and systems that comply with those legislations to ensure personal data protection. However, it is challenging for software engineers to understand and implement privacy requirements when developing software applications in practice Gürses et al. (2011).

Several work identified privacy requirements and constructed privacy requirement taxonomies based on privacy policies and privacy standards Antón and Earp (2004); Ayala-Rivera and Pasquale (2018). Antón and Earp (2004) developed a privacy goal taxonomy from website privacy policies. The study adopted a Goal-based Requirement Analysis Method (GBRAM) to extract privacy goals and requirements from 24 online healthcare privacy policies. This taxonomy provides a useful set of requirements that the website developers could use to reduce web vulnerabilities. Ayala-Rivera and Pasquale (2018) proposed an approach to identify requirements that should be implemented in a software system. The requirements were elicited from the mapping between GDPR and privacy controls in ISO/IEC standards. Although both studies are relevant to privacy requirements elicitation and classification, none of them investigated privacy requirements in issue reports and studied on the automated classification process.

A number of existing studies used ML and NLP methods to assist in regulatory compliance checking Müller et al. (2019); Torre et al. (2020). Müller et al. (2019) studied on a method used to check the compliance between GDPR and companies’ privacy policies. The statements in privacy policies were extracted and classified into five categories (i.e. Data Protection Officer (DPO), Purpose, Acquired data, Data sharing and Rights). The study applied three different word embedding techniques (i.e. Word2Vec, FastText Bojanowski (2017) and ELMo Peters et al. (2018)) in combination with three classifiers (i.e. Support Vector Machines (SVMs), Logistic Regression (LR) and Neural Networks (NN)) to automatically classify the extracted privacy policies’ statements into those five categories. It employed F-measure to evaluate the classification performance. Similarly, Torre et al. (2020) proposed a solution for completeness checking of privacy policies against GDPR. The study used a pre-trained word-vector model to transform sentences in privacy policies into vector representations Pennington et al. (2014). It then built ML classifiers for classifying the sentences into metadata types. Precision and recall were used as performance measures. Comparing to our study, these studies focused on privacy policies, employed fewer word embedding techniques and classified data into a smaller group of categories.

A number of studies applied ML, NLP and deep learning models to perform issue report classification in different applications Fan et al. (2017); Pandey et al. (2017); Choetkiertikul et al. (2021); Cho et al. (2022). Fan et al. (2017) performed a large-scale study on issue reports of 80 popular projects in GitHub. They classified issues into a bug or non-bug type. The study evaluated the classification performance of four traditional text-based classifiers which are SVM, LR, Multinomial Naive Bayes (MNB) and Random Forest (RF). It used an average F-measure to evaluate the classification performance of those classifiers. It also proposed a new framework that can improve the performances of the traditional classification methods. Pandey et al. (2017) studied various classification algorithms used to classify issue reports of three open-source software projects based on their types (i.e. bug or improvement). The algorithms consist of Naive Bayes (NB), linear discriminant analysis, k-nearest neighbours and SVM with various kernels, decision tree and RF. The authors employed F-measure, average accuracy and weighted average F-measure to evaluate the classification performance. The study reported that RF performed best in this setting.

Choetkiertikul et al. (2021) proposed a predictive model that can be used to recommend the most relevant software components for new issue reports. The model was built using the deep learning Long Short-Term Memory (LSTM) technique. The issue reports from a collection of 11 open-source projects from four repositories were used as inputs for the model. The performance of the model was evaluated using recall@k. Cho et al. (2022) proposed a method that automatically classified issue reports into software feature descriptions in user manuals of three software projects: Notepad++, Visual Studio Code and Komodo. The authors performed an experiment to evaluate the classification performance of 8 approaches based on a combination of two deep learning models (i.e. CNN and Recurrent Neural Network (RNN)) and four word embedding techniques (i.e. embedding layer, Word2vec, GloVe and FastText). The study used precision, recall and F-measure to evaluate the performance of those approaches. The best performing word embedding techniques are FastText and Glove while CNN performed better than RNN in this experimental setting. Although the existing studies mentioned above investigated various classification problems in issue reports, they focused on types/components classification, not privacy-related issue reports. In addition, our study employed some different word embedding techniques, deep learning models and performance measures compared to those studies.

Recent work Sangaroonsilp et al. (2023) developed a taxonomy of privacy requirements aiming to support the development of privacy-aware software systems. The work identified privacy requirements in issue reports and mapped them to relevant privacy requirements in the taxonomy. The process of taxonomy development consists of three main steps: privacy requirements identification, refinement and classification. In the privacy requirements identification step, the privacy requirements were derived from four well-established data protection regulations and privacy frameworks (i.e. General Data Protection Regulations (GDPR) Office Journal of the European Union (2016), Thailand Personal Data Protection Act (PDPA) National Legislative Assembly (2019), ISO/IEC 29100 /IEC (2011) and Asia-Pacific Economic Cooperation (APEC) privacy framework Asia-Pacific Economic Cooperation (APEC) (2017)).

GDPR is a data protection regulation developed by the European Union (EU). It provides a set of principles and conditions for protecting personal data and individual rights which need to be complied by applicable organisations. Thailand PDPA, based on GDPR, is a country-specific personal data protection regulation that governs the use and protection of personal data. ISO/IEC 29100 and APEC privacy framework define a set of privacy principles used to manage personal data processing activities. They also provide guidelines for controlling the processing of personal data. The processing of personal data includes collection, use, storage, dissemination and disposal.

The privacy requirements derived from multiple sources can be redundant and/or inconsistent, thus the privacy requirements refinement step was designed to remove duplicate requirements and manage inconsistent requirements. These privacy requirements were then classified into categories in the privacy requirements classification step. In this step, the authors identified the privacy requirements that have common characteristics or address similar privacy concerns, grouped those privacy requirements together, and finally formed the categories. This created a taxonomy of 71 privacy requirements with 7 categories which consist of (1) user participation, (2) notice, (3) user desirability, (4) data processing, (5) breach, (6) complaint/request and (7) security.

To validate the taxonomy, the authors of the paper performed a study on how issue reports of two large open-source software projects (Google Chrome and Moodle) address the privacy requirements in the taxonomy. Google Chrome and Moodle projects were selected due to their accessibility of issue reports, large scale size, popularity and representativeness. To identify and classify privacy requirements in issue reports, the authors first collected issue reports from issue tracking systems of Chrome and Moodle. There were 896 Chrome and 478 Moodle issue reports collected. Those issue reports were marked as closed and privacy-related.

Then, the issue classification process was performed by three coders, who were the authors of the paper and also involved in the taxonomy development process. Each issue report was annotated by two coders. The coders followed the following three steps to classify issue reports to relevant privacy requirements. First, the coders identified concerned personal data (e.g. name, email address and bank account details) described in an issue report. Then, they identified functions/properties related to the concerned personal data reported in the issue. Finally, they mapped the issue report to one or more relevant privacy requirements. The coder took approximately 138 person-hours to classify 1,374 issue reports. The coders also performed an inter-rater reliability assessment and a disagreement resolution to mitigate subjective judgements in issue report classification process. This process produced a dataset that contains Google Chrome and Moodle issue reports associated with concerned privacy requirements.

Although the work in Sangaroonsilp et al. (2023) provides a valuable dataset for mining and classifying privacy requirements in issue reports, an automated approach has not been studied. As can be seen above, the issue classification process is labour intensive and time-consuming. Thus, we aim to automate this process by exploring and evaluating multiple feature extraction techniques, and reporting the best technique that can be used in this setting. The automated approach would reduce the use of resources and efforts as well as minimise human errors.

3 Classifying privacy requirements in issue reports

3.1 Motivating example

The following example demonstrates how the issue reports were manually classified into relevant privacy requirements in Sangaroonsilp et al. (2023). Issue 123403Footnote 1 in Google Chrome reported that users cannot delete individual cookies (see Fig. 1). The coders read through the issue summary and description and followed the following three steps to classify issue reports to privacy requirements. First, the coders identified concerned personal data (e.g. name, email address and bank account details) described in the issue report. Then, they identified functions/properties related to the concerned personal data reported in the issue. Finally, they mapped the issue report to one or more relevant privacy requirements. Based on this issue, the coders identified cookies as the concerned personal data as it may contain login details, search history and identifiable information. The coders identified delete as the function related to the cookies. Finally, this issue report was classified into requirement R44 ALLOW the data subjects to erase their personal data under Category 1 (User participation) in the taxonomy (see Appendix A). This scenario reflects that the browser should allow the users to select the cookies that they want to delete Sangaroonsilp et al. (2023). This manual classification process can be automated using ML and NLP techniques as well as deep learning models, which is studied in this work.

3.2 Approaches

Our study explores the feature extraction techniques that can be used to automatically classify privacy requirements in issue reports. We use the textual description (i.e. summary and description) of issue reports as an input to a model to classify relevant privacy requirements (see Fig. 2). We treat our classification model as multi-label multi-class classification problem since an issue can be classified into multiple privacy requirements. We acquire the privacy-related issues from Sangaroonsilp et al. (2023) (see Sect. 4.1 for more details). We first extract the summary and description of issues with the label(s) of relevant privacy requirements from the dataset. We then apply different textual feature extraction and learning techniques to form a vector representation of texts. Those vectors and privacy requirement labels are then fed to a classifier for training and validation.

Fig. 2
figure 2

An overview framework for developing a classification model

The classification process involves three essential steps: textual data pre-processing, textual feature extraction, and training a classification model. The first important step is to eliminate noise from texts. In our study, we remove stop words from the input texts and also apply lemmatisation. Stop words are the words that occur commonly in the texts but contain less meaning (e.g. a/an/the) while the lemmatisation reduces the variation of words by converting words into their root form (e.g. played to play). It is noted that the stop words removal and lemmatisation are not applied in BERT. The preprocessed texts are then encoded into vectors using a textual feature extraction technique. We describe each technique in Sect. 3.2.13.2.5 below. It is important to note that our models take as input the summary/title and description of an issue. This is the minimal information which must be provided at the time when an issue is created for any issue tracking system. Thus, the automated support provided by these classification models is readily available from the issue creation time. In addition, this enables our study to be applicable to a wide range of issue tracking systems.

The issue feature vectors derived from each method are then fed into a classifier. We first employed two well-known classifiers RF and SVM, that performed best in several document classification problems Moraes et al. (2013); Breiman (2001). However, we found that RF performed better than SVM for all the traditional word embedding techniques we adopted in our context. Thus, we decided to use only the RF classifier for all the traditional word embedding techniques in this study. RF is a randomised ensemble method where a classification is made by voting from weak learners (i.e. decision trees) Blei et al. (2003). We can thus derive the probability distribution among class labels as recommended requirements. We note that BERT has different details in the implementation steps. Hence, we specifically describe our BERT implementation in Sect. 3.2.6.

3.2.1 Bag of words

Bag of Words (BoW) is the simplest model that represents texts as a vector of word frequency Tirilly et al. (2008). It basically creates a list of vocabulary for the entire corpus, then constructs a vector representation of a vocabulary size for each document. Elements in the vector contain the frequency of each word. BoW is implemented in our study as follows. After the issues are pre-processed, we first tokenise every issue in our dataset to convert from sentences into a collection of words (i.e. terms). We then build our dictionary (i.e. vocabulary) from a list of unique words. Next, we generate a vector of each issue by counting the occurrences of each word appearing in that issue based on the words that we have in our vocabulary set. Finally, we acquire a vector representation of a vocabulary size for each issue, whose elements contain occurrences of the words. The vectors are then fed to train the model.

However, BoW has two major weaknesses. Firstly, it does not capture the semantics of words. For example, it treats “Chrome”, “browser” and “school” equally, while semantically “Chrome” and “browser” are more related than “browser” and “school” or “Chrome” and “school”. Secondly, the order of the words in texts is ignored. For instance, BoW represents the same vector representation for these two different sentences “Chrome is good” and “Is Chrome good”.

3.2.2 N-gram IDF

N-gram IDF is an extension of IDF weighting scheme. It is developed to measure weights of phrases (i.e. terms of any length, where length is greater than 1) appeared in a collection of documents Shirakawa et al. (2015). It gives more weights to the phrases (N-gram terms) that occur in fewer documents. We note that the N-gram libraryFootnote 2 is adopted in our implementation. N-gram helps us to distinguish the frequency of different phrases that contain the same set of words. For example, from “Chrome is good” and “Is Chrome good”, we can detect their difference when we use 3-gram setting. When it combines with IDF, it can also identify significant phrases occurred in our issue reports.

In our implementation, we first construct a vocabulary set of N-gram terms. We specify the number of grams (i.e. words in a term) that we would like to extract to our dictionary. For example, an issue summary states “Modify cookies settings”. If we create a dictionary using 1-gram, we get a list of “Modify”, “cookies” and “settings” (3 terms). If we use 2-gram, we get a list of 2-word terms as “Modify cookies” and “cookies settings” (2 terms). Therefore, when we set a range of 1-word to 2-word terms, our vocabulary set contains 1-word and 2-word terms as “Modify”, “cookies”, “settings”, “Modify cookies” and “cookies settings” (5 terms). We set the length of words from 1 to 10 in our experimental setting. This generates a set of 1-word to 10-word terms we have in our issue reports to include in the dictionary. We then find the occurrences of all the terms in each issue report. The weights of the N-gram terms are computed using the combination of IDF and Multiword Expression Distance (MED) in the space of information distance Shirakawa et al. (2015). Finally, we create vectors to store the results obtained.

N-gram IDF suffers from computational time when processing large corpus. It also fails to process natural language texts that do not have spaces between words (e.g. Chinese and Japanese)

3.2.3 TF-IDF

TF-IDF is a type of frequency-based embedding where a vector representation of texts is computed from the term weighting measurement across documents in the corpus Blei et al. (2003). This technique evaluates whether a term is important to a document or in a collection of documents. The TF-IDF weight term-document is calculated from the product of two values—term frequency (tf) and inverse document frequency (idf) as shown below:

$$\begin{aligned} w_{t,d} = tf_{t,d} \times \log _{10} \left( \frac{N}{df_t}\right) \end{aligned}$$

where \(w_{t,d}\) is a tf-idf weighted term-document of term t in document d, \(tf_{t,d}\) is a term frequency of term t in document d, N is a number of documents in a collection, and \(df_t\) is a number of documents in a collection that contains term t.

Comparing to our implementation, documents are issue reports and terms (t) are a list of words that we have in our dictionary. To construct a TF-IDF vector for an issue, we first create a dictionary from all the terms in our issue reports. We set the range to extract the terms from 1-word to 10-word terms in our setting. Next, we define N as the number of issue reports we have in each project. In each project, we count the number of times that term t appears in an issue, then divide by the total number of terms in the issue, and keep it as a term frequency. We then count the number of issue reports that contain the individual term in our dictionary, and keep it as document frequency of each term. Finally, we calculate the TF-IDF values of terms and store them in the elements in a vector representation of each issue report.

TF-IDF overcomes the major weaknesses of BoW. The common terms that frequently appear in issue reports but do not show their significance in other issue reports in the dataset are identified by their weights in vectors. TF-IDF also captures basic linguistic notions of terms (e.g. synonymy) Blei et al. (2003). However, semantic similarities between words and sequence of words in texts are not promised Kowsari et al. (2019).

3.2.4 Word2Vec

Word2Vec is a two-layer neural networks that learns word features (e.g. the transitional probabilities between words). It then represents words as a group of numerical vectors in a vector space (i.e. word embedding) Mikolov et al. (2013). The dot product of two word vectors in the vector space reflects the similarity between these two words. Word2Vec is developed to tackle the following problems:

  • Coverage: Co-occurrence words may not occur consecutively to each other. There are two model architectures proposed to solve this problem: Continuous Bag-of-Words (CBoW) Harris (1954) and Skip-gram Mikolov et al. (2013).

  • Space: The aforementioned techniques create vectors of vocabulary size. However, most of the elements inside each vector store zero. This could lead to sparse distribution.

  • Speed: The computation is expensive when the vector space is large.

In our implementation, we employ Google’s pretrained Word2Vec modelFootnote 3 in our study. We first tokenise all the issue reports in the dataset and store it in a vocabulary set. Next, we generate one-hot encoding vectors for all the issue reports as inputs. Next, we feed the inputs into a Word2Vec architecture. The two layers of Word2Vec consist of hidden layer and output layer. The hidden layer is an encoder receiving input vectors while the output layer is a decoder storing word probabilities. In the training process, the input layer is the word embedding of each issue report while the output layer is privacy requirement labels of that issue report. The model performs the training by adjusting the weight and bias vector in the hidden layer from the given input and output vectors. We then use the hidden layer to perform privacy requirements classification on our test set. In the testing process, we feed the word embedding of each issue report in the test set to the model that we have trained (i.e. hidden layer in the training process) to get the classification results. We have modified our CNN implementation to support a multi-label multi-class classification. We have changed an activation function from softmax to sigmoid, and had a dedicated output unit for each requirement (i.e. category)

3.2.5 Convolution neural network

Convolution Neural Network (CNN) is a deep learning architecture proposed in Lee et al. (2017). It consists of input layer, hidden layer (or convolutional layer) and output layer. We implement this method to support a multi-label multi-class classification task by using a sigmoid activation function and having a dedicated output unit for each requirement (i.e. category). We adopt KerasFootnote 4 and Google’s pretrained Word2Vec model.Footnote 5 The pre-trained model is used to create a word embedding for each input vector. We use dropout (dropout rate is 0.5) to avoid overfitting Srivastava et al. (2014). Binary cross entropy, provided by Keras, is used as the network’s loss function. For the setting of optimiser, we employ Adam with the learning rate adjustment techniques (ReduceLROnPlateau) to train the model. We connect the output layer which is privacy requirement labels to the model. We train the model after fitting all the settings.

3.2.6 BERT

In addition to the feature extraction techniques mentioned above, we also explore Bidirectional Encoder Representations from Transformers (BERT) as one of the candidates for performance comparison in classifying privacy requirements in issue reports. BERT is one of the latest breakthroughs in machine learning and NLP architecture. We use Huggingface Transformers library Huggingface (2020) and Tensorflow Keras API TensorFlow (2020) in our implementation.

Fig. 3
figure 3

An overview of privacy requirements classification process using BERT

We have followed the process steps shown in Fig. 3 for developing a recommendation model using BERT Devlin et al. (2018). We concatenate issue summary and description of issue reports and use them as input texts. We then perform text pre-processing. In this step, we only remove non-alphanumeric characters from the input texts. We do not perform stop words removal and lemmatisation with BERT implementation since these processes will affect the context of original texts (e.g. negations and parts of speech). BERT tokeniser handles text preprocessing by itself Huggingface (2020); Reina (2020). The dataset is then split into training and test set (see Sect. 4.2).

We employ \(\hbox {BERT}_{{BASE}}\) model which has 12 transformer blocks (i.e. encoders), 768 hidden layers and 12 attention heads. We load the BERT tokeniser and configure parametersFootnote 6 in the model (e.g. maximum length of tokens, regularisation, optimiser and classification metrics, batch size and learning epochs). We have calculated the average number of words in our input texts (113 words for Google Chrome and 90 words for Moodle). Hence, we set the maximum length of tokens at 120 for both projects as this number is sufficient for our input. We denote that N represents the maximum length of token in Fig. 4, thus N for our model is 120. After specifying all parameters, we perform the tokenisation process to transform input texts into tokens (represented as Toki in Fig. 4, where i is an order of token). In this step, the tokeniser also automatically adds a special token (i.e. the [CLS] token) into our original input text (see Fig. 4). The [CLS] token is a starting point of each input text Devlin et al. (2018). Finally, a complete sequence of input tokens is generated.

Fig. 4
figure 4

An architecture of our BERT model (adapted from Devlin et al. (2018))

The next step is to set up our model input layer, main BERT layer and model outputs layer. From the previous step, each issue report is transformed into a sequence of input tokens. Thus, our model input layer is the sequences of input tokens. We load the BERT model with all configurations as a main BERT layer. Finally, we create an individual output for each privacy requirement (i.e. R1, R2,..., R71). Since we perform multi-label multi-class classification, the model predicts a class label for individual privacy requirement of each issue report. We therefore have 71 model outputs.

Once the sequence of tokens is fed into the model, the model creates an input embedding \((E_{i})\) for each token as shown in Fig. 4. The token embeddings are then forwarded to hidden states in each transformer block which functions as an encoder. \(\hbox {BERT}_{{BASE}}\) model consists of 12 transformer blocks, however Fig. 4 shows only the first and last layers for illustration purpose. Next, the model is trained with the training set based on the setting above and evaluated on the validation set in the training process. We later evaluate the performance of our model on the test set.

4 Evaluation

We first describe the dataset used in this study and performance measures used for model evaluation. We then present and discuss the results obtained from privacy requirements classification in Google Chrome and Moodle projects.

4.1 Dataset

We use issue reports from two large and widely used software projects (i.e. Google Chrome and Moodle). This datasetFootnote 7 was originally collected and annotated by Sangaroonsilp et al. Sangaroonsilp et al. (2023). The authors first collected 1,604 issue reports in total from both projects. After the manual examination, 230 issues were filtered out due to limited information for classification task. Finally, the dataset contains 1,374 issues (896 issues from Google Chrome and 478 issues from Moodle). These issues were explicitly assigned “privacy" component in their issue tracking systems. In the dataset, each issue report contains issue ID, issue summary, issue description and its privacy requirement labels (see Fig. 5).

Fig. 5
figure 5

An example of Google Chrome issue report with label in requirement and category levels. Irrelevant category/requirement to a particular issue report is represented by 0, whereas the relevant one is represented by 1. *Issue description is not shown in full details due to space limitation

As described in Sangaroonsilp et al. (2023), an issue report was labelled with relevant privacy requirement(s). We use the term “labelled with" to denote the relationship between an issue report and its relevant privacy requirement(s). It is noted that some privacy requirements are belong to more than one category in the taxonomy. Each issue report can be represented in two levels: category and requirement levels. For example, issue 123403Footnote 8 in Google Chrome is classified into requirement R44 (see Fig. 5a). Based on the taxonomy in Sangaroonsilp et al. (2023), requirement R44 is categorised into category 1 (User participation) (see Fig. 5b).

We also show the distribution of issue reports based on number of requirements they were classified to. The majority of the issues relates to one requirement for both projects (see Fig. 6). Figure 7 shows the occurrences of all the privacy requirements in both projects. Out of 71 privacy requirements, the issue reports were classified into 39 and 40 requirements in Google Chrome and Moodle respectively. We have noted that our problem falls into a long tail distribution where the frequency of top requirements is heavy, which is challenging for the classification task.

Fig. 6
figure 6

The distribution of number of privacy requirements identified in Google Chrome and Moodle issue reports

Fig. 7
figure 7

Privacy requirements occurrences in Google Chrome and Moodle issue reports

4.2 Experimental setting

We perform the experiments for each project separately. The issue reports are sorted by creation date based on their issue ID. The issues are then split into two mutually exclusive sets where those issues in training set are created before the issues in the test set to ensure that the models learn from historical data (60% of issues in the training set, 20% of issues in the validation set and 20% of issues in the test set).

4.3 Performance measures

We use two widely used measures (i.e. mean reciprocal rank (MRR) and recall at k (Recall@k)) to evaluate the performance of our models. We also employ a non-parametric hypothesis test (i.e. Wilcoxon rank-sum test) and effect size to compare classification benchmarks between a pair of models applied in our study (see Tables 2 and 3).

  • MRR Almhana et al. (2016); Ye et al. (2019) reflects the reciprocal of the rank at which the first relevant answer is recommended. It is calculated as:

    $$\begin{aligned} MRR = \frac{1}{q}\sum _{i = 1}^{q}\frac{1}{rank_{i}} \end{aligned}$$
    (1)

    where q is the number of issue reports, \(rank_{i}\) is the rank position of first relevant items.

    For example, assume that we have two issue reports (i.e. A and B). Issue report A is labelled with requirement R5. The classifier returns requirements R1, R5,R12 for issue report A. Issue report B is labelled with requirements R7 and R3. The classifier returns requirement R7 for issue report B. The rank of issue report A is 2 as requirement R5 is in the second order from the list returned by the classifier, and the reciprocal rank is \(\frac{1}{2}\). The rank of issue report B is 1 as requirement R7 is in the first rank of the returned result, and the reciprocal rank is 1. Hence, the MRR is:

    $$\begin{aligned} \begin{aligned} MRR = \frac{1}{2} \left( \frac{1}{2} + 1\right)&= \frac{3}{4} \end{aligned} \end{aligned}$$
    (2)
  • Recall@k Recall@k returns the number of correctly retrieved results over all the collected results from the top k elements recommended by our recommendation models. It is calculated as:

    $$\begin{aligned} Recall@k = \frac{1}{k}\sum _{i=1}^{k}\frac{|{Rec_{i}|\cap Label_{i}}}{|{Label_{i}|}} \end{aligned}$$
    (3)

    where k is a number of issue reports in the test set, \(Rec_{i}\) is a set of privacy requirements recommended for issue report i, and \(Label_{i}\) is a set of privacy requirement(s) actually labelled in issue i. Assuming that we have two issue reports (i.e. A and B), and three privacy requirements (i.e. R1, R2 and R3). Issue report A is actually labelled with requirements R1, R2, R3, while issue report B is actually labelled requirement R1. The classifier recommends requirements R1, R2 for issue report A, and requirements R1, R3 for issue report B. Assuming that we perform Recall@k, where \(k = 2\), is:

    $$\begin{aligned} \begin{aligned} Recall@2&= \frac{1}{2} \left( \frac{ |\left\{ R1, R2 \right\} \cap \left\{ R1, R2, R3 \right\} |}{|\left\{ R1, R2, R3 \right\} |} + \frac{ |\left\{ R1, R3 \right\} \cap \left\{ R1 \right\} |}{|\left\{ R1 \right\} |} \right) \\&= \frac{1}{2}\left( \frac{2}{3} + \frac{1}{1} \right) = \frac{5}{6} \end{aligned} \end{aligned}$$
    (4)
  • Wilcoxon test Wilcoxon rank-sum test (also known as Mann–Whitney U test) is a non-parametric hypothesis test which compares the statistical difference between two independent observations Wild (1997). We employ the Wilcoxon test to compare the significance of the performance of two classification models based on their recall@k, where k = 1 to k = 20. In addition, we also compute the common language effect size (CL) to reflect the proportion of paired observations where the first observation (x) is higher than the second observation (y) McGraw and Wong (1992). Later, Vargha and Delaney Vargha and Delaney (2000) proposed the generalisation of CL and called it as the measure of stochastic superiority. The measure of stochastic superiority \((\hat{A}_{XY}\) measure) is a generalisation of CL. This version does not require normally distributed data and it works with ordinal data. This measure was also recommended and employed in previous work to assess effect size when performing non-parametric hypothesis test Arcuri and Briand (2014); Choetkiertikul et al. (2018). Therefore, it is suitable for our observations. \(\hat{A}_{XY}\) is calculated as:

    $$\begin{aligned} \hat{A}_{XY} = \frac{\left[ \# \left( X > Y \right) + \left( 0.5 \times \#\left( X = Y \right) \right) \right] }{n} \end{aligned}$$
    (5)

    where \(\#(X > Y)\) is the number of observations that the recall@k of X is greater than the recall@k of Y, \(\#(X = Y)\) is the number of observations that the recall@k of X equals Y, and n is the total number of observations (i.e. n = 20).

Table 1 Evaluation results of classifying privacy requirements in Google Chrome and Moodle issue reports

4.4 Results

The methods we used in this work can be classified into 3 groups: naive baselines (i.e. random guessing and frequency-based method), traditional word embedding techniques (i.e. BoW, N-gram IDF, TF-IDF and Word2Vec) and deep learning techniques (i.e. CNN and BERT). We report here the results in answering the following research questions:

  • RQ1:Do the traditional word embedding techniques outperform naive baselines? This research question aims to examine how the selected traditional word embedding techniques perform compared to the naive baselines. We performed a sanity check by comparing the classification performance between the traditional word embedding techniques and two common naive baseline benchmarks, which are random guessing and frequency-based method. The random guessing performs random selection over a set of issues in the training set to give a recommendation based on the privacy requirements of the selected issue. The frequency-based method recommends privacy requirements based on the frequency of assigned privacy requirements in the training set. We note that the data used in this study is heavily imbalanced and has long tail distribution (see Fig. 7). Thus, the frequency-based method becomes a strong baseline that the models should be compared with as a sanity check whether the techniques are suitable for classifying the privacy requirements. The goal of this RQ is to confirm that the selected traditional word embedding techniques outperform the naive baseline benchmarks and can be used to automatically classify privacy requirements in issue reports.

    Table 1 shows the MRR and recall@k results of all the state-of-the-art techniques and naive baseline benchmarks in Google Chrome and Moodle datasets. All the traditional word embedding techniques outperform both random guessing and frequency-based methods on MRR in both Google Chrome and Moodle projects. All the techniques except random guessing achieve over 0.7 recall@5 in Google Chrome project. In other words, more than 70% of the issue reports were classified into the correct privacy requirements when five requirements have been suggested. We note that as k equals to the number of privacy requirements that we have (i.e. 71), the recall will become 1 (i.e. issue reports are classified into every requirement). On recall@k, the traditional techniques also perform better than the random guessing baseline in both projects. In Google Chrome project, these techniques underperform the frequency-based baseline on average recall@k, where k = 1 to k = 20. N-gram IDF and TF-IDF are the best performers with 0.6093 on MRR in Google Chrome project. TF-IDF also achieves the highest recall@5. In Moodle project, the traditional techniques perform much better on MRR and recall@k. N-gram IDF achieves the highest MRR at 0.5838 while BoW and TF-IDF achieve the highest recall@5 at 0.6027. All the traditional techniques perform much better than the random guessing and frequency-based methods. The differences on MRR and recall@k results between those four traditional techniques are very small (less than 0.05). In summary, N-gram IDF is the best performer in both projects.

  • RQ2:Do deep learning techniques outperform the naive baselines and traditional word embedding techniques? Since deep learning techniques have attracted a lot of attention from the research community and been using in various classification tasks Sarker (2021), we aim to investigate if the selected deep learning techniques can beat the traditional word embedding techniques and naive baselines in issue report classification. We evaluated the classification performance of the selected deep learning techniques and compared them with those four traditional word embedding techniques and two naive baseline benchmarks. We found that CNN and BERT outperform the random guessing and frequency-based methods on MRR in both projects, except for BERT in Moodle project. Although they achieve higher recall@k, where k = 1 to k = 20, than the random guessing method, they cannot beat the frequency-based method in both projects. The deep learning techniques achieve lower MRR comparing to the traditional word embedding methods in both projects. CNN and BERT underperform those traditional methods on recall@k in Moodle project, but they outperform the traditional methods on recall@k in Google Chrome project. These two techniques may not be suitable for classifying issue reports into privacy requirements in this setting as the number of input data may not be sufficient for deep learning models.

  • RQ3:Are the results statistically significant difference? This RQ aims to investigate if the results of all classification methods are statistically significant difference. We can see that the results between naive baseline benchmarks, traditional word embedding techniques and deep learning techniques produced in the previous research questions are different, however we do not know how statistically significant difference those results are and how large of issue reports are concerned. Thus, we employed the Wilcoxon rank-sum test to answer this RQ. Tables 2 and 3 show the results of the Wilcoxon test and their effect sizes evaluated on recall@k , where k = 1 to k = 20, in Google Chrome and Moodle respectively. In Google Chrome project, the Wilcoxon results confirm that there is a significant difference between the recall@k results of random guessing and all the other techniques. Those recall@k results are statistically significant difference with effect sizes greater than 0.95 for all the techniques in Google Chrome. The recall@k results of BERT are also statistically significant difference with effect sizes equal to 0.98 for all four traditional word embedding techniques. The recall@k results among the traditional techniques are not statistically significant difference. In Moodle project, the recall@k results of all the techniques are not statistically significant difference, except the random guessing with BoW, N-gram IDF and TF-IDF and N-gram IDF with BERT. The effect sizes of those statistically different methods are greater than 0.95.

Based on our investigation in this study, we discuss here a few factors that should be considered and also provide some suggestions for researchers when classifying privacy requirements in issue reports. The first factor that we need to consider is a dataset. We have seen from our results that all the selected traditional word embedding techniques performed better than the selected deep learning-based techniques with the datasets we used. The deep learning-based methods, especially BERT, may suffer from small datasets. Another factor is the selection of feature extractors and classifiers. We suggest the researchers to perform a preliminary experiment to briefly assess the performance of privacy requirements classification in issue reports. The classification performance varies depending on the amount, quality and distribution of the data and experimental settings. In our setting, N-gram IDF with RF performs well in Google Chrome and Moodle datasets in this setting. The researchers can adopt this approach to see if it works well with their datasets (Fig. 8).

Fig. 8
figure 8

The performance of textual feature extraction techniques employed in this study (evaluated on Recall@k)

Table 2 Comparison on classification benchmarks using Wilcoxon test and \(\hat{A}_{XY}\) effect size (on the right column with 2 decimal points) on Google Chrome project
Table 3 Comparison on classification benchmarks using Wilcoxon test and \(\hat{A}_{XY}\) effect size (on the right column with 2 decimal points) on Moodle project

5 Threats to validity

There are several threats to validity of our study which are discussed below.

5.1 Construct validity

We performed the privacy requirements classification in issue reports of Google Chrome and Moodle projects. The dataset used in our study was publicly published and have been used by previous work Sangaroonsilp et al. (2023). The issue reports in that dataset have been extracted from the active issue tracking systems of two large and widely-used software systems. Both projects have a strong emphasis on privacy concerns. We however acknowledge that it is a threat to construct validity as we rely on the labelled dataset of the previous study. Since the dataset involved with subjective judgements and was manually annotated, it may contain errors caused by human biases.

5.2 Internal validity

We are also aware that the configurations of parameters could affect the performance of the techniques applied in the experiments. We tried to minimise this threat by setting the same values for all the common parameters across different techniques in both Google Chrome and Moodle projects. In addition, our dataset has long tail distributions which may flavour the performance of the frequency-based method. However, we have used a range of different text features to perform the experiments and compare their performance. The results show that several text features performed better than the frequency-based method.

5.3 External validity

We acknowledge that our dataset may not be representative of other software applications or software applications in other domains. Further investigation is required to explore other projects in different domains (e.g. e-health software systems and mobile applications), which will be explored in our future work.

6 Conclusion and future work

User privacy in software development has attracted attention from the software development community in the past recent years. Together with the enactment of data protection regulations and laws, it is essential for software development teams to properly address privacy concerns in their software systems. In this paper, we have explored a wide range of textual feature techniques that can be used to automatically classify privacy requirements in issue reports. We performed our study on issue reports of Google Chrome and Moodle. We used MRR and recall@k to evaluate the performance of these techniques. The evaluation results showed that BoW, N-gram IDF, TF-IDF, Word2Vec are suitable for privacy requirements classification. In addition, N-gram IDF is the best performer in our experiment. We believe that this study provides insightful reference in choosing textual feature extraction techniques in future work. Our future work involves studying on the robustness performance and in-depth analysis of different textual feature extraction techniques. We plan to extend our study on privacy-related issue reports of other software projects (e.g. health and mobile applications). We also plan to explore this line of research further and develop it as an automated tool to support the privacy requirements identification in issue tracking systems (e.g. JIRA).