1 Introduction

Human values such as inclusiveness, social justice, and privacy, or ‘what people hold important in their life’ (Schwartz 2012; Rokeach 1973), have received increasing attention in the last few years in the software industry. Recent incidents have demonstrated people’s awareness of their values and how they strongly react to the violation of their values in software. For instance, changes in WhatsApp’s privacy policy in early 2021 have led millions of its users to migrate to alternative messaging apps (Best 2021). One of the reasons was fear of another privacy breach by WhatsApp’s parent company, Facebook, which will have access to user data after the new policy comes into play (Best 2021). In this case, the trust of Facebook users did not appear to have recovered after the infamous Cambridge Analytica case in 2015 (Confessore 2018). Another example is that digital and human rights groups protested the use of facial recognition systems in justice systems that introduce racial bias (Schapiro and Bacchi 2020). This bias was suspected to come from the use of datasets in the recognition systems that underrepresent minorities (Schapiro and Bacchi 2020). The bias caused people of colour to be more likely to be detected as offenders and increased fear of unfair treatments (Schapiro and Bacchi 2020). These events are aligned with a characteristic of human values where people feel threatened when their values are jeopardised (Schwartz 2012). To avoid these situations, considering human values, i.e., being attentive to the implication of human values, in an application or software is necessary because it could influence the acceptance of users (Wang et al. 2013; Harris et al. 2016; Fu et al. 2013). In this paper, we used the terms human values dashboard and values dashboard interchangeably.

Addressing human values is difficult because of their subjective nature (Winter et al. 2018) and their definition depends on the context in which they are applied (Kujala and Vȧȧnȧnen-Vainio-Mattila 2009; Mougouei et al. 2018). Several solutions have been proposed to support practitioners in addressing values in software. These solutions are commonly in the form of frameworks, techniques, practises, and guidelines, such as Value-based Requirements Engineering (Thew and Sutcliffe 2018), Value Sensitive Design (Friedman et al. 2008; Friedman et al. 2013), or Continual Value(s) Assessment (Perera et al. 2020). However, these solutions aim to consider values at a specific phase of software development, such as requirements or design phase, or satisfy a specific type of practitioner (e.g. designer). We argued that providing a human values dashboard can bridge this gap and help practitioners effectively and efficiently understand and handle values in the software development lifecycle.

In software development, dashboards are commonly used to support decision making (Ivanov et al. 2018a, 2018a), promote awareness within a project (Treude and Storey 2010; Baysal et al. 2013), and monitor development activities (Leite et al. 2015). It is common for a software development dashboard to use software developmeent artefacts as its source, because these artefacts capture software development process. For instance, Leite et al. developed a dashboard that used commit history to detect unusual events (Leite et al. 2015). Several other dashboards have also been developed using artefacts from software repositories (GitHub 2021a,2021b; Cauldron 2021; Mautic 2022). Recent studies also suggested that human values, security (Fischer et al. 2017; Viega et al. 2002; Pletea et al. 2014; Alqahtani and Rilling 2017), privacy (Kim et al. 2012; Li et al. 2015; Gibler et al. 2012; Naseri et al. 2019; Kuznetsov et al. 2016; Slavin et al. 2016; Sharma et al. 2014), or energy efficiency (Bao et al. 2016; Pereira et al. 2017), can be found in software development artefacts. Although these works did not specifically address security, privacy, or energy efficiency as values, they show the possibility of discovering values in the artefacts. A dashboard is suitable for our purpose because it allows information to be visually displayed at ‘facilitate understanding’ (Wexler et al. 2017). We believe that a dashboard can help clarify the less known and abstract concept of values (Mougouei et al. 2018; Perera et al. 2020) for software practitioners.

Figure 1 presents our vision of a human values dashboard that uses software development artefacts as its data source and displays the values identified in the artefacts to support practitioners in addressing those values in the software. To this end, we propose a human values dashboard consisting of a back end and a front end. The back end of the dashboard provides functionality to identify values from software development artefacts. The identification of values could be done manually (e.g. by the development team) or using an automated approach. The back end is necessary because these artefacts naturally do not have values identified in them yet. For example, Fig. 2 shows a user of an open source application expresses his/her opinion of inclusiveness to be present in the application in an issue discussion. Based on this example, we define a human value can be identified in a software development artefact if there is a notion of that value in the artefact. This argument is supported by a recent work that discovered human values in issue discussions as an example of software development artefacts (Nurwidyantoro et al. 2021b). The front end of the dashboard displays values identified from various artefacts in different views (for different roles).

Fig. 1
figure 1

Proposed human values dashboard

Fig. 2
figure 2

Example of values identified in an issue discussion

This study to design, implement, and evaluate a human values dashboard consists of three stages, shown in Fig. 3. First, we conducted an exploratory study by developing a prototype of the dashboard and interviewed software practitioners to obtain their perceptions and what is necessary for such a dashboard (i.e., exploratory stage). Second, we developed a human values dashboard as a proof-of-concept based on the findings of the exploratory study (i.e., dashboard development stage). Finally, we presented the human values dashboard to software practitioners and asked for their feedback and suggestions to improve it (i.e., feedback stage).

Fig. 3
figure 3

Methodology of this study

Our results reveal that the participants agree that human values are important to be considered in software, e.g., developing an application that is aligned with its users’ cultural background. Participants also believe that a values dashboard can benefit various software development roles (e.g. project manager, developer, tester), primarily to raise awareness of values and support values-based decision making in project management (e.g. prioritising issues). Our participants also acknowledge that software development artefacts are suitable as a source for the dashboard. Among those artefacts, requirements documents and issue discussions are deemed to be the most suitable. This study also identified two potential challenges for the adoption of the dashboard. First, the possible incorrect issues description that can mislead the automated values identification in the dashboard. Second, the lack of willingness of a company to adopt the dashboard.

We have previously reported the design and findings of the exploratory stage of our research at the ESEM conference in 2021 (Nurwidyantoro et al. 2021a). This paper significantly extends the previously published work by adding the design and development of the human values dashboard and its components. We also conducted 10 further interviews (i.e., feedback stage) to see practitioners’ perspectives on the usefulness of the dashboard in practice.

The remaining sections of this paper are organised as follows. Section 2 summarises the background of this study. The next three sections describe the methodological details of each stage in this study, namely, the exploration stage, the dashboard development stage, and the feedback stage. Section 3 describes the exploration stage to understand the potential benefits and what is necessary for a human values dashboard. Section 4 explains the development of the human values dashboard. Section 5 describes the feedback stage to obtain practitioners’ perceptions and suggestions toward the dashboard. Section 6 discusses the findings and potential future directions. Section 7 describes the threats to the validity of this study. Finally, Section 8 concludes the paper and proposes future work for this study.

2 Background

This section introduces the theoretical background and previous work related to this study. First, we describe the definition of human values and their model in social science. Second, we present related work on human values in software. Finally, we present previous work on the dashboard for software development.

2.1 Human Values

Human values, such as achievement or benevolence, are defined by Schwartz as ‘things that people hold important in their life’ (Schwartz 2012). Meanwhile, Rokeach defined values as ‘a belief that a particular way of doing something is personally or socially preferable to the opposite ways’ (Rokeach 1973). Studies in social sciences suggested a degree of relative importance between these values for each person (Rokeach 1973; Schwartz 2012; 2017). Because values are intertwined with feelings (Schwartz 2012), a threat to someone’s values can upset that person. Otherwise, consideration of values will bring enjoyment for a person. For example, in a software engineering context, a user of an application who values inclusiveness can expect this value to be present in the application (Fig. 2).

Social sciences have proposed several models that identified human values and divided them into several categories (Rokeach 1973; Bird 1998; Cheng et al. 2010; Schwartz 2012; 2017). Among those models, Schwartz’s model (Schwartz 2012) is considered the most complete as it covers the largest number of values compared to other models (Cheng et al. 2010). Schwartz’s model, also known as, Schwartz’s theory of basic values, categorises human values into 10 types based on their motivation. These types are self direction, universalism, benevolence, conformity, tradition, security, power, achievement, hedonism, and stimulation (Schwartz 2012). Schwartz also provides ‘exemplary specific values that primarily represent each value type’ (Schwartz 1994) or value items (Schwartz 2012), such as privacy, choosing own goals, and equality for self direction values. This model is organised in a circular manner so that the supporting values are adjacent to each other, while the conflicting values are diametrically opposite to each other (Schwartz 2012) (see Fig. 4). For example, pursuing freedom could introduce conflict to the value of respecting tradition. In this study, we used Schwartz’s model to introduce values to software practitioners during data collection.

Fig. 4
figure 4

The Schwartz models of basic human values (Schwartz 2012) taken from (Holmes et al. 2011)

2.2 Human values in SE

2.2.1 Solutions to integrate values in software

Several solutions have been proposed to integrate human values into software engineering. For example, Value-based Requirement Engineering (VBRE) was introduced for the requirements engineering stage to elicit values from users and stakeholders (Thew and Sutcliffe 2018). Another approach called Value Sensitive Design (VSD) (Friedman et al. 2008; Friedman et al. 2013) was proposed to integrate the consideration of values into the design process of a system. A framework called Continual Value(s) Assessment (CVA) was also proposed to extend a set of an application’s functionalities based on an evaluation of value implications of the existing functionalities (Perera et al. 2020).

These solutions proposed values to be considered in specific stages of software development, especially in the early stages, such as requirements and design. We believe it is possible to support the integration of values in the later stages of the development (e.g. implementation). For example, Hussain et al. have identified several places to introduce values throughout the software development phases in the SAFe Agile framework (Hussain et al. 2022). In addition, these works proposed solutions as methods or frameworks. We argued that to be practical, a solution could also be in the form of a tool. Meanwhile, not so many studies have proposed a tool to support values in software. Our study addressed this gap by envisioning a dashboard as a solution. We believe that our idea of a human values dashboard has the potential to support various stages of software development by utilising artefacts generated during software development as its data source.

2.2.2 Human values in software development artefacts

Software development activities normally generate artefacts. For instance, requirements documents are written as a result of requirements-gathering activities. Development teams also discuss an issue report within repositories. These artefacts have been used in previous work to investigate human values. Recent studies mainly considered more familiar values in software engineering, such as security, privacy, and energy efficiency. For example, some studies have investigated the notion of security in source codes (Fischer et al. 2017; Viega et al. 2002) and issue discussions (Pletea et al. 2014; Alqahtani and Rilling 2017). Privacy has received a lot of attention through several investigations on various artefacts, such as source code and configuration files (Kim et al. 2012; Li et al. 2015; Gibler et al. 2012; Naseri et al. 2019; Kuznetsov et al. 2016), application programming interfaces (API) (Slavin et al. 2016), and project documentation (Sharma et al. 2014). Other studies focused on the energy efficiency of an application (Bao et al. 2016; Pereira et al. 2017). Although these solutions are related, they do not specifically consider security, privacy, or energy efficiency as values. As a complement to previous work, a recent study has demonstrated that human values as defined in the social sciences are present in the issue discussion artefact (Nurwidyantoro et al. 2021b). These studies support our idea that values are present in software development artefacts. Therefore, development artefacts can be suitable as a data source of a values dashboard.

2.3 Dashboard for software development

A dashboard is generally used to monitor progress (Wexler et al. 2017) and support decision making (Janes et al. 2013) in an organisation. In software development, recent studies have demonstrated the use of a dashboard to make decisions (Ivanov et al. 2018a, 2018a) and to promote awareness of the software project to the development team (Treude and Storey 2010; Baysal et al. 2013). For example, Leite et al. proposed a dashboard to alert developers to unusual events in repositories (Leite et al. 2015). Another study used a dashboard to visualise concerns in the context of software evolution (Treude and Storey 2009). In practise, software projects use dashboards during development to monitor the development activities of a project (Cauldron 2021; Mautic 2022; GitHub 2021b; 2021a). In this study, our objective was to use the benefits of promoting awareness, in our case awareness of values, during software development for practitioners.

Our study proposes a human values dashboard to promote awareness of values in software development. In this regard, recent work has developed dashboards that include various indicators to support awareness about the software project, such as code quality and non-blocking code (Lȯpez et al. 2021), project size, issue density, and productivity (Thiruvathukal et al. 2018). Unlike our work, these recent works focused more on the technical aspects of the software. For non-technical aspects, other work proposed dashboards, not in software contexts, for online discussions. These works proposed dashboards to visualise team dynamics (Vivian et al. 2015) and provide suggestions for inclusive meetings (Samrose and McDuf 2021). Unlike these previous works, our dashboard highlights human values, as a non-technical aspect, in software development.

In terms of evaluation of a dashboard, previous work primarily used interviews (e.g., (Ivanov et al. 2018b; Baysal et al. 2013; Leite et al. 2015; Samrose and McDuf 2021)). Some of these studies asked participants to interact with the dashboard before interviews (Ivanov et al. 2018b; Samrose and McDuf 2021). Other works used surveys or questionnaires to evaluate their dashboard (e.g., (Lȯpez et al. 2021)) or in combination with interviews (e.g., (Treude and Storey 2009)). This study followed the approach of allowing participants to interact with the dashboard followed by interviews.

3 Exploration stage

This stage aims to explore whether our envisioned human values dashboard would be useful to support the consideration of human values during software development. To understand this, first, it is necessary to understand whether software practitioners consider human values important. Second, it is necessary to explore the possible benefits of that tool for different roles in software development. As the dashboard uses software development artefacts as its source, it is also important to understand which artefacts are considered by practitioners to be the most suitable. Finally, the dashboard was intended to help software practitioners in incorporating human values during software development. Therefore, it is also necessary to obtain requirements from practitioners for the dashboard. Based on these, the following research questions were developed:

  1. RQ1

    What are the perceptions of practitioners towards human values in software development?

  2. RQ2

    Who will benefit from and what is the benefit of a human values dashboard?

  3. RQ3

    Which artefacts are suitable for the dashboard?

  4. RQ4

    What is needed for a human values dashboard to be helpful in software development?

In this stage, we first developed a prototype of our visioned human values dashboard. The prototype in this stage was developed using static HTML that presents manually values-labelled issues in three different views. The labelling process was carried out by the authors following the methods presented in Nurwidyantoro et al. (2021b). Then we conducted interviews with 15 software practitioners (i.e., P01 - P015). The interview questions for this stage are available in Nurwidyantoro et al. (2022a). Finally, we analysed the interviews to address the research questions. We used the thematic analysis approach (Braun and Clarke 2012) to analyse the interviews in this stage and, later, in the feedback stage (Section 5). Table 1 shows the analysis of parts of interview transcripts to themes and sub-themes. The first and second examples are from the exploration stage interview and the third example is from the feedback stage interview. This stage of the study has been published in Nurwidyantoro et al. (2021a). The remainder of this section summarises the findings for the exploration stage.

Table 1 Examples of the analysis from interviews to a theme and a sub theme

3.1 Practitioners’ Perceptions of Human Values (RQ1)

Participants indicated that human values are important in software. However, their understanding of human values is limited to those that are well known in software engineering, such as security or privacy. The participants found that other values, such as achievement or ability, are not easily understood to be translated into software engineering. However, they argued that they have considered some human values in their software development activities. For example, an application is developed by following the cultural background of its users. In terms of the importance of human values, participants believed that it depends on the nature of the software being developed. However, in general, some values, such as security and privacy, are always more important regardless of the functionalities of the software.

3.2 Benefits of a Human Values Dashboard (RQ2)

The participants suggested that a human values dashboard can benefit all roles in software development in several stages of software development. A human values dashboard can be used, especially, to determine values-driven priorities in a project and to raise awareness of values within the software development team. For example, the project manager could use the dashboard to discuss the project values priorities with the product owner. Another example is that the dashboard could also help other roles, such as requirements engineers or developers, to be aware of the existence of values that need to be addressed in their tasks. In open source projects, information on the presence of human values could inform users of the project to assess whether they would like to use the application, i.e., the values present is aligned with their values.

3.3 Artefacts as the Datasource of the Dashboard (RQ3)

The participants proposed several artefacts, namely market research documents, requirements documents, design documents, features specification documents, issue discussions, and pull request discussions, as potential artefacts suitable for a human values dashboard. The participants chose these artefacts for the following reasons: (a) values can be potentially identified within these artefacts; and (b) these artefacts are used and referred to during software development. Additionally, among these artefacts, the participants identified the requirements documents and issue discussions as the most suitable sources for the dashboard.

3.4 Requirements for a Human Values Dashboard

The participants suggested six high-level requirements that are necessary for a human values dashboard, shown in Table 2. First, the dashboard should be able to identify the presence of human values within the artefacts automatically. Second, the dashboard should also be able to refer the identified values to the artefact source. The third and fourth requirements are about determining the values priority and displaying the artefacts based on the priority. The last two requirements are more related to the development of a project, where the updates and different views concerning values in the artefacts are presented in the dashboard.

Table 2 Proposed requirements for the dashboard

4 Dashboard Development Stage

This stage involves developing a human values dashboard as a proof of concept. The results of the exploration stage were considered during the development of the dashboard. The dashboard development used the prototype views (Figure 4 in Nurwidyantoro et al. (2021a) as the basis. This stage began with designing the components and functionality of the dashboard. Subsequently, we evaluated machine learning techniques to automate the detection of human values as a component of the dashboard (see Table 2). The dashboard was then implemented and populated with an artefact of open-source projects hosted on GitHub.

4.1 Analysis and Design

The analysis considered the perspectives of the practitioners on what is required of a human values dashboard in the exploration stage (Table 2). The analysis of those high-level requirements is described as follows:

  1. R1

    The identification of values in the dashboard shall be conducted automatically. To address this requirement, a human values detector is used as a component in the back end of the dashboard. To support automatic detection, the human values detector utilised machine learning models. The experiments to determine which machine learning models used is presented in Section 4.2.

  2. R2

    The dashboard should maintain the traceability between the identified values and their artefact source. This requirement was addressed by storing the web page URLs of the artefacts in a database. These URLs would be displayed in the front end along with the artefacts. Using this approach, practitioners could use the URLs to refer to the actual location of the artefacts in the repository.

  3. R3

    The dashboard shall allow the development team to determine the values priority of a project. The machine learning model presented in Section 4.2 was used to detect the presence of human values in artefacts. At this time, this model is unable to detect the presence of specific values (e.g. privacy or inclusiveness). Due to this limitation, the dashboard will only display the presence of the general human values.

  4. R4

    The dashboard shall display the artefacts based on the values priority determined in a project. As explained in the previous requirement, the presence of any human values was assumed to be the priority. To address this requirement, the dashboard provided a filtering mechanism on the front end. This filtering allows the dashboard to display only the issues that had been identified that have values present. To inform the latest update on the artefacts, the dashboard displayed the date and time the artefacts were reported and closed. The dashboard also has notifications to inform users when the human values detector found human values in an artefact.

  5. R5

    The dashboard shall provide different views for various roles to support addressing values in software development. In the dashboard prototype (Nurwidyantoro et al. 2021a), the dashboard provided three views for various roles in software development. The development of the dashboard included these views with some adjustments based on the availability of the artefacts (i.e. issue discussions) and the capability of the human values detector (i.e. in detecting whether any human values were present).

After considering the high-level requirements from the practitioners in the exploration stage, this stage continued with designing the components of the human values dashboard. Similarly to the prototype, the human values dashboard was designed to have a back end and a front end. The back end provides an automated downloading of artefacts from project repositories and automated labelling of human values in the artefacts. The dashboard’s front end provides three views similar to those on the previous prototype with some adjustments based on the practitioners’ suggestions in the exploration stage. Figure 5 shows the components of the human values dashboard. The first component on the back end, artefacts downloader, allows development teams to specify repository URLs and download the corresponding artefacts. The downloaded artefacts are stored in the database. The human values detector could then be used to automatically detect the presence of human values in the downloaded artefacts. It uses pre-trained models from the human value detection experiments (Section 4.2) and stores the results in the database mentioned above. The views in the front end provides visualisations of the detection results and their corresponding artefacts.

Fig. 5
figure 5

The components and simplified flow of the human values dashboard

To download artefacts, a software practitioner specifies the project and artefact they want to download. Then, the artefacts downloader connects to the project repository via the GitHub API and downloads the specified artefacts to the dashboard database. For this study, we chose issue discussions for the dashboard source. We made this decision for the reason that it is one of the artefacts suggested by our participants in the exploration stage and also supported by previous work (Fischer et al. 2017; Viega et al. 2002; Pletea et al. 2014; Alqahtani and Rilling 2017; Kim et al. 2012; Li et al. 2015; Gibler et al. 2012; Naseri et al. 2019; Kuznetsov et al. 2016; Slavin et al. 2016; Sharma et al. 2014; Bao et al. 2016; Pereira et al. 2017; Nurwidyantoro et al. 2021b). Thus, from this process, the database stores the project information, issues, and corresponding posts. To obtain the results of values detection in the issues, a practitioner runs the human values detector against the issues. Afterwards, the human values detector uses pre-trained models to detect the presence of human values in the downloaded issues and stores the results in the database. Issues and their detected values are then displayed in the front end.

4.2 Automating the Detection of Human Values

To address the automated detection requirements of human values in the exploration stage (Section 3.4), we formulated the detection of human values as a classification problem of whether human values are present in software development artefacts. We used a data set of the presence of human values in issue discussion from our previous work (Nurwidyantoro et al. 2021b). This dataset consists of 1,097 issues manually labelled with the presence of values. The labelling of this data set followed the same concepts of human values. Thus, this dataset is suitable for for providing human values perspectives to the dashboard. Software practitioners identified the issue discussion as one of the appropriate artefacts for the human values dashboard (see Section 3.3). Finally, we evaluated well-known machine learning techniques to detect human values in the issue discussions. These techniques had been used to classify software development artefacts.

4.2.1 Methodology

To automate the detection of human values, we first preprocessed the dataset. Second, we extracted the classification features from the issues. Finally, we conducted experiments to evaluate four well-known machine learning techniques for detecting the presence of human values in issue discussions. These steps are described below.

Preprocessing

Two preprocessing activities were performed, namely content abstraction and data cleansing. The content abstraction process abstracted contents to their types (Prana et al. 2019). For example, a mention detected in issue discussions was replaced with a ˆmentionˆ string. Table 3 shows the abstracted contents and their string abstraction found in the issues. The data cleansing process removed punctuations, numbers, source codes, stop words, and HTML tags from the dataset. The removal of stop words was done using the Natural Language Toolkit library (NLTK) (Bird et al. 2021).

Table 3 Content types found in issues and their abstractions

Feature extraction

Two statistical features and a sentiment feature were extracted from the preprocessed dataset. The two statistical features, namely BoW (bag of words) and TF-IDF (term frequency-inverse document frequency), have been used in previous studies for the classification of human values and their related concepts in software engineering (e.g. (Jha and Mahmoud 2019; Rezaei Nasab et al. 2021; Ishita et al. 2010; Ortu et al. 2016)). BoW represents each issue in terms and its number of occurrences in that unit (Schütze et al. 2008). Meanwhile, TF-IDF considers the importance of each term in the dataset by multiplying the frequency of a term t in an issue d by the inverse frequency of the issue where that term is present (Schütze et al. 2008):

$$ \text{TF-IDF}_{t,d} = tf_{t,d} \times \log\frac{N}{df_{t}}, $$

where:

  • tft,d is the frequency of a term t in an issue d,

  • N is the number of issues in a dataset, and

  • dft is the number of issues in the dataset that contains the term t.

In addition to those two features, we also extracted the sentiment feature from the issue discussion dataset. The sentiment feature was derived from the results of the sentiment analysis of the issues. This feature was suggested by Nurwidyantoro et al. (2021b) who also provided the dataset that we used. The sentiment analysis aims at ‘analyse people’s opinions, sentiments, and emotions towards entities (e.g. products)’ (Liu 2020). To determine the sentiment score of each issue, this study used SentiStrength (Thelwall et al. 2010) because it supports sentiment analysis in informal text communication (Thelwall et al. 2010). The SentiStrength toolFootnote 1 provides two sentiment strengths: positive and negative sentiments. The positive sentiment that resulted in this tool is scaled from 1 (less positive) to 5 (extremely positive). Meanwhile, negative sentiment is scaled from -1 (less negative) to -5 (extremely negative). This score resulting from the tool was used as the sentiment feature for this study.

Classification experiments

This study formulated the detection of human values as a binary classification problem to identify whether there are human values present in issue discussions. We started with binary classification for the presence of any values rather than for specific values because the dataset contains a small number of cases for each value (see Nurwidyantoro et al. 2021b). Furthermore, the dataset is quite unbalanced, with the number of issues in which the values were identified being only one-third of the total issues in the dataset.

The experiments evaluated four well-known supervised learning methods, namely, support vector machines, random forest, multi-layer perceptron, and logistic regression. This study used these methods due to an earlier study on the identification of human values in text documents that reported that a deep learning approach performs less well in smaller datasets and ‘achieve[s] good results in data-rich settings’ (Ishita et al. 2019). All of these methods have been used in previous studies to classify the content of GitHub repositories (Golzadeh et al. 2021; Arya et al. 2019; Fan et al. 2017; Eluri et al. 2019; Trockman et al. 2019; Munaiah et al. 2017; Kikas et al. 2016; Song and Chaparro 2020). The experiments used the implementation of these methods in the scikit-learn library (Pedregosa et al. 2011).

In the experiments, the performance of the classifiers were evaluated using ten fold cross validation. It is commonly used to evaluate classifiers performance, including in software engineering field (Ding et al. 2018; Ma et al. 2018). This technique splits the dataset into ten equal-sized parts. Then, a classifier is trained using nine parts of the dataset and evaluated using the remaining one. This training and evaluating process is repeated 10 times such that each part is evaluated once. The average and standard deviation of the results were then calculated for the final scores. For the performance measures, this study used precision, recall, F1, and Matthew’s correlation coefficient (MCC). These first three metrics are considered as standards performance measures for classification problems (e.g. (Jha and Mahmoud 2019; Ding et al. 2018; Ma et al. 2018)). The MCC was included due to recent studies that argue that it provides an unbiased measure of performance (Yao and Shepperd 2020; 2021).

The experiments considered three parameters, namely resampling technique, feature set, and classification method, explained as follow:

  1. 1.

    Resampling technique. This parameter was considered because of the unbalanced nature of the dataset. An unbalanced dataset can affect the performance of a classifier toward the majority class (Padurariu and Breaban 2019). In the experiments, we evaluated the performance of the classifier without using any resampling techniques and then compared it with the use of oversampling and undersampling techniques. An oversampling technique attempts to balance a dataset by generating new samples for the under-represented class (Mohammed et al. 2020). We used SMOTE (synthetic minority oversampling technique) (Chawla et al. 2002) as one of the prominent oversampling techniques used in classification experiments (e.g. (Arya et al. 2019; Beyer et al. 2020; Catolino et al. 2019)). In contrast to oversampling, an undersampling technique balances the dataset by selecting a subset of a class with the majority number of samples (Mohammed et al. 2020). It has been used in software engineering research as an alternative way to handle unbalanced dataset (e.g. (Biswas et al. 2019; Canedo et al. 2020)). In these experiments, we randomly selected a subset of samples using the RandomUnderSampler implementation in the imbalanced-learn library (Lemaître et al. 2017).

  2. 2.

    Feature set. This parameter investigated how the features influence the performance of the classifiers. In the experiments, we compared the performance of the classifiers using BoW, TF-IDF, and the combination of each statistical feature with the sentiment feature (i.e. BoW+Sentiment and TF-IDF+Sentiment features).

  3. 3.

    Classification method. We experimented with four classification methods, namely support vector machine (SVM), random forest (RF), multi-layer perceptrons (MLP), and logistic regression (LR). To obtain the best parameter for each classifier method (hyper-parameter tuning), the grid search process was used on a set of values for the methods’ parameters. This approach has been used in previous work (e.g. (Golzadeh et al. 2021; Arya et al. 2019; Song and Chaparro 2020)) for classification experiments. The arguments and their values used in the experiments are shown in Table 4. We then selected the best results for each classification method and compared them with each other.

Table 4 Arguments for the classification methods

4.2.2 Experiments Results

The experiments used the F1 score, which provides ‘the balance between precision and recall’ (Arya et al. 2019) as the primary metrics to determine the performance of the classifier. The remaining metrics were used to provide different perspectives on the classification results. This approach is considered common in classification studies, including those in software engineering (e.g. (Arya et al. 2019; Fan et al. 2017; Prana et al. 2019)). Table 5 shows the best performance of each classification method.

Table 5 The best performance of each classification method

The best performance of the SVM method was demonstrated using the undersampling technique and the TF-IDF feature with kernel parameter radial basis function. The BoW with sentiment features and the oversampling technique performed best for the RF method. The hyper-parameter setup for this performer used the entropy information gain and 1,000 decision trees for the RF method. The best F1 performers for these two methods had the same F1 score (0.619). However, the RF method offered better precision but slightly lower recall than the SVM method.

For the MLP method, Table 5 shows that the best performance used the undersampling technique and TF-IDF feature. The best hyper-parameter setup for this method was using the hyperbolic tan function (tanh) for the activation function and the L-BFGS (Limited-memory Broyden–Fletcher–Goldfarb-Shanno) solver. Similarly, for the LR method, the best performer was using the undersampling method and the TF-IDF but with sentiment feature with the stochastic average gradient (SAG) solver. The F1 scores for the MLP and LR were very close (0.001 difference). However, the precision of the MLP was slightly better than that of the LR method. Conversely, the recall of the MLP was slightly lower than that of the LR method.

Comparing the precision for all methods, Table 5 shows that the RF method was the best. As for the recall, the LR method had the best score. The MCC score was aligned with the performance rate of the F1 scores. The MCC score for the RF method was higher than that of the SVM method, although the F1 scores were the same. This condition means the RF method offered a better overall prediction than the SVM method. Nevertheless, the MLP method was still the best performer among all these methods, with the highest F1 and MCC scores.

Table 6 shows the average values for the confusion matrix of all 10 folds in the top performer of the MLP classifier mentioned in Table 5. The Total Actual column of this table shows the imbalanced nature of the testing set (i.e. 37 values issues : 73 no values issues). This means the undersampling was only applied in the training set. The confusion matrix shows that the MLP classifier correctly identified the majority of the issues where values were found (i.e. 27 out of 37 issues). However, this classifier had lower performance in detecting issues where values were not found (i.e. 53 out of 73 issues). The classifier incorrectly identified 20 issues to have values (i.e. false positives). Meanwhile, only 10 issues were incorrectly identified to have no values found (i.e. false negatives). This results in a higher recall (0.74) and a lower precision (0.58). The complete comparison and results of these experiments can be found in Nurwidyantoro (2022). We used the best classifier in the human values dashboard, i.e. multi-layer perceptrons with the undersampling classification model.

Table 6 The average values of confusion matrix of all folds in the MLP (Precision= 0.58, Recall= 0.74, F1 = 0.65, MCC= 0.53)

4.3 Dashboard Implementation

The dashboard was implemented using FlaskFootnote 2, a web framework written in Python. A Python-based framework was chosen to facilitate the integration of the human values detector into the back end. The implementation of the dashboard focused on the use of issue discussions, as suggested by the empirical findings of a previous study (Nurwidyantoro et al. 2021b) and practitioners in the exploration stage (Section 3.3).

Front-end views were developed using the Chart.js libraryFootnote 3. The three views proposed in the exploration stage were retained because participants in the exploration stage considered those views useful for various roles in software development. We made some adjustments to the three views in the implementation because of the limitation of the automated human values identification (Section 4.2). This limitation only allowed us to display whether human values are present or not in the issues, without specifying which specific values, such as security or inclusiveness, are present. The adjustments made are explained below.

  1. 1.

    Summarised values overview (OV). This view displays the number of issues where human values were present and not present in a pie chart. To allow for comparisons between projects, this view provides two of these pie charts side by side. This view also displays the number of issues where values are detected based on the status of the issues (i.e., open or closed). This view aims to provide insight into the number of values-labelled issues that need to be addressed. Figure 6 shows the OV implemented on the dashboard.

  2. 2.

    Values-labelled list (LI). This view shows a list of issues similar to how issues are displayed on GitHub. This view displays a label as the result of the human values detector indicating the presence of human values in a particular issue. An issue is labelled with either the ‘Values’ label if the human values detector finds human values in that issue or the ‘No Value’ label if the human values detector does not find human values in that issue. This view also includes information on when the issue was opened, by whom, whether it is open or closed, and the number of posts. There is a filtering capability for issues and a link to the original webpage of the issue, as suggested in the exploration stage (Section 3.4). The filtering feature allows software practitioners to view issues in which human values are identified. Figure 7 shows this view on the dashboard.

  3. 3.

    Values-labelled timeline (TM). This view shows the issues chronologically according to the date the issues were opened. This view shows a bar graph showing the monthly number of open and closed issues where values are present (Fig. 8a). At the bottom, this view presents a timeline of issues where values are present, with two different colours to indicate whether the values are open (orange) or closed (yellow) (Fig. 8b). Figure 8 shows both visualisations in the timeline view.

Fig. 6
figure 6

Dashboard summarised overview (OV). (Boxes in the red outline are not part of the dashboard)

Fig. 7
figure 7

Dashboard values-labelled list (LI). (Boxes in the red outline are not part of the dashboard)

Fig. 8
figure 8

Dashboard timeline (TM) (Boxes in the red outline are not part of the dashboard)

On the back end, there are two interfaces used for the artefacts downloader and human values detector components. The first interface allows the development team to specify a range of issue numbers that they want to download from the project repository (Fig. 9a). The issue downloader was implemented using the GitHub API and the github3.pyFootnote 4 library. The second interface enables the development team to specify a range of issue numbers to be detected by the human values detector (Fig. 9b). This range of issues is then used as a parameter to run either the download or detection as a background task on the server. Figure 9 shows those interfaces for the dashboard back end. Two open source projects, Signal Android and K9 Mail, were used as examples in the feedback stage. This human values dashboard is available onlineFootnote 5.

Fig. 9
figure 9

User interfaces for the dashboard’s back end

5 Feedback Stage

This stage involved presenting the human values dashboard developed in the previous stage to gather feedback from the software practitioners. It focused on obtaining feedback from the practitioners and determining whether the human values dashboard is useful. Additionally, because the human values dashboard used an automated technique to detect human values (i.e. a human values detector – Figure 5), it is also necessary to understand the practitioners’ opinions on the performance of the detector. We believe this is necessary regardless of the performance of the current classifier to know the acceptable level of performance to practitioners. Therefore, for this feedback stage, the following sub-research questions were defined:

  1. RQ5

    To what extent do practitioners find the human values dashboard useful?

  2. RQ6

    How do practitioners perceive the performance of the automated human values detection?

In addition to the answers to these sub-research questions, the participants’ suggestions were also collected to improve the dashboard in the future.

5.1 Interview Guide Development

An interview guide was developed for the feedback interview to obtain the practitioners’ feedback and suggestions on the human values dashboard. This semi-structured interview consisted of two parts. The first part of the interview asked for the demographic information of the participants, such as their roles and experiences. The second part of the interview started with an introduction on human values concepts and a demonstration of the values dashboard. Then, the second part continued by asking the practitioners’ opinions regarding the dashboard and its usage. This part also asked questions related to the human values detector, e.g. a component to automatically detect the presence of values. This interview guide was discussed with the supervisory team and other group members, resulting in several suggestions. Several adjustments were made to the interview guide by incorporating these suggestions.

5.2 Data Collection

Participant selection criteria. The selection criteria used in this stage were similar to those used in the exploration stage. This interview sought for practitioners who had been involved in a software development project and were familiar with artefacts from software repositories. This stage involved a new set of participants, i.e. practitioners who had not been involved in the exploration stage. This choice was made to investigate whether different practitioners viewed the human values dashboard as acceptable.

Participant recruitment. The recruitment of the participants was done by inviting contributors of open-source projects hosted on GitHub via email. The email addresses of these contributors were made available by them on their GitHub pages. Interested participants were asked to reply to the email invitation. An invitation to participate was also published on the group web page and LinkedIn. In addition, our colleagues were asked to broadcast the invitation to their networks. Interested practitioners were asked to inform us of their emails through our colleagues or fill out an online form on the group web page. These candidates were then contacted via email to request their consent and arrange an interview session.

Profile of the participants. Table 7 shows the profiles of the participants for the feedback interview. Please note that the participants for this stage are different from the participants in the exploration stage. Participants mostly had developer roles. Most of them had less than 10 years of experience in software development; 4 had 10 or more years of experience. The participants were mostly located in Asia, with one participant located in Europe and another in Australia.

Table 7 Profile of the participants

Interview protocol. Before the interview session, participants were asked to read the explanatory statement and complete the interview consent form. The informed consent document is available in Nurwidyantoro et al. (2022b). All interview sessions were conducted in English. We ensured that all participants have adequate English proficiency. The interview consisted of two parts. The first part focused on obtaining the professional backgrounds of the participants. The second part started by explaining human values’ concepts and the study. The participants then were given the link to access the dashboard developed in Section 4.3. The interviewer then demonstrated the dashboard and provided 10-15 minutes for the participants to interact and evaluate the dashboard and its contents. Then, the participants were asked for their perspectives on the usefulness of the dashboard (e.g. ‘Would the dashboard be useful for you in software development? How?’) and on the performance of the values detection (e.g. ‘At what level is the dashboard accuracy tolerable for you? Why?’). Before asking these questions, we explained and discussed the background of each question to ensure that the participants understood. This second part of the interview also asked for their suggestions and feedback for the dashboard (e.g. ‘Does the information provided in the dashboard prototype sufficient to help you?’). The interview questions for this study are available in Nurwidyantoro et al. (2022a).

The interviews in this stage were recorded using a video conference system with the permission of the participants. Similar to the exploration phase, the number of interviews had not been set in advance. Recruitment and interviews were conducted in parallel with data analysis until data saturation was reached (Beitin 2012; Ournani et al. 2020). The convergence of answers and ideas became apparent in the data analysis after 10 interviews. The mean duration of the interviews was 30 minutes and 49 seconds. Professional transcription services transcribed all the audio recordings of the interviews.

5.3 Data Analysis

The interview data was analysed using the thematic analysis approach (Braun and Clarke 2012). Similar to the data analysis in the exploratory stage of this work and other previous studies (Tomasdottir et al. 2017; Tómasdóttir et al. 2020), the first author performed a large portion of the analysis, which was followed by reviews and discussions with the other authors. In this analysis process, the supervisory team was also consulted in the event of doubts or difficulties. The first author started to familiarise himself with the interview data by reading the transcriptions and listening to the audio recordings. Then, the first author generated codes and themes from the analysis of the transcriptions. Subsequently, the first author had several discussions with the supervisory team to review the identified codes and themes and determine their relations. The first author then assigned a name and definition to each theme. The resulting themes were presented to the other authors for feedback. The themes were then adjusted by incorporating that feedback.

5.4 Results

This section presents the results of the feedback stage. First, this section describes the usefulness of the human values dashboard and the challenges of deploying it in a company. Second, this section presents the practitioners’ perceptions of the human values detector. Finally, this section lists suggestions from the practitioners to improve the dashboard further.

5.4.1 Usefulness of the Dashboard (RQ3.5)

To understand the practitioners’ perspectives on the extent to which the human values dashboard could be helpful, the interview started with presenting and providing the dashboard to the participants to explore. Then, the interview asked the participants about the usefulness of the dashboard to support their development activities. The analysis of the interviews suggested that the practitioners agreed that the dashboard could be useful for them. Some participants argued that there would be some potential challenges for the dashboard to be implemented in their company. These findings are described below.

The human values dashboard was considered useful. The participants agreed that the human values dashboard could be useful to support them in software development activities. Identifying the values present in issues would help the practitioners focus their attention on the issues, which, in turn, would ensure these issues were addressed. A developer mentioned:

‘... Developers will pay their attention to that one [the LI view]. So, if we make sure that we have covered all possible scenarios in the issue list to take down [address] that human values in those tickets.’. (P16–Developer)

In addition to focusing the development team’s attention, the participants believed that the dashboard’s values labels would provide human values perspectives in addition to the well-known technical perspectives. A developer mentioned:

‘When we look at an issue right now, so we do not think about any values aspect, like human values normally. We just think about it from a technical side usually. This would be helpful to understand there is another aspect for the ticket there.’. (P20–Developer)

The human values perspectives would subsequently help them prioritise their tasks. This usefulness would be apparent if there were a substantial backlog of issues:

‘... Especially when there is a huge backlog of issues, I think it is very hard to kind of prioritise and a lot of issues get lost in the backlog, and we file it during one time and then it kind of gets lost and then it does not come up or it just that. So, if there is some sort lot of, let us say, a subjective value, let us say morals assigned to an issue. I think it would help to kind of prioritise it.’. (P22–Software Architect)

Some participants believed that the dashboard could also inform the team’s performance. The dashboard summarised overview (Fig. 6) could also be used to compare the progress between projects. A developer mentioned:

‘I have a company, and I am running several projects. Okay. So, I can measure the team performance by this tool easier, and also the complexity of the project I can understand from this.’.(P19–Developer)

A project manager suggested that linking values-labelled issue posts to their contributors could help identify values champions. The participant referred to a values champion as ‘anyone who aligns themselves with human values in the organisation’. The participant mentioned:

‘In this dashboard, you can see who is the champion of these values or maybe what is the level of “do not do evil” in the discussion, inside the repository and the issue tracker.’. (P17–Project Manager)

Potential challenges in adopting the dashboard. When the dashboard was presented, some participants reacted by suggesting potential challenges in adopting the dashboard in their environment. A developer mentioned that a contributor might not describe the issue correctly and that this could influence the result of the human values detector:

‘Because, in my experience, I have gone through some issues that may be the QA developers, ... I mean, QA when raising these issues, but they are not correctly describing the issue in the field.’. (P16–Developer)

Another challenge concerned the willingness of a company to use the dashboard. The participants suggested some reasons that could hinder the use of the dashboard in a company. First, a company may not be familiar with the concept of human values. This situation could lead to a lack of awareness of human values in the company and the company tending to focus on the financial aspects of the business:

‘Although it has some significant impact while I am developing something or not, but sometimes the management or the [project] plan, and does not bother [with] that type of issues or that type of thing. They only think about money and business.’.(P19–Developer)

Even if a company is aware of human values, it must decide how to address conflicting values from different users. An additional effort may be necessary to determine what needs to be done:

‘These are two issues that we need to prioritise. Are (users from) China our main priority or (users from) [the] US our main priority? The domain is specific. So, how can I prioritise these two issues by these two (users)? Is it possible?’. (P19–Developer)

Second, a corporation could argue that the consideration of human values is not required because it is unregulated. A project manager suggested that a company itself is in a position to decide whether it wants to support the consideration of values:

‘This is [an] area where the company, right now within the US or maybe international law is not compulsory. It is more like the company does assessments on their intentions, on their diversity, and so on, as a public campaign, but not regulated.’. (P17–Project Manager)

figure h

5.4.2 Perceptions of the Performance of the Human Values Detector (RQ3.6)

The use of an automated approach to identify human values in a dashboard has the possibility of leading to inaccuracies. This interview stage used the term ‘accuracy’ to simplify the communication with the participants regarding the correct or incorrect identification of values. To understand how the practitioners perceive the automated human values detector, the interview probed the extent to which the performance of the detector was tolerable to the participants.

The analysis of the interviews indicated that the practitioners understood the possibility of inaccuracies occurring in the identification of human values. However, the level of tolerance to accuracy varied among practitioners. One practitioner preferred to have 90% accuracy to trust the identification results:

‘To have that kind of level of trust, I think at least 90% accuracy is needed. Less than 90%, usually we do not trust the tools, we do not put any action point on the tools.’. (P17–Project Manager)

Meanwhile, another practitioner considered 50% accuracy to still be tolerable:

‘This is a machine learning thing, so there will be some issues. It cannot give an exact solution, so I think 50 is enough and it will develop after some time.’. (P18– Developer)

The analysis of the interviews also discovered that all participants preferred to have false positives on the detector than false negatives. This finding meant that it was acceptable to have the human values detector identify that an issue had values present even though that might not be correct. All participants agreed among themselves that false positives were better than missing critical issues because the detector was unable to detect the presence of the values. A developer mentioned:

‘It says there is no value, but actually there is a value. We can neglect this since it notifies that this has no value, and we neglect it without further investigating the issue.’. (P17–Developer)

figure i

5.4.3 Suggestions for Improving the Dashboard

To obtain feedback, each view in the dashboard (Figs. 67, and 8) was demonstrated to the participants. Then, the participants were probed for suggestions to improve the dashboard. The first author, as the main analyst of the interview, collected the participants’ feedback and suggestions on each view and the overall dashboard.

Table 8 shows the feedback from the participants on the dashboard. In G1, the participants wanted to have the dashboard display which specific values were detected in the issues. This would provide a development team with an opportunity to address issues based on their values priorities. One participant also wanted to have functionality, such as a to-do list or a planner, to manage the issues that a developer want to address (G2). Some of these functions are provided in GitHub (2021d). In G3, it was found that each team or practitioner had a preference on the label colour. A colour customisation feature could be developed to address this suggestion. Practitioners also suggested that the dashboard display the progress of each issue (G4). This information could help them to plan or predict application delivery. Additionally, they wanted the dashboard to allow them to specify which values they wanted to prioritise (G5). This suggestion could be addressed by having the human values detector detect the presence of specific values (e.g. privacy or longevity). In G6, the practitioners suggested having additional criteria for ranking the issues on top of the presence of the human values. The urgency level or the popularity of an issue could be the indicators for this ranking. In G7, a practitioner proposed a suggestion to indicate not only the presence of human values but also the values violations in an issue. The practitioner stated that this indication would be helpful to prioritise the issues.

Table 8 Suggestions for the overall dashboard

Table 9 lists the practitioners’ suggestions for the summarised overview (OV) of the dashboard. In OV1, the participants suggested that the dashboard should have additional categorisation based on the type of issues. An example of this categorisation could be based on the types of roles in the team that could address the issues, such as UI issues for the UI designer. A practitioner also suggested that the OV have a reporting functionality to support the decision-makers of a software project (OV2).

Table 9 Suggestions for the summarised overview in the dashboard

For the list view of the dashboard, the practitioners had some suggestions, which are listed in Table 10. For example, a practitioner suggested that the issues’ assignees be displayed in the dashboard (LI1). The practitioner mentioned that this information would help filter out issues that still need someone to work on them. In LI2, a practitioner suggested including the topics of the issues. This information would provide the development team with a quick summary of what all the issues are about. Related to the accuracy of the human values detector (Section 5.4.2), a participant mentioned that it would be helpful if the confidence level of detection is displayed on each issue (LI3). This level of confidence is the prediction score from the classifier in classifying an issue, e.g., whether the issue has values or not. This suggestion could help practitioners prioritise issues with a higher confidence level of detection. To further help practitioners in addressing issues, one participant also requested that the dashboard search and display relevant solutions from search engines (e.g., Google) or questions-and-answer forums (e.g., Stack Overflow) for each issue (LI4).

Table 10 Suggestions for the list view in the dashboard

In the timeline view of the dashboard, the participants provided several suggestions related to the time perspective, as shown in Table 11. First, the practitioners requested a customisable time range (TM1). The practitioners felt that this customisation would highlight recent issues depending on the frequency of issues in a project. The remaining two suggestions were related to the duration of time that had passed since an issue was first reported (TM2) and the duration needed for an issue to be completed (TM3). The former suggestion (TM2) would highlight an issue that had not been addressed for a period of time, while the latter (TM3) would provide analytics of issue completion to the project managers.

Table 11 Suggestions for the timeline view in the dashboard
figure j

6 Discussions

This section highlights and discusses the findings of the exploratory and feedback stages. First, this section discusses the findings on the awareness of values. Second, it discusses the possibility of using software artefacts as the source to identify values for the dashboard. Third, this section discusses the possibility of providing a human values dashboard for users. Finally, it discusses the limitations and challenges of the dashboard.

6.1 Awareness of Values

The analysis in the exploration stage showed that software practitioners are familiar with only a limited set of values, such as security and accessibility (Section 3.1). This finding strengthened the findings of Perera et al. (2020), which highlighted that only a few values have been discussed in recent academic publications on software engineering. This lack of awareness was also found in the feedback stage as a potential challenge to the adoption of the human values dashboard in a company (Section 5.4.1). One possible reason for this stems from the fact that there is a lack of understanding of these values in the software engineering context. Furthermore, participants in the exploration stage thought that the values they were familiar were important and believed that they had already considered these values during software development. The other values that they were not familiar with became ‘nice to have’ in an application. It could be argued that the values that participants are familiar with are similar to non-functional requirements that are related to the quality properties or characteristics of software (Glinz 2007; Mairiza et al. 2010; Barn 2016). This might also be the reason why practitioners are not familiar with some other values. This argument makes sense for us because, based on our previous work (Nurwidyantoro et al. 2021b), we believe that human values have a much broader sense that includes non-functional requirements. These findings also showed the need to increase awareness of values not only of practitioners but also of companies. A possible solution could be to provide a contextualised software engineering definition for each of these values (Mougouei et al. 2018; Perera et al. 2020), as presented in Nurwidyantoro et al. (2021b). Furthermore, as suggested in the findings, a tool, such as a human values dashboard, could be used to introduce and increase the awareness of values of companies and their development teams. These findings were in line with a previous study that suggested that a dashboard has the benefit of increasing awareness (Treude and Storey 2009).

6.2 Artefacts as the Source for the Dashboard

The software practitioners in the exploration stage suggested that requirements documents and issue discussions are considered more suitable for the mining of human values. This paper focused on one of these artefacts, namely issue discussions, as the source for the dashboard. Future work could extend this research by investigating the presence of human values in requirements documents. If such a study could find the presence of values in requirements documents, then it could be followed by incorporating requirements documents as another artefact source in the human values dashboard.

Based on the results of the exploration stage, a human values dashboard was developed. This dashboard labels the presence of human values in issue discussions. The feedback stage found that the participants agreed that the dashboard could be helpful in focussing their attention and prioritising issues. Nevertheless, it is still possible to enhance the developed human values dashboard by adding other artefacts. Therefore, future research could investigate the presence of human values in other artefacts to incorporate them into the dashboard. Some suggestions by the participants could be acomplished by integrating the dashboard with existing software repositories. More studies could be conducted to investigate to what extent this integration is possible.

6.3 A Human Values Dashboard for Users

The exploration stage results suggested that one of a human values dashboard’s main benefits is promoting the awareness of values. This awareness of values could trigger discussions among stakeholders on what values must be considered in an application. Then, as suggested by the findings in the feedback stage, the development team could focus on the prioritised values and ensure these values are addressed during development. This study focused on software practitioners involved in software development. It did not include end users of an application as one of the stakeholders in software development. Application users are indirectly involved in application development by providing feedback. Giving them access to a human values dashboard would help users evaluate the values of an application (Kujala and Vȧȧnȧnen-Vainio-Mattila 2009), which in turn could guide them to choose their preferred application (Wang et al. 2013; Harris et al. 2016; Fu et al. 2013). However, to understand the dashboard’s usefulness for users, a future study involving users needs to be carried out.

6.4 Limitations and Challenges of the Dashboard

The human values dashboard (Fig. 5) may have several limitations. First, the dashboard depends on the availability of artefacts (e.g. issue discussions). A project may not have all the artefacts mentioned in Section 3.3 depending on how it is managed. Second, an automated approach has been chosen to identify the presence of values because it can reduce manual efforts. Identification using automated approaches has accuracy limitations. This performance limitations also happened in prior studies on the detection of human values in text documents. These studies initially reported low performance (F1 score of 0.45 (Ishita et al. 2010)), but a series of studies later in the following years (Takayama et al. 2013; Takayama et al. 2014; Takayama et al. 2015) resulting in better performance (F1 score of 0.74 (Takayama et al. 2016)). These recent works demonstrated that classifying human values is not a trivial task. Abstract concepts of human values may contribute to this challenge. Regarding the accuracy, although the tolerance level for inaccuracies varied between participants, here the inaccuracies were understandable by the participants. The findings also found that practitioners preferred false positives to false negatives. This means that the classification methods evaluated in Section 4.2.2 could be considered to be tolerable by practitioners. Furthermore, evaluation metrics that emphasise false positives, such as the F2 score (Jha and Mahmoud 2019), could be used to evaluate the performance of the automated human values detector. In this case, the F2 score for each classifier mentioned in Table 5 are 0.65 for the support vector machine, 0.61 for the random forest, 0.70 for the multi-layer perceptrons, and 0.71 for the logistic regression classifiers. These F2 results show that the logistic regression classifiers now performed better, although not significant, than the multi-layer perceptrons. In addition, in the feedback stage, the practitioners (Section 5.4.3) suggested displaying the confidence level for the detection of human values. This information could help practitioners prioritise issues with a higher level of confidence. Third, the automated approach used in the dashboard at this point is only capable of detecting if any human values are present, without specifying which values. This limitation was due to the limited number of cases wherein specific values were discovered (see Nurwidyantoro et al. (2021b)). The unbalanced nature of the dataset was also found in previous studies of human values analysis in text documents (e.g. (Ishita et al. 2010; Takayama et al. 2013; Takayama et al. 2014)). Future work could expand the datasets for specific values by targeting specific types of applications. For example, the hedonism value could potentially be discovered in issue discussions of computer games. Furthermore, the human values detection could be improved in the future by using emerging approaches such as deep learning techniques. However, the datasets may need to be expanded to cater to such approaches. Alternatively, other methods that do not require large datasets, such as keyword-based or rule-based approaches, could also be used. Another way to improve automated detection could be to ask practitioners as users of the dashboard to add or correct the labels on the artefacts. These additions and corrections could then be incorporated as feedback to retrain the classification model to improve the identification over time.

Despite these limitations, the findings showed that the dashboard would be beneficial for software development. However, the practitioners highlighted two potential challenges in adopting the dashboard. The first challenge was an unclear or incorrect description of an issue provided by the reporter. One way to address this challenge is to provide issue reporting guidelines. Additionally, practitioners could ask for clarification in a post on that particular issue. The second challenge was related to the willingness of a company to adopt the dashboard. To address this challenge, essential efforts must be made to increase awareness of human values. Providing regulations and standards (e.g. GDPR Wolford 2021) is one potential way to increase this awareness.

7 Threats to Validity

This section discusses the potential threats arising from the research method and the findings. This section uses the following validation criteria, which are considered suitable for qualitative research (Guba 1981; Stol et al. 2014; Cruzes and Dybå 2011):

Credibility: Possible threats to the credibility of this study could arise from the procedures used to collect data, develop interview questions, or select participants. Although the collected data was only from one source (interviews), the initial step of examining the literature before developing the dashboard prototype could increase the plausibility of our findings. To mitigate the threats of the interview questions, open-ended questions were used, and follow-up questions tailored to each participant’s responses were asked. The use of issue discussions in the prototype of the exploration stage’s interviews may have introduced bias into the participants’ responses. This threat was mitigated by probing the participants to consider the possibilities of using other artefacts as the dashboard’s source.

To reduce the possible threats resulting from the selection of participants, this study relied on the criteria for the recruitment of participants. The list of participants consisted of software practitioners with diverse roles, experiences, and work locations. Therefore, the participants had the right competencies to provide insight for the study. To mitigate the uneven number of participants in each role, the interviews also asked them to share their opinions from other roles’ perspectives. This approach allowed for cross-validation of the findings across different roles.

Confirmability: A possible threat to the confirmability of this study might have been introduced by the definitions of human values, which have not specifically developed for software engineering. To mitigate this, some examples were provided to the participants to describe what a value could possibly mean in software engineering contexts (e.g. ‘A user who values privacy may not choose an application with a bad privacy reputation’).

Participants were also allowed to reflect and translate values into contextualised software engineering definitions based on their experiences. The data analysis could have introduced another possible threat to the confirmability, as it was carried out primarily by the present author. This threat was mitigated by having other authors review and validate the codes/themes in several discussions.

Transferability: This study accepted that the findings cannot be generalised to all software organisations and practitioners. Different results might have been discovered if another group of participants had been included. However, this threat was reduced by involving a reasonable number of participants with various development roles and work locations. Furthermore, the data reached saturation during the parallel work of interviews and data analysis. This study also accepted that the relative importance of some specific values to others cannot be generalised because the entire list of values was not presented to the participants. This threat was mitigated by concluding that some values are more important than others. We also accepted that the results of the classification experiments were limited to the dataset that we used. Different results might have been obtained if other datasets had been used for the experiments. This study also realised that the use of only open source repositories made the results cannot be generalised to other projects.

8 Conclusion and Future Work

This study envisioned a values-driven dashboard and investigated whether it would help software practitioners address values during software development. This study consists of three stages, namely, exploration, development, and feedback stages. The exploration stage was conducted by providing a prototype of the dashboard and interviewing 10 software practitioners. This stage found that the participants acknowledged that a human values dashboard would be beneficial to them. The dashboard could raise awareness of values among development teams and inform values-based decision-making in project management. Supporting the idea of using artefacts as the dashboard source, practitioners suggested requirements documents and issue discussions as the most suitable artefacts for values identification in the dashboard. This stage of the study also received suggestions as a set of requirements to develop the envisioned dashboard.

In the development stage, a human values dashboard was developed as a proof-of-concept based on the requirements suggested in the previous stage. Then feedback interviews were conducted with 10 other practitioners to obtain their opinions on the dashboard. This study found that the human values dashboard could help focus attention and prioritise issues, in line with the findings from the exploration stage. Practitioners also suggested several potential challenges, such as a possible unclear or incorrect description of an issue by the reporter and the lack of willingness due to extra efforts required to deploy the dashboard in a company. Regarding the performance of the human values detector, the practitioners had different levels of tolerance, but all agreed that false positives were preferable to false negatives. Participants also made 16 suggestions to improve the dashboard.

The suggestions of practitioners and the results of this study could further improve the human values dashboard. Future studies could extend the dashboard to include other development artefacts suggested by interview participants. This direction has the potential to create a more comprehensive dashboard that covers the software development life cycle. Alternatively, because some suggestions can be integrated with an existing software repository, a future study could explore to what extent this integration is possible. We also realised that the performance of automated human values detection is quite low. Therefore, future studies could focus their efforts on improving the performance of the classifier, such as evaluating other classification approaches such as deep learning or newer classification features such as word embedding (Wang et al. 2021) or newer data imbalance handling approaches, such as transfer learning (Al-Stouhi and Reddy 2016). More work is also necessary to develop the classifier for specific values, such as face or hedonism. In terms of dashboard evaluation, future studies could evaluate it in a company setting using additional approaches, such as an observational study or a controlled experiment. These approaches could complement interviews to obtain a comprehensive evaluation of the dashboard in real world settings.