To address RQ1 (Does the choice of the sentiment analysis tool introduce a threat to conclusion validity in a software engineering study?), we replicate the original studies by Pletea et al. (2014) and Calefato et al. (2018b) using the four SE-specific sentiment analysis tools reviewed in Section 2.2. The mapping of the tool output to consistent polarity labels is described in Section 4.1. The results of the two replications are reported in Sections 4.2 and 4.3, respectively.
Mapping the Output of Tools to Polarity Labels
In our replications, we use all the SE-specific tools as ‘off-the shelf’ resources. Since the tools issue heterogeneous outputs, we need to map them into polarity labels by enforcing the same operationalization of sentiment adopted in the original studies.
In their study, Pletea et al. model sentiment using a single variable with three polarity classes, namely negative, positive, and neutral (see Section 3.1). The output of Senti4SD is consistent with such schema, so we do not need to apply any changes. Similarly, the scores produced by DEVA can be directly mapped into the corresponding negative (score < 1), positive (score > 1), and neutral (score = 1) labels. Because the output of SentiStrength-SE has the same structure as the output of SentiStrength, we adopt the mapping implemented for SentiStrength in the former replication by Jongeling et al. (2017): a text is considered positive when p + n > 0, negative when p + n < 0, and neutral if p = n and p < 4. Texts with a score of p = n and p ≥ 4 are considered having an undetermined sentiment and, therefore, removed from the datasets. Finally, we translate the categorical scores of SentiCR into negative (score = -1) and non-negative (score = 0).
Calefato et al. use two Boolean values (Positive Sentiment and Negative Sentiment) to indicate the presence of positive and negative emotions, respectively. Neutral is modeled by assigning ‘false’ to both values. To represent mixed cases both values are equals to ‘true’ (see Section 3.2). As for Senti4SD and DEVA, we implemented a direct mapping by setting ’true’/’false’ values based on the output of the tool. As for SentiStrength-SE, positive and negative sentiment are set to ‘true’ if positive score > 1 and negative score < -1 , respectively, in line with the original study. Finally, the categorical scores of SentiCR are used to indicate the presence of negative sentiment. Conversely, the positive sentiment cannot be modeled in the empirical setting using SentiCR as this tool does not distinguish between positive and neutral.
Replication of Pletea et al.
As in the original study, we compute the proportions of negative, neutral, and positive in the security comments and discussions for both commits and pull requests. The resulting values are compared with the proportions of negative, neutral, and positive sentiment in the comments and discussions revolving around other topics, i.e. the rest of the texts in the dataset. We report the results from the replication and the comparison with the original work by Pletea et al. in Tables 3 and 4 for commits and pull requests, respectively. For the sake of completeness, we also report the results of the former replication of the work of Pletea et al. performed by (Jongeling et al. 2017), which used both NLTK and SentiStrength.
From Tables 3 and 4, we notice that there is always a larger proportion of negative polarity for security topics, regardless of the tool and the type of text. These results confirm the findings of the original study, as well as the findings of the replication by Jongeling and colleagues. In particular, we observe that different proportions of positive, negative, and neutral labels are issued by the different tools. Despite such differences, the original conclusion of Pletea et al. still holds: whether we consider comments or discussions, commits or pull requests, the proportion of negative sentiment among security-related texts is higher than among non-security related texts. As such, we can claim that, in the case of the replication of Pletea et al., different SE-specific tools do not lead to contradictory conclusions for this study. Furthermore, the choice of the sentiment analysis tool does not affect the conclusion validity of the results previously published.
However, the differences in the proportions of positive, negative, and neutral labels issued by the various tools indicate the need for further reflection on the possible threats to conclusion validity that might be due to the ’off-the-shelf’ use of sentiment analysis tools, beyond the confirmation of the high level findings of the original study. In both Tables 3 and 4, we observe such differences. First of all, we observe that SE-specific tools for sentiment analysis all tend to classify pull request and commit comments as predominantly neutral. This is in line with previous evidence (Calefato et al. 2018a) showing how the SE-specific tools are able to solve the negative bias of general purpose ones, due to the inability of the latter in dealing with technical jargon and domain specific semantics of words, such as ‘kill’ or ‘save,’ which are considered non-neutral outside the technical domain. As such, we do not observe the prevalence of negative comments reported in the original study of Pletea and colleagues, which used the general purpose tool NLTK. Even if SE-specific tools agree on classifying the comments as mainly neutral, we still observe differences in the percentages. This pertains to the problem of assessing the agreement between the SE-specific tools (RQ3), which we address in Section 5.
As for pull request and commit discussions, overall we observe a lower proportion of neutral labels, even when the same tool is adopted. For example (see the lower half of Table 3), SentiStrength-SE classifies 68.84% of security commit comments as neutral, 15.46% as negative and 15.70% as positive. The situation changes when discussions are analyzed as a whole, i.e., as a group of comments belonging to the same thread originated by the commit (see the upper part of Table 3). In fact, SentiStrength-SE classifies 43.96% of discussions as neutral, with a resulting higher percentage of negative (24.73%) and positive (31.30%). Same considerations hold for the other tools ad for the pull request analysis (Table 4). This evidence suggests that different findings can be derived also depending on the level of granularity of the unit of analysis (comments vs. discussions, in this case).
Replication of Calefato et al.
In our replication, we use the dataset of the original study.Footnote 4 As in the original study, we apply a logistic regression for estimating the correlation of each factors on the probability of success of a Stack Overflow question. In particular, we use the output of the four SE-specific sentiment analysis tools to recompute the metrics associated with the affect and other factors of their framework, i.e., the positive and negative sentiment scores. Conversely, we do not recompute the metrics associated to the remaining factors and use those originally extracted in the original study, as they are distributed in the original dataset. Since SentiCR only distinguishes between negative and non-negative sentiment, in the replication using this tool we could only include the Positive Sentiment among the predictors. In line with the original study, we treat the success of a question—i.e., the presence of an accepted answer—as the dependent variable and the metrics that operationalize each factors as independent variables.
In Table 6, we report the results of the original study as well as the outcome of the replications with the four SE-specific sentiment analysis tools. For each predictor, we report the coefficient estimate, the odds ratio (OR), and indicate statistical significance of the correlation (p-value <.05) in bold. The sign of the coefficient estimate indicates the positive/negative association of the predictor with the success of a question. The odds ratio (OR) weighs the magnitude of this impact, with values close to 1 indicating a small impact. A value lower than 1 corresponds to a negative coefficients, and vice versa. Technically speaking, an OR = x indicates that the odds of the positive outcome are x times greater than the odds of the negative outcome. OR is an asymmetrical metric, with positive odds varying from 1.0 to infinity and decreasing OR bounded by 0. To further investigate the outcome of classification performed with the different tools, in Table 5, we report the label distributions for the predictions issued by the SE-specific tools and provide comparison with those obtained with SentiStrength from the original study.
As for RQ1, we confirm the original findings related to Reputation, Time, and Presentation Quality, for which we observe comparable coefficients and ORs in all settings. More in detail, we confirm that Reputation is the most influential factor for the success of questions, with Trusted users having the highest probability of getting an accepted answers. As for Time, evening and night are confirmed as the most effective time zones. As far as Presentation Quality is concerned, the presence of code snippet
remains the strongest predictor in all settings. The results from the replications also confirm the Body Length to be negatively correlated with the success of questions. Conversely, we could not confirm the positive correlation between low uppercase ratio with the success of questions. As for the impact of the Affect factor, we confirm the negative impact of both positive and negative sentiment on the success of questions when Senti4SD is used. Conversely, for SentiStrength-SE, DEVA, and SentiCR we do not find empirical support for this claim. As already observed in the replication of the study by Pletea and colleagues (see Tables 3 and 4 in Section 4.2), the proportions of polarity labels vary depending on the tools. As such, we conclude that the choice of the sentiment analysis tools leads to partially contradictory results for this study. The original findings about the impact of factors on the success of questions are mostly confirmed, with a couple of exceptions. Specifically, the sentiment from the original work are fully confirmed only when Senti4SD is used for performing sentiment analysis of questions, while the impact of writing in uppercase is not confirmed in any setting.