1 Introduction

Large-scale online service systems are becoming indispensable for people’s work and everyday life nowadays. They also get more and more complex so as to support the ever-growing needs of their users for new and more powerful functionalities. The scale and complexity of such services as well as the diversity of environments in which the services are to be invoked, however, have made it more challenging than ever for developers to make sure the services will always behave as expected. Despite the tremendous amount of time and effort developers invest in testing and debugging such online service systems, it is almost inevitable that some bugs escape the developers’ attention, get released into the field, and negatively impact users’ experience with the services. It is, therefore, extremely important for the service providers to discover issues in their systems based on information gathered from users in a timely manner.

In view of that, Zheng et al. [45] recently proposed the iFeedback approach to detecting issues based on user feedback. While the approach has been deployed to help detect issues in large-scale online service systems and has successfully detected severe issues, the overall precision of its results is relatively low, 76.2% to be exact [45]. We conjecture there are three reasons for that. First, iFeedback extracts word combinations from feedback texts as indicators of issues. Since word combinations only capture the lexical, rather than semantical, characteristics of feedback texts, they, as issue indicators, tend to be overly sensitive to the wording of user feedback. Second, iFeedback detects anomalies at the level of time intervals based on all the user feedback gathered during those intervals, which is too coarse-grained. Since a wide range of different types of user feedback, concerning issues or not, may get reported during each time interval, it is more likely for iFeedback’s judgment to be influenced or even misled by user feedback that does not report any issues. Third, iFeedback applies an unsupervised algorithm to cluster the feedback during anomalous time intervals based on the word combinations and their contexts. While unsupervised clustering algorithms are less expensive to apply, they tend to produce less precise results than supervised algorithms in general [36].

To address these limitations of iFeedback and improve the quality of issue detection results, we propose in this paper a novel approach, named SkyNet, to automatically detecting issues in online service systems based on multi-channel user input, including both user feedback and messages posted on social media platforms. More concretely, SkyNet first employs a cascading classifier to label the user feedback texts based on an input hierarchical label system for different types of user experiences. Then, it applies time-series data analysis to predict, based on historical data, a threshold for the normal frequencies of user feedback reporting each known type of negative user experience; and it reports an issue when more feedback of the same type than allowed by the threshold is gathered from the users. Meanwhile, for user feedback reporting negative experiences of previously unknown types, SkyNet reports an issue when an abnormous amount of such user feedback concerns similar negative user experiences. The semantic embedding of feedback texts and the customized issue detection process adopted by SkyNet enables it to detect more real issues in service systems and to prune out most false positives. In view that social media platforms have become important and popular venues for users to share their experiences with various services and products, SkyNet also monitors and analyzes messages posted on social media platforms to detect issues before they generate a large number of user feedback or attract considerable unwanted public attention.

We have implemented the SkyNet approach into a tool with the same name. To empirically evaluate SkyNet’s effectiveness, we applied it to detect issues for three real-world, large-scale online service systems based on their historical data gathered from a ten-month duration. SkyNet reported in total 2790 issues, 93.0% of which were confirmed by operators and developers as reflecting real problems that deserve their close attention. Besides, SkyNet was able to detect 58 of the 62 severe issues that occurred during that period of time. Such results suggest SkyNet is highly effective and accurate in issue detection.

Contributions. This paper makes the following contributions:

  • We propose the SkyNet technique that analyzes both user feedback gathered from specific channels and public posts collected from social media platforms to accurately detect issues in large-scale online service systems.

  • We develop SkyNet into a tool with the same name.

  • We empirically evaluate SkyNet by applying it to detect issues for three real-world service systems based on historical data. The results produced suggest that SkyNet is highly effective and accurate.

2 Related Work

Our work is closely related to existing work in the following areas.

Anomaly detection based on backend monitoring. In view that many issues in online service systems affect performance attributes like “disk queue length” and “network retransmission rate” of the backend systems, people often monitor the corresponding key performance indicators (KPIs) of the systems and rely on the values to detect anomalies in those services [15, 18, 21,22,23, 25, 26, 39, 44]. For instance, Laptev et al. [21] proposed the EGADS system that combines a collection of anomaly detection and forecasting models to detect anomalies in time-series KPI data. Liu et al. [25] proposed the Opprentice system that trains a random forest with labeled KPI features to select appropriate parameters and thresholds for existing detectors. Xu et al. [44] proposed an unsupervised anomaly detection algorithm, named Donut, to effectively detect anomalies in seasonal KPIs. Given that online service systems automatically generate issue reports and alerts when the monitored indicators exhibit anomalous values, techniques have also been developed to mine attribute collections of issue reports [15, 24] to characterize and detect incidents [22].

Issue detection based on user feedback. Many issues, e.g., user interface defects and silent back-end issues, in those systems, however, are not reflected by pre-defined KPIs [45]. In view of that and the fact that user opinions coming in different forms (e.g., user feedback, tweets, and forum posts) contain valuable information to support software development and maintenance [12, 13, 29, 30, 41, 42], Zheng et al. [45] proposed the iFeedback approach to detecting issues based on user feedback on-the-fly. iFeedback first extracts word combination-based indicators to represent an issue and collects each indicator’s historical occurrence trend (HOT), then the long-term and short-term windows of the HOTs are fed to a binary classifier to identify anomalous time intervals, and in the end, user feedback from time intervals containing issues are clustered as reporting different issues. SkyNet improves on iFeedback from three perspectives. First, iFeedback extracts word combinations from feedback texts as indicators of issues, which captures only the lexical characteristics of feedback texts, while SkyNet employs the ALBERT-tiny model to encode user feedback so that the semantics of user feedback can be taken into account during the issue detection process. Second, iFeedback detects anomalies at the level of time intervals based on all the gathered user feedback, which is often too coarse-grained and increases the chance of coincident non-issue-reporting feedback influencing and misleading the issue detection process. In contrast, SkyNet employs a cascading classification algorithm to label user feedback based on a hierarchical label system and only takes feedback that reports negative user experiences into account in the remaining issue detection process. Third, SkyNet also monitors and analyzes messages posted on social media platforms to detect issues in a timely manner, which complements user-feedback-based issue detection.

Learning from user opinions in other forms. User opinions in other forms have also been utilized to support various types of activities in software development. Gao et al. [14] proposed the IDEA framework that detects issues from review texts of apps. Stanik et al. [38] proposed an approach to identify aspects of software systems to improve based on user comments received on Twitter. While those identified aspects may indeed need improvement, they not necessarily are issues in the corresponding software systems. Guzman et al. [16] proposed the ALERTme approach that automatically classifies, groups, and ranks tweets to facilitate the analysis of application-related tweets. Williams and Mahmoud [43] conducted a study on leveraging Twitter as a main source of software user requirements.

Fig. 1.
figure 1

An overview of the issue detection process with SkyNet.

Johann et al. [19] proposed the SAFE approach that extracts keywords from app feature descriptions written by developers and app reviews on app stores to better characterize the apps. Compared with these works, SkyNet focuses on detecting issues in online service systems based on user feedback and social media posts.

3 The SkyNet Approach

Figure 1 depicts an overview of the issue detection process with SkyNet. SkyNet leverages deep learning algorithms to detect issues based on multi-channel data and it combines two loosely coupled processes: The main process is designed for detecting issues based on user feedback texts gathered through dedicated channels that are embedded in the service systems, while the auxiliary process complements the main process and aims to detect issues using posts collected from social media platforms. Each issue detected by SkyNet is associated with a collection of user feedback, a social media post in case it is the main concern of the post, and a list of ten keywords extracted from the user feedback and post using the TF-IDF method [6]. While the keywords help provide a rough idea about an issue, developers must examine the associated user input to determine whether the reported issues reflect real problems in the service systems. In the rest of this section, we explain in detail the steps in SkyNet’s main and auxiliary issue detection processes.

Note that, as in other model-based approaches, we periodically review the input user feedback and social media posts as well as the detected issues, manually rectify the incorrect detection results if any, and use the new data to fine-tune the models that SkyNet utilizes so as to keep the models fit for the updated business situation and to prevent model degradation. Also note that, although sometimes users include images in their feedback and social media posts to help explain the problems they have encountered, SkyNet does not utilize such information in its current implementation. We leave the development of new techniques that exploit the extra image information to facilitate issue detection for future work.

3.1 Hierarchical Classification of User Feedback

The first step in issue detection with SkyNet is to decide the type of user experience that each piece of the gathered user feedback reports. SkyNet makes such decisions on the basis of a hierarchical label system, where the labels characterize with different levels of detail the types of (negative) user experiences that users report in their feedback.

SkyNet differentiates three broad categories of user feedback in issue detection, namely feedback reporting negative user experiences of a known type, feedback reporting negative user experiences of unknown types, and feedback not reporting negative user experiences. User feedback from the first two categories is collectively called negative experience reporting feedback. Note that not all negative user experiences are caused by issues in service systems. For example, although a user’s access to an online service will be blocked if her device is offline due to a hardware failure, the experience does not indicate anything problematic in the online service system.

Feedback Encoding Since SkyNet is designed to detect issues in large-scale online service systems, and it may need to process a large number of user feedback under tight time constraints, we use ALBERT-Tiny [20] to encode the user feedback. BERT [11] is a pre-trained state-of-the-art language representation neural network model with strong semantic comprehension capability.

Fig. 2.
figure 2

A sample hierarchical label system (in blue) and some examples of the associated user feedback.

ALBERT [20] is a lite BERT architecture, and it lowers the memory consumption and increases the training speed of BERT, while without significantly sacrificing BERT’s semantic comprehension ability, by sharing parameters across layers and reducing embedding dimensions of words. ALBERT-Tiny [20] is the smallest version of ALBERT that is 10x times faster than BERT for inference.

Hierarchical Label System To correctly decide which type of user experience each user feedback reports is crucial since incorrect decisions made here may mislead the downstream steps and cause the whole task of issue detection to fail. SkyNet employs an existing hierarchical label system to facilitate making those decisions. In the system, each label corresponds to a particular type of user experience that users may have with the target online service system.

Designing a label system to properly characterize user experiences is a challenging task. SkyNet adopts a hierarchical, rather than flat, label system mainly because it is extremely difficult, if not impractical, to decide a priori on the right granularity level for the labels in a flat system so as to strike a good balance between the accuracy and the value of the classification results based on that label system. On the one hand, a coarse-grained label system often makes it easier for a classifier to correctly label the input data, but the classification results may not be very useful since each label encodes little extra information. On the other hand, a fine-grained label system typically makes it harder for a classifier to correctly label the input data, but a correct label in this case can be highly valuable since it encodes abundant extra information. In the context of user feedback classification for issue detection, coarse-grained labels provide relatively vague information about the user experience, which may not be sufficient to help developers effectively confirm or understand the underlying issues.

Figure 2 displays part of the hierarchical label system that SkyNet uses for classifying the user feedback on an online video editing system. In the hierarchical label system, labels at the top level classify all the user feedback into broad categories concerning aspects like “Functionality” and “User Account” of the online system, labels at the intermediate level partition the broad categories into smaller, finer-grained ones, while labels at the bottom level correspond to specific types of experiences that users may have when using the online system. Two top-level labels in the hierarchical label system, namely “Unknown” and “Non-negative”, are special in the sense that they do not have subordinate labels because they are for user feedback texts that report negative user experiences of previously unknown types and that do not report negative user experiences, respectively. Since some user experiences of previously unknown types may still reveal important issues of the systems, SkyNet conducts extra analysis on the related feedback to determine if they report any issues. Section 3.2 gives more details about the analysis. User feedback classified as “Non-negative” will not be further processed by SkyNet.

Figure 2 also lists some example feedback snippets from users of the online video editing system and associates the snippets to their corresponding labels. Two things from the examples are worth noting. First, users often use different words in describing the same issue. For example, the words “save” and “export” were used in snippets 1-1 and 1-2 to refer to the action of exporting a video, respectively. Second, different words with similar meanings may be used to describe user experiences of distinct types. For example, the word “save” was used in both snippets 2-2 and 3-2, which report different types of negative user experiences. Due to such flexibility in natural language expressions, using word combinations like (“save” and “video”) to characterize and group user feedback, as was done in previous work [45], may often produce results of low precision. In view of that, SkyNet extracts the semantics of the experiences reported in user feedback via deep learning and classifies user feedback based on their semantics.

Fig. 3.
figure 3

The process of hierarchical user feedback classification in SkyNet.

We do not consider the requirement for an input hierarchy of user feedback labels as a major restriction to SkyNet’s applicability for two reasons. First, although not every service system readily has a dedicated hierarchy of user feedback labels, hierarchies from similar systems could be used instead to bootstrap the application of SkyNet on a new service system since, according to our experience, systems with similar functionalities often share hierarchies of user feedback labels. Second, a collection of appropriate issue labels is essential for the effective management of issues in large online service systems. Developers need to devise the labels with or without tool support, and the labels can be organized into a hierarchy to drive SkyNet. While the construction of such a hierarchical label system may require some manual effort, such investment is worthwhile in the long term since a high-quality label system can greatly improve the result accuracy of feedback classification and issue detection.

Cascading ClassificationSkyNet employs cascading classification to associate user feedback to the labels from the hierarchical label system. Cascading is a particular case of ensemble learning based on the concatenation of several sub-classifiers [2]. In SkyNet’s cascading classification for hierarchical labels, each sub-classifier targets only the labels at a particular level, and the output of a high-level sub-classifier is used as additional input to drive lower-level sub-classifiers in the cascade. In such a setting, it is relatively easier for high-level sub-classifiers to produce proper classification results since the number of labels they need to consider is small and the differences between instances from different classes are big; It is also relatively easier for low-level sub-classifiers to achieve more precise classification results since they only need to focus on the labels subordinate to those labels output by high-level sub-classifiers [35].

Figure 3 shows the cascade classifier SkyNet employs to categorize the user feedback on the online video editing system described in Section 3.1. The classifier contains three sub-classifiers, each for one level of the label hierarchy. Each sub-classifier is a two-layer network, with the neural cells on each layer being fully connected with each other, and it takes all its parent-level classifiers’ output, if any, as input for the current level’s classification. For instance, the top-level sub-classifier classifies user feedback based on the highest level labels like “Functionality” and “User Account” according to the input text embedding. While the bottom-level sub-classifier takes both the text embedding and the output of the two sub-classifiers at higher levels as input to conduct the most fine-grained classification. The connections between classifiers help preserve the cascade relationship between multi-level labels and improve classification accuracy.

Particularly, each sub-classifier is a multi-class classifier with a loss function defined as \(L = \frac{1}{N}\sum _{i=1}^{N}\sum _{c=1}^{C}loss(y_{ic},\hat{y}_{ic})\), where N is the number of samples, C is the total number of classes in the classification, \(\hat{y}_{ic}\) is the probability of ith training example belonging to the cth class, \(y_{ic}\) is a binary indicator function that represents the ground truth label, while \(loss(y_{ic},\hat{y}_{ic})\) is the cross-entropy loss between the classification results and the ground truth. Cross-entropy loss [10] is a common loss function for classification tasks, and its value increases as the predicted probability diverges from the actual labels.

The loss function for the overall cascading classification model is defined as \(L_{overall}=\alpha L_1 + \beta L_2+ \gamma L_3\). That is, the overall loss \( L_{overall} \) of the model is the weighted sum of the loss \( L_n \) at the \( n \)-th cascading level (\(1\le n\le 3\)), with \( \alpha , \beta \) and \( \gamma \) being the weights of corresponding levels. We assign decreasing values 0.8, 0.6, and 0.4, to \( \alpha , \beta \) and \( \gamma \), respectively, based on the intuition that an incorrect label at any level will lead to incorrect labels for all the underneath levels. With the cascading connections, the weight of the first level sub-classifier will be adjusted with respect to the loss of all classifiers at the three levels during back-propagation, and the weight of the second level sub-classifier will be adjusted with respect to the loss of sub-classifiers at the second and third levels.

3.2 Issue Detection Based on User Feedback

While it is useful to classify feedback texts based on the types of user experiences they report, it is neither necessary nor practical to manually examine all the user feedback that reports negative experiences. On the one hand, not all user feedback reporting negative experiences is caused by issues in online service systems that demand manual inspection by developers. On the other hand, user feedback reporting negative experiences with popular service systems often comes in overwhelming numbers, and therefore it can be prohibitively expensive to manually handle all those user feedback.

To help developers better distribute their time and effort on tasks for issue handling, SkyNet only reports issues for negative experiences shared by a large number of users. Particularly, SkyNet employs a time series forecasting technique to dynamically predict a threshold for the frequency of each known type of negative user experience. An alert indicating the discovery of an issue that needs to be handled will be raised if negative user experiences of the related type get reported more often than allowed by the threshold.

Issues of Known Types When SkyNet classifies a piece of user feedback text to a known type of negative user experience, we say the feedback is an instance of the user experience type. By concatenating the instance numbers of a known negative user experience type within each time unit, we form time-series data about the frequency of that type of user experience. Based on the hypothesis that a rising issue of known type will cause outliers in the time-series data of its corresponding label, SkyNet determines that there is an issue when the number of user feedback reporting a particularly known type of negative experience in a time period exceeds a threshold.

Since the normal frequency of each type of negative user experience is closely related to several factors that vary across experience types and over time, adopting a fixed threshold for all negative user experience types would be too rigid. First, different types of negative experiences naturally occur in different frequencies. For example, in our experience, it is normal to have in each day a few hundred users of a large-scale service system reporting that they cannot receive the verification code, and the reasons often include things like typos in their phone numbers, unstable connections of their phones, and the low response speed of their network operators, none of which is indicative of issues in our systems. On the contrary, the daily number of users reporting problems with uploading files is typically much smaller, and when that number increases significantly, it is highly likely that an issue in our system is the cause. Second, the normal frequency of any type of negative user experience fluctuates at different times in a day, a week, or a month. For instance, most negative experiences occur more often during the day when most users are active than at midnight when most users have fallen asleep. Since predicting a dynamic threshold with historical data is a widely accepted way to detect issues [21, 33], SkyNet naturally formulates the issue detection problem as a time series forecasting problem that predicts the normal frequency range for each label based on historical data.

More concretely, we apply a sliding window strategy for the segmentation of each label’s historical data, and we adopt a classical bidirectional long short-term memory (BiLSTM) [17] network to learn the historical trends of individual labels. The window size is set to 50 time units in the current implementation, and the window slides with a stride length of one time unit. Note that all outliers—data points outside the interquartile range [4]—in the time series are removed, the Min-Max normalization [31, 32] is applied for feature scaling before training.

BiLSTM is a recurrent neural network that takes historical time series data as input to make a prediction based on the trend. To predict a value \( y'_{t} \) for time \( t \), the model takes a series of historical data \( [x_{t-50},...,x_{t-1}] \) as input, where \( x_t \) represents the feature vector for the time unit immediately after \( t \). During training, the model loss is the mean squared error between the actual value \( y_{t} \) and the predicted value \( y'_{t} \) for time \( t \).

Based on the predicted frequency \(y'_t\) for a label, SkyNet calculates the threshold \( th_{t} \) for the label as \(y'_{t}*dr\), where dr is a dynamic ratio calculated as \(\log (std([x_{t-50},...,x_{t-1}])/mean([x_{t-50},...,x_{t-1}]))\). The rationale behind the calculation of the threshold is that the magnitude of acceptable frequency fluctuations should be proportional to the absolute value of the frequency prediction for the label. For example, when the occurrence of a label increases by ten, this fluctuation would be relatively smaller if the label’s regular frequency \( y_{t} \) is ten thousand instead of a hundred. We apply a log transformation when calculating \( dr \) to keep it relatively small.

Fig. 4.
figure 4

Expansion of frequency data with feedback type ID, which enables the prediction of multiple thresholds with a unified BiLSTM model.

Predicting Multiple Thresholds with A Unified BiLSTM Model Usually, predicting the normal frequency of a particular type of user feedback requires training a specialized model with the historical frequency data associated with that type. Training one specialized model for each prediction task, however, would cause high costs for the application and maintenance of SkyNet. To reduce those costs, we expand the values in the time series data for each type of user feedback with the identity of that type and use the expanded time series data of all feedback types to train a unified BiLSTM model. The unified model is then able to predict the normal frequencies of different types of user feedback.

Particularly, we expand the feedback frequency data in three steps, as depicted in Figure 4. We first apply one-hot encoding to produce a unique value as the identity of each type of user feedback. Since one-hot type IDs generated in this way are typically sparse, we then transfer them to a dense vector via a fully-connected network \(g(\cdot )\). Afterward, the frequency data and the dense vector will be combined to form the expanded frequency data. That is, given the one-hot ID \(\delta \) of a user feedback type and the vectorized frequency \(\overline{x_t}\) of this user feedback type at time t, the expanded frequency is constructed as \(\overline{x_t}\oplus g(\delta )\), where \(\oplus \) indicates vector concatenation. Here, the transfer of one-hot type IDs to dense vectors is necessary because, without it, all but one dimensions of the input data would be for the feedback type ID, and it will be extremely hard for the BiLSTM model to learn meaningful knowledge about the feedback frequency.

Fig. 5.
figure 5

Detecting issues of unknown types by clustering user feedback.

Evaluation results of SkyNet on three real-world large-scale online service systems, as detailed in Section 4, show that such unification does help improve the efficiency, while without significantly sacrificing the effectiveness, of threshold prediction in SkyNet.

Issues of Unknown Types Recall that all feedback reporting previously unknown types of negative user experiences will be classified into the “Unknown” category, and such feedback may also reveal issues if many of them concern similar experiences. In view of that, SkyNet clusters user feedback in category “Unknown” periodically (e.g., every half an hour) and raises an issue when the number of feedback in a cluster exceeds a threshold. Figure 5 depicts the main steps SkyNet takes to detect issues of unknown types based on clustering.

To increase the chance that user feedback reporting similar user experiences gets placed into one cluster, it is important that the embedding properly captures the semantic characteristics of the feedback texts. To that end, SkyNet naturally uses the fine-tuned ALBERT-Tiny model to generate the deep semantic embedding of these feedback texts. Feedback clustering solely based on that embedding, however, may suffer from the overfitting problem and miss issues of unknown types because the ALBERT-Tiny model was fine-tuned w.r.t. the input hierarchical label system. Therefore, SkyNet also incorporates the shallow semantics extracted with Word2Vec [27, 28] and Smooth Inverse Frequency (SIF) [9] to facilitate the clustering. Word2Vec is a pre-trained model that masters word associations from a large corpus of text, while SIF uses the vector calculated as the weighted average of all word vectors to embed a sentence. Given a piece of feedback text, SkyNet first applies Word2Vec to produce the embedding for each token in the text and then converts the token embeddings to a sentence embedding with SIF. Afterward, the overall embedding of the feedback combining its shallow and deep semantic information is formed by concatenating the embeddings produced by ALBERT-Tiny and SIF, respectively.

Fig. 6.
figure 6

Cross-domain decision mechanism. The valid public opinion is used to retrieve feedback according to both syntactic and semantic similarity from the database in a time window. The retrieved feedback results then go through a statistical judgment for issue alert.

With the overall semantic embedding as input, SkyNet employs the K-means algorithm to cluster “Unknown” feedback into groups. Note that, since the “Unknown” user feedback usually concerns a wide range of user experiences without concentrating on any specific types, we expect the resultant clusters to be small in size. Correspondingly, when those user feedback texts form large groups, it is highly likely that the feedback in those groups reveals issues in the system. Specifically, SkyNet reports an issue if the size of a cluster exceeds a threshold \(H_f=\text {MAX}(N_{total}/m*\alpha , \beta )\), where \( N_{total} \) is the total number of feedback being clustered, \( m \) is the (predefined) number of clusters to produce, while both \(\alpha \) and \(\beta \) are constants. In other words, an alert will be raised if the number of feedback in a cluster is larger than both \( \alpha \) times the average cluster size and a fixed value \(\beta \). We conservatively set \( \alpha \) to 5 in SkyNet since, according to our experience, an issue often causes the size of its corresponding feedback cluster to increase by 10 times or even more. \( \beta \) is introduced to avoid reporting issues merely because the value of \(N_{total}/m*\alpha \) is very small, e.g., when the total number of user feedback to be clustered is small, and we empirically set it to 10.

3.3 Issue Detection Based on Social Media Data

Due to the potentially high cost and the impact that negative public opinions may cause when they are overlooked, SkyNet dedicates an auxiliary process to detecting issues reflected by posts on social media platforms.

Compared with user feedback collected from dedicated channels that is more informative and has labeled historical data for training, social media posts usually contain noisy data, are less structured, and often cover a wide range of topics, making it more challenging to extract issue-related information from them. In view of that, SkyNet adopts a two-stage denoising process to prune out most posts that are either not directly related to the service system under consideration or not reporting experiences likely associated with issues.

More concretely, during the two-stage denoising process, SkyNet first applies keyword-based search to filter out posts that do not mention the name of the target service system, and then applies a binary classification model constructed with ALBERT-Tiny to further filter out posts not reporting negative user experiences. To train the classification model, we collect product-related posts and manually labeled them to distinguish whether they report negative user experiences. We refer to all the social media posts that are retained after the two-stage denoising process as relevant posts.

To identify social media posts that report negative experiences likely associated with issues, SkyNet employs a cross-domain joint-decision-making process based on both user feedback and social media posts. As depicted in Figure 6, for each relevant social media post, SkyNet first retrieves similar user feedback from past time windows. We consider two types of similarities between user feedback and social media posts. The lexical similarity is calculated using the Lucene correlation algorithm that comes with ElasticSearch [3], which is based on the classic BM25 algorithm [8]. We consider a piece of user feedback to be a lexical match of a social media post if the BM25 score between them is higher than a threshold 40. The semantic similarity is calculated as the Euclidean distance between the ALBERT-Tiny embeddings of the user feedback and the social media post. We consider a piece of user feedback to be a semantic match of a social media post if the distance is smaller than a threshold of 0.4. A piece of user feedback is considered a match for a social media post if it is a lexical or semantic match for the post. Obviously, it is possible that a piece of user feedback is both a lexical and a semantic match of a social media post.

Given a relevant social media post p, let \(N_{h}\) and \(N_{d}\) be the total number of matching user feedback for p in the past hour and day, respectively, SkyNet raises an issue if \(N_{h}\) exceeds the threshold \(H_h = MAX(\alpha _h*\overline{N_{h}}, \beta _h)\) or \(N_{d}\) exceeds the threshold \(H_d = MAX(\alpha _d*\overline{N_{d}},\beta _d)\), where \( \overline{N_{h}} \) and \( \overline{N_{d}} \) are the average number of matching user feedback for p in each hour and day of the past week, respectively, while \(\alpha _h\), \(\alpha _d\), \(\beta _h\), and \(\beta _d\) are constants. Intuitively, an alert will be generated if (1) the number of similar user feedback in the past hour is larger than both \(\alpha _h\) times the hourly average across the past week and a fixed value \(\beta _h\) or (2) the number of similar user feedback in the past day is larger than both \(\alpha _d\) times the daily average across the past week and a fixed value \(\beta _d\). We empirically assign 3, 3, 5, and 10 to \(\alpha _h\), \(\alpha _d\), \(\beta _h\), and \(\beta _d\), respectively, in the current implementation of SkyNet, and we leave the development of more sophisticated techniques for predicting the threshold values for future work.

4 Experimental Evaluations

We experimentally evaluated the effectiveness of SkyNet and the usefulness of its components based on its application results produced on real-world online service systems. Our evaluation aims to address the following research questions:

  1. RQ1:

    How effective is SkyNet in detecting issues in industry-level online service systems? In RQ1, we assess the effectiveness of SkyNet in issue detection in terms of the precision and recall it achieves from a user’s perspective.

  2. RQ2:

    How useful are the individual component mechanisms of SkyNet for the overall issue detection? Recall that SkyNet integrates three components to effectively detect issues in large-scale online service systems, namely a component \(C_k\) that applies cascading classification and time series analysis to detect issues of known types based on user feedback, a component \(C_u\) that applies the K-means clustering algorithm to detect issues of unknown types based on user feedback, and a component \(C_p\) that applies joint decision making to detect issues based on social media posts. In RQ2, we investigate how much each of these components contributes to the overall effectiveness of SkyNet.

We were not able to experimentally compare SkyNet with iFeedback for two reasons. First, the implementation of iFeedback is not publicly available. Second, faithfully re-building the tool is hardly viable because important information regarding its implementation is missing from the related publication. For example, we only know from the publication that iFeedback employs an XGBoost-based model to classify whether a time interval contains an issue, and it applies a hierarchical algorithm to cluster the user feedback as reporting different issues [45], but no information about the settings and parameters of the model and algorithm adopted in their implementation was given in the publication, although those settings and parameters may greatly affect iFeedback’s issue detection capabilities.

Table 1. Industry-level online service systems used as the subjects in our experiments.

4.1 Subject Systems

In our experiments, we applied SkyNet to three industry-level online service systems. Table 1 summarizes the basic information about the systems. For each system, the table gives its ID, a brief description, its number of monthly active users (MAUs) in millions, and the average number of user feedback items received per day for the system. System S1 is an online video-sharing social media platform, system S2 is an online video editing system, and system S3 is an online beauty camera platform. The subjects include systems of different types for different users, with different magnitudes of MAUs, and receiving different amounts of user feedback. The diversity in the subject systems helps to ensure that the experiments are representative of SkyNet’s behavior in different situations.

4.2 Model Training

Since all three subject systems mainly target Chinese users, we configured SkyNet to utilize a pre-trained ALBERT model [1], the DSG embedding corpora [7], and the Jieba text segmentation library [5] for processing texts in Chinese. Meanwhile, we configured SkyNet to utilize the texts posted on WeiboFootnote 1, one of the biggest social media platforms in China, for issue detection in the experiments.

For each system, we utilized historical user feedback with labels manually assigned by the system developers over a one-month period to fine-tune the ALBERT-Tiny model and to train the cascading classification model as a whole. To prepare the hierarchical label system, first, we invited the system developers to decide which labels associated with negative user experience reporting feedback should be retained as the bottom layer labels. Then, following the principles described in Section 3.1, the developers were asked to group and summarize the bottom layer labels to form the intermediate and top layer labels. Finally, all the other labels indicating negative user experiences were converted to “Unknown”, and the remaining labels were converted to “Non-negative”. In this way, we prepared for each online service a hierarchical label system and a large number of user feedback associated with those labels. For each constructed hierarchical label system, Table 1 gives the numbers of labels at its three different layers.

Afterward, we followed the standard practice [34] to tune the hyperparameters to be used with the classification and BiLSTM models. Particularly, for each service system, we we selected via random search a group of 10 hyperparameters that enables the classification model to correctly label the most historical user feedback texts, and then we looked for values adjacent to these hyperparameters via grid search [34] that produced the highest number of correct labels and used the values for the classification model in our experiments. The BiLSTM model was trained through stochastic gradient descent [37] on the time series data derived from the given historical feedback data. For example, for the experiments on service system S1, the cascading classification model used the following non-default hyperparameters: batch_size=24; dropout=0.1; learning_rate=\(2e{-5}\); warm_up_proportion=0.1; max_epoch=10, while the BiLSTM model used the following non-default hyperparameters: dropout=0.1; max_epoch= 50; sequence_len=50; learning_rate=0.1; batch_size=24.

4.3 Experimental Setup

We applied SkyNet to detect issues in each subject system based on historical data collected over a ten-month period of time. Each detected issue was checked manually by operators and developers of the systems to confirm whether it indicates a real problem that needs to be handled. Moreover, the operators and developers also assessed the severity of each issue based on the functionalities it may impact, the costs it may incur, and the extent to which users’ experience may be jeopardized. An issue is called a severe issue if its impact in at least one of those aspects is substantial.

To answer RQ1, we collected all the issues reported by SkyNet for the subject systems as well as the results of manual inspections on the issues. Following the practice in previous work [45], we measure the effectiveness of SkyNet in terms of the precision and recall of the issue detection results produced by the tool. In particular, the precision is calculated as the percentage of real issues in all the detected issues, i.e., \(N^i_c/N^i_d\), where \(N^i_c\) and \(N^i_d\) are the numbers of issues confirmed by developers and detected by SkyNet, respectively; The recall is calculated as the ratio of detected severe issues to all the severe issues recorded for the whole experiment period, i.e., \(N^s_d/N^s_r\), where \(N^s_d\) and \(N^s_r\) are the numbers of severe issues detected by SkyNet and recorded by developers, respectively. Note that metric recall concerns only severe issues in the system because severe issues will be reported eventually due to their high impact even if SkyNet fails to detect them, while there is no practical way for us to find out the exact total number of real issues in those systems.

To answer RQ2, we ran SkyNet two more times on all the user feedback data and the social media posts to detect issues for the systems, the first time with component \(C_p\) being disabled and the second time with both components \(C_p\) and \(C_u\) being disabled. Then, we compared the issue detection results from the three runs in the number of issues detected as well as the precision and recall of the corresponding results.

4.4 Experimental Results

In this section, we report on the results produced in the experiments and answer the research questions.

RQ1: Effectiveness Table 2 lists the basic information about the issue detection results SkyNet produced on the systems. For each system, the table lists its system ID, the numbers of issues detected by SkyNet and confirmed by developers, the numbers of severe issues detected by SkyNet and recorded by developers, and the precision (prec) and recall (reca) achieved accordingly.

SkyNet detected 2790 issues in total, 2595 of them were manually confirmed to be true issues, achieving a precision of 93.0%. As for severe issues, developers recorded in total 62 cases for the three systems in ten months, and 58 of them were detected by SkyNet, achieving a recall of 93.5%. In comparison, iFeedback [45] was able to achieve 76.2% and 93.2% for precision and recall, respectively, in its evaluation. SkyNet managed to significantly outperform iFeedback in terms of precision while slightly improving the recall. Such results suggest that SkyNet is both effective and accurate in issue detection.

Table 2. Issue detection results produced by SkyNet on the subject systems.

To understand the reasons for SkyNet’s ineffectiveness, we manually inspected all four severe issues that were missed. Three of the four severe issues were missed due to minor fluctuations in the number of associated user feedback. For instance, one severe issue that SkyNet missed occurred during AB-testing [40] of a service system. Since only a small number of users were involved in the AB-test, while the issue seriously damaged the user experience of the system, the total number of users affected was relatively small, compared with the number of users that routinely access the service provided by the system. Hence, no alert was triggered. The severe issue could have been detected if SkyNet predicts the threshold frequency of issue-reporting feedback texts as a ratio to the total number of users with access to the relevant system feature. SkyNet missed the other severe issue of a previously unknown type due to the imprecise clustering of feedback texts. Since various users’ descriptions of the issue were quite different, SkyNet’s unsupervised model was not able to group all the user feedback reporting the same issue into a cluster. This is not completely unexpected since, although we have considered both the lexical and semantic characteristics of feedback texts in their embedding, it is not a perfect solution yet. We plan to devise more powerful embedding and clustering techniques to facilitate the detection of issues of unknown types in the future.

figure a

RQ2: Usefulness of Component Mechanisms Table 3 shows the results produced by SkyNet with various components being disabled in issue detection. For each system identified by its SID, the table gives the issue detection results from using just component \(C_k\), using both components \(C_k\) and \(C_u\), and using all three components of SkyNet. In each setting, the table lists the numbers of issues detected by the tool (\(N^i_d\)) and confirmed by developers (\(N^i_c\)), the number of severe issues detected by the tool (\(N^s_d\)), and the precision (P) and recall (R) achieved accordingly.

Table 3. Usefulness of SkyNet’s individual components for issue detection.

When \(C_k\) is the only component enabled, SkyNet was able to detect 2749 issues, among which 2560 were manually confirmed, and 33 severe issues for the systems, achieving the overall precision and recall of 93.1% and 53.2%, respectively. To put it in perspective, that is 98.7% (=2560/2595) of the real issues and 56.9% (=33/58) of the severe issues the tool can ever detect with all its components being enabled. Such results clearly show that both cascade feedback classification and dynamic threshold prediction of SkyNet were effective in detecting issues based on user feedback. Although the recall that \(C_k\) achieved in detecting severe issues is relatively low, it is understandable since many severe issues are of previously unknown types and hence beyond the detecting capability of \(C_k\).

Component \(C_u\) helped capture 29 (=2589-2560) real issues and 19 (=52-33) severe issues that component \(C_k\) failed to detect, which caused the precision of the overall result to drop slightly to 93.0% but helped raise the recall of the overall result to 83.9%. The drop in the result precision is understandable since \(C_u\) essentially detects issues of previously unknown types via unsupervised learning, and the results of unsupervised learning are relatively low in general. Compared with a few false positives, i.e., reported issues that were manually ruled out as they were not real issues, the 19 severe issues detected by component \(C_u\) are significantly more important for the developers. Therefore, we believe component \(C_u\) is a valuable complement to component \(C_k\). Note that only feedback items that report negative user experiences of previously unknown types are processed by component \(C_u\).

The issue detection results produced by components \(C_k\) and \(C_u\) also enable us to directly compare SkyNet and iFeedback’s issue detection capability solely based on user feedback. As shown in Table 3, if only having access to user feedback, or when component \(C_p\) is disabled, SkyNet was able to detect 2784 issues, among which 2589 were confirmed to be real ones and 52 were considered severe. The precision and recall achieved are therefore 93.0% and 83.9%, respectively. Recall that the precision and recall iFeedback achieved were 76.2% and 93.2%, respectively. The differences suggest that SkyNet and iFeedback make different tradeoffs between issue detection precision and recall. iFeedback is more lenient in reporting issues. On the one hand, many issues it reported turned out to be false positives; On the other hand, it managed to detect more severe issues; SkyNet is stricter in reporting issues. On the one hand, it reported fewer false positives; On the other hand, it missed a few more severe issues.

SkyNet makes up for its relatively low recall in issue detection based on user feedback by taking into account also users’ posts on social media platforms. Although component \(C_p\) only detected 6 more real issues in our experiments, all of them turned out to be severe, and missing any of these issues may have caused great damage to the company. Therefore, although this component has only slightly improved the overall recall, we consider it to be a crucial and non-dispensable part of SkyNet.

figure b

Threat to Validity In this section, we discuss possible threats to the validity of our findings and show how we mitigate them.

Construct validity. In our evaluation, a reported issue could be manually confirmed or rejected as a real or severe issue, but different people may provide different assessments. To mitigate this threat, we directly reused the independent issue assessment results from the developers of the service systems.

Internal validity. SkyNet makes use of a list of parameters, including, e.g., the size of the sliding window for BiLSTM and the similarity threshold for matching social-media posts with user feedback texts. We set the parameters based on our experience in the current implementation of SkyNet. Experimental evaluation conducted on three industry-level online service systems produced very promising results, suggesting the chosen parameter values are appropriate. Having said that, we are aware that different values for the parameters may influence SkyNet’s effectiveness, and therefore we plan to conduct more experiments in the future to systematically evaluate the possible influence.

We were not able to experimentally compare SkyNet with iFeedback for reasons stated at the beginning of Section 4. As the result, we compared the two tools based on the results they produced on the subject systems in their corresponding evaluations. For the comparison to be as fair as possible, we evaluated SkyNet on service systems of similar scales from various categories of applications. Moreover, the comparison was based on common metrics precision and recall, instead of measurements like the numbers of issues and severe issues detected, which greatly depends on the experimental setup.

External validity. The subject service systems adopted in our experiments were real-world services of different scales and from different application domains. These characteristics help mitigate the risk that our evaluation overfits the subjects. In the future, on the one hand, we will continue monitoring the execution of SkyNet on existing service systems, on the other hand, we will deploy SkyNet on more service systems. We see no intrinsic limitations that would prevent SkyNet from working reliably on different online service systems.

5 Conclusions

This paper presents the SkyNet technique and tool that utilize user data gathered from multiple channels to detect issues for large-scale online service systems. The technique has been applied to detect issues for three real-world online services based on historical data gathered over a ten-month period of time. The produced results suggest that SkyNet is both effective and accurate in detecting issues and severe issues for large-scale online service systems.