Evaluation Design
For the evaluation, we collected out-of-sample data that is not linked to the constructed learning base and therefore was not part of the model training procedures. It served both as a basis for benchmarking purposes between alternative text classifiers and for the comparison of the different evaluation items. In particular, we collected 60 different real-world problem statements, which are equally distributed among the three target classes, based on problem descriptions derived from own industrial DSA projects as well as selected DM competitions from online platforms such as Kaggle.Footnote 3 When gathering the set of problem statements, we paid attention to ensure (i) that the underlying scenarios originated from a wide range of application domains, (ii) that the keywords and key phrases for signalizing a specific class of DM method contained sufficient degree of variability, and (iii) that the descriptions were provided with a varying degree of filling information and noise. The complete list of problem descriptions can be found in Appendix G.
To evaluate our system design artifact, we measured its advice quality to provide a correct mapping between problem statements expressed in domain-specific natural language and DM methods. We grounded the evaluation on a performance comparison of different evaluation items constituting test and reference elements for multiple design hypotheses. Table 2 provides an overview of these items.
Table 2 Reference and test items of the evaluation design According to the derived design requirements, the TbIAS should be able to provide advice that is of improved quality over random guessing assuming a discrete uniform distribution across all possible DM methods, which determines the lowest limit of any reference line. A second reference line for assessing the artifact’s usefulness can be obtained by directly measuring the judgement capacity of a potential user group for whom the assistance system has been designed (cf. Sect. 6.2). The third reference item is the baseline combination of instantiable design principles. Here, we followed the basic idea of incrementally activating individual design principles, resulting in different design configurations to measure their effects separately (Meth et al. 2015).
Note that in our case, design principles are sequentially interdependent. For example, without an underlying learning base (DP3), no text classifiers for automated NL request processing (DP1) can be applied and vice versa. Similarly, the use of embedding models for automated context extraction (DP2) allows to apply different types of text classifiers, which results in alternative feature instantiations of DP1 with and without (*) the use of embeddings. Therefore, our baseline configuration consists of the constructed learning base and some standard text classification models (cf. Sect. 6.3). By contrast, the test item represents the full configuration based on the entire system of design principles DP1 to DP3 and their associated design features (cf. Sect. 6.4). In this case, DP1 was instantiated with more advanced text classifiers, as already outlined in Sect. 5.3.
Based on the four evaluation items, we propose three hypotheses. First, at the very minimum we assume that the performance of the full configuration is better than pure guessing when signalizing a match between problem statements and DM method classes. Thus, we hypothesize:
H1
Using a TbIAS that is built on an automatically constructed learning base (DP3), supports automated natural language request processing (DP1), and allows automated context extraction (DP2) will result in higher advice quality for DM method class selection than a selection by random guessing.
Second, we expect that the full TbIAS configuration based on all three design principles is also able to outperform the judgement capacity of DM novices and therefore provides useful assistance when no sufficient DM experience is available. Consequently, we hypothesize:
H2
Using a TbIAS that is built on an automatically constructed learning base (DP3), supports automated natural language request processing (DP1), and allows automated context extraction (DP2) will result in higher advice quality for DM method class selection than a selection based on the judgement capacity of DM novices.
Lastly, we expect that the full TbIAS configuration based on all three design principles outperforms the basic configuration due to the additional capability of automatically extracting relevant context. Thus, we hypothesize:
H3
Using a TbIAS that is built on an automatically constructed learning base (DP3), supports automated natural language request processing (DP1), and allows automated context extraction (DP2) will result in higher advice quality for DM method class selection than a TbIAS that is built on an automatically constructed learning base (DP3) and supports automated natural language request processing (DP1(*)).
To measure and report the advice quality for each item, we rely on standard metrics for the evaluation of classification problems that are straightforward to interpret. Specifically, we use the overall accuracy as the proportion of correctly classified cases among the total number of cases and the recall as the proportion of correctly classified positive cases among the total number of positive cases (Metz 1978). Here, a positive class refers to a specific class of DM methods to be considered (e.g., cluster analysis) in contrast to the remaining method classes. Furthermore, to assess the inter-group differences between the items at a case level, we consider confidence scores instead of binary decisions to express how certain a case was assigned to a particular DM method.
Novice Assessment
To obtain a representative reference item reflecting the judgement capacity of DM novices, we collected survey data from master level students at a public research university in Germany. Specifically, we recruited 20 students attending an advanced DSA module as subjects, who were still at the beginning of their education with only little experience in the application of DM methods. Hence, we consider them as representative DM novices. As the module is an elective, we assume that the subjects were intrinsically motivated to answer the questions conscientiously.
The survey consisted of two parts. In the first part, we introduced the survey and asked the subjects to rate their DM knowledge. Specifically, we asked to provide descriptive data about their study background, their general knowledge in DM, and their particular experience with the three method classes of interest (cf. Table 3).
Table 3 Student subjects’ descriptive data To introduce the second part, the instructor provided a brief overview of the three method classes using an overview slide to ensure that the subjects have a basic understanding of all method classes. Finally, the subjects were asked to read the 60 randomly ordered problem statements carefully and select the DM method class that they consider to be suited best. To avoid any distortion of the results by pure guessing, we asked subjects to indicate whenever they were not sure about a selection, which was offered as a fourth response option. Answering the second part took between 25 and 38 min. The full questionnaire can be found in Appendix H.
The analysis of the survey data shows that on average the DM novices were able to assign 55% of the problem descriptions correctly. Considering the self-assessed method experience in Table 3, one could assume that the majority of incorrectly assigned problem statements belong to the class of FPM. However, there was no remarkable difference between the three classes when considering their individual average recall scores {CL: 0.62, PR: 0.48, FPM: 0.56}. As a pre-test, we provided the questionnaire to three graduate students with more advanced DM experiences, who achieved an average accuracy of 0.91. This ensures that with a certain level of DM experience, the mapping task based on the given validation data can be performed unambiguously.
Baseline Configuration
In order to compare the artifact with baseline text classification functionality, we implemented several standard classifier algorithms to represent activated DP3 and DP1(*) in the absence of DP2. For this purpose, we considered an SVM with a radial basis kernel and a multilayer perceptron (MLP) with three hidden layers, fifty neurons per hidden layer, a sigmoid activation function and dropout layer. We chose those algorithms to best resemble the text classifiers used in the full design. We omitted the LSTM and GRU architectures from the baseline configuration as it is only trained on a document level. Hence, there are no word or sentence level embeddings as sequential inputs due to the absence of DP2, which renders the use of a sequential model superfluous. Additionally, to represent out-of-the-box behavior, we only considered the algorithms in their most standard configuration without hyperparameter optimization. For model training, we built a vector representation of all documents by adding all the words from the corpora as features using term frequency-inverse document frequency (TF-IDF) weighting on each word for each document (Salton and Buckley 1988). We used only the non-augmented (DEF) and augmented (AUG) datasets for training and withheld the validation data for evaluation purposes.
Table 4 shows that the accuracies with the AUG dataset are superior to those with the DEF dataset for both classifiers. This affects especially the SVM, which falls back to a random guessing level with the non-augmented data, while slightly improving with the augmented data. The same effect can be observed for the MLP classifier with an even higher magnitude. We expected these results since the three hidden layers most likely generate abstract features that could better describe the documents compared to the SVM. We can also observe a skewed result distribution that leans towards the class of FPM. This effect would result in the TbIAS to favor the vote of one class over another when unsure. Additionally, we can see that the baseline models perform only very close to random guessing for the class of cluster analysis.
Table 4 Evaluation results of the baseline models trained on different datasets Full Configuration
Lastly, we evaluated the full configuration of our system design artifact based on the advanced text classifiers, which were trained on distinct embedding models and the two datasets DEF and AUG. Table 5 reveals that the SVM (on DAN:DEF) and the LSTM (N–1) (on FastText:AUG) show the best performances and produce accurate results. In contrast, the KNN and Topic-KNN models (on USE) exhibit the lowest accuracies and are not suitable to realize an adequate mapping. Generally, we observe that the results of the pre-trained USE model lag behind those of the other embedding models and that the concept of transfer learning produces no useful effect in this context. Further, we can see that methods using singular paragraph vectors as input achieve better results on DAN models, whereas the models with N–1-architectures perform better with FastText vectors. This underlines the usefulness of separate models for inference of word and paragraph vectors. Lastly, we see that the DL architectures generally perform better on the augmented data, whereas the classical approaches perform better on the non-augmented data.
Table 5 Evaluation results of the full design configuration trained on different embeddings and datasets In addition to the depicted text classifiers that are based on the embedding models, we also examined the two LDA topic models. The first approach, which is used for direct classification with three topics, reaches an accuracy of 0.7, whereas the second approach based on seven topics and a subsequent SVM classifier only shows poor accuracy of 0.46, demonstrating that the latter approach is not suitable for the given task.
Ultimately, to produce an even more accurate classifier, we built weighted averaging ensembles (Sagi and Rokach 2018) based on the best performing classifiers per each model type. The results of the three top ensembles are illustrated in Table 6. We can see that accuracies up to 90% can be achieved with the combination of an SVM, an LSTM (N–1) and a GRU (1–1). Consequently, this combination was implemented in the final prototype.
Table 6 Evaluation results of the ensemble models based on weighted averaging Performance Comparison and Hypothesis Testing
After the assessment of the individual evaluation items, a comparison of the scores reveals that the full configuration based on all three design principles dominates the other reference items. Table 7 summarizes the recall and accuracy results for all four items. We considered only the best-performing TbIAS configurations.
Table 7 Overall performance comparison for the different evaluation items In order to provide even more reliable statements about the inter-group differences and test our design hypotheses H1–H3, we additionally considered the confidence scores for each classification decision. These scores express how certain an algorithm is about a decision. While, for a general evaluation, we want the algorithm to make the right decisions, we also want the algorithm to be sure about it. For example, confidences of {CL: 0.32, PR: 0.32, FPM: 0.36} produce the same decision as confidences of {CL: 0.01, PR: 0.01, FPM: 0.98}, the resulting decision to classify the problem as FPM, however, is less reliable.
For random guessing, we set equal confidences of 0.33 for each class, whereas the scores for the novice assessment were calculated using the relative frequency of subjects voting for the right DM method class. For both TbIAS configurations, we used the scores derived from the classifiers. An overview of all confidence scores for each problem statement can be found in Appendix G.
To test our hypotheses, we conducted a two-stage analysis. First, we performed an ANOVA with the evaluation item as the independent variable and the confidence as the dependent variable. We applied the Bartlett test, the Levene test, and the Brown–Forsythe test for unequal variances to check the prerequisites for the ANOVA (Blanca et al. 2018). The tests returned indication of unequal variances. We therefore applied a robust version of the standard ANOVA by Wilcox (1989) to adjust for these circumstances. The results of the ANOVA are depicted in Table 8.
After the ANOVA returned a significant result for the overall test that at least two evaluation items are different, we performed a post hoc independent t test with Bonferroni adjustment to compare them. The t tests returned significant results on H1 and H2 at the 0.01 level, and on H3 at the 0.05 level. This supports our three hypotheses and confirms that our design principles indeed increase the advice quality using natural language problem descriptions. Table 9 shows the results of the test.
Table 9 Post-hoc t test results of hypotheses H1–H3 Robustness Checks
In addition to the evaluation based on fixed validation data, we also conducted several robustness checks with the full TbIAS design configuration to ensure the transferability of the results to other circumstances than those given within the currently considered problem descriptions. Specifically, we investigated the impact on the confidence scores when (i) replacing method-centric keywords, (ii) replacing domain entities, and (iii) modifying the length of the problem descriptions. For each check, several examples can be found in Appendix I.
The first check revealed that the choice of keywords has a high impact on the confidence scores, as keywords with a stronger semantic connection to a certain DM method generally increased the confidence scores of the correct class, whereas weaker keywords resulted in a decrease. Likewise, keywords, which are associated with contrary DM methods, cause a problem statement to be assigned to another class.
For the second check, we systematically replaced characteristic domain entities. We found that the type of problem surroundings also has a certain impact on the confidence scores. For example, the TbIAS generally tended to drift towards FPM whenever a problem statement contained sales-related terms, such as “client” or “selling”. Presumably, this kind of distortion is caused by the fact that a predominant portion of academic articles (as the fundamental foundation of the learning base) investigate FPM mostly for sales problems. Thus, this bias is an apparent current limitation, as the system should be able to classify problems independently of the underlying domain.
In the third check, we iteratively modified the length of the problem descriptions by either reducing them to the central statement or adding noise. Hereby, we could also notice an influence of additional noise, but despite a decreasing ratio of keywords to total words, we could still obtain relatively stable confidence values.