Prototype specification
How should the RM respond to a user’s decision? As mentioned, different approaches can be used to determine the feedback. To allow for experimentation with (a combination of) different approaches, a modular design is required. Robbins proposed a classification of critic modules that served purposes similar to that of an RM (Robbins, 1998). These modules were designed to analyse different parts of a problem and compare them to a submitted user solution. In line with Robbins’ classification, multiple modules for an RM are described below. For the sake of readability, we focus on the two medical use cases. However, the RM design likewise applies to the law cases.
Modules 0a and 0b serve as a control group, using unintelligent methods to challenge the user’s solution. Modules 1, 2 and 3 present more intelligent and informed methods that aim to produce more relevant and helpful feedback than that of the control group:
-
0a, The RM should at least produce “Are you sure?“.
This is an uninformed control case, from here on referred to as the ‘uninformed’ module. In the abdominal pain example presented earlier, the RM would simply provide this question to the GP.
-
0b The RM should ask about aspects specific to the case.
This is an informed control case, from here on referred to as the ‘informed’ module. Regarding the abdominal pain example, the informed module would for instance ask “Are the child’s teeth breaking through?”.
-
1 The RM should check for alarming discrepancies between the diagnosis of the DSS and the general practitioner.
Assuming that the DSS is good enough to be used in the heavily regulated domain of healthcare, its solution should be as close as we can get to a ground truth. Thus, this module is akin to a correctness critic, from here on referred to as the ‘correctness’ module. The correctness module would for instance note that whereas the DSS based its decision on data points about “the school situation”, the GP mainly decided on the basis of “eating habits”, and ask a question related to this discrepancy.
-
2 The RM should check for alarming discrepancies between the diagnosis of the practitioner and the symptoms described by the patient.
This module aims to verify that the diagnosis is consistent with the patient’s symptoms. This is akin to a consistency critic, from here on referred to as the ‘consistency’ module. This module would for instance note that the GP’s diagnosis of winter flu would not fit well with the absence of fever and ask a question about that.
-
3 The RM might produce a differential diagnosis of its own to compare with the user’s solution.
This module is closely related to an alternative critic, from here on referred to as the ‘differential’ module. The RM would come up with for instance the alternative of child abuse in contrast with the GPs focus on the child’s diet, by asking e.g. “Was it checked whether the child has any bruises?”.
Unlike Robbins’ critic modules, where each module produces valuable critiques to the user’s solution, the modules might not all produce useful insight. In addition, unlike a critic, the goal of the RM is not to improve the user’s solution but rather to increase the user’s involvement in the decision-making process while working with a DSS, thus increasing meaningful human control. Hence, to implement and evaluate a simple prototype and to avoid the risk of overloading the user with information, the output of the RM was limited to one counter question only. The diagram in Fig. 1 describes how the different modules can work alongside one another, while the RM only produces one output.Footnote 1
Our RM (particularly correctness and consistency modules 2 and 3) needs to ‘reason’ about possible solutions to a given problem. For this, the design of an RM can learn from the design of DSSs. Classic DSSs use 3 main components to base their decision on. Phillips-Wren et al. (2009) identified these 3 main components as a knowledge base, a data base, and a model base. Since the model base, which classically contains formal models of decision making, is largely dependent on the content and representation of these formal models, this component will vary across implementations and problem domains.
For this simple prototype, the knowledge base and data base are combined in one table. This yields 4 inputs for the machine:
-
The textual case information as presented to the human expert,
-
A one-word textual representation of the solution as proposed by the DSS. (Though not strictly required for questioning the user’s solution, it makes sense to include this input in the context of responsibility gaps.) For each of the four tasks, the recommendation by the DSS was determined during the design of the case studies, choosing an arbitrary (not ground truth) multiple-choice answer.
-
A one-word textual representation of the solution as submitted by the human user (similarly, the RM can function without this input to question a DSS, though in the context of responsibility gaps that is not the goal),
-
A lookup table that holds information on possible solutions and their related features.
Using these inputs, the prototype should produce feedback for the user. To this end, n modules map the input to intermediate output. Any module must satisfy the following constraint to function within the prototype, effectively mapping a subset of the above-defined input to a list of potential questions and a set of weights representing their value:
$$\begin{array}{*{20}l} {\left( { \subseteq \left[ {case,~support~solution,~user~solution,~lookup~table} \right]} \right)} \hfill \\ {\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \,\, \to } \hfill \\ {\left[ {\left( {question,~keyword,confidence,multiplier} \right),\, \ldots } \right]} \hfill \\ \end{array}$$
Where a complete entry in the list of RM questions is defined as a set of:
-
A generated question, as it would be presented to the user,
-
An associated feature that helps the aggregator identify similar questions,
-
A critiquing weight. The exact meaning of this weight may vary across implementations and problem domains. For this prototype, the weight is a combination of two components. The first is a measure of confidence in the user’s solution, where a score of 100% means the module is certain the user’s answer is correct, and a score of 0% means the module is certain the user is wrong. The second is a multiplier that is derived from the feature itself. Without this multiplier, all features pertaining to a single solution would hold the same critiquing weight. With the multiplier, features that hold greater significance (i.e., show a greater difference between compared solutions) are more likely to be used for falsification than features of lesser significance.
Finally, the intermediate output of each module is to be aggregated to determine which one question to ask the user. For instance, this can be done by taking the average score of each factor over the different modules. This can be a combination of multiple modules or a single best option as presented by a module individually, depending on what output is most desired / effective. Since this is largely unknown, the weights will be used to determine a single best question to present to the user.
Creating domain knowledge
The domain knowledge available to the prototype was a manually created data set based on the case descriptions, as explained above. The relations between solutions and features were extended by adding data of trusted Dutch self-help websites such as https://richtlijnen.nhg.org/ and https://www.thuisarts.nl/ where necessary. The domain knowledge can be represented by a simple m by n matrix with columns for each known feature (e.g., symptoms in a medical task) and rows for each known solution. Numerical values record the relation of each feature to each solution. Table 1 shows an example of such a matrix.
Table 1 Example of domain knowledge represented in an n by m matrix Positive numbers represent that a feature is known to be related to a solution, negative numbers show they are known to be unrelated, and zeros denote unknown relationships. The bounds of the weights are arbitrary and will likely not be identical across implementations. They will affect how an RM produces feedback, and the bounds should therefore be consistent across all data sets used with a particular implementation.
The distribution of (either positively or negatively) associated features per case and solution is shown in Fig. 2. This distribution is not perfect but will suffice for the implementation. The DSS suggests, per case, the solution:
-
(1)
D
-
(2)
B
-
(3)
C
-
(4)
D
For this implementation, the values of the matrix were assumed to represent how often a feature i is associated with a solution a, meaning that the weights of the lookup table can be defined as:
$$weight= \frac{x\leftarrow number\, of\, times \,feature\, i\, was\, associated \,with \,solution\, a}{n\leftarrow number \,of\, recorded \,instances\, of\, solution\, a}$$
Since the actual values of x and n are unknown, the values were simulated to represent the created data set while maintaining the [0, 1] bounds. Each non-zero relation from the data set was assigned randomly from a truncated normal distribution (Burkardt, 2014):
$$\varphi \,\left(\mu ,\delta ,x\right)= \frac{1}{\sqrt{2\pi }}{e}^{-\frac{{x}^{2}}{x}}$$
Using an m (mean) of 0.5 and a d (variance) of 0.5, all values taken from the truncated normal distribution will be between our bounds [0, 1] with an overall mean of 0.5.
$$weights \leftarrow \int_{{0.0}}^{{1.0}} {\varphi {\mkern 1mu} (0.5,0.5,x)}$$
This method produces a randomisable set of simulated associations between features and solutions, while adhering to the rational known relations between features and solutions as shown in Fig. 2.
Determining counter questions
To arrive at a helpful counter question, the RM takes its confidence in the human’s decision into account and identifies features that might have been overlooked, misinterpreted, or point towards other solutions. For this, each intelligent module (1, 2 and 3) should compare the user’s solution with a part of the RM input. For the correctness module (1), this is the DSS solution. For the consistency module (2) this is the case information. For the differential module (3), this would be a solution generated by the RM based on the case information. Since the purpose of this research was not to generate the best alternate solution, the differential module (3) received the DSS support solution insteadFootnote 2. This distribution of inputs means each module receives two groups of features. Either 2 solutions with associated features in the domain knowledge, or a solution and the complete case description which can be matched to the domain knowledge table. For sake of readability, the group of matching features from a case description used by the consistency module (2) will also be referred to as a ‘solution’.
To determine the value of a single feature for effective RM feedback, a measure of confidence in the user’s solution and a weight to describe the value of a specific feature of the user’s solution must be computed. The goal is to find the factor that the RM has least confidence in (or alternatively, that best explains why a solution is ‘bad’). To obtain a measure of confidence in a solution a, each of its features can be compared to those of a competing solution b. The sum of the absolute difference between each feature results in a total distance between solutions a and b. The total distance divided by the total number of evaluated features n gives a degree of dissimilarity between solution a and solution b, and subtracting that from 1 produces a similarity score:
$$distance{\mkern 1mu} \left( {a,b} \right) = \sum\limits_{{i = 0}}^{n} {abs{\mkern 1mu} (a\left[ i \right] - b[i])}$$
$$similarity\,\left(a,b\right)=1- \frac{distance\,(a,b)}{n}$$
A greater similarity score implies more shared features and can thus be seen as a verdict of confidence in solution a. This method yields a symmetric similarity measure (i.e., the distance from a to b equals the distance from b to a). However, for us it was more interesting to use an asymmetric measure as this allowed for experimentation within modules and different aggregation methods. To obtain this asymmetric similarity score, we only measure the distance for features of which the value for a is known, i.e., not 0:
$$similarity\,\left(a,b\right)= 1-\frac{distance\,(a\left[where\, a \ne 0\right], b\,[where\, a \ne 0]}{length\,\left(a\right[where\, a \ne 0\left]\right)}$$
Because the lookup table is bound to [0, 1] and case information mentioning the known absence of a feature is translated to a continuous value between − 1 and 0, the maximal distance is 1 − (− 1) = 2, and the minimal distance is x−x = 0. For computing the overall confidence, a bound of [0, 2] is not suitable, since the maximal distance between solution a and solution b is then 2 times the number of features, which results in a negative similarity score rather than a minimal score of 0:
$$similarity\,\left(a,b\right)=1- \frac{2n}{n}= -1$$
Hence, while the distance per feature is bound to [0, 2], it is more practical to bound the total distance to [0, n], which in turn bounds the similarity score to [0, 1]. To achieve this, the distance per feature can be divided by 2 before addition, or the total distance can simply be divided by 2. Finally, by multiplying the similarity value by 100%, a percentile confidence measure is obtained that is both rationally bound to its meaning and useful to determine the value of a module’s feedback.