With all the information from Sect. 3, we claim that explaining ML practice has two components:
-
(a)
a description of the way SaMD-ML has been developed and constructed
-
(b)
a motivation for the fundamental technical choices made by the ML practitioners
Complying with (a) only will provide a long list of technical requirements and specifications, spelled out as neutral, step-by-step recipes. Formulating (b) means also providing reasons why the technical choices made are best and result in the overall effect really allowing the purpose function. It has been shown that documenting technical choices is a neglected practice [15, 29], and motivating choices even more. Moreover, the nature of technical choices reveals that the practice of ML is replete with value judgements. How we motivate technical choices is shaped by technical constraints, but it is not limited to them—here we claim that value-laden judgements are inevitable in ML practice, given that technical choices are underdetermined [13]. In this section we describe first a pipeline to accommodate how (a) can be formulated (Sect. 4.1), then clarify in which sense technical choices are necessarily shaped by values (Sect. 4.2), describe (b) in detail (Sect. 4.3), and finally characterize limitations of our approach (Sect. 4.4).
Documenting ML practices
Decomposing the overall effect functions by identifying modules based upon the practices used to develop SaMD-ML for a specific purpose can be done in different ways. Chen et al. [5] characterizes a pipeline for healthcare, and we extend that with best practices from data science, e.g., to distinguish data understanding and preparation from data collection and model development. This builds upon the FDA’s AI Model Development and gives a functional decomposition of the aspects of ML that can explain an overall effect function of SaMD-ML from the point of view of design choices. We characterize the design and development of SaMD-ML as six stages, which we identify as ‘modules of the overall effect function’ (Fig. 2).
First, there is what Chen et al. [5] call problem selection (i.e. understanding and definition). In developing SaMD-ML, one has to choose carefully which prediction or other ML tasks the SaMD-ML will perform in meeting the purpose. Although driven by the sociotechnical processes constraining the ML task, the first step of the process is to understand those needs and constraints and formally define the problem to be addressed, i.e., one defines the purpose and overall effect function. This choice has important ramifications for the other phases. As a working example, consider the problem of building a ML system to extract diagnostic information from Electronic Health Records (EHR). The purpose could be to extract an explicit diagnosis; to extract information sufficient for diagnosis (such as test results), even if not previously coded; to create cohorts or subpopulations for targeted treatment; to select individuals (or medical institutions) for clinical trials or retrospective investigations; etc. The effect function specifies a diagnosis given the data, typically with a predictive model.Footnote 5
The second module is data acquisition. Acquiring training data for ML adds complexity beyond the task taken in isolation, depending on the problem one has selected. Even if data exists in some form within organizations, additional steps are needed to select the data and ensure it is fit for the purpose. For instance, if one needs data to train a model for identifying the possibility of a rare disease from medical records, then one must train on many examples of records with the disease, much more than proportionate to the population. Moreover, depending upon the purpose, one may need different kinds of negative examples, e.g., the occurrence and results of different tests would vary in their relevance depending upon whether the system is extracting a diagnosis or predicting a disease without formal diagnoses.
The third stage/module is data understanding and preparation, which include characterizing the data, especially issues of quality (e.g. cleaning, transforming, reducing) to prepare data for analysis and modeling. Although glossed over by Chen et al., this stage is well characterized in data science and data mining [27]. Data preparation is also constrained by requirements of the modeling algorithms to be used and the need to divide the set into developmental (e.g., model training and tuning) and a testing (or clinical validation) data sets for supervised approaches. In unsupervised ML, the data scientist may need to understand the biases that may occur, such as gender or race imbalances, and adjust or augment datasets.
The fourth module is model development, which is the phase where the algorithm is run on the data to create the model or other ML construct, e.g., trained and tuned (in the narrower, technical process of training) with the developmental data sets. This is the module/stage where most of the choices for the ‘model architecture design’ take place. In designing the architecture, parameters and hyperparameters are fundamental. Parameters are usually defined as the internal values of the model we are generating via the training of the algorithm. The number of parameters that can be estimated from data in the model would be a hyperparameter and should also be appropriate to the dataset size and complexity as they influence how parameters are learned during training. Examples include k in k-nearest neighbors or k-means, number and size of hidden layers in NN, etc. Opening up the black box of training requires explaining why certain hyperparameters were chosen, e.g., was it a convenient default or the result of substantial tuning efforts with a specific aim.
Next, there is the stage of validation of performances and interpretation for its intended use (testing, or clinical validation), where metrics similar to the ones of tuning are evaluated on the test set, but additional emphasis is placed on clinical results. Chen et al. clarify the distinction between validation in ML (what they and we call tuning) and validation in a clinical setting, though we would add the need for those clinical criteria to inform the selection of tuning metrics used in model development. Thus, clinical validation criteria can also affect the model development, especially via model tuning.
The last module/stage is assessment of model impact and deployment and monitoring. As Chen et al. notice, “a performant model alone is insufficient to create clinical impact” (p 413). From a data science perspective, deployment is a prerequisite for assessing impact. Reconciling the clinical and ML perspectives illuminates the need to recognize that the model should be deployed in as realistic setting as possible to assess impact, and that many follow up steps may be required for regulatory approval. What we have to do here is show how design choices of, e.g., a visualization or user interface for a ‘dashboard’, allow a range of usage modalities. Moreover, design choices can facilitate or impede the integration of the tool in specific clinical workflows.
Value judgement in training
Section 4.1 shows that there are a lot of aspects that have to be documented during ML practices. We have mentioned throughout this article that some choices taken in the process are pervaded by value judgements. What does this mean exactly, and why does this matter? In order to explain this, we draw a parallel between the context of this article and the problem of theory choice in philosophy of science. The context of this article is understanding the reliability of ML tools in medicine. We have translated this problem into the problem of establishing what is the best design for a SaMD-ML given a specific context of deployment and a given intended use. This problem, in turn, can be addressed by understanding which are the best technical choices that practitioners have to make in order to design the best SaMD-ML. We think that these ideas can be approached by thinking about design/technical choices under the lens of the problem of theory choice in philosophy of science, and this will show that in which senses technical choices are value laden.
How scientists choose among theories/hypotheses is highly contentious. Ideally, theory choice is determined by criteria that scientific theories should meet in order to be considered ‘good scientific theories’. Various lists are present in the literature, but they are all variations on Kuhn’s preliminary list [22], which includes predictive accuracy, internal coherence, external consistency, unifying power, and fertility. When we say that theory choice is ideally determined by these criteria, we mean that it would be very convenient if these functioned as rules: a theory/hypothesis with more unifying power is better than one with less, etc. If theory choice functioned as an algorithmic procedure, one would be able to apply those criteria unambiguously. However, Kuhn argues that this is a misleading idealization, because criteria are imprecise (i.e. individuals disagree on how to apply them in concrete cases), and they conflict with one another. Therefore, Kuhn concludes that “the criteria of theory choice with which I began function not as rules, which determine choice, but as values which influence it” [22, p 362]. McMullin in a well-known paper [28] makes similar considerations: theory choice is a procedure close to value judgement, meaning it is not an unambiguous procedure determining which choice is the best, but a propensity to consider certain characteristics as more desirable. In addition to the difficulties of deciding which epistemic value is the best for a theory in a given context, it has also been shown that nonepistemic values, such as social and moral values, shape theory/hypothesis choice. In particular, because of a gap between hypothesis and evidence, one faces inductive risk, namely accepting/rejecting a hypothesis with the risk that it will turn out to be false/true; and usually deciding to accept or reject the hypothesis is a function of the seriousness of making that mistake, where ‘seriousness’ can be evaluated from the point of view of nonepistemic values [11, 12, 18, 31].
Both the arguments from epistemic and nonepistemic values point to an underdetermination: any dataset is insufficient to determine which theories are best, and we need to resort to pragmatic considerations (informed by various values, both epistemic and nonepistemic) to justify theory choice convincingly. But there is evidence that ML faces analogous problems [13]: throughout the ML pipeline, technical choices are underdetermined and practitioners resort to value-laden considerations. Thus, a step-by-step description will not suffice to document the development of SaMD-ML. To understand what is really behind a tool, we must understand the values shaping and constraining the development of these tools, because any design necessarily is shaped by technical choices that are influenced by values.
But what is exactly a value here? McMullin argues that something counts as a value in a specific entity if “it is desirable for an entity of that kind” (p 5). Why something is desirable can vary a lot; in the case of epistemic values in science, those are values because they are conducive to truth. But here we are in a different context. Technical choices are value laden when the reasons to take them are based on the expectation that they will promote certain valuable characteristics of that process or, in our case, of the SaMD-ML. There are two valuable characteristics of the practices that we think are relevant:
-
there are characteristics of ML practices that simply make SaMD-ML a more reliable tool assuming certain goal performances, some of which will be subordinated to the purpose function (e.g. usability, as we will see). But others will be independent of the purpose function (e.g. the metrics London refers to). In other words, there will be technical choices that will allow SaMD-ML to meet some performance metrics better than others. We call these values performance-centered values
-
there are things that we find desirable because they result in an overall effect function that does not harm data subjects, or that data subjects may possibly benefit from. Characteristics that we can find desirable from this perspective are shaped by social, political, and moral values.Footnote 6
In the next section we will identify some of these values in the different modules of the ML practice (see Table 1 for a summary).
Table 1 Examples of values in the training process Identifying values in ML practices
Let us now see in detail how values and technical choices influence each other in every single phase (i.e. module) described in Sect. 4.1.
The first module (i.e., problem understanding and definition), requires seeking understanding of what is needed and making choices in how to define the problem. The overall effect function aims to solve the problem statement (or answer the posed questions) for the stated purpose. Performance-centered values include consistency between the constructs of the problem statement and external needs as well as its internal coherence. Moreover, choosing the problem to address incorporates a number of human biases and assumptions (including well-reasoned clinical ones) that need to be made explicit, especially as the project most likely requires cross-discipline collaboration.
In data acquisition, data availability is a performance-centered value playing a key role. But we want to ensure not only that there is enough data to train properly, but also that the data is representative enough, given the particular goal. Going back to the example of asthmatic patients, it may be difficult to find data sets about patients that are independent of medical care received, and one may need to redefine the problem to account for diagnostic information and resulting treatments actually available. How much value is placed on available data versus acquiring more data will determine how SaMD-ML will be developed to address a problem. In addition, racial and ethnic diversity of training data often fails to match the diversity of the population—a problem in both ML and clinical trials [16, 20]. Moreover, given that there are different ways of thinking about representativeness that cuts across different social factors, sometimes the values of availability of data and representativeness/inclusiveness (understood as an ethical and social value) can stand in an odd relation. Although data may appear sufficiently representative, it might only come from well-served and wealthy areas. But collecting data coming from underserved areas may mean processing poor data, possibly optical character recognition (OCR) of scanned records, with lots of gaps. We will have to justify the use of this problematic data by saying that we want to promote inclusiveness that cuts across not only different ethnic groups, but also groups with different incomes. Moreover, in the example of rare diseases in Sect. 4.1, sometimes rather than just listing the dataset used, one should explain the decision to oversample from a disease group (or other underrepresented subpopulation) in terms of the desired effect on the model. Finally, data acquisition requires identifying legal issues around ownership, ethical concerns about privacy and implicit bias, and technical challenges for balancing data representation and identifying underlying changes over time.
In data understanding and preparation, conflicting values may lead to different choices. The value of inclusion/representativeness can stand in an odd relation with ‘data quality’ (a performance-centered value). Varying the threshold separating acceptable versus low quality data may lead to more or less inclusive SaMD-ML. Another important aspect is how we define, especially in the medical context, a reference truth for our tools; this, according to Chen et al., requires clinical judgement, and we should make explicit the performance-centered values of what makes a good reference truth. Eliminating statistical outliers may lead to cleaner data, and more useful models, but related outliers in the overall population might indicate subpopulations that should be considered separately.
In model development, we face similar conundrums. In choosing an algorithm, we may favor one that generates complex models, but if we do not have enough data, then we may want to opt for a less complex algorithm. Here the performance-centered values of data availability and complexity constrain one another. But choosing an algorithm depends also on data modalities (2D images, 3D volumes, lab measurements, texts, etc.). For data including simple or engineered features such as clinical characteristics, then linear/logistic regression, support vector machines, and decision tree are considered accurate, while for images convolutional neural networks seems to be the norm [5]. Which are the characteristics that we are looking for in an algorithm given a particular data modality? Other values central in developing a model are similarity (i.e. the model should accurately reflect the phenomena of interest) or generality (i.e. the model performs robustly with respect to novel data), and sometimes these can stand in a tradeoff relation. Because the output of the model contributes directly to the overall effect function, the adaptability and opacity of ML models are central to what needs explaining. Our approach differs significantly in this respect from London and other approaches to transparency or XAI. Rather than decompose how the model generates the effect function, or calibrate it to empirical clinical results, we emphasize justifications for the inputs to the training, good practices in selecting and tuning the algorithm, justifying that the model is appropriate for the purpose, and ML and clinical validation of the model outputs.
The stage of validation is replete with ethical and social values shaping performance-centered values, especially considerations of inductive risks. Metrics will change depending on how the SaMD-ML deals with conditions and level of risk. One may choose metrics minimizing either false positives or false negatives depending on both the purpose function as well as ethical and social values—e.g. either to minimize false positives when diagnosing nonthreatening diseases with costly follow-up or to minimize false negatives with effective early treatment of serious or contagious disease. It is frequently necessary to trade off either reducing false positives or false negatives, as it may be not possible to reduce both simultaneously, and these tradeoffs can be captured by metric pairs precision–recall and sensitivity–specificity. The two tradeoffs are similar, with sensitivity mathematically identical to recall, and both increase with a reduction in false negatives. The difference between tradeoffs is that while specificity and precision both increase with fewer false positives, specificity takes true negatives into account while precision takes into account true positives. Precision works well in information retrieval to measure relevance when retrieving a few documents out of potentially millions, but specificity works better when modeling high-prevalence disease to identify healthy individuals (i.e., true negatives for the disease). There is no technical rationale for choosing among these metrics, and how to adjust the tradeoffs may depend heavily upon performance-centered as well as social/moral values. If the purpose of the SAMD-ML involves diagnosing a disease, then sensitivity–specificity may be a better tradeoff, while if the system is retrieving disease information from patient records, then precision–recall might be better. Design of the SAMD-ML involves not only selecting one of the tradeoffs, it involves justifying the reasons one tradeoff was selected for model validation over the other one.
Finally, there is model impact and deployment and monitoring. In this module, the user interface of the tool is mostly driven by the performance-centered value of usability. In other words, designers should motivate the interface by describing how it affords usability. But usability is not everything. We also want to make sure that the interface does not mislead users into thinking that they have obtained a particular piece of information, while in fact they have not, e.g., graphical integrity. When presenting a relative quantity, such as a relative difference, ratio, or percentage, it is important that the reference value reflect the user’s expectation rather than a modeling artifact. This is because in this case the purpose function will be impacted. Therefore, motivating why the interface does not mislead users is fundamental.
Limitations of the account
This overview of how values can influence different choices must not be thought as complete and definite. There are at least three limitations to what we have described in this section. First, we have characterized the relation between values mostly in tradeoffs terms. However, values can stand in many different types of relation. Let us indulge for a moment in an analogy; let us think about values as mathematical variables: as there are different types of relationships between mathematical variables (e.g. linearity, nonlinearity, discontinuity, nonmonotonicity, etc.), so we can make similar considerations with values. The second limitation is that is that it is not clear how the values we described have been selected. We base the identification of values on our professional and research experience, but we acknowledge that introspection is not a substitute for a real methodology, at least not long term. Finally, our characterization of the interplay between values and technical choices is an idealization, especially because we have represented the data-aware machine learning pipeline as a process where one person takes all decisions; however, the practice of ML is a social practice, involving different actors and stakeholders [2, 26], and this can mean that there will be negotiations of values among different individuals. But these limitations can be overcome in future works. This is a philosophical framework for a much bigger, empirical project. We hope later to use qualitative methods (e.g. interviews, ethnographies) and enrich our list of values, relationships among values, and characterization of negotiations among stakeholders.
Despite these limitations, we think that our framework has substantial advantages over existing frameworks that aims to provide tools to identify values in the practice of data science. Here we mention only a few of these frameworks, with no presumption of completeness. For instance, Loi et al. [25] provide a contribution similar to ours, in particular with the idea of ‘design explanation’, which is an explanation of the goals of the system, in conjunction with information about why the design of the system is the way it is and the norms and values guiding it. Another example is Selbst and Barocas [32]. They argue that we should understand “values and constraints that shape the conceptualization of the problem (..) how these (…) inform the development of machine learning models (…) how the outputs of models inform final decisions” (p 47). They propose, as others do [23, 30], to consider ‘documentation’ of the training process as a way to explain and regulate ML tools. These are all valuable contributions, but when they address values, it seems to us that they refer especially to social/ethical values, while technical/performance-centered values are neglected. What generally count as ‘technical preferences’ are seen as less problematic than ethical/social values, even though in principle they are preference and hence values. This assumes that practitioners largely agree on ‘technical values’, while social/ethical values are somehow arbitrary and hence need to be discussed more thoroughly. However, this is largely misleading: ‘technical values’ (i.e. performance-centered values) are no less values than ethical/social values (Sect. 4.2), and they pose the same problems of ethical/social values. They are preferences, they can be seen as arbitrary, they can be endorsed without awareness, etc. The only exception to this trend that we could identify is the work by Birhane et al. [3], that provided a list of 67 values which influence to various degrees ML research, including technical values, and van de Poel [35] who formalizes technical norms within sociotechnical AI systems. While their work shows the possibilities implied by more ‘empirical’ approaches, we think that it does not engage enough with the connections between specific technical choices and the values they identify.