Who Is a Good Data Scientist? A Reply to Curzer and Epstein

A central distinction in Curzer and Epstein (2022) is the one between “protect the disadvantaged” and “protect the data”. This can open up discussions about the relationship between ethics and epistemology in the practice of science. Focusing on the disadvantaged to the exclusion of good scientific practices, Curzer and Epstein argue, can harm everyone impacted by medical science, including the disadvantaged. For this reason, they propose that “ethical data scientists should strive for accurate data and scientifically sound data analysis” (2022, p 2) with attention to minimizing data processing errors, bias, and outside influence, and that includes identifying errors caused by tendencies to neglect disadvantaged and historically underrepresented communities and groups. While we agree with several points made by Curzer and Epstein, we also have three main points of concern.

"Ratti and Graves' use of the capabilities approach as well as their examples, leave the impression that ethical data science primarily consists in identifying and avoiding data management decisions that inappropriately negatively impact the disadvantaged" (p 2).
First, it should be noted that our framework has the explicit goal of shaping all phases of the data science pipeline, and data management phases are only a fraction of it. In fact, we expand our analysis in . Next, our focus on the "disadvantaged" is an artifact of the proof-of-concept that we develop: it is easy to illustrate the impact on capabilities on those who are blatantly being unequally impacted by mechanisms of social injustice. Moreover, the "inappropriately" is out of place: when describing moral attention, we explicitly say that it is not a full-blown virtue, as "moral attention may not be effective in choosing and acting, but only in reasoning" Graves, 2021, p 1828). This means that moral attention is just realizing that there is an impact on capabilities, but whether this impact is positive or negative (appropriate or inappropriate) is pretty much left in the air, and it will depend on either other virtues or the participatory design sketched at the end of the article.
In the same page, Curzer and Epstein also add that "It is also that focusing on a subgroup distorts results of data management and undermines its goals of advancing knowledge." (p 2) We agree that focusing on "a subgroup" at the expense of others can harm everyone, but we would argue that the process still needs to focus on impinged capabilities. In agreement with the commentary authors, we would argue that data scientists should not focus on the group they initially believe is disadvantaged, but should focus on the impinged capabilities of everyone, recognizing that those with the greatest impinged capabilities would, as a consequence, be considered disadvantaged. 1

Ethics and Epistemology, or Science and Values
In the commentary, emphasis is added on accuracy for the integrity of data science, and how this is important for advancing knowledge (p 2). However, the authors treat "advancing knowledge" as a pure epistemic goal achievable by pure epistemic means, and we view this as a controversial thesis. Because science always operates in a situation of uncertainty, risks of epistemic errors arise anywhere during scientific practices. There is a rich literature in philosophy of science-known through various labels such as 'inductive risk' and 'epistemic risk'-focusing on connections between scientific choices and ethical considerations (Douglas, 2009;Elliott & Richardson, 2017). Risks must be managed and balanced in light of values and interests, and this makes value-laden choices in science inevitable: scientists proceed by balancing and managing risks and uncertainty via values (Ward, 2021). The case of choosing which data set to process first, whether EMR or OCR, is a good example. Assuming that resources are limited (which is a defining feature of the context in which medical data scientists operate), you cannot deal with EMR and OCR data sets with the same level of accuracythis epistemic desideratum will lean towards a certain direction. You have to make a choice. This can be based on efficacy ("I'll go with EMR because those data sets are bigger and easier to deal with") or on concerns about complete representation, which would include the capabilities of the disadvantaged ("It's likely that OCR data will be about them"). In both cases, we treat data accurately (or to the best of the means available), but the direction is shaped by values.
To make the same point a bit differently, how accuracy is modulated, and which direction should be pursued, is not a pure epistemic matter. Of course, if we had all the data of the world, all the computational power of the universe, and billions of data scientists, then we could strive for complete accuracy, but that is just not possible. This argument makes "values" inevitable, given our situation of limited beings with limited information. One can even make a further argument and argue that, beyond the problem of uncertainty, science per se has an ethical dimension intertwined with the epistemic one, as it can be value-promoting, in the sense that scientific choices promote certain values while simultaneously obfuscating others (Russo, 2021). Philosophy of technology has already previously explored this territory in the context of power relations (Winner, 1980). Even if we want to debate the value-free ideal of science on its own terms, we should consider that the over-reliance on epistemic characteristics, such as accuracy, is problematic. In fact, Curzer and Epstein seem to assume that accuracy has one and only one meaning, that there is one way to measure it, and that data scientists will agree on all these things, but this is a situation that the history of science and technology has shown pretty well is not the case: epistemic desiderata are understood and operationalized in many, and sometimes mutually exclusive, ways (Kuhn, 1977). This means that epistemic virtues such as accuracy require value judgment in order to be operationalized (McMullin, 1983): such desiderata are indeed values, and one can even question the distinction between epistemic and non-epistemic values (Rooney, 1992). But even assuming we do have one notion, consider this. One cannot know to include something like transportation conversion factors in the model unless one realizes they could be a significant factor. There is a debatable point here whether that requires moral attention or just heightened awareness of social factors determining health, but it still requires cultural and/or moral awareness within the data science process that goes beyond pure technical proficiency. However, Curzer and Epstein seem to imply, even further, that there is a dichotomy between ethics and epistemology. Putting ethics and epistemology in opposition through the "protect the data" approach has an interesting consequence. Given that ethical considerations must appear at some point, these will be external to the practice of data science. This is well-formulated at page 4: "Rather our claim is that the appropriate point at which moral concern about the methodology and application of the study comes into play is not during the study, but rather before or after the study" We disagree that moral attention should only occur "before" or "after" the data work. We see this as an externality model of science and values that was much criticized by Longino decades ago (1990), arguing against the assumptions that ethics is completely external to science and that science within its internal activities is value-free: value-free science is not only descriptively false but also normatively problematic. The case of the diabetes intervention in our article is an example of why moral attention is needed during the project. It is clear that attention to the ethical dimension of data subjects improves the study also from the point of view of epistemic considerations alone, such as brute performance metrics. The same for missing data analysis.

The Social Context of Data Science
At pages 2-3, Curzer and Epstein say: "Recognizing that scientific errors can impede the health agency of large numbers of people in sometimes unpredictable ways, good data scientists try to imagine what scientific errors might be introduced by a proposed data management choice, and then take steps to avoid or ameliorate these errors" This claim is useful to introduce the importance of the social context to which data science is going to make a difference, and why ethics and epistemology are necessarily intertwined because of that context. If "scientific" is understood as "technical integrity," which the authors seem to identify with "data accuracy," then even processing accurate data may lead to negatively affecting health agency of a large number of individuals. If you do this in a society where health injustice is systematic like the USA, then data science tools will simply provide predictions that are informed by the same patterns of systematic injustice, and hence will impact the substantial freedoms of data subjects, as we have documented in our article, but this is problematic, because of the role that medical data scientists are increasingly having. We claim that the profession of the medical data scientist inherits the same ethical and epistemic obligations that any member of the medical community has, which is to promote the well-being of patients. But because of the well-documented ripple effects on substantial freedoms that data science tools have, being a medical data scientist implies also a "protecting human agency" perspective, given that human agency as a substantial freedom is a necessary component of well-being. This can be preserved by our use of the capability approach and the sketch of the participatory design at the end of our article.
In other words, the "protect the data" position does not sufficiently acknowledge the values influencing all scientific endeavors, the broader social systems in which data science takes place, and the obligations of data scientists qua members of the medical community towards those social systems. We might agree with Curzer and Epstein that a scientist should not commit extra resources to one group over another beyond developing a representative model, but we argue that the false separation of scientific and ethical practices fails to acknowledge that the scientific practice occurs within a system that has ingrained biases, and that data scientists have ethical obligations towards those biases, as they impact the well-being of data subjects by being detrimental to their substantial freedoms. In healthcare in particular, a data scientist is tasked with developing models that represent the population and its healthcare needs, and that means not incorporating the systemic biases into the modeling framework.