Introduction

Data science has become a mature field of enquiry only recently, propelled by the proliferation of data and computing infrastructure. While many have written about the philosophical problems in data science, such problems are rarely unified into a holistic “epistemology of data science” (we avoid the more generic expression “philosophy of data science”—more on this presently). In its current state, this epistemology is vibrant but chaotic. For this reason, in this article, we review the relevant literature to provide a unified perspective of the discipline and its gaps; assess the state of the debate; offer a contextual analysis of the significance, relevance, and value of various topics; and identify neglected or underexplored areas of philosophical interest. We do not discuss data science’s GELSI (governance, ethical, legal, and social implications). These have already receive considerable attention, and their analysis would lie beyond the scope of the present work, even if, ultimately, we shall point to obvious connections. We also limit our review to the epistemological debate concerning data science, without entering the related, significant, but distinct debates about the epistemology of data and the role of data in epistemology (including philosophy of science and scientific research). It seems clear that the epistemology and ethics of data science (in the inclusive sense of GELSI indicated above) may need to find a unified framework. Still, this article would be the wrong context to attempt such a unification.

Methodologically, we determined the scope of this epistemological analysis by a structured literature search in the sense indicated by Grant and Booth (2009) and detailed in the “Appendix”. Its content was therefore determined empirically. This empirically-driven approach has a twofold motivation. First, there is now an area of enquiry explicitly labelled “data science”, which is something more than simply a general study of data, or a science which investigates “data” (indeed, almost anything might be understood as data science in one way or another, with such a broad characterisation). Whatever “data science” is (see Sects. 1 and 2), it has an associated epistemology, and fundamental philosophical issues have only recently begun to be explored, at least in a critical fashion, as a philosophy of data science. Second, since, as we noted above, this philosophy is still relatively fragmented, an empirically-driven approach to what is already available in terms of current debates makes possible an overview of this emerging discipline, returning at least a first image of its landscape and contours.

Our findings partition the epistemology of data science into five areas, and the article is structured accordingly. We present them in a logical rather than chronological order. We begin in Sect. 2 by focusing on the characterisation of data science—what it is or should be—focusing on descriptive and normative accounts of this new discipline, including a reference to what data scientists do and should do. In Sect. 3, we analyse the debate about what kind of enquiry data science is. Next, in Sect. 4, we analyse the related debate about the nature and genealogy of the knowledge that data science produces. In Sect. 5, we concentrate on one of the most significant methodological issues in data science, the so-called “black box” problems, such as interpretability and explainability. Still on the methodological side, in Sect. 6, we outline the various explorations of the epistemically revolutionary new frontier raised by data science: the so-called “theory-free” paradigm in scientific methodology. In Sect. 7, we briefly summarise our analysis.

The characterisation of data science

This section reviews the most relevant definitions of data science proposed in the literature, spanning descriptive and normative alternatives. It concludes by offering a proposal that synthesises the most valuable elements of each.

Minimalist and maximalist characterisations

At the dawn of the twentieth century, statistics came to be recognised as an academic discipline worthy of its own journals and university departments. Technological advances in subsequent decades marked a definite break from theory-driven and inferential classical statistics. New approaches, such as bootstrapping and Markov chain Monte Carlo simulations, replaced strong parametric assumptions with brute computational power. Viewed from this perspective, machine learning algorithms—which automatically detect and exploit subtle patterns in large datasets—are simply the next logical step in a centuries-long progression toward ever more automated forms of empirical reasoning.

The question of when precisely these early forays into quantitative modes of analysis crystallised into what we now call “data science” presupposes that the discipline has some yet unspecified essential character. Although we are sceptical of any purported “solution” to the so-called “problem” of demarcation—in this area, as in science more generally—we observe two broad trends in the literature on this topic, which we shall label the “minimalist” and “maximalist” accounts (more on this below). As we shall see, minimalists aim for necessary conditions, as weakly constraining as possible but still carving out a unique space for data science. Maximalists strive for sufficient conditions with detailed ontologies and methodological taxonomies. Minimalist approaches characterise early debates on the nature of data science. Contemporary analyses tend to embrace maximalist approaches, identifying in data science a means to develop causal knowledge directly connected to the object of investigation.

Minimalist conceptions do not commit data science to any method or subject(s), and do not make any specific claims about what kind of discipline data science is. They focus only on the pedagogical aspects and their dependency on information and data. Chambers (1993) and Carmichael and Marron (2018) provide two examples of minimalist accounts. Chambers (1993) presents a “greater statistics” view of data science, characterised as “everything related to learning from data” (Chambers, 1993, p. 182, italics in the original). Similarly, Carmichael and Marron (2018, p. 117) claim that data science is “the business of learning from data” and that a data scientist is someone who “uses data to solve problems”.

Maximalist accounts are more fine-grained and qualify the discipline’s relation to data in general. Breiman (2001) gives a maximalist account with a single qualification, characterising data science as a subject interested in two broad classes, “prediction” and “information”. Prediction is “to be able to predict what the responses are going to be to future input variables”, while information is “to extract some information about how nature is associating the response variables to the input variables”. According to Breiman, statisticians (taken to include data scientists) may be interested in making correlative predictions from data and also extracting information about any associated underlying natural causal mechanisms. Therefore, he makes an epistemic distinction between correlative/predictive knowledge on the one hand and causal knowledge on the other. That Breiman’s vision of data science is concerned with these two products means that his account is narrower than the minimalist accounts above.

Another maximalist account is provided by Mallows (2006, p. 322), who gives statistics an essentially practical nature. As he writes, “Statistics concerns the relation of quantitative data to a real-world problem, often in the presence of variability and uncertainty. It attempts to make precise and explicit what the data has to say about the problem of interest.” Mallows emphasises the primacy of problem-solving of the applications of data to the “real world” rather than general and more abstract intellectual enquiry. Thus, his vision of statistics is somewhat removed from a purely mathematical and formal conception [this point is also stressed by Blei and Smyth (2017), discussed below]. A unique aspect of Mallows’ account is his explicit mention of variability and uncertainty, which data-scientific and statistical methods must confront. This embodies an implicit commitment to the separation of the noisy real world and the idealised constructs familiar to the natural and social sciences. This separation is essential. It means statistics is characterised as a fundamental epistemic method concerned with how human beings relate to the world around them. Statistics forms a kind of epistemic bridge between the two worlds.

Donoho (2017, p. 746) also supports a maximalist approach. His account of data science has a distinctively sociological dimension, referencing the Data Science Association’s “professional code of conduct”: “‘Data Scientist’ means a professional who uses scientific methods to liberate and create meaning from raw data [our italics].” The words “liberate” and “create” indicate that Donoho’s account is consistent with two broad philosophies of science: realist and antirealist leanings. That data science can liberate meaning assumes a realist position, according to which we uncover a particular objective and independent ontology through scientific enquiry. It suggests that data originates from phenomena and processes that are inherently amenable to a systematic study and comprehension. However, the term “create” implies a more antirealist conception, according to which we superimpose artificial, context-related ontologies upon data as means to our particular ends. The extent to which data science creates or liberates meaning will depend on one’s position in such a debate. Conceptions of data science, like Donoho’s, seek to accommodate both. However, Donoho’s account remains problematic because of its unqualified and underdeveloped reliance on the concept of “raw data”. Strictly speaking, data are never entirely devoid of interpretation, so there is no such thing as “raw data”. Data are always loosely interpreted, at least because they have been selected instead of other data, and because they are framed by paradigms, hypotheses, theoretical needs, background assumptions, and so forth. As Donoho writes from within the era of big data, his assumption that “raw data” is a suitable base from which to liberate and create meaning is a consequence of the contemporary attitude that data can, are, and will be recorded in sufficient depth, breadth, and quality for any problem domain. However, we shall see in Sect. 6 that this assumption is questionable.

Leonelli (2020, Sect. 1) offers an account of data science in her attempt at an initial definition of Big Data: “Perhaps the most straightforward characterisation is as large datasets that are produced in a digital form and can be analysed through computational tools”. Here, she draws attention to the computational methodology characteristic of modern data-scientific investigation, and underscores the close relationship between data science (a mode of inquiry) and Big Data (a class of information). Blei and Smyth (2017, p. 8691) agree with Leonelli’s emphasis on computational methodology, but further qualify it with a practical component: “data science blends statistical and computational thinking… It connects statistical models and computational methods to solve discipline-specific problems.” They prioritise statistical and computational methods, thus emphasising a practical rather than pedagogical priority. However, this characterisation does not specify information—broadly conceived—as data science’s object of interest, nor does it mark specific disciplines as parents or patients of data science.

Descriptive taxonomies

Some authors have attempted to characterise data science by providing descriptive, procedural taxonomies of the discipline. The analysis of three descriptive accounts written at different times over the last six decades offers a diachronic perspective.

Let us begin with Tukey’s (1962) account. This appears to be the first descriptive taxonomy of “data analysis”, focusing on: “procedures for analysing data and techniques for interpreting the results of such procedures; ways of planning the gathering of data to make its analysis easier, more precise, or more accurate; all the machinery and results of (mathematical) statistics which apply when analysing data” (Tukey, 1962, p. 2). Tukey intended to give a transparent description of what actually occurs in the analysis of data. As we shall discuss in Sect. 3, the orthodox view at his time of writing was that data analysis was applied statistics, and hence primarily mathematical. By describing its nature plainly and accurately, Tukey’s account was a transgression of the status quo: breaking off the concept of data analysis from applied statistics into its own field.

Some years after Tukey, Wu (1997) presented a threefold descriptive taxonomy centred on data collection (experimental design, sample surveys); data modelling and analysis; problem understanding/solving, and decision making. Like Tukey’s, this description came as part of a broader project to move mathematical statistics in a scientific direction. Wu bid to rename “statistics” as “data science” or “statistical science”, and we note the inclusion of the manifestly scientific “experimental design”.

More recently, Donoho (2017) has provided an extensive taxonomy which cites the University of Michigan’s “Data Science Initiative” programme: “[Data Science] involves the collection, management, processing, analysis, visualization, and interpretation of vast amounts of heterogeneous data associated with a diverse array of scientific, translational, and interdisciplinary applications” (Donoho, 2017, p. 745). A brief comparison with Tukey’s and Wu’s accounts highlights the maturation and growth of data science: the procedural pipelines have coevolved with intermediary stages between inputs and products.

Earlier accounts ought not to be faulted for missing a moving target, as they may not have foreseen the growing demands and affordances of the digital era. However, we may still identify a trade-off between constraint specificity and contemporaneity in descriptivist accounts of data science, which has evolved along with computation. Any account that isolates data science from its computational context risks obsolescence. However, any account that does not must grapple with a massively entangled and evolving digital sphere, with all its attendant mechanisms and requirements. Therefore, going forward, we distinguish ones-and-zeroes data science from pen-and-paper statistics by its digital and computationally intensive nature.

Normative taxonomies

Some researchers thought that the status quo conception of data science of their time was inadequate to meet the demands placed on society by the proliferation of data. This led them to develop revisionist accounts, often proposing normative taxonomies of data science. Here, we consider what are arguably the four most influential revisionary accounts offered so far: Chambers (1993), Breiman (2001), Cleveland (2001), and Donoho (2017).

Chambers (1993) remarks that, at the time of his analysis, there was a trend in academic statistics towards what he calls “lesser statistics”—mathematical statistics filtered through journals, textbooks, and conferences—rather than engaging in real-world applications to data. In this context, he presents the following tripartite taxonomy (Chambers, 1993, p. 182) of the composition of his “greater statistics”, referring to the concept mentioned earlier as “everything related to learning from data”:

  1. 1.

    Preparing data (planning, collection, organization, and validation);

  2. 2.

    Analysing data (by models or other summaries);

  3. 3.

    Presenting data (in written, graphical or other form).

Chambers’ taxonomy delineates the processes and products of data science from the decision-making and outcomes that result from those products. The promotion of data preparation to stand equal to analysis and presentation is remarkably prescient. The subprocesses of planning, collection, organisation, and validation anticipate, respectively, the sourcing, volume, diversity, and quality of data required of practical data science, as opposed to the abstract concerns of “lesser statistics”. When taken together with the descriptions of the analysis and presentation of data, a conception is revealed of human limitations when confronted with data, and with data science seen as the epistemic endeavour to exceed those limitations.

Breiman (2001) echoes the need for statistics to move towards the real world. Like some of the maximalist statements of data science analysed in Sect. 2.1, his account too emphasises that data analysis collaborates with, and thus acts on, specific disciplines, supplying them with analytical tools. To understand Breiman’s radical normative conception of data science as disinterested in truth, in favour of practical knowledge, one needs to engage in a brief historical and sociological detour. In homage to C. P. Snow, Breiman remarks that preference for truth or action characterises two contrasting “cultures” in statistics: the predictive camp, which he estimated at his time of writing in 2001 contained only ~2% of academic statisticians and data analysts, and the inference camp containing the rest. Those in the former camp are primarily interested in generating accurate labels on unseen data. Those in the latter focus on revealing mechanisms and estimating parameters. This is a distinction we shall revisit in Sect. 4. Breiman’s revisionism becomes manifest when he argues that the emphasis on inference over prediction has led to a distracting obsession with “irrelevant theory” and the drawing of “questionable conclusions”, thereby keeping statisticians “from working on a large range of interesting current problems”. Today, the relative sizes of the two cultures are nearly reversed (see, for example, Anderson, 2008). Breiman’s vision of a theory-free data science marks a significant deviation from the classic epistemological project of “understanding understanding”.

Cleveland (2001, pp. 22–23) considered the teaching programs of his time to be deficient, producing data practitioners unprepared for the demands of an increasingly data-rich society. In this sociological context, he proposed the following taxonomy:

  1. 1.

    Multidisciplinary investigations (data analysis in the context of different discipline-specific areas)

  2. 2.

    Models and methods for data (statistical models, model-building methods, estimation methods, etc.)

  3. 3.

    Computing with data (hardware, software, algorithms)

  4. 4.

    Pedagogy (curriculum planning, school/college/corporate training)

  5. 5.

    Tool evaluation (descriptive and revisionary analysis of tools and their methods of development)

  6. 6.

    Theory (foundational and theoretical problems in data science)

Cleveland’s taxonomy puts forward a conception of data science as a fundamentally computational, fully-fledged, scientific discipline in its own right. Four points are particularly relevant in this case. First, the elevation of “computing” alongside “models and methods for data” marks data science as fundamentally digital, separating it from statistics at large. In contrast to Chambers’ taxonomy, computers are by now explicitly recognised as the vehicle that makes data science possible. Second, under “pedagogy”, there is recognition of the necessity to preserve and propagate data science as an academic and commercial field. Third, the novel inclusion of “tool evaluation” and “theory”, absent in previous accounts, signals a conception of data science as self-reflective and progressive. Fourth, the fact that “multidisciplinary investigations” is placed on the same footing as the other five taxa indicates a relative deprioritisation of application. This is a significant shift from preceding accounts, like Breiman’s, that treat data science merely as a means to an end.

More recently, Donoho (2017, p. 755) has given a comprehensive revisionary taxonomy to meet current needs. Emulating Chambers’ terminology, “greater data science” is set in contrast to some of the descriptive taxonomies described in Sect. 2.1, which he calls “lesser data science”. Greater data science consists of:

  1. 1.

    Data gathering, preparation and exploration

  2. 2.

    Data representation and transformation

  3. 3.

    Computing with data

  4. 4.

    Data modelling

  5. 5.

    Data visualisation and presentation

  6. 6.

    Science about data science

In contrast to Cleveland’s taxonomy, Donoho’s focuses just on data science qua field and means of enquiry. Two aspects of this taxonomy are epistemologically interesting. First, mirroring Cleveland, the repeated presence of the sixth metascientific category: data science should reflect and conduct science on itself in order to improve and develop. Second, Donoho’s description is procedurally complete, beginning with data exploration and gathering, and going through all the analytical steps from origins to final products. This ambitious scope contributes to the normative force of Donoho’s proposal.

Considering these previous points, it seems reasonable to propose the following working definition of data science, which we shall use in the rest of this article:

Data science is the study of information systems (natural or artificial), by probabilistic reasoning (e.g., inference and prediction) implemented with computational tools (e.g., databases and algorithms).

This definition is inclusive enough to cover all of machine learning, as well as more generic procedures that typically fall under the umbrella of statistics, such as scatter plotting to inspect trends and bootstrapping to quantify uncertainty. It may or may not exclude some edge cases, depending on one’s interpretation of constituent terms. For instance, it covers deterministic systems if one holds that these are a subset of probabilistic systems. It covers hand-calculated regression models if one holds that human cognition is a kind of computation. Yet these are grey areas, even if the former may be an obvious case of computer science and the latter an obvious case of statistics. Data science stretches across both disciplines, emphasising different aspects.

We can now turn to the question of what kind of enquire data science may be.

Kind of enquiry

Critics may allege that data science is not an academic discipline, but a set of tools, bundled together through pragmatic functions. At issue is whether data are the “right kind of thing” to stand as the subject matter of a discipline. If “data” is a concept too insubstantial or the methods of data science are too heterogeneous, then any attempt to carve out a unified data science seems doomed to fail.Footnote 1 There is a growing demand for data science, not just in the business world, but also in academia, as evidenced by a proliferation of university courses and programs, specialised teaching positions, dedicated conferences, journals, and industry positions. Therefore, for the sake of argument, one may assume that data science may be on its way to becoming an entrenched and mature discipline. If this is the case, the next question is of what kind. The literature offers three main answers: a sort of academic statistics, where statistics is a formal, theoretical part of applied mathematics; statistics, but appropriately expanded to bring it outside of applied mathematics and into a proto-science; and a full-blown science in itself. Let us examine each alternative in detail.

Data science as statistics

The first two approaches take data science to be some form of statistics. For example, Donoho (2017, p. 746) provides a comprehensive collection of papers, talks, and blogs whose authors argue that data science simply is statistics by a different name. This stance further speciates according to whether one takes statistics to be part of, or separate from, applied mathematics. Arguing for the former case, Wu (1997) cites a dictionary definition of statistics: “the mathematics of the collection, organisation and interpretation of numerical data”. This narrow view of data analysis does not have many contemporary proponents. Most of the current literature either accepts that data analysis is part of an extended statistics—which itself is no longer seen as strictly formal mathematics (cf. Chambers’ greater statistics)—or grants data analysis the status of a standalone field, external but related to statistics, which is considered a narrow part of formal, applied mathematics. Breiman (2001) and Mallows (2006) take the latter stance, calling for the expansion of statistics to include scientific elements and engage with real-world disciplines. This does not entail that statistics is itself a full-bodied science. Data analysis, in this view, remains statistics, even though it begins to transcend strictly formal, mathematical, or deductive inference and practices.

Data science as science

Other authors locate data analysis as a scientific discipline (with a characteristic concern for systematising the material world) rather than a mathematical one (with a typical concern for exploring the deductive consequences of mathematical axioms). Carmichael and Marron (2018, p. 120) claim that a manifestly scientific understanding of data science is a “reaction to the narrow understanding of [Chambers’] lesser statistics” [our italics]”. Two main strategies support the claim that data analysis is a science.

The first is to formulate demarcation criteria for whatever we already call science (cf. Popper, 1959), and then show that data science satisfies them. Tukey (1962, p.5) made this attempt, setting out three paradigmatic demarcation criteria for science: “intellectual content”, “organization into an understandable form”, and “reliance upon the test of experience as the ultimate standard of validity”. By running up his contemporary data science against these criteria, Tukey concluded that whatever makes other disciplines scientific also applies to data science. In a similar way, (Donoho, 2017) focuses on a paradigmatic scientific feature of a subject: the formulation of empirically accountable questions which are solved through scientifically rigorous techniques. Since there is conceptual room for a field of this nature that operates on data and information, he concludes that there is space for a forthcoming genuine science of data analysis.

However, this first strategy struggles with the heterogeneity of science and has waned in popularity [see also Laudan’s (1983) decisive critique of Popper’s falsificationism]. Considering this, an alternative strategy is to demonstrate relevant similarities between data science and paradigmatic sciences, and argue that these similarities warrant an extension of the general concept. For example, Wu (1997) cites a series of important similarities between his descriptive taxonomy of statistics and paradigmatic sciences. These similarities include: the “empirical—physical” approach of statistics, in which we use induction to infer knowledge from observations and deduction to infer implications of theories; the primacy of experimental design and data collection; and the use of Bayesian reasoning to evaluate models and evidence. However, there are notable ways in which data science diverges from paradigmatic sciences. Such dissimilarities include the kind of knowledge it generates (see Sect. 4), the modes of logical inference by which it proceeds (see Sect. 4), and the status it endows to hypotheses (see Sect. 6).

A further dissimilarity may be that data science sits alongside normal sciences, providing them with the tools and resources needed to make more profound, discipline-specific discoveries. If these dissimilarities are regarded as sufficiently significant, it becomes plausible that data science might not be a science at all, or may be a transcendental science. We turn to this topic in the next section.

Data science as something else

The debate above overlooks the possibility that data science may be best understood neither as an applied mathematical statistics nor an empirical science but as something else altogether. Wiggins, for example, has expressed this thought in private communication with Donoho, claiming that “Data science is not a science… It is a form of engineering, and the doers in this field will define it, not the would-be scientists” (Donoho, 2017, p. 764). Wiggins takes the pragmatic and social ends of data science to distinguish it from both mathematics and empirical science, with a closer affinity to the essentially pragmatic interdiscipline of engineering. Perhaps a similar claim could be made about computer science, which is rooted in mathematics but sufficiently specialized, interdisciplinary and practically oriented to constitute its own individuated field of enquiry.

Another possibility would be understanding data science as a discipline that operates on and assists other forms of enquiry. This was, for example, how Wittgenstein came to view philosophy, as a set of tools and methods for resolving linguistic confusions both within philosophy itself and in other areas of thought, like mathematics or psychology. A similar function might be attributed to data science, as an essentially operative discipline that works in tandem with various disciplines (natural science, social science, history and the humanities, etc.) to solve their own problems. If one has a general enough conception of data, one might even understand this operative relationship as, in fact, one of data science as an external basis for other forms of enquiry. Data science might be conceived as serving a transcendental function for other sciences, as the condition for the possibility of their scientific inquiry as such. There is nothing essentially different between the structures of understanding across disciplines, like Linnaeus’ taxonomies and the hierarchical ontologies familiar to database managers. Tycho Brahe’s journals are essentially a high-quality dataset of one kind. Newton’s laws of motion can be understood as an algorithm, obtained from empirical data, and verified against them, for predicting values for some physical variables based on the values of others. We shall not pursue this approach in this article, offering it as a suggestion to be explored (in terms of gap analysis of the literature reviewed) rather than a thesis to be defended.

Whatever data science is, and irrespective of what kind of science it may be, the epistemological debate also focuses on the nature of knowledge it generates. This is the topic of the next section.

The knowledge generated by data science

This section examines the debate about the knowledge generated by data science. The analysis is structured into two related parts. One focuses on positions that privilege the process, or how (concerning modes of inference). The other gives an account of positions that prefer the product, or what (referring to the epistemic products).

Modes of inference

Different means of enquiry have differing affinities to the three typical modes of inference: deduction, induction, and abduction. The epistemology of data science reflects on the extent to which data scientists deploy these various modes.

Deductive inferences are present in data science through the heavy reliance on mathematical and logical reasoning. Probability theory, differential calculus, functional analysis, and theoretical computer science are all purely deductive disciplines widely used to derive the properties of algorithms and design new learning procedures with little concern for empirical behaviour. For instance, the backpropagation algorithm used to optimise parameters in neural networks combines elements of linear algebra and multivariable calculus to converge, provably, on a local optimum of an objective function. No datasets are required to derive this result.

Inductive inferences are also of central importance. Any data are a finite sample of the world. Data science then identifies structures in the data and distils them into information that applies beyond the data itself. This is achieved by projecting the patterns and structures found in data to new contexts, going beyond the antecedent domain. This projection is an inductive inference. Harman and Kulkarni (2007) argue that statistical learning theory represents a principled and sophisticated defence of induction. Similar remarks can be found in Frické (2015), who observes that “Inductive algorithms are a central plank of the Big Data venture.” More recently, Schurz (2019) has argued that formal results from reinforcement learning demonstrate the optimality of meta-induction, thereby solving Hume’s problem on a priori grounds. In other words, this represents a defeasible solution to Hume’s problem of induction, whereby statistical testing can provide stronger or weaker evidence in favour of particular hypotheses (Mayo, 1996, 2018).

One can distinguish between two canonical types of inductive inference, object and rule induction. The first is the informed prediction of singular unobserved instances: hypotheses of the form “the next observed instance of X will be Y” based on previous data of the co-instantiation of X’s and Y’s. This is known as object induction. Rule induction, by contrast, posits universal claims of the form “all X’s are Y’s”, based on the same data. Data-scientific investigation involves both. Singular predictive instances are commonplace in any application of supervised learning, where the goal is to learn a function from inputs to outputs. These are the kinds of inductions that interest Breiman’s (2001) “first culture” of statistics. At the same time, one of the purposes of data science is to identify underlying structures and mechanisms. The project of causal inference, which we revisit in Sect. 4.2, is devoted to such forms of rule induction.

Turning to abductive inference, Alemany Oliver and Vayre (2015) have emphasised the importance of abductive reasoning in data science methods, particularly in how data science is embedded into broader scientific practice (see Sect. 6 for further discussion). They argue that the tools of data science are useful first in exploring data to determine its internal structure, and second in identifying the best hypotheses to explain this structure. This inference from structure to an explanatory hypothesis is an abductive inference. The view that science is essentially abductive can be traced back to Peirce, though modern adherents abound (Harman, 1965; Lipton, 1991; Niiniluoto, 2018). The status of abduction in a data-intensive context is further elevated by the theoretical virtue of explanatory unification (Kitcher, 1989). In the philosophy of science, a common virtue of a theory is its explanatory power, with some authors maintaining that such power is grounds to choose one of two empirically equivalent theories (cf. van Fraassen’s (1980) discussion of pragmatic virtues). One dimension of explanatory power is the extent of the diversity and heterogeneity of phenomena that a theory can explain simultaneously (cf. Kitcher, 1976). If the methods of data science make possible the identification of patterns in a diverse and heterogeneous range of phenomena, then perhaps we will develop a broader and more nuanced picture of the explanatory power of our theories. For those theories that can unify many phenomena, abductive reasoning confers more robust support on them considering various data science techniques.

In addition to being an end in itself, epistemological reflection on modes of influence also sheds light on the connections between data science, mathematics, and science. The similarities between these disciplines—such as their relevance (Floridi, 2008), explanatory power, practical utility, and degree of success—are precisely what is in question when we look to extend the categories coherently. For example, mathematical proofs are formulated deductively. But given the importance of non-deductive inferences in data science, one needs to recognise an important difference between the two and refrain from placing data science strictly within applied mathematics. Likewise, natural sciences use a mixture of deduction, induction, and abduction in everyday practice, with more formal sciences making more frequent use of deduction, and more applied sciences relying more on abduction. Other sciences assign different weightings to differing modes of inference. For example, abduction is commonplace in the social, political, and economic sciences. Cognitive science is another example that relies on abduction, given the frequency of empirically equivalent, underdetermined theories. It seems that data science, if it is a science, is in good company.

Epistemic products

The trichotomy of machine learning—which spans supervised, unsupervised, and reinforcement learning algorithms—helps delineate the kind of knowledge generated by data science and its techniques.

Supervised learning models predict outcomes based on observed associations. They automate the process of inductive reasoning at scales and resolutions that far exceed the capacity of humans. However, large datasets and powerful algorithms are insufficient to overcome the fundamental challenges inherent to this mode of inference. A model that does well in one environment may fail badly in another, if data no longer conform to the observed patterns. For instance, a classifier trained to distinguish cows from camels may struggle when presented with a cow in the desert or a camel on grass, presuming the training set only contains images of both animals in their natural habitats. Since the background was a reliable indicator of the outcome in training, the model could be forgiven for assuming the same would hold at test time. More dramatic examples have emerged in the context of adversarial learning, where models are trained with the express goal of fooling another model into misclassifying images (Goodfellow et al., 2014). In this case too, the adversarial example succeeds in its deception precisely because it comes from a different data generating process than the original data (although this may not be clear to the naked eye). Such fallibility is inherent in all inductive reasoning, which nevertheless helps us accomplish many important epistemic goals.

Unsupervised learning is a more heterogeneous set of methods, broadly united by their tendency to infer structure without predefined outcome variable(s). Examples include clustering algorithms, autoencoders, and generative models. At their best, these tools can shed light on latent properties—how samples or features reflect underlying facts about some natural or social system. For instance, cancer research commonly uses clustering methods to categorise patients into subgroups based on biomarkers. The idea is that an essential fact (e.g., that cancer manifests in identifiable subtypes) is reflected by some contingent property (e.g., gene expression levels). The risk of overfitting—i.e., “discovering” some structure in training data that does not generalise to test data—is especially high in this setting, as there is no outcome variable against which to evaluate results.

In reinforcement learning, one or more agent(s) must navigate an environment with little guidance beyond a potentially intermittent reward signal. The goal is to infer a policy that maximises rewards and/or minimises costs. A good example of this is the multi-armed bandit problem. An agent must choose among a predefined set of possible actions—i.e., must “pull” some “arm” —without knowing the rewards or penalties associated with each. Therefore, an agent in this setting must strike a balance between exploration (randomly pulling new arms) and exploitation (continually pulling the arm with the highest reward thus far). Reinforcement learning has powered some of the most notable achievements of data science in recent years, such as AlphaGo, an algorithm that is currently the world’s best player of Go, chess, and several other classic boardgames. The epistemic product of such algorithms is neither associations (as in supervised learning) nor structures (as in unsupervised learning), but policies—methods for making decisions under uncertainty.

On their own, these methods do not necessarily provide causal knowledge. However, some of the most important research on AI of the last 20 years has focused on causal reasoning (Imbens & Rubin, 2015; Pearl, 2009; Peters et al., 2017; Spirtes et al., 2000). Such research demonstrates how probabilistic assumptions can combine with observational and/or interventional data to infer causal structure and treatment effects. Remarkably, this literature is only just beginning to gain traction in the machine learning community. Recent work in supervised learning has shown how causal principles can improve out-of-distribution performance (Arjovsky et al., 2019), while complex algorithms such as neural networks and gradient boosted forests are increasingly used to infer treatment effects in a wide range of settings (Chernozhukov et al., 2018; Künzel et al., 2019). The task of learning causal structure from observational data is a quintessential unsupervised learning problem. This has been an active area of research since at least the 1990s and remains so today [see Glymour et al. (2019) for a recent review]. Reinforcement learning—perhaps the most obviously causal of all three branches, given its reliance on interventions—has been the subject of intense research in the last few years (Bareinboim et al., 2021). Various authors have shown how causal information can improve the performance of these algorithms, which in turn helps reveal causal structure.

These methods can, in principle, be used to infer natural laws. Schmidt and Lipson (2009) have proposed what appears to be the algorithmically obtained laws of classical mechanics. Their method involved analysing the motion-data of various dynamical systems using algorithms without prior physical knowledge of mechanics. They claim to obtain the Lagrangian and Hamiltonian of those dynamical systems, together with various conservation laws. This is an important result for those hoping for the possibility of the autonomous discovery of natural laws. More recent work on symbolic metamodels provides a more general strategy for deriving interpretable equations from complex machine learning models (Alaa & van der Schaar, 2019). We shall discuss the roles of correlation and causation in science, and of autonomous, theory-free science in Sect. 6. Leonelli (2020, Sect. 7) also considers the nature of the epistemic products of data science (the extent to which they are predictions or knowledge of causations) and their consequences on scientific investigation.

Black box problems

Tools that produce more successful (more efficient, accurate, deployable, etc.) outcomes are adopted in virtue of their utility, often without reflection on how we are to understand them and the mechanisms by which they work. This has led to questions about the opacity of these tools, where opacity is here understood as our inability to find intelligible mechanisms through which some outcome is achieved. Opaque tools have come to be known as “black boxes” due to their lack of epistemic transparency. In this section, we consider and evaluate the epistemological debate in the literature on data science about a variety of black box problems.

It may be helpful to begin with some clarification. Burrell (2016) has proposed that there are three ways in which data science algorithms become opaque. The first is their intentional concealment for commercial or personal gain. The second is the opacity that arises from technological literacy and proficiency being a necessary condition to understand sophisticated algorithms. And the third is the inherent complexity that arises from algorithmic optimisation procedures that exceed the capacity of human cognition. The first two of these problems are pragmatic problems that occur when data science is embedded in wider society [see Tsamados et al. (2020) for recent work on these issues]. They are not the kind of in-principle epistemological problems that concern us here. Thus, we will focus only on the last problem. There have been many “technical solutions”, or proto-solutions, to various instances of black box problems which proceed through fine-grained mathematical investigations, and do not attempt to integrate black boxes into the ordinary understanding. Again, we are not concerned with these here, because we are focused on the level of abstraction of concern to philosophy, which is above such technical investigations. In this section, we provide only a brief, comparative overview to illustrate (dis)similarities, or instances where putatively different problems may collapse into one.

Black box problems can be placed into two broad categories, which we shall organise as conceptual and non-conceptual. Conceptual problems concern whether and when the concepts belonging to our ordinary understanding can be employed sensibly and intelligibly in discussions of black boxes and their workings. For example, a conceptual problem is whether the term “explainability” may be coherently and unambiguously defined in a machine learning context. Non-conceptual problems, in contrast, do not concern the appropriate use of ordinary concepts in machine learning contexts, but the broader problems that result from these concepts. Within these non-conceptual problems, we will restrict our focus to those in the domain of epistemology. However, it is worth recognising that further non-conceptual problems arise elsewhere, for example, in ethics or politics.

Conceptual problems

Some black box problems arise from our ordinary concepts being inadequate or unclear when projected into machine learning contexts. Lipton (2018) has acknowledged this imprecision over the use of “interpretation”. He observes that “the task of interpretation appears underspecified. Papers provide diverse and sometimes non-overlapping motivations for interpretability and offer myriad notions of what attributes render models interpretable” (Lipton, 2018, p. 36). Similarly, Doshi-Velez and Kim (2017) have remarked on the lack of agreement on a definition of “interpretability”, and further about how it is to be evaluated. They identify two paradigmatic uses of “interpretability” in the literature: interpretability in the context of an application and interpretability through a quantitative proxy. Rigorous definitions of both are found lacking.

There have been a few attempts to respond to such conceptual problems. One important first step is to construct a clear taxonomy of how problematic concepts like interpretability are used, and what the desiderata and methodologies of interpretability research are. This is the kind of project in which Lipton (2018) engages. A second attempt has been to refine these concepts, or at least conduct the groundwork to facilitate this refinement. Doshi-Velez and Kim (2017) engage in this kind of project, laying the groundwork for the subsequent rigorous definition and evaluation of interpretability. Authors have also refined the concepts of interpretability and its cognates by making fine-grained distinctions within them, adding to their structure. Doshi-Velez and Kim (2017) distinguish between local and global interpretability to avoid confusion. The former applies to individual predictions and the latter to the entire decision boundary or regression surface. Watson and Floridi (2020) make a similar distinction between local (token) and global (type) explanations, though in a more formal mathematical context.

Further work on the representations deployed in black box problems concerns the relationship between various roughly synonymous terms: words like “interpretability”, “explainability”, “understandability”, “opacity”, and so on. It is of philosophical interest whether any or all of these terms overlap in whole or part. Some commentators take a coarse-grained approach to such cognates. Krishnan (2020), for example, takes them to be negligibly different, arguing that these terms all define one another in a circular fashion that does little to clarify imprecise concepts. Others take a more fine-grained approach. Tsamados et al. (2020) emphasise the difference between explainability and interpretability. The former applies to experts and non-experts alike, for example, the expert data scientist practitioner might need to explain the mechanics of some algorithm to their non-expert client. In contrast, the latter is restricted to experts (interpretability as interpretability-in-principle). Thus, in their view, explainability presupposes interpretability but not vice versa.

Non-conceptual problems

Non-conceptual problems and their solutions do not address deficiencies in representations themselves. In this section, we will discuss four epistemological problems that have received less attention.

Ratti and López-Rubio (2018) have argued that interpretability is crucial to distil causal explanations from the correlations identified by data science techniques, as may be the case in a data-rich scientific context. Using the paradigm of mechanistic biological models, they observe that for biologists to turn data-scientific correlative models into causal models with explanatory power, the correlative models must be interpretable. This stems from a general trade-off: the more complex a model is, the less explanatory it becomes. Since the predictive powers of data-scientific models are positively correlated with their complexity, they conclude that there is a genuine epistemological black box problem. In Sect. 6, we shall see that this epistemological concern is more significant when considering the nature of the scientific method in a data-driven paradigm. Insofar as we expect computational, data-led investigations to become increasingly important for scientific progress, the algorithms and tools we deploy in these investigations ought to yield discoveries accessible to scientists and integrate readily with the wider scientific epistemic community. Thus, we agree with Ratti and López-Rubio (2018) on the need to address such opacity problems, given the importance of a secure foundation for a data-driven science.

Watson and Floridi (2020) have construed overfitting as a different kind of epistemological black box problem, as a kind of algorithmic Gettier case. Overfitting occurs when a machine learning model performs well in training but fails at test time. As an example, they cite the results of Lapushkin et al. (2016), in which pictures of a horse shared a subtle, distinctive watermark. The resultant image classifier strongly associated that watermark with the label “horse”, and thus failed to classify horses in a test set in which the watermark was absent. This Watson and Floridi propose, is importantly analogous to the Gettier cases of classical epistemology, which illustrate how epistemic luck can pull apart notions of knowledge and justified true belief. The model is justified to infer that watermarks entail horses, since the former was both necessary and sufficient evidence for the label in training. Moreover, when presented with a new image of a horse containing the tell-tale watermark, the prediction is true. Finally, we may come to believe in the model due to its high accuracy rate, at least on data from this training regime. However, the apparent accuracy is a mirage. The overfit model is right for the wrong reasons, and hence only accidentally true. Watson and Floridi argue that better tools for model interpretability can reveal such errors, mitigating the potential damage of overfitting.

We recognise some notable differences with classical Gettier cases, though not ones that undermine the epistemic consequences of opaque algorithms. One such difference is that, due to the very threat of overfitting, we might avoid attributing genuine reliability to some machine learning algorithm in virtue of only its training performance. Rather, it remains possible that such algorithms are only deemed reliable when they have performed sufficiently well in non-training environments, so that overfitting is demonstrably and empirically insignificant. Differences aside, there appears to be a significant epistemological problem regarding accidentally true classifications that may be met through developments of machine interpretability.

Krishnan (2020) has remarked on the broader epistemological point that, insofar as machine learning algorithms might have a pedagogical dimension (that we can learn from the mistakes that algorithms might make), they must be interpretable or understandable for us to learn anything at all. Lipton (2018, Sect. 2.4) (citing Kim et al., 2015) makes a similar remark on the informativeness of algorithms. Thus, there are significant epistemic benefits to greater algorithmic transparency.

The discussion above gives the impression that these problems are substantial and worth solving. However, not all commentators agree. There are two main kinds of objections. Some concede that black boxes are opaque but deny that the correct way to proceed is to try to explain or interpret their inner workings. Instead, they argue that black boxes should be replaced altogether by equally capable non-black boxes (however, this strategy must answer tough questions of attainability). Others deny that black boxes are problematic at all. Let us look at each position in turn.

Rudin (2019) has expressed an objection of the first kind. She agrees that the lack of interpretability of machine learning algorithms is a problem. However, she takes this not as motivation to construct better post hoc interpretability methods, but instead as a reason to reject opaque models altogether. She rejects the commonplace assumption that accuracy and interpretability are inversely related. In her view, black box problems should be dissolved (rather than solved) by globally transparent models that perform comparably to black box competitors. Rudin’s solution can be criticised for assuming that it is realistic and practical to expect comparable alternatives to black boxes. One of the reasons black box problems exist is because the top performing models for many complex tasks are opaque. It is a chief difficulty of the problem that, in many cases, alternatives are either non-existent or impractical.

Zerilli et al. (2019) have expressed an objection of the second kind, arguing that the opacity of black boxes is not a genuine problem. They see the explainability debate as evidence of a pernicious double standard. They point out that we do not demand explicit, transparent explanations from human judges, doctors, managers, military generals, or bankers. Rather, justification is found simply in past reliability: demonstrated and sustained accuracy and success. If we impose the same norms on algorithms, then the explainability problem is dissolved.

Similarly, Krishnan (2020) has argued that our concerns about interpretability and its cognates are unnecessarily inflated. The inherent imprecision of these terms prevents them from doing the work required of them: “Interpretability and its cognates are unclear notions… We do not yet have a grasp of what concept(s) any technical definitions are supposed to capture—or indeed, whether there is any concept of interpretation or interpretability to be technically captured at all” (Krishnan, 2020, p. 490). But unlike Doshi-Velez and Kim, Krishnan does not take this as motivation to sharpen such concepts for subsequent progress, for worrying about them distracts from our real needs. Krishnan contends that most of the de facto motivations for treating interpretability as an epistemological problem in the first place are due to other ends (e.g., social, political, etc.). For example, algorithmic bias audits use explainability as a means to avoid unethical consequences.

We are sympathetic to Krishnan’s overall project. Many authors uncritically assume that black box problems are necessarily important, and epistemological concerns about concepts like interpretability are indeed often means to other ends. However, we disagree that this observation undermines the status of purely epistemological concerns, as the examples from Sect. 5.2 attest. It might be the case that worrying about black box problems is an inefficient and suboptimal use of philosophical effort (particularly in the hyper-pragmatic context in which data science methods are mostly deployed). However, black box problems qua objects of epistemological interest remain relevant to at least some parts of a complete philosophy of data science.

Normal science in a data-intensive paradigm

So far, we have considered foundational and epistemological issues in the philosophy of data science. We may now broaden the investigation to consider how data science might relate to science and the philosophy of science more generally.

The classical conception of the scientific method involves a specific gnostic relation between hypotheses and evidence. Hypotheses are derived from existing scientific theory (e.g. Peter Higgs’s 1960s prediction of the Higgs Boson) and then confirmed or falsified through experiments (e.g. the CERN confirmation of the Higgs Boson). The relation is gnostic because scientists within the paradigm posit specific connections between phenomena in advance of experiments.

However, it has recently been proposed that the proliferation of data has inaugurated a new era of agnostic science. Here, scientific knowledge can be generated, and mathematical and data-scientific methods deployed without prior knowledge or understanding of phenomena or their interrelations. A putatively agnostic science is one where experiments are in some sense “blindly” performed, and large amounts of data amassed. Then algorithms retrospectively seek correlations in this data, from which underlying causal laws and scientific generalisations can be extracted. Kitchin (2014) has compiled Gray’s work (found in Hey et al. (2009)) to elucidate the nature of this new paradigm and locate it in the history of science. This section explores the extent and implications of agnostic science.

Paradigm Nature Form When
First Experimental Empiricism; describing natural phenomena pre-Renaissance
Second Theoretical Modelling and generalization pre-computers
Third Computational Simulation of complex phenomena pre-Big Data
Fourth Exploratory Data-intensive; statistical exploration and data mining Now
  1. Scientific paradigms taken from Kitchin (2014, p. 3), compiled from Hey et al. (2009)

Agnosticism about the application of mathematics

One identification of agnosticism is provided by Napoletani et al. (2018), who observe that the de facto application of mathematical techniques in science is undergoing an agnostic transformation. They remark that classical methods required both the prior understanding of phenomena and interconnections between elements in datasets. This is the case, for example, if one wishes to model some biological population using differential equations. The nature of the models one uses, which parameters to include, and so on, require the scientist to have antecedent knowledge and understanding of population biology, multivariate calculus, etc. They also need the scientist to know the basic structure of the dataset. Matters are very different in contemporary data analysis. There, the scientist can remain, to a great extent, agnostic or uninformed about any underlying scientific theory and the structure of their data. With the tools of contemporary data science, raw data can be parsed, and structure exploited more or less automatically.

After observing that this appears to be an important direction in scientific practice, Napoletani et al. raise the second-order question of why mathematics and data have such an effective synergy. They claim that a common response is to appeal to a Wignerian-like resignation to “unreasonable effectiveness” (Wigner, 1960). In this view, big data has a sort of omnipotence that grants unreasonable success to disparate and heterogenous data-scientific tools. However, Napoletani et al. (2018) reject this response, arguing that the question can be reformulated into the more general question of whether the success of mathematical methods in an agnostic normal science is due to a similarity between the structure of those methods and the structure of the phenomena themselves captured in data corpora. This is a question that deserves further attention in the debate. While Napoletani et al. observe the increasing possibility of employing mathematical techniques agnostically, others have engaged in a more radical debate about whether this agnosticism heralds the end of theory choice in science altogether. This is the topic of the next section.

Theory-free science

Anderson (2008) has argued that classical theory-driven science is becoming obsolete. In his view, the density and plurality of correlations yielded by the analysis of extraordinary large amounts of data will become more useful than the causal generalisations provided by classical science. [Such views are also discussed in: Cukier and Mayer-Schoenberger (2013), Leonelli (2020); Prensky (2009); Steadman (2013)]. Kitchin (2014) provides a more formal characterisation of this view, which he calls a new type of empiricism, and Schmidt and Lipson’s (2009) aforementioned reconstruction of classical mechanics via machine learning is a provocative example of theory-free science in action.

Critics object that this is sensationalist, over-optimistic and inflated. Kitchin (2014) presents a fourfold attack on Schmidt and Lipson’s (2009) analysis. His first contention is that, as much as large data corpora can try to exhaust information in a whole domain, they are nonetheless coloured by the technology used in their generation and manipulation, the data ontology in which they exist, and the possibility of sampling bias. Indeed, “all data provide oligoptic views of the world” (Kitchin, 2014). Second, following Leonelli (2014), he remarks that even the agnostic distillation of structure and patterns from data cannot occur in vacuo from all scientific theory. Due to their deep embedding in society, scientific theories and training always provide the scaffolding around data collection and analysis. Third, insofar as normal science is cumulative, he argues that the individual results of data-scientific investigations will always require interpretation and framing by scientists equipped with knowledge of scientific theories. And fourth, if data and the results of its analysis are interpreted free of any background theory, they risk becoming fruitless. It will be difficult for them to contribute to any fundamental understanding of the nature of phenomena since it “lacks embedding in wider … knowledge” (Kitchin, 2014). Frické (2015) presents a similar view against this extreme kind of agnosticism. He objects that one needs antecedent theoretical insight to decide which data to provide inductive algorithms in the first place. Theory cannot be removed from science, even in a data-driven paradigm.

Kitchin’s first point is echoed by Symons and Alvarado’s (2019) epistemological analysis of data and computer simulations, who express doubt that experimental simulations can ever be purely epistemically justified by the pragmatic success of their correlative predictions. The authors argue that in addition to these pragmatic successes, data-driven scientific investigation only becomes credible if it is entrenched in a variety of other factors, including well-curated data, established scientific theory, empirical evidence, and good engineering practices” (Symons & Alvarado, 2019, p. 0.3). Their view therefore underscores the mutually reinforcing relationship between theory and data in science, opposing a data-dominated conception. It also reiterates the significance, that was apparent in Sect. 4 and elsewhere, of quality and diverse data sets. Elsewhere, Symons and Alvarado (2016, p. 3) further attack the notion of a theory free science, by connecting it problems of epistemic opacity. They examine the Google Flu Trends web service and observe a series of limitations within the resultant data set. These limitations, they contend, have at best only inadequately ambiguous sources; hence they question the possibility of a genuinely agnostic science, and further underscore the epistemological significance of black-box problems.

We believe that these arguments can be supplemented with further reasons against this total agnosticism. The first reason relates to the critical issue in the philosophy of science of the theory-ladenness of observation, which holds that what one observes is influenced by one’s theoretical and pre-theoretical commitments (see also Leonelli, 2020, Sect. 2). This is especially true for data science, where observations are gathered, labelled, and processed according to pre-existing categories and analysis routines. For any data to be manipulated and ultimately rendered intelligible to a human being, they must be represented under one particular concept or another. And in this process of conceptual subsumption, one’s theoretical commitments will always prevent any kind of total agnosticism. Second, it is plausible that Anderson’s claim that correlations will be sufficient for the future of science is too naïve a conception of the scientific enterprise. It reminds one of Bacon’s untenable view that Nature would speak by itself if adequately interrogated. Agnostic data science may generate a predictive science without knowledge of underlying natural laws or causal mechanisms, but prediction is not the only goal of the scientific enterprise. Explaining phenomena by knowing the underlying causal structure of the world, and helping to plan and intervene, are also two important goals of science. A similar line of argument is offered by Cukier and Mayer-Schoenberger (2013, p. 32). Third, and as perhaps Kuhn (1970) showed better than anyone else, science is still a fundamentally social project. Scientific paradigms do not exist in a vacuum. Science is developed by human experts embedded in a rich and complex sociocultural environment. Discourse about science, and scientific pedagogy, are indispensable aspects of what science is. Consequently, one might question whether a genuinely agnostic science would be recognisable as a science at all. Such a science would have a significantly different intellectual structure to contemporary versions of the discipline, and it is not clear whether such a difference could be accommodated.

Total agnosticism, therefore, seems too extreme. The task then is how to integrate agnostic data-scientific practices into scientific methodology. Kitchin (2014) proposes a humbler account of this integration. He calls it “data driven science”, which takes the form of a rebalancing of the three modes of inference discussed in Sect. 4.1. He argues that contemporary normal science has an experimental-deductive dimension in which hypotheses are deduced from more fundamental hypotheses and then offered up for confirmation or refutation by experiment. In contrast, science in a data-driven paradigm elevates the status of inductive logic in this process of hypothesis formation, with experimental hypotheses generated from correlations identified by data-scientific methods rather than by deduction from parent hypotheses. However, in contrast to the naïve empiricist, Kitchin’s data-driven science does not involve the absolute primacy of induction. Theories and their deductions play an essential role, for example, in framing data, directing which data-scientific processes to deploy, embedding results in wider knowledge, generating causal explanations, and so on. A picture of a new science emerges, involving a shift towards a more inductive enterprise, while maintaining many paradigmatic and realistic similarities to our current model of normal science.

There have been further remarks about the introduction of data-scientific methods into the social sciences. Lazer et al. (2014) stress the emergence of “computational social science”, and Miller (2010) observes the proliferation of data in the context of regional and urban science. In both cases, the potential for data to reshape social-scientific practices is acknowledged. However, authors have noted the dissimilarities between natural and social sciences, which likely mean that the impact of data on the two categories will differ.

It is likely that the future of data-intensive science will still be theory-based, though sometimes agnostic and data-scientific methods to assist in theory-generation will be used. Since Reichenbach (1938), there has been a popular distinction made in the philosophy of science between the contexts of discovery and justification: where a theory came from is irrelevant to whether the theory is sound. Consequently, it has become orthodox to consider scientific theories only for their own content, independent of their origins. The genealogy of our scientific knowledge has, classically, never been of epistemic relevance.

This distinction may be questioned by the possibility of agnostic science in a data-intensive paradigm. For now, the genealogy of such agnostic knowledge that is generated autonomously from data is important: its epistemic standing is supervenient on the tools and algorithms of data science that generated it and on the quality of the antecedent data. Thus, the reliability of automated inferences depends on the quality of the underlying data and the algorithm(s) used to extract information from them. Such questions about theory genealogy are perhaps too often ignored by modern philosophies of science that inform “gnostic” paradigms. A philosophy of science in a data-intensive paradigm may be forced to address them more directly.

In Sect. 5, we highlighted the relationship between algorithmic opacity and agnostic aspects of science. There, we discussed how opaque, uninterpretable algorithms may prevent underlying causal connections between phenomena from being inferred from correlations in data. If algorithms become responsible on a wide scale for recognising correlations in data, their interpretability becomes essential to understand the explanatory grounds of those correlations. Thus a theory-free scientific paradigm, or a paradigm in which algorithms start to play a more autonomous role in the scientific method, ought to concern itself with developing frameworks for addressing black box problems and their imports.

Conclusion

In this article, we have provided a systematic and integrated review of the current landscape of the epistemology of data science. We have focused on its critical evaluation and identifying and characterising some of its pressing or obvious gaps wherein philosophical interest lies. We have structured this reconstruction into five areas: descriptive and normative accounts of the composition of data science; reflections upon the kind of enquiry that data science is; the nature and genealogy of the knowledge that data science produces; “black box” problems; and the nature and standing of a new frontier within the philosophy of science that is raised by data science. Each of these areas is home to a variety of important issues and active debates, and each area interacts with the others. The resulting picture is a rich, interconnected, and flourishing epistemology, which will continue to expand as both philosophical and technological progress is made, possibly influencing other interconnected views about the nature of science and its foundations.