1 Introduction

Machine learning (ML) algorithms are increasingly integrated into every aspect of our lives. Whether it is recommending movies, diagnosing diseases, or defending against cyberattacks, ML has emerged as one of the dominant technologies of our age, powering countless advances in diverse fields such as robotics, particle physics, and personal finance. Practitioners typically distinguish between three main branches of ML: supervised, unsupervised, and reinforcement learning.Footnote 1 The first pertains to prediction problems, e.g., labeling images. The third pertains to policy optimization, e.g., learning boardgames. The subject of this article is the second branch, for which pithy descriptions are not nearly so straightforward.

To the extent that contemporary philosophers have engaged with ML at all, the focus has overwhelmingly been on supervised and reinforcement learning. For instance, many ethicists have critically evaluated the social, economic, and political impact of automated decision systems, typically identified with supervised learning models for scoring, ranking, and classification (Gabriel, 2020; Mittelstadt et al., 2016; Tsamados et al., 2021; Zimmermann & Lee-Stronach, 2021). Philosophers of science have exploited the mathematical guarantees of supervised and reinforcement learning to defend or refine favored theories of knowledge and complexity (Corfield et al., 2009; Harman & Kulkarni, 2007; Schurz, 2019; Sterkenburg & Grünwald, 2021). Reinforcement learning has become a popular tool in evolutionary game theory, social epistemology, and other settings where interactions between agents and environments can lead to unexpected results (Barrett et al., 2019; LaCroix, 2020; Mayo-Wilson & Zollman, 2021; Skyrms, 2010). This sampling is no doubt imperfect and incomplete, but it serves to capture general trends and highlight an obvious lacuna.

Unsupervised learning is simply not a mainstream subject of contemporary philosophical discourse. Exceptions to the trend are notable primarily for their scarcity. Clustering algorithms have inspired some theoretical analyses (Ackerman & Ben-David, 2009; Hennig, 2015), primarily among practicing statisticians and computer scientists—more on this in Section 2. A class of generative models known as “deepfakes” has attracted understandable concern among ethicists (de Ruiter, 2021; Öhman, 2022), as well as reflections on the nature of synthetic media (Millière, 2022)—more on this in Sections. 45. The success of adversarial methods, which fool high-performance classifiers by adding strategic perturbations to input data, has led some to speculate that these tools can detect genuine features of our reality that are beyond human understanding (Buckner, 2020), an intriguing but so far underdeveloped proposal. In a recent article by Pääkkönen and Ylikoski (2021), the authors argue that topic modeling—an unsupervised learning algorithm for text mining—can help ground hermeneutic interpretation in objective evidence. Their analysis is limited to text data, with a focus on social scientific applications. Several recent bibliometric studies have used topic modeling to analyze large corpora of philosophical publications (Kinney, 2022; Malaterre et al., 2021; Noichl, 2021). Though unsupervised learning methods are used in these works, they are not themselves the subject of inquiry. Unsupervised learning has received some attention in cognitive science and philosophy of mind, but even here the subject tends to get short shrift. For instance, proponents of predictive processing have suggested that their framework may accommodate unsupervised learning under certain conditions (Clark, 2017; Hohwy, 2020). This claim is typically made in passing, with little or no evidence, followed by copious supervised and reinforcement learning examples.

It is unclear why unsupervised learning algorithms have been so philosophically neglected, especially given their widespread use and impressive performance. Such technologies underlie popular models for fraud detection, machine translation, cancer subtyping, image compression, and countless more applications across various domains. One possible reason for their silence is that philosophers are generally partial to well-defined concepts and clear success criteria, neither of which are immediately forthcoming in this subfield. By far the least cohesive branch of ML, unsupervised learning is a heterogeneous mix of methods left over after we have defined away the neater areas of supervised and reinforcement learning. According to the authors of one famous ML textbook, the field is characterized by “learning without a teacher….The goal is to directly infer the properties of [a] probability density without the help of a supervisor…providing correct answers or degree-of-error for each observation” (Hastie et al., 2009, p. 486). In other words, unsupervised learning algorithms seek to find structure in data without recourse to labels (as in supervised learning) or reward signals (as in reinforcement learning). Just what “structure” amounts to, and how we should determine better or worse structures without any external feedback, are problems that each unsupervised learning algorithm must resolve in its own way.

I shall argue that unsupervised learning poses profound philosophical questions that are distinct from those raised by alternative branches of ML. These questions pertain to how and whether we can identify natural kinds, infer essential and contingent properties, and imagine unrealized possibilities—all in a fully adaptive, data-driven manner that eschews supervision. I examine three canonical unsupervised learning tasks—clustering (Section 2), abstraction (Section 3), and generative modeling (Section 4)—and show how each forms a metaphysical hypothesis about its training data, and by extension the target system. I submit that unsupervised learning is ontologically fundamental—more so, at any rate, than supervised or reinforcement learning algorithms, which necessarily presume some a priori division of the world into endogenous inputs (features) and exogenous outputs (errors/rewards). Following this critical exegesis, I examine the ethical implications and epistemological limits of unsupervised learning (Section 5). I argue that the risk of self-delusion is especially acute with these algorithms, since their “fully automated” nature may seduce us into believing strange, potentially dangerous conclusions. I close with a summary and proposals for future research (Section 6).

Before going any further, I must make a few brief comments on notation. Though I will strive to keep statistical jargon to a minimum, I am afraid some formalisms are unavoidable. Unsupervised learning generally begins with a matrix of training data X, consisting of n samples and d features. A single observation is represented by a vector \({{\varvec{x}}}_{i}, i\in \left\{1, \dots , n\right\}\), which can be considered a point in d-dimensional space. Features \({X}_{j}, j\in \{1, \dots ,d\},\) may in principle be continuous or discrete. Since mixed covariates pose challenges for the methods below—challenges that are far from insurmountable but frankly irrelevant for our purposes—we will assume for simplicity that all features are continuous. The domain of \({X}_{j}\) is denoted with calligraphic \({\mathcal{X}}_{j}\), while the complete domain is \(\mathcal{X}={\mathcal{X}}_{1}\times \dots \times {\mathcal{X}}_{d}\). Data are distributed according to some distribution P with probability density function p. Though some unsupervised learning tasks (e.g., density ratio estimation) explicitly assume that data are generating according to several distinct processes, we will simplify matters by assuming that samples are independent and identically distributed.

2 Clustering

Clustering is a basic statistical task with a historical pedigree that long predates the latest AI hype cycle. However, just as supervised learning outgrew its origins in linear regression through advances in computing and statistics, clustering has undergone a revolution of sorts in the last 10–20 years. This is partly due to technological developments—it is possible now to efficiently cluster enormous datasets in high dimensions with massively parallel processing, as well as to resample data for greater cluster stability—but also to theoretical progress. Inferential techniques for data clustering have put the field on firmer footing (Tibshirani et al., 2001; John et al., 2020), while approximate solutions have led to dramatic speedups (Abboud et al., 2019; Cohen-Addad et al., 2019). Meanwhile, impossibility results have strictly delimited what such algorithms can and cannot achieve (Kleinberg, 2002; Ben-David & Ackerman, 2008; Cohen-Addad et al., 2018).

The basic goal of any clustering algorithm is to identify meaningful subgroups in a collection of data. These may correspond to different hand-written digits, subspecies of iris plant, or any other grouping of interest. As a running example, let us say that X is a large collection of photographs, each depicting either a cat or a dog. Unlike a supervised classifier, which learns to label new datapoints based on observed instances (“This is a cat, that is a dog”), clustering algorithms must learn the categories themselves (“There are two different animals here, let’s call them cat and dog.”). Though many variants exist, I will focus on two popular algorithms that hopefully convey the basic logic behind clustering analysis and give some sense of the diversity among such methods: k-means and hierarchical clustering.

The k-means algorithm takes as input the data matrix X and some target number of clusters k. The goal is to find k representative d-length vectors called centers, where the jth coordinate is the mean value of the variable \({X}_{j}\) among samples in the corresponding cluster. Each center can be thought of as a sort of prototype, exemplifying the traits we most associate with a given cluster. For example, these may describe the most feline cat or the most canine dog. Note that, because each coordinate in the center is given by a groupwise mean, it is likely that the resulting prototype is not identical to any observed sample. Instead, the center describes a hypothetical datapoint, a sort of Platonic ideal against which all others are compared. These centers are computed so as to minimize the sum of squared distances from each sample to its nearest center, i.e., the within-cluster variance. This problem is NP-hard and thus exact solutions are generally infeasible. However, iterative approximations work well in practice and tend to converge on local optima in reasonable time (Hartigan, 1975; Lloyd, 1982). See Fig. 1 for a visual summary. Typically, we try out some range of values for k and evaluate the fit by means of one or several scoring metrics, potentially including a penalty for large k.

Fig. 1
figure 1

Visual overview of the k-means algorithm. Random centers are selected in the first step (A), in this case with k = 3. Clusters are formed by associating each point with its nearest center (B). A new center is calculated for the resulting cluster (C). Steps (B) and (C) are repeated until convergence (D). From Wikipedia (2022)

Hierarchical clustering differs from k-means in several respects. First, it transforms the input data X intro a symmetric \(n\times n\) matrix \(\mathbf{D}\), where entry \({\mathbf{D}}_{ij}\) represents the dissimilarity between samples \({{\varvec{x}}}_{i}\) and \({{\varvec{x}}}_{j}\). This determines the geometry of the problem, which may differ substantially from that of k-means if we use a non-Euclidean distance measure. Second, as the name suggests, the method assumes a nested structure among the clusters. For instance, in our running example, the optimal cutoff for k = 2 should clearly distinguish cats from dogs. But say our dataset includes two distinct breeds of each animal. Then, the optimal value of k may in fact be 4, although each smaller cluster is a proper subset of one of the larger ones. Unlike k-means, hierarchical clustering algorithms do not take k as input but instead compute the full solution path for \(k\in \{2,\dots ,n\}\), recursively partitioning the remaining samples as k increases. Results are typically visualized as a dendrogram, with the full dataset included at the root and individual observations in the leaves (see Fig. 2). Optimal tree height is determined either visually or via some scoring heuristic. Hierarchical clustering arguably stretches our notion of optimality, since it can be informative to know which samples are co-clustered at varying resolutions. Results for our pet example with k = 2 and k = 4 are both “right” at different levels of granularity. While k-means can in principle also return these same clusters at k = 2 and k = 4, there is no guarantee that its assignments form a recursive partition since the algorithm starts anew at each value of k. Thus, hierarchical clustering automatically imposes a sort of consistency condition that is generally violated by k-means.

Fig. 2
figure 2

Example dendrogram trained on Fisher’s iris dataset, with colors corresponding to the best three-partition

The question of how to divide the world into proper units—to “carve nature at the joints,” in Socrates’s immortal words—is an ancient philosophical problem, better known today as the problem of natural kinds. The literature on this topic is enormous and varied—for a good overview, see Bird and Tobin (2022)—but two clear strands have emerged. One is focused on the ontological question: What is a natural kind? The other seeks to answer the epistemological question: How can we learn to identify natural kinds? I shall argue that clustering algorithms offer practical answers to both, although the two are not equally convincing.

Take the epistemological question first, as the solution here is more direct. Consider the following reply:

EC: We learn to identify natural kinds via clustering algorithms, or something very much like them.

Whether such algorithms are implemented via computer chips or human brains is irrelevant, as functionalists have long argued.Footnote 2 Both strategies described above—clustering by prototypes or by recursive partitioning—are common and legitimate in different domains. For instance, grouping by prototypes is an example of case-based reasoning, in which we reduce a collection to an ideal type that represents the whole. This approach is widely used in areas including clinical medicine, law, and sociology (Kolodner, 1993). Meanwhile, the method of recursive partitioning is familiar to anyone who has ever scanned a phylogenetic tree, where diverse evolutionary lineages are traced back to common ancestors. Creating and debating these taxonomies is an ancient tradition, going back at least as far as Plato’s method of collection and division. The practice was later popularized by Linnaeus as a strategy for sorting biological species, and served as an inspiration for Darwin’s theory of natural selection. These examples help delimit the scope of EC, as they remind us that natural kinds (if indeed anything corresponds to the name) should permit inductive inferences and participate in natural laws. Of course, clustering algorithms on their own cannot guarantee this. However, they can offer data-driven strategies for forming hypotheses that can subsequently be tested through standard scientific methods.

This analysis suggests one plausible reply to the ontological question.

OC: Natural kinds are what clustering algorithms ought to find under ideal conditions.

OC does little to resolve substantive issues around natural kindhood without further elaboration—the modal “ought” smuggles in all manner of normative baggage, while “ideal conditions” are underspecified and potentially question-begging—but the basic strategy is familiar. Much like how Turing (1950) famously replaced a metaphysical question (“Can machines think?”) with a methodological one (“How can we tell?”), clustering algorithms reduce murky debates about identity and essence to formal procedures for grouping elements together based on precise notions of similarity and difference. If two people disagree on whether some elements form a natural kind, it may well be that they are applying different clustering algorithms, or perhaps the same algorithm with different inputs. Computational methods can make underlying assumptions explicit, thereby promoting critical discussion and hopefully greater consensus.

This reductive move is tempting and often successful. However, OC fails in general unless we impose some further ad hoc restrictions on natural kinds to guarantee their effective computability.Footnote 3 Perhaps the relevant notions of distance are hopelessly imprecise, or the complexity of the requisite clustering procedure is so great that it will not complete before the heat death of the universe. Even under a generous interpretation of “ideal conditions,” some natural kinds may remain unidentifiable due to their intrinsic ineffability, or perhaps some fundamental limitation that prevents us from gathering and processing the evidence required to discover them. For this reason, I judge EC to be more convincing than OC. Clustering algorithms can help us find natural kinds, but the purported methodological reduction carries little or no ontological weight.

I will revisit issues concerning the reliability and testability of unsupervised learning algorithms in Section 5. For now, I will simply point out that this proposed reduction by no means entails some radical brand of constructivism. The claim is not that natural kinds come into being if and only if some agent happens to implement a particular computational process. Just as the mathematical realist insists that two plus two equaled four long before anyone was around to add numbers together, we may readily acknowledge that natural kinds, if they exist at all, surely predate any actual clustering method. But this is a mere historical coincidence. The algorithm itself is an abstraction, a sequence of steps which, when applied, produces some output. OC merely suggests that learnability via clustering is a necessary condition for natural kindhood, regardless of whether such learning ever in fact takes place. Yet there is no a priori reason to believe that all natural kinds are effectively computable via clustering algorithms, even ones that have yet to be devised. In arguing against OC, I conclude that identifiability via clustering algorithms is not a necessary condition for natural kindhood.

But is it sufficient? Any affirmative answer must provide further details on the purported “ideal conditions” requirement of OC, which will no doubt strike skeptics as inherently subjective. Ideal for whom? Toward what end? A full treatment of these issues is beyond the scope of this article, but I can sketch one possible direction. I take a broadly pragmatic view of natural kinds that is liable to disappoint the realist every bit as much as it does the antirealist. Any grouping of elements into kinds is undertaken within some context and for some purpose. That is why clustering algorithms may succeed at varying levels of abstraction, e.g., grouping species together at k = 2 and breeds at k = 4. Each solution is the right answer to a different question. Of course, this does not mean that any context and purpose will do. For starters, the “natural” in “natural kinds” places a nontrivial constraint on candidate inputs (pianos and laptops do not form natural kinds). This goes back to the aforementioned criterion that natural kinds should permit inductive inferences and participate in natural laws. However, the granularity and scope of such inferences and laws is itself a function of context and purpose.Footnote 4 It may be possible to describe, say, the migratory patterns of birds in terms of atoms and electrons, but such a Herculean feat would be profoundly unilluminating. The task of the statistician is to ensure that inputs align with analytical goals, giving the clustering algorithm the best chance to discover what it is we want to know. In this way, natural kinds and clustering algorithms may even be co-constitutive. This accords with a sort of Kantian metaphysics, in which the world gains meaning only through mental (or computational) representation.

I will go to no great lengths defending this brand of pragmatism or debating the proper interpretation of transcendental idealism, matters which, despite their evident interest, are clearly exogenous to the present discussion. I merely raise them to illustrate how fluid the borders become between metaphysical and methodological matters once we begin to seriously reflect on the mechanics of unsupervised learning. No matter one’s position on these issues, there seems little question that clustering is a basic function of human thought and language. This is a claim not about how infants come to learn nouns—that is more accurately described by supervised or reinforcement learning—but about how language itself came into being in the first place. Dividing one’s perceptual field into a collection of things with names that persist under some range of conditions was a key step in the development of complex concepts for individuals and collaborative communication for groups. If we accept Chomsky’s contention that the subject-predicate structure is part of the so-called universal grammar (Cook & Newson, 2007), then some capacity for clustering may be literally hardwired into human brains. For better or worse, clustering is inescapable.

3 Abstraction

I confess that my nomenclature in this section is somewhat unorthodox. Abstraction has a different meaning in the formal methods literature (Wang & Tepfenhart, 2019), while the algorithms I describe below go by various other names in ML (e.g., embedding, autoencoding, projection, coarsening). I refer to them collectively as abstraction algorithms to avoid the specific assumptions and problem setups associated with those terms, as well as to highlight the connections between these techniques and related subjects in philosophy. This usage has roots in Floridi’s (2008) levels of abstraction. More recently, Buckner (2018) has studied how deep convolutional neural networks operationalize modes of transformational abstraction that were originally conceived (but never fully explained) by classical empiricists such as Locke and Hume. The operative notion of abstract is the following, from Merriam Webster’s: “to consider apart from application to or association with a particular instance.” Simply put, abstraction algorithms (as I shall use the term) are a family of methods for learning simplified representations of data. These may be latent variables defined via transformations of input features, or coarse-grained variables forged by some data-driven discretization process. In either case, the result is typically a dimensionality reduction of the input space.Footnote 5 In classical statistics, this was achieved via parametric methods such as principal component analysis (Jolliffe, 2002). More recently, the area has been dominated by adaptive alternatives, especially those based on deep neural networks (Goodfellow et al., 2016).

The most straightforward introduction to abstraction comes from information theory.Footnote 6 Consider the images depicted in Fig. 3. Each is made by the same data generating process, with varying degrees of additive noise. In each case, the dataset consists of a \(20\times 20\) matrix of \(1\)’s and \(0\)’s, i.e., \(400\) bits. But more efficient encodings are possible. Rather than spell out the values of each cell in the matrix, we could exploit the evident checkerboard pattern to obtain a lossless compression of Fig. 3A with a suitably digitized version of “Alternating \(5\times 5\) patches of \(1\)’s and \(0\)’s,” resulting in a \(16\)-bit compression. The result is an abstraction—a description of the high-level pattern, rather than the low-level details. The same high-level pattern can in principle be used to describe Fig. 3B—but notice that if we use the same encoding in this case, the reconstruction is no longer lossless, for it fails to account for the noise. Thus, the deviations that distinguish Fig. 3A from Fig. 3B would be lost in such a compression. In some applications, we may not care about such errors. In fact, if we are more interested in patterns than pixels; then, we may prefer the abstraction to the original. In other settings, details matter a great deal. The tradeoff between compression efficiency and resulting error rates is unavoidable in information theory, and the optimal balance is inevitably context-dependent. A similar tradeoff is evident in supervised learning, where we must strike some balance between accuracy (minimizing bias) and complexity (minimizing variance).Footnote 7

Fig. 3
figure 3

Binary grids implementing noiseless (A) and noisy (B) patterns, respectively

One approach to abstraction in ML is based on coarsening. This is a conceptually simple procedure in which we describe the data in larger units, e.g., molecules instead of atoms or cities instead of neighborhoods. Techniques for learning such partitions are typically based on clustering algorithms of the sort described in Section 2. Popular examples include superpixels (Stutz et al., 2018), which segment images into contiguous patches. The approach has special appeal in cases where macro-objects may be more meaningful or useful for practitioners than micro-objects. This is the case, for instance, in some causal modeling tasks, where our ability to measure complex systems in fine detail exceeds the precision of possible interventions. Just because we can observe RNA abundance at the single-cell level does not mean we can reliably stimulate individual transcripts. On the contrary, drug therapies are often messy affairs, activating and deactivating regulatory pathways in a complex, unpredictable cascade. Causal feature learning is an unsupervised data coarsening technique aimed at recovering high-level variables more amenable to intervention than low-level data (Beckers et al., 2019; Chalupka et al., 2017; Kinney & Watson, 2020). Such abstractions have obvious appeal for planning and design. Indeed, the merits of abstraction are well-appreciated in the sciences, where modeling complex systems always involves a degree of idealization (Potochnik, 2017).

Neural networks take an entirely different approach to the problem of abstraction, most famously through a family of unsupervised learning algorithms called autoencoders. First, an input layer of data \(\mathbf{X}\) is mapped to an internal layer with fewer nodes. In deep autoencoders, a sequence of such mappings compresses the data still further. The result is a dense matrix \(\mathbf{Z}\in {\mathbb{R}}^{n\times {d}_{z}}\), with \({d}_{z}<d.\) This object is called an embedding or a representation. So far, this more or less tracks the architecture of a supervised network, where internal layers are optimized to predict some outcome variable. In autoencoders, by contrast, the goal is to reconstruct the original data. In other words, we have two functions: an encoder \(f:\mathcal{X}\mapsto \mathcal{Z}\) and a decoder \(g:\mathcal{Z}\mapsto \mathcal{X}\), which together form a sort of hourglass structure. With linear activations and a fully connected architecture, autoencoders instantiate a form of classical principal component analysis; with nonlinear activations, autoencoders can capture a far richer set of complex interactions. When models are successful, simple geometric or arithmetic operations in the latent space often have clear semantics at the input level. To cite a famous example, word embeddings—mathematical representations of text data—have been known to capture analogical relations (Mikolov et al., 2013). This results in equations of the sort “king” − “man” + “woman” \(\approx\) “queen,” where the words are replaced with their latent vector representations as learned by the autoencoder.

Though originally conceived for generic statistical tasks, autoencoders and other latent variable methods have been repurposed for causal discovery and inference. Early work in this area includes the LiNGAM algorithm, which uses a form of independent component analysis to learn a graphical model over observed covariates (Shimizu et al., 2006). More recently, causal representation learning has become a major research area, with ML practitioners increasingly acknowledging the utility of causal tools for transfer and generalization (Schölkopf et al., 2021). The strategy here is very different to coarsening, where we simply chunk the space into larger buckets. Instead, causal disentanglement algorithms such as \(\beta\)-VAE aim to learn latent factors that correspond to independent axes of variation, e.g., the style and content of an image (Higgins et al., 2017; von Kügelgen et al., 2021). In scientific applications, such embeddings may help to design interventions or estimate treatment effects when randomized control trials are impossible.

The success of autoencoders has sparked a great deal of discussion—and perhaps a bit of wonderment—among researchers. How can they be so effective with no supervision? One suggestion is the manifold hypothesis, which states that even in seemingly complex, high-dimensional domains such as text and genomics, data typically lie on a low-dimensional manifold that is far simpler than the original observations suggest. This hypothesis has received theoretical and empirical support (Fefferman et al., 2016). If true, it may go some way toward explaining the unreasonable effectiveness of autoencoders. By learning the most efficient and accurate compression, we essentially approximate the true target manifold. That is why, for example, two-dimensional projections of transcriptomic profiles (with typical input dimensions of about 40,000) are often sufficient to linearly separate disease and healthy samples in cancer studies. With the proper encoding scheme, information loss is minimal even under such dense compression.

The nearest analog to the manifold hypothesis in the philosophical canon is Plato’s cave. Just as the prisoners in that famous allegory are forced to stare at shadows, limiting their reality to a dark, confusing jumble, so we are deceived into believing that the world is far more complex and chaotic than it really is. In fact, our observations are little more than distorted reflections of a relatively parsimonious reality that lies just beyond our powers of perception. In this analogy, abstraction algorithms serve the function of philosophers, providing us with tools to access the luminous substrata that explain the vagaries of our phenomenal experience. (Fortunately, in this version, we are not compelled to murder anyone.) For Plato, the hidden light corresponds to the Forms, immaterial essences that animate the imperfect imitations we find in the material realm. We need not follow Plato quite so far out of the cave to acknowledge a categorical distinction between essential and contingent properties, i.e., those that an object must have and those it could do without, respectively. Modern essentialism comes in many forms—see Robertson and Atkins (2020) for a good overview—but realists about essences (e.g., Kripke (1980), Ellis (2001), and Williamson (2013)) are unified in their belief that some such distinction obtains. For instance, I could have long or short hair depending on my taste. I could not, however, fail to be human. This is because my hairstyle is contingent, while my humanity is not—a non-human version of me would simply not be me.

I submit that abstraction algorithms can and often do help to infer essential properties. Just as with the discussion of natural kinds in Section 2, this claim has both epistemological and ontological dimensions. Specifically, we have the following thesis:

  • EA: We learn to identify essential properties via abstraction algorithms, or something very much like them.

It is worth reiterating that I take a functionalist view of computation on which the details of physical implementation are basically irrelevant. A pragmatic interpretation of constituent terms once again avoids the Scylla of antirealism and the Charybdis of realism. That said, proponents of each camp may find something to like in these theses—the former by arguing that essential properties are co-constituted by certain methods of abstraction, the latter by arguing that essences not only exist but may be learned from data. If any object has essential properties, then there must be some simplified representation thereof—stripping away the details and focusing instead on form, e.g., by ignoring irrelevant distinctions (coarsening) or reducing the whole to some relatively small set of independent components (projection). This procedure, however it is carried out, amounts to an abstraction algorithm.

But does this license an ontological reduction? Consider the claim:

OA: Essential properties are what abstraction algorithms ought to find under ideal conditions.

This thesis may seem attractive to those eager to supplant metaphysics with methodology. However, OA fails in general for the same reason that OC fails. Even if we could identify satisfactory, noncircular “ideal conditions” for this problem, there can be no guarantee that all essential properties are effectively computable. Just because there is light behind the shadows does not mean that we will always find it. On the contrary, some objects may defy all efforts to reduce them to their essential form. This is a claim not about the limits of our cognition or technology but about the limits of our world. Abstraction algorithms may be sufficient to identify some essential properties, but essential properties are not necessarily identifiable via abstraction algorithms.

Though nothing in the mechanics of thought or language commits us to any particular theory of essence and contingency, an ability to appreciate a plurality of grammatical moods—to understand that though the world is this way, it might have been some other way—is a hallmark of intelligent behavior. Planning and other forms of higher order cognition require predication within and across modalities, skills that are inconceivable without some capacity for abstraction. The question of whether efficient and accurate encodings can in fact be learned—i.e., whether the manifold hypothesis applies—is clearly an empirical matter that must be evaluated on a case-by-case basis. It certainly appears to hold in many important areas of interest, though it would be naïve to assume that this entails some universal law. In summary, abstraction algorithms raise much the same philosophical issues for essentialism that clustering algorithms raise for natural kinds.

4 Generative Models

Generative models create synthetic data that should, ideally, be indistinguishable from the real thing. Such methods can be used for data imputation or augmentation in supervised, reinforcement, or self-supervised learning tasks. They can also be used for deceptive purposes, as in adversarial attacks or so-called deepfakes. The former occurs when samples are generated with the express intent of fooling another model, e.g., by adding noise to an image in order to trick a classifier into mislabeling samples. The latter refers to a class of models designed to output highly realistic image, audio, or video data of people saying and doing things they never in fact said or did. I defer a discussion of the ethical aspects of this technology to Section 5. For now, the point to emphasize is that generative models are a powerful and relatively novel innovation in ML research. Though density estimation procedures have been in use for decades, methods were far too crude for realistic data synthesis in high dimensions until quite recently. Current models, by contrast, can make convincing forgeries of landscapes in the style of Cezanne or photorealistic portraits of nonexistent people (see Fig. 4). I will focus in particular on generative adversarial networks (GANs) and tree-based approaches.

Fig. 4
figure 4

Synthetic portraits created by StyleGAN. From Karras et al. (2019)

GANs are a popular class of deep learning algorithms originally proposed by Goodfellow et al. (2014). Many variations exist—for overviews, see Creswell et al. (2018) and Gui et al. (2021)—but the unifying idea behind them all is a zero-sum game between two networks, a generator f and a discriminator g. First, f samples some synthetic data, typically just noise to start. This fake data is then fed along with real data to g, whose job is to classify samples as genuine or synthetic. In the early rounds, f’s attempted forgeries are poor, and g has no trouble distinguishing between samples. However, network weights are updated after each round, resulting in improved performance for the generator f. Meanwhile, the discriminator g updates its weights to try and keep up with its opponent. With sufficient network capacity, the forgeries will continue improving until g can do no better than random guessing. At this point, we say that f has learned the true data distribution, and the competitors have converged on the game’s Nash equilibrium.

Another class of generative models is based on classification and regression trees. These are the so-called weak learners, which split the input data X at some value of a (potentially random) feature \({X}_{j}\) according to an optimization objective, e.g., variance minimization. A sequence of such splits results in a partition of the input space into relatively small nodes called leaves. Though individual trees tend to be poor predictors for complex datasets, forest models—i.e., ensembles of trees—are among the most popular and effective of all ML algorithms. When trees are grown independently on perturbed datasets, results may be aggregated via averaging to create a random forest (Breiman, 2001). Alternatively, trees can be grown iteratively to improve upon the predictions of the previous tree, in a method known as boosting (Friedman, 2001). These algorithms are widely used for supervised learning, although they also have applications to unsupervised tasks. Generative models based on random forests go back at least a decade (Criminisi et al., 2012), with several more recent proposals showing strong results (Correia et al., 2020; Watson et al., 2023). Conditional density estimation is especially common with trees, as shown by numerous popular methods for data imputation and related tasks (Lundberg et al., 2020; Stekhoven & Bühlmann, 2012; Tang & Ishwaran, 2017). Forests can be readily compiled into probabilistic circuits (Choi et al., 2020), a large and well-studied family of computational graphs for generative modeling and inference.

The logic behind GANs and forests is entirely different. The former effectively turns the problem into a reinforcement learning task, with the discriminator network g providing a sort of inadvertent reward signal to the generator f.Footnote 8 The more uncertain g is about the provenance of some sample, the better f has done creating realistic data. This is akin to a forger honing his craft by seeing what does and does not evade detection by the authorities; or, to take a less criminal example, a child learning to play an instrument by mimicking a melody until a blindfolded listener cannot distinguish between their version and a professional recording. Confounding a would-be discriminator is a familiar method in computer science, with origins in the Turing test. Crucially, in that case, the discriminating agent is human. But for many purposes, artificial agents do just as well. The idea has applications in statistics as well, where a generic form of the classic two-sample test is achieved by evaluating whether a classifier can distinguish between the generating distributions (Kim et al., 2021).

Forests, by contrast, take a more impressionistic approach to data synthesis. They break the input space down into a collection of subregions, effectively solving many simple problems instead of one hard problem. By sampling from the fitted densities in each leaf of the tree, we may approximate distributions of arbitrary complexity. The visual effect is not unlike a pointillist painting, where structure gradually emerges from a plurality of tiny uniformities. However, whereas the works of Georges Seurat have a distinctive fuzziness owing to deliberate pixelation, forest algorithms aim instead for maximal realism in their outputs. In this sense, they are less Seurat and more Chuck Close, who created massive, high-resolution portraits by laying a fine grid over a photograph and painstakingly recreating every cell (see Fig. 5). Because trees are independent in a random forest, the entire process can be efficiently computed with parallel processing. (Close, on the other hand, would routinely spend over a year on a single portrait.) It is worth noting that these visual analogies are merely illustrative. GANs and forests are by no means limited to image data.

Fig. 5
figure 5

Georges Seurat’s “A Sunday on La Grande Jatte” (A) and Chuck Close’s “Big Self-Portrait” (B). Both artists implemented techniques similar to the tree-based generative modeling approach

Generative modeling is a uniquely creative task for AI. It pushes beyond the common goals of prediction and optimization into something more like imagination and fantasy. These are vital aspects of human cognitive development, as children learn to circumscribe the realm of the possible by recombining past experiences in novel ways. The gains do not end in childhood. Imagination is a key part of the methodological toolkit for physicists and philosophers, who often reason by way of thought experiment (Stuart et al., 2018). It can inspire artistic works of genius, or simply random doodles. Both styles of imagination described above—the top-down approach of GANs, which requires some abstraction as a basis, and the bottom-up approach of forests, which focuses instead on the particulars—are evident in human creativity. The former is clearest in dreaming or game play, where whole scenes unfold with few explicit rules but clear thematic unity. The latter is found not just in the visual arts, as described above, but also in literature, where an author may start with an outline that breaks the narrative down into a collection of scenes, working separately on each and only merging them into a single structure once the individual components are complete.

In one sense, it may seem that data synthesis is simpler than classical supervised learning. After all, there is a single right answer to the question of what output we should expect for a given input (subject to some irreducible uncertainty, itself precisely quantifiable), whereas many datasets could be plausible samples from a given distribution. Yet uniqueness is a poor heuristic for complexity, especially when there are so many more ways for the generative model to err. This should be apparent as soon as we consider the difference between (a) classifying images of cats and dogs, and (b) painting realistic images of cats and dogs. Small children can execute (a) with 100% accuracy; only devoted artists with years of training could possibly perform (b) with sufficient skill to convince viewers the result may be a photograph.

The difficulty goes back again to the subtle task of distinguishing essence from contingency. A properly trained generator network f has not only learned an abstraction \(\mathbf{Z}\) that approximates the target manifold; it has also learned how to project that low-dimensional representation back to the high-dimensional realm in novel, realistic ways. In other words, it has not just captured the essence of a domain, but additionally modeled its contingencies. To return to the allegory of the cave, generative algorithms must both capture the light and sample the shadows. These are dual tasks, just as necessity and possibility are dual operators in modal logic. However, whereas logic deals in binary evaluations, generative models must learn a probability distribution over the space of features that make up a possible world. In so doing, they delimit a horizon of possibility that constitutes a form of knowledge unto itself. This is what Williamson (2016) refers to as “knowing by imagining,” a capacity he argues serves clear evolutionary purpose as it helps agents to plan, empathize, and avoid unnecessary risks. If generative models can successfully imagine unrealized possibilities, then their achievement is both metaphysical, insomuch as the algorithm models probable contingencies in the world, and epistemological, insomuch as imagination is itself a form of knowledge.

These points can be spelled out in a pair of theses with a now familiar structure. First, consider the epistemic claim:

EG: We identify unrealized possibilities via generative models, or something very much like them.

It is difficult to imagine how the skeptic might rebut EG, short of taking a hard actualist line and simply denying that there are any such things as unrealized possibilities. If there are, however, then any method for reasoning about them, rightly or wrongly, must rely on some sort of mental representation of the relevant circumstances. On the functionalist philosophy of cognition that I have espoused throughout, this mental act requires a generative model more or less by definition. Of course, some models will be better than others—people err in their assumptions about possibilia all the time, just as some GANs may overfit or underperform, producing unconvincing samples that are noticeably synthetic. But even false judgments about unrealized possibilities require some generative reasoning to get off the ground.

Next, consider the ontological thesis:

OG: Unrealized possibilities are what generative models ought to learn under ideal conditions.

The attentive reader will likely anticipate my skepticism about OG, which cannot be affirmed without making the vaguely fantastical assumption that all unrealized possibilities are effectively computable. However, the methodological reductionist may insist that only effectively computable possibilities are true possibilities, the rest being mere whimsy or idle speculation. This move appears ad hoc and circular. It also runs into certain hard physical limits, given that faithful simulations of subatomic events require sophisticated quantum computing techniques that may prove intractable beyond relatively small-scale phenomena.

5 Discussion

In the preceding sections, I introduced a variety of unsupervised learning algorithms. My aim was not only to give some sense of the diversity of the field, but also to underscore the philosophical significance of these approaches, which provide data-driven methods for discovering natural kinds, learning essences, and sampling contingencies. Deep connections with ancient debates, not to mention basic functions of human language and cognition, substantiate my contention that unsupervised learning is not just some hodgepodge of statistical methods left over after we have analyzed the more substantive domains of supervised and reinforcement learning. On the contrary, unsupervised methods are fundamental and important, if chronically overlooked.

However, we should not be too sanguine about what these algorithms can achieve. Along with their distinctive metaphysical implications, unsupervised learning methods raise unique epistemological and ethical risks that merit closer inspection. It can be tempting to assume that the results of an unsupervised learning analysis are somehow “objective,” since they are untainted by human biases. For instance, if a clustering algorithm trained on gene expression data separates treated from untreated subjects, then this is arguably better evidence that the treatment causes transcriptomic effects than a supervised analysis that explicitly aimed to distinguish between the two classes. The hypothesis of genomic stratification is baked into the latter model, whereas the former has no such preconceptions. The point is valid, as far as it goes. However, any claims the algorithm may have to “objectivity” are conditioned upon user-supplied inputs and mediated by user-interpreted outputs. There is nothing automatic about the model’s data or hyperparameters, which inevitably encode certain assumptions and biases. Moreover, the reasoning behind algorithmic outputs is often opaque and inscrutable, raising major issues of trust and reliability.

Take hyperparameters first. As discussed above, clustering algorithms do not generally estimate the optimal number of clusters k. Some (like k-means) require this value upfront; others (like hierarchical clustering) compute results for all possible k. In both cases, biases can enter in. For k-means, issues may arise ex ante, as users neglect to test a sufficient range of values for k; for hierarchical clustering, the danger is primarily post hoc, as users more or less eyeball what they deem a reasonable cut point for the resulting dendrogram. A range of other hyperparameters may also influence results (e.g., maximum iterations or preferred distance measure). Moreover, since clustering algorithms do not test against the null hypothesis (i.e., k = 1), they may produce false positives, indicating cluster structure when in fact there is none. Some post-processing methods are designed to mitigate against this (Tibshirani et al., 2001), but heuristic alternatives are still common.

Similar concerns apply for abstraction algorithms and generative models, both of which require users to fix hyperparameters that govern network architecture, batch size, regularization penalties, split rules, stopping criteria, and various other components. There is no way to know how sensitive results are to these configurations without extensive trial and error. In practice, hyperparameters are often left at arbitrary defaults, or else chosen via grid search over some bounded set of candidates, which may themselves be arbitrary defaults. Interpretation is especially difficult for abstraction algorithms and generative models when the input space is relatively unstructured, as with generic tabular data. Whereas humans have a decent grasp of how to evaluate “right” and “wrong” patterns in text and images, the task becomes much harder in, say, chemometrics, where our understanding of the relevant laws is limited and intuition is a poor guide. This ties into more general concerns regarding model opacity, as the logic behind particular cluster assignments, abstractions, or generative sampling procedures may be too complex for users to understand or interrogate.

Another source of potential bias lies in the data itself. There is nothing “objective” about how datasets are collected, maintained, or disseminated. On the contrary, they are the product of individuals and groups with certain goals and beliefs operating in particular contexts. To borrow Leonelli’s (2016) memorable phrase, they take idiosyncratic “data journeys” through time and space, often under the aegis of large institutions working in their own interests. Whether the topic is economics, climate science, or particle physics, all datasets are selective in scope. When access is restricted, we cannot use the data without submitting to some external authority, further obscuring claims of objectivity under a dense web of social relations and power dynamics. Matters are no better when datasets are simulated, for then the sufficient statistics that govern the generating distribution are either arbitrary, in which case the previous objections regarding algorithmic hyperparameters apply, or derived from some prior dataset(s), in which case the present concerns over provenance and access remain. In short, pragmatic exigencies and contingent social arrangements permeate every stage of the modeling pipeline.

An extreme takeaway from these observations would be to conclude that unsupervised learning algorithms are simply inference engines for our own assumptions and biases. According to this view, these methods are basically vacuous, since they do little more than spell out the statistical consequences of our own choices—they do not, strictly speaking, add anything new. To this critique, three replies are in order. First, the speed and scalability of unsupervised learning algorithms are sufficiently great that it would be a mistake to suggest that they merely amplify preexisting beliefs. On the contrary, they represent a step change in our analytical capacity, enabling large-scale data mining procedures that can reveal unexpected patterns and generate novel hypotheses. To object that these conclusions could in principle have been deduced with pen and paper is akin to dismissing the printing press as essentially redundant, since all those bibles might just as well have been handwritten.

Second, the claim that “anything goes” when it comes to unsupervised learning is either false (if we restrict attention to actual practice) or irrelevant (if we consider exotic algorithms as a speculative exercise). Take clustering, for instance. The skeptic is right to point out that users may design distance functions that guarantee any arbitrary partition of the data. But is this a serious concern? Baroque and overly convenient feature mappings are a sure sign of statistical malpractice, and any critical investigation of the procedure will reveal this. Perhaps that is why practitioners generally stick with a small handful of common, simple distance measures. This still permits a plurality of plausible clusterings for any given dataset, but far short of the combinatorial explosion we would expect in a fully unrestricted setting. Among these choices, rational agents may disagree on which best reflects the true structure of the data. The same is true of many scientific problems, where competing models provide comparable fit to the evidence. In such cases, we must either suspend judgment, in true Popperian fashion, or rely on alternative criteria (e.g., parsimony or consistency with auxiliary facts).

This raises a third and final reply to the vacuity objection—unsupervised learning models do not generally work in isolation. Instead, they are used in conjunction with a range of other methods to build evidence for or against particular conclusions. For instance, in the cancer subtyping literature, the clusters output by algorithms like k-means are often validated by comparing survival curves for the hypothesized groups via a log-rank test (Kleinbaum & Klein, 2012). Significant differences are interpreted as evidence in favor of the purported subtypes; insignificant differences suggest the clusters may have overfit. In this way, the assumptions and biases that drive individual model results can be independently corroborated. By incorporating unsupervised learning into the discovery pipeline, along with supervised methods and classical inference procedures, we can protect against allegations of ostensible vacuity.

These epistemological concerns take on ethical significance when unsupervised learning algorithms are applied in socially sensitive contexts. Marginalized groups cannot benefit from the purported advantages of these technologies if they are not represented in the data. To take one prominent example, autoencoders are widely used for dimensionality reduction in genome-wide association studies (GWAS). If the resulting abstractions in fact capture some essential properties of the human genome, they could advance our biological understanding and pave the way for major medical breakthroughs. However, genome banks do not represent a random sample of humanity. Some 80% of all GWAS subjects are individuals of white European ancestry, even though this group represents just 16% of the global population (Martin et al., 2019). Such disparities raise legitimate concerns that the advances of precision medicine will only exacerbate existing health inequalities.

Unsupervised learning algorithms threaten to do more than just amplify existing problems, however. They may create new ones all their own. Insisting on the objectivity of these models is like claiming that the invention of the camera makes it impossible for journalism to mislead. On the contrary, photography creates entirely new risks that were inconceivable before the medium was invented, since doctored photographs can be far more convincing than a mere rumor. This danger arguably reaches its apotheosis in deepfakes, which could be used for slander, blackmail, sabotage, propaganda, or other nefarious purposes. In an extreme scenario, deepfakes could potentially erode trust in all visual and audio evidence, making it vastly more difficult to hold bad actors to account. These concerns have attracted considerable scholarly and media attention, and justifiably so.

By the same token, a myopic focus on ethical risks can blind us to ethical opportunities. For example, generative modeling techniques offer a creative solution to thorny problems surrounding data privacy. Policies for handling sensitive data are notoriously (and understandably) onerous. For their landmark study on geographic factors of social mobility, a team from UC Berkeley had to practically join the Internal Revenue Service in order to access the tax records of some 40 million Americans (Chetty et al., 2014). They submitted to fingerprinting and background checks while agreeing to conduct all data analysis on an air gapped machine at the Treasury Department headquarters in Washington, DC. Generative models can make such arrangements obsolete. With sufficient capacity, these algorithms can synthesize highly realistic data of a potentially sensitive nature for mass dissemination without infringing on the privacy of any real-world data subjects. Global properties would be preserved, allowing for legitimate research findings, while local details may vary unpredictably. Expanding access to sensitive data could accelerate important discoveries in medicine, economics, and beyond. Generative models can in principle provide a secure way to manage such an expansion while respecting privacy rights. However, research has shown that synthetic data is not necessarily immune to membership inference attacks, so the privacy-utility tradeoff remains delicate (Stadler et al., 2022).

The issues raised in this section—hyperparameter optimality, data quality, model opacity, and opportunities for (mis)use—echo similar debates in supervised learning, where an unjustified confidence in results can also lead to grave harms. Existing frameworks for the ethical use of ML can and should therefore be expanded to cover unsupervised learning tasks, even if this is not always obvious. For example, concerns over the reliability of inputs can be addressed with classical tools of statistical inference such as sensitivity analysis and hypothesis testing. Mayo’s (1996, 2018) emphasis on severity is especially instructive in this regard. According to Mayo, our belief in some hypothesis h is justified only to the extent that h has passed severe tests, i.e., tests that should detect flaws or discrepancies from h with high probability. Practitioners rarely bother to subject unsupervised learning models to the same scrutiny as their supervised or reinforcement learning counterparts, as standards for such tests are not well developed. While familiar ML notions such as “loss” or “regret” may not apply in these settings, alternative desiderata such as stability and generalization do. More importantly, they are testable. Resampling procedures like bootstrapping and subsampling provide readymade tools for evaluating the robustness of clusters and abstractions (Fisher et al., 2016; Monti et al., 2003). Supervised learning methods can also be of service. For example, stable clusters should generalize to unseen data, a hypothesis that can be evaluated with the aid of standard ML classifiers (Dudoit & Fridlyand, 2002; Tibshirani & Walther, 2005).

When generative models provide explicit densities, their performance can be evaluated via out-of-sample log-likelihoods. This is the case, for instance, with tree-based models. Systematic evaluation is more difficult for GANs, however, which do not provide densities by default. Once again, supervised learning methods can help. One standard pipeline for scoring GANs is to train one or several models to predict some outcome on the input data \(\mathbf{X}\) (picking some random \({X}_{j}\) as the dependent variable), then build another set of models for the same task on synthetic data \(\widetilde{\mathbf{X}}\) (Ravuri & Vinyals, 2019). If sample sizes match, then performance on \(\mathbf{X}\) represents the upper bound of possible performance on \(\widetilde{\mathbf{X}}\) (up to some sampling error). The best generative models should produce synthetic datasets that closely approximate the mutual information structure of the original data, as evaluated by the predictive performance of supervised learning algorithm(s). Error-statistical reasoning can therefore help to improve standards for unsupervised learning and ensure more reliable results, even in the face of legitimate concerns over hyperparameters and data quality.

Model opacity can be addressed through methods from explainable AI (XAI), a fast-evolving if somewhat controversial research area devoted to making the predictions of supervised learning models more intelligible for end users.Footnote 9 Leading examples include feature attributions, which explain model predictions as a sum of input features, and rule lists, which do so through a sequence of if–then statements. The applicability of such methods is perhaps most obvious in the case of clustering, which is similar in principle to classification and therefore amenable to the same XAI techniques developed for that domain. However, recent work has extended the logic of feature attributions to more general abstraction tasks (Crabbé & van der Schaar, 2022), an approach that could in principle be applied to generative models with an explicit embedding step.

Issues surrounding potential misuse are more complicated, for they raise larger questions of algorithmic governance. How should policymakers respond to rapid improvements in generative AI, which can be used for beneficial or diabolical purposes? This is a classic conundrum in the philosophy of technology, perhaps most salient in the debates on nuclear energy, which could either represent humanity’s salvation (in the form of a sustainable alternative to fossil fuels) or devastation (through apocalyptic weaponry delivered via intercontinental ballistic missiles). History offers no simple solutions to such dilemmas. In the present case, it is hard to see how regulatory measures could limit the proliferation of generative models. However, punitive sanctions could be implemented to disincentivize the creation or knowing dissemination of malicious deepfakes or other forms of misinformation. Meanwhile, technical solutions designed for algorithmic content moderation could be extended to cover such material (Gorwa et al., 2020). These are admittedly imperfect tools, but they may make potential violators think twice before deployment. Regulatory and enforcement mechanisms will no doubt continue to evolve as generative models improve.

6 Conclusion

Unsupervised learning algorithms are widely used in practice but rarely discussed in philosophy. In an effort to stimulate more critical reflection on this neglected branch of ML, I examined three common classes of unsupervised learning algorithms: clustering, abstraction, and generative models. I argued that these technologies raise distinctive metaphysical questions pertaining to natural kinds, modal ascriptions, and the limits of imagination. I showed how unsupervised learning can add clarity and precision to potentially muddled debates, replacing ontological speculation with methodological hypotheses that can be efficiently computed and severely tested.

However, we must be careful not to overstate the power of these algorithms. I cautioned against the temptation to believe that unsupervised learning produces results that are somehow “objective” or otherwise privileged. These tools require input data and hyperparameters that are inevitably selected and evaluated by humans with finite resources. Their internal operations are often opaque. Questions of purpose and context cannot be ignored without incurring major epistemic and ethical risks. However, this does not license the antithetical extreme that unsupervised learning is a vacuous reflection of user biases. On the contrary, these algorithms can reveal structural insights and generate novel hypotheses as part of complex discovery pipelines. The solution, I submit, is a pragmatic approach that acknowledges unavoidable limitations while simultaneously demanding high standards of statistical rigor and algorithmic transparency. Too often, unsupervised learning methods are exempted from the critical scrutiny traditionally applied to supervised and reinforcement learning models. The time for a reevaluation is long overdue.