One of the Mark Twain’s lesser-known works, Letters from the Earth, is narrated by Satan. This narrator is not the scary, devil-in-a-burning-hell Satan, but rather a sarcastic and slightly perturbed fallen angel sent down to Earth to report on what we humans are up to. Satan’s bitingly witty observations, written in a series of letters, are both true and slightly off. During the course of our interdisciplinary research on intersectionality in structured synthetic data, we have been reminded of this amusing and critical book.

Synthetic data have, of course, been used for a long time, but AI-generated versions of synthetic data are having a moment, partly riding on the coattails of the current AI conversations, and partly because they are now solving several of the problems people who use tabular data often have, including privacy issues, data portability, and the need for large amounts of data to feed ML algorithms.

One of the claims of synthetic data generated by AI is that a neural network will learn the essential statistical distributions of an existing dataset and then produce a new (synthetic) dataset with approximately those distributions. This new dataset is the same but different. And it is because it is slightly different that it can be shared with others without risking the exposure of sensitive data. It can avoid regulation from national or international data laws, or portability restrictions within an organization or across research groups. Or, it can be added to a smaller dataset to amplify it and create a big dataset.

But, despite the connotations conjured up by its name, synthetic data are not a just a ‘plastic’ version of the original but are a new version of a dataset made by AI technologies and meant to represent, again, the essential statistical aspects of the original. Our claim, in this curmudgeon complaint, is that what those essential aspects are and where the differences between the original and synthetic data appear is important. They are not currently being given enough concern in the discourse around synthetic data. And we need a vocabulary to speak about them.

First, our slightly perturbed assertion: A structured dataset’s essential aspects appear in the intersections—the relations—contained within the data. Those relations are often (always?) of great interest. This is true for any structured data including multimodal datasets. It is why people collect and organize data. But for the sake of clarity, we make our point here with population data, since the intersectional aspects of population data are intuitive for us humans to understand.

To explore the dynamics of intersectional relations, we synthesized a new version of the 1990 US Adult Census Data. We did this several times, using different GANs and applying different constraints in the process. Then we compared the synthetic datasets with the original data.

One of our first questions to the synthetic data was if we were losing edge cases. True to form, we were. For example, with one of the first GANs we used, we had 40 countries of birth in the original data, but only 31 in the synthetic data. Countries with very little representation in the original data disappeared in the synthetic data. This was to be expected, given the way GANs work, but we found work-arounds and GANs that let us control for edge cases in curated ways.

Our second question to the synthetic datasets was how well the diversity of the population was being represented. Was, for example, the age distribution in the synthetic data similar to that in the original data? Often, at the single column level, statistical distributions were pretty close. Not identical, of course, because then the data would not be synthetic, but pretty close.

But we are more than just our age, or just our gender, or just our income level. Intersectionality theory from within the social sciences has shown us that our identities and subject positions are created through the complex intersections of power dynamics in society (Crenshaw 1989). Translating these power dynamics into a series of columns of population data is fraught (Monk 2022; Bouk and Boyd 2021). But for work like this, census data is what we have. And while long from perfect, it does at least give us a hint of the complexity of different subject positions—of different data points—a complexity that is very relevant for any analysis or use of population data. Given this, intersectional complexity in synthetic data is, also, important.

We started looking at intersectional representations in the synthetic data. For example, we compared the intersection of age–income–gender in one of the datasets and saw that the synthetic dataset represented this intersection fairly well. Again, it was not identical (which is good, or else it would not be synthetic data), but pretty close—something we are calling intersectional fidelity.

But then we started looking at some other intersections. In the dataset just mentioned, for example, we also looked at the intersection of marital status, occupation and gender. At this intersection in the data, we saw that there were a lot more females in the synthetic data than in the original data. Looking even more closely, we saw that in the original dataset there was 1 husband who was also classified as female. In the synthetic dataset there were 259 husbands who were also female. Today, and in some countries, that is perfectly fine. But remember, we were trying to synthesize the 1990 US census data. Those 259 female husbands are what we call an intersectional hallucination.

The implication of this intersectional hallucination is that this particular synthetic dataset might be useful for research or policy that is addressing age, income, and gender (where there was intersectional fidelity). But it would be less valuable—even incorrect—for research or policy decisions related to marital status, occupation, and gender (where we identified an intersectional hallucination).

Intersectional hallucinations can be controlled to some extent (we have produced larger and smaller ones using different GANs and fairness constraints, and we have watched them shift around in our data). And we suspect techniques will quickly develop to allow the more precise production of controlled hallucinations and, simultaneously, to ensure fidelities around particular intersections. However, intersectional hallucinations must always exist in synthetic data because intersectional hallucinations are the essence of the ‘synthetic’ in synthetic data. Without them, one would have a dataset that correctly reproduced every intersectional relation of the original tabular data. This would make for great data fidelity, but it would mean one had a mirror of the original data. Those data would not assure privacy, avoid regulation, or be portable. It is in the intersectional hallucinations that the ‘synthetic’ is made.

Thus, knowing exactly where those intersections are hallucinations and where they are actually true to the original data is necessary when considering how and where to use a synthetic dataset. This insight shows how ludicrous it is to think that synthetic data are going to be an easily portable and sharable commodity, deposited in an open repository for all to use. For synthetic data to be valuable, the intersectional hallucinations that are an essential part of the data must first be made visible and labeled. The intersectional fidelities of the synthetic data must also be labeled before it can be safely used for other purposes. And this information must follow the synthetic dataset wherever it goes and whenever it is used—including when it is added to original data to amplify a dataset for ML purposes.

Synthetic data promise to do things we need. But for that to happen, we must know how the synthetic data are intersectionally similar and different to the original data—regardless of if those intersections are related to human data or other structured data. Knowing that a synthetic dataset contains oak trees with pine needles or nonpolar covalent water molecules is just as important as knowing the dataset contains 6-year-old medical doctors or, as ours did, nearly 200 male wives working in administrative jobs, a combination that did not exist in the original data.

We cannot be satisfied with comparing statistical fidelity within individual columns or even in simple two-part relations if we want to future proof synthetic data, because the devil, so to speak, is in the intersectional details. And in how they are recorded and reported. Just like letters written by a fallen angel, some of those reports are going to be more faithful to the relations in the original data than others. Knowing a synthetic dataset’s intersectional fidelities and intersectional hallucinations is the only way to ensure that research results based on the use of synthetic data are reliable.