1 Introduction

Running online controlled experiments, also known as A/B tests, has become an increasingly important aspect of data-driven decision-making for companies. It is also a powerful tool to evaluate the impact of changes made to software products and services (Jiang et al., 2019; Kohavi et al., 2020; Rajkumar et al., 2022; Siroker & Koomen, 2013; Thomke, 2020). A/B testing is an umbrella term and covers a wide spectrum of use cases and best practices for experimentation (see Kohavi et al., 2020 for a review). The key in any A/B test, or online controlled experiment, is that it is “controlled.” Users are randomly exposed to one of two variants: Control (A), or Treatment (B). Therefore, online controlled experiments, exactly like offline controlled experiments, adopt randomization as the best scientific design for establishing a causal relationship between changes and their influence on behavior.Footnote 1Footnote 2 To better understand the phenomenon and share the same technical vocabulary, let us consider a typical scenario for an A/B test run by a social network SN, in which a team of data scientists is testing a new feed ranking algorithm. Simplifying a bit, to run an experiment at SN, the data science team must answer the following questions:

  1. (1)

    What is the team’s business goal? That is, what is SN trying to achieve? In this case, let us assume that it is an increase in advertising revenue.

  2. (2)

    What is the evaluation criterion? That is, what metrics are considered good proxies for SN’s business goal? In this case, let us assume that it is engagement with the news feed, as measured by likes. This is typically known as a key performance indicator (KPI).

After a standard power analysis to check whether there are enough users available to detect the effect of interest in a given amount of time t, data scientists at SN launch the experiment. Users are randomly exposed to different conditions—that is, “treatment (T)” (the new algorithm) and “control (C)” (the old algorithm)—at some fixed ratio. Ideally, the two groups C and T should both be representative of the wider SN userbase. At the end of the trial period, researchers at SN review the results and estimate the average treatment effect (ATE), the average difference in KPI between the treatment and control groups. The ATE will determine whether the intervention could be beneficial for SN, significantly improving the KPI. To estimate it, researchers will resort to a combination of computations and statistical analyses (Veytsman, 2020). Although this example is based on an A/B test conducted by a fictional social network, it represents the fundamental methodology adopted by organizations across multiple domains.Footnote 3

While participant protection protocols are considered the norm in behavioral, medical, and social research, the situation is different when it comes to company-sponsored A/B testing. Research institutions must review ethical evaluation and informed consent procedures through Institutional Review Boards (IRBs) whenever research involves experimentation with human subjects (Burris & Moss, 2006). Yet, companies are not required to meet the same standards for A/B tests. Occasionally, experiments undergo internal reviews. They rarely undergo an ethics review.

Many practical challenges in running experiments at scale have been discussed (Kohavi et al., 2020). In the context of social media research, Grimmelmann (2015) highlights potential ethical risks in widespread A/B testing by private companies, including IRB laundering (the sidestepping of ethical reviews by academic researchers through collaborations with corporate partners) and the waiving of informed consent to obtain unbiased results. The author argues that already existing oversight mechanisms such as the Common Rule, IRBs and academic journals can be leveraged to mitigate these risks. Benbunan-Fich (2017) distinguishes between front-end A/B testing, i.e. changes in User Interface, and back-end C/D experimentation, where code is altered to intentionally deceive users. In the latter case, several recommendations are proposed to address the resulting ethical concerns, namely the development of an ethical code of conduct for online experiments, the design of a tool to obtain explicit consent from participants (similar to the one already in place for website cookies) and the creation of an independent user advocacy board to promote education and receive complaints about unethical conduct. However, so far the ethical implications of A/B testing have been rarely acknowledged, let alone thoroughly investigated. Providing clear, actionable, and principled ethical guidelines for responsible A/B testing is therefore especially timely and relevant. Overlooking risks and ignoring fundamental safeguards for the protection of human subjects who unknowingly become participants in A/B tests can be dangerous. After a decade in which tech companies were celebrated for empowering ordinary users, problems have been mounting over the past few years. Many digital companies have been shown to exploit behavioral biases, deception, and addictive tendencies (Costa & Halpern, 2019). While such manipulation has long been central to the business model of gambling and gaming industries (Dow Schüll, 2012), these practices are becoming more widespread (Wendel, 2020). In interface design on web pages or in games, this manipulation uses what are called “dark patterns” (Mathur et al., 2021; Waldmann, 2020). It is reasonable to conjecture that many of the established dark patterns now commonly found online have been accepted after A/B tests involving human subjects (Kramer et al., 2014).

In this article, we argue that the rise of an experimentation culture in industry brings unprecedented opportunities to businesses—but also significant responsibilities that have been overlooked for too long. We do not maintain that all online controlled experiments raise ethical concerns, but rather that it is always important to ask whether any ethical risks are involved.Footnote 4 To facilitate the assessment of the ethical risks involved, we propose a set of principles that should be adopted to encourage ethical and responsible experimentation, protecting users, customers, and society (for ease of reference we call it the DEC methodology, from the Digital Ethics Center).

Notably, practitioners regularly consider, assess, measure and report on the impact that their experiments have on their business. In this paper it is argued that practitioners should, in a similar fashion, also identify, manage and mitigate potential ethical risks. The prompting questions added in the Appendix are meant to provide practitioners with support and guidance.

The recommendations and analyses provided in this article are relevant to different audiences, from practitioners conducting online experiments to other corporate stakeholders (e.g. executives involved in ESG reporting), from scholars in the field of ethics of technology (both as authors and gatekeepers) to policy-makers and legislators.

This article aims to establish a new field of research on the ethics of A/B testing, stimulate further questions, and encourage more work on how to operationalize principles and recommendations. Although not all details are fleshed out in this article, it also already offers a wide range of recommendations that are practical in nature.

The paper is structured as follows. In Sect. 2, we introduce a new soft ethics and governance framework that organizations should strive to implement throughout the A/B test lifecycle, from planning to the communication of results. In Sect. 3, we discuss the effectiveness of several relevant mechanisms in educating, governing, and incentivizing companies conducting online controlled experiments. In the Appendix, we offer a complementary list of questions specifically designed to help empower practitioners by stimulating specific ethical deliberations. This constitutes a starting point in the development of an ethical code of conduct for A/B testing as recommended in Benbunan-Fich (2017). Section 4 concludes the article. Different audiences are expected to have diverse levels of familiarity with the content covered in different sections. For example, ethicists may be already familiar with many of the concepts and principles presented in Sect. 2, whereas industry practitioners may find them less familiar. At the same time, the article aims to engage all the relevant audiences by making the discussion of principles, policies, and mechanisms accessible to all relevant stakeholders.

2 A/B Testing Meets Ethics

A/B testing should strive to respect a critical set of ethical principles, which are general considerations that must be applied when doing research. An obvious approach to develop such a framework is offered by Beauchamp and Childress (Beauchamp & Childress, 2001, henceforth B&C). The moral framework laid out in their Principles has had an enormous impact on academics and practitioners across a wide variety of disciplines. B&C define common morality as the set of universal and constant norms shared by all persons committed to morality, while acknowledging that issues of moral status may vary significantly over time and between cultures. Following B&C, the key ethical principles that should guide any A/B testing are:

  1. (1)

    Autonomy—Respect the right for an individual to make their own choice.

  2. (2)

    Fairness—Treat individuals with fairness and equality.

  3. (3)

    Non-maleficence—Do not harm individuals.

  4. (4)

    Beneficence—Be beneficial to people and the environment.

Before commenting on each principle in the following sections, it is worth explaining here that a framework inspired by B&C’s work is valuable for a number of reasons.

First, it corrects the error often made in the literature around ethics and technology, of equating ethics with “fairness”. This is a common reductionist approach, arguably attributable to Rawls (1971) and the contract tradition in political philosophy, that is also encountered in the context of AI ethics. For instance, it is easy to associate AI Ethics with initiatives from major tech companies to issue so-called “fairness toolkits” such as IBM’s “AI Fairness 360”, Meta’s “Fairness Flow” and Google’s “What-if Tool” (Bellamy et al., 2018). However, as we shall see below, ethical concerns involved in A/B testing go beyond fairness, and so do the actions that practitioners should undertake. A benefit of adopting B&C’s principles is that they provide us with a framework for acknowledging, understanding and systematizing the breadth of the ethical concerns involved in A/B testing. However, since theirs is not the only framework available which does not equate ethics with fairness, this benefit of their approach is best considered in combination with other reasons.Footnote 5

Second, a pragmatic reason for adopting B&C’s principles is that they have been successfully used in the adjacent, burgeoning literature on AI ethics. In particular, Floridi and Cowls (2019) report the results of a fine-grained analysis of several of the highest-profile sets of ethical principles for AI. Based on their comparative analysis, the authors argue that the four above-mentioned bioethical principles adapt well to the fresh ethical challenges posed by AI, although a new principle is needed in addition: explicability. Further, a paper by Jobin et al. (2019) reviewed 84 documents produced by several actors in the field and offered a classification of AI ethics guidelines. The classification proposed by Jobin et al. (2019) focused on a seemingly broader set of values than the one considered by B&C: transparency, justice and fairness, non-maleficence, responsibility and privacy. However, both the papers by Floridi and Cowls (2019) and Jobin et al. (2019) highlight that the core of B&C’s framework is effectively applicable beyond the field of bioethics. Floridi and Cowls (2019) introduce a fifth principle, but this is justified by the specific demands arising in the field of AI. Further, the range of ethical considerations discussed by Jobin et al. (2019) is only apparently broader. In fact, as will become evident in the remainder of this paper, concerns about trust and transparency are related to the four general principles introduced above. Overall, this constitutes another reason, albeit preliminary, for considering B&C’s principles when mapping the ethical domains of A/B testing.

Third, since the principles identified by B&C and at the heart of their proposal seek to be representative of different ethical perspectives, traditions, paradigms, and moral beliefs, they provide an ideal starting point for discussion among relevant stakeholders (Beauchamp, 2003). The greatest appeal of their approach lies precisely in its ecumenical and pluralistic nature: it is based on a set of norms affirmed by people from a wide variety of traditions, and it combines many of the most plausible elements of different theories into a clear and commonsensical framework. Importantly, however, the list of relevant ethical principles is not set in stone. B&C’s framework offers a very fruitful starting point for the ethics of A/B testing and to categorize concerns and expectations, but the list of principles can be revised if needed, precisely as Floridi and Cowls (2019) did when they introduced a fifth principle in the field of AI ethics.

Having reviewed some reasons for adopting B&C’s principles, it is also worth clarifying that the framework offered by the four principles is coherent with the soft ethics approach adopted in this article. “Soft ethics” (like soft law) is post-compliance ethics (Floridi, 2018): it applies after the application of the relevant legislation, such as the General Data Protection Regulation (GDPR) in the European Union. It differs from hard ethics, which may not be aligned with—and indeed may even oppose—some legislation. In this article, we are not concerned with hard ethics and how it may shape, criticize, or contrast with particular pieces of legislation. We assume that current law, at least in the EU, may be ethically acceptable and ask what more can and should be done over and above it. For instance, we acknowledge that non-discrimination law offers potential pathways to deal with instances of company-sponsored experimentation. An interesting example is posed by European non-discrimination law. Article 21 of the EU Charter of Fundamental Rights (European Union, 2012) establishes that “any discrimination based on any ground such as sex, race, colour, ethnic or social origin, genetic features, language, religion or belief, political or any other opinion, membership of a national minority, property, birth, disability, age or sexual orientation shall be prohibited.” Similarly, we also welcome initiatives such as California’s ban on “dark patterns” (Akhtar, 2021). What we argue is that, even assuming that current legislation is ethically sound, much more can and needs to be done, ethically speaking, whenever we deal with actions and practices that are legally uncharted, about which the law needs to be interpreted ethically, or where compliance is insufficient.

Before starting our discussion of each of B&C’s principles, it is worth discussing two possible objections to the suggested approach. They concern the merits and applicability of B&C’s principles. Let us consider them in turn.

2.1 Other Frameworks are Better

The first objection argues that despite the appeal of B&C’s principles, there are alternative frameworks available which should be preferred. As it turns out, however, the presence of alternative frameworks does not really challenge the soundness of the approach suggested in this article. First, it is not obvious that adopting an alternative option would lead to different conclusions. For instance, consider the influential approach proposed by Friedman et al.’s (2013). The authors propose a list of thirteen values that are important for the design of information systems. In their list of values, we encounter human welfare, privacy, freedom from bias, trust, autonomy, informed consent, accountability, all elements that, as we will see, can be fruitfully explored using B&C’s principles. The few elements in their list that seem to be less obviously captured within B&C’s framework, such as environmental sustainability, seem to be orthogonal to the present discussion of A/B testing. Admittedly, there are further relevant accounts that one could consider. For instance, one could object that human rights provide instead a better framework than the one put forth by B&C (Fukuda-Parr & Gibbons, 2021), where human rights represent an approach that highlights human dignity as the ground for our moral status. But alternative frameworks tend to converge upon closer scrutiny, meaning that other frameworks will likely accord with the conclusions of this paper (Baker, 2001). Second, there is no framework that has not attracted objections in the literature. Consider for example another influential approach, such as the capabilities approach originally proposed by Sen (e.g. Sen, 1985) and more recently further developed by Nussbaum and others. This approach stresses the importance of evaluating human welfare using the metric of what people are able to do and be. Sen emphasizes capabilities broadly, whereas, Nussbaum proposed a more specific list of capabilities that are required for a human life to be “not so impoverished that it is not worthy of the dignity of a human being” (Nussbaum, 2000, p. 72). This framework has influenced a number of policies (Bondi et al., 2021). But even this approach has been criticized (Jaggar, 2006; Nelson, 2008) and charged, for instance, with being paternalistic (Claassen, 2014).

2.2 B&C’s Principles are not Useful in Practice

The second objection is that B&C’s principles may seem appealing but are not useful in practice for two main reasons. First, because they are too broad to be action-guiding for practitioners, as only principles that are narrower and more specific are likely to be useful in practice (Whittlestone et al., 2019). Second, because B&C’s framework does not really help us resolve conflicts between principles. Let us examine them in turn. Indeed, high-level principles are difficult to translate into practice. However, as it will become more apparent in the remainder of this article, the goal is not to merely provide high-level principles, but to unpack them, teasing apart different notions and conceptualizations, and to further enable and assist practitioners in their decision making by providing them with a detailed list of prompting questions in the Appendix. More work will be needed and is encouraged to further operationalize different ethical principles and recommendations, for instance by introducing a set of ethical guardrails to complement the more traditional metrics used as organizational guardrails to protect the business and as technical guardrails to ensure the internal validity and trustworthiness of the results. These ethical guardrails will also need to be contextualized as they will arguably be specific to different industries, contexts and use cases. The objection that B&C’s approach does not in itself provide a universalizable method for prioritizing the four principles has been raised a number of times in the literature (e.g. Clouser & Gert, 1990), and might appear more scathing. More precisely, B&C’s latest approach was committed to reflective equilibrium as a methodology (Rauprich, 2008). This is a process by which our considered responses to actual cases influence our moral principles, and those principles then provide guidance for our response to further cases. The reality, however, is that although the framework does not in itself provide a universalizable method for prioritizing the four principles, this is not a shortcoming but an advantage. More precisely, given that there is no widely acceptable universalizable method for prioritizing these principles (pace the competing claims to the contrary), it is a positive feature of this approach that it allows to give different weight to these different principles when they conflict, also in relation to different circumstances, ethically, legally and culturally. Further, noting a tension between two values does not necessarily mean we are forced to choose between them: often, we may be able to find a tradeoff to get more of both things we value. In the Appendix, the paper offers prompting questions that also inquire as to whether any conflicts between different ethical principles are observed, and in case what weight was given to different considerations. This helps better capture any tensions that arise when high-level principles are applied to concrete cases. Though most of these tensions cannot be resolved straightforwardly, articulating them more clearly and explicitly will help further operationalize principles.

Let us now turn to the four principles.

2.3 Autonomy

Individual autonomy refers to the capacity to be one’s own person, to live one’s life according to reasons and motives that are one’s own and not the product of manipulative or distorting external forces (Calvo et al., 2020). The growth of unregulated A/B testing may undermine human autonomy, as it can result in widespread, systematic omission of appropriate information about the risks, benefits, and alternatives of experiments, and in deceptive, opaque, and unintelligible practices that do not have the individual’s best interest in mind.

Informed consent is closely related to the concepts of “autonomy” and “autonomous choice”. The requirement to secure informed consent is the cornerstone of human subject protection (Resnik, 2018). The main objective of informed consent is to make prospective participants aware of the research and give them the option to opt out of the study. However, online companies automatically acquire implicit consent for research when a user accepts the terms of service (TOS), whereby these agreements are complex and difficult to read and thus raise doubts on the validity of “informed consent” (Luger et al., 2013). Consider the example of Facebook’s experiments in 2014 (Kramer et al., 2014). Their study set out to test whether emotions were contagious via online social networks. For a week, Facebook showed people fewer positive or negative posts to people in the News Feed, and then measured how many positive or negative words they included in their own posts. People who saw fewer positive posts (a more depressing feed) posted 0.1% fewer positive words in their posts—their status updates were slightly less happy. People who saw fewer negative posts (a happier feed) posted 0.07% fewer negative words—their updates were slightly more positive. Technically, Facebook had consent from all users. Yet, that was arguably a weak form of consent, as participants did not know that they were in the experiment, were not provided with any way to opt out, and were not informed about its scope or intent, its potential risks, or whether data would be kept confidential. This is in stark contrast with the consent required by offline experiments. Because of their length, most people fail to read TOS agreements and are unaware of their content (Obar & Oeldorf-Hirsch, 2020). The result was that participants were de facto uninformed and prevented from opting out, and the manipulation of emotions risked causing psychological harm to some users exposed to these practices, without any efforts to ensure their well-being.

Further, in line with Turilli & Floridi (2009), we do not consider transparency to be an ethical principle per se but rather a moral enabler, part of an ethical infrastructure or infraethics (Floridi, 2017). In particular, informational transparency and opacity can both be at odds with the concept of autonomy introduced above. Of course, in many cases, “complete understanding” of the systems with which we interact is neither desired nor required: we are perfectly happy to use technology by adopting “intentional” or “design stances” (rather than the more complete but cumbersome “physical stance”) so long as the system functions correctly (Dennett, 1987). But given that we increasingly and preferentially trust and interact with software and intelligent systems, and that we rely on them to make decisions in a variety of socially significant and morally weighty contexts, the call for transparency in the design of services and goods has acquired an ethical dimension too. Importantly, the notion of transparency is highly relevant to many discussions around A/B tests, as these can be opaque when run with little or no human control or oversight. On this point, it is worth noting that not all A/B tests are the same.

First, some kinds of manipulations raise serious problems of transparency, especially when related to what we call “back-end A/B testing”. Back-end A/B testing can be defined as manipulations that do not affect the outline or the design of web application like traditional surface level testing aimed at improving the design of websites or user interface (UI). Instead, back-end A/B testing modifies the inner workings of the algorithms that power certain portions of the UI, e.g. recommender systems. Consider the hypothetical data science team at SN which we introduced in Sect. 1. There is a difference between a scenario in which the team decides to test the impact of some new functionality on the outline or design of their website/app, such as adding a new “double thumb” option to rate content, and one in which they seek to introduce new machine learning algorithms to power the news feed. These latter types of changes are more opaque, as they operate largely beneath the surface of what the user can immediately detect. While users are becoming better at detecting and handling “dark patterns,” (Shaw, 2019), they are nevertheless ill-equipped to scrutinize back-end changes and to hold companies accountable. We argue that cases of back-end testing raise unique ethical challenges, which means that informed consent should be a priority.

Second, it is important to make a distinction between two different scenarios:

  1. (i)

    Treatment and control involve two different models with different parameters, but maximizing the same function.

  2. (ii)

    Treatment and control involve two different models, e.g. one for engagement and one for long term value (LTV).

The majority of experiments tend to fall under type (i), as A/B tests are the paradigmatic example of exploitation (in the exploration/exploitation jargon of reinforcement learning): once the data science team at SN realizes that a personalized news feed is a valuable feature for engagement, it is “just” a matter of finding the right “knobs” to turn in the relevant algorithm. For technical reasons ranging from data drifts to causal inference (users are more likely to click what is recommended, ceteris paribus), A/B tests are essential to determine the optimal parameter configuration. Whatever harm or user benefit the relevant model is causing, type (i) testing would typically make it a bit higher or lower, but not structurally different. A more interesting ethical case is type (ii) experiments, which involve deep changes in either the problem framing, or the entire experience. For example, data scientists at SN may shift their attention from predicting likes to predicting lifetime value, that is, the total years on the platform. The new algorithm will be less readily comparable to the existing one, making A/B testing harder to interpret and arguably have a much higher risk of causing significant disadvantage in a portion of users, even if temporarily. A special case for type (ii) is when a new model introduces a new UI experience altogether. For example, Yu and Tagliabue (2020) introduced a model-based query refinement tool as an alternative to a standard search bar with no pre-existing functionality. Since bigger changes may tamper with users’ intention in a new way, it is imperative that information and communication be handled more responsibly.

2.4 Fairness

Justice and fairness are closely related terms, often used interchangeably. In the literature on A/B testing, justice is mainly expressed in terms of fairness (Saint-Jaques et al. 2020). Fairness is a critical concern in the context of A/B testing due to the high stakes and risks involved. Considering how A/B tests can drive software changes with serious impact on society and daily practices, mediating personal and professional interactions, ensuring fairness in A/B tests becomes critical. For example, if left unregulated, experiments can end up reinforcing existing social (dis)advantages or stigmatization in targeted groups. Several different definitions of the concept of fairness have been provided in the literature (Saxena et al., 2019; Verma & Rubin, 2018).

It is unlikely that there will be a universal definition of fairness that is appropriate across all applications and experiments. However, many recent studies have investigated primarily two notions of fairness. Group fairness focuses on some sort of statistical parity (e.g. between outcomes) for members of different groups (e.g. gender), whereas individual fairness focuses on whether people who are similar with regards to the task receive similar outcomes (Dwork et al., 2012; Pedreschi et al., 2008). We agree with recent literature showing that these characterizations of fairness may be mutually compatible (Binns, 2020). Specifically, within the context of A/B testing, individual and group fairness constraints become important and relevant at different stages of the A/B testing cycle. Individual fairness concerns the distribution of benefits and risks and should be preserved at the time of traffic allocation by relying on hashing that randomly assigns visitors to groups A and B. The protection of members of different protected groups becomes more relevant at the stage of final acceptance. Let us examine these points in turn.

In healthcare and clinical contexts, the principle of equipoise (a.k.a. the “uncertainty principle”) holds that a user should partake in an A/B test only if there is uncertainty (see the value of opacity stressed above) about which condition is most likely to benefit the participant (Friedman & Nissenbaum, 1996; MacKay, 2018). Slightly more formally, to be in equipoise between two conditions A and B is to be cognitively indifferent between the statement “A is strictly more effective than B” and its negation. Equipoise regarding A and B is often considered sufficient for an assignment to be fair. However, the original definitions of equipoise introduced in the context of medical randomized control trials (RCTs), if literally interpreted, would substantially impede A/B testing. The definition should therefore be softened in the context of A/B testing, in the following way. A/B testing departs from principles of fairness whenever there is clear available evidence that a condition would lead to better outcomes and most users would be indifferent regarding the condition to which they are assigned.

Issues of bias do not obviously arise at the stage of traffic allocation. They become more relevant at the stage of final acceptance. Fairness is closely related to the concept of bias. A biased system “systematically and unfairly discriminates against certain individuals or groups of individuals in favor of others” (Friedman & Nissenbaum, 1996). When we ask whether a new algorithm improves a KPI (e.g., number of likes), it is imperative that we consider all the relevant groups within a population. For example, consider the context of the ranking and information retrieval, whereby ranking systems have a responsibility to their users and to the items that are being ranked. Importantly, people in current information retrieval systems are not only the ones issuing search queries, but increasingly they are also the ones being searched. This is especially important, as a number of studies have shown that ranked lists produced by a biased machine learning model can result in unfairly limited visibility for an already disadvantaged group (Geyik et al., 2019; Imana et al., 2021). Given the important role that ranking systems have come to play on websites such as Airbnb and Uber, or on human resource matchmaking platforms, such as LinkedIn, changes of rank can have a tangible impact on people’s lives. Thus, controlling for key socioeconomic variables when evaluating the treatment effect of A/B tests may prove a sensible approach to uncover these asymmetric effects.

Considering the sensitive nature of the topic, it is no surprise that there are no published papers outlining evidently biased and discriminatory conditions included in A/B tests. Yet, these concerns are not at all far-fetched. Take recent arguments to the effect that Google’s search engine is algorithmically biased (Noble, 2018). It has been shown that searching keywords like ‘Black girls’ directs to adult sites where women of color are hypersexualized. Other minorities and even religious groups are often associated with harmful stereotypes when searched with keywords on commercialized search sites. As it turns out, biases become a threat to fair treatment of users also in the context of A/B testing. The soft ethics framework introduced here contends that instances of A/B testing that amplify biased behavior are unethical and must be avoided.

Discrimination refers to a difference in how individuals are treated based on their membership in a group. Instances of discrimination are broader than instances of bias. Further, while bias is a dimension of the process, discrimination describes the effects of the process. Notably, in many cases, discrimination is neither illegal nor obviously problematic. Yet, it can frequently raise concerns. A case in point is represented by personalized pricing, a form of discrimination in which costs are scaled to an individual’s (predicted) willingness to pay. To economists, personalized pricing can be a desirable feature, with the potential to improve allocative efficiency (Inderst & Shaffer, 2009), although it should be noted that allocation efficiency does not necessarily entail social welfare (Bergemann et al., 1996). However, people’s perceptions of fairness are often at odds with dynamic and personalized pricing, as shown in seminal work by Kahneman et al. (1986) and confirmed more recently by Inderst and Shaffer (2009).

If these practices are conducted using opaque means, there is also a risk that they reduce trust and create a perception of unfairness. While our framework does not specifically recommend against experimenting on personalized pricing, there is a general recommendation here that experimenters should be wary of A/B tests that would be deemed unfair and unethical by users should they become public.

2.5 Non-maleficence

This principle emphasizes the importance of not harming individuals. A/B testing should minimize the risk of causing psychological or emotional harm. For example, consider experiments on engagement on social networking websites. Addiction to social networks is not formally recognized as a diagnosis (Moqbel & Kock, 2018), but psychological dependency on such sites may interfere with important duties and activities. Notably, social media use has been associated with negative consequences such as reduced productivity, unhealthy relationships, and reduced life-satisfaction (Ponnusamy et al., 2020).

In the context of experimental research with human subjects, it is customary to accept that additional safeguards must be included in experiments involving vulnerable subjects such as children, prisoners, pregnant women, mentally disabled persons, or economically or educationally disadvantaged persons (Resnik, 2018). For instance, adults with mental disabilities or diseases that impair decision-making need additional protections in research because they may have compromised ability to consent to research participation. In the context of A/B testing, experimentation should be governed thoughtfully to protect the most vulnerable populations and additional safeguards must be included to protect the rights and welfare of these subjects. The issue and principle have gained special importance considering recent controversies based on the claim that Facebook (Meta) knew that Instagram was proving toxic for teenagers (Wells et al., 2021). It is critical that companies involved in A/B testing put in place screening practices to help exclude from experiments members of a vulnerable group, in the same way in which vulnerable subjects are screened and excluded from clinical RCTs (see for instance the United States Code of Federal Regulations Title 45, Part 46, subparts B, C and D). Protections for vulnerable populations should be put in place in addition to, not in lieu of, overall protections for all users, as vulnerability may be context-specific.

While physical harm might not seem to be a primary concern in the context of traditional cases of A/B testing, it is worth noting that the relevance of this ethical concern should not be dismissed too quickly. For example, companies such as Lyft have introduced A/B testing in the context of hardware (Drayna et al., 2021), raising further kinds of ethical concerns about the safety and physical protection of participants.

2.6 Beneficence

The promotion of beneficial A/B testing can arguably be perceived as placing an unreasonable expectation in the context of company-sponsored experimentation. On this view, non-maleficence is sufficient for ethical A/B testing. This is because tech industry settings are different from academic and medical research settings in several ways. For instance, medical practice is bound by the Hippocratic Oath but there is no equivalent industry-wide oath for technology. However, considering the increasingly important role of Corporate Social Responsibility (CSR), it can be argued that a light, basic duty of beneficence should be understood as relevant to today’s experimentation practices. After all, discussions on the importance of the so-called “triple bottom line”—the need to care for not just profit but also people and the planet—are not new (Elkington, 1997). In the broadest sense, the term “beneficence” refers here to the principle of considering and advancing the well-being of users. Beneficence generally means doing good or engaging in acts of kindness. Over and above refraining from harming others, the principle of beneficence thus requires companies to promote their welfare.

3 Incentive Mechanisms for a Soft Ethics Framework

Companies should not be left alone in trying to elevate their standards of ethical experimentation. Engineers and developers often involved in experiments are not systematically trained in ethics, may perceive ethical considerations as unnecessary red tape, and need to grapple with unavoidable conflicts of interests due to the close link between business and science. To foster compliance with ethical principles, companies need to be properly educated, governed, and incentivized. Arguably, however, ensuring ethical treatment of human subjects in the context of A/B testing is too complex to be addressed with a single, simplistic solution. On the contrary, several strategies need to be in place and several players need to be involved. Our suggestions are not meant to conclusively resolve these debates, but rather serve as a starting point in what will surely be a long ethical journey for the community. As complex problems typically require complex solutions, we submit several candidate mechanisms that should prove helpful in fostering the adoption of soft ethics. What follows is a non-exhaustive outline of plausible recommendations.

3.1 Institutional Review Boards

A first mechanism that needs to be considered to promote a framework of soft ethics is Institutional Review Boards (IRBs). An IRB is essentially a panel of experts who review proposed research and determine whether any potential ethical concerns it might pose are sufficiently mitigated by the methodology or nature of the specific project. In the United States, IRBs in medicine were introduced to manage the ethical risks commonly faced in human subjects research. Some of that unethical conduct was particularly horrific. One notorious example is the Tuskegee experiments, in which doctors refrained from treating Black men with syphilis, despite the availability of penicillin, so that they could study the disease’s unmitigated progression (Alsan & Wanamaker 2018). More generally, the goals of an IRB include upholding the core ethical principles of respect for persons, beneficence, and justice. IRBs carry out their function by approving, denying, and suggesting changes to proposed research projects.

Some countries like the United States require research institutions to have an IRB as a condition for federal funding. Before conducting a given study, a researcher submits it to the IRB board at her university, and only after IRB approval may the research begin. If the IRB declines to authorize the study, the researcher must work with the IRB to alter its nature or methods to address the IRB’s concerns. If the researcher is unable to meet the IRB’s demands, then the research, in theory, must not be conducted. This does not apply to company-sponsored research.

IRBs can help identify and mitigate ethical risks in A/B testing. Just as in medical research, IRBs can not only play the role of approving and rejecting various proposals but should also make ethical risk-mitigation recommendations to researchers and product developers.

While an A/B testing IRB would obviously be welcome as an important mechanism to minimize ethical risk, this solution may raise new dilemmas. Companies such as Microsoft and Meta have been launching internal IRBs over the past years, but it is unclear to what extent these boards can be truly independent (Wong & Floridi, 2023). The issue is especially relevant considering that the social contagion study mentioned above was approved by Facebook’s IRB (Kramer, 2014). Some have plausibly argued that letting Facebook conduct and approve an ethical review of the study was like leaving the fox to guard the henhouse (Boesel, 2014). At the same time, opting for external IRBs would raise another set of worries. Lengthy review times for IRBs are a well-known barrier to research, and A/B tests are often time-sensitive (Liberale & Kovach, 2017). Unsurprisingly, there have been calls to improve the efficiency of the review process (Spellecy et al., 2018). But arguably the reform should be very ambitious, as a recent study of IRBs revealed that only 6% had tools sufficient for considering the area of internet research (Zimmer, 2020).

Considering this, we welcome the rise of IRBs internal to corporations and believe that these mechanisms can become critical to approve research that involves more than a minimum risk. We are aware that corporate IRBs may not benefit from the same degree of independence that other academic IRBs do. Yet this is the issue that should be addressed. The current IRB turnaround times, policies and procedures are generally perceived to be hardly compatible with corporate expectations, business needs, expected agility and the ubiquity of A/B tests. In light of the above-mentioned concerns, we outline some tentative solutions aimed at increasing accountability while not overlooking feasibility. To begin with, companies could be required to make IRB reviews and deliberations public upon request, or perhaps publish reviews of what experiments were conducted alongside their context and rationale. Arguably, this can add an extra layer of accountability should the IRB approve some dubious experiment, or potentially expose internal pressures that drove poor decisions. Further, it is advisable that IRBs include company employees and external members (e.g. from academia) in order to at least partially mitigate the “foxes shouldn’t guard henhouses” problem and “IRB laundering” mentioned in Grimmelmann (2015). This would help ensure that unethical experiments are not approved in the first place. In all, IRBs are critical when research involves more than a minimum risk. Neither a completely internal or a completely external review board would seem to be ideal. The former would likely face conflicts of interest, the latter could lack the agility and speed required in industry settings. A mixed board that includes both internal and external members appears to be a promising avenue. However, solutions must also be feasible and realistic to gain wide adoption. Hence, to further reduce turnaround times with external board members, it is suggested that internal members should be granted some kind of preemptive right to approve an A/B test if some conditions are met (e.g. review time exceeds n months, external members are not cooperating) and if there is agreement among all internal members.

3.2 Soft Ethics from the Perspective of ESG

Soft ethics can also be explored from the angle of Environmental, Social, and Corporate Governance (ESG). ESG has increasingly become a CEO-/CFO-level topic, underlining its importance for the entire organization and its implications for risk management as well as differentiation. According to IDC (2021), almost 75% of companies have already integrated or are currently in the process of integrating ESG considerations into their business approach. While this suggests how there might be a compelling business case for adopting a soft ethics framework that goes beyond the mere purpose of “doing good”, ethical experimentation is not mentioned in CSR or ESG reports when accounting for companies’ social footprint. Interestingly, however, the literature on AI ethics has been recently approached from the ESG angle (Owe & Baum 2021). We maintain that there is an opportunity here for companies that want a first-mover advantage in differentiating themselves in the marketplace by adopting a soft ethics framework in the context of A/B testing.

3.3 The Role of Conferences, Journals and Editorial Guidelines

Journals and associations can also play a critical role in facilitating compliance with ethical principles of experimentation. In particular, scientific publication has increasingly become popular among researchers in big tech, resulting in a growing number of corporate-affiliated papers published every year in journals or in conference proceedings (e.g. SIGIR, KDD, WSDM, WWW, WSDM, RecSys, CIKM). The fact that researchers frequently communicate via peer-reviewed publications matters substantially because research can be significantly shaped by journals’ editorial decisions and policies. A few well-intentioned guidelines have been published. For example, bodies such as the Association of Computing Machinery (ACM) have long maintained ethical guidelines. However, the impact of these guidelines has been modest, and they have remained virtually invisible to a large part of the A/B testing community. It is good news that things have started to change. In our experience, an increasing number of conferences (e.g. NeurIPS, EMNLP) are encouraging reviewers to raise ethical concerns. This is important, as journals’ and conferences’ editorial decisions end up influencing the kind of projects that researchers will be carrying out (Horvat et al., 2015; Polonioli, 2017). Journal and conference guidelines are thus relevant as the prospect of manuscript rejection or article retraction may be an important drive to comply with different ethical standards. More generally, we argue for an increasingly important role of conferences and journals in promoting ethical A/B testing. Besides educating practitioners and acting as gatekeepers by adding relevant information about the experiments involved, conference and journal editors could also become members of mixed IRBs (as suggested in Sect. 3.1) and host the release of proceedings through which companies disclose information regarding the online experiments conducted.

3.4 The Use of Participant Compensation

Another incentive mechanism is to require companies to compensate study participants, a common practice in academic research with human subjects. For instance, in fields such as Experimental Economics, monetary rewards for participants have been the norm for decades (Hertwig & Ortmann, 2001). However, compensation need not be monetary. Access to a platform’s premium features or new designs could be a viable alternative (similar to what is done with beta testing in software engineering).Footnote 6 This could be helpful in at least two ways: first, participants would need to be made aware that they are part of the experiment, thus truly enforcing the informed consent principle (and giving them the option to opt-out). Second, companies would need to compensate participants, thus potentially reducing the number of unnecessary and potentially harmful A/B tests. To mitigate problems with the latter consequence, it could be argued that compensation should be required only when assessment by the IRB reveals that the test poses significant risks (e.g., Kramer et al.,’s 2014 experiment would require compensation and informed consent while a change in website layout could be carried out without the need for compensation). Compensation has raised several issues in RCTs before, such as the exploitation of vulnerable populations (Pandya & Desai, 2013). Although we do not deny that the use of compensation may raise issues as well, it seems to be an interesting avenue to explore to mitigate and regulate the use of online experiments, which as of now is entirely unregulated. Further, we have already argued in Sect. 2.5 that measures should be in place to try to protect vulnerable populations.

3.5 Prompting Questions for an Ethical Use of A/B Testing in Industry Settings: A DEC Methodology

While all the mechanisms discussed so far may play a role in elevating the ethical standards of online experimentation, we also wish to offer a complementary tool specifically designed for practitioners to stimulate ethical deliberations. In the Appendix, we provide a list of questions to employ across the experimentation lifecycle, as it motivates deeper reflection to question whether A/B testing is used responsibly and ethically. We suggest that companies follow this list of questions and start compiling documentation that provides a concise, holistic picture of an A/B test. Documentation should be aimed at both internal and external audiences. More precisely, we encourage the creation of “A/B Test Cards” inspired by Google’s “Model Cards”. A recent paper by Mitchell et al. (2019) introduced the idea of Model Cards, a “one-pager” summing up what we know about a given model: input and output of course, but also the accuracy on a test set, biases and limitations, best practices for its use, as well as further relevant information. In line with the data-centric AI movement, the importance of extending documentation to the context surrounding a model—data, training, operations—has led to the creation of Directed Acyclic Graph (DAG) Cards, which emphasize “documentation as code” as a best practice for developers, and causal lineage for reporting and debugging (Tagliabue et al., 2021). We suggest that companies draw inspiration from this approach to reporting and start documenting their online experiments by offering brief summaries of their findings in clear and simplified formats, focusing on scoping, design, implementation, and dissemination. In the Appendix, we provide the full “A/B Test Card” in line with the ethical analysis developed in this article. We propose a set of sections that an A/B test card should ideally include and we provide a list of 15 questions that comes with a description of each one’s relevance and supporting guidance on how to answer it. We acknowledge that not all sections and questions have the same importance. In particular, those regarding the preservation of autonomy and avoidance of harm are more important than others, for instance the ones concerning pre-registration. Further, we appreciate that the weight attributed to each question and section might to some extent depend on the specific use case. For example, for experiments involving particularly new and ethically risky hypotheses, it becomes even more important to conduct thorough research and formulate detailed hypotheses. Although this inevitably introduces some elements of subjectivity, we believe that the A/B test card greatly facilitates the evaluation process and fosters compliance with best practices to mitigate ethical risks and increase accountability. We encourage practitioners to not only rate their compliance with best practices but also assess the relevance and weight of all checklist items.

3.6 Other Mechanisms

A possible objection to our recommendations and analyses accepts the overall sensibility of the approach but argues that this is unlikely to have a tangible impact, since not all companies will be inclined to follow our recommendations, and the external mechanisms discussed so far are not strong enough. To be sure, we agree that other external mechanisms could play an important role in driving compliance. For instance, one of the biggest tools users and consumers have is the ability to scrape, monitor and inspect platforms. ProPublica’s Facebook Political Ad Collector is a case in point. ProPublica (a non-profit newsroom focusing on investigative journalism) developed a browser extension to collect political ads on Facebook in a crowdsourced manner. The investigative journalists at ProPublica were able to purchase housing ads that specifically excluded African Americans, mothers of high school kids, people interested in wheelchair ramps, Jews, expats from Argentina and Spanish speakers (Larson, 2017). Notably, many of these forms of discrimination are not only ethically challenging but even illegal, insofar as they involve protected categories of ads (housing and job) and persons (ethnicity and disability). We believe that similar initiatives, and more generally the possibility of crawling, scraping and auditing platforms (Jiang et al., 2019) will also be powerful mechanisms to empower users and society.

4 Conclusion

Protection protocols have become the norm in both medical research and the social and behavioral sciences. However, the use of human subjects in research that is not federally or publicly funded—such as in the case of privately funded A/B testing, often affecting millions of potentially unaware people—has remained unregulated. Unfortunately, the growing literature on A/B testing has not paid sufficient attention to the practice’s ethical dimensions. This article fills the gap by introducing a new governance framework that explicitly recognizes how the rise of an experimentation culture in industry settings brings not only unprecedented opportunities to businesses but also responsibilities. The ethical framework recommended in this article is meant to be actionable, reasonable, scalable, and legally compliant. We know that the ethics of A/B testing and responsible causal inference is a nascent area of research, and we encourage readers to take a critical view to our assertions and use them as a point of departure for further thought and exploration.