What is the Value of Experimentation and Measurement?

Experimentation and Measurement (E&M) capabilities allow organizations to accurately assess the impact of new propositions and to experiment with many variants of existing products. However, until now, the question of measuring the measurer, or valuing the contribution of an E&M capability to organizational success has not been addressed. We tackle this problem by analyzing how, by decreasing estimation uncertainty, E&M platforms allow for better prioritization. We quantify this benefit in terms of expected relative improvement in the performance of all new propositions and provide guidance for how much an E&M capability is worth and when organizations should invest in one.


Introduction
The value of making data-driven or data-informed decisions has become increasingly clear in recent years. The key to making data-driven decisions is the ability to accurately measure the impact of a given choice and to experiment with possible alternatives. We define Experimentation & Measurement (E&M) capabilities as the knowledge and tools necessary to run experiments (controlled or otherwise) with different products, services, or experiences, and measure their impact. The capabilities may be in the form of an online-controlled experiment framework, a team of analysts, or a system capable of performing machine learning-aided causal inference.
The value of E&M is currently best reflected in the success of major organizations that have adopted and advocated for them in the past decade. A large number of major technology companies report having mature infrastructure for online-controlled experiments (OCEs, e.g., Google [1], Linkedin [2], and Microsoft [3]) and/or are heavily investing in state-of-the-art techniques (e.g., Airbnb [4], Netflix [5], and Yandex [6]). Amazon [7] and Facebook [8] have also reported the use of various causal inference techniques to measure the incrementality of advertising campaigns. A number of startups (e.g., Optimizely [9] and Qubit [10]) have also recently been established purely to manage OCEs for businesses.
While mature E&M capabilities can quantify the value of a proposition, it remains a major challenge to "measure the measurer"-to quantify the value of the capabilities themselves. To the best of our knowledge, there is no work that addresses the question "should we invest in E&M capabilities" or how to value these capabilities, making it difficult to build a compelling business case to justify investment in the related personnel and infrastructure. We address this problem, calculating both the expected value and the risk, allowing the Sharpe ratio [11] for an E&M capability to be calculated and compared to other potential investments.
The value created by E&M capabilities can be divided into three classes-(1) recognizing value, (2) prioritizing propositions, (3) optimizing individual propositions: 1. Recognizing value E&M capabilities enable value to be attributed to a product, proposition, or service. They also prevent damage from propositions that have negative value. This is important for dynamic organiza-tions with large numbers of propositions as the damage caused by individual roll outs can be compartmentalized and contained in a similar fashion to unit and integration testing in software development. 2. Prioritization Without E&M capabilities, prioritization is based on back-of-envelope estimates or gut feel, which has high uncertainty. E&M reduces the magnitude of the noise arising from estimation, enabling prioritization based on estimates that are closer to the true values and improving long-term decision making (see Fig. 1). 3. Optimization E&M capabilities allow large numbers of variants to be evaluated against each other and the best to be selected efficiently. Without such capabilities, propositions can be experimented with sequentially, but this is slow and introduces noise from the changing environment.
The value of (1) comes from rolling back negative propositions. Given an E&M capability, it can be calculated by summing the negative contributions of unsuccessful propositions. In the absence of a capability, it can be estimated from the value distribution of propositions, which is given across industries in [10,12]. The value of (3) is the difference between the maximum and the mean value for each variant summed over the number of propositions. This can be estimated by placing Gaussian distributions over variants for each proposition or evaluated in the case that an E&M capability exists.
While quantifying the values of (1) and (3) are relatively straightforward, quantifying the value of (2) is more interesting and the subject of the remainder of this paper. E&M capabilities improve prioritization by reducing uncertainty in the value estimates of each proposition. This is a form of ranking under uncertainty, a well-studied problem in the fields of statistics and operations research. However, in all previous work, either the variance is assumed to be a fixed constant, or it is changed without the value being measured. Here, we wish to understand the value of variance reduction through E&M.
Our contribution is as follows. We 1. Specify the first model that values the contribution of an E&M capability in terms of better prioritization due to reduced estimation noise for propositions (Sects. 3, 4); 2. Derive the variance of our estimate, allowing a Sharpe ratio to be calculated to guide organizations considering investment in E&M (Sect. 5); and finally 3. Provide two-case studies based on large-scale metaanalyses that reflect how our model can be applied to real world practice (Sect. 7), and two extensions that opens the door to future work in this area (Sect. 8).

Related Work
There is a large literature on the use of controlled or natural experiments. A number of works are dedicated to running trustworthy online-controlled experiments [13], choosing good metrics [14], and designing experiments where samples are dependent due to external confounders [15,16]. While important contributions, these works assume the existence of E&M capabilities. However, to the best of our knowledge, there is no literature that helps organizations justify the acquisition of E&M capabilities. We believe that filling this gap is necessary for wider adoption, and that increased participation will accelerate the development of the field. This paper is related to existing work in statistics and operations research, in particular, on decision making under uncertainty, which has been extensively studied since the 1980s. Notable work includes proposals for additional components in a decision maker's utility function [17], alternate risk measures [18], and a general framework for decision Fig. 1 Prioritizing four projects (the fruits) according to their value (x-axes). The semi-opaque icons represent the projects' true value, and the solid icons represent possible project value estimates under some level of uncertainty (horizontal lines) in the estimation process. (Top) Under a noisy estimation process, projects with a low true value (e.g., project apple) may appear to have a high value and be prioritized erroneously. (Bottom) With E&M, the estimation noise is reduced, which enables a better prioritization with value estimates that are closer to the truth making with incomplete information (i.e., uncertainty) [19]. These works assume the inability to change the noise associated with estimation and/or measurement.
The sub-problem of ranking under uncertainty has also attracted considerable attention, partially due to the advent of large databases and the requirement in ranking results with certain ambiguity in relevance [20]. While Zuk et al. [21] measured the influence of noise levels in their work, they focused on the quality of the ranks themselves but not the value associated with the ranks.
The project selection problem is a related problem in optimization, where the goal is to find the optimal set of propositions using mixed integer linear programming, possibly under uncertainty. Work in this domain generally seeks methods that cope with existing risk/noise [22], and to the best of our knowledge, there are no work that consider the value from reducing risk. While Shakhsi-Niaei et al. [23] have discussed lowering the uncertainty level during the selection process, they refer to the uncertainty of decision parameters instead of the general noise level.

Mathematical Formulation
We formulate the prioritization problem, and the value gained from E&M capabilities, by considering M propositions that must be selected from N candidates, where M < N . The estimated value of each proposition is given by Y n = X n + n , where X n are the true (unobserved) values that are estimated with error n . The propositions are labeled in ascending order of estimated value Y n to get the order statistics Y (1) , Y (2)  where I(⋅) denotes the index function that maps the ranking to the index of the proposition. 1 We define the mean true value of the M selected propositions as where a good prioritization maximizes V. Part of the value of E&M capabilities arises from the observation that V increases when the magnitude of the uncertainties arising from estimation ( n ) decreases. We are interested in the value gained by reducing estimation uncertainty without changing the set of propositions (i.e., retaining all X n s), as the true value of the propositions do not depend on the measurement method used:

Modeling Values with Statistical Distributions
To value an E&M capability, which is a generic framework that can be applied in many different ways across diverse organizations, it is first necessary to make some simplifying assumptions about the statistical properties of the propositions under consideration. We assume the value of the propositions ( X n ) and the estimation noises ( n ) are randomly distributed: where n ⟂ X m ∀ n, m (see the LHS of Fig. 2). We note two special cases, one when both the value and the noises are assumed to be normally distributed: and the other when both the value and the noises are assumed to follow some generalized Student's t-distributions: where t is a standard Student's t-distribution with degrees of freedom. The location and scaling parameters ensure X n and n have the mean and variance specified in (4). (R) When we change the noise level from 2 1 to 2 2 , we obtain two sets of observed values, Y n and Z n , for each noise level These two cases are particularly relevant as meta-analyses compiled on the results of 6700 e-commerce [10] and 432 marketing experiments [12], respectively, indicate the uplifts measured by the experiments, and hence, the value of the propositions under some estimation noise, exhibit the following properties: 1. They can be positive or negative, 2. They are usually clustered around an average instead of uniformly spreading across a certain range, and 3. The distributions are heavy tailed.
The normal assumptions cover the first two properties only, yet enable one to draw on the wealth of results in order statistics and Bayesian inference related to normal distributions to get started. The t-distributed assumptions also covers property 3, though valuation under such assumptions is more complicated as t-distributions do not have conjugate priors.
For brevity, we will include the valuation under t-distributed assumptions under the general case. We will, however, present empirical results in Sect. 8.1 showing that the value gained under t-distributed assumptions has a higher mean and variance, which demonstrate that the model can capture the "higher risk, higher reward" concept.

Key Results
In the next two sections, we will derive the expected value and variance for V, the mean true value of the top M propositions selected after being ranked by their estimated value (as defined in (2)), as well as the expected value and the variance of D, the value gained when the estimation noise is reduced.
We will also provide two key insights. Firstly, the expected mean true value of the selected propositions (V) increases when the estimation noise ( 2 ) decreases, and the relative increase in value is dependent on how much noise we can reduce. Secondly, when M is small, reducing the estimation noise may not lead to a statistically significant improvement in the true value of the propositions selected. As a result, improvements in prioritization driven by E&M may only be justified for larger organizations.

Calculating the Expectation
We first derive the expected value for D. This requires the expected values of, in order: 1. Y (r) -the estimated value of the rth proposition, ranked in increasing estimated value; 2. X I(r) -the true value of the rth proposition, ranked by increasing estimated value; and 3. V-the mean of the true value for the M most valuable propositions, ranked by their estimated values.
To obtain the expected value for Y (r) , we apply a result by Blom [24], which states that the expected value for the order statistic Y (r) can be approximated as: where F −1 Y denotes the quantile function for Y n , and ≈ 0.4 [25].
The expected value of X I(r) is obtained by using a result from [26] (Eq. 6.8.3a): Equation (8) shows that decreasing the estimation noise 2 will lead to an increase in (X I(r) ) for any r > (N + 1) ⋅ F Y ( X + ) . It follows that the mean true value of the top M propositions, selected according to their estimated value, will increase with the presence of a lower estimation noise. We show this by applying the expectation function to V defined in (2) to obtain We finally consider the improvement when we reduce the estimation noise from 2 = 2 1 to 2 2 . This will be the expected value gained by having better E&M capabilities:

Expectation Under Normal Assumptions
In the special case where Y n are normally distributed (with mean X + and variance 2 X + 2 ), the expected value for the normal order statistics Y (r) is approximately: where Φ −1 denotes the quantile function of a standard normal distribution. It is worth noting that decreasing the estimation noise 2 will decrease (Y (r) ) for any r > N+1 2 , appearing to lower the average value of the top M propositions. This is a common pitfall; the estimated value of a proposition is not being optimized, what actually matters is the true, yet unobserved value of that proposition X I(r) , as shown below.
For X I(r) , we can simplify (8) either by substituting in (11), or from first principles by noting a standard result in Bayesian inference, which states that the posterior distribution of X n once Y n is observed is also normally distributed with mean and applying the law of iterated expectations to obtain 2 Here, decreasing the estimation noise 2 will lead to an increase in (X I(r) ) for any r > N+1 2 . The value of propositions chosen (V) under normal assumptions then evaluates to This is done by substituting (13) into (9). Note the complete absence of in (14), which suggests that systematic bias in estimation will not affect the true value of the chosen propositions in the normal case.
Finally, the expression for the expected value of D when we reduce the estimation noise from 2 = 2 1 to 2 2 is much neater under normal assumptions, as many terms cancel out in (10) leading to If we further assume that X = 0 (i.e., the true value of the propositions are centered around zero), then the relative gain is entirely dependent on 2 X , 2 1 , 2 2 : To calculate the relative improvement in prioritization delivered by E&M under these assumptions, plug into Eq. (16): 1. The estimated spread of the values ( 2 X ), 2. The estimated deviation of the current estimation process ( 2 1 ), and 3. The estimated deviation to the actual value upon acquisition of E&M capabilities ( 2  2 ) to get an estimate on how much one will gain from acquiring such capabilities.

Calculating the Variance
To make effective investment decisions, it is important to understand both the expected value and the risk or uncertainty that this value is delivered. Having derived the expected value in (10), in this section, we address the investment risk given by the variance of D.
The variance calculation features new challenges in addition to that identified in the section above, the most prominent of which concerns the interactions between quantities generated under different estimation noise levels. While these interactions do not affect the expected value, they are influencing the variance via the covariance terms. Failure to account for the covariance terms may lead to a large error in the variance estimate.
To address the challenges, we first extend the notation to clarify the interactions. We define two noise levels, 2 1 (assumed to be the higher one) and 2 2 , in place of 2 in Sect. 3. The estimated value of each item is then given by (15) where Var( 1n ) = 2 1 and Var( 2n ) = 2 2 . The setup is illustrated in the RHS of Fig. 2.
Having obtained two sets of estimated values, we rank and trace the corresponding indices for each set separately. For the Ys, we denote Y (r) as the r th order statistic of Y n , the estimated value of the r th ranked item under noise level 2 1 ; followed by X I(r) as the concomitant [26] of Y (r) , i.e., the true value of the r th item ranked by its estimated value. 3 We repeat the process for the Zs: we denote Z (s) as the s th order statistic of Z n and X J(s) as the concomitant of Z (s) . 3 We also define the mean true value of the top M items, ranked by their estimated value, under both noise levels as follow: where V 1 is the mean true value under 2 1 and V 2 is the mean true value under 2 2 . Finally, we denote the difference between the mean true values as Deriving the variance is similar to deriving the expectation-one has to obtain the variances for (in order) Y (r) /Z (s) , X I(r) /X J(s) , V 1 /V 2 , and D. The relationship between these quantities is shown in Fig. 3. We note the expression of the first three pairs of quantities are very similar to each other, with only the noise level terms and the indices changed. Thus, we only present the expressions for Y (r) , X I(r) , and V 1 below. The expressions for Z (s) , X J(s) , and V 2 can easily be obtained by substituting in the corresponding quantities and indices (Z for Y, s for r, 2 2 for 2 1 , etc.).

Var(Y (r) )
We apply a result from David and Johnson [27], which states the variance of Y (r) can be approximated as: where f Y and F −1 Y are the probability density function and quantile function of Y n , respectively. In the special case where X n and 1n are normally distributed, the variance is: where is the probability density function, and Φ −1 is the quantile function of a standard normal distribution.

Var(X I(r) )
The variance for X I(r) is then obtained using properties of the concomitants of order statistics [28]: 4 where XY = X ∕ √ 2 X + 2 1 denotes the correlation between X n and Y n .

Var(V 1 )
To derive the variance of V 1 , we require the covariance between pairs of Y (⋅) s and X I(⋅) s, respectively. This is necessary as the terms of V 1 (see (2)), being the result of removing noise from successive order statistics, are highly correlated.
David and Nagaraja [26] have provided a formula to estimate the covariance between Y (r) and Y (s) for any r < s ≤ N: 5 To obtain the covariance between X I(r) and X I(s) for any r, s ≤ N , we again refer to [28] (Eq. 2.3d): Fig. 3 Relationship between different variances/covariances used to calculate the variance of D, the value gained when the estimation noise is reduced. An arrow from quantity A to B means the value of B is dependent on the value of A Equation (23) affirms the claim that X I(⋅) are positively correlated. Unlike X n , which are independent by definition, they become correlated under the presence of ranking information. Now, we can state the variance of V 1 . Applying the variance function to (18) we get where the constituent variances and covariances are derived in (21) and (23), respectively.

Var(D)
Finally, we derive the variance of D. In addition to the variance of V 1 and V 2 derived in (24), we require the covariance between these two terms. This, in turn, requires the covariance between Y (r) and Z (s) , and that between X I(r) and X J(s) .
The covariance between Y (r) and Z (s) can be derived using results in [26]: where f X and F −1 X are the probability density function and quantile function for X n , respectively.
Deriving the covariance between X I(r) and X J(s) is perhaps the most challenging problem within the work, as they take two forms depending on the indices: where the second case is a standard Bayesian inference result.
The problem arises as the rth ranked Y n and the sth ranked Z n can be generated by the same X i for some i. This is not (23) Cov(X I(r) , X I(s) ) = 2 Cov(Y (r) , Z (s) ) = XY XZ Cov(X (r) , X (s) ) Cov(X I(r) , X J(s) ) = Var(X I(r) ) = Var(X J(s) ) if I(r) = J(s) possible if we have only the Ys or only the Zs (see Fig. 4 for an example). In this case, when we consider the covariance of the concomitants X I(r) /X J(s) , both the existing variance of X n , as well as the ranking information provided by Y (r) and Z (s) have to be taken into account. If the order statistics are generated by different Xs, we only need to take into account the latter as the Xs are assumed to be independent, and hence uncorrelated. As we are interested in the overall behavior, we only need to derive the two cases on the RHS of Equation (26) and weight them using the probability that I(r) = J(s) , without worrying which case does each (r, s) pair falls under. The first case (when I(r) = J(s) ) can be evaluated using the law of total variance with multiple conditioning random variables: The second case can be derived by substituting (25) into (26).
For the weighting probability ℙ(I(r) = J(s)) , we see its derivation as an interesting and potentially important problem in its own right, yet to the best of our knowledge no proper treatment was given to the problem. In this work, we approximate the probability using beta-binomial distributions, with parameters derived from quantities calculated above. Without distracting readers from the main question of quantifying the value and risk of E&M capabilities, we (27) Var(X I(r) ) = Var(X J(s) ) = (Var(X I(r) |Y (r) , Z (s) )) + (Var( (X I(r) |Y (r) , Z (s) )|Y (r) )) + Var( (X I(r) |Y (r) )) Var(Y (r) ).

Fig. 4
Relationship between different quantities in a three-item generative model. X n , Y n /Z n , and Y (r) /Z (s) represent the true value, the unranked noisy estimates, and the ranked noisy estimates of the items respectively for n, r, s ∈ {1, 2, 3} . (L) Under one estimation noise level, ∃ a bijection between X n and Y (r) . (R) With two noise levels, Y (r) and Z (s) may be generated by the same X n for some r and s relegate the detailed discussion on approximating the quantity to "Appendix".
With the three components for the covariance between X I(r) and X J(s) in place, we can finally derive Cov(V 1 , V 2 ) and Var(D) by applying the covariance and variance functions to the definitions of V 1 , V 2 , and D, respectively, (see (18)) to obtain where the first two terms on the RHS of (29) are that derived in (24).
We conclude this section by observing that M and N have a large influence on Var(D) . In particular, Var(D) is generally large when M and N is small with other parameters fixed. This is crucial as even in cases where (D) is positive, the limited capacity of an organization to introduce new propositions may mean that the Sharpe ratio [11], defined as where r is a small constant, may not be high enough to justify investment in an E&M capability.
The exact threshold where an organization should consider acquiring such capabilities depends on multiple factors including their size (which affects M), the size of their backlog (N), the nature of their work ( X and 2 X ), and how good they were at estimation ( 2 1 ). We refrain from providing a one-sizefits-all recommendation, but give examples in Sect. 7.

Experiments
Having performed theoretical calculations for the expectation and variance of the value E&M systems deliver through enhanced prioritization, here we verify those calculations using simulation results. All code used in the experiments, case studies and extensions is available on GitHub. 6 We verify the result derived in Sects. 4 and 5 empirically, in particular, under the normal assumptions. For each quantity of interest-the mean and variance of V 1 /V 2 and D, as well as the covariance between different pairs of order statistics and their concomitants-we run multiple statistical tests. In each statistical test we randomly select and fix the value of the parameters (i.e., N, M, X , , 2 X , 2 1 , 2 2 , r Cov(X I(r) , X J(s) ), , and s-the latter two for the covariance of the order statistics only), and compare the theoretical value of the quantity to the centered 95% confidence interval (CI) generated from multiple empirical samples. If the theoretical value derived above is exact, the 95% CI should contain the theoretical value in around 95% of the statistical tests, and the histogram of the percentile rank of the theoretical quantity in relation to the empirical samples should follow a uniform distribution [29]. Each empirical sample is generated using one of the following two methods depending on the quantity we are evaluating: (a) Bootstrap resampling This is used for generating a sample mean/variance for V 1 /V 2 and D. We first generate the initial samples for V 1 , V 2 , and D by performing 10,000 simulation runs (see below). We then resample the initial samples and calculate the mean/variance of the resample to obtain an empirical sample. We repeat the latter step 2000 times to obtain a representative empirical distribution. (b) Sampling for order statistics The bootstrapping approach is unlikely to work on the covariance between the order statistics (e.g., Y (r) , Z (s) ) and their concomitants (e.g., X I(r) , X J(s) ), as the ranking information may not be preserved during resampling. Hence, for these quantities, we opt for a more naïve sampling approach. We generate 200 samples for Y (r) , Z (s) , X I(r) , and X J(s) via the same number of simulation runs, and calculate the covariance between these quantities to obtain an empirical sample. The process is repeated 1000 times to yield representative samples.
Finally, in each simulation run, we obtain one sample for each of Y (r) /Z (s) , X I(r) /X J(s) , V 1 ∕V 2 , and D w.r.t. the parameters via the following four-step process: 7 1. Take N samples from N( X , 2 X ) , referred as hereafter with n being the index; 2. Take N samples from N( , 2 1 ) , and sum the n th -indexed sample with ∀n to obtain : The results are shown in Table 1 and Fig. 5. We observed that the 95% CI of the quantities (V) , (D) , Var(V) , Cov(Y (r) , Z (s) ) , and Cov(X I(r) , X J(s) ) contain the derived theoretical value for roughly 90%, 94%, 74%, 94%, and 96% of the times, respectively. While these numbers are expected for the expectations and covariances considering they are approximations, they are on the low side for the variances. Upon further investigation, we realized that the majority of the out-of-CI cases have a theoretical variance below the CI, suggesting a slight underestimate in our variance derivation. We believe that this is due to the omission of higher order terms when using the formulas in [27], leading to a small bias. The bias is more apparent when N and M are small. Otherwise, we are satisfied with the soundness of the derived quantities.

Case Study
"What do e-commerce/marketing companies gain by acquiring experimentation & measurement capabilities?" It is difficult to verify any model that seeks to ascertain the value of E&M capabilities with real data. This is not only because of the inability to observe the true value of a proposition/product/service without any measurement error, but also the lack of published measurements from organizations. The closest proxies are meta-analyses, including that compiled by Browne and Johnson [10] and Johnson et al. [12], which contain statistics on the measured uplift (in relative %) over a large number of e-commerce and marketing experiments for many organizations.
The information presented by the two groups of researchers, which we describe in more detail below, are sufficient for us to ask the following question: If all the experiments presented by Browne and Johnson/Johnson et al. are conducted for the same organization, how much value did the E&M capabilities add due to improved prioritization? We present results under normal assumptions in this section, and will revisit the question when we discuss the model under t-distributed assumptions in Sect. 8.1.

e-Commerce Companies
In [10] Browne and Johnson reported running 6700 A/B test in e-commerce companies, with an overall effect in relative Con-Version Rate (CVR) uplift centered at around zero, and the 5% and 95% percentiles at around ±[1.2%, 1.3%] . We then divide the range by z 0.95 ≈ 1.645 , the 95th percentile of a standard normal, to estimate the distribution reported has a standard deviation of around 0.75%. Based on this information, we take X = 0 and 2 X = (0.6%) 2 , taking into account that the reported distribution incorporated some estimation noise, and hence, the spread of the true values should be slightly lower.
Given an A/B test on CVR uplift run by the largest organizations (e.g., one with five million visitors and a 5% CVR) carries an estimation noise of around (0.28%) 2 , 8 we explore the scenarios where we reduce the noise level from 2 1 = {(1%) 2 , (0.8%) 2 , (0.6%) 2 } to   We set = 0 as we do not assume any systematic bias during estimation in this case.
The results are reported in Fig. 6, which shows the relationship between different Ms and the value gained under different magnitudes of estimation noise reduction. One can observe that the expected gain in value actually decreases in M. This is expected: as one increases their capacity, they will run out of the most valuable work, and have to settle for less valuable work that has many acceptable replacements with similar value, limiting the value E&M capabilities bring.
We can also see an inverse relation between the size of M and the uncertainty of the value gained. As a result, while the expected value gain decreases with increasing M, the uncertainty drops quicker such that at some M we will see a statistically significant increase in value gained, and/ or an acceptable Sharpe ratio that justifies investment in E&M capabilities. The specific value that tips the balance is heavily dependent on individual circumstances.

Marketing Companies
In the second case study, we repeat the process applied to e-commerce in Sect. 7.1 for the marketing experiments described in [12]. In that work, Johnson et al. reported running 184 marketing experiments that measures relative CVR uplift, with a mean relative uplift of 19.9% and standard error of 10.8%. This suggests the use of X = 19.9% and 2 X = (10%) 2 , the latter slightly reduced to account for the estimation noise being included in the reported standard error.
Johnson et al. also noted the average sample size in these experiments is over five million, which keeps the estimation noise low. However, the design of marketing experiments often comes with extra sources of noise compared to standard A/B tests [8,30], hence, we keep the estimation noise in our scenarios the same as above (i.e.,  Figure 7 shows the results. We can see in the presence of a larger variability in the true uplift of the advertising campaigns ( 2 X ) and lower capacity (M), the level of estimation noise reduction that gave a statistically significant value gained in the e-commerce example is no longer sufficient. One needs a larger noise reduction, or to increase their capacity to effectively control the risk in investing in E&M capabilities. Otherwise, they may be better off focusing their resources on improving their limited number of existing propositions.

Empirical Extensions
We also provide two extensions, evaluated empirically, that open the door for future work in this area.

Valuation Under Independent t-distributed Assumptions
We spent a large proportion of the work so far assuming that both the true value of the propositions and the estimation noise are normally distributed. While possessing decent mathematical properties, it is insufficient to explain the heavy tail in the distribution of uplifts shown in [10] or [12].
In this section, we model the true value of the propositions, as well as the estimation noise, as generalized

Fig. 6
The value gained by having some experimentation and measurement (E&M) capabilities (x-axes, in percent) under different capacity M (y-axes, in log scale) in the case study on 6700 e-commerce experiments reported by Browne and Johnson [10] (see Sect. 7.1). In each plot, the dot represents the mean, and the error bar represents the 5th-95th percentile of the empirical value distribution. The subcaption denotes the estimation noise before and after acquisition of E&M capabilities (i.e., 2 1 , 2 2 ). We fix X , = 0 , 2 X = (0.6%) 2 , and N = 6700 Student's t-distributions (see (6)). It is difficult to derive the exact theoretical quantities under such model assumptions because Student's t-distributions do not have conjugate priors (see e.g., [31]). We instead simulate the empirical distribution of the value gained under different parameter combinations to understand if this model is a better alternative to that under normal assumptions. The sampling procedure is similar to that described Sect. 6, with steps modified such that the samples are generated from standard t-distributions, then scaled and located as specified by (6). We compare the value gain estimates obtained under t-distributed assumptions and normal assumptions as follows. For each comparison, we randomly sample values for N, M, X , , 2 X , 2 1 , 2 2 , and perform 1000 simulation runs of the four-step sampling procedure in Sect. 6 to obtain samples of D using both the t 3 and normal distributions. 9 We then compare the expected values, as well as the 5% and 95% percentiles of the value gained under the two distributions.
We observed from 840 comparisons that overall the value gained under the t-distributed assumptions has a higher mean (7% higher mean) and variance (7% higher in the 95% percentile on average) than that under normal assumptions. The result arises despite the mean/variance of the true value and estimation noise under the t-distributed assumptions were being set to that under the normal assumptions. This suggests that the model under t-distributed assumptions is able to capture the "higher risk, higher reward" concept. Individual comparisons paint a more nuanced picture, and this is perhaps best illustrated by revisiting the case study in Sect. 7 under t-distributed assumptions. We select a number of scenarios featured in the previous section, and overlay the value gained by having E&M capabilities under t-distributed assumptions over that under normal assumptions in Fig. 8. One can see that while t-distributed assumptions generally yield a higher value gained, this is not always the case-for the e-commerce case, as M increases the value gained decreases quicker under t-distributed assumptions than that under normal assumptions. This shows model assumptions can potentially play an important role in valuation of E&M capabilities.

Partial Estimation/Measurement Noise Reduction
There are many situations when not all propositions are immediately measurable upon the acquisition of E&M capabilities. This may be due to the extra work required to integrate additional capabilities in certain legacy systems, or the limited ability to run experiments on online but not offline activities. In the case where there is a single backlog, we ask the question, will an organization still benefit from a partial noise reduction when some propositions' values are obtained under reduced uncertainty while others are subject to the original noise level? We address this by attempting to establish the relationship between the expected improvement in mean true value of the selected propositions and the proportion of propositions that benefited from a reduced estimation noise (denoted p ∈ [0, 1]). 10 The sampling procedure is similar to that described in Sect. 6, with Step 3 modified: instead of generating all samples from N( , 2 2 ) we generate p of the samples from N( , 2 2 ) (the lowered estimation noise) and 1 − p of the samples from N( , 2 1 ) (the original estimation noise).
We run the procedure above under various scenarios, including under a large/small N, a large/small ratio between an organizations' capacity and backlog (M/N), and a large/ small magnitude of noise reduction upon acquisition of E&M capabilities ( 2 1 − 2 2 ). Figure 9 shows the result. We can see that under most scenarios, the expected value gained increases with p at least linearly, while there are a few scenarios where the expected improvement in mean true value of the selected propositions curve upward for increasing p. This shows that while there are incentives for organizations to acquire E&M capabilities that cover the majority of their work, in many scenarios, a partial acquisition yields proportional benefits. Potential experimenters need not see the acquisition as a zero-one decision, or worry about any steep initial investment required to unlock returns.

Conclusion
We have addressed the problem of valuing E&M capabilities. Such capabilities deliver three forms of value to organizations. These are (1) improved recognition of the value of propositions, (2) enhanced capability to prioritize and (3) the ability to optimize individual propositions. Of these, the most challenging to address is improved prioritization. We have established a methodology to value better prioritization through reduced estimation error using the framework of ranking under uncertainty. The key insight is that E&M capabilities reduce the estimation error in the value of individual propositions, allowing prioritization to follow more closely the optimal order of projects were the true values of propositions be observable. We have provided simple formulas that give the value of E&M capabilities and the Sharpe ratio governing investment decisions and provide guidelines for conditions when such investments are not appropriate. The near-linear relationship between p (proportion of propositions which value is obtained under a lower estimation noise, x-axes) and the improvement in mean true value of the selected propositions (y-axes) under the normal assumptions. In each plot, the dot represents the sample mean, and the error bar represents the 5-95% per-centile of the sample value gained. All figures assume 2 X = 1 , while the left four figures assume 2 1 = 0.5 2 and 2 2 = 0.4 2 (corresponding to a small reduction in estimation noise), and the right four figures assume 2 1 = 0.8 2 and 2 2 = 0.2 2 (corresponding to a large reduction in estimation noise) where f Z n and f ′ Z n is the probability density function and its derivative for Z n respectively. We observe that the first order approximation (a special case of the delta method) is insufficiently accurate when compared against simulation results. This is likely due to F Z n being non-linear. We thus recommend using a second-or higher-order approximation.
Finally, we denote and obtain the beta(-binomial) distribution parameters ℙ and ℙ via the method of moments:

Estimation Under Normal Assumptions
To complement the main text, we also discuss how the quantities derived above behave under normal assumptions. Firstly, (33) and (34)

evaluates to
We then recall from (17) that Z n