1 Introduction

In this paper, we develop, validate, and illustrate a practicable approach to deal with non-random samples whose underlying cause is data requirements, a pervasive issue in accounting research. Examples include requirements for analyst following, database inclusion (e.g., Execucomp includes S&P 1500 firms), and stock price above a threshold, such as $5. We examine how non-randomnessFootnote 1 of the dependent variable in data-restricted samples affects results of empirical association tests, propose and validate a nonparametric resampling technique (“distribution-matching”) to adjust for the effects of non-randomness, and thereby increase the generalizability of results and apply the technique to tests of associations between realized returns and implied cost of equity (CofE) metrics. Our goal is to assist accounting researchers in constructing more powerful and less biased test samples, thereby increasing the generalizability (decreasing the sample-dependency) of results.

As discussed later, distribution-matching differs from selection models and multiple imputations, which are sometimes used in the analysis of data-restricted samples. A Heckman-type selection model approach assumes the selection model can be reliably estimated on a random sample of the population, but practical research settings typically involve a trade-off between selection model fit and data requirements for the selection model variables. A selection model approach may therefore simply transfer the data-restrictions issue from the test model to the selection model. In our setting, we find that introducing a selection model does not resolve the issue of non-randomness that we address using distribution matching; in contrast, applying multiple imputation yields results that converge to the distribution-matching results.

The starting point of the distribution-matching technique is a common requirement in archival accounting research that the sample contains only observations with complete data for all variables of interest (“complete cases”). We first examine a pairwise association in a stylized simulation setting consisting of a reference sample with full information on one variable (y) and restricted data on the second variable (x). Complete-case analyses effectively impose any data restrictions of x on y and do not use information about y in incomplete observation pairs. If missingness of x is even weakly correlated with values of y, complete-case samples are non-random samples of y and association-test results would not generalize to the reference sample. Distribution-matching uses information about the marginal distribution of y (the reference distribution) and resamples complete-pairs of observations (x and y) from the data-restricted sample to match the reference distribution as closely as possible. The goal is to construct samples that appear as if they were drawn randomly from the reference distribution, despite the data restrictions, and using only the observations of the data-restricted sample. Using simulated data with known induced levels of statistical associations and three types of non-randomness, we show that association tests yield biased results in the non-random samples; applying distribution-matching substantially reduces or even eliminates the bias. We view the results of these simulation analyses as providing evidence that distribution-matching can address the issue of non-random sampling in a generic research setting.

To illustrate distribution-matching in a specific research setting, we apply the technique to an archival setting characterized by stringent data requirements and inconsistent/counterintuitive results (e.g., Easton and Monahan 2005), namely, tests of the association between realized returns and CofE metrics. By definition, all firms have a cost of equity, but a researcher cannot observe the cost of equity for some firms, typically because of data requirements.Footnote 2 The theory linking realized returns and CofE metrics applies to the reference sample defined in note 1 (CRSP firms with at least 12 months of realized returns during February 1976–July 2009). This reference sample contains complete data on one variable (y, returns) and not on the other variable (x, CofE).Footnote 3 We show (1) the unadjusted sample with data on CofE metrics is a non-random sample of the reference sample (the CofE sample returns have substantially lower standard deviation, skewness and kurtosis)Footnote 4; and (2) tests of the CofE-returns association based on the unadjusted CofE sample—that is, a sample that would typically be used in accounting research—produce weak or negative associations between realized returns and implied cost of equity metrics, consistent with previous empirical research (e.g., Easton and Monahan 2005) and inconsistent with theory.

In contrast, tests using a distribution-matched CofE sample produce statistically reliable evidence of the theoretically-predicted positive association between CofE metrics and realized returns. Distribution-matching does not involve creating new data; the technique resamples CofE-sample observations (CofE and returns pairs) so that the returns distribution in the distribution-matched CofE sample mimics the returns distribution in the reference sample. We apply two approaches. The first uses the non-parametric Kolmogorov-Smirnov (KS) test of general sample differences. The KS statistic rejects, at the 0.10 (0.05) [0.01] level, the hypothesis of distribution equality in the unadjusted CofE sample, compared to the reference sample in 401 (401) [393] of the 402 sample months. We distribution-match by constructing monthly subsamples using only CofE-sample observations (CofE and returns pairs) so that the deviation between the returns distribution of the resulting distribution-matched CofE sample and the reference distribution of returns, as captured by the KS statistic, is minimized. This resampling procedure is effective and requires few assumptions but imposes a substantial computational burden. To address this practical concern, the second approach sorts the returns distributions of both the reference sample and the CofE sample into researcher-defined strata (“bins”) of the continuous variable and applies a form of stratified resampling that aims to match the standard deviation of the reference returns distribution. We find that distribution-matching can result in smaller samples than the original non-random samples; despite a possible loss of power, correlations between realized returns and CofE metrics are with one exception reliably positive in distribution-matched samples obtained using both approaches.Footnote 5

The methodological inference from our results is that selection criteria yielding samples with outcome distributions differing from the reference distribution can materially affect the results of association tests, including producing results that do not generalize to the reference sample to which the tested hypothesis applies. We extend this inference in several ways. To illustrate the inference in other settings with restrictive data requirements, we show that imposing several plausible selection criteria on the reference sample (S&P 500 membership, NYSE listing, availability of a dispersion measure of analyst earnings forecasts, and stock price at least $5) can change the distribution of realized returns and lead to biased estimates in association tests of realized returns with risk factor premia. We reach the same inference when we directly induce changes in the distribution of realized returns and show the sensitivity of risk factor premia to these changes. To illustrate that applying distribution-matching does not produce false results, we apply the technique to Richardson et al.’s (2005) analysis of the association between returns and accruals, a setting in which previous research shows results consistent with theory, and do not overturn their inferences. To separate the effects of reduced sample size from the effects of non-randomness, we compare the association between realized returns and asset-pricing factor betas for a random subsample from the full returns sample (to capture the effects of reduced sample size) and the actual CofE sample (to capture the combined effects of reduced sample size and non-randomness). We reason that imposing an unnecessary data restriction on samples used in an association test with risk metrics (factor betas) that can be performed on the entire reference sample allows us to separate the effects of non-randomness from the effects of a reduced sample size. The coefficients on factor betasFootnote 6 for the random sample of equal size as the CofE sample are similar in magnitude and statistical significance to those for the reference sample, while for the actual CofE sample there is no reliable association between realized returns and any factor beta. These results suggest that, (1) in our setting, efficiency losses due to reduced sample sizes alone have little effect on qualitative inferences and (2) the CofE sample should not be assumed to be a random subsample of the reference sample. Finally, we show that inferences from analysis of our main CofE sample, restricted by the data requirements of all five CofE metrics we consider, are qualitatively similar when we re-do our main association tests on a purely IBES-based CofE sample.

We believe our findings support a conclusion that results obtained using unadjusted non-random samples may not support generalizations to a researcher-selected reference sample. In fact, our analysis of the CofE sample highlights that maximizing the size of the non-random sample, after imposing data requirements, may conflict with the goal of obtaining a random sample, which is fundamental for the generalizability of results. We also believe our analyses provide a practical solution to this issue, in the form of distribution-matching.

2 Motivation for and validation of distribution-matching

2.1 Motivation and intuition

In accounting research settings, data constraints often mean that analyses can be performed only on a subsample of observations, even though the results are intended to generalize to a population or reference sample to which the tested hypothesis logically applies.Footnote 7 The setting we consider is association tests (regression coefficients or correlation coefficients) between two variables, of which one (y) is available for all firms in the researcher-defined reference sample, while the second variable (x) is often missing.Footnote 8 A common treatment in the accounting literature is restricting the test sample to observations with complete information, yielding a data-restricted sample. The data constraints on x are imposed on y, causing information about the unrestricted distribution of y to be lost. We hypothesize (1) this deletion leads to non-random test samples and (2) association tests using these samples yield results that may not be generalizable to the reference sample. We address this problem by incorporating information about the reference distribution (of the complete variable y) into the association test. We resample observations from the data-restricted non-random sample to create an adjusted sample that mimics the reference distribution of the complete variable and appears randomly drawn from the reference sample with respect to y, despite data constraints on x.

The intuition for this approach is as follows. Consider the estimate of a Pearson correlation coefficient \( \hat{\rho} \) between two continuous random variables x and yFootnote 9:

$$ \hat{\rho}=\int \int {x}_i\;{y}_i\;f\left({x}_i,{y}_i|{s}_i=1\right)\; dxdy=\int \int {x}_i\;{y}_i\;f\left({x}_i|{y}_i,{s}_i=1\right)\;f\left({y}_i|{s}_i=1\right)\; dxdy, $$
(1)

where xi and yi are standardized (demeaned and divided by their respective standard deviations) realizations of x and y, f(.) denotes the density function, and si is an observation-level indicator for membership in the data-restricted test sample. For simplicity, subscripts for time t are suppressed.

The true correlation in the reference sample, assuming availability of complete data, is given by:

$$ {\rho}^{\ast }=\int \int {x}_i\kern0.28em {y}_i\kern0.28em f\left({x}_i,{y}_i\right)\kern0.28em dxdy=\int \int {x}_i\kern0.28em {y}_i\kern0.28em f\left({x}_i|{y}_i\right)\kern0.28em f\left({y}_i\right)\kern0.28em dxdy. $$
(2)

\( \hat{\rho} \) is a consistent estimator of the true ρ only if the joint distribution in the restricted sample equals the joint distribution in the reference sample, f(xi, yi| si = 1) = f(xi, yi) or, equivalently, f(xi| yi, si = 1) f(yi| si = 1) = f(xi| yi) f(yi). This condition implies that the unobserved data are missing completely at random (MCAR); only then would a restricted sample (i.e., a sample after deletion of observations because of missing data) be a random subsample of the reference sample.

In our stylized setting, as well as in some other accounting research settings, it is possible to assess the difference in the marginal distributions of the fully observed variable yi between the restricted sample, f(yi| si = 1), and the reference sample, f(yi) and reject the assumption of MCAR. Differences between these marginal distributions mean the restricted sample is non-random and consistency of \( \hat{\rho} \) is less likely.

If the MCAR assumption is rejected, it must be replaced with a weaker assumption: either the data are missing at random, conditional on observed variables (MAR), including y (realized return in our application), or the data are not missing at random (NMAR), which implies that missingness also depends on unobserved data. While it is possible to reject the MCAR condition, the unavailability of missing data precludes testing whether data are MAR or NMAR. Our main analyses extend the common approach of constructing test samples by list-wise deletion under the MCAR assumption. We begin by showing differences in the marginal distributions of realized returns between a full-returns sample and a CofE subsample and that results of association tests (with factor betas) also differ qualitatively between these two samples. We therefore focus on methods under the MAR assumption, specifically, distribution-matching and multiple imputations. In Section 5, we assess the impact of a possible NMAR assumption using a Heckman-type selection model.

Referring to Eqs. (1) and (2), the distribution-matching approach requires the distribution of x, conditional on yi, is unchanged in the data-restricted sample, compared to the reference sample:

$$ f\left({x}_i|{y}_i,{s}_i=1\right)=f\left({x}_i|{y}_i\right). $$
(3)

Assuming complete data on y, the MAR assumption implies Eq. (3) holds.Footnote 10 Then \( \hat{\rho} \) will converge to ρ as f(yi, si = 1) approaches f(yi) via distribution-matching. In the context of our archival analysis, condition (3) implies the CofE metrics are not systematically biased in the restricted sample, conditional on the value of the future realized return. It seems unlikely that the probability of having the analysts’ forecasts required to construct an implied CofE metric depends on the value of realized returns, which can be assessed only ex post.

Regardless of concerns specific to our application, condition (3) contrasts with and is arguably weaker than the assumption that data are missing at random, essentially equating f(xi, yi| si = 1) with f(xi, yi), particularly when differences in the marginal distributions of returns between the reference sample and the CofE-restricted sample, f(yi) versus f(yi| si = 1), are knowable from the data. We use the marginal distribution of the dependent variable of the reference sample as opposed to simply deleting observations with incomplete data. That is, under condition (3), our approach focuses on the marginal distribution of y in the data-restricted and possibly non-random sample. We resample systematically only from observations in this sample with complete data on both variables, so that, in the limit, the marginal distribution of yi matches the marginal reference distribution of yi:

$$ f\left({y}_i|{s}_i=1\right)\to f\left({y}_i\right). $$
(4)

While the convergence in (4) is achievable in the limit, the effectiveness of distribution-matching in a given research setting is a function of, among other things, the number of restricted-sample observations and the size of the common support of the restricted-sample and reference distributions. A smaller restricted sample means fewer observations to resample from and a smaller common support of the distributions means f(yi) is more severely truncated in the restricted sample. In addition, the resampling approach may be unnecessary if only a few observations are missing, making the restricted sample (nearly) equal to the reference sample.

To measure similarity in the cumulative distributions between the reference sample and the non-random data-restricted sample, we use the non-parametric Kolmogorov-Smirnov (KS) statistic, which computes the percentage maximum absolute distance between two cumulative empirical distributions.

$$ KS=\underset{i}{\max}\left|{F}^{NRS}\left({y}_i\right)-{F}^{POP}\left({y}_i\right)\right|\kern0.24em \;\;\;\mathrm{where}\ i=1,2,\dots, n, $$
(5)

where FNRS(yi), FPOP(yi) are the cumulative distributions of y in the non-random sample and population, respectively. We use the KS statistic and its associated asymptotic p value for a test of distribution equality between a subsample and the reference sample and to assess the degree of convergence in (4) within the KS-based distribution-matching approach. Information on the setup of the simulation is available from the corresponding author.

2.2 Relation of distribution-matching to traditional matching approaches and other missing-data approaches

Traditional matching approaches The distribution-matching approach differs substantially in goal and implementation from traditional matching techniques that might be used to address issues of endogeneity and sample selection on observable determinants. Traditional matching relies on an observation-by-observation comparison (for example, matched pairs), while distribution-matching aims to reweight the observations in a sample distribution so that the resulting distribution approximates an (empirical or theoretical) reference distribution. Distribution-matching focuses on the outcome variable, while traditional matching focuses on (possibly multiple) independent variables, for example, through sorting or propensity scoring observations.

Approaches applicable in a MAR setting

Missing-data approaches under the MAR assumption include multiple imputation (MI) and full-information maximum likelihood (FIML) estimation.Footnote 11 FIML incorporates information about the marginal distribution f(yi) by including the observations in the likelihood calculation, even if data on some variables are missing. MI uses a stochastic regression framework to impute possible values for the missing data multiple times, after which the completed (“imputed”) datasets can be independently analyzed and the results aggregated. Complete variables, that is, the marginal distribution of returns in our setting, are preserved from the reference sample and considered in the analysis. MI uses the entire reference sample, so it is more efficient than distribution-matching in cross-sectional analyses; it can be applied when data are missing for more than one variable; and it can incorporate the use of auxiliary variables that are either informative about missingness or correlated with the missing data.Footnote 12 However, distribution-matching is non-parametric while both MI and FIML rely on multivariate normality. Descriptive statistics in Table 2 suggest the normality assumption is unlikely to hold in our example setting with realized returns. Based on theoretical arguments in Schafer (1997), supported by simulation evidence of Demirtas et al. (2008), that MI appears to be less susceptible to deviations from multivariate normality than maximum likelihood, we repeat our main tests using subsamples-based forms of MI and find results similar to those obtained using distribution-matching.

Other missing-data approaches

We clarify the intuition of distribution-matching by contrasting it with three other approaches: estimations that take account of truncation; incomplete post-stratification and selection models. First, assuming the true distribution of the complete outcomes is normal, Tobin (1956) derives closed-form solutions when the outcome variable is truncated at a known upper or lower bound (see also Wooldridge 2010). Rather than focus on truncation, we emphasize that non-randomness likely manifests in a restricted sample with a different shape than the reference distribution, even if the common support is large or complete.Footnote 13 Also, rather than making assumptions about the reference distribution of the outcome variable, we estimate its shape from the reference sample with complete data.Footnote 14 Second, the survey literature uses incomplete post-stratification, which involves reweighting observations according to their marginal weights in a reference distribution or population. The weights are typically constructed based on discrete and exogenous variables, such as gender, not an outcome variable. The similarity to distribution-matching arises because survey respondents need not be representative of the population, necessitating re-weighting responses if the goal is to generalize results to the population.Footnote 15 To that end, both post-stratification and distribution-matching import information about the marginal reference distribution. In fact, for the common support region of the sample and reference distributions, distribution-matching is essentially a form of post-stratification that treats the variable as continuous (each yi is its own stratum) and does not require ex-ante grouping of observations into strata. In addition, the sample distribution may not only be non-random within the common support region but also truncated when compared to the reference distribution. Intuitively, the effect of truncation is mitigated by oversampling from the tails of the sample distribution. In short, distribution-matching tries to combine the notions of stratified sampling and overcoming biases from truncation.

Third, distribution-matching assumes data are missing at random, conditional on observables, while Heckman-type selection models assume data are not missing at random, necessitating a first-stage probit selection model to capture the mechanism that selects observations into the restricted sample. Under certain conditions,Footnote 16 the bias in the test model can be alleviated by incorporating the inverse Mills ratio from the selection model. In contrast, distribution-matching disregards the selection-mechanism and uses information about its consequences by assessing and minimizing the difference of the sample distribution to a reference distribution. In Section 5.1, we apply the selection model approach and find that results using the restricted non-random samples, both on CofE and on factor betas, are little affected by including the inverse Mills ratio.

2.3 Validity tests on simulated data

We use simulated data to validate our distribution-matching approach by showing that correlation estimates from distribution-matched samples converge to their true values, even though these samples consist only of non-randomly drawn observations from the reference sample.Footnote 17 We generate populations of data for two variables (y and x) with known correlations and draw from these simulated populations both randomly and in three non-random ways, with selection probabilities based on the marginal distribution of y. For each of the three types of non-random samples drawn from the simulated populations, we resample with replacement to create distribution-matched samples. Using only observations from the respective non-random sample, distribution-matching is designed to mimic the marginal distribution of variable y in the population as closely as possible.

The outcome variable of interest is the estimated correlation between yi and xi in the non-random samples before and after distribution-matching. We examine population correlations specified at 0.5, −0.5, and 0 (to rule out the possibility that distribution-matching induces a correlation where none exists). We examine both negative and positive true correlations to provide evidence that the effectiveness of distribution-matching does not depend on the either the sign of the true association or the sign of the bias in the association estimate. We draw three kinds of non-random samples. In Non-random sample I, the selection probability of a given observation is decreasing in the absolute distance from the mean, leading to fewer observations that include extreme y values. Non-random sample II samples observations based on the uniform distribution over the entire interval of y observations, leading to higher selection probabilities for observations in the tails. These two symmetric non-random samples are expected to yield either a negative bias (Non-random sample I) or a positive bias (Non-random sample II) in the estimates of associations. Non-random sample III is a form of non-symmetric selection probability, in that selection probability of an observation is increasing in y.

Figure 1 presents visual evidence of the effectiveness of distribution-matching for the three types of non-random samples, drawn from normally distributed population data. The figure depicts example distributions of y for a single randomly-chosen simulation run, for the unadjusted non-random samples (on the left), and the distribution-matched samples (on the right) of size m = 1000. The benchmark distribution, which appears in both right and left graphs, is from the population in that particular run (n = 5000). For all three non-random draws, the left-side graphs illustrate that the distributions deviate from the population benchmark. After distribution-matching, the right-side graphs show the sample distributions closely follow the population distribution and are indistinguishable for large regions of y.

Fig. 1
figure 1

Simulated (univariate) cumulative distributions before and after distribution-matching. Panel A: Non-Random Sample I: Before (left) and after (right) distribution-matching. Panel B: Non-Random Sample II: Before (left) and after (right) distribution-matching. Panel C: Non-Random Sample III: Before (left) and after (right) distribution-matching. Figure 1 shows the empirical cumulative distribution of yi for three types of non-random samples and for the corresponding distribution-matched samples, for one run of the results reported in Table 1, Panel A. The dashed line marks the population distribution (benchmark) and is constant in all six graphs. Sample distributions are depicted with a continuous line, whereby the left (right) graphs are before (after) distribution-matching

Table 1 reports numerical results of the simulations analysis. We focus on the results in Panel A (normally distributed variables). Results in Panel B (non-normally distributed variables) are qualitatively similar, suggesting the effectiveness of the non-parametric distribution-matching approach does not depend on the shape of the marginal distributions, in particular, a normality assumption. We first verify that empirical correlation estimates in the population and random samples (CORR) are close to the specified (true) correlations (CORR*), and there are no meaningful differences between the population and the random sample. For all three levels of true correlation, the KS statistic for random samples is about 2.4%, and p values for the difference between the random sample and the population are about 0.69. Estimated correlations differ from true correlations by 0.005 or less, confirming that a reduction in sample size, even to 20% of the population (m = 1000), is unimportant for the association-test point estimates as long as the sample selection is random.

Table 1 Simulation results from forced non-random samples and corresponding distribution-matched samples

In contrast, and by construction, non-random samples have a distribution of y that differs sharply from the population distribution. For Non-random sample I (Non-random sample II) [Non-random sample III], KS statistics are about 13.9% (25.5%) [26.7%]. To assess the significance of the bias in correlation estimates for these samples, we report the percentile of the mean non-random sample correlation in the distribution spanned by the 1000 correlations as ‘Percentile (Random Sample).’ A small (large) percentile corresponds to a low (high) estimate. When the true correlations are 0.5 or −0.5, the estimated correlations are biased towards zero in magnitude in Non-random samples I and III and are upward biased in Non-random sample II. The bias is highly significant, with percentiles of either 0.0 (i.e., below the distribution of 1000 correlations from random samples) or 99.9 or higher (i.e., only one or fewer of the 1000 correlations is higher). Distribution-matching reduces or eliminates the effects of non-random sampling: Across all three non-random samples, the corresponding distribution-matched samples exhibit correlations much closer to the true correlations. Biases are close to zero (never exceeding 0.0169 in magnitude) and are insignificant in all cases, with percentiles ranging from 30.1 to 77.5.

Based on these simulation results, we conclude that distribution-matching is effective in reducing the bias in correlation estimates in non-random samples of yi; results for zero correlations show that distribution-matching does not induce an (apparent) correlation where none exists. The result for Non-random sample I is of particular interest. Because the result is based on simulated data and an imposed criterion in the sample construction, we view this finding as suggesting the kind of bias in association tests if data availability requirements bias a sample toward firms with typically-less-extreme returns realizations (described by prior research as relatively “stable” firms; see Footnote 4), as is often the case in accounting research situations. Our next tests investigate whether this simulation result applies to a specific well-studied empirical-archival setting.

3 Application to the association of realized returns and implied cost of equity metrics

We test for a bias in association test results when samples are restricted to firms with available data to calculate analyst-based implied CofE metrics. Specifically, we re-examine the correlation between realized returns and CofE metrics, as they have been used to test for expected-returns associations. Settings include voluntary disclosure (Botosan 1997), AIMR scores (Botosan and Plumlee 2002), earnings attributes (Francis et al. 2004), restatements (Hribar and Jenkins 2004), legal institutions/security regulation (Hail and Leuz 2006), shareholder taxes (Dhaliwal et al. 2007), mandatory IFRS adoption (Li 2010), earnings quality and information asymmetry (Bhattacharya et al. 2012), and financial constraints and taxes (Dai et al. 2013). We believe the construct validity of analyst-based CofE metrics is of interest to many researchers, so that an application of the distribution-matching approach in this setting provides insights in its own right. Section 3.1 summarizes previous research, and Section 3.2 explains why we chose this setting to illustrate distribution-matching.

3.1 Research on the association between realized returns and implied CofE metrics

Realized returns are the dependent variable in a variety of association tests, including two-stage cross-sectional asset pricing tests (associations of realized returns with risk factor betas, Fama and MacBeth 1973) and cost of equity tests (associations of realized returns with implied CofE metrics). The latter are predicated on the view that both realized returns and CofE metrics are potentially noisy or confounded proxies for unobservable expected returns. Intuitively, a firm’s expected return should be commensurate with its riskiness. Realized returns are ex-post outcome measures, affected by the arrival of information during the return measurement period and therefore contain an expected return component and a potentially non-zero unexpected return component caused by news about cash flows and news about the expected return itself (Campbell and Shiller 1988; Campbell 1991).

Researchers typically assume either that (1) realized returns are a reasonable proxy for expected returns (that is, the unexpected return component is small, cancels out through aggregation, or both) or, (2) even in broad samples, the unexpected return component is a key non-cancelling component of realized returns (e.g., Elton 1999; Vuolteenaho 2002).Footnote 18 Adopting the latter perspective, researchers have developed several CofE metrics derived independently of realized returns (e.g., Gebhardt et al. 2001; Claus and Thomas 2001; Botosan and Plumlee 2002; Easton 2004; Brav et al. 2005; Ohlson and Jüttner-Nauroth 2005),Footnote 19 or, alternatively, researchers have empirically purged the realized return of its unexpected (“news”) component. For example, Campbell (1991) and Vuolteenaho (2002) propose a variance decomposition method that pre-specifies expected returns as a linear combination of firm characteristics; Botosan et al. (2011) and Ogneva (2012) control for a specific kind of fundamental (earnings) news in realized returns to identify the cash-flow news component of realized returns.Footnote 20 A stream of research that views realized returns and CofE metrics as alternative and imperfect proxies for expected returns aims to validate CofE metrics jointly with realized returns in association tests. Easton and Monahan (2005) and Guay et al. (2011) find the association between realized returns and several commonly used CofE estimates is often insignificant or even significantly negative. Botosan et al. (2011) find the association varies between positive and negative over time and is, on average, weak.

The contrast of these results with economic intuition leads some researchers to propose and test approaches to increase the association. One approach attributes the weak association to realized returns. Using variance decomposition to control for non-expected return components in realized returns, Easton and Monahan (2005) find no, or significantly negative, associations between “news-purged” realized returns and four of the seven CofE estimates they consider. In contrast, Botosan et al. (2011) use different empirical proxies for unexpected return components (similar to Ogneva 2012) and find their CofE estimates have significant positive associations with “news-purged” returns. However, they also document their news-purged returns measure has either no association or counter-intuitive associations with the risk-free rate, beta, book-to-market, and other proxies for risk, leading them to question the validity of their “news-purged” realized returns as a proxy for expected returns and to express caution about the approach. In terms of our research setting, we note there seems to be a trade-off in that adjustments to CofE metrics may worsen the relation between CofE metrics and other proxies for risk, such as beta (Botosan et al. 2011). We address this potential concern in Section 4.4.2 by showing the coefficients on risk factors are little affected by our distribution-matching approach.

Other papers attribute the weak association to the analyst forecast-inputs. Guay et al. (2011) find that modifying analyst-based CofE metrics to account for “analyst sluggishness” improves the associations between some CofE proxies and realized returns but does not always result in statistically reliable associations.Footnote 21 Other studies adopt a portfolio design: Gode and Mohanram (2003), for example, find positive spreads in realized returns. In portfolio-level tests, Hou et al. (2012) show that returns spreads increase when they replace analyst forecasts with regression-based earnings forecasts.Footnote 22 Li and Mohanram (2014) use the Hou et al. (2012) approach to derive earnings forecasts and show positive associations on the firm level as well.

3.2 Analysis of the CofE setting

We analyze a different and possibly co-existing explanation for the weak and inconsistent results in tests of association between realized returns and CofE metrics. Rather than modify the metrics or consider other alternatives proposed in the literature, we leave the original CofE metric constructions in place and propose an explanation that derives from known features of samples used to estimate those metrics. That is, we choose the most conservative starting point (the problem as first examined by Easton and Monahan (2005)) and use unmodified implied CofE definitions and unmodified realized returns.

Because the data requirements for estimating CofE metrics eliminate firms with no analyst following, negative book value of equity, or negative or declining earnings forecasts, firms in a CofE sample tend to be larger and more profitable, hence likely more stable, than the CRSP population. For example, Francis et al. (2004, Table 1) report that the aggregate market capitalization of their sample of Value Line-followed firms, averaging 790 firms per year, is just over 47% of the CRSP market capitalization. We posit these data requirements result in CofE samples that are non-random draws from the population of CRSP firms, with a returns distribution that differs from the returns distribution in that population (or a random sample thereof).Footnote 23 We further posit that association estimates based on such a non-random sample are difficult to generalize to the population, that is, the external validity of the results is questionable. We do not dispute previous findings but rather use distribution-matching to arrive at results that we believe can be more justifiably generalized to the population of listed firms. To summarize, we believe the CofE-realized returns association has the following desirable characteristics for an empirical examination of the effects of non-random sampling: (1) the full distribution of returns is available for the reference sample; (2) data on the CofE metric are missing for many reference sample firms, but conceptually all sample firms have a cost of equity; and (3) previous research shows the characteristics of returns for missing firms differ from the characteristics of returns for included firms.

To illustrate how data requirements may result in non-random samples, we let the requirements for five CofE metrics dictate the sampling bias in returns. While intuition suggests these data requirements are likely to bias CofE samples towards more stable firms with less dispersion in returns than the returns of the reference sample, intuition does not necessarily suggest an effect on associations of CofE metrics with these returns. We triangulate the effects of non-random returns on associations by using implied risk factor premia that are estimable for the entire reference sample. Specifically, our data contain complete information on both realized returns and asset pricing factor betas (loadings) for all observations in the reference sample. In a CAPM world, cross-sectional variation in beta is equivalent to cross-sectional variation in expected returns. Therefore we can use association tests on factor loadings to gauge the non-randomness of the CofE sample by first performing factor loading association tests on the reference sample and then artificially imposing the CofE sample restriction into the same test. Differences in results would suggest the CofE sample is a non-random sample of the reference sample. Using Fama-MacBeth two-stage tests of the association of returns with risk factor betas, we show the CofE returns sample is indeed a non-random sample of the reference sample.

4 Test design and non-randomness of the CofE sample

4.1 Sample and descriptive data

Table 2 describes the archival data used in our empirical tests. We first identify all firms with monthly CRSP returns data during February 1976 to July 2009 (402 months). These data are used for our cross-sectional asset pricing tests. The reference sample (the Full Returns sample) includes all firms with returns data in the current and preceding 11 months; a firm is required to have 12 consecutive monthly returns observations to enter the Full Returns sample in Month t.Footnote 24 The Full Returns sample contains 2,460,998 firm-month observations (24,657 unique firms). Table 2, Panel A, shows the average monthly cross section contains 6122 firms, with an average (median) monthly realized raw return of 1.30% (0.20%). Monthly excess returns, (realized return less the month-specific risk-free rate) are 0.83% (mean) and −0.27% (median). The average cross-sectional standard deviation of both raw and excess returns is 16.15%, and the interquartile ranges are about 12%–13%, indicating that realized returns are quite dispersed.Footnote 25

Table 2 Descriptive statistics of monthly (cross-sectional) distributions

Panel B of Table 2 reports average cross-sectional statistics for the sample of firms with data to estimate the five CofE measures. On average, those cross sections contain 955 firms (383,955 monthly observations for 3989 unique firms), a potentially non-random subsample of the Full Returns sample. Value Line cost of equity (VL CofE) estimates are derived from Value Line target prices and dividend forecasts, are re-calculated each month and are de-annualized to the month level.Footnote 26 We calculate four other implied CofE estimates following Claus and Thomas (2001, CT); Gebhardt et al. (2001, GLS); Easton (2004, MPEG), and Ohlson and Jüttner-Nauroth (2005, OJN).Footnote 27 The CofE metrics require analyst following in general, and Value Line coverage in particular, as well as positive and increasing earnings forecasts.

As reported in Panel B of Table 2, the mean (median) values of the CofE estimates range from 0.0071 to 0.0121 (0.0067 to 0.0118). The mean (median) monthly realized excess returns for the CofE sample are 0.74% (0.41%). The Full Returns sample is more dispersed, more positively skewed, and more leptokurtic than the CofE sample. With regard to dispersion, the standard deviation of excess returns for the CofE sample is 8.96%, a 44.5% reduction compared to the standard deviation of the Full Returns sample, and the interquartile range of excess returns of the CofE sample is 9.87%, a reduction of about 22% relative to the Full Returns sample. The Full Sample returns are positively skewed, with skewness coefficient of 3.74; the skewness coefficient of the CofE sample is 0.6034 (a perfectly symmetric distribution has zero skewness). Finally, the Full Sample returns are leptokurtic, with a kurtosis coefficient of 82.75 on average, while the average CofE sample kurtosis is 6.16.

4.2 Benchmark results of associations between CofE metrics and realized returns

We first estimate cross-sectional Pearson correlations and slope coefficients from regressions of realized (excess) returns on each of the five CofE metrics; intercepts (not tabulated, in the interest of brevity) are included in all regressions. We use Eq. (6) to estimate slope coefficients for each Month t using all complete returns-CofE observations available for that month.

$$ {R}_{i,t}-{R}_{f,t}={\delta}_{0,t}+{\delta}_{1,t}{CofE}_{t-1}+{\varepsilon}_{i,t}. $$
(6)

The averages of the monthly coefficient estimates δ1, tover the sample period measure the association between realized excess returns and a specific CofE metric. Following Fama and MacBeth (1973), the test statistic for the significance of the associations is the average monthly coefficient estimate relative to the time-series standard error of the monthly estimates over the sample period.Footnote 28 Results are reported in Table 3 for the unmodified CofE sample resulting from deletion when data on any of the CofE metrics is missing (analogous to Easton and Monahan 2005). These results are broadly consistent with prior literature on the association between CofE estimates and realized returns, if not more negative.Footnote 29 Correlations show either no reliable relation between realized returns and CofE, in the case of VL CofE, or a negative relation, in the case of the other four CofE metrics (t-statistics range from −0.52 to −6.11). Regression coefficients show a similar picture, with three of the five metrics showing significantly (at the .05 level or better) negative slope coefficients. All five coefficients are reliably different from their theoretical value of 1, with t-statistics (not tabulated) between −7.25 and −12.36.

Table 3 Average correlation and regression coefficients of realized returns and CofE metrics

4.3 Benchmark results of factor beta tests

To exploit the richness of the cost of capital setting we use tests of factor betas using two-stage asset-pricing tests; one purpose is to serve as an input to our demonstration that the CofE sample is a non-random sample from the reference sample. In the first stage, we estimate slope coefficients (factor betas) in a firm-specific time-series regression of excess returns on each risk factor:

$$ {R}_{i,t}-{R}_{f,t}={a}_{i,t}+{b}_{i,t}^F{F}_t+{\varepsilon}_{i,t}, $$
(7a)

where Ri, t − Rf, t is the excess return for firm i for Month tFt = a risk factor, specifically, the market excess return (market factor), the size factor or book-to-market factor (SMBt, HMLt) from Fama-French (1993) or the accruals quality factor (AQfactort) from Francis et al. (2005);\( {b}_{i,t}^F \) = the factor beta for risk factor F; t subscripts the sample month. Equation (7a) is estimated over a rolling 12-month window ending in Month t.Footnote 30

In the second stage, we estimate cross-sectional regressions of the firm-specific excess returns in Month t on the univariate first-stage factor loadings \( \hat{b_{i,t}^F} \) (the risk factor betas) obtained from estimating Eq. (7a):

$$ {R}_{i,t}-{R}_{f,t}={\gamma}_{0,t}+{\gamma}_t^F\hat{b_{i,t}^F}+{\vartheta}_{i,t}. $$
(7b)

Equation (7b) is estimated for each Month t. The full sample tests use all firms with the necessary observations to estimate first stage betas. The second-stage coefficient estimates (\( {\gamma}_t^F \)) are interpretable as implied risk factor premia in Month t (implied by the first-stage loadings). Following Fama and MacBeth (1973), the test statistic for the significance of the implied risk factor premia is the average monthly coefficient estimate relative to the time-series standard error of the monthly estimates over the sample period. Theory predicts the sign (positive), but not the magnitudes of the second-stage coefficient estimates (the magnitudes of the implied factor premia). Following previous research, we test whether the \( {\gamma}_t^F \) estimates are reliably different from zero.

We use the samples described in Table 2 to establish benchmark associations between excess returns and factor betas. The tests are motivated by the idea that both CofE metrics and factor betas are supposed to capture risk and the fact that tests on factor betas can be performed on both the reference sample and on subsamples, such as the CofE sample. Table 4, column 1, shows the second stage coefficient estimates from Eq. (7b) and t-statistics based on the time-series standard error of the monthly estimates. Our interest is not in the significance of specific risk factors but rather in using the Full Sample results as a benchmark for comparing subsample results. The association between returns and market beta is positive (the coefficient estimate corresponds to a risk premium of 0.52% per month; t-statistic = 2.03) as is the association for the AQfactor beta (risk premium of 0.77% per month; t-statistic = 2.44). The coefficient on the SMB beta is 0.0025, t-statistic = 1.69, significant at the 0.05 level, one-tailed. The association between returns and HML beta is not reliably different from zero at the 0.05 level.Footnote 31

Table 4 Association tests on reference sample, random subsamples, and CofE subsamples

As previously described, we aim to shed light on how differences in the distributional properties of estimation samples of realized returns and, by implication, how differences in sample selection criteria affect the results of association tests between realized returns and both risk factor betas and CofE estimates. We first consider sample size versus sample non-randomness, using the Full Returns sample as the proxy for the population and the CofE sample as a potentially non-random subsample. With regard to sample size, the monthly average is 6122 firms in the Full Returns sample and 955 firms in the CofE sample, a reduction of about 84%. With regard to distributional properties, as captured by dispersion, skewness, and kurtosis, the Full Returns sample is more extreme on all three distributional properties.

Because factor betas are available for both the Full Returns sample and the CofE sample, asset pricing tests can be used to illustrate that the CofE sample differs from the Full Returns sample with respect to the association between realized returns and risk proxies. To illustrate the effects of sample size, columns 2 and 3 of Table 4 present results of association tests between risk factor betas and realized returns from 1000 randomly drawn subsamples of the Full Returns sample (Random Subsample) of the same size each month as the actual CofE sample (monthly average of 955 firms).Footnote 32 Column 2 reports average slope coefficients and t-statistics, and column 3 contains the range of values across the 1000 random draws. The coefficient estimates of the Full Returns sample (column 1) and the average random subsample (column 2) are nearly identical (differences are between 1 and 4 basis points); the Full Returns results fall well into the range of values (column 3). The reduced sample size means the monthly coefficients are estimated with less precision; as expected, the time-series t-statistics are lower by amounts between 0.10 (market risk premium) and 0.12 (AQFactor premium). Turning to the effects of non-randomness, results of the asset pricing tests using the actual CofE sample are shown in the rightmost column of Table 4. None of the factor betas evidences a significant association with excess returns, and all factor premia are reduced in magnitude by at least 50%. The results from the CofE sample fall outside the range of values spanned by the random subsamples.

We draw three conclusions from the results in Table 4. First, even substantial reductions in sample size (84% in the average cross section) have a modest effect on the efficiency of the estimation. Second, distributional differences in either realized returns or factor betas have substantial effects on the results. We interpret these results as supporting the notion that the CofE sample is a non-random subsample of the Full Returns sample for purposes of testing associations with proxies for risk. Third, in such non-random samples, if factor betas fail to load significantly, insignificant results concerning CofE metrics should not be surprising.

4.4 Demonstration of distribution-matching on simulated CofE-calibrated data and on archival data

4.4.1 CofE-calibrated simulations

Table 5 shows simulation results when the data approximate the size and shape of the distribution of the reference sample excess returns and the Value Line CofE metric. We use empirically determined parameters of the actual distributions of excess returns and of a CofE metric to create simulated data similar to the archival data. We are able to mimic the first four moments of the variable distributions. We induce correlations of 0.25, 0.10, and 0. We report results for Non-Random Sample I, constructed to be less extreme than the random sample by setting the selection probability to decrease in the absolute distance to the variable mean. For the simulated Full Returns data (first line of Table 5) and for random samples of the same size as the actual CofE sample (second line), estimated correlations are similar to the induced population correlations. This finding buttresses our conclusion that even sharply diminished sample sizes do not obscure or shift estimates away from true correlations in the data, as long as the samples are randomly drawn from the reference sample. In contrast, the third line indicates that for the intentionally less-extreme non-random sample, the KS test statistic rejects similarity of the distributions at better than the 0.0001 level. Estimated associations between the simulated returns and simulated CofE metrics are negative and highly significant even though true correlations are positive: when the true correlation is 0.25 (0.10), the estimated correlation is −0.17 (−0.06), with a t-statistic of −61.8 (−33.9). When the true correlation is zero, the estimated correlation for the non-random sample is also zero (point estimate 0.0004, t-statistic 0.23). When we distribution-match the non-random sample, the KS test statistics decline to about 0.03 with significance levels of about 0.50. When the true correlation is 0.25 (0.10) [0], the estimated correlation is 0.17 (0.07) [0.00], with a t-statistic of 25.40 (9.07) [0.30].

Table 5 Simulation results from CofE-calibrated samples

We believe these simulations support two conclusions. First, sample marginal distributions, not sample size per se, affect the ability to empirically detect the true correlations between two variables. In particular, the sign differences reported in Table 5—negative correlation estimates when true correlations are positive—highlight the potentially substantial bias in results when the marginal distribution is non-randomly drawn from the reference sample. Second, despite extreme differences in the characteristics of marginal distributions, distribution-matching yields an adjusted sample with correlations similar in sign and magnitude to the true correlations.

4.4.2 Distribution-matching the actual CofE sample

We construct distribution-matched samples from data on realized returns and the five CofE metrics. Results so far suggest the returns of the actual CofE sample are a non-random sample of the Full Returns sample returns. Table 2 shows excess returns for the CofE sample have a similar mean/median, and noticeably smaller standard deviation, skewness and kurtosis, as compared to the Full Returns sample. We interpret these findings as raising the question of whether the negative or weak correlation between CofE estimates and realized returns reported in Table 3 is generalizable to the reference sample or arises from the effects of data requirements.

To accommodate the substantial differences in the shape of the CofE sample returns distribution, as compared to the reference sample returns distribution, we change the implementation of the distribution-matching approach used in the simulation in two ways. We first apply an iterative procedure that starts with a base sample, draws an additional observation, and keeps that additional observation only if the resulting sample shows a smaller KS statistic. The approach still aims to minimize the KS-based statistic, even in months where insignificant KS statistics cannot be achieved because of large initial differences between the CofE sample and Full Returns sample.Footnote 33 The minimization does not require a pre-specified sample size but rather lets the iteration determine the optimal sample when the KS statistic cannot be further minimized. This approach is conceptually grounded but inefficient and computationally burdensome for large samples. The second, less computationally demanding implementation matches the CofE sample to the Full Returns sample using a variant of stratified resampling (post-stratification), which tries to match the dispersion of the returns distribution. We describe both approaches next.

Method 1: Kolmogorov-Smirnov-based distribution-match

We start by randomly sampling either 20% of the month-specific sample or, separately, 100 unique firms from the CofE sample in a given month. We compute the KS statistic for this initial draw against all returns from the reference sample in that month.Footnote 34 Our previous analyses suggest the KS test on this initial sample is likely to reject the hypothesis that the Full Returns distribution is equal to the returns distribution of, for example, the 100 initially selected firms. We start our iteration to minimize the KS statistic by randomly adding one observation (# 101), re-compute the KS statistic, and again compare to the reference distribution of returns that month. If the KS statistic using 101 observations (against the reference distribution) is equal to or greater than the KS statistic using the original 100 observations sample (against the reference distribution), we dismiss the 101st observation and replace it with another randomly chosen, with replacement, 101st observation from the CofE sample. If the KS statistic using 101 observations is lower than the KS statistic using the 100 observations sample, we keep the 101st observation, draw a 102nd observation, and evaluate the inclusion of the 102nd observation. We repeat this step 1000 times, thereby allowing for KS-based distribution-matched samples to increase by a maximum of 1000 observations each month.Footnote 35 Because convergence to a minimum KS statistic depends both on the initial 20% (or 100) observations drawn and on the order of additions, we repeat the procedure 30 times and retain the final sample with the lowest KS statistic (i.e., with the minimal difference as compared to the Full Returns distribution). We repeat the construction of KS-based samples for each month. When the iteration begins with 20% of the CofE sample firms, the final distribution-matched sample contains an average of 242.2 firms (about 25.4% of the 955 firms in the average CofE sample month), with a time-series standard deviation of 54 firms. When the iteration begins with 100 firms each month, the average cross section consists of 124 firms (with a standard deviation of 14 firms).Footnote 36

Method 2: Description of bin-based distribution-match

To reduce the computational burden of Method 1, we create bin-based distribution-matched samples. Bin-based matching is similar to post-stratification except that the distribution is also truncated (some population strata are not represented in the sample), requiring an additional weighting scheme for the tails of the distribution. For the common support region, we first place the returns of the Full Returns sample and the CofE sample into bins with width of 100 basis points (bp) and calculate the sample proportion of observations in each bin for both samples.Footnote 37 To distribution-match, we re-weight (by resampling return-CofE pairs with replacement) each bin in the CofE sample, so that the sample proportion matches the proportion in the corresponding reference sample bin. For example, if the realized returns bin [0.10; 0.11[contains 5% of the CofE sample observations and 10% of the reference sample observations, we resample the CofE sample bin to increase its percentage to 10% of the sample size in that month.Footnote 38 At the extremes of the reference sample distribution, we encounter bins without corresponding observations in the CofE sample. To address this issue, at both the upper and lower extremes of the CofE sample, we re-weight the most extreme positive and most extreme negative returns, with equal weighting at both extremes in the following form (month subscripts omitted):

$$ {w}_{i_{CofE}}=\Big\{{\displaystyle \begin{array}{c}{w}_{i_{RS}}\sum \limits_{j=\min \left({i}_{RS}\right)}^{\min \left({i}_{CofE}\right)}{\left[\mathit{\min}\left({i}_{CofE}\right)-{i}_{j, RS}+1\right]}^{\gamma}\kern1.08em \forall \kern0.48em {i}_{RS}=\min \left({i}_{CofE}\right)\\ {}{w}_{i_{RS}}\kern8.399994em \forall \kern0.48em \min \left({i}_{CofE}\right)<{i}_{RS}<\max \left({i}_{CofE}\right)\\ {}{w}_{i_{RS}}\sum \limits_{j=\max \left({i}_{CofE}\right)}^{\max \left({i}_{RS}\right)}{\left[{i}_{j, RS}-\max \left({i}_{CofE}\right)+1\right]}^{\gamma}\kern0.72em \forall \kern0.48em {i}_{RS}=\max \left({i}_{CofE}\right)\end{array}} $$
(8)

\( {w}_{i_{CofE}} \)(\( {w}_{i_{RS}} \)) is the sampling proportion of Bin [i;i+0.01] in the CofE sample and the reference sample, respectively. The product of sampling proportion \( {w}_{i_{CofE}} \) and overall sample size in Month t is the bin-specific number of draws that month. We numerically solve, within the sampling procedure, for the constant weight parameter γ until the standard deviation for the distribution-matched CofE sample is statistically indistinguishable from that of the Full Returns sample. After this calibration, the time-series average of the differences in cross-sectional standard deviations between the Full Returns sample and the distribution-matched CofE sample is 0.0009 (t-statistic = 0.62) at γ = 2.15, and −0.0008 (t-statistic = −0.56) at γ = 2.20. Figure 2a illustrates the approach by plotting the average distribution of excess returns for the CofE sample before and after distribution-matching as well as the reference distribution of returns. The dashed continuous distribution of returns in the CofE sample is sorted into bins of pre-determined widths and then resampled, such that the sample proportion of each bin is equal to that bin in the reference distribution. The procedure is effective if the heights of the resulting light bars, representing the strata, match the heights of the dark reference strata. As previously mentioned (note 10), Fig. 2b plots the average Value Line-based CofE metric by returns-bin, before and after bin-based distribution matching; across the 402 sample months, in none of the 101 returns bins is the difference significantly different from zero at the 0.01level (results not tabulated). Differences using the other CofE metrics (not plotted) are, if anything, generally smaller in absolute magnitude.

Fig. 2
figure 2

Graphical evidence of distributional properties before and after bin-based distribution-matching. Panel A: Average Monthly Returns Distribution Before and After Bin-based Distribution-matching. Panel B: Average (Value Line) CofE Before and After Bin-based Distribution-matching. Fig. 2, Panel A, shows the empirical distribution (density) of excess returns for three samples: the actual CofE sample (dashed line), the reference (Full Returns) sample (dark bars), and the bin-based distribution-matched sample with γ = 2.20 (light bars). The width of each bin is 100 basis points. Data in the figure are bin-specific average sample proportions across 402 months and are truncated at −/+50%. The figure shows distribution-matched-sample returns from a single randomly chosen run of the resampling procedure. Panel B depicts the average Value Line CofE metric, by realized-returns bin, before (dark solid line) and after distribution-matching (dashed line), and the associated differences (light solid line)

Association tests using distribution-matched samples

Table 6 reports results of correlation tests between realized returns and CofE metrics for the distribution-matched samples under both the Kolmogorov-Smirnov and bin-based approaches. The average KS statistic (average p value) of the unadjusted CofE sample is about 14% (0.0009) and the test rejects similarity of distributions in 401 of 402 sample months at the 0.10 level or lower. After KS-based distribution-matching using 20% of firms (100 firms), the average KS statistic is just under 6% for both initializations, with an average p value of 0.5287 (0.7789), and the test rejects similarity of distributions in 42 (5) of 402 months at the 0.10 level or better. We conclude from these results that the KS-based approach to distribution-matching is effective.

Table 6 Association tests in distribution-matched samples

The top portion of Table 6 reports correlations between five CofE measures and realized returns, using the KS-based distribution-matched samples (columns 2 and 3) and the bin-based distribution-matched samples (columns 4 and 5). For KS-based matching, the time-series average correlations across 402 months are reliably positive in 9 of the 10 specifications, with t-statistics between 2.23 and 6.08 (the exception is the association with the CT CofE metric with 100 firms as initial sample where the correlation is positive and insignificant). These results indicate reliably positive associations between four implied CofE metrics and realized returns, in contrast to the benchmark results on the unmodified CofE sample (reproduced in the first column), where four correlations are significantly negative. For bin-based matching, we report correlations for γ = 2.15 and γ = 2.20, where the overall difference in standard deviations is insignificant at conventional levels. Because the focus is on matching standard deviations only, we do not expect the bin-based distribution-match to be entirely effective in lowering the KS statistics for general similarity of distributions. Table 6, columns 4 and 5, shows that the average KS statistic decreases, relative to the unmodified CofE sample, and remains significant for 372 (375) months for γ = 2.15 (γ = 2.20). In these analyses, all five CofE measures have reliably positive correlations with realized returns, ranging from 0.021 (CT CofE measure) to 0.054 (the VL measure), with t-statistics between 2.07 (CT CofE measure, γ = 2.15) and 4.28 (VL CofE measure, γ=2.20).

The bottom portion of Table 6 contains regression coefficients for the KS-based and bin-based distribution-matched samples. In contrast to the results reported in Table 3, the coefficients are generally significantly positive, with t-statistics usually exceeding 2.0, and statistically indistinguishable from 1 in 16 of the 20 specifications. The coefficient on the CT metric is significantly smaller than 1 in two of the KS-based samples, and the GLS CofE metric shows significantly larger coefficients for the bin-based samples.

To examine the effect of distribution-matching on the implied factor premia from asset pricing tests, we repeat the Fama-MacBeth-type tests reported in Table 4 using the KS-based distribution-matched samples with the initialization set at 20% of the CofE sample (results not tabulated) and using one bin-based distribution-matched sample (γ = 2.20; results not tabulated). The factor premia from these samples are qualitatively similar to the Full Returns sample results in Table 4 in that the market factor, SMB factor, and AQfactor are positive and statistically significant and the HML factor is insignificant at conventional levels. In sum, the results on implied factor premia in the distribution-matched samples are qualitatively similar to results in the reference sample as reported in Table 4.

We next address a concern that arises in part from Botosan et al.’s (2011) finding that news-purged realized returns, which should measure expected returns, have either no associations or counter-intuitive associations with risk proxies such as beta. In our setting, the concern is that distribution-matching the CofE sample increases the association between CofE metrics and excess returns at the cost of a diminished association between CofE metrics and other risk proxies, specifically risk factor betas. We test for a decline in the associations between CofE metrics and risk factor betas, using (1) the sample composition from the KS-based matching in Table 6 with initial sample size equal to 20% of the CofE sample, as compared to (2) a random sample from the CofE sample of the same size in any given month. For both samples, we regress the five CofE metrics on lagged risk factor betas from Eq. (7a). If distribution-matching decreases the association between CofE metrics and risk factor betas, the associations will be smaller for the distribution-matched sample (1) than for the random sample (2). Our test is based on the time-series of the difference between the 402 month-specific KS-based sample results and the 402 month-specific results from the equal-sized random samples. We repeat the procedure 100 times and evaluate the differences using the average Fama-MacBeth-type t-statistics across the 100 outcomes. In untabulated results, we find that for 19 of 20 coefficient estimates (five CofE metrics times four risk factor betas), differences between the two sets of associations are insignificant at conventional levels, with t-statistics between −0.59 and 1.46. The exception is the coefficient on the market beta in the VL CofE regression, which shows a small, reliably positive difference of 0.0001 (t = 2.09). In all cases, coefficients from the KS-based sample are numerically similar to coefficients from the random samples; they are always of the same sign and always significant at comparable levels.

Combined with previous results, we interpret the weight of the evidence in Table 6 as demonstrating that differences in the shape of the returns distribution between the Full Sample and the CofE sample have a marked effect on the results of association tests. We draw three inferences from these results. First, selection criteria that yield estimation samples with different returns distributions, as compared to a reference sample, decrease the ability to detect theoretically-predicted associations between realized returns and CofE estimates. Second, adjusting the distribution of the outcome variable (in this case, realized returns) in the non-random sample to mimic that of the reference sample provides at least a partial solution. Third, the finding that the distribution-matched sample may be smaller than the original, non-random sample suggests that attempts to achieve generalizability to a reference sample by maximizing the size of a data-restricted sample may not be effective.

5 Extensions

5.1 Relating distribution-matching to selection models and multiple imputation

Selection models

Both selection models and distribution-matching seek to incorporate information into the test model beyond the information in the complete-data subsample. The latter focuses only on the outcome variable, whose empirical distribution in the reference sample can be derived by the researcher or is empirically estimable, and aims to construct a test sample that appears randomly selected with respect to the outcome variable. In contrast, a selection model (1) operates under the assumption that data are not missing at random, conditional on observed data, and (2) requires explicit modelling of the missingness mechanism using additional explanatory variables, which might impose even more stringent data restrictions than the actual test model. Thus the sample restriction issue at the heart of our analysis does not arise in the approach developed by Heckman (1979), because the exogenous covariates in the first-stage selection model are attainable for all observations, or, equivalently, attainable for a random subsample of the population.Footnote 39 Consequently, results from a Heckman-type model are generalizable only to the sample for which the selection model variables are available, and increasing selection-model fit by including more explanatory variables is likely to impose increasingly stringent sample restrictions due to data requirements. Exacerbating the data availability problem is the exclusion restriction on the explanatory variables set in the test model, compared to the explanatory variables set in the selection model. To avoid collinearity of the test model variables and the inverse Mills ratio, the recommended approach is to include at least one additional variable in the selection model not contained in the test model of interest and, in theory, not associated with the outcome variable.Footnote 40

Distribution-matching is, by design, non-parametric and based on a reference distribution of the outcome variable. In contrast, the derivation of the Heckman correction for sampling biases relies on the assumption that the residuals from the selection model and the test model are jointly normally distributed. The normality assumption allows for a closed-form solution for the sampling bias in OLS coefficients as a function of the inverse Mills ratio, the standard deviation of the test model residual, and the correlation between test model residuals and selection model residuals. The normality assumption is crucial for the parameter estimates in the test model; descriptive statistics for excess returns reported in Table 2 cast doubt on this assumption in our setting. At a minimum, we caution that Heckman test results with realized returns as the dependent variable are likely biased (in an unknown direction) by violations of the normality assumption.

Despite these concerns, we implement the Heckman model, subject to the constraint of avoiding, as much as possible, additional sample restrictions, at the potential cost of not maximizing the selection model fit. We restrict our analysis to selection models with explanatory variables available for all, or at least the vast majority of, observations in the Full Returns sample and include some or all of the following: firm size (CRSP market capitalization at the end of the prior month), firm age (number of months between the first month on CRSP and the current month), CRSP trading volume, the book-to-market ratio (calculated from the Compustat annual file), and the four univariate risk factor betas from the asset pricing regressions, eq. (7a).Footnote 41

Table 7 reports semi-partial correlation coefficients between CofE metrics and excess returns and goodness-of-fit measures for the probit selection models estimated. The model including only size has a pseudo-R2 of 0.48, with no additional sample loss; adding variables increases the pseudo-R2 to a maximum of 0.52, with a sample loss of 2.1% when the model includes the log of the book-to-market ratio. The reported semi-partial correlations are averages of the 402 month-specific (cross-sectional) semi-partial correlations, obtained by controlling for the inverse Mills ratio in the respective CofE metric first and then computing the correlation between the returns and the residualized CofE metric. Similarly, regression coefficients (bottom portion of the table) are averages of 402 cross-sectional slope coefficients from regressions of excess returns on both the CofE metric in question and the inverse Mills ratio from the selection model.

Table 7 Results from Heckman-type selection models

The inverse-Mills ratio-adjusted semi-partial correlations and the adjusted regression coefficients are negative or, in the case of the VL CofE metric, indistinguishable from zero. Across CofE metrics, point estimates appear slightly lower and are more statistically different from zero, as compared to unadjusted correlations or slopes.Footnote 42 With the caveat that the effects of the inverse Mills ratio on the semi-partial correlations and regression coefficients might be due to violations of the normality assumption, an inadequate fit of the selection model, or some combination of the two, we conclude that Heckman-type selection models do not change the conclusion from results obtained using the unadjusted CofE sample.

Multiple imputations

Footnote 43A standard implementation of multiple imputation will fail to recognize differences in the functional form connecting returns and CofE. Specifically, correlation and regression coefficients from a single imputation model for the entire cross-section of returns are qualitatively similar to the actual CofE sample results reported in Table 3, except that coefficients tend to be more negative (farther from the theoretical value) and standard errors are larger because of additional variance from the imputed CofE data. However, when we modify the approach to allow for group-wise imputation models (two groups divided at the monthly cross-sectional median; three groups or five groups) to improve the overall model fit, 12 of the 15 regression slopes (five CofE metrics times 3 different sample groupings) are positive, significant at the 0.10 level or better and indistinguishable from 1. Results for correlations are generally positive but weaker and insignificant in eight of the 15 specifications and especially for the CT CofE.

We assess the sensitivity of these results in two ways. First, we preclude the imputation of negative valuesFootnote 44 by using log transformations before imputing and find qualitatively comparable results for two and three imputation groups and stronger results for five imputation groups per cross section. Qualitative inference changes only for the MPEG CofE, with a higher coefficient of 0.65 (t-statistic against 0 = 1.83, t-statistic against 1 = −0.97). Second, we impute data for the factor tests. We construct a dataset that deletes the loadings estimates from eq. (7a) for observations without CofE metrics and then impute the now missing (by construction) values for the loadings before we estimate the implied factor premia using eq. (7b). Untabulated results show that, for imputations of the full cross section, only the market risk premium is significantly positive (t=1.92). For imputations using two, three and five groups results are qualitatively similar to the results from the Full Returns sample. We interpret the weight of this evidence as suggesting that multiple imputation can be a viable alternative to distribution-matching, albeit one that imposes a normality assumption and that may require additional adjustments in a specific research setting, for example, precluding inadmissible imputed values.

5.2 Asset pricing tests on returns of samples that meet selection criteria used in accounting research

Using the CRSP population of firms with at least 12 consecutive monthly returns during our sample period and the subsample of those returns associated with firms for which CofE measures can be calculated, we have analyzed how differences in returns distributions between the two samples affect results of association tests. We next consider whether results of asset pricing tests of the association between risk factor betas and realized returns are sensitive to the following cross-sectional selection criteria that likely yield non-random samples: S&P 500 membership, a potential screen in compensation researchFootnote 45; NYSE listing, a screen in some intra-day trading studies; availability of the standard deviation of analysts’ earnings forecasts, required for research examining forecast dispersion; or a stock price of at least $5. We apply each criterion separately to the Full Sample, report the proportions of firms that do and do not meet the criterion and re-estimate Eq. (7b), separately, for observations meeting and not meeting the criterion.

Results are reported in Table 8, Panel A. The selection criteria generally result in unequal proportions of firms in the Full Returns sample that do and do not meet each criterion. The difference in proportions is, unsurprisingly, most extreme for the S&P 500 criterion (8.44% meet the criterion). KS statistics for tests of equality of distributions show that, for three of the four selection criteria, percentage deviations between the Full Returns sample and the subsample meeting the criterion exceed the deviations for the subsample not meeting the criterion. That is, the returns distributions of firms not selected by these three criteria more closely resemble the returns distribution of the Full Returns sample.Footnote 46 The exception is the price of at least $5 criterion.

Table 8 Univariate associations between (excess) returns and risk factor betas (implied factor premia)

These findings suggest asset pricing tests may yield results that are more theory-consistent as well as more consistent with results for the Full Sample for firms not included in the sample resulting from the application of plausible selection criteria. Specifically, Table 8 shows that both point estimates and t-statistics are more similar to Full Sample results for firms that do not meet the selection criteria. We view these results as indicative, but not dispositive, that the distributional issues we identified and analyzed for the CofE sample generalize to other research situations where data-constrained samples consist of large, stable firms and, as a consequence, have returns distributions that are not random draws from the population.

5.3 Association tests between realized returns and factor betas using forced non-random samples

To illustrate the sensitivity of association test results to (small) changes in non-random sampling, we split the Full Sample realized returns distribution into positive and negative returns and reweight both subsamples differentially to varying degrees. This test is motivated by the conjecture that data requirements might lead (implicitly) to a similar and less extreme reweighting of positive and negative returns. We resample 20 times, by month, for each of 402 sample months. In each month, with Nt firms in each month, we resample 20 times with replacement Nt firms. We use these 402 months of resampled data to illustrate the effect on the returns-factors betas association, as these data are available for the entire reference sample.

Results are reported in Table 8, Panel B. The 0% column shows results when we resample preserving the population proportions of positive and negative returns. These results coincide with the Table 4 Full Returns sample results; small differences result from sampling with replacement as opposed to using the full sample. The columns labeled −2.5%, −5%, −10%, and −25% show the Full Returns sample results when our resampling procedure decreases the portion of positive returns sampled by the specified percentages and increases the portion of negative returns sampled by the same percentages. The columns labeled +2.5%, +5%, +10%, and +25% show results when resampling increases (decreases) the portion of positive (negative) returns sampled by the specified percentages. The results suggest that increasing the proportion of positive returns increases the significance of results of asset pricing tests.Footnote 47 For example, the t-statistic on the implied SMB factor premium increases from 0.38 (25% decrease in positive returns) to 1.67 (unbiased sample) to 2.98 (25% increase in the proportion of positive returns). Factor premia are differentially sensitive to these changes, with the market factor apparently relatively more robust compared to other factors, although the trend exists also for it. We infer that results of association tests are sensitive to the distributional properties of estimation samples and therefore sensitive to differences in sample selection criteria, with the degree of sensitivity differing with the nature of the selection criteria.

Overall, the results in Table 8, Panel B, illustrate that the outcome of all four returns-betas association tests is sensitive to a pre-specified characteristic of the returns distribution, namely whether the sample contains more positive returns. We directly manipulated the sample distribution; however, this sample characteristic may also be implicitly influenced by researcher-chosen selection criteria, data availability, or sample partitions.

5.4 Applying distribution-matching to Richardson et al. (2005)

We apply distribution-matching to Richardson et al.’s (2005) analysis of the association between annual returns and accruals to illustrate empirically that distribution-matching does not produce false results.Footnote 48 Richardson et al. find results consistent with prior research (notably Sloan 1996) and with their theory that lower-reliability accruals lead to lower earnings persistence that investors appear to not fully understand. We therefore expect that applying distribution-matching should similarly produce results consistent with prior research and theory; that is, distribution-matching should not falsify or bias these results. We follow procedures outlined in Section 3 and footnote 8 of Richardson et al. to obtain a sample as close as practicable to theirsFootnote 49 (the unadjusted accruals sample) and replicate the analysis presented in their Table 8, Panel B. Before distribution-matching, the average cross-sectional KS statistic rejects similarity of the $5-price-filtered reference distribution of returns (as described in footnote 49) and the accruals-sample distribution of returns at the 0.0917 level; differences are significant in 30 of the 40 sample years. After applying the KS-based distribution-matching approach, the average annual KS statistic is reduced to 0.0525, with an average p value of .6377. Thus the sample of returns obtained by requiring specified accounting data appears to be non-random relative to the $5-price-filtered reference distribution of returns, and distribution-matching lets the distribution approximate a randomly-drawn sample distribution. The distribution-matched sample has fewer observations (approximately 23,000 as opposed to over 105,000 in the unadjusted accruals sample).

The key test variables reported in Richardson et al.’s Table 8, Panel B, are ROA (coefficient = 0.09, t = 1.69), change in working capital (∆WC; coefficient = −0.30, t = −7.54), change in net non-current operating assets (∆NCO; coefficient = −0.27, t = −6.77) and change in net financial assets (∆FIN; coefficient = −0.054, t = −1.94). After distribution matching, we find the following (not tabulated): the coefficient on ROA is 0.16, t = 2.27; the coefficient on ∆WC is −0.31, t = −4.28; the coefficient on ∆NCO is −0.17, t = −3.01; the coefficient on ∆FIN is −0.02, t = −0.29. Using two-sample t-tests of differences in slope coefficient estimates from the full sample and the distribution-matched sample, we do not reject equality at lower than the 0.18 level. In other words, the distribution-matched sample yields results that are statistically indistinguishable from the full-sample results of Richardson et al.’s returns test. We believe these results complement our previous analyses by supporting an inference that distribution matching does not produce false results.

5.5 Eliminating the requirement of Value Line data

We re-estimate the Table 6 association tests after dropping the requirement that CofE firms be followed by Value Line (results not tabulated). For these tests, the CofE sample contains firms with the necessary IBES data to calculate four CofE metrics; the monthly average is 1980 firms, a substantial increase from the monthly average of 955 firms when we impose the Value Line requirement.Footnote 50 However, the larger IBES sample returns distribution remains reliably different from the reference sample of returns: the KS statistic for the unadjusted IBES CofE sample is 11.21% with average p value = 0.0018; 400 of 402 months have a p value of 0.10 or less. After distribution-matching using the 20% initial draw setting, the KS statistic is 0.0585 (average p value = 0.35; 135 months have a p value = 0.10 or less); when the initial draw is 100 firms the average KS statistic is 0.0496 (average p value = 0.89; one month has a p value = 0.10 or less). Average cross-sectional regression coefficients for all four IBES CofE metrics in the unadjusted IBES CofE sample are reliably negative (t-statistics between −2.96 and −5.03) and reliably different from 1 (t-statistics between −2.30 and −4.86). After distribution-matching using either 20% of the sample or 100 firms, average cross-sectional regression coefficients are reliably positive (t-statistics between 2.20 and 2.91) and, with the exception of the CT CofE metric, are not reliably different from 1. In summary, dropping the requirement of Value Line data increases the sample size, does not eliminate the non-randomness of the resulting CofE sample returns and does not systematically alter the associations of the CofE metrics with realized returns in the unadjusted sample.

6 Conclusions

This paper proposes and illustrates a practical solution to a pervasive issue in empirical-archival accounting research, namely, data-restricted samples that are non-randomly drawn from the reference sample to which the researcher would like to generalize results. Paired with an objective of maximizing the number of observations with values for all variables (“complete cases”), these non-random samples are effectively dictated by data availability. We describe, validate, and illustrate a distribution-matching technique that can be used to align the distribution of a non-random estimation sample with that of a reference sample. The foundation for this approach is resampling from the data-restricted non-random subsample to minimize the distance between the marginal sample distribution and the marginal reference distribution.

We illustrate the effectiveness of the distribution-matching approach in tests of associations between returns and five popular implied cost of equity (CofE) estimates. This setting is of interest in its own right, given the practical and theoretical importance of the association and the weak and mixed results in previous research. Our analysis shows that associations between realized returns and CofE metrics are influenced by the properties of the realized returns distribution used to estimate the associations. Our reference sample is CRSP firms with at least 12 consecutive monthly returns during 1976–2009; our test sample is firms with sufficient data to calculate the CofE measures. The latter sample is a substantially smaller, non-random subsample of the former. We first show that associations between realized returns and CofE metrics are weak or negative, as in prior research. After distribution-matching, so that the resulting returns distribution mimics the returns distribution in the reference sample, we find reliably positive correlations between realized returns and most CofE measures, as predicted by theory. This result suggests that several implied CofE measures used in the accounting literature have greater construct validity than prior results suggest.Footnote 51 We also discuss two alternative approaches: multiple imputation (which performs well as a potential alternative to distribution-matching in our setting, albeit at the cost of additional assumptions) and selection-type models (which do not perform well in our setting).

Viewed broadly, our analysis implies that non-randomness of samples resulting from data requirements may lead to conclusions that do not generalize to a reference sample selected by the researcher. We demonstrate how to use available information about a marginal reference distribution of one variable of interest (in our setting, realized returns) to construct samples that mimic a reference distribution more closely than can an unmodified sample whose composition is dictated by data requirements. Highlighting an important caveat to the goal of maximizing the size of a data-constrained research sample, our analysis suggests maximization of a data-constrained sample may not be goal-congruent with increasing the generalizability from such a sample.

Our findings suggest researchers might benefit, in terms of increasing the generalizability of results, from examining the impact of data requirements on the empirical distribution of the test model variables, in particular the variable whose distribution is most affected by the availability of other variables of interest. We believe the approach we discuss, modified to suit the specific research context, will assist future research by providing an explanation for weak or counter-intuitive results from data-restricted samples. Distribution-matching might also benefit future research by helping to coordinate across studies that address either similar questions using different samples. Comparisons of results across studies would be facilitated to the extent many researchers can define, construct and analyze a common reference sample.