Linear models enable powerful differential activity analysis in massively parallel reporter assays
Abstract
Background
Massively parallel reporter assays (MPRAs) have emerged as a popular means for understanding noncoding variation in a variety of conditions. While a large number of experiments have been described in the literature, analysis typically uses adhoc methods. There has been little attention to comparing performance of methods across datasets.
Results
We present the mpralm method which we show is calibrated and powerful, by analyzing its performance on multiple MPRA datasets. We show that it outperforms existing statistical methods for analysis of this data type, in the first comprehensive evaluation of statistical methods on several datasets. We investigate theoretical and realdata properties of barcode summarization methods and show an unappreciated impact of summarization method for some datasets. Finally, we use our model to conduct a power analysis for this assay and show substantial improvements in power by performing up to 6 replicates per condition, whereas sequencing depth has smaller impact; we recommend to always use at least 4 replicates. An R package is available from the Bioconductor project.
Conclusions
Together, these results inform recommendations for differential analysis, general group comparisons, and power analysis and will help improve design and analysis of MPRA experiments.
Keywords
Massively parallel reporter assays Statistics EnhancerAbbreviations
 FET
Fisher’s exact test
 LCL
Lymphoblastoid cell line
 MPRA
Massively parallel reporter assay
Background
There are three broad categories of MPRA applications: (1) characterization studies, (2) saturation mutagenesis, and (3) differential analysis. (1) Characterization studies examine thousands of different putative regulatory elements that have a wide variety of sequence features and try to correlate these sequence features with measured activity levels [3, 4, 5, 6, 7, 8, 9, 10]. Typical statistical analyses use regression to study the impact of multiple features simultaneously. They also compare continuous activity measures or categorized (high/low) activity measures across groups using paired and unpaired t, rank, Fisher’s exact, and chisquared tests. (2) Saturation mutagenesis studies look at only a few established enhancers and examine the impact on activity of every possible mutation at each base as well as interactions between these mutations [11, 12, 13, 14, 15, 16, 17]. Analyses have uniformly used linear regression where each position in the enhancer sequence is a predictor. (3) Differential analysis studies look at thousands of different elements, each of which has two or more versions. Versions can correspond to allelic versions of a sequence [18, 19, 20] or different environmental contexts [21], such as different cell or tissue types [22]. These studies have compared different sequence versions using paired ttests, rank sum tests, and Fisher’s exact test (FET) (by pooling counts over biological replicates).
Despite the increasing popularity of this assay, guiding principles for statistical analysis have not been put forth. Researchers use a large variety of ad hoc methods for analysis. For example, there has been considerable diversity in the earlier stages of summarization of information over barcodes. Barcodes are viewed as technical replicates of the regulatory element sequences, and groups have considered numerous methods for summarizing barcodelevel information into one activity measure per enhancer. On top of this, a large variety of statistical tests are used to make comparisons.
Recently, a method called QuASARMPRA was developed to identify regulatory sequences that have allelespecific activity [23]. This method uses a betabinomial model to model RNA counts as a function of DNA counts, and it provides a means for identifying sequences that show a significant difference in regulatory activity between two alleles. While it provides a framework for two group differential analysis within MPRAs, QuASARMPRA is limited in this regard because experiments might have several conditions and involve arbitrary comparisons.
To our knowledge, no method has been developed that provides tools for general purpose differential analysis of activity measures from MPRA. General purpose methods are ones that can flexibly analyze data from a range of study designs. We present mpralm, a method for testing for differential activity in MPRA experiments. Our method uses linear models as opposed to countbased models to identify differential activity. This approach provides desired analytic flexibility for more complicated experimental designs that necessitate more complex models. It also builds on an established method that has a solid theoretical and computational framework [24]. We show that mpralm can be applied to a wide variety of MPRA datasets and has good statistical properties related to type I error control and power. Furthermore, we examine proper techniques for combining information over barcodes and provide guidelines for choosing sample sizes and sequencing depth when considering power. Our method is open source and freely available in the mpra package for R on the Bioconductor repository [25].
Results
The structure of MPRA data and experiments
MPRA data consists of measuring the activity of some putative regulatory sequences, henceforth referred to as “elements”. First a plasmid library of oligos is constructed, where each element is coupled with a number of short DNA tags, or barcodes. This plasmid library is then transfected into one or more cellular contexts, either as freefloating plasmids or integrated into the genome [21]. Next, RNA output is measured using RNA sequencing, and DNA output as a proxy for element copy number is measured using DNA sequencing (occasionally, element copy number is unmeasured), giving the data structure shown in Fig. 1. The logratio of RNA to DNA counts is commonly used as an activity outcome measure.
Since each element is measured across a number of barcodes, it is useful to summarize this data into a single activity measure a for a single element in a single sample. Multiple approaches have been proposed for this summarization step. We consider two approaches. First is averaging, where a logratio is computed for each barcode, then averaged across barcodes. This treats the different barcodes as technical replicates. The second approach is aggregation, where RNA and DNA counts are each summed across barcodes, followed by formation of a logratio. This approach effectively uses the barcodes to simply increase the sequencing counts for that element.
Datasets
Dataset  Description  Cell culture  Replicates  Barcodes 

Differential analysis: alleles  
Tewhey  Study of 39,479 oligos coming from 29,173 variants  NA12878 (LCL)  NA12878: 5  79k pool: ∼73 
to follow up on prior eQTL results.  NA19239 (LCL)  NA19239: 3  7.5k pool: ∼350  
Large initial oligo pool: 79k. Second pool: 7.5k.  HepG2  HepG2: 5  
Ulirsch  Study of 2756 variants in strong LD with 75 main  K562, K562 with  K562: 6  14 
variants to identify loci that affect RBC traits.  GATA1 overexpr.  K562+GATA1: 4  
Differential analysis: conditions  
Inoue  Comparison of episomal and lentiviral MPRA.  HepG2  3  Max: 99 
Shen  Study of tissue specificity of cisregulatory  Mouse retina and  3  ∼8 
element invivo in mouse.  cerebral cortex  
Saturation mutagenesis  
Melnikov  Two inducible enhancers:  HEK293T  Single: 2  Single: 13 
(1) a synthetic cAMPregulated enhancer and  Multi: 2  Multi: 1  
(2) the virusinducible interferonbeta enhancer.  
Singlehit scanning alters one base at a time.  
Multihit sampling alters several bases at a time.  
Kheradpour  Study of 2104 wildtype sequences and 3314 variant  K562  2  10 
sequences containing targeted motif disruptions to  HepG2  
understand baselevel effects in motifs. 
The variability of MPRA data depends on element copy number
It is well established that count data from RNA sequencing studies exhibit a meanvariance relationship [26]. On the log scale, low counts are more variable across replicates than high counts, at least partly due to inherent Poisson variation in the sequencing process [27, 28]. This relationship has been leveraged in both countbased analysis methods [29, 30] and, more recently, linear modelbased methods [24]. For countbased methods, this meanvariance relationship helps improve dispersion estimates, and for linear modelbased methods, the relationship allows for estimation of weights reflecting inherent differences in variability for count observations in different samples and genes.
Statistical modeling of MPRA data
To model MPRA data we propose to use a simple variant of the voom methodology [24], proposed for analysis of RNA sequencing data. This methodology is based on standard linear models, which are coupled with inverse variance weights representing the meanvariance relationship inherent in RNA sequencing data. The weights are derived from smoothing an empirical meanvariance plot. Similar to voom, we propose to use linear models to model logratio activity data from MPRAs, but we estimate weights by smoothing the relationship between empirical variance of the logratios and logDNA copy number, as depicted in Fig. 2. We call this approach mpralm.
The current literature on analysis of MPRA experiments contains many variant methods (see Introduction). To evaluate mpralm, we compare the method to the following variants used in the literature: QuASARMPRA, ttests, and Fisher’s exact test (FET). QuASARMPRA is a recently developed method that is targeted for the differential analysis of MPRA data [23]. It specifically addresses a two group differential analysis where the two groups are elements with two alleles and uses basecalling error rate in the model formulation. It collapses count information across samples to create three pieces of information for each element: one count for RNA reads for the reference allele, one count for RNA reads for the alternate allele, and one proportion that gives the fraction of DNA reads corresponding to the reference allele. Fisher’s exact test similarly collapses count information across samples. To test for differential activity, a 2byG table is formed with RNA and DNA designation forming one dimension and condition designation (with G groups) in the second dimension. The ttest operates on the logratio outcomes directly; we use the aggregate estimator to summarize over barcodes. Either a paired or unpaired ttest is used based on experimental design.
Both edgeR and DESeq2 are popular methods for analysis of RNAsequencing data represented as counts. The two methods are both built on negative binomial models, and both attempt to borrow information across genes. These methods allow for the inclusion of an offset. Because both methods use a logarithmic link function, including logDNA as an offset allows for the modeling of logratios of RNA to DNA. This makes these methods readily applicable to the analysis of MPRA data, and they carry many of the same advantages as mpralm. In addition to QuASAR, ttests, and Fisher’s exact test, we examine the performance of edgeR and DESeq2 for differential activity analysis in our evaluations. We point out that although our application of edgeR and DESeq2 to MPRA data is straightforward, it has not been used in this way so far in the literature. Tewhey et al. [19] uses DESeq2 to perform differential expression analysis of RNA counts relative to DNA counts within a single condition. This assesses whether the regulatory elements have activating or repressive activity, but it does not assess whether the activity of regulatory elements differs between conditions. We remind the community of the ability to use offsets in the edgeR and DESeq2 negative binomial models, and explore a new use of these models for MPRA data.
mpralm is a powerful method for differential analysis
First, we focus on evaluating the performance of mpralm for differential analysis. We compare to QuASARMPRA, ttests, Fisher’s exact test, edgeR, and DESeq2. We use four of the previously discussed studies, specifically the Tewhey, Inoue, Ulirsch, and Shen studies. Two of these studies (Tewhey, Ulirsch) focus on comparing the activity of elements with two alleles, whereas the other two (Inoue, Shen) compare the activity of each element in two different conditions. For the allelic studies, we use a random effects model for mpralm and paired ttests. In the random effects model, we estimate the correlation between the multiple alleles for a given single nucleotide polymorphism (SNP). This correlation estimate improves the estimation of elementwise variances used in testing for differences between conditions. Both Tewhey et al. [19] and Ulirsch et al. [18] compare alleles in different cellular contexts; we observe similar behavior of all evaluations in all contexts (data not shown) and have therefore chosen to depict results from one cellular context for both of these studies. For the Tewhey dataset we depict results both from a large pool of elements used for initial screening and a smaller, targeted pool.
Observed type I error rates for low nominal error rates
Nominal  edgeR  DESeq2  QuASAR  mpralm  mpralm  FET  ttest 

error rate  (agg)  (avg)  
Tewhey  large  
1.62e05  8.21e04  8.22e04  1.97e03  1.64e04  NA  2.82e01  0 
(5/6088)  (5/6084)  (12/6088)  (1/6088)  (1718/6088)  (0/6088)  
1.03e05  6.57e04  8.22e04  1.64e03  0  NA  2.73e01  0 
(4/6088)  (5/6084)  (10/6088)  (0/6088)  (1663/6088)  (0/6088)  
7.25e06  4.93e04  8.22e04  1.64e03  0  NA  2.66e01  0 
(3/6088)  (5/6084)  (10/6088)  (0/6088)  (1618/6088)  (0/6088)  
6.25e06  4.93e04  6.57e04  1.64e03  0  NA  2.62e01  0 
(3/6088)  (4/6084)  (10/6088)  (0/6088)  (1597/6088)  (0/6088)  
Tewhey  targeted  
1.62e05  5.71e04  5.71e04  1.71e03  5.71e04  NA  2.57e02  0 
(2/3503)  (2/3503)  (6/3503)  (2/3501)  (90/3503)  (0/3503)  
1.03e05  5.71e04  5.71e04  1.71e03  5.71e04  NA  2.37e02  0 
(2/3503)  (2/3503)  (6/3503)  (2/3501)  (83/3503)  (0/3503)  
7.25e06  5.71e04  5.71e04  1.43e03  5.71e04  NA  2.14e02  0 
(2/3503)  (2/3503)  (5/3503)  (2/3501)  (75/3503)  (0/3503)  
6.25e06  5.71e04  5.71e04  1.43e03  5.71e04  NA  2.00e02  0 
(2/3503)  (2/3503)  (5/3503)  (2/3501)  (70/3503)  (0/3503)  
Inoue  
1.62e05  1.64e03  2.05e03  4.1e04  0  4.1e04  3.69e03  4.1e04 
(4/2440)  (5/2440)  (1/2440)  (0/2440)  (1/2440)  (9/2440)  (1/2440)  
1.03e05  1.64e03  2.05e03  4.1e04  0  4.1e04  3.28e03  4.1e04 
(4/2440)  (5/2440)  (1/2440)  (0/2440)  (1/2440)  (8/2440)  (1/2440)  
7.25e06  8.20e04  2.05e03  4.1e04  0  4.1e04  2.87e03  4.1e04 
(2/2440)  (5/2440)  (1/2440)  (0/2440)  (1/2440)  (7/2440)  (1/2440)  
6.25e06  8.20e04  2.05e03  4.1e04  0  4.1e04  2.46e03  4.1e04 
(2/2440)  (5/2440)  (1/2440)  (0/2440)  (1/2440)  (6/2440)  (1/2440)  
Ulirsch  
1.62e05  0  0  1.09e03  0  0  4.14e01  0 
(0/2756)  (0/2756)  (3/2756)  (0/2756)  (0/2756)  (1140/2756)  (0/2756)  
1.03e05  0  0  7.26e04  0  0  4.06e01  0 
(0/2756)  (0/2756)  (2/2756)  (0/2756)  (0/2756)  (1119/2756)  (0/2756)  
7.25e06  0  0  7.26e04  0  0  3.96e01  0 
(0/2756)  (0/2756)  (2/2756)  (0/2756)  (0/2756)  (1092/2756)  (0/2756)  
6.25e06  0  0  7.26e04  0  0  3.93e01  0 
(0/2756)  (0/2756)  (2/2756)  (0/2756)  (0/2756)  (1083/2756)  (0/2756)  
Shen  
1.62e05  2.45e03  1.67e02  2.98e04  0  3e04  7.39e01  0 
(8/3264)  (51/3052)  (1/3358)  (0/3286)  (1/3337)  (2412/3264)  (0/3264)  
1.03e05  2.45e03  1.54e02  0  0  3e04  7.33e01  0 
(8/3264)  (47/3052)  (0/3358)  (0/3286)  (1/3337)  (2393/3264)  (0/3264)  
7.25e06  2.14e03  1.31e02  0  0  3e04  7.30e01  0 
(7/3264)  (40/3052)  (0/3358)  (0/3286)  (1/3337)  (2384/3264)  (0/3264)  
6.25e06  1.84e03  1.21e02  0  0e+00  0e+00  7.29e01  0 
(6/3264)  (37/3052)  (0/3358)  (0/3286)  (0/3337)  (2379/3264)  (0/3264) 
To investigate the tradeoff between observed power (number of rejected tests) and type I error rates, we combine these quantities in two ways: (1) we look at the number of rejections as a function of observed type I error rates and (2) we look at estimated FDR as a function of the number of rejections. Specifically, we use the error rates we estimate using the simulation (Fig. 4) to adjust the comparison of the observed number of rejections (Fig. 3).
In conclusion, we observe that Fisher’s exact test has too high of an error rate and that QuASARMPRA is underpowered; based on these results we cannot recommend either method. Ttests perform better than these two methods but are still outperformed by mpralm, edgeR, and DESeq2. These methods have similar performance, but mpralm seems to have slightly better performance than the latter two in terms of consistency of type I error calibration and power.
Comparison of element rankings between methods
While power and error calibration are important evaluation metrics for a differential analysis method, they do not have a direct relation with element rankings, which is often of practical importance. for the top performing methods in our evaluations (mpralm, ttest, edgeR and DESeq2) we examine rankings in more detail.
Accuracy of activity measures and power of differential analysis depends on summarization technique over barcodes
MPRA data initially contain count information at the barcode level, but we typically desire information summarized at the element level for the analysis stage. We examine the theoretical properties of two summarization methods: averaging and aggregation. Under the assumption that DNA and RNA counts follow a count distribution with a meanvariance relationship, we first show that averaging results in activity estimates with more bias. Second, we examine real data performance of these summarization techniques.
Let R_{b} and D_{b} denote the RNA and DNA count, respectively, for barcode b=1,…,B for a putative regulatory element in a given sample. We suppress the dependency of these counts on sample and element. Typically, B is approximately 10 to 15 (for examples, see Table 1). We assume that R_{b} has mean μ_{r} and variance k_{r}μ_{r} and that D_{b} has mean μ_{d} and variance k_{d}μ_{d}. Typically the constants k_{d} and k_{r} are greater than 1, modeling overdispersion. Negative binomial models are a particular case with k=1+ϕμ, where ϕ is an overdispersion parameter. Also let N_{d} and N_{r} indicate the library size for DNA and RNA, respectively, in a given sample. Let p_{d} and p_{r} indicate the fraction of reads mapping to element e for DNA and RNA, respectively, in a given sample so that μ_{r}=N_{r}p_{r} and μ_{d}=N_{d}p_{d}. Let a be the true activity measure for element e defined as a:= log(p_{r}/p_{d}). When performing total count normalization, the RNA and DNA counts are typically scaled to a common library size L.
Aggregation technique affects power in a differential analysis. In the last three columns of Figs. 3, 4, 5, and 6, we compare aggregation to averaging using mpralm. The two estimators have similar type I error rates but very different detection rates between datasets. The average estimator is more favorable for the Ulirsch and Shen datasets, and the aggregate estimator is more favorable in the Inoue dataset.
Recommendations for sequencing depth and sample size
Discussion
The field of MPRA data analysis has been fragmented and consists of a large collection of studyspecific ad hoc methods. Our objective in this work has been to provide a unified framework for the analysis of MPRA data. Our contributions can be divided into three areas. First, we have investigated techniques for summarizing information over barcodes. In the literature, these choices have always been made without justification and have varied considerably between studies. Second, we have developed a linear model framework, mpralm, for powerful and flexible differential analysis. To our knowledge, this is the second manuscript evaluating for statistical analysis in MPRA studies. The first proposed the QuASARMPRA method [23], which we show to have worse performance than mpralm. In our comparisons, we provide the largest and most comprehensive comparison of analysis methods so far; earlier work used only two datasets for comparisons. Third, we have analyzed the impact of sequencing depth and number of replicates on power. To our knowledge, this is the first mathematicallybased power investigation, and we expect this information to be useful in the design of MPRA studies.
The activity of a regulatory element can be quantified with the logratio of RNA counts to DNA counts. In the literature, groups have generally taken two approaches to summarizing barcode information to obtain one such activity measure per element per sample. One approach is to add RNA and DNA counts from all barcodes to effectively increase sequencing depth for an element. This is termed the aggregate estimator. Another approach is to compute the logratio measure for each barcode and use an average of these measures as the activity score for an element. This is termed the average estimator, and we have shown that it is more biased than the aggregate estimator. Because of this bias, we caution against the use of the average estimator when comparing activity scores in enhancer groups (often defined by sequence features). However, it is unclear which of the two estimators is more appropriate for differential analysis.
In addition to barcode summarization recommendations, we have proposed a linear model framework, mpralm, for the differential analysis of MPRA data. Our evaluations show that it produces calibrated pvalues and is as or more powerful than existing methods being used in the literature.
While the countbased tools, edgeR and DESeq2, would seem like natural methods to use for the analysis of MPRA data, they have not yet been used for differential analysis of MPRA activity measures between groups. There has been some use of DESeq2 to identify (filter) elements with regulatory activity (differential expression of RNA relative to DNA) [19, 32]. However, these tools have not been used for comparisons of activity measures between groups. In this work we propose the use of logDNA offsets as potential sensible uses of these software for differential analysis. In our evaluations, we see that this approach is most competitive with mpralm. For the allelic studies [18, 19], we observe that the degree of withinsample correlation affects the power of mpralm relative to comparison methods. In particular, there is little difference in the performance of the different methods for the Tewhey large pool experiment, and this experiment had overall low withinsample correlation. Both the Tewhey targeted pool experiment and the Ulirsch experiment had larger withinsample correlations, and we observe that mpralm has increased power over the comparison methods for these datasets. We expect that mpralm will generally be more powerful for paired designs with high withinpair correlations.
In terms of element rankings, mpralm, edgeR, and DESeq2 are similar. However, we observe a substantial difference in ranking between ttests and mpralm and believe top ranked mpralm elements exhibit better properties compared to those from ttests.
Linear models come with analytic flexibility that is necessary to handle diverse MPRA designs. Paired designs involving alleles, for example, are easily handled with linear mixed effects models due to computational tractability. The studies we have analyzed here only consider two alleles per locus. It is possible to have more than two alleles at a locus, and such a situation cannot be addressed with paired ttests, but is easily analyzed using mpralm. This is important because we believe such studies will eventually become routine for understanding results from genomewide association studies. We note that for allelic studies, it is often of interest to filter detections of significant differences to those cases where at least one allele appears to show regulatory activity. This is not inherent in the mpralm method, but it is possible to screen for regulatory activity with a conventional countbased differential analysis of RNA counts versus DNA counts using methods such as voom, edgeR, or DESeq2.
While we have focused on characterizing the mpralm linear model framework for differential analysis, it is possible to include variance weights in the multivariate models used in saturation mutagenesis and characterization studies. We expect that modeling the copy numbervariance relationship will improve the performance of these models.
For power, we find a substantial impact of even small increases in sample size. This is an important observation because many MPRA studies use 2 or 3 replicates per group, and our results suggest that power can be substantially increased with even a modest increase in sample size. We caution that using less than 4 replicates can be quite underpowered.
In short, the tools and ideas set forth here will aid in making rigorous conclusions from a large variety of future MPRA studies.
Conclusions
We have observed differences in the MPRA activity estimates resulting from the averaging and aggregation methods of summarizing counts across barcodes. For this reason, we recommend that practitioners perform sensitivity analyses for their results by using both of these estimation procedures. The mpralm linear model framework appears to have calibrated type I error rates and to be as or more powerful than the ttests and Fisher’s exact type tests that have been primarily used in the literature. mpralm has similar performance to variations of the edgeR and DESeq2 methods that we introduce here. These variations involve including DNA counts as offsets in the RNA differential analysis procedures. We recommend either of these 3 methods for unpaired differential analysis settings (such as tissue comparison studies), but we recommend mpralm for allelic studies due to its ability to better model the paired nature of the alleles with mixed models. Finally, we recommend that practitioners use at least 4 samples per condition for reasonable power to detect differences for top ranking elements.
Methods
Data
See Table 1. Dataset labels used in figures are accompanied by short descriptions below.
Melnikov: Study of the baselevel impact of mutations in two inducible enhancers in humans [12]: a synthetic cAMPregulated enhancer (CRE) and a virusinducible interferonbeta enhancer (IFNB). We do not look at the IFNB data because it contains only one sample. We consider 3 datasets:
Melnikov: CRE, singlehit, induced state: Synthetic cAMPregulated enhancer, singlehit scanning, induced state.
Melnikov: CRE, multihit, uninduced state: Synthetic cAMPregulated enhancer, multihit sampling, uninduced state.
Melnikov: CRE, multihit, induced state: Synthetic cAMPregulated enhancer, multihit sampling, induced state.
Kheradpour: Study of the baselevel impact of mutations in various motifs [15]. Transfection into HepG2 and K562 cells.
Tewhey: Study of allelic effects in eQTLs [19]. Transfection into two lymphoblastoid cell lines (NA12878 and NA19239) as well as HepG2. In addition two pools of plasmids are considered: a large screening pool and a smaller, targeted pool, designed based on the results of the large pool. We use data from both the large and the targeted pool in NA12878.
Inoue: chromosomal vs. episomal: Comparison of episomal and chromosomallyintegrated constructs [21]. This study uses a wildtype and mutant integrase to study the activity of a fixed set of putative regulatory elements in an episomal and a chromosomallyintegrated setting, respectively.
Ulirsch: Study of allelic effects in GWAS to understand red blood cell traits [18]. Transfection into K562 cells as well as K562 with GATA1 overexpressed. We use the data from K562.
Shen: mouse retina vs. cortex: Comparison of cisregulatory elements invivo in mouse retina and cerebral cortex [22]. Candidate CREs that tile targeted regions are assayed invivo in these two mouse tissues with adenoassociated virus delivery.
Simulation of data for type I error rate estimation
 1.
Compute logratio activity measures from the original RNA and DNA counts.
 2.
Calculate and save the elementwise residual standard deviations of the logratios after meancentering them in each group. Calculate and save the mean of the original logratios in group 1. These elementwise means will become the mean for both groups in the new null data.
 3.
Standardize the logratios in each group to have mean zero and unit variance.
 4.
The standardized logratios from both groups all have mean zero and unit variance. Resample these standardized residuals without replacement for each element in each sample. For paired (allelic) studies, resample each alleleallele pair without replacement.
 5.
Multiply the resampled residuals by the elementwise residual standard deviations, and add the original group 1 elementspecific means. This creates identically distributed logratios in both comparison groups.
 6.
Retain the original DNA counts from all samples.
 7.
Compute RNA counts using the original DNA counts and the resampled logratio activity measures.
 8.
The original DNA counts and the new RNA counts form the new synthetic dataset.
Count preprocessing
DNA and RNA counts are scaled to have the same library size before running any methods. We perform minimal filtering on the counts to remove elements from the analysis that have low counts across all samples. Specifically, we require that DNA counts must be at least 10 in all samples to avoid instability of the logratio activity measures. We also remove elements in which these logratios are identical across all samples; in practice this only happens when the RNA counts are zero across all samples. Both filtering steps remove clear outliers in the copy numbervariance plot (Fig. 2).
Modeling
The square root of the standard deviation of the logratios over samples is taken as a function of average log DNA levels over samples, and this relationship is fit with a lowess curve. Predicted variances are inverted to form observationlevel precision weights. Logratios activity measures and weights are used in the voom analysis pipeline. For the allelic studies, a mixed model is fit for each element using the duplicateCorrelation module in the limma Bioconductor package [33].
This linear model approach has a number of advantages. (1) It is flexible to different functional forms of the variancecopy number relationship. (2) It allows for a unified approach to modeling many different types of MPRA design using the power of design matrices. (3) It allows for borrowing of information across elements using empirical Bayes techniques. (4) It allows for different levels of correlation between elements using random effects.
mpralm enables modeling for complex comparisons
is a linear mixed effects model for activity measures, where b_{i} is a random effect that induces correlation between the two alleles measured within the same sample. We can perform inference on the β_{3e} parameters to determine differential allelic effects. Such a model is easy to fit within the mpralm framework, since our framework supports model specifications by general design matrices. In contrast, this question cannot be formulated in the QuASAR, ttest, and Fisher’s exact test frameworks. Neither edgeR nor DESeq2 support the fitting of mixed effects models.
Running mpralm, QuASAR, ttest, fisher’s exact test
For all methods, DNA and RNA counts were first corrected for library size with total count normalization. For edgeR and DESeq2, DNA counts were included as offset terms on the log scale before standard analysis. For the ttest we computed the aggregate estimator of the logratio as the outcome measure. For Fisher’s exact test, we summed DNA and RNA counts in the two conditions to form a 2by2 table as input to the procedure. For QuASARMPRA, we summed RNA counts in each condition to get one reference condition count and one alternative condition count per element. We also summed DNA counts in all samples and in the reference condition to get one DNA proportion for each element. These were direct inputs to the method.
Metrics used for method comparison

Shape ofpvalue distributions. Calibrated differential analysis methods have a characteristic shape for the pvalue distribution. Normally the majority of pvalues are uniformly distributed, corresponding to null comparisons, and there is a peak at low pvalues for the truly differential comparisons. We compare the methods with regards to this expected shape.

Type I error rates. For all methods and datasets, we estimate via realistic simulation (described above) the proportion of truly null comparisons in which we reject the null hypothesis.

Number of rejections as a function of type I error rate. To some degree, this is a comparison of power between methods. We cannot compare power of methods by comparing the height of the peaks around zero in the pvalue distributions because those plots tell us nothing about type I error rates. We wished to fix type I error rate and compare the number of rejections made by the different methods. For a given nominal type I error rate, we computed (1) the number of rejections made by a method and (2) the estimated true type I error rate. For example, a conservative method would have a true type I error rate of 0.03, say, at a nominal level of 0.05. Quantity (2) is plotted on the xaxis of Fig. 5, and quantity (1) is plotted on the y axis. Curves that are above others indicate higher detections for fixed type I error rate.

False discovery rates. These are defined as the fraction of rejections that are false positives. The estimation of these FDRs is described below.

Metrics that describe top ranking elements. The metrics above focus on type I error rates and power. Comparisons of the element rankings produced by the different methods were performed by taking elements that were ranked in the top 200 for one method but not the other. (Comparisons were done pairwise with mpralm always being one method compared.) Metrics measured the magnitude of the RNA counts, DNA counts, estimated logratio activity measures, effect sizes (difference in activity between groups), and residual standard deviations of the activity measures. It is nice if the RNA, DNA, logratios, and effect sizes are higher in top ranking elements. It is nice if the top ranking elements have residual standard deviations that are low, but not so low as to have been underestimated. It is common for variability to be underestimated when there are uniformly low counts across samples.
Estimation of FDR
The proportion of truly null hypotheses π_{0} for each dataset was estimated using the “lfdr” method in the propTrueNull function within limma [31]. As is common with π_{0} estimation procedures, the pvalues resulting from a statistical analysis are used in the estimation process. To this end, the π_{0} proportion was estimated with the pvalues resulting from mpralm, ttest, QuASAR, edgeR, and DESeq2, and the median of these estimates was used as the estimate for π_{0} for a given dataset. Fisher’s exact test was excluded from this estimate because it gave an estimate of π_{0} that was considerably smaller than the other methods, and which was dubious in light of its uncontrolled type I error rate. We multiply the number of tests by these π_{0} estimates to obtain an estimate of the number of truly null hypotheses. We then multiply this by our estimate of the true datasetspecific type I error rate (as shown in Fig. 4) to obtain an estimate of the number of false positives. Dividing by the number of rejected hypotheses at a given nominal significance level gives the estimated FDRs in Fig. 6.
Bias and variance of estimators
We suppress the dependency of these parameters on sample and element. Library sizes are given by N. The fraction of reads coming from a given element is given by p. Dispersion parameters are given by k. The common library size resulting from total count normalization is given by L. The true activity measure of a given element is given by a:= log(p_{r}/p_{d}).
Notes
Acknowledgements
The authors would like to thank Michael Ziller for sharing his insights.
Funding
Research reported in this publication was supported by the National Cancer Institute and the National Institute of General Medical Sciences of the National Institutes of Health under award numbers U24CA180996 and R01GM121459. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Availability of data and material
The mpralm method described here is implemented in the mpra package for R on the Bioconductor repository [25].
All of the data used in this study are available on GEO. The Tewhey data is available under GSE75661, Inoue data under GSE83894, Melnikov data under GSE31982, Kheradpour data under GSE33367, Shen data under GSE68247, and Ulirsch data under GSE87711.
Authors’ contributions
LM and KDH developed the mpralm method and conceived the evaluations. LM carried out the analyses and evaluations. DA and LG provided guidance for handling allelic studies. All authors contributed to writing the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
 1.White MA. Understanding how cisregulatory function is encoded in DNA sequence using massively parallel reporter assays and designed sequences. Genomics. 2015; 106:165–70. https://doi.org/10.1016/j.ygeno.2015.06.003.CrossRefGoogle Scholar
 2.Melnikov A, Zhang X, Rogov P, Wang L, Mikkelsen TS. Massively parallel reporter assays in cultured mammalian cells. J Vis Exp. 2014. https://doi.org/10.3791/51719.
 3.Grossman SR, Zhang X, Wang L, Engreitz J, Melnikov A, Rogov P, Tewhey R, Isakova A, Deplancke B, Bernstein BE, Mikkelsen TS, Lander ES. Systematic dissection of genomic features determining transcription factor binding and enhancer function. PNAS. 2017; 114:1291–300. https://doi.org/10.1073/pnas.1621150114.CrossRefGoogle Scholar
 4.Maricque BB, Dougherty J, Cohen BA. A genomeintegrated massively parallel reporter assay reveals DNA sequence determinants of cisregulatory activity in neural cells. Nucleic Acids Res. 2017; 45:16. https://doi.org/10.1093/nar/gkw942.Google Scholar
 5.Ernst J, Melnikov A, Zhang X, Wang L, Rogov P, Mikkelsen TS, Kellis M. Genomescale highresolution mapping of activating and repressive nucleotides in regulatory regions. Nat Biotechnol. 2016; 34:1180–90. https://doi.org/10.1038/nbt.3678.CrossRefGoogle Scholar
 6.White MA, Kwasnieski JC, Myers CA, Shen SQ, Corbo JC, Cohen BA. A simple grammar defines activating and repressing cisregulatory elements in photoreceptors. Cell Rep. 2016; 5:1247–54. https://doi.org/10.1016/j.celrep.2016.09.066.CrossRefGoogle Scholar
 7.Farley EK, Olson KM, Zhang W, Brandt AJ, Rokhsar DS, Levine MS. Suboptimization of developmental enhancers. Science. 2015; 350:325–8. https://doi.org/10.1126/science.aac6948.CrossRefGoogle Scholar
 8.KampsHughes N, Preston JL, Randel MA, Johnson EA. Genomewide identification of hypoxiainduced enhancer regions. PeerJ. 2015; 3:1527. https://doi.org/10.7717/peerj.1527.CrossRefGoogle Scholar
 9.Mogno I, Kwasnieski JC, Cohen BA. Massively parallel synthetic promoter assays reveal the in vivo effects of binding site variants. Genome Res. 2013; 23:1908–15. https://doi.org/10.1101/gr.157891.113.CrossRefGoogle Scholar
 10.White MA, Myers CA, Corbo JC, Cohen BA. Massively parallel in vivo enhancer assay reveals that highly local features determine the cisregulatory function of ChIPseq peaks. PNAS. 2013; 110(29):11952–7. https://doi.org/10.1073/pnas.1307449110.CrossRefGoogle Scholar
 11.Patwardhan RP, Lee C, Litvin O, Young DL, Pe’er D, Shendure J. Highresolution analysis of DNA regulatory elements by synthetic saturation mutagenesis. Nat Biotechnol. 2009; 27:1173–5. https://doi.org/10.1038/nbt.1589.CrossRefGoogle Scholar
 12.Melnikov A, Murugan A, Zhang X, Tesileanu T, Wang L, Rogov P, Feizi S, Gnirke A, Callan CG Jr, Kinney JB, Kellis M, Lander ES, Mikkelsen TS. Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay. Nat Biotechnol. 2012; 30:271–7. https://doi.org/10.1038/nbt.2137.CrossRefGoogle Scholar
 13.Patwardhan RP, Hiatt JB, Witten DM, Kim MJ, Smith RP, May D, Lee C, Andrie JM, Lee SI, Cooper GM, Ahituv N, Pennacchio LA, Shendure J. Massively parallel functional dissection of mammalian enhancers in vivo. Nat Biotechnol. 2012; 30:265–70. https://doi.org/10.1038/nbt.2136.CrossRefGoogle Scholar
 14.Kwasnieski JC, Mogno I, Myers CA, Corbo JC, Cohen BA. Complex effects of nucleotide variants in a mammalian cisregulatory element. PNAS. 2012; 109:19498–503. https://doi.org/10.1073/pnas.1210678109.CrossRefGoogle Scholar
 15.Kheradpour P, Ernst J, Melnikov A, Rogov P, Wang L, Zhang X, Alston J, Mikkelsen TS, Kellis M. Systematic dissection of regulatory motifs in 2000 predicted human enhancers using a massively parallel reporter assay. Genome Res. 2013; 23:800–11. https://doi.org/10.1101/gr.144899.112.CrossRefGoogle Scholar
 16.Birnbaum RY, Patwardhan RP, Kim MJ, Findlay GM, Martin B, Zhao J, Bell RJA, Smith RP, Ku AA, Shendure J, Ahituv N. Systematic dissection of coding exons at single nucleotide resolution supports an additional role in cellspecific transcriptional regulation. PLoS Genet. 2014; 10:1004592. https://doi.org/10.1371/journal.pgen.1004592.CrossRefGoogle Scholar
 17.Zhao W, Pollack JL, Blagev DP, Zaitlen N, McManus MT, Erle DJ. Massively parallel functional annotation of 3’ untranslated regions. Nat Biotechnol. 2014; 32:387–91. https://doi.org/10.1038/nbt.2851.CrossRefGoogle Scholar
 18.Ulirsch JC, Nandakumar SK, Wang L, Giani FC, Zhang X, Rogov P, Melnikov A, McDonel P, Do R, Mikkelsen TS, Sankaran VG. Systematic functional dissection of common genetic variation affecting red blood cell traits. Cell. 2016; 165:1530–45. https://doi.org/10.1016/j.cell.2016.04.048.CrossRefGoogle Scholar
 19.Tewhey R, Kotliar D, Park DS, Liu B, Winnicki S, Reilly SK, Andersen KG, Mikkelsen TS, Lander ES, Schaffner SF, Sabeti PC. Direct identification of hundreds of ExpressionModulating variants using a multiplexed reporter assay. Cell. 2016; 165(6):1519–29. https://doi.org/10.1016/j.cell.2016.04.027.CrossRefGoogle Scholar
 20.Vockley CM, Guo C, Majoros WH, Nodzenski M, Scholtens DM, Hayes MG, Lowe WL Jr, Reddy TE. Massively parallel quantification of the regulatory effects of noncoding genetic variation in a human cohort. Genome Res. 2015; 25:1206–14. https://doi.org/10.1101/gr.190090.115.CrossRefGoogle Scholar
 21.Inoue F, Kircher M, Martin B, Cooper GM, Witten DM, McManus MT, Ahituv N, Shendure J. A systematic comparison reveals substantial differences in chromosomal versus episomal encoding of enhancer activity. Genome Res. 2017; 27:38–52. https://doi.org/10.1101/gr.212092.116.CrossRefGoogle Scholar
 22.Shen SQ, Myers CA, Hughes AEO, Byrne LC, Flannery JG, Corbo JC. Massively parallel cisregulatory analysis in the mammalian central nervous system. Genome Res. 2016; 26:238–55. https://doi.org/10.1101/gr.193789.115.CrossRefGoogle Scholar
 23.Kalita CA, Moyerbrailean GA, Brown C, Wen X, Luca F, PiqueRegi R. QuASARMPRA: Accurate allelespecific analysis for massively parallel reporter assays. Bioinformatics. 2017. https://doi.org/10.1093/bioinformatics/btx598.
 24.Law CW, Chen Y, Shi W, Smyth GK. voom: Precision weights unlock linear model analysis tools for RNAseq read counts. Genome Biol. 2014; 15:29. https://doi.org/10.1186/gb2014152r29.CrossRefGoogle Scholar
 25.The mpra package. https://bioconductor.org/packages/mpra.
 26.McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNASeq experiments with respect to biological variation. Nucleic Acids Res. 2012; 40:4288–97. https://doi.org/10.1093/nar/gks042.CrossRefGoogle Scholar
 27.Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNAseq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008; 18:1509–17. https://doi.org/10.1101/gr.079558.108.CrossRefGoogle Scholar
 28.Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNASeq experiments. BMC Bioinformatics. 2010; 11:94. https://doi.org/10.1186/147121051194.CrossRefGoogle Scholar
 29.Robinson MD, McCarthy DJ, Smyth GK. edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26:139–40. https://doi.org/10.1093/bioinformatics/btp616.CrossRefGoogle Scholar
 30.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biol. 2014; 15:550. https://doi.org/10.1186/s1305901405508.CrossRefGoogle Scholar
 31.Phipson B. Empirical bayes modelling of expression profiles and their associations. 2013. PhD thesis, University of Melbourne.Google Scholar
 32.Gisselbrecht SS, Barrera LA, Porsch M, Aboukhalil A, Estep PW 3rd, Vedenko A, Palagi A, Kim Y, Zhu X, Busser BW, Gamble CE, Iagovitina A, Singhania A, Michelson AM, Bulyk ML. Highly parallel assays of tissuespecific enhancers in whole drosophila embryos. Nat Methods. 2013; 10:774–80. https://doi.org/10.1038/nmeth.2558.CrossRefGoogle Scholar
 33.Smyth GK, Michaud J, Scott HS. Use of withinarray replicate spots for assessing differential expression in microarray experiments. Bioinformatics. 2005; 21:2067–75. https://doi.org/10.1093/bioinformatics/bti270.CrossRefGoogle Scholar
Copyright information
Open Access This article is distributed under the terms of theCreative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.