Analytical and Bioanalytical Chemistry Pepr: Pipelines for Evaluating Prokaryotic References Pipelines for Evaluating Prokaryotic References Supplemental Material Candidate Reference Material Sequencing and Genome Assembly the S. Aureus Candidate Reference Materials Were Sequenced Using Three Orthog

Sequencing data for the NIST candidate reference material RM8376, genomic DNA from Staphylococcus aureus strain NRS100 isolate COL (Biosample SAMN02854573) was selected as a model system to demonstrate how PEPR is used to characterize a microbial genomic material. The strain was isolated from a clinical sample by Children's National Hospital (Biosample SAMN02700075). The genomic material represents a large homogeneous batch of extracted DNA, with 1500 vials with 3µg DNA in each. The extracted DNA was prepared from single colony of the initial culture stab was in-cubated overnight at 37 °C on an agar plate and a single colony was used to inoculate a new plate. One colony from the new plate was grown in 20 mL Luria-Bertani (LB) broth at 37 °C. The culture was used to inoculate 15 X 150 mm plates which were incubated at 37 °C for 16 hours. DNA was isolated by lysing the bacteria in a solution containing NaCl, Tris, EDTA and lysostaphin (25µg/ml) and SDS. Proteinase K and RNase A were used to treat protein and RNA. Ammonium acetate was used to remove protein. DNA was recovered by isopropanol precipitation as follows, the DNA was washed with 70% alcohol, drained, and dissolved in TE buffer (Tris 10 mM, EDTA 0.1 mM, pH8.0). 1 Figure 1: Comparison of optical map data to genome assembly. Alignment of in-silico genome map generated from the PacBio HGAP assembly to the OpGen optical map. Blue bars in map represent NocI restriction sites, black lines indicate co-linear regions. vials were randomly sampled from the lot of 1500 vials. For MiSeq, two technical replicate libraries were prepared for each of the eight vials using the Nextera DNA Sample Prep Kit 3 ; samples were barcoded using the Nextera Index Kit and sequenced using the MiSeq 600 cycle Reagent kit v3 for 2 X 300 bp reads. The 16 libraries were pooled and sequenced in a single run. Single barcoded 400 bp Ion Torrent PGM libraries were prepared for each of the eight vials using the Ion Xpress Plus kit 2. The vials were barcoded using the IonXpress Kit 3 , sequencing template was prepared using the Ion PGM Template OT2 400 kit and the IonPGM400 kit was used for sequencing on a 318C chip. The raw sequence data is archived in the Genbank Sequence Read Archive 1 of main paper for accession numbers. A genome map was obtained for the candidate …


Material Development
Sequencing data for the NIST candidate reference material RM8376, genomic DNA from Staphylococcus aureus strain NRS100 isolate COL (Biosample SAMN02854573) was selected as a model system to demonstrate how PEPR is used to characterize a microbial genomic material. The strain was isolated from a clinical sample by Children's National Hospital (Biosample SAMN02700075). The genomic material represents a large homogeneous batch of extracted DNA, with 1500 vials with 3µg DNA in each. The extracted DNA was prepared from single colony of the initial culture stab was incubated overnight at 37°C on an agar plate and a single colony was used to inoculate a new plate. One colony from the new plate was grown in 20 mL Luria-Bertani (LB) broth at 37°C. The culture was used to inoculate 15 X 150 mm plates which were incubated at 37°C for 16 hours. DNA was isolated by lysing the bacteria in a solution containing NaCl, Tris, EDTA and lysostaphin (25µg/ml) and SDS. Proteinase K and RNase A were used to treat protein and RNA. Ammonium acetate was used to remove protein. DNA was recovered by isopropanol precipitation as follows, the DNA was washed with 70% alcohol, drained, and dissolved in TE buffer (Tris 10 mM, EDTA 0.1 mM, pH8.0).

Sequencing Experimental Design
The S. aureus candidate reference materials were sequenced using three orthogonal sequencing platforms: Pacific Biosciences RSII (PacBio) 1 Table 1 of main paper for accession numbers. A genome map was obtained for the candidate reference material from optical mapping data generated using the Argus Optical Mapping System 4 . Agarose plugs generated from the same culture stock as that used to generate the DNA reference material were used for optical mapping measurement. Restriction enzyme NcoI was used for the restriction digest. These steps occured prior to our PEPR pipeline.

S. aureus Reference Assembly
PacBio long read data were used to generate a de novo genome assembly for use in the Genome Evaluation Pipeline. PacBio Reads were assembled using SMRTAnalysis software version 2.3 1 to apply the HGAP assembly algorithm [1]. To identify potential errors in the assembly, an in silico digest of assembly was compared the genome map obtained from the optical mapping data using OpGen MapSolver software 4 . The de novo assembly was supported by Optical mapping data (Fig. 1). The plasmid is too small for OpGen optical mapping technology, and therefore the plasmid assembly was not validated with the optical mapping data.

Sequencing Error Adjusted and Multiple Comparison Corrected Read Depth Estimation
Introduction A critical component in planning Next Generation Sequencing (NGS) experiments is determining appropriate read depth. This is particularly important for reliably detecting what if any differences exist between two biological samples, for example normal versus tumor tissue. The problem of determining appropriate read depth can be thought of as a sample size estimation problem with two important considerations. The first is the inclusion of systematic sequencing errors (SSE) on read accuracy; this can include quality score and other metrics. The second is controlling for multiple comparisons, which in the context of NGS is especially important as we are comparing millions/billions of locations between the two samples.
In this analysis we develop a Bayesian method for helping to determine read depth and associated power estimates, taking into account both sequencing error and false discovery rate.
Our specific focus will be to provide read depth estimates in the context of sample homogeneity assessment; i.e. looking at two sequences grown from the same batch and determining how deeply those sequences need to be sequenced in order to detect whether any differences exist.

Sampling Distribution
As this type of analysis is done prospectively to help determine appropriate read depth coverage for the final sample preparations, we work with simulated data only as no measurements have been taken. In order to arrive at an estimate of read depth we need a way to simulate observed purities, i.e. adjusted for SSEs and other distributional factors associated with an NGS experiment. Our objective is to compare the purities, simulated from two vials and determine if differences could be detected as we slowly change the underlying true purity of one vial, while holding the other fixed. Stated in terms of a hypothesis test we have ‫ܪ‬ : purity of vial ଵ = purity of vial ଶ vs. ‫ܪ‬ : vial purities not equal.
To generate these purities we used a Bayesian sampling scheme as a first step in our analysis, this is done prior to any adjustment for multiple comparisons which we describe in the following section. We now describe the sampling procedure. We begin by defining the terms of our model (note that these all pertain to the sampling of a purity associated with a base at a given position in the genome for a single vial), letting denote ‫ݎ‬ read depth we have ‫ܤ‬ = The event that the reference base ܾ ∈ ሼ‫,ܣ‬ ‫,ܥ‬ ܶ, ‫ܩ‬ሽ is matched in read ݅, ݅ = 1, … , ‫,ݎ‬ ‫ܧ‬ = The occurence of a sequencing error at read ݅, ݅ = 1, … , ‫,ݎ‬ For each of these we assume the following probability distributions: The conditional distributions of the data can be understood as follows. For the case given by ሺ‫ܤ‬ ‫ܧ|‬ = 0, ‫‬ ሻ, i.e. where no sequencing errors have occurred, the probability of observing the correct reference base is related to the purity of that base, ‫‪݈݈݅ሺ‬ݑ݊ݎ݁ܤ‬ ሻ distribution. For the case where a sequencing error ሺ‫ܤ‬ ‫ܧ|‬ = 1, ‫‬ ሻ the conditional distribution captures the event that the incorrect base is sampled but is identified as the reference base. For this reason the appropriate d capturing this event is ‫݈݈݅ݑ݊ݎ݁ܤ‬ below.

Fig. 1 Model for purity of a base and sequencing errors
Of primary interest is to compute the posterior of the error parameter ‫ܧ‬ , which can be shown to be In order to sample ‫‬ ௦ we must first generate ou sampling a set of ‫ݎ‬ sequencing errors that these errors could be allowed to vary these we then sample our observed data as Of primary interest is to compute the posterior of ‫‬ given the observed data , which can be shown to be … , ‫ݎ‬ሻ we must first generate our observed data ‫ܤ‬ . To do this we begin by sequencing errors ‫ܧ‬ with a fixed error rate ߳ (though it should be noted that these errors could be allowed to vary at each position based on quality scores, etc.
sample our observed data as ributions of the data can be understood as follows. For the case given by , i.e. where no sequencing errors have occurred, the probability of observing the correct reference base is related to the purity of that base, p b and follows a distribution. For the case where a sequencing error has occurred, the conditional distribution captures the event that the incorrect base is sampled but is identified as the reference base. For this reason the appropriate distribution for our model is shown in Figure 1 given the observed data ‫ܤ‬ conditional on . To do this we begin by (though it should be noted at each position based on quality scores, etc.

Generating null and alternative distributions
With a way to sample the observed purities ‫‬ ௦ , our objective is to estimate the probability of detecting differences in purities between vials as we vary the true purity ‫‬ (i.e. the power of the test). To do this we begin by estimating the null distribution, i.e. when the distribution of ‫‬ is the same for both vials 1 and 2, specifically ߙ ଵ = ߙ ଶ and ߚ ଵ = ߚ ଶ , here the subscripts denote the corresponding vials parameters for the Beta distribution (note, this sampling is for a given read depth ‫.)ݎ‬ Recall that the expected value of the beta distribution is ఈ ఈାఉ , so that the ratio of these parameters defines the underlying pure probability. In our analysis we consider two scenarios ሺ2ሻ to be the difference in the observed purities and ݀ = ሺ݀ ଵ , … , ݀ ሻ the collection of these differences. In a similar fashion we make draws from the alternative distribution, altering the expectation for vial 2 as defined above; for this case define ݀ ଵ = ሺ݀ ଵ ଵ , … , ݀ ଵ ሻ as the collection of these differences. With ݀ and ݀ ଵ we can now estimate the distribution of the differences for the null and alternative hypotheses. To do this we use the binned kernel density estimate function, bkde in R.

Computing the power of the test
Pictorially the power of our test, i.e. the probability of detecting a difference when one exists is shown below in Figure 2. Here the starting point of the green region (the power) is determined by the pre-determined level of significance ߛ, i.e. the probability of rejecting the null when no differences exist. Typically this value is taken to be ߛ = 0.05, but as will be discussed in the following section, in order to account for multiple comparisons we use an adjustment of this to determine the starting point for our power estimate. Irrespective the cutoff associated with a particular value of quantile from the binned kernel density fit to the sampled differences in alternative distribution for ݀ ଵ Fig. 2 Pictorial description of the power to detect significant distributions

Adjusting for multiple comparisons
In order to adjust for multiple comparisons we use the well adjustment for controlling False Discovery Rate (FDR). FDR procedures are designed to control the expected proportion of incorrectly rejected null hypotheses, i.e. false discoveries. In contrast, multiple comparison correction using family w such as the Bonferroni correction, seek to reduce the probability of even one false discovery, as opposed to expected proportion of false discoveries. Thus FDR procedures have great the cost of increased rate false positives. In its most common form the BH adjustment calls for taking the computed p ordering, smallest to largest as positions in the genome we are compu proceeds by finding the largest ீ as representing the number of pairwise differences we might expect to see Put another way, if some number of difference are observed between vials at various positions in the genome, this proportion informs how many of those differences we expect to be versus artifacts of the large number of comparisons or feature of the experi for example, taking ீ = 1.0 real. The impact that this adjustment has on our powe As the number of actual differences between sequences s the ratio ீ , however, certain "real" differences do occur differences may be detected, largely as a result of the massive number of comparisons being made, we believe only a small number of these following section, in order to account for multiple comparisons we use an adjustment of this to determine the starting point for our power estimate. Irrespective the cutoff associated with a particular value of ߛ is established by finding the uantile from the binned kernel density fit to the sampled differences in ݀ ଵ is then computed and corresponds to our estimate of power Pictorial description of the power to detect significant difference between two

Adjusting for multiple comparisons
In order to adjust for multiple comparisons we use the well-known Benjamini adjustment for controlling False Discovery Rate (FDR). FDR procedures are designed to control the expected proportion of incorrectly rejected null hypotheses, i.e. false discoveries. In contrast, multiple comparison correction using family wise error rate (FWER) based procedures, such as the Bonferroni correction, seek to reduce the probability of even one false discovery, as opposed to expected proportion of false discoveries. Thus FDR procedures have great the cost of increased rate false positives. In its most common form the BH adjustment calls for taking the computed p ordering, smallest to largest as ܲ ሺଵሻ ܲ ሺଶሻ ⋯ ܲ ሺீሻ , where ‫ܩ‬ denotes the number of positions in the genome we are computing pairwise comparisons for. The procedure then proceeds by finding the largest ݇ such that ܲ ሺሻ ீ ߛ. Intuitively we can think of the proportion as representing the number of pairwise differences we might expect to see another way, if some number of difference are observed between vials at various positions in the genome, this proportion informs how many of those differences we expect to be versus artifacts of the large number of comparisons or feature of the experi would basically say that we believe all detected differences are real. The impact that this adjustment has on our power estimate is shown in Figure  number of actual differences between sequences should be small so to should , however, certain "real" differences do occur so that ݇ differences may be detected, largely as a result of the massive number of comparisons being made, we believe only a small number of these reflect actual differences. For our purposes we following section, in order to account for multiple comparisons we use an adjustment of this to is established by finding the ݀ . The area under the and corresponds to our estimate of power.
rence between two known Benjamini-Hochberg (BH) adjustment for controlling False Discovery Rate (FDR). FDR procedures are designed to control the expected proportion of incorrectly rejected null hypotheses, i.e. false discoveries. In ise error rate (FWER) based procedures, such as the Bonferroni correction, seek to reduce the probability of even one false discovery, as opposed to expected proportion of false discoveries. Thus FDR procedures have great power at In its most common form the BH adjustment calls for taking the computed p-values and denotes the number of ting pairwise comparisons for. The procedure then . Intuitively we can think of the proportion as representing the number of pairwise differences we might expect to see correctly rejected. another way, if some number of difference are observed between vials at various positions in the genome, this proportion informs how many of those differences we expect to be real versus artifacts of the large number of comparisons or feature of the experimental process. So, would basically say that we believe all detected differences are r estimate is shown in Figure 3. hould be small so to should 0. So, while many differences may be detected, largely as a result of the massive number of comparisons being reflect actual differences. For our purposes we conservatively set ீ = 0.05, i.e. of the differences that are detected, we believe that only 5% may be attributable to real events (and Fig. 3 Pictorial depiction of decrease in power to detect differen comparisons , i.e. of the differences that are detected, we believe that only 5% may be attributable to real events (and many of these may be related to sequencing artifacts). orial depiction of decrease in power to detect differences due to multiple , i.e. of the differences that are detected, we believe that only 5% related to sequencing artifacts).
ces due to multiple