Quality Control Metrics for Extraction-Free Targeted RNA-Seq Under a Compositional Framework

LaRoche, Dominic; Billheimer, Dean; Michels, Kurt; LaFleur, Bonnie

doi:10.1007/978-3-319-67386-8_21

Dominic LaRoche³,
Dean Billheimer⁴,
Kurt Michels³ &
…
Bonnie LaFleur³

Part of the book series: Springer Proceedings in Mathematics & Statistics ((PROMS,volume 218))

Included in the following conference series:

Midwest Biopharmaceutical Statistics Workshop

928 Accesses

Abstract

The rapid rise in the use of RNA sequencing technology (RNA-seq) for scientific discovery has led to its consideration as a clinical diagnostic tool. However, as a new technology the analytical accuracy and reproducibility of RNA-seq must be established before it can realize its full clinical utility (SEQC/MAQC-III Consortium, 2014; VanKeuren-Jensen et al. 2014). We respond to the need for reliable diagnostics, quality control metrics and improved reproducibility of RNA-seq data by recognizing and capitalizing on the relative frequency nature of RNA-Seq data. Problems with sample quality, library preparation, or sequencing may result in a low number of reads allocated to a given sample within a sequencing run. We propose a method, based on outlier detection of Centered Log-Ratio (CLR) transformed counts, for objectively identifying problematic samples based on the total number of reads allocated to the sample. Normalization and standardization methods for RNA-Seq generally assume that the total number of reads assigned to a sample does not affect the observed relative frequencies of probes within an assay. This assumpion, known as Compositional Invariance, is an important property for RNA-Seq data which enables the comparison of samples with differing read depths. Violations of the invariance property can lead to spurious differential expression results, even after normalization. We develop a diagnostic method to identify violations of the Compositional Invariance property. Batch effects arising from differing laboratory conditions or operator differences have been identified as a problem in high-throughput measurement systems (Leek et al. in Genome Biol 15, R29 [14]; Chen et al. in PLoS One 6 [10]). Batch effects are typically identified with a hierarchical clustering (HC) method or principal components analysis (PCA). For both methods, the multivariate distance between the samples is visualized, either in a biplot for PCA or a dendrogram for HC, to check for the existence of clusters of samples related to batch. We show that CLR transformed RNA-Seq data is appropriate for evaluation in a PCA biplot and improves batch effect detection over current methods. As RNA-Seq makes the transition from the research laboratory to the clinic there is a need for robust quality control metrics. The realization that RNA-Seq data are compositional opens the door to the existing body of theory and methods developed by Aitchison (The statistical analysis of compositional data, Chapman & Hall Ltd., 1986) and others. We show that the properties of compositional data can be leveraged to develop new metrics and improve existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aitchison, J.: The Statistical Analysis of Compositional Data. Chapman & Hall, Ltd. (1986). http://dl.acm.org/citation.cfm?id=17272
Aitchison, J.: On criteria for measures of compositional difference. Math Geol 24(4), 365–379 (1992). https://doi.org/10.1007/BF00891269. http://link.springer.com/10.1007/BF00891269
Article MathSciNet Google Scholar
Aitchison, J., Greenacre, M.: Biplots of compositional data. J R Stat Soc Series C (Appl Stat) 51(4), 375–392 (2002). https://doi.org/10.1111/1467-9876.00275. http://doi.wiley.com/10.1111/1467-9876.00275
Article MathSciNet Google Scholar
Aitchison, J., Barceló-Vidal, C., Martín-Fernández, J.A., Pawlowsky-Glahn, V.: Logratio analysis and compositional distance. Math Geol 32(3), 271–275 (2000). https://doi.org/10.1023/A:1007529726302
Article MATH Google Scholar
Aitchison, J., Shen, S.: Logistic-normal distributions: some properties and uses. Biometrika 67(2), 261–272 (1980). https://doi.org/10.1093/biomet/67.2.261. https://www.researchgate.net/publication/229099731_Logistic-Normal_Distributions_Some_Properties_and_Uses
Article MathSciNet Google Scholar
Anders, S., Huber, W.: Differential expression analysis for sequence count data. Genome Biol 11(10), R106 (2010). https://doi.org/10.1186/gb-2010-11-10-r106. http://www.biomedcentral.com/content/pdf/gb-2010-11-10-r106.pdf
Article Google Scholar
Ben-Gal, I.: Outlier detection. In: Data Mining and Knowledge Discovery Handbook, pp. 117–130. Springer, US (2009). https://doi.org/10.1007/978-0-387-09823-4_7. http://link.springer.com/10.1007/978-0-387-09823-4_7
Chapter Google Scholar
Billheimer, D., Guttorp, P., Fagan, W.F.: Statistical interpretation of species composition. J Am Stat Assoc 96(456), 1205–1214 (2001). https://doi.org/10.1198/016214501753381850. http://www.jstor.org/stable/3085883
Article MathSciNet Google Scholar
Bolstad, B.M., Irizarry, R.A., Astrand, M., Speed, T.P.: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics (Oxford, England) 19(2), 185–193 (2003). http://www.ncbi.nlm.nih.gov/pubmed/12538238
Article Google Scholar
Chen, C., Grennan, K., Badner, J., Zhang, D., Gershon, E., Jin, L., Liu, C.: Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods. PLoS One 6(2) (2011). https://doi.org/10.1371/journal.pone.0017238
Article Google Scholar
Dillies, M.A., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, N., Keime, C., Marot, N.S., Castel, D., Estelle, J., Guernec, G., Jagla, B., Jouneau, L., Laloe, D., Le Gall, C., Schaeffer, B., Le Crom, S., Guedj, M., Jaffrezic, F.: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefngs Bioinform 14(6), 671–683 (2013). https://doi.org/10.1093/bib/bbs046
Article Google Scholar
Hawkins, D.M.: Identification of Outliers. Springer Netherlands, Dordrecht (1980). https://doi.org/10.1007/978-94-015-3994-4. http://link.springer.com/10.1007/978-94-015-3994-4
Book Google Scholar
Law, C.W., Chen, Y., Shi, W., Smyth, G.K.: voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 15(2), R29 (2014). https://doi.org/10.1186/gb-2014-15-2-r29. http://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29
Article Google Scholar
Leek, J.T., Scharpf, R.B., Bravo, H.C., Simcha, D., Langmead, B., Johnson, W.E., Geman, D., Baggerly, K., Irizarry, R.A.: Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet 11(10), 733–739 (2010). https://doi.org/10.1038/nrg2825. http://dx.doi.org/10.1038/nrg2825
Article Google Scholar
Lovell, D., Müller, W., Taylor, J., Zwart, A., Helliwell, C.: Proportions, percentages, PPM: do the molecular biosciences treat compositional data right? In: Compositional Data Analysis: Theory and Applications, pp. 191–207. Wiley (2011). https://doi.org/10.1002/9781119976462.ch14. http://dx.doi.org/10.1002/9781119976462.ch14
Chapter Google Scholar
Lovell, D., Pawlowsky-Glahn, V., Egozcue, J.J., Marguerat, S., Bähler, J.: Proportionality: a valid alternative to correlation for relative data. PLoS Comput Biol 11(3), e1004,075 (2015). https://doi.org/10.1371/journal.pcbi.1004075. http://www.ncbi.nlm.nih.gov/pubmed/25775355
Article Google Scholar
Luo, J., Schumacher, M., Scherer, A., Sanoudou, D., Megherbi, D., Davison, T., Shi, T., Tong, W., Shi, L., Hong, H., Zhao, C., Elloumi, F., Shi, W., Thomas, R., Lin, S., Tillinghast, G., Liu, G., Zhou, Y., Herman, D., Li, Y., Deng, Y., Fang, H., Bushel, P., Woods, M., Zhang, J.: A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data. Pharmacogenomics J 10(4), 278–291 (2010). https://doi.org/10.1038/tpj.2010.57. http://www.ncbi.nlm.nih.gov/pubmed/20676067www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC2920074
Article Google Scholar
Martín-Fernández, J.A., Barceló-Vidal, C., Pawlowsky-Glahn, V.: Dealing with zeros and missing values in compositional data sets using nonparametric imputation. Math Geol 35(3), 253–278 (2000). https://doi.org/10.1023/A:1023866030544. http://link.springer.com/article/10.1023/A%3A1023866030544
Article Google Scholar
Martín-Fernández, J.A., Barceló-Vidal, C., Pawlowsky-Glahn, V., Buccianti, A., Nardi, G., Potenza, R.: Measures of difference for compositional data and hierarchical clustering methods. In: Proceedings of IAMG, vol. 98, no. 1, pp. 526–531 (1998)
Google Scholar
Martn-Fernndez, J.A., Hron, K., Templ, M., Filzmoser, P., Palarea-Albaladejo, J.: Bayesian multiplicative treatment of count zeros in compositional data sets. Stat Model 15(2), 134–158 (2015). http://ezproxy.library.arizona.edu/login?url=https://search-proquest-com.ezproxy1.library.arizona.edu/docview/1673859465?accountid=8360. (Copyright-SAGE Publications Apr 2015; Last updated 19 Sep 2015)
Pearson, K.: Mathematical contributions to the theory of evolution.-On a form of spurious correlation which may arise when indices are used in the measurement of organs. Proc R Soc Lond 60, 489–498 (1896). https://archive.org/details/philtrans00847732 (Free Download & Streaming: Internet Archive.)
Robinson, M.D., Oshlack, A.: A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 11(3), R25 (2010). https://doi.org/10.1186/gb-2010-11-3-r25
Article Google Scholar
Robinson, M.D., Smyth, G.K.: Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 9(2), 321–332 (2007). https://doi.org/10.1093/biostatistics/kxm030. http://biostatistics.oxfordjournals.org/cgi/doi/10.1093/biostatistics/kxm030
Article Google Scholar
Sanford, R.F., Pierson, C.T., Crovelli, R.A.: An objective replacement method for censored geochemical data. Math Geol 25(1), 59–80 (1993). https://doi.org/10.1007/BF00890676. http://link.springer.com/10.1007/BF00890676
Article Google Scholar
Sims, D., Sudbery, I., Ilott, N.E., Heger, A., Ponting, C.P.: Sequencing depth and coverage: key considerations in genomic analyses. Nat Rev Genet 15(2), 121–132 (2014). https://doi.org/10.1038/nrg3642. http://www.nature.com/doifinder/10.1038/nrg3642
Article Google Scholar
Tarazona, S., García-Alcalde, F., Dopazo, J., Ferrer, A., Conesa, A.: Differential expression in RNA-seq: a matter of depth. Genome Res 21(12), 2213–2223 (2011). https://doi.org/10.1101/gr.124321111. http://www.ncbi.nlm.nih.gov/pubmed/21903743www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC3227109
Tukey, J.W.J.W.: Exploratory Data Analysis. Addison-Wesley Publication, Co (1977)
Google Scholar

Download references

Author information

Authors and Affiliations

HTG Molecular Diagnostics, Inc., Tucson, AZ, USA
Dominic LaRoche, Kurt Michels & Bonnie LaFleur
Department of Biostatistics, Mel and Enid Zuckerman College of Public Health, University of Arizona, Tucson, AZ, USA
Dean Billheimer

Authors

Dominic LaRoche
View author publications
You can also search for this author in PubMed Google Scholar
Dean Billheimer
View author publications
You can also search for this author in PubMed Google Scholar
Kurt Michels
View author publications
You can also search for this author in PubMed Google Scholar
Bonnie LaFleur
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dominic LaRoche .

Editor information

Editors and Affiliations

Statistical Innovation and Consultation group, Takeda Pharmaceuticals, Cambridge, MA, USA
Ray Liu
Division of Biometrics VI, CDER, U.S. Food and Drug Administration , Silver Spring, MD, USA
Yi Tsong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

LaRoche, D., Billheimer, D., Michels, K., LaFleur, B. (2019). Quality Control Metrics for Extraction-Free Targeted RNA-Seq Under a Compositional Framework. In: Liu, R., Tsong, Y. (eds) Pharmaceutical Statistics. MBSW 2016. Springer Proceedings in Mathematics & Statistics, vol 218. Springer, Cham. https://doi.org/10.1007/978-3-319-67386-8_21

Download citation

DOI: https://doi.org/10.1007/978-3-319-67386-8_21
Published: 13 June 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67385-1
Online ISBN: 978-3-319-67386-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics