Abstract
The rapid rise in the use of RNA sequencing technology (RNA-seq) for scientific discovery has led to its consideration as a clinical diagnostic tool. However, as a new technology the analytical accuracy and reproducibility of RNA-seq must be established before it can realize its full clinical utility (SEQC/MAQC-III Consortium, 2014; VanKeuren-Jensen et al. 2014). We respond to the need for reliable diagnostics, quality control metrics and improved reproducibility of RNA-seq data by recognizing and capitalizing on the relative frequency nature of RNA-Seq data. Problems with sample quality, library preparation, or sequencing may result in a low number of reads allocated to a given sample within a sequencing run. We propose a method, based on outlier detection of Centered Log-Ratio (CLR) transformed counts, for objectively identifying problematic samples based on the total number of reads allocated to the sample. Normalization and standardization methods for RNA-Seq generally assume that the total number of reads assigned to a sample does not affect the observed relative frequencies of probes within an assay. This assumpion, known as Compositional Invariance, is an important property for RNA-Seq data which enables the comparison of samples with differing read depths. Violations of the invariance property can lead to spurious differential expression results, even after normalization. We develop a diagnostic method to identify violations of the Compositional Invariance property. Batch effects arising from differing laboratory conditions or operator differences have been identified as a problem in high-throughput measurement systems (Leek et al. in Genome Biol 15, R29 [14]; Chen et al. in PLoS One 6 [10]). Batch effects are typically identified with a hierarchical clustering (HC) method or principal components analysis (PCA). For both methods, the multivariate distance between the samples is visualized, either in a biplot for PCA or a dendrogram for HC, to check for the existence of clusters of samples related to batch. We show that CLR transformed RNA-Seq data is appropriate for evaluation in a PCA biplot and improves batch effect detection over current methods. As RNA-Seq makes the transition from the research laboratory to the clinic there is a need for robust quality control metrics. The realization that RNA-Seq data are compositional opens the door to the existing body of theory and methods developed by Aitchison (The statistical analysis of compositional data, Chapman & Hall Ltd., 1986) and others. We show that the properties of compositional data can be leveraged to develop new metrics and improve existing methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aitchison, J.: The Statistical Analysis of Compositional Data. Chapman & Hall, Ltd. (1986). http://dl.acm.org/citation.cfm?id=17272
Aitchison, J.: On criteria for measures of compositional difference. Math Geol 24(4), 365–379 (1992). https://doi.org/10.1007/BF00891269. http://link.springer.com/10.1007/BF00891269
Aitchison, J., Greenacre, M.: Biplots of compositional data. J R Stat Soc Series C (Appl Stat) 51(4), 375–392 (2002). https://doi.org/10.1111/1467-9876.00275. http://doi.wiley.com/10.1111/1467-9876.00275
Aitchison, J., Barceló-Vidal, C., Martín-Fernández, J.A., Pawlowsky-Glahn, V.: Logratio analysis and compositional distance. Math Geol 32(3), 271–275 (2000). https://doi.org/10.1023/A:1007529726302
Aitchison, J., Shen, S.: Logistic-normal distributions: some properties and uses. Biometrika 67(2), 261–272 (1980). https://doi.org/10.1093/biomet/67.2.261. https://www.researchgate.net/publication/229099731_Logistic-Normal_Distributions_Some_Properties_and_Uses
Anders, S., Huber, W.: Differential expression analysis for sequence count data. Genome Biol 11(10), R106 (2010). https://doi.org/10.1186/gb-2010-11-10-r106. http://www.biomedcentral.com/content/pdf/gb-2010-11-10-r106.pdf
Ben-Gal, I.: Outlier detection. In: Data Mining and Knowledge Discovery Handbook, pp. 117–130. Springer, US (2009). https://doi.org/10.1007/978-0-387-09823-4_7. http://link.springer.com/10.1007/978-0-387-09823-4_7
Billheimer, D., Guttorp, P., Fagan, W.F.: Statistical interpretation of species composition. J Am Stat Assoc 96(456), 1205–1214 (2001). https://doi.org/10.1198/016214501753381850. http://www.jstor.org/stable/3085883
Bolstad, B.M., Irizarry, R.A., Astrand, M., Speed, T.P.: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics (Oxford, England) 19(2), 185–193 (2003). http://www.ncbi.nlm.nih.gov/pubmed/12538238
Chen, C., Grennan, K., Badner, J., Zhang, D., Gershon, E., Jin, L., Liu, C.: Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods. PLoS One 6(2) (2011). https://doi.org/10.1371/journal.pone.0017238
Dillies, M.A., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, N., Keime, C., Marot, N.S., Castel, D., Estelle, J., Guernec, G., Jagla, B., Jouneau, L., Laloe, D., Le Gall, C., Schaeffer, B., Le Crom, S., Guedj, M., Jaffrezic, F.: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefngs Bioinform 14(6), 671–683 (2013). https://doi.org/10.1093/bib/bbs046
Hawkins, D.M.: Identification of Outliers. Springer Netherlands, Dordrecht (1980). https://doi.org/10.1007/978-94-015-3994-4. http://link.springer.com/10.1007/978-94-015-3994-4
Law, C.W., Chen, Y., Shi, W., Smyth, G.K.: voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 15(2), R29 (2014). https://doi.org/10.1186/gb-2014-15-2-r29. http://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29
Leek, J.T., Scharpf, R.B., Bravo, H.C., Simcha, D., Langmead, B., Johnson, W.E., Geman, D., Baggerly, K., Irizarry, R.A.: Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet 11(10), 733–739 (2010). https://doi.org/10.1038/nrg2825. http://dx.doi.org/10.1038/nrg2825
Lovell, D., Müller, W., Taylor, J., Zwart, A., Helliwell, C.: Proportions, percentages, PPM: do the molecular biosciences treat compositional data right? In: Compositional Data Analysis: Theory and Applications, pp. 191–207. Wiley (2011). https://doi.org/10.1002/9781119976462.ch14. http://dx.doi.org/10.1002/9781119976462.ch14
Lovell, D., Pawlowsky-Glahn, V., Egozcue, J.J., Marguerat, S., Bähler, J.: Proportionality: a valid alternative to correlation for relative data. PLoS Comput Biol 11(3), e1004,075 (2015). https://doi.org/10.1371/journal.pcbi.1004075. http://www.ncbi.nlm.nih.gov/pubmed/25775355
Luo, J., Schumacher, M., Scherer, A., Sanoudou, D., Megherbi, D., Davison, T., Shi, T., Tong, W., Shi, L., Hong, H., Zhao, C., Elloumi, F., Shi, W., Thomas, R., Lin, S., Tillinghast, G., Liu, G., Zhou, Y., Herman, D., Li, Y., Deng, Y., Fang, H., Bushel, P., Woods, M., Zhang, J.: A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data. Pharmacogenomics J 10(4), 278–291 (2010). https://doi.org/10.1038/tpj.2010.57. http://www.ncbi.nlm.nih.gov/pubmed/20676067www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC2920074
Martín-Fernández, J.A., Barceló-Vidal, C., Pawlowsky-Glahn, V.: Dealing with zeros and missing values in compositional data sets using nonparametric imputation. Math Geol 35(3), 253–278 (2000). https://doi.org/10.1023/A:1023866030544. http://link.springer.com/article/10.1023/A%3A1023866030544
Martín-Fernández, J.A., Barceló-Vidal, C., Pawlowsky-Glahn, V., Buccianti, A., Nardi, G., Potenza, R.: Measures of difference for compositional data and hierarchical clustering methods. In: Proceedings of IAMG, vol. 98, no. 1, pp. 526–531 (1998)
Martn-Fernndez, J.A., Hron, K., Templ, M., Filzmoser, P., Palarea-Albaladejo, J.: Bayesian multiplicative treatment of count zeros in compositional data sets. Stat Model 15(2), 134–158 (2015). http://ezproxy.library.arizona.edu/login?url=https://search-proquest-com.ezproxy1.library.arizona.edu/docview/1673859465?accountid=8360. (Copyright-SAGE Publications Apr 2015; Last updated 19 Sep 2015)
Pearson, K.: Mathematical contributions to the theory of evolution.-On a form of spurious correlation which may arise when indices are used in the measurement of organs. Proc R Soc Lond 60, 489–498 (1896). https://archive.org/details/philtrans00847732 (Free Download & Streaming: Internet Archive.)
Robinson, M.D., Oshlack, A.: A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 11(3), R25 (2010). https://doi.org/10.1186/gb-2010-11-3-r25
Robinson, M.D., Smyth, G.K.: Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 9(2), 321–332 (2007). https://doi.org/10.1093/biostatistics/kxm030. http://biostatistics.oxfordjournals.org/cgi/doi/10.1093/biostatistics/kxm030
Sanford, R.F., Pierson, C.T., Crovelli, R.A.: An objective replacement method for censored geochemical data. Math Geol 25(1), 59–80 (1993). https://doi.org/10.1007/BF00890676. http://link.springer.com/10.1007/BF00890676
Sims, D., Sudbery, I., Ilott, N.E., Heger, A., Ponting, C.P.: Sequencing depth and coverage: key considerations in genomic analyses. Nat Rev Genet 15(2), 121–132 (2014). https://doi.org/10.1038/nrg3642. http://www.nature.com/doifinder/10.1038/nrg3642
Tarazona, S., García-Alcalde, F., Dopazo, J., Ferrer, A., Conesa, A.: Differential expression in RNA-seq: a matter of depth. Genome Res 21(12), 2213–2223 (2011). https://doi.org/10.1101/gr.124321111. http://www.ncbi.nlm.nih.gov/pubmed/21903743www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC3227109
Tukey, J.W.J.W.: Exploratory Data Analysis. Addison-Wesley Publication, Co (1977)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
LaRoche, D., Billheimer, D., Michels, K., LaFleur, B. (2019). Quality Control Metrics for Extraction-Free Targeted RNA-Seq Under a Compositional Framework. In: Liu, R., Tsong, Y. (eds) Pharmaceutical Statistics. MBSW 2016. Springer Proceedings in Mathematics & Statistics, vol 218. Springer, Cham. https://doi.org/10.1007/978-3-319-67386-8_21
Download citation
DOI: https://doi.org/10.1007/978-3-319-67386-8_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67385-1
Online ISBN: 978-3-319-67386-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)