Skip to main content

Quality Control Metrics for Extraction-Free Targeted RNA-Seq Under a Compositional Framework

  • Conference paper
  • First Online:
Pharmaceutical Statistics (MBSW 2016)

Part of the book series: Springer Proceedings in Mathematics & Statistics ((PROMS,volume 218))

Included in the following conference series:

  • 928 Accesses

Abstract

The rapid rise in the use of RNA sequencing technology (RNA-seq) for scientific discovery has led to its consideration as a clinical diagnostic tool. However, as a new technology the analytical accuracy and reproducibility of RNA-seq must be established before it can realize its full clinical utility (SEQC/MAQC-III Consortium, 2014; VanKeuren-Jensen et al. 2014). We respond to the need for reliable diagnostics, quality control metrics and improved reproducibility of RNA-seq data by recognizing and capitalizing on the relative frequency nature of RNA-Seq data. Problems with sample quality, library preparation, or sequencing may result in a low number of reads allocated to a given sample within a sequencing run. We propose a method, based on outlier detection of Centered Log-Ratio (CLR) transformed counts, for objectively identifying problematic samples based on the total number of reads allocated to the sample. Normalization and standardization methods for RNA-Seq generally assume that the total number of reads assigned to a sample does not affect the observed relative frequencies of probes within an assay. This assumpion, known as Compositional Invariance, is an important property for RNA-Seq data which enables the comparison of samples with differing read depths. Violations of the invariance property can lead to spurious differential expression results, even after normalization. We develop a diagnostic method to identify violations of the Compositional Invariance property. Batch effects arising from differing laboratory conditions or operator differences have been identified as a problem in high-throughput measurement systems (Leek et al. in Genome Biol 15, R29 [14]; Chen et al. in PLoS One 6 [10]). Batch effects are typically identified with a hierarchical clustering (HC) method or principal components analysis (PCA). For both methods, the multivariate distance between the samples is visualized, either in a biplot for PCA or a dendrogram for HC, to check for the existence of clusters of samples related to batch. We show that CLR transformed RNA-Seq data is appropriate for evaluation in a PCA biplot and improves batch effect detection over current methods. As RNA-Seq makes the transition from the research laboratory to the clinic there is a need for robust quality control metrics. The realization that RNA-Seq data are compositional opens the door to the existing body of theory and methods developed by Aitchison (The statistical analysis of compositional data, Chapman & Hall Ltd., 1986) and others. We show that the properties of compositional data can be leveraged to develop new metrics and improve existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aitchison, J.: The Statistical Analysis of Compositional Data. Chapman & Hall, Ltd. (1986). http://dl.acm.org/citation.cfm?id=17272

  2. Aitchison, J.: On criteria for measures of compositional difference. Math Geol 24(4), 365–379 (1992). https://doi.org/10.1007/BF00891269. http://link.springer.com/10.1007/BF00891269

    Article  MathSciNet  Google Scholar 

  3. Aitchison, J., Greenacre, M.: Biplots of compositional data. J R Stat Soc Series C (Appl Stat) 51(4), 375–392 (2002). https://doi.org/10.1111/1467-9876.00275. http://doi.wiley.com/10.1111/1467-9876.00275

    Article  MathSciNet  Google Scholar 

  4. Aitchison, J., Barceló-Vidal, C., Martín-Fernández, J.A., Pawlowsky-Glahn, V.: Logratio analysis and compositional distance. Math Geol 32(3), 271–275 (2000). https://doi.org/10.1023/A:1007529726302

    Article  MATH  Google Scholar 

  5. Aitchison, J., Shen, S.: Logistic-normal distributions: some properties and uses. Biometrika 67(2), 261–272 (1980). https://doi.org/10.1093/biomet/67.2.261. https://www.researchgate.net/publication/229099731_Logistic-Normal_Distributions_Some_Properties_and_Uses

    Article  MathSciNet  Google Scholar 

  6. Anders, S., Huber, W.: Differential expression analysis for sequence count data. Genome Biol 11(10), R106 (2010). https://doi.org/10.1186/gb-2010-11-10-r106. http://www.biomedcentral.com/content/pdf/gb-2010-11-10-r106.pdf

    Article  Google Scholar 

  7. Ben-Gal, I.: Outlier detection. In: Data Mining and Knowledge Discovery Handbook, pp. 117–130. Springer, US (2009). https://doi.org/10.1007/978-0-387-09823-4_7. http://link.springer.com/10.1007/978-0-387-09823-4_7

    Chapter  Google Scholar 

  8. Billheimer, D., Guttorp, P., Fagan, W.F.: Statistical interpretation of species composition. J Am Stat Assoc 96(456), 1205–1214 (2001). https://doi.org/10.1198/016214501753381850. http://www.jstor.org/stable/3085883

    Article  MathSciNet  Google Scholar 

  9. Bolstad, B.M., Irizarry, R.A., Astrand, M., Speed, T.P.: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics (Oxford, England) 19(2), 185–193 (2003). http://www.ncbi.nlm.nih.gov/pubmed/12538238

    Article  Google Scholar 

  10. Chen, C., Grennan, K., Badner, J., Zhang, D., Gershon, E., Jin, L., Liu, C.: Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods. PLoS One 6(2) (2011). https://doi.org/10.1371/journal.pone.0017238

    Article  Google Scholar 

  11. Dillies, M.A., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, N., Keime, C., Marot, N.S., Castel, D., Estelle, J., Guernec, G., Jagla, B., Jouneau, L., Laloe, D., Le Gall, C., Schaeffer, B., Le Crom, S., Guedj, M., Jaffrezic, F.: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefngs Bioinform 14(6), 671–683 (2013). https://doi.org/10.1093/bib/bbs046

    Article  Google Scholar 

  12. Hawkins, D.M.: Identification of Outliers. Springer Netherlands, Dordrecht (1980). https://doi.org/10.1007/978-94-015-3994-4. http://link.springer.com/10.1007/978-94-015-3994-4

    Book  Google Scholar 

  13. Law, C.W., Chen, Y., Shi, W., Smyth, G.K.: voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 15(2), R29 (2014). https://doi.org/10.1186/gb-2014-15-2-r29. http://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29

    Article  Google Scholar 

  14. Leek, J.T., Scharpf, R.B., Bravo, H.C., Simcha, D., Langmead, B., Johnson, W.E., Geman, D., Baggerly, K., Irizarry, R.A.: Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet 11(10), 733–739 (2010). https://doi.org/10.1038/nrg2825. http://dx.doi.org/10.1038/nrg2825

    Article  Google Scholar 

  15. Lovell, D., Müller, W., Taylor, J., Zwart, A., Helliwell, C.: Proportions, percentages, PPM: do the molecular biosciences treat compositional data right? In: Compositional Data Analysis: Theory and Applications, pp. 191–207. Wiley (2011). https://doi.org/10.1002/9781119976462.ch14. http://dx.doi.org/10.1002/9781119976462.ch14

    Chapter  Google Scholar 

  16. Lovell, D., Pawlowsky-Glahn, V., Egozcue, J.J., Marguerat, S., Bähler, J.: Proportionality: a valid alternative to correlation for relative data. PLoS Comput Biol 11(3), e1004,075 (2015). https://doi.org/10.1371/journal.pcbi.1004075. http://www.ncbi.nlm.nih.gov/pubmed/25775355

    Article  Google Scholar 

  17. Luo, J., Schumacher, M., Scherer, A., Sanoudou, D., Megherbi, D., Davison, T., Shi, T., Tong, W., Shi, L., Hong, H., Zhao, C., Elloumi, F., Shi, W., Thomas, R., Lin, S., Tillinghast, G., Liu, G., Zhou, Y., Herman, D., Li, Y., Deng, Y., Fang, H., Bushel, P., Woods, M., Zhang, J.: A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data. Pharmacogenomics J 10(4), 278–291 (2010). https://doi.org/10.1038/tpj.2010.57. http://www.ncbi.nlm.nih.gov/pubmed/20676067www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC2920074

    Article  Google Scholar 

  18. Martín-Fernández, J.A., Barceló-Vidal, C., Pawlowsky-Glahn, V.: Dealing with zeros and missing values in compositional data sets using nonparametric imputation. Math Geol 35(3), 253–278 (2000). https://doi.org/10.1023/A:1023866030544. http://link.springer.com/article/10.1023/A%3A1023866030544

    Article  Google Scholar 

  19. Martín-Fernández, J.A., Barceló-Vidal, C., Pawlowsky-Glahn, V., Buccianti, A., Nardi, G., Potenza, R.: Measures of difference for compositional data and hierarchical clustering methods. In: Proceedings of IAMG, vol. 98, no. 1, pp. 526–531 (1998)

    Google Scholar 

  20. Martn-Fernndez, J.A., Hron, K., Templ, M., Filzmoser, P., Palarea-Albaladejo, J.: Bayesian multiplicative treatment of count zeros in compositional data sets. Stat Model 15(2), 134–158 (2015). http://ezproxy.library.arizona.edu/login?url=https://search-proquest-com.ezproxy1.library.arizona.edu/docview/1673859465?accountid=8360. (Copyright-SAGE Publications Apr 2015; Last updated 19 Sep 2015)

  21. Pearson, K.: Mathematical contributions to the theory of evolution.-On a form of spurious correlation which may arise when indices are used in the measurement of organs. Proc R Soc Lond 60, 489–498 (1896). https://archive.org/details/philtrans00847732 (Free Download & Streaming: Internet Archive.)

  22. Robinson, M.D., Oshlack, A.: A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 11(3), R25 (2010). https://doi.org/10.1186/gb-2010-11-3-r25

    Article  Google Scholar 

  23. Robinson, M.D., Smyth, G.K.: Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 9(2), 321–332 (2007). https://doi.org/10.1093/biostatistics/kxm030. http://biostatistics.oxfordjournals.org/cgi/doi/10.1093/biostatistics/kxm030

    Article  Google Scholar 

  24. Sanford, R.F., Pierson, C.T., Crovelli, R.A.: An objective replacement method for censored geochemical data. Math Geol 25(1), 59–80 (1993). https://doi.org/10.1007/BF00890676. http://link.springer.com/10.1007/BF00890676

    Article  Google Scholar 

  25. Sims, D., Sudbery, I., Ilott, N.E., Heger, A., Ponting, C.P.: Sequencing depth and coverage: key considerations in genomic analyses. Nat Rev Genet 15(2), 121–132 (2014). https://doi.org/10.1038/nrg3642. http://www.nature.com/doifinder/10.1038/nrg3642

    Article  Google Scholar 

  26. Tarazona, S., García-Alcalde, F., Dopazo, J., Ferrer, A., Conesa, A.: Differential expression in RNA-seq: a matter of depth. Genome Res 21(12), 2213–2223 (2011). https://doi.org/10.1101/gr.124321111. http://www.ncbi.nlm.nih.gov/pubmed/21903743www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC3227109

  27. Tukey, J.W.J.W.: Exploratory Data Analysis. Addison-Wesley Publication, Co (1977)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dominic LaRoche .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

LaRoche, D., Billheimer, D., Michels, K., LaFleur, B. (2019). Quality Control Metrics for Extraction-Free Targeted RNA-Seq Under a Compositional Framework. In: Liu, R., Tsong, Y. (eds) Pharmaceutical Statistics. MBSW 2016. Springer Proceedings in Mathematics & Statistics, vol 218. Springer, Cham. https://doi.org/10.1007/978-3-319-67386-8_21

Download citation

Publish with us

Policies and ethics