Abstract
In modern plant biology, progress is increasingly defined by the scientists’ ability to gather and analyze data sets of high volume and complexity, otherwise known as “big data”. Arguably, the largest increase in the volume of plant data sets over the last decade is a consequence of the application of the next-generation sequencing and mass-spectrometry technologies to the study of experimental model and crop plants. The increase in quantity and complexity of biological data brings challenges, mostly associated with data acquisition, processing, and sharing within the scientific community. Nonetheless, big data in plant science create unique opportunities in advancing our understanding of complex biological processes at a level of accuracy without precedence, and establish a base for the plant systems biology. In this chapter, we summarize the major drivers of big data in plant science and big data initiatives in life sciences with a focus on the scope and impact of iPlant, a representative cyberinfrastructure platform for plant science.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Park S, Lee CM, Doherty CJ, Gilmour SJ, Kim Y, Thomashow MF (2015) Regulation of the Arabidopsis CBF regulon by a complex low‐temperature regulatory network. Plant J 82(2):193–207
Beckwith EJ, Yanovsky MJ (2014) Circadian regulation of gene expression: at the crossroads of transcriptional and post-transcriptional regulatory networks. Curr Opin Genet Dev 27:35–42
Taylor-Teeples M, Lin L, De Lucas M, Turco G, Toal T, Gaudinier A, Young N, Trabucco G, Veling M, Lamothe R (2015) An Arabidopsis gene regulatory network for secondary cell wall synthesis. Nature 517(7536):571–575
Krouk G, Lingeman J, Colon AM, Coruzzi G, Shasha D (2013) Gene regulatory networks in plants: learning causality from time and perturbation. Genome Biol 14(6):123
Patel RV, Nahal HK, Breit R, Provart NJ (2012) BAR expressolog identification: expression profile similarity ranking of homologous genes in plant species. Plant J 71(6):1038–1050. doi:10.1111/j.1365-313X.2012.05055.x
Zhang H, Jin J, Tang L, Zhao Y, Gu X, Gao G, Luo J (2011) PlantTFDB 2.0: update and improvement of the comprehensive plant transcription factor database. Nucleic Acids Res 39(Suppl 1):D1114–D1117
Mathelier A, Zhao X, Zhang AW, Parcy F, Worsley-Hunt R, Arenillas DJ, Buchman S, Chen C-Y, Chou A, Ienasescu H (2013) JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Nucleic Acids Res gkt997
Guo A, He K, Liu D, Bai S, Gu X, Wei L, Luo J (2005) DATF: a database of Arabidopsis transcription factors. Bioinformatics 21(10):2568–2569
Palaniswamy SK, James S, Sun H, Lamb RS, Davuluri RV, Grotewold E (2006) AGRIS and AtRegNet. a platform to link cis-regulatory elements and transcription factors into regulatory networks. Plant Physiol 140(3):818–829
Yilmaz A, Nishiyama MY, Fuentes BG, Souza GM, Janies D, Gray J, Grotewold E (2009) GRASSIUS: a platform for comparative regulatory genomics across the grasses. Plant Physiol 149(1):171–180
Xiong Y, Liu T, Tian C, Sun S, Li J, Chen M (2005) Transcription factors in rice: a genome-wide comparative analysis between monocots and eudicots. Plant Mol Biol 59(1):191–203
Maruyama K, Todaka D, Mizoi J, Yoshida T, Kidokoro S, Matsukura S, Takasaki H, Sakurai T, Yamamoto YY, Yoshiwara K (2012) Identification of cis-acting promoter elements in cold-and dehydration-induced transcriptional pathways in Arabidopsis, rice, and soybean. DNA Res 19(1):37–49
Chen Z-Y, Guo X-J, Chen Z-X, Chen W-Y, Liu D-C, Zheng Y-L, Liu Y-X, Wei Y-M, Wang J-R (2015) Genome-wide characterization of developmental stage-and tissue-specific transcription factors in wheat. BMC Genomics 16(1):125
Li H, Peng Z, Yang X, Wang W, Fu J, Wang J, Han Y, Chai Y, Guo T, Yang N (2013) Genome-wide association study dissects the genetic architecture of oil biosynthesis in maize kernels. Nat Genet 45(1):43–50
Mochida K, Ha CV, Sulieman S, Dong NV, Tran LSP (2015) Databases of transcription factors in legumes. Biol Nitr Fix pp 817–822
Proost S, Van Bel M, Sterck L, Billiau K, Van Parys T, Van de Peer Y, Vandepoele K (2009) PLAZA: a comparative genomics resource to study gene and genome evolution in plants. Plant Cell 21(12):3718–3731
Van Bel M, Proost S, Wischnitzki E, Movahedi S, Scheerlinck C, Van de Peer Y, Vandepoele K (2011) Dissecting plant genomes with the PLAZA comparative genomics platform. Plant Physiol 158:590–600. doi:10.1104/pp.111.189514
Goodstein DM, Shu S, Howson R, Neupane R, Hayes RD, Fazo J, Mitros T, Dirks W, Hellsten U, Putnam N (2012) Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res 40(D1):D1178–D1186
Rouard M, Guignon V, Walde C, Droc G, Dufayard J, Conte M (2011) GreenPhylDB: phylogenomic resources for comparative and functional genomics in plants. Nucleic Acids Res 39(Database Issue):D1095–D1102
Conte MG, Gaillard S, Lanau N, Rouard M, Périn C (2008) GreenPhylDB: a database for plant comparative genomics. Nucleic Acids Res 36(Database issue):D991–D998. Epub 2007 Nov 5
Monaco MK, Stein J, Naithani S, Wei S, Dharmawardhana P, Kumari S, Amarasinghe V, Youens-Clark K, Thomason J, Preece J (2014) Gramene 2013: comparative plant genomics resources. Nucleic Acids Res 42(D1):D1193–D1199
Mueller LA, Solow TH, Taylor N, Skwarecki B, Buels R, Binns J, Lin C, Wright MH, Ahrens R, Wang Y (2005) The SOL Genomics Network. A comparative resource for Solanaceae biology and beyond. Plant Physiol 138(3):1310–1317
Fernandez-Pozo N, Menda N, Edwards JD, Saha S, Tecle IY, Strickler SR, Bombarely A, Fisher-York T, Pujar A, Foerster H (2015) The Sol Genomics Network (SGN)—from genotype to phenotype to breeding. Nucleic Acids Res 43(D1):D1036–D1041
Matthews DE, Lazo GR, Anderson OD (2009) Plant and crop databases. In: Gustafson JP, Langridge P, Somers DJ (eds) Plant genomics, vol 513, Methods in molecular biology. Humana, New York, pp 243–262. doi:10.1007/978-1-59745-427-8_13
Popescu SC, Popescu GV, Bachan S, Zhang Z, Seay M, Gerstein M, Snyder M, Dinesh-Kumar S (2007) Differential binding of calmodulin-related proteins to their targets revealed through high-density Arabidopsis protein microarrays. Proc Natl Acad Sci 104(11):4730–4735
Popescu SC, Snyder M, Dinesh-Kumar S (2007) Arabidopsis protein microarrays for the high-throughput identification of protein-protein interactions. Plant Signal Behav 2(5):416–420
Popescu SC, Popescu GV, Bachan S, Zhang Z, Gerstein M, Snyder M, Dinesh-Kumar SP (2009) MAPK target networks in Arabidopsis thaliana revealed using functional protein microarrays. Genes Dev 23(1):80–92
Popescu SC, Popescu GV, Snyder M, Dinesh-Kumar SP (2009) Integrated analysis of co-expressed MAP kinase substrates in Arabidopsis thaliana. Plant Signal Behav 4(6):524–527
Lee HY, Bowen CH, Popescu GV, Kang H-G, Kato N, Ma S, Dinesh-Kumar S, Snyder M, Popescu SC (2011) Arabidopsis RTNLB1 and RTNLB2 reticulon-like proteins regulate intracellular trafficking and activity of the FLS2 immune receptor. Plant Cell 23(9):3374–3391
Campe R, Langenbach C, Leissing F, Popescu GV, Popescu SC, Goellner K, Beckers GJ, Conrath U (2016) ABC transporter PEN3/PDR8/ABCG36 interacts with calmodulin that, like PEN3, is required for Arabidopsis nonhostresistance. New Phytol 209(1):294–306. doi:10.1111/nph.13582. Epub 2015 Aug 28
Dreze M, Carvunis A-R, Charloteaux B, Galli M, Pevzner SJ, Tasan M, Ahn Y-Y, Balumuri P, Barabási A-L, Bautista V (2011) Evidence for network evolution in an Arabidopsis interactome map. Science 333(6042):601–607
Mukhtar MS, Carvunis A-R, Dreze M, Epple P, Steinbrenner J, Moore J, Tasan M, Galli M, Hao T, Nishimura MT (2011) Independently evolved virulence effectors converge onto hubs in a plant immune system network. Science 333(6042):596–601
Thelen JJ, Peck SC (2007) Quantitative proteomics in plants: choices in abundance. Plant Cell 19(11):3339–3346
Elmore JM, Liu J, Smith B, Phinney B, Coaker G (2012) Quantitative proteomics reveals dynamic changes in the plasma membrane during Arabidopsis immune signaling. Mol Cell Proteomics 11(4):M111.014555
Kim YJ, Lee HM, Wang Y, Wu J, Kim SG, Kang KY, Park KH, Kim YC, Choi IS, Agrawal GK (2013) Depletion of abundant plant RuBisCO protein using the protamine sulfate precipitation method. Proteomics 13(14):2176–2179
Boschetti E, Righetti PG (2014) Plant proteomics methods to reach low-abundance proteins, Plant proteomics. Springer, New York, pp 111–129
Waszczak C, Akter S, Jacques S, Huang J, Messens J, Van Breusegem F (2015) Oxidative post-translational modifications of cysteine residues in plant signal transduction. J Exp Bot 66(10):2923–2934
Takahashi D, Li B, Nakayama T, Kawamura Y, Uemura M (2014) Shotgun proteomics of plant plasma membrane and microdomain proteins using nano-LC-MS/MS, Plant proteomics. Springer, New York, pp 481–498
Mann GW, Joshi HJ, Petzold CJ, Heazlewood JL (2013) Proteome coverage of the model plant Arabidopsis thaliana: implications for shotgun proteomic studies. J Proteome 79:195–199
Carapito C, Burel A, Guterl P, Walter A, Varrier F, Bertile F, Van Dorsselaer A (2014) MSDA, a proteomics software suite for in‐depth Mass Spectrometry Data Analysis using grid computing. Proteomics 14(9):1014–1019
Slagel J, Mendoza L, Shteynberg D, Deutsch EW, Moritz RL (2015) Processing shotgun proteomics data on the Amazon Cloud with the Trans-Proteomic Pipeline. Mol Cell Proteomics 14(2):399–404
Kelchtermans P, Bittremieux W, Grave K, Degroeve S, Ramon J, Laukens K, Valkenborg D, Barsnes H, Martens L (2014) Machine learning applications in proteomics research: How the past can boost the future. Proteomics 14(4–5):353–366
del Toro N, Reisinger F, Foster JM, Contell J, Fabregat A, Safont PR, Hermjakob H, Vizcaíno JA (2014) PRIDE Proteomes: a condensed view of the plethora of public proteomics data available in the PRIDE repository. DILS 2014:21
Kusebauch U, Deutsch EW, Campbell DS, Sun Z, Farrah T, Moritz RL (2014) Using PeptideAtlas, SRMAtlas, and PASSEL: comprehensive resources for discovery and targeted proteomics. Curr Protoc Bioinform 46: 13.25. 11–13.25.28
Fenyö D, Beavis RC (2015) The GPMDB REST Interface. Bioinformatics 31(12):2056–2058
Sun Q, Zybailov B, Majeran W, Friso G, Olinares PDB, van Wijk KJ (2009) PPDB, the plant proteomics database at Cornell. Nucleic Acids Res 37(Suppl 1):D969–D974
Joshi HJ, Christiansen KM, Fitz J, Cao J, Lipzen A, Martin J, Smith-Moritz AM, Pennacchio LA, Schackwitz WS, Weigel D (2012) 1001 proteomes: a functional proteomics portal for the analysis of Arabidopsis thaliana accessions. Bioinformatics 28(10):1303–1306
Hirsch-Hoffmann M, Gruissem W, Baerenfaller K (2012) pep2pro: the high-throughput proteomics data processing, analysis, and visualization tool. Front Plant Sci 3:123
Baerenfaller K, Hirsch-Hoffmann M, Svozil J, Hull R, Russenberger D, Bischof S, Lu Q, Gruissem W, Baginsky S (2011) pep2pro: a new tool for comprehensive proteome data analysis to reveal information about organ-specific proteomes in Arabidopsis thaliana. Integr Biol 3(3):225–237
Sakata K, Komatsu S (2014) Plant Proteomics: From Genome Sequencing to Proteome Databases and Repositories. In: Jorrin-Novo JV, Komatsu S, Weckwerth W, Wienkoop S (eds) Plant proteomics, vol 1072, Methods in molecular biology. Humana, New York, pp 29–42. doi:10.1007/978-1-62703-631-3_3
Mohammed Y, Mostovenko E, Henneman AA, Marissen RJ, Deelder AM, Palmblad M (2012) Cloud parallel processing of tandem mass spectrometry based proteomics data. J Proteome Res 11(10):5101–5108
Pratt B, Howbert JJ, Tasman NI, Nilsson EJ (2012) MR-Tandem: parallel X!Tandem using Hadoop MapReduce on Amazon Web Services. Bioinformatics 28(1):136–137. doi:10.1093/bioinformatics/btr615. Epub 2011 Nov 8
Keller A, Eng J, Zhang N, Xj L, Aebersold R (2005) A uniform proteomics MS/MS analysis platform utilizing open XML file formats. Mol Syst Biol 1(1)
Muth T, Peters J, Blackburn J, Rapp E, Martens L (2013) ProteoCloud: a full-featured open source proteomics cloud computing pipeline. J Proteome 88:104–108
Smedley D, Haider S, Durinck S, Pandini L, Provero P, Allen J, Arnaiz O, Awedh MH, Baldock R, Barbiera G, Bardou P, Beck T, Blake A, Bonierbale M, Brookes AJ, Bucci G, Buetti I, Burge S, Cabau C, Carlson JW, Chelala C, Chrysostomou C, Cittaro D, Collin O, Cordova R, Cutts RJ, Dassi E, Genova AD, Djari A, Esposito A, Estrella H, Eyras E, Fernandez-Banet J, Forbes S, Free RC, Fujisawa T, Gadaleta E, Garcia-Manteiga JM, Goodstein D, Gray K, Guerra-Assunção JA, Haggarty B, Han D-J, Han BW, Harris T, Harshbarger J, Hastings RK, Hayes RD, Hoede C, Hu S, Hu Z-L, Hutchins L, Kan Z, Kawaji H, Keliet A, Kerhornou A, Kim S, Kinsella R, Klopp C, Kong L, Lawson D, Lazarevic D, Lee J-H, Letellier T, Li C-Y, Lio P, Liu C-J, Luo J, Maass A, Mariette J, Maurel T, Merella S, Mohamed AM, Moreews F, Nabihoudine I, Ndegwa N, Noirot C, Perez-Llamas C, Primig M, Quattrone A, Quesneville H, Rambaldi D, Reecy J, Riba M, Rosanoff S, Saddiq AA, Salas E, Sallou O, Shepherd R, Simon R, Sperling L, Spooner W, Staines DM, Steinbach D, Stone K, Stupka E, Teague JW, Dayem Ullah AZ, Wang J, Ware D, Wong-Erasmus M, Youens-Clark K, Zadissa A, Zhang S-J, Kasprzyk A (2015) The BioMart community portal: an innovative alternative to large, centralized data repositories. Nucleic Acids Res 43(W1):W589–W598
Paten B, Diekhans M, Druker BJ, Friend S, Guinney J, Gassner N, Guttman M, James Kent W, Mantey P, Margolin AA, Massie M, Novak AM, Nothaft F, Pachter L, Patterson D, Smuga-Otto M, Stuart JM, Van′t Veer L, Wold B, Haussler D (2015) The NIH BD2K center for big data in translational genomics. J Am Med Inform Assoc 22(6):1143–1147
Sinha S, Song J, Weinshilboum R, Jongeneel V, Han J (2015) KnowEnG: a knowledge engine for genomics. J Am Med Inform Assoc 22(6):1115–1119
Crosswell LC, Thornton JM (2012) ELIXIR: a distributed infrastructure for European biological data. Trends Biotechnol 30(5):241–242
Goff SA, Vaughn M, McKay S, Lyons E, Stapleton AE, Gessler D, Matasci N, Wang L, Hanlon M, Lenards A et al (2011) The iPlant collaborative: cyberinfrastructure for plant biology. Front Plant Sci 2:34
Burleigh JG, Bansal MS, Eulenstein O, Hartmann S, Wehe A, Vision TJ (2011) Genome-scale phylogenetics: inferring the plant tree of life from 18,896 gene trees. Syst Biol 60(2):117–125
Matasci N, Hung L-H, Yan Z, Carpenter EJ, Wickett NJ, Mirarab S, Nguyen N, Warnow T, Ayyampalayam S, Barker M (2014) Data access for the 1,000 Plants (1KP) project. Gigascience 3(1):1–10
Ward R, Wan M, Schroeder W, Rajasekar A, de Torcy A, Russell T, Xu H, Moore R. The integrated Rule-Oriented Data System (iRODS 3.0) Micro-service Workbook. ISBN:9781466469129 DICE Foundation
Oliver SL, Lenards AJ, Barthelson RA, Merchant N, McKay SJ (2002) Using the iPlant Collaborative Discovery Environment, Current protocols in bioinformatics. John Wiley, Hoboken, NJ. doi:10.1002/0471250953.bi0122s42
Skidmore E, Kim S-j, Kuchimanchi S, Singaram S, Merchant N, Stanzione D iPlant atmosphere: a gateway to cloud infrastructure for the plant sciences. In: Proceedings of the 2011 ACM workshop on Gateway computing environments, 2011. ACM, pp 59–64
McKay SJ, Skidmore EJ, LaRose CJ, Mercer AW, Noutsos C (2013) Cloud computing with iPlant atmosphere. Curr Protoc Bioinform 9.15. 11–19.15. 20
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, Iyer R, Schatz MC, Sinha S, Robinson GE (2015) Big Data: astronomical or genomical? PLoS Biol 13(7):e1002195
Orchard S, Binz PA, Jones AR, Vizcaino JA, Deutsch EW, Hermjakob H (2013) Preparing to work with Big Data in proteomics–a report on the HUPO‐PSI spring workshop. Proteomics 13(20):2931–2937
Pennisi E (2005) How will big pictures emerge from a sea of biological data? Science 309(5731):94
Karr JR, Sanghvi JC, Macklin DN, Gutschow MV, Jacobs JM, Bolival B, Assad-Garcia N, Glass JI, Covert MW (2012) A whole-cell computational model predicts phenotype from genotype. Cell 150(2):389–401
Karr JR, Takahashi K, Funahashi A (2015) The principles of whole-cell modeling. Curr Opin Microbiol 27:18–24
Gonzalez N, Inzé D (2015) Molecular systems governing leaf growth: from genes to networks. J Exp Bot 66(4):1045–1054
Westlake TJ, Ricci WA, Popescu GV, Popescu SC (2015) Dimerization and thiol sensitivity of the salicylic acid binding thimet oligopeptidases TOP1 and TOP2 define their functions in redox-sensitive cellular pathways. Front Plant Sci 6:327
Chew YH, Wenden B, Flis A, Mengin V, Taylor J, Davey CL, Tindal C, Thomas H, Ougham HJ, de Reffye P (2014) Multiscale digital Arabidopsis predicts individual organ and whole-organism growth. Proc Natl Acad Sci 111(39):E4127–E4136
Acknowledgements
This work was supported by the National Science Foundation (project IOS-1025642 to S.C.P.) and by CNCSIS-UEFISCDI (project PN-II-PT-PCCA-2011-3.1-1350 to G.V.P.). C.N. was supported by the iPlant Collaborative grant DBI-0735191.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media New York
About this protocol
Cite this protocol
Popescu, G.V., Noutsos, C., Popescu, S.C. (2016). Big Data in Plant Science: Resources and Data Mining Tools for Plant Genomics and Proteomics. In: Carugo, O., Eisenhaber, F. (eds) Data Mining Techniques for the Life Sciences. Methods in Molecular Biology, vol 1415. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-3572-7_27
Download citation
DOI: https://doi.org/10.1007/978-1-4939-3572-7_27
Published:
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-3570-3
Online ISBN: 978-1-4939-3572-7
eBook Packages: Springer Protocols