Abstract
With the evermore emphasis put on open science and its invaluable benefits to the scientific community, it is no longer the case where a research project simply ends with a scientific publication. The benefits of data sharing and reproducibility of results have taken the centerpiece within the life science research supported by FAIR principles that firmly underline the importance of open data. The current data-intensive multidisciplinary research has also highlighted the significance of how data is mined and managed. Here we describe some of the features adopted by EMBL-EBI data resources to support data mining, data quality, and data management. We also highlight how EMBL-EBI has responded to the current pandemic through its data resources.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
References
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, Iyer R, Schatz MC, Sinha S, Robinson GE (2015) Big data: astronomical or genomical? PLoS Biol 13(7):e1002195. https://doi.org/10.1371/journal.pbio.1002195
Goncalves RS, Musen MA (2019) The variable quality of metadata about biological samples used in biomedical experiments. Sci Data 6:190021. https://doi.org/10.1038/sdata.2019.21
Cantelli G, Cochrane G, Brooksbank C, McDonagh E, Flicek P, McEntyre J, Birney E, Apweiler R (2021) The European bioinformatics institute: empowering cooperation in response to a global health crisis. Nucleic Acids Res 49(D1):D29–D37. https://doi.org/10.1093/nar/gkaa1077
Sayers EW, Beck J, Bolton EE, Bourexis D, Brister JR, Canese K, Comeau DC, Funk K, Kim S, Klimke W, Marchler-Bauer A, Landrum M, Lathrop S, Lu Z, Madden TL, O'Leary N, Phan L, Rangwala SH, Schneider VA, Skripchenko Y, Wang J, Ye J, Trawick BW, Pruitt KD, Sherry ST (2021) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 49(D1):D10–D17. https://doi.org/10.1093/nar/gkaa892
Provost F, Fawcett T (2013) Data science and its relationship to big data and data-driven decision making. Big Data 1(1):51–59. https://doi.org/10.1089/big.2013.1508
Navarro FCP, Mohsen H, Yan C, Li S, Gu M, Meyerson W, Gerstein M (2019) Genomics and data science: an application within an umbrella. Genome Biol 20(1):109. https://doi.org/10.1186/s13059-019-1724-1
Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten JW, da Silva Santos LB, Bourne PE, Bouwman J, Brookes AJ, Clark T, Crosas M, Dillo I, Dumon O, Edmunds S, Evelo CT, Finkers R, Gonzalez-Beltran A, Gray AJ, Groth P, Goble C, Grethe JS, Heringa J, t Hoen PA, Hooft R, Kuhn T, Kok R, Kok J, Lusher SJ, Martone ME, Mons A, Packer AL, Persson B, Rocca-Serra P, Roos M, van Schaik R, Sansone SA, Schultes E, Sengstag T, Slater T, Strawn G, Swertz MA, Thompson M, van der Lei J, van Mulligen E, Velterop J, Waagmeester A, Wittenburg P, Wolstencroft K, Zhao J, Mons B (2016) The FAIR guiding principles for scientific data management and stewardship. Sci Data 3:160018. https://doi.org/10.1038/sdata.2016.18
Vines TH, Albert AYK, Andrew RL, Debarre F, Bock DG, Franklin MT, Gilbert KJ, Moore JS, Renaut S, Rennison DJ (2014) The availability of research data declines rapidly with article age. Curr Biol 24(1):94–97. https://doi.org/10.1016/j.cub.2013.11.014
Cook CE, Lopez R, Stroe O, Cochrane G, Brooksbank C, Birney E, Apweiler R (2019) The European bioinformatics institute in 2018: tools, infrastructure and training. Nucleic Acids Res 47(D1):D15–D22. https://doi.org/10.1093/nar/gky1124
Vamathevan J, Apweiler R, Birney E (2019) Biomolecular data resources: bioinformatics infrastructure for biomedical data science. Annu Rev Biomed Data Sci 2(1):199–222. https://doi.org/10.1146/annurev-biodatasci-072018-021321
Williams E, Moore J, Li SW, Rustici G, Tarkowska A, Chessel A, Leo S, Antal B, Ferguson RK, Sarkans U, Brazma A, Salas REC, Swedlow JR (2017) The image data resource: a bioimage data integration and publication platform. Nat Methods 14(8):775–781. https://doi.org/10.1038/nmeth.4326
Sarkans U, Gostev M, Athar A, Behrangi E, Melnichuk O, Ali A, Minguet J, Rada JC, Snow C, Tikhonov A, Brazma A, McEntyre J (2018) The BioStudies database-one stop shop for all data supporting a life sciences study. Nucleic Acids Res 46(D1):D1266–D1270. https://doi.org/10.1093/nar/gkx965
Iudin A, Korir PK, Salavert-Torres J, Kleywegt GJ, Patwardhan A (2016) EMPIAR: a public archive for raw electron microscopy image data. Nat Methods 13(5):387–388. https://doi.org/10.1038/nmeth.3806
Sarkans U, Fullgrabe A, Ali A, Athar A, Behrangi E, Diaz N, Fexova S, George N, Iqbal H, Kurri S, Munoz J, Rada J, Papatheodorou I, Brazma A (2021) From ArrayExpress to BioStudies. Nucleic Acids Res 49(D1):D1502–D1506. https://doi.org/10.1093/nar/gkaa1062
Esvelt KM, Wang HH (2013) Genome-scale engineering for systems and synthetic biology. Mol Syst Biol 9:641. https://doi.org/10.1038/msb.2012.66
Buniello A, MacArthur JAL, Cerezo M, Harris LW, Hayhurst J, Malangone C, McMahon A, Morales J, Mountjoy E, Sollis E, Suveges D, Vrousgou O, Whetzel PL, Amode R, Guillen JA, Riat HS, Trevanion SJ, Hall P, Junkins H, Flicek P, Burdett T, Hindorff LA, Cunningham F, Parkinson H (2019) The NHGRI-EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res 47(D1):D1005–D1012. https://doi.org/10.1093/nar/gky1120
Lambert SA, Gil L, Jupp S, Ritchie SC, Xu Y, Buniello A, McMahon A, Abraham G, Chapman M, Parkinson H, Danesh J, MacArthur JAL, Inouye M (2021) The polygenic score catalog as an open database for reproducibility and systematic evaluation. Nat Genet 53(4):420–425. https://doi.org/10.1038/s41588-021-00783-5
Wand H, Lambert SA, Tamburro C, Iacocca MA, O’Sullivan JW, Sillari C, Kullo IJ, Rowley R, Dron JS, Brockman D, Venner E, McCarthy MI, Antoniou AC, Easton DF, Hegele RA, Khera AV, Chatterjee N, Kooperberg C, Edwards K, Vlessis K, Kinnear K, Danesh JN, Parkinson H, Ramos EM, Roberts MC, Ormond KE, Khoury MJ, Janssens ACJW, Goddard KAB, Kraft P, MacArthur JAL, Inouye M, Wojcik G (2021) Improving reporting standards for polygenic scores in risk prediction studies. Nature 591(7849):211–219. https://doi.org/10.1101/2020.04.23.20077099
Durinx C, McEntyre J, Appel R, Apweiler R, Barlow M, Blomberg N, Cook C, Gasteiger E, Kim JH, Lopez R, Redaschi N, Stockinger H, Teixeira D, Valencia A (2016) Identifying ELIXIR Core Data Resources. F1000Res:5. https://doi.org/10.12688/f1000research.9656.2
Howe KL, Achuthan P, Allen J, Allen J, Alvarez-Jarreta J, Amode MR, Armean IM, Azov AG, Bennett R, Bhai J, Billis K, Boddu S, Charkhchi M, Cummins C, Da Rin FL, Davidson C, Dodiya K, El Houdaigui B, Fatima R, Gall A, Garcia Giron C, Grego T, Guijarro-Clarke C, Haggerty L, Hemrom A, Hourlier T, Izuogu OG, Juettemann T, Kaikala V, Kay M, Lavidas I, Le T, Lemos D, Gonzalez Martinez J, Marugan JC, Maurel T, McMahon AC, Mohanan S, Moore B, Muffato M, Oheh DN, Paraschas D, Parker A, Parton A, Prosovetskaia I, Sakthivel MP, Salam AIA, Schmitt BM, Schuilenburg H, Sheppard D, Steed E, Szpak M, Szuba M, Taylor K, Thormann A, Threadgold G, Walts B, Winterbottom A, Chakiachvili M, Chaubal A, De Silva N, Flint B, Frankish A, Hunt SE, IIsley GR, Langridge N, Loveland JE, Martin FJ, Mudge JM, Morales J, Perry E, Ruffier M, Tate J, Thybert D, Trevanion SJ, Cunningham F, Yates AD, Zerbino DR, Flicek P (2021) Ensembl 2021. Nucleic Acids Res 49(D1):D884–D891. https://doi.org/10.1093/nar/gkaa942
Brunak S, Danchin A, Hattori M, Nakamura H, Shinozaki K, Matise T, Preuss D (2002) Nucleotide sequence database policies. Science 298(5597):1333. https://doi.org/10.1126/science.298.5597.1333b
McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F (2016) The Ensembl variant effect predictor. Genome Biol 17(1):122. https://doi.org/10.1186/s13059-016-0974-4
Ochoa D, Hercules A, Carmona M, Suveges D, Gonzalez-Uriarte A, Malangone C, Miranda A, Fumis L, Carvalho-Silva D, Spitzer M, Baker J, Ferrer J, Raies A, Razuvayevskaya O, Faulconbridge A, Petsalaki E, Mutowo P, Machlitt-Northen S, Peat G, McAuley E, Ong CK, Mountjoy E, Ghoussaini M, Pierleoni A, Papa E, Pignatelli M, Koscielny G, Karim M, Schwartzentruber J, Hulcoop DG, Dunham I, McDonagh EM (2021) Open Targets Platform: supporting systematic drug-target identification and prioritisation. Nucleic Acids Res 49(D1):D1302–D1310. https://doi.org/10.1093/nar/gkaa1027
Ghoussaini M, Mountjoy E, Carmona M, Peat G, Schmidt EM, Hercules A, Fumis L, Miranda A, Carvalho-Silva D, Buniello A, Burdett T, Hayhurst J, Baker J, Ferrer J, Gonzalez-Uriarte A, Jupp S, Karim MA, Koscielny G, Machlitt-Northen S, Malangone C, Pendlington ZM, Roncaglia P, Suveges D, Wright D, Vrousgou O, Papa E, Parkinson H, MacArthur JAL, Todd JA, Barrett JC, Schwartzentruber J, Hulcoop DG, Ochoa D, McDonagh EM, Dunham I (2021) Open Targets Genetics: systematic identification of trait-associated genes using large-scale genetics and functional genomics. Nucleic Acids Res 49(D1):D1311–D1320. https://doi.org/10.1093/nar/gkaa840
Golestan Hashemi FS, Razi Ismail M, Rafii Yusop M, Golestan Hashemi MS, Nadimi Shahraki MH, Rastegari H, Miah G, Aslani F (2017) Intelligent mining of large-scale bio-data: bioinformatics applications. Biotechnol Biotechnol Equip 32(1):10–29. https://doi.org/10.1080/13102818.2017.1364977
Lan K, Wang DT, Fong S, Liu LS, Wong KKL, Dey N (2018) A survey of data mining and deep learning in bioinformatics. J Med Syst 42(8):139. https://doi.org/10.1007/s10916-018-1003-9
Ferguson C, Araujo D, Faulk L, Gou Y, Hamelers A, Huang Z, Ide-Smith M, Levchenko M, Marinos N, Nambiar R, Nassar M, Parkin M, Pi X, Rahman F, Rogers F, Roochun Y, Saha S, Selim M, Shafique Z, Sharma S, Stephenson D, Talo F, Thouvenin A, Tirunagari S, Vartak V, Venkatesan A, Yang X, McEntyre J (2021) Europe PMC in 2020. Nucleic Acids Res 49(D1):D1507–D1514. https://doi.org/10.1093/nar/gkaa994
Mitchell AL, Almeida A, Beracochea M, Boland M, Burgin J, Cochrane G, Crusoe MR, Kale V, Potter SC, Richardson LJ, Sakharova E, Scheremetjew M, Korobeynikov A, Shlemov A, Kunyavskaya O, Lapidus A, Finn RD (2020) MGnify: the microbiome analysis resource in 2020. Nucleic Acids Res 48(D1):D570–D578. https://doi.org/10.1093/nar/gkz1035
Leonelli S (2019) The challenges of big data biology. elife 8. https://doi.org/10.7554/eLife.47381
Cai L, Zhu Y (2015) The challenges of data quality and data quality assessment in the big data era. Data Sci J 14. https://doi.org/10.5334/dsj-2015-002
Leonelli S (2017) Global data quality assessment and the situated nature of “best” research practices in biology. Data Sci J 16. https://doi.org/10.5334/dsj-2017-032
Burge S, Attwood TK, Bateman A, Berardini TZ, Cherry M, O'Donovan C, Xenarios L, Gaudet P (2012) Biocurators and biocuration: surveying the 21st century challenges. Database (Oxford) 2012:bar059. https://doi.org/10.1093/database/bar059
Perrier L, Blondal E, Ayala AP, Dearborn D, Kenny T, Lightfoot D, Reka R, Thuna M, Trimble L, MacDonald H (2017) Research data management in academic institutions: a scoping review. PLoS One 12(5):e0178261. https://doi.org/10.1371/journal.pone.0178261
Pinfield S, Cox AM, Smith J (2014) Research data management and libraries: relationships, activities, drivers and influences. PLoS One 9(12):e114734. https://doi.org/10.1371/journal.pone.0114734
Haug K, Cochrane K, Nainala VC, Williams M, Chang J, Jayaseelan KV, O'Donovan C (2020) MetaboLights: a resource evolving in response to the needs of its scientific community. Nucleic Acids Res 48(D1):D440–D444. https://doi.org/10.1093/nar/gkz1019
Consortium P-K (2020) PDBe-KB: a community-driven resource for structural and functional annotations. Nucleic Acids Res 48(D1):D344–D353. https://doi.org/10.1093/nar/gkz853
Harrison PW, Ahamed A, Aslam R, Alako BTF, Burgin J, Buso N, Courtot M, Fan J, Gupta D, Haseeb M, Holt S, Ibrahim T, Ivanov E, Jayathilaka S, Balavenkataraman Kadhirvelu V, Kumar M, Lopez R, Kay S, Leinonen R, Liu X, O'Cathail C, Pakseresht A, Park Y, Pesant S, Rahman N, Rajan J, Sokolov A, Vijayaraja S, Waheed Z, Zyoud A, Burdett T, Cochrane G (2021) The European Nucleotide Archive in 2020. Nucleic Acids Res 49(D1):D82–D85. https://doi.org/10.1093/nar/gkaa1028
Courtot M, Cherubin L, Faulconbridge A, Vaughan D, Green M, Richardson D, Harrison P, Whetzel PL, Parkinson H, Burdett T (2019) BioSamples database: an updated sample metadata hub. Nucleic Acids Res 47(D1):D1172–D1178. https://doi.org/10.1093/nar/gky1061
Perez-Riverol Y, Csordas A, Bai J, Bernal-Llinares M, Hewapathirana S, Kundu DJ, Inuganti A, Griss J, Mayer G, Eisenacher M, Perez E, Uszkoreit J, Pfeuffer J, Sachsenberg T, Yilmaz S, Tiwary S, Cox J, Audain E, Walzer M, Jarnuczak AF, Ternent T, Brazma A, Vizcaino JA (2019) The PRIDE database and related tools and resources in 2019: improving support for quantification data. Nucleic Acids Res 47(D1):D442–D450. https://doi.org/10.1093/nar/gky1106
Athar A, Fullgrabe A, George N, Iqbal H, Huerta L, Ali A, Snow C, Fonseca NA, Petryszak R, Papatheodorou I, Sarkans U, Brazma A (2019) ArrayExpress update—from bulk to single-cell expression data. Nucleic Acids Res 47(D1):D711–D715. https://doi.org/10.1093/nar/gky964
UniProt C (2021) UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res 49(D1):D480–D489. https://doi.org/10.1093/nar/gkaa1100
Lappalainen I, Almeida-King J, Kumanduri V, Senf A, Spalding JD, Ur-Rehman S, Saunders G, Kandasamy J, Caccamo M, Leinonen R, Vaughan B, Laurent T, Rowland F, Marin-Garcia P, Barker J, Jokinen P, Torres AC, de Argila JR, Llobet OM, Medina I, Puy MS, Alberich M, de la Torre S, Navarro A, Paschall J, Flicek P (2015) The European Genome-phenome Archive of human data consented for biomedical research. Nat Genet 47(7):692–695. https://doi.org/10.1038/ng.3312
Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC, Gaasterland T, Glenisson P, Holstege FC, Kim IF, Markowitz V, Matese JC, Parkinson H, Robinson A, Sarkans U, Schulze-Kremer S, Stewart J, Taylor R, Vilo J, Vingron M (2001) Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet 29(4):365–371. https://doi.org/10.1038/ng1201-365
McEntyre J, Sarkans U, Brazma A (2015) The BioStudies database. Mol Syst Biol 11(12):847. https://doi.org/10.15252/msb.20156658
Gabella C, Durinx C, Appel R (2017) Funding knowledgebases: Towards a sustainable funding model for the UniProt use case. F1000Res 6. https://doi.org/10.12688/f1000research.12989.2
Anderson W, Apweiler R, Bateman A, Bauer GA, Berman H, Blake JA, Blomberg N, Burley SK, Cochrane G, Di Francesco V, Donohue T, Durinx C, Game A, Green ED, Gojobori T, Goodhand P, Hamosh A, Hermjakob H, Kanehisa M, Kiley R, McEntyre J, McKibbin R, Miyano S, Pauly B, Perrimon N, Ragan MA, Richards G, Teo YY, Westerfield M, Westhof E, Lasko PF (2017) Towards coordinated international support of core data resources for the life sciences. bioRxiv. https://doi.org/10.1101/110825
Drysdale R, Cook CE, Petryszak R, Baillie-Gerritsen V, Barlow M, Gasteiger E, Gruhl F, Haas J, Lanfear J, Lopez R, Redaschi N, Stockinger H, Teixeira D, Venkatesan A, Elixir Core Data Resource Forum, Blomberg N, Durinx C, McEntyre J (2020) The ELIXIR Core Data Resources: fundamental infrastructure for the life sciences. Bioinformatics 36(8):2636–2642. https://doi.org/10.1093/bioinformatics/btz959
Abbott S, Iudin A, Korir PK, Somasundharam S, Patwardhan A (2018) EMDB web resources. Curr Protoc Bioinformatics 61(1):5.10.1–5.10.12. https://doi.org/10.1002/cpbi.48
Harrison PW, Lopez R, Rahman N, Allen SG, Aslam R, Buso N, Cummins C, Fathy Y, Felix E, Glont M, Jayathilaka S, Kadam S, Kumar M, Lauer KB, Malhotra G, Mosaku A, Edbali O, Park YM, Parton A, Pearce M, Estrada Pena JF, Rossetto J, Russell C, Selvakumar S, Sitja XP, Sokolov A, Thorne R, Ventouratou M, Walter P, Yordanova G, Zadissa A, Cochrane G, Blomberg N, Apweiler R (2021) The COVID-19 Data Portal: accelerating SARS-CoV-2 and COVID-19 research through rapid open access data sharing. Nucleic Acids Res. https://doi.org/10.1093/nar/gkab417
Mendez D, Gaulton A, Bento AP, Chambers J, De Veij M, Felix E, Magarinos MP, Mosquera JF, Mutowo P, Nowotka M, Gordillo-Maranon M, Hunter F, Junco L, Mugumbate G, Rodriguez-Lopez M, Atkinson F, Bosc N, Radoux CJ, Segura-Cabrera A, Hersey A, Leach AR (2019) ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res 47(D1):D930–D940. https://doi.org/10.1093/nar/gky1075
The RC (2019) RNAcentral: a hub of information for non-coding RNA sequences. Nucleic Acids Res 47(D1):D221–D229. https://doi.org/10.1093/nar/gky1034
Blum M, Chang HY, Chuguransky S, Grego T, Kandasaamy S, Mitchell A, Nuka G, Paysan-Lafosse T, Qureshi M, Raj S, Richardson L, Salazar GA, Williams L, Bork P, Bridge A, Gough J, Haft DH, Letunic I, Marchler-Bauer A, Mi H, Natale DA, Necci M, Orengo CA, Pandurangan AP, Rivoire C, Sigrist CJA, Sillitoe I, Thanki N, Thomas PD, Tosatto SCE, Wu CH, Bateman A, Finn RD (2021) The InterPro protein families and domains database: 20 years on. Nucleic Acids Res 49(D1):D344–D354. https://doi.org/10.1093/nar/gkaa977
Kalvari I, Nawrocki EP, Ontiveros-Palacios N, Argasinska J, Lamkiewicz K, Marz M, Griffiths-Jones S, Toffano-Nioche C, Gautheret D, Weinberg Z, Rivas E, Eddy SR, Finn RD, Bateman A, Petrov AI (2021) Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Res 49(D1):D192–D200. https://doi.org/10.1093/nar/gkaa1047
Huntley RP, Sawford T, Mutowo-Meullenet P, Shypitsyna A, Bonilla C, Martin MJ, O'Donovan C (2015) The GOA database: gene ontology annotation updates for 2015. Nucleic Acids Res 43(Database issue):D1057–D1063. https://doi.org/10.1093/nar/gku1113
Acknowledgments
The authors would like to thank Gaia Cantelli and Jessica Vamathevan for help with preparing the manuscript and Spencer Phillips for artwork.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Zadissa, A., Apweiler, R. (2022). Data Mining, Quality and Management in the Life Sciences. In: Carugo, O., Eisenhaber, F. (eds) Data Mining Techniques for the Life Sciences. Methods in Molecular Biology, vol 2449. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-2095-3_1
Download citation
DOI: https://doi.org/10.1007/978-1-0716-2095-3_1
Published:
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-2094-6
Online ISBN: 978-1-0716-2095-3
eBook Packages: Springer Protocols