Genomics was born in relatively recent times: in the last two decades, after the introduction of Next Generation Sequencing (NGS) technologies , the processes of DNA/RNA sequencing have benefit form notable cost and time reductions. NGS data is employed at three levels. Primary analysis produces raw datasets of nucleotide bases, reaching a typical size of 200 Gigabytes per single human genome when stored . Secondary analysis produces regions of interest, including mutations (where the code of an individual differs from the code of the “reference” human being) – possibly associated with genetic diseases and cancers, gene expression (indicating in which conditions genes are active), and epigenomic signals (phenotype changes not involving alterations in the genetic sequence). Finally, tertiary analysis aggregates and combines together heterogeneous datasets produced during the preceding phase, trying to “making sense” of the data, unveiling complex biological mechanisms.
For boosting the last—and most interesting—type of analysis, thousands of datasets are becoming available every day, typically produced within the scope of large cooperative efforts, open for public use and made available for secondary research use , including the Encyclopedia of DNA Elements (ENCODE, ), Genomic Data Commons (GDC, ), Gene Expression Omnibus (GEO, ) Roadmap Epigenomics , and the 1000 Genomes Project . In addition to these well-known sources, we are witnessing the birth of several initiatives of population-specific or nation-scale sequencing .
In the following, we focus on processed genomic datasets, which include experimental observations—representing regions along the genome chromosomes with their properties—and metadata, carrying information about the observed biological phenomena. The integration of genomic data and of their describing metadata is a challenge that is at the same time important (as a wealth of public data repositories is available to drive biological and clinical research), difficult (as the domain is complex and there is no agreement among the various data formats and definitions), and well-recognized (because, in the common practice, repositories are accessed one-by-one, with tedious and error-prone efforts). Although the potential collective amount of available information is huge, the effective combination of genomic datasets is hindered by their heterogeneity (in terms of download protocols, formats, notations, attribute names, and values) and lack of interconnectednes.
Motivating Example Let us consider a researcher who is looking for data to perform a comparison study between a human non-healthy breast tissue, affected by carcinoma, and a healthy sample coming from the same tissue type. Exploiting her previous experience, the researcher locates three portals having interesting data for this analysis (see Fig. 1). On GDC, one or more cases can be retrieved with the query “disease = Breast Invasive Carcinoma”. To compare such data with references, the researcher chooses additional datasets coming from cell lines, a standard benchmark for investigations. On the GEO web interface, tumor cell line data is found by browsing thousands of human samples (e.g., the “T47D-MTVL” exhibits the disease “breast cancer ductal carcinoma”). Finally, on ENCODE, the researcher chooses both a tumor cell line (“MCF-7”, affected by “Breast cancer (adenocarcinoma)”) and a normal cell line(“MCF-10A”, widely considered the non-tumorigenic counterpart), to make a control comparison. As it can be noted, from both points of view of attributes and of values—when searching for disease-related information—we find many forms, only possibly pointing to comparable samples. This kind of information is not encoded in a unique way over data sources and is often missing. Considerable external knowledge is necessary to find appropriate connections; this cannot be obtained on the mentioned portals, but needs to be retrieved manually by querying specific databases, dedicated forums, or specialized ontologies.
In this chapter we describe the research carried out in the context of the Genomic Computing ERC project , concerned with designing and building a repository of processed NGS data genomic datasets using a systematic and repeatable approach:
Model: we analyze the domain state of the art (including scouting of online resources/documentation and testing data retrieval methods). Data is studied with the goal of proposing a conceptual model for the main characteristics shared by relevant data sources in the field, targeting completeness but favoring simplicity, for producing easy-to-use systems for biologists and genomic experts.
Integrate and build: we select interesting open data sources for the domain, build solid pipelines to download data from them, and transform it into a standard interoperable format, obtaining a repository of homogenized data, to be used seamlessly from a unique endpoint, allowing integrative biological queries.
Search: we target the end-users of the repository, i.e., experts of the domain who browse the repository in search for datasets to prove or disprove their research hypotheses. Interfaces need to take into account their background: considerable biological knowledge, but limited understanding of programming languages.
During the first phase of the COVID-19 epidemic, in March and April 2020, we responded proactively to the call to arms issued by the broad scientific community. We conducted an extensive requirement analysis by engaging in interdisciplinary conversations with a variety of scientists, including virologists, geneticists, biologists, and clinicians. This preliminary activity convinced us of the need for a structured proposal for viral data modeling and management. We thus reapplied the previously proposed methodology of modeling a data domain, integrating many sources to build a global repository, and finally making its content searchable to enable further analysis. This experience suggests that our approach is general enough to be applied to any domain of life sciences and encourages broader adoption.
Chapter organization. Section 2 describes our data integration proposal in the field of genomic tertiary analysis; Sect. 3 is dedicated to understanding the world of viral sequences and their descriptions (collected samples, host characteristics, variants and their impact on the related disease); Sect. 4 foresees how the two previous sections could be included in one single system to drive powerful biological discovery.