eUTOPIA: solUTion for Omics data PreprocessIng and Analysis
Application of microarrays in omics technologies enables quantification of many biomolecules simultaneously. It is widely applied to observe the positive or negative effect on biomolecule activity in perturbed versus the steady state by quantitative comparison. Community resources, such as Bioconductor and CRAN, host tools based on R language that have become standard for high-throughput analytics. However, application of these tools is technically challenging for generic users and require specific computational skills. There is a need for intuitive and easy-to-use platform to process omics data, visualize, and interpret results.
We propose an integrated software solution, eUTOPIA, that implements a set of essential processing steps as a guided workflow presented to the user as an R Shiny application.
eUTOPIA allows researchers to perform preprocessing and analysis of microarray data via a simple and intuitive graphical interface while using state of the art methods.
KeywordsTranscriptomics analysis Microarray Gene expression R shiny
Omics data have become an integral part of biological studies as researchers leverage these techniques to obtain a broader perspective of complex biological phenomena. Scientific community resources such as CRAN and Bioconductor  contain an extensive collection of computational tools developed in R programming language  to process omics data. Application of these tools, however, requires a deep understanding of the computation and statistical aspects of the methods employed, while further integration of these tools in a workflow requires a certain degree of proficiency in computer programming languages. We tasked ourselves with creating a solution by implementing “state of the art” practices to preprocess, statistically analyze, and visualize microarray data from multiple platforms. eUTOPIA’s workflow caters to the researchers with varied levels of computation and statistical skills. This solution fills a void in a research environment, thus allowing experimental biologists to process microarray data while avoiding critical errors due to lack of proficiency in programming languages and pipeline development. The guided interface simplifies the complex tasks to set up the study and allows for more focus on the biological interpretation of the results rather than the technical challenges of executing statistical methods and generating visual representations.
There are seven main steps in the workflow: ‘DATA INPUT’ (Additional file 1, p. 8–13), ‘QUALITY CONTROL’ & ‘FILTERING’ (Additional file 1, p. 14), ‘NORMALIZATION’ (Additional file 1, p. 15), ‘BATCH CORRECTION & ANNOTATION’ (Additional file 1, p. 20–29), and ‘DIFFERENTIAL ANALYSIS’ (Additional file 1, p. 30–31).
eUTOPIA requires that user provides a detailed phenotype information file (Additional file 2) with all biological and technical variables of the samples in the experiment.
A quality control report can be generated from microarray raw data by affyQCReport R package , yaqcaffy R package , arrayQualityMetrics R package , or shinyMethyl R package , depending on the microarray platform. It is essential that poor quality probes from the experimental data are omitted prior data normalization. In the gene expression specific platforms, this is accomplished by estimating the robustness of probe signals against the background (negative control probes). For Illumina methylation platforms, a detection p-value is computed by using the total DNA signal (methylated + unmethylated) against the background signal by minfi R package . The expression value OR p-value threshold determined from the background signal is used to evaluate the probes across a percentage of sample specified by the user. Finally, the probes failing this evaluation are considered unreliable and thus filtered out.
Normalization of the expression and methylation signals distribution across the samples is performed respectively with methods from the limma  and minfi R packages. Methods scale, quantile, and loess perform normalization on log2-scaled intensities and ratios, and the vsn method uses a variance stabilizing transformation, which performs better for weakly expressed features. Normalization of Illumina methylation arrays is performed by using the methods from minfi R package with options of background subtraction, control feature measure, dye bias, and quantile normalization.
In microarray data analysis, a fundamental step is to attenuate the effects associated with technical variables (batch effects) while retaining the variation associated with biological variables. Batch effects can arise for multiple reasons, most commonly when the experiments are conducted in multiple batches, and the data is pooled together for processing. These batches can contribute to the variability of the features and could introduce a systematic error in their assessment, ultimately leading to incorrect results in the worst scenario . Batch effects can be caused by known variables (e.g., dye, RNA quality, experiment date, etc.) or by hidden sources of variation not explained by the known variables. Known biological (e.g., treatment, disease status, age, tissue, etc.) and technical (e.g., dye, array, etc.) variables are provided by the user in the phenotype information, while unknown sources of variations can be identified by using the sva function from sva R package . First, the impact of the technical variables is computed with prince function from swamp R package , and the correlation between both biological and technical variables is evaluated from the confounding plot which is generated by using the confounding function from the swamp R package. This information is used to identify batch variables as known technical or surrogate variables which are associated with strong sources of variation and are not correlated with biological variables of interest. These identified batch variables can be justifiably corrected to remove technical noise from the data. Finally, the correction is performed with ComBat  function from sva R package that employs an empirical Bayes approach to estimate systemic batch biases affecting many genes. The batch correction is carried out by specifying the variable of interest, any biological covariates, and a set of known batches or surrogate variables (obtained from the sva function described above). The batch correction process implemented in eUTOPIA applies the ComBat function iteratively to remove one batch covariate at a time while the rest are modeled as covariates of interest. The ComBat function can process only one batch covariate at a time, and this process of blocking other batches ensures clear separation of variation adjustment effects for each batch without any interference from the adjustment of other batches.
Linear models allow to model the covariate dependencies between samples. The differential analysis is performed with linear model implementation in the R package limma. The lmFit function from the limma R package fits gene-wise linear models to the microarray data. The user defines the design for the model by providing the biological variable of interest and covariates (biological and technical batch variables). The contrasts of interest are then specified to obtain contrast specific coefficients from the original coefficients of the linear model. The eBayes function is applied to assess differential expression by using the fitted model with the contrast coefficients. Final reporting of the differentially expressed genes is performed by using the toptable function where adjusted p-value for the multiple comparisons can be obtained by specifying methods “Holm”, “Hochberg”, “Hommel”, “Bonferroni”, “Benjamini & Hochberg”, “Benjamini and Yekutieli” or “False Detection Rate”. Differential analysis results for comparisons defined by the user are reported in tabular format and with other meaningful visualizations. The dynamic plots help to perform a preliminary interpretation of the analysis results. The distribution of the differential features by fold-change magnitude and significance can be observed for a chosen contrast by means of the volcano plots. The expression profile of user-specified top significant features from one or more contrasts can be inspected from the heatmap. Comparison of differential features from different contrasts by set intersections is represented as Venn diagrams or UpSet plots. And the distribution of signal for one or more gene(s) of interest in sample annotations (e.g., experimental condition) can be inspected by means of box plots. A user manual with sample data analysis and plot descriptions is provided in Additional file 1.
Results and discussion
The functional capabilities of eUTOPIA are showcased here by processing a publicly available dataset GSE92900  obtained from the GEO  repository. The phenotype table provided in Additional file 2 is used for sample annotations.
In Fig. 2a, array 10 has high distance measures with the rest of the arrays as represented by the bright and dull yellow colored cells. This distance information can also be interpreted from the hierarchical cluster in the right margin of Fig. 2a, where array 10 clusters separately from the others. Outliers are detected based on this distance information as arrays with exceptionally large distance to all other arrays. A summarized distance measure is determined for each array by summing up the distances to all other arrays, which is checked against an outlier threshold obtained on the basis of well-established interquartile range (IQR) rules defined by Tukey Fences. The outliers can be observed in Fig. 2b, where array 10 has a large summarized distance from the rest of the arrays and can be markedly recognized as an outlier.
The difference in the distribution of expression values in different arrays can be observed in the box plot before normalization (Fig. 3a), suggesting the need for adjustment of the distributions for fair comparison across the set of arrays. The distribution of expression values is harmonious across the arrays in the box plot after normalization with the quantile method (Fig. 3b).
The difference in the distribution of expression values from individual channels of different arrays can be observed from the smoothed curves in the density plot (Fig. 3c) before correction. Only a single smoothed curve is visible in the density plot after normalization (Fig. 3d) since the channels from all arrays have the same distribution of expression values as a result of quantile normalization.
The larger scale of log-2 ratios (M-values on the y-axis) observed in the MDplot before normalization (Fig. 3e) shows that the data points are farther away from zero log2 expression ratio, suggesting bias. The average log2 expression values also have a large scale (x-axis) before normalization. The smaller scale of M-value (y-axis) observed in the MDplot after normalization (Fig. 3f) shows that data points are much closer to zero log2 expression ratios as the bias has been adjusted. From this plot, it is also possible to appreciate how the average log2 expression values have much smaller scale (x-axis) after normalization.
eUTOPIA allows users to reliably process microarray data and generate visual interpretations of the results seamlessly via a simple yet intuitive interface. It is focused on the preprocessing of microarray data, thus providing an agile and robust alternative to more comprehensive tools. Both commercially available and free software for microarray data preprocessing either provide limited methodological options or force the user to design an appropriate pipeline from the tools provided. This can be a daunting task for researchers with limited experience in omics data analysis. eUTOPIA is designed to balance both reliability and flexibility.
The case study showcases eUTOPIA’s features that enable the user to perform array data preprocessing and analysis. eUTOPIA’s guided workflow helps to easily preprocess the data, identify sources of unwanted variation, remove them and evaluate the sanity of corrected data. The graphical user interface allowed the ease of defining a linear model to test the data and the wide choice of provided dynamic plots help to classify and characterize conditions by sets of differential features and expression patterns.
We compared eUTOPIA against a set of microarray analysis tools that are free for academic use and have a graphical interface, namely AGA , shinyMethyl, MeV , O-miner , Chipster , and Babelomics . The comparison table (Additional file 4) evaluates the tools over a list of implemented analytical steps and supported data platforms. One major feature that is not supported by most of the compared tools is the batch correction of known variables and more noticeably of surrogate variables. eUTOPIA’s analysis workflow integrates the visual representation of sample annotation and principal components of variation to identify batch variables, along with the ability to perform correction of batch effects seamlessly. Furthermore, it is the only tool that incorporates the surrogate variable (hidden batch) identification, visualization, and correction. This process of batch identification and correction is of extreme importance for microarray analysis because it can help to isolate technical noise from the biological signal. In this comparison, Chipster has the most comprehensive toolbox; it provides the most features that even extends beyond microarray analysis. However, these features are presented as separate tools with no specific workflows and guidelines for choosing the most optimal set of tools. This can pose a challenge to the users in need of designing analysis workflows by combining these tools appropriately. In contrast, eUTOPIA incorporates a specific set of tools in a streamlined workflow to ensure intuitiveness and ease of use. eUTOPIA does not impose on the user the technical challenges of workflow design thus allowing to focus more on the biological aspects of the data and results.
Availability and requirements
Project name: eUTOPIA
Project home page: https://github.com/Greco-Lab/eUTOPIA
Operating system(s): Platform independent
Programming language: R Shiny
License: GNU GPL 3.
Any restrictions to use by non-academics: none
We would like to thank Nanna Fyhrquist (University of Helsinki, Karolinska Institute) and Marit Ilves (University of Helsinki) for providing the valuable user feedback during the development process.
This study was supported by the Academy of Finland (grant agreements 275151 and 292307) and EU H2020 LIFEPATH (grant agreement 633666).
Availability of data and materials
VM implemented the stepwise guided workflow for microarray data analysis as an interactive R Shiny app and wrote the manuscript. GS defined the framework for Illumina methylation analysis, performed evaluation of the implemented R shiny app, and critically evaluated the manuscript. PK contributed to the case study and performed critical evaluation of the manuscript. AS contributed to the implementation of the R Shiny app, performed evaluation of the implemented R shiny app, and critically evaluated the manuscript. HA performed evaluation of the implemented R shiny app. VF developed the methods and framework for the Agilent microarray analysis, developed the methodology of batch effect mitigation, and critically evaluated the manuscript. DG conceived and supervised the project, contributed to the development of the analysis framework and R shiny app implementation, and critically evaluated the manuscript. All authors reviewed and approved of the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 4.Parman C, Halling C, Gentleman R. affyQCReport: QC Report Generation for affyBatch objects; 2017.Google Scholar
- 5.Gatto L. yaqcaffy: Affymetrix expression data quality control and reproducibility analysis; 2017.Google Scholar
- 7.Fortin J-P, Fertig E, Hansen K. shinyMethyl: interactive quality control of Illumina 450k DNA methylation arrays in R. F1000Res. 2014;3.Google Scholar
- 16.Considine M, Parker H, Wei Y, Xia X, Cope L, Ochs M, et al. AGA: interactive pipeline for reproducible gene expression and DNA methylation data analyses. F1000Res. 2015;4.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.