Background

Repositories of genome-wide expression studies such as ArrayExpress [1] have been growing rapidly over the last few years and continue to do so. The more experimental data are deposited into these repositories, the more likely it becomes that some of them can provide a meaningful biological context to aid in the planning and analysis of new studies. Retrieval of experiments based on their textual description and experimental design has several shortcomings. First of all, textual description of an experiment or its results is not as information-rich as the actual data itself. Secondly, information about the experimental design alone is only of limited use in retrieving biologically relevant data because it does not reflect the results, which contain the bulk of the information and may reveal unexpected relationships. We introduce novel retrieval methods that incorporate the actual gene expression measurements into the search process, along with visualization tools for interpreting and exploring the results [2].

Methods

We developed a two-stage procedure, first identifying differentially active gene sets in each experiment using a recent nonparametric statistical method [3], and then combining gene set activation patterns into higher-level structures, so-called biological topics, using a state-of-the-art probabilistic model [4]. The probabilistic formulation enables the use of a natural and rigorous metric for assessing the similarity between two experiments. For interpreting and exploring retrieval results, we have developed visualization methods that also provide insight into the model used to perform the retrieval.

Results

We show that gene sets corresponding to each biological topic form highly coherent and holistic components. Several case studies performed on a subset of ArrayExpress show that our method can retrieve experiments relevant to a biological question, as long as sufficient amounts of data are available, and highlight relations between experiments, either because the same biological questions were targeted, or because of unexpected relationships that were confirmed in the literature. The visualization methods allow us to both efficiently interpret the model and put retrieval results in the context of the whole set of experiments (see Figure 1 for an example).

Figure 1
figure 1

2D NeRV projection of retrieval results when the model is queried with a malignant melanoma experiment. Each experiment is represented as a striped glyph. Colors indicate biological topics. Stripe widths indicate the predominance of each biological topic in each experiment. Glyph size indicates relevance to the malignant melanoma query experiment [5].

Conclusion

Using a combination of existing and novel methods for modeling and visualizing a heterogeneous collection of gene expression experiments, we were able to decompose and relate experiments via biologically meaningful components. Our approach allows search within a gene expression database to be driven by actual measurement data.