Background

DNA microarray technology has evolved rapidly to become the most popular platform for high throughput gene expression analysis as it allow biologists to measure the expression of entire transcriptomes at relatively high speed and low cost. This makes microarrays ideal for applications like sample clustering/fingerprinting, genome annotation, detection of differential gene expression, detection of polymorphisms and re-sequencing [1, 2]. Microarrays contain oligonucleotides (probes) that can hybridise with the labelled reverse complement of mRNA. Since the probes are immobilised on the surface of an array and it is known which probes are located where on the array, signal at a certain spot can be used as a measure for gene expression. This requires that probes are unique for their target genes and hence optimal microarray design requires 1) a completely sequenced reference genome, 2) complete annotation for this reference genome to know what parts may be expressed and 3) complete knowledge about the natural variation amongst the sampled individuals.

Unfortunately there is currently not a single species for which such complete information is available. Although some reference genomes are now close to completion, annotation of these reference genomes as well as information on how individuals differ from these reference genomes is far from complete. Hence, microarray design is currently sub-optimal even for species with a rather complete reference genome. Probe design based on incomplete or erroneous data can lead to serious problems like non-specific probes causing cross hybridisation, orphan probes designed for non-existing targets, missing probes and misleading probes due to erroneous annotation.

Therefore, it is important to update the annotation for arrays regularly to improve the functional annotation of the targets as well as the reliability of probe-target assignments. Several tools have been developed for this purpose [312], but these provide either limited annotation, require complicated local installations with many dependencies, do not scale well or do not support our species of interest. We have developed OligoRAP (Oligo ReAnnotation Pipeline) to overcome these issues.

Implementation

The pipeline consists of 5 steps: I. Convert oligo library data into BioMoby objects, II. Align oligos with a reference genome assembly and with a set of unmapped transcripts (UMTs), III. Analyse oligo annotation, IV. Analyse oligo quality and V. Make summary charts (see Figure 1). Implementation details are described and illustrated in Additional files 1, 2, 3, 4, 5, 6. In this section we will only focus on the key advantages of OligoRAP.

Figure 1
figure 1

Summarizing OligoRAP flowchart. Blue blocks represent user input, green blocks databases, pink blocks output and finally orange blocks represent one or more BioMoby web services. For a more detailed description see Additional files 1, 2, 3, 4, 5, 6.

Firstly, OligoRAP does not rely solely on a reference genome or solely on transcripts (or sequences derived thereof), but uses both where possible. For the genome OligoRAP uses reference assemblies and annotation as provided by the Ensembl [13] project. Ensembl was chosen as primary annotation source, because it is the largest and richest resource of its kind with support for most popular model species in the animal kingdom. In addition to reference assemblies OligoRAP uses a set of unmapped transcripts (UMTs) to get a more complete picture. The UMT set contains RefSeq [14] and UniGene [14] entries, which failed to map to the reference assembly. Where available annotation derived from Ensembl (for hits on the genome) and from RefSeq or UniGene (for hits on UMTs) can be expanded with links to Entrez Gene [14] and GO [15]. The combination of reference genome supplemented with UMTs provides optimally complete annotation for well-annotated species whilst keeping redundancy at a minimum. At the same time this strategy is flexible enough to support less well-annotated species even if there is no reference assembly available. In that case all of a species' transcripts simply become part of the UMT set.

Secondly, OligoRAP provides annotation for all hits instead of only for the best hit. This allows OligoRAP to provide not only updated annotation, but also oligo target specificity based on the amount and type of hits. OligoRAP can differentiate between primary hits (high hybridisation potential) and secondary hits (low hybridisation potential). Hybridisation potential is determined using three filters, which users can adjust based on their experimental setup. Based on their target specificity oligos are divided into six target specificity classes (TSCs): 1. Gene-specific probes with maximum signal potential, 2. Gene-specific probes with reduced signal potential, 3. Non-specific probes with maximum signal potential, 4. Non-specific probes with mixed signal potential, 5. Non-specific probes with reduced signal potential and 6. Orphan probes with background signal potential.

Finally, each of the steps is implemented as one or more web services [16], which were built using the BioMoby framework [17, 18]. These web services provide remote programmatic access and can be glued together using a variety of BioMoby clients like the Taverna Workbench [19] or custom code built with the BioMoby Perl or Java framework. Using web services we created a highly customisable and modular annotation pipeline with a robust interface. This allows for OligoRAP to be embedded in microarray data analysis workflows for improved scalability without tedious, local installations suffering from complex dependencies.

Results and discussion

OligoRAP was used to update annotation and target specificity for the subset of 791 oligos from the ARK-Genomics 20 K chicken array (see methods in Additional file 1). Figure 2 shows how these oligos are divided over OligoRAP's target specificity classes (TSCs) with transcriptome-based target specificity (TbTS) in Figure 2A and genome-based target specificity (GbTS) in figure 2B.

Figure 2
figure 2

Distribution of Oligos over Target Specificity Classes (TSCs). Distribution of the 791 oligos selected for the workshop over the 6 TSCs with transcriptome-based (A) and genome-based target specificity (B). The status of the link between the oligo and the accession number/identifier it was originally designed for is indicated by a tint difference in the colour for TSC 1, 3 and 4: accession/id still present in the annotation, hence "target unchanged" (dark tint) or accession/id absent, hence "target changed" (light tint). For TSC 2, 5 and 6 the target status is always "changed".

Transcriptome-based versus genome-based target specificity

Up till recently the transcriptome of higher eukaryotes was thought to contain a very small subset of the genome. For example in Ensembl 50 less than 5% of the chicken genome is annotated as exon. Since only potentially expressed sequences can hybridise to probes on a microarray, most oligo design and annotation efforts have focused on known and/or predicted transcripts without taking the rest of the genome into account. Apart from a few structural elements like the centromeres and telomeres it's still not clear what the function of the other 95% or more of DNA is, but slowly evidence is piling up indicating the size of the transcriptome is vastly underestimated. Especially the pilot phase of the ENCODE project showed that the human "genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts" [20]. It remains unclear whether all these transcripts are biologically functional or whether they just represent noise, but it is clear that all transcripts can potentially hybridise with the oligos on microarrays. Therefore it is probably more appropriate to evaluate target specificity in the context of the entire genome as compared to what is currently annotated as transcriptome.

Looking at TbTS and GbTS for the 791 ARK-Genomics chicken oligos the total amount of gene-specific oligos differs only by 2.3% with 69.5% and 67.2%, respectively. Hence taking the entire genome into account as compared to looking only at the transcriptome does not lead to a dramatic decrease of gene-specific probes. Unfortunately at least one third of the probes are non-specific. For these problematic non-specific probes the TbTS and GbTS pictures look quite different.

Annotation quality

For most of the oligos it is extremely difficult to verify their predicted target specificity except for the orphan oligos of TSC 6. The 791 oligos selected as starting material for this EADGENE/SABRE workshop were picked, because they do show a high differential signal on the microarrays. Hence these oligos clearly bind labelled cDNA derived from one or more target genes, but OligoRAP classifies 3.5% and 16.1% of the oligos as orphans with GbTS and TbTS, respectively. These numbers indicate that OligoRAP's TSC assignments are currently more an indicator for the relatively immature status of the chicken genome assembly and its annotation than for target specificity.

Furthermore, for almost half of the oligos, the sequence identifier they were originally designed for is no longer present in their updated annotation, which is indicated with "target changed" in Figure 2. The fact that these identifiers no longer link to these oligos not necessarily means that the oligo no longer represents expression of the same gene as before, but it does indicate at least major changes in the annotation. On the other hand annotation associated with certain identifiers may have received considerable "minor" updates keeping the sequence identifier intact. Hence, the large amount of oligos with changed targets is still an underestimate of the total amount of changed annotation.

Future work

Although the ENCODE pilot study covered only approximately 1% of the human genome it is clear that our view on the transcriptome will change dramatically over the next years. This will have a big impact on oligo annotation & target specificity making it more important than ever to be able to update oligo annotation quickly and regularly. In addition to regular updates of the data, annotation pipelines like OligoRAP will need to be updated too to adapt the annotation strategies to our changing insights in gene expression.

Conclusion

Microarray probes are designed on incomplete data. Therefore it is important to update probe annotation and estimate target specificity regularly. OligoRAP provides such functionality for Ensembl species and can easily be embedded in customised applications for microarray data analysis due to its design based on BioMoby web services. The rather high amount of oligos with changed targets shows the importance of updated annotation and reflects the limited amount and quality of the annotation available at the time the ARK-Genomics 20 K chicken array was designed.

Further information

ZIP-archive containing the final results of the OligoRAP pipeline run as well as all intermediate results. See included README for details.

https://www.bioinformatics.nl/phenolink/home/OligoRAP/datasets/Ensembl50_RefSeq30/OligoRAP_RIGG791_20081222_BMC_Proceedings.zip