Background

With rapid accumulation of gene expression data in public repositories such as NCBI GEO[1], integrated analysis of multiple studies is receiving increased attention. The integrated analysis increases statistical power, generalizability, and reliability, while decreasing the cost of analysis, since it exploits publicly available data for related studies, which are often from different platforms [2, 3]. Rhodes et al. identified a set of differentially expressed genes between prostate cancer patients and healthy subjects from an integrated study of four different datasets and discovered that some genes were consistently dysregulated in prostate cancer but were not reported in the individual studies [4]. Warnat's group performed a classification study of cancer patients with six different datasets and achieved higher accuracy over single-set analysis [5]. Both studies were conducted across different gene expression platforms.

Despite the well-known benefits, conducting a cross-platform analysis of gene expression data involves many intricate issues at different layers. A recent guideline discussed several key issues when conducting an integrated microarray data analysis: annotating probes of the individual dataset, resolving the many-to-many relationship between probes and genes, aggregating multiple measurements into a single gene-level value, and combining study-specific estimates [2]. Some authors noted that the interpretation of the biological results could be improved with re-annotated and filtered probes in microarray studies [6, 7]. These probes would ideally be at no risk of cross-hybridization to multiple genes and would not contain any SNPs or repeats in its sequence [8, 9].

Several tools were developed to resolve the aforementioned hurdles for cross-platform analysis of gene expression data. CrossChip http://www.crosschip.org provides comparative analysis between different generations of Affymetrix arrays [10]. It utilizes architectural information of probe, i.e., the minimum sequence overlap length and the minimum probe pairs per probe-set to enable cross-platform comparison, but the scope is limited to Affymetrix platforms. Another tool, EXALT http://seq.mc.vanderbilt.edu/exalt, allows the user to upload his/her data and searches for homologous data sets obtained from public repositories. However, the user is still responsible for ascertaining the quality of the probe and its impact at the gene expression level [11]. Furthermore, neither of these tools takes into account the biological characteristics of probes, such as presence of SNPs or repeat sequences. EXALT recommends the use of GDS (processed data by NCBI)--pre-processed data derived from GSE (raw data submitted by authors)--but the derived measurement values are problematic as they still contain undesirable probes that map to multiple genes, are specific to a certain transcript, or may contain a SNP. Furthermore, only a fraction of all studies in GEO have a corresponding GDS. In our recent study of 58,432 GEO microarray samples from six different diseases, only 19.7% of them were included in GDS [12]. Another available tool, A-MADMAN http://compgen.bio.unipd.it/bioinfo/amadman, performs integration of cross-platform microarray data obtained from GEO [13]. However, its input is limited to Affymetrix platform microarrays and the probe annotation relies on available chip description files, which are known to have errors or are outdated, as biological knowledge is updated [6, 7]. CPTRA http://people.tamu.edu/~syuan/cptra/cptra.html, another tool for cross-platform analysis of gene expression data [14], allows two different platforms to be combined, but the focus of this software is on the species with limited genome information, such as horseweed [14]. Hence, one of the inputs must be a long-read sequence with proper annotation. In contrast to CPTRA, our analysis tool, AnyExpress focuses on well-studied species like human, mouse, fruitfly or Arabidopsis, where reference genome information and the transcriptome database are well-maintained and available.

Our approach is to start the analysis from raw files, such as .fastq (Roche 454 or Illumina GA), .csfastq (ABI SOLiD color-space), or .fasta (microarray platforms, SAGE or MPSS) to remove undesirable probes before pre-processing. For example, we summarize multiple probe level measurements into a single target-level value, where the target is a user-defined expression unit (e.g., gene/isoform/exon). None of the existing tools can handle integration of NGS and microarray data from different platforms. Thus, we developed a practical, integrated toolkit for cross-platform analysis of gene expression data serving all NGS and microarray platforms for any species. Previously, our group demonstrated that sequence-based probe matching improved the consistency of measurements across different platforms, compared to the widely-used identity-based matching method at that time[1517]. We also developed DSGeo, a software collection for analyzing microarray data deposited in GEO [18]. We now extend our previous work, by integrating a novel interval-matching algorithm [1820] and developing a suite of software tools, called AnyExpress. Our suite of tools automates the matching of NGS, microarray, SAGE and MPSS, and also allows users to define reference target and probe quality filters.

Implementation

Architecture

AnyExpress is a software suite for cross-platform analysis of gene expression data. It allows two sources of inputs: (i) genomic position files, obtained from the external alignment software and (ii) probe-level sample measurements files. AnyExpress returns a target-by-sample text file as an output. We define 'tag' as a string of nucleotide sequences used for measuring gene expression abundance. This string is commonly called 'probe' or 'read' for microarray or NGS platform, respectively. Throughout this article, we use tag, probe, and read interchangeably. Next we define 'platform' as a set of tags. Then, we classify platforms into one of two classes, based on the availability of knowledge in a tag's sequence. When the tag sequence was predetermined, as in a microarray or catalysed reporter deposition (CARD) FISH, the platform was considered to be closed-platform[21, 22]. If the tag sequence is determined at the time of sequencing, as it is in NGS, serial analysis of gene sequence (SAGE), or differential display (DD), the platform is considered to be open-platform[21, 22]. While closed-platform can have multiple samples (e.g., 20.cel files of the same platform, an Affymetrix U133A microarray), the open-platform has a 1-to-1 relationship with the sample (e.g., six Illumina GA NGS .fastq files from six corresponding patients). AnyExpress is capable of dealing with gene expression data from all platforms in contrast to the existing tools that focus on a single platform. A schematic workflow of AnyExpress is displayed in Figure 1. The gene expression data of one closed-platform (Affymetrix U133A) and two open-platforms (Illumina GA and ABI SOLiD) are combined. A summarization file is created per platform as a result of the SUMMARIZE process, then the multiple summarization files are merged into a single gene-by-sample text file through a JOIN process within a COMBINE process. Before running core processes of AnyExpress (shown as blue rectangles), the user needs to build target and reference features (indicated by yellow rectangles) to generate 'SYSTEM DATA' and manually perform sequence alignment using external software tools such as Bowtie (indicated by pink trapezoids) [23].

Figure 1
figure 1

Workflow of AnyExpress. An outline of data flow is depicted with input, output, and intermediate files. Core processes in AnyExpress are drawn as blue rectangles. Pre-processing is represented by yellow rectangles. ALIGN is an external process (pink trapezoid) that runs via software such as Bowtie or RMAP. The standard input to AnyExpress is a Browser Extensible Data (BED) file or a tab-delimited multi-column file. The output is a target-by-sample combined file. A gene-by-sample is used in the final output of this figure, but the user can choose his/her own target (e.g., 'RefGene isoform' or 'Ensembl gene') by running anyexpress Build.

Reference target

We refer to a target as a biologically meaningful expression unit against which tag will be matched using genomic positions. Each target is a collection of five attributes: chromosome, strand, start position, end position, and identifier. AnyExpress accepts the target as a .BED file where the five fields are separated by tabs. In most cases a single target has multiple associated tags; hence, multiple measurement values will be summarized into a single aggregated value. The target identifier must consist of two substrings concatenated by '@', i.e., targetID = 'superID' + '@' + 'subID'. For 'BRCA1' gene, its identifier (with the corresponding target) could be represented as 'BRCA1@Exon2' (official gene symbol), 'NC_007294@Exon2' (RefSeq), or 'ENSG00000012048@Exon2' (Ensembl). The target information will be updated as knowledge of the genome and genes evolve. Species, source database, and build-date are three factors defining a target. Example of .BED format files are as following: 'Human_Ensembl_Feb2009.BED,' 'Human_UCSCKnownGene_Feb2009.BED,' 'Human_RefSeq_Feb2009,' and 'Human_RefSeq_Mar2006.BED'. Unlike existing custom reannotation data approaches in which the user's analysis is limited by a particular type of target that the annotator has predefined, AnyExpress allows users to define their own reference target for any species. In Figure 1, RefGene2010 was selected as a target and the corresponding system data was created by running Anyexpress Build before running Anyexpress Combine all in a command-line.

Exclusion features to identify undesirable tags

Exclusion features allow users to apply a biological filter applied against the tags to filter out undesirable ones. Previous studies have shown the negative effect of low quality microarray probes on measurement of gene expression abundance and consequently on the interpretation of the results [6, 8, 9, 24, 25]. A probe that hybridizes to more than one reference target is referred to as a 'cross-hybridization' or 'multi-target' probe. This type of probe often results in ambiguous interpretation of results, negatively affecting downstream analysis such as statistical testing, clustering, or enrichment analysis on Gene Ontology or pathways [6, 25]. The presence of SNPs within the probe sequence would cause incorrect estimation of mRNA abundance [6, 8]. It has been reported that the removal of undesirable tags resulted in increased statistical power to detect differentially expressed genes [9, 25]. Existing tools or custom CDF files restrict users to a predefined set of filters, sources, and build dates according to external annotators [9, 26]. For example, a SNP alone can have several characteristics: class of variant (single, in-del, or unknown), functional category (coding-synonymous, intron, noncoding-synonymous, near-3', near-5', or unknown), validation status (by-cluster, by-frequency, by-hapmap, or unknown), and average heterozygosity [24]. AnyExpress offers flexible solutions where the user can define desired characteristics and selectively apply tag-filters at the time of data integration.

Interval matching algorithm

AnyExpress takes the outputs of external alignment software as inputs (e.g., Bowtie for NGS), which consist of a list of attributes of genomic position (chromosome, strand, start, and end). Probes and reads are matched against targets. Matching two entities based their genomic positions is a core part of data integration process in AnyExpress. While naïve comparison of all intervals of target and tag (e.g., RefSeq vs. NGS read) is a computationally-intensive task with time complexity O(n2 ), AnyExpress adopts a fast interval matching algorithm called PositionMatcher, developed by our group [20]. PositionMatcher performs "genomic walking" by iterating linearly along the positional stamp of a genome, keeping track of overlapping intervals in a hybrid data structure of stack and queue to achieve time complexity down to O(nlogn + n). In a previous study [20], we showed that the execution time of PositionMatcher was over 20 times faster than that of all-pairwise comparison methods using the Illumina NGS data reported in Marioni et al.[27] as an example.

Figure 2 shows an example of how MATCH process is performed: the first match relates tag to target and the second match relates tag to SNP. The result of matching is a list of two objects, (target, tag) or (target, SNP). Then, ANNOTATE process generates a tag annotation file to report each of the tag's associated targets and the presence/absence of SNPs in the sequence. Based on this annotation file, EXCLUDE and AGGREGATE processes produce a summarization file for each platform to feed into JOIN process (Figure 1).

Figure 2
figure 2

Illustration of two processes: MATCH and ANNOTATE. The MATCH process takes as input two position files (each with three columns: start, end, and identifier) and generates as an output, a list of ID pairs, tag ID, and target ID. While MATCH on the left was between tag and target, MATCH on the right occurred between tag and SNP. Then ANNOTATE merged the two outputs of MATCH to create a tag annotation file with associated target ID and indication of SNP presence/absence.

Platform-level summarization

As illustrated in Figure 1, the output of SUMMARIZE process, per platform, is a target-by-measurement value text file where multiple measurements are aggregated into a single numeric value per target. For a closed-platform, multiple tag-level signals are summarized into a single number per target-sample pair. We used Tukey's median-polish algorithm, a widely used summarization technique in microarray data, to introduce the robust multi-array averaging (RMA) method [28]. For an open-platform, multiple associated tags were aggregated into a single 'Reads Per Kilobase exon Model per million mapped reads' (RPKM) value per tag [29].

Auxiliary tools with high scalability to create input data

In early microarray studies, the number of samples for Affymetrix was small (less than 20), so it was easy to create a single column-bound file. But recent studies involve a large number of samples, often exceeding 200, which makes data loading impossible using R, Matlab, or stand-alone software due to limited memory size. Although the number of tags is relatively lower in microarrays than in NGS (1 million versus tens of millions), currently, the number of samples is larger due to the maturity and inexpensive cost of this technology. For example, 186 Affymetrix microarray .cel files were used in a lung cancer classification study [30] and 286 .cel files were used in a breast cancer study [31]. These individual studies are already large and the integrated analysis incorporating those will evidently be even larger. Simple loading of individual .cel files using conventional computers fails even before normalization or summarization. Solutions using parallel computing are being proposed, but these are useful only to users with advanced skills and access to high performance computing resources [32]. Thus, we propose AnyExpress as a solution. It is developed to serve the average user, one that has access to 4 ~ 8 GB of memory and a 2 ~ 3 GHz processor.

Binding a large number of Affymetrix .cel files

We defined the input format of closed-platform samples for AnyExpress as a single column-bound, tab-delimited text file where the first column is a probe identifier followed by measurement values of the samples in the second column. This is a common data format for microarrays in non-Affymetrix platforms. However, in Affymetrix, each sample is a .cel file and needs to get column-bound. AnyExpress provides a scalable binding tool, anyexpress BindAffyCel (in a command-line), to create this single column-bound file from a large number of Affymetrix .cel files. We tested the capability of AnyExpress in binding a different number of .cel files downloaded from GEO. For binding, AnyExpress extracts probe identifier as probeID = 'x-coordinate' + ':' + 'y-coordinate' from the .cel file. The user is required to place .cel files of the same platform in the same directory. Only the probe identifiers that are common across all samples will be represented in an output file. In Figure 1 (top left), four Affymetrix .cel files are bound to a single text file '4samples.txt'.

Quantile-normalization for a column-bound data of microarray samples

Quantile-normalization [28] is a widely used pre-processing procedure for microarray data, but its processing is severely limited by certain hardware. The column-bound file can be directly used in anyexpress Combine, but it is highly recommended to perform between-sample normalization of this data to remove systematic bias to enable fair comparison among samples [13, 33]. Of the different normalization algorithms, the quantile-normalization was shown to be superior [13, 33]. Quantile-normalization is a rank-invariant transformation of measurement values that have identical distribution of measurement values across all samples once they get processed. The same scalability issue applies to quantile-normalization. A single .cel file contains over 500,000 probe-level measurement values. When hundreds of samples need to be combined, existing tools can hardly perform quantile-normalization. AnyExpress solves this issue with a highly scalable tool, anyexpress NormalizeColumnBoundSamples.

Coverage plot

Visualization plays a critical role in data validation, interpretation, and hypothesis generation during analysis [34]. Software tools for visualization should be able to manage a large number (e.g., millions) of tags. We developed a tool that can create a coverage plot along the genome for all the platforms used in a single AnyExpress run. The output file is a .bedGraph text file. The user needs to upload this file onto the UCSC Genome Browser http://genome.ucsc.edu through his/her own web-browser. The user can draw a plot by typing in five parameters: a directory of user, Project; chromosome; strand ('forward' or 'backward'); start position; and end position. Each platform, closed or open, in the user's Project is drawn as a track in the .bedGraph file. In each track, vertical bars are drawn along the genomic region of interest. The height of the vertical bar is either the number of the reads covering each base in open-platforms or the average signal intensity of the probe covering each base. As a default reference track, the RefSeq gene model is displayed at the bottom of the plot. The user can freely add, hide, or modify the plot through the UCSC Genome Browser, e.g., adjust scales, change color, or add biological reference tracks.

Operation

AnyExpress is composed of an executable wrapper (shell script or .exe file), a collection of Java classes and pre-processed data (reference targets and exclusion features). Once an archived file (.zip) is extracted to the user's local machine, AnyExpress is ready to execute after ENVIRONMENT and PATH variables are set, as in any other command-line software for a Unix-like environment or Windows. Tools available in AnyExpress are summarized in Table 1. Instructions on installation, configuration, and usage are detailed in the accompanying webpage http://anyexpress.sourceforge.net. Among seven tools, COMBINE is the main process that performs data integration. Figure 3 explains the option parameters by showing an example of running anyexpress Combine in a command-line, to combine closed-platform data (microarray: Affymetrix U133A) and two open-platform data sets (NGS: Illumina GA and ABI SOLiD). Tags from these three platforms were matched against 'RefGene2010' using the PositionMatcher algorithm. The tags were also matched against targets 'multiTarget' and 'dbsnp131' for filtering. The exclusion feature 'multiTarget' is automatically generated during the ANNOTATE process. For example, in Figure 2, tag5 is matched to two genes, geneY and geneZ (bottom left table in Figure 2). Once such tag-to-target pairs are obtained, a 'multiTarget.txt' file that contains a list of undesirable tags, such tag5 in Figure 2, is created. The final output 'combinedExpression.txt' is created under the user-specified directory (specified as 'myProject' in Figure 3) and also contains summary statistics.

Table 1 Summary of AnyExpress tools
Figure 3
figure 3

Command line example. A command line to execute AnyExpress Combine is illustrated. The command can run in any Unix-like environment or MS-DOS in Windows. A backslash ('\') character is used to continue the command onto the next line. All .BED and .txt files are assumed to be located in the current working directory. If they are not, the full path should be used. Also, the target 'RefGene_2010 and two exclusion features 'multiTarget' and 'dbsnp131' are assumed to be created before running anyexpress Combine.

Tested platforms

AnyExpress was implemented in Java, shell script, and Python and it runs on Unix, Linux, Mac OS X, and MS-DOS in Windows. AnyExpress successfully worked with three different configurations: (i) a 64-bit Linux server with a 2.13 GHz Intel Core™ 2 Duo CPU and 16 GB memory, (ii) a 32-bit iMac with a 2.66 GHz Dual-core Intel Zion and 4 GB memory, and (iii) a 32-bit Windows 7 with a 1.8 GHz Intel Core CPU and 4 GB memory. The executables, the source code, the example data, and the manual are available at http://anyexpress.sourceforge.net.

Results

Combining NGS and microarray data

We applied AnyExpress to human gene expression data from Marioni et al.[27]. It consists of six microarray samples (Affymetrix HG U133A) and six Illumina GA NGS samples. We downloaded raw microarray .cel files from the NCBI GEO database (accession number: GSE11045) and raw NGS .fastq files from NCBI Sequence Read Archive (submission number: SRP000225). The six samples, all human tissue, were obtained from kidney (3 samples) and liver (3 samples). For pre-processing, six microarray .cel files were bound into a single column-bound file after running AnyExpress:

$anyexpress BindAffyCel ~/celfiles 6sample.txt

To create a closed-platform .BED file, we downloaded a probe sequence file (.fasta) of Affymetrix U133 Plus2 from the Affymetrix Support webpage http://www.affymetrix.com and processed it to have a probe identifier as in 'x coordinate' + ':' + 'y-coordinate', used by Thompson et al.[25]. Then we aligned the probe against the genome sequence to obtain genomic positions for the probes using two external tools, Bowtie [23] and AWK [35]:

$ bowtie ~/indexes/hg19 -t -n 0 -B 1 hg19 -f U133PLUS2.fasta U133PL2.bowtie

$ awk '{ FS="\t"; OFS="\t"; print $3, $4, $4+length($5)-1,\ $1, $2 }' U133PLUS2.bowtie > U133PLUS2.BED

Bowtie is a fast and memory-efficient algorithm and tool for short sequence alignment [23] and AWK is a convenient Unix-like environment tool for processing a text file [35]. We chose these tools for their popularity and convenience, but users can freely use other tools or their own code to process their .fasta files to obtain the .BED format file. For Windows users, we provide an AWK-equivalent tool, awk.exe, through the AnyExpress webpage. Open-platform files were aligned and processed in the same way as closed-platform files. The only difference was to replace the Bowtie option from '-f' to '-q' because NGS data used the .fastq format. The following Bowtie-awk running was repeated for all six NGS files (SRR002320.fastq through SRR002325.fastq):

$ bowtie -t -n 0 ~/indexes/hg19 -q SRR002320.fastq SRR002320.bowtie

$ awk 'BEGIN {FS = "\t"; OFS="\t"} {print $3, $4, $4+length($5)-1,

$1, $2 }' SRR002320.bowtie > SRR002320.BED

We built target and an exclusion features into the system using AnyExpress:

$ anyexpress BuildTarget RefSeq_Gene.BED

$ anyexpress BuildExclusionFeature dbsnp131.BED

The resulting files were all created in the '$ANYEXPRESS_HOME/sys/target' and '$ANYEXPRESS_HOME/sys/exclusionFeature' directories.

We combined all 7 platforms (1 closed-platform + 6 open-platforms) of the Marioni data with AnyExpress, using multiTarget and dbsnp131 as exclusion features:

$anyexpress Combine -c U133PLUS2.BED 6samples.txt -o SRR002320.BED SRR002321.BED SRR002322.BED SRR002323.BED SRR002324.BED SRR002325.BED -t RefSeq_Gene -e multiTarget dbsnp131 -p /user/jkim/myProject

Run-time messages during the AnyExpress execution are shown in Figure 4. The start and end of tasks are displayed in a step-by-step manner. The final combined file is a target-by-sample text file that can be used in downstream analyses, such as identification of differentially expressed genes, classification, clustering or enrichment analysis on Gene Ontology and pathways [36]. At the bottom of Figure 4, coverage statistics of tag and target are added along with the execution time (2,671 seconds).

Figure 4
figure 4

AnyExpress run-time message log. Run-time messages are shown during the analyses, with statistics and execution times shown at the bottom.

We calculated Spearman correlation coefficients (CC) for the combined data to assess reproducibility. The within-platform CCs were very high in both kidney and liver (mean CC = 0.980; sd CC = 0.011), while cross-platform CCs were moderate (mean CC = 0.733; sd CC = 0.001). These results confirmed the results of the original study [27]. The cross-platform CC = 0.733 is similar to our previous results for cross-platform microarray studies [16, 17] and similar to results from the MAQC consortium [17]. Although we observed lower cross-platform CCs, it is known that a decrease in correlation could be due to tag-effect differences in each platform [37].

For visualization, we drew a coverage plot in the genome regions of gene GPX3 (chr5:150,395,999 - 150,410,551):

$anyexpress Plot/user/jkim/myProject chr5 forward 150395999 150410551

This gene is shown to have tissue-specific expression, higher in kidney but lower in liver [38]. The resulting .bedGraph file was uploaded to a custom track of the UCSC Genome Browser for visualization. We selected four representative tracks out of the original twelve due to page limitations and adjusted the browser setting for clearer viewing. Figure 5 displays the difference between the two technologies. As expected from Affymetrix's original probe design scheme, microarray probes were only found in the last exon. In contrast, Illumina GA reads spread across all exons. Figure 5 demonstrates that differential expression between two tissues, kidney (red) vs. liver (green), is well-conserved within each platform.

Figure 5
figure 5

Coverage plot. An example of a coverage plot generated from the anyexpress Plot module in the human genomic region of the GPX3 gene (from 150,395,999 to 150,410,551 in chr5). The top two tracks represent kidney (red) and liver (green) samples measured by microarrays (MA). The bottom two tracks represent kidney (red) and liver (green) measured by NGS. Height rescaling and color changes were applied in the UCSC Genome Browser for a better view.

Effect of exclusion features

We ran AnyExpress on the Marioni data with four different exclusion feature settings: 'none' = apply no exclusion feature, 'snp'= remove SNP containing tags, 'multiTarget' = remove tags matched to more than one target, and 'both' = apply both 'snp' and 'multiTarget'. We assessed the effect of exclusion features on gene coverage and correspondence of highly expressed genes across the platforms. Figure 6 displays gene coverage of seven platforms of microarray (MA), six NGS (NGS.*) and the final combined expression (Combined). The coverage was calculated as the number of genes that remained after filtering divided by the total number of genes in the RefSeq transcriptome database (total = 21,505). Microarray had the highest coverage value and the combined file had the lowest since it only keeps genes from the intersection of the other six platforms. (AnyExpress also allows 'union' as a set operation.) Application of exclusion features resulted in slightly lower coverage per platform. Within NGS, overall coverage was higher in kidney (Kid) than in liver (Liv).

Figure 6
figure 6

Gene coverage. Gene coverage per platform is displayed with different exclusion feature settings. The coverage was calculated using the RefSeq transcriptome database as a reference, which had a total of 21,505 genes. Each platform had 4 coverage values obtained from the corresponding exclusion features: 'none' = apply no exclusion feature, 'snp'= remove tags containing SNPs, 'multiTarget' = remove tags matched to more than one target, and 'both' = apply both 'snp' and 'multiTarget'. Both microarray (MA) and next-generation sequencing (NGS) had replicates from two tissues: kidney (Kid) and liver (Liv). The coverage that resulted from applying all filters is shown at the bottom of the graph (Combined).

In Figure 7, cross-platform agreement for highly expressed genes is assessed with the correspondence at the top (CAT) plot, first introduced by Irizarry et al.[37]. Correlation coefficients were shown to be inadequate to assess correspondence between studies or platforms, due to a small number of differentially expressed genes [3]. Hence, other authors have suggested that cross-platform agreement should be evaluated on genes which are likely to be differentially expressed [3, 37]. Previously we used this plot in a cross-platform study of microarray and MPSS [16]. The CAT plot has also been used in several similar studies [3, 39, 40]. We created lists of highly expressed genes, size n, sorted by fold-change in decreasing order, varying n from 50 to 2000 by 50. For each top-n genes from NGS, we counted the number of genes that were in common with the top-n genes from microarray and divided this number by n . As expected, the proportion that was in common between the two platforms increased with an increase in n. The agreement proportions in 'none' and 'snp' were similar when the list size was smaller than 900, but the proportion was higher in 'snp' than in 'none' when the list size was above 900. 'multiTarget' and 'both' outperformed 'none' for all list sizes. Overall, the CAT plot demonstrated that filtering by exclusion features produced higher agreement between the two platforms. We also assessed the cross-platform correspondence with a modified CAT plot where genes were ranked by a false discovery rate (FDR) adjusted q-value [41], instead of fold-change (Additional file 1). 'snp' and 'none' showed similar correspondence, but overall we observed the same effect of larger correspondence with filtering.

Figure 7
figure 7

CAT plot using NGS as reference. This CAT plot depicts the agreement between NGS and microarray data on the detection of differently expressed genes. The X-axis is the number of top genes in NGS data, ranked by log2 fold-changes of kidney vs. liver. The Y-axis is the proportion of genes from the microarray that is in common with top-ranked genes from NGS. Four CAT plots are drawn using none, snp, multiTarget, and both (snp and multiTarget simultaneously) exclusion features.

Execution time with a large number microarray samples

We performed stress testing of AnyExpress with a different number of .cel files under different memory sizes (Table 2). The number of .cel files was increased per memory size until failure (i.e., encounter of memory allocation error). Pre-processing processes (BIND or NORMALIZE) took longer than the actual COMBINE process. We found that AnyExpress can manage up to 500 .cel files with 8 GB memory. The user needs to have a memory size larger than 8 GB to process more than 500 .cel files. At the time of writing this manuscript, the price of a 4 GB memory was around 100 US dollars. Considering the cost of high-performance computing, running AnyExpress with additional memory on an average PC or laptop computer is cost-effective for a large-scale cross-platform analysis of gene expression data.

Table 2 Stress testing with Affymetrix .cel files

Future work

AnyExpress currently has some limitations as it is based on position matching between tag and reference. Hence it misses exon-spanning tags during the COMBINE process. In the Marioni data, about 4% (or around 1,000) of transcript-matched reads were exon-spanning tags. Although these were not counted in the current version of AnyExpress because of their relatively small representation, we are currently working on developing post-processing modules to rescue these tags.

AnyExpress performs within-platform normalization and quantile normalization [28] for closed-platforms, and the RPKM-like method [29] for an open-platform normalization. However, the current version of AnyExpress does not offer cross-platform normalization. Systematic biases may originate from different platforms, hybridization protocols, time of day when an assay was performed, replicates, and/or amplification reagents. Some investigators have proposed pre-processing methods to remove systematic biases: Singular Value Decomposition (SVD) [42], Distance Weighted Discrimination (DWD) [43], and an empirical Bayes method [44]. However, these methods focus on microarray, not NGS, data and only a small number of arrays are considered. Or, they perform "over-normalization" to the point that biological variations of interest may be lost [44]. NGS technology is still new and a thorough investigation of NGS-specific systematic biases is needed. AnyExpress is modular and open-source, so it is easy to extend and modify. The above-mentioned sources of systematic bias can occur in many of the different analysis steps depicted in Figure 1. However, we implemented AnyExpress in a modular fashion so that users can easily make changes to the current source code to handle the systematic bias in each step. AnyExpress targets an audience with some computational knowledge and hardware with at least 8 GB memory. We have shown that AnyExpress successfully combines 500 .cel files and six NGS data. Currently, we are extending AnyExpress in a distributed computing environment to accommodate a larger study.

Conclusions

We developed AnyExpress, a toolkit that combines and filters cross-platform gene expression data. With sequence-oriented tag mapping and a fast interval algorithm, AnyExpress uniquely offers all of the following features: (i) combine cross-platform gene expression data at a user-defined gene expression unit level (gene, isoform, or exon), (ii) process gene expression data from both open- and closed-platforms, (iii) select a preferred custom target reference, (iv) exclude undesirable tags based on custom-defined biological features, (v) create a coverage plot along the genomic regions of interest, (vi) bind a large number of Affymetrix .cel files into a single text file, and (vii) perform quantile-normalization with a large number of microarray samples.

Availability and requirements

  • Project name: AnyExpress

  • Project home page: http://anyexpress.sourceforge.net

  • Operating system: Linux, Unix, Mac OS X, or Windows

  • Programming language: Java, shell script, and Python

  • License: Apache License version 2.0