Background

The majority of DNA microarrays in use today are created from single genomes that do not reflect the genetic diversity of a group of heterogeneous entities. Mixed-DNA microarrays offer an alternative for "capturing" genetic diversity and can be used for classification purposes such as identifying pathogens or determining genetic relationships for epidemiology studies [14]. DNA from one or more reference strains or plasmids is shotgun-cloned, and a mixed-genome or mixed-plasmid microarray is generated from randomly selected, PCR-amplified clone inserts [2, 3]. Unlike most fingerprinting tools, the mixed-array format permits identification of informative probes that can be retrieved from the clone library for sequencing [5]. However, redundant sequences and limited representation of diversity can limit the application of these tools [3, 4]. Fortunately, a growing public database of genomes offers a new opportunity to incorporate non-redundant and diverse sequences into a mixed-microarray format. These arrays can be used to quickly assess the distribution of genetic diversity across multiple species and niches.

This work focuses on the optimal design of classification arrays. By optimal we mean minimizing the complexity and cost of an array by using as few probes as possible while still rendering sufficient information to discriminate between strains or groups of organisms and to avoid bias; the goal is to remove irrelevant probes (probes that contain no useful information) and reduce the number of redundant probes (probes that contain the same information) in such a way that the chosen probes will allow us to perform the desired classification task accurately. Selection of an optimal set of probes is a key factor in designing a successful mixed microarray to suit a particular need. The effects of probe length and the number of probes per gene have been discussed in [6]. A method for finding unique and valid oligonucleotides or probes was proposed in [7], which tries to identify probes for a gene such that there is no similar occurrence in other locations of a genome. A tool for choosing optimal DNA oligos is reported in [8], which identifies oligo sequences that occur in members of the target group but not in the non-target group. However, these methods are used for genome-wide probe selection and are not intended to identify minimum probe sets for classification problems.

A number of methods have been introduced for designing optimal probe sets. Pre-filtering methods [9] use clustering of all probes to find similar probe groups. Similar probes are discarded; the remaining probes are ranked, and top-ranked probes are kept for further analysis. A similar method [10] uses K-means to cluster all genes, and the means of different gene clusters are used as prototype genes. The limitation of these methods is that the number of clusters must be specified. A hybrid approach [11] ranks the probes first and selects a set of top-ranked probes. Hierarchical clustering is then used on these probes to generate a dendrogram. The optimal probes are selected by collapsing dense clusters. In this manner a small set of probes is identified that has a similar prediction accuracy to one that uses more probes.

The methods described above identify optimal probes using training data when the structure of the data is given. Such information, however, is usually unavailable for microarray data sets. A tool is still needed to help design mixed microarrays when prior knowledge of a microarray data set is unavailable. The focus of this paper is a software program, PLASMID, used for selecting an optimal set of probe sequences without a priori knowledge that will enable correct classification of groups of plasmids or bacteria. Data used to identify probe candidates can be either existing microarray data (or similar hybridization data) or sequence data from a public database such as GenBank. The latter are converted to "probe" sequences, and virtual hybridization is used to generate data for probe selection [1]. To demonstrate the generality of PLASMID, we include an example whereby the program can also be applied to develop a minimum probe set to distinguish between two classes of leukemia using data from an expression array.

Methods

Finding meaningful clusters in hybridization data

Finding meaningful clusters of samples (e.g., plasmids) from a given set of hybridization or sequence data is the starting point for the design of an optimal microarray; our tool provides several clustering options. Clustering methods can be divided into two general groups: distance-based methods and model-based methods. Distance-based methods are either non-hierarchical or hierarchical, and each method has its particular strengths and weaknesses. Currently our tool includes the K-means non-hierarchical clustering algorithm and hierarchical clustering by means of Unweighted Pair Group Method with Arithmetic mean (UPGMA), neighbor joining, or Ward's minimum variance method, all of which are widely used in microarray data analysis [12]. Two distance metrics have been implemented, Euclidean distance and Pearson's correlation coefficient, from which users can choose. The distance-based methods listed above are standard clustering techniques. In addition to these, we have also implemented the model-based clustering method described below.

Model-based genetic clustering

Distance-based methods are simple to use, and the clustering results are easy to explain. However, it is hard to obtain information about the number of clusters, the confidence level of the clustering results, and so on, from these methods. To avoid some of these issues, model-based clustering methods can be used as an alternative. Model-based clustering methods assume that the data can be clustered according to a set of underlying distributions. These underlying distributions can be modeled, and finding a suitable model can be construed as an optimization problem. We assume that M is the underlying model for a data set represented by a matrix X where each row of the matrix represents the data for a given sample (e.g., plasmid). The best clustering result is represented by partition P of X. A measure is used to determine which P is most likely for X. In our tool the measure is the likelihood of all possible partitions P. A number of different optimization methods can be used to find the solution for P. In our tool, we have chosen to use a genetic algorithm because of its simplicity and efficiency in addition to its ability to find the optimal solution. Usually model-based clustering methods are based on the Expectation-Maximization (EM) method. However, EM algorithms tend to break down for microarray data because an inversion of the covariance matrix must be performed. In genetic algorithms, a search method is used to circumvent the need for this computation, thereby making genetic model-based methods more stable.

To find the best partition P we want to maximize the posterior probability f(P|X). According to Bayes' theorem, f ( P | X ) = f ( X | P ) f ( P ) f ( X ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOzayMaeiikaGIaemiuaaLaeiiFaWNaeeiwaGLaeiykaKIaeyypa0tcfa4aaSaaaeaacqWGMbGzcqGGOaakcqqGybawcqGG8baFcqWGqbaucqGGPaqkcqWGMbGzcqGGOaakcqWGqbaucqGGPaqkaeaacqWGMbGzcqGGOaakcqqGybawcqGGPaqkaaaaaa@43B5@ where f(P) is the prior probability. Recasting Bayes' theorem in terms of the likelihood MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NeHWeaaa@3696@ (X|P) gives f(P|X) ∝ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NeHWeaaa@3696@ (X|P) f (P)–that is, the posterior probability is proportional to the product of the likelihood and prior probability. Now if we assume a uniform distribution for P, then f(P) is constant and maximizing the posterior probability f(P|X) is equivalent to maximizing the likelihood MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NeHWeaaa@3696@ (X|P).

If we assume the rows of the matrix X in each cluster of the partition are independent and identically distributed, we can compute the likelihood of a partition. For this work, we assume the rows in each cluster are normally distributed with mean μ i and variance σ i 2 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4Wdm3aa0baaSqaaiabdMgaPbqaaiabikdaYaaaaaa@3012@ , and we assume a normal distribution for all μ i and an inverse-Γ distribution for all σ i 2 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4Wdm3aa0baaSqaaiabdMgaPbqaaiabikdaYaaaaaa@3012@ . This leads to:

( X | P ) = k j 2 σ 0 2 Γ ( 1 ) ( 2 π ) ( n k / 2 ) n k + 1 Γ ( n k / 2 + 1 ) ( 2 σ 0 2 + 0.5 ( i x k i j 2 + μ 0 2 i ( x k i j + μ 0 ) 2 n k + 1 ) ) ( n k / 2 + 1 ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NeHWKaeiikaGIaeeiwaGLaeiiFaWNaemiuaaLaeiykaKIaeyypa0ZaaebuaeaadaqeqbqaaKqbaoaalaaabaGaeGOmaiJaeq4Wdm3aa0baaeaacqaIWaamaeaacqaIYaGmaaaabaGaeu4KdCKaeiikaGIaeGymaeJaeiykaKcaamaalaaabaGaeiikaGIaeGOmaiJaeqiWdaNaeiykaKYaaWbaaeqabaGaeyOeI0IaeiikaGIaemOBa42aaSbaaeaacqWGRbWAaeqaaiabc+caViabikdaYiabcMcaPaaaaeaadaGcaaqaaiabd6gaUnaaBaaabaGaem4AaSgabeaacqGHRaWkcqaIXaqmaeqaaaaadaWcaaqaaiabfo5ahjabcIcaOiabd6gaUnaaBaaabaGaem4AaSgabeaacqGGVaWlcqaIYaGmcqGHRaWkcqaIXaqmcqGGPaqkaeaacqGGOaakcqaIYaGmcqaHdpWCdaqhaaqaaiabicdaWaqaaiabikdaYaaacqGHRaWkcqaIWaamcqGGUaGlcqaI1aqncqGGOaakdaaeqaqaaiabdIha4naaDaaabaGaem4AaS2aaSbaaeaacqWGPbqAaeqaaiabdQgaQbqaaiabikdaYaaacqGHRaWkcqaH8oqBdaqhaaqaaiabicdaWaqaaiabikdaYaaaaeaacqWGPbqAaeqacqGHris5aiabgkHiTmaalaaabaWaaabeaeaacqGGOaakcqWG4baEdaWgaaqaaiabdUgaRnaaBaaabaGaemyAaKgabeaacqWGQbGAaeqaaiabgUcaRiabeY7aTnaaBaaabaGaeGimaadabeaacqGGPaqkdaahaaqabeaacqaIYaGmaaaabaGaemyAaKgabeGaeyyeIuoaaeaacqWGUbGBdaWgaaqaaiabdUgaRbqabaGaey4kaSIaeGymaedaaiabcMcaPiabcMcaPmaaCaaabeqaaiabcIcaOiabd6gaUnaaBaaabaGaem4AaSgabeaacqGGVaWlcqaIYaGmcqGHRaWkcqaIXaqmcqGGPaqkaaaaaaWcbaGaemOAaOgabeqdcqGHpis1aaWcbaGaem4AaSgabeqdcqGHpis1aaaa@9FF2@
(1)

where k is the index of clusters, j is the index of probes, n k is the number of samples in the k th cluster, k i is the index of samples in the k th cluster, and μ0 and σ 0 2 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4Wdm3aa0baaSqaaiabicdaWaqaaiabikdaYaaaaaa@2FA5@ are the overall mean and variance of all the data [13].

Using this as a measure, the genetic algorithm is used to find the partition that maximizes the likelihood. The steps of the genetic algorithm are summarized as follows:

  1. 1.

    Generate N random partitions. Each partition is represented by a vector [1 2 1 ⋯] where each term is the index of a cluster.

  2. 2.

    Prior knowledge of pairs of samples highly unlikely to be in the same cluster can be incorporated into the partition likelihood by creating a text file with each pair of samples, together with a small weighting factor, on one line. The weighting factor must be smaller than 1, but how much smaller has to be determined empirically based on the end result. A weighting factor of zero indicates that the pair cannot be in the same cluster.

  3. 3.

    Compute the likelihood MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NeHWeaaa@3696@ for all partitions.

  4. 4.

    Repeat the following steps until the the maximum iterations (Max) has been reached or the difference between the likelihood of two successive iterations is less than ε, where Max and ε are given.

  5. (a)

    Select the two partitions with the highest scores.

  6. (b)

    Do crossover and mutation on these two partitions to generate new partitions. Crossover is accomplished by randomly selecting sections of equal length from each partition and exchanging them. Mutation is performed following crossover and is accomplished by randomly selecting one term in each of the partitions and changing it to a different value.

  7. (c)

    Compute the likelihood MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NeHWeaaa@3696@ for these two new partitions (offspring).

  8. (d)

    Replace the two lowest-ranked partitions with the offspring.

Other measures can be used including Bayesian Information Criteria and minimum description length. These measures will be included in future versions of PLASMID.

Probe ranking for classification

In a DNA microarray data set there are usually many more probes than the number of samples (e.g., plasmids) to be classified, and often some probes either convey no useful information or convey the same information. Thus, in the design of an optimal probe set for sample classification, one objective is to identify and remove irrelevant and redundant probes. In this section, we describe our method for removing irrelevant probes; in the next section redundancy reduction is described.

Irrelevant probes are removed using probe ranking on the clusters of samples obtained in the previous step. There are two basic approaches to probe ranking: filter techniques and wrapper techniques. Because of their simplicity, filter procedures are used most commonly for DNA microarrays. The filter procedure ranks each probe using a metric based on its classification relevance. Top-ranked probes are then selected to perform classification. Numerous filter metrics are described in the literature [14]: probabilistic and distance metrics, dependence measures, scores based on information theory, etc. In our tool, filter metrics are determined using two different statistical tests, the ANOVA-F and Brown-Forsythe tests. Other tests considered were the Welch, adjusted Welch, Cochran, and Kruskal-Wallis test statistics [15].

The test statistic is used as a metric to evaluate the discriminating power of a probe. Higher values represent more discriminating probes. For some applications, clusters may include an insufficient number of samples for meaningful statistical analysis. Such cases can be handled by generating random samples that differ only slightly from the original samples. These samples can be included in the statistical analysis and then discarded without compromising the probe ranking procedure. The purpose of adding these samples is for computational convenience only; they do not add more information.

The end result of the probe ranking function is a list of all probes ranked by their classification relevance. At this point, the user can either stop and use some chosen number of the top-ranked probes for the array probe set or continue with probe reduction and stepwise discriminant analysis to remove redundant probes and assign weights to the probes.

Stepwise discriminant analysis

Probe ranking is used to remove irrelevant probes that convey little or no information. Nevertheless, while the top-ranked probes are informative, at least some of them are likely to convey redundant information. The next task is to remove this unnecessary redundancy. K-means clustering is usually used to cluster samples (e.g., plasmids) as described in an earlier section, but here we use it in a novel way to cluster probes. A set of top-ranked probes is clustered into κ groups where the value of κ is evaluated empirically to maximize classification accuracy; probes in the same group are highly correlated with each other but uncorrelated or loosely correlated with probes in other groups. The probe closest to the center of a group is chosen to be representative of that group, and the κ representative probes are used with stepwise discriminant analysis (SDA) [16] which identifies the optimal probe set G MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NbXFeaaa@3755@ from the κ probes. At each step of the SDA, an F MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8xmHyeaaa@36AF@ statistic is computed for each probe; this value is used to determine whether including the probe or excluding the probe from G MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NbXFeaaa@3755@ will significantly improve sample differentiation. The SDA process starts with an empty probe set G MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NbXFeaaa@3755@ , and an iterative process of adding a probe to G MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NbXFeaaa@3755@ or removing a probe from G MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NbXFeaaa@3755@ continues until no probes can be added or removed. F r e m o v e MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8xmHy0aaSbaaSqaaiabdkhaYjabdwgaLjabd2gaTjabd+gaVjabdAha2jabdwgaLbqabaaaaa@3F2D@ is used for the probes in G MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NbXFeaaa@3755@ , and F e n t e r MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8xmHy0aaSbaaSqaaiabdwgaLjabd6gaUjabdsha0jabdwgaLjabdkhaYbqabaaaaa@3DC4@ is used for the probes not in G MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NbXFeaaa@3755@ . The probe in G MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NbXFeaaa@3755@ with the smallest value of F r e m o v e MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8xmHy0aaSbaaSqaaiabdkhaYjabdwgaLjabd2gaTjabd+gaVjabdAha2jabdwgaLbqabaaaaa@3F2D@ less than a chosen threshold value, usually 1.0, is removed; the probe not in G MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NbXFeaaa@3755@ with the largest value of F e n t e r MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8xmHy0aaSbaaSqaaiabdwgaLjabd6gaUjabdsha0jabdwgaLjabdkhaYbqabaaaaa@3DC4@ greater than the threshold value is added to G MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NbXFeaaa@3755@ . The formulas used to compute F MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8xmHyeaaa@36AF@ are:

F MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8xmHyeaaa@36AF@ values:

F r e m o v e = n r q + 1 q 1 Λ ( G \ p ) Λ ( G ) Λ ( G ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8xmHy0aaSbaaSqaaiabdkhaYjabdwgaLjabd2gaTjabd+gaVjabdAha2jabdwgaLbqabaGccqGH9aqpjuaGdaWcaaqaaiabd6gaUjabgkHiTiabdkhaYjabgkHiTiabdghaXjabgUcaRiabigdaXaqaaiabdghaXjabgkHiTiabigdaXaaadaWcaaqaaiabfU5amjabcIcaOiab=zq8hjabcYfaCjabdchaWjabcMcaPiabgkHiTiabfU5amjabcIcaOiab=zq8hjabcMcaPaqaaiabfU5amjabcIcaOiab=zq8hjabcMcaPaaaaaa@5EBE@
(2)
F e n t e r = n r q q 1 Λ ( G ) Λ ( G | p ) Λ ( G | p ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8xmHy0aaSbaaSqaaiabdwgaLjabd6gaUjabdsha0jabdwgaLjabdkhaYbqabaGccqGH9aqpjuaGdaWcaaqaaiabd6gaUjabgkHiTiabdkhaYjabgkHiTiabdghaXbqaaiabdghaXjabgkHiTiabigdaXaaadaWcaaqaaiabfU5amjabcIcaOiab=zq8hjabcMcaPiabgkHiTiabfU5amjabcIcaOiab=zq8hjabcYha8jabdchaWjabcMcaPaqaaiabfU5amjabcIcaOiab=zq8hjabcYha8jabdchaWjabcMcaPaaaaaa@5EAC@
(3)

Wilks' Λ:

Λ ( G ) = det ( W ) det ( T ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeu4MdWKaeiikaGYenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NbXFKaeiykaKIaeyypa0tcfa4aaSaaaeaacyGGKbazcqGGLbqzcqGG0baDcqGGOaakcqWGxbWvcqGGPaqkaeaacyGGKbazcqGGLbqzcqGG0baDcqGGOaakcqWGubavcqGGPaqkaaaaaa@4A62@
(4)

Within-group covariance matrix:

W ( G ) = m = 1 q m = 1 n m ( x m k i x m i . ) ( x m k j x m j . ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4vaCLaeiikaGYenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NbXFKaeiykaKIaeyypa0ZaaabCaeaadaaeWbqaaiabcIcaOiabdIha4naaBaaaleaacqWGTbqBcqWGRbWAcqWGPbqAaeqaaOGaeyOeI0IaemiEaG3aaSbaaSqaaiabd2gaTjabdMgaPjabc6caUaqabaGccqGGPaqkcqGGOaakcqWG4baEdaWgaaWcbaGaemyBa0Maem4AaSMaemOAaOgabeaakiabgkHiTiabdIha4naaBaaaleaacqWGTbqBcqWGQbGAcqGGUaGlaeqaaOGaeiykaKcaleaacqWGTbqBcqGH9aqpcqaIXaqmaeaacqWGUbGBdaWgaaadbaGaemyBa0gabeaaa0GaeyyeIuoaaSqaaiabd2gaTjabg2da9iabigdaXaqaaiabdghaXbqdcqGHris5aaaa@66A4@
(5)

Among-group covariance matrix:

T ( G ) = m = 1 q m = 1 n m ( x m k i x i .. ) ( x m k j x j .. ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemivaqLaeiikaGYenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NbXFKaeiykaKIaeyypa0ZaaabCaeaadaaeWbqaaiabcIcaOiabdIha4naaBaaaleaacqWGTbqBcqWGRbWAcqWGPbqAaeqaaOGaeyOeI0IaemiEaG3aaSbaaSqaaiabdMgaPjabc6caUiabc6caUaqabaGccqGGPaqkcqGGOaakcqWG4baEdaWgaaWcbaGaemyBa0Maem4AaSMaemOAaOgabeaakiabgkHiTiabdIha4naaBaaaleaacqWGQbGAcqGGUaGlcqGGUaGlaeqaaOGaeiykaKcaleaacqWGTbqBcqGH9aqpcqaIXaqmaeaacqWGUbGBdaWgaaadbaGaemyBa0gabeaaa0GaeyyeIuoaaSqaaiabd2gaTjabg2da9iabigdaXaqaaiabdghaXbqdcqGHris5aaaa@65A0@
(6)

where q is the number of clusters, n m is the number of samples in the cluster m, x mki is the value of the i th probe for the k th sample in the m th cluster, n is the total number of samples, r is the number of probes currently included in G MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NbXFeaaa@3755@ , G MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NbXFeaaa@3755@ |p denotes a new group of probes which is obtained by adding the probe p to G MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NbXFeaaa@3755@ , and G MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NbXFeaaa@3755@ \p denotes a new group of probes which is obtained by removing the probe p from G MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NbXFeaaa@3755@ .

At the conclusion of SDA, the optimal probe set is determined based on the prediction accuracy of the selected probes. Because there are typically a small number of samples associated with microarray data, prediction accuracy is computed using the leave-one-out (LOO) cross validation method [11, 15]. The set of probes associated with the highest LOO predication accuracy is written to a file together with its associated weights. It is important to note that when SDA is used to obtain the final probe set, the weights associated with the probes must be used for classification of new empirical data obtained using the probes. The probes should not be treated with equal weight.

Probe selection for a classification microarray

In summary, the steps in our design of an optimal probe set are:

  1. 1.

    Cluster the samples (e.g., plasmids) using microarray or sequence data and select clusters of interest using a hierarchical, non-hierarchical, and/or model-based method. A priori clustering is also permitted.

  2. 2.

    Use the probe ranking procedure with the sample clusters to rank the probes for relevance.

  3. 3.

    Repeat K-means clustering of probes for probe reduction until satisfied:

  4. (a)

    Select j top-ranked probes.

  5. (b)

    Repeat for κ in a chosen range:

  6. i.

    Cluster the j top-ranked probes into κ clusters.

  7. ii.

    Choose κ representative probes, one from each cluster.

  8. iii.

    Use SDA to find a set of probes from the κ representative probes and compute the LOO prediction accuracy.

  9. 4.

    Save the set of probes associated with the highest LOO prediction accuracy together with its weights. After constructing the optimized microarray, a set of independent control samples should be hybridized to empirically assess the accuracy of the microarray results.

A flowchart of the process is shown in Fig. 1. It should be pointed out that the optimal number of probes computed by this process does not take into account the effects of noise and other random experimental effects. The sample-to-feature (SFR) ratio gives the minimum number of probes that should be used to create a microarray. The rule of thumb is given by [17]:

S F R = n u m b e r o f s a m p l e s n u m b e r o f f e a t u r e s 1 5 . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uamLaemOrayKaemOuaiLaeyypa0tcfa4aaSaaaeaacqWGUbGBcqWG1bqDcqWGTbqBcqWGIbGycqWGLbqzcqWGYbGCcqqGGaaicqWGVbWBcqWGMbGzcqqGGaaicqWGZbWCcqWGHbqycqWGTbqBcqWGWbaCcqWGSbaBcqWGLbqzcqWGZbWCaeaacqWGUbGBcqWG1bqDcqWGTbqBcqWGIbGycqWGLbqzcqWGYbGCcqqGGaaicqWGVbWBcqWGMbGzcqqGGaaicqWGMbGzcqWGLbqzcqWGHbqycqWG0baDcqWG1bqDcqWGYbGCcqWGLbqzcqWGZbWCaaGccqGHKjYOjuaGdaWcaaqaaiabigdaXaqaaiabiwda1aaakiabc6caUaaa@643E@

In this paper we refer to features as probes. The SFR should be used in conjunction with the results to choose the optimal probe set.

Figure 1
figure 1

Flowchart of PLASMID. Flowchart of the probe selection process using PLASMID.

System Overview and Implementation

Our software tool PLASMID is implemented as a Java application. The NetBeans platform was chosen for development because addition of new functions is easily implemented. Also, many of the tasks common to desktop applications are provided by NetBeans. These include user interface management (e.g., menus and toolbars), user settings management, storage management (saving and loading any kind of data), window management, and wizard framework (supporting step-by-step dialogs). Each function is implemented as a NetBeans module and can be installed or removed easily without affecting existing functions. Java is a platform-independent programming language, so although PLASMID has been developed using the Windows operating system, it will be relatively easy to adapt it to other operating systems. We intend to extend PLASMID to both the Linux and Mac OS X operating systems. In addition to Java, PLASMID uses code written using the C++ programming language. C++ is needed for computationally intensive tasks that require greater speed and efficiency. The use of two different programming languages is transparent to the user.

PLASMID provides an integrated environment for designing an optimal classification microarray. As such, PLASMID v0.91 includes the following services:

  1. 1.

    Loading and management of different kinds of input data, including plasmid sequence data, hybridization data, virtual hybridization data, and probe sequences. Data may be in tab-delimited or comma-delimited text format or in Microsoft Excel spreadsheet format.

  2. 2.

    Different methods for processing hybridization data. The tool provides several data preprocessing methods, including normalization and noise filtering. It also provides hierarchical, non-hierarchical, and model-based methods for clustering samples; two different statistical tests for ranking probes; use of K-means clustering for reduction of probe redundancy; and stepwise discriminant analysis with assignment of weights to probes.

  3. 3.

    Design of mixed arrays using existing hybridization data or virtual hybridization data. An optimal set of probes is identified, and weights associated with each probe are stored for analysis of experimental results.

  4. 4.

    Construction of virtual microarrays to obtain virtual hybridization data using genomes from the National Center for Biotechnology Information (NCBI) database. Genomes for probes can be chosen by accession number or by gene sequence.

  5. 5.

    Visualization of microarray data and data processing results, including dendrograms, heat maps, and scatter plots. Plots can be saved in different image formats.

  6. 6.

    Automatic probe design after the user has specified the parameters. A step-by-step wizard guides the user through the various steps.

Experimental data obtained from microarrays designed using PLASMID can be used as input data and analyzed using the weighted classification function obtained in 3.

Results and Discussion

In this section we present results obtained using PLASMID to analyze a mixed-plasmid microarray data set [4] and a simulated mixed-genome microarray data set [1]. We also present results for publically-available leukemia expression array data [18]. For this latter data set, clusters (i.e., types of leukemia) are pre-assigned so only probe ranking, reduction of probe redundancy, and stepwise discriminant analysis (SDA) are used to determine the optimal probe set. PLASMID's performance in probe selection is evaluated using the leave-one-out (LOO) approach for which one sample is excluded and the remaining samples are used to obtain the discriminant functions. Each sample is, in turn, excluded and a corresponding set of discriminant functions is used to classify it. The prediction accuracy, the percentage of times the withheld samples are correctly classified, is used as the performance metric.

Mixed-plasmid microarray data

A mixed-plasmid microarray has been used to compare the genetic composition of plasmids [4]. The microarray consists of 576 probes composed of randomly selected fragments of plasmid DNA, and the data were obtained from hybridization experiments with 43 plasmids. The data are composed of hybridization signal intensities for each microarray probe [see Additional file 1].

First we used the Ward's minimim variance hierarchical clustering algorithm to create a dendrogram. To test the two-class problem, we divided the dendrogram into two clusters. One cluster consisted of 15 plasmids which, with one exception (the peSSuTet plasmid), have the blaCMY-2 antibiotic resistance gene; the other cluster consisted of 28 plasmids. We then used the probe ranking function, choosing the ANOVA-F test statistic, and generated a scatter plot (Fig. 2). The scatter plot shows that the majority of the probes have statistical values close to zero and, thus, that ANOVA-F test statistics can be used to distinguish between informative (F > 0) and uninformative (F ≈ 0) probes. This result also serves to highlight the need for optimization algorithms, as the majority of probes provide limited discrimination.

Figure 2
figure 2

Scatter plot of ANOVA- F test statistics for the mixed-plasmid microarray probes. The scatter plot shows that the majority of the probes have statistical values close to zero and, thus, that ANOVA-F test statistics can be used to distinguish between informative (F > 0) and uninformative (F ≈ 0) probes.

For the two-cluster case, we chose 1, 20, and 200 top-ranked probes for comparison. Using reduction of probe redundancy and SDA, we found that a single probe (5-E3, a transposase gene associated with the blaCMY-2 element [19]) correctly classified all but two of the plasmids [4]. Interestingly, in the original study one of these two plasmids (pe1171sT) was classified with plasmids that harbor the blaCMY-2 gene even though it does not carry this gene. Analysis with PLASMID separated pe1171sT from the blaCMY -2 plasmids. In addition, a different plasmid (pe7594T) that harbors the blaCMY-2 gene was classified with other blaCMY-2 positive plasmids. Thus, analysis using PLASMID more accurately reflects the phenotypic properties of the plasmids included in the study. The one exception was the peSSuT plasmid that was consistently classified with blaCMY-2 plasmids while not harboring this gene [8].

Next we divided the original dendrogram into five plasmid clusters and ranked probes as before. As expected, the number of probe clusters κ specified for the reduction of probe redundancy affects the prediction accuracy (Table 1). Small values of κ certainly reduce redundancy, but they also reduce specificity. The optimal set of probes is identified using SDA with the LOO method to determine the highest prediction accuracy. In this case, the smallest number of probes from the top-most ranked probes with the highest prediction accuracy is 10. Thus, PLASMID analysis reduced the original data set of 576 probes to 10 probes that are needed to accurately classify plasmids into one of five groups. Non-hierarchical clustering followed by probe ranking, probe reduction, and SDA gave similar results (data not shown).

Table 1 Classification accuracy of mixed-plasmid data using hierachical clustering with five sample (plasmid) clusters. PA is the prediction accuracy.

In addition to hierarchical and non-hiearchical clustering methods, we can obtain classification results using our model-based method, which is based on a genetic algorithm. The genetic algorithm predicted that the most likely number of plasmid clusters is five (Table 2). Comparison of Tables 1 and 2 shows that prediction accuracies depend on the initial clustering method used. For this case, the prediction accuracies for the model-based clustering method are larger for a given number of probe clusters than those obtained via the hierarchical method. Furthermore, the variance in prediction accuracies is lower as a function of the number of top-ranked probes when clusters are initially assigned using the model-based method. For other data sets, however, another clustering model might give the best results.

Table 2 Classification accuracy of mixed-plasmid data with model-based clustering. PA is the prediction accuracy.

Based on the sample-to-feature ratio (SFR), at least 9 probes (features) are required for classifying 43 plasmids (samples). Tables 1 and 2 show several choices for 10 probes with equivalent performance. When additional information is available, it should be used to assist with the choice of a final set.

Virtual Streptococcus mixed-genome microarray data

A virtual Streptococcus mixed-genome microarray was constructed by Wan et al. [1]. To create the equally-represented, 4000-probe virtual array, 800 gene segments each 600-bp long were randomly selected from genomes of fifteen strains of five bacterial species–that is, each species was represented by 800 different probes. Virtual hybridization was accomplished using BLAST scores as proxies for array probe intensities [see Additional file 2], and PLASMID was used to analyze the data. In the initial analysis one bacterial species was excluded from the study because it was represented by only a single strain (S. mutans UA159). Because we knew a priori that the samples belonged to four different species, the goal was to find an optimal set of probes to classify these four. ANOVA-F tests were used to rank the 4000 probes, and LOO analysis was performed on different numbers of the highest ranked probes. In fact, we found the LOO prediction accuracy to be 100% for differentiating the four different species using only the single top-ranked probe. On examination we found that the hybridization values (BLAST scores) for this probe for strains from different groups were well separated (i.e., different from each other), while the hybridization values for strains from the same group were very similar. While it appears that successful classification can be achieved with a single probe when classification relies on differences in hybridization signal, given inherent sources of variation in microarray hybridization, it would be prudent to include additional probes to increase classification confidence for empirical data. For example, the minimum recommended probe set in this case would be 3 according to the SFR.

In the second analysis, our model-based clustering method identified two clusters, one with the two S. pneumoniae strains and the other with the remaining 13 strains. After probe ranking, reduction of probe redundancy, and SDA, a single probe could be used to differentiate these two groups. We also used non-hierarchical clustering of the samples followed by probe ranking, probe reduction, and SDA. When the number of clusters was chosen to be k = 2, the result was identical to the result obtained using our model-based cluster method. When the number of clusters was chosen to be k = 3 or k = 4, the two S. pneumoniae genomes were placed into different groups. A dendrogram constructed using the neighbor joining method shows a clear distinction between the two S. pneumoniae samples and the remaining bacteria (Fig. 3). When these two samples are excluded, PLASMID groups the remaining thirteen samples correctly into four species clusters. The results shown in Table 3 are obtained using non-hierarchical clustering, probe ranking, probe reduction, and SDA. As this table illustrates, only 2 probes are needed to obtain 100% prediction accuracy by species. These 2 probes are from the genomes of S. pneumoniae TIGR4 and either S. pyogenes M1 GAS or S. pyogenes MGAS5005. Based on the SFR rule of thumb, at least 3 probes are needed. Several choices exist that suffice for this condition.

Figure 3
figure 3

Dendrogram for Streptococcus MGM data. The dendrogram constructed using the neighbor joining method shows a clear distinction between the two S. pneumoniae samples and the remaining bacteria.

Table 3 Classification accuracy using mixed-genome array data with non-hierarchical clustering for four sample (bacterial species) clusters. PA is the prediction accuracy.

For virtual microarrays, BLAST scores are used to obtain hybridization intensities, and the accuracy of the scores will affect the choice of an optimal probe set. While error could be modeled from real data, the best measure of reliability will be obtained using actual hybridization experiments.

Public ALL/AML leukemia data

The ALL/AML leukemia data set, obtained from expression arrays, has been widely used in the literature. It consists of two classes of leukemia, acute lymphoblastic leukemia (ALL) and acute myeloblastic leukemia (AML), and there are 72 samples (47 ALL and 25 AML) and 7129 probes. Table 4 shows prediction accuracy results after probe ranking, probe redundancy reduction, and SDA have been performed. When the top 50 probes were selected, the highest accuracy was achieved when probes were clustered into 10 groups. A set of 10 probes was identifed with a prediction accuracy of 97.22%. Using additional probes does not lead to improvement. According to the SFR rule of thumb, at least 20 probes should be used in the actual microarray design; several choices of 20 probes exist and all produce robust prediction results (Table 4).

Table 4 Classification accuracy using ALL/AML leukemia data. PA is the prediction accuracy.

Conclusion

In this paper we describe a new software tool, PLASMID, for selecting an optimal set of probes for the design of a classification microarray. The tool provides the user with several clustering methods, a probe ranking method, probe redundancy reduction, and probe selection using stepwise discriminant analysis. Images can be saved in several different formats, and weights generated using SDA can be stored for use in analysis of experimental data. In addition, PLASMID can be used to construct virtual microarrays with genomes from public databases; these can then be used to determine an optimal probe set for use in actual microarray experiments. The software package has been applied to data from a mixed-plasmid microarray, a virtual mixed-genome microarray, and an expression microarray. Robust results have been obtained for all three sets of data.

Although many methods are available for determining a set of probes for a given microarray data set, these methods require the classification information to be known in advance. PLASMID was designed to be used prior to implementation of a microarray when no such information is available, although the program can also be used when clusters are known a priori.

PLASMID can be obtained by following the link from http://www.vetmed.wsu.edu/research_vmp/MicroArrayLab/.

Availability and requirements

  • Project name: PLASMID

  • Project home page: http://www.vetmed.wsu.edu/research_vmp/MicroArrayLab/

  • Operating system: Windows but to be ported to Linux and Mac OS X

  • Programming languages: Java and C++ (with gcc compiler)

  • Other requirements: Java Runtime Environment

  • License: Free to academic and nonprofit organizations