Background

Pigeonpea (Cajanus cajan (L.) Millsp) is one of the major grain legume crops of the tropical and subtropical regions of the world [1]. It is the only cultivated food crop of the Cajaninae sub-tribe and has a diploid genome with 11 pairs of chromosomes (2n = 2× = 22) and a genome size estimated to be 858 Mbp [2]. The genus Cajanus comprises 32 species most of which are found in India, Australia and one is native to West Africa. Pigeonpea is a major food legume crop in South Asia and East Africa with India as the largest producer (3.5 Mha) followed by Myanmar (0.54 Mha) and Kenya (0.20 Mha) [3]. It plays an important role in food security, balanced diet and alleviation of poverty because of its diverse usages as a food; fodder and fuel wood [4]. Several abiotic (e.g. drought, salinity and water-logging) and biotic (e.g. diseases like Fusarium wilt, sterility mosaic and pod borer insects) stresses, are serious challenges for sustainable pigeonpea production to meet the demands of the resource poor people of several African and Asian countries.

Fusarium wilt (FW) caused by Fusarium udum is an important biotic constraint in pigeonpea production in the Indian subcontinent, which results in 16-47% crop losses [5]. The fungus enters the host vascular system at the root tips through wounds or invasion made by nematodes, leading to progressive chlorosis of leaves, branches, wilting and collapse of the root system [6]. In India alone, the loss due to this disease is estimated to be US $71 million and the percentage of disease incidence varies from 5.3 to 22.6% [7].

Sterility mosaic disease (SMD) caused by pigeonpea sterility mosaic virus (PPSMV) is one of the wide-spread diseases of pigeonpea, which is transmitted by an eriophyid mite (Aceria cajani Channabasavanna). The disease is characterized by the symptoms like bushy and pale green appearance of plants followed by reduction in size, increase in number of secondary and mosaic mottling of leaves and finally partial or complete cessation of reproductive structures. Some parts of the plant may show disease symptoms and other parts may remain unaffected [8].

Due to the above mentioned factors combined with limited water resources to the fields in the semi-arid tropic regions, where the crop is grown, the productivity has remained stagnant at around 0.7 t/ha during the past two decades [1]. With the advent of genomic tools such as molecular markers, genetic maps, etc., conventional plant breeding has been facilitated greatly and improved genotypes/varieties with enhanced resistance/tolerance to biotic/abiotic stresses have been developed in several crop species [9, 10]. In case of pigeonpea, however, a very limited number of genomic tools are available so far [11, 12]. For instance, 156 microsatellite or simple sequence repeat (SSR) markers [1316], 908 expressed sequence tags (ESTs), at the time of undertaking the study, were available in pigeonpea. For enhancing the genomic resources in pigeonpea, transcriptome sequencing to generate ESTs should be a fast approach. ESTs, which are generated by large-scale single pass sequencing of randomly picked cDNA clones, have been cost - effective and valuable resource for efficient and rapid identification of novel genes and development of molecular markers [17]. Further, ESTs have been employed in bioinformatic analyses to identify the genes that are differentially expressed in various tissues, cell types, or developmental stages of the same or different genotypes [18, 19].

In view of above facts, this study was undertaken to obtain a comprehensive resource of FW- and SMD-responsive ESTs in pigeonpea with the following objectives: (i) generation of FW- and SMD- responsive ESTs, (ii) functional annotation of assembled unigenes, (iii) in silico identification of putative FW- and SMD- responsive genes, and (iv) development of novel SSR and SNP markers in pigeonpea.

Results

Root tissue is the site for Fusarium udum infection, the causal fungal agent of Fusarium wilt in pigeonpea. With an objective to evaluate the transcriptional responses after infection of roots by F. udum, six unidirectional cDNA libraries were constructed. These are from each of FW- infected root tissues of resistant ('ICPL 20102') and susceptible ('ICP 2376') genotypes at different stages viz. 6, 10, 15, 20, 25, 30 days after inoculation (DAI). Infected roots were examined by light microscopy upon harvest at different stages. The severity of wilt disease in both susceptible and resistant genotype was observed in longitudinal sections of stem and root vascular region at 15 and 30 DAI (Figure 1). Likewise for SMD, leaf tissue is the specific site of infection and therefore leaf samples of SMD infected genotypes, 'ICP 7035' (SMD resistant) and 'TTB 7' (SMD susceptible) were harvested at 45 and 60 days after sowing (DAS). RNA was extracted and consequently unidirectional cDNA libraries were constructed (see Additional file 1).

Figure 1
figure 1

Fusarium wilt (FW) challenged pigeonpea seedlings at 30 days after inoculation (DAI). a) Fusarium wilt challenged pigeonpea genotypes ('ICPL 20102') and ('ICP 2376') at 30 days after inoculation (30 DAI); b & c) Microscopic examination of FW-resistant pigeonpea genotype ('ICPL 20102') showing no disease symptoms on shoot and root vascular tissues; d & e) Microscopic examination of FW-susceptible pigeonpea genotype ('ICP 2376') showing severe wilt symptoms on shoot and root vascular tissues.

Generation of FW- and SMD- responsive ESTs

A total of 16 unidirectional cDNA libraries were constructed from all the four genotypes i.e. 'ICPL 20102' and 'ICP 2376'; 'ICP 7035' and 'TTB 7' which represent parents of mapping population segregating for FW and SMD, respectively. Using Sanger sequencing approach, 3,168 ESTs were generated from root cDNA libraries of 'ICPL 20102' and 2,880 from 'ICP 2376'. Similarly, 1,920 ESTs were generated from each leaf cDNA libraries of SMD- responsive genotypes, 'ICP 7035' and 'TTB 7'. Details of EST generation from different cDNA libraries are given in Figure 2. In brief, a total of 9,888 ESTs were generated and after stringent screening for shorter (<100 bp) and poorer quality sequences, 9,468 high quality ESTs were obtained, with an average varied-read length of 514 bp (Figure 2). All EST sequences were deposited in the dbEST of GenBank under accession numbers GR463974 to GR473857 and GR958228 to GR958231.

Figure 2
figure 2

Summary of total ESTs generated from FW- and SMD- responsive pigeonpea genotypes. Generation and analysis of ESTs from 16 cDNA libraries of pigeonpea subjected to Fusarium wilt (FW) and Sterility mosaic disease (SMD) stresses; (A) Clustering and assembly of 2,943 and 2,737 HQS (High quality sequences) derived from FW-responsive cDNA libraries of pigeonpea genotypes 'ICPL 20102' and 'ICP 2376', respectively resulted in 3,316 unigenes (UG-I); (B) Clustering and assembly of 1,894 HQS from each SMD-responsive pigeonpea genotypes 'ICP 7035' and 'TTB 7' resulted in 1,308 unigenes (UG-II); (C) 9,468 HQS generated from all the four genotypes in the study as shown in (A) and (B) were analyzed together that provided a set of 4,557 unigenes (UG-III); (D) Clustering analysis of generated ESTs in this study along with 908 public domain pigeonpea ESTs, which resulted in 5,085 unigenes (UG-IV), RS: Raw sequences; VS/ET: Vector trimmed/EST trimmed sequences; HQ: High quality sequences; PD: Public domain pigeonpea sequences from NCBI.

Pigeonpea EST assembly

With an objective to minimize redundancy, clustering and assembly was done for different EST datasets to define unigenes for (a) FW-responsive ESTs, (b) SMD-responsive ESTs, (c) FW- and SMD-responsive ESTs, and (d) the entire set of pigeonpea ESTs including those from the public domain. These unigene (UG) sets were referred to as UG-I, UG-II, UG-III and UG-IV, respectively. The UG-I comprised of 3,316 unigenes with 389 contigs and 2,927 singletons by clustering of 5,680 high quality ESTs. Similarly, for UG-II, clustering of 3,788 high quality sequences resulted in 1,308 unigenes (328 contigs and 980 singletons). Based on clustering of all the 9,468 high quality sequences generated in this study, the UG-III was defined with 4,557 unigenes (697 contigs and 3,860 singletons). The cluster analysis of 908 ESTs available in the public domain along with 9,468 pigeonpea ESTs resulted in UG-IV that included 5,085 unigenes with 871 contigs and 4,214 singletons. The number of ESTs in a contig ranged from 2 to 573, with an average of 7 ESTs per contig. As expected, contigs with two EST members exhibited a higher percentage (46.7%) than contigs with three or more EST members (Figure 3).

Figure 3
figure 3

Frequency and distribution of pigeonpea ESTs among assembled contigs. Number of ESTs in EST classes (on X-axis) are as following: Class 1=2 , Class 2 = 3, Class 3 = 4, Class 4 = 5, Class 6 = 7, Class 7= 8, Class 8 = 9, Class 9 = 10, Class 10 = 11, Class 11 = 12, Class 12= 13, Class 13 = 14, Class 15 = 16, Class 16 = 17, Class 17 = 18, Class 18 = 19, Class 19 = 20, Class 20 = 21, Class 21 = 22, Class 22 = 23, Class 23 = 24, Class 24 = 25, Class 25 = 27, Class 26 = 28, Class 27 = 29, Class 28 = 30, Class 29 = 31, Class 30 = 34, Class 31 = 35, Class 32= 37, Class 33 = 43, Class 34 = 46, Class 35 = 54, Class 36 = 60, Class 37 = 61, Class 38 = 66, Class 39 = 89, Class 40 = 92, Class 41 = 94, Class 42 = 125, Class 43 = 165, Class 44 = 172, Class 45 = 234, Class 46= 248, Class 47 = 314 and Class 48 = 573.

Comparison of pigeonpea unigenes with other plant EST databases

All the four sets of unigenes i.e. UG-I, UG-II, UG-III and UG-IV were analyzed for BLASTN similarity search against available EST datasets of legume species namely chickpea (Cicer arietinum), pigeonpea (Cajanus cajan), soybean (Glycine max), Medicago (Medicago truncatula), Lotus (Lotus japonicus), common bean (Phaseolus vulgaris) and three model plant species namely Arabidopsis (Arabidopsis thaliana), rice (Oryza sativa) and poplar (Populus alba). An E-value significant threshold of ≤ 1E-05 was used for defining a hit. Detailed results of BLASTN analyses for all the four unigenes sets are given in Table 1. For instance, analysis of UG-III found highest identity of 60.3% with soybean, followed by cowpea (43.6%), Medicago (43.0%), common bean (42.2%), Lotus (37.2%), and the least identity with chickpea (23.2%). Comparative BLASTN analysis of pigeonpea unigenes with EST databases of model plant species showed high identity with poplar (35.4%), followed by Arabidopsis (33.7%) and the least similarity with rice (28.3%). Of 4,557 unigenes, 2,839 (62.2%) showed significant identity with ESTs of at least one plant species analysed, while 227 (4.9%) showed significant identity across all the plant EST databases in this study. It is also interesting to note that 39 unigenes did not show any homology with the legume species examined.

Table 1 BLASTN analyses of pigeonpea unigenes against legume and model plant ESTs

To identify the putative function of all the unigenes compiled in this study, the unigenes from all the four sets (UG-I, UG-II, UG-III and UG-IV) were compared against the non-redundant UniProt database, using the BLASTX algorithm. At a significant threshold of ≤ 1E-08, 1,005 (30.30%) of UG-I, 638 (48.77%) of UG-II, 1,603 (35.17%) of UG-III and 1,777 (34.94%) of UG-IV unigenes showed significant similarity with known proteins (Figure 4). Details of BLASTX and BLASTN analyses against UniProt database for all four unigene sets are provided in Additional files 2, 3, 4 and 5.

Figure 4
figure 4

BLASTX analysis of pigeonpea unigenes against UniProt database. BLASTX homology search was performed for all the four unigene datasets (UG-I, UG-II, UG-III and UG-IV) against the non-redundant UniProt database. The values against each bar represent total number of unigenes, total number of hits, significant hits at ≤ 1E-08 and no hits for each unigene set.

Functional categorization of pigeonpea unigenes

The unigenes from all the four sets that showed a significant hit (≤ 1E-08) against the UniProt database were further categorized into functional categories. As a result, 640 (63.6%) of UG-I, 448 (70.2%) of UG-II, 997 (62.1%) of UG-III and 1,119 (62.9%) of UG-IV unigenes were categorised into three principal GO categories i.e. biological process, molecular function and cellular component. Like in earlier studies of this nature, it was observed that one gene could be assigned to more than one principal category, thus the total number of GO mappings from each category exceeded the number of unigenes analyzed. Details on full list of gene annotation for significant hits of four unigene sets are given in Additional file 6, 7, 8 and 9. For instance, of 1,603 (35.1%) unigenes of UG-III, only 997 (21.8%) were assigned to three principle categories. As a result, a total of 132 were grouped under biological process, 132 under molecular function and 153 under cellular component (Figure 5). Under the biological process category, cellular process accounted to 101, followed by metabolic process (82), biological regulation (32) and response to stimulus (21). In the cellular component category, 160 unigenes coded for cell part, 112 to organelle, and 70 to organelle part. In the last category of molecular function, majority of the unigenes were involved in binding (95) and catalytic activity (44). The remaining 606 unigenes which could not be classified into any of the three GO categories were grouped as "unclassified". The distribution of unigenes (UG-III) along with corresponding Gene Ontology (GO) categories are provided in Additional file 10. Based on GO annotation, enzyme commission IDs were also retrieved from the UniProt database to get an overview of unigenes (UG-III) putatively annotated to be enzymes. The major group of unigenes are included under oxidoreductases (107) followed by transferases (91), hydrolases (90), lyases (36), ligases (21) and isomerases (18). Similar patterns of distribution were observed in all the remaining Unigene sets.

Figure 5
figure 5

Gene Ontology (GO) assignment of pigeonpea unigenes (UG-III) by GO annotation. Functional categorization and distribution of 997 unigenes (UG-III) among three GO categories i.e biological process, cellular component and molecular function according to UniProt database.

In silicoexpression analysis

The identification of differentially expressed genes among specific cDNA libraries of FW- and SMD-responsive genotypes based on EST counts in each contig was done using a web statistical tool IDEG.6. As a result, 19 genes were identified to be differentially expressed between 'ICPL 20102' (FW- resistant) and 'ICP 2376' (FW-susceptible) genotypes, similarly, 20 genes were differentially expressed between 'ICP 7035' (SMD- resistant) and 'TTB 7' (SMD- susceptible) genotypes (Figure 6 and 7).

Figure 6
figure 6

Differential gene expression between FW- responsive genotypes using IDEG.6 web tool. Differentially expressed genes between libraries of FW-resistant ('ICPL 20102') and susceptible ('ICP 2376') genotypes. Cells with different degrees of blue color represent extent of gene expression.

Figure 7
figure 7

Differential gene expression between SMD- responsive genotypes using IDEG.6 web tool. Differentially expressed genes between libraries of SMD resistant ('ICP 7035') and susceptible ('TTB 7') genotypes. Cells with different degrees of blue color represent extent of gene expression.

To assess the relatedness of each library and expressed genes in terms of expression pattern, a cluster analysis on the basis of EST abundance in each contig was performed [20]. Of the 697 contigs (UG-III), that were subjected to R-statistics [21] only 71 contigs were normalized with a true positive significance (R>8) and were eventually subjected to hierarchical clustering analysis (Additional file 11). The correlated gene expression pattern of all normalized 71 contigs/genes is displayed in Figure 8. All the 12 FW- derived libraries were grouped into a single cluster, while all the four SMD- challenged libraries were grouped into another cluster. About 49 genes were highly expressed in SMD- challenged libraries than in FW- challenged libraries and can be attributed to high accumulation of defence proteins during SMD infection. In the cluster of FW- challenged libraries, the 'ICPL 20102'-30 DAI library was distantly placed between FW- susceptible challenged libraries 'ICP 2376' - 6 DAI and 'ICP 2376' - 30 DAI. Each cluster represents a different pattern of gene expression as shown in Figure 8. Based on the clustering pattern and library specificity, Clusters I and IV were further divided into sub-clusters (represented in different colour bars). The above results indicated that the pattern and percentage of genes expression varied according to severity of the stress in specific library.

Figure 8
figure 8

Hierarchical clustering analysis of differentially expressed genes from 16 libraries of pigeonpea using HCE version 2.0 beta web tool. Clusters of genes highly expressed in different libraries of pigeonpea genotypes subjected to FW and SMD stress. Columns represent different cDNA libraries and their relationship in a dendrogram. Clustering of highly expressed ESTs (normalized using R statistics, R>8) into four major clusters (indicated in vertical colour bars), and their cluster sub groups based on their library specificity. Colour scale represents the range of expression pattern by different genes with respect to libraries.

In Cluster I, 11.3% (8) of total genes were grouped and further sub divided into two groups with each sharing 2.8% (2) and 8.5% (6) genes, respectively. Similarly, Cluster II and Cluster III accounted for 4.2% (3) and 15.5% (11) genes and the largest Cluster IV, included 69.0% (49) of total genes with three sub groups IVa, IVb and IVc each sharing 14.0% (10), 10% (7) and 45% (32) of genes, respectively. Cluster analysis also showed high level expression of genes related to chloroplast/photosystem related proteins (22.5%), developmental proteins (19.7%), cellular proteins (15.4%), metabolic proteins (14.0%), defence/stimulus responsive proteins (4.3%), protein specific binding proteins (2.8%) and few uncharacterized proteins (19.8%).

Marker discovery

EST based markers can assay the functional genetic variation compared to other class of genetic markers and hence were targeted for marker development [22]. The unigene set based on generated ESTs in this study as well as the ones available in public domain was used for development of simple sequence repeats (SSR) and single nucleotide polymorphism (SNP) markers.

Identification and development of genic microsatellite markers

The entire set of 5,085 pigeonpea unigenes derived from UG-IV was used to identify the SSRs using MISA (MIcroSAtellite) tool [23]. As a result a total of 3,583 SSRs were identified at the frequency of 1/800 bp in coding regions (Table 2). 698 ESTs contained more than one SSR and 1,729 SSRs were found as compound SSRs. In terms of distribution of different classes of SSRs i.e. mono-, di-, tri-, tetra-, penta- and hexa-nucleotide repeats, mononucleotide SSRs contributed to the largest proportion (3,498, 97.6%). Only a limited number of SSRs of other classes were found. For instance, di- and tri- nucleotide SSRs accounted for 40 (1.1%) and 33 (0.9%), respectively. On the other hand, 9 tetrameric, 2 pentameric and 1 hexameric microsatellites were present (Figure 9). While using the criteria for Class I (> 20 nucleotides in length) and Class II SSRs (< 20 nucleotides in length) as used by Temnykh and colleagues [24] and Kantety and colleagues [25], on all SSRs 641 SSRs represented Class I while 2,942 SSRs represented Class II (Table 2).

Figure 9
figure 9

EST-SSR motifs derived from pigeonpea unigenes (UG-IV). Number of EST-SSR repeat motifs (excluding monomers) derived from unigenes (UG-IV) of pigeonpea cDNA libraries subjected to FW and SMD stress.

Table 2 Features of SSRs identified in ESTs

In general, mononucleotide SSRs are not included for primer designing and synthesis. However, as only a very limited number of SSR markers are currently available for pigeonpea in public domain and in a separate study some mononucleotide SSRs were found polymorphic [15], primer pairs were designed for 383 SSRs including mononucleotide SSRs. A total of 94 primer pairs were considered for validation after excluding the primers for monomeric SSR motifs and compound SSRs with mononucleotide repeats. However based on repeat number criteria, such as 5 minimum for di-, tri-, tetra-, penta-nucleotides, primer pairs were synthesized for 84 SSRs. The details of newly developed pigeonpea EST-SSR primers along with corresponding SSR motif, primer sequence, annealing temperature and product size are provided in Additional file 12.

Newly synthesized 84 markers were analyzed on 40 elite pigeonpea genotypes (Additional file 13). As a result, 52 (61.9%) primer pairs provided scorable amplified products and 26 primer pairs produced a number of faint bands indicative of non-specific amplifications. A total of 15 (28.8%) markers showed polymorphism with 2-7 alleles with an average of 4 alleles per marker in genotypes examined. These markers showed a moderate PIC value ranging from 0.20 to 0.70 with an average of 0.40 (Table 3). To evaluate the genetic variability within a diverse collection of pigeonpea accessions which are parents of different mapping populations segregating for important agronomic traits and also to determine genetic relationship among them, phylogenetic analysis on the basis of dissimilarities was performed using NTSYS software package. The UPGMA cluster diagram showed clear segregation of wild and cultivated species (Figure 10).

Figure 10
figure 10

Dendrogram of elite pigeonpea accessions based on UPGMA analysis. Unweighted Pair Group Method using arithmetic average dendrogram showing relatedness among the forty elite pigeonpea genotypes representing 8 wild species and 32 cultivated genotypes. The scale at the bottom of the dendrogram indicates the level of similarity between the genotypes.

Table 3 Characteristics of pigeonpea EST-SSR markers

SNP discovery and identification of CAPS markers

SNPs are an important class of molecular markers which are becoming more popular in recent times. To enhance the reliability of SNPs identification, the SNP which occurred in a contig ≥ 5 ESTs from more than one genotype was considered. In silico analysis showed a total of 102 SNPs in 37 (27,659 bp) contigs with a frequency of 1/271 bp (Table 4). With an objective of validating these in silico identified SNPs, as an example, 10 contigs were used to generate PCR amplicons and sequence four genotypes namely 'ICPL 20102', 'ICP 2376', 'ICP 7035' and 'TTB 7'. While a scorable and sequenceable amplicon was obtained in case of 6 contigs (contig 210, contig 433, contig 535, contig 555, contig 620 and contig 718), the scorable amplicons were not obtained in case of four contigs (contig 67, contig 330, contig 587 and contig 632). Sequencing of amplicons for all the four genotypes for all the six contigs showed occurrence of SNPs as predicted in silico (Additional file 14). For instance, for contig 433, a comparison of the amplified DNA sequences for four genotypes ('ICPL 20102', 'ICP 2376', 'ICP 7035' and 'TTB 7') with the 5 EST sequences coming from two genotypes ('ICP 7035' and 'TTB 7') showed the occurrence of the same SNP G to C between 'ICP 7035' and 'TTB 7' (Figure 11).

Figure 11
figure 11

A snapshot of sequence alignment of EST sequences and amplicons for contig 433 validating the in silico predicted SNP. CAP3 alignment of ESTs (a) generated from 'ICP 7035' and 'TTB 7'in contig 433 showing a SNP between these genotypes and (b) Multiple sequence alignment of amplicon sequences generated from genomic DNA of four genotypes ('ICPL 20102', 'ICP 2376', 'ICP 7035' and 'TTB 7') with the primer pairs for the assembled contig 433. The in silico identified SNP in the EST contig 433 was confirmed in the amplicon sequences.

Table 4 Summary of SNPs and CAPS markers identified from pigeonpea ESTs

In order to perform cost-effective and robust genotyping assay for the detected 102 SNPs in 37 contigs, efforts were made to identify the restriction enzymes that can be used to assay SNPs via cleaved amplified polymorphic sequence (CAPS) assay. Results indicated that SNPs present in 37 contigs can be evaluated by using CAPS assay (Table 4).

Discussion

Plants are known to have developed integrated defence mechanisms against fungal and viral infections by altering spatial and temporal transcriptional changes. The EST approach was successfully utilized in identification of disease-responsive genes from various tissues and growth stages in chickpea [26], Lathyrus [27], soybean [28], rice [29] and ginseng [30]. Many earlier studies have shown that resistant genotypes have efficient mechanisms for stress perception and enhanced expression of defence-responsive genes, which maintain cellular survival and recovery [31]. Hence, the present study was undertaken to identify catalog of defence related genes in response to FW and SMD infection in pigeonpea by generating ESTs from different stress challenged tissues at various time intervals.

Generation of cDNA libraries and unigene assemblies

Roots provide a structural and physiological support for plant interactions with the soil environment by conducting transport of water, ions and nutrients. Plants are encountered with many biotic stress factors which includes bacterial, fungal and viral infection. Roots and leaves are the primary sites of infection by these organisms. Therefore, a total of 16 cDNA libraries were generated at different time intervals to specifically target the roots infected with Fusarium udum and leaves infected with SMD. In total 5,680 high quality ESTs were generated from FW- and similarly 3,788 high quality ESTs from SMD- challenged genotypes. Earlier, at the time of analysis in November 2008, the public domain consisted of only 908 ESTs for pigeonpea. Thus the present study contributes approximately 10-fold increase in the pigeonpea EST resource and an addition of 4,557 pigeonpea unigenes (UG-III).

Functional annotation of pigeonpea unigenes

Homology searches (BLASTN and BLASTX) against other plant ESTs and functional characterization was done for all the defined unigene datasets (UG-I, UG-II, UG-III and UG-IV). Of the 5,085 unigenes (UG-IV) assembled from all the pigeonpea ESTs, 3,280 (64.5%) had significant identity with ESTs of at least one plant species analyzed, 299 (5.8%) unigenes showed significant identity with ESTs of all analyzed plant species in the study, while 41 (0.8%) were found to be novel to pigeonpea. A high significant identity was observed with soybean (56.3%), and the least percentage of similarity was observed with chickpea (22.7%) (Table 1). A similar BLASTN results were observed for the remaining three unigenes sets (UG-I, UG-II and UG-III) against the ESTs of plant species surveyed. Comparative analysis of newly defined UG-III dataset (4,557) with 908 public domain pigeonpea ESTs showed that only 508 (11.1%) shared identity and indicated that our EST sequencing study identified 4,049 (88.9%) new set of pigeonpea unigenes. Relatively, very low similarity of 36.5% with Lotus and 42.3% with Medicago was observed compared to soybean and cowpea than other legume species. These observations are in accordance with phylogenetic relationships of legumes [32].

The pigeonpea ESTs showed higher similarity to legume ESTs databases (22.7-56.3%) of the legume species than monocot species (27.3-33.4%). Comparative analysis of pigeonpea ESTs with monocot species like rice (27.3%) showed that the percentage of significance is much lower compared to any other legume species, inspite of larger EST repository. This is clearly attributed to phylogenetic divergence between dicots and monocots in course of evolution. These comparisons also indicate that several unigenes that were absent in analysed non-legumes but present in all legume species may be specifically confined to legumes.

BLASTX analyses indicated that those ESTs without significant identity to any other protein sequences in the existing database may be novel and involved in plant defence responses. Hence, this novel EST collection represented a significant addition to the existing pigeonpea EST resources and provides valuable information for further predictions/validation of gene functions in pigeonpea.

A comprehensive comparison of functionally categorized unigenes of all the four unigenes data sets (UG-I, UG-II, UG-III and UG-IV) showed a similar distribution. A large number of unigenes were involved in cell part, organelle, binding, organelle part, metabolic and cellular process among the significantly annotated ones. These observations are consistent with the earlier reported functional categorization studies in rice [29], soybean [33], barley [34] and tall fescue [35]. However, the sequences encoding activities related to categories such as biological regulation and response to stimulus are 28 and 20 incase of FW-responsive ESTs compared to 0 and 2 in case of SMD-responsive ESTs. This was possibly due to the fact that the ESTs generated from FW- challenged root libraries were most abundantly involved in stimulus to pathogenesis and ESTs derived from SMD stress are chloroplast binding proteins. Earlier studies such as Lee and colleagues [36], Ablett and colleagues [37], also reported that photosynthesis-related proteins were the most prevalent from aerial parts of the plant, which would help to make energy related activities such as cell division, growth, elongation and development. Similarly in this study, photosynthesis related genes were identified in larger proportion (30%) in SMD-responsive cDNA libraries derived from leaf tissues.

In silicodifferential gene expression

The invasion of pathogen not only results in expression of novel genes/transcripts, but also in altering the abundances of different ESTs resulting in induction or repression. This was evident from differential expression of 19 genes between FW-responsive genotypes and 20 genes between SMD-responsive genotypes. It is however, important to mention that in silico method of gene expression is not the ideal method to identify the differentially expressed genes. Nevertheless, as large scale EST data were generated from FW- responsive and SMD- responsive genotypes, an effort like some earlier studies [1821, 34] was made to identify some putative genes differentially expressed in FW- and SMD- resistant and sensitive genotypes. Validation of these candidate genes by Northern analysis or real-time quantitative PCR analysis is essential before these candidate genes are deployed in some other studies.

Significant number of unigene sequences related to proteins like kinases, phosphatases, peroxidases, ribonucleases, endochitinases, glucanases and hormones like Abscisic acid responsive (ABR) genes were identified to be differentially expressed and are known to play a vital role in defence mechanism. For example, the cell wall degrading enzymes like endochitinases (EC: 3.2.1.14) implicate a major defence mechanism against pathogen [27]. Similarly, kinases play a major role in the plant's recognition to pathogen [38, 39]. For instance, chitinase protein (UniProt ID: P23472), a class of pathogenesis related (PR) proteins with bi-functional role in lysozyme/chitinase activity involved in random hydrolysation of N-aetyl-beta-D-glucosaminide-beta linkages in chitin and chitodextrins during systemic acquired resistance (SAR), was expressed at higher concentrations in FW-responsive resistant genotype ('ICPL 20102') compared to susceptible genotype ('ICP 2376'). The high expression levels of chitinase in resistant genotype indicate the effectiveness within a narrow range of pathogenesis [40, 41].

Similarly, the protein coding for ABA-responsive protein (ABR18) (UniProt ID: Q06930), which is involved in stimulus mechanism and cell localization etc. during plant development and one of the vital roles is in defence mechanism during biotic stress signaling. This gene was identified to be expressed relatively higher in SMD-resistant pigeonpea genotype 'ICP 7035' compared to the susceptible genotype 'TTB 7'. During pathogen infection ABA inhibits the transcription of a basic β-1, 3-glucanase (EC: 3.2.1.39) that can degrade the β-1, 3-glucan callose, forming a physical barrier to viral spread through plasmodesmata. This down regulation of β-1, 3-glucanase by ABA can be termed as a resistance factor in plant pathogen interactions [42]. In our study, significant expression signals were observed in SMD resistant genotype 'ICP 7035' during viral infection. This positive correlation between the ABA levels and disease resistance was reported in plant species like common bean [43], rice [44] and tobacco [45]. Different enzymes like methyltransferases (HMT3) (UniProt ID: Q8LAX0) and dehydrogenases (G3PC) (UniProt ID: P34921) are putatively involved in synthesis of lignin in cell walls. These enzymes also play a major role in defence against pathogen interaction [46, 47].

An in silico hierarchical clustering analysis of 71 differentially expressed and genes across 16 cDNA libraries using HCE V 2.0 was done to infer potential relation between the co-expressed genes. The profiles of some of the interesting gene families and genes that could play an important role in stress stimulus were explained.

In Cluster I, of the 8 contigs, 6 were identified to be highly expressed in FW- challenged libraries of susceptible genotype. The cluster includes genes encoding proteins involved in mitochondrial DNA (mtDNA) stability, Na+/H+ exchanger and a few uncharacterized proteins. The sub cluster Ia includes two genes, which are highly expressed in 'ICPL 20102' libraries (15 DAI and 25 DAI). The genes connected in sub cluster Ib are highly expressed in 'ICP 2376' libraries (6 DAI and 15 DAI). One of the putative proteins TAR1 (Transcript Antisense to Ribosomal RNA), a mitochondrial protein is known to be involved in regulation and respiratory metabolism. Over- expression of this protein suppresses the respiration-deficient petite phenotype of a point mutation in mitochondrial RNA polymerase that affects mitochondrial gene expression and mtDNA stability. This dysfunction of mitochondria might occur in response to biotic or abiotic stress [48]. The over-expression of these genes was observed only in 15 DAI libraries of both the FW- responsive genotypes. And their immediate disappearance in the later stages of infection in resistant genotype libraries and continued expression in susceptible genotypes supports a hypothesis that continual expression of this protein may lead to mitochondrial dysfunction and subsequent cell degeneracy.

Another protein Na+/H+ exchanger 7 (UniProt ID: Q8BLV3) is an ubiquitous ion transporter that serves multiple cell physiological processes such as intracellular pH homeostasis and electro neutral exchange of protons for Na+ and K+ across endomembranes. Biochemical studies suggest that Na+/H+ exchangers in the plasma membrane of plant cells contribute to cellular sodium homeostasis during salt stress, although the above protein is expressed in high salt stressed plants, it may also be expressed during biotic stress. Its high expression in susceptible genotype 'ICPL 2376' at 6 and 15 DAI libraries shows the severity of stress during fungal pathogenesis. And in the remaining FW- and SMD- challenged libraries this shows normal expression.

Cluster II genes include hevamine (EC: 3.2.1.14) and Leucoanthocyanidin dioxygenase (EC: 1.14.11.19) specifically expressed in 'ICPL 20102' 30 DAI library. The important protein hevamine represents a new class of polysaccharide-hydrolyzing (βα)8 barrel enzyme belonging to families of plant chitinases and lysozymes, which are vital for plant defence against pathogenic bacteria and fungi. Recent results indicate that these enzymes may be involved not only in defence-related process or general stress response but also in growth and development processes [49]. The high expression of these proteins in the late phase of Fusarium infection indicates their prolonged defensive role against fungal pathogenesis.

The genes connected in Cluster III are highly expressed in FW-challenged 'ICP 2376' - 25 and 30 DAI libraries. The genes related to endoprotease activity, beta-glucan synthesis, carboxypeptidases, alkaloid biosynthesis, secologanin biosynthesis, beta-galactosidase activity, microtubule-stabilizing activity, nucleic acid binding protein, ribonucleoprotein and a few uncharacterized proteins were constituted in this cluster. For instance, antifungal class proteins such as beta-glucanases (EC: 3.2.1.39) and Hansenula mrakii killer toxin-resistant proteins (UniProt ID: P41809) located in epidermal leaf cells are believed to be involved in cell differentiation and defence against fungal pathogens [50]. Plants deficient in these enzymes generated by antisense transformation showed markedly reduced resistance to viral and fungal infection. Similarly, another class of proteins carboxypeptidases (EC: 3.4.16 - 3.4.18) with diverse functions ranging from catabolism to regulating biological processes, including function as a defence against pathogen attack [51].

The majority of genes (49) segregated in Cluster IV were highly expressed in SMD-responsive cDNA libraries derived from leaf tissues. In total Cluster IV genes showed high gene expression in SMD derived libraries. As expected, photosynthesis related transcripts were abundantly represented in sub clusters IVa, IVb and IVc, and these include putative transcripts like ribosomal proteins, mitochondrial proteins, chloroplast precursor proteins, photosystem I and II reaction centre proteins. The observed expression pattern of photosynthesis related proteins in this study is also consistent with experimental observations in barley [34].

Overall, the differentially expressed genes are involved in diverse pathways, displaying complex expression patterns. The different clusters based on monitoring of gene expression patterns propose that various pathways in response to biotic stress exist in pigeonpea and their interaction can lead to differential stress tolerance. The uncharacterized class of transcripts co-expressed could be repository of novel proteins and further characterization of these may reveal their significant role in plant stress responses [52].

Development of functional markers

One of our primary goals of our research programme is to develop molecular markers based on expressed sequences and screen them for polymorphism. During the last decade, microsatellites or SSRs have proven to be useful markers in plant genetic research and have been used for marker-assisted breeding purposes. The presence of SSRs in the coding region suggests their importance as functional or gene based markers [1, 11, 53]. Unfortunately, development of microsatellite markers is expensive, labor intensive, and time consuming if they are being developed from genomic libraries [54]. The data mining of microsatellites markers from EST data can be a cost effective option. The cost of mining EST libraries is far lower than other traditional methods, and SSR development from ESTs has been successful in EST data mining [22, 23, 5356]. SSR motifs with repeats more than eight for di-nucleotides, six for tri-nucleotides, and five for tetra-nucleotides were considered. Dimeric repeat motifs (40) were relatively abundant than trimeric repeats (33). In addition to this, tetra-, penta- and hexameric repeat motifs were considerably less represented. A total of 94 SSR markers have been synthesized and characterized for polymorphism survey. However, there are some distant contrasts in frequency and distribution of SSRs in ESTs and in genomic survey sequences (GSSs). In general di-nucleotide SSRs of all repeat lengths are more common in GSSs and tri-nucleotide SSRs are common in the ESTs [22, 23, 56, 57]. As against these reports, in our findings we observed that di-nucleotide repeats are more abundant than tri-nucleotide repeat motifs [58, 59]. However this observation is not unexpected as the frequency and distribution of SSR depends on several factors such as size of dataset, tools and criteria used for SSR discovery [22].

In this study, a total of 15 polymorphic EST-SSRs primer pairs were validated and used for diversity study on forty pigeonpea genotypes representing 32 cultivated (C. cajan) and 8 wild species (six C. scarabaeoides and two C. platycarpus). All markers detected at least one allele in all genotypes tested, suggesting transferability of all markers across the Cajanus genus. In addition to high transferability, EST-SSRs are good candidates for the development of conserved orthologous sequence (COS) markers for genetic analysis and breeding of different species [10]. However, EST-SSRs were reported to be less polymorphic than genomic SSRs in crop plants due to greater DNA sequence conservation in transcribed regions [22, 60]. For instance, the 15 SSR loci provided only 60 alleles with an average of 4 alleles per loci and an average 0.43 PIC value. Similar kind of diversity features were observed in earlier SSR based diversity studies in pigeonpea [14, 15].

EST-SSR profiles obtained on 40 pigeonpea genotypes were used to compute pair-wise genetic distances among different genotypes to construct a dendrogram based on UPGMA clustering. The neighbor joining tree grouped 40 pigeonpea genotypes into three major clusters (Figure 10). The Cluster I comprising 32 genotypes (cultivated) is the largest cluster followed by Cluster III containing six wild genotypes (C. scarabaeoides) and Cluster II is the smallest cluster with two wild genotypes belonging to C. platycarpus species revealing clear segregation of the cultivated and the wild species. Less genetic variation was detected with in cultivated species, with only nine markers detecting polymorphism and a total of 35 alleles. The low genetic variability amongst cultivars when compared with the wild species genotypes suggests that natural and artificial selection has contributed to the selection of specific alleles and to changes of allelic frequencies at specific loci as reported by Odeny and colleagues [14]. The distinctness of C. platycarpus with C. scarabaeoides accessions observed in this study correlate well with earlier studies [61]. It is also important to note that 'ICPL 20097' and 'ICP 2376' genotypes were found closely related with high genetic similarity as both of these genotypes belong to the same geographic region. In conclusion, EST-SSR markers developed in this study complement the currently available or ongoing efforts on development of genomic SSRs that will be a valuable resource for linkage map development and marker assisted selection in pigeonpea [12].

SNPs and indels are an essentially inexhaustible resource of polymorphic markers for use in the high-resolution genetic map development of traits and for association studies. Although a variety of molecular markers are available SNPs are comparatively advantageous because of their abundance and amenability to high throughput approaches [62]. In addition, SNPs also offer several advantages like high-throughput and cost-effective genotyping [63] and identification of functional/gene-based markers for complex trait through linkage map development or association genetics [9, 10, 64, 65]. Although SNP discovery was a cost effective task in past, advances in next generation sequencing technologies have made SNP discovery cheaper and faster [66]. However in case for a given species, ESTs are available from more than one genotype, in silico mining of ESTs is still a very inexpensive and fast approach for SNP discovery [17, 64] and therefore we used this approach for mining SNPs in this study.

By using in silico mining approach in a total of 871 contigs coming from 10,376 ESTs (9,888 generated in this study and 908 available in public domain), a total of 102 potential SNPs were identified in 37 contigs that were consisted of ≥ 5 ESTs. Smaller contigs were not considered for SNP mining as these contigs are prone to errors due to lack of read depth as reported by Wang and colleagues [67]. Sequence analysis of PCR products for a subset of 6 out of 10 contigs confirmed the occurrence of SNPs in all the cases. As PCR products could not be generated for remaining four contigs, the presence of SNPs could not be confirmed in those cases. Furthermore, as SNP genotyping is another important criteria in breeding programmes, identification of CAPS markers for 37 contigs will facilitate SNP genotyping even in low tech laboratories [63].

Conclusion

This study has contributed a new and significant set of 9,888 ESTs that together with 908 public domain ESTs provides a unigene set of 5,085 sequences for pigeonpea. Detailed analysis of these datasets have provided several important features of pigeonpea transcriptome such as conserved genes (across legumes and model plant species) as well as possible pigeonpea specific genes, assignment of pigeonpea genes to different GO categories, identification of differentially expressed genes in response to FW- and SMD- stresses, etc. In terms of applied aspect of developed resource in breeding, this study has demonstrated development and application of gene-based molecular markers i.e SSRs, SNPs and CAPS. In summary, it is anticipated that this study is a significant contribution to enhance genomic resources in a so called orphan legume crop that will eventually impact pigeonpea breeding [11, 12].

Methods

Plant material

Four pigeonpea genotypes namely 'ICPL 20102' (FW- resistant), 'ICP 2376' (FW- susceptible), 'ICP 7035' (resistant to SMD) and 'TTB 7' (highly susceptible to SMD) were used for constructing the cDNA libraries and generating the ESTs. Seeds of two genotypes ('ICPL 20102', 'ICP 2376') were procured from Legume Pathology section at ICRISAT and for the remaining two genotypes ('ICP 7035' and 'TTB 7') were obtained from Dr. M Byregowda, University of Agricultural Sciences, Bangalore, India.

A total of 40 genotypes including 32 genotypes from cultivated species (C. cajan) and 8 genotypes from 2 wild species (C. platycarpus and C. scarabaeoides) were used for validation and diversity analysis with new set of EST-SSR markers. These genotypes were obtained from Pigeonpea Breeding (Dr. KB Saxena) and Genebank (Dr. HD Upadhyaya) and have been listed in Additional file 13.

Inoculation treatment for FW and SMD

Seeds of FW-tolerant ('ICPL 20102') and FW-susceptible ('ICP 2376') were germinated in 15-inch deep polythene covers filled with sterile soil and sand (1:1) in a glass house at 23 ± 3°C under 80% relative humidity. The root, being the primary target of the pathogen Fusarium udum and the possible site of the initial defence response, was selected as the tissue of study. Ten days old seedlings were uprooted from pots and the root system was thoroughly washed in running tap water and rinsed with distilled water. Seedlings of each genotype were inoculated by immersing the roots for 2 min in fungal inoculum (Fusarium udum culture). The spore suspension at 6 × 105 conidia/ml, was made by adding fungal spores from several culture plates (Fusarium was grown on potato dextrose media supplemented with 0.25 μg/ml tetracycline). Immediately following the inoculation, the seedlings were transplanted to sterilized sand and soil mixture (1:1) in pots and were transferred to glass house.

In order to capture the genes expressed in resistant and susceptible genotypes at different time periods after inoculation, six stages i.e. 6, 10, 15, 20, 25 and 30 days after inoculation (DAI) were selected arbitrarily to construct the cDNA libraries. From 6-15 DAI, chlorosis symptoms were observed on the leaves and aerial parts of plant material, indicated the severity of Fusarium wilt disease. Furthermore, from each stage of days after inoculation, shoot and root section cuttings were made and observed the fungal penetration into the vascular tissues. Based on microscopic observations at different stages, initial symptoms of fungal infection were noticed in root vascular tissue at 15 and 20 DAI stages. Beyond 20 DAI, though the fungus penetrates deeper into the vascular tissues of susceptible and resistant varieties to some extent, the susceptible variety shows complete Fusarium symptoms where as the resistant variety prevents most of the attacking fungus from reaching maturity and developing symptoms.

For SMD study, highly susceptible ('TTB 7') and resistant ('ICP 7035') pigeonpea genotypes that are parents of a mapping population segregating for resistance to SMD were chosen. Forty seeds from each accession were sown in plastic bags filled with sterilized soil and were maintained in a glass house under optimal physiological conditions as described above. Ten days after sowing, the aerial parts of the seedlings were stapled with mosaic virus infected leaves. The viral disease is caused by pigeonpea sterility mosaic virus (PPSMV) and transmitted by an eriophyid mite Aceria cajani Channabasavanna [7]. The disease slowly spreads into the vascular tissues from the aerial parts through mite population which is characterized by a bushy and pale green appearance of plants. Based on the severity of disease symptoms, leaves with visible SMD lesions were harvested at 45 and 60 days after sowing (DAS) stages for construction of cDNA libraries.

cDNA library construction

Root and leaf tissue samples were collected from FW- and SMD- responsive genotypes at different time-points till the infection stage reached stagnant phase. RNA was isolated from the above two tissue samples according to the protocol described by Schmitt and colleagues [68]. RNA quality was assessed using formamide gel electrophoresis and poly (A)+ RNA was isolated with poly (A) tract mRNA isolation system IV (Promega, Madison, WI, USA) as described by the manufacturers. Double-strand cDNA was constructed using Super SMART™ PCR cDNA Synthesis kit (Clontech®, Mountain View, CA, USA) as described in the manufacturer's instructions. The resulting cDNA was size fractioned on 1.2% agarose gel. cDNA fractions containing fragments greater than 500 bp were selected for library construction. Subsequently, the cDNA was ligated into pGEM® Easy vector (Promega®, Madison, WI, USA) and ligation was allowed to proceed overnight at 14°C. The resulting plasmids were electroporated using One Shot® Top 10 Electrocomp™ cells (Invitrogen, Carlsbad, CA, USA). The transformants were spread on LB Agar plates containing 100 mg/ml ampicillin for direct picking. Based on blue/white screening, recombinant clones were picked into Nunc-Immuno™ 96 MicroWell™ Plates (Nunc™, Roskilde, Denmark) containing LB broth with 100 μg/ml ampicillin and grown for overnight at 37°C on a rotary shaker at 220 rpm. Glycerol stocks in 96-well format were prepared by combining 38 μl of 60% (v/v) glycerol with 150 μl of culture and frozen at -80°C.

EST sequencing, editing and assembly

Clones were randomly selected and on an average of 500 clones were sequenced per library in case of FW-response study and 1000 clones per library in case of SMD-responsive study. The plasmid DNA from these clones (i.e. colonies) was extracted using a 96-well alkaline lysis method prior to sequencing [69]. Plasmid DNA sequencing was performed by commercial DNA sequencing service provider (Macrogen Inc., Korea) using the standard M13 forward primer.

The FASTA files containing the raw sequences were edited by the software Sequencher™ 4.0 (Gene Codes Corporation, Ann Arbor. MI, USA) to remove the vector sequences. The vector screened sequences were subjected to EST trimmer [70], to trim poly-A ends and low quality sequences. High quality sequences of >100 bp were selected for further sequence analysis. ESTs were clustered and aligned into contigs and singletons using the CAP3 program [71].

In order to assess the number of unique and overlapping transcripts among the 16 libraries, four data sets were generated; those derived from libraries constructed from of FW-responsive genotypes (UG-I); those derived from libraries constructed from of SMD-responsive genotypes (UG-II); combined dataset of FW- and SMD- responsive ESTs (UG-III); and also from public domain sequences with total generated ESTs in this study (UG-IV). In addition to the above assembly of unigene sets, CAP3 analysis was also performed to libraries derived from FW- resistant genotype, FW- susceptible genotype, SMD- resistant genotype and from SMD- susceptible genotype individually.

Homology search and functional annotation

The unigene sequences were also characterized for nucleotide homology search against the EST datasets of selected legume species [pigeonpea (Cajanus cajan)-908, chickpea (Cicer arietinum)-7,097, soybean (Glycine max)-880,561, Medicago (Medicago truncatula)-249,625, common bean (Phaseolus vulgaris)-83,448, cowpea (Vigna unguiculata)-183,757 and Lotus (Lotus japonicus)-183,153] and selected model plant species [rice (Oryza sativa)-1,240,613, Arabidopsis (Arabidopsis thaliana)-1,527,298 and poplar (Populus alba)-418,223] available at National Center for Biotechnology Information (NCBI, http://www.ncbi.nlm.nih.gov) using BLASTN algorithm [72]. A match was considered significant at E-value ≤ 1E-05.

Each unigene dataset was subjected to BLASTX analysis against the non-redundant protein database of UniProt to deduce a putative function. Sequence similarity was considered as significant at E-value ≤ 1E-08. Each unigene was assigned a putative cellular function based on the significant database hit with lowest e-value. Subsequently, unigenes that showed a significant BLASTX hit were used for functional annotation based on Gene Ontology categories from UniProt database (UniProt-GO). This process allowed assignment of unigenes to the GO functional categories of biological process, cellular component and molecular function. Distribution of unigenes was further investigated in terms of their assignment to sub-categories of the main GO categories.

In silico expression and hierarchical clustering

In order to identify the differentially expressed genes in FW- and SMD- responsive genotypes, 389 contigs coming from FW-responsive genotypes and 328 contigs coming SMD-responsive genotypes were analyzed by using IDEG.6 web interface tool [73, 74]. The IDEG.6 web tool allows running six different statistical analyses for the detection of differentially expressed genes in multiple tag experiments. For pair-wise comparisons, the Audic and Claverie test, Fisher exact test and chi-square tests (Χ2) were used and in multiple comparisons R- statistics test, Greller and Tobin test and chi-square tests (Χ2) were used [73, 74].

Further, gene expression analysis was performed with hierarchical clustering expression (HCE) version 2.0 beta software [75] using transcript abundance data from UG-III set that includes 697 contigs derived from both the stress responsive libraries. As a pre-requisite for HCE analysis, all 697 contigs were subjected to R statistics (R>8) and only those contigs (71) were selected that have; (i) minimum 5 ESTs, and (ii) differential abundance of ESTs coming from different libraries. The matrix file developed based on the frequency of ESTs to each of 71 contigs was used as input file for above mentioned HCE tool.

Identification and development of SSR markers

A total of 5,085 unigenes (unigene set, UG-IV) developed based on 9,888 ESTs generated in this study and 908 public domain ESTs were searched with a Perl script program, MISA (MIcroSAtellite) [23, 76] for identification and localization of SSRs. The SSR motifs, with repeat units more than five times in di-, tri-, tetra-, penta- and hexa- nucleotides were considered as SSR search criteria in MISA script. The Primer3 programme [77] was used for designing the primer pairs for SSRs and custom synthesized by MWG (MWG-Biotech AG, Bangalore, India).

The primer pairs for SSRs were tested for their utility as potential genetic markers on 40 elite genotypes of pigeonpea (Additional file 13). PCR amplifications were performed in 5 μl reactions containing 5 ng of genomic DNA, 1× SE-Taq DNA polymerase buffer (including 1.5 mM MgCl2), 2 mM dNTPs, 10 pmol of each primer and 0.1 U Taq DNA polymerase (SibEnzyme, Novosibirsk, Russia) with the following touch down profile; 3 min at 95°C; 5 cycles of 20 sec at 94°C, 20 sec at 60°C minus 1°C/cycle, 30 sec at 72°C; 40 cycles of 20 sec at 94°C, 20 sec at 56°C, 30 sec at 72°C; and 20 min at 72°C for final extension. PCR products were separated on 6% non-denaturing polyacrylamide gels for 3 h at 600 V and visualized by silver staining. The polymorphism information content (PIC) of individual EST-SSR markers was calculated by using the standard formula [62]. Only data from polymorphic SSR loci were used for diversity analysis. Genetic similarities between any two genotypes were estimated according to Nei and Li [78]. All 40 genotypes were clustered with the Unweighted Pair Group Method using arithmetic average (UPGMA) in the SAHN procedure of the NTSYS-PC v2.10t [79].

SNP detection and their conversion into CAPS

All 871 contigs obtained from the collection of 5,085 unigenes (UG-IV) were searched for putative SNP/indels by using an integrated pipeline for large scale SNP discovery [80, 81]. The pipeline utilized the CAP3 output files as input to detect SNPs/indels based on the nucleotide redundancy in the multiple sequence alignments. The auto SNP pipeline generated text file includes contig ID, number of sequences in the contig ID, consensus length, number of SNPs, mutation type and SNP frequency. The threshold for identification of SNPs was based on the number of sequences (≥ 5) in each consensus sequence and two or more sequences from different genotype. In order to verify the SNPs at sequence level, the PCR amplicons of all four genotypes were sequenced using the corresponding forward and reverse primers for a set of 10 contigs (see Additional file 14). The amplicons were purified and further sequencing was done as described [80]. The sequenced data along with the sequences of ESTs (that provided the SNPs initially) were aligned and analyzed using BioEdit programme http://www.mbio.ncsu.edu/BioEdit/bioedit.html.

For converting SNPs into cleaved amplified polymorphic sequence (CAPS) markers, SNPs present in 37 contigs were analyzed to identify the recognition site for any of commercially available 725 restriction enzymes [82] by using integrated SNP2CAPS pipeline [80].