The first draft of the pigeonpea genome sequence
- First Online:
- Cite this article as:
- Singh, N.K., Gupta, D.K., Jayaswal, P.K. et al. J. Plant Biochem. Biotechnol. (2012) 21: 98. doi:10.1007/s13562-011-0088-8
- 6.2k Downloads
Pigeonpea (Cajanus cajan) is an important grain legume of the Indian subcontinent, South-East Asia and East Africa. More than eighty five percent of the world pigeonpea is produced and consumed in India where it is a key crop for food and nutritional security of the people. Here we present the first draft of the genome sequence of a popular pigeonpea variety ‘Asha’. The genome was assembled using long sequence reads of 454 GS-FLX sequencing chemistry with mean read lengths of >550 bp and >10-fold genome coverage, resulting in 510,809,477 bp of high quality sequence. Total 47,004 protein coding genes and 12,511 transposable elements related genes were predicted. We identified 1,213 disease resistance/defense response genes and 152 abiotic stress tolerance genes in the pigeonpea genome that make it a hardy crop. In comparison to soybean, pigeonpea has relatively fewer number of genes for lipid biosynthesis and larger number of genes for cellulose synthesis. The sequence contigs were arranged in to 59,681 scaffolds, which were anchored to eleven chromosomes of pigeonpea with 347 genic-SNP markers of an intra-species reference genetic map. Eleven pigeonpea chromosomes showed low but significant synteny with the twenty chromosomes of soybean. The genome sequence was used to identify large number of hypervariable ‘Arhar’ simple sequence repeat (HASSR) markers, 437 of which were experimentally validated for PCR amplification and high rate of polymorphism among pigeonpea varieties. These markers will be useful for fingerprinting and diversity analysis of pigeonpea germplasm and molecular breeding applications. This is the first plant genome sequence completed entirely through a network of Indian institutions led by the Indian Council of Agricultural Research and provides a valuable resource for the pigeonpea variety improvement.
KeywordsPigeonpea Genome sequence Disease resistance SSR markers Legumes
hypervariable ‘Arhar’ simple sequence repeats
single nucleotide polymorphism
Agricultural knowledge initiative
Knowledge of the genetic basis of yield, quality and stress tolerance is important for genetic improvement of pigeonpea. Until a couple of years ago pigeonpea was considered an orphan legume crop but now substantial amount of genomic resources have been generated, largely owing to the efforts of Indo-US Agricultural Knowledge Initiative (AKI), NSF and GCP funded projects, (Varshney et al. 2009, 2010a; Dutta et al. 2011; Bohra et al. 2011). Pigeonpea cultivars have a narrow genetic base due to limited breeding efforts and poor utilization of wild pigeonpea species. Availability of genome sequence will accelerate the utilization of pigeonpea germplasm resources in breeding (Yang et al. 2006; Saxena 2008; Varshney et al. 2010b). Development of molecular markers tightly linked to the important agronomic traits is a prerequisite for undertaking molecular breeding in plants. But molecular basis of most agronomic traits in pigeonpea remains unexplored due to low level of DNA polymorphism in the primary gene pool and limited number of validated molecular markers (Ratnaparkhe et al. 1995; Yang et al. 2006; Odeny et al. 2009; Dutta et al. 2011; Bohra et al. 2011).
The aim of present study was: (a) to decode the pigeonpea genome by using next generation sequencing technologies and analyse its genes and repeat DNA contents; (b) generation of chromosome specific sequence by anchoring the sequence scaffolds to a high density reference molecular linkage map and its comparison with soybean genome; and (c) development of SSR markers for gene discovery and molecular breeding applications. Pigeonpea variety ‘Asha’ selected for this purpose is a popular variety with one of the highest breeder seed indents in India and is resistant to common diseases of pigeonpea, namely Fusarium wilt and sterility mosaic disease.
Materials and methods
Pigeonpea variety ‘Asha’ (ICPL87119) was used for genome sequencing and validation of newly designed HASSR markers. To identify informative HASSR markers, a set of 8 genotypes namely Asha, UPAS 120, HDM 04–1, Pusa Dwarf, H2004-1, Bahar, Maruti and TTB7 was screened for marker polymorphism. The seeds were obtained originally from IARI, New Delhi, ICRISAT Hyderabad, IIPR Kanpur and CCSHAU Hisar.
Genome sequence assembly and submission to NCBI GenBank
High quality genomic DNA was isolated from the leaves of a single plant of variety ‘Asha’ using CTAB method (Murray and Thompson 1980). Sequencing of 19 plates of whole genome shotgun libraries of short DNA fragments was carried out using GS-FLX Phase D chemistry, and 3 plates of paired end sequences from a library of 20 Kb long fragments of pigeonpea genomic DNA using GS-FLX Titanium chemistry (Margulies et al. 2005). Filtered high quality sequence reads were assembled using “Newbler GS De Novo assembler version 2.5.3” (Roeche Inc. Germany) with: Overlap minimum match length = 25 bp, Large genome = True, Number of CPU used = 0 (all), Exclude contigs of <500 bp. The GS Assembler is designed to compare all sequence reads in a pair wise fashion. Reads that overlap one another are joined into contigs. The consensus sequence for a contig is computed by taking an average of all aligned reads at a specific nucleotide position, the paired end reads were used for making scaffolds of sequence contigs. The large sequence contigs were quality checked and contaminating sequences were identified and removed. The quality check passed Fasta files containing 510,809,477 bp of pigeonpea genome sequence were further processed using command line software of NCBI to generate .sqn file (http://www.ncbi.nlm.nih.gov/HTGS/tbl2asninfo.html), which was submitted to GenBank as draft genome version 1 using Genomes Macro Send direct submission tool.
The whole genome large sequence contigs were passed through FGENESH tool of MOLQUEST software (www.softberry.com) using Arabidopsis thaliana gene models as reference. From all predicted genes only those with size of >500 bp were taken for further analysis. The genes were BLAST searched against NCBI non-redundant database using optimized search parameters of gap opening penalty (G) = 4, gap extension penalty (E) = 1, mismatch score (q) = −1, match score (r) = 1, word size (W) = 11 and e-value <e−20 (Singh et al. 2004). Low complexity regions were included in the search. The BLAST search output was processed using BLAST Parser software (http://geneproject.altervista.org/). All the hits having bit scores of >100 and e values of <e−20 were tabulated in Microsoft Excel. Gene annotations were manually curated and categorized based on their functions. Details of pigeonpea transcriptome assemblies used for validation of predicted gene models is described earlier (Dutta et al. 2011). The predicted genes were manually curated with different keywords/phrases using auto filters to find R-like and defense response genes and categorize them into five main classes (Hulbert et al. 2001; The Rice Chromosomes 11 and 12 sequencing consortia 2005): (a) NBS-LRR (matching with NBS-LRR, but not with LZ-NBS-LRR and LRR, CC-NBS-LRR, Rp 1-d8, Lr10, Mla 1 and rust resistance), (b) LZ-NBS-LRR (matching with LZ-NBS-LRR, but not with NBS-LRR, CC-NBS-LRR, LRR and RPM1), (c) LRR-TM (matching with serine/threonine kinases and Cf2/Cf5 resistance), (d) miscellaneous category (matching with disease resistance, viral resistance, LRR, but not with NBS-LRR, CC-NBS-LRR, LZ-NBS-LRR), (e) defense response genes (matching with glucanases, chitinases and thaumatin like proteins). Similarly, genes for abiotic stress tolerance, lipid metabolism, sugar and starch biosynthesis, cellulose synthesis and transcription factors were also identified and categorized.
Annotation of transposable elements and repeats
Both De novo and homology based approaches were used for the identification of repeats in the large sequence contigs of pigeonpea genome. We used Repeat Modeler software pipeline for the construction of repeat library using RECON and Repeat Scout software (Benson 1999; Bao and Eddy 2002). Repeat Masker software was used for annotation using RMBLAST as search engine (Wootton and Federhen 1993; Lander et al. 2001; Waterston et al. 2002). Same strategy was used for the identification of repeats in genetically anchored scaffolds. We developed and added two different Perl scripts (Split masker, Masked table) in the Repeat Masker to break the large data set into individual files and simultaneously run the complete file in one go. Masked table script produced results on percentage of masked elements in each scaffold and exported it in Microsoft Ecxel.
For analysis of ribosomal RNA genes we downloaded all plant rDNA data from NCBI and used BLASTN search to find 28S, 18S and 5.8S rRNA genes in the pigeonpea genome. 5S rRNA genes were searched using a pigeonpea sequence obtained by cloning of Cot1 repeat fraction. tRNAscan software (Schattner et al. 2005; Lowe and Eddy 1997) was used for prediction of transfer RNA genes. The miRNA genes were identified using BLASTN search (e >1*10−5, top hits) of sequences present in the miRNA database, allowing no more than three mismatches (miRBAse release 17.0, Griffiths-Jones 2004; Griffiths-Jones et al. 2006, 2008; Kozomara and Griffiths-Jones 2011). Rfam database (version 10.1, May 2011, Gardner et al. 2010; Griffiths-Jones et al. 2005) was used for identification of ribosomal, small nuclear and small nucleolar RNA genes. For snRNA only those families having 100% identity and e values of <0.001 were selected, whereas for snoRNA 80% identity and e values of <0.001 were selected.
Anchoring of sequence scaffolds to pigeonpea chromosomes
The sequence scaffolds were anchored to a high density linkage map of genic-SNP markers of an intra-species reference mapping population derived from Asha/UPAS120. The linkage map was based on two Illumina multiplex SNP assays of 1536-plex and 768-plex SNPs identified by comparing deep coverage transcriptome assemblies of the parental lines Asha and UPAS 120 (Dutta et al. 2011 and our unpublished results). The 59,681 sequence scaffolds assembled from the 454 GS-FLX sequence data were used to create a local database. Total 366 genic-SNP marker sequences genetically mapped to eleven pigeonpea chromosomes were BLASTN searched against this database at a cutoff bit score of ≥100 and e-value of <e−20. Gene density per 50 kb of anchored scaffolds was plotted for each chromosome at respective genetic map positions (cM) using Microsoft Excel. Anchored scaffolds were also scanned for the identification and annotation of RE using Repeat Modeler and Repeat Masker software, respectively. The percentage of RE in each scaffold was plotted against the gene density. The TE related genes in the scaffolds were identified using BLAST search in the NCBI-NR database.
Comparison between pigeonpea and soybean genomes
A total of 42,094 non-TE related genes were predicted from the pseudomolecules of twenty chromosomes of soybean (Glycine max) using the same approach as described above and a local database was created. Genes in the anchored scaffolds of pigeonpea were searched against this database using BLASTN with optimized search parameters (Singh et al. 2004). The output was parsed using BLAST Parser software (http://geneproject.altervista.org/) and tabulated in Microsoft Excel. Chromosomal positions of both pigeonpea and soybean genes were retained in the gene headers for analysis of synteny. Numbers of hits with bit scores ≥100 for each of the eleven pigeonpea chromosomes was counted in soybean and tabulated using Microsoft Excel. Similar comparison was made using single copy pigeonpea genes against the soybean chromosomes and a circular synteny map was plotted according to Krzywinski et al. (2009). To identify single copy genes a local database of all the predicted pigeonpea genes was created using ‘formatdb’ script of the NCBI local BLAST (Altschul, et al. 1990). All genes in the database were searched against themselves to find their copy numbers in the genome.
In silico mining, primer design and validation of genomic-SSR markers
All assembled contigs were screened for the presence of SSRs using MISA software (http://pgrc.ipk-gatersleban.de/misa). MISA created two types of files namely, 454AllContigs.fna.misa and 454AllContigs.fna.misa.statistics. MISA files were transferred to Microsoft Excel where SSRs were classified into mono-, di-, tri-, tetra-, penta- and hexa-nucleotide and compound repeats. The minimum repeat number was set at 10 for mono-, 6 for di-, and 5 for tri-, tetra-, penta- and hexa-nucleotides. Compound SSRs were defined as those loci having ≥2 SSRs interrupted by ≤100 bp of non-repetitive sequence. Class I SSRs with repeat lengths of ≥20 bp and hypervariable SSRs with repeat lengths of ≥50 bp were extracted according to Temnykh et al. (2001) and Singh et al. (2010), respectively. Nomenclature of markers HASSR1-HASSR437 using prefix H for hypervariable A for “Arhar” (pigeonpea) followed by SSR identification number was on the same pattern as describes earlier for pigeonpea genic-ASSR markers (Dutta et al. 2011). Primer pairs flanking the repeats were designed using Primer3 software (http://frodo.wi.mit.edu/). The target amplicon size was set to 100–260 bp, annealing temperature to 60°C, primer length to 20 bp and GC content to 50%. The primers were BLAST searched against the whole genome sequence to identify those with unique binding sites. For marker validation genomic DNA of eight pigeonpea genotypes was adjusted to a final concentration of 25 ng/μl. Total 437 genomic HASSR loci were first tested for PCR amplification using genomic DNA from Asha using PTC225 Gradient Cycler (Bio-Rad). PCR was carried out in 15 μl reaction volume containing 1.5 μl of 10× reaction buffer, 0.20 μl of 10 mM dNTPs (133 μM), 1.5 μl each of forward and reverse primers (10 pmol), 2.5 μl (62.5 ng) of template genomic DNA and 0.15 μl (0.75 U) of Taq DNA polymerase (Vivantis Technologies). The PCR cycling profile was: initial denaturation at 94°C for 5 min, followed by 35 cycles of 94°C for 1 min., 55°C for 1 min., 72°C for 1 min and a final extension at 72°C for 7 min. Re-screening of primers that did not amplify at these conditions was done by sequentially decreasing the annealing temperature by 1°C; and for the primers producing multiple bands by sequentially increasing the annealing temperature by 1°C. The optimized SSR markers were then used for genotyping of eight varieties to check the level of polymorphism. PCR products were separated by electrophoresis in 4% Metaphor agarose gels (Lonza, Rockland USA) containing 0.1 μg/ml ethidium bromide in 1× TBE buffer at 130 V for 4 h, visualized and photographed in gel documentation system Fluorchem™ 5,500 (Alfa Innotech Crop., USA).
Results and discussion
Pigeonpea genome assembly
Gene content of the pigeonpea genome
Summary of gene prediction statistics in the genome sequence of pigeonpea variety ‘Asha’
Size of the assembled genome sequence (bp)
Number of large sequence contigs
Number of protein coding genes
Number of TE-related genes
Largest gene size (bp)
Smallest gene Size (bp)
Average gene Size (bp)
Total number of exons
Largest exon size (bp)
Average exon size (bp)
Maximum number of exons in a gene
Total number of introns
Largest intron size (bp)
Average intron size (bp)
Frequency of some major categories of genes in the pigeonpea genome in comparison to soybean genome
No. of genes Pigeonpeaa
No. of genes Soybeanb
Detailed Supplementary material
Disease resistance and defense response
Abiotic stress tolerance
Sugars and starch synthesis
Schmutz et al. (2010) identified 1,127 putative acyl lipid metabolism genes in the oilseed crop soybean. A similar analysis of genes for lipid metabolism in pigeonpea genome identified only 269 such genes, while soybean showed 536 genes (Table 2, Supplementary Table S3). Apart from the seed storage lipids these genes are involved in the metabolism of membrane lipids and various kinds of lipo-protein, glyco-lipid and mineral-lipid interactions. In contrast, pigeonpea genome has a higher number of 43 cellulose synthase genes as compared to only 37 genes in the soybean genome, which may be important for its woody plant architecture (Supplementary Table S4). Pigeonpea genome has 108 genes for the synthesis of various kinds of sugars, sugar transporters and starches including granule bound starch synthase, soluble starch synthase, starch branching and debranching enzymes. These have important implications for the grain yield and biomass accumulation (Supplementary Table S5). We identified 1,470 genes for different transcription factors and regulatory proteins in the pigeonpea genome (Table 2, Supplementary Table S6). These transcription factors play pivotal roles in the developmental regulation of gene expression and response of plants to various biotic and abiotic stresses. Most predominant transcription factors in the pigeonpea genome were AP2 domain-containing proteins, NAC domain containing proteins, WRKY transcription factors, Zinc finger proteins and MYB transcription factors.
Repeat elements in the pigeonpea genome
Different types of repeat elements in the 511 Mb of pigeonpea genome sequence
Number of elements
Sequence length (bp)
Percent of repeats
1. Interspersed repeats
1.1 Class I (Retro transposons)
1.2 Class II (DNA transposons)
2. Simple repeats
3. Low complexityrepeats
Major classes of repeat elements (RE) in the pigeonpea genome in comparison to ten other sequenced plant genomes
Genome sequence (Mb)
RE in genome (%)
1. Interspersed repeats
1.1 Class I (Retro transposons)
1.2 Class II (DNA transposons)
Hat super family
2. Simple repeats
3. Low complexity repeats
DNA transposons constituted 2.99% of the pigeonpea genome, which is higher than apple (1.31%) but much lower than rice (37.25%), soybean (26.83%) and Brachypodium (16.98%). However, these proportions might be revised upwards after all the pigeonpea REs are classified. We identified 6,572 copies of hat-AC like families which has the highest frequency among the DNA transposons, followed by En-spm (5,166 copies) and TcMAR-Pogo (153 copies). Helitrons constituted only ~0.03% of the total RE in pigeonpea while sorghum showed the highest percentage of 1.3% (Table 4). The unclassified RE sequences represented the highest copy number of 623,425, covering 216 Mb of the available genome sequence. The interspersed repeats constituted 303 Mb (92.78%) of all RE in the pigeonpea genome, which was similar to soybean (95.53%). In contrast to the interspersed transposable elements, simple repeats and low complexity repeats contributed only 2.57% and 4.63% of the pigeonpea genome, respectively. These values were higher than 0.75% and 1.77% for the soybean genome (Schmutz et al. 2010).
Non-coding RNA genes in the pigeonpea genome
Genomes of higher plants contain thousands of copies of genes for non-coding RNA including rRNA, tRNA, miRNA, snRNA and snoRNA which play important role in the cellular protein synthesis machinery and regulation of expression of protein coding genes. In the pigeonpea genome we identified 35 copies of 28S rRNA genes, 66 copies of 18S rRNA (largest match of 2,346 bp in contig number 7,811) and 77 copies of 5.8S rRNA (largest match of 2,166 bp in contig number 77,111). We identified 270 copies of 5S rRNA genes using pigeonpea specific rDNA probes. We expect more copies of rRNA genes in the finished genome. The tRNAscan-SE software identified 671 tRNA genes. Of this, twenty were pseudogenes and two have undetermined anticodon isotypes. Remaining 649 tRNA genes have 50 different anticodons, representing all the twenty amino acids (Supplementary Table S7AB). The maximum number of genes were for leucine tRNAs (49), followed by serine (47), arginine (45) and glycine (45). Thirty six of the pigeonpea tRNA genes contain introns.
Anchoring of pigeonpea sequence scaffolds to genetic map
Gene and repeat densities in the pigeonpea genome scaffolds anchored with 347 genetically mapped genic-SNP markers
No. of Markers
Size of scaffolds (bp)
No. of genes
No. of genes per 50 kb
Size of repeats (bp)
Percent repeats in scaffolds
Comparative analysis of pigeonpea and soybean genomes
Pigeonpea and soybean belong to the same clade Millettieae of the plant family Fabaceae (Wojciechowski 2003). Both are important crop plants but have quite different plant architecture and seed composition. Pigeonpea is a shrub grown as annual crop that has high seed protein and starch contents but minimal oil content. Soybean on the other hand is an annual herb with seeds rich in oil and protein but low in carbohydrates. Therefore, we were interested to see the difference in the genome organization and gene content of the two species. The 47,004 protein coding genes of pigeonpea were compared with 42,094 protein coding genes of soybean using BLAST search with default parameters. Total 31,937 (67.94%) of the pigeonpea genes showed matches with soybean genes at a cutoff bit score of 100, whereas 9,067 genes were unique to pigeonpea. Similarly, out of 42,094 genes predicted in soybean 40,392 showed significant matches with pigeonpea genes, whereas 1,702 genes were unique to soybean. This shows that pigeonpea has significantly higher number of unique genes that differentiate it from soybean.
Development and validation of hyper variable HASSR markers for pigeonpea
Frequency of SSRs in the 511 Mb of pigeonpea genome sequence
Type of SSR
Total no. of SSRs
Class I SSR (n ≥20 bp)
HASSRa (n ≥50 bp)
Search for class I SSRs (n ≥20 bp, Temnykh et al. 2001) and hyper variable HASSRs (n ≥50 bp, Singh et al. 2010) revealed that class I SSRs are most prevalent in the di-nucleotide category, whereas HASSRs are most abundant in the compound SSR category (Table 6). Based on the SSR length criteria 46,501 loci were classified as class I SSR and 11,711 of these were HASSR. All the SSR loci belonging to tetra-, penta-, hexa- and compound category were of class I SSR, while more than half (10,891) of the compound SSRs was of HASSR type. In contrast, mononucleotide repeats never reached a size of more than 50 bp, however this could be partly due to limitation of the 454 sequencing technology in dealing with large homopolymers. Due to their higher polymorphism longer SSR loci are more useful for routine genetic diversity analysis, fingerprinting, QTL mapping and molecular breeding applications in the laboratories lacking sophisticated fragment analysis and SNP genotyping platforms, but having simple agarose gel electrophoresis facility (Singh et al. 2010).
Wet lab validation of the PCR amplification and polymorphism of 437 HASSR markers designed from pigeonpea genome sequence information
No. of loci
Unexpected size bands
% Poly morphism
The work presented here is the first draft of the whole genome sequence of pigeonpea and is the first report of a plant genome sequenced entirely in India. The 47,004 protein coding genes predicted in the pigeonpea genome are similar to that in soybean, potato and tomato, but significantly higher than Arabidopsis and rice. Ninety-nine point nine percent of the predicted genes were supported by RNA expression data, suggesting that these are true genes. A small proportion of genome scaffolds were genetically anchored with 347 mapped SNP markers which provide nucleation points for further finishing of the genome to large pseudomolecules of the eleven chromosomes. A comprehensive set of 46,501 Class I SSRs and 11,711 hypervariable HASSR loci were identified, and 437 HASSR markers were experimentally validated for amplification and higher rate of polymorphism. HASSR markers have high potential utility in the genetic diversity analysis, fingerprinting and molecular breeding for efficient utilization of pigeonpea germplasm resources in breeding improved varieties. The network partners under Indo-US AKI have already developed a EMS-mutagenized population and more than 24 recombinant inbred line populations for mapping of important agronomic traits including Fusarium wilt, sterility mosaic disease, flooding tolerance, seed size and number, plant type, drought tolerance and Dal (milling) quality of pigeonpea.
We are grateful to the Indian Council of Agricultural Research (ICAR) New Delhi for financial support through Indo-US Agricultural Knowledge Initiative (AKI) and Network Project on Transgenics in Crops (NPTC) projects. SD is grateful to the Council of Scientific and Industrial Research, Government of India for financial support (Grant no. 09/083/(0342)/2011/EMR-I)