Data description

A locally funded genomic sequencing project provided the first phase of genome sequencing of the Puerto Rican Parrot (Amazona vittata) (see Developing of the Local Community Involvement in Additional file 1). DNA was purified from a female A. vittata blood sample (see Additional file 2: Table S1), and sequencing was initiated with the construction of two genome libraries: the majority of sequencing used a short fragment library (~300 bp inserts), and scaffolds were generated using a long fragment library (~2.5 kb inserts). Raw Illumina HiSeq reads were processed and filtered using the Genome Analyzer Pipeline software (as per the manufacturer’s instructions at default parameters). Of the 309,060,168 paired-end reads and the 180,079,956 mate-pair reads, respectively, 86.48% and 85.14% passed QC, using the condition that if one read from a pair failed the QC, the entire pair was filtered out. Based on the total number of base pairs generated (see Additional file 3: Table S2), and the predicted genome size of 1.58 Gb [1], we calculated a total genome coverage of 26.89x depth: with 17.08x coverage for short fragment reads, and 9.8x for mate pairs (Table 1 and Additional file 3: Table S2) (see Sample Collection and Genome Sequencing sin Additional file 1).

Table 1 Average coverage of the Puerto Rican parrot genome in the current study based on the predicted genome size of 1.58Gb [[1]]

We carried out two separate de novo assemblies, using Ray [2] software (Table 2) and SOAPdenovo[3] (Additional file 4: Table S3), and selected the Ray assembly for use in all further analyses. Our genome coverage was approximately 76%, which, given some of the scaffolds may be overlapping and could not be properly assembled, might be slightly overestimated. (see Assembly in Additional file 1). We evaluated assembly by comparing the entire collection of transcripts listed for G. gallus in the NCBI Entrez Gene database using local BLAST [4] and found that > 70% of the chicken transcripts were present, and as much as 11% of scaffolds shared similarity with at least one G. gallus sequence at average density of 1.39 genes/kbp (Table 3; Additional file 5: Figure S1).

Table 2 Results of the genome assembly by Ray [[2]]
Table 3 Annotation summary

RepeatMasker software (http://www.repeatmasker.org) was used to search scaffolds for the presence of the known repeat classes with known repeats found on 59% of the scaffolds (see Annotation in Additional file 1). In addition, we used manual annotation, both by annotation scaffolds for gene and repeat elements and by annotating known genes, to validate high-throughput annotation, and using this, we designed and carried out a student development program (see Genome Annotation and Education in Additional file 1).

Comparative analyses of the A. vittata scaffolds against the chicken (Gallus gallus) [5] and zebra finch (Taeniopygia guttata) [6] genomes using local BLAST [4] resulted in 93.4 Mbp of total length of alignments to the chicken genome with 82.7% identity on average (average bit score 577.3), and 41.7 Mbp of total length of alignments to the zebra finch genome with 84.5% identity on average (average bit score 431.1).

The top BLAST alignments were sorted by the average of their locations, and their frequencies were calculated in 1 Mbp bins and plotted along all of the chromosomes for both G. gallus and T. guttata genomes using Circos [7] (Figure 1). The chicken genome coverage was higher (109 scaffolds per Mbp in chicken on average vs. 72 in zebra finch), but the chicken genome also had more locations with higher genome coverage. As high as 57% of the scaffolds could be partially aligned to one or both of the genomes: 21.7% aligned only to G. gallus, and 10.6% aligned exclusively to T. guttata, while 25% aligned to both genomes (Figure 2). These data are presented and summarized for chicken in Additional file 6: Table S4.A, for zebra finch in Additional file 7: Table S4.B, and the complete information in Additional file 8: Table S4.C.

Figure 1
figure 1

Density of the A. vittata scaffolds that shared similarity with fragments of chicken and zebra finch genomes (Top) Chicken ( G. gallus genome (per Mbp) and (Bottom) zebra finch ( T. guttata ) genome (per Mbp). Different chromosomes are represented by different colors as shown in the legend on the right. Chromosomal locations, lengths and quality of alignments to the two genomes by BLAST are presented in Additional file 6: Table S4.

Figure 2
figure 2

Proportion of sequences with some similarity across the two avian genomes ( G. gallus and T. guttata ). A. vittata scaffolds are classified into five categories (A) unmapped - those that were not found any similar sequence, (B) chicken only – those that shared similarity only with a fragment of G. gallus genome; (C) finch only – those that shared similarity only with a T. guttata genome; (D) mismatched – those scaffolds that shared similarity with sequences of G. gallus and T. guttata genomes but mapped to different chromosomes in the two species; (E) matched – those that mapped to the same chromosome in reference genomes of the two avian species. Proportions are represented as totals (pie chart), absolute numbers (top) and proportions per chromosome (bottom). The associated data are provided in Additional file 9: Table S5.

Although a large proportion of scaffolds shared some similarity with the two avian genomes, there was also discordance as only 12.6% of the scaffolds (2.8% of the total number of scaffolds) aligned to the same chromosome in both species (Figure 2, top and Additional file 9: Table S5), and the proportion of discordance varied across chromosomes, with the lowest value on chromosome 11 (Figure 2, bottom and Additional file 9: Table S5). While this lack of synteny could point to extensive rearrangements during the evolutionary history, the proportions of scaffolds discordantly aligned between chromosomes seemed to be distributed similarly relative to chromosome lengths, indicating a significant random component (Figure 3). To test this, we selected the 200 longest scaffolds and independently queried 500 bp ends to the chicken genome. Of these, only 10 scaffolds (5%) showed discordance by aligning to the opposite ends to two or more different chicken chromosomes (see Comparative Analysis in Additional file 1).

Figure 3
figure 3

Synteny of alignment of the A. vittata scaffolds to two avian reference genomes ( G. gallus and T. guttata ). The connecting lines show the proportion of scaffolds that mapped to T. guttata chromosomes on the left side to G. gallus chromosomes on the right side. The chromosomes are shown in order from top to bottom and designated in the same color for the both species. For simplicity, different colors are used only for the three largest chromosomes. Chromosome 1 in G. gallus corresponds to chromosomes 1, 1A and 1B in T. guttata shown in different shades of blue.

In summary, these data represent the first assembly of a genome sequence for a parrot endemic to the United States, and also the first genome of a species from the diverse and ecologically important genus, Amazona, native to South America and the Caribbean. The assembled sequence provides a starting point towards completing and annotating a draft genome sequence. The data available at this coverage will be helpful in designing the future sequencing efforts, and can also be used for annotation and comparative genomic studies across the growing amount of avian genome data [5, 6, 8], which is essential given the growing rate of extinction among avian species worldwide.

Availability of supporting data

The raw reads are available at the ENA (accession #PRJEB225). Scaffolds and the assembly parameters have been submitted to the GenBank (accession #PRJNA171587), and all data, including FASTA files of contigs, scaffolds, corresponding assembly parameters, and annotation data are available in Giga DB [9]. The links to all the supplementary tables and databases are listed in (Additional files 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, and 16) and can also be accessed at http://genomes.uprm.edu/gigascience/SupplementaryTables/.

Note from the editors

A related commentary by Stephen O’Brien on the issues surrounding this work is published alongside this article [10].