Introduction

Alectoris rufa, also known as red-legged partridge, is a game bird that holds significant ecological and economic importance for rural areas in southwestern Europe1. Habitat degradation, captive breeding, and hunting management have led to the creation of a complex species situation, impacting both the ecosystems and society of the region. Across various hunting grounds, wild, farmed, and hybrid partridges coexist in varying proportions. While these partridges exhibit distinctions in behavior, physiology, morphology, anatomy, and genetics, the absence of a reference genome hinders our ability to molecularly differentiate these ecotypes, spanning from wild to domestic2. The haploid genome of A. rufa has 9 macro chromosomes and 30 micro chromosomes3,4. The advent of Next-Generation Sequencing (NGS) technologies, mainly based on short-read sequencing data, combined with decreasing DNA sequencing costs, led to an increase in the number of available genome sequences. However, those genomes were still highly fragmented due to the limitations inherent to short reads, where for example repetitive regions can lead to genome misassembly. The emergence of third-generation sequencing technologies partially overcame those limitations by generating long-read sequencing data. These long-reads helped to reduce assembly fragmentation and increase contiguity, greatly improving the quality of whole-genome assemblies5. Still, early long-read technologies had base-calling error rates of 10–14%, that are much higher than the less than 1% error rate found in short-read technologies6. In addition, the error profiles of both technologies are different. Errors in short-reads are mostly at the level of incorrect nucleotide substitutions, while errors in long-reads mostly involve incorrect insertions and deletions7,8. This difference makes long read errors more complex to resolve, requiring an error correction step prior to genome assembly. The error correction problem has been addressed either by self-correction, aligning long-reads against each other, or by a hybrid approach in which long-reads are corrected using short-reads. The latter approach is known to achieve more accurate genome assemblies than genomes assembled based only on short- or long-read technologies9,10.

In this context, the quality of reference genome assemblies benefited from the combination of Illumina short-read sequencing with third-generation sequencing platforms such as Pacific Bioscience (PacBio)11 or Oxford Nanopore Technologies (ONT)12. Application of these technologies improved contiguity, completeness, and accuracy compared to assemblies based on short-read sequencing alone13,14. In general, the number of contigs and scaffolds was significantly reduced, and N50 values increased, leading to better genome annotation and identification of more genes, including non-coding RNA genes, pseudogenes, and transposable elements15,16. Examples of genomes assembled using hybrid approaches in the avian clades include, for example, the Tibetan partridge15, the Indian peafowl17, the domestic turkey14, or the common pheasant18.

The first effort to sequence the red-legged partridge genome of a male individual, which was published in 2021 under the accession number GCA_019345075.119, was based on Illumina paired-end short reads sequence data resulting in a highly fragmented assembly, with 10 598 scaffolds, a contig/scaffold N50 of 11.57 Mb, and L90 equal to 131. A more recent version of A. rufa‘s genome, based on ONT and short reads, was recently released at the NCBI under the accession number of GCA_947331505.1. That version has 426 scaffolds with N50 of 34 Mb and L90 of 3220. Both genomes lack detailed annotation. The contiguity of the GCA_947331505.1 assembly (~ 500 contigs) is approximately twenty five times better than that of assembly GCA_019345075.1 (~ 10,000 contigs). Finally, the BUSCO completeness assessment of the two assemblies reveals that assembly GCA_019345075.1 is missing approximately 500 single copy BUSCO orthologs with respect to GCA_019345075.1 and has approximately twenty times more duplicated gene copies. These discrepancies may lead to potential errors in gene order conservation (synteny) and contribute to large-scale assembly inaccuracies. In order to overcome some of the challenges and limitations found in the earlier genome assemblies of A. rufa and move towards a well annotated chromosome level assembly, we combined short- and long-read sequencing data in a hybrid approach. Here we report the resulting scaffold-level assembly and its annotation. We validated the assembly by comparing it to the reference genomes of chicken (Gallus gallus, NCBI reference GCF_016699485.2) and quail (Coturnix japonica, NCBI reference GCF_001577835.2), two closely related species. Overall, we provide a valuable resource for comparative and population genomics, improving our understanding of avian evolution, biogeography, and demography.

Results

Estimation of genome size and heterozygosity rate

We conducted genome profiling on sixty A. rufa individuals using k-mer analysis of short-read sequence data, and finding an estimated genome size between 1 and 1.06 Gb, and 0.1% ≤ heterozygosity ≤ 0.4% (Fig. 1).

Figure 1
figure 1

Assessment and profiling of Alectoris rufa’s genome assembly. A Distribution of estimated genome size and heterozygosity levels across sixty individuals of A. rufa sequenced with Illumina short-reads. B Genome size and ploidy level estimation using long reads from two individual A. rufa sequenced with ONT. Left: Genome size and heterozygosity plot. Right: genome ploidy inference. C Assessment of scaffolds completeness using a 21 k-mer spectrum approach. Scaffold completeness is estimated to be 92%. D Comparison the completeness between genome assembly and genome annotation, based on recovered core genes from the aves_odb10 dataset of BUSCO.

A. rufa genome assembly, annotation and quality assessment

We tested and evaluated various pipelines to assemble the genome of the red-legged partridge. The NextDenovo pipeline produced a primary assembly with the best metrics. This assembly comprised 116 contigs, with an N50 length of 74 Mb and an N90 of 10 Mb (Supplementary Table S1). We further refined this assembly, recovering 96.8% (8078 out of 8332) of the single-copy genes found in the BUSCO dataset of avian single copy orthologous genes (aves_odb10, N = 8332 genes) (Supplementary Table S2). The contigs were then used as the basis for genome scaffolding, resulting in a final genome assembly of 115 scaffolds and 1.03 Gb. Table 1 summarizes the most relevant contiguity metrics of this assembly and its annotation.

Table 1 Statistic for the Alectoris rufa genome assembly and annotation.

The final assembly significantly improves the statistical metrics of contiguity of the earlier available assemblies (Table 2). Our L90 is 23, closer to the 9 macro-chromosomes present in the haploid genome of A. rufa, and at least five times smaller than that for assemblies GCA_947331505.1 (based on short-reads) and GCA_019345075.1 (based on long-reads). Our N50 (74 Mb) is twice that of the GCA_019345075.1 assembly and seven times that of the GCA_947331505.1 assembly. Our assembly contained 96.78% (n = 8053 genes) of complete and single-copy genes without duplications present in the BUSCO avian dataset, surpassing both the short-read (95.1%; n = 7933 genes) and the long read (96.58%; n = 7378) genome assemblies. Table 2 summarizes the main differences in terms of the genome contiguity and completeness metrics between those assemblies.

Table 2 Comparing Alectoris rufa genome assemblies.

Additionally, we compared our genome assembly against that of eleven birds and one reptile, all of which possessed chromosome-level genome assemblies (Supplementary Table S3). Our assembly has the fifth highest scaffold N50 value for the bird genomes analyzed here (Fig. 2A; see also Supplementary Fig. S1 for the contig N50 statistics). Moreover, in terms of avian orthologs, our assembly also ranks within the top five of the highest number of both complete and single copy orthologs. The NCBI’s Foreign Contamination Screening revealed no significant contamination in the assembly of those 115 scaffolds (Supplementary Table S4).

Figure 2
figure 2

Assessing the completeness and correctness of the A. rufa assembly in comparison to closely related bird species. A Scaffold N50 statistic for each genome assembly. B Completeness of each assembly based on BUSCO results with the aves_odb10 dataset.

Annotation of transposable elements

RepeatMasker21 annotated 13% of the A. rufa genome as repetitive sequences. Table 3 summarizes the analysis of transposable elements (TE), which revealed a higher percentage of repetitive elements, when compared to the previous draft genome based on short reads alone19. Long interspersed nuclear elements (LINE) are the most frequent transposable elements in the genome, representing 7.74% of the whole genome sequence. These elements have a greater divergence rate in comparison to other DNA transposable elements (Supplementary Fig. S2) identified in the genome. DNA transposons (2.33%) and long terminal repeat (LTR) elements (1.76%) are the second and third most abundant classes of transposable elements in the genome, respectively.

Table 3 Comparative statistics of repetitive elements between short read and hybrid A. rufa genome assemblies.

Annotation of RNA and protein-coding genes

We validated 10,757 annotated protein genes through comparison of their intron–exon structure with G. gallus or C. japonica orthologs (Supplementary data file S1). To do so we BLASTed our annotated A. rufa proteome against that of G. gallus, to identify pairs of orthologs with conserved intro-exon structure. Then, we repeated the process between A. rufa and C. japonica. An additional 8,509 genes were also validated through mapping of the full transcript (Supplementary Table S5). This generated a high-confidence data set of 19,266 predicted protein genes. An additional 11,010 gene were annotated with lower confidence, making a total of 30,236 protein-coding genes in our assembly. We summarize the statistics for all these predicted genes in Table 4.

Table 4 Summary of features annotated in the genome of A. rufa.

We identified known homologs for 95% (28,862) of the predicted protein genes in a non-redundant database merging the complete protein datasets downloaded from SwissProt, TrEMBL and NCBI. Of these, 18,865 (62.1%) proteins were simultaneous and consistently annotated among the three databases (Supplementary Fig. S3). We were able to assign InterProScan family and subfamily domains to 25,978 (85.9%) predicted genes, and GO biological functions to 13,371 (57.1%) genes (Supplementary data file S1).

A KEGG-based functional annotation mapped 12,377 of our protein-coding genes predicted with high confidence to their representative functional KEGG ortholog (KO) genes (Supplementary data file S1). The largest number of genes were mapped to genetic information processing (2968 genes), environmental information processing (1785 genes), and molecular function-related signaling and cellular processes (1664 genes). The top five KEGG metabolic pathways were carbohydrate metabolism (342 genes), lipid metabolism (306 genes), glycan biosynthesis and metabolism (220 genes), amino acid metabolism (180 genes), and nucleotide metabolism (148 genes) (Supplementary data file S1, Supplementary Fig. S4).

We reported the annotation profile of non-coding RNAs (ncRNA) in the assembled genome with respect to their Rfam families. We identified 305 transfer RNA (tRNA) through tRNAScan. Additionally, employing Infernal we were able to identify 246 micro-RNA (miRNA), 135 ribosomal RNA (rRNA) and 315 small nuclear RNA (snRNA) genes (Supplementary data file S1).

Synteny analysis of the genome structures of A. rufa, C. japonica and G. gallus

A. rufa belongs to the Phasianidae (pheasants, partridges, chickens, turkeys, etc.) family of the Galliformes clade. While many relationships within Galliformes are well-supported, some uncertainties remain, particularly regarding the branching order within the species-rich Phasianidae family. One of the uncertainties in this family is the relationship between A. rufa, C. japonica, and G. gallus. The three birds are closely related and exhibit a shared karyotype of n = 39 chromosomes22. This karyotype similarity motivated us to compare the sequence of the largest 23 scaffolds of A. rufa (containing at least 90% of the assembled genome) across the three species. Figure 3 highlights significant syntenic regions across the three genomes. Scaffolds 2 and 5 of A. rufa align with chromosome 1 in the two other species. Similarly, scaffolds 1, 3 and 4 respectively align to chromosomes 2, 4 and 3 of both birds. Furthermore, A. rufa scaffolds 6 and 10 display near complete synteny with C. japonica’s sex chromosome Z, while scaffold 10 showing synteny with G. gallus’ Z chromosome. Scaffolds 7 and 15 of A. rufa display considerable synteny with chromosome 5 of the other birds. The remaining 14 A. rufa scaffolds exhibit strong synteny with individual chromosomes of the other two bird species. Notably, twelve micro chromosomes from C. japonica and 20 micro chromosomes from G. gallus did not exhibit significant homology with any of the assembled A. rufa scaffolds.

Figure 3
figure 3

Circus plots comparing sequence homology between the largest 23 A. rufa scaffolds and the reference chromosomes of A C. japonica, and B G. gallus. Each line within the circle represents 10 Kb of sequence homology. Chromosomes are color coded to facilitate visualizing the synteny regions between A. rufa and the other two birds. There are 248386 regions of strong homology between A. rufa and C. coturnix, compared to 154686 regions of strong homology between the genome of A. rufa and that of G. gallus.

Pairwise analysis of the chromosomal rearrangements between A. rufa and C. japonica or G. gallus

The scaffold-to-chromosome alignments revealed significant large-scale genomic rearrangements between A. rufa and both C. japonica and G. gallus genomes (Supplementary Fig. S5, Supplementary Table S6). Scaffold 2 exhibits a small 2.52 Mb inversion within the 105.87–108.07 Mb region of C. japonica's chromosome 1. Scaffold 5 presents two similar-sized inversions, occurring at regions 19.06–20.95 Mb and 50.02–57.48 Mb of chromosome 1. Scaffold 1 displays a substantial inversion in its center relative to the centromeric region of C. japonica's chromosome 2 (42.9–77.77 Mb). Scaffold 3 features two inversions near one of its ends compared to chromosome 3. Similarly, scaffolds 4 and 18 exhibit inversions when aligned to chromosomes 4 and 15, respectively.

Pairwise alignment of our scaffolds with G. gallus chromosomes unveiled repeated inversions, particularly at telomeric regions. Notably, scaffold 4 included two inversions totaling 4.37 Mb within regions 1.76–4.29 Mb and 0.02–1.77 Mb of chromosome 4. Similarly, scaffold 8 exhibited three inversions totaling 3.23 Mb between regions 7.3–8.46 Mb, 9.97–11.06 Mb, and 11.81–12.72 Mb, aligning with chromosome 6 of the G. gallus genome. Additionally, scaffold 11 featured a substantial 8.35 Mb inversion relative to the 0.06–8.07 Mb region of chromosome 8.

Overall, these results suggest that A. rufa’s genome is more similar to that of C. japonica than to that of G. gallus, indicating a closer evolutionary relationship between A. rufa and C. japonica when compared to the G. gallus. The similarities in genomic structures and rearrangements between A. rufa and C. japonica genomes imply a closer evolutionary proximity between the two birds with respect to G. gallus.

Comparative proteome of A. rufa, C. japonica, G. gallus, and M. gallopavo

Comparing the ortholog clusters of protein coding genes in the high confidence dataset between the four species reveals 10,111 shared orthologous gene families (Fig. 4A). We have also identified 113 gene families that are exclusive to A. rufa. Among these, 101 genes could be functionally associated to general biological processes using GO (Supplementary data file S1, summarized in Fig. 4B). Among the gene families linked to more specific GO components, 1 gene was associated with membranes, and 2 genes were associated with structural activities. The set of genes unique to A. rufa (Fig. 4C) is significantly enriched in genes related to viral processes (5 genes) regulation of immune response (8 genes) and microtubule depolymerization (16 genes).

Figure 4
figure 4

Functional comparison of the protein genes annotated with higher confidence in A. rufa’s proteome to the annotated NCBI proteomes of C. japonica, G. gallus, and M. gallopavo. A Comparison of orthologous gene families between A. rufa, C. japonica, G. gallus and M. gallopavo. B Generic GO enrichment terms for gene families that are unique to A. rufa. C Specific GO enrichment terms for gene families that are unique to A. rufa. Only GO categories that are associated to more than one gene were included in panels (B) and (C).

Phylogenetic analysis of A. rufa within the Galliformes clade

The phylogenetic tree (Fig. 5), constructed through the alignment of 8212 single-copy BUSCO genes found across thirteen genomes (Supplementary Table S3 and our A. rufa assembly), unveils pivotal points in evolutionary history measured in million years ago (Mya). The divergence between birds and reptiles occurred roughly 300 Mya. Anseriformes and Galliformes parted ways around 75 (95% credibility interval 46.14–106.85) Mya, with the Guinea fowl diverging from the main Galliformes lineage approximately 56 (95% credibility interval 33.41–78.34) Mya. The clade containing G. gallus, C. japonica and A. rufa separated from the rest of the Galliformes approximately 49 (95% credibility interval 27.88–69.01) Mya, with their last common ancestor estimated at roughly 35 (95% credibility interval 9.86–57.87) Mya. The divergence between C. japonica and A. rufa happened approximately 20 (95% credibility interval 0.0011–41.44) Mya ago. These predicted divergence timelines are consistent with the findings we report from the pairwise analysis of the chromosomal rearrangements between the three birds (Fig. 3). Because of the large confidence intervals for the divergence times we calculated the individual maximum likelihood gene trees for the single copy BUSCO orthologs identified in all genomes, using IQTree. We then used the multi-species coalescent model approach in ASTRAL to build the species tree from the individual gene-based trees (Supplementary Fig. S6). The speciation structure of the two trees is consistent.

Figure 5
figure 5

Phylogenetic analysis of A. rufa. The phylogenetic tree was reconstructed from concatenated single-orthologous genes of the complete genome of 11 birds plus A. rufa using IQTREE. The lizard Anolis carolinensis (lizard) was used as outgroup. A. rufa is closer to C. japonica than to G. gallus. Numbers at each node represent the estimated divergence time in million years. Blue lines indicate the 95% credibility interval for those estimates. Only branching points supported by 100% of bootstrapped trees are shown.

Discussion and conclusions

We achieved a highly contiguous genome assembly for A. rufa by integrating accurate short reads from Illumina sequencing with lower accuracy ultra-long reads from the Oxford Nanopore Technology (ONT). The resulting assembly is scaffold-level, comprising 115 DNA scaffolds, with a L90 of 23. Our approach demonstrates superior contiguity and scaffolding accuracy compared to previous assemblies relying solely on either short-read19 or long-read data (accession number GCA_947331505.1 at the NCBI), further validating the efficacy of the combined sequencing approach for de novo genome assembly in non-model organisms. Additionally, the sequences from sixty A. rufa individuals provides a valuable reference for future genetic studies characterizing genome size, ploidy, and heterozygosity rates in different A. rufa populations. Our assembly contributes to the collection of avian genomes and highlights the effectiveness of integrating long-read and high-quality short-read data from Illumina10,16,23.

Notably, the contiguity statistics for the A. rufa genome is above average with respect to the other eleven fully sequenced Galliformes genomes analyzed (Fig. 2A). Still, we note that the Bird10K genome sequencing initiative is having tremendous success in generating highly contiguous genomes and these have better contiguity statistics than ours24,25,26,27. We expect to further improve our contiguity by generating and using HiC data to improve the assembly in the future. Assessment our assembly’s completeness using BUSCO28 shows that it has the highest number of single copy orthologous genes identified with respect to the other analyzed genomes (Fig. 2B). We note that the BUSCO assessment of the gene annotation using BUSCOs is lower (87.9% completion rate, Fig. 1D) that that for the assembled genome. This discrepancy between the recovered BUSCO genes and the annotated gene set is consistently observed in similar cases29. The highly contiguous assembly facilitated a comprehensive genome annotation by leveraging diverse functionally annotated sequence databases and pre-existing transcriptomic data. As a result, we could use sequence homology to assign biological function for over 95% of all genes identified in our assembly. Overall, our assessment of the annotation quality using RNA sequencing data showed a complete alignment with gene models, with no missed single exons (Supplementary Table S5). Supplementary data file S1 contains all details of the annotation.

We identified 19,226 protein genes with high confidence. Of these, 10,757 protein genes were verified to maintain their intron–exon structure when compared to G. gallus or C. japonica orthologs (Supplementary data file S1), suggesting these genes are also correctly annotated. The remaining 8,509 genes were verified through mapping the full transcript. Notably, transcriptomic data is currently limited to the spleen and skin tissues, yet it aligns well with the annotated gene models. Despite varying parameters during the process of masking DNA transposable elements, we observed a minimal impact of those changes on the number of annotated protein-coding genes. Given these findings, we anticipate that incorporating transcriptomic data from additional tissues will refine the gene models specific to A. rufa and potentially reduce the overall count of annotated genes, mirroring observations in other model organisms30. Additional ab initio annotation identifies 11,010 genes with lower confidence.

The genomic annotation of TEs in the A. rufa genome shows a high abundance of LINE (7.74% of the genome) and LTR repeat elements (1.76% of the genome). These numbers are higher than those found in the genomes of G. gallus31 (~ 3% LINE and ~ 0.5% LTR) and C. japonica32 (~ 5.60% LINE, and ~ 0.60% LTR, Supplementary Table S7). The genomes of C. californica33 and C. virginianus both have a percentage of LINE (6.9% and 5.6% respectively) and LTR (5.6% and 1.73% respectively) more similar to that found in A. rufa. Given that transposable elements were found to influence color34,35 in insects, mammals, birds and other vertebrates, a future analysis of the genome should reveal if any genes involved in color determination are found within regions containing TEs.

Notably, 13% of the annotated genes are associated with metabolic functions, while 11% are involved in processing environmental information, including 9% dedicated to signal transduction tasks. The distribution of tRNA genes in the A. rufa genome indicates that 11% code for alanine-tRNA and 9% for serine-tRNA (Supplementary data file S1, Supplementary Table S8). Our gene enrichment analysis suggests that A. rufa evolved a distinct set of regulatory genes and viral response proteins, likely shaped by species-specific infections and pressures. These findings align with previous transcriptomic analyses that highlighted heightened immune responses in the A. rufa36.

A. rufa, C. japonica, and G. gallus (Phasianidae family) have a diploid genome with 78 chromosomes while C. virginianus or C. californica (Odontophoridae family) have 82 and 84, respectively. A structural genomic comparison between the three Phasianidae birds using chromosome-mapping approaches shows that chromosomal coverage and synteny is stronger between A. rufa and C. japonica than between A. rufa and G. gallus. Still, several chromosomal inversions (Supplementary Fig. S5, Supplementary Table S6) highlight that the divergence between A. rufa and C. japonica is not recent. For example, aligning scaffold 1 of A. rufa to chromosome 2 of C. japonica reveals an inversion that contains the centromeric region of the chromosome. Our sequence comparison between scaffold 4 of A. rufa and chromosome 4 of G. gallus reveals another centromeric inversion. This inversion had been previously reported by4 based on cytogenetic analysis. Still, we note that inversions detected close to centromeres and telomeres may result from mis-assemblies, due to the higher DNA repeat content in those genomic regions. However, a more detailed analysis of those inverted regions shows that they were all associated to DNA mobile elements, rather than with tandem repeats (Supplementary Table S6). In addition we realigned the raw long reads to our assembly and this alignment is consistent with the assembly direction. As such, we strongly believe that those inversions are not an artefact of assembly. In fact, they are also consistent with similar massive inversions observed within independent populations of C. coturnix, another quail species, and associated to an expansion of phenotypic diversity between populations37. These genomic rearrangements were reported to associate with adaptive divergence in other species of animals38.These and other observations in our analysis emphasize the potential interest of future research focusing on A. rufa's evolutionary chromosomal rearrangements. We are currently developing efforts to generate HiC data that would facilitate obtaining a map of physical interactions that would allow us to generate a chromosome level assembly. This would contribute to fully discard the possibility of the chromosomal rearrangements being assembly artifacts.

The genome assembly provided here is also of interest for phylogenetic studies. Phylogeny proposes an evolutionary tree that aids our comprehension of species divergence over time, drawing upon evidence from paleontology, biogeography, and genetics39,40,41. The integration of both mitochondrial and nuclear markers significantly advanced the accuracy of those studies42,43. Still, phylogenetic trees based on individual genes may be biased44,45,46, due to factors such as incomplete lineage sorting, gene flow dynamics, and horizontal gene transfer. Coalescent-based methods are often helpful in reducing that bias47. Still, combining genome-based trees with estimates of divergence time gleaned from fossil records and genetic clocks has produced robust phylogenies that can be used to generate strong hypotheses about speciation events48,49,50. We also combined fossil record-based divergence times, concatenated gene-based trees, and coalescent-based trees to reconstruct the phylogeny of A. rufa in the Galliformes order. We found the phylogenetic tree topologies to be robust for the alternative approaches.

Our tree suggests a divergence time around 75 Mya for the Galliformes clade, consistent with previous estimates42,48,51. Notably, within the bird group and the order Galliformes, a variety of studies using a limited set of genetic markers proposed multiple and often coinciding clade formation hypotheses51,52. Our results are consistent with those hypotheses. However, divergence times are slightly different because earlier efforts used a smaller number of genetic markers and a larger number of species. Still, those earlier estimates are well within the 95% credibility intervals for our divergence times. Leveraging future assemblies of A. rufa’s and other bird genomes to create genome wide alignments from which to create phylogenetic trees will likely enable a more accurate understanding of the evolutionary history of life.

Overall, our assembly and annotation provide a significant contribution towards a reference genome of the red-legged partridge, which will aid in developing genetics applied to phylogeny, zoology, demography, and ecology of the species. This near-chromosome assembly provides a foundation upon which to anchor future comparative genomics research between different A. rufa populations and across Phasianidae species. It is a valuable resource, potentially enabling the development of more effective strategies for management and conservation of A. rufa and wildlife.

Methods

Genome sequencing data

Total DNA was obtained from the muscle of sixty frozen A. rufa individuals (muscle from 30 wild birds obtained from hunter’s bags and 30 farm birds obtained from slaughtered partridges from a farm in Ciudad Real) for whole-genome sequencing on the NovaSeq6000 Illumina platform producing short paired-end reads with a read length of 151 bp as described in2. Additionally, blood was collected from the brachial vein in the wing of two live individuals (one male and one female, no anesthetics were used) using a sterile syringe with a 20 G needle. We then extracted high molecular weight (HMW) DNA from that blood for library preparation with the genomic DNA sequencing kit of Oxford Nanopore technology (ONT) and then sequenced the libraries using a GridION platform. This had the purpose of facilitating an assembly of both sex chromosomes when HiC data becomes available. The study was conducted in full compliance with Spanish laws and regulations, including the licence of “Las Ensanchas” for sampling shot partridges. The protocol was approved by the Committee on the Ethics of Animal Experiments of the University of Lleida (Ref. 1998–2012/05). The ten essential ARIVE guidelines were followed in designing and reporting this study.

Processing sequence data

The Illumina sequencing yielded an average of 218 million raw reads per individual, with an average depth sequencing of 32X per sample. We assessed the quality of those reads using FastQC53. The per-base quality scores were consistently high across all samples, and no adaptor content within the reads was found. Thus, it was determined that additional cleaning and adapter removal procedures were unnecessary.

We generated 2 million raw ultra-long reads of the Oxford Nanopore Technology (ONT), yielding 48 Gb with an average read length of 20.68 kb (Supplementary Table S9). We used Porechop V.0.2.454 with default parameters in order to scan for known Nanopore adapters and to trim them out of the long reads, ensuring a high-quality dataset, free of adaptor contamination. We assessed the quality of this dataset using NanoPlot v1.40.2 (part of the NanoPack software suite)55. We then used Filtlong56 to split the reads into two subsets applying different criteria. For the first subset, we prioritized read length over average read quality, selecting a coverage depth of 40X(–min_length 15 kb -t 40 Gb). For the second subset, we prioritized average read quality over read length, generating a coverage depth of 20X (–min_mean_q 12 -t 20 Gb). By using these two different subsets, we aimed at improving genome contiguity while also correcting structural errors, ensuring a more reliable and accurate analysis of the sequencing data.

Genome size estimation

We used a 21-mer-based approach in Jellyfish v.2.2.1057 to estimate k-mer histogram frequencies from the Illumina paired-end sequencing data of each of the sixty individual birds. The output of Jellyfish was then used in GenomeScope258 to estimate genome size and heterozygosity level for the genome of each bird. In addition to the genome profiling with genomescope2 on those short reads, Smudgeplot58 was used to estimate the ploidy level using Nanopore long reads sequencing data.

Hybrid genome assembly

Supplementary Fig. S7 summarizes the pipeline we employed to create a de novo assembly for the genome of A. rufa, using a hybrid approach. The raw ONT long-reads were assembled de novo with Flye59, Canu60, Wtdbg261, and NextDenovo v2.2.462. In order to select the best primary assembly for further procedures we compared the performance of the four assemblers. We used QUAST v5.2.063 to calculate the contiguity statistics of each assembly statistics and the aves_odb10 dataset of Benchmarking Universal Single-Copy Orthologs (BUSCO) v.5.4.328 to assessed their completeness. Based on these numbers, we chose the NextDenovo contig-level assembly for further improvement.

We combined long and short read information to improve the contig-level assembly. This hybrid approach comprised two main steps to enhance the assembly quality. First, we mapped the subset of long-read ONT with 40X and min length size of 15 kb, to the contig-level assembly using minimap264. This alignment was then input into RACON v.1.5.165 for one polishing iteration, improving the contiguity of the contig-level assembly by correcting several structural assembly errors. Then, we aligned the short reads from the sixty A. rufa individuals to the RACON-improved assembly using BWA-mem2 v2.2.166. This alignment was the input for Polypolish v0.5.067, which we used to polish the RACON-improved draft and fix small SNPs and indels, leveraging the high coverage of short-reads to generate a high-quality consensus assembly that represents the genetic diversity of A. rufa’s genome. We completed the scaffolding of the assembly using the REDUNDANS pipeline68. We ran this pipeline using the "–non-reduction –nogapclosing" parameters to enhance genome scaffolding, using a subset of both long and short reads in combination from the original raw reads. We combined the subset of accurate long-read ONT with 20X sequencing depth and the short reads of the two animals with the highest genome coverage, aiming at improving scaffold accuracy. The final scaffold-level assembly served as the foundation for downstream genome annotation and comparative analysis.

Genome screening for contamination sequences

Before annotating the assembled genome, we conducted a thorough screening process to identify and eliminate any sequences that might be contaminants related to the assembled genome of A. rufa. To do this, we employed NCBI’s Foreign Contamination Screening (FCS) tools69 FCS-adapter and FCS-GX. We used FCS-adapter to detect adaptors and vectors. We used FCS-GX to identify foreign DNA contamination sequences by aligning our assembly against the NCBI database of genomes. We ran each of these tools independently using default settings, except for the taxonomic identifier, which was set to be that of A. rufa (NCBI: txid 9079). This rigorous screening process helped ensure the integrity of our assembled genome data before proceeding with annotation.

Genome annotation

Annotation of transposable elements

We used EDTA v2.1.170 to annotate the DNA transposable elements (TEs) in our assembled genome. EDTA integrates a set of open-source programs for TE annotation based on homology and/or ab initio search methods. We used two independent data sets to increase the accuracy of EDTA annotation. First, we downloaded a curated library from the gold-standard database of repetitive sequences msRepDB71. This library contained DNA transposable sequences for six closely related bird species (Alectoris barbara, Alectoris philbyi, Alectoris melanocephala, Coturnix japonica, Meleagris gallopavo, and Gallus gallus; Supplementary Table S10). Then, the CDS sequences of G. gallus were downloaded from ENSEMBL release 10972, to remove gene-related sequences. In parallel, we used RepeatModeler V2.0.373 with default parameters for additional ab initio annotation of repetitive elements. Finally, we combined the results from EDTA and RepeatModeler to build a non-redundant library of repetitive elements using our in-house scripts. This custom TEs library was used as input to the RepeatMasker v4.1.421,74 for soft masking of the A. rufa genome. We ran RepeatMasker using the following parameters: “-e ncbi -gff -s -a -inv -no_is -norna -xsmall -nolow -div 40”, against the Dfam75 and RepBase update 18. We then used the soft-masked genome for further annotation.

Divergence distribution of transposable element

We analyzed RepeatMasker’s alignment output file using the parseRM.pl script v5.8.2 available at74. We determined the percentage of divergence from the consensus for each TE fragment, considering the elevated mutation rate at CpG sites and employing the Kimura 2-Parameter divergence metric. This divergence percentage serves as a measure of the age of the TE fragments, as older TE invasions accumulate more mutations. We further categorized TE fragments by age, organizing them into bins of 1 million years, based on the substitution rate calculated by parseRM.pl. We then plotted the distribution landscape of TE using a custom R script.

Gene structure annotation

We combined three strategies to annotate the protein coding genes in the soft-masked genome: homology-based, transcriptome-based, and ab initio predictions:

  1. 1-

    We ran Miniprot v.0.10-r22576 for homology-based gene prediction. A dataset comprising 3,044,546 protein sequences was generated. These sequences were obtained from the NCBI reference sequence of proteins (accessed on April 15, 2023). Specifically, we focused on the Aves NCBI:txid8782 lineage to ensure retrieval of only avian proteins. Additional details about this dataset can be found in Supplementary Table S11.

  2. 2-

    We ran PASApipeline v.2.5.377 to perform gene prediction based on the transcriptional evidence provided by the transcriptome assembly of A. rufa published in 201736.

  3. 3-

    For ab initio gene prediction, we ran BRAKER2 v.2.1.678, training it with the same dataset we used for Miniprot.

The annotation results of the three approaches were then combined using EVidenceModeler v.2.1.077 to produce a consensus gene set model of the assembled A.rufa genome. The pipeline is summarized in Supplementary Fig. S8.

We then took the annotated proteome of A. rufa and BLASTed it against the annotated proteome of G. gallus, to identify all pairs of orthologs, filtering by e-values ≤ 10–30, mutual best BLAST result, and mutual alignments over more than 80% of query and target proteins. Finally, we mapped each ortholog to its corresponding genome, to compare intron structures between orthologous genes. We repeated this comparison between A. rufa and C. japonica.

Non-coding RNA gene annotation

We also annotated non-coding RNA genes (ncRNAs) in our genome assembly. We used tRNAscan-SE2 v.2.0.1179 to identify transfer RNAs (tRNAs). Infernal v.1.1.480 was run to identify microRNAs (miRNAs), ribosomal RNAs (rRNAs) and small nuclear RNAs (snRNAs), based on the Rfam database (release 14.0)81.

Functional annotation

We assigned functions to the predicted gene models combining various approaches. A standard e-value cutoff of 1e-6 was applied for sequence comparisons, unless otherwise specified. Initially, we utilized eggnog-mapper v.2.1.1082 against the eggNOG database83 to assign Gene Ontology terms. Subsequently, Blastp v2.12.0 + was employed against SwissProt, TrEMBL84, and NCBI NR85 databases for homology-based functional annotation (all the public protein databases mentioned above were downloaded on April 15, 2023). Priority was given to matches with over 95% identity from SwissProt and TrEMBL, as annotation of proteins in these databases is more reliable because of manual curation. The resulting functional annotations were combined with InterProScan v5.64.-96.086. InterProScan identified protein domains, families, and superfamilies in annotated protein-coding genes using specified parameters “-m diamond –sensmode fast –go_evidence non-electronic”. KofamKOALA v1.3.087 assigned KEGG orthologs (KO)88 and pathways with an e-value cutoff of 1e-9. Functional annotations from the Uniprot database (minimum Blastp homologue identity match of 95%) were integrated into the final genome annotation file using the GAG tool v2.0.189.

Quality assessment of genome assembly and annotation

We used QUAST v5.2.0 to calculate correctness and contiguity metrics for the genome assembly. We used BUSCO against the aves_odb10 v2019-11-20 database to assess both the completeness of the assembly and of the annotation of structurally predicted protein-coding genes.

Furthermore, to assess the accuracy of the genome, Merqury v1.390 was used. This involved comparing the original ONT raw reads to the final version of the genome assembly, which had been polished using high-quality data from 60 partridge samples. This analysis provided insights into the QV metric and the accuracy of the consensus sequence.

Quality assessment of the genome annotation using RNA sequencing

As part of evaluating the accuracy of gene model annotations, we downloaded a set of RNA sequencing transcriptome from the spleen and the skin of the red-legged partridge (A. rufa) that are deposited in the NCBI SRA database (Supplementary Table S12). We employed the STAR aligner tool v2.7.10b91 for mapping these reads to the soft-masked assembly version. Subsequently, each sample underwent transcript-assembly guided using Stringtie v2.2.1 reference-guided assembler of transcripts. The spliced transcripts from all samples were combined using Stringtie into a one master list of transcripts, the output of Stringtie was retrieved in the GFF file format for suitable downstream analysis. Next, we used the GffCompare v2.12.6 tool92 to compare this list of annotated transcripts with respect to the final annotated gene set model. This comparison helped determine the number of new spliced transcripts that were not previously identified in our gene set, contributing to our assessment of gene annotation quality.

Comparison to the reference genomes of C. japonica and G. gallus

We used MUMMER v.493 to perform whole-genome alignment between our assembly and the fully sequenced genomes of C. japonica and G. gallus. The genome pairwise alignment results and synteny blocks of 10 kb were visualized with DOT-PLOT viewer94 and Circos v.0.69-895.

Gene family analysis

We used the OrthoVenn3 pipeline96 to compare gene families between A. rufa, C. japonica, G. gallus, and G. pavo. In brief, Orthofinder97 was used to compute the orthologs between the species of interest and to cluster gene families based on GO functional annotation categories. Additionally, we also used the pipeline to automatically conduct GO terms enrichment analysis by considering the evolutionary relationship between the four species.

Phylogenomic analysis and divergence time tree building

We performed phylogenetic analysis to infer the divergence time of A. rufa with respect to other birds with fully sequenced genomes within the Galliformes order, in a way that is similar to previous reports14,98,99. We included all Galliformes reference genomes available at the NCBI RefSeq database at the time of submission. In addition to A. rufa, we included 8 genome protein sequences of Galliformes species, of which 7 species belong to Phasianidae family and one to Numididae family (Numida meleagris). We also included one genome from the Anseriformes order (Anas platyrhynchos), and another bird species for the Falconiformes order (Falco cherrug). As an outgroup we used Anolis carolinensis100 from the Reptilia class.

Genome assemblies for the birds and outgroup (Anolis carolinensis) from Reptilia were downloaded from the NCBI. Detailed information about those species can be found in Supplementary Table S3. We started by using the aves_odb10 database of the BUSCO tool28 to identify shared single-copy genes in the twelve analyzed genomes. The aves_odb10 database contains 8338 genes. We used the custom Python script available at https://github.com/jamiemcg/BUSCO_phylogenomics.git to extract the 8212 shared single-copy orthologs common to all species. We independently created multiple alignments for each of the orthologs common to all species, using MUSCLE101. We concatenated the resulting multiple alignments to create a supermatrix alignment. To ensure alignment quality, we applied trimAI102 and removed unreliable aligned sites and gaps.

Subsequently, a phylogenetic tree was constructed using IQTREE v103, incorporating 1000 bootstrap replicates. The best model for tree construction was determined using the ModelFinder package99 from the IQTREE suite. Then we used ASTRAL v5.7.3104 to handle the possibility of incomplete lineage sorting that might impact gene-based trees. Finally, to estimate the divergence time of A. rufa in relation to the other birds, we used the MCMCtree tool from the PAML package105. MCMCtree used the phylogenetic tree generated by IQTREE and the alignment file to achieve reliable divergence time estimation, minimizing potential outliers. Three fossil calibration times from the TimeTree5106 were employed for divergence estimation: G. gallusC. japonica (≈ 32.9–46.1 Mya), Numida-Mallards (≈ 72.5–85.4 Mya), and the divergence time between birds and reptiles (≈300–250 Mya)107. We ran MCMCTREE on protein-coding sequences, sampling 20,000 times with a sampling frequency of 10, following a burn-in of 2000 iterations. We used default parameter for the other settings.

Supplementary Table S13 summarizes all bioinformatics pipelines, tools versions, and settings used during the genome assembly and annotation process and other related analysis used in this work.

Ethics approval and consent to participate

The study was conducted in full compliance with Spanish laws and regulations, including the licence of “Las Ensanchas” for sampling shot partridges. The protocol was approved by the Committee on the Ethics of Animal Experiments of the University of Lleida (Ref. 1998–2012/05). The ten essential ARIVE guidelines were followed in designing and reporting this study.