Background & Summary

Chickens are among the most important farm animals supplying eggs, meat, and other products to humans. After long-term domestication and human-driven selection, hundreds of distinctive chicken breeds are now cultivated worldwide, and chicken meat has been the largest meat resource since 20191. In particular, more than one hundred local breeds have been identified in China, constituting almost half of the broiler market. Compared to commercial broilers (e.g., Arbor Acre, Ross, and Cobb), local chickens exhibit markedly improved meat quality and flavour, adaptation to the environment, and disease resistance2,3. However, local breeds have not undergone intense artificial selection for productivity traits, which may explain their lower production performance compared to commercial breeds4,5. However, compared to pigs or cattle, chickens are more efficient and environmentally friendly livestock. Their feed conversion ratio is ten times that of cattle, and the carbon emissions of broilers are only 1/10 of those of cattle6. Therefore, it is necessary to investigate genomic markers and genetic mechanisms related to economic traits via whole-genome sequencing to bridge the gap in production performance and accelerate breeding progress7.

Simultaneously, domestic chickens are desirable models for investigating genetic adaptation and diversity and disease-related markers due to the advantages of their short reproductive cycle, small body size, and identical ancestors. After domestication from red jungle fowl (Gallus gallus spadiceus)8, chickens were cultivated with the aim of meeting various human demands, and meat-type (e.g., Xiaoshan chicken), egg-type (e.g., Xianju chicken), ornamental-type (e.g., Silkie chicken), and game fowl breeds (e.g., Luxi gamecocks, Henan gamecocks) were developed9,10,11,12, as well as the commercial broilers (e.g., broiler line B, Cobb)13. Therefore, it is relatively easy to unambiguously infer the driving factors of phenotypic or behavioural changes in these chickens. Additionally, domestic chickens can be used to generate special populations, such as F2 populations14,15 and advanced intercross lines16, which contributes to exploring the genetic mechanisms of chicken complex traits and provides new insights into genomic breeding.

Genomic analysis can reveal the demographic history of different chicken breeds and reconstruct gene flow among them, which contributes to a better understanding of the domestic history and potential mechanisms of some economic traits, such as breast muscle yield17, yellow skin4,18. The analysis of genome-wide variants in chickens distributed in different regions can be used to investigate genetic adaptation and diversity, especially adaptation to environmental conditions, such as altitude, temperature, and anoxic environments19,20,21,22. Moreover, selective sweep analysis based on single nucleotide polymorphisms (SNPs) is an effective method for identifying genetic markers and mechanisms underlying chicken production performance, reproduction, immunity, etc. This approach provides important insights into modern breeding systems20,21,22,23,24,25. In addition to SNPs and insertion‒deletions (InDels), genome-wide sequencing has also been used to identify copy number variants (CNVs)26 and structural variants (SVs)27, although the detection rate and efficiency of SV calling are relatively low compared to those of PacBio sequencing28. Therefore, genomic analysis using the whole-genome variants of local chickens is an effective approach for elucidating genetic diversity and selective signatures during long-term domestication.

This study provides a whole-genome sequencing dataset from hundreds of local chickens (n = 209) of 13 local chicken breeds in China, including meat-type, egg-type, and ornamental-type chickens. A total of more than 19 million SNPs and 1.98 million InDels were identified on chromosomes 1–39 and one sex chromosome (Z) by aligning the sequencing reads against the most recent reference genome of chickens (assembly bGalGal1.mat.broiler. GRCg7b, GCF_016699485.2). The summarization of high-quality sequences (Q30 > 90%, mapping rate > 99%) confirmed the high accuracy and resolution of this genetic analysis. This dataset is expected to be useful in many studies, such as those aimed at identifying putative regions of positive selection, inferring demographic history, analysing gene flow, detecting candidate genes related to economic traits, determining genetic adaptation to local environmental factors, discovering breed-specific variants or markers, analysing genetic diversity, or developing SNP genotyping arrays for use in chicken breeding systems or species identification. Moreover, the whole-genome variants of most chicken breeds (except for Beijing You and Silky-feather chickens) included in the present study have not been reported. Therefore, this dataset provides an ideal resource for population genetics and evolutionary analyses. Furthermore, this database represents an important supplement to the chicken variant database and plays a crucial role in reconstructing the demographic and domestication history of chickens.

Methods

Sampling

Blood sampling from 13 chicken breeds was performed in Zhejiang Province, China. The following breeds were included (Supplement Table S1): BE, Baier chicken; BJY, Beijing You chicken; DXB, Dongxiang Black chicken; JSW, Jiangshan white-feathered chicken; LH, Luhua chicken; LY, Longyou chicken; SF, Silky-feather chicken; SYJ, Songyang Jin chicken; WL, Wenling chicken; XJ, Xianju chicken; XS, Xiaoshan chicken; XX, Xiaoxiang chicken; and YD, Yandang chicken. We collected a 1 ml blood sample from the wing vein of each individual and stored it in a 2 ml anticoagulation tube at −20 °C. All procedures associated with the chickens used in this study were consistent with the standards of the Laboratory Animal Guidelines for the Ethical Review of Animal Welfare and were approved by the Animal Use Committee of Zhejiang Academy of Agricultural Sciences (No. 2022ZAASLA68).

Genomic DNA extraction and quality control

The workflow from sampling to variant filtration is shown in Fig. 1. Genomic DNA was extracted from blood samples using the phenol‒chloroform method. DNA quality control was performed as follows: 1) DNA degradation and contamination were monitored on 1% agarose gels; 2) the OD 260/280 ratio was determined with a NanoDrop instrument to check the purity of the DNA; and 3) the DNA concentration was measured with a Qubit® DNA Assay Kit on a Qubit® 3.0 Fluorometer (Invitrogen, USA). Finally, more than 0.2 μg of DNA fragments with no degradation or contamination and an OD value of 1.8~2.0 were used for library construction.

Fig. 1
figure 1

The workflow of the library preparation, sequencing, genome mapping, and variant calling and filtration. The pipeline was consistent with the GATK recommended protocol for variant identification.

Library preparation and sequencing

A sequencing library was created with the NEB Next UltraTM DNA Library Prep Kit for Illumina (NEB, USA). A mean size of 350 bp was achieved by shearing the genomic DNA. Before sequencing, the DNA fragments were subsequently submitted to end polishing, A-tailing, and adapter addition. PCR amplification and purification were then carried out using an AMPure XP system (Beckman Coulter, Beverly, USA). Using an Agilent 2100 Bioanalyzer and Qubit 3.0 Fluorometer (Invitrogen, USA), the quality of the library was evaluated based on the insert size and DNA concentration. Real-time PCR (>2 nM) was used to quantify the results. Ultimately, a flow cell containing the qualifying DNA nanospheres was filled, and a DNBSEQ-T7 platform was used for sequencing.

Data quality control, mapping, and variant calling

Clean reads were produced by trimming the raw reads using FASTP v0.21 software29, and reads meeting any of the following conditions were discarded: 1) one read of a pair contained adapter contamination (>10 nucleotides aligned to the adapter, allowing ≤10% mismatches); 2) more than 10% of bases were uncertain in either read; and 3) the proportion of low-quality (Phred quality <5) bases was over 50% in either read. The trimmed clean reads were evaluated by FastQC v0.12.1 software. Then, the clean reads were aligned against the most recent chicken reference genome (GRCg7b, GCF_016699485.2) using the Burrows‒Wheeler Aligner (BWA) v0.7.17 with the MEM algorithm30. The results were sorted, indexed, and converted to BAM files using SAMtools v1.12 software31. After removing polymerase chain reaction (PCR) duplicates, SNP calling was performed using the joint calling strategy, realized with the HaplotypeCaller, CombineGVCFs, and GenotypeGVCFs functions in the Genome Analysis Toolkit (GATK) v4.2.232. Next, we removed SNPs using GATK v4.2.2 according to specific standards: QualByDepth <2.0, FisherStrand >40.0, StrandOddsRatio >3.0, MappingQuality <40.0, and ReadPosRankSum <−8.0. A total of 19,820,641 biallelic SNPs and 1,980,710 biallelic InDels identified on autosomes after further quality control using VCFtools v0.1.1333 (--max-alleles 2 --min-alleles 2 --min-meanDP 3 --maf 0.01) were retained for subsequent analysis. The missing alleles were imputed by Beagle v5.1 software34 with the default parameters.

Variant annotation

The identified variants were annotated using ANNOVAR software35, the SNPs/InDels were classified according to the genomic location, such as exonic, intergenic, downstream, upstream, and others.

Variant density, genetic diversity, and polymorphism information content (PIC) estimation

The variant density was calculated as the ratio of the total length of the genome (with N bases removed) to the variant number. The SNP/InDel density was calculated and visualized with the CMplot package36. The genetic diversity (π) was calculated using VCFtools v0.1.1333 with window size of 50 kb. The PIC was estimated using the following formula: PIC = 1 − pi2qi2, where pi indicates the frequency of the minor allele of SNPi and qi indicates the frequency of the major allele of SNPi.

Data Records

Raw FASTQ files for whole-genome sequencing were deposited in the NCBI Sequence Read Archive (SRA) and have been assigned BioProject accession number PRJNA942350 (https://identifiers.org/ncbi/insdc.sra:SRP426730)37. The raw VCF file containing SNPs and InDels was deposited in the EVA database with accession number PRJEB71347 (https://identifiers.org/ena.embl:PRJEB71347)38. The quality control results of raw reads and annotation files of variants have been deposited in the the Figshare database with the following digital object identifier: https://doi.org/10.6084/m9.figshare.24751956.v239. The relationship between the chicken ID in the VCF files and the SRA database was shown in Supplement Table S2.

Technical Validation

Quality control of sequencing results

For each individual, an average of 17.3~28.5 Gb of raw data from were obtained, with 91.4% of the data achieving a Phred quality score of 30 (indicating sequencing accuracy of 99.9%) on average (Table 1). A stable GC content (42.6%) was demonstrated for the sequence (Table 1, Supplement Table S3). A genome mapping rate greater than 99% and an average genome coverage of 97.1% (with N bases removed) were obtained (Table 2, Supplement Table S4).

Table 1 Summary statistics for quality control of sequences1.
Table 2 Summary statistics for genome alignment analysis1.

Filtration of SNP and InDel datasets

The joint calling strategy was used in this procedure, and a total of 27 million raw SNPs and 2.75 million raw InDels were identified in the population of 13 chicken breeds. To exclude low-quality variants, we used the VariantFiltration function in GATK software32. The specific standards used are described in the Methods section above. After the first round of quality control, the SNPs and InDels were further filtered using VCFtools v0.1.13 software33, and only biallelic variants with a minor allele frequency of 0.01 and a minimum sequencing depth of 3 were retained. We calculated statistics for SNP types, and T:A > C:G-type mutations were mainly identified in this population (Fig. 2). Figure 3 shows the relationships among SNP quality, supported read number, SNP percentage, and neighbouring SNP distance. We detected a positive correlation between SNP quality and percentage, and most SNPs were supported by at least 20 reads (Fig. 3a,b). This indicates the high quality of the identified SNPs.

Fig. 2
figure 2

Statistics for the SNP number of different mutation types.

Fig. 3
figure 3

Statistics for the SNP percent in support reads number (a), quality (b), neighbouring SNP distance (c). The different colour indicated the various individual.

Summary statistics of SNPs and InDels across the whole genome

The high-quality SNPs and InDels were distributed across the genome with an average density of 1 SNP every 52 bases and 1 InDel every 521 bases (Table 3). Figures 3c, 4a show the high density of SNPs across the whole genome. The density of InDels was relatively low (Fig. 4b). Chromosomes 16, 29–32, and 34–38 were defined as dot chromosomes, while chromosomes 22, 25 and 33 were defined as microchromosomes40. We found that the SNP density on these chromosomes exhibited a polarized distribution (Table 3). Based on variants annotation, SNPs/InDels were mainly distributed in intronic and intergenic regions (Fig. 5). And more SNPs rather than InDels were located in exons (Fig. 5a). The PIC and genetic diversity π of each breed are shown in Fig. 6, the genomes of the SF and YD chickens exhibited higher polymorphism, and LH chickens were found significant lower genetic diversity.

Table 3 Summary statistics of SNPs and InDels in each chromosome 1.
Fig. 4
figure 4

Distribution of SNP and InDel across the whole-genome of 13 local chicken breeds. (a) SNP density statistics across the whole-genome. (b) InDel density statistics across the whole-genome.

Fig. 5
figure 5

Annotation result for SNPs (a) and InDels (b).

Fig. 6
figure 6

Estimation of genomic PIC (a) and π (b) based on SNPs of 13 chicken breeds.

Usage Notes

Whole-genome sequencing allows us to obtain the SNPs and InDels across the whole genome using the bioinformatic pipeline of this study. However, the CNVs and SVs were not included in the current study. As same as the previous report in human and chickens8,41,42,43, we did not consider the ploidy of variants in sex chromosome ZW. And the variants in chromosome W were discarded due to the uncertain gaps in chromosome. Although SNPs are more widely used for investigating investigate the genetic diversity and dissecting the genetic mechanism of economic traits, the heritability of traits of interest cannot be fully explained by SNPs, namely, missing heritability. These traits are also influenced by epigenetic factors. The effect of missing heritability could be relieved by increasing the sample size and sequencing depth and by considering the epistatic effects and SVs and CNVs, which also contributed to the identification of additional novel loci44. Additionally, we aligned the sequencing data to the reference genome (GRCg7b, GCF_016699485.2), which was assembled based on one broiler and did not include all the variants in the reference genome. Therefore, the data produced from this study may produce an incomplete explanation of the genetic background due to the missing alignment. Overall, our data provide insights into the evaluation of chicken population structure, and are an efficient dataset for identifying the CNVs and SVs. These data can be used in the construction of chicken pangenomes or graph pangenomes together with the Pacbio and Oxford Nanopore Technology data.