Introduction

Domestic buffalo (Bubalus bubalis) is an important multi-purpose agricultural animal in tropical and subtropical regions. China has the third-largest buffalo population in the world, following India and Pakistan (Zhang et al. 2007), and holds a long history in domestication and utilization of buffalo for thousands of years (Yue et al. 2013). In the long period of utilization, a variety of buffalo breeds were produced and retained through artificial selection for desirable performance traits, such as milk or meat production, labour force and so on (Zhang et al. 2020; Liang et al. 2021). Sichuan Basin in southwestern China is positioned to the northeast of the eastern Himalayas and is a typical agricultural production region with mountains encircled. Due to the special geographical location and climatic environment, Sichuan Basin spawned two indigenous buffalo varieties, Yibin buffalo and Dechang buffalo. Yibin buffalo are better adapted to high heat environments, and Dechang buffalo grows faster. However, limited attention has been given to them, which is apparently insufficient for mining functional genes associated with important traits and the genetic breeding of buffalo.

Single nucleotide polymorphisms (SNPs) are the most abundant molecular markers and have contributed significantly to address genetic diversity and positive selection sites of many species (Anello et al. 2016; Wang et al. 2013). Restriction site-associated DNA sequencing (RADseq) is a powerful approach for obtaining genetic variation on a genome-wide scale by a reduced representation sequencing of regions adjacent to restriction cut sites. Since the ability for providing numerous information on genetic variation and relatively low cost (Baird et al. 2008), RADseq has been widely used for markers discovery and genotyping in many species (Orita et al. 2021; Liu et al. 2019; Yang et al. 2017; Wang et al. 2018; Junge et al. 2019).

In present study, we employed RADseq approach to discover genome-wide SNPs and explored positively selected location of two representative indigenous buffalo breeds in the Sichuan Basin. These results of the present study undoubtedly facilitate a better understanding of the genetic composition of Sichuan indigenous buffaloes and provide a comprehensive set of candidate genetic regions for improving management strategies for buffalo breeding selection.

Materials and methods

Samples collection and DNA isolation

Blood samples from 20 buffaloes were collected from two representative indigenous buffalo breeds, Yibin buffalo (GYBS, n = 10, 27° 50′–29° 16′ N, 103° 36′–105° 20′ E) and Dechang buffalo (XCS, n = 10, 27° 05′–27° 36′ N, 101° 54′–102° 29′ E), in Sichuan Province. Genomics DNA for genotyping was extracted using the Qiagen DNeasy Blood and Tissue Kit (Qiagen, Hilden, Germany). The DNA qualification was performed by using three methods as follows: (1) DNA purity and integrity were evaluated with agarose gel electrophoresis, (2) DNA purity was tested using Nanodrop (Wilmington, DE, USA) (OD260/OD280) and (3) DNA concentration was measured by Qubit® DNA Assay Kit in Qubit® 2.0 Fluorometer (Life Technologies, CA, USA).

GBS library construction and high-throughput sequencing

After a preliminary in silico analysis of the buffalo genome sequences, the genomic DNA from individual samples was first normalized to a concentration of 50 ng/μl and digested with restriction enzyme EcoRI. RAD libraries were prepared as the protocol described in Baird et al. (Baird et al. 2008). Briefly, after ligating with a uniquely barcoded sequencing adaptor (Solexa P1 Adapter), samples were pooled and randomly sheared. Fragments of 300 bp to 700 bp were collected by gel purification and ligated to a second adapter (P2) with divergent ends. Products were further amplified by polymerase chain reaction (PCR), and amplicons were sequenced on the Illumina HiSeq Platform with 150 bp paired-end reads (Novogene Co. Ltd., Beijing).

Quality filtering, SNP calling, and annotation

According to the official pipeline (https://www.illumina.com/), initials image data obtained by high throughput sequencer were converted into sequence files in FASTQ format. To ensure the reliability of sequence data, raw data were trimmed using the following criteria: removing reads containing adapter sequences, removing reads containing unidentified nucleotides (N) representing ≥ 10% of total length; removing reads with > 50% of bases with quality scores < 5. If any sequence of the paired reads was identified as low quality, both pairs were discarded. At the same time, Q20, Q30, GC content, and duplication level of the sequencing data were also calculated. After these steps, clean reads with high quality were obtained and used for the following analyses.

High-quality reads remaining were mapped to the reference genome (Low et al. 2019), https://www.ncbi.nlm.nih.gov/assembly/GCF_003121395.1) using BWA software (Li and Durbin 2009) with the command 'mem -t 4 -k 32 -M'. SNP calling was performed on a population scale by using Bayesian methods implemented in the SAM tools package (Li et al. 2009). Filtering was performed to improve the quality of SNP obtained as follows: SNP quality value was more than twenty, the number of reads supported by SNP was in the range of 1/3 to 5 × of average sequencing depth, and the distance of two adjacent SNPs ≥ 5. Annotating and prioritizing of all genetic variants in genome sequencing data were carried out by ANNOVAR software (Wang et al. 2010).

PCA analysis

Based on the high-quality SNPs dataset, genetic diversity and relationship among populations were calculated. The phylogenetic tree was constructed by using the neighbor-joining method in software TreeBest v1.9.2 with 1000 bootstrap replicates. The principal component analysis (PCA) was performed based on SNPs among individual genomes using GCTA (Yang et al. 2011).

Screening for selection signatures and candidate genes analysis

Genetic diversity and population differentiation patterns of two populations in genome-wide were characterized by using nucleotide diversity (π) and population differentiation statistic (Fst) calculated by a sliding window approach (a 40 kb window sliding with a step size of 20 kb). Selection signatures of population-specific genomic regions between XCS and GYBS were measured as the genomic regions with a significantly high Fst (top 5%) and θπ ratio (log2 (π1) − log2 (π2), top 5%) values. Genes located in these regions were annotated for functional ontology terms and biological pathways by using Gene Ontology (GO) (Gene et al. 2000) and Kyoto Encyclopedia of Genes and Genomes (KEGG) database (Kanehisa and Goto 2000).

Results

Sequencing and variation calling

To investigate the genetic structure and discover the potential regions under artificial selection in the formation process of indigenous breeds, 20 buffalo individuals from two representative indigenous breeds, GYBS and XCS, in Sichuan Basin were selected for RAD-tags sequencing. After raw data filtering and quality control, a total of 1.06 billion paired-end clean reads were generated, ranging from 44.03 million to 64.49 million, with a mean of 53.10 million reads per individual (Additional file 1). Calling of SNP genotypes was carried out by aligning each individual sequence reads to the buffalo reference genome published previously (Low et al. 2019) with the BWA software (Li and Durbin 2009). After removal of redundancy by using SAMtools (Li et al. 2009), over 97.9% of the generated sequence reads mapped to reference genome ranging from 97.01 to 98.94% for each sample. The average sequence depth was ~ 11 × within a range of 10.25 × to 12.56 × per individual and coverage was approximately 30% (27.67–31.29%) of the total length of reference genome with 1 × coverage depth (Additional file 1).

Single nucleotide polymorphisms (SNPs) detection for each sample was performed by using SAMtools (Li et al. 2009) based on this high-quality sequencing data. After strict quality control and filtration, 2,110,077 SNPs were identified from 20 individuals. Based on the well-assembled reference genome, clean SNPs were classified to chromosomes (n = 2,092,682) and un-anchored contigs (17,395). The detailed SNP distribution information on the chromosome of the reference genome was showed in Fig. 1a. The number of SNPs was the most on chromosome 1 (NC_037545.1) and the least on chromosome 24 (NC_037568.1), accounting for 8.21% and 1.67% of the SNPs, respectively. SNP density varied among chromosomes, and chromosome 13 (NC_037557.1) had the highest density with 91.9 SNPs per 100 Kb, and chromosome X (NC_037569.1) had the lowest density with 31.9 SNPs per 100 Kb. The transitions and transversions SNPs were 1,206,600 and 903,477, respectively, resulting in an overall transition/transversion ratio of 1.34 (Fig. 1b). The SNP was further annotated by ANNOVAR software (Wang et al. 2010) (Table 1) to get more detailed information on these variants. Among these variants detected, most SNPs were identified as located in intergenic regions (1,270,286) and intronic regions (786,162). Further, 10,812 SNPs were located upstream and 11,889 downstream within 1 Kb regions of genes. Since their positions are in the transcription start or end, these SNPs probably play a role in transcriptional regulation (Guo et al. 2016). Coding regions exhibited relatively lower diversity levels with 11,737 SNPS identified in exonic regions, including 5002 non-synonymous mutations, 6650 synonymous and 85 exonic-stop gain or loss, resulting in a non-synonymous to synonymous substitution ratio of 0.75.

Fig. 1
figure 1

SNP calling and statistical metrics. A The SNP density distributions among 25 chromosomes. B Statistics of SNPs substitution types

Table 1 Summary and annotation of SNPs in buffaloes

Population genetic structure analysis

In order to explore the phylogenetic relationships and genetic differentiation within and between populations, the high-quality SNPs information was used to build a phylogenetic tree by the neighbor-joining methods (bootstrap values = 1000) based on the distance matrix calculated by TreeBest version 1.9.2 software (Fig. 2a). The results showed that two distinct taxa were clustered with the obvious separation of accessions from GYBS and XCS. This result indicated that two breeds have probably undergone different artificial selection during domestication.

Fig. 2
figure 2

Genetic relationships and population structure of two indigenous buffalo breeds based on SNPs. A Phylogenetic tree of the 20 buffalo individuals. The scale bar represents the proportional similarity. B Principal component analysis plots based on the first two principal components (PC). C Genetic structure and individual ancestry derived from the FRAPPE program. The length of each colored segment represents the proportion of the individual's genome from K = 2–5 ancestral populations

PCA is a statistical method that could convert a large number of measurements into a few independent factors and explain the differences of different groups by the divergence of these factors (Reich et al. 2008). To access the genetic separation of two buffalo breeds and discovery of population structure, PCA analysis was also calculated by using the genome wide SNPs data in this study by using GCTA software (Yang et al. 2011) (Fig. 2b). PCA analysis revealed a clear separation of individuals collected from GYBS and XCS by the PC1 and was consistent with the result from phylogenetic tree analysis. In the PCA plot, individuals from XCS exhibit a wider range of distribution than those from GYBS in PC2, suggesting more genetic diversity among these samples.

Screening for selective regions

To investigate the potential artificial selection between two indigenous buffalo breeds during the domestication, we focused on variation in genes or genomic regions that were possibly under selection in buffalo breeds. A population genetic methodology of fixation (Fst) and genetic diversity (θπ) was selected and calculated for each population by using a 50 kb sliding window with a 10 kb step size. As described previously (Wang et al. 2021) that the genomic regions undergone positive selection invariably exhibit specific patterns of variation as high FST and lower levels of π, the selected regions for the GYBS and XCS populations were detected with a significantly high Fst (top 5%) and θπ ratio (log2 (π1) − log2 (π2), top 5%) values. Results suggested a low level of differentiation in most of the genomic areas. Further analysis indicated 995 and 910 protein-coding genes were located in the selected regions in GYBS and XCS, respectively (Fig. 3a and b). To thoroughly understand domestication process and selective site, GO and KEGG enrichment analysis were conducted after functional annotation (Fig. 3c–f, Additional file 2). GO enrichment analysis showed 229 and 160 GO terms were significantly enriched for GYBS and XCS, respectively (Additional file 2). As to the GYBS (Additional file 2a), genes involved in protein processing were significantly represented in the biological process, as the 'protein processing' and 'protein maturation' were the most significantly enriched terms in the biological process, which probably suggesting the protein processing were significantly selected in this breed. It is noteworthy that many biotic and abiotic stimulus-related GO terms were also exhibited strong positive selection signal, including 'response to external stimulus', 'detection of temperature stimulus' and so on. This result probably indicates that GYBS were confronted with a complex living environment and underwent relatively strong environmental pressure, including biotic and abiotic originated during the domestication. Moreover, the neuron development and mitochondria-related functions were also highly presented with several GO terms significantly enriched. In the cellular component, membrane-related functional terms were mostly enriched, including 'membrane', 'membrane part' and 'integral component of membrane'. For the molecular function classification, 'antioxidant activity' and 'adenosine deaminase activity' were the most enriched GO terms, which probably also indicated the capacity for immunity was selected since both the 'antioxidant activity' and 'adenosine deaminase activity' were closely related to organism immune (Li et al. 2015; Hitoglou et al. 2001). The top thirty significant GO terms were listed in Fig. 3c. KEGG is an important resource for understanding higher-order functional meanings (Kanehisa et al. 2004). To gain the major pathway under selection, KEGG enrichment analysis were also executed. The result showed 'linoleic acid metabolism' pathway, which was known for its function in fat content regulation, meat quality improvement (Li et al. 2012), was significantly selected (Fig. 3d).

Fig. 3
figure 3

Selective sweep analysis of the GYBS and XCS populations. (A and B) Distribution of pi ratios (GYBS/XCS, XCS/GYBS) and Fst values. Red points were identified as selected regions. C GO enrichment analyses of the positively selected genes in GYBS. D KEGG pathway enrichment analysis of the positively selected genes in GYBS. E GO enrichment analyses of the positively selected genes in XCS. F KEGG pathway enrichment analysis of the positively selected genes in XCS. In the GO enrichment analysis, the Top 30 of the most enriched GO terms were listed. * means P < 0.05

The same analyses were conducted for XCS simultaneously. Enriched GO terms of genes positively selected in XCS (Fig. 3e and Additional file 2b) showed relatively obvious differences with those in GYBS. In biological process, 'assembly of actomyosin apparatus involved in 'cytokinesis' and 'actomyosin contractile ring assembly' were the two most significant categories. Several actomyosin and cellular processes GO terms, including 'cytokinetic process', 'regulation of cytokinetic process', and et al., were also enriched in addition and accounted for the vast majority proportion of total enriched functional terms in biological process. This result is obviously different from that in GYBS and probably suggests the growth and development progress received significantly positive selection. It’s worth noting that a ‘NFκB’ related GO term, which has been confirmed to play a vital role in cell proliferation and is related to many human diseases (Chandrakesan et al. 2013), exhibited a strong positive selection signal. Additionally, nervous and perception system, membrane process and energy related biological process were also significantly presented. For cellular component, ‘Roundabout signaling pathway’, followed by ‘ciliary basal body’ and ‘membrane part’ were the top 3 significant functional terms. Besides membrane-related function terms, ciliary-related components were also enriched, which perhaps suggests cell morphogenesis was selected during breed formation. As to molecular function, ‘ErbB-2 class receptor binding’ was the most significant term, followed by ‘phospholipid scramblase activity’ and ‘guanyl-nucleotide exchange factor activity’. Since ERBB receptors are thought to mediate signals for cell proliferation, differentiation, migration, or apoptosis (Tvorogov and Carpenter 2004), this result further confirmed the growth and development progress was undergone strong positive selection. KEGG analysis showed ‘Phenylalanine, tyrosine and tryptophan biosynthesis’ pathway was significantly enriched (Fig. 3f). As reported previously, this pathway appeared closely related to brain development (Fernstrom 2013; Parker and Brotchie 2011), presumably, the nervous system development was probably an underlying important domestication trait.

Discussion

Sichuan Basin is located in southwestern China and is surrounded by Tibet Plateau, Yunnan-Kweichow Plateau, Daba mountain, and Huaying mountain. Since the wide range of elevation ladder and complex terrain, Sichuan harbors substantial diversity in geography and climate. Meanwhile, it is also a historic agricultural area. During the long-term practice of agriculture, Sichuan Basin spawns diverse local breeds of animals to match complex geographical environments and applications. Water buffalo is a representative one of the numerous domestic draught animals in this area and always employed as an important labor force in the rice planting process, as well as a provider of milk, meat, horns, and skin (Yue et al. 2013; Nanda and Nakao 2003). It is recognized that close inbreeding frequently occurred during breed formation, which would result in the fixation of some key characteristics (Gorssen et al. 2020). During the long process of local breeding process, buffaloes have undergone intense selection pressures for this development of interesting traits and gradually formed self-specific characteristics distinct among breeds. Thus, these distinctive buffalo breeds undoubtedly fixed diverse precious gene resources for desirable performance traits and favors, a great potential for animal breeding.

In present study, to clarify the genetic population structure and selective signals of buffalo in the Sichuan Basin, we collected 20 individuals representing two different indigenous buffalo breeds with ten samples of each. We conducted a RAD-Seq to analyze population structure and positive selection region during breeding process. A total of 2,110,077 SNP loci were detected in all samples. As to our knowledge, this is the first time to provide such a huge SNP data set in indigenous buffalo breeds in Sichuan and further illustrate the abundant diversity of indigenous buffalo breeds. Additionally, the population structure was determined with the phylogenetic tree and principal component analysis. The results showed that the GYBS and XCS were clustered in two different branches on the phylogenetic tree map. The PCA map further supported the result with GYBS and XCS separate obviously by PC1, suggesting a significant genetic differentiation between two populations. It is noteworthy that the XCS exhibits a much wider distribution on the PCA plot, which probably indicates a higher genetic diversity of XCS. Although the limited number of samples and populations were analyzed in the present study, all the results implied a distinct genetic differentiation among the different populations, and these indigenous breeds are likely to harbor huge genetic varieties resources, which deserve our further exploration and discussion.

To detect positive selection signatures in different buffalo breeds, we searched the genome of the buffalo for regions with a significantly high Fst (top 5%) and θπ ratio (log2 (π1) − log2 (π2), top 5%) values. As a result, 995 and 910 protein-coding genes in GYBS and XCS were identified as located in the selected regions. Functional annotation and enrichment results showed these two breeds underwent distinctly different evolutionary scenarios under different selections. As to GYBS, functional enrichment analysis revealed that many candidate genes positively selected are related to protein processing, meat quality improvement, and stress response (Additional file 2a), which likely reflect the main objective of selection for this breed is to improve the meat quality and increase the adaption capability for complex living environment and disease resistance. While in the XCS (Additional file 2b), these genes were significantly enriched in cell proliferation and cell morphogenesis, suggesting the primary selection pressure of XCS was presumably derived from the speed requirement of growth and development. Additionally, energy metabolism and neuron development appeared to be selected in both breeds.

Conclusions

This study collected two indigenous buffalo breeds in Sichuan Basin and employed RADseq to explore selection signatures across genome. Based on the 2,110,077 high-quality SNPs identified from two populations, the genetic structure was analyzed. Result showed significant genetic differentiation between two breeds. Analysis of selective regions identified 995 and 910 positive selection genes in GYBS and XCS, respectively. Functional annotation and enrichment analysis showed these two breeds seemingly undergone different selective pressure with that the GYBS were mainly selected in functions closely related to meat quality, complex living environment adaption, and disease resistance, while selection pressure in XCS possibly derived from speed requirement of organism growth and development during breed formation. Overall, the majority of genome-wide SNP markers among two breeds of Sichuan indigenous buffaloes, as well as the considerable number of genes for traits of interest reported here, will serve as a valuable resource to accelerate genetic research and precision breeding programs in buffaloes.