Background

Due to the economic value of farm animals, their genomics, in general, and whole genome sequencing, in particular, are important issues. Results of such research have already had an impact and will continue to do so in the future in terms of production of meat, milk, fibre and other products, environmental effects of animal husbandry, breeding, animal health, feeding, and even human medical issues such as xenotransplantation and disease modelling [1, 2]. Regarding this, the genome of a number of agriculturally important animal species has been or is being completed [311].

Pig is one of the most important farm animals, providing about 103,000 thousand tonnes of pork for meat consumption worldwide in 2012 [12]. Moreover, pigs can be used as a model for human diseases, such as arthritis, cardiovascular diseases, diabetes and obesity, because pigs are more similar to humans at physiological and gene level, when compared with rodent animal models [2]. According to different sources, the predicted number of pig breeds and lines range from 350 to 730 [13, 14]. Most of these breeds are local, with only 25 found in multiple regions of a country, and a further 33 spread to more than one country [13]. In spite of the larger number of pig breeds, only six (Large White, Duroc, Landrace, Hampshire, Berkshire and Pietrain) dominate the pork industry [13].

In the last decade, enormous efforts have been made to exploit the genetic and genomic resources of pigs. Genome sequencing of swine goes back to the early 2000’s, when the Sino-Danish Pig Genome Project was initiated and subsequently a 0.66× coverage genome survey, based on shotgun sequencing, was published [15]. Deeper coverage sequencing of the pig genome was initiated by the Swine Genome Sequencing Consortium [16]. The Sscrofa9 genome assembly was released in 2009 [17] and the pig genome sequence was recently published [9]. These genome resources for pig, together with specialised sequencing projects such as parallel sequencing, have had a huge impact on widening our knowledge about the pig genome, to include SNP identification and genotyping [1820], GC variance [21], muscle transcriptome [22, 23], pig interactome [24], domestication/selection [25], evolution/domestication [9], and in a number of other recently published research topics [26].

Despite the large number of local pig breeds, only a few of them (for example Angler Satleschwein, British Saddleback, Cinta Senese, Manchado de Jabugo, Basque and Guodyerbas), were included in genome sequencing projects. In addition to the major industrial and the few local breeds, Asian and European wild boars, several Asian pig breeds and several other species of the Sus genus have also been included [9, 2729]. However, other local breeds, of which many are endangered, should also be of great interest for genomic studies because of their importance in biodiversity, conservation, local community and even pork production issues [14, 30]. Mangalica is an example of a local/rare breed with a characteristic curly hair phenotype, which is indigenous to Hungary and was developed in the 19th century [14]. Mangalicas are fatty-type pigs [31], with high intramuscular fat content [32]. Mangalicas have three colour variants, Blond, Red and Swallow-belly, which are considered as separate breeds based on microsatellite studies [33]. As the history of the three Mangalica breeds indicate [14], the Blond was bred first from old Hungarian pig races and pigs of Mediterranean origin, and then it contributed to the two newer breeds, Red and Swallow-belly Mangalicas. Reproduction studies are quite numerous in Mangalica [3438], but genetic studies are rare [39]. Previously we have described that the mtDNA D-loop sequences of Mangalicas display low diversity, but the maternal lineages that they represent are genetically distant from cosmopolitan breeds kept in Hungary [14] and very likely originate from one particular European ancient line [40].

In order to explore how the genomes of Mangalicas differ from the reference pig genome, we have sequenced a male individual of each of the three Mangalica breeds along with a male Duroc individual of Hungarian origin. The genome sequence of Mangalicas can serve as a basis for future conservation of the breeds and for an extended Mangalica pork industry.

Results

Genome sequencing

Three Mangalica male pigs with a Mangalica-specific mitochondrial D-loop haplotype were selected [40] for genome sequencing. These animals were kept at Emőd, Hungary, registered at the Hungarian Mangalica gene-bank as pedigree sires. They were previously assessed as Blond, Red and Swallow-belly Mangalicas, respectively, under the Hungarian Mangalica Standard and by microsatellite analysis. A Duroc male of Hungarian origin was also sequenced, because we have found previously that Duroc pigs of international or Hungarian origin belong to different maternal lineages [40] and Mangalica × Duroc F1 hybrids are processed at industrial scale in Hungary for pork products.

Genome sequencing resulted in 6.27 × 108, 4.15 × 108, 4.06 × 108 and 3.32 × 108 reads for the genomes of the Blond, Red and Swallow-belly Mangalica and the Duroc animals, respectively (Table 1). Due to the 500 bp average fragment size of the libraries used for the 2 × 100 bp paired-end sequencing, 300 bp long spacer between the reads was predicted. Mapping of the reads to the reference pig genome Sscrofa 10.2 resulted in an excellent correspondence between the expected and observed length of the spacers (Additional file 1). The proportion of the mapped reads was 77.3, 83.3, 82.8 and 82.5% resulting in 19×, 14×, 14× and 11× median autosomal coverage, respectively, for the four sequenced individuals (Table 1). The coverage for the individual autosomes varied between 10× and 21×, while for the sex chromosomes about half of the autosomal coverage was obtained (Figure 1). In addition, large numbers of reads for the Blond (260,270), Red (98,832) and Swallow-belly (104,478) Mangalicas and the Duroc (100,663) individual resulted in 1,571×, 602×, 638× and 615× coverage of the pig reference mitochondrial genome [41], respectively.

Table 1 Sequencing statistics
Figure 1
figure 1

Sequence coverage of the auto- and sex-chromosomes in four pig individuals.

Identification of genetic variants

To identify SNP and INDEL variants we used the SAMtools and GATK pipelines. In each animal, SAMtools and GATK provided a very similar number of SNPs and the proportion of the concordant variations was high. In contrast, GATK detected more INDELs than SAMtools, and thus the proportion of the common INDELs was lower compared either to the SNPs or to the total numbers of INDELs identified by the pipelines (Additional file 2). We analysed only the concordant variants further. More than seven million SNP and INDEL variants were identified by comparing the genome of each Mangalica individual to the Sscrofa 10.2 genome assembly. The genome sequence of the Duroc male also contained almost 6.5 million SNPs and INDELs when compared with the reference genome, which was assembled predominantly from a Duroc female animal [20]. SNPs outnumbered INDEL variations in all four animals by about 10-fold. In the Blond Mangalica, more homozygous then heterozygous SNPs were identified; in the Red Mangalica their number was about the same, while in the Swallow-belly Mangalica there were more heterozygous than homozygous SNPs. In the Duroc animal, there were more heterozygous than homozygous SNPs. In each individual, more homozygous than heterozygous INDELs were found and their ratio was also about the same. SNP transitions were more numerous than transversions in all four individuals by about 2-fold. A summary of the statistics for these data are shown in Table 2.

Table 2 Categories of sequence variations

Filtering the SNP variations using stringent criteria (see Methods) resulted in 6.2 × 106, 6.3 × 106, 6.2 × 106 and 5.4 × 106 SNPs in the Blond, Red and Swallow-belly Mangalica and the Duroc individuals, respectively (Additional file 3). Approximately 9 to 13% of the filtered SNPs were revealed as novel (Additional file 3) when compared with the 28.6 million SNPs in the pig dbSNP138 database. The filtered SNPs were grouped into main and sub-categories according to their intergenic or genic position and synonymous or non-synonymous nature (Additional file 4). It was observed that Mangalicas, in contrast to the Duroc animal, had more homozygous than heterozygous variations in almost all SNP categories. A comparison of both synonymous and non-synonymous exonic SNP variants revealed 12,448 SNPs that were common to the four animals, and approximately 5,200 to 9,500 unique SNPs for each individual (Figure 2).

Figure 2
figure 2

Venn diagram of exonic SNPs in the sequenced animals. D, Duroc; BM, Blond Mangalica; SM, Swallow-belly Mangalica; RM, Red Mangalica.

The detection of large INDELs was not the scope of the current study, and so only INDELs shorter than 52 bp were identified. For the genomes of the Blond, Red, Swallow-belly Mangalicas and the Duroc pig, approximately 6.9 × 105, 6.2 × 105, 6.1 × 105 and 4.5 × 105 such INDELs were identified, respectively. Of these, 99.9% were novel compared to the dbSNP138 database. With respect to the size distribution, of the INDELs among the four genomes, single base-pair INDELs were the most abundant (Additional file 5). Exonic INDELs were sorted into eight categories: frame-shift deletions, frame-shift insertions, frame-shift block substitutions, non-frame-shift deletions, non-frame-shift insertions, non-frame-shift block substitutions, stop-gains and stop-losses (Additional file 6). In exonic INDELs, apart from the relatively large number of one base-pair variations that cause ORF shifts, +/− 3 base-pair changes, which do not effect the ORF, were identified in higher numbers than two or four base-pair variations (Additional file 7). An elevated number of one base-pair INDELs when compared with other sizes has also been reported by others [42, 43]. Our comparison with the platinum human exonic data obtained from Illumina’s BaseSpace (https://basespace.illumina.com/datacentral) provided the same result (data not shown) suggesting that our analysis with the pig genome is reliable.

Copy number variants (CNVs) were identified that were common amongst the three sequenced Mangalicas. Only CNV gains were analysed further due to the effect of sequence coverage depth on CNV losses [44]. One thousand and forty-one CNV gains with a copy number of three or more were identified across all chromosomes (Figure 3). The minimum and maximum size of the CNVs was 1,000 and 135,735 bp, respectively with an average of 3,529 bp. Of the 1,041 Mangalica CNVs, 485 and 160 had no positional overlap with either the 3,118 CNV gains described by Paudel and colleagues [44] or the 145,857 CNVs identified in the Duroc animal in this study, respectively, while the numbers of overlapping CNVs were 556 and 881, respectively. We note here that the very large number of CNVs in the Duroc animal is because no statistical test could be performed on data from one individual. Porcine genes could be annotated to 155 CNVs, while 886 CNVs did not contain any gene (Additional file 8). Of the 155 genes, 150 were unique since five genes contained two CNVs. An overrepresentation analysis identified 16 out of the 150 unique genes, which were in the overrepresented Molecular function (GO:0003674) category (P value = 1.25 × 10−7). One of the 16 genes, HOXB8, encoding a homeobox protein, is neither present in the literature [44] nor in the sequenced Duroc animal used in this study (Additional file 8).

Figure 3
figure 3

Distribution of CNVs across Mangalica chromosomes. Short vertical lines represent the position of CNVs, which are present in all three sequenced Mangalicas.

Analysis of genes with exonic, non-synonymous SNPs

Functional, QTL and pathway annotation of the genes

Due to the importance of the Mangalica × Duroc hybrids to the Hungarian pork industry, the 2,328 exonic, non-synonymous SNPs common to all three Mangalica breeds but absent from sequenced Duroc animal (Figure 4) and the reference pig genome, were selected for functional analysis. These SNPs in the coding regions of genes, which result in amino acid changes in proteins, may be of great importance as they could be the polymorphisms affecting variation in phenotypes. The 2,328 SNPs were mapped to 1,389 unique genes of the Sscrofa10.2 assembly as certain genes had multiple SNPs (Additional file 9) and their annotation into biological process (BP) categories by the web-based software PANTHER [45] revealed that they belong to twelve major GO groups (Figure 5). Since the SNPs were identified by comparing Mangalicas, which are fatty-type of pigs, and Duroc, which is a lean-type breed, we were particularly interested in those SNP-harbouring genes that might be involved in fat-related biological processes. Amongst the 1,389 unique genes with exonic, non-synonymous SNPs, we have identified 52 genes, which belonged to Lipid metabolic process (GO:0006629). Although this category, in contrast to when two sets of 1,389 randomly chosen genes were used as control, appeared in an overrepresentation analysis, it was not overrepresented using the strict Bonferroni correction (Additional file 10). As another control, we have found no overrepresentation using the full pig gene set. Despite the lack of overrepresentation, we still consider that the identified genes might have a great importance, since the amino acid changes caused by the SNPs in them may affect the structure and, consequently, the function of the encoded protein, and such functional alterations of proteins remain hidden in gene expression studies. The importance of our SNP-based gene identification approach is indicated by, for example, that proteins encoded by the PNLIP and PNLIPRP2 genes, which were not associated to fatness phenotypes in pigs before, are the target of Orlistat (tetrahydrolipstatin), a drug used for treating obesity in humans (data not shown). The possible effect of exonic SNPs on protein function is discussed below using FASN as an example.

Figure 4
figure 4

Venn diagram of exonic, non-synonymous SNPs in the sequenced animals. D, Duroc; BM, Blond Mangalica; SM, Swallow-belly Mangalica; RM, Red Mangalica.

Figure 5
figure 5

Biological process ontology of genes with exonic SNPs found in Mangalica breeds. Of the 1.389 genes, 1,372 resulted in 2,130 total hits in processes. Percentage indicates the percent of genes in one process against the total number of process hits.

To study the possible relationship between the 52 genes in the lipid metabolic process GO category and QTLs, the chromosomal position of each genes was compared to the positions of the “Fatness” and “Fat composition” QTLs downloaded from the QTLdb, Release 19, [46]. Forty-nine genes are in one or more fat-related QTLs with 14 genes on chromosome 14, overlapped by 15 fat-associated QTLs (Additional file 11). Because of this large proportion (~28%) of genes on chromosome 14, we performed an enrichment analysis for the 14-gene set and a control set of 1282 genes, both are in the same region of chromosome 14 determined by the 15 QTLs. The corrected P-value for lipid metabolic genes in the control and in our set was 4.80 × 10−3 and 2.95 × 10−19, respectively, indicating that the enrichment of the 14 genes in these QTLs deviate significantly from random.

Fatty acid composition of meats is an important dietetic and health issue for pork consumers. We, therefore, compared those genes, which are in saturated and unsaturated fatty acid QTLs and found that nine genes were in common across both fatty acid categories, while the saturated and unsaturated QTL groups each contained two unique genes, NKX2-3 and EPHX2, and OMA1 and FAM135B, respectively (Additional file 12).

Of the 52 lipid metabolic process-associated genes, we could map 41 to one or more pathways using the KEGG database. Almost 44% (18) of the mapped genes were associated with lipid metabolic pathways (Figure 6), while others contribute to glycan and carbohydrate metabolisms, biochemical processes at the interface of lipid and other metabolic pathways and the regulation of lipid metabolism (Additional file 11). Of the 41 mapped genes, two are particularly important. One is FASN, which encodes an enzyme involved in a number of steps in the synthesis of 8 to 16 carbon-chain fatty acids in the fatty acids biosynthesis pathway [KEGG:ssc00061]. The FASN protein is a homodimeric multifunctional enzyme with six catalytic domains, which processes different steps of cyclic elongation of fatty acids [47]. The other gene is SLC27A6, a member of a gene family, which is expressed in liver, heart and subcutaneous backfat of pig [48]. The encoded protein is a fatty acid transporter, which is one of the two membrane proteins of the PPAR signalling cascade [KEGG:hsa03320], which regulate lipid and fatty acid metabolism, bile acid biosynthesis and adipocyte differentiation, amongst other regulated processes [49].

Figure 6
figure 6

Fat metabolic pathways and participating genes with Mangalica-specific exonic, non-synonymous SNPs. Lines represent the interconnections of the pathways. Arrows indicate where signalling or metabolites (name above the line) affect genes in other pathways.

Genotyping SNPs in other breeds

The 90 SNPs in the above described 52 genes were present in all three sequenced Mangalicas, but absent from the sequenced Duroc and the reference genome. To learn about their wider occurrence, we have “e-genotyped” 55 animals whose genome was sequenced [9] for these SNPs. The results indicate that the frequencies of these SNPs vary amongst the 55 individuals (Additional file 13). Clustering of the average frequencies revealed four clusters among the individuals, where Mangalica represents a separate cluster and European, international/Hungarian Duroc, and non-European pigs and/or wild boars comprise the three other related groups (Additional file 14). The clear separation of Mangalicas from other breeds by those 90 SNPs might have the potential in practical applications, such as whole genome selection in breeding.

It was found that four SNPs are present only in Mangalicas, but not in the genotyped individuals (Additonal file 13). All of these SNPs are in one gene, MOGAT2 (ENSSSCG00000014861), which encodes a monoacylglycerol O-acyltransferase 2 enzyme, and is in several back- and belly-fat QTLs and in the “Fat digestion and absorption” (KEGG: 04975) pathway (Additional file 11). It is possible, therefore, that this gene has a particular role of the development of the fatty-pig phenotype of Mangalicas.

Some studies have highlighted the importance of the FASN gene in pig fatness [50, 51]. In this gene, we have identified two non-synonymous SNPs, which are present in the three sequenced Mangalicas, but not in the reference genome and the sequenced Duroc individual used in this project. They are also different from those three SNPs that have been genotyped previously [50]. SNP1 is in exon 9 (chromosome 12, position 1,028,766) and is a G•C (reference) to A•T (Mangalica) transition, which causes a R443Q amino acid change while SNP2 is a C•G (reference) to T•A (Mangalica) transition in exon21 (chromosome 12, position 1,025,096) resulting in a T1088I change in the FASN protein. The frequency of these two SNPs is quite diverse in the genome sequenced animals, including the three Mangalicas and one Duroc individual sequenced in this study (Additional file 15). We, therefore, genotyped 72 Mangalica and 21 Duroc pigs for both SNPs in order to get more information about these SNPs in the two breeds. We found that the A (“Mangalica”) alleles (SNP1A•T or SNP2T•A) occurs at a much higher frequency than the B alleles in Mangalica, whereas in contrast the B alleles (SNP1G•C or SNP2C•G, “non-Mangalica”) are more prevalent in Duroc (Table 3). Additionally, we found that for SNP1, 62 and 10 Mangalicas and 1 and 20 Duroc animals were AA and BB homozygous, respectively; no heterozygotes were found. For SNP2, 65 Mangalicas and eight Durocs had AA, five Durocs had BB, and seven Mangalicas and eight Durocs had AB genotypes respectively; no Mangalica with BB genotype was found.

Table 3 Genotyping the FASN gene

Discussion

The genome of one individual each of the three Mangalica breeds (Blond, Red and Swallow-belly), and a Duroc animal from a Hungarian herd was sequenced and analysed. More than 100 million reads were obtained from the genome of each animal. On average for the four genomes sequenced, 81% of the reads were mapped to the reference genome, resulting in 14.5× median autosomal coverage. Millions of SNP and hundred-thousands of INDEL variations were identified in the three Mangalicas and the one Duroc genome, respectively, when compared to the reference pig genome assembly Sscrofa 10.2. By filtering the SNPs, about five to six million variations were obtained, and about one-tenth of these were novel SNPs compared to the dbSNP138 database (Additional file 3).

For functional analysis, we selected 2,328 exonic non-synonymous SNPs present in each sequenced Mangalica individual, but absent from either the reference genome or the Hungarian Duroc animal. These SNPs were mapped to 1,389 pig genes present in the Ensembl database. Since Mangalicas are fatty-type pigs, and the SNPs were identified in comparison with Duroc, a lean-type pig, we were particularly interested in fat-related genes in this set. Fifty-two genes were found belonging to lipid-related metabolic process categories and were further analysed using QTL and pathway data-mining. Of the 52 genes, 49 and 41 are associated with fat-related QTL regions and KEGG pathways, respectively (Additional file 11).

Some of the 52 genes, for example ACACA, ANKRD23, GM2A, KIT, MOGAT2, MTTP, FASN, SGMS1, SLC27A6 and RETSAT, which we have highlighted here, have been previously described in the context of fat-related characteristics in pigs [5054]. Of these genes, FASN, a gene encoding a fatty acid synthase, has been shown to be associated with a cis-11-Eicosenoic acid (C20:1) percentage QTL in a Guadyerbas × Landrace cross, although none of the identified SNPs had any putative effect on the protein structure [50]. The FASN protein is a homodimeric, multifunctional enzyme with six catalytic domains, which are required for the cyclic elongation of fatty acids [47] and catalyses 32 reactions in the fatty acid biosynthesis [KEGG:ssc00061] pathway. Targeted mutagenesis of the FASN gene and inhibition of the FASN protein in mice resulted in reduced total body fat [55] and body weight [56], respectively. We have identified two SNPs in this gene in Mangalicas that result in a R443Q (SNP1) and a T1088I (SNP2) amino acid change. The amino acid in position 443 is part of the α-helix in the protein’s inter-domain linker. Since glutamine is more hydrophilic than arginine, the amino acid substitution may affect the relative position of the two functional domains by modulating the flexibility of the linker connecting them [57]. The amino acid in position 1,088 is part of the dehydratase domain of the FASN protein. This domain catalyses the conversion of β-hydroxyacyl-ACP to β-enoyl-ACP in the cyclic elongation of fatty acids [47]. T1088 is in close vicinity to the active site of the dehydratase domain containing an open-ended hydrophobic tunnel [57]. Predicting hydrophobicity of amino acids along the FASN polypeptide revealed that the substituting I1088 is strongly hydrophobic, while T1088 is hydrophilic (data not shown). It is possible, therefore, that in the FASNT1088I protein the substrate-binding nature of the active site is altered, which may influence the dehydration step of the fatty acid cyclic elongation. This might be particularly important in Mangalicas, where no BB homozygotes were found. Thus the active site in the catalytic domain of their FASN protein is expected to be hydrophobic, although allele-specific expression of the FASN gene in heterozygotes might influence this.

It is known that feeding regimes influence fatty acid composition and meat’s marbling in Mangalicas [31, 58], similar to other pig breeds and farm animals. In lipid metabolism, the “Fat digestion and absorption” and “Bile secretion” pathways are involved in the metabolism of dietary fats. These two pathways are connected to the “Glycerolipid metabolism”, “Fatty acid metabolism” and “Fatty acid biosynthesis” pathways. Our study highlighted a number of genes in these metabolic pathways and in the PPAR signalling pathway (Figure 6). We have identified one gene, MOGAT2 (ENSSSCG00000014861), with seven SNPs, of which four are present in Mangalicas, but not in other 56 sequenced pig individuals (see Results). The MOGAT2 protein catalyses the conversion of 1-acylglycerol obtained from dietary fat into diacylglycerol in the smooth endoplasmatic reticulum of the small intestinal epithelial cells, and thus participates in the production of chylomicron (“Fat digestion and absorption” pathway, KEGG:04975). Chylomicron affects the PPAR signalling pathway, which in turn regulates a number of lipid metabolic processes (Figure 6). It is possible, therefore, that polymorphisms that affect genes in this complex networks of pathways, which are also part of relevant QTLs, may be responsible for the differences in fattening, fat composition and any related phenotypes that were observed between breeds in response to different feeding regimes. For example, the MOGAT2 gene was found to be part of the lipid concentration biological function, modulated in backfat [54].

Conclusions

The discovery of genes behind agriculturally important traits is a difficult task in farm animals, in particular when the intermediate- or end-phenotypes are determined by QTLs. In this study, we described the genome sequencing and analysis of three Hungarian Mangalica individuals representing each of the three Mangalica breeds, which are local, fatty type pigs with a niche role in the pork market. After filtering, millions of SNPs were identified in each animal compared to the reference genome, and about 10% of them are novel compared to the porcine SNP entries of the dbSNP138 database. This finding highlights that sequencing genomes of individuals of rare/local breeds can provide large amounts of data identifying genomic variations relative to the reference genome of the same species. These variations can be the basis for gene discoveries. With special emphasis on pig fatness, by annotating and comparing exonic, non-synonymous Mangalica-specific SNPs to QTLs and pathways, we identified a number of candidate genes, which can serve for future genotyping, expression, structure-function, and biological network studies and in applications, such as molecular breeding and meat identification or tracing in both Mangalica and other breeds.

Methods

Genome sequencing

Pig blood samples were obtained from the MANGFOOD consortium’s Biobank at the Agricultural Biotechnology Center, Gödöllő, Hungary. Total DNA was extracted using the Duplicα® Prep Automatic Extraction System and the Duplicα® Blood DNA kit (EuroClone, Milan, Italy). DNA concentration was measured using the Quant-iT™ PicoGreen dsDNA® Assay (Life Technologies, Budapest, Hungary). Preparation of 500 bp fragment libraries and 2 × 100 bp Illumina paired-end genome sequencing was performed by Aros Applied Biotechnology (Aarhus, Denmark) as a custom service, using Illumina’s HiSeq2000 platform.

Data analyses

The Sus scrofa reference genome sequence 10.2 was indexed using the “bwtsw” algorithm option of BWA 0.5.9rc1 [59] followed by mapping the short sequence reads to the indexed genome using the default settings and the paired-end method of the same software. The obtained BAM files were sorted and indexed for further analyses.

To detect small genetic variants (SNPs and INDELs), the SAMtools [60] and GATK (version: 2.3-9-ge5ebf34) [61] variant calling pipelines were employed. In SAMtools, base-calling was performed using the “mpileup” command and the “-E -D -S -u” parameters of SAMtools 0.1.18. The “view” command of BCFtools was used to call the variants using the “-bvcg” parameters. VCF files were then generated by the “vcfutils.pl” script using the “varFilter” option and SNPs and INDELs were extracted. Finally, SNPs, which had a Phred score higher than 30 (i.e. their base-calling accuracy is larger then 99.9%), and a high-quality read coverage of minimum three, were filtered using a custom script. INDELs were used in downstream analyses without filtering. For GATK, the dbSNP138 data were used as a training set. Other settings were used according to the GATK best practice online documentation. Results obtained by the two pipelines were compared using the BEDTools’ [62] “intersectBed” module for SNPs and using our custom script for INDELS; only concordant variations were processed further.

Copy number variations (CNVs) were detected as described by Paudel and coworkers [44] using the mrCaNaVar (version 0.51) software [63]. The window size was set to 1,000 bp. We selected windows where the copy number and the standard deviation were bigger than three and 0.7, respectively, for the three Mangalicas. After that step the regions were chained.

To determine novel variants in our sequence data, we compared the identified SNPs and INDELs with the dbSNP138 data using BEDTools [62] and annotated the detected genetic variants using ANNOVAR [64]. Following the ANNOVAR analysis, non-synonymous exonic SNPs, which were present only in Mangalicas, were determined by BEDTools’ “multiIntersectBed” module. Genes carrying these variants were identified using a custom script. Comparison of SNPs in the lipid metabolism genes amongst genome sequenced animals (this study and literature 44) were also performed using the “multiIntersectBed” module of BEDTools.

Gene ontology analysis was performed by the web-based software PANTHER [45]. For overrepresentation analyses, Biomart’s [65] enrichment analysis option with 0.05 cut off P-value was employed using the Sscrofa 10.2 reference genome as background. Random sets of genes was generated by a custom Python script. Fat-related pig QTLs and their positions were downloaded from the QTLdb (Release 19) database [46], and their extension was compared with the position of the SNPs of selected genes manually. Genes were annotated into pathways using the KEGG database.

Data from Ensembl were retrieved using BioMart [65]; Venn diagrams were generated using the software Venny [66]; clustering was performed using CIMminer [67] with Manhattan distance and complete linkage clustering settings.

Genotyping

To genotype the two Mangalica-specific SNPs in the FASN gene, High Resolution Melting (HRM) analysis was performed with a Rotor-Gene Q 5plex HRM Platform using a saturating dye (EvaGreen) technology (Qiagen, Hilden, Germany). PCR reactions were performed in 25 μl reaction volumes using 60 ng total DNA as template and the Type-it HRM PCR kit (Qiagen, Hilden, Germany), according to the instruction of the manufacturer. The primers for FASN SNP1 and SNP2 were FASN1_F: 5′ CGCGATCTCGTTGAGCAT 3′, FASN1_R: 5′ GTGCAGACCCTGCTGGAG 3′ and FASN2_F: 5′ GGATAGGCTTGAGATGCTCTT 3′, FASN2_R 5′ GTGGTGGTGGACAGGAATCT 3′, respectively. Reactions were carried out with an initial denaturation step at 95°C for 5 min, followed by 35 cycles of 95°C for 15 sec, 60°C for 30 sec and 72°C for 10 sec and then HRM curves were generated by acquiring florescence data between 80 and 91°C. Individuals with homozygous and heterozygous genotypes were assigned according to their HRM curve determined by the Rotor-Gene software and visual inspection.

Availability of supporting data

The data sets supporting the results of this article are included within the article and its additional files. Sequence data are deposited to the NCBI Sequence Read Archive under identifier SRP039012.