Objective

Black Bengal goat (BBG) belongs to the Bovidae family and found throughout Bangladesh, West Bengal, Bihar, and Orissa regions of northeastern India. It is estimated that more than 90% of the goat population in Bangladesh comprised the Black Bengal, the remainder being Jamunapari and their crosses [1]. Higher prolificacy, fertility, resistance against common diseases, adaptability to the adverse environmental condition, early maturity, seasonality and superiority in the litter size are some of the outstanding features of BBG. Besides, it produces excellent quality flavored, tender and delicious meat with low intramuscular fat and fine skin of extraordinary quality for which there is tremendous demand all over the world [1, 2]. Moreover, it plays a vital role in the economy of Bangladesh by contributing 1.66% of the GDP (Gross Domestic Product) (DLS 2017).

Fortunately, the market demand of Black Bengal goat is emerging. This gives breeders of original/rare breeds an opportunity to expand the stock and preserve its genetic diversity. One of the primary goals in managing goat populations is to maintain high-level genetic diversity and low-level inbreeding. To estimate the future breeding potential of a goat breed, it is necessary to characterize the genetic structure and evaluate the level of genetic diversity within the breed. Moreover, a long term genetic approach can be used to improve the spectacular economic characteristics of BBG [3].

Therefore, the genetic characterization of the entire BBG genome is essential in characterizing its economic traits as well as adaptive capability. With the availability of whole genome sequence, the targeted areas for genetic improvements are now: goat prolificacy, growth rate, meat quality, skin quality, disease resistance, and survivability. A complete and accurate reference to the goat genome is an essential component of advanced genomic selection of product characteristics.

Data description

At first, A 3 years old male healthy Black Bengal goat (BBG) without known genetic diseases was selected for blood collection. Genomic DNA from each animal was isolated from the EDTA-blood, using the Addprep genomic DNA extraction kit (South Korea) (detailed methodology in Data file 1—Table 1). The quality and quantity of the DNA were assessed by the Qubit fluorometer (Invitrogen, Carlsbad, CA, USA) and Infinite F200 microplate reader (TECAN), according to the manufacturer’s instruction. The status of the DNA was visually inspected by 0.8% agarose gel electrophoresis. Purified genomic DNA was sent for library preparation (detailed methodology in Data file 1—Table 1) and whole genome sequencing (WGS) at BGI Group (Shenzhen, Guangdong, China). A total of 40 Gb (Gigabase pair) (14-fold) of subread bases with a read length of 150 bp were generated using next-generation sequencing (NGS) technology on an Illumina HiSeq 2500 platform (detailed methodology in Data file 1—Table 1).

Table 1 Overview of data files/data sets

After sequencing, quality of the raw sequencing reads were inspected using FastQC version 0.11.8 [4]. Reads were quality controlled including removing adaptor sequences, contamination and low-quality reads from raw reads using Trimmomatic V0.32 [5]. A total of 247,325,362 clean reads were included in the assembly. Subsequently, for de novo assembly we used ABySS v. 2.1.5 assembler [6], which generated 32,94,295 contigs (minimum contig size 200 bp). Next, ABACAS v.1.3.1 pipeline was used with the reference genome ARS1 (GCA_001704415.1) [7] to arranging, ordering, and orientation of the assembled genome [8]. The genome assembly data has been deposited in the NCBI GenBank under the Accession number GCA_001704415.1 (Data file 2—Table 1). The final assembled genome size of BBG is 3.04 Gb with 724.80 Mb (Megabase pair) gaps and GC content of 41.77%. Completeness of the genome was assessed with benchmarking universal single-copy orthologs (BUSCO) version 3.0.2 [9] which showed 82.5% completeness.

Genes were annotated using Maker version 3.0 pipeline [10] which identified 26,458 gene models. RepeatMasker V 4.0.9 [11] using the latest version of the repbase database [12] identified 31.85% repeat elements in the genome. Finally, InterProScan V 5.33–72.0 [13] was used to identify the gene ontology (GO) terms, which identified a total of 12,589 GO terms and 8173 genes have at least 1 associated GO term. The whole genome sequence data has been submitted in the NCBI GenBank under the Accession numbers SMSF01000001–SMSF01003972 (Data file 3—Table 1).

Limitations

The number of unassembled regions in the genome is 3943 and the total number of bases placed in this gap is 724,808,570 bp.