Introduction

The Capsicum genus species, belonging to the Solanaceae family, includes both sweet (peppers) and hot (chilies) cultivars (D’Agostino et al. 2018). It is considered a model crop due to its significant diversity, which leads to abundant phenotypic variation (Park et al. 2016). Capsicum species, initially cultivated extensively in the Americas, gained global distribution following Columbus’ voyages (Perry et al. 2007). Starch fossils of Capscium spp. were found at seven sites from the Bahamas to southern of Peru, dating back 6000 years before the first contact with Europeans (Perry et al. 2007). The USDA Genetic Resources Information Network (GRIN) currently recognized 38 Capsicum species within this genus, including five domesticated species. Some species, such as Capsicum chinense Jacq., Capsicum baccatum L. and Capsicum pubescens R. & P., are widely cultivated in South America but are less common elsewhere (Tripodi et al. 2021). Genetic, archaeological, and contemporary plant distribution evidence suggests distinct domestication origins: (i) C. chinense in the northern Amazon lowlands, (ii) C. baccatum in Bolivian lowlands, (iii) Capsicum. annuum in Mexico or northern Central America, (iv) Capsicum frutescens in the Caribbean, and (v) C. pubescens in mid-elevation Southern Andes (Perry et al. 2007; Kraft et al. 2014). These five domesticated species are facultative inbreeders, diploid, self-pollinating crops. They are closely related to tomato, potato, eggplant, petunia and tobacco (Kim et al. 2014). Capsicum species feature flowers with stellate corollas, distinct pigmentation patterns, and fleshy berries varying in size, shape, and color. A unique characteristic within the Solanaceae family is their entire cup shaped calyx (Carrizo García et al. 2016).

The Capsicum genus has been an important dietary component throughout the New World (Russo 2012) and is widely used as a natural food colorant, as an ingredient in pharmaceuticals, and as a very spicy extract used for self-defense sprays, animal repellents, and insecticides (Bosland 2016). Also, hot chili pepper provides many essential nutrients, vitamins, minerals that have great importance for human health. The most important character in Capsicum is the pungency of the fruit. This is due to the production of capsaicinoids, a group of alkaloids that are synthesized in the placenta of the fruits (Stewart et al. 2007). Capsaicinoids possess health benefits for humans and are considered anticancer (Mori et al. 2006; Ito et al. 2004), analgesic for arthritis and other pain (Fraenkel et al. 2004), and reduce appetite (Westerterp-Plantenga et al. 2005).

Next-generation sequencing (NGS) technologies speed the generation of high-throughput data (Park et al. 2016; Kim et al. 2014; Jo et al. 2011). With this technology several investigations have been conducted. (Jo et al. 2011) an analyzed the complete chloroplast genome sequence of C. annuum. Their results identified large indels, including insertions in accD and rpl20 gene sequences, and they also dentified frequent repeated tandem sequences. Kim et al (2014) reported the whole-genome sequencing and assembly of a Mexican landrace of C. annuum. They also resequenced two cultivated peppers and de novo sequenced species C. chinense and found that the genome size of the hot chili pepper was about four times larger than that of its close relative, the tomato. They also identified that the genome showed an accumulation of elements of the Gypsy and Caulimoviridae families.

The genome assembly process plays a fundamental role in genomics, where computational tools called assemblers are used to reconstruct genomic sequences from sequencing data. These assemblers vary in their computational approaches and strategies, which can significantly influence their performance across different types of data and environmental conditions (Forouzan et al. 2018). By understanding the strengths and limitations of each assembler, researchers can make informed decisions on selecting tools for their genomic analysis projects (Forouzan et al. 2017).

Unfortunately, genomic research employing the Peruvian Capsicum germplasm is still very limited. Here we report the first genome sequence of the arnaucho chili pepper, a landrace cultivated in northern Lima. This autochthonous cultivar is widely used in the Peruvian gastronomy and represents cultural and economic importance. Deciphering the ACP genome provides an evolutionary view of the expansion of the genome, also it will be possible to develop molecular markers that allow the selection of superior individuals in early stages. This would be the first step for establishing modern genetic breeding programs in Peru. On the other hand, we will know the location of the underlying genes or chromosomes that could be useful for introgression of traits of interest. The analysis of the ACP genome will be applied in the future to implement the gene editing methodology, which has resulted in impressive results in the world in favor of the development of new cultivars to improve the horticultural, nutritional and medicinal values of Capsicum species.

Materials and methods

Sample collection and DNA extraction

We collected leaves of Arnaucho chili pepper from Supe Valley, northern of Lima department (− 10.8099889, − 77.6953950). Young fresh leaves were stored in paper envelopes in plastic containers with silicone gel, for the preservation of samples during transport to the National Institute of Agricultural Innovation (INIA for its acronym in Spanish) for genomic DNA extraction. This specimen was deposited at the Germplasm Bank of INIA (https://www.gob.pe/inia, drgb@inia.gob.pe) under the voucher number PER1002642. Genomic DNA was extracted using the CTAB method (Doyle and Doyle 1987) with slightly modifications for this species. DNA quantity was evaluated using the Qubit™4 Fluorometer (Invitrogen,Waltham, MA, USA) and quality was evaluated in agarose gel (1%).

DNA sequencing and estimation of genome size

Pair-end DNA sequencing was conducted using the Illumina HiSeq 2500 system. Read quality was assessed by FastQC v.0.11.9 software (Andrews 2014). Also, the reads were trimmed and filtered (phred Q > 25) using Trimmomatic v0.36  (Bolger et al. 2014) and TrimGalore v.0.6.10 software (Krueger 2012) with default parameters. We employed the Illumina short-read data and the k-mer analysis with programs Jellyfish v.2.0. and Genome Scope v1.0.0 (Cold Spring Harbor Laboratory, Laurel Hollow, US) (Vurture et al. 2017) to estimate the features of the genome, including genome size, repeat content, and heterozygosity rate. K-depth was estimated to identify a common single-peak pattern in the k-mer frequency distribution analysis.

Genome assembly

De novo assembly was performed with two assembly algorithms: SOAPdenovo2 v.2.04 (Luo et al. 2012), and MaSuRCA v.4.0.6 (Zimin et al. 2013). These assemblers were selected due to their distinct methodological approaches, enabling a comparative evaluation to identify the most suitable algorithm for this genome. SOAPdenovo2 rapidly assembles short reads into contigs with high efficiency, while MaSuRCA employs a hybrid approach to generate longer contigs and scaffolds, particularly in complex genomic regions.. Next, we used QUAST v.5.2.0 for statistics of assemblies. MaSuRCA resulted in improved assembly statistics and was subjected to Samba Scaffolder v.1.0 [21] for scaffolding and gap-filling. For the reference-based scaffolding, we used C. chinense genome (Genbank: GCA_002271895.2). Next, we used QUAST with the output of the scaffolding. Validation of assembly was assessed using two different approaches: (i) filtered PE Illumina reads were remapped to detect errors in the assembly using Bowtie2 v.2.4.2 (Langmead and Salzberg 2012) and SamTools v.1.7 (Li et al. 2009) software, (ii) the BUSCO (Simão et al. 2015) strategy was used to test the completeness of the genome assembly and gene space using the mammalian-specific profile. This approach makes use of single-copy genes expected to be present in plants (4,104 genes). We used JCVI vecscreen (https://github.com/tanghaibao/jcvi), which uses Univecdatabase (https://ftp.ncbi.nlm.nih.gov/pub/UniVec/) for detection of vectors, and 900 mapped the scaffolds against to nt/nr National Center for Biotechnology Information (NCBI) database (https://www.ncbi.nlm.nih.gov/) using BLAST v.2.2.26 (Altschult et al. 1990) for identifying contamination. The mitochondrial DNA sequences were removed after BLAST searches against databases of mitochondrial sequences. Finally, we discarded contaminated scaffolds and vectors to submit to the NCBI database.

Repeat annotation and SSR Data mining

To identify repetitive elements, we used de novo and homolog-based methods. For the de novo approach, we used, with default parameters, RepeatModeler v.1.08 (Abrusán et al. 2009) to generate a de novo ACP repeat library, which is subsequently used in RepeatMasker v4.0.7 (Bedell et al. 2000) to annotate repeats. For the homology-based approaches, we used Repbase v4.0.7 (Bao et al. 2015), RepeatMasker and RMBlast v2.2.27 (Korf 2004). All repeat results were merged, and final summary was reported using RepeatMasker. The SSRs were identified in the genome using MISA perl script (http://pgrc.ipk-gatersleben.de/misa/) (Beier et al. 2017) with the specific settings: monomer (one nucleotide, n > 12), dimer (two nucleotides, n > 6), trimer (three nucleotides, n > 4), tetramer (four nucleotides, n > 3), pentamer (five nucleotides, n > 3), and hexamer (six nucleotides, n > 3). Also, for SSR comparison, we added the genomes from Capsicum rhomboideum (GenBank:GCA_031232105.1), C. annuum (GenBank: GCA_002878395.3), C. baccatum var. pendulum (GenBank: GCA_030864225.1), C. pubescens (GenBank: GCA_030864215.1), C. chinense (Genbank: GCA_026120095.1), C. annuum (Genbank: GCA_021292125.1), C. annuum var. annuum (Genbank: GCA_033026575.1), C. annuum var. glabriusculum (Genbank: GCA_000950795.1), and C. baccatum (Genbank: GCA_002271885.2). Afterward, we used the MISA script with the same parameters for ACP.

Results

The whole genome sequencing data was deposited in the Short Read Archive (SRA) database under accession number SRR21189431, Biosample SAMN30481898, Bioproject PRJNA873099. The total of raw pair-reads were 1,062,221,314 sequences with mean length of 150 pb, 35% GC content, and a total output of 330.467 GB sequencing data. After the trimming, we retained 95% of sequencing data; 1,008 million high-quality sequence reads with approximate 313 GB total sequencing data were generated. We did not detect overrepresented sequences and trimmed adapters. Also, the average quality per read was 36.

Genomic survey

We obtained a low heterozygosity (0.078429%), slightly repetitive (37.6%), and the estimation of the genome (2.358 Gb) was relatively close to the reported references of genomes for C. chinense (Genbank:GCA_002271895.2; 3.07 Gb) (Table 1). Based on the estimated draft genome size, subsequent de novo assembly and genome annotation were performed with the sequencing depth of approximately 63.06 X coverage (Fig. 1).

Table 1 Genomic survey of Arnaucho chili pepper genome
Fig. 1
figure 1

Distribution of K-mers in the draft genome of Arnaucho chili pepper

Assembly de novo and validation

We obtained different assemblies from SOAPdenovo2 and MaSuRCA programs. We continued our analysis with MaSuRCA due to a better N50 and longer contigs (Supplementary Data 1). Assembly MaSuRCA genome was scaffolded, and we obtained a total length of 2.98 Gb, which had 11,930 contigs (≥ 1000 bp) with a GC content of 34.99%. The longest contig was of 275,254,545 pb. In addition, we found that 99.33% of the raw paired-end reads generated from the small insertion libraries were aligned to our final assembled genome. Also, with the scaffolding approach, we improved the N50, from 13.22 kb to 237.34 Mb (Table 1). In the Solanales order, the number of complete single-copy BUSCOs (Benchmarking Universal Single-Copy) increased from 5201 to 5895 complete BUSCOs (Table 2), and the number of fragmented BUSCOs decreased from 951 to 139. The genome sequence is openly available in the Genbank of NCBI under the accession number JAPEIP000000000.1.

Table 2 Statistics of the completeness of the de novo assembly of Arnaucho chili pepper genome

Simple sequence repeat data mining

The predominant microsatellite motif in the Arnaucho genome comprised mononucleotide repeats, constituting 57.74% of the total SSRs. Dinucleotide repeats accounted for 28.52%, trinucleotide repeats for 14.69%, and tetranucleotide repeats for 1.80%. This distribution closely mirrors the microsatellite motif patterns observed in other pepper species, including GCA_000710875.1, GCA_002271885.2, GCA_000950795.1, GCA_033026575.1, GCA_030864215.1, GCA_030864225.1, GCA_002878395.3, and GCA_031323105.1 (Fig. 2).

Fig. 2
figure 2

Distribution of SSRs in types of repetitions and total SSRs by type in Arnaucho chili peper genome compared to other species

A total of 590,103 microsatellite loci were identified in the assembled Arnaucho draft genome sequence, showing a frequency of 0.14 SSR/Mb. This frequency is comparable to GCA_000950795.1 (0.18 SSR/Mb) but lower than GCA_000710875.1 (0.2 SSR/Mb), GCA_002271885.2 (0.19 SSR/Mb), GCA_033026575.1 (0.22 SSR/Mb), GCA_030864215.1(0.22 SSR/Mb), GCA_030864225.1 (0.23 SSR/Mb), GCA_002878395.3 (0.2 SSR/Mb), and GCA_031323105.1 (0.23 SSR/Mb) (Table 3). Additionally, the composite matrix of the Arnaucho genome contains a total of 76,301 SSRs, which is lower than all others.

Table 3 Summary of the repetitive DNA of the Arnaucho chili pepper genome

Discussion

The selection of sequencing technique and assembly program played a crucial role in generating and ensuring the quality of the assembled genome of Arnaucho chili pepper. The importance of assembly program selection in the quality of the assembled genome has been previously highlighted (Forouzan et al. 2018), and significant differences in the quality and extent of the assembled genome were observed in this study through the contrast between the SOAPdenovo2 and MaSuRCA assembly methods. The internal results of this study demonstrated that MaSuRCA produced an improved N50 and longer contigs compared to SOAPdenovo2, leading to the continuation of the analysis with MaSuRCA. This observation underscores the significant influence of assembly program selection on the quality and extent of the assembled genome (Martínez-García et al. 2016), emphasizing the importance of this stage in obtaining a representative and reliable genome for subsequent analysis and annotation.

The genomes of Capsicum annuum and C. chinense exhibited a notably high proportion of divergence, indicating considerable genetic variation between these species. Additionally, the Arnaucho chili pepper genome showed a higher abundance of transposable elements (TEs), including long terminal repeat (LTR) elements and elements from the Caulimoviridae family, compared to the pepper genome. These TEs are widely dispersed throughout the Arnaucho chili pepper genome and significantly contribute to its expansion. Furthermore, there is a differentiation in the accumulation of Gypsy and Copia elements between both species, potentially contributing to the speciation of the Arnaucho chili pepper (Kim et al. 2014).

The comprehensive analysis of repetitive DNA in the Arnaucho chili pepper genome reveals a complex and diverse dynamics of retrotransposable elements within its genome. Retroelements represent a significant fraction of repetitive sequences, with LTRs being the major fraction, occupying between 20% and 80% of the genome depending on the species (Yañez-Santos et al. 2021). The prevalence of TEs, notably LTR retrotransposons, underscores the substantial impact of these elements on the repetitive sequences. Furthermore, the observed variability in the proportion of LTRs among species underscores the genomic diversity within Capsicum (De Assis et al. 2020). This variation in retroelement quantity has been identified as one of the main contributors to the variability in the genomic size of the botanical family Solanaceae (Park et al. 2011). The presence of complete and intact retroelements, constituting a minor fraction of the genome, reveals their ability to induce changes in genomic structure and promote genome divergence and evolution (Yañez-Santos et al. 2021). Notably, the Gypsy retrotransposon emerges as one of the most abundant elements in the repetitive DNA of the Arnaucho chili pepper, reinforcing its significance in the genomic configuration and structure of this species. These findings align with comparative research between chili pepper and tomato, underscoring the importance of differential accumulation of repetitive elements in the expansion of euchromatic regions (De Assis et al. 2020).

The uniformity in the distribution of microsatellite motifs shared with other pepper cultivars such as Pepper Zunla 1, Pepper Chiltepin, and those mentioned previously, suggests coherence in the genomic patterns of microsatellites within the Capsicum genus. This could indicate a conservation of these genetic patterns, revealing a crucial aspect for genomic stability (Li et al. 2002) within the Capsicum genus.

The microsatellite frequency in this genome, expressed as 0.14 SSR/Mb, provides an enlightening measure of the density of these sequences in its genome. This variable becomes even more significant when contrasted with other cultivars, highlighting variability in the abundance of microsatellites among the studied species, suggesting potential differences in genomic structure and genetic diversity (Srivastava et al. 2019; Fischer et al. 2017) among pepper species. Regarding the lower number of SSRs in the Arnaucho genome compared to other pepper cultivars, several factors can be considered. It is possible that evolutionary processes specific to the Arnaucho cultivar have led to a reduction in the number of SSRs, potentially through selection pressures or genetic drift (Bagshaw 2017). Further research into the genomic and evolutionary mechanisms underlying these differences could provide valuable information on the genetic diversity and adaptation of pepper cultivars. 

Conclusion

The selection of sequencing techniques and assembly programs played a pivotal role in shaping the quality and comprehensiveness of the assembled Arnaucho chili pepper genome. The preference for MaSuRCA assembler was based on superior metrics like N50 and contig length, highlighting the significance of program choice in obtaining a reliable genome for subsequent analyses. Additionally, the consistent distribution of microsatellite motifs across diverse Capsicum species, including Arnaucho, hints stable genomic patterns within this genus, potentially preserving essential genetic elements for genomic stability. Furthermore, the observed variability in microsatellite abundance, highlighted by microsatellite frequency of 0.14 SSR/Mb compared to other studied species, implies potential distinctions in genomic structure and genetic diversity among pepper species. These findings offer crucial insights into the genomic intricacies of Capsicum species, providing a foundation for further explorations and understanding within this field.