Background

Rubus is the largest genus in the family Rosaceae, with approximately 700 species. Plants are widely distributed in the temperate and sub-tropical regions of the Northern Hemisphere, with only a few species found in the Southern Hemisphere. There are more than 208 species found in China, 139 of which are endemic [1]. There is a growing interest in the abundance of regional taxonomic treatments, new species, and new accounts of the genus Rubus in China [2,3,4,5,6,7,8,9,10,11,12].

Rubus has been used in traditional Chinese medicine since ancient times. The fruits are well-known in Asia, Europe, and North America and have a long history of usefulness in these regions, especially in some European countries. Some natural species, such as R. corchorifolius, are widely cultivated for their medicinal, edible, and ornamental value.

Rubus is well-known for its diversity, leaf shape, inflorescence architecture, reproductive pattern, and other features. The plant is typically armed with bristles, prickles, or glandular hairs; the leaf exhibits simple, palmate, or pinnate shapes; flowers are pentamerous and mostly bisexual; achenes are drupelets or drupaceous and are aggregated on a penduncle as semispherical, conical, or cylindrical fruits [1]. Rubus is one of the most difficult taxa of flowering plants to classify due to its highly variable morphologies, and its complicated apomixis, polyploidy, and hybridization [13, 14]. Therefore, it is essential to construct a phylogeny tree using molecular evidence to better understand the relationships between species and improve the development and utilization of wild germplasm resources.

The most widely-accepted taxonomic system of the genus Rubus was built by Focke [15,16,17]. According to Focke, Rubus was separated into 12 subgenera, the largest being Rubus (132 species), Idaeobatus (117 species), and Malachobatus (115 species). Rubus is found mainly in Europe and North America, according to Flora of China (FOC), while Malachobatus and Idaeobatus are typically found throughout Asia, especially China [1].

Several molecular phylogenetic studies have tried to resolve the genetic relationships of Rubus [13, 18,19,20,21,22]. Wang et al. (2016) selected three chloroplasts (rbcL, rpl20-rps12, and trnG-trnS) and three nuclear genetic markers (nrITS, GBSSI-2, and PEPC) to resolve the phylogenetic relationships of 142 Chinese Rubus species, in which the phylogeny showed a certain degree of inadequacy between the chloroplast and nuclear markers.

There are few reports on the chloroplast genomes of Rubus to date [23,24,25,26,27,28,29,30]. Recently, a comparative analysis of the characteristics of the Rubus cp genome was reported, and together with about other ten species was used to construct a phylogenetic tree, but the mainly species was distributed in Taiwan [31]. Here, we sequenced and assembled eight cp genomes of Rubus and comparatively analyzed. And together with previously reported 38 cp genomes downloaded from the organelle genome database at National Center for Biotechnology Information (https://www.ncbi.nlm.nih.gov) [32], phylogenetic analysis was also performed. Our results, including gene content, size, nucleotide variable sites, identified SSRs, and phylogeny analysis, may improve our understanding of the cp genomes structure of genus Rubus and provide resources for genetic diversity and phylogenetic analyses in future studies.

Results and discussion

General features of Rubus chloroplast genomes

For the eight newly sequenced species, Illumina PE sequencing generated 3,408,285,600 (R. innominatus) to 9,832,178,700 (R. trianthus) clean reads, with mean coverage from 831 (X) in R. innominatus to 1229 (X) in R. trianthus. The newly assembled plastome of the eight Rubus samples had a quadripartite structure forming a circular molecule ranging from 155,546 bp (R. trianthus) to 156,321 bp (R. lambertianus) in length (Table 1). The eight cp genomes comprised a large single copy (LSC) region (85,028–85,883 bp) and a small single copy (SSC) region (18,710–18,874 bp), divided by two copies of inverted repeats (IRs) (25,761–25,994 bp) (Fig. 1, Table 1). The guanine-cytosine (GC) content of the eight cp genomes differed slightly, from 37.0% (R. trianthus and R. innominatus) to 37.3% (R. coreanus, R. hirsutus and R. parvifolius) (Table 1). The GC content of the coding sequence (CDS) ranged from 37.8 to 38.0%. Same as previously reported GC content of Rubus in Taiwan, the GC content of IR regions (42.8–42.9%) was higher than LSC (34.8–35.2%) and SSC regions (30.9–31.4%) [31].

Table 1 Accession numbers and features of the eight Rubus plastomes in the present study
Fig. 1
figure 1

Gene maps of the complete cp genome of eight species of Rubus. Genes on the inside of the circle are transcribed clockwise, while that outside are transcribed counter clockwise. Genes were colored according to their functional groups. The darker gray in the inner circle corresponds to GC content, whereas the lighter gray corresponds to AT content

Totally, 134 genes were annotated in plastomes of Rubus tephrodes, R. coreanus, R. trianthus and R. hirsutus, included 89 CDS, 37 transfer RNA genes (tRNAs), and eight ribosomal RNA (rRNAs). Because infA gene were not annotated, four other Rubus species (R. lambertianus, R. parvifolius, R. innominatus and R. hunanensis) contained 133 genes, including 88 protein-coding genes, eight rRNAs, and 37 tRNAs. Among all, 16 or 18 genes had a double copy in the IR region: seven or five protein-coding genes, seven tRNAs and four rRNAs (Table 1 and Table 2); 17 genes contained one intron (rps16, rpoC1, petB, petD, rpl16, rpl22, rpl2, ndhA, ndhB, and eight tRNA genes) and two genes had two introns (clpP and ycf3) (Table S1). The trnK-UUU gene had the longest intron of 2488–2518 bp in the eight Rubus plastomes.

Table 2 Gene contents in the cp genomes of Rubus species

IR contraction and expansion

Chloroplast genome structures, including the gene content and order, were compared, and analyzed among the eight Rubus species (Table 2 and Table S1). The results showed that the eight newly assembled chloroplasts was relatively conserved when concerned in four regions (LSC, SSC and two IRs) boundary (Fig. 2). The lengths of the IR region of the 46 Rubus ranged from 25,758–25,993 bp versus 26,238 bp in Fragaria (Fig. S1). The LSC-IRb border was located between the genes rps19 and rpl2 for ten chroroplast genomes, within the rps19 gene for Rubus niveus, and between the genes trnH and rpl2 for R. leucanthus. Three types of SSC-IRa borders were detected among the twelve plastomes. In R. corchorifolius and R. boninensis the ycf1 gene was situated in the IRa region, 191 bp and 2 bp apart from the SSC-IRa region, respectively. In R. tephrodes, R. niveus and R. coreanus, the ycf1 gene was found entirely in the SSC region and was 0–192 bp away from the SSC-IRa region. For the other seven plastomes, the SSC-IRa border located in the coding region of ycf1. The length of the ycf1 gene range from 4437 bp to 5750 bp and was typically found in the SSC region. The ndhF gene located in the SSC region at the SSC-IRb border for all but two species (R. takesimensis and Fragria chiloensis). The LSC-IRa border was between the rpl2 and trnH genes. The trnH gene was found in the LSC region, which has also been reported in dicots [33, 34].

Fig. 2
figure 2

The comparison of four regions (LSC, SSC and two IRs) among twelve cp genome. Numbers above or near the colored genes indicated distances between the gene and the edge of borders. The figure is not in scale for length

Comparative Plastome sequence divergence and hotspots regions

Collinearity detection was carried out to analyze and compare the chloroplast genomes. Mauve aligment of plastomes showed that no gene rearrangements within the chloroplast genomes of the eight Rubus (Fig. 3). The eight newly assembled chloroplast genomes were compared using the annotated R. tephrodes as the reference cp sequence (Fig. 4) to determine interspecific divergence using mVISTA software. The results show that the inverted repeat regions were more stable than the single copy region, and are consistent with those of other studies [25, 35, 36]. The most diverse regions were the intergenic spacers, including rps16-trnQ, trnL-trnT, and rpl32-trnL-ccsA.

Fig. 3
figure 3

Mauve alignment of eight Rubus cp genome revealing no interspecific rearrangements

Fig. 4
figure 4

Visualized alignment of the Rubus cp genome sequences with annotated R. tephorodes as reference, using mVISTA. The x-axis represents the base sequence of the alignment and the y-axis represents the pairwise percent identity within 50–100%. Grey arrows represent positions and directions of the genes

The nucleotide variability (Pi) was calculated with the resulting average value of 0.008, and a range of 0 to 0.0313. SSC and LSC were highly variable and IR was relatively conserved (Fig. 5). Nine intergenic regions (trnK-rps16, rps16-psbK, psbI-trnS-trnG, trnG-atpA, petN-psbM-trnD, trnE-psbD, rps4-trnL, petA-psbF, rpl16-rps3, ndhF, rpl32-trnL-ccsA and ycf1) were found to be higher variable with Pi values > 0.02, and the first nine fragments were located in the LSC region, while the rest two located in the SSC region. Only two highly variable trnS-trnG and ndhF region (other region used in reference: rpl16, trnL-trnF, rbcL, rpl20-rps12) has been used to reconstruct the phylogeny of genus Rubus to date [18, 19, 21, 22, 37,38,39]. The highly variable regions detected by comparing entire chloroplast genomes may be useful markers for further phylogenetic study.

Fig. 5
figure 5

Sliding window analysis of the entire cp genome of eleven Rubus species (window length: 600 bp; step size: 200 bp). X-axis: position of the midpoint of a window; Y-axis: nucleotide diversity of each window

Synonymous (Ks) and nonsynonymous (Ka) substitution rate analysis

The ratios of nonsynonymous (Ka) vs synonymous (Ks) substitutions were calculated for shared unique protein coding genes (PCGs) in the eight Rubus cp genome, with R. tephrodes as the reference (Fig. 6, Table S2). Among 79 shared genes, 31 genes could not be calculated because no variation for identical sequences or without nonsyonymous or synonymous nucleotide substitution. Most of the Ka/Ks ratios were less than one, except rpl22 in R. trianthus (1.1892), rpl16 in R. innominatus (1.27177), R. parvifolius (1.27177), R. lambertianus (1.27177) and R. hunanensis (1.27177). The results consisted with expected for common sense that the Ka/Ks ratio of most gene is less than one [40]. The above results also indicated the two except genes (rpl16 and rpl22) are undergoing positive selection and some of mutation of the two genes in these species must be advantageous.

Fig. 6
figure 6

Ka/Ks values of protein-coding genes of the seven comparative combinations. Ka, nonsynonymous; Ks, synonymous

SSR polymorphisms and long repeat structure

MISA was used to detect the total number of simple sequence repeats (SSRs) in totally 46 Rubus plastomes (Fig. 7, Table S3). Totally, 2243 SSRs were found in the 46 plastomes of Rubus, of which 3621 SSRs (83.55%) located in the LSC region, 671 SSRs (15.48%) were in the SSC region, and 42 SSRs (.097%) were in the IR region. The number of SSRs detected among the 46 species ranged from 38 (R. parvifolius) to 63 (R. trianthus) (Table S4). The mononucleotide repeat units were the most identified SSRs. A/T were the most abundant repeats, while AT/TA and AAT/TAA repeats were most found in the dinucleotide and trinucleotide types, respectively. The SSR results showed that A/T repeats were common in the cp genomes, and are consistent with the results of previous studies [35, 41,42,43]. The SSRs may be potential specific molecular markers to use in genetic diversity and phylogenetic studies for Rubus and its related species.

Fig. 7
figure 7

Number and type of simple sequence repeats in the 46 Rubus cp genome. A, Total number of SSRs detected in each species. B, Frequencies of identified SSRs in the large single-copy (LSC), small single-copy (SSC), and inverted repeat (IR) regions. C, Numbers and types of SSRs detected in each species

In total, 2300 long repeat structures were identified in the 46 plastomes (Fig. 8, Table S5), which including 895 (38.91%) forward repeats, 907 (39.43%) palindromic repeats, 457 (19.87%) reverse repeats and 41 (1.78%) complement repeats, respectively. Most of these repeats (1648, 71.65%) were distributed in the non-coding regions.

Fig. 8
figure 8

Number and type of long repeat sequences in the 46 Rubus cp genome. A, Numbers and types of longer repeats in each species. B, Frequency of each repeat type. C, Presence of longer repeats in coding regions and non-coding regions

Phylogenetic analyses

Fifty-one Rubus chloroplast genome and other seven plastomes of Rosaceae were used to examine the usability of the chloroplast genome in phylogeny analysis. The maximum likelihood (ML) tree constructed with RaxML and Bayesian inference (BI) tree contructed by MrBayes was topologically congruent and highly supported (Fig. 9). The relationship of the tribe below Rosaceae was congruent with previously reported results [25]. The tree also strongly supported the monophyly of the genus Rubus. The Rubus clade showed that some species of Idaeobatus were likely the original taxa and the polyploidy group may have originated from those primitive species. Other sections of the tree may have evolved from these taxa via different evolutionary events. The main objective of our study was to test the discriminatory power of the chloroplast genome sequences in genus Rubus. Additional studies with broader sampling strategies are needed to test the efficiency of the regions identified by our study to clarify the phylogeny of genus Rubus.

Fig. 9
figure 9

The maximum likelihood (ML) and Bayesian inference (BI) phylogenetic tree based on whole chloroplast genome data. The support values (bootstrap value [BS]/posterior probability [PP]) are indicated at the branches. BS and PP of 100% are indicated by an asterisk

Materials and methods

Material sampling, DNA isolation, and sequencing

Here, eight species, Rubus tephrodes, R. coreanus, R. trianthus, R. lambertianus, R. hirsutus, R. parvifolius, R. hunanensis and R. innominatus were chosen for sequencing based on their special taxonomic status, their absence or not well assembled in the NCBI. Young, disease-free leaves of wild seedlings were collected and quickly dried with silica gel (Table 3). Voucher specimens were collected for each species and deposited in the herbarium of Huanggang Normal University (formerly, Herbarium of Biology Department of Huanggang Teachers College, HGTC). The specimens were identified by prof. Hongjin Dong (Huanggang Normal University) before deposited. The total genomic DNA was extracted from the leaf tissues using the CTAB method [44] with minor modifications and stored at − 80 °C. The total genomic DNA were sheared into fragments of about 300 bp to construct libraries according to manufacturer’s instructions (Illumina, Hayward, CACA, USA). Sequencing was performed on the Illumina HiSeq 2500 Sequencing System at BGI-Wuhan.

Table 3 Sampled species and their voucher specimens used in this study

Chloroplast genome assembly and annotation

Raw data with adapter sequences or low-quality sequences was filtered by SOAPnuke software developed by BGI [45]. Then, the high-quality PE reads were used for subsequent analyses. The chloroplast genome was de novo assembled in the GetOrganelle pipeline (http://github.com/Kinggerm/GetOrganelle; [46]). The output graphs file “gfa” was checked in Bandage v. 0.8.1 [47] and the finally sequence paths were selected when the minimum depth of contigs above 100 × and the minimum length > 300 bp. To validate the assembled cp sequence error, raw sequencing reads were mapped to the assembled plastomes using the Bowite2 [48] plug-in in Geneious ver 8.0.2 [48]. The assembled cp genome sequence of the eight Rubus samples was annotated using Perl script of PGA [49]. The annotated results were summarized and the final annotations were manually checked using Geneious ver.8.0.2 [48]. The assembly and annotation were completed by mapping the reported plastomes of other well-annotated Rubus species. The whole cp sequence with annotated information was submitted to GenBank with accession numbers MT478113-MT478115 and OK127882- OK127886. The physical map of the annotated cp genomes was drawn using the online program OGDRAW [50].

Comparative Plastome sequence divergence analysis

Gene order comparison of newly-assembled Rubus plastomes were performed using the Mauve v.1.0.0 [51] plugin in Geneious v.8.0.2 [48]. We compared the completed plastomes of the eight Rubus using mVISTA in Shuffle-LAGAN mode [52] with R. tephrodes as the reference. In order to compare the inverted repeated region (IR) contraction or expansion, the detailed information of the boundaries between IR and single copy region (SC) regions were manually obtained in Geneious [48]. The chloroplast genome sequence data sets used for final analysis were aligned using the Windows version of MAFFT [53]. The output data matrix was visualized and manually edited using Geneious [48] or BioEdit [54]. The nucleotide diversity (Pi) of the plastome sequence was calculated using DnaSP v. 6.10 [55], with respect to the whole cp genomes. We used a step size of 200 bp and window length of 800 bp for sliding window analysis.

Gene selective pressure analysis of eight Rubus cp PCGs

To analysis variation in the evolutionary rates of chloroplast genes, the the Ka_Ks Calculator program Caculator 2.0 was used to calculate the nonsynonymous substitution rates (Ka), synonymous rates (Ks), and their ratios (Ka/Ks). Before calculating, the shared unique protein coding gene sequences (PCGs) was aligned in MEGA [56] (version 10.1.6) by mode of MUSCLE (codons). The gene data matrix was then saved as Clustal (.aln) or Phylip (.phy) format.

Simple sequence repeats and repeat structure analysis

The Perl script MISA [57] was used to identify microsatellites (mono-, di-, tri-, tetra-, penta-, hexanucleotide repeats), with the following parameters (unit size, min repeats): 10 for mononucleotide, 5 for dinucleotide, 4 for trinucleotide, and 3 for tetra-, penta-, and hexanucleotide. The online REPuter program was used to detect four types of long repeat sequences (forward, reverse, palindromic and complement) in Rubus plastomes with a hamming distance of 3 and a minimum repeat size of 30 bp [58].

Phylogenetic analyses

The newly assembled cp genome of Rubus and relative taxa were downloaded from NCBI and then aligned with MAFFT for phylogenetic analysis [53]. Finally, fifty-eight plastomes were used to construct the phylogeny tree (Table S3). RAxML (Version 8 for Windows) was used to run maximum likelihood (ML) analysis [59] with a bootstrap value of 1000. The general time-reversible (GTR) model with a gamma model was used at normal settings to determine the rate of heterogeneity. The Bayesian inference (BI) tree was generated in MrBayes version 3.2 [60] as implemented on the Cyberinfrastructure for Phylogenetic Research (CIPRES) Science Gateway (http://www.phylo.org/, [61]) using the default settings. The best model was determined for each sequence partition, after comparisons among 24 models of nucleotide substitution using jModeltest v.2.1.10 [62]. Figtree v1.4 [63] was used to visualize and adjust the ML trees. The graph generated from Figtree was further revised with Adobe Illustrator (Adobe Systems, Mountain View, CA, USA).

Conclusions

The complete chloroplast sequences of Rubus tephrodes, R. coreanus, and R. trianthus, R. lambertianus, R. hirsutus, R. parvifolius, R. hunanensis and R. innominatus of the section Lampobatus, and Idaeobatus were reported in this study. The comparison analysis of fouty-six Rubus plastomes indicated that the structure was relatively conserved. However, the SSC-IR and LSC-IR edges were variable among the chloroplast genomes and the IR region was less varied than the SC region. We identified the location of the SSR sites and highly changeable regions, which may be used as markers in future studies of the Rubus species. The ML and BI phylogenetic tree constructed from whole chloroplast sequences illustrated the phylogenetic relationship and was consistent with the results of previous studies. Our results indicate that the whole plastome may be used as a reliable marker in phylogenetic studies of this genus.