Sample preparation and Illumina sequencing
RNA-seq was performed on ripe fruit of Rubus sp var Lochness to gather information about genes expressed at the time and place most important to breeders of blackberry. Traits such as colour (anthocyanin biosynthesis), sweetness (sugar metabolism) and healthfulness (polyphenol metabolism) are determined by metabolic pathways active in ripe fruit. Total RNA of two independent samples, Ripe Fruit1 (RF1) and Ripe fruit2 (RF2) were isolated from ripe fruits to characterize the Rubus sp. transcriptome and enhance sequence coverage. After cleaning and quality checks, two independent rounds of Illumina sequencing (RF1 and RF2) generated 44,166,280 and 45,562,458 clean reads in total, encompassing 4,416,628,000 and 4,556,245,800 total nucleotides (nt) respectively (Table 1). These data sets are available in the EBI database (accession number: PRJEB6680).
The two independent samples, (RF1 and RF2) were collected and, after DNase treatment, RNA integrity was confirmed using a triple check, Nanodrop™, Experion™ Automated Electrophoresis System, and gel electrophoresis.
De novo assembly of sequence reads without a reference genome
Reads were assembled using Trinity  and then, sequences were clustered using the TIGR Gene Indices clustering tools (TGICL). TGICL  was used to join further sequences and remove any redundant sequences.
So, the result of clustering was that from 68,768 and 68,357 raw sequence reads were generated; after clustering 41,770 and 41,881 total consensus sequences were generated respectively (Table 1).
Gene family clustering was performed such that the consensus sequences were divided into two classes. One class comprised clusters, for which the prefix CL followed by the cluster id and the number of contigs in each cluster was given (Additional file 1: Table S1). In any one cluster, there were several consensus sequences for which similarity between the consensus sequences was more than 70%. The other class comprised singletons, for which the prefix Unigene was given.
Altogether, considering both repetitions, 42,062 different consensus sequences were detected. Among them 21,903 were singletons, and 20,159 others were grouped into 7,610 different clusters.
The diagram in Additional file 2 shows the distribution of raw sequences and of the consensus sequence lengths ranging from 200 bp to more than 3,000 bp in both samples. The most abundant raw sequences were 200 bp (over 38,000) and the least abundant were 3000 bp (121); sequences over 3000 bp were grouped together. For the consensus sequences, the most abundant were 200 bp (over 7000), and the least abundant were 300 bp (150). The number of sequences decreased as the length increased (Additional file 2).
Consensus sequences were aligned with Blastdb using Blastx (evalue < 0.00001) . Sequence orientations were determined according to the best hit in the database. The orientation and CDS of sequences that had no hit in blast were predicted using ESTScan .
Annotation and classification of Rubus sp. consensus sequences
For annotation, the consensus sequences were first searched using BLASTX against the NCBI ‘non-redundant’ database (Nr)  using a cut-off E-value of 0.000001.
To search for the maximum number of similar genes, after using the Nr database, the NCBI’s NT database , Swiss Institute of Bioinformatics databases (Swiss-Prot) , Kyoto Encyclopedia of Genes and Genomes (KEGG) , Clusters of Orthologous Groups of proteins (COG) , Gene Ontology (GO)  databases were used. First, several databases were used to annotate each gene. In each database, two criteria were used, the score and the evalue. The evalue was set to discard alignments with statistical significance (NCBI minimum score = 58, evalue = 0.000001; and Swissprot minimum score = 30, evalue = 0.00001). Each gene was analyzed independently, and the annotation was made according to these criteria; these data are shown in Additional file 1. The KEGG PATHWAY database records networks of molecular interactions in cells, and variants of them, specific to particular organisms. Pathway-based analysis helped to understand further the biological functions of genes. Pathway information for all annotated sequences was obtained from KEGG pathway annotations.
COG is a database where orthologous gene products are classified. Every protein in COG is assumed to have evolved from an ancestral protein, and the whole database is built on genes encoding proteins from species with complete genome sequences as well as the evolutionary relationships between bacteria, algae and eukaryotes. All consensus sequences were aligned to the COG database to predict and classify possible functions. It was possible to get Gene Ontology (GO) functional annotation from the NR annotation. GO offers three ontologies: molecular function, cellular component and biological process. The basic unit of GO is the GO-term. Every GO-term belongs to a type of ontology. Based on the NR annotation, the Blast2GO program was used  to get the GO annotation of all consensus sequences. WEGO software  was then used for GO functional classification and to understand the distribution of gene functions of the species at a macro level.
For functional annotation, 33,040, 32,762, 21,932, 20,134, 13,676, 24,168 consensus sequences were annotated using the NR, NT, Swiss-Prot, KEGG, COG, GO databases, respectively; in total 34,552 annotated sequences were identified. For protein prediction analysis, the number of CDS that mapped to the protein database was 32,540.
Among the annotated sequences, the species with the highest number of best hits were wild strawberry (Fragaria vesca subsp. vesca) (73.56% matched genes) and peach (Prunus persica) (15.25% matches) (Table 2). These results are consistent since strawberry and peach are the species closest to Rubus sp. with sequenced genomes, all belonging to the family Rosaceae.
Based on sequence homology, 24,168 Rubus sp. sequences were categorized into 40 functional groups, belonging to three main GO ontologies: molecular function, cellular component and biological process. Results showed a high proportion of genes from the categories of; “cellular process”, “metabolic process”, “cell” “organelle”, “catalytic” and “binding” with only a few genes related to “biological adhesion”, “immune system processes”, “growth”, “rhythmic process”, “nucleoid”, “antioxidant activity”, “nutrient reservoir activity”. No genes were clustered as “extracellular”, “virion”, “channel regulator activity”, “protein tag” or “translation regulator activity” (Figure 1).
To identify active biological pathways in ripe fruit of Rubus sp., the sequences were mapped to the reference canonical pathways in the Kyoto Encyclopedia of Genes and Genomes.
(KEGG). In total, 20,134 sequences were assigned to 128 KEGG pathways. The pathways with most representation were “metabolic pathways” (4,371 members), “Biosynthesis of secondary metabolites” (2,005 members), “plant-pathogen interaction” (1,471 members) and “RNA transport” (1,011 members) (Additional file 1). The 2,005 genes in the “Biosynthesis of secondary metabolites” category expressed in blackberry fruits will be useful for defining metabolic pathways for synthesis and turnover of compounds potentially beneficial to human health, and modifiable by plant breeding in Blackberry.
To further differentiate the NCBI nucleotide sequences and assembled sequences at the protein level, COG classification was undertaken to analyse the NCBI sequences.
The 13,676 assembled sequences were divided into 25 clusters according to NCBI COG classification (Additional file 3). The groups with the highest representation were found in the clusters R “general function prediction only”, K “transcription” and L “replication, recombination and repair” (Additional file 3).
To determine differential expression, once all reads were assembled and annotated, each gene expression level was normalized to its length for each replicate (C1 and C2). The gene expression level was calculated by using RPKM method  (Reads per kilobase transcriptome per million mapped reads), with the following formula: RPKM = 106C/(NL/103) which defines the expression of gene A , where C is the number of reads that are uniquely aligned to gene A , N is the total number of reads that are uniquely aligned to all genes, and L is the number of bases in gene A. The RPKM method is able to eliminate the influence of different gene lengths and sequencing discrepancies in the calculation of gene expression. Therefore, the calculated gene expression can be used directly for comparing the difference of gene expression among samples. Normalized data from C1 was plotted against data from C2; low dispersion in the plot indicated high repetitivity in expression between samples. Gene expression levels showed high similarity between biological replicates, RF1 and RF2 (Additional files 3 and 4). Most genes showed no significant differences between the samples, suggesting the results were reliable. Therefore, the calculated FPKM gene expression values can be directly compared between genes and, for any given gene, between samples.
Finally, SSRs were detected using MISA software , using the sequences as a reference (Additional file 5). Predominant SSRs were dinucleotides (over 4000), followed by trinucleotides (over 3000), mononucleotides (1200), hexanucleotides (365) and similar amounts of tetra and pentanucleotides (Additional file 5). Despite the importance of these sequences to predict variability in different organisms  no further analysis has been undertaken with these data in the present study, but the sequences are available in EBI databases (PRJEB6680), to use as markers for improvement of blackberry quality. Such SSRs will be useful as molecular markers for assaying the functional diversity in natural populations or germplasm collections, evolutionary studies and for breeding projects.
De novo assembly of sequence reads using the reference genome from strawberry
Since the species distribution of NR annotation showed that 25,418 genes (77,5%) had the highest similarity with Fragaria vesca subspecies vesca, a reanalysis was carried out, aligning the blackberry reads to this reference genome to obtain a more accurate analysis of the ripe blackberry transcriptome.
Primary sequencing data produced by Illumina HiSeq TM 2000, (raw reads), was subjected to quality control (QC), to determine whether a resequencing step was needed. Raw reads were filtered into clean reads and aligned to the reference sequences with SOAPaligner/SOAP2 . Then, the distribution of reads on reference genes and coverage analysis was done. The quality control was positive for both samples (Additional file 6), and therefore further analysis was undertaken.
The genome map rate and gene map rate were very low (lower than 7%) because, even though strawberry and blackberry belong to the family Rosaceae, they are quite distinct species and the alignment using the SOAP software was very strict (no more than 5 mismatches were allowed in the alignment) (Table 3). The alignment parameters were strict because we wanted to detect only the most similar genes, to compare this analysis with that undertaken with the first strategy. Although the number of genes was not as high as expected, (12,077 genes had high similarity to strawberry genes), a sufficient number were detected to allow comparative analyses. Ontology (GO) enrichment analysis and pathway enrichment analysis were undertaken, but, the results were not as representative nor complete as in the first analysis.
The expression levels of sequences were similar in both replicates RF1 and RF2 (Figure 2); only 31 genes had significantly different values (0.24%), suggesting highly reproducible results.
Single-nucleotide polymorphism (SNP) analysis was done with SOAPaligner/SOAP2 . In samples RF1 and RF2, 67,521 SNPs and 67,845 SNPs were detected, respectively (Additional file 7).
Comparison of strategies used to analyse the blackberry transcriptome de novo
Our initial analysis strategy (alignment using blastx with any plant sequence in the databases) produced a large number of annotated genes: 34,552 from a total of 42,062 assembled genes (82.14% of genes). This provides a significant database for berry breeders. All the classifications (COG, KEGG, etc.) provide new tools and resources for research on fruit development and bioactives. However, functional assignment of genes based on similarity to genes in other plant species should be undertaken with caution, especially if the comparator species are taxonomically distant from blackberry, such as Populus balsamifera subsp. Trichocarpa, Medicago truncatula, Lycopersicon esculentum, which all showed some genes with high similarity to those of blackberry (147, 139 and 107 genes respectively).
The second analysis (alignment with the closest sequenced genome Fragaria vesca subspecies vesca) resulted in a lower number of expressed genes, 12,077. Since very high stringency was set for this alignment with strawberry (on average less than 5 mismatches per gene), it is very likely that matched sequences have equivalent functions in the two species.
The combination of the two strategies for assembly and analysis of RNA-seq data adds value to the dataset for diverse applications.
Study of putative chimeras
Rubus sp. Var Lochness is a tetraploid hybrid , and consequently there is a risk of chimeric contigs from assembling the NGS data. However, Trinity is reliable in assembling genes from different chromosomes and avoiding chimeras, especially when the hybrid has been derived from different species .
To test for putative chimeras, the CDS of one gene encoding Chalcone Synthase (CHS) was selected as a representative example for its role in biosynthesis of flavonols and anthocyanins, that are greatly accumulated in blackberries. The CDS was cloned from fresh tissue by designing primers (Additional file 8) for both ends of the known sequence; the CDS were cloned in pGEMT and several clones were sequenced.
All the sequences from the clones aligned with high scores (99%) with the two CHS contigs from the RNAseq data; however 33 nucleotides were different (2.9%) (Additional file 9) between the cloned sequences and the CHS contigs. These differences could be due to SNPs or to errors introduced during amplification by PCR or during the sequencing the genes. These sequence differences were clustered around 500 nucleotides from each end of the CDS.
Despite the high reliability of the software used to align the sequences to distinguish homologs of different chromosomes  and our results, that suggest that CHS is not chimeric, this represents a single test case, and a deeper analysis on more genes should be carried out, to rule out the occurrance of chimeric genes resulting from mistakes in alignment of transcripts in this tetraploid variety.
Expression of the contigs estimated by qRT-PCR
RNA-seq analysis showed that more than 13,000 genes of the blackberry transcriptome are clustered in different contigs. This could be problematic for primer design for RT-qPCR analysis, since design of primers that amplify only one of the contigs encoding a specific protein, instead of all the copies of that gene, could give misleading expression data. To check if this is a real problem, three pairs of primers were designed for the CHS gene. The first two pairs were designed using the zone with high SNP frequency between the two contigs encoding CHS in blackberry (the first 500 bp, Additional file 9). Consequently, these primers should monitor the transcript levels of each contig encoding CHS but not the combined expression of both genes (Additional file 8). The third pair of primers was designed within the sequence conserved between the two genes; accordingly this third pair should report the total expression of the CHS genes.
RT-qPCR showed that the expression reported by this third primer pair (Contig1 + 2) was equal to the sum of the RT-qPCR products of the two primer pairs which amplified Contig1 and Contig2 separately, during three stages of ripening of blackberry fruit (green, red and black) (Figure 3). These data illustrate how gene expression analysis is best undertaken for tetraploid varieties such as blackberry var LochNess.
Although these studies represent assays of a single gene for chimeras, the degree of polymorphism between the two CHS contigs was such (10 mismatches per contig of 1200 nt) that data on this gene likely represent the top end of the problem, where single nucleotide differences would impact the proteins encoded, since on average 5SNPs were found per contig. Consequently chimeras existing in other contig pairs are less likely to impact the sequence of the encoded protein than for CHS.