Introduction

Cork oak (Quercus suber L.) is a forest tree species native to the Mediterranean region with a significant economic importance. Cork trees are found in human managed cork oak woodlands (known as “montado” in Portugal, and “dehesa” in Spain); these are important habitats harboring a wide range of biodiversity, including many rare, endemic, and endangered plant and animal species (Berrahmouni et al. 2009). The long lifespan and high activity of the cork cambium in this species allow the sustainable exploitation of its outer bark (cork) for a wide range of industrial applications (Leite and Pereira 2017), harnessing its unique physical and chemical properties.

The Mediterranean region is considered one of the most affected by climate change, and the intensification of extreme climatic events is imposing a significant pressure on cork productivity and tree survival (Nunes et al. 2021; Pérez-Girón et al. 2022). Events of sudden death or progressive decline have long been reported in cork oak and are thought to become more frequent due to the combined occurrence of extreme drought coupled with biotic stress (Camilo-Alves et al. 2017). Recent advances in silviculture practices have been proposed as a way to minimize the impact of these extreme events on tree development in short term (Camilo-Alves et al. 2020); however, a comprehensive knowledge of the cork oak genetic heritage is critical to plan long-term strategies for plant improvement.

The draft nuclear genome of cork oak was made available in 2018 (Ramos et al. 2018), providing a reference resource for genomic studies in this species, and the third genome assembly available for the Quercus genus (Sork et al. 2016; Plomion et al. 2018). The three genome assemblies revealed a high similarity, despite the long evolutionary distance (approx. 35 MYA) from a common ancestor (Hipp et al. 2020; Sork et al. 2022). The draft cork oak genome assembly was produced using Illumina paired-end (PE) and mate-pair (MP) sequencing. It was fragmented in 23,344 scaffolds, with 94.6% of the genome represented in 4730 scaffolds larger than 10 Kbp, including 79,752 genes with complete open reading frames (Ramos et al. 2018). The predicted size was 953.3 Mbp in close agreement with previous estimates using flow cytometry (Zoldos et al. 1998).

The availability of a reference genome for cork oak paved the way for multiple transcriptomic studies aimed at unveiling key molecular mechanisms regulating secondary growth (Leal et al. 2021) and cork development (Lopes et al. 2020; Fernández-Piñán et al. 2021; Pires et al. 2022). Using additional transcriptomic datasets already available for different tissues, which included EST sequencing (Pereira-Leal et al. 2014), other authors focused on genome-wide or targeted characterization of genes involved in epigenetic regulation (Inácio et al. 2018; Silva et al. 2020), triterpenoid synthesis (Busta et al. 2020), and response to biotic stress (Coelho et al. 2021). In addition, the development of genetic markers for early selection of more productive and resilient genotypes would be extremely useful for this species, and different authors have started to explore the occurrence of genetic diversity associated with cork quality (Mendes et al. 2022) and adaptation to climate change (Vanhove et al. 2021).

In the present paper, we describe the development of the most recent version of the cork oak genome, obtained using a combined assembly of Illumina and PacBio sequencing reads. We also report for the first time the sequence assembly of the cork oak plastidial and mitochondrial genomes.

Materials and methods

Plant material and DNA extraction

Cork oak leaves were collected from the same HL8 tree used to generate the draft version of the cork oak reference genome assembly (Ramos et al. 2018) in order to produce long-read sequences. This specimen is over 50 years old and considered a producer of very high-quality cork. Total nuclear DNA extraction of the newly collected leaves was obtained with the NGS-Grade gDNA Prep optimized for PacBio long reads and performed by Amplicon Express (USA). Additional leaves were also submitted to organelle DNA isolation using the protocol described by Lang et al. (2011). The innuPREP Plant DNA Kit (Analytik Jena, Jena, DE) was used for DNA extraction.

Sequencing

The nuclear DNA sequencing was carried out using the PacBio RS II platform at the Yale Center for Genomic Analysis, Yale University (USA). Organelle DNA sequencing was performed using the Illumina HiSeq 4000 platform, at the Beijing Genomics Institute (BGI) producing PE reads of 100 bp in length with an insert size of 300 bp.

Moreover, the Illumina PE and MP reads and the RNA-Seq data produced to build the draft genome version of cork oak were also used in this work (Supplementary Table 1).

Illumina raw reads obtained from the nuclear DNA libraries and the RNA-Seq reads were preprocessed with Trimmomatic v.0.36 (Bolger et al. 2014). Adapter sequences and low-quality bases with less than an average quality threshold of 20 over a sliding window of 10% of the read length were trimmed from the end of the read; reads shorter than 80% of the original read length were then removed. Finally, the PacBio long raw reads were self-corrected using Canu v1.5 with the following parameter specification: rawErrorRate=0.25, correctedErrorRate=0.04, corOutCoverage=90, and userGrid=false (Koren et al. 2017).

The RNA-Seq reads were preprocessed with Trimmomatic using the same criteria described above.

Genome assembly and annotation

To remove chloroplast and mitochondrial sequencing reads, the preprocessed reads were mapped against the Quercus suber chloroplast and mitochondrial genomes assembled during this work and described further below, using the BWA-mem algorithm (Li 2013). Unmapped reads from such alignments were used for downstream assembly of the nuclear genome.

The bioinformatics pipeline followed is described in Fig. 1. As a first step, two independent genome assemblies based on short reads were produced using Ray assembler (Boisvert et al. 2010). While a k-mer size of 81 bp was defined for the assembly based on reads of 100 bp, a k-mer size of 121 bp was set for the assembly based on reads of 150 bp. Possible alternative heterozygous sequences were removed from each assembly using Redundans (Pryszcz and Gabaldón 2016). Then, assembly reconciliation was performed with GARM (Soto-Jimenez et al. 2014) as previously described (Ramos et al. 2018).

Fig. 1
figure 1

Bioinformatics pipeline followed for cork oak genome assembly. *Raw set of reads

The MP reads were mapped against the final set of contigs obtained from the assembly reconciliation with BWA-mem. Only the reads with a minimum mapping quality score of 10 were subsequently used for the scaffolding process using the SSPACE Basic (Boetzer et al. 2011), with the MP libraries used in ascending order relative to their respective insert size. In order to reduce the number of gaps of the genome, a first round of gap closing was performed using the long error-corrected reads of PacBio with the SOAPdenovo2 (Luo et al. 2012). Afterwards, three additional rounds of (1) gap closing with two PE libraries from all available insert sizes (170 bp, 300 bp, 500 bp, and 800 bp), (2) scaffolding using the long error-corrected PacBio reads with the SSPACE-LongRead (Boetzer and Pirovano 2014), and (3) scaffolding with the MP using SSPACE Basic were performed. In each round, the set of PE libraries used for the gap closing step was different (Fig. 1, red square). Lastly, a screening for potential contaminants was performed with FCS-GX from the Foreign Contamination Screening (FCS) tool (https://github.com/ncbi/fcs).

The gene prediction of the cork oak genome was performed executing two rounds of annotation with Maker (Campbell et al. 2014), using several sources of external hints generated to maximize annotation accuracy (see Supplementary Materials). In the first round, Maker was run to predict gene coordinates based on the evidences provided from all the hints generated. The set of genes obtained from the first round was used to train Augustus and SNAP (Korf 2004; Stanke et al. 2008). Once both kinds of software were trained, the second round of Maker was performed in order to do an ab initio gene prediction keeping the predictions made in the first round.

Functional annotation of the final set of genes was performed using BLAST (Camacho et al. 2009), against the NCBI non-redundant plants and the Swiss-Prot databases (Boeckmann et al. 2003). Additionally, conserved protein domains, Gene Ontology (GO) terms, and KEGG mappings were identified using InterProScan (Jones et al. 2014). Also, eggNOG-mapper (Huerta-Cepas et al. 2017) was used to assign orthologues based on precomputed Viridiplantae phylogenies.

The quality of the annotation was assessed by the annotation edit distance (AED) metric which quantifies the congruency between a gene annotation and its supporting evidence.

Transposable and repetitive elements (TEs and REs)

The first set of TEs and REs was identified and annotated running RepeatMasker v.4.0 (Smith et al. 2013) using the eudicotyledons subset of RepBase. Additionally, due to the highly variable sequences of TEs across species and to accurately identify the diverse set of TEs distributed in the cork oak genome sequence, RepeatModeler2 (Flynn et al. 2020) was used to produce a de novo reference library of TEs.

Completeness of the genome assembly

CEGMA and BUSCO v5.3.0 (Parra et al. 2007; Manni et al. 2021) were used to investigate the presence of highly conserved orthologous genes in the nuclear genome assembly. CEGMA contains a total of 248 core eukaryotic genes. BUSCO was run over the plant set (embryophyta_odb10), which includes a total of 1614 ortholog groups, defining Arabidopsis thaliana as the model species for the gene prediction performed by Augustus within the BUSCO pipeline.

Chloroplast genome assembly and annotation

For the chloroplast (cp) genome assembly, NOVOPlasty v2.6.3 (Dierckxsens et al. 2017) was used, setting insert Range and Insert Range strict parameters to 1.3 and 1.1, respectively. The complete CDS sequence of Quercus suber ribulose 1,5-bisphosphate carboxylase/oxygenase large subunit (GenBank: AB125027.1) was used as a seed for the de novo assembly of the cp genome, following software recommendations. By using a conserved organelle gene sequence as a seed to start the assembly, followed by sequence extension, the evidence of genome circularization is better achieved than in more common genome assembly strategies, which prioritize the assembly of the entire dataset at once. Furthermore, given that organelles have homologous sequences, a more targeted approach like the one presented here may reduce the cross contamination of sequences between organelles during genome assembly. Additionally, the complete cp genome of Quercus variabilis (GenBank: NC_031356.1, (Yang et al. 2016)) was used as a reference genome to resolve highly repetitive regions.

The cp genome sequence was then blasted against NCBI non-redundant database to confirm its integrity and completeness compared with other Quercus cp genomes available.

The cp genome was annotated using the web-based tool DOGMA (Wyman et al. 2004) with default parameters. Results were manually inspected, and the precise coordinates of each gene were confirmed using the annotation of Q. variabilis as a guide. For genes not automatically detected, but present in Q. variabilis annotation, a manual search for Q. variabilis gene sequence was performed, and the presence of the gene was then confirmed or discarded. New putative genes present in our annotation, but not in Q. variabilis, were verified by homology with chloroplast proteins from other species through blastx in NCBI non-redundant database.

Thirty-nine complete cp genome sequences were used for phylogenetic analysis, including the new chloroplast genome of Q. suber and 35 cp genomes of different Quercus species publicly available. Arabidopsis thaliana (NC_000932), Populus trichocarpa (NC_009143), and Fagus sylvatica (NC_041437) were used as outgroup species. The details of the Quercus species used are provided in Supplementary Table 2. The 39 cp genomes were aligned with MAFFT v.7 with default parameters (Katoh et al. 2018). The alignment was trimmed using trimAl v1.2rev22 (Capella-Gutiérrez et al. 2009) with options -cons 60 and -gt 0.9 and manually adjusted. Phylogenetic analysis was conducted by maximum likelihood (ML) using RAxML v8.2.12 (Stamatakis 2014) with the GTRGAMMA model and 1000 bootstrap replicates. The tree was rooted at midpoint.

Mitochondrial genome assembly and annotation

The mitochondrion (mt) genome was assembled following the same approach described for the chloroplast genome, using NOVOPlasty. However, in this case, the partial CDS sequence of Orfx gene (GenBank: DQ340806.1) was used to initiate the assembly. The complete mt genome of Betula pendula (GenBank: LT855379.1) was used as reference to resolve highly repetitive regions. Furthermore, the chloroplast genome obtained previously was also considered in order to exclude chloroplast reads that could be present in the data. To reduce the fragmentation of the genome, a de novo assembly based only on the organelle-unfiltered error-corrected PacBio reads was performed with Canu v.1.5 (for more details see Supplementary Materials). Then, the contigs obtained from NOVOPlasy were aligned with the PacBio assembly using LAST v885 (Kiełbasa et al. 2011). Additionally, the preprocessed MP reads were mapped with BWA-mem against the improved mt genome assembly to search for evidences of junction between contigs. Both organelle genomes were also aligned against each other with LAST to check for cross contamination of sequences.

MITOFY analysis web server (Alverson et al. 2010) was used with default parameters to find protein coding and rRNA genes for Q. suber mitochondrial genome. This annotation was then manually confirmed gene by gene in the web server by comparison with other land plant species available at MITOFY database. Transfer RNA (tRNA) genes were detected using tRNAscan-SE tool (Lowe and Chan 2016) with default parameters.

Results and discussion

Nuclear genome assembly and annotation

The percentage of reads kept after preprocessing ranged from 56.2 to 93.3% (Table 1). The Illumina preprocessed reads were assembled separately by their original read length, 100 and 150 bp, resulting in two assemblies with 249,707 and 214,524 contigs, respectively. Then, a haplotype reduction was performed over each assembly which allowed for a reduction of the number of contigs by 55.50% and 56.74%, respectively. The assembly reconciliation procedures were applied at the end producing a single assembly comprising 111,877 contigs. More details about the assemblies are provided in Supplementary Table 3. These contigs were then used as the basis of the scaffolding and gap filling processes with MP and corrected PacBio reads, producing a final genome assembly with 2479 scaffolds. The assembly statistics through the scaffolding and gap filling procedures are shown in Table 2. Lastly, after screening for potential contaminants, 128 sequences were removed due to a high homology with bacterial and fungal genomes.

Table 1 Summary of total reads per library before and after pre-processing
Table 2 Assembly metrics after each scaffolding and gap filling steps

The final assembly of cork oak genome contained 2351 scaffolds greater than 1000 bp, representing an assembly length of 765.7 Mbp with a 4.3% of undetermined nucleotides. The N50 obtained was 1.0 Mbp, and the largest scaffold was 4,440,151 bp in length. Most of the genome was represented in the largest scaffolds. For instance, scaffolds equal to or larger than 250 Kbp (N=823) comprise approximately 92.5% of the genome (Table 3), being the L90 equal to 755.

Table 3 Final assembly metrics for the cork oak nuclear genome

Using two rounds of an automated annotation pipeline based on Maker, 40,131 genes were predicted (23,512 in the first round). The full set of predicted genes represented 21.1% of the total genome size, with a density of 0.52 genes/10 kb, which is similar to the densities of Quercus robur (GCA_900291515) and Quercus lobata (GCA_001633185) (0.32 and 0.63 genes/10 kb, respectively), two species of the same genus and similar genome size (814.3 Mbp and 845.94 Mbp, respectively) (Plomion et al. 2018; Sork et al. 2022).

From the whole set of predicted genes, 33,006 (82.2%) and 39,504 (98.4%) were functionally annotated against Swiss-Prot and NCBI-nr plant databases, respectively. Additionally, the search for protein signatures against InterPro resulted in 37,184 (92.6%) genes with similarity to at least one protein domain. Regarding the GO terms and KEGG pathways, 16,198 (40.3%) genes were assigned to at least one GO term and 2727 (6.8%) genes to an Enzyme Commission Number.

The annotation edit distance (AED) was used to evaluate the annotation quality of the cork oak genome. AED values range from 0 to 1, with 0 denoting an extrinsic evidence and 1 denoting no extrinsic evidence supporting the annotation (Campbell et al. 2014). Figure 2 shows the cumulative distribution of AED values of the predicted genes. A total of 38,513 genes (95.9%) have an AED < 0.5, and 27,409 (68.3%) have at least one recognizable Pfam protein domain. Based on the rule of thumb defined by Campbell et al. (Campbell et al. 2014), the cork oak genome annotation can be considered well annotated.

Fig. 2
figure 2

The cumulative annotation edit distance (AED) distributions of the predicted genes

CEGMA and BUSCO results

The results obtained by CEGMA showed that the completeness of the genome was 98.8% (98.4% of complete matches) while with BUSCO, it was 91.6% (88.6% of complete matches). Comparing these results against the results obtained in the draft version of the cork oak genome (Ramos et al. 2018), the completeness reported by CEGMA remains the same, while the one reported by BUSCO was slightly lower in the current version. Although the standard metrics suggested that duplicate orthologues were better resolved in this version (6.8% vs. 10.4%), the number of missing orthologues increased. This increase could be a fallout of the application of heterozygous sequence reduction, process from which Pryszcz and Gabaldón (2016) emphasize that the resulting assembly represents a variety of fragments which are randomly chosen from each of the haploid genomes.

Genome size and gene representation

To better understand the difference in the cumulative size of the genome between the draft version and the genome version reported here, which is about 187.6 Mb smaller in size, both genomes were aligned with LAST using the current version as reference. The alignment results showed that 95.4% of the current genome sequence is represented in the draft version, with 2337 scaffolds out of 2351 showing alignments with 22,877 scaffolds from the draft genome. A total of 17.5 Kbp of sequence was recovered from the current genome version and not assembled in the draft version. The incorporation of the long PacBio reads allowed the capture of more challenging sequence architecture. Additionally, together with the removal of possible alternative heterozygous sequences immediately after the assembly allowed improving the scaffolding rearrangement. However, 65.8 Kbp of the draft sequence is not present in the current version of the cork oak genome. This could be directly related to the smaller genome size and slightly higher number of fragmented and missing ortholog genes, as a result of different methodologies applied for genome assembly.

Regarding the number of genes predicted (and gene content), the gene sequences from the draft were blasted against the current genome at the nucleotide level, with an e-value of 1e−5. From the 79,372 genes, 70.0% (55,627) aligned against the current genome version, although only 45.7% aligned in regions with annotated genes. It should be kept in mind that fragmented assemblies lead to an overestimated number of annotated genes (Denton et al. 2014). In fact, the number of high-confidence genes identified in the draft version (33,658) is closer to the number of genes predicted in the updated version (40,131).

Transposable and repetitive elements

The percentage of TEs and REs covering the genome based on RepBase was 11.6%, for a total of 88,930,547 bp of the 765.7 Mbp (732.8 Mbp excluding Ns). Retroelements account for 6.9% of the genome while DNA transposons and simple repeats covered 0.8% and 2.9%, respectively (Supplementary Table 4). Similar results were found in the draft version of the genome (Ramos et al. 2018). Additionally, to better assess the tandem repeats present in the cork oak genome, the modeling step based on the combination of RepeatModeler2 + RepeatMasker was essential. With this strategy, we could predict a level of repetitiveness of 53.4%, in agreement with the level reported for Quercus mongolica (53.7%) (Ai et al. 2022) and other Quercus species (Sork et al. 2022) (Supplementary Table 5).

Chloroplast genome

The chloroplasts are the plant organelles responsible for the conversion of light energy into chemical energy through the process of photosynthesis (Finkeldey and Gailing 2013). The cp genome presents inheritance patterns, predominantly of maternal nature (Greiner et al. 2015), and its characterization has been useful for evolutionary studies and phylogenetic relationships (Wu et al. 2020). In this study, we assembled and characterized the complete cp genome sequence of Q. suber.

From the initial 13,519,700 whole-genome Illumina PE reads, only 274,918 (2.03%) were de novo assembled as cp genome, representing an average coverage depth of 253x. This assembly yielded one single circular molecule of 161,179 bp (Fig. 3), which is among the range of sizes known for the thirty five cp genomes of Quercus species (from 160,415 to 161,394 bp, in Quercus aquifolioides and Quercus bawanglingensis, respectively), available at NCBI organelle genome Resources (https://www.ncbi.nlm.nih.gov/genome/organelle/) and summarized in Supplementary Table 2. The GC content was 36.8%, also very similar to other Quercus, which range between 36.79 and 36.96% depending on the species. The cork oak cp genome annotation presented 90 protein coding genes, 40 tRNA and 8 rRNA genes. The comparison of these results with the annotation of five cp genomes reported by Yang et al. (2016) reveals that the annotation of Q. suber cp genome is in complete agreement with other Quercus species.

Fig. 3
figure 3

Circular map of cork oak chloroplast genome. The color of the genes corresponds to their function as per the legend. Genes transcribed counter-clockwise are indicated inside of the outer circle, while genes transcribed clockwise are displayed outside of the circle. The inner circle represents the GC content as a histogram in darker gray. The gray line in the inner circle represents 50% GC content. The black border line of the inner circle represents the different regions of chloroplast genome (IRA, IRB: inverted repeats; SSC: small single copy; LSC: large single copy). This plot was generated with OGDRAW tool (Greiner et al. 2019)

Quercus species have been classified into two main lineages known as the “New World” and the “Old World” oak clades (Manos et al. 1999). The “New World” oak clade is formed by the subgenus Quercus which comprises five sections (Quercus, Lobatae, Protobalanus, Ponticae, and Virentes), while the “Old World” oak clade is formed by the subgenus Cerris comprised by three sections (Cerris, Cyclobalanospsis, and Ilex). Considering this classification, the phylogenetic tree based on the complete cp genomes showed that the Quercus section formed a single clade. The Ilex section was split into two clusters. The first cluster is formed by species distributed across Tibet, Nepal, Bhutan, Myanmar, and Thailand regions (Q. spinose, Q. tungmaiensis, Q. aquifolioides, and Q. pannosa) together with Q. guyavifolia, present in nearby China. The second one is comprised by Quercus species occurring in Japan and Southern and Central China. The exceptions were Q. baronii and Q. acrodonta that appeared to be closely related to section Cerris, where Q. suber belongs. This reflects the monophyletic relation between Cerris and Ilex, which was well established at the nuclear level before (Denk and Grimm 2010). Cork oak grouped relatively close to Q. variabilis, both cork-producing species for commercial use. Additionally, the sections from the “Old Word” are clearly distinct from the clade comprising Quercus, Lobatae, and Virinetes sections (Fig. 4, Supplementary Table 2).

Fig. 4
figure 4

Phylogenetic tree obtained for 39 complete chloroplast genomes, 36 corresponding to Quercus species. Black circle: section Cerris; black square: section Ilex; black triangle: section Cyclobalanospsis; black rhombus: section Quercus; white circle: section Lobatae; white square: section Virentes; white rhombus: outgroup species; nothing: undetermined section

Mitochondrial genome

A total of 612,018 out of 11,269,248 whole-genome Illumina PE reads (5.43%) were assembled into 23 contigs, from which 16 showed evidence of overlap. The contig sizes varied from 1123 to 81,375 bp, for a total genome length of 796,392 bp. The self-alignment of these 23 contigs showed high homology between different contigs, which suggested (1) duplicate regions of mt genome or (2) a poor assembly by the NOVOPlasty software. In order to improve these results, we took advantage of a de novo assembly of Q. suber nuclear genome previously described in this work, which used error-corrected PacBio reads unfiltered for organelle contamination. The alignment of this PacBio assembly against the contigs obtained with NOVOPlasty identified 12 PacBio contigs with high homology. The manual inspection of the alignment coordinates allowed the overlap of several PacBio contigs after confirming sequence identity. Three contigs were obtained for mitochondrion assembly: one large contig with 442,094 bp, followed by two smaller contigs of 52,064 bp and 37,700 bp. The MP reads of different insert sizes were further mapped to this assembly, but no evidence of connections between contigs was found. These contigs, which showed no evidence of circularization, sum a total genome length of 531,858 bp. The genome length is larger than other Quercus mitochondrion genomes recently deposited, namely, Q. robur (390,906 bp GenBank: OW028777.1), Q. variabilis (412,886 bp; Genbank: MN199236.1), and Q. acutissima (448,694; GenBank: MZ636519.1). The annotation of Q. suber mt genome revealed a total of 66 genes, being 40 protein coding, 23 tRNA, and 3 rRNA. The completeness of the annotation is corroborated by the presence of all major organelle genes except for RPS9 and RPS13. This annotation presents more genes than any other Quercus mitochondrion genome, which is in agreement with its increased genome length.

Comparison with other Quercus genomes

Our assembly was compared against five other available genomes for Quercus genus (Table 4, data accessed on 30/05/2022). The L90 metric, which describes the minimum number of sequences representing 90% of the genome sequence, is expected to be equal or closer to the number of chromosomes when genome assembly contiguity is very high, being a good method to better compare and assess genome contiguity. Four genomes had an L90 between 11 and 12 and were organized in chromosomes (Q. aquifolioides, Q. lobata, Q. robur, Q. mongolica, and Q. variabilis) and one in scaffolds (Quercus wislizeni). All assemblies were aligned against the cork oak genome using LAST v885. The alignment results revealed a high similarity between Q. suber and the other Quercus species, with 98.16% of Q. aquifoloides, 96.70% of Quercus wisilizeni, 95.94% of Q. robur, 95.72% of Q. mongolica, and 95.43% of Q. lobata genomes, aligned against the cork oak genome.

Table 4 Nuclear genome assembly and annotation statistics of Quercus species (Q. aquifolioides v.1 (GCA_019022515); Q. lobata v.3.2 (GCA_001633185); Q. robur v3.1 (GCA_932294415, (Bodénès et al. 2016)); Q. wislizeni v.1 (GCA_023055345); and Q. mongolica v.1 (GCA_011696235 (Ai et al. 2022)))

BUSCO analysis

The Q. suber genome reported here is slightly less complete than the other genomes represented in Table 5 (Q. robur v.3.1 (98.0%); Q. wislizeni (97.7%); Q. lobata (96.8%); Q. mongolica (95.4%), Q. aquifolioides (95.1%); and Q. suber (88.6%)) but still has a high-level of completeness.

Table 5 Summary of BUSCO analysis performed over the nuclear genomes available for Quercus spp

Conclusions

High-quality genome assemblies are of great need, since initial “drafts” may fail to capture important genomic regions involved in specific adaptations of the species. In this work, the addition of long-sequences from PacBio allowed to improve sequence connectivity and reduce fragmentation of the cork oak reference genome (Ramos et al. 2018), observing a high level of sequence assembly homology between the two versions. However, genome completeness was slightly lower mostly due to the differences in the pipeline applied which resulted in a smaller genome size.

Another important goal of this study was the assembly of the cork oak plastidial and mitochondrial genomes, which was accomplished for the first time. These newly reported genomes will provide valuable information for genetic evolution and molecular breeding studies of Quercus species.

Given the importance of cork oak and the efforts required to study its genetic structure, it is essential to be able to use all the information resources available. Therefore, the current version of the cork oak genome will be deposited in the CorkOakDB (Arias-Baldrich et al. 2020) to integrate and share open access to the information in a comprehensive, accessible, and intuitive format.

High-quality genomes allow the documenting of the deep evolutionary history of species. The current cork oak genome version is a step forward towards a better and less fragmented genome sequence, and further improvements can be achieved making use of recent sequence technologies such as HiFi (high fidelity) or Hi-C (high-throughput chromosome conformation capture technique) sequencing data. Hence, our research team has additional ongoing work to develop a better version of the cork oak genome at the chromosome level, which will lead to additional improvements in the quality and completeness of the cork oak genome.