Significant taxon sampling gaps in DNA databases limit the operational use of marine macrofauna metabarcoding

Significant effort is spent on monitoring of benthic ecosystems through government funding or indirectly as a cost of business, and metabarcoding of environmental DNA samples has been suggested as a possible complement or alternative to current morphological methods to assess biodiversity. In metabarcoding, a public sequence database is typically used to match barcodes to species identity, but these databases are naturally incomplete. The North Sea oil and gas industry conducts large-scale environmental monitoring programs in one of the most heavily sampled marine areas worldwide and could therefore be considered a “best-case scenario” for macrofaunal metabarcoding. As a test case, we investigated the database coverage of two common metabarcoding markers, mitochondrial COI and the ribosomal rRNA 18S gene, for a complete list of 1802 macrofauna taxa reported from the North Sea monitoring region IV. For COI, species level barcode coverage was 50.4% in GenBank and 42.4% for public sequences in BOLD. For 18S, species level coverage was 36.4% in GenBank and 27.1% in SILVA. To see whether rare species were underrepresented, we investigated the most commonly reported species as a separate dataset but found only minor coverage increases. We conclude that compared to global figures, barcode coverage is high for this area, but that a significant effort remains to fill barcode databases to levels that would make metabarcoding operational as a taxonomic tool, including for the most common macrofaunal taxa.


Introduction
Advances in molecular biology and computer technology have redefined the way most areas of natural sciences are carried out, but assessment of biodiversity has yet to take full advantage of these technology leaps (e.g., Taberlet et al. 2012;Bourlat et al. 2013). In particular, the practice of using estimates of biodiversity change to understand the extent of anthropogenic impact on the environment still routinely uses century-old sampling techniques. In the marine environment, most benthic community impact assessments are based on changes in the macrofauna assemblage sampled by replicate sediment grab samples sieved to retain fauna larger than 0.5 or 1 mm (Bean et al. 2017). This fauna is subsequently determined to lowest possible taxonomical level using morphological characters based on available taxonomic expertise and literature.
Valid and important reasons to continue with this method include the fact that the continuity of data enables tracking changes over time (Obst et al. 2018), but there are major Communicated by K. Kocot Electronic supplementary material The online version of this article (https://doi.org/10.1007/s12526-020-01093-5) contains supplementary material, which is available to authorized users. challenges associated with this approach, including increasing lack of taxonomic expertise, lack of taxonomic literature for large areas of the world, subjective and inconsistent taxonomy, and cryptic diversity (Knowlton 1993;Schander and Willassen 2005;Ellingsen et al. 2017). Furthermore, macrofauna is only one part of a community also comprising microeukaryotes and other inconspicuous taxa, such that sampling only macrofauna lowers sensitivity to changes in the community and risks type II errors in impact assessments (Leray and Knowlton 2015;Lanzén et al. 2016). Finally, morphological taxonomy is very time consuming and thereby costly, restricting the possible sampling effort for any given study (Bean et al. 2017).
The implementation of metabarcoding in biodiversity studies-that is inferring community composition through high throughput amplicon sequencing-may be useful to overcome some of the limitations of existing methods (Taberlet et al. 2012;Bourlat et al. 2013). Metabarcoding can be applied to environmental DNA (eDNA) samples from water or sediment (Valentini et al. 2016;Sakata et al. 2020) or DNA extracts from a homogenate of the fauna present in a sample (Taberlet et al. 2012). The resulting barcode sequence reads can subsequently be matched with sequences assigned to taxon names accessed from databases such as NCBI GenBank (Sayers et al. 2020), the Barcode of Life Data System (BOLD) (Ratnasingham and Hebert 2007), or the ribosomal RNA SILVA database (Quast et al. 2012). The most common metazoan species barcode marker is the mitochondrial cytochrome c oxidase subunit I (COI) gene (Hebert et al. 2003). COI is sufficiently variable to distinguish the vast majority of metazoan taxa at the species level and offers the potential to discriminate among cryptic species and enhance diversity measures relative to morphology (Tang et al. 2012;Leray and Knowlton 2016;Sinniger et al. 2016). Due to the lack of truly universal primers, it is often paired with other markers in metabarcoding applications to account for taxonomic groups that do not amplify well such as the 18S rRNA gene (18S). While 18S tends to underestimate macrofaunal diversity relative to morphology due to its lower rate of evolution (Hartmann et al. 2010), conservative primer sites still make it a good candidate for universal eukaryote applications. Though not treated here, other metabarcoding complement or alternative markers have also been suggested, including the ribosomal 28S and mitochondrial 12S RNA genes .
The use of metabarcoding is limited by the lack of reference sequences in the barcode repositories, reported at an average of 80% to 94% across metazoan groups (Kvist 2013). Many marine shelf, deep water, and polar areas are notoriously undersampled (McClain and Schlacher 2015). As an example, for abyssal plains (54% of Earth's surface), a staggering 90% of the infauna species found in a typical survey are new to science (Ebbe et al. 2010). Recent work in the central Pacific Ocean-where DNA barcoding has not been implemented to any noteworthy degree-points to the significant effort needed to establish even a preliminary taxonomic baseline for DNA-based species identification Glover et al. 2016;Wiklund et al. 2017;Wiklund et al. 2019). In contrast to most deep ocean basins, shelf areas are better understood. In a recent study comparing morphology and COI metabarcoding in the Bay of Biscay, Aylagas et al. found a database coverage of 23% (2016), up from the global 6% as reported by Kvist (2013), but still severely limiting the translation to taxonomy and use of the metabarcode data in understanding biodiversity. It has been suggested that rare or otherwise peripheral taxa are dominating the fraction of species that have not yet been barcoded. The absence of these taxa in the reference databases would thus have a more limited impact on the results from metabarcoding studies than raw percentages would suggest (Hebert et al. 2016). To the extent that fungi are a fit object of comparison, such claims do not seem to hold true, however (Tedersoo et al. 2014(Tedersoo et al. , 2017. Ongoing developments in DNA extraction, sequencing, and analytical methods coupled with an ever-growing body of pertinent scientific publications push for the introduction of metabarcoding to routine environmental monitoring programs (e.g., Bourlat et al. 2013;Aylagas et al. 2014;Bohmann et al. 2014;Pawlowski et al. 2014;Rees et al. 2014;Thomsen and Willerslev 2015;Lanzén et al. 2016;Valentini et al. 2016). While the taxonomic gaps in the global barcode repositories may appear overwhelming, assessments of regional coverage can help understand the progress of database coverage and to identify cases and situations where metabarcoding may be advantageous over other approaches. The North Sea is one of the most heavily sampled marine areas in the world (Hebert et al. 2016). The North and Norwegian Seas are regions where taxonomic work has been carried out for a long time, including 10-15 marine laboratories and another 10 or so in surrounding areas, some of which rank among the oldest in the world (Dean 1893;Lasserre et al. 1994). Dense spatial and temporal sampling of macrofauna due to oil and gas exploitation impact monitoring has also provided unique samples for taxonomic work (e.g., Petersen and George 1991). Though many new species are still found (e.g., Glover et al. 2005;Wiklund et al. 2009a;Strand et al. 2014;Dietrich et al. 2015), the marine macrofauna is well known and documented in published faunas relative to other areas.
The objective of this paper is to investigate the current state of publicly accessible barcode repositories, including GenBank, BOLD, and SILVA, for metabarcode taxon matching-the barcode repository gap-for the COI and 18S genes in marine benthic macrofauna taxa using a large dataset from a densely surveyed area. The North Sea is an area where major effort is applied to monitor environmental impact from the oil and gas extraction industry, and there is substantial stakeholder interest in the performance of metabarcoding technology. We hypothesize that the North Sea is a "best case scenario" in terms of both taxonomic baseline and macrofauna barcode coverage. We further assess this gap for each of the major invertebrate macrofauna phyla Echinodermata, Mollusca, Annelida, and Arthropoda and look at the most common taxa reported in the dataset to investigate the validity of the claim that rare taxa are overrepresented among species lacking barcodes.

Material and methods
A dataset of macrofauna taxon names was compiled based on survey reports from monitoring Region IV (≈ 20,000 km 2 ) in the northern part of the North Sea ( Fig. 1) by downloading taxon lists from the Environmental Monitoring (MOD) database (DNV GL 2020), a central repository of monitoring data for the Norwegian offshore oil and gas industry. The dataset includes around 140 sampling stations (≈ 700 grab samples) at depths ranging between 127 and 385 m from the years 1996, 1999, 2002, 2005, 2008, 2011, and 2014. In accordance with Norwegian regulatory standards (Norwegian Environment Agency 2015), each station was sampled using 4-5 replicate van Veen grabs resulting in 0.5 m 2 total sampled area, sieved through a 1 mm sieve. Morphological identifications were originally performed by environmental consultant companies, with most taxa assigned to macrofaunal phyla such as Annelida, Arthropoda, Echinodermata, and Mollusca. Meiofauna < 1 mm such as nematodes are not considered part of the dataset proper for the purposes of assessing biodiversity in these surveys and were reported at a low level or as presence/absence only. To assess the coverage of the most common taxa, a list of the ten most common reported taxa from each sampling station was also compiled as a separate dataset. Taxon names were updated and classified using the World Register of Marine Species (WoRMS) database (Horton et al. 2020). To mitigate differences in taxon names between WoRMS and the sequence databases, both names considered valid by WoRMS and originally reported names from the MOD database (where different) were used as synonyms for searches of the barcode repositories.
Three public repositories-one general, one COI specific, and one 18S specific-were used to check taxon barcode coverage at species or higher taxonomic rank: NCBI GenBank, a general, lightly curated sequence database (Sayers et al. 2020); the Barcode of Life Data System (BOLD), a heavily curated COI specific sequence database (Ratnasingham and Hebert 2007) 18S sequences were confirmed by taxon name batch searches of GenBank and the SILVA database. Due to the way searches were returned from BOLD, total taxon records (public and non-public) were recovered at all taxonomic levels, but public sequences were recovered for species level only. Online searches were carried out in January 2020.

Results
The original complete list of survey taxon names at all taxonomic levels comprised 1902 names of which 1568 were identified to species level. Taxa not identified to species level were assigned at various levels from phylum (e.g., "Nematoda" and "Sipuncula") to genus level by the original taxonomists. The merged list of the ten most common taxa from each survey station comprised 240 taxon names (188 species level names). Validating and updating taxon names and classification in WoRMS reduced the number of valid taxa names to 1802 and valid species names to 1474 (Supplementary Table 1). For the ten most common taxa reported from each station, WoRMS indicated 236 valid taxon names including 184 taxa at the species level (Supplementary Table 2). The independent taxonomic curation of the databases, while lagging behind WoRMS, was found to be relatively updated, even for GenBank, with synonym hits being rare.
Of the 1802 taxa in the total dataset, 56.5% were represented by a publicly available COI barcode in GenBank at the same level as that reported in the morphological dataset. The corresponding figures for 18S were 45.5% for GenBank and 29.2% for SILVA. For the 1474 taxa at the species level, 50.4% were represented by a COI sequence in GenBank and 42.4% by a COI sequence in BOLD. Species coverage exclusive to GenBank was 11.2%, while 3.2% was exclusive to BOLD. Corresponding results for 18S were 36.4% for GenBank and 27.1% for SILVA. Reducing the sensitivity of the search to genus level increased coverage to 69.3% in GenBank for COI and 77.7% in GenBank and 56.2% in SILVA for 18S. Total taxon record coverage in BOLD (rather than publicly available sequences) was 70.1% at the species level and 89.5% at the genus level (Table 1).
For the 236 taxa comprising the ten most common species from each station, the same taxonomic level COI coverage was 61.0% in GenBank, while the 18S coverage was 52.1% in GenBank and 40.7% in SILVA. At species level, COI coverage was 52.2% in GenBank and 47.8% in BOLD. Fourteen COI sequences (7.6%) were only found in GenBank, while six (3.3%) were only found in BOLD. The 18S coverage was 41.3% in GenBank and 33.7% in SILVA. At the genus level, COI coverage in GenBank increased to 83.9% while 18S coverage increased to 77.9% in GenBank and 67.1% in SILVA. Total taxon record coverage in BOLD (rather than publicly available sequences) was 78.8% at the species level and 93.3% at the genus level (Table 2).
Divided by phylum for the most common macroinvertebrate phyla, coverage was roughly similar (45.0-51.1% at species level for the whole dataset in GenBank), except for Echinodermata, where coverage was substantially higher, with a COI coverage of 77% at species level for the whole dataset in GenBank. For the 18S gene, coverage between phyla was more uneven, with 10-20% lower coverage for Arthropoda compared to other phyla in the complete dataset (Tables 1 and 2).

Discussion
For taxonomy-based metabarcoding to be effective as a supplement to classic monitoring methods, open access barcode repositories that cover a significant fraction of the biota are a necessity. A global analysis of the paucity of metazoan COI barcodes in data repositories was recently estimated to 85% including both public and private entries (Kvist 2013), suggesting that a major effort is needed to fill this gap. In a study from the Philippines, only 715 species were identified as barcoded out of an estimated 50,000 species native to the islands (Fontanilla et al. 2014). Even a study on a more restricted dataset (n = 138) of marine benthic macrofauna at similar depths from the Bay of Biscay found a paucity of 77% (Aylagas et al. 2016). In contrast, our study, based on a dataset of 1474 benthic marine macrofauna species at shelf depth in the North Sea, shows that COI DNA barcodes were available for 50.4% of macrofauna species through GenBank and 42.4% through the publicly accessible part of the BOLD database. In other words, if a metabarcoding project managed to sample and successfully sequence the COI barcode gene from macrofauna species so far recorded in this region, around half could be determined to species level. This increased coverage highlights the critical importance of the cumulative effort of taxonomy and systematics research and well-funded barcode initiatives, including active Barcode of Life programs with marine components in surrounding countries.
The phylum with the highest COI coverage was Echinodermata, at 77.0% species level coverage. This is likely both a result of the taxon being represented by relatively few (n = 74) and large species that are easy to recognize and identify, but also due to an ongoing effort to barcode all Norwegian echinoderms as part of a broad initiative to map and barcode the entire Norwegian fauna (Bakken 2009). In contrast, the global number for echinoderm barcode database representation is only 5% (Kvist 2013). Sponges and benthic cnidarians have been identified as taxa where COI sequence data frequently fails to produce species-level resolution (Mcfadden et al. 2011;Vargas et al. 2012;Kvist 2013), but the low presence of these taxa in the North Sea dataset rules out any claim that this is a significant cause of the identified barcode gap.
While GenBank generally has higher COI coverage than BOLD, sequences are uncurated and lack morphological metadata, raising concerns regarding the accurate taxonomic assignment of sequences in this database and inflated diversity estimates for metabarcoding datasets. While substantial error rates have been shown for certain groups, such as a 20% species level misidentification of fungal sequences (Nilsson et al. 2006), metazoan sequences in particular seem to be surprisingly accurate, with a reported likely error rate less than 1% at genus level (Leray et al. 2019).
Total BOLD record coverage is significantly higher than BOLD records containing public barcodes: 15-40% higher for the different phyla in the complete list of species. While not all records signify the existence of actual sequence data for the recorded taxon, records usually indicate at least a barcoding attempt on the record in question. BOLD is a curated database with more stringent metadata and supplementary information requirements than, e.g., GenBank, meaning that records could represent all of taxa where barcoding was attempted but failed, taxa that are still being processed, and taxa that are withheld until after publication in a journal or project completion. In any case, the large discrepancy between reported taxon records and actually publicly available barcode sequences suggests that coverage of public sequences in BOLD will increase as more sequences are released into the public part of the database by data contributors.
The 18S gene evolves more slowly than mitochondrial COI and is therefore suitable to resolve higher rank relationships (Hillis and Dixon 1991) while COI saturates at family rank in most taxa (Wiklund et al. 2009b), and the 18S gene has been used to infer higher-rank diversity in metabarcoding projects using phylogenetic approaches (Lanzén et al. 2016;Fonseca et al. 2017). The 18S gene has also been extensively used to resolve higher metazoan phylogenies and is one of the most Table 1 Database coverage for the whole monitoring region IV dataset in GenBank (COI and 18S). BOLD (COI) and SILVA (18S). Results are given for all groups and separately for the major phyla Annelida, Arthropoda, Mollusca, and Echinodermata. Total dataset BOLD public sequence information is available at species level only. BOLD records are total records in BOLD, including non-public and incomplete information widely sequenced genetic markers among metazoans. This facilitates assignment of specimens belonging to taxonomic groups with poor COI coverage such as meiofauna (Kvist 2013). At species level in our dataset, the 18S gene (36.4%) had lower GenBank coverage than COI (50.4%), with even lower coverage (27.1%) in SILVA. In contrast, GenBank 18S genus level coverage (77.7%) was actually higher than COI (69.3%) (SILVA: 56.2%). However, most available sequences only cover part of the approximately 1800 base pair length of the 18S gene, meaning that actual sequence searches will only return a portion of the sequences identified as 18S in the databases. Still, the ability to resolve higher level taxa makes the 18S marker even more valuable in cases with lower database coverage at low taxonomic level. Compared to GenBank, SILVA 18S coverage was around 10-20% lower, probably due to the more stringent requirements and additional curation of the SILVA database than of the more lightly curated GenBank.
It has been suggested that lack of barcode coverage could be partly alleviated if rare and peripheral species are overrepresented among species lacking barcodes (Hebert et al. 2016). The publicly available COI barcode coverage for the list of the most common species in our dataset was 52.2% in GenBank and 47.8% in BOLD. For 18S, the corresponding GenBank coverage was 41.3%. We also note that of the 27 arthropod species recorded in the list of the most common taxa, only 4 (14.7%) where represented with an 18S sequence in GenBank. Our results indicate that the coverage of the most common species is only slightly higher than for all reported species, and thus, we could not see that rare species were underrepresented in the barcode repositories for our data. Table 2 Database coverage for a merged list of the ten most common species from each station from monitoring region IV dataset in GenBank (COI and 18S). BOLD (COI) and SILVA (18S). Results are given for all groups and separately for the major phyla Annelida, Arthropoda, Mollusca, and Echinodermata. Total dataset BOLD public sequence information is available at species level only. BOLD records are total records in BOLD, including non-public and incomplete information

Conclusions
The general state of marine fauna barcode coverage is poor, and in this regard, the North Sea stands out in a positive way: roughly half of the 1474 marine macrofauna shelf species from the North Sea dataset analyzed in this study had a COI barcode in a public repository (42.4-50.4%). Missing barcodes were not limited to lesser known or rare taxa, however: barcode coverage for the most common species was only slightly higher than for all recorded species (47.8-52.2%). 18S coverage (27.1-36.4%), while lower than COI, was still substantial. 18S is able to resolve higher level taxon groups, and also targets the meio-and microeukaryote community, which makes it a good general complement to COI for metabarcoding studies. While a substantial repository gap remains even in the North Sea, this gap is shrinking, and the release of currently nonpublic sequences, together with targeting common marine macrofaunal species for barcoding, could make taxonomic assignment of most North Sea macrofauna viable within the foreseeable future. Worldwide, this is an atypical situation, however: In most areas, being able to resolve marine macrofauna identity in metabarcoding applications is a more remote prospect.
Acknowledgments The METAMON team, including Aud Larsen, Christofer Troedsson, Anders Lanzén, Eric Thompson, Katrine Sandnes Skaar, Christian Collin-Hansen, Juliette Diouma-Leyris, and Jessica Ray, is acknowledged for discussions regarding metabarcoding in marine benthic environments. The manuscript was improved by comments from the editor and three anonymous reviewers.
Funding information Open Access funding provided by NORCE Norwegian Research Centre AS. This work was funded by Statoil ASA (now Equinor ASA) as a delivery of the METAMON pre-project. Additional funding was received from the Norwegian Biodiversity Information Centre (Artsdatabanken).

Compliance with ethical standards
Conflict of interest The funders had no role in data collection and analysis, decision to publish, or preparation of the manuscript.
Ethical approval This article does not contain any studies with animals performed by any of the authors.
Sampling and field studies All necessary permits for sampling and observational field studies have been obtained by the authors from the competent authorities and are mentioned in the acknowledgements, if applicable.
Data availability All data used in this research is available in the associated supplementary files.
Author contribution TGD and AGG conceived and designed research. JTH, EE, TGD and RHN conducted data analyses. POJ contributed morphology based taxon lists from the North Sea oil and gas fields. TGD, JHT and RHN wrote the manuscript. All authors read and approved the manuscript.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.