Main

Biodiscovery—the exploration and use of genetic and biochemical properties of biological materials—has a long and rich history. For instance, centuries before the discovery of penicillin from mould in a laboratory, skin diseases were already being treated in the Kingdom of Jordan via red soils with potent antibacterial properties that have only recently been confirmed1. Other examples include traditional medicines extracted from evergreen shrubs for cancer treatment2, derivatives of the foxglove plant used to treat heart problems3, antimalarial quinine4 and fungi-extracted podophyllotoxin to treat sexually transmitted diseases5. However, recent advances in genetics and sequencing innovations have spurred an unprecedented growth in the scale of discoveries. Today, bioprospecting—the search for potential products with scientific and industrial value derived from biological resources such as animals, plants and microorganisms—often involves large-scale screening, analysis and prediction of prospective biological compounds through the exploration of databases with sequencing data, including DNA extracted directly from environmental samples6.

In this context, the ocean is considered a promising but largely untapped frontier for biodiscovery7. Marine organisms have evolved over millions of years to adapt to extreme conditions of temperature, salinity, light, pressure and water flow8. These conditions as well as a far longer evolutionary history have contributed to substantially greater taxonomic and functional diversity in marine habitats than in other biomes9. Nearly one million eukaryotic species are believed to inhabit the ocean10, and the number of archaea and bacteria may be ten thousand times higher11, yet most remain undescribed by science.

Despite these knowledge gaps, marine biotechnology—the use of marine organisms and their compounds for a wide range of applications in industrial sectors—has managed to distinguish itself from the broader biotechnology landscape. For instance, while nearly half of the approved pharmaceuticals are based on biological compounds produced by living organisms, success rates are two to four times higher for compounds from marine organisms7,12. Annual sales and licensing revenues from marine drugs have exceeded US$1 billion annually since 201113, and prospects for greater commercial growth are substantial: in 2020 alone, more than 1,400 new compounds were isolated from marine species14. Biomolecules extracted from marine bacteria and other products developed from sequences of larger marine organisms are widely used in food production, diagnostics, bioremediation and disease treatment15. Some notable examples include the discovery of a thermostable enzyme required for the production of lactose-free milk in Archaea Pyrococcus furiosus16, seawater cyanobacteria toxins developed into anticancer treatment products17 and the extensive use of green fluorescent protein found in jellyfish Aequorea victoria18 as a molecular marker, both in medical and diagnostic contexts and fundamental research.

Establishing a regulatory landscape that keeps pace with rapid advances in biotechnology, while also promoting transparency, equitable access and benefit-sharing mechanisms, has proven challenging19. The adoption of the Convention on Biological Diversity (CBD) in 1993 was a crucial milestone, as it defined genetic resources as ‘any material of marine plant, animal, microbial or other origin containing functional units of heredity of actual or potential value’, and established the fair and equitable sharing of benefits from their use as one of the convention’s three core objectives20. In 2014, the convention’s Nagoya Protocol provided a framework to regulate the access and benefit sharing of marine genetic resources (MGR) sampled in national jurisdictions21. Yet, some two-thirds of the ocean lies beyond national jurisdiction, and it was not until 2023, following protracted negotiations, that the ‘High Seas Treaty’ was agreed upon, including provisions to address MGR from areas beyond national jurisdiction (ABNJ)22.

Despite these encouraging developments, the actual and potential value of MGR for marine bioprospecting remains poorly understood. Studies have focused on counting referenced marine species in patents23 or GenBank24, examining sequences in international patent applications25,26,27,28 or exploring biological compounds for natural product discovery29,30. A common aspect to all these studies, however, is their lack of focus on the connection between the actors involved in the use of MGR and the potential sources for natural product discovery. They also suffer from limited information in patent and GenBank records about the geographical origin of gene sequences, which in many cases are referenced without naming the source species. The unevenness of these data presents a challenge for interpreting the true scale, scope and trajectory of marine bioprospecting.

Here we address these gaps by creating a comprehensive database of genetic sequences and related patent applications from 1989 to 2022 in marine bioprospecting. In addition to systematically compiling and presenting key data about the sequences, coded proteins, date of deposition and patent holders, we also address significant data gaps by developing and applying a BlastX sequence similarity model to consider sequences from unnamed species. We also assess the biodiversity data of species currently considered unique to ABNJ and highlight the special importance of deep-sea conservation for future biotechnology focused on the innovation and development of naturally derived products.

Results

Our analysis of patent filings revealed 29,065 nucleotide sequences from 1,474 disclosed marine species across 3,636 unique patents, representing approximately 1% of all gene patents submitted to the International Nucleotide Sequence Database Collaboration (INSDC). Many patents referenced multiple sequences, with a majority including both marine and non-marine sequences (Fig. 1a). Overall, marine sequences and species represented only 16% and 15%, respectively, of all sequences and species identified within the 3,636 patents (Fig. 1b,c). For comparison, approximately 242,000 marine species have been described to date (World Register of Marine Species (WoRMS), 2022), corresponding to roughly 10% of the 2.1 million species described by science31. This suggests considerable untapped potential of marine bioprospecting (Fig. 1b and Supplementary Table 1).

Fig. 1: Patent applications associated with marine species.
figure 1

a, Number of patents that contain sequences associated with species of marine origin only, and patents with species from both marine and mixed origins. b, Number of unique marine and non-marine species referenced in patent applications that include at least one sequence of marine origin. c, Sequence frequency aggregated by species they originate from. Only 16% of all sequences attached to patent applications originate from marine species. The top five species with the highest frequency in each class are indicated. Marine species: (1) C. intestinalis (2.96%), (2) Gadus morhua (1.76%), (3) Anguilla japonica (0.67%), (4) Salmo salar (0.34%) and (5) Oncorhynchus mykiss (0.26%). Non-marine species: (6) Arabidopsis thaliana (4.31%), (7) Zea mays (3.61%), (8) Glycine max (2.87%), (9) Homo sapiens (2.71%) and (10) Oryza sativa (2.60%).

Types of sequence in marine gene patents

The patent applicants who referenced the highest number of unique genetic sequences included both protein-coding and non-coding sequences, with the former having a higher potential for natural product discovery (Fig. 2). Most of the companies with a large number of applications referenced protein-coding genes that originate from multiple species, with an average length between 500 and 2,000 nucleotides. Some applicants specifically focused on MGR from a single species and predominately referenced non-coding sequences. For instance, the Fisheries Research Agency of the National Research and Development Agency in Japan included 1,179 sequences in their patent applications, mostly originating from Japanese eel (Anguilla japonica), yet only 127 are protein-coding sequences. Similarly, the Japan Science and Technology Agency has referenced 5,190 sequences from the sea vase tunicate (Ciona intestinalis), only 150 of which are protein-coding genes.

Fig. 2: Key actors in marine biotechnology.
figure 2

Applicants that submitted at least 25 nucleotide sequences in their patent claims (81% of all sequences) are shown. Companies that submitted at least 250 sequences are indicated. The size of the dots represents the number of patents submitted by each applicant. The dotted grey line indicates the shortest protein length estimation (150 base pairs). The continuous colour bar indicates the percentage of protein-coding sequences submitted in applicant claims. BASF [DE], BASF; JST [JP], Japan Science and Technology Agency; GENOME ATLANTIC [CA], Genome Atlantic; FRA [JP], Fisheries Research Agency; EW GROUP [DE], EW Group GmbH; UNIVERSITY OF UTAH [US], The University of Utah; YEDA [IL], Yeda Research and Development Company Ltd.; KIOST [KR], The Korea Institute of Ocean Science and Technology; DU PONT [US], DuPont; OGT [UK], Oxford Gene Technology Ltd.; KHK [JP], Kyowa Kirin Co., Ltd.; DSM [NL], DSM N.V.; KANEKA CORP [JP], Kaneka Corporation; GS [AU], Gene Stream Pty Ltd.; GENOMAR [NO], GenoMar.

Most short non-coding sequences of identical length, originating from the same species, exhibit a wide range of GC content (that is, the percentage of two DNA basic building blocks), which is typical for artificially modified sequences used in amplification or as probes for detecting specific sequences of DNA or RNA (Supplementary Fig. 1). Out of all the patents that include at least one sequence from disclosed marine species, 71% contain nucleotide sequences that are potentially protein-coding genes. This suggests that most MGR are used in bioprospecting (Fig. 3a). For sequences of particular interest (that is, those submitted to all patent systems), we provide examples illustrating the conversion of DNA molecules into products of value (Box 1).

Fig. 3: Patent applications in global marine bioprospecting.
figure 3

a, Share of companies that submitted patents with at least one protein-coding sequence (bioprospecting patents) and patents with non-coding sequences only. b, Top 100 largest patent applicants in marine bioprospecting, aggregated by applicant type. The terms ‘multinational’ and ‘national’ denote the extent of company presence in more than two countries or less, respectively. c, Top 100 largest patent applicants in marine bioprospecting, aggregated by country of origin (the country of headquarters).

Marine Bioprospecting Patent database

While INSDC records provide considerable insight into the genes referenced in patents, only 37.3% of records include the name of source species, primarily filed under the World Intellectual Property Organization (WIPO), the European Patent Office, the Patent Office of Japan and the Korean Intellectual Property Office. Most of the remaining records are from the US Patent and Trademark Office, which does not share species names in its records (Supplementary Fig. 2).

To address this gap, we developed a sequence similarity model and BlastX search tool to query all genetic sequences with unknown origins against the UniProtKB protein sequence database. This model retrieved an additional 60,636 sequences, which can be said with a high degree of certainty to originate from marine organisms. Together with the 31,914 protein-coding sequences of confirmed marine species, this resulted in a comprehensive database of 92,550 sequences, which form the basis for all subsequent analysis in this paper and were used to construct the Marine Bioprospecting Patent (MABPAT) database (https://mabpat.shinyapps.io/main/).

Key actors in marine biotechnology

We found that 100 applicants accounted for 58% of all patents that contain protein-coding sequences with identified marine origin (that is, bioprospecting patents). The remaining 42% were associated with applicants who filed fewer than two patents on average. For companies in the top 100 (Supplementary Table 3), the total number of patent applications would have been underestimated by at least one-third if we had not applied the sequence similarity model. Transnational corporations (1,675 applications) are the most frequent type of applicant, although roughly one-fifth of filings are from research institutes and their commercialization centres (634 applications) (Fig. 3b). In total, 78% of all bioprospecting patents filed by the top 100 were submitted by actors headquartered in the USA, Germany or Japan (Fig. 3c).

The number of patents registered by each applicant is correlated with the total count of unique species included in such patents (r = 0.8168, P = 2.17552 × 10−318). To illustrate how much biological diversity each of these applicants is drawing upon, we connected patent holders and unique species included in patent claims and aggregated on the domain (Fig. 4a) and phylum level of biological taxonomy (Supplementary Fig. 6). For each flow diagram, we also added information if the corresponding marine species had been observed in a deep-sea environment. The most active users of MGR are primarily dependent on sequences from bacteria and archaea (Fig. 4a). The ten largest actors, including eight multinational corporations and two public research bodies (Fig. 4a), collectively registered more than one-third of all patents in the top 100. Deep-sea marine species have attracted interest from all ten of the largest users of MGR.

Fig. 4: Species of interest to marine bioprospecting.
figure 4

a, Species of interest in bioprospecting connected to a company of reference (top 10 largest patent applicants) in patent application grouped by domain and potential presence in ABNJ and deep-sea habitats. AIST, The National Institute of Advanced Industrial Science and Technology; UC, The University of California. The interactive version of this plot is available at https://mabpat.shinyapps.io/main/. b, Percentage of deep-sea species present exclusively in ABNJ and all species living in the ocean according to WoRMS. Credit: flags in a, flagpedia.net; icons in a, FreePik.com.

The opacity of marine bioprospecting in ABNJ

Issues of access and benefit sharing related to genetic material from ABNJ are of particular interest as they fall outside the scope of the Nagoya Protocol of the Convention on Biological Diversity and were at the core of negotiations for the High Seas Treaty adopted in June 2023. It is therefore notable that among 1,639 species of identified marine origin referenced in INSDC patent records, 281 species have been observed in ABNJ, with only 5 of them being exclusive. This contrasts with the 5,889 species found exclusively in ABNJ, predominantly from the Arthropoda, Foraminifera and Nematoda phyla, according to our analysis of species observation data available in the Ocean Biodiversity Information System (OBIS), a global open-access database on marine biodiversity (https://obis.org). The complete taxonomic distribution is given in Supplementary Fig. 7. According to the records from the World Register of Deep-Sea Species (WoRDSS), 39% of marine species were exclusively found to inhabit deep-sea environments, in contrast to only 15% of all species listed in WoRMS (Fig. 4b). The spatial distribution of ABNJ-specific species (Supplementary Table 4) is predominantly in the sub-Antarctic and Antarctic latitudes (Supplementary Fig. 8).

ABNJ account for 64% of the ocean surface area and 95% of its volume. Once thought to be largely devoid of life, the deep-sea habitats and the water column have been found to harbour many marine species. While many of these species are thought to be considerably cosmopolitan, hotspots of endemism are found throughout the deep sea, perhaps most strikingly around hydrothermal vent systems32. According to geolocations of active hydrothermal vents (721 in total), more than half (363) are located in ABNJ.

Discussion

Marine biotechnology is mainly focused on species that serve as model organisms in basic research and as a backbone in genetic engineering, allowing the creation of new drugs and increasing the efficiency of biotechnological processes for food and energy production, plant agriculture or the invention of new materials33. Marine species currently represent a small, but important, share that is used as a source for natural product discovery7,30. Unravelling the global scope of economic interest in MGR is a crucial first step towards understanding the value that rests in the biological functions encoded in genetic sequences and pathways to fair and equitable sharing of benefits from its use.

Patent data are a valuable source of information in examining innovation and technological advancements, which are widely acknowledged as key drivers of firm performance and economic growth34,35. Aggregate patent application counts in particular are useful for studying national patenting activity36. Patent data also provide insights into the scope of ‘pre-emptive patenting’ to block competitors, to increase the market price of existing products or to ensure operational freedom37—strategies that biotechnology corporations are known to use38. While estimating the market value of patents or establishing links to commercialization is challenging39, patent data are a useful indicator for gaining insights into the long-term economic interest of societal actors in MGR applications on a global level, in the form of either knowledge production or market control.

The MABPAT database offers a global catalogue of patent sequences derived from marine species over the past three decades. It includes in depth information on patent applications, the genetic sequences attached to them and the marine species from which the sequences were derived, effectively connecting the resources and users of marine bioprospecting. In doing so, the MABPAT database not only fills an important research gap but also contributes to the transparency and interoperability of MGR use. By making it publicly available, we hope to enable further research efforts to inform improved policymaking. The analysis that generated this database also resulted in three key insights that are addressed below.

Rapid technological advances and data governance

Scholars have suggested that the earliest form of a patent system can be traced back 2,500 years ago to ancient Greece and that the first modern patent law dates back to the year 147440. Little surprise then that the patent system has struggled to keep pace with the rapid advances in genetics and genomics research of the past decades, as seen, for instance, in the considerable variation in ground rules for patenting genetic sequences across jurisdictions27. Key developments over the past 30 years have focused on jurisdictional norms and compliance standards. In 1998, international applications introduced a mandatory data element for sequence description (‘organism’), which aimed to indicate biological origin41. Yet, current international standards42 still allow the inclusion of custom organism names not listed in the Integrated Taxonomic Information System (https://www.itis.gov/), including ‘unknown’, ‘unidentified’ and ‘artificial sequence’. The new requirements of INSDC43, announced in November 2021, aim to ensure correct origin disclosure for all incoming sequences. But the effect on the 24.5 million patent sequences already stored in the databases as well as new depositions remains uncertain given that it is ultimately up to patent offices to define standards for the sequences attached to patent applications (https://www.ncbi.nlm.nih.gov/education/patent_and_ip_faqs/).

The analysis of patents therefore often depends on either accepting considerable data gaps or developing methods to reconstruct missing data. In this study, for instance, 17.2 million sequences would have been excluded from the analysis owing to the lack of species names (primarily from the US Patent and Trademark Office, the largest repository of biological sequences and patents). Instead, our sequence similarity model allowed us to reasonably and more comprehensively estimate the patent shares across national states and actor types. This reconstruction allowed us to identify marine origin, focusing on molecular similarities of biological molecules instead of relying on disclosed species names, and to confirm with higher confidence than previous work that Japan, the USA and Germany are the headquarters location for the world’s primary MGR patent applicants25,27. The disproportionate importance of these three states suggests a corresponding responsibility to work towards innovative benefit-sharing and capacity-building mechanisms. These could include, for instance, the establishment of a multilateral fund for the equitable sharing of benefits between providers and users of digital sequence information (DSI), which has been agreed to be finalized at CBD COP16 (ref. 44).

Importance of microorganisms and deep-sea life for bioprospecting

Marine viruses, although having been recognized as being highly prevalent in ocean ecosystems, contributing to the largest pool of genetic diversity45, have seen little commercial activity to date beyond a limited focus on those that affect commercial aquaculture production. However, the potential role of viruses in creating proteins of interest for marine bioprospecting could be bigger than we think. Viruses have shaped the majority of the genomes of Archaea and Bacteria via horizontal gene transfer, the exchange of genetic material between organisms that do not form parent–offspring relationships46. Bacterial and archaeal species often live in symbiosis and exchange genes with microbial eukaryotes, protists47, and together constitute the vast majority of organisms used in marine bioprospecting. Importantly, many archaeal and bacterial species used in bioprospecting live in deep-sea habitats, most of which are located in ABNJ. The diversity of microbial marine species is still highly underrepresented in databases that document the distribution and abundance of marine life (Box 2). This underrepresentation may account for the lack of patenting interest in species found exclusively in ABNJ. However, even with limited data, our findings show that ABNJ-specific species are 2.5 times more likely to inhabit the deep ocean compared with marine species in general.

Our analysis of the past three decades of global gene patents indicates that deep-sea species have become an important source for marine bioprospecting. All of the ten largest actors in marine bioprospecting are already using deep-sea species. As a result, there is a logic for benefit sharing from MGR utilization to flow into conservation projects aimed at protecting at-risk deep-sea habitats48, not least as a vital source for future biotechnology focused on innovation and development of naturally derived products. More advanced biodiversity models that put emphasis on safeguarding entire communities with unique functional roles, including microbial species, should also be better integrated into conservation plans49.

With the successful conclusion of the High Seas Treaty and the recognition of DSI in the legally binding agreement, MGR used for bioprospecting and product discovery opens a new opportunity to protect biodiversity in deep-sea habitats. However, the INSDC database, the largest data repository of DSI, is currently missing from the biodiversity informatics landscape50; therefore, genetic diversity and information on the spatial origin of genetic information are not available on a global level. Adoption of the principles of Open and Responsible Data Governance and the development of MGR data repositories51 will be a necessary step to overcome the lack of information on MGR in ABNJ.

Intellectual property questions are not discussed within the High Seas Treaty, yet commercial sensitivities and national patent regulations are important for benefit sharing related to MGR sourced from the deep sea in ABNJ. While the agreed text of the treaty includes a voluntary mechanism to ensure traceability of MGR collected from ABNJ to end product, the treaty implementation will not affect sequences already used in marine bioprospecting up to date. As there are no legal requirements for patent holders to disclose commercialization of their patents, the scale of commercial products developed and marketed from deep-sea organisms will remain poorly understood. A continued increase in corporate interest along current trajectories would lead to unequal opportunities for new developments in biotechnology.

Multi-stakeholder collaboration in MGR protection

Analysis of bioprospecting patents yielded an asymmetrical distribution of patent registrations, consistent with previous findings25,27. The sector is dominated by transnational corporations, which have a higher capacity to undertake genomic research. One-third of all patents were held by the ten largest actors, eight of which are large multinational corporations and none of which conduct marine research themselves but instead rely on public gene databases for sequences with potential commercial applications. While many multinational pharmaceutical companies have marine biology departments52, their total share of bioprospecting patents is modest (Supplementary Table 3). Still, a fair estimate of corporate engagement in marine species discovery is hard to calculate. Marine scientists who study microbial diversity often engage in collaboration with the oil and gas industry for the collection of samples in deep-sea oil wells53,54. With the rising popularity of using remotely operating vehicles for the inspection and maintenance of offshore oil and gas development sites, it is likely that more science–industry partnerships will emerge to support collection of biological data in the deep sea55.

The disproportionate role of a small number of actors also suggests the potential for science–industry collaboration in the spirit of previous efforts with so-called keystone actors, which consists in engaging the largest companies in a given sector to enable transformative change56. Constructive efforts to promote sustainable management in ABNJ have also been undertaken by partnerships such as the Deep Seas Project (https://www.deep-seas.eu) and the Common Oceans ABNJ Project57, as well as regional bodies such as OSPAR Comission, the North East Atlantic Fisheries Comission and the Sargasso Sea Commission58, which have addressed challenges related to illegal, unreported and unregulated fishing, and pollution, based on integrated and holistic approaches. The International Seabed Authority, empowered by UNCLOS (Supplementary Text 1) to manage the resources of the seabed in ABNJ, has begun to apply tools such as Regional Environmental Management Plans (REMPs) and designated associated Areas of Particular Ecological Interest (APEIs) aimed at conserving ecosystem function and biodiversity. The impact of such measures could be further amplified by seeking a coordinated approach in accordance with overarching environmental goals59. Such initiatives can foster cross-sectoral dialogue and capacity-building activities that improve the capacity of national governments and local communities to engage in sustainable resource use in ABNJ.

Corporate efforts to safeguard intellectual property rights, significant data gaps and the heterogeneity of data standards have contributed to the use of ambiguous terminology and a lack of precision in discussions concerning MGR and bioprospecting in ABNJ. This has shaped perceptions of the scale and nature of commercial interest in MGR from ABNJ, feeding expectations of a lucrative ‘deep-sea gold rush’ without adequate empirical support for such claims60,61. While the conclusion of the High Seas Treaty has laid the foundation for improved management in ABNJ, its entry into force and full implementation are a remote prospect and, in the meantime, voluntary collaborative efforts based on the best available science can help inform future binding mechanisms to ensure conservation and sustainable use. By filling the crucial knowledge gap in understanding the potential of MGR, the MABPAT database represents a first step in that direction.

Methods

Summary statistics of patents that include MGR

The GenBank patent division, the European Bioinformatics Institute database (EMBL-EBI) and the DNA DataBank of Japan (DDBJ) exchange their data daily and together form the INSDC. Genetic sequences associated with patents were retrieved from the Patent division of GenBank from the NCBI (GenBank database) on 10 November 2022; this included 24,600,503 annotated sequences. All files (from gbpat1.seq.gz to gbpat254.seq.gz) were downloaded and processed following the methodology of ref. 25 to create database entries with information on the nucleotide sequence of DNA, species name, patent number, patent data and the party registering the patent. This was done by splitting each file into individual sequences and by extracting the data in the ‘origin’ field (nucleotide sequence), ‘organism’ field (species name) and ‘journal’ field (patent application number, year of application, patent system and patent applicant name) for each sequence. Unlike previous studies25,27 that restricted their analysis to sequences submitted in a given patent system, here we considered both patents submitted in national jurisdictions and those filed under the Patent Cooperation Treaty (‘international’ patents) of WIPO.

As of November 2022, sequences from a total of 14,708 different species were included in the GenBank database. To determine the subset of ‘marine species’ within the database, the taxon match tool of the WoRMS was used for all database entries, resulting in a filtered list of 4,000 species. Web searches were conducted for each of these species to verify the marine origin and to collect further information about the nature of each species. More than half of the matched species were subsequently excluded as non-unique to marine environments, resulting in the list of 1,474 marine species, which was used to select patent records associated with disclosed marine species. See ref. 27 for details of marine origin determination and criteria for filtering.

The taxonomy (domain and phylum) of 879 marine species was retrieved from the WoRMS database. In cases in which such taxonomic levels were not available, we obtained species taxonomy from the NCBI taxonomy database (https://www.ncbi.nlm.nih.gov/taxonomy) and Wikipedia (https://en.wikipedia.org/wiki/) (220 and 356 species, respectively). We did not succeed in matching 19 of the marine species (predominantly marine bacterial strains) into related taxonomic groups owing to lack of certainty in organism names. The complete list of marine species selected for this study is given in Supplementary Table 5a.

MABPAT construction

Marine biotechnology pipelines usually focus on the search for biological compounds that encode a new functionality62. There are two types of nucleotide sequences encoded in DNA: protein-coding sequences and non-coding sequences. The latter could have either a functional or a non-functional role in genome regulation, including DNA fragments that code for proteins involved in all cell functions. Except for short peptides like cone snail peptide toxins63, most natural products are derived from proteins, which are polypeptide chains of a certain length. While identifying the shortest polypeptide chain length to form a protein is still controversial, it is currently estimated at 50 (ref. 64) to 100 (ref. 65) amino acids or 150 to 300 DNA base pairs, respectively.

Another important metric widely used to analyse genome composition variation in molecular biology and genomics is nucleotide usage, which is normally calculated as GC content—the percentage of certain nucleotide bases (guanine and cytosine) that form stronger chemical bonds in DNA strings. Modern genetic engineering techniques such as CRISPR66 have proven to be very useful at enhancing important functions of proteins by altering DNA makeup. This could involve changing individual nucleotides or introducing short sequences that control gene regulation and protein synthesis. Hence, GC content for modified proteins with similar functionality remains the same. Short DNA sequences, below the shortest DNA length required for protein formation, have various functions, including in the amplification of a specific gene sequence (as PCR primers), and usually have a wide range of GC content.

To predict whether genetic sequences are protein coding or not, we applied two filtering criteria: sequence length threshold and the presence of an open reading frame (ORF)—a gene region that has the potential to be transcribed into RNA and, after, translated into proteins. Sequences with an ORF longer than 150 base pairs have been considered protein-coding sequences. As most natural products are derived from proteins, we reason that at least one protein-coding sequence has to be included in a patent application, to be related to marine bioprospecting. Following that, we selected 31,914 protein-coding sequences associated with 1,039 marine species together with 112,115 of other sequences that have been submitted as a part of the same application.

For all companies that have registered patents associated with MGR, we counted the total number of nucleotide sequences and calculated the average sequence length (Fig. 2). Based on the shortest protein length estimation, the number of protein-coding or non-coding sequences for each company was identified. In each category, for the ten companies with the highest counts of genetic sequences attached to patent claims, we calculated the length and DNA composition (GC content) of each sequence, and coloured by distinct species origin (Supplementary Fig. 1).

For each sequence that was included in patent applications submitted in national jurisdictions as well as ‘international’ patents (sequences of special patenting interest), we collected the description of the invention and the protein function, if a nucleotide sequence search (BlastX) resulted in a significant match to a protein with annotated function. Web searches were conducted for each of these proteins to collect further information about protein function and potential application. The resulting information about the sequences of special patenting interest is available in Supplementary Table 2.

Patents owned by subsidiaries were replaced with ultimate owner names of controlled subsidiaries as stated in the Orbis company database, which contains information on around 400 million companies worldwide (Orbis; https://orbis.bvdinfo.com/). For jointly owned patents, the ownership was assigned to the first company on the list. After filtering and removing duplicate entity names and aggregating subsidiaries, we identified a total of 1,125 applicants and collected information about each through web searches, including the country where it is headquartered and the type of entity that it represents. Our classification resulted in five major entity types: multinational (presence in more than two countries) or national companies, universities and their commercialization centres, governmental agencies and ‘other’ (predominantly applications submitted by private individuals). We also included patent applications from 201 entities that contained protein-coding sequences with identified marine origins, which we were unable to classify under any specific entity type (‘none’).

Each record in the MABPAT database includes the following: (1) patent applicant name, (2) type of applicant, (3) country where it is headquartered, (4) year of application, (5) patent application number, (6) patent system, (7) genetic sequence identification, (8) marine species name associated with the sequence, (9 and 10) species taxonomy, (11) taxonomic source, (12) whether species can be classified as ‘deep-sea’ species, (13) source of deep-sea presence, (14) whether species were observed in ABNJ, (15) genetic sequence, (16) GC content, (17) sequence length, (18) whether the sequence originated from a marine organism, (19) whether the marine origin of the sequence was disclosed by the patent applicant or bioinformatically predicted, (20) whether the sequence contains protein-coding information and (21) sequence prediction source. If the marine origin was predicted, the following information about the most similar protein entry in the reference database is provided: (22) protein entry header, (23) protein entry sequence identification, (24) protein entry title, (25) E-value, (26) hit identity and (27) query coverage.

Deep-sea presence of marine species

The search for presence of species in deep-sea habitats was conducted based on multiple sources. For species in the Eukarya domain of life, we used the WoRDSS, a taxonomic database of deep-sea species. As Bacteria and Archaea species are not present in WoRDSS, we used web search based on the PubMed (https://pubmed.ncbi.nlm.nih.gov/) and Integrated Microbial Genomes and Microbiomes (https://img.jgi.doe.gov/) databases to establish their potential presence in deep-sea habitats, whether within or beyond national jurisdiction. Samples of species collected from deep-sea environments that have already been found to be associated with international patent applications27 are also marked as ‘deep-sea’ species. For the definition of deep-sea marine species, we followed the inclusion criteria in WoRDSS, that is, that the biological material was sampled in depths greater than 500 m.

BlastX sequence similarity model and patent share estimation

Sequence similarity models are widely used to identify newly sequenced data or unknown species67. To conduct sequence similarity BlastX searches (translated nucleotide versus protein) against the database of annotated protein sequences, we created the reference database of all proteins belonging to 627 genera of previously confirmed marine species in Supplementary Table 5a. A total of 24,024,531 proteins from all species within those genera were selected from UniProt Knowledgebase (UniProtKB/Swiss-Prot; UniProt Consortium 2023) which included Swiss-Prot (the expertly curated protein records) and TrEMBL (bioinformatically predicted proteins).

BlastX searches with a specific set of search parameters (E-value ≤ 10−5, query coverage ≥ 80%, hit identity ≥ 99%) were used to verify that marine sequences could be identified to a genus level with at least 95% confidence (correct hit) (Supplementary Fig. 3a). We also tested whether correct hits and searches with confidence below 95% tend to be included in certain patent applications, patented by certain actors or in certain patent systems, but did not find any preference (Supplementary Fig. 3b,c). Using the sequence search tool DIAMOND68, we queried 12,716 protein-coding sequences with disclosed marine origin against the selected records from UniProtKB, which resulted in 10,514 correct hits (82.68% recovery rate).

We then queried 7,467,396 sequences with unknown taxonomic origin (‘unknown’, ‘unidentified’ and ‘synthetic construct’ species tag)—62.7% of all GenBank records—against the selected records from UniProtKB, and found 234,836 sequences originating from 1,368 species not previously disclosed in patent records. All matched species were subsequently verified to be exclusively present in marine habitats, resulting in a final list of 561 additional marine species (Supplementary Table 5b). Overall, we have recovered 60,636 previously unknown protein-coding sequences with marine origin and 144,545 other sequences that have been submitted as a part of the same patent application (2,257 patent applications in total).

Finally, we compared summary statistics (number of sequences, number of patents and median year of application) for the top 10 largest patent applicants that referenced sequences with disclosed marine origin and top 10 applicants that referenced sequences with predicted marine origin (Supplementary Figs. 4 and 5, respectively), and found that both lists contained the two largest patent applicants (Bayer and BASF, respectively).

Hydrothermal vent presence and ABNJ-unique species counts

The geolocation of hydrothermal vents was collected from the InterRidge Vents Database. The maritime boundary map of World High Seas was downloaded from Marine Regions (https://marineregions.org/). Each set of hydrothermal vent coordinates was checked for presence within any of the High Seas polygons. Spatial vector data were analysed with the R package sf version 1.0-9 (ref. 69).

To establish the list of species uniquely present in ABNJ, we used species geographical abundance data from OBIS. We first retrieved all 28,375 species with at least one occurrence record in ABNJ (https://obis.org/area/1). For each ABNJ-present species, we checked if it was also observed in the territorial waters of any country. Species with at least one occurrence record were excluded. Data were obtained from the OBIS database (2022) using the R package robis version 2.11.0. (ref. 70) and parallel version 3.6.2. (ref. 71).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.