Introduction

Natural products (NPs), are broadly defined as chemicals produced by living organisms. More precise definitions of NPs exist, but they do not always meet a consensus: some of the NPs include all small molecules that result from metabolic reactions, others classify as “NP” only products of secondary, or non-essential, metabolism. In this review, we made the choice to exclude molecules that participate in the primary, or essential, metabolism, such as energy or anabolic pathways, and consider only molecules that are produced by living organisms in order to accomplish a “higher” function, such as signalling or defence and still smaller than 1500 Da. However, as for most of the definitions in life sciences, the line between primary and secondary metabolites is very thin and depends on the potential application of the molecule to categorise This categorisation justifies the necessity of dedicated NP databases or a proper annotation in generalistic databases of molecules.

NPs have evolved over millions of years and acquired a unique chemical diversity, which consequently results in the diversity of their biological activities and drug-like properties. Therefore, even before the rise of the modern chemical pharmacology, NPs have been used for centuries as components of traditional medicines, in particular as active components of herbal remedies. Nowadays, some of the traditional healing practices, such as Indian Ayurveda, traditional Chinese medicine or African herbal medicines, remain the primary treatment option for many people across the world, due to economic reasons, to personal beliefs or to the difficulty in accessing pharmaceutical products. In modern pharmacology too, NPs have become one of the most important resources for developing new lead compounds and scaffolds [1,2,3]. Every week, scientific articles in peer-reviewed journals are published describing the positive effects of NPs on the healing process of various human and animal diseases. Major classes of antibiotics and antifungals are based on NPs isolated from microorganisms. Drugs used in the treatment of various cancers, cardiovascular diseases, diabetes, and more are often NPs or their derivatives. For instance, between 1981 and 2014 over 50% of newly developed drugs were developed from NPs [1]. NPs and their derivatives are also actively studied in food [4,5,6,7,8], cosmetic industries [9, 10] and in agriculture, with natural pesticides development [11]. This growing interest over NPs and their application resolved in uncontrollable growth of the number of published open and commercial databases, industrial catalogues, books of NPs and collections of structures provided in supplementary materials or research articles, compiling NPs from various organisms, geographical locations, targeted diseases and traditional uses. It became, therefore, a real challenge to find a complete and comprehensive open database for NPs. One other major problem is the publication of structures only in graphical format, such as in the annual reviews of Marine Natural Products [2]: these are not easily retrievable to be computationally analysed and they are not automatically integrated into public molecular databases. Virtual NP collections are therefore required for virtual screening, which is the first step in all exploratory molecular analyses and to some extent, in the discovery of NP-based drug or other types of active components. For example, the prior virtual screening of known NPs can prevent loss of time with extracting and purifying samples, postponing the wet lab step to the moment of theoretical identification of best candidates. In this way, the usage of modern cheminformatics technologies allows to accelerate research and save time and money for better results. The previous reviews on NPs databases are either outdated and do not reference the actual state of NP resources [12, 13], either focus on one particular type of application for such databases [14, 15], in particular databases that can be used for dereplication [16], a particular geographic origin of NPs [17] or simply do not refer a significant part of NP resources [18].

For this article, we reviewed a total of 123 resources listing NP structures cited in the scientific literature after 2000. Among them 92 are open and only 50 contain molecular structures that we could retrieve for analyses of their content, the overlap between them and compilation. The quality of molecular structures stored in these databases is also challenging: stereochemistry, for example, plays a major role in the function of NPs, and is the centre of a lot of research projects in the field. Despite this known importance, almost 12% of the collected molecules lack information on stereochemistry while having stereocenters. Finally, the non-redundant collection of NPs from these open resources has been assembled in a MongoDB COlleCtion of Open Natural prodUcTs (COCONUT).

Natural products online resources: availability and characteristics

For now, there is no globally accepted community resource for NPs, where their structures and annotations can be submitted, edited and queried by a large public, like there is UniProt [19] for proteins or NCBI Taxonomy [20] for the classification of living organisms. This leads to an impressive (123) amount of various, open and commercial, with different scope and differently structured resources for NP structures and their annotations. Mentions of NP databases, datasets and collections across publications from 2000 to 2019 and in omicX [21], a catalogue of scientific databases and software, were collected and are listed in Table 1 [22].

Table 1 List of Natural Products databases cited in scientific literature since 2000. The list is ordered by alphabetical order of the database names, and contains, when available, extended metadata

The databases are sorted by alphabetical order of their names and the table lists their various features such as: if they are open or commercial, if they are maintained and updated, what type of NPs they contain and their origin, the approximative number of molecular structures they contain, most recent publication of the collection, if a registration is required to access the data, if extensive metadata is available (taxonomy of the organism producing the NP, tissue, the geographical location where it is isolated, it’s application in (traditional) medicine, diseases it targets, etc.) and if the download of the molecular structures for local use (such as virtual screening) is easy. All these criteria are chosen to evaluate the “FAIRness” [23] (Findable, Accessible, Interoperable and Reusable) of the NP resources.

For the purpose of this review, the first classification level of the NP databases is their open or commercial access. Next, among the open-access databases, we distinguish databases of metabolites (that contain NPs but also products of primary metabolism), generalistic databases, that do not limit themselves to a particular geographic location or taxonomic classification, databases containing experimental spectra of NPs (NMR, mass spectrometry) and can be used for dereplication applications, thematic databases, that focus on traditional medicine, on drug-like NPs, on the biodiversity of a particular geographic region or on a particular taxonomic group and, finally, open-access industrial catalogues, that are virtual collections of NPs that chemical companies synthesize or isolate and sell. Of course, this segregation is not the only one possible and was made here uniquely for the readability purpose.

Commercial databases

Commercial databases sell the data, access or licence, and in general, it is quite expensive [24], even for academic use (from 6600 US$ per year for the Dictionary of Natural Products [25] to over 40,000 US$ for Reaxys [26] and SciFinder [27]).

The Chemical Abstracts Service (CAS) launched in 1995 SciFinder [27], a curated database of chemical information, compiled and maintained by the American Chemical Society. Originally available as desktop software, the web version of SciFinder is available since 2008. As it is CAS that assigns a unique registry number to every chemical substance described in the scientific literature since 1957, the SciFinder contains one, if not the biggest collection of curated chemicals, and, subsequently, of NPs. It is estimated that the number of NPs in SciFinder is over 300,000.

Reaxys [26] is a database for substances, reactions and documents compiled and maintained by the editor Elsevier. It contains over 107 compounds in total, over 200,000 of which are NPs.

The Dictionary of Natural products (DNP) [25] and it’s autonomous sub-sections, the Dictionary of Marine Natural Products (DNMP) [28] and the Dictionary of Food Compounds [29], are the considered as the most complete and best-curated resources for NP.

NaprAlert [30] was created by researchers at the University of Chicago and contains manually curated information on NPs from literature with rich metadata. Nowadays offers limited free searchers under conditions for academic researchers.

National Institute of Standards and Technology-NIST (version 17) [31] is one of the standard reference databases for mass spectra (MS) data and is developed and maintained at the National Institute of Health (NIH) in the USA. The main library contains over 250,000 molecules of natural origin (the separation between primary metabolites and NPs is not clearly marked) and is only purchasable on a compact disk.

MarinLit [32, 33] is a database of marine NPs based on literature reviews and contain highly curated data that has been collected since the 1970s at the University of Canterbury, New Zealand, and since several years is maintained by the Royal Society of Chemistry (RSC). AntiMarin [34, 35] is a historic database of marine NPs that have a described antibiotic activity. While it is still widely cited in thematic studies, the database itself is not accessible anymore, as was apparently merged with MarinLit.

AntiBase [36] is a comprehensive database of more than 40,000 NPs from microorganisms and higher fungi with very rich metadata collected from literature and manually validated. It is not updated since 2014 and is only available for purchase on Wiley’s website [37].

eBasis (Bioactive Substances in Food Information Systems) is an online, manually curated collection of 267 foods and 794 active compounds that they contain. The database offers rich and high-quality metadata on food NP activities and structures and limited free access to scientists to try the resource.

The Natural Product Discovery System (NADI) [38] contains over 3000 natural compounds from more than 15,000 Malaysian plant species. Despite being developed and maintained by the University Sains Malaysia, it is not open for academic use.

ChemTCM [39] is a database of NPs from plants used in traditional Chinese herbal medicine. The original part of this dataset resides not only in the very rich metadata but also in the predicted activity of NPs against common Western therapeutic targets and their estimated molecular activity according to traditional Chinese herbal medicine categories. The database was developed at King’s College London, in the UK, in part with the support of Innovation China-UK.

The Natural Products Library (NPL) [40] was described in a paper by AstraZeneca, a famous pharmaceutical company, but the data, containing at the moment of publication over 800 well-curated and annotated NPs, only remained as an in-house collection.

The Ayurveda dataset [41] was initially a published database of NPs extracted from the Indian traditional medicine plants. The link in the mentioned publication is still working but redirects to a website that provides software solutions for NP and chemistry research in general. Maybe the database is still available together with the software, but the access to it is for subscriptions only.

The Berdy’s Bioactive Natural Products Database [42] database is mentioned in publications from the 2000s and early beginning of 2010s but is not accessible anymore not even for the purchase of an older version. Originally, Birdy’s company was sending the database as a paper version and with the rise of accessible digital storage, on a digital medium upon order. The company does not seem to exist anymore.

Open-access databases

We could identify a total of 92 open-access NP resources across the literature in the last 20 years. The concept of “Open-access” encourages and prioritizes free and open online access to academic information, such as data and scientific publications. For a dataset, whether in a database or attached as additional information to an article, it means that anyone can read, download, copy, distribute, print, search for and within and re-use all or parts of data that are contained in it. For this review, we have endeavoured to compile an exhaustive list of open-access NP resources that have been cited at least ones in a peer-reviewed scientific publication after the year 2000. As the number of such sources is quite substantial (87), a thematic classification for them has been established. First, we present larger databases of organic molecules that also contain metabolites and NPs. These are followed by the presentation of databases containing molecular spectra (mass spectrometry or NMR) that can be used for the dereplication process for the identification of organic molecules and, in particular, of NPs in experimental data. Next, the scope will be narrowed with databases containing only NPs but without any taxonomic, usage or geographic selection on them. The most diverse data source category is the so-called “thematic” one: it contains databases of NPs that focus on a particular taxonomy (e.g. plants, bacteria, fungi), on a particular usage (e.g. Chinese, Indian or African traditional medicine, NPs found in food or toxic NPs) or on a particular geographic location (e.g. marine NPs, Brazilian and Mexican biodiversity NPs). Finally, are introduced industrial catalogues of NPs. These are made available by chemical companies that synthesize or purify NPs on command.

Databases of metabolites and chemicals

The first starting points in the search for structures for organic molecules are these big chemical libraries. They contain a wide range of organic compounds, and metabolites and NPs are well identifiable in them. The reference libraries, widely accepted by the scientific community as sources of reliable molecular information are: ChEBI [43], ChEMBL [44], ChemSpider [45], PubChem [46] and ChemBank [47]. ChEBI is developed and maintained at the European Bioinformatics Institute (EBI) and its main focus is chemical ontologies, i.e. structural relationships between molecules; it contains over 15,000 clearly identified NPs. ChEMBL is also the product of EBI but it has a wider focus and is considered as a repository for experimentally elucidated molecular structures and, in particular, drugs and drug-like chemical; it contains over 1800 NPs, but this number is very probably underestimated because of the unclear labelling of molecules as NP in this database. PubChem is an integrated platform of small molecules and biological activities is an initiative of the US (NIH) and is one of the major sources for biomolecules discovery and submission. It contains over 3500 NPs, although, similarly to ChEMBL, this number is very underestimated due to the unclear labelling of compounds as NPs. ChemSpider is a chemical database offering very rich metadata, cross-references to a lot of other chemical sources and advanced search. It is maintained by the Royal Society of Chemistry and contains over 9700 easily findable NPs. ChemBank was developed by the Broad Institute of Harvard and MIT and was dedicated to the storage of raw screening data of small organic molecules. This resource is unfortunately not available anymore due to maintenance difficulties, although all data remains available for a bulk download, but is not as handy to search.

There are also databases that focus only on metabolites, chemicals that are produced by living organisms (generally, but not only through enzyme-catalyzed reactions) and that are involved in primary and secondary metabolisms. The two major and most comprehensive databases for metabolites covering most of the domains of life are KEGG [48] and MetaCyc [49]. They contain an equivalent amount of chemicals, also involved in secondary metabolism, i.e. NPs, but present a different point of view on data organization and have been widely compared in the literature [50]. The BRENDA database [51] focuses on enzyme activities, but also contains the compounds involved in enzyme-catalyzed reactions, and this, covering most of all known domains of life. The particularity of this database is the manually validated compounds, reactions and enzyme activities in its main part, and exhaustive taxonomic origins for enzymes and compounds; however, NPs and primary metabolites are not clearly separated in this resource, so it is difficult to estimate their respective numbers. The Chemical Structure Lookup Service (CSLS) [52] was developed for a very rapid metabolite structure lookup in an aggregated collection of more than 80 databases comprising more than 27 million unique structures in 2007. Not updated anymore, it is still possible to download the datasets, but the lookup service is not available so the extraction of NPs only requires an extensive data curation. The last database presented in this section is BiGG [53]: a platform for highly-curated genome-scale metabolic models. It contains, as parts of the metabolic models metabolites, but the distinction of primary and secondary metabolism is not clear, so it requires a lot of efforts to extract information on NPs only.

Databases for dereplication

Dereplication is one important step in experimental NP discovery as it prevents re-isolation and re-characterization of already known molecules. It consists of a lookup in databases with annotated experimental data (mainly mass spectrometry (MS) and Nuclear Magnetic Resonance (NMR) spectra) for comparison to newly obtained experimental data, and its annotation in case of found spectral identity. There are two big categories of databases used for dereplication based on the type of spectra they contain, MS and NMR.

Databases for dereplication for MS data

There are three distinct databases called “MassBank”: the MassBank of North America (MoNa) [54], the European MassBank [55] and the Japanese MSSJ MassBank [56]. The three contain reference MS spectra for metabolites and extensive metadata. MoNa tends to be favoured by the scientific community as it integrates data from more sources than the two others, contains rich and community-curated metadata and facilitates the submission of new datasets.

METLIN [57] is a database that allows the characterization of known metabolites and a technology platform for the identification of known and unknown metabolites and other chemical entities. It is a comprehensive resource containing over 1 million molecules including primary metabolites, toxins, small peptides, and NPs. METLIN’s high-resolution tandem mass spectrometry (MS/MS) database, which plays a key role in the identification process, has data generated from both reference standards and their labelled stable isotope analogues, facilitated by METLIN-guided analysis of isotope-labelled microorganisms. However, it does not allow an easy download of the data, but the access to the platform is free for academic use.

The Human Metabolome Database (HMDB) [58] is a metabolomic database containing comprehensive information on human metabolites with very extensive metadata and reference spectra. It contains human-produced NPs together with NPs that are essential for the function of the human organism. However, as it is the case in a lot of previously described databases, the separation between NPs and primary metabolites is tricky.

From the same institution, the Yeast Metabolome Database (YMDB) [59], was created with the same pattern as the HMDB, and therefore also contains very extensive metadata for baker’s yeast metabolites, enzymes that are involved in the molecular metabolism and reference spectra. Again, the separation between NPs and primary metabolites is difficult, do this dataset was not included in further analysis either.

The RIKEN MSn spectral database for phytochemicals (ReSpect) is a collection of in-house and literature MS plant NP spectra. The website is still maintained and is usable but the last dataset has been added in 2013.

The Global Natural Products Social Molecular Networking (GNPS) [60] is a web-based knowledge base containing MS spectra for NPs only and is intended to be the base for the community-wide organization and sharing of raw, processed or identified data. In addition to providing access to spectra, it is also possible to download solely the structures of the NPs from this database.

Databases for dereplication for NMR data

NMRshiftDB [61] an open and peer-reviewed database for organic molecules structures and their NMR spectra. It contains a big number of easily identifiable NP spectra that makes it the reference tool for NP dereplication applications.

NMRdata [62] is a Chinese initiative for the storage and elucidation of NP structures from NMR data. Unfortunately, the main website is in Chinese and the English version is limited. To access the data one needs an account in a university that participates in the NMRdata project. At the moment of the writing of this manuscript, NMRdata contains 1,167,468 spectra, which theoretically makes it the biggest resource for NMR data in the world but it is under-used due to the language barrier.

NAPROC-13 [63] is a database containing 13C spectral information of over 6000 natural compounds. All data is accessible and searchable online, however, it is not possible to download the subsequent structures.

Spektraris NMR database [64] is a collection of NMR spectra that are focusing on plant NPs. The more than 400 spectra from more than 200 compounds in this database were manually transcribed from the literature. Spectra from this database are also submitted to NMRshiftDB to profit of the advanced technological aspects of the latter.

Generalistic databases of natural products

Generalistic public databases for NPs are not specialized in any particular type of NP nor on NP origins or usages. They are generally intended as catalogues for various purposes, such as in silico screening for activity prediction, molecular docking and so on. Seven generalistic public NP databases that have been active in the last 20 years have been identified from the literature.

SuperNatural II [65] is a database that contains over 300,000 NPs together with their 2D structures, computed physicochemical properties and predicted toxicity. It also provides references to the chemical suppliers for the actual purchase of the molecules, but not to other chemical databases. The database is maintained but is probably not updated anymore as some of the companies selling molecules are not active anymore (such as MDPI [66]). Unfortunately, SuperNatural does not provide a bulk download, even if the download of separate MOL files for molecules is possible and erroneously does not contain only NPs (e.g. it contains dodecahedrane, identified in this database under SN00136231 and it is not a NP), so this resource needs to be used with caution despite its wide fame in the scientific community.

The Universal Natural Products Database (UNPD) [67] was an effort to compile all know NPs in one collection for in silico drug screening. The last accessible version of the UNPD contains over 200,000 NP structures. The database is not accessible anymore through the link provided in the original publication, but a copy of the molecular structures contained in it is still maintained on the ISDB [68] website (a database for in silico predicted MS/MS spectra for NPs).

ZINC [69] is a public access database and toolset that was initially developed to enable easy access to chemical compounds for virtual screening purposes and that became ever widely used for a big range of cheminformatic applications. It has a very clear separation of molecules in catalogues, in particular on their origin, and contains an easily searchable and retrievable collection of over 85,000 NPs.

The Natural Product Activity and Species Source Database (NPASS) [70] contains over 30,000 NPs from plants, bacteria, fungi and animals and is developed and maintained at the National University of Singapore. This database was created to provide a reliable source for highly curated NPs with structures, experimental activity values and the organisms that synthesize them.

RIKEN Natural Products Encyclopedia (NPEdia) [71] contains over 25,000 secondary metabolites isolated from various species and annotated with rich metadata, such as molecule origin and physicochemical and biological properties. The database is still available online but is not updated since 2014.

3DMET [72] is a database that was created in 2005 in the National Institute of Agrobiological Sciences in Japan and is still maintained and updated until now. The idea of such a database came during the conversion from 2D to 3D NP structures and the errors that were occurring during it that needed manual curation. Currently, the database contains over 18,000 entries, cross-referenced to the KEGG database [48], but unfortunately, the download of the structures is not possible.

The Chinese Natural Products Database (CNPD) [73] is a generalistic database created by Chinese researchers in order to facilitate the virtual screening of NPs for drug discovery purposes. This database is mentioned in over 120 papers until 2010 but is impossible to localize, as there is no URL provided in the original publication of the database and the dataset is not added as supplementary information to it. It is therefore probably incorrect to cite this database as a data source for NP, as the only possible sources found (from NeoTrident Technology Ltd) are in Chinese only.

One big negative point is that in ZINC, SuperNatural II and UNPD databases, the three biggest ones in terms of the number of NPs, the taxonomic nor geographic origins of the organism that produced the compound cannot be identified and in general they lack metadata and literature references.

For the completeness of this list, it is also necessary to site two major tools for the discovery and prediction of NPs from protein sequence data: antiSMASH [74] and PRISM [75]. Both are trained on, among others, NP data, but the latter is not provided directly to the public.

Thematic databases

Thematic databases for NPs focus on one particular origin or application of these secondary metabolites. Here we list databases that contain NPs produced by a particular domain of life (e.g. plants, fungi, bacteria), produced by organisms living in a particular geographical location (e.g. marine organisms, South American organisms) or by its application (traditional medicines, food or drugs). Apart from some rare exceptions, thematic databases tend to be small (less than 3000 entries) and very specialized.

In order to avoid biological provenance confusion, it needs to be noted that in some cases, NPs isolated from plants and animals can actually be synthesized by microorganisms that live on or in the host [76]. This is particularly the case of endophytes, bacteria living inside plant cells and very difficult to differentiate from the latter during preparation for metabolomics experiments [77]. Although the confusion is rare due to the improvement of identification methods and genetic approaches, it can create a bias in reproducibility of the NP isolation and needs, therefore, to be taken into account.

Natural products by the taxonomy of the synthesizing organism

Plants

KNApSaCK [78] is a comprehensive database for plant NPs that contains over 10,000 retrievable 2D and 3D structures, information on the relationships between the NPs and their expressing organism(s). It is pretty difficult to navigate despite the original design choices, and it does not offer a bulk download of the dataset.

Collective Molecular Activities of Useful Plants (CMAUP) [79], a relatively new database, contains very extensive information on plants that are linked to human activities together with their chemical constituents, i.e. NPs. The database offers very rich metadata for NPs, such as the plants that produce them and their geographical distributions.

TriForC [80] is a European Union-funded project that aims for the “discovery and production of known and novel bioactive triterpenes for pharmaceutical and agrochemical development”. The database contains a pipeline for triterpenes discovery and 266 NPs together with the enzymes and pathways leading to their production. It contains metadata for the compounds, but no structures in computer-readable format nor the possibility of downloading them.

Alkamid database [81] references over 300 N-alkylamides from plants, a promising group of bioactive compounds in drug and crops research. The database is fully open and offers rich metadata, in particular, the taxonomical classification of plants that produces the NPs, but does not allow a bulk download of any information from it.

The Tea Metabolome Database (TMDB) [5] is a curated and literature-based database for tea components. Not accessible anymore, it contained over 1300 constituents found in tea.

Microorganisms

StreptomeDB [82] is a collection of NPs from bacteria from the Streptomyces genus, which is very important for the production of natural bioactive compounds such as antibiotics, antitumour and immunosuppressant drugs. These bacteria are of particular importance in pharmacological research as around two-thirds of all known natural antibiotics are produced by them. While collecting data for this review, we encountered some difficulties to access the website, but the data was downloadable. In addition, an old dataset is available on ZINC.

The Natural Products Atlas (NP Atlas) [83] is maintained at the Simon Fraser University in Canada and is curated by a consortium of data curators around the world. It is designed to cover NPs from microbes (bacteria, fungi, lichens and cyanobacteria) published in the peer-reviewed literature. The resource is actively updated, allows a bulk download of all data and metadata and since September 2019 is completely open.

ProCarDB [84] is a database for carotenoids produced by bacteria. It contains over 300 compounds with rich metadata and structures but does not offer any download option.

PAMDB [85] is a comprehensive Pseudomonas aeruginosa metabolome database, well-curated, with rich metadata and offering bulk download. However, it does not contain only NPs but also results of the primary metabolism, so it was not included in the COCONUT collection.

The Lichen Database [86] is a collection of over 200 metabolites that have been isolated and identified experimentally in lichens. The database is not available yet, but the data has been already published in the MetaboLights [87] repository for metabolomics experimental data.

Natural products by use

Traditional medicines

The World Health Organization listed between 1999 and 2009 a list of over 21 000 plants used for medicinal purposes all over the world [88, 89]. This effort was made for proper identification of safe plants, as it is estimated that plant-based traditional medicines are used by 60% of the world’s population [90]. In addition to efforts to establish formal, DNA-based identification of such plants for wider use [91], collections of medicinal plant species, and in particular of phytochemicals, NPs produced by plants, associated to their therapeutic activities and physicochemical properties are being established around the world. This is particularly the case in Asia and Africa, where traditional medicines remain an important part of everyday life for cultural, traditional and economic reasons.

Traditional Chinese Medicine (TCM) is naturally part of the Chinese public health system [92, 93]. It is therefore coherent that in this country the scientific study of natural compounds from plants used in TCM is very advanced and is receiving strong governmental support, and they have developed a plethora of databases containing NPs, their sources and effects.

The biggest database containing NPs used in TCM is TCM@Taiwan [94]. It contains over 58,000 entries and is directly feeding iSMART [95], an integrated cloud computing web server for online virtual screening, evolution studies and drug design. In addition to this, there are several other, smaller, databases for NPs TCM that can be cited, such as the Chinese Ethnic Minority Traditional Drug Database (CEMTDD) [96], that is maintained, but not updated and contains 4000 NPs, the Chinese Traditional Medicinal Herbs Database (CHDD) [97], not maintained anymore, but according to the publication contained over 30,000 entries, now not accessible and probably lost for the scientific community. Some other databases containing phytochemicals and other active compounds used in TCM can be cited, such as the Comprehensive Herbal Medicine Information System for Cancer (CHMIS-C) [98] that is not maintained anymore, the Encyclopaedia of Traditional Chinese Medicine (ETCM) [99], that is maintained but the chemical structures it contains are not easily retrievable, the database of medicinal materials and chemical compounds in Northeast Asian TM (TM-MC) [100], which is maintained, updated, but no structures but contains precise plant species for all compounds, the Traditional Chinese Medicine Integrative Database (TCMID) [101], maintained, but not updated anymore, The Traditional Chinese Medicine Systems Pharmacology database and analysis platform (TCMSP) [102], that is also not maintained anymore but used to contain over 29,000 NPs. One can quickly realize that there is a lot of databases that focus on chemical compounds used in TCM, and creators of the latter recognize it: there is even a database called “Yet Another Traditional Chinese Medicine Database” (YaTCM) [103] that was published in 2018. Mainly, all these databases differ in the number of compounds they cover, in the richness of their metadata and on the availability of the datasets they contain.

Another extremely important traditional medicine in Asia is the Indian Ayurveda, that also got a wide popularization worldwide over the past decade. There are, however, very few databases listing natural compounds from plants, insects and animals used in Ayurveda, and they do not contain as many entries as the Chinese ones. Only two are currently online and open. The first one, IMPPAT [104] is the manually curated database of over 10,000 phytochemicals extracted from 1700 Indian medicinal plants, their phytochemistry and their therapeutic effects. The other, MedPServer [105] contains NPs from plants from North-East India used in traditional medicine. It aims towards the understanding of the therapeutic mechanisms of action of the 1124 NPs from these plants by integrating ligand-based and structure-based approaches. NeMedPlant [106] is a small (over 100 NPs) database of active compounds from plants used in North-East Indian traditional medicine, with rich metadata focused on the plants that produce the compound but without possibilities of downloading any information and is not updated anymore. Because it was cited in several peer-reviewed papers, we also need to mention TIM [90], the database created in 2011 for the Prediction of Biologically Active Natural Products from Ayurveda Traditional Medicine but never linked to an actual database not listing the NPs in the supplementary material of the publication.

Phytochemica [107] is a small database of plant-derived chemicals that contains plants from Himalaya used in both Chinese and Indian traditional medicines. There are also some databases of NPs that specialize in traditional medicines of other parts of Asia, such as the Database of Indonesian Medicinal Plants [108] and TIPdb [109] for plants from Taiwan, but most of them are relatively small and contain in general only few hundreds of compounds.

African Traditional Medicine (ATM) is the other extremely rich and developed traditional medicine with a lot of modern efforts to study, rationalize and put its teachings to the benefit of modern medicine. As for the CTM and the Ayurveda, it requires inventorying plants used by African traditional doctors, identifying the parts that are used to efficiently cure and then identify the active components that they contain. It exists also a certain number of databases focusing on NPs from plants used in traditional medicines on the African continent. Among those, the most famous and the most generalistic is AfroDB [110], although it is only accessible through the ZINC catalogues. The pan-African natural products library (p-ANAPL) also needs to be cited here, as it focuses on plants used in ATM and is available as the supplementary information if its publication [111]. Three datasets, AfroCancer [112], AfroMalariaDB [113] and Afrotryp [114], available as supplementary information of their respective publications link NPs from plants used in traditional medicines to their potential targets involved in the treatment of cancer, malaria and Trypanosoma. There are then country-specific and relatively small databases for NPs extracted from ATM plants, such as the Cameroon Medicinal Natural Products database (CamMedNP) [115], Central African Medicinal Plants database (ConMedNP) [116] and the Ethiopian Traditional Medicine Database (ETM-DB) [117].

Databases of drug-like natural compounds

Not linked, at least directly, to the traditional medicines, there is a lot of pharmacological research around the therapeutic properties of NPs, and these are compiled in the databases for drugs and drug candidates. In these databases, natural compounds are generally associated with a type of disease or molecular targets or receptors they interact with, and a rich description of their molecular and overall effects on the state of a patient or of a healthy person. The reference database in this category is DrugBank [118]. It latest version, which was greatly modified and curated compared to previous ones, contains over 10,000 drugs, among which 3732 are approved drugs and 200 approved drugs that have been produced by a living organism. In order to select only the latter, one needs to search for “nutraceuticals” in the search bar of the DrugBank website [119]. The previous version of Drugbank, 4.0 [120], contained over 8000 nutraceuticals, and they were added to COCONUT.

BindingBD [121] is an interesting database for pharmaceutical research as it contains measured binding affinities of proteins that are supposedly targets of drugs, with small drug-like molecules. Although it does contain NPs and their protein targets, they are not clearly distinguishable from synthetic drugs in this database.

The Novel Antibiotics Database [122], that is still surprisingly online, is not updated since 2003 and contains 5430 compounds of natural origin with an antibiotic activity that have been published in the Journal of Antibiotics between 1947 and 2003. However, no structure is available for download, only compound names, their activity and the organisms they were isolated from.

ChemIDplus [123] is a database part of the TOXicology DataNETwork and chemicals that have a relationship with diseases, environment, environmental health and poisoning. It contains rich metadata for each chemical, including its physicochemical properties but also its impact on health and environment. A simple search for “natural product” returns more than 9000 entries, it is however not possible to bulk download the results of the query.

The Herbal Ingredient Targets (HIT) [124] and the Herbal Ingredients in vivo Metabolism (HIM) [125] databases are two inter-connected collections of NPs from mainly (but not only) Chinese plants. Both are not accessible online anymore, but the structures of the NPs they contained are available on ZINC. They contained very extensive metadata on the molecular targets of the herbal active ingredients, their toxicity, a wide range of pharmacologically relevant molecular descriptors and their therapeutic effects. Unfortunately, this metadata is not available on ZINC and is probably lost.

There are several databases that focus on collecting information on NPs with anticancer properties and their mechanisms of action. The first one, NPCARE [126] contains over 6000 NPs from plants, marine organisms, fungi and bacteria with validated anticancer activities and contains extensive metadata. The website is available and seems updated but cannot be accessed sometimes, probably due to server failures on the maintenance side. The Indian Plant Anticancer Compounds Database (InPACdb) [127] is not available anymore but used to contain very broad information covering pharmaceutical and physicochemical properties of 144 NPs, cancer types and molecular targets. Fortunately, the data is still available on GitHub [128]. Another database, containing phytochemicals with anti-cancer properties is the Naturally Occurring Plant-based Anti-cancer Compound-Activity-Target (NPACT) database [129] is still maintained and accessible It contains 1574 manually curated entries with rich metadata on NPs and their therapeutical mechanisms on different types of cancer. The US National Cancer Institute also maintains and makes freely available a number of small (390 on average) natural compound datasets [130] that are selected as of interest in anticancer research and are currently undergoing tests in various research groups from the US NIH.

InflamNat [131] is a small (200 NPs) but well-curated dataset of NPs with anti-inflammatory activity. The dataset consists of NP structures, their type and origin and literature references, and is available as supplementary information for its publication.

BioPhytMol [132] is a manually curated database of natural compounds from plants that have an antibacterial effect. The database has over 2500 entries with very rich metadata, in particular regarding the plant species from which the compounds were extracted. The database is open and maintained but does not offer a bulk download option to be used to further analyses.

The last database in this section is the Open Source Malaria [133], which is a very nice project as it is a totally open-source collaborative project for anti-malarial drugs discovery that already encountered certain success [134]. Drug candidates tested in this project are often of natural origin, but as the focus of this database is to collect their effects, it is not always specified, so the content of OSM was not integrated into COCONUT.

Food

FooDB [8] is the reference database on chemical food constituents associated with extremely rich and diverse metadata. It is developed by the Wishart research group and supported by the Canadian Institutes of Health Research. In total it contains over 22,000 NPs and offers a convenient bulk download their structures.

BitterDB [6] collects bitter-tasting natural compounds associated with rich metadata on their receptors. However, it also contains synthetic molecules with a bitter taste, and in this database, it is difficult to separate them from the natural ones.

Phenol-Explorer [135] is a comprehensive database on polyphenol content in food. It currently contains over 800 phenol structures from over 400 foods. Data is derived from the scientific literature, and all data is associated with rich metadata and is available for download.

PhytoHub [136] is a database of dietary phytochemicals and the human and animal metabolites that derive from them. Over 1200 NPs from more than 350 foods are available in this resource, together with rich metadata and references to other chemical and spectral databases it, unfortunately, does not offer a bulk download for the moment.

The SuperSweet database [4] is a collection of various molecules, mainly from plant origin, but also synthetics that have a sweet taste. Their structures together with information on their number of calories, therapeutic uses and sweetness index are available. The database is still maintained but is not updated since 2011 and does not provide a bulk download of its content.

Toxins

A toxin is a substance that is toxic for one or more living organisms and that has a plant or animal origin. Despite this original definition, more and more resources on toxins also integrate molecules from non-organic origin massively present in the environment as they also have a harmful effect on the living organisms. For instance, Exposome-explorer [137] is a manually curated database of biomarkers of exposure to environmental and dietary factors, and it also contains these factors and their structures. A lot of the toxic environmental and dietary factors in it are from natural origin, but also, approximately half of the compounds in this database are not NPs, which is reasonable, as, for example, environmental pollution is anthropogenic. In the same way can be mentioned the T3DB [138], the toxin and toxin-target database, as it contains a number of toxins produced by the living organism but its focus is on synthetic toxins and how human metabolism reacts to them.

The biggest (over a 1000) database of animal toxins was the Animal Toxin Database (ATDB) [139], designed originally to collect toxin structures, origins and effects, but it is not available anymore at the URL provided in the publication. More specialized databases were also published, such as the International Venom and Toxin Database [140], the Snake Neurotoxin Database [141], the Mollusk Toxin Database [142] or the Scorpion Toxin Database [143]. Unfortunately, most of these databases were based on unformatted text and were lacking effective systems for data query, and none of them is not accessible anymore. It is also unknown if the data contained in these databases is lost or is still available in some generalistic resources.

The last in this section, the Toxic Plants—Phytotoxins Database (TPPT) [144], is accessible and is maintained and updated by the Agroscope in Switzerland. It contains over 1500 phytotoxins from Central Europe and offers high-quality metadata and a convenient bulk download.

Other

The two databases described next could not be fitted in any of the previous categories. The Carotenoids database [145] is a collection of NPs produced by a wide range of organisms and that share common substructures (polyene with possibly terminating rings) and properties as they are all yellow, orange or red pigments. Carotenoids produced by plants have particular importance for the nutritional value of the consumed food [146], but plants are not the only producers of this molecular type which is demonstrated in the Carotenoids database. This database is developed and maintained at the RIKEN institute. SuperScent [10] is a database of volatile compounds essential from an organic origin that can be scented by humans and animals. It contains over 2000 compounds with their structures and properties but does not offer any download and most of the compound pages are now working. This database is maintained at Charité Belin but is not updated since 2010.

Natural products by the geographic origin of producing organisms

There is a number of country-level efforts to catalogue the biodiversity of NPs in particular geographical zones, generally defined by country political borders. These databases are mainly plant-focused, but can also integrate NP produced by insects, by microorganisms and animal toxins. In this part, the databases are cited in the geographical order from West to East. The last part is describing collections of NPs from organisms in marine and ocean environments.

BIOFAQUIM [147] is a database published in 2019 and offers for full download over 400 unique NPs from plants, fungi and propolis from Mexican flora and fauna, the species from which the compounds were extracted and their geographical location. The Nuclei of Bioassays, Ecophysiology and Biosynthesis of Natural Products Database (NUBBEDB) [148] is the first NP library from Brazilian biodiversity. It currently contains over 2000 NPs, highly curated and good quality metadata and easy download of all or partial data. The UEFS dataset [149] is a collection of NPs isolated from Brazilian plants and maintained by the State University of Ferriera de Santana in Bahia, Brazil. The NPs in this collection have been published separately but there is no common publication nor public database for it, it is however accessible via ZINC.

Three databases contain NPs from the African flora and fauna. The Northern African Natural Products Database (NANPDB) [150] contains over 4500 NPs from plants, endophytes, fungi and bacteria. The database provides rich metadata, literature references, cross-references to major chemical databases and an easy bulk download. The South African natural compound database (SANCDB) [151] is very similar to NANPDB in its quality and contains over 600 NPs isolated from South African biodiversity. It is also possible to submit new molecules and to participate in the curation of the database. The Mitishamba database [152] contains 1100 NPs isolated from Kenyan plants. The database is still maintained but does not seem to be updated and it is possible to download data from it only by requesting an account.

ChemDB [3] and MAPS database [153] are two databases for natural compounds from Pakistani plants. Unfortunately, none of them is accessible anymore. VIETHERB [154] is a database published in 2018 with the aim of providing high-quality and literature-based data on herbs and active compounds from them. Despite the novelty of the database, it is not accessible anymore.

The oceans cover 71% of the surface of the Earth, therefore databases that collect NPs from marine organisms are expected to be broad, complex and cover a wide range of organisms. Unfortunately, the biggest repositories for marine NP structures are commercial (e.g. MarineLit [33] and DMNP [28] presented above). In the marine NP community, the major trend is to publish newly discovered molecules in specialised journals (such as the Journal of Natural Products [155] or Marine Drugs [156]) as images and rich textual description that are not, for now, easily machine-retrievable.

In the last 20 years, four databases containing structures of marine NPs and their metadata were published. Two of them are not accessible anymore: the Marine Compound Database (MCDB) [157] and the Marine Natural Product Database (MNPD) [158]. Both contained only a few hundreds of entries according to their respective publications but these were comprising rich metadata which is now lost. The Dragon Exploration System on Marine Sponge Compounds Interactions (DESMCI) [159] is still accessible but seems not to be maintained as the actual data, such as molecular structures and the corresponding metadata is not visible when one tries to access it. The Seaweed Metabolite Database (SWMD) [160] is the only one really maintained and it contains 1110 entries, with only 423 unique structures. Molecular structures in this database are annotated with the species of the algae that produce them, together with the geographical origin of the latter, biological activity of the compound and its physicochemical properties.

Industrial catalogues

A lot of companies that are synthesizing and isolating chemical compounds offer a catalogue of their products, and in some cases, these catalogues also contain the structures and annotations. These catalogues are often cited in the scientific literature as sources of NP structures, therefore it was important to mention the most used catalogues in this review. Surprisingly, a non-negligible number of cited catalogues of NP structures are accessible only to clients, on-demand or to registered users. This is the case of the NP catalogues from Ambinter-Greenpharma natural compound library [161], ChemBridge diversity datasets [162] (their NP catalogue seems to be not available anymore), LOPAC1280 by Merk [163], Prestwick [164] and TargetMol [165]. Open NP catalogues are provided by the following: AnalytiCon Discovery [166], InterBioScreen [167], Indofine Chemical Company [168], Pi Chemicals Systems [169] and Specs [170]. The website of the latter is not offering the download of their NPs catalogue anymore, but a dataset is available on ZINC [171]. Note that only the most famous and cited in academic research are listed and more industrial catalogues for NPs exist.

Problems

The biggest problem nowadays is that there are too many sources for NPs. A non-experienced researcher in NPs (and even a more experienced one) will just get lost in this variety and diversity of possible data sources. The next major problem is access to data and its maintenance. Indeed, a lot of publications point to a website that is not maintained anymore. This is the case of the majority of animal toxins databases, but also of a number of small regional or traditional medicine databases. In the list of NP sources presented in Table 1, over 20% are not maintained anymore or the access is intermittent. In some rare cases, the information on the NP structures is still recoverable via the ZINC database, but it is not the case of more modern databases and ZINC does not store any metadata from these collections, only the molecular structures encoded in SMILES. Also, the description and origins of the NPs (i.e. metadata), in addition to their structure are generally lacking, and it is especially the case in data aggregators that are nevertheless the most commonly used. This leads to cases where in silico screening reveals potentially interesting compounds but requires way more efforts and investigations to identify its origins and the way of obtaining it experimentally. Only 40% of NP databases offer an easy bulk download of molecular structures that they contain for further analyses with local tools. The quality of the molecular structures might also require additional attention and curation efforts. Indeed there are no standards for NP databases for a definition of stereochemistry, aromaticity or isotopes, which leads to a variety of possible versions of the same molecule.

This multiplicity of databases comes also from the publishing pressure on scientists, the infamous “publish or perish”. Nowadays, publishing a dataset or a database is a relatively easy publication and have the potential to generate a high number of citations. However, this trend generates a plethora of databases that are unmaintained beyond the publication time (like it is the case of VIETHERB [154] for example, published only 1 year prior to the writing of the present review and already not accessible anymore), despite the journals requirements to provide accessibility to the published datasets and databases for a number of years ahead.

Comparison and analysis of the content of open NP databases

The 50 NP collections from which NP structures could be downloaded were analysed in order to evaluate their overlap in terms of molecular structures and coherence of their content. 19 physicochemical properties, such as molecular weight, NP-likeness [172, 173], logP, TPSA Efficiency, and Zagreb Index, were computed and their distributions are shown in an interactive graphic at https://npreview.naturalproducts.net. Due to the high number of databases to compare, a non-interactive would not be visible. Globally, the physicochemical properties of all datasets are comparable. The NP subset of Drugbank contains molecules that are less likely to be NPs, which can be explained by its high content in NP-derived drugs and the difficulty in dissociating the latter from synthetic ones. The average mass of all NPs in the assembled collection is of 454 Da, and the Spektraris and TCM@Taiwan databases contain the heaviest molecules: both contain molecules with an average of 612 Da. The logP is a lipophilicity measure commonly used in analytical chemistry; the more it is positive, the more lipophilic is the compound and the more negative, the more hydrophilic. Here, the logP was computed with two algorithms, AlogP and XlogP available in the CDK [174]. In general, NPs tend to be lipophilic, which allows them to have higher membrane penetration, but all datasets also contain in lesser amounts, hydrophilic molecules. CarotenoidsDB and the SeaWeed Metabolites Database outstand from others with their very lipophilic content. On the other side, ReSpect contains more hydrophilic molecules than other datasets.

The overlap in terms of molecular structures between the databases was also calculated and is presented in Fig. 1 and in Additional file 1: Table S1. In Fig. 1, which represents a network of overlap between databases, there is a directed edge between database A and database B if more than 50% of the unique molecules from database A are present in database B. An interactive version of this network, where the user can change the percentage of similarity between databases to display is available at https://npreview.naturalproducts.net. It should be noted that 40 of the 50 open NP databases have an overlap of at least 50% with at least one other open database. Except for the Lichen Database, all datasets share at least 10% of their compounds with at least one other open dataset.

Fig. 1
figure 1

Network of content similarity between the 50 open natural products databases. The network is directed, and there is an arrow from database A to database B if more than 50% of molecules in database A are also present in database B. The interactive version of this network is available at https://npreview.naturalproducts.net

In the majority of the databases, stereochemistry is defined for at least some of their content. Only three databases, TCMid, ReSpect, and NPCARE don’t have any stereochemistry defined for any of the molecules in them. The fraction of NPs with stereochemistry in each database is accessible in Table 1. On average in the open NP databases, more than 50% of the molecules have a defined stereochemistry. When a 2D molecular structure is present in two databases and stereo information was elucidated, in general, open databases tend to agree on the latter. Doing a pairwise comparison between databases on their overlapping content, pairs of databases tend to agree on the stereochemistry, in on average 70% of NP than they share. The whole list of pairwise agreement between databases on the stereochemistry of their overlapping molecules can be found on FigShare (https://doi.org/10.6084/m9.figshare.11926047.v2).

Five NPs are found in 34 of these 50 databases: apigenin, quercetin, kaemferol, catechin and naringenin. Interestingly, belong all to the flavanol group, part of the flavonoids family and share a common skeleton (Fig. 2a) with only differences in hydroxy groups. In the top ten most frequent molecules in open databases, in addition to more flavonoids, there is also coumaric acid (Fig. 2b), gallic acid (Fig. 2c), scopoletin (Fig. 2d) and ellagic acid (Fig. 2e). According to the literature, all these compounds are well-known plant products, however, most of the flavanols, coumaric acid and scopoletin are also present in the bacterial NP database, StreptomeDB.

Fig. 2
figure 2

Most frequent molecules in open databases. a Common biggest substructure in the top 5 most frequent molecules, found in 34 out of 50 open databases. b Coumaric acid; c gallic acid; d scopoletin; e ellagic acid

COlleCtion of Open NatUral producTs (COCONUT)

In its current version, COCONUT contains 411,621 unique molecules, unified on the stereochemistry-free InChi keys, that were collected from 50 open and accessible NP databases, listed in Table 1. This number is big, as this dataset still needs to undergo a curation process, as, despite their claims, some of the NP collections do not contain only natural compounds. 27.9% of molecules in COCONUT do not have stereo centres defined in any of the databases where they have been collected from. Among the latter, 57.7% (66,374 unique molecules) have truly no stereocenters, and the remaining 48,611 NPs have at least one stereocenter, but this information is not provided.

50% of the unique molecules have only one stereochemical version of their 3D structure, and 22.1% have more than one. The latter could be different valid stereoisomers of the same base constitution or errors in the databases. Addressing those errors will be subject of future curation of COCONUT. When a 2D molecule has several possible 3D structures, these can originate from the same public database, where stereochemistry is precisely defined, but also from different databases. Note that unknown NP structures or mixtures are not included in COCONUT. The collection is available as a MongoDB dump and a CSV file on Zenodo (https://doi.org/10.5281/zenodo.3547718) and a user-friendly web interface to browse it is under development. The aim of COCONUT is to make the NP-related data as FAIR as possible.

Discussion

There are currently 123 data collections of natural products (NPs) that have been published and cited in the scientific literature between 2000 and 2019. Only 50 of them are open access or have their content accessible (in ZINC for example) and among them, the overlap of their content is significant, as 40 of these datasets share at least 50% of the compounds they contain with at least one other dataset.

There are several aggregators, such as the ZINC catalogue for NPs, SuperNatural II and UNPD (not maintained anymore), but they do not cover the entire space of known NPs and do not allow submissions of newly discovered compounds.

There is a need for an aggregator database for NPs, that will be commonly recognized, well organized and allowing an easy submission of newly found molecules, like it is the case for UniProt for proteins.

Conclusions

Natural products are important molecules for medical, chemical and social research. There is no, for now, any universal, community-accepted database for NP discovery, screening and dereplication. Instead, there is an extremely high number of very diverse databases and datasets, not all maintained or open access in 2020, which represents a serious loss of knowledge. There is a need for a unified universal repository for NPs, to avoid the unnecessary duplication of online resources and facilitate NP research. For the purpose of this review, a COlleCtion of Open Natural prodUcTs (COCONUT) has been assembled, analyzed and made available in Zenodo (https://doi.org/10.5281/zenodo.3547718). A web interface is currently under development for user-friendly querying, exploration and download of the known open NP space. In the future, the annotations of the molecules contained in COCONUT will be improved, in particular, systematically linking the compound to the first publication where it was described and to the organisms that synthesize it.

Materials and methods

All databases in Table 1 were downloaded in July and September 2019. Molecular structures were processed with CDK 2.3 and, when available, annotations were parsed with Java (code available on GitHub https://github.com/mSorok/COCONUT). Resulting original and non-redundant collections of NPs are stored in a MongoDB database, available as a dump on Zenodo (https://doi.org/10.5281/zenodo.3547718). Redundancy was eliminated based on InChi Keys, computed without stereochemistry (JNI-inchi option of the InChi generator set to “Snon”, “ChiralFlagOff” and “AuxNone”). Stereochemistry was not taken into account during this unification step as it is encoded differently between some databases and there are databases where it is not encoded at all. The overlap between databases in terms of similar stereochemistry was also performed with CDK 2.3. All network representations of overlaps between databases are made with Cytoscape [175]. Plots and comparative analyses made with Python and the Plotly and Dash libraries. The code for the interactive plots is available on GitHub at https://github.com/mSorok/NPDBReviewDash.