For thousands years, natural products are an important source of drugs . They are produced by marine or terrestrial organisms (plants, vertebrates, invertebrates…) and microorganisms (fungi, bacteria, algae). Many studies in the literature discuss the importance of natural products in drug discovery [2–5]. They are still important sources for many drugs in the market (e.g. morphine, cocaine, penicillin, taxols…) and are also good lead compounds suitable for further modification during drug development. Introducing a new compound on the market is time consuming and cost-intensive process [6, 7], in particular for natural products, so that strategies allowing time saving are welcomed.
The discovery of natural products requires specific steps as they are synthesized by living organisms. For example, scientists need to determine which organisms produce interesting compounds and define the conditions of production. The produced compounds have to be extracted from cultured media or from natural environments. Finally, chemical structures are determined. Those structures can, finally, be mimicked leading to artificial compounds. To reduce the time and cost of the specific steps, the optimal process is to predict the compounds produced by an organism directly from its genome sequence. This strategy can be particularly performed with nonribosomal peptides.
Those peptides are synthesized by a ribosome-independent cell machinery. This alternative pathway produces peptides using large multi-enzymatic complexes called nonribosomal synthetases (NRPSs) . Those synthetases are composed of proteins organized in modules, each one being responsible for the incorporation of one specific amino acid in the final peptide. A relationship between specific signatures and a given incorporated amino acid have been determined from protein sequences of NRPSs [9–12]. So, from a genome sequence, bioinformatics analysis allows to extract genes coding for NRPSs, to deduce their protein sequences and to predict the amino acids incorporated in the produced peptide . This predicted peptide can then be analyzed by bioinformatics tools to infer its putative activity.
We have collected nonribosomal peptides in Norine (http://bioinfo.lifl.fr/norine/) , the first and still unique computational resource dedicated to nonribosomal peptides (NRPs). Each peptide has a unique Norine identifier in the form NOR followed by a number of 5 digits. The database contains more than 1,100 nonribosomal peptides extracted from scientific literature with manually curated annotations such as biological activity, producing organisms or bibliographic references and, most importantly, their monomeric structure. We used the universal term monomer instead of amino acid because the entities encountered into those peptides do not only include the 21 proteogenic amino acids, but also derivates or unusual ones; other compounds such as carbohydrates or lipids can also be incorporated. Norine currently references 526 different monomers occurring in the listed peptides. The monomeric structures are encoded by undirected labelled graphs, with nodes representing monomers and edges corresponding to chemical bonds between them. One monomer can display more than two peptidic bonds, and non peptidic bonds are also observed in NRPs leading to peptides with cycles and/or branches. The database can be queried for peptide search through their annotations as well as through their monomeric structures. It also contains a section dedicated to the monomers incorporated into the peptides stored in Norine.
Due to the particular way of synthesis, nonribosomal peptides are a valuable source of a wide range of structural and biological activities, produced by microbial cells (typically bacteria and fungi). The NRPs may represent novel drugs for several pharmaceutical areas including antibiotics (penicillin and cephalosporin the precursor of which is ACV, NOR00006), antitumors (actinomycin D, NOR00228), and immunosuppressive agents (cyclosporin A, NOR00033). They can also be exploited in biotechnological applications such as biosurfactants. Their various and interesting biological activities almost comes from their original mode of synthesis that offers huge flexibility by including non proteogenic monomers and cycles and branching.
As they are small and exploited in pharmacology and biotechnology, nonribosomal peptides are usually represented by atomic structures and stored in chemical compounds databases. Classical chemo-informatics tools are applied to them as part of generalist chemical databases to predict their activity or do some structure search or comparison. Norine contains few links to structural conformation databases such as PDB (25 NRPs). However, the length of this data set is too low to be exploited for NRP comparison or activity prediction.
Due to the similar property principle, structurally similar compounds are expected to exhibit similar properties and similar biological activities. This principle is exploited for in silico drug discovery. The chemical compounds are virtually screened either by docking into the active site of interest or by virtue of their similarity to a known active. Many studies suggest that knowledge about a target obtained from known bioactive ligand is as valuable as knowledge of the target structures for identifying novel bioactive scaffolds through virtual screening [15, 16].
But, NRPs exhibit specificities in comparison to typical synthetic compounds (synthesis pathway, complex structures). So, published numerical representations for chemical compounds, such as fingerprints, may not be the optimal choice to represent NPRs. Our monomeric approach opens new ways to analyze them. As first observations showing that some monomers are specific to a given activity  were promising, we decided to further investigate the relationship between the NRP monomer structures and their activity.
In this paper a new fingerprint based on monomeric composition of NRPs is introduced. Monomer composition fingerprint (MCFP) is a new method for obtaining a representative description of NRP structures from their monomer composition in fingerprint form. In this work, we present experiments that show the usefulness of monomer composition fingerprint when used for similarity searching and activity prediction of NRPs.