An algorithm to identify functional groups in organic molecules
The concept of functional groups forms a basis of organic chemistry, medicinal chemistry, toxicity assessment, spectroscopy and also chemical nomenclature. All current software systems to identify functional groups are based on a predefined list of substructures. We are not aware of any program that can identify all functional groups in a molecule automatically. The algorithm presented in this article is an attempt to solve this scientific challenge.
An algorithm to identify functional groups in a molecule based on iterative marching through its atoms is described. The procedure is illustrated by extracting functional groups from the bioactive portion of the ChEMBL database, resulting in identification of 3080 unique functional groups.
A new algorithm to identify all functional groups in organic molecules is presented. The algorithm is relatively simple and full details with examples are provided, therefore implementation in any cheminformatics toolkit should be relatively easy. The new method allows the analysis of functional groups in large chemical databases in a way that was not possible using previous approaches.
KeywordsFunctional group Chemical functionality Organic chemistry Medicinal chemistry
The concept of functional groups (FGs)—sets of connected atoms that determine properties and reactivity of parent molecule, forms a cornerstone of organic chemistry, medicinal chemistry, toxicity assessment, spectroscopy and, last but not least, also chemical nomenclature. The study of common FGs forms substantial part of basic organic chemistry curriculum. Numerous scientific papers and books focus on properties and reactivity of various FGs. A well known example is the classical book series “Chemistry of functional groups” describing various classes of organic molecules  consisting of over 100 volumes. There is, however, surprisingly little attention paid to the study of functional groups from the cheminformatics point of view. The majority of theoretical studies are utilizing FGs as a basis of chemical ontologies, where FGs are “keys” that are used to hierarchically classify molecules into categories . An example of this type of publications is work by Bobach et al. describing a rule-based definition of chemical classes to classify compounds into classes  or the ClassyFire software  developed in the Wishart’s group allowing chemists to perform large-scale automated chemical classification based on a structure-based chemical taxonomy consisting of over 4800 categories.
Various substructure features are often used in cheminformatics in connection with machine learning to develop models to predict biological activity or properties of molecules . In this approach the substructure descriptors are generated by extracting groups of atoms from a molecule using a predefined algorithm. Examples of such descriptors are linear or atom centered fragments, topological torsions, pharmacophoric triplets and many others. Although such fragment descriptors are very useful, they do not provide description of functional groups. The fragments are generally strongly overlapping and are generated for all parts of a molecule without considering their potential chemical role.
One of the first software tools to identify FGs was the checkmol program written by Haider  that was able to identify about 200 FGs. Recently an extended version of the program containing 583 manually curated functionalities encoded as SMARTS was published . This list includes also numerous heterocyclic rings and general structural patterns (i.e. 5-membered aromatic ring with 1 heteroatom). These substructure features are used to develop QSAR models for prediction of toxicity and various molecular physicochemical properties. The well-known ZINC database and related web-based software suite  stores about 500, so called, chemical patterns, that speed-up substructure searches and allow estimation of molecule reactivity. The patterns include PAINS filters  that identify frequent hitters interfering with biochemical screens as well as some other substructures. Another widely used set of substructures used to identify potentially reactive or promiscuous molecule has been defined by Eli Lilly scientists based on their experience with internal screening campaigns . Recently a set of generic chemical functionalities called ToxPrint chemotypes that describe molecule substructure and reaction features and atom and bond properties was defined within the ToxPrint program . The main goal of the tool is to use these features in toxicity modelling.
We are not aware of any software system able to identify FGs that is not based on manually curated set of substructure features, but instead automatically identifies all functional groups in a molecule. The algorithm presented in this article is an attempt to solve this scientific challenge.
Identification and extraction of functional groups
The majority of FGs contain heteroatoms. Therefore our approach is based on processing heteroatoms and their environment with the addition of some other functionalities, like multiple carbon–carbon bonds.
mark all heteroatoms in a molecule, including halogens
- 2.mark also the following carbon atoms:
atoms connected by non-aromatic double or triple bond to any heteroatom
atoms in nonaromatic carbon–carbon double or triple bonds
acetal carbons, i.e. sp3 carbons connected to two or more oxygens, nitrogens or sulfurs; these O, N or S atoms must have only single bonds
all atoms in oxirane, aziridine and thiirane rings (such rings are traditionally considered to be functional groups due to their high reactivity).
merge all connected marked atoms to a single FG
extract FGs also with connected unmarked carbon atoms, these carbon atoms are not part of the FG itself, but form its environment.
The algorithm described above iterates only through non-aromatic atoms. Aromatic heteroatoms are collected as single atoms, not as part of a larger system. They are extended to a larger FG only when there is an aliphatic functionality connected (for example an acyl group connected to a pyrrole nitrogen). Heteroatoms in heterocycles are traditionally not considered to be “classical” FGs by themselves but simply to be part of the whole heterocyclic ring. The rationale for such treatment is enormous diversity of heterocyclic systems. For example in our previous study  nearly 600,000 different heterocycles consisting of 1–3 fused 5- and 6- membered rings were enumerated.
After marking all atoms that are part of FGs as described above, the identified FGs are extracted together also with their environment—i.e. connected carbon atoms, when the type of carbon (aliphatic or aromatic) is also preserved.
We do not claim that this algorithm provides an ultimate definition of FGs. Every medicinal chemist has probably a slightly different understanding about what a FG is. In particular the definition of activated sp3 carbons may create some discussion. In the present algorithm we restricted our definition only to classical acetal, thioacetal or aminal centers (i.e. sp3 carbons having at least 2 oxygens, sulfurs or nitrogens as neighbors) and did not consider other similar systems, i.e. alpha-substituted carbonyls or carbons connected to S=O or similar bonds. During the program development phase various such options have been tested, and this “strict” definition provided the most satisfactory results. Extension of FGs also to alpha-substituted carbonyls (i.e. heteroatom or halogen in alpha position to carbonyl) and similar systems more than triple the number of FGs identified, generating many large and rare FGs. Since our major interest was in comparing various molecular datasets and not in reactivity estimation we implemented this strict definition of acetal carbons. To assess the possible reactivity of molecules, various substructures filters are available, as for example already mentioned PAINS  or Eli Lilly rules .
Generalization of functional groups
environments on carbon atoms are deleted, the only exception are substituents on carbonyl that are retained (to distinguish between aldehydes and ketones)
- 2.all free valences on heteroatoms are filled by the “R atoms” (this atom may represent hydrogen or carbon) with exception of:
hydrogens on the –OH groups
hydrogens on the simple amines and thiols (i.e. FGs with just single central N or S atom) are not replaced, this allows to distinguish secondary and tertiary amines, and thiols and sulfides.
all remaining environment carbons (on heteroatoms and carbonyls) are replaced by the “R atoms”; exceptions are environments on single atomic N or O FGs with one carbon connected, where this carbon is retained also with its type (aliphatic or aromatic), this allows to distinguish between amines and anilines, and alcohols and phenols.
This scheme provides a good balance between preserving sufficient, chemically meaningful details on one side and generalization on the other side. Examples of generalized FGs created by this procedure are shown in the following section.
Results and discussion
The algorithm outlined above was implemented using the Novartis in-house cheminformatics system written in Java and was tested by identifying and collecting FGs for bioactive portion of the ChEMBL database , consisting of ~483,000 molecules with activity below 10 um on any ChEMBL target. Organometallic structures were discarded and all molecules were standardized by removing counterions and neutralizing atomic charges.
Comparison of frequency (in %) of FGs identified by checkmol  and by the presented algorithm
Alkyl aryl ether
Sec. aliphatic/aromatic amine
Tert. aliphatic amine
Sec. aliphatic amine
Prim. aromatic amine
Sec. aromatic amine
Prim. aliphatic amine
Tert. aliphatic/aromatic amine
Carboxlic acid ester
In summary, the agreement between the both approaches is very good. This shows that systems based on a well selected list of substructures (as apparently checkmol is) can provide useful information about the FG composition of general molecular datasets. The advantage of our algorithm of automatic identification of FGs is its ability to handle more complex FGs that are often present in specialized collections of molecules. Examples identified by this simple analysis are activated alkenes (that act as a Michael acceptors and are often used as covalent binders). Another advantage of the algorithmic approach over predefined set of substructures is more fine graining of the results (like the detailed classification of ureas mentioned previously, that may be important for analyzing properties like solubility) and the easy possibility to fine-tune the algorithm to fit the needs of a particular analysis (for example library diversity or QSAR studies).
A new algorithm to identify all functional groups in organic molecules is presented. The algorithm is relatively simple and full details with examples are provided, therefore implementation in any cheminformatics toolkit should be relatively easy. The author is willing to provide help in any such endeavor. The new method allows to analyze FGs in large chemical databases in a way that was not possible using previous approaches. Several such studies focusing on differences in distribution of FGs between bioactive and “average” molecules as well as identification of functionalities typical for different classes of natural products are underway.
The author wants to thank Dr. Richard Lewis for discussing the manuscript and for helpful comments.
Peter Ertl www.peter-ertl.com.
The author declares that he has no competing interests.
I performed work on this manuscript as an employee of Novartis. No external funding was received.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 1.Patai’s Chemistry of Functional Groups. Wiley. http://onlinelibrary.wiley.com/book/10.1002/9780470682531
- 6.Haider N, The checkmol/matchmol homepage. http://merian.pch.univie.ac.at/~nhaider/cheminf/cmmm.html
- 9.Baell JB, Holloway GA (2010) New substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays J Med. Chem 53:2719–2740Google Scholar
- 11.Yang C, Tarkhov A, Marusczyk J, Bienfait B, Gasteiger J, Kleinoeder T, Magdziarz T, Sacher O, Schwab CH, Schwoebel J et al (2015) New publicly available chemical query language, CSRML, to support chemotype representations for application to data mining and modeling. J Chem Inf Model 55:510–528CrossRefGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.