1 Introduction

Similarity of molecular structures is highly relevant to various realms of science such as toxicology, ecotoxicology, and pharmacology. The basic paradigm of quantitative structure-property/activity relationships (QSPR/QSAR) is that compounds with similar structure have similar properties. This implies a smooth transient behavior in the relation between structure and property/activity, i.e., for any small change in the structure, the magnitude of a physico-chemical property or biological activity changes smoothly rather than in an abrupt, in an all-or-none type, way. Recently, much attention has been paid to so-called “activity cliffs” [1], which imply that this smooth behavior may not always be appropriate. However, we simply point out here that exactly such discontinuous behavior would be expected to result from descriptors that do not encode the relevant molecular properties adequately. This behavior is, for instance, well known when trying to define geometrical reaction coordinates for complex reactions. The concept of molecular similarity is always complicated by ambiguities in the representation of molecular structure and in the definition of similarity. In the computation of molecular similarity, a large number of mathematical functions can be used to derive measures of similarity for a pair of molecules starting from the same set of structural descriptors [2, 3]. Quantitative molecular similarity analysis (QMSA) uses global molecular descriptors, such as 2D-fingerprints [4] or topological indices [5], to identify molecular similarity for property prediction and for environmental risk assessment. However, similarity measures that concentrate on the relevant regions of molecules, such as established 3D QSAR methods like comparative molecular field analysis (CoMFA) [6], are often more appropriate. These methods rely on spatially resolved, rather than global molecular similarity. They are potentially extremely powerful, but suffer from the disadvantage that they depend on the molecular conformation used, so that extensive conformational searches are necessary. This problem has been treated using conformationally weighted descriptors [7], which implies a dynamic description of flexible molecules. Methods are currently being developed [8] to use 3D information efficiently in drug design. A promising class of spatially resolved methods in computational pharmacology is derived from quantum-mechanical descriptions of molecules and of their potential interactions with the environment [912].

Recently, we have proposed intensity distribution moments as molecular descriptors [1315]. They are related to the shapes of molecular spectra. We have shown that using new descriptors one can distinguish nitriles from amides [16].

In the present work, we consider another kind of distribution moments—spectral density distribution moments. We demonstrate the usefulness of these descriptors using an example of the infrared spectra of 76 chloronaphthalenes.

Chloronaphthalenes have been used in industry for many years. They are toxic and of considerable interest in environmental studies [17]. Naphthalene IR spectrum, as well as spectra of other polyaromatic hydrocarbons, are also of great interest in astrophysics [18, 19].

We show that the new descriptors correctly represent the molecular structure—they clearly identify the number of chlorine atoms in the compounds. Moreover, we use statistical analysis as a tool for checking the correctness of the calculated spectra. We have used similar methodologies in astrophysics [20, 21]. We have shown that using methods of statistical spectroscopy it is possible to check the correctness of the existing classification of the stellar spectra. This methodology seems to be universal, and can be applied to many kinds of problems involving different classification schemes, as for example distinguishing between chaotic and periodic motions [22] or the classification of stars. We have also used this methodology to study the similarity of DNA sequences [23].

2 Theory

Let us consider a discrete frequency spectrum \(\nu _1, \nu _2, \ldots \nu _D\). A common approach in statistical spectroscopy is to describe this spectrum by the following distribution:

$$\begin{aligned} \rho (\nu )=\frac{1}{D}\sum _{i=1}^{D} \delta (\nu - \nu _i), \end{aligned}$$

where \(\delta (\nu - \nu _i)\) is the Delta Dirac function. The function \(\rho (\nu )\) represents the density of the frequencies. This function is called the spectral density distribution and is normalized

$$\begin{aligned} \int \limits _{-\infty }^{\infty } \rho (\nu ) d\nu =1. \end{aligned}$$

The aim of this work is to introduce spectral density distribution moments \(M_{\rho , q}, M^{\prime }_{\rho , q}, M^{\prime \prime }_{\rho ,q}\) as molecular descriptors. The q-th moment of \(\rho (\nu )\) is defined as

$$\begin{aligned} M_{\rho , q}=\int \limits _{-\infty }^{\infty } \nu ^q \rho (\nu ) d\nu . \end{aligned}$$

Using Eq. 1, the q-th spectral density distribution moment reads

$$\begin{aligned} M_{\rho , q}=\frac{1}{D}\sum \limits _{i=1}^D \nu _i^q. \end{aligned}$$

The corresponding q-th spectral density scaled moments are

$$\begin{aligned} M^{\prime }_{\rho , q}&= \frac{1}{D}\sum \limits _{i=1}^D ( \nu _i - M_{\rho ,1}) ^q,\end{aligned}$$
$$\begin{aligned} M^{\prime \prime }_{\rho , q}&= \frac{1}{D}\sum \limits _{i=1}^D \left[\frac{(\nu _i-M_{\rho ,1})}{\sqrt{M_{\rho ,2}-(M_{\rho ,1})^2}}\right]^q. \end{aligned}$$

In the present work, we construct spectral density distributions \(\rho (\nu )\) from the frequencies \(\nu _i\) of the infrared (IR) spectra. The moments of these distributions supply information about the locations of \(\nu _i\) but not about the intensities. They describe variuos properties of the distribution. In particular \(M_{\rho , 1}\) is the mean frequency, \(M^{\prime }_{\rho ,2}\) describes the width, \(M^{\prime \prime }_{\rho , 3}\)—the asymmetry, and \(M^{\prime \prime }_{\rho ,4}\)—the excess of this distribution. Higher-order moments do not have direct geometrical equivalents.

We show that spectral density distribution moments provide information also about the molecular structure. For this purpose we construct the classification diagrams (see the subsequent section).

3 Results and discussion

We perform the calculations for \(76\) compounds: chloronaphthalenes containing from zero through eight chlorine atoms. They are listed in Table 1, where \(r=0,1,\ldots 75\) are the labels of the compounds. Several examples of the compounds are shown in Fig. 1.

Table 1 Compounds
Fig. 1
figure 1

Compounds: a \(r=1\); b \(r=13\); c \(r=49\); d \(r=73\)

We study spectral density distributions of the frequencies of the IR spectra of the chloronaphthalenes.

Fig. 2
figure 2

Classification diagrams for 76 compounds based on spectral density distribution moments

The vibrational spectra we obtained from systematic high level DFT (Density Functional Theory) calculations. A hybrid B3LYP functional and 6-311++G** basis were used as implemented in the Gaussian 03 code [24]. A careful geometry optimization was performed prior to calculation of vibrational spectra for each compound. In some cases an initial guess of planar chloronaphthalene geometry led to transition states or saddle points, so in order to avoid these artefacts a lower, non-planar initial guess geometries were used. Thus, no calculated minima had imaginary frequencies.

Figure 2 shows classification diagrams based on the descriptors defined in Eqs. 4, 5, 6. The descriptors representing compounds with the same number of chlorine atoms are denoted by the same symbols in the plots. The number of chlorine atoms are written next to the symbols representing the corresponding descriptors. For example, the descriptors representing the 22 compounds with four chlorine atoms (\(r=27, 28, \ldots 48\)) are denoted by squares. All the squares in the figures are clustered—they are nearly in one point of the diagram. We observe that all the descriptors representing compounds with the same number of chlorine atoms are located in the same parts of the classification diagrams. This kind of behavior is observed in all the cases: \(M_{\rho ,1}-M^{\prime }_{\rho ,2}\) (Fig. 2a), \(M^{\prime \prime }_{\rho ,3}-M^{\prime \prime }_{\rho ,4}\) (Fig. 2b), \(M_{\rho ,1}-M^{\prime \prime }_{\rho ,3}\) (Fig. 2c), \(M^{\prime \prime }_{\rho ,5}-M^{\prime \prime }_{\rho ,6}\) (Fig. 2d), \(M^{\prime \prime }_{\rho ,7}-M^{\prime \prime }_{\rho ,8}\) (Fig. 2e), \(M^{\prime \prime }_{\rho ,5}-M^{\prime \prime }_{\rho ,7}\) (Fig. 2f).

Fig. 3
figure 3

Spectral density distribution moments for 76 compounds

Figure 3 shows the first four moments of the spectral density distributions for all 76 compounds. In this figure, one can recognize particular descriptors corresponding to particular compounds numbered by \(r\) (see Table 1) in the horizontal axis. The behavior of these descriptors with the number of chlorine atoms is very regular. The average frequency (\(M_{\rho ,1}\)) decreases when the number of chlorine atoms increases. The widths of the spectral density distributions (\(M^{\prime }_{\rho ,2}\)) also decreases when the number of chlorine atoms increases. The reverse trend is observed for the asymmetry coefficients (\(M^{\prime \prime }_{\rho ,3}\)) and for the excess (\(M^{\prime \prime }_{\rho ,4}\)) which increase when the number of chlorine atoms increases. The case of the molecule with eight chlorine atoms is different and the values of its descriptors are the smallest.

Fig. 4
figure 4

IR spectra: a \(r=0\); b \(r=1\); c \(r=3\); d \(r=27\); e \(r=73\); f \(r=75\)

Figure 4 shows spectra of 6 compounds: (a) naphthalene \(r=0\), (b) 1 chlorine atom \(r=1\), (c) 2 chlorine atoms \(r=3\), (d) 4 chlorine atoms \(r=27\), (e) 7 chlorine atoms \(r=73\), (f) 8 chlorine atoms \(r=75\).

In [19] and in [25] one may find, respectively, the assignment of naphthalene vibrational spectrum and an analysis of the IR spectrum of chloronapthalene. The high frequency oscillations (\(>3000\,\text{ cm}^{-1}\)) correspond to C–H stretching vibrations. In chlorine substituted compounds these frequencies are gradually replaced by much lower frequency C–Cl stretching modes. This simple fact explains the shift of \(M_{\rho , 1}\) towards lower values as a number of chlorine atoms increases. Also out-of-plane (oop) C–H vibrations, located in the naphthalene in the region of 786–980 cm\(^{-1}\) (see for example assignments in [26]) gradually disappear while a number of chlorine substituents increases in the molecule. The oop modes for C–Cl fragments have quite low frequencies (around 100 cm\(^{-1}\), see Fig. 4) and thus lower \(M_{\rho , 1}\) values are expected.

Fig. 5
figure 5

Spectral density distribution moments for 76 compounds calculated using spectra with some errors for \(r=27, 28, \ldots 75\)

These two facts (replacing high frequency C–H stretching modes by low frequency C–Cl stretching modes and replacing oop modes for C–H by oop modes for C–Cl) are also related to the shift of \(M^{\prime }_{\rho ,2}\) to lower values when the number of chlorine atoms increases. The dispersion of the frequencies becomes smaller for the spectra of compounds with larger number of chlorines.

The unexpected order of \(M^{\prime \prime }_{\rho ,3}\) descriptors [\(M^{\prime \prime }_{\rho ,3}\)(7 Cl) \(<\) \(M^{\prime \prime }_{\rho ,3}\)(6 Cl)] may be explained by the fact that in 7 Cl compounds only two non-equivalent systems are expected, namely these in which the only hydrogen atom is located either in position (1) (alpha) or in (2) (beta). Thus, the symmetry of this distribution is relatively high. On the other hand, in 6 Cl derivatives there are 10 different combinations of 2 hydrogen atoms and these spectra have less symmetric distribution of frequencies.

It appears that the values of moments calculated for the correct spectra, for a given number of chlorine atoms, do not depend on the distribution of these atoms up to several significant figures. However, this is not the case if spectra contain some errors. Though we still do not have a formal proof of this property, it may be used for the detection of errors in the calculated spectra. This may be seen by comparing Figs. 3 and 5. In Fig. 3 all the spectra are correct. In Fig. 5 the correct spectra are for \(s<4\) i.e. \(r=0,1,\ldots ,26\), and they contain errors for \(s\ge 4\) i.e. \(r=27,28,\ldots ,75\), where \(s\) is the number of chlorine atoms in the compounds. In particular, the irregular oscillations may be seen for \(M_{\rho ,1}\) and for \(M^{\prime \prime }_{\rho ,3}\) in top panels of Fig. 5.

Summarizing, spectral density distribution moments clearly identify the number of chlorine atoms in the molecules. They can be used as a tool for checking the correctness of the spectra which are used for their creation.