Introduction

Nowadays, the majority of environmental studies are focused on group of chemicals called persistent organic pollutants (POPs), which pose a vast range of threats to human health and natural ecosystems. Due to their high lipophilicity and resistance to naturally occurring degradation processes, they are prone to bioaccumulate in human and animal tissues and to biomagnify in food chains [1]. Moreover, after entering the organism, they can induce a variety of toxic effects, including cancer, allergies, and hypersensitivity, damage to the central and peripheral nervous systems, reproductive disorders, and disruption of the immune system [2, 3]. Therefore, according to the Stockholm Convention, the emission of POPs to the environment needs to be eliminated or reduced [1].

Typical representatives of POPs are chloronaphthalenes (CNs). This group includes of 75 congeners—chemicals based on the same skeleton (naphthalene), but differ by a number of chlorine atoms or by the substituted pattern [4]. Despite of the fact that the synthesis of CNs is formally abandoned, there are still some commercial products (i.e., insulating materials, rubber belts) containing CNs available [5]. Moreover, chloronahthalenes are released to the environment during thermal-related synthesis (i.e., industrial waste incineration as well as domestic heating) [6], which, in fact, is assumed to be currently the main source of CNs in the environment [7]. Since the emission of CNs to the atmosphere, estimated only for Europeans countries, is still high, equal to 1.03 tons per year [8], there is an urgent need to perform comprehensive risk studies of these pollutants.

Among main factors influencing the environmental behavior of CNs are: their overall environmental persistence, mobility, and (eco) toxicity. The first two characteristics cannot be measured directly. They are usually determined with employing multimedia mass-balance (MM) models. And every MM model requires a set of phys/chem parameters (e.g., partition coefficients, half-live times, enthalpies of phase transfer, vapor pressure, etc.) as input data. These parameters can be obtained empirically. However, high costs of experiments and time required for performing them for large arrays of chemicals motivate the scientific community to search for alternative, non-experimental, and ways of receiving the lacking parameters.

Nowadays, the significance of a computational technique known as quantitative structure–property relationships (QSPR) modeling and their various applications in chemical risk assessment have being highlighted by many international organizations and regulations (e.g., REACH in Europe) [9]. This approach is based on the assumption that the phys/chem properties of chemical compounds are the functions of so-called molecular descriptors, representing structural features of the molecules. Thus, based on the experimental data available even for few compounds, it is possible to develop mathematical equation describing the correlation(s) between their molecular structures and properties and, on this basis, to predict the lacking information for other, structurally similar molecules [10].

There are many examples of successful applications of the QSPR approach for predicting environmentally relevant properties of CNs [7, 11, 12]. However, still there is a need of searching for novel structural descriptors that more appropriately would express molecular variance in particular groups of structurally similar congeners of POPs.

Intensity distribution moments, recently proposed by us as new molecular descriptors [1315], proved to be an efficient tool in the identification of specific groups of molecules. For example, using these descriptors, one could distinguish nitriles from amides [16]. The general methodology used in this study, the statistical spectroscopy, is known in many different areas of science. The basic quantities, the distribution moments, may be derived from atomic or molecular spectra. Similar methods of statistical spectroscopy we have already applied in studies on stellar spectra [17, 18], in analyzing properties of chaotic dynamical systems [19], and in bioinformatics [20]. Now, we continue the investigation on the usefulness of different kinds of moments as molecular descriptors. This time we check the spectral density distribution moments. In the present study, the moments are obtained from the frequencies (rather than from the intensities, as it was done before [16]) of the infrared (IR) spectra of CNs. However, the statistical distributions may also be created from any function describing the system under consideration. The new descriptors are applied for developing a QSPR model that predicts the logarithmic values of subcooled liquid vapor pressure at 25 °C.

Theory

Let us consider a discrete frequency spectrum \(\nu_1, \nu_2, {\ldots} \nu_D\) treated as a statistical ensemble. The density of the frequencies (spectral density distribution) is defined as:

$$ \rho(\nu)=\frac{1}{D}\sum _{{\it i}=1}^{D} \delta (\nu - \nu_{\it i}). $$
(1)

Convenient characteristics of distributions are their moments.

The q-th moment of ρ(ν) is defined as

$$ M_{\rho, q}=\int_{-\infty}^{\infty} \nu^q \rho(\nu) d\nu. $$
(2)

Using Eq. 1 we get

$$ M_{\rho, q}=\frac{1}{D}\sum\limits_{i=1}^D \nu_i^q. $$
(3)

The corresponding q-th spectral density scaled moments are

$$ M^{\prime}_{\rho, q}=\frac{1}{D}\sum\limits_{i=1}^D ( \nu_i - M_{\rho,1}) ^q, $$
(4)
$$ M^{\prime\prime}_{\rho, q}=\frac{1}{D}\sum\limits_{i=1}^D \left[\frac{(\nu_i-M_{\rho,1})} {\sqrt{M_{\rho,2}-(M_{\rho,1})^2}}\right]^q. $$
(5)

In the present study, we construct spectral density distributions ρ(ν) from the frequencies ν i of the IR spectra. The aim of this study is to introduce spectral density distribution moments \(M_{\rho, q}, \; M^{\prime}_{\rho, q}, \; M^{\prime\prime}_{\rho,q}\) as molecular descriptors. Usually, four lowest moments are used in statistical investigations (M ρ,1 is the mean frequency, \(M^{\prime}_{\rho,2}\) describes the width, \(M^{\prime\prime}_{\rho, 3}\) - the asymmetry, and \(M^{\prime\prime}_{\rho,4}\) - the excess of the distribution). Higher-order moments do not have any direct geometrical equivalents and usually are neglected. In most cases they supply no new information about the system [15]. However, in some cases, they may also be useful [16].

Results and discussion

We perform the calculations for 76 compounds: CNs containing from zero through eight chlorine atoms. They are listed in Table 1, where \(r=0,1,{\ldots} 75\) are the labels of the compounds.

Table 1 Experimental and predicted values of log P L , leverage values and M ρ,1 used as molecular descriptors (T-training set, V-validation set)

We study spectral density distributions of the frequencies of the IR spectra of the CNs.

The vibrational spectra we obtained from density functional theory (DFT) calculations. A hybrid B3LYP functional and 6-311++G** basis were used as implemented in the Gaussian 03 code [21].

Figure 1 shows the first moments M ρ,1 for all 76 compounds. In this figure, one can recognize particular descriptors corresponding to particular compounds numbered by r (see Table 1) in the horizontal axis. The descriptors corresponding to the molecules with different number of chlorine atoms are represented by different symbols in the figure. The same symbols are used in Fig. 2, where \(M^{\prime}_{\rho,2}\) and \(M^{\prime\prime}_{\rho,q}\) for \(q=3,4,\dots 10\) are shown. The patterns of \(M''_{\rho, 5}, M''_{\rho, 6}, {\ldots} M''_{\rho, 10}\) are similar to each other (Fig. 2). That suggests strong correlations between these moments. Therefore, the four lowest moments M ρ,1\(M^{\prime}_{\rho,2},\) \(M^{\prime\prime}_{\rho,3},\) and \(M^{\prime\prime}_{\rho,4},\) carry sufficient characteristics of the compounds.

Fig. 1
figure 1

\(M_{\rho, 1}\) for 76 compounds

Fig. 2
figure 2

Spectral density distribution moments

In the present study, the spectral density distribution moments are applied as molecular descriptors for developing a QSPR model of the logarithmic values of subcooled liquid vapor pressure (log P L ) at 25 °C. Experimental data, available for 17 CNs (22 % of the investigated group), have been taken from [22]. The compounds, for which the experimentally derived log P L values have been available, were divided into two smaller sets: a training set (12 compounds) and a validation set (5 compounds). The splitting algorithm was as follows. The 17 compounds have been sorted along with the decreasing log P L value and then every third compound was selected to the validation set, whereas the remaining ones formed the training set. This method produces two representative sets of compounds, since the compounds are evenly distributed along with the range of log P L . The training set was then utilized for the model development and calibration, whereas the validation set, according to the golden standards [23] and the OECD recommendations for QSAR [24], was employed for performing external validation of the model.

Multiple linear regression (MLR)—a standard statistical method—was selected for modeling. But, regarding the limited number of training compounds, complexity of the model was restricted to using maximum one descriptor. Among all the descriptors taken into account (namely: \(M_{\rho,1}, M^{\prime}_{\rho, 2}, M^{\prime\prime}_{\rho, 3}, \ldots M^{\prime\prime}_{\rho, 10}\)) only one (M ρ,1-spectral density distribution moment of the first order) has enabled the construction of a statistically significant (p < 0.05) QSPR model (Eq. 6):

$$ log P_L = -9.1601 + 0.0076 M_{\rho,1}. $$
(6)

The model was characterized not only by high goodness-of-fit (measured by the high correlation between the observed/experimental and predicted values of log P L , R 2 = 0.991), but also by high robustness (expressed by the cross-validated correlation coefficient, Q 2CV  = 0.985) and—which is the most important—by high predictive ability (based on its external validation coefficient, Q 2Ext  = 0.991). The values of root mean square error (RMSE) of calibration (C), cross-validation (CV), and external prediction for the validation compounds (P) were as follows: RMSE C=0.14, RMSE CV  = 0.18, RMSE P  = 0.06. Visual analysis of the correlation plot (Fig. 3) did indicate any outlying results. Similarly, the analysis of the model’s applicability domain with Williams plot (Fig. 4) indicated ability of the model for performing reliable predictions for CNs having leverage values h i below h * = 0.5. This means that the model predicts correctly also for those CNs that have not been previously used for fitting (training). For more details related to developing and validating QSAR/QSPR models one should refer to [23] and [25]. The values of log P L predicted for all 76 CNs together with the calculated leverage values were listed in Table 1.

Fig. 3
figure 3

Plot of the observed (experimental) versus predicted values of log P L . Training compounds are indicated by the squares, whereas validation compounds are indicated with the circles

Fig. 4
figure 4

Applicability domain of the model explored with the Williams plot. Training compounds are indicated by the squares, whereas validation compounds are indicated with the circles. Standardized cross-validated residuals do not exceed the absolute value of 2.5. All the compounds from the training and validation set are characterized by leverage values lower than the critical leverage h * = 0.05. Predictions for compounds, for which h i  < h * should be considered as reliable, since they are the results of interpolation by the model

It is worth noting that the subcooled liquid vapor pressure (log P L ) at 25 °C has been already successfully predicted for CNs with QSPR models [26]. The models utilized other popular quantum-mechanical descriptors (averaged polarizability, dipole moment etc.) calculated at the same level of theory (B3LYP/6-311++G**) with statistical modeling methods of different complexity, including MLR, principal component regression (PCR), partial least square (PLS) regression, and its two modifications: PLS regression with uninformative variable elimination (UVE-PLS) and partial least square regression with variable selection by genetic algorithm (GA-PLS). However, by employing the spectral density distribution moments as novel molecular descriptors in the current study it was possible to develop a model characterized by both lower complexity and better predictive ability than the best model obtained in the previous study. The best original model was developed with GA-PLS, utilized eight descriptors and the prediction error RMSE P  = 0.108.

Summarizing, the model presented in the current study has been developed with much simpler algorithm MLR, utilizing only one descriptor with RMSE P equal to 0.06. This finally confirms the usefulness of the proposed spectral density distribution moments in QSPR.

The proposed descriptors characterize statistical properties of the distributions of the frequencies (not of the intensities) used for their computation. Therefore, the descriptors defined in this study are useful for the description of the properties which are mainly determined by the frequency distributions, such as log P L of chloronaphthalens. The frequency distributions are different if the molecules contain different numbers of chlorine atoms but all isomers of CNs with a fixed number of substituents have nearly the same frequency distributions. Therefore, the moments shown in Figs. 1 and 2 are nearly constant for the compounds with the same number of substituents. In order to distinguish different isomers one has to use the intensity distribution moments [16]. For CNs such descriptors are different for each compound and they will be considered in a subsequent paper.