Introduction

The human and environmental exposome contains a multitude of chemicals consisting of natural ones, man-made chemicals, and their transformation products, including their metabolites [1, 2]. These chemicals cover a wide range of molecular weights, functional groups or compound classes, physiochemical properties, and biological activities (i.e. toxicity) [2, 3]. Most of the chemicals in the human exposome are structurally unknown and therefore there is little known about their occurrence, fate, and potential health impact [4,5,6,7,8,9].

Non-target analysis (NTA) combined with high-resolution mass spectrometry (HRMS) is considered one of the most comprehensive strategies for the detection and identification of the unknown chemicals of emerging concern (CECs) in complex biological and environmental samples [2, 4, 8, 10,11,12,13,14,15]. The NTA experiments are reliant on generic experimental conditions as they aim to cover as wide a portion of the sample chemical space as possible [4, 8, 16, 17]. Moreover, the NTA experiments have the ultimate goal of confident identification (i.e. structural elucidation) of all the chemical constituents within the covered chemical space of the sample. This implies that the NTA experiments tend to generate a large number (e.g. thousands) of high-resolution mass spectra per sample to be structurally elucidated [8, 12, 14, 17,18,19,20].

In the past decade, a lot of efforts have been put into the generation of digital open-source/access data processing tools to tackle the complex data generated from the NTA assays [11, 12, 21,22,23,24,25,26,27,28]. These digital tools provide the means to perform a complete NTA workflow from feature detection [25, 29,30,31] to componentization [21, 31, 32] and identification/annotation [28, 33,34,35,36]. These tools, even though powerful, have shown to be highly sensitive toward the data quality and the parameters used during the processing [9, 37,38,39,40], particularly when dealing with complex samples [24, 41,42,43]. In addition, the inherent variability in the data caused by the experimental conditions used during the analysis significantly increases the difficulties associated with confidence assessment of the generated structures for the features in the chromatograms [8, 44,45,46]. Finally, a large number of the generated chromatographic features remain unidentified, due to data complexity, limited spectral databases [47], and limited structures in chemical databases (e.g. PubChem [48] and/or CompTox [49]), even though their accurate mass spectral information has been collected during the analysis.

To mitigate the issues related to the identification of known and unknown unknowns, the addition of retention times and retention indices (r\(_{i}\)), as an additional source of information, have been previously tested [50,51,52,53,54,55,56,57,58]. For r\(_{i}\) measurements, a series of calibrants (i.e. chemicals with known retention behavior) are necessary. The simultaneous analysis of the samples and the r\(_{i}\) calibrants under specified conditions has enabled the measurement of r\(_{i}\) values of structurally unknown chemicals [59, 60]. These measured r\(_{i}\) values are then compared to the r\(_{i}\) databases of structurally known chemicals to further increase the associated confidence in the generated identifications [59,60,61]. Additionally, recent studies have highlighted the use of quantitative structure retention relationship (QSRR) models to predict and populate the r\(_{i}\) databases, employing molecular descriptors [59,60,61,62]. The QSRR methods for the prediction of r\(_{i}\) values of structurally known chemicals have been a complementary strategy to the experimentally defined r\(_{i}\) values. However, these approaches have some major limitations namely: they require the chemical structure to be known; and for unknown chemicals are applicable only under well-defined chromatographic conditions (for example a specific organic modifier); and, the experimental r\(_{i}\) calibrant information associated with each NTA experiment. The combination of the measured r\(_{i}\) via calibrant chemicals and the predicted values via databases have been utilized for reducing the number of potential candidates and thus increasing the confidence levels associated with tentatively identified features [59,60,61]. However, for this workflow to be effective, the calibrants and the samples must be measured using the exact same experimental conditions. This implies that any changes in the experimental conditions (e.g. gradient, organic modifier, column temperature) will warrant additional measurements of the calibrants via the new methods. Moreover, it should be noted that for most NTA studies, already published, the r\(_{i}\) calibrants are not injected and thus their retention times missing [4]. This limitation also hinders the alignment of chromatograms acquired under different experimental conditions (i.e. either different labs or experimental setups), ultimately slowing down the process of detection of chemicals of emerging concern [26]. Such aforementioned limitations, greatly limits the applicability of r\(_{i}\) values for unraveling the human and environmental exposome via NTA assays.

Here we have developed and validated a novel machine learning algorithm to predict the r\(_{i}\) values for structurally unknown chemicals based on their measured fragmentation pattern. The developed models, for the first time, enable the prediction of r\(_{i}\) values without the need of the exact structure of the chemicals. For the model development, we selected the alkylamides homologous series as the r\(_{i}\) scale, based on their range of applications in the NTA metabolomics studies [59, 61]. The r\(_{i}\) values for structurally known chemicals were predicted using both descriptors as well as the fragmentation patterns translated into cumulative neutral losses (CNL). The CNL values were obtained by calculating the difference between the precursor mass and individual fragments within the high-resolution mass spectrometry (HRMS) spectra assuming complete independence among fragments. The CNL based model was validated employing both experimental r\(_{i}\) values and descriptor-based predicted r\(_{i}\) values. Finally, the validated CNL-based model showed comparable accuracy in r\(_{i}\) prediction to conventional descriptor-based models, relying only on the measured HRMS spectra (i.e. no information about the chemical structures).

Methods

Data

For the model development, validation, and testing we employed two different datasets namely: a set of experimental r\(_{i}\) values - referred to as amide dataset - based on the alkylamides homologous series, consisting out of 1488 chemicals [59]; and 26489 chemicals from the NORMAN SusDat database [26]. The alkylamides homologous series is one of the most commonly used r\(_{i}\) scale for C18 reversed phase liquid chromatography (RP-LC) due to their ease of measurement with RP-LC-HRMS and their applications in the metabolomics field [59, 61]. The amide dataset consisted of measured r\(_{i}\) values for 1488 chemicals with more than 40 different functional groups from amine, aniline, pyridine, pyrrole, ether, ester, ketone, alcohol, carboxylic acid, phenol to amide and molecular weight range between 79 Da and 609 Da. These r\(_{i}\) values were measured using a Zorbax SBC18 column and the combination of water and acetonitrile as the mobile phase. More details on these measurements are provided elsewhere (Hall et al 2016 [59]).The NORMAN dataset, on the other hand, was selected based on the high level of curation and availability of experimental HRMS spectra through NORMAN MassBank [63].

We calculated 2757 1D, 2D, 3D, and PubChem fingerprints for both datasets using the PaDEL software package (Fig. 1) [64]. The curated amide dataset descriptors were employed for the development and validation of an r\(_{i}\) prediction model. This was done due to the fact that only 133 unique chemicals out of 1488 had their experimental HRMS spectra available in public spectral databases. This descriptor-based model then was utilized to predict the r\(_{i}\) values for the NORMAN dataset, which contained 3217 unique chemicals with experimental HRMS spectra (around 20871 measured spectra). We combined 30\(\%\) of the amide dataset with experimental r\(_{i}\) values and HRMS spectra with 85\(\%\) of the NORMAN dataset with predicted r\(_{i}\) values (descriptor based model) and experimental HRMS spectra. This combination enabled us to minimize the impact of the first model (i.e. descriptor based model) on the CNL based model, while enabling an adequate validation of the model. Finally, the remaining 70\(\%\) of the amide dataset with experimental r\(_{i}\) values and HRMS spectra were employed to further test the performance the CNL based model. This workflow is schematically shown in Fig. 1.

Additionally, we further compared the chemical spaces covered by our training set as well as the NORMAN dataset. We performed a principal component analysis using the curated descriptors for both datasets. More details of the PCA are provided in Section S2 of the Additional file 1. The scores plot of the first two PCs indicates that our training set provides an adequate coverage of the NORMAN dataset (see Additional fie 1: Figs. S6 and S7).

Fig. 1
figure 1

Workflow for setting up the models for predicting r\(_i\) values. A shows the construction of the descriptor model for predicting the NORMAN r\(_i\) values, whereas, B shows the conversion of spectra to CNL values and the construction of the CNL model

Descriptor datasets

Descriptor generation: The PaDEL software was employed for the calculation of 2757 1D, 2D, 3D, and PubChem fingerprints for the amide and NORMAN datasets [64]. The predicted descriptors were saved as CSV files for the amide and NORMAN datasets, respectively and can be found on FigShare (see Sect. "Potentials and limitations"). When performing the 3D descriptor calculations, PaDEL needed to optimize the chemical structures, which in some cases resulted in convergence issues, and thus a failure in descriptor calculation. The descriptor calculations converged for 1289 out of 1488 unique chemicals in the amide dataset and 23012 out of 26489 unique chemicals from the the NORMAN dataset.

Descriptor curation: We also performed a descriptor curation to assure that only the relevant and stable descriptors were included in the models. To assess the stability of the descriptors, we performed these calculations in triplicates for the amide dataset. Next, the descriptors were scaled based on the minimum and maximum value of each descriptor in order to compare them at the same scale. Then we calculated the variance of each descriptor and kept only the descriptors that had a variance lower than 0.01. This resulted in 2363 final stable descriptors to be used for the models. As for the NORMAN dataset, the descriptors were calculated only once, due to the large number of chemicals. It was assumed that the descriptors that were deemed stable in the amide dataset, had a high probability of being stable also for the NORMAN dataset.

The retention index values (r\(_{i}\))

While for the compounds in the amide dataset r\(_{i}\) values were determined experimentally, this was not the case for the compounds in the NORMAN dataset. Therefore, r\(_{i}\) values for the NORMAN dataset were predicted using the descriptor based model (see Sect. "Modeling"). The prediction of the r\(_{i}\) values of the NORMAN dataset was performed only for the chemicals that were within the applicability domain (AD) of the descriptor based model (see Sect. "Applicability domain"). The AD filtering of the NORMAN dataset resulted in reliably predicted r\(_{i}\) values for 14567 unique chemicals chemicals. The distribution of the r\(_{i}\) values used in this study can be seen in Fig. 2.

Fig. 2
figure 2

Distribution of r\(_i\) values for the descriptor amide dataset (A) and of predicted r\(_i\) values for the descriptor NORMAN dataset (B)

CNL dataset

For the CNL dataset, electrospray ionization (ESI) high resolution (i.e. \(\ge \) 5000) spectra were obtained from MassBank EU [65] for both the amide and NORMAN compounds. For each of these chemicals, all corresponding experimental spectra were obtained based on the InChiKeys [66] and SMILES [67]. For the amide dataset, a total of 862 mass spectra were found for 133 unique compounds. Whereas for the NORMAN dataset, a total of unique 23871 mass spectra were found for 3217 unique chemicals. These compounds were cross-referenced with the NORMAN descriptor dataset, from which it was concluded that only for 2734 unique compounds, reliable r\(_i\) values could be predicted, resulting in a dataset of 20871 entries. The distribution of r\(_i\) values for the amide and NORMAN datasets can be seen in Fig. 3A, B, respectively. The retrieved HRMS spectra were generated by different labs, instruments, and under different experimental conditions. In order for our model to be able to handle the variance in the spectra coming from these different experimental settings, we kept the redundant spectra of the same compound as separate entries. For example, for caffeine there were around 50 measured spectra with different instruments, collision energies and instrumental setups (e.g. source geometry and temperature). However, we were not able to see a direct relationship between different experimental parameters and the number, m/z value, and the relative intensity of the generated fragments. For example, the spectrum generated with orbitrap at 10 eV (AU276601) resulted in 2 fragments while the spectrum via a QToF instrument at 30 eV resulted also in 2 fragments (KW107903). For our model to be robust enough to handle such instrument related variance, we kept all those spectra as separate entries in our model. This strategy enabled us to incorporate the instrument variability into our models, without compromising the model accuracy.

For both the amide and NORMAN datasets, the CNLs were calculated for each individual spectrum by subtracting the fragment masses from the precursor ion mass. For each spectrum, the CNL values were converted to a bit vector, corresponding to CNLs masses from 0 to 1000 Da with a step size of 0.01 Da (i.e. ± 5 mDa mass tolerance), where a 1 represented the presence and a 0 indicated the absence of a CNL within a spectrum. Additionally, the CNL values larger than the precursor ion were encoded as -1 in the bit vector. This was to make sure that the model can distinguish between absent and impossible CNLs (i.e. fragments with m/z values larger than precursor m/z). The precursor ion mass was added as an additional continuous feature to the dataset. The combination of the monoisotopic mass and the CNLs enabled us to incorporate the information provided by the individual fragments and NLs into our model while containing the number of model variables to minimum. For example in case of CNL of 18 Da (i.e. the loss of water), the number of variables were reduced \(\approx \) 3000 variables (i.e. individual fragments) to one single variable without any loss of information.

Fig. 3
figure 3

Distribution of r\(_i\) values for the CNL amide dataset (A) and of the predicted r\(_i\) values for the CNL NORMAN dataset (B)

Modeling

For modeling, a gradient boosting regression model was implemented in Python 3.7.11 using CatBoost (v 0.3) [68]. CatBoost is a state-of-the-art approach for gradient boosting on decision trees for big data [69]. The main idea of gradient boosting is to consecutively combine many decision trees (weak learners) to create a strong competitive model. Since the decision trees are fitted consecutively, the fitted trees will learn from the mistakes of former trees to reduce errors. The process of adding new trees to existing ones is continued until the selected loss function is no longer minimized or the maximum tree depth is obtained.

Descriptor based model: To train the descriptor based model, the amide dataset was split into a training set (85%, n=1102) and a test set (15%, n=195). The descriptor based model had calculated and curated descriptors of the amide dataset as the input variables, while having the experimental r\(_{i}\) values as the output variable (see Fig. 2). To reflect the skewed nature of the r\(_{i}\) values, the data splitting was performed utilizing stratified sampling. This ensured that both the training set and test set had a good representation of the population. The stratified sampling was done using three classes, from 200–440, 440–700, and 700–1041 r\(_{i}\) units.

The model training and optimization was performed using the root mean square error (RMSE) loss function for a total of 450 iterations (a tree is constructed every iteration) with a learning rate of 0.03. The tree depth was set to 8 with a maximum number of 256 leaves. The minimum data in a leaf was set to 1. To prevent overfitting the coefficient of the L2 regularization term was set to 10, which showed to provide the needed balance between the model accuracy and robustness. In addition, the training was stopped if the error on the validation set did not decrease for more than 5 iterations. For the remaining parameters, the default values were used. Additional information regarding the hyperparameter selection can be found in Additional file 1: Section S3. We used 5-fold cross-validation to tune model hyper-parameters on the training set. Accordingly, the training set was split into five equally sized parts, which were used to construct 5 different training and validation splits. Additionally, from the optimized model, we extracted the 40 features resulting in highest levels of variance explained in the training set. The distributions of these features are shown for both the amide and NORMAN dataset in Additional file 1: Section S4.1. The final model was then refitted using these 40 features and optimized model hyper-parameters, which consisted out of 448 trees, on the whole training set.

To further assess the model performance, we took advantage of the test set that was unknown to the model during the training step.

CNL model: A CatBoost regression model was built utilizing the CNL datasets as input in order to predict r\(_{i}\) values. The model was trained using the RMSE loss function for a total of 5000 iterations with a learning rate of 0.077. The tree depth was set to 6 with a maximum number of 64 leaves, with a minimum number of data points of 1 per leaf. The coefficient of the L2 regularization term was set to 3, and the training was stopped if the error on the validation set did not decrease for more than 5 iterations, to prevent overfitting. Again, the default values were used for the remaining parameters. Additional information regarding the hyperparameter selection can be found in Additional file 1: Section S3. The model was trained on a combination of the amide and NORMAN datasets. The NORMAN dataset was split into a training set (85\(\%\), n=17740) and a test set (15\(\%\), n=3131). And the amide dataset was split into a training set (30\(\%\), n=258) and a test set (70\(\%\), n=604), resulting in a total training set of 17998 entries, consisting of unique spectra. It should be noted that only a small fraction of the amide dataset was used for training, enabling the final testing of the model with experimental r\(_{i}\) values that were not included in the training set. The model was then trained and optimized using 5-fold cross-validation on the training set (as in the paragraph above). This resulted in a final model with 5000 trees and 4220 used features out of the 100,000 CNLs. Distributions of the 50 most important CNLs are shown in Additional file 1: Section S4.2 for both the amide and NORMAN dataset. Both test sets (i.e. the test set for the data splitting and the amide withheld dataset) were used as external test sets after the training process to reliably assess model performance.

Applicability domain (AD)

In order to assess whether a new entry is represented by the training set, we employed the applicability domain (AD) calculations. The AD was determined using the leverage [70] (\(h_{ii}\)), which is defined as follows:

$$\begin{aligned} h_{ii} = \varvec{x}_{i}^{\top }\left( \textbf{X}^{\top } \textbf{X}\right) ^{-1} \varvec{x}_{i} \end{aligned}$$
(1)

Where \(\textbf{X}\) is a matrix of descriptors for compounds from the (e.g. amide) training set, and \(\varvec{x}_i\) is a vector of molecular descriptors for a compound i. The leverage score can be viewed as the weighted distance between \(\varvec{x}_i\) and the mean of \(\textbf{X}\), which therefore provides a measure of the applicability domain of the model. The acceptable threshold leverage value was determined as the 95% confidence interval using the distribution of leverages generated using a leave-one-out approach on the training set (this was \(\approx \) 0.131). This threshold was used for assessing whether a chemical was well covered by the training set. The same approach was employed for the AD assessment of the CNL model. All distributions of the training and test sets can be found in Additional file 1: Section S1. It should be noted that this AD assessment only takes into account the variables used in the final model, which could be inadequate for future entries.

Calculations

All calculations were performed using a personal computer with an AMD Ryzen Threadripper 3970X CPU and 256GB of RAM operating on Windows 10 Pro. All the data processing and statistical analysis were performed using Python 3.7.11 and using Julia language 1.6.0 For descriptor calculations a Python 3 wrapper of PaDEL software called padelpy (https://github.com/ecrl/padelpy) was employed.

Results and discussion

Descriptor based model

The first model developed in this study was a QSRR model of the 2363 curated descriptors for 1289 unique chemicals and their experimental r\(_{i}\) values. This model was then optimized and validated with an external test set. Next the validated model was employed to predict the r\(_{i}\) values for the NORMAN dataset, providing a large enough training set for the CNL based model.

Descriptor based model performance

The optimized descriptor based model was able to successfully and accurately predict r\(_{i}\) values with a standard error of 4.9– − 7.5%, for the training and test sets, respectively. The quality of the data fit is shown in Fig. 4. To assess the performance of the model, both the coefficient of determination (\(R^2\)), the root mean squared error (RMSE), and the maximum error were evaluated. The model showed regression statistics with \(R^2 = 0.94\) for the training set and \(R^2 = 0.85\) for the test set. The RMSE of the CatBoost model was 44 r\(_{i}\) units for the training set and 67 r\(_{i}\) units for the test set, which roughly is 4.9\(-\)7.5% of standard error. Interestingly, based on the distribution of residuals, the model seems to be consistently overestimating for low (200–400) r\(_{i}\) values and underestimating for high (900–100) r\(_{i}\) values. The worst prediction was off by 165 r\(_{i}\) units for the training set data and 211 r\(_{i}\) units for the test set.

Fig. 4
figure 4

Parity plot of the descriptor model predictions and the experimental r\(_i\) values for the training set (n=1102) (A) and the external test set (n=195) (B) with the coefficient of determination (\(R^2\)), root mean squared error (RMSE) and maximum error. In addition, marginal distributions of the experimental and predicted r\(_i\) are shown

Interpretation of selected descriptors

When looking into the 7 most important descriptors selected by the final model (Shown in Additional file 1: Section S5 of the Additional file), consistency in the information behind the descriptors can be found. All 7 most important descriptors are 2D descriptors and are directly or indirectly related to the charge of the molecule, which is a highly important factor for the separation with reversed-phase chromatography (RP-LC) [50]. The first three descriptors are describing a variant of LogP (i.e., partition coefficient), namely, the XlogP, Mannhold LogP, and the Crippen LogP. The next important descriptors are the number of basic groups (nBase), the Lipoaffinity index, and the centered moreau-broto autocorrelation of lag 3 weighted by mass (ATSC3m), which contains information on the topological structure. These descriptors all represent the chemical interactions with the stationary and mobile phases. Lastly, the 7th most important variable is the BCUTw-1 h [71], which corresponds to the lowest eigenvalue obtained from information based on the atomic charge, polarizability, and hydrogen-bond donor and acceptor capabilities. Overall, it is logical that descriptors containing information on the charge of the molecules have the highest contributions to the prediction of RPLC r\(_{i}\) values.

CNL based model

Using the descriptor based model we predicted the r\(_{i}\) values of 2734 unique chemicals part of the NORMAN dataset, that were within the AD of our model. These 2734 chemicals resulted in 20871 HRMS entries from the MassBank EU and were combined with 258 entries belonging to the amide dataset. The generated CNL matrix and the vector of r\(_{i}\) values were utilized to build a model, which was able to predict the r\(_{i}\) values only based on the HRMS spectra. Additionally, 604 entries from the amide dataset were used for further testing of the final model. These 604 entries had both the CNL matrix and r\(_{i}\) values experimentally determined and were completely unknown to our model.

CNL based model performance

The final model showed correlation statistics with an \(R^2 = 0.96\) for the training set and \(R^2 = 0.91\) for the test set. On the other hand for the withheld 604 entries from the amide dataset, an \(R^2=0.77\) was produced by our model. The worst prediction error was for the additional amide dataset and was of 283 r\(_{i}\). On the other hand, the RMSE for the model was 30 r\(_{i}\) and 47 r\(_{i}\) units for the training and test sets, respectively while resulting in 67 r\(_{i}\) units for the additional amide test set. This further indicates that our model is able to accurately predict the r\(_{i}\) values based on the HRMS spectra.

When comparing the performance of the model on the different test sets, it is evident that the performance is lower for the additional amide test set compared to the conventional test set. This is not surprising, as the training set is mostly comprised of the NORMAN dataset (17740 out of 17998 entries), for which the r\(_{i}\) values were predicted using the descriptor based model resulting in error propagation and thus lower accuracy. It is therefore rather impressive that the model still is able to explain more than 77 % of the variance of the additional amide test set. This is a strong confirmation that CNLs can effectively be used to predict r\(_{i}\) values, indicating that an increase in experimentally determined retention indices and HRMS spectra could further improve the model accuracy. Another point of attention is the number of features that have been used by the model to make its predictions. As we are dealing with a setting in which we have more features ( 100 000) than measurements (17998 r\(_{i}\) values), care needs to be taken in model construction that the model uses fewer features than there are data points to avoid a potential under-determination issue. This was addressed using L2 leaf regularization and early stopping as described in Sect. "Methods". Yet for the generalization of the model, it is important for the model to incorporate many features (read different CNLs) as a molecule can potentially fragment into different fragments and the model should take these into account. Also, measurement noise may change CNL masses slightly (and therefore feature values), which should also be able to be modeled. Again we emphasize that training on larger data sets could naturally capture this. We feel that our final model trained on the dataset at hand, utilizing 4220 features, is balancing the aforementioned trade-off well (Fig. 5).

Fig. 5
figure 5

Parity plot of the CNL model predictions and the experimental r\(_i\) values for the training set (n=17998) (A) the external NORMAN test set (n=3131) (B) and the external amide test set (n=604) (C) with the coefficient of determination (\(R^2\)), root mean squared error (RMSE) and maximum error. In addition, marginal distributions of the experimental and predicted r\(_i\) are shown

Interpretation of selected CNL features

More information on the selected CNL features are shown in Additional file 1: Section S6 of the Additional file. When investigating the most important features of the CNL based model, a few consistent structures were found for specific CNLs by checking the annotated spectra on MassBank EU. Among those, the 3 most important CNLs for which similarity was found, will be discussed as examples. However, it should be noted that for the larger CNLs it is generally more difficult to build such relationships between the r\(_{i}\) values and the CNL values, due to an exponential increasing number of possible structures.

One of the most important CNLs had a mass of 155.00 Da, which was present in 365 spectra, corresponding to 30 unique chemical constituents. A majority of these chemicals contained the structure C2(=CC=C(C=C2)N)[S](=O)=O, which is a common structural feature observed in several antibacterial chemicals. Two other examples are the CNLs of 65.97 and 56.06 Da, which showed to be consistent with a loss of SO\(_2\) and C\(_4\)H\(_8\), respectively. Finally, the monoisotopic mass was also highly important to the CNL based model, which is understandable due to the direct relationship between the molecular weight and its retention behavior. Overall, these results show that CNL contains enough structural information on the functional groups and/or molecular substructures to be used for the prediction of r\(_{i}\) values, however, alarger experimental dataset may further improve these interpretations.

Potentials and limitations

In this work, we showed a novel approach for predicting r\(_{i}\) values of the structurally unknown compounds using CNLs obtained from public HRMS spectra. The model requires no prior chemical or structural information, which enables the use of the model for NTA where both known and unknown chemicals can be encountered. The developed model enables a calibrant free use of r\(_{i}\) values in NTA experiments. In other words, the analyst can predict the r\(_{i}\) values of every single feature in the LC-HRMS chromatogram without knowing the chemical structure of the feature or without the need for measuring the r\(_{i}\) calibrants. Besides the possibility of obtaining r\(_{i}\) for unknown chemicals from CNLs, the model can also be used to, for example, enhance the performance of library searching and analysis of historical data. Specifically for cases where no r\(_{i}\) calibrants were measured. Being able to predict the r\(_{i}\) values for these cases can enhance library searching by reducing the number of initial candidates based on r\(_{i}\) filtering. Also, the predicted r\(_{i}\) values for historical data enables their alignment, independently from the experimental setup, which consequently provides the means for performing trend analysis and detection of novel unknown CECs. Additionally, r\(_{i}\) values across the full chromatogram (i.e. pixel-by-pixel) could be mapped, providing insights into the chemical space covered by the analysis method. This range, also, gives insight into which compounds can and cannot be analyzed with one method vs another. Finally, the model could also be used for cross selectivity tracking. For example, if the current RPLC r\(_{i}\) model would be used for a HILIC method, a reversed order of retention indexes is expected due to opposing selectivity modes. Overall, r\(_{i}\) prediction from CNLs could potentially be used for a variety of applications in resolving human exposome.

One of the current limitations of the model is the number of chemicals with measured r\(_{i}\) and HRMS spectra, impacting our models AD. Expansion of such measurements will be part of a near future study. To expand its AD and potentially the model confidence, larger spectral databases and more measured r\(_{i}\) values for a specific selectivity would be required. Additionally, the current version of the model is trained using clean spectra. Therefore, in the case of compounds that are co-eluting during an LC-HRMS measurement, the obtained r\(_{i}\) values could be less accurate due to the presence of false positive fragments in the spectrum. However, an adequate spectral clean-up and deconvolution could mitigate the above mentioned issue.