1 Introduction

MALDI-MS has come to play a unique role in the analysis of low-molecular-weight biological compounds, principally metabolites [1, 2]. It is well known that the scope of detectable compounds in the MALDI-MS analysis is strongly associated with the molecular species of the matrix. To date, extensive research has been contributed to elucidate fundamental mechanism of MALDI [3]. However, to clarify whether a target molecular species can be sensitively detected by MALDI-MS, an experimental trial is still required because there is currently no decisive rationale to predict which compounds will be ionizable with which matrices. This problem is largely attributable to the chemical and structural diversity of metabolites, which might hinder the rational understanding of the interrelationships between metabolites and the potential factors affecting their ionization.

In the present study, we aimed to model the relationship between the structural properties of the metabolites and their ionizability in MALDI. In the targeted analyses, the merit of property modeling lies in the prediction of the probability of the ionization of metabolites yet to be analyzed in MALDI-MS. In the non-targeted analyses, on the other hand, the model would work to screen chemical structures plausibly assigned to a detected peak, even if compounds with similar m/z values are not distinguishable. Furthermore, the expected signal response calculated from the ionization efficiency model would provide insights into the abundance of the compound of interest. As a practical case study, we selected 9-AA as the matrix because it is one of the most frequently used matrices for the metabolite analyses MALDI-MS [4]. The MALDI-MS analyses with 9-AA (9-AA-MALDI-MS) have been utilized for various studies, including high-throughput and highly sensitive metabolite analyses [58] as well as metabolite MS imaging [9, 10].

First, 200 metabolite standard compounds were selected to cover a wide range of structural diversity and biological importance, and their ionization profiles in MALDI-MS with 9-AA were examined. Second, a quantitative structure–property relationship (QSPR) analysis was performed to model the experimental evaluation using molecular descriptors of the compounds. As there were hundreds of descriptors available, the Random Forest method was employed because of its robust applicability to large multivariate data and unbiased modeling performance [11]. The importance of the descriptors was estimated and discussed with regard to the relevance to the ionizability and ionization efficiency of the compounds.

2 Methods

The detailed methods for the MALDI-MS analysis and QSPR analysis are described in the Supplemental Materials.

2.1 MALDI-MS Analysis of Metabolite Standards

The ionizability and ionization efficiency in MALDI-TOF-MS (AXIMA Confidence, Shimadzu, Japan) analysis for each standard compound was assessed using 9-AA as the matrix. Ionization efficiency was represented as limit of detection (LOD) value in ppm.

2.2 Summary of the QSPR Analysis

MDL Molfiles of individual metabolites were acquired from the PubChem website (http://pubchem.ncbi.nlm.nih.gov), using a list of PubChem Compound IDs (CIDs) as the query. The acquired MDL Molfiles were applied for the calculation of the molecular descriptors by the PaDEL-Descriptor software program [12]. The types of molecular descriptors included 1-2D and 3D type descriptors and fingerprints. Descriptors with zero variance or 95 % identical values (including NAs) were excluded from the subsequent analysis.

The LOD was used as the response variable, which could be considered as an inverse measure of the ionization efficiency. In the classification model, the responsive variable was converted to a categorical value denoted as ionized or not ionized, corresponding to whether the LOD value could be evaluated or not. In the regression model, where not ionized observations were eliminated, the LOD values were used in the molar concentrations. Modeling of the inter-relationships between the descriptors and the ionization profiles of metabolites was conducted using the Random Forest method [11]. The importance of variables for constructing a model was evaluated as the mean decrease in accuracy. All of the analyses were performed using the R language [13]. Random Forest and decision tree models were constructed by the party package [14]. The accuracy of the prediction model was evaluated based on the correct rate given as a fraction of the number of correct predictions to the number of the examined metabolites. The performance of a regression model was evaluated by Spearman’s rank correlation coefficients between the measured LODs and the predicted values.

3 Results and Discussion

First, we investigated the ionizability and ionization efficiency of 200 compounds to clarify the coverage of 9-AA-MALDI-MS for the metabolite analysis (Table 1). As a result of the test analysis, 104 out of 200 compounds were detected as deprotonated peaks. The LOD value ranged from 0.00125 to 100 ppm. As the chemical diversity defines the applicability of models constructed using the dataset, the taxonomy superclass of the metabolites in the sample set was summarized in Table 1 (see the Supplemental Materials for the details of the experimental result). Interestingly, distinct ionization profile was observed even in compounds with a similar structure (e.g., alanine and β-alanine, or leucine and isoleucine, Figure 1a, b). In these cases, β-alanine and isoleucine exhibited concentration-dependent peak intensity in MALDI-MS analysis, whereas alanine and leucine were not detected. Generally, structural similarity of low-molecular-weight compounds should give similar physicochemical properties. In contrast, these observations strongly indicated that apparent properties of the molecule, such as the presence of functional groups, are insufficient to explain the diverse ionization profiles of the compounds.

Table 1 The Ionization Profiles of Metabolites in the 9-AA-MALDI-MS Analysis and the Predictive Accuracy of the Random Forest Ionizability Models
Figure 1
figure 1

Distinct ionization profiles of structurally similar compounds in MALDI-MS analysis and their Random forest prediction. (a) Structural formulas and LODs of four representative compounds with similar structures but distinct ionization profiles. (b) The prediction of ionizability by the 3D descriptor model for whole compounds (gray bar) and by the 3D descriptor amino-acid-specific model (blue bar) represented as the votes of the ensemble trees. When the ratio of positive vote (ionizable) exceeds 50 %, the corresponding compound is predicted to be ionizable

The physicochemical factors of the metabolites that influenced the ionization profiles were of interest. To address these factors, we performed non-hypothesis-based statistical modeling, where the source of efficient MALDI was sought by molecular descriptors of target compounds. First, we constructed a Random Forest QSPR model for the ionizability prediction (ionized or not ionized) using the whole descriptor provided by the PaDEL-Descriptor (Global model). The overall accuracy of the prediction was 86.0 %, and there were no significant biases with regard to the estimation error and the metabolite class (Table 1, Global model for whole compounds).

The prediction model was then investigated to estimate the prerequisite properties for the ionization of a compound in a 9-AA-MALDI-MS analysis. In the Global model, the descriptors with higher importance indicated the electrotopological state of strength for potential hydrogen bonds and the area of the negatively charged surface (Supplemental Figure S-1a and Supplemental Table S-1). These descriptors belong to the 2D and 3D descriptors, respectively. The electrotopological state value (E-state value) is a kind of 2D descriptor that combines both the electronic characteristics and the topological environment of each skeletal atom in the molecule [15]. The importance of the E-state value indicated that the strength of possible hydrogen bonds positively correlated with the ionizability in MALDI. It was clear that the ionization profiles were strongly influenced by the interaction between molecules. In addition to the global model, which incorporated all the type of descriptors available, the respective types of descriptors were applied to construct Random Forest prediction models to investigate the relevance of each descriptor types to the prediction performance (Table 1). As the result, 3D model exhibited the highest performance followed by 2D model (91.0 % and 88.5 % accuracy rate for whole compounds, respectively). Considering the variable importance of these models (Supplemental Figure S-1b, c), although the strength of hydrogen bonds well represented the ionization profile, the information of charged surface area led to a better ionizability model. This result was reasonable because the charged surface area indicated the electron distribution within the molecules that should cover the effect of hydrogen bond acceptors. The further functioning of the negatively charged surface area could be the effectiveness of proton abstraction in the interaction with matrix molecule, 9-AA.

The constructed prediction models for amino acids (“Amino Acids, Peptides, and Analogues” class) exhibited relatively poor accuracy, even though they were a major class in our data set. Our models were effective for a broad spectrum of metabolites, but they still lacked the ability to model rather faint structural differences of amino acids. The reason of this defect could be strongly attributed to the relevance of hydrogen bonds. As both amines and carboxyl groups in amino acids can form hydrogen bonds, the ionizabilities of amino acids could be overestimated. To address these issues, we attempted to improve the prediction performance for amino acids because they are one of the most important classes in the metabolite analysis because of their significant metabolic and regulatory versatility [16]. We thus developed new models specific for amino acids to improve the predictive accuracy and investigate the relevant structural properties. Again, the models were constructed using the whole or the individual types of descriptors. As a result, the accuracy of model prediction improved for all types of descriptors (Table 1). Especially, the 3D model achieved a perfect prediction of the ionizability, even for the above-mentioned pairs of structurally similar amino acids (Figure 1c). Fingerprinting descriptors provided still a moderate accuracy (86.4 % correct rate for the highest value by the MACCSFP model), indicating that the presence of substructures was insufficient to fully represent the ionizability of amino acids. Unlike the class-independent model (whole-data model), the relevant 3D descriptors were not involved with the charged surface areas, but Weighted Holistic Invariant Molecular (WHIM) descriptors [17] (Supplemental Figure S-1d). WHIM descriptors provide information about the whole 3D-molecular structure in terms of the size, shape, symmetry, and atom distribution. This result was intriguing because the shape of the molecules itself was relevant rather than electronic properties. It has been reported that cation affinities of amino acids were associated with degree of linearity [18], which is a direct index of the flexibility of molecule [19]. It was thus suggested that the shape properties of target compounds affect their interaction with other molecules to promote or inhibit their ionization.

The Random Forest method is applicable to a regression, averaging the output of decision trees [11]. The experimentally evaluated ionization efficiency, indicated by LOD values, was also modeled by the Random Forest method using individual types of descriptors. While the Global and 3D ionization efficiency models both reached ρ = 0.77 (Supplemental Figure S-2a, b, and the variable importance for Global model was shown in Supplemental Figure S-1e), the best predictive performance was achieved with 2D descriptors, evaluated as ρ = 0.78 (2D model, Figure 2, and the variable importance was shown in Supplemental Figure S-1f). The MACCSFP also provided a highly accurate model compared to the 2D, 3D, and Global models (ρ = 0.69, Supplemental Figure S-2b). It was supposed that the fundamental trend of the ionization efficiency was reasonably modeled. The 2D model indicated that the quantitative extent of ionization was mainly associated with E-state index of double-bonded oxygen and the strength of the potential hydrogen bonds (Supplemental Figure S-1f). Hence, overall results indicated that the partial negative charge in the molecule could be a prerequisite for ionization, and that the richness of carbonyl oxygen should be preferable for efficient negative MALDI because of the basic condition brought by 9-AA. However, Sun et al. showed that pH condition altered the ionized metabolite profiles specifically to analyzed molecular species [20]. They also reported that multiplexed solvent could be used for optimization of analyte-matrix interaction during co-crystallization [8]. The formation of hydrogen bonds, which could be affected by pH condition, might result in specific crystal structures with the advantage of binding energy, leading to distinct MALDI efficiencies. Noteworthy, structural flexibility of the target compounds might play a special role to specific interaction with other molecules, presumably the matrix molecules to reduce ionization energies [21], which determine the fate of their ionization profiles.

Figure 2
figure 2

The Random forest regression model for the ionization efficiency in 9-AA-MALDI. The 2D model showed the best performance in terms of the regression for ionization efficiency. The rank correlation coefficient for the plot was indicated as ρ. The models of other types (Global, MACCSFP, and 3D) can be found in Supplemental Figure S-2

4 Conclusions

This study was primarily intended to lead to more rational and predictive MALDI-MS analyses. In contrast to empirical approaches, this study employed a systematic analysis of the ionization profile in 9-AA-MALDI-MS for the first time. In the MALDI-MS analysis, the ionizability prediction model evaluates the likelihood of peak identification. On the other hand, the ionization efficiency model would help to estimate the abundance of the metabolite based on the observed signal intensity. The relevant descriptors found in this study can be interpreted as the structural preference specific to 9-AA and/or negative mode MALDI-MS analysis. The QSPR approach should also be applicable for other MALDI matrices to characterize the structural properties of target compounds for preferred ionization. Such information will play an indispensable role in the strategic development of MALDI-MS-based studies.