Introduction

Electrospray ionization (ESI) tandem mass spectrometry, using ion-trap or beam-type collision cell instruments, is becoming a routine technique for compound identification in metabolomics, proteomics, lipidomics, and forensic science [1,2,3,4,5]. However, accurate identification of compounds from millions of mass spectra is highly challenging for data analysis [6,7,8,9,10,11]. Tandem mass spectral library searching matches acquired tandem mass spectra against reference spectra in libraries. This technique can provide an efficient and reliable method for identifying compounds in omics studies [12,13,14,15,16,17,18]. Nevertheless, the use of MS2 spectra alone for identifying a compound may be insufficient because of the lack of specificity of the product ions. Identification may be difficult when multiple compounds share similar fragments in their MS2 spectra. The use of MSn spectra, in which the fragmentation pathways can be thoroughly probed, can help in distinguishing between similar compounds. The major peaks in an MS2 spectrum are selected as precursors and fragmented to produce their MS3 spectra. The top peaks in these spectra are, in turn, selected to obtain their MS4 spectra. MSn spectra with isomer-specific peaks can be used to distinguish between compounds with the same formula but different structures. For example, mass spectral libraries including MSn were reported to aid identification of some specific types of compounds such as oligosaccharides and plant phenolics [19,20,21,22,23,24].

Simultaneous with ionization, an analyte may fragment within the ESI source, generating in-source ions, which are observed at the same retention time as the target analyte. In-source fragment ions are often observed in various mass spectrometers [25,26,27]. Although in-source fragments are the unwanted by-products in MS1, their MS2 spectra can be utilized for confirming the compound identification and must not be confused with an analyte having the same precursor mass and product ions. While MSn spectra are collected in the ion-trap and their products are obtained at a single energy, the MS2 spectra of in-source fragments are recorded in the collision cell with a number of collision energies. Together, these two measurements provide more robust identification.

The present measurements are carried out with an Orbitrap Elite mass spectrometer, where both an ion trap and a collision cell are used. Supplementing MS2 spectra of the original compounds with the spectra of their in-source fragments can increase the confidence of metabolite identification in the sample. If an experimental spectrum matches with the spectrum of an in-source fragment, however, it is necessary to determine whether the identified compound really exists in the sample or is formed in the mass spectrometer.

We previously reported constructing a tandem mass spectral library with multiple precursor ion types [12] (such as [M + H]+, [M + H – H2O]+, [M + Na]+, [M – H], [M + Cl]) . Herein we report on expanding this library with MS2 spectra of in-source fragments and MSn spectra of the most abundant fragment ions. Furthermore, consensus spectra were generated with a newly developed top-down hierarchical divisive clustering algorithm. These added features provide further signature peaks for metabolites, drugs, sugars, lipids, and other analytes. This is demonstrated by application of the extended library in the identification of E. coli metabolites and matching spectra of in-source fragments as well.

Experimental

Acquisition of Mass Spectra

Each authentic compound was dissolved in a water:acetonitrile (v:v = 1:1) solvent containing 0.1% formic acid and was diluted to a concentration of ≈1 μmol/L. The solution was infused into an Orbitrap Elite mass spectrometer to acquire several types of spectra. Low-resolution collision-induced dissociation (CID) MS2, MS3, and MS4 spectra were acquired on the ion-trap (LTQ) section of the instrument, and high-resolution MS2 spectra were acquired in the Orbitrap section with fragmentation using either higher-energy collision-induced dissociation (HCD) in the collision cell or CID in the ion trap (IT-FT). Spectra were recorded in the profile mode. The spectra were obtained in a data-dependent mode (using experimental run time of 50 min and exclusion time of 90 s) where MS2 spectra were acquired for all major ions including in-source fragment ions. In the ion-trap measurements, MS3 and MS4 spectra were collected for the top three peaks in the corresponding MS2 and MS3 spectra. The resolution of MS1 and MS2 spectra acquired in the Orbitrap was set at 30,000.

Data Analysis

The procedure of constructing the tandem mass spectral library to include the MS2 spectra of in-source fragments and MSn spectra is summarized in Figure 1.

Figure 1
figure 1

Procedure of building a tandem mass spectral library to include the MS2 spectra of in-source fragments and MSn spectra

Clustering Precursors of MS2 Spectra

Each acquired MS2 spectrum is associated with a precursor ion with a certain m/z value. To group the MS2 spectra associated with the same precursor ion, all the precursor m/z values for the same compound, run at the same experimental condition, were collected and clustered using a simple density-based clustering algorithm as follows. First, for each precursor ion, we count the number of precursors that fall within 10 ppm in their m/z value (defined as precursor density). Then the precursors were grouped with the precursor with the highest precursor density. In this way, the MS2 spectra with similar precursor m/z values were grouped into the same cluster. This process was repeated until all precursor ions observed were clustered. In a similar way, the precursors of MS3 or MS4 spectra, which were acquired in an ion trap and therefore of lower resolution, were clustered within 0.3 Da from the same cluster of MS2 or MS3 spectra, respectively.

The same procedure was used for clustering the precursor ions of in-source fragments, which were formed from the original target precursor. In this case, however, the spectra of the in-source fragment ions were retained only if the peak intensities of the corresponding ions in at least one MS2 HCD spectrum and the IT-FT spectrum of the original target precursor were at least 20% of the base peak.

Clustering MS2, MS3, and MS4 Spectra

Spectra of the precursor ions identified in the above step varied somewhat because of instrument instability. In order to associate all spectra of a single precursor together, a sensitive top-down clustering algorithm was developed to cluster the MS2, MS3, and MS4 spectra and to generate consensus spectra. This procedure is in the class of hierarchical divisive clustering algorithms [28]. Low quality spectra due to random mass shifts or peak intensity fluctuations were eliminated in this step. The basic idea of this algorithm is to first associate all spectra of the same precursor ion in one cluster and then separate them into sub-clusters of similar spectra as follows:

  1. a)

    In each spectrum, each peak intensity is normalized based on the highest intensity in the spectrum.

  2. b)

    Individual spectra are gathered to form a composite spectrum, which may include multiple peaks with identical m/z and intensity values and where each peak remains linked to its original spectrum.

  3. c)

    All the peaks in the composite spectrum are sorted based on peak intensity in descending order (retaining the m/z value of each peak).

  4. d)

    Starting with the most intense peak in the composite spectrum, all the peaks within 10 ppm (0.3 Da for low resolution ion-trap spectra) are compared with this most intense peak (if there is more than one most intense peak, the median m/z of these peaks is selected). Any spectra in which the peaks in this m/z range are much lower than this most intense peak (<20% of intensity) or do not have peaks in this range are eliminated from this composite. This procedure was repeated for the next most intense peak until the intensity of the query peak is lower than 10% of the most intense peak.

  5. e)

    Step (d) is applied to the unclustered spectra. This is repeated until no unclustered spectra remained.

  6. f)

    For each cluster with at least three spectra, one consensus spectrum is generated. In a similar way as in step (d), starting with the most intense peak within 10 ppm (0.3 Da for low resolution ion-trap spectra), the median m/z and intensity is used as a peak for the consensus spectrum. This procedure is repeated for the next most intense peak until all peaks were handled for the consensus spectrum.

The best consensus spectrum is chosen for the library based on the quality control procedures described before [12]. The number of individual spectra is generally 10 to 60 for generating a consensus spectrum for the library.

Identifying Precursor and Product Ions

For high resolution spectra, all potential fragment formulas were listed and their theoretical m/z values were calculated based on the formula of each authentic compound that was analyzed [12]. The precursor candidates and the product ions were identified by matching the m/z values with the theoretical fragment m/z values within 10 ppm. If a precursor m/z matched with more than one precursor ion, the most reasonable one was chosen in subsequent manual inspection, based on the compound structure and fragmentations in the MS2 and MSn spectra.

Results and Discussion

Clustering and Identifying Precursor Ions

Precursor m/z values are clustered by using a density-based clustering algorithm. For example, 282 precursor candidates were found with different m/z values and abundances for 17β-carboxy-17α-formyloxydexamethasone (M = C22H27FO6) (Figure 2a). They were grouped into 16 clusters.

Figure 2
figure 2

An example of clustering and identifying precursor ions and their tandem mass spectra. (a) Clustering and identifying precursor ions from a sample steroid. Each square corresponds to a precursor ion with m/z and intensity values. The dotted lines link similar precursors into the same cluster. Fragmentation spectra of the red-colored precursors are retained for the library. The blue-colored precursors are identified but not retained for the library because of low quality of spectra or low abundance. Unlabeled ones are unidentified. (b) MS2 spectra of the primary precursor ion [M + H]+ and the in-source fragment ions from the same compound. The precursors of spectra C and D are in-source fragments, which are also the major product ions (peak intensities >20% of the base peak) in the spectra of [M + H]+ (spectra A and B) acquired from IT-FT and HCD instruments

By comparing the m/z value of each cluster with the theoretical m/z values of the expected precursor ions and all possible in-source fragment ions, 12 precursor ions were identified for subsequent analysis. After quality control procedure and examination of the MS2 spectra of these ions at the various collision energies, only seven ions were found to have spectra of sufficient quality for inclusion in the library. Sample spectra of four of these ions are shown in Figure 2b and three others are shown in Supplementary Figure S1.

Clustering Mass Spectra

The spectra obtained with the Orbitrap Elite mass spectrometer include high resolution MS2 spectra from the Orbitrap (HCD at different energies and IT-FT) and low resolution MS2, MS3, and MS4 spectra from the ion-trap (IT). These spectra were analyzed with our newly developed top-down hierarchical divisive clustering algorithm and the resulting consensus spectra were inspected for possible inclusion in the library. When more than one consensus spectrum is obtained, the correct one can be generally chosen based on the relative abundance or number of replicates. But occasionally, a more complex situation is encountered. For example, 31 IT-FT spectra acquired for 16α-hydroxyestrone were clustered into two consensus spectra with 11 and 20 spectra (A and B, respectively, in Figure 3). Spectrum B has a predominant peak with m/z 251.1431 (corresponding to loss of 2H2O from the precursor), whereas in spectrum A, this peak is very small. Examination of the HCD spectra at various collision energies shows only one consensus spectrum in each energy, and the m/z 251.1431 peak was clearly observed (for example, spectrum C in Figure 3). The ion-trap spectrum (D in Figure 3) also shows a peak at m/z 251.1. Therefore, we chose spectrum B for inclusion in the library. A detailed clustering process for this example is described in Supplementary Figure S3.

Figure 3
figure 3

An example of generating consensus spectra after clustering MS2 spectra by using the top-down hierarchical divisive clustering algorithm. Two consensus spectra were generated (A and B) for IT-FT spectra and one consensus was generated for HCD spectra at each collision energy and ion-trap spectra, respectively

The top-down hierarchical divisive clustering algorithm was tested using the spectra of 4480 authentic compounds that were analyzed on the Orbitrap instrument. The results show that this algorithm can cluster the spectra efficiently and accurately. In contrast to the bottom-up hierarchical agglomerative clustering algorithm, this top-down clustering algorithm clusters similar spectra directly based on the peak intensities without pair-wise comparison. The purpose of clustering is to group similar spectra into the same cluster. We only need to assure that the spectra in the same cluster are similar. Hierarchical divisive clustering halts if all similar spectra are grouped. It is, therefore, unnecessary to determine the similarity between each pair of spectra.

MSn Spectra

MS3 and MS4 spectra were obtained in order to provide direct evidence of fragmentation pathways that can be utilized for distinguishing similar compounds. As an example, Figure 4 shows the MS2 spectra of two compounds with the same formula but different structures, α-methylphenylalanine and N-methylphenylalanine. In both cases, the [M + H]+ ion dissociates to form the [M + H – CH2O2]+ ion as the predominant product in the MS2 spectra. But the MS3 spectra of these product ions show that the [M + H – CH2O2]+ ion produced from α-methylphenylalanine undergoes loss of NH3 whereas the ion from N-methylphenylalanine undergoes loss of CH3 and forms additional product ions as well. The loss of NH3 vs. CH3 can be used to distinguish these two isomeric compounds. MS2, MS3, and MS4 of these two compounds are provided in the library. The MS3 and MS4 spectra were linked to MS2 and MS3 spectra, respectively.

Figure 4
figure 4

An example of MS3 ion-trap spectra in the library to distinguish isomers

In-Source Fragmentation

High resolution spectra of in-source fragment ions help identify compounds in a similar manner as the MSn spectra. They provide better understanding of the fragmentation pathways of the compound in the mass spectrometer. Examples of MS2 spectra of in-source fragments from a sample steroid were discussed above (Figure 2b, and Supplementary Figure S1). As another example, the spectra obtained with β-(2-thienyl)alanine are shown in the supplement (Supplementary Figure S2).

The MS2 spectrum of an in-source fragment ion may, in certain cases, be equivalent to a spectrum of another compound, which helps in identifying the in-source fragment. For example, in Figure 5, spectrum A is a consensus spectrum that was generated from 31 individual HCD spectra of an in-source fragment ion of guanosine, [M + H – C5H8O4]+, whereas spectrum B is a consensus spectrum of guanine, which has the same precursor mass as that guanosine fragment. All spectra were acquired at collision energy 48 eV. These two spectra are almost identical and have the same quality. This comparison suggests that the in-source fragment of guanosine, [M + H – C5H8O4]+, is guanine. Therefore, the library contains two equivalent spectra, one of the authentic compound and the other of the in-source fragment ion from the larger compound. When searching an experimental spectrum against library spectra and finding guanine and guanosine at the same elution time, both spectra may have resulted from guanosine. Guanine can then be identified only if it is found at a different elution time or if guanosine is not present. So spectra of in-source fragments in the library can not only help identify compounds but also provide information for tuning the instrument to reduce in-source fragmentation, and identifying those high duplicate “unknown” spectra from in-source fragmentations.

Figure 5
figure 5

Spectral comparison of in-source fragment from guanosine and authentic compound guanine

An Extended Tandem Mass Spectral Library and Its Application in the Identification of Unknowns

A total of 4480 compounds have been examined with the above methods. The numbers of spectra obtained for the [M + H]+ and [M – H] ions were 53,880 and 15,261, respectively, and additional 46,892 spectra were collected for other common precursor ions described before [12]. By using the new processing method described in this paper, 51,960 MS2 spectra of 1476 new precursor ions for in-source fragments and 40,061 MSn spectra were identified and confirmed for these compounds. The spectral qualities were ascertained by manual inspection and following the NIST tandem mass spectral library quality control procedures [12]. The library was exported to an sdf-format file, including the annotated peak list, metabolite structure in molfile format, chemical name, synonym(s), formula, CAS number, precursor m/z, precursor ion, instrument type, etc. for each consensus spectrum. This comprehensive mass spectral library contains 652,475 high quality reference spectra obtained from 15,243 authentic compounds, including newly added MS2 spectra of in-source fragments and MSn spectra. The compounds include metabolites (of these, ~3500 are human metabolites), lipids, sugars, glycans, drugs, pesticides, surfactants, and other small molecules. In addition, the library also contains 84,446 MS2 spectra of all dipeptides, all tryptic tripeptides, and many bioactive peptides.

The extended library was tested by searching the data sets from E. coli extracts that were run on the Q-Exactive mass spectrometer using the NIST MS Search (ver. 2.2) program. For example, 527 experimental MS2 spectra were found to match the spectrum of oxidized glutathione (Figure 6a). In addition, 215 MS2 spectra obtained at the same elution time matched the spectrum of the in-source fragment [M + H – C5H7NO3]+ from this compound (Figure 6b). This in-source fragment identification confirms the identification of oxidized glutathione. However, this result also showed that the substructure was generated in the mass spectrometer, and was not present in the original sample. Searching the library, including the spectra from in-source fragments, helps users to confirm compound identification and avoid misidentifying in-source fragments as components.

Figure 6
figure 6

An example of searching the NIST Tandem Mass Spectral Library for identifying E. coli metabolites. (a) Identifying a metabolite, oxidized glutathione. (b) Identifying its in-source fragment. In both cases, the top spectrum is an experimental spectrum and the bottom one is a reference spectrum in the library. The scores for these matches were 951 and 874, respectively (max = 999)

Conclusion

We describe a top-down hierarchical divisive clustering algorithm that can associate the MS2, MS3, and MS4 spectra of original precursor ions and of their in-source fragment ions efficiently and accurately. A density-based clustering algorithm was used for clustering precursor ions, which were subsequently identified by matching with the calculated theoretical m/z values based on the formula of each analyzed authentic compound. We applied these algorithms to extend the tandem mass spectral library to include the spectra of in-source fragments and MSn spectra. The extended library can be used for more accurate compound identification by searching of spectra of in-source fragment ions and MSn spectra.