Introduction

Mass spectrometry is a standard technique for the analytical investigation of molecules and complex mixtures. It is important in determining the elemental composition of a molecule and in gaining partial structural insights using mass spectral fragmentations. The final structure confirmation of an unknown organic compound is always performed with a set of independent methods such as one- (1D) and two-dimensional (2D) nuclear magnetic resonance spectroscopy (NMR) or infrared spectroscopy and X-ray crystallography and other spectroscopic methods. The term structure elucidation usually refers to full de novo structure identification, and it results in a complete molecular connection table with correct stereochemical assignments. Such an identification process without any assumptions or pre-knowledge is commonly the domain of nuclear magnetic resonance spectroscopy. The term dereplication often refers to the rediscovery of known natural products by means of mass spectral library search or the interpretation of known mass spectral fragmentations.

Scope of this review

This review investigates theoretical and experimental structure elucidation techniques using mass spectrometry for organic molecules with a molecular mass less than 2,000 Da. The review covers newer techniques within the last 10–15 years; if none were available, then older material was included. Hyphenated separation techniques (gas chromatography coupled to mass spectrometry (GC-MS) and liquid chromatography coupled to mass spectrometry (LC-MS)) are covered due to the close relationship of those techniques with mass spectrometry. Detailed proteomics and peptide sequencing strategies along with the structure elucidation of large biomolecules, such as RNA, DNA, and oligosaccharides/glycans, are outside the scope of this review. The term “small molecules,” used throughout this review, thus refers to all small molecules excluding peptides. Approaches for inorganic mass spectrometry as well as elemental and organometallic analysis are only sparsely covered.

Mass spectral instrumentation and ionization techniques

The history of commercial mass spectrometry instrumentation covers more than 40 years. Brunnee covers the principles of common mass analyzers in a vibrant 1987 review [1]. Gelpi discusses over 130 different mass spectrometers built since 1965 in a series of two reviews [2, 3]. Only one totally new mass spectrometer type, the Orbitrap analyzer [4, 5], has been developed lately. Nevertheless, many new hybrid approaches, among them ion mobility coupled to time-of-flight (TOF) mass spectrometers have been introduced to the market recently. A series of ionization techniques and figures of merit for mass spectrometers will be discussed in the proceeding paragraphs.

Soft and hard ionization techniques

Electron ionization (EI) at 70 eV is historically seen as the oldest ionization technique for small-molecule investigations. Because of the selected constant ionization energy, this technique results in consistent and fragment-rich mass spectra. These mass spectra can be easily used for a mass spectral library search. Electron ionization is commonly used for GC-MS setups. A major disadvantage of mass spectra obtained under EI conditions is the low abundant or missing molecular ion. An abundant molecular ion however is needed for the calculation of elemental compositions. Chemical ionization (CI) is a soft ionization technique mostly used in GC-MS setups to obtain molecular ion information [6, 7]. Supersonic molecular beam interfaces provide the ability to obtain fragment rich electron ionization spectra together with abundant molecular ions [8].

The introduction of electrospray ionization (ESI) [9, 10] was a major breakthrough for the analysis of intact and large biomolecules. ESI is now the ionization method of choice for LC-MS in many laboratories worldwide [11]. Additionally, nanoelectrospray (nanoESI) [12] and chip-based nanoelectrospray ionization have been advanced during recent years [1318]. The infusion of nanoliters of solvents using nanoESI allows for sustained analysis over long sample times with a minimum of sample material and increased sensitivity. These long infusions times are needed for structural identifications from data-dependent MSn fragmentations obtained by ion trap mass spectrometers. The use of a new spray nozzle for each injection prevents cross-contaminations (see Fig. 1) especially when multiple compounds are infused from 396 well plates. Recently, multi-nanoelectrospray emitters (nanoESI) have been developed, which may further enhance ion production and increase the dynamic range (see Fig. 2) [19, 20].

Fig. 1
figure 1

Chip-based nanoelectrospray allows for sensitive and contamination-free mass spectral infusions (photo by Tobias Kind/FiehnLab)

Fig. 2
figure 2

Nanoelectrospray emitter array for enhanced sensitivity of electrospray ionization mass spectrometry (reproduced with permission from Keqi Tang and Richard D. Smith/Pacific Northwest National Laboratory)

Atmospheric pressure chemical ionization (APCI) [2124], atmospheric pressure photoionization (APPI) [2528], and matrix-assisted laser desorption/ionization (MALDI) [2931] are matured soft ionization techniques. Field desorption and field ionization are also soft ionization techniques, and both produce abundant molecular ions with few fragment ions [3234]. Direct analysis in real time (DART) [35] is an ambient ionization technique [36] and allows for the real time analysis of the sample. The DART source has been widely used in “open access/walk-up” laboratories together with robotic sample handling [37]. Techniques for sampling molecules from surfaces have been extensively reviewed as well [38]. Secondary ion mass spectrometry (SIMS) and MALDI are used for mass spectrometric imaging [39], a new and exciting technology to gain spatial and structural insights from tissues and organs [4042]. Several new surface-based ionization techniques including desorption electrospray ionization [43], desorption ionization on silicon, and nanostructure-initiator mass spectrometry [44] have been developed recently.

Multi-mode or simultaneous ion sources [4548], as well as the pulsed and parallel use of different ionization techniques [4952], are helpful to shorten analysis time and to obtain structural information from a wide range of different substance classes [5355] (see Fig. 3). Although simultaneous positive and negative polarity switching is available within many ion source designs, the commercialization of dual- or multi-mode ion sources applying different ionization techniques is a more recent development [5658].

Fig. 3
figure 3

Coverage of molecule classes with different ionization methods (reproduced with permission from Oxford University Press [55])

Figures of merit of mass spectrometers

Mass spectrometers are typically designed for specific analytical aims: ion trap mass spectrometers as versatile instruments, quadrupole mass spectrometers as general work horses, triple quadrupole mass spectrometers as very sensitive instruments for targeted analysis, and Fourier transform instruments for measurements requiring high resolving power and high mass accuracy [59]. In addition to their technical and instrumental design, mass spectrometers can be classified using specific figures of merit [60, 61] (see Table 1). These figures of merit are combinations of hardware, software, and customer experience indicators.

Table 1 Important figures of merit for modern mass spectrometric systems

High mass resolving power is needed to resolve overlapping interferences by mass spectrometry only [6264] (see Fig. 4). Up to one million resolving power can be achieved routinely with current commercially available Fourier transform ion cyclotron resonance (FT-ICR-MS) instruments [65]. A series of “world records” achieved by FT-ICR-MS [66] has been recorded. Hybrid instruments especially allow for the acquisition of high-resolution tandem mass spectra [67, 68] used for natural product structure elucidation. One drawback of FT-ICR-MS and Orbitrap instruments is the higher cycle time to acquire high-resolution broad band mass spectra [69]. At one million resolving power (FT-ICR-MS), a single scan can take up to 2 s or longer. New high-field Orbitrap analyzers can now reach resolving power in excess of 350,000 at m/z 524 (full width at half maximum) [70]. Modern TOF and Q-TOF instrument are routinely capable of higher than 10,000 mass resolving power with the latest instruments reaching up to 40,000 resolving power [71]. When coupled to ultra performance liquid chromatography and comprehensive two-dimensional GC×GC [7274], the data acquisition rate (scan speed) and duty cycle of the mass selective detector are very important. The chromatographic peak width can be around 2–5 s or lower, and there needs to be enough time to perform additional data-dependent tandem mass spectra (MS/MS) or MSn scans [75, 76]. Several new TOF and hybrid quadrupole-TOF and iontrap-TOF mass analyzers have been introduced into the market to obtain accurate masses at the MSn level at a very high data acquisition rate [77, 78]. Additionally, new generation benchtop electrospray ionization time-of-flight analyzers can reach sub-ppm mass accuracy under routine conditions [79]. High mass accuracy together with high isotopic abundance accuracy is generally important to obtain only few molecular formula candidates from an accurate mass measurement [80, 81]. For structure elucidation purposes, the ability to perform multiple-stage MSn experiments is the most important feature to obtain additional structural information from small molecules [82, 83]. The ability to obtain tandem mass spectra under positive and negative ionization in a single run [84] can speed up the identification process of unknown chemicals [8587]. Low machine maintenance and high robustness of the instrument operating under different temperatures and humidity ranges in high-throughput manner are additional important aspects. The software as one of the cornerstones for successful compound identification is just as important as the instrument itself. Fast software bug fixes, uncomplicated software updates, easy-to-use graphical user interfaces, and responsive software support are sometimes more important than certain instrument parameters. Documented software interfaces that allow programmers to access certain software functions and the support of open mass spectral exchange formats (netCDF, mzXML, and mzData) are equally important and discussed later in the article.

Fig. 4
figure 4

The importance of mass resolving power showing a high-resolution FT-ICR-MS spectrum with lower resolution Q-TOF mass spectrum. Only the high-resolution instrument can resolve peaks with 0.0112 Da difference (reproduced with the permission from Ref. [63])

Tandem mass spectrometers and modes of operation

Ion trapping instruments such as quadrupole ion traps and FT-ICR mass spectrometers can be used to create tandem mass spectra, and multiple-stage MSn experiments can be performed without instrument modification or couplings of different mass analyzers [88]. Other hybrid instrument types are discussed in Ref. [2, 3]. Orthogonal or hybrid mass spectrometers are favorable for structural elucidation because they either increase the total peak resolution or they introduce another separation dimension that can be used either to trigger or acquire additional mass spectrometric information [89, 90]. The different modes of operation, which include precursor ion scans, product ion scans, neutral loss scans and selected reaction monitoring, are discussed in De Hoffmann [91]. The MS/MS and MSn scans are usually triggered via data-dependent setups. Multiple precursor ions can be manually selected or the software can acquire tandem mass spectra when a certain peak abundance or signal/noise ratio is exceeded. For example, electrospray ionization with ion mobility mass spectrometry coupled to time-of-flight mass spectrometry (ESI-IMMS-TOF-MS) was used for metabolic profiling of Escherichia coli metabolites [92], phospholipid [93], and drug analysis [94].

Ion activation modes

Collision-induced dissociation (CID), or collisionally activated dissociation, is the most common technique to obtain tandem mass spectra. Precursor ion stability and internal energy under CID have been previously discussed [95]. A series of new fragmentation modes are aimed at improved protein and peptide identification rates by creating more specific fragmentations. These modes include electron capture dissociation (ECD) [9698], electron transfer dissociation [99101], and infrared multiphoton dissociation [102]. They are not fully exploited yet for small-molecule applications outside proteomics.

Two-dimensional, three-dimensional, hybrid, and orthogonal chromatographic approaches

Multiple dimension setups are possible on the chromatographic and mass spectrometric sides. On the chromatography side, the usual aim is directed at increasing the peak resolution, which therefore provides a better separation of overlapping compound peaks. The peak capacity can be increased by using different selective chromatographic phases in a two-dimensional or multi-column setup. These approaches are known for liquid chromatography and prominently used for protein identification by coupling an ion exchange column together with a reversed phase column, which coined the term multidimensional protein identification technology [103]. The difference between simple two-dimensional connections such as GC-GC compared with truly orthogonal approaches such as comprehensive two-dimensional GC (GC × GC) [104] lies in the fact that a modulator is used to accumulate parts of the sample from the first column and pulse the sample to the second shorter column with a different polarity of the stationary phase [105]. The detector must be a fast scanning detector with a high acquisition rate and an example of this is a time-of-flight mass analyzer. Sampling rates are usually between 100 and 200 spectra per second for GC×GC-TOF-MS [106] instruments. The resulting mass spectra have a very high signal to noise ratio and therefore represent cleaner mass spectra and give better mass spectral library search scores [107]. Miniaturization and the introduction of chip-based liquid chromatography [108] play a major role in high-throughput methods.

Mass spectral data handling

The following section discusses basic steps that have to be performed to obtain clean and background free mass spectra. Charge state deconvolution, accurate mass measurements, and software algorithms for elemental composition calculations are reviewed. Certain hardware specific setups are discussed when required.

Background and noise subtraction

Automatic background and noise subtraction are standard techniques to obtain clean and interference free mass spectra. The Biller–Biemann algorithm [109] or similar algorithms by Dromey et al. [110] have been in use for more than 30 years. It is generally advisable to perform blank or solvent runs to obtain possible noise or contamination data. These infusion mass spectra or complete LC-MS and GC-MS runs must be subtracted from the real sample data [111113]. Most modern mass spectrometry software tools have inbuilt algorithms to perform these tasks. Many of the mentioned algorithms have been developed for EI (70 eV) mass spectra. Several approaches have been introduced with the CODA algorithm of Windig et al. [114] for ESI and LC-MS data, and similar methods have been applied in drug discovery studies [115117]. A ore detailed discussion about automated approaches is covered in the mass spectral deconvolution and biotransformation sections.

Adduct formation and detection

Ionization techniques such as CI, MALDI, ESI, or APCI show not only single adduct ions but also sets of multiple adducts [118, 119]. The process of adduct formation can be studied using heuristic and computational methods [120, 121]. Solvent and buffer constitution, pK a, pH, substance proton donor and acceptor properties, and gas-phase acidities influence the formation of adducts [122, 123]. Different adducts also can result in different fragmentation pathways [124]. The correct adduct ion must be detected in order to obtain the accurate mass of the neutral molecule. One possible solution is to increase the concentration of specific ions in the liquid phase [125] to obtain preferably those adducts. When analyzing lipids, lithium is used as modifier [126] to obtain characteristic [M+Li]+ ions. An extended list of common electrospray adducts, including [M+H]+, [M+NH4]+ [M+Na]+ and [M−H], has been prepared [127]. In case of MALDI, metal cation adducts [M+Na]+ and [M+K]+ are often observed [29, 128]. Software tools such as CAMERA [129] and IntelliXtract [130], and tools for infusion spectra [131] can help detect adduct ions in mass spectra automatically. Currently, no software exists that can predict adduct probabilities based on a given compound structure for a specified ionization mode (CI, ESI, APCI, and APPI).

Charge state deconvolution

Charge state determinations play an important role in proteomics [132, 133] but are less frequently required in small-molecule investigations [132]. Many small organic molecules are usually singly charged. Certain molecule classes, such as cardiolipins, may occur as singly and doubly charged ions. The occurrence of multiply or doubly charged ions can be influenced by buffer concentration, analytes concentration, amount of organic modifier, or flow rate [134, 135]. Open-source software tools, such as Decon2LS [136], exist (see Fig. 5), which can automatically determine charge states. Most vendor mass spectrometry software has charge state determinations included.

Fig. 5
figure 5

Charge state deconvolution with the freely available software Decon2LS (reproduced with permission from Ref. [136])

Accurate mass measurements

Accurate masses and isotope abundances are reported in an IUPAC report [137]. The statistical evaluation of measured mass accuracies should include the proper terminology and basic statistic tests [138]. An intercomparison study from 45 laboratories [139] showed that FT-MS and magnetic sector field instruments in peak matching mode routinely achieved less than 1 ppm mass accuracy. Quadrupole-TOF, TOF, and magnetic sector field instruments in magnet scan mode achieved between 5 and 10 ppm. Newer publications reported that time-of-flight instruments can reach around 1 ppm [140] or even sub-ppm [79] mass accuracies. Orbitrap technology in hybrid mode can routinely reach sub-ppm mass accuracy and in non-hybrid mode less than 2 ppm [141]. The importance of the inclusion/exclusion of the electron mass during accurate mass measurements was discussed in Ref. [142], and the impact on LC/TOF-MS mass accuracy was further outlined in Ref. [143]. A mass error of up to 3 ppm was reported if the electron mass is not included in calculations. The mass error introduced by this calculational error can be as high as 5 ppm at 100 m/z (see Fig. 6). The red line marks 300 ppb, which can be obtained from broadband FT-ICR-MS experiments. The current accurate electron mass is reported as: m(e−) = 0.00054857990924 u [144]. A recent approach used the ubiquitous presence of background ions to correct for small mass errors, and this was also used for accurate peak alignment and internal mass calibration [145]. The reported mass error on a LTQ-Orbitrap dropped from ±1–2 ppm to an absolute median error of 0.21 ppm. Another research article discussed a computational method to adjust for mass errors outside the lock mass range and intensity and reported error improvements from 20 down to 1 ppm [146]. The process of selected ion monitoring (SIM) stitching was investigated [147, 148]. The authors concluded that an average mass error of 0.18 ppm could be obtained routinely on a high-resolution FT-ICR mass spectrometer. If instruments are uncalibrated or out of tune, then an automated post-calibration routine [149] can be used to remove systematic precursor mass errors. The authors’ reason that in case of sample overload, the automatic gain control system (AGC) is not able to control the optimal number of ions to inject into the Orbitrap cell, which finally results in space charge effects causing noticeable systematic mass errors.

Fig. 6
figure 6

A mass error of up to 5 ppm is the penalty if the electron mass is not accurately included in accurate mass calculations. The lower red line marks 0.3 ppm mass accuracy, which can be reached by FT-ICR-MS

Higher mass accuracy on unit mass resolution instruments can be obtained using post-processing peak shaping algorithms as implemented in the MassWorks software (Cerno Biosciences) [150, 151]. These algorithms use an internal calibrant that is later used for post-calibration of mass accuracy errors. Unit resolution mass spectrometers (inaccurate mass spectrometers) can be converted into accurate mass spectrometers as long as mass spectral data are obtained in profile mode, which is required to perform the spectral peak shape correction. If data are obtained in centroid mode or stick mode, then no such post-correction can be performed. A correction for spectral accuracy can also be performed with high-resolution data [152]. Artificial neural network calibration [153] in conjunction with AGC and better peak centroiding can improve the mass accuracy on FT-MS instruments to reach 100 ppb for certain experiments [154].

Several unit mass resolution instruments, including ion traps and triple quadrupole instruments [155], allow a hardware-based high-resolution or an ultra-zoom scan [156]. This zoom scan can be used for accurate mass measurements or better charge state assignments. The resolving power usually can be increased by one order of magnitude, or from 1,000 resolving power to 10,000 resolving power. However the m/z scan range is usually very limited, and the duty cycle is high for enhanced resolution scans.

Isotope abundance measurements and isotopic pattern calculations

The isotopic abundances of common monoisotopic (F, Na, P, and I) or polyisotopic (H, C, N, O, S, Cl, and Br) elements are listed [137]. Isotopic abundances are measured and have been utilized in mass spectrometric measurements since the beginning of mass spectrometry [157]. The most sensitive and accurate method for isotopic abundance measurements is accelerator mass spectrometry [158], and this method is used for age determination, forensics, and food monitoring [159]. Its precision is around 0.05% for the measurement of the 13C/12C ratio [160] requiring total combustion of the sample. The availability of commodity mass spectrometers delivering isotopic abundance errors less than ±5% was utilized for LC-MS-based screening approaches [161164] and environmental screening applications [165167].

To filter or match elemental compositions according to their experimental isotopic abundances, the high- or low-resolution isotopic envelopes of molecular formulas must be calculated. Several algorithms have been proposed to calculate the isotopic fine structures and allow the modeling of Gaussian peak shapes according to the selected resolving power of the instrument. Several of the algorithms implement either polynomial-based methods or Fourier transform-based methods (IsoDalton, MWTWIN, Mercury, IsotopeCalculator, IsoPro, emass/qmass, libmercury++, ISOMABS, and Decon2Ls) [168171]. Isotopic abundances from tandem mass spectra and multiple-stage MSn can yield additional information that can help during the structure elucidation process [172174].

Elemental composition determination

The determination of the molecular formula or elemental composition requires a clean mass spectrum with no interfering noise or coeluting compounds. A process for elemental composition determination from electrospray data was described in Ref. [175]. The algorithm includes a decision making step for proton and alkali metal adducts, automated determination of charge states and overlapping peaks, and an isotopic pattern matching. It was validated with 220 pharmaceutical compounds and yielded a success rate of 90%. Isotope-enriched metabolites can be investigated using a method that includes spectral correlation methods along with mass accuracy and isotope ratio filters [176]. Another software discusses the use of isotopic abundance ratios to confirm or reject NIST mass spectral library search results [177]. A series of papers discusses the process of isotopic pattern matching for elemental formula determination in environmental chemistry [165167], metabolic profiling experiments [178, 179], and geochemistry [180, 181]. The freely available software SIRIUS (Sum formula Identification by Ranking Isotope patterns Using mass Spectrometry) [182] has a user-friendly graphical interface and can be used on LINUX, MAC, and Windows platforms. The newer implementation “SIRIUS Starburst” also includes features such as peak intensity, number of hetero atoms in the molecular formula, neutral losses, and tandem mass spectral information [183].

The Seven Golden Rules [81] are a set of heuristic rules for elemental composition calculations, including the Senior and Lewis rules, element ratio rules, and an isotopic abundance matching filter. The rules were developed with a set of 68,237 existing elemental compositions and validated with 6,000 molecular formulae by means of an internal database of 432,968 existing elemental compositions. The freely available software was used to calculate the molecular formula space (elements CHNSOP; <2,000 u) covering more than two billion elemental compositions, and it was deduced that only 623 million elemental compositions are highly probable (see Fig. 7).

Fig. 7
figure 7

The molecular formula space below 2,000 Da (elements CHNSOP) covers more than eight billion elemental compositions and can be reduced to 600 million highly probable molecular formulas using the Seven Golden Rules [81]

The influence of spectral accuracy of molecular ions on elemental composition calculations was investigated on a high-resolution mass spectrometer [184]. The automated correction of isotope pattern abundance errors using peak shaping and correction algorithms resulted in better identification rates of the molecular formulas. An algorithm for isotopic pattern calculation that includes stable isotope markers (13C and 15N labeled) was developed [185]. Recently, an approach was developed that uses elemental formula calculations with database lookup and a subsequent in silico generation of CID mass spectra from the obtained isomer structures [186]. The obtained in silico tandem mass spectra (calculated by MassFrontier) were then compared with experimental CID spectra. This approach combined with additional filter constraints and possible MSn fragmentation information can be used for compound annotations (compound dereplication), provided that the structure is known in compound databases. Other prerequisites such as proper validation of the in silico prediction algorithms and use of larger datasets will be discussed in a later chapter.

Algorithms for formula calculation from high-resolution MS/MS data

If the mass spectrometer is capable of obtaining accurate mass multistage product ions (MSn), then this information should be utilized during the elemental composition determination. The possible elemental formulae for single peaks should be shown, and the algorithm should analyze if the elemental composition of the product ion could be combined to generate feasible elemental compositions of the complete molecule. Bruker (Billerica, MA, USA) developed the SmartFormula three-dimensional (3D) algorithm [187] that includes this information by using a recursive algorithm to exclude unfeasible molecular formulae from lower mass fragments (see Fig. 8). Tandem mass spectra obtained under EI can be used together with isotope abundance analysis to obtain correct elemental compositions [174]. Polynomial expansion algorithms to calculate the isotope patterns for precursor ion, neutral loss, and MSn product ion tandem mass spectra have been discussed in Ramaley and Herrera, and Rockwood et al. [173, 188].

Fig. 8
figure 8

Fragmentation pathway of paclitaxel and sum formulae for fragments from MS/MS and MS3 experiments calculated with the SmartFormula3D algorithm (reproduced with permission from Ilmari Krebs, Bruker Daltonik GmbH, Bremen [187])

Another approach used accurate masses from MS/MS product ions during the investigation of fragmentation processes of some natural products [189, 190]. Sirius Starburst [183] is a freely available software that combines MS/MS fragment and element ratio information with elemental composition determinations. A useful hardware-based approach [191], the acquisition of exact masses at high and low ionization energy MSE, can lead to more accurate elemental formula determinations.

Complex data-dependent setups including ion maps and ion trees

Data-dependent acquisition methods are used in most of today’s tandem mass spectrometers [87, 192197]. The mass spectrometry software triggers MS/MS or MSn product ion scans based on specific events. The trigger can be set on specific events such as the highest abundant peaks, manually selected masses, specific neutral losses, or specific isotopic pattern [197].

Specific data-dependent setups such as total molecule ion maps (see Fig. 9) are very powerful features for molecule fragmentation studies. The process to create ion maps has been known since more than 20 years [198]; however, it has not reached its full potential, mainly due to missing data handling options. Ion maps contain product ion mass spectra over the mass range of all precursor ions from 20 Da increasingly up to the molecular mass of the compound [199]. These ion maps can be obtained by a longer direct infusion process with autosamplers or better by nanoESI using Nanomate (Advion Inc.) robotic injections to allow long-enough scan times. The method should not be confused with spatial ion maps obtained from secondary ion mass spectrometry TOF-SIMS [200] or mass spectrometric imaging or ion maps that refer to retention time–m/z visualizations (LC-MS ion maps) [201]. The total ion map is a function of precursor m/z value versus product ion m/z value and intensity, and it can be represented in two- or three-dimensional space. The applications range from the investigation of single molecules to obtain deeper structural insights [202] to the investigation of complex petroleum mixtures [203] and natural compounds.

Fig. 9
figure 9

A total ion map of tandem mass spectra from cobalamin (vitamin B12) created by a linear ion trap mass spectrometer and visualized by the Thermo Xcalibur software. For all precursor ions in the mass range between m/z 300 and 1,376, one MS/MS spectrum was acquired

An even more powerful method to investigate mass spectral fragmentations and fragmentation pathways of molecules are ion tree experiments [204208] (see Fig. 10). A data-dependent ion tree contains multiple MS2 to MSn product ion spectra from a single molecule and represents the ultimate mass spectral fingerprint of a molecule. The methodology has been available for many years, and in principle, any mass analyzer capable of MSn fragmentation can make use of it. The technology is very attractive because it can be performed with inexpensive ion trap systems (tandem-in-time) using direct infusion experiments. Different ionization voltages and adduct-dependent fragmentations, as well as the use of high-resolution measurements and accurate mass MSn spectra from hybrid instruments, can reveal additional fragmentation pathways. However, these complex multidimensional setups were rarely used in the past due to data handling and software issues. Application examples include fragment studies of polyphenols [196], lipids [209213], glycans [214], and carbohydrates [215].

Fig. 10
figure 10

An automatic data-dependent ion tree experiment with multiple-stage MSn spectra of selected precursor ions of reserpine acquired on a linear ion trap mass spectrometer. The information rich ion tree represents the ultimate mass spectral fingerprint of a molecule

Mass spectral library search

Mass spectral library search is the first step in any mass spectral interpretation and therefore will be discussed in deeper detail. Mass spectral search can be performed with unit mass and high-resolution mass spectra of all stages (MS to MSn). The aim of a library search is either to obtain a correct structure hit of compounds already in the library or to obtain partial structural insights from compounds that nearly match. For that purpose, an experimental mass spectrum is searched against a large collection of already recorded mass spectra that are stored in a database. A general review of mass spectral libraries [55] and mass spectral search algorithms [216, 217] has been written.

MS and MS/MS and MSn libraries and search algorithms

Search algorithms for electron ionization spectra were developed first [218], and these include the INCOS algorithm, probability-based matching (PBM) [216], and dot-product algorithm [217]. The size of publicly and commercially available MS/MS libraries is small compared with electron ionization libraries (Wiley and NIST) that cover several hundred thousand electron ionization mass spectra. Currently, the NIST08 MS/MS collection is a large commercially available database with 14,802 MS/MS spectra from 5,308 precursor ions. There are a variety of commercial libraries that have been generated for certain instrument types and settings. The publicly available Massbank [219, 220] and ReSpect database (RIKEN) [221223] are databases currently covering 24,772 mass spectra and tandem mass spectra from 13,200 compounds. An electrospray tandem mass spectrometry library (ESI-MS/MS) for forensic applications covered 5,600 spectra of 1,253 compounds acquired at different ionization voltages using a hybrid tandem mass spectrometer coupled to a linear ion trap [224]. Smaller but specialized libraries are in use for toxicological screening and drug analysis [225, 226]. An in-house library of MS/MS spectra from 1,200 natural products with the majority of entries having [M+H]+ adducts and 95% of those compounds being able to ionize in positive mode was investigated in Ref. [227]. Tandem mass spectra are not as reproducible as electron ionization spectra when obtained from different instruments. However, the creation of reproducible and transferable MS/MS spectral libraries for use on multiple instrument types [228] is possible [229, 230]. A fragmentation energy index was proposed for LC-MS [231] to normalize collision energies and create reproducible spectra comparable to 70-eV electron ionization spectra. Another study compared tandem mass spectra obtained from quadrupole–quadrupole–time of flight, quadrupole–quadrupole–linear ion trap, quadrupole–quadrupole–quadrupole, and linear ion trap–Fourier transform ion cyclotron resonance mass spectrometer and came to the conclusion that platform independent MS/MS spectra can be obtained with multiple fragmentation voltage settings [232234].

Search algorithms for MS/MS spectra of small molecules can use similar approaches as used for EI mass spectra [55, 235]. Peptide mass spectra usually show specific fragmentations, and a series of specialized search algorithms were developed for these purposes [236, 237]. MS/MS spectra can be searched according to spectral similarity [238], probability match (PBM) [216, 239], or dot-product algorithm search [217]. If the MS/MS spectra were obtained in data-dependent mode and precursor mass information is available, this precursor mass can be used as a powerful first filter for all subsequent MS/MS matches [240, 241]. The precursor m/z search window can be selected according to the experimentally mass accuracy of the instrument. Well-calibrated unit mass resolution instruments can reach a mass accuracy of ±0.5 Da (or better with post-calibration methods). In this case, a precursor search window of ±0.5 Da can be set for MS/MS search. The subsequent MS/MS match uses a product ion window search tolerance that is slightly higher due to possible hydrogen shifts. Well-established dot product, PBM, and reverse search algorithms are used to match the filtered MS/MS spectra. The accuracy, recall, precision, true, and false discovery rate of the selected algorithm and all other statistical parameters are best obtained from test sets with known spectra and decoy mass spectral datasets as seen from the proteomics community [242245]. The freely available NIST Mass Spectral Search Program contains efficient algorithms to search accurate mass tandem mass spectra, including m/z precursor and product ion filtering. Moreover, NIST MS Search can handle and search molecular structures together with their associated mass spectra, which is an obligatory prerequisite for any advanced library search program.

Mass spectral trees combine multiple-stage mass spectra

Ion traps and hybrid mass spectrometers can be used to create multiple-stage mass spectra (MSn) by consecutively fragmenting precursor and all product ions. Usually, the abundance of the obtained product ions decreases, which sets a practical limit at MS6 to MS10. Furthermore, there must be enough time for trapping, or a direct infusion experiment has to be performed to generate enough ions. The feasibility of using MSn data for the investigation of drugs [246], monosaccharides [247], oligosaccharides [248250], and other molecules has been shown. The use of multistage mass spectral libraries together with precursor ion fingerprinting for structure elucidation purposes has been investigated in Sheldon et al. [205]. The authors show that similar building blocks will have similar product ion mass spectra, and therefore, the utilization of MSn spectra of all stages can aid in structure elucidation of the core molecule structures. For example, if a set of molecules would have different substitutions or side chains, then an accurate mass precursor search could not identify these molecules. If the side chain is cleaved off or lost in a dissociation step, then the remaining core molecules would generate similar product ion spectra and therefore could be matched among this set of similar compounds. The representation of a spectral tree (see Fig. 11) of compound mass spectra and their associated structures was obtained from MassFrontier (HighChem Ltd).

Fig. 11
figure 11

A spectral tree diagram from MassFrontier representing multiple-stage MSn spectra, in-source CID spectra or zoom spectra. Any stage can be searched and is logically connected with different product ion spectra (reproduced with permission from Robert Mistrik/HighChem Ltd)

Mass spectral interpretation

Many of the developments in mass spectral interpretation are deeply rooted in the era of electron ionization mass spectrometry from the 1970s and 1980s. Hence, mass spectral fragmentation interpretation rules are best developed for EI mass spectrometry. The red book entitled “Interpretation of mass spectra” written by Turecek and McLafferty [251], the book entitled “Introduction to Mass Spectrometry” by Watson and Sparkman [252], and “Understanding mass spectra: a basic approach” by Smith [253] are standard sources for mass spectrometrists investigating electron ionization spectra. These books contain very detailed explanations of reactions and fragmentation pathways, including rearrangement reactions, homolytic or heterolytic bond cleavages, hydrogen rearrangements, electron shifts, resonance reactions, and aromatic stabilizations. Any de novo interpretation without any pre-knowledge is still challenging, if not totally impossible, due to the high molecular diversity and many similar compound structures.

The even electron rule states that usually neutral molecule fragmentations are observed from molecular ions, but radical loss can also occur in case of aromatic and nitroaromatic compounds [254, 255]. Under positive electrospray (ESI), most fragment ions were reported even electron, whereas the formation of odd electron under EI was significantly higher [256]. The Stevenson rule states that ions with low ionization energy are more stable and will gain high peak abundance in the mass spectrum. The nitrogen rule should in principle only be used for unit resolution mass spectra because high-resolution and high-accuracy mass spectrometry can always calculate the correct number of nitrogen atoms. The Rings Plus Double Bonds Equivalent (RDBE) should not be used with elements that allow multiple valence counts (such as phosphorus and sulfur) [257] as otherwise only possible RDBE ranges can be obtained instead of unique solutions. Mass spectral visualization techniques such as van Krevelen or Kendrick plots, and spectral mappings using dimension reduction methods with principal component analysis [258] are helpful for the investigation of unresolved and complex organic matter (petroleum, coal, sediments, and fulvic acids) [259, 260].

Electron ionization and chemical ionization mass spectrometry

Electron ionization at 70 eV is a very hard ionization resulting in very complex rearrangements and fragmentations [261]. The EI mass spectra itself are very reproducible, which is important for a mass spectral library search. The ions in the gas phase have no “memory” where they originate from. That renders the structural interpretation of full scan EI mass spectra very complex. Electron ionization MS/MS with accurate masses may ease that problem [262]. Several book chapters discuss most important aspects of CI [6, 263]. One interesting aspect of chemical ionization is that multiple ionization gases with different proton acidities can be used, which results in different molecular ions for correct molecular ion and elemental composition determination. Although most GC-MS instruments are capable of performing CI analysis, the use of chemical ionization GC-MS is not common anymore. One reason may be the non-existence of chemical ionization mass spectral libraries and the lower sensitivity during chemical ionization GC-MS measurements. Nevertheless, chemical ionization GC-MS remains an attractive technique for structural identifications due to the capability of obtaining abundant molecular ions.

Electrospray and atmospheric pressure chemical ionization

The study of the fragmentation behavior of compounds under electrospray conditions (ESI) [11, 264] is an important topic due to the wide availability of LC-MS devices with ESI interfaces. Using high-resolution CID data, compound substructures were ranked using a systematic bond disconnection approach [265]. In a similar approach for the structural investigation of MS/MS product ion spectra, the authors of a freely available software used a brute-force ab initio combinatorial approach to generated possible fragment ions [266, 267], and they concluded that it is “a non-trivial task to accomplish.” Currently, only MassFrontier contains a large fragmentation reaction library as discussed in the section below. Different voltage settings should be selected for complete coverage of fragmentations. Automatic solutions such as CID voltage ramping exist [268] for obtaining maximum fragmentation patterns. A lookup table of common neutral losses during CID fragmentation has also been published [269], and typical fragmentations for atmospheric pressure ionization are discussed in Ref. [270].

Determination of stereochemistry using mass spectrometry

The determination of stereochemical (absolute) configuration usually requires a separation technique such as GC, CE, or LC with chiral columns. ESI-MS was used to determine the binding affinities of ion-molecule reactions by performing CID experiments of host–guest complexes [271]. It is possible to determine the chirality of molecules without preseparation using chiral selector agents and ESI-MS/MS [272]. Additionally, traveling wave ion mobility spectrometry can be used to determine stereochemistry. The book titled “Applications of Mass Spectrometry to Organic Stereochemistry” [273] discusses practical approaches for stereochemical investigations of molecules.

Determination of 3D conformations using mass spectrometry

Although conformational changes of small molecules can be monitored using mass spectrometry, this approach was usually applied to high molecular weight compounds such as peptides and proteins [274] with the requirement of high resolving power. Mainly, protein folding and dynamics [275] have been studied in recent years. It has been reported that small-molecule mass spectra show differences depending on the 3D conformation of the molecule [276]. The determination of the conformational changes of small molecules is possible using ion mobility mass spectrometers or hybrids thereof. This approach requires the experimental determination of cross sections from known molecules and the use of such data for theoretical models [276, 277].

Biotransformation reactions and drug metabolism studies with mass spectrometry

Biotransformation and drug metabolism studies play a crucial role in all analytical studies targeted at drug design for phase I and phase II metabolites [278]. The tools and approaches discussed in this section are aimed to identify or predict in vivo metabolites from cytochrome P450 (CYP) enzymes and guide through preclinical drug metabolism and pharmacokinetics, and absorption, distribution, metabolism, and excretion/Tox studies. More than 50 CYPs are known in humans, and CYP1A2, CYP2C9, CYP2C19, CYP2D6, CYP3A4, and CYP3A5 enzymes metabolize 90% of drugs [279]. In pharmacokinetics and metabolism studies, the pathway of one single drug and all related enzymatically transformed metabolites are investigated. Levsen et al. [280] discuss the utilization of tandem mass spectrometry for the investigation of phase II metabolites. In recent years, software expert algorithms for metabolite predictions have been developed, and this includes tools such as DEREK, Catabol, LHASA, MetaboGen, METEOR, and MetabolExpert [281284]. The software works along known metabolic transformation rules and performs an in silico prediction of possible metabolites. Those metabolite structures can be identified later either by mass accurate mass shifts or by tandem mass spectrometry [285]. Specialized mass spectrometry centric software from vendors such as Metabolite ID (AB Sciex), Metabolynx (Waters), Metworks (Thermo), MassHunter/Metabolite ID (Agilent), and MetaboliteTools (Bruker) mostly use a combination of accurate mass, neutral loss, and biotransformation rules with associated accurate masses for metabolite identification.

Iontrap and triple quadrupole mass spectrometers can be used to monitor and identify common neutral losses (including methylation, acetylation, and glucuronidation). Tables with common biotransformations, and lists of metabolic changes and their accurate masses can be found in Ref. [286] and Ref. [287] (see Table 2). With the broader availability of accurate mass spectrometers, the mass defect filter rule [288291] could be applied. Using the nominal mass shift and a mass defect window with several milliDalton (mDa) widths, the matrix influence can be separated from the analytes of interest. A recent report discussed the integration of structure-based metabolism prediction with predicted and experimental MS/MS data [292]. The information from precursor and product ion spectra can be used to find common biotransformations from possible regioisomers [293]. The approaches presented here can also be used during pesticide screening, biotechnology, and bioengineering studies with enzymatic reaction systems and metabolic profiling studies.

Table 2 Selected biotransformations for in vivo drug metabolism studies detectable by accurate mass spectrometry (reproduced from [282] with permission of Future Science Ltd)

The availability of hybrid triple quadrupole mass spectrometers with linear ion traps (QTRAP) allows the sensitive detection of metabolites using multiple reaction monitoring (MRM) and a subsequent MS/MS (product ion) scan for metabolite identification or annotation [194, 195, 294]. A newly developed software (LightSight and ABI/Sciex) [295297] can automatically create MRM or multiple ion monitoring transitions. This software approach, called predictive MRM, allows for a very sensitive analysis and detection of new metabolites [298].

Isotope labeling studies

Stable isotopic labeling studies [193, 299] and hydrogen/deuterium exchange reactions [300302] are commonly applied in drug metabolism studies. Proteomics approaches use labeling studies for the quantification of peptides and proteins [303305] as well as mass defect isotopomer studies [306, 307]. In vivo labeling with stable isotopes can be applied for metabolism studies in plants [308, 309], isotopomer-based flux balance analysis [310315], and structural elucidation of unknown compounds [197, 316319]. The use of deuterated mobile phases (D2O) or post-column infusion of D2O has been popular over the last several years for metabolite identification studies [320322].

Determination of impurities and contaminants

The elucidation of impurities is a reoccurring event during daily lab work. Contaminants can be avoided either by experience or better quality control sets of all reagents and solvents used. For GC-MS, LC-MS, and CE-MS, this includes the purchase of solvents and reagents in batch to obtain consistent quality and the use of quality check monitoring procedures. These chromatograms or mass spectra (solvent blanks or reagent blanks) need to be stored long term to monitor impurities over month and years. Existing collections of fragments and ions can help during the investigation of such contaminations. Certain detergents and buffer components (Triton X) are excellently ionized in ESI mode and result in large abundant peaks that suppress the signal of other ions. A comprehensive review [323] discusses mostly ESI and MALDI interferences and contains a large EXCEL sheet in the supplementary data section that covers around 800 potential interferences and contaminant ions in positive and negative mode electrospray mode. Additionally, it also contains 40 repetitive fragments such as sodium formate clusters (NaHCO2) and lists multiple adducts, losses, and possible replacements. A constant batch-wise monitoring of the purity of solvents and derivatization agents is important along with the removal of artifacts from datasets for GC-MS [324]. The hot injector in GC-MS can act as a small chemical reactor, and this could introduce a series of breakdown products that can lead to false analysis conclusions [325]. Many volatile compounds, among them pesticides and insecticides (DDT), easily decompose in a hot injector [326]. Using a GC cold injection system with a near zero degree Celsius injection temperature to avoid the breakdown of chemicals and an automatic liner exchange (ALEX, Gerstel Inc.) to avoid carryover can increase the level of confidence in compound identification of complex samples [327]. A chip-based nanoelectrospray system (NanoMate, Advion Inc.) can be used to avoid cross-contaminations. For each sample, a new ESI nozzle is used during direct infusion mass spectrometry experiments.

Mass spectral fragmentation reaction databases

Mass spectral fragmentation reaction databases contain chemical reactions and fragmentation mechanisms from mass spectral investigations. These are organic reactions proposals drawn by mass spectrometrists in order to explain specific fragments or mass spectral abundances. If applied to new molecules or mass spectra, they can speed up the elucidation process by using existing knowledge. Until recently, no structure searchable mass spectral fragmentation library existed. Currently, only MassFrontier (HighChem Ltd.) contains a large fragmentation library of 30,936 fragmentation schemes with 129,229 reactions and 151,762 associated structures. Direct molecule search, substructure search, similarity search, and name search can be performed, and all associated meta-data are electronically searchable. The database was manually curated from several thousand publications (see Fig. 12, reproduced with permission of HighChem Ltd.) and can be used to develop in silico fragmentation predictions as discussed in the next chapter.

Fig. 12
figure 12

A mass spectral fragmentation pathway database containing 30,936 fragmentation mechanisms. The mass spectrometry community never enthusiastically endorsed digital data sharing. Therefore, most of the spectra and reaction data had to be captured from old paper publications (reproduced with permission from Robert Mistrik/HighChem Ltd)

The current practice of dissemination of chemical fragmentation reactions on paper publications (PDF) is not keeping up with existing technological possibilities. It is impractical to search compound structures and reaction data from paper publications. Also, many data centric approaches, including the development of novel fragmentation algorithms, are actively hindered. Chemical reaction and fragmentation data should be submitted in electronic, machine-readable exchange formats to journals or external repositories. Currently, no such repository for mass spectral reaction data exists.

Mass spectral simulation and generation of in silico mass spectra

Chemical compound databases currently cover more than 50 million chemical structures; however, only around one million mass spectra (including duplicates) from known compounds exist. This gap could be filled by computer generation of mass spectra from large compound structure databases. An in silico algorithm has to predict accurate mass fragments and their abundances. Such an in silico generation of theoretical mass spectra could be useful because experimentally obtained mass spectra can then be matched against large in silico mass spectral databases. Several mass spectral simulation algorithms have been published in the literature. Many of those programs, however, were never made commercially or publicly available, which therefore prevents any possible independent scientific validation. The main problem of most algorithms is to simulate or calculate peak abundances or peak intensities [328] that reflect experimentally measured peak abundances [329331]. This problem has not been solved for the vast majority of small molecules under different ionization modes. The success rate of any algorithm has to be determined by a validation study using unknown molecules and a library match of the in silico generated spectra against the experimental spectra. Furthermore, the structural diversity and the number of compounds have to be high to avoid overfitting.

Successful cases of in silico generation are known for molecules with certain structural scaffolds and consistent fragmentation patterns. That includes lipids (see Fig. 13), oligosaccharides [332], glycans [333], and peptides [334]. For example, compound libraries from combinatorial synthesis show common neutral mass losses when studied under electrospray conditions [335]. Another study used neural networks to simulate 70-eV electron ionization mass spectra of alkanes [336]. MASSIS/MASSIMO was a rule-based spectral simulation system for electron ionization spectra that included McLafferty rearrangements, retro-Diels-Alder reaction, neutral loss, and oxygen migration [337339]. Another method was developed for the prediction presence of carboxylic acids using low-energy CID spectra and CO2 (44 Da) loss in MS/MS product ions [340]. The publicly available MetFrag algorithm [341] compares in silico mass spectra, obtained by a bond dissociation approach, with experimental mass spectra and assigns a score to all results. A validation study [342] compared the success rate of three commercial programs (MOLGEN-MSF (University Bayreuth), MS Fragmenter (Advanced Chemistry Development Inc.) [343, 344], and Mass Frontier (HighChem Ltd.)) and came to the conclusion that the simulation of mass spectral fragmentations of electron ionization spectra is still far from daily practical usability.

Fig. 13
figure 13

An experimental phospholipid spectrum and computer generated MS/MS spectrum. Mass spectral libraries of theoretical in silico spectra can be generated from large structure databases (source: Tobias Kind/FiehnLab)

Expert systems for mass spectral interpretation

Computer-aided interpretation of mass spectra started in the 1960s [345, 346] when the first commercial computers were available. The DENDRAL project pioneered approaches with the aim of predicting isomer structures from mass spectra using self-learning or artificial intelligence algorithms [347]. There are several software tools that can assist during interpretation of mass spectra, including Automated Mass Spectral Deconvolution and Identification System (AMDIS), MassFrontier, ACD/MS Manager, MASSLib [348], and the freely available NIST MS Interpreter as part of the NIST08 database search program. The NIST MS search program can generate substructure information using a nearest-neighbor approach [349] by searching unknown mass spectra against a large reference database. The algorithm will generate a good list and a bad list of substructures based on an actual hit list. If there is no mass spectrum with similar features in the database, then the algorithm fails. The tools AMDIS and MOLGEN-MS [350, 351] integrate the Varmuza feature-based classification approach [352354]. Mass spectral classifiers for neutral loss selection using Fisher ratio and linear discriminant analysis and genetic algorithm partial least squares discriminant analysis have been investigated to distinguish alcohols and ethers [353]. A decision tree-based prediction of substructures from mass spectral features allowed the classification of unknown metabolites into different compound classes [355]. For soft ionization techniques (ESI, APCI), programs such as HighChem Mass Frontier [87, 205, 227, 356360] or ACD/MS Manager [361] can help during data interpretation and fragmentation prediction. Older software usually works well with unit resolution data. New software should allow the handling of accurate and high-resolution mass spectral data. There is currently no software or “magic bullet” that combines mass spectral knowledge and scientific intuition and is able to present a correct compound structure from mass spectral data only.

Use of computational chemistry to explain gas-phase phenomena

There is a constant series of papers that use computational chemistry to investigate gas-phase reactions or ionization processes in regard to thermochemistry and kinetics [362]. In some cases, this approach can lead to a better understanding of fragmentation pathways. The book titled “Assigning structures to ions in mass spectrometry” covers many small-molecule-related approaches regarding thermochemistry, including potential energy curves, calculation of heats of formation, and proton affinities [363]. Quantum mechanical methods can also be used to determine bond cleavage energies and bond dissociation energies [364], and help to interpret adduct formation [365, 366]. Proton affinities have been calculated with semiempirical methods (AM1) [367] and density functional theories (DFT) on the MP2 and B3LYP level [368, 369]. Investigation of CID cross sections can be used to determine binding affinities of cations and small molecules [370]. The kinetic method with entropy correction can be used to calculate proton and electron affinities [371, 372]. Ab initio and DFT calculations were used to elucidate the energetics of ECD [96]. A recent paper discussed the application of DFT to understand tandem mass spectrometric (MS/MS) fragmentation for non-peptidic molecules [373]. The report from three example molecules shows that protonation significantly perturbs the electron density and affects ion formation and subsequent bond fragmentation throughout the whole molecule. The fragmentation pathways for phthalates [374] were investigated using DFT. Even chirality detection of molecules is possible by means of electrospray ionization mass spectrometry and competitive binding analysis [375]. Many of the applied quantum chemical methods require a deep computational chemistry knowledge and can make use of available software tools such as GAMESS, GAUSSIAN, NWCHEM, or AMBER [376]. Moreover, just recently released Intel Xeon (Nehalem) and AMD Opteron (Magny-Cours) processor technology allows for the needed computational speed on commodity desktop computers. The performance of 200 GFlop/s (Giga floating point operations per second; double-precision mode) is comparable with speeds only reached by supercomputers 10 years ago. Both the high software and hardware barrier have made computational interpretations of mass spectra interesting for research, but they have not yet translated into easy to use software tools for mass spectrometry practitioners.

Approaches for hyphenated techniques (GC-MS and LC-MS)

Mass spectral deconvolution for clean mass spectra

Mass spectral deconvolution refers to the process of creating background- and noise-free mass spectra from GC-MS or LC-MS data. Traditionally, chromatographers would use a simple chromatographic peak detection method and would manually select a detected peak to obtain the related mass spectrum. This manual process is error prone and time consuming, and requires manual background subtraction in front and in the back of the chromatographic peak. With an automated deconvolution, routine peaks can be detected under the baseline total ion chromatogram or overlapping peaks can be resolved (see Fig. 14). Additionally, if the chromatographic resolution is not sufficient, then the process is also able to separate (deconvolute) overlapping compound mass spectra. The automated deconvolution process itself is now standard in many GC-MS investigations [377] and is mostly known from the freely available AMDIS [378]. The AMDIS process includes four sequential steps: (1) noise analysis, (2) component perception, (3) spectral deconvolution, and (4) compound identification. AMDIS was recently adapted to monitor air quality and identify toxic gases on board of the International Space Station [379, 380]. Multiple other software solutions for the analysis of GC-MS and LC-MS exist [381]. That includes LECO ChromaTOF, SpectralWorks AnalyzerPro, Ion Signature Quantitative Deconvolution Software, HighChem MassFrontier, and TargetSearch [382]. The use of peak picking and peak detection algorithms for LC-MS data [114, 383] is still an active field of research due to high noise ratios, broader chromatographic peaks, and mass spectra that show less fragments than electron ionization spectra. The deconvolution process itself usually performs best if it is optimized for a specific scan rate; otherwise, false-positive and false-negative peak detections may occur [384]. The detection of these deconvolution errors [385, 386] is best solved by using reference compound mixes with a known number of analytes and a subsequent optimization process to detect the correct number of compounds. Deconvoluted compound mass spectra are subsequently submitted to a mass spectral database search. If additional MS/MS spectra were extracted, then a tandem mass spectral search can be performed.

Fig. 14
figure 14

Peak picking and mass spectral deconvolution. The program can automatically detect peaks under the baseline (case A). Overlapping (non-resolved) peaks can be detected, and clean mass spectra are extracted (case B) (source: Tobias Kind/FiehnLab created with MassFrontier)

Chromatographic heart cut, column switching, and fractionation techniques

The fractionation of complex samples using liquid chromatography is an often performed technical step to obtain pure compounds or reduce the complexity of the sample. This further allows 1D and 2D NMR investigations of complex natural products [387]. Peaks can be frozen out using a preparative fraction collector in conjunction with a low-efficiency preparative packed GC column or higher film thickness megabore traps for gas chromatography [388]. These applications are exemplified in biomarker research [389], investigation of hydrocarbons [390, 391], and entomology and pheromone studies [392, 393]. Using column switching of hydrophilic interaction chromatography (HILIC) and reversed-phase (RP) columns, complex samples can be analyzed within one single run [394]. To increase the chromatographic peak, capacity columns with different polarity can be coupled together into a 2D-LC-MS setup. Comprehensive two-dimensional liquid chromatography (LC×LC) is currently in a developmental stage [395, 396]. The enrichment of samples using peak parking [397] or fraction collection [398] is commonly used during natural product investigations and drug research [399402]. The combination of liquid chromatography with solid phase extraction and NMR has been applied for pharmaceutical studies [403], drug discoveries [404], food investigations [405, 406], and natural product research [407410]. When this technique is combined with mass spectrometric detectors (LC-SPE-NMR-MS), an almost universal system for structure elucidation is created [411414].

Cheminformatics meets mass spectrometry

Modern mass spectrometry centric approaches for structure elucidation cannot be performed without proper molecular structure handling [415]. Many of the software tools for structural elucidation (MassFrontier, ACD/MS Manager, NIST MS Search, Sierra’s APEX) also have inbuilt structure handling capabilities to either allow substructure analysis or perform structure–spectra correlations. Many drug metabolism studies also include computational chemistry approaches. It is also important to investigate how many resonance structures [416], tautomers [417420], and stereoisomers [421] can be generated from a given structure. For the ionization processes [422], it is favorable to understand the ionization behavior by calculating charges, electronegativities, and H-bond donor and acceptor counts. It is important to calculate the distribution of microspecies and pK a values [423] under different pH values in a given buffer system [424] to estimate the retention behavior. Software tools such as the Marvin Calculator Plugins (ChemAxon) [425] can calculate all those compound properties, including all possible tautomers and stereoisomers, with a single program. Multiple candidate structures can be investigated using the commercially available ChemAxon Instant-JChem program or the open-source BioClipse software [426].

Structure retention relationships and retention index predictions

The investigation and accurate prediction of the retention behavior of a molecule are a major cornerstone for structure elucidation using mass spectrometry. The theoretical predicted retention index or retention time can be used as a powerful orthogonal filter for hyphenated chromatographic techniques. If the elemental composition and possible substructures can be detected from the mass spectrum, then this information can be used within molecular isomer generators (MOLGEN-MS, SMOG [427], and Assemble [428]) to generate all structural isomers [429]. Those molecular isomer generators usually work with the constraint of a molecular formula and a good list and bad list of possible substructures. The retention index (RI) prediction algorithm could then be used to predict the retention index or retention time of these virtual compounds. Subsequently, these theoretical RIs can be matched against the experimental RI values, and all compounds outside a specific retention index window can be removed as false candidates. These prediction algorithms are very accurate for a small subset of structures but lack wider substance coverage, or they cover a broad range of structural classes but lack prediction accuracy. A model with a good correlation coefficient may still exercise bad prediction power. Additionally, the development datasets for comprehensive solutions should have a minimum size of 500–1,000 compounds, which are best acquired under the same conditions and the same instrumental setup. The obtained quantitative structure retention relationship (QSRR) models must be carefully validated with a large number of external test set compounds to avoid overfitting [430]. It has to be stated that published QSRR models without an existing commercial or open software implementation are interesting scientific exercises, but they are relatively useless for the majority of practitioners because they cannot apply or use these models.

Several QSRR models for gas and liquid chromatography have been published and already reviewed in the recent literature [431]. Kaliszan wrote a series of papers regarding structure retention relationship models culminating in a single comprehensive review [432]. Both reviews cover several hundred scientific papers. Katritzky discussed the use quantum chemical descriptors for QSRR calculations [433]. HPLC retention indices were calculated from a set of 500 drug-like compounds, and molecular descriptors and neural network machine learning were applied for the prediction of the RI values [434, 435].

A large commercial database of Kovats GC retention index values was released in 2005 [436]. Table 3 lists selected GC column parameters from this database. As the column types and film thicknesses cover a wide range of possible parameters, retention indices also differ. Such variations must be included during model development and require careful statistical evaluation [437] because one single compound can have a high variability in observed experimental retention indices. A freely available software for the prediction of Kovats retention indices (based on alkanes) was released by NIST in 2007 [438]. The software was developed with RI values from 35,000 compounds and used a group contribution method of 85 different substructures for polar and nonpolar column data. The median error for polar columns was 65 RI units, which is not accurate enough to determine single structures, but can be used as a refinement filter for comprehensive structure elucidation workflows. Several standard compounds including keto-alkanes [439], alkylarylketones [440], or 1-nitroalkanes [441] were proposed as retention index markers in the past. No universal or unified HPLC retention index system for RP, normal phase, and HILIC has been developed yet. These standard compounds should cover a wide retention time range on a given LC phase, and they should be easily ionizable with electrospray ionization, non-toxic, non-reactive, inexpensive, commercially available, and outside of targeted profiling approaches. Synthetic peptides could be amicable compound structures for HPLC RI values [442].

Table 3 Statistics of selected column parameters of semipolar stationary gas chromatographic phases obtained from the NIST05 retention index database

Derivatization strategies for LC-MS and GC-MS

The detection of functional groups with the help of selective derivatization [55] is one of the oldest wet-lab techniques in chemistry. The book “A Handbook of Derivatives for Mass Spectrometry” by Zaikin and Halket comprehensively covers most derivatization reactions for different ionization modes, and LC-MS- and GC-MS-based mass spectrometric studies [443]. In the case of GC-MS, the aim is to increase volatility of the compound and protect reactive groups to avoid thermal breakdown or reactions with the column material. For LC-MS studies, derivatizations are performed to improve ionization characteristics for poorly ionizable compounds [444447]. The obtained products must be hydrolysis stable as for example in the tert-butyldimethylchlorosilane products from a N-methyl-N-[tert-butyldimethyl-silyl]trifluoroacetimide derivatization [448]. Common fields of application are pesticide and environmental screening [449451], separation of complex sugars as mono-, di-, and trisaccharides [452, 453], enantiomer analysis [454], amino acid analysis [447, 455], steroid and drug testing [456, 457], and metabolomic profiling studies [458460].

Use of structure databases for targeted compound annotation

The availability of large public compound databases, such as PubChem [461] and ChemSpider [462], or specialized drug and metabolism databases, such as KEGG [463], HMDB [464], ChEBI [465], DrugBank [466], MZedDB [467], and the Chemical Lookup Service [468], allow for a web-based search of molecular formulae or accurate masses [469]. DrugBank has a search interface allowing an accurate mass search in positive or negative mode within the known human metabolite pool, and the results are presented with possible adducts and link to further database sources. This information can include literature, chemical taxonomy data, or other related information. If other molecular features are known from mass spectral investigations, such as the number of polar hydrogens from derivatization or H/D exchange experiments, then these molecular properties can be used as additional orthogonal filters. Additionally, theoretical retention indices can be used to match experimental RI values and remove false candidate hits. Multiple databases have advanced programming interfaces (APIs) that allow a connection of standalone programs with online databases without the need of downloading several gigabytes of the database itself.

Fields of applications—review of reviews

The use of mass spectrometric analysis in metabolomics has been reviewed in [470473]. A comprehensive review covered the identification of known endogenous and exogenous metabolites by applying accurate mass, isotopic pattern filter, retention indices, and mass spectral fragmentation in a sequential filter cascade and combing the results with a database search [474]. A wide range of LC-MS-based methods including MRM-based approaches, precursor ion scans, and radio-labeling were discussed [475]. Multistage mass spectrometry approaches were used for the identification of drugs, metabolites, toxins, and plant and animal metabolites [124, 476486] with their associated fragmentation pathways. The structural elucidation of flavanoids, flavonoid glycosides [487491], and drug metabolites [294, 318, 492498] using multiple-stage tandem mass spectrometry was reviewed in several papers. Natural product investigations usually combine mass spectral information with a de novo structure elucidation step using NMR [499502]. Lipid and phospholipid analysis [503512] can be performed with all ionization modes and types of mass spectrometers including triple quadrupole, ion trap, and TOF mass spectrometers. Structure elucidation of compounds from environmental samples [118, 389, 450, 513516] is among the most complex cases of structure elucidation. The use of capillary electrophoresis coupled to mass spectrometry has been previously reviewed [517, 518].

Electronic data sharing of mass spectra

The future success of structure elucidation with mass spectrometry will largely depend on the development of new software algorithms. Similar to the success of the bioinformatics [519] and the proteomics [520] communities, which had open access to large genome associated data, mass spectrometric data must be made publicly available to enable long-term data reuse and allow data-driven research [521, 522]. Multiple software database implementations are currently in use or in development [523, 524], among them SPECTRa [525], MeltDB [526], SetupX [527], MassBank [219, 220], MMCD [528], METLIN [529, 530], GMD [458], KNApSAcK [531], and PRIME [532]. Mass spectra from a wide range of instrument types can be used in machine learning approaches for mass spectral elucidation. The future of mass spectral structure elucidation will depend on a wide array of well-described, meta-data enhanced and freely available resources. Not only high- and low-resolution mass spectra but also their suggested fragmentation pathways can be electronically collected. Mass spectra and associated molecular structure drawings need to be shared in open exchange formats and global repositories. This will create a new breed of scientists who only deal with mass spectrometric data evaluation independent from access to mass spectrometers just as in bioinformatics.

Unfortunately, there were never any data sharing policies released by the mass spectrometric community itself. The American Society for Mass Spectrometry, which is open to all scientists worldwide and is the leading mass spectrometric society worldwide, never actively pushed or developed data sharing principles for the community. Driven by community efforts, however, the proteomics community [533] and several funding agencies, including the Welcome Trust (UK), National Cancer Institute (NCI, USA), and National Institutes of Health (NIH, USA), released the International Summit on Proteomics Data Release and Sharing Policy [534], which urges the rapid release of mass spectra, tandem MS, and liquid chromatography MS into the public domain. On the technical level [535], the data sharing problem can be solved with large repositories such as the PeptideAtlas.org or peer-to-peer (P2P) approaches as in the Tranche project [536] at ProteomeCommons.org [537]. Open exchange formats must be further developed that can store multidimensional data. This can only be done with the support of the mass spectrometry industry, which in recent years also opened up parts of their proprietary programming interfaces (APIs) to allow open-source programmers access to specific data formats.

The Human Proteome Organisation was among the leading organizations and supported the open mzData and mzML formats [538], which was later joined to the upcoming mzML format [539]. Different organizations, such as the Institute for Systems Biology (mzXML, PepXML, and ProtXML) and the Proteomics Standard Initiative (mzData and PSIMI) started the development of those standard formats and provided vendor specific converters [537, 540, 541]. The continuous success of each of those formats requires broad support for vendor specific converter software and additional software that can visualize and manipulate such exchange data. The netCDF [542] and JCAMP-DX [543] format are widely supported within the GC-MS community. Both formats suffer from non-existing accurate mass and MSn implementations and the lack of broader community development support within the mass spectrometry community.

It has been shown from the crystal structure community, that once data and exchange standards are established, no human interaction is needed anymore to collect spectral data [525]. The CrystalEye project (http://wwmm.ch.cam.ac.uk/crystaleye/) shows that the aggregation of crystal structures can be totally robotized using modern web technologies. The only requirement is that the spectral data must be available under open-data licenses (http://www.opendefinition.org/) [544]. The public availability and open mass spectral resources will allow commercial [545] as well as governmental entities (NIST) [546] that are specialized in collection of mass spectral data to focus on the expensive curation of these data. The enhancement of these spectral datasets with meta-information and compound structures will add value to those collections and allow commercial distribution due to market demands.

We have shown in a recent research publication regarding the data sharing of compound, spectral, and meta-data [521] that parsing bitmap data to obtain multidimensional high-resolution mass spectral data is not keeping up with the today’s technological possibilities. There is a tremendous loss of information associated, and it is impossible to investigate (enlarge or zoom) such bitmap or paper-based mass spectra. During the spectral capturing process, many peaks disappear because their associated accurate mass values cannot be obtained (see Fig. 15). Additionally, molecular structure data capturing is an error-prone process. The scientific value of such open-access and open-data shared mass spectral collections, their structures, and associated reaction data, will outweigh initial hesitations as learned from the genomics community.

Fig. 15
figure 15

Capturing high-resolution mass spectral data from paper publications is an error-prone process. The final machine readable structure usually does not represent the original spectrum (hamburger-to-cow algorithm). New digital data-sharing principles need to be set in place (source: Tobias Kind/FiehnLab)

Future software and hardware advances (opinion)

The success of structure elucidation using mass spectrometric approaches depends not only on technical machine developments but also much more on the development of better software algorithms. Particularly, software for working with data output from multiple-stage mass spectrometry (MSn) or data from multiple orthogonal hybrids, including ion mobility, is not yet fully developed. The large proteomics software community with a very active bioinformatics development branch could be a positive example for the small-molecule community. In terms of software development, there will be always a very innovative core of commercial and open-source software developers that will develop state-of-the-art software tools. For taxpayer-funded research in universities and government-funded laboratories, the direction should go towards open-source software or at least towards freely or publicly available software with the least restrictive software licenses to allow commercial and non-commercial exploitation. Community efforts can solve many of the complex software challenges, whereas consistent software support, user help, and error fixing can be obtained from commercial services. The publication and discussion of approaches or programs that are not commercially or publicly available should be avoided because claims made within the publication cannot be independently verified. One of the most important issues is the public sharing of mass spectra and other spectral data from a wide variety of mass spectrometers. This may ultimately lead to an evolution of scientists and software developers that specialized in software development for small-molecule identification using mass spectrometry.

In terms of a technological process, it must be stated that in principle, all technological prerequisites for advancement in structure elucidation exist. The hyphenation of LC-MS with NMR seems to be a fast lane towards successful structure elucidation. Hybrid orthogonal approaches that add an additional dimension to the chromatographic side (GC×GC and LC×LC) or mass spectrometric part (ion mobility coupled to time of flight) will particularly enable the extraction of cleaner mass spectra from fully resolved compounds. Increased resolving power, better mass and isotopic abundance accuracy, and high data acquisition rate will enable a faster structure elucidation process. Accurate masses and high resolution for all multiple-stage mass spectra (MSn) will subsequently allow the evolution of new software tools as discussed in this article.

Conclusions

Structure elucidation using mass spectrometry is a challenging field of research with many success stories. Mass spectrometry itself is seldom used for the de novo structure elucidation of small molecules but serves as an important building block together with NMR, IR, X-ray crystallography, and other spectroscopic techniques. Together with hyphenated chromatographic techniques, (GC and LC) mass spectrometry serves as a powerful tool for the elucidation of drugs, pesticides, metabolites, and complex chemical mixtures. Mass spectrometry hardware is currently in a very advanced stage with many technologies not fully exploited yet. More data centric approaches have to be taken in the future. This includes the electronic publishing of investigated structures and their associated multiple-stage mass spectra with open-data licenses. The ultimate success of structure elucidation of small molecules lies in better software programs and the development of sophisticated tools for data evaluation of high-resolution and accurate mass multiple-stage (MSn) mass spectral data.