Advances in structure elucidation of small molecules using mass spectrometry
- First Online:
- Cite this article as:
- Kind, T. & Fiehn, O. Bioanal Rev (2010) 2: 23. doi:10.1007/s12566-010-0015-9
- 4.6k Views
The structural elucidation of small molecules using mass spectrometry plays an important role in modern life sciences and bioanalytical approaches. This review covers different soft and hard ionization techniques and figures of merit for modern mass spectrometers, such as mass resolving power, mass accuracy, isotopic abundance accuracy, accurate mass multiple-stage MS(n) capability, as well as hybrid mass spectrometric and orthogonal chromatographic approaches. The latter part discusses mass spectral data handling strategies, which includes background and noise subtraction, adduct formation and detection, charge state determination, accurate mass measurements, elemental composition determinations, and complex data-dependent setups with ion maps and ion trees. The importance of mass spectral library search algorithms for tandem mass spectra and multiple-stage MS(n) mass spectra as well as mass spectral tree libraries that combine multiple-stage mass spectra are outlined. The successive chapter discusses mass spectral fragmentation pathways, biotransformation reactions and drug metabolism studies, the mass spectral simulation and generation of in silico mass spectra, expert systems for mass spectral interpretation, and the use of computational chemistry to explain gas-phase phenomena. A single chapter discusses data handling for hyphenated approaches including mass spectral deconvolution for clean mass spectra, cheminformatics approaches and structure retention relationships, and retention index predictions for gas and liquid chromatography. The last section reviews the current state of electronic data sharing of mass spectra and discusses the importance of software development for the advancement of structure elucidation of small molecules.
KeywordsStructure elucidationMass spectrometryTandem mass spectraFragmentation predictionMass spectral interpretationMass spectral library searchMultistage tandem mass spectrometry
Mass spectrometry is a standard technique for the analytical investigation of molecules and complex mixtures. It is important in determining the elemental composition of a molecule and in gaining partial structural insights using mass spectral fragmentations. The final structure confirmation of an unknown organic compound is always performed with a set of independent methods such as one- (1D) and two-dimensional (2D) nuclear magnetic resonance spectroscopy (NMR) or infrared spectroscopy and X-ray crystallography and other spectroscopic methods. The term structure elucidation usually refers to full de novo structure identification, and it results in a complete molecular connection table with correct stereochemical assignments. Such an identification process without any assumptions or pre-knowledge is commonly the domain of nuclear magnetic resonance spectroscopy. The term dereplication often refers to the rediscovery of known natural products by means of mass spectral library search or the interpretation of known mass spectral fragmentations.
Scope of this review
This review investigates theoretical and experimental structure elucidation techniques using mass spectrometry for organic molecules with a molecular mass less than 2,000 Da. The review covers newer techniques within the last 10–15 years; if none were available, then older material was included. Hyphenated separation techniques (gas chromatography coupled to mass spectrometry (GC-MS) and liquid chromatography coupled to mass spectrometry (LC-MS)) are covered due to the close relationship of those techniques with mass spectrometry. Detailed proteomics and peptide sequencing strategies along with the structure elucidation of large biomolecules, such as RNA, DNA, and oligosaccharides/glycans, are outside the scope of this review. The term “small molecules,” used throughout this review, thus refers to all small molecules excluding peptides. Approaches for inorganic mass spectrometry as well as elemental and organometallic analysis are only sparsely covered.
Mass spectral instrumentation and ionization techniques
The history of commercial mass spectrometry instrumentation covers more than 40 years. Brunnee covers the principles of common mass analyzers in a vibrant 1987 review . Gelpi discusses over 130 different mass spectrometers built since 1965 in a series of two reviews [2, 3]. Only one totally new mass spectrometer type, the Orbitrap analyzer [4, 5], has been developed lately. Nevertheless, many new hybrid approaches, among them ion mobility coupled to time-of-flight (TOF) mass spectrometers have been introduced to the market recently. A series of ionization techniques and figures of merit for mass spectrometers will be discussed in the proceeding paragraphs.
Soft and hard ionization techniques
Electron ionization (EI) at 70 eV is historically seen as the oldest ionization technique for small-molecule investigations. Because of the selected constant ionization energy, this technique results in consistent and fragment-rich mass spectra. These mass spectra can be easily used for a mass spectral library search. Electron ionization is commonly used for GC-MS setups. A major disadvantage of mass spectra obtained under EI conditions is the low abundant or missing molecular ion. An abundant molecular ion however is needed for the calculation of elemental compositions. Chemical ionization (CI) is a soft ionization technique mostly used in GC-MS setups to obtain molecular ion information [6, 7]. Supersonic molecular beam interfaces provide the ability to obtain fragment rich electron ionization spectra together with abundant molecular ions .
Atmospheric pressure chemical ionization (APCI) [21–24], atmospheric pressure photoionization (APPI) [25–28], and matrix-assisted laser desorption/ionization (MALDI) [29–31] are matured soft ionization techniques. Field desorption and field ionization are also soft ionization techniques, and both produce abundant molecular ions with few fragment ions [32–34]. Direct analysis in real time (DART)  is an ambient ionization technique  and allows for the real time analysis of the sample. The DART source has been widely used in “open access/walk-up” laboratories together with robotic sample handling . Techniques for sampling molecules from surfaces have been extensively reviewed as well . Secondary ion mass spectrometry (SIMS) and MALDI are used for mass spectrometric imaging , a new and exciting technology to gain spatial and structural insights from tissues and organs [40–42]. Several new surface-based ionization techniques including desorption electrospray ionization , desorption ionization on silicon, and nanostructure-initiator mass spectrometry  have been developed recently.
Figures of merit of mass spectrometers
Important figures of merit for modern mass spectrometric systems
Figures of merit
Example ranges (multiple instruments)
Mass resolving power
1000–1,000,000 (at m/z 400)
Isotopic abundance accuracy
Linear dynamic range
Accurate mass MSn capability
MS/MS or multiple-stage MSn
Pulsed or continuous
Positive/negative polarity switching
Fast switching within run
Robustness, maintenance, ease of use
Just chillin in the lab/get the hell out of here
Instrument and software cost
Benchtop or room size
Software updates with active support
Customer involved or customer ignored
Open-data exchange formats supported
netCDF, mzXML, ASCII
Tandem mass spectrometers and modes of operation
Ion trapping instruments such as quadrupole ion traps and FT-ICR mass spectrometers can be used to create tandem mass spectra, and multiple-stage MSn experiments can be performed without instrument modification or couplings of different mass analyzers . Other hybrid instrument types are discussed in Ref. [2, 3]. Orthogonal or hybrid mass spectrometers are favorable for structural elucidation because they either increase the total peak resolution or they introduce another separation dimension that can be used either to trigger or acquire additional mass spectrometric information [89, 90]. The different modes of operation, which include precursor ion scans, product ion scans, neutral loss scans and selected reaction monitoring, are discussed in De Hoffmann . The MS/MS and MSn scans are usually triggered via data-dependent setups. Multiple precursor ions can be manually selected or the software can acquire tandem mass spectra when a certain peak abundance or signal/noise ratio is exceeded. For example, electrospray ionization with ion mobility mass spectrometry coupled to time-of-flight mass spectrometry (ESI-IMMS-TOF-MS) was used for metabolic profiling of Escherichia coli metabolites , phospholipid , and drug analysis .
Ion activation modes
Collision-induced dissociation (CID), or collisionally activated dissociation, is the most common technique to obtain tandem mass spectra. Precursor ion stability and internal energy under CID have been previously discussed . A series of new fragmentation modes are aimed at improved protein and peptide identification rates by creating more specific fragmentations. These modes include electron capture dissociation (ECD) [96–98], electron transfer dissociation [99–101], and infrared multiphoton dissociation . They are not fully exploited yet for small-molecule applications outside proteomics.
Two-dimensional, three-dimensional, hybrid, and orthogonal chromatographic approaches
Multiple dimension setups are possible on the chromatographic and mass spectrometric sides. On the chromatography side, the usual aim is directed at increasing the peak resolution, which therefore provides a better separation of overlapping compound peaks. The peak capacity can be increased by using different selective chromatographic phases in a two-dimensional or multi-column setup. These approaches are known for liquid chromatography and prominently used for protein identification by coupling an ion exchange column together with a reversed phase column, which coined the term multidimensional protein identification technology . The difference between simple two-dimensional connections such as GC-GC compared with truly orthogonal approaches such as comprehensive two-dimensional GC (GC × GC)  lies in the fact that a modulator is used to accumulate parts of the sample from the first column and pulse the sample to the second shorter column with a different polarity of the stationary phase . The detector must be a fast scanning detector with a high acquisition rate and an example of this is a time-of-flight mass analyzer. Sampling rates are usually between 100 and 200 spectra per second for GC×GC-TOF-MS  instruments. The resulting mass spectra have a very high signal to noise ratio and therefore represent cleaner mass spectra and give better mass spectral library search scores . Miniaturization and the introduction of chip-based liquid chromatography  play a major role in high-throughput methods.
Mass spectral data handling
The following section discusses basic steps that have to be performed to obtain clean and background free mass spectra. Charge state deconvolution, accurate mass measurements, and software algorithms for elemental composition calculations are reviewed. Certain hardware specific setups are discussed when required.
Background and noise subtraction
Automatic background and noise subtraction are standard techniques to obtain clean and interference free mass spectra. The Biller–Biemann algorithm  or similar algorithms by Dromey et al.  have been in use for more than 30 years. It is generally advisable to perform blank or solvent runs to obtain possible noise or contamination data. These infusion mass spectra or complete LC-MS and GC-MS runs must be subtracted from the real sample data [111–113]. Most modern mass spectrometry software tools have inbuilt algorithms to perform these tasks. Many of the mentioned algorithms have been developed for EI (70 eV) mass spectra. Several approaches have been introduced with the CODA algorithm of Windig et al.  for ESI and LC-MS data, and similar methods have been applied in drug discovery studies [115–117]. A ore detailed discussion about automated approaches is covered in the mass spectral deconvolution and biotransformation sections.
Adduct formation and detection
Ionization techniques such as CI, MALDI, ESI, or APCI show not only single adduct ions but also sets of multiple adducts [118, 119]. The process of adduct formation can be studied using heuristic and computational methods [120, 121]. Solvent and buffer constitution, pKa, pH, substance proton donor and acceptor properties, and gas-phase acidities influence the formation of adducts [122, 123]. Different adducts also can result in different fragmentation pathways . The correct adduct ion must be detected in order to obtain the accurate mass of the neutral molecule. One possible solution is to increase the concentration of specific ions in the liquid phase  to obtain preferably those adducts. When analyzing lipids, lithium is used as modifier  to obtain characteristic [M+Li]+ ions. An extended list of common electrospray adducts, including [M+H]+, [M+NH4]+ [M+Na]+ and [M−H]−, has been prepared . In case of MALDI, metal cation adducts [M+Na]+ and [M+K]+ are often observed [29, 128]. Software tools such as CAMERA  and IntelliXtract , and tools for infusion spectra  can help detect adduct ions in mass spectra automatically. Currently, no software exists that can predict adduct probabilities based on a given compound structure for a specified ionization mode (CI, ESI, APCI, and APPI).
Charge state deconvolution
Accurate mass measurements
Higher mass accuracy on unit mass resolution instruments can be obtained using post-processing peak shaping algorithms as implemented in the MassWorks software (Cerno Biosciences) [150, 151]. These algorithms use an internal calibrant that is later used for post-calibration of mass accuracy errors. Unit resolution mass spectrometers (inaccurate mass spectrometers) can be converted into accurate mass spectrometers as long as mass spectral data are obtained in profile mode, which is required to perform the spectral peak shape correction. If data are obtained in centroid mode or stick mode, then no such post-correction can be performed. A correction for spectral accuracy can also be performed with high-resolution data . Artificial neural network calibration  in conjunction with AGC and better peak centroiding can improve the mass accuracy on FT-MS instruments to reach 100 ppb for certain experiments .
Several unit mass resolution instruments, including ion traps and triple quadrupole instruments , allow a hardware-based high-resolution or an ultra-zoom scan . This zoom scan can be used for accurate mass measurements or better charge state assignments. The resolving power usually can be increased by one order of magnitude, or from 1,000 resolving power to 10,000 resolving power. However the m/z scan range is usually very limited, and the duty cycle is high for enhanced resolution scans.
Isotope abundance measurements and isotopic pattern calculations
The isotopic abundances of common monoisotopic (F, Na, P, and I) or polyisotopic (H, C, N, O, S, Cl, and Br) elements are listed . Isotopic abundances are measured and have been utilized in mass spectrometric measurements since the beginning of mass spectrometry . The most sensitive and accurate method for isotopic abundance measurements is accelerator mass spectrometry , and this method is used for age determination, forensics, and food monitoring . Its precision is around 0.05% for the measurement of the 13C/12C ratio  requiring total combustion of the sample. The availability of commodity mass spectrometers delivering isotopic abundance errors less than ±5% was utilized for LC-MS-based screening approaches [161–164] and environmental screening applications [165–167].
To filter or match elemental compositions according to their experimental isotopic abundances, the high- or low-resolution isotopic envelopes of molecular formulas must be calculated. Several algorithms have been proposed to calculate the isotopic fine structures and allow the modeling of Gaussian peak shapes according to the selected resolving power of the instrument. Several of the algorithms implement either polynomial-based methods or Fourier transform-based methods (IsoDalton, MWTWIN, Mercury, IsotopeCalculator, IsoPro, emass/qmass, libmercury++, ISOMABS, and Decon2Ls) [168–171]. Isotopic abundances from tandem mass spectra and multiple-stage MSn can yield additional information that can help during the structure elucidation process [172–174].
Elemental composition determination
The determination of the molecular formula or elemental composition requires a clean mass spectrum with no interfering noise or coeluting compounds. A process for elemental composition determination from electrospray data was described in Ref. . The algorithm includes a decision making step for proton and alkali metal adducts, automated determination of charge states and overlapping peaks, and an isotopic pattern matching. It was validated with 220 pharmaceutical compounds and yielded a success rate of 90%. Isotope-enriched metabolites can be investigated using a method that includes spectral correlation methods along with mass accuracy and isotope ratio filters . Another software discusses the use of isotopic abundance ratios to confirm or reject NIST mass spectral library search results . A series of papers discusses the process of isotopic pattern matching for elemental formula determination in environmental chemistry [165–167], metabolic profiling experiments [178, 179], and geochemistry [180, 181]. The freely available software SIRIUS (Sum formula Identification by Ranking Isotope patterns Using mass Spectrometry)  has a user-friendly graphical interface and can be used on LINUX, MAC, and Windows platforms. The newer implementation “SIRIUS Starburst” also includes features such as peak intensity, number of hetero atoms in the molecular formula, neutral losses, and tandem mass spectral information .
The influence of spectral accuracy of molecular ions on elemental composition calculations was investigated on a high-resolution mass spectrometer . The automated correction of isotope pattern abundance errors using peak shaping and correction algorithms resulted in better identification rates of the molecular formulas. An algorithm for isotopic pattern calculation that includes stable isotope markers (13C and 15N labeled) was developed . Recently, an approach was developed that uses elemental formula calculations with database lookup and a subsequent in silico generation of CID mass spectra from the obtained isomer structures . The obtained in silico tandem mass spectra (calculated by MassFrontier) were then compared with experimental CID spectra. This approach combined with additional filter constraints and possible MSn fragmentation information can be used for compound annotations (compound dereplication), provided that the structure is known in compound databases. Other prerequisites such as proper validation of the in silico prediction algorithms and use of larger datasets will be discussed in a later chapter.
Algorithms for formula calculation from high-resolution MS/MS data
Another approach used accurate masses from MS/MS product ions during the investigation of fragmentation processes of some natural products [189, 190]. Sirius Starburst  is a freely available software that combines MS/MS fragment and element ratio information with elemental composition determinations. A useful hardware-based approach , the acquisition of exact masses at high and low ionization energy MSE, can lead to more accurate elemental formula determinations.
Complex data-dependent setups including ion maps and ion trees
Data-dependent acquisition methods are used in most of today’s tandem mass spectrometers [87, 192–197]. The mass spectrometry software triggers MS/MS or MSn product ion scans based on specific events. The trigger can be set on specific events such as the highest abundant peaks, manually selected masses, specific neutral losses, or specific isotopic pattern .
Mass spectral library search
Mass spectral library search is the first step in any mass spectral interpretation and therefore will be discussed in deeper detail. Mass spectral search can be performed with unit mass and high-resolution mass spectra of all stages (MS to MSn). The aim of a library search is either to obtain a correct structure hit of compounds already in the library or to obtain partial structural insights from compounds that nearly match. For that purpose, an experimental mass spectrum is searched against a large collection of already recorded mass spectra that are stored in a database. A general review of mass spectral libraries  and mass spectral search algorithms [216, 217] has been written.
MS and MS/MS and MSn libraries and search algorithms
Search algorithms for electron ionization spectra were developed first , and these include the INCOS algorithm, probability-based matching (PBM) , and dot-product algorithm . The size of publicly and commercially available MS/MS libraries is small compared with electron ionization libraries (Wiley and NIST) that cover several hundred thousand electron ionization mass spectra. Currently, the NIST08 MS/MS collection is a large commercially available database with 14,802 MS/MS spectra from 5,308 precursor ions. There are a variety of commercial libraries that have been generated for certain instrument types and settings. The publicly available Massbank [219, 220] and ReSpect database (RIKEN) [221–223] are databases currently covering 24,772 mass spectra and tandem mass spectra from 13,200 compounds. An electrospray tandem mass spectrometry library (ESI-MS/MS) for forensic applications covered 5,600 spectra of 1,253 compounds acquired at different ionization voltages using a hybrid tandem mass spectrometer coupled to a linear ion trap . Smaller but specialized libraries are in use for toxicological screening and drug analysis [225, 226]. An in-house library of MS/MS spectra from 1,200 natural products with the majority of entries having [M+H]+ adducts and 95% of those compounds being able to ionize in positive mode was investigated in Ref. . Tandem mass spectra are not as reproducible as electron ionization spectra when obtained from different instruments. However, the creation of reproducible and transferable MS/MS spectral libraries for use on multiple instrument types  is possible [229, 230]. A fragmentation energy index was proposed for LC-MS  to normalize collision energies and create reproducible spectra comparable to 70-eV electron ionization spectra. Another study compared tandem mass spectra obtained from quadrupole–quadrupole–time of flight, quadrupole–quadrupole–linear ion trap, quadrupole–quadrupole–quadrupole, and linear ion trap–Fourier transform ion cyclotron resonance mass spectrometer and came to the conclusion that platform independent MS/MS spectra can be obtained with multiple fragmentation voltage settings [232–234].
Search algorithms for MS/MS spectra of small molecules can use similar approaches as used for EI mass spectra [55, 235]. Peptide mass spectra usually show specific fragmentations, and a series of specialized search algorithms were developed for these purposes [236, 237]. MS/MS spectra can be searched according to spectral similarity , probability match (PBM) [216, 239], or dot-product algorithm search . If the MS/MS spectra were obtained in data-dependent mode and precursor mass information is available, this precursor mass can be used as a powerful first filter for all subsequent MS/MS matches [240, 241]. The precursor m/z search window can be selected according to the experimentally mass accuracy of the instrument. Well-calibrated unit mass resolution instruments can reach a mass accuracy of ±0.5 Da (or better with post-calibration methods). In this case, a precursor search window of ±0.5 Da can be set for MS/MS search. The subsequent MS/MS match uses a product ion window search tolerance that is slightly higher due to possible hydrogen shifts. Well-established dot product, PBM, and reverse search algorithms are used to match the filtered MS/MS spectra. The accuracy, recall, precision, true, and false discovery rate of the selected algorithm and all other statistical parameters are best obtained from test sets with known spectra and decoy mass spectral datasets as seen from the proteomics community [242–245]. The freely available NIST Mass Spectral Search Program contains efficient algorithms to search accurate mass tandem mass spectra, including m/z precursor and product ion filtering. Moreover, NIST MS Search can handle and search molecular structures together with their associated mass spectra, which is an obligatory prerequisite for any advanced library search program.
Mass spectral trees combine multiple-stage mass spectra
Mass spectral interpretation
Many of the developments in mass spectral interpretation are deeply rooted in the era of electron ionization mass spectrometry from the 1970s and 1980s. Hence, mass spectral fragmentation interpretation rules are best developed for EI mass spectrometry. The red book entitled “Interpretation of mass spectra” written by Turecek and McLafferty , the book entitled “Introduction to Mass Spectrometry” by Watson and Sparkman , and “Understanding mass spectra: a basic approach” by Smith  are standard sources for mass spectrometrists investigating electron ionization spectra. These books contain very detailed explanations of reactions and fragmentation pathways, including rearrangement reactions, homolytic or heterolytic bond cleavages, hydrogen rearrangements, electron shifts, resonance reactions, and aromatic stabilizations. Any de novo interpretation without any pre-knowledge is still challenging, if not totally impossible, due to the high molecular diversity and many similar compound structures.
The even electron rule states that usually neutral molecule fragmentations are observed from molecular ions, but radical loss can also occur in case of aromatic and nitroaromatic compounds [254, 255]. Under positive electrospray (ESI), most fragment ions were reported even electron, whereas the formation of odd electron under EI was significantly higher . The Stevenson rule states that ions with low ionization energy are more stable and will gain high peak abundance in the mass spectrum. The nitrogen rule should in principle only be used for unit resolution mass spectra because high-resolution and high-accuracy mass spectrometry can always calculate the correct number of nitrogen atoms. The Rings Plus Double Bonds Equivalent (RDBE) should not be used with elements that allow multiple valence counts (such as phosphorus and sulfur)  as otherwise only possible RDBE ranges can be obtained instead of unique solutions. Mass spectral visualization techniques such as van Krevelen or Kendrick plots, and spectral mappings using dimension reduction methods with principal component analysis  are helpful for the investigation of unresolved and complex organic matter (petroleum, coal, sediments, and fulvic acids) [259, 260].
Electron ionization and chemical ionization mass spectrometry
Electron ionization at 70 eV is a very hard ionization resulting in very complex rearrangements and fragmentations . The EI mass spectra itself are very reproducible, which is important for a mass spectral library search. The ions in the gas phase have no “memory” where they originate from. That renders the structural interpretation of full scan EI mass spectra very complex. Electron ionization MS/MS with accurate masses may ease that problem . Several book chapters discuss most important aspects of CI [6, 263]. One interesting aspect of chemical ionization is that multiple ionization gases with different proton acidities can be used, which results in different molecular ions for correct molecular ion and elemental composition determination. Although most GC-MS instruments are capable of performing CI analysis, the use of chemical ionization GC-MS is not common anymore. One reason may be the non-existence of chemical ionization mass spectral libraries and the lower sensitivity during chemical ionization GC-MS measurements. Nevertheless, chemical ionization GC-MS remains an attractive technique for structural identifications due to the capability of obtaining abundant molecular ions.
Electrospray and atmospheric pressure chemical ionization
The study of the fragmentation behavior of compounds under electrospray conditions (ESI) [11, 264] is an important topic due to the wide availability of LC-MS devices with ESI interfaces. Using high-resolution CID data, compound substructures were ranked using a systematic bond disconnection approach . In a similar approach for the structural investigation of MS/MS product ion spectra, the authors of a freely available software used a brute-force ab initio combinatorial approach to generated possible fragment ions [266, 267], and they concluded that it is “a non-trivial task to accomplish.” Currently, only MassFrontier contains a large fragmentation reaction library as discussed in the section below. Different voltage settings should be selected for complete coverage of fragmentations. Automatic solutions such as CID voltage ramping exist  for obtaining maximum fragmentation patterns. A lookup table of common neutral losses during CID fragmentation has also been published , and typical fragmentations for atmospheric pressure ionization are discussed in Ref. .
Determination of stereochemistry using mass spectrometry
The determination of stereochemical (absolute) configuration usually requires a separation technique such as GC, CE, or LC with chiral columns. ESI-MS was used to determine the binding affinities of ion-molecule reactions by performing CID experiments of host–guest complexes . It is possible to determine the chirality of molecules without preseparation using chiral selector agents and ESI-MS/MS . Additionally, traveling wave ion mobility spectrometry can be used to determine stereochemistry. The book titled “Applications of Mass Spectrometry to Organic Stereochemistry”  discusses practical approaches for stereochemical investigations of molecules.
Determination of 3D conformations using mass spectrometry
Although conformational changes of small molecules can be monitored using mass spectrometry, this approach was usually applied to high molecular weight compounds such as peptides and proteins  with the requirement of high resolving power. Mainly, protein folding and dynamics  have been studied in recent years. It has been reported that small-molecule mass spectra show differences depending on the 3D conformation of the molecule . The determination of the conformational changes of small molecules is possible using ion mobility mass spectrometers or hybrids thereof. This approach requires the experimental determination of cross sections from known molecules and the use of such data for theoretical models [276, 277].
Biotransformation reactions and drug metabolism studies with mass spectrometry
Biotransformation and drug metabolism studies play a crucial role in all analytical studies targeted at drug design for phase I and phase II metabolites . The tools and approaches discussed in this section are aimed to identify or predict in vivo metabolites from cytochrome P450 (CYP) enzymes and guide through preclinical drug metabolism and pharmacokinetics, and absorption, distribution, metabolism, and excretion/Tox studies. More than 50 CYPs are known in humans, and CYP1A2, CYP2C9, CYP2C19, CYP2D6, CYP3A4, and CYP3A5 enzymes metabolize 90% of drugs . In pharmacokinetics and metabolism studies, the pathway of one single drug and all related enzymatically transformed metabolites are investigated. Levsen et al.  discuss the utilization of tandem mass spectrometry for the investigation of phase II metabolites. In recent years, software expert algorithms for metabolite predictions have been developed, and this includes tools such as DEREK, Catabol, LHASA, MetaboGen, METEOR, and MetabolExpert [281–284]. The software works along known metabolic transformation rules and performs an in silico prediction of possible metabolites. Those metabolite structures can be identified later either by mass accurate mass shifts or by tandem mass spectrometry . Specialized mass spectrometry centric software from vendors such as Metabolite ID (AB Sciex), Metabolynx (Waters), Metworks (Thermo), MassHunter/Metabolite ID (Agilent), and MetaboliteTools (Bruker) mostly use a combination of accurate mass, neutral loss, and biotransformation rules with associated accurate masses for metabolite identification.
Selected biotransformations for in vivo drug metabolism studies detectable by accurate mass spectrometry (reproduced from  with permission of Future Science Ltd)
Mass change (Da)
Molecular formula change
Phase I: CYP, FMO
Oxidative displacement of chlorine
Oxidative displacement of fluorine
Phase I: reductases (e.g., CYP)
Reductive displacement of fluorine
–F + H
Phase I: dehydrogenases (ADH, ALDH), aldoketoreductases (e.g., ALK)
Phase I: other enzymes (not easily assignable)
Reductive displacement of chlorine
Loss of nitro group
Alcohol to carboxylic acid
Methyl to carboxylic acid
The availability of hybrid triple quadrupole mass spectrometers with linear ion traps (QTRAP) allows the sensitive detection of metabolites using multiple reaction monitoring (MRM) and a subsequent MS/MS (product ion) scan for metabolite identification or annotation [194, 195, 294]. A newly developed software (LightSight and ABI/Sciex) [295–297] can automatically create MRM or multiple ion monitoring transitions. This software approach, called predictive MRM, allows for a very sensitive analysis and detection of new metabolites .
Isotope labeling studies
Stable isotopic labeling studies [193, 299] and hydrogen/deuterium exchange reactions [300–302] are commonly applied in drug metabolism studies. Proteomics approaches use labeling studies for the quantification of peptides and proteins [303–305] as well as mass defect isotopomer studies [306, 307]. In vivo labeling with stable isotopes can be applied for metabolism studies in plants [308, 309], isotopomer-based flux balance analysis [310–315], and structural elucidation of unknown compounds [197, 316–319]. The use of deuterated mobile phases (D2O) or post-column infusion of D2O has been popular over the last several years for metabolite identification studies [320–322].
Determination of impurities and contaminants
The elucidation of impurities is a reoccurring event during daily lab work. Contaminants can be avoided either by experience or better quality control sets of all reagents and solvents used. For GC-MS, LC-MS, and CE-MS, this includes the purchase of solvents and reagents in batch to obtain consistent quality and the use of quality check monitoring procedures. These chromatograms or mass spectra (solvent blanks or reagent blanks) need to be stored long term to monitor impurities over month and years. Existing collections of fragments and ions can help during the investigation of such contaminations. Certain detergents and buffer components (Triton X) are excellently ionized in ESI mode and result in large abundant peaks that suppress the signal of other ions. A comprehensive review  discusses mostly ESI and MALDI interferences and contains a large EXCEL sheet in the supplementary data section that covers around 800 potential interferences and contaminant ions in positive and negative mode electrospray mode. Additionally, it also contains 40 repetitive fragments such as sodium formate clusters (NaHCO2) and lists multiple adducts, losses, and possible replacements. A constant batch-wise monitoring of the purity of solvents and derivatization agents is important along with the removal of artifacts from datasets for GC-MS . The hot injector in GC-MS can act as a small chemical reactor, and this could introduce a series of breakdown products that can lead to false analysis conclusions . Many volatile compounds, among them pesticides and insecticides (DDT), easily decompose in a hot injector . Using a GC cold injection system with a near zero degree Celsius injection temperature to avoid the breakdown of chemicals and an automatic liner exchange (ALEX, Gerstel Inc.) to avoid carryover can increase the level of confidence in compound identification of complex samples . A chip-based nanoelectrospray system (NanoMate, Advion Inc.) can be used to avoid cross-contaminations. For each sample, a new ESI nozzle is used during direct infusion mass spectrometry experiments.
Mass spectral fragmentation reaction databases
The current practice of dissemination of chemical fragmentation reactions on paper publications (PDF) is not keeping up with existing technological possibilities. It is impractical to search compound structures and reaction data from paper publications. Also, many data centric approaches, including the development of novel fragmentation algorithms, are actively hindered. Chemical reaction and fragmentation data should be submitted in electronic, machine-readable exchange formats to journals or external repositories. Currently, no such repository for mass spectral reaction data exists.
Mass spectral simulation and generation of in silico mass spectra
Chemical compound databases currently cover more than 50 million chemical structures; however, only around one million mass spectra (including duplicates) from known compounds exist. This gap could be filled by computer generation of mass spectra from large compound structure databases. An in silico algorithm has to predict accurate mass fragments and their abundances. Such an in silico generation of theoretical mass spectra could be useful because experimentally obtained mass spectra can then be matched against large in silico mass spectral databases. Several mass spectral simulation algorithms have been published in the literature. Many of those programs, however, were never made commercially or publicly available, which therefore prevents any possible independent scientific validation. The main problem of most algorithms is to simulate or calculate peak abundances or peak intensities  that reflect experimentally measured peak abundances [329–331]. This problem has not been solved for the vast majority of small molecules under different ionization modes. The success rate of any algorithm has to be determined by a validation study using unknown molecules and a library match of the in silico generated spectra against the experimental spectra. Furthermore, the structural diversity and the number of compounds have to be high to avoid overfitting.
Expert systems for mass spectral interpretation
Computer-aided interpretation of mass spectra started in the 1960s [345, 346] when the first commercial computers were available. The DENDRAL project pioneered approaches with the aim of predicting isomer structures from mass spectra using self-learning or artificial intelligence algorithms . There are several software tools that can assist during interpretation of mass spectra, including Automated Mass Spectral Deconvolution and Identification System (AMDIS), MassFrontier, ACD/MS Manager, MASSLib , and the freely available NIST MS Interpreter as part of the NIST08 database search program. The NIST MS search program can generate substructure information using a nearest-neighbor approach  by searching unknown mass spectra against a large reference database. The algorithm will generate a good list and a bad list of substructures based on an actual hit list. If there is no mass spectrum with similar features in the database, then the algorithm fails. The tools AMDIS and MOLGEN-MS [350, 351] integrate the Varmuza feature-based classification approach [352–354]. Mass spectral classifiers for neutral loss selection using Fisher ratio and linear discriminant analysis and genetic algorithm partial least squares discriminant analysis have been investigated to distinguish alcohols and ethers . A decision tree-based prediction of substructures from mass spectral features allowed the classification of unknown metabolites into different compound classes . For soft ionization techniques (ESI, APCI), programs such as HighChem Mass Frontier [87, 205, 227, 356–360] or ACD/MS Manager  can help during data interpretation and fragmentation prediction. Older software usually works well with unit resolution data. New software should allow the handling of accurate and high-resolution mass spectral data. There is currently no software or “magic bullet” that combines mass spectral knowledge and scientific intuition and is able to present a correct compound structure from mass spectral data only.
Use of computational chemistry to explain gas-phase phenomena
There is a constant series of papers that use computational chemistry to investigate gas-phase reactions or ionization processes in regard to thermochemistry and kinetics . In some cases, this approach can lead to a better understanding of fragmentation pathways. The book titled “Assigning structures to ions in mass spectrometry” covers many small-molecule-related approaches regarding thermochemistry, including potential energy curves, calculation of heats of formation, and proton affinities . Quantum mechanical methods can also be used to determine bond cleavage energies and bond dissociation energies , and help to interpret adduct formation [365, 366]. Proton affinities have been calculated with semiempirical methods (AM1)  and density functional theories (DFT) on the MP2 and B3LYP level [368, 369]. Investigation of CID cross sections can be used to determine binding affinities of cations and small molecules . The kinetic method with entropy correction can be used to calculate proton and electron affinities [371, 372]. Ab initio and DFT calculations were used to elucidate the energetics of ECD . A recent paper discussed the application of DFT to understand tandem mass spectrometric (MS/MS) fragmentation for non-peptidic molecules . The report from three example molecules shows that protonation significantly perturbs the electron density and affects ion formation and subsequent bond fragmentation throughout the whole molecule. The fragmentation pathways for phthalates  were investigated using DFT. Even chirality detection of molecules is possible by means of electrospray ionization mass spectrometry and competitive binding analysis . Many of the applied quantum chemical methods require a deep computational chemistry knowledge and can make use of available software tools such as GAMESS, GAUSSIAN, NWCHEM, or AMBER . Moreover, just recently released Intel Xeon (Nehalem) and AMD Opteron (Magny-Cours) processor technology allows for the needed computational speed on commodity desktop computers. The performance of 200 GFlop/s (Giga floating point operations per second; double-precision mode) is comparable with speeds only reached by supercomputers 10 years ago. Both the high software and hardware barrier have made computational interpretations of mass spectra interesting for research, but they have not yet translated into easy to use software tools for mass spectrometry practitioners.
Approaches for hyphenated techniques (GC-MS and LC-MS)
Mass spectral deconvolution for clean mass spectra
Chromatographic heart cut, column switching, and fractionation techniques
The fractionation of complex samples using liquid chromatography is an often performed technical step to obtain pure compounds or reduce the complexity of the sample. This further allows 1D and 2D NMR investigations of complex natural products . Peaks can be frozen out using a preparative fraction collector in conjunction with a low-efficiency preparative packed GC column or higher film thickness megabore traps for gas chromatography . These applications are exemplified in biomarker research , investigation of hydrocarbons [390, 391], and entomology and pheromone studies [392, 393]. Using column switching of hydrophilic interaction chromatography (HILIC) and reversed-phase (RP) columns, complex samples can be analyzed within one single run . To increase the chromatographic peak, capacity columns with different polarity can be coupled together into a 2D-LC-MS setup. Comprehensive two-dimensional liquid chromatography (LC×LC) is currently in a developmental stage [395, 396]. The enrichment of samples using peak parking  or fraction collection  is commonly used during natural product investigations and drug research [399–402]. The combination of liquid chromatography with solid phase extraction and NMR has been applied for pharmaceutical studies , drug discoveries , food investigations [405, 406], and natural product research [407–410]. When this technique is combined with mass spectrometric detectors (LC-SPE-NMR-MS), an almost universal system for structure elucidation is created [411–414].
Cheminformatics meets mass spectrometry
Modern mass spectrometry centric approaches for structure elucidation cannot be performed without proper molecular structure handling . Many of the software tools for structural elucidation (MassFrontier, ACD/MS Manager, NIST MS Search, Sierra’s APEX) also have inbuilt structure handling capabilities to either allow substructure analysis or perform structure–spectra correlations. Many drug metabolism studies also include computational chemistry approaches. It is also important to investigate how many resonance structures , tautomers [417–420], and stereoisomers  can be generated from a given structure. For the ionization processes , it is favorable to understand the ionization behavior by calculating charges, electronegativities, and H-bond donor and acceptor counts. It is important to calculate the distribution of microspecies and pKa values  under different pH values in a given buffer system  to estimate the retention behavior. Software tools such as the Marvin Calculator Plugins (ChemAxon)  can calculate all those compound properties, including all possible tautomers and stereoisomers, with a single program. Multiple candidate structures can be investigated using the commercially available ChemAxon Instant-JChem program or the open-source BioClipse software .
Structure retention relationships and retention index predictions
The investigation and accurate prediction of the retention behavior of a molecule are a major cornerstone for structure elucidation using mass spectrometry. The theoretical predicted retention index or retention time can be used as a powerful orthogonal filter for hyphenated chromatographic techniques. If the elemental composition and possible substructures can be detected from the mass spectrum, then this information can be used within molecular isomer generators (MOLGEN-MS, SMOG , and Assemble ) to generate all structural isomers . Those molecular isomer generators usually work with the constraint of a molecular formula and a good list and bad list of possible substructures. The retention index (RI) prediction algorithm could then be used to predict the retention index or retention time of these virtual compounds. Subsequently, these theoretical RIs can be matched against the experimental RI values, and all compounds outside a specific retention index window can be removed as false candidates. These prediction algorithms are very accurate for a small subset of structures but lack wider substance coverage, or they cover a broad range of structural classes but lack prediction accuracy. A model with a good correlation coefficient may still exercise bad prediction power. Additionally, the development datasets for comprehensive solutions should have a minimum size of 500–1,000 compounds, which are best acquired under the same conditions and the same instrumental setup. The obtained quantitative structure retention relationship (QSRR) models must be carefully validated with a large number of external test set compounds to avoid overfitting . It has to be stated that published QSRR models without an existing commercial or open software implementation are interesting scientific exercises, but they are relatively useless for the majority of practitioners because they cannot apply or use these models.
Several QSRR models for gas and liquid chromatography have been published and already reviewed in the recent literature . Kaliszan wrote a series of papers regarding structure retention relationship models culminating in a single comprehensive review . Both reviews cover several hundred scientific papers. Katritzky discussed the use quantum chemical descriptors for QSRR calculations . HPLC retention indices were calculated from a set of 500 drug-like compounds, and molecular descriptors and neural network machine learning were applied for the prediction of the RI values [434, 435].
Statistics of selected column parameters of semipolar stationary gas chromatographic phases obtained from the NIST05 retention index database
GC column parameters (semi-standard nonpolar)
CP Sil 8 CB
Derivatization strategies for LC-MS and GC-MS
The detection of functional groups with the help of selective derivatization  is one of the oldest wet-lab techniques in chemistry. The book “A Handbook of Derivatives for Mass Spectrometry” by Zaikin and Halket comprehensively covers most derivatization reactions for different ionization modes, and LC-MS- and GC-MS-based mass spectrometric studies . In the case of GC-MS, the aim is to increase volatility of the compound and protect reactive groups to avoid thermal breakdown or reactions with the column material. For LC-MS studies, derivatizations are performed to improve ionization characteristics for poorly ionizable compounds [444–447]. The obtained products must be hydrolysis stable as for example in the tert-butyldimethylchlorosilane products from a N-methyl-N-[tert-butyldimethyl-silyl]trifluoroacetimide derivatization . Common fields of application are pesticide and environmental screening [449–451], separation of complex sugars as mono-, di-, and trisaccharides [452, 453], enantiomer analysis , amino acid analysis [447, 455], steroid and drug testing [456, 457], and metabolomic profiling studies [458–460].
Use of structure databases for targeted compound annotation
The availability of large public compound databases, such as PubChem  and ChemSpider , or specialized drug and metabolism databases, such as KEGG , HMDB , ChEBI , DrugBank , MZedDB , and the Chemical Lookup Service , allow for a web-based search of molecular formulae or accurate masses . DrugBank has a search interface allowing an accurate mass search in positive or negative mode within the known human metabolite pool, and the results are presented with possible adducts and link to further database sources. This information can include literature, chemical taxonomy data, or other related information. If other molecular features are known from mass spectral investigations, such as the number of polar hydrogens from derivatization or H/D exchange experiments, then these molecular properties can be used as additional orthogonal filters. Additionally, theoretical retention indices can be used to match experimental RI values and remove false candidate hits. Multiple databases have advanced programming interfaces (APIs) that allow a connection of standalone programs with online databases without the need of downloading several gigabytes of the database itself.
Fields of applications—review of reviews
The use of mass spectrometric analysis in metabolomics has been reviewed in [470–473]. A comprehensive review covered the identification of known endogenous and exogenous metabolites by applying accurate mass, isotopic pattern filter, retention indices, and mass spectral fragmentation in a sequential filter cascade and combing the results with a database search . A wide range of LC-MS-based methods including MRM-based approaches, precursor ion scans, and radio-labeling were discussed . Multistage mass spectrometry approaches were used for the identification of drugs, metabolites, toxins, and plant and animal metabolites [124, 476–486] with their associated fragmentation pathways. The structural elucidation of flavanoids, flavonoid glycosides [487–491], and drug metabolites [294, 318, 492–498] using multiple-stage tandem mass spectrometry was reviewed in several papers. Natural product investigations usually combine mass spectral information with a de novo structure elucidation step using NMR [499–502]. Lipid and phospholipid analysis [503–512] can be performed with all ionization modes and types of mass spectrometers including triple quadrupole, ion trap, and TOF mass spectrometers. Structure elucidation of compounds from environmental samples [118, 389, 450, 513–516] is among the most complex cases of structure elucidation. The use of capillary electrophoresis coupled to mass spectrometry has been previously reviewed [517, 518].
Electronic data sharing of mass spectra
The future success of structure elucidation with mass spectrometry will largely depend on the development of new software algorithms. Similar to the success of the bioinformatics  and the proteomics  communities, which had open access to large genome associated data, mass spectrometric data must be made publicly available to enable long-term data reuse and allow data-driven research [521, 522]. Multiple software database implementations are currently in use or in development [523, 524], among them SPECTRa , MeltDB , SetupX , MassBank [219, 220], MMCD , METLIN [529, 530], GMD , KNApSAcK , and PRIME . Mass spectra from a wide range of instrument types can be used in machine learning approaches for mass spectral elucidation. The future of mass spectral structure elucidation will depend on a wide array of well-described, meta-data enhanced and freely available resources. Not only high- and low-resolution mass spectra but also their suggested fragmentation pathways can be electronically collected. Mass spectra and associated molecular structure drawings need to be shared in open exchange formats and global repositories. This will create a new breed of scientists who only deal with mass spectrometric data evaluation independent from access to mass spectrometers just as in bioinformatics.
Unfortunately, there were never any data sharing policies released by the mass spectrometric community itself. The American Society for Mass Spectrometry, which is open to all scientists worldwide and is the leading mass spectrometric society worldwide, never actively pushed or developed data sharing principles for the community. Driven by community efforts, however, the proteomics community  and several funding agencies, including the Welcome Trust (UK), National Cancer Institute (NCI, USA), and National Institutes of Health (NIH, USA), released the International Summit on Proteomics Data Release and Sharing Policy , which urges the rapid release of mass spectra, tandem MS, and liquid chromatography MS into the public domain. On the technical level , the data sharing problem can be solved with large repositories such as the PeptideAtlas.org or peer-to-peer (P2P) approaches as in the Tranche project  at ProteomeCommons.org . Open exchange formats must be further developed that can store multidimensional data. This can only be done with the support of the mass spectrometry industry, which in recent years also opened up parts of their proprietary programming interfaces (APIs) to allow open-source programmers access to specific data formats.
The Human Proteome Organisation was among the leading organizations and supported the open mzData and mzML formats , which was later joined to the upcoming mzML format . Different organizations, such as the Institute for Systems Biology (mzXML, PepXML, and ProtXML) and the Proteomics Standard Initiative (mzData and PSIMI) started the development of those standard formats and provided vendor specific converters [537, 540, 541]. The continuous success of each of those formats requires broad support for vendor specific converter software and additional software that can visualize and manipulate such exchange data. The netCDF  and JCAMP-DX  format are widely supported within the GC-MS community. Both formats suffer from non-existing accurate mass and MSn implementations and the lack of broader community development support within the mass spectrometry community.
It has been shown from the crystal structure community, that once data and exchange standards are established, no human interaction is needed anymore to collect spectral data . The CrystalEye project (http://wwmm.ch.cam.ac.uk/crystaleye/) shows that the aggregation of crystal structures can be totally robotized using modern web technologies. The only requirement is that the spectral data must be available under open-data licenses (http://www.opendefinition.org/) . The public availability and open mass spectral resources will allow commercial  as well as governmental entities (NIST)  that are specialized in collection of mass spectral data to focus on the expensive curation of these data. The enhancement of these spectral datasets with meta-information and compound structures will add value to those collections and allow commercial distribution due to market demands.
Future software and hardware advances (opinion)
The success of structure elucidation using mass spectrometric approaches depends not only on technical machine developments but also much more on the development of better software algorithms. Particularly, software for working with data output from multiple-stage mass spectrometry (MSn) or data from multiple orthogonal hybrids, including ion mobility, is not yet fully developed. The large proteomics software community with a very active bioinformatics development branch could be a positive example for the small-molecule community. In terms of software development, there will be always a very innovative core of commercial and open-source software developers that will develop state-of-the-art software tools. For taxpayer-funded research in universities and government-funded laboratories, the direction should go towards open-source software or at least towards freely or publicly available software with the least restrictive software licenses to allow commercial and non-commercial exploitation. Community efforts can solve many of the complex software challenges, whereas consistent software support, user help, and error fixing can be obtained from commercial services. The publication and discussion of approaches or programs that are not commercially or publicly available should be avoided because claims made within the publication cannot be independently verified. One of the most important issues is the public sharing of mass spectra and other spectral data from a wide variety of mass spectrometers. This may ultimately lead to an evolution of scientists and software developers that specialized in software development for small-molecule identification using mass spectrometry.
In terms of a technological process, it must be stated that in principle, all technological prerequisites for advancement in structure elucidation exist. The hyphenation of LC-MS with NMR seems to be a fast lane towards successful structure elucidation. Hybrid orthogonal approaches that add an additional dimension to the chromatographic side (GC×GC and LC×LC) or mass spectrometric part (ion mobility coupled to time of flight) will particularly enable the extraction of cleaner mass spectra from fully resolved compounds. Increased resolving power, better mass and isotopic abundance accuracy, and high data acquisition rate will enable a faster structure elucidation process. Accurate masses and high resolution for all multiple-stage mass spectra (MSn) will subsequently allow the evolution of new software tools as discussed in this article.
Structure elucidation using mass spectrometry is a challenging field of research with many success stories. Mass spectrometry itself is seldom used for the de novo structure elucidation of small molecules but serves as an important building block together with NMR, IR, X-ray crystallography, and other spectroscopic techniques. Together with hyphenated chromatographic techniques, (GC and LC) mass spectrometry serves as a powerful tool for the elucidation of drugs, pesticides, metabolites, and complex chemical mixtures. Mass spectrometry hardware is currently in a very advanced stage with many technologies not fully exploited yet. More data centric approaches have to be taken in the future. This includes the electronic publishing of investigated structures and their associated multiple-stage mass spectra with open-data licenses. The ultimate success of structure elucidation of small molecules lies in better software programs and the development of sophisticated tools for data evaluation of high-resolution and accurate mass multiple-stage (MSn) mass spectral data.
We thank Vladimir Tolstikov (UC Davis Genome Center Metabolomics CoreLab) for an internal review of the paper. We also thank Google Scholar (http://scholar.google.com) and Google Books for pleasant scholarly information retrieval. Additional thanks are provided to the Jane Journal Estimator (www.biosemantics.org), CAS SciFinder Scholar, and SCOPUS (www.scopus.com) for literature analysis.
This work was supported by the grant R01 ES013932 of the NIEHS/US National Institutes of Health and the US National Science Foundation MCB-0520140. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
The authors declare no competing interests.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.