Pivotal Role of Computers and Software in Mass Spectrometry – SEQUEST and 20 Years of Tandem MS Database Searching
Advances in computer technology and software have driven developments in mass spectrometry over the last 50 years. Computers and software have been impactful in three areas: the automation of difficult calculations to aid interpretation, the collection of data and control of instruments, and data interpretation. As the power of computers has grown, so too has the utility and impact on mass spectrometers and their capabilities. This has been particularly evident in the use of tandem mass spectrometry data to search protein and nucleotide sequence databases to identify peptide and protein sequences. This capability has driven the development of many new approaches to study biological systems, including the use of “bottom-up shotgun proteomics” to directly analyze protein mixtures.
KeywordsSEQUEST Database searching Tandem mass spectrometry Computers Proteomics
Technological advancement depends on an intricate balance between the inventions and innovations of different fields. For example, Charles Babbage had the idea to create programmable computers in the early 1800s, but the necessary electronics technology had not yet been invented for him to fully realize his idea. Babbage’s ideas were just far ahead of his time and his design was shown to be sound when a team built a mechanical calculator from his original plans in 1989. Evolution of technology depends on the innovations of many fields and, often, on the timing of those developments. In addition, unexpected advances in technology in one field can create disruptive leaps in other fields. Advances in mass spectrometry have come about this way. It was invented over 100 years ago and since then has benefited from advances in many fields, notably computerized metal machining and vacuum technology, but it has been advances in electronics and computers that have driven the creation of the powerful mass spectrometers we have today.
Many advances in mass spectrometry technology over the last 100 years can be directly linked to innovations in the electronics industry. Mass spectrometers depend on the use of electric and magnetic fields for the separation of ions and, thus, the capabilities of instruments have increased with the robustness, accuracy, and precision of electronics. The invention and commercialization of solid-state electronics coincided with the emergence of quadrupole theory and eventually helped drive the development and use of quadrupole mass spectrometers by providing the dependable and precise electronics required for the electric fields used to separate ions in quadrupoles. Throughout the 1960s, innovations in the electronics industry, particularly at Fairchild Semiconductor and Texas Instruments, resulted in improved performance and decreased costs of integrated circuits. Eventually the advances in both price and performance allowed more affordable computers to be built in academic environments, and commercialization of these innovations through new companies such as Digital Equipment Corporation (DEC) led to the development of real time interactive minicomputers. MIT was one of the centers of the 1960s “computing revolution” and, as a result, the Biemann mass spectrometry laboratory was well poised to exploit computers to aid in the interpretation of mass spectra.
The need to collect data coming off the mass spectrometer directly into a computer drove a second stage of the integration with computers (Figure 1b). A dramatic increase in the amount of data collected using GCMS instead of a solids probe-based technique created a data analysis bottleneck and the need to not only collect the data directly into the computer but also to create algorithms to process the data. Initial GCMS analyses used a Mattuch-Herzog design to project a complete mass spectrum onto photographic plate. As many as 30 spectra could be collected onto one plate . Mass spectral data from the photographic plate could be read by a comparator-densitometer and then fed into a computer for processing. An advantage to this method over the use of a scanning magnetic sector was the ability to collect high resolution data . Although photographic plates could collect as many 30 spectra before a new plate had to be inserted into the mass spectrometer, there was still a drive to read data directly into computers to accommodate increasingly complicated samples, and this process required conversion of ion signals into electrical signals [8, 9]. Adding an electron multiplier to detect the ions and then converting the analog signal to a digital form with storage on a magnetic tape, which could be fed directly into a computer, achieved this goal. Continuous recording of data created new obstacles to overcome. Hites and Biemann automated the collection and calibration of mass spectra using computers, which paved the way for better acquisition of GCMS data . As these processes became more sophisticated, new problems arose, such as the loss of fine detail of the molecules eluting in the mass spectrometer when GCMS data was plotted as a total ion chromatogram. The Biller-Biemann algorithm solved this problem by plotting the base peak ion for every scan, which restored fine detail to the chromatogram . These algorithms enhanced the efficiency of data processing and put MS data directly into a form that could be further processed by computer algorithms. However, in the 1960s, manual collection and interpretation of the mass spectra was still the norm for all but a few laboratories, and it was not until the middle of the 1970s that a commercial data system from INCOS Inc. became available for most mass spectrometers.
As mass spectrometers evolved, there was a concurrent drive to understand the gas-phase ion chemistry of molecules to aid in the interpretation of spectra. Throughout the 1960s there was a focus on deciphering fragmentation mechanisms of ions using 70-eV electron ionization. To understand how molecular bonds fragmented during ionization, many studies were carried out using specifically prepared synthetic molecules with strategically placed stable isotopes. Which fragment ion got the stable isotope, as represented by an increase in the expected m/z value, helped determine the mechanisms of bond cleavage. The details of these processes have been described in classic books on mass spectrometry [12, 13, 14]. As the details of fragmentation revealed themselves, efforts turned to making the process of spectral interpretation more efficient and, consequently, a third stage of algorithm development emerged that focused on increasing the speed of interpretation of mass spectra  (Figure 1c). Two efforts undertaken during this period are most notable. The first was the creation of mass spectral libraries and the algorithms used to match spectra as an aid in interpretation of mass spectra [16, 17, 18, 19]. This was perhaps one of the earliest attempts to “crowd source” a problem since a scientist’s interpretation of a particular spectrum to identify the molecule it represented was subsequently shared throughout the mass spectrometry field through a library. Several groups were prominent in the development of algorithms to search spectral libraries. McLafferty’s and Biemann’s groups laid the early foundations for library searching, but their efforts were constrained by the limited capabilities of computers as well as by the availability of computers more generally [16, 17, 18, 20, 21]. This situation was remedied, as mentioned previously, with the development of a commercial data system. The community-wide effort to collect EI spectra of small molecules for inclusion in libraries would make this approach quite powerful in time. Such libraries are still widely used for the analysis of mass spectra, and the library concept has become even more relevant as metabolomics has surged in popularity [22, 23, 24, 25].
The second strategy to improve data interpretation efficiency was to develop more “intelligent” approaches to data analysis. In the mid to late 1950s, the field of artificial intelligence (AI) was born. By the mid-1960s, the field had some surprising success and interest in the programming methods expanded. Edward Feigenbaum, Bruce Buchanan, Joshua Lederberg, and Carl Djerassi teamed up to develop an approach to use AI to interpret mass spectra of small molecules . A goal was to use a heuristic approach to interpret mass spectra in much the same way a human would, and this idea was encoded in a software program called Dendral. Although the program never achieved complete independence of interpretation, it was able to eliminate the most implausible structures and narrow the possible structures to be considered. This narrowing of the possible structures was sufficient to allow a non-expert to complete the interpretation process without too much difficulty. Despite this futuristic vision of automating de novo interpretation of organic molecule mass spectra, this concept never advanced much further than Dendral for many decades. Most interpretation of organic molecule mass spectra is still performed through library searching or by manual interpretation, although the increasing popularity of metabolomics has stimulated new interest in developing de novo interpretation algorithms [27, 28]. However, de novo interpretation remains a significant challenge, as the even-electron ionization methods of today (ESI and APCI) often yield limited fragmentation information, even when coupled with CID .
When separation methods are integrated with mass spectrometers, a very powerful and large-scale method to detect the molecules being separated is created. Connecting liquid chromatography with mass spectrometers was of long-standing interest in the field. Some rather inventive methods were developed, from the moving belt interface to thermospray, but it was not until the development of electrospray ionization (ESI) that the problem was finally solved [46, 47, 48]. The essential element of ESI is placing a high voltage on the tip of the LC outlet to form a spray of fine droplets that is outside the vacuum system. Previous methods tried to introduce the liquid in the high vacuum of the ion source, an approach that is fraught with complications. ESI solved the interface problem for LC and it created a robust and powerful method to introduce biomolecules directly into the mass spectrometer. At the same time ESI was emerging, vigorous discussions on sequencing the human genome were taking place [49, 50]. Recent advances in the technology to sequence DNA drove an optimism that tackling the sequencing of the human genome as a world-wide, coordinated effort would yield huge benefits for human biology, medicine, and mankind. So important was the invention of DNA sequencing that Alan Malcolm argued this development spelled the end of protein sequencing, and he was substantially correct . Automation of the DNA sequencing process drove a consensus of leading scientists, government officials, and politicians to decide to sequence not only the human genome but also a set of model organisms to provide experimental systems in which to understand the functions of human genes. While the Genome projects were heralded for their vision, it was believed that the data would be understood through a combination of bioinformatics and genetics. In the report of the National Academy of Sciences recommending the project, there was little mention of the role of protein biochemistry in understanding the function of gene products, nor was there any indication that protein biochemistry might benefit by sequencing genomes . Despite early optimism, so far it has been the simpler Mendalian diseases and traits that have been readily understood through genome sequencing. More complex, and multi-genetic, diseases have been slow to yield their secrets through large-scale genomics, suggesting the diseases are quite complicated in their etiology [52, 53].
Until this time, the principal method for the analysis of mass spectra had remained the library search (Figure 2b). While peptide tandem mass spectra might have seemed like an easy case for de novo interpretation, as the fragmentation patterns are relatively straightforward, the number of possible combinations can escalate very quickly. A 5-residue peptide can have (205) 3,200,000 different sequence combinations, but the potential combinations can be limited by knowing the peptide’s molecular weight. As molecular weight measurement becomes more accurate, the number of possible combinations of amino acids decreases. Despite efforts to automate the process, the combination of low mass resolution and poor accuracy made accurate de novo interpretation difficult to achieve [36, 37, 54, 55]. As genome-sequencing projects got underway, new strategies for mass spectral data interpretation emerged. Genome sequencing projects produced DNA sequences that could be translated into the amino acid sequences that evolved in a biological system. This information would then be used to further sharply limit the sequences that needed to be considered when trying to interpret tandem mass spectra.
The idea for creating an algorithm to directly search tandem mass spectra through sequence databases emerged quite simply from an effort to sequence peptides derived from class II MHC proteins [56, 57]. These peptides come from a well-studied family of immunologic proteins with a high level of sequence conservation. To minimize the time spent performing de novo sequencing on a tandem mass spectrum for a sequence that was already known, we would read off a stretch of five amino acids and then send the sequence for a BLAST search using the NCBI e-mail server. If the sequence matched to a protein, it was possible to see if the surrounding sequence of the initial 5-residue sequence fit the tandem mass spectrum. While waiting for a BLAST search to return, it occurred to me that we should simply send off the tandem mass spectrum to the database to see if a sequence would match the tandem mass spectrum. In thinking about how to achieve this, it was apparent that three immediate problems had to be solved. The first issue was how to get local access to the protein sequence database, which was stored at the National Center for Biotechnology Information (NCBI). At the time, the World Wide Web was in its infancy, so accessing sites via the Internet was not always straightforward. Assuming we could acquire a protein sequence database from the NCBI, the second obstacle was figuring out how to access the computerized tandem mass spectral data. Data files were kept in a proprietary format to maintain compliance with Federal regulations and, consequently, it was not easy to access an entire LC run’s worth of tandem mass spectra. Finally, once we were able to access tandem mass spectra, we needed to figure out how to match them to sequences.
The problem of getting a database was solved quickly since increasingly they were becoming available via ftp-servers. Extracting data from the proprietary data formats of the Finnigan mass spectrometer we were using was not as straightforward. Finnigan had made a program available to us to extract data over a set of scans, but it was cumbersome to use because you needed to know the exact scan numbers for the tandem mass spectrum you wanted to extract. What was needed was a way to automatically extract all the tandem mass spectra from a file. Even though in pre-data dependent acquisition (DDA), the number of spectra collected was not large and encompassed only those m/z values entered into the computer during the run, we still needed a way to extract all spectra at once. As a first step, we set about trying to decipher the proprietary format of the Finnigan file system and within a short time we could read the files. As a result, Finnigan gave us access to their software libraries that more faithfully accessed their file formats. Access to Finnigan’s software libraries allowed us extract all MSMS files from an LC run, and this capability proved to be pivotal to fully automate the process. By granting access to their software libraries, Finnigan initiated a new era in software cooperation between academic laboratories and instrument manufacturers that continues to this day.
Once we could extract tandem mass spectra from the proprietary files, a strategy to match spectra to the sequences in the database was the next challenge. At this point, there was a significant amount of literature on spectral library searching, and some very powerful mathematical techniques had been developed to evaluate matches. Our problem was that we would not have a mass spectrum that represented the amino acid sequences extracted from the sequence databases. As fragment ions are easy to predict from amino acid sequence, a straightforward strategy for sequence matching is to predict the expected fragment ions and simply count the number of peaks the sequence shares with the experimental spectrum. We developed a scoring function based on this model, which became known as the SEQUEST preliminary score or Sp . It calculates a score based on the number of shared peaks as a function of the expected peaks and the immonium ions present in spectrum when the amino acids that generate those immonium ions are present. This approach was tested and it worked pretty well, but was not as accurate as hoped when spectral quality was not high.
To improve scoring, we capitalized on an approach that had been employed for library searching. In the original paper, we referred to our method as a “pseudo library” search because by matching a tandem mass spectrum with a predicted one reconstructed from the amino acid sequence, we could take advantage of methods already developed for library searching. At the time it was difficult to predict the intensity of fragment ions, so a method was fashioned to recreate spectra that would minimize the impact of fragment ion intensity. The reconstructed spectrum made all the predicted b and y fragment ions the same intensity of 50%. Neutral losses of water and ammonia, a common occurrence in triple quadrupole spectra, were set at 10%. To make the experimental spectrum comply with this reconstruction, we divided the spectrum into 10 even windows across the mass range and then normalized ions within each window to 50%. We tried a few different methods to compare the experimental and reconstructed spectra, but none was satisfactory until we found a paper by Kevin Owens describing the use of a cross-correlation function as part of a mass spectral library searching approach . When we implemented this approach it proved to be incredibly sensitive and accurate. However, the cross-correlation function uses a fast Fourier transform (FFT) in its calculation, making it a computationally intensive approach, particularly so in the early 1990s. Computers have gotten faster and better implementations of the FFT calculation have been developed [60, 61, 62], while others have used the computationally less intensive “poor man’s cross-correlation” or dot product to compare spectra .
What was immediately clear from the early tests of the search program was that the interpretation of spectra was now a simplified- and possibly a solved-problem when databases became complete for an organism (Figue 2c). As shown in the 1994 JASMS paper, the search program also allowed the analysis of peptides obtained from intentionally digested protein mixtures. However, this data was collected by manually typing in the m/z values for the peptides, a process that required performing an LC/MS run to first measure m/z values for peptides and subsequently setting up MS/MS experiments for each m/z value of interest recorded to be analyzed in a second LC/MS analysis. This was not as cumbersome as it might sound, given that multiple windows could be open in Finnigan’s Interactive Chemical Information System (ICIS) MS control software, making it possible to cycle through the windows and trigger the MS/MS experiment for each m/z value. Clearly, this process was ripe for an infusion of new technology and it turned out it was already present in the Finnigan ICIS data system.
When the Finnigan MAT TSQ70 was designed, it contained an on-board computer to control the operation of the instrument. This computer was distinct from the computer workstation used to acquire and process the data. Also built into the system was an instrument control language (ICL). ICL was a scripting language that could be used to write little programs to operate the instrument in automated fashion, such as an autotune program. The real power of ICL was realized by Terry Lee’s laboratory through the creation of a data dependent acquisition (DDA) script . This script made it possible to collect automated MS/MS over the course of an LC run. When the Lee paper came out, we were working on a DDA script of our own, so we quickly capitalized on this method and published a DDA script that used a neutral loss scan to identify the loss of phosphoric acid from phosphopeptides to trigger an MSMS of the phosphopeptides . This method was published in combination with the demonstration of the SEQUEST algorithm as a way to identify differential modifications to peptides. By simplifying data interpretation, the analysis of post-translational modifications by mass spectrometry has become a straightforward endeavor. Because PTMs regulate biological processes, the ability to identify them in large-scale data created a paradigm shift in biology as it was now possible to analyze the state of modifications in cells as a function of different states or conditions .
By the mid-1990s, The Institute for Genome Research (TIGR) and academic laboratories were increasingly depositing cDNA sequences in the databases [66, 67]. In fact, since the deposition of cDNA sequence data was outstripping the deposition of whole genome DNA sequences, it made sense to develop an approach to search this data. Searching cDNA sequence data required converting the DNA sequences to protein sequences, but doing so in six frames, as the mRNA sequencing methods sequenced randomly from either end of the transcript (e.g., 3′ end or 5′ end) . This approach could be used to search genomic as well as cDNA sequences, and the upshot was that you could identify sequences within open reading frames using tandem mass spectra . cDNAs represent transcribed sequences, which are most certainly proteins, but this searching concept enabled what would eventually be called “proteogenomics” as a way to identify ORFs within genomic data [70, 71, 72].
Tandem mass spectra from triple quadrupole mass spectrometers were used to develop SEQUEST. A question at the time was whether the approach was generalizable to other types of tandem mass spectra. MALDI-TOF mass spectrometers were capable of generating post source decay spectra where peptides fragment after acceleration into the flight tube . Very often, PSD spectra have enough fragmentation to generate sequence, and when they were tested for database searching it worked quite well . Although tandem double focusing magnetic sector mass spectrometers were being used less often with the development of ESI, this instrument produced tandem mass spectra with high energy collisions and lots of amino acid sidechain cleavages . By adapting the sequence ions considered in the search, SEQUEST could be used to search tandem mass spectra. What was interesting was that a peptide containing a leucine (Leu) residue was differentiated from the same sequence in the database where the leucine residue was replaced by isoleucine (Ile). Differentiation was possible because w-ions are created in the high-energy CID spectra that are different depending on whether the sidechain is Leu or Ile . The tandem mass spectra produced in ion traps also has a very different appearance from those produced in TSQs, including the lack of immonium ions in the low mass range of the spectra, but these spectra could still be used to search with SEQUEST . A big advantage of the LCQ ion trap was a much more sophisticated DDA with the ability to exclude ions once a tandem mass spectrum was collected (dynamic exclusion), which greatly improved the efficiency of data collection.
The combination of DDA with the ability to rapidly and accurately match spectra to sequences in the database enabled revolutionary new approaches to protein analysis. The 1994 JASMS paper demonstrated the ability to collect tandem mass spectra of digested protein mixtures obtained from a yeast cell lysate , and although DDA was not used in the data collection, it was shown that the tandem mass spectra of peptides could be matched back to their proteins from a purposely-digested protein mixture. The invention of DDA made the process much more efficient and opened the possibility to develop a new approach to analyze protein mixtures. The development of faster scanning instruments together with more effective DDA methods has resulted in larger and larger data sets necessitating methods to speed up searches like computing clusters and methods to filter and process search results [78, 79, 80, 81, 82].
Understanding biology requires determining the function of proteins. Some of the information used to understand the function of proteins includes expression in response to perturbations, physical interactions, modifications, and localization. After demonstrating in 1994 that we could identify proteins in mixtures, we set out to determine if this new approach to protein analysis could be used to measure these types of parameters.
Proteins performing specific functions are frequently sequestered in specific compartments in the cell. We determined the proteins localized to the periplasmic space of E. coli by combining a specific method to enrich the proteins with separation of the intact proteins using ion exchange LC. Fractions of proteins were collected, digested, and then identified using LC/MS/MS. DDA was used to collect the tandem mass spectra, resulting in the identification of 80 proteins . This study demonstrated that digesting the mixtures and analyzing the resulting peptides by LC/MSMS with database searching could identify proteins. In effect, this approach could be used to identify proteins localized to a subcellular compartment of the cell, an approach that has been used to identify the components of many of the cell’s compartments . This strategy is very dependent on the efficacy of the initial enrichment method, which is why robust cell biological methods are important for these types of studies. The idea of purposely digesting a mixture of proteins for mass spectrometry analysis was called “shotgun proteomics.”
Protein–protein interactions are a powerful tool to understand the role of proteins in cellular processes. It is based on a “guilt by association” phenomenon that assumes if proteins are interacting with each other, they must all have some role in the biological process. If you know something about the function of one of the proteins and you use this protein as a “bait” to find its interactors, the interactors are highly likely to be involved in the same biological process. As protein complexes frequently contain a small number of proteins (10s of proteins) it should be an ideal problem for direct analysis of the proteins or shotgun proteomics. We used shotgun proteomics to analyze protein complexes enriched by three different methods . The methods consisted of immunoprecipitation with an antibody to a protein, affinity enrichment of binding proteins using a bound protein bait, and enrichment of a protein complex using a nondenaturing separation method. The enriched proteins were digested with trypsin and analyzed by LC/MS/MS with database searching to identify the proteins. This approach has proven to be very powerful, and numerous large-scale studies have been made to identify the protein networks of organisms [86, 87, 88].
The co-evolution of mass spectrometers with the computing capacity has driven innovation in mass spectrometry and increased the impact of mass spectrometers in many fields. These advances have come about not simply as an improvement in data processing but also as a result of increased operational capabilities of mass spectrometers to collect data and to combine the collection of different data types [94, 95]. The power of computer-driven data processing is the creation of more accurate and large-scale data analysis approaches that enable new experimental paradigms. The ability to search tandem mass spectra of peptides through sequence databases enabled an approach to identify intentionally digested proteins in complex mixtures, upending the established paradigm for protein discovery. Search is only the first step in the discovery process, and search results need to be filtered and assembled back into protein sequences to be more useful [78, 80, 81, 82]. Beyond search and filtering is the need to quantify results and, consequently, software tools to quantify using label-free methods, stable isotope labels, or covalent tags such as TMT were developed . Many of these software tools have been combined into pipelines to streamline data processing. All of these tools and methods have made it possible to answer many biological questions, to explore the mechanisms of diseases, and to search for biomarkers of disease. A question going forward is: Can new, innovative software tools be envisioned to address as yet unanswered biological questions using mass spectrometry?
The author thanks Claire Delahunty for reading drafts of this manuscript, and members of the Yates laboratory past and present for the exciting journey. Funding has been provided by National Institutes of Health grants P41 GM103533, R01 MH067880, 1 R01 MH100175, and HHSN268201000035C.