A Deeper Look into Comet—Implementation and Features
- First Online:
- Cite this article as:
- Eng, J.K., Hoopmann, M.R., Jahan, T.A. et al. J. Am. Soc. Mass Spectrom. (2015) 26: 1865. doi:10.1007/s13361-015-1179-x
- 900 Downloads
The Comet database search software was initially released as an open source project in late 2012. Prior to that, Comet existed as the University of Washington’s academic version of the SEQUEST database search tool. Despite its availability and widespread use over the years, some details about its implementation have not been previously disseminated or are not well understood. We address a few of these details in depth and highlight new features available in the latest release. Comet is freely available for download at http://comet-ms.sourceforge.net or it can be accessed as a component of a number of larger software projects into which it has been incorporated.
Key wordsMS/MSDatabase searchComet
In a seminal paper published in 1994, the ability to sequence peptides by searching uninterpreted tandem mass spectra (MS/MS) against protein sequence databases was disseminated to the proteomics community . Now over 20 years later, MS/MS database searching has become arguably the most commonly applied computational proteomics analysis method in practice. Not surprisingly, a number of novel MS/MS database search tools have been developed over the years  but the SEQUEST algorithm continues to be widely used.
SEQUEST was originally developed at the University of Washington and commercially licensed to the Thermo Electron Corporation. For a number of years through the 1990s, SEQUEST existed in two forms: the academic version developed at University of Washington and the commercial version distributed by Thermo. Over the years, SEQUEST-like tools have expanded to include a commercial version from Sage-N Research (Sorcerer ), other academic versions developed at the University of Washington (Crux , Tide ), Scripps Research (ProLuCID ), and Dartmouth College (macroSEQUEST ,Tempest ) among others. In 2012, the University of Washington’s version of SEQUEST was released as an open source project and renamed Comet . This article describes in detail some of the features and optimizations in the latest version of the Comet software tool.
High resolution MS/MS data are more common these days because of improvements in instrumentation with Orbitrap (Fourier transform) and Time-of-Flight analyzers. With advances in instrumentation, the ability to generate high resolution MS/MS spectra at a fast acquisition rate makes such data much more ubiquitous. While accurate MS/MS fragmentation spectra allow for more stringent identifications due to the significantly increased selectivity of matching fragmentation peaks with tight mass tolerances, such data poses a challenge to the Comet algorithm, and SEQUEST before it, with respect to how the data is represented internally. With spectra stored as discrete arrays of numbers, where the array index represents the mass and the array value at that index representing the intensity, a high resolution spectrum requires a lot of memory to be stored in this array data format because of the large number of small mass buckets or bins. A detailed description of what such mass bins represent, how optimal bin sizes were determined for low and high resolution data, and two different mechanisms for addressing memory use for high resolution data will be presented in this paper.
In 1995, the second paper to be published on the SEQUEST algorithm described the ability to search for post-translational modifications . Being able to identify modified residues not contained in the protein sequences stored in sequence databases is a powerful method, with numerous applications for biological insight. For example, Swaney et al. were able to identify over 2,000 phosphorylation sites co-occurring with over 2,000 ubiquitylation sites in S. cerevisiae, allowing the investigation of how phosphorylation can be regulated by ubiquitylation . In a different application, the ability to search for post-translational modifications enabled Chavez et al. to identify cross-linked proteins in living human cells, demonstrating the ability to make direct topological measurements and provide evidence for novel protein–protein interactions . In order to provide researchers with more flexibility in how post-translational modifications can be analyzed, the most recent release of Comet incorporates additional new options for modification analysis, which will be described in detail below with usage examples.
Materials and Methods
Comet is written in C++ and developed on both Linux and Windows operating systems. Comet incorporates the MSToolkit file parsing library  to read mass spectral data in various formats. Multi-threading is implemented using POSIX threads on Linux and Windows native threads on Windows. The computer configuration used for all analyses and benchmarks is a dual Intel Xeon E5-2470 2.4 GHz CPU (eight total physical cores) with 64 GB RAM, running Red Hat Enterprise Linux Server 6.5. False discovery rates and q-value calculations are based on ordering results by Comet’s E-value score and then computing, for a concatenated target-decoy search, the ratio of the number of accepted decoy matches divided by the number of accepted target matches at a given score threshold. Mass spectral data files used in the analysis presented here were downloaded from the PRoteomics IDEntification (PRIDE)  repository or the Stem Cell Omics Repository (SCOR) .
Results and Discussion
Sparse Matrix Representation of Spectra
The core scoring algorithm in SEQUEST is the cross-correlation score or xcorr. In the 1994 manuscript, the cross-correlation score was calculated by performing Fourier transforms on both the experimental spectrum and the theoretical spectrum, multiplying one transform by the complex conjugate of the other transform, and performing an inverse Fourier transform. This mathematical operation generated the full correlation spectrum from which the cross-correlation score was derived. In 2008, a method to calculate the cross-correlation score in an efficient manner was published , where each experimental spectrum was preprocessed in a way that allows the cross-correlation score to be calculated by simply summing up processed intensity values at each theoretical fragment ion mass location. This optimization enabled the cross-correlation score to be applied to scoring all peptides instead of just the 500 best candidate peptides in the original implementation, enabling E-value  and p-value [17, 18] calculations based on the cross-correlation score distribution. Performance comparisons of Comet with other search engines can be found in  and  and a comparison of Comet versus SEQUEST cross-correlation scores is presented below.
The array index for a spectrum does not need to be exactly 1 Da wide. In fact, the optimum mass bin width is 1.0005 for low resolution data such as that acquired on an ion trap detector, and we recommend a mass bin width of 0.02 for high resolution spectra. In Comet, this bin size setting is controlled by the parameter “fragment_bin_tol.” For any given mass m and bin width w, the appropriate array index idx for any given mass is determined by the equation “idx = (int)(m/w)”, which simply defines the array index as the integer value of the mass divided by the bin width. This allows for the array representation of a spectrum at any arbitrary bin width value. For high resolution data, much narrower bin widths are necessary to take advantage of the high mass accuracy measurements on the fragment ion masses. But as the bin width w is reduced from say 1.0 to 0.01, the corresponding spectral array grows 100-fold larger in size to accommodate the much smaller mass bins. Accordingly, the memory requirements to internally store the spectral data in this array format is increased 100-fold, making this representation untenable for the analysis of standard-sized LC-MS/MS runs on typical desktop computers.
Run times (in minutes:seconds) and memory use for a combination of default, sparse matrix, batch size, and fragment bin width options. Memory use is a function of the bin width setting plus the number of input spectra and does not vary with the sequence database size. Results indicate the new sparse matrix format performs as fast as the default array format with the added benefit of significant memory savings for high resolution search settings
Default array + batch size
Sparse matrix + batch size
Search Speed and Memory Use Versus Spectrum Batch Size
Given the rapidly increasing data acquisition throughput of modern instrumentation, the ability to process much larger MS/MS datasets is a necessity for modern search engines. Comet is well suited to handle this use case given the two developments described here: the sparse matrix format for improved memory efficiency and batch searching to facilitate iteratively analyzing extremely large files. Comet will run well on any modern computer where search throughput is directly related to the CPU speed, core count, and, to a lesser extent, memory size. The fast cross-correlation score is a spectrum-specific analysis where every spectrum is processed independently of every other spectrum. So there are no inherent issues to searching extremely large queries beyond simply having to process more spectra. The benchmark run times shown in Figure 3, using four cores of a 2012-era Intel CPU, should assist users in defining computer/server configurations suitable for the processing throughput desired (e.g., double the CPU core count to double the search throughput).
Impact of the Fragment Bin Offset Parameter
As demonstrated in Figure 4, the choice of bin size and bin offset can make a big impact on the resulting spectral representation and search scores. Note that the bin width is related to but inherently different from a classic fragment mass tolerance setting. The effect of varying the bin width is not the same as varying a fragment mass tolerance using the same values. The bin width choice, along with the bin offset value, will define where the bin edges lie but this does not guarantee that they are centered on each fragment peak even if using values greater than the instrument mass accuracy. Small variations of the bin width will cause the bin edges to end up in suboptimal locations across many regions of the spectrum.
Theoretical Peaks Shape and Flanking Peaks
The “theoretical_fragment_ion” parameter instructs Comet whether or not to include signal from the flanking bins in the cross-correlation calculation. In the original implementation of SEQUEST, fragment ions in the theoretical spectrum have reconstructed peaks with an intensity of 50 at the mass bin corresponding to the fragment ion mass and peaks of intensity 25 at the flanking bins. Adding the flanking peaks was meant to generate a peak shape that mimicked the wide peaks of the low resolution data at that time. The “theoretical_fragment_ion” parameter controls whether or not to incorporate these flanking peaks in the current cross-correlation calculation. With a wide bin size of 1.0005, adding signal from the flanking bins performs poorly compared with leaving off the flanking peaks. But with narrow bin widths, contributions from the flanking peaks do improve identification rates (data not shown).
The current variable modification support allows for flexibility in how and where modifications are applied. Up to nine variable modifications can be specified, each of which can be applied to multiple residues, and more than one (actually up to nine) variable modification can be specified for the same amino acid. The concept of a binary modification, where all residues present in a peptide must be all modified or all un-modified, is currently supported on a per-modification parameter basis. However, binary modifications across modification parameters, such as heavy lysine and heavy arginine in a SILAC experiment requiring specification of different modification masses using separate variable modification parameters, is a feature that will be implemented in a future release. Additionally, N15 metabolic labeling currently requires two separate searches, one normal and one where all amino acid masses are statically modified to their N15 counterparts; a future release will support N15 light and heavy searches directly.
Comparison of Comet Versus SEQUEST Cross-Correlation Score
Quite often, the internal details of database search algorithms are a mystery to those that use the tools daily, even those that are open sourced or have been in use for decades. While Comet stems from the academic version of SEQUEST that has existed for many years, it is still being actively developed and extended on a regular basis. Improvements include adding search features, optimizing the code for speed improvements, and tweaking the core identification routines. Changes in the current release of Comet include implementation of more flexible modification options, a new sparse matrix data structure, multi-threaded optimization, and better search progress reporting. mzXML, mzML, ms2, and native Thermo RAW files are supported input formats whereas pepXML, SQT, Percolator TSV, and text files are supported output formats. Since its initial release in 2012, Comet has had five subsequent major releases, has been directly downloaded well over 1,000 times, and is incorporated into a number of larger software projects. These include Crux , Chorus , PatternLab , ProHits , LabKey Server , PeptideShaker , MASSyPup , and the Trans-Proteomics Pipeline , among others. Users are encouraged to access Comet from within any one of these tools. Documentation and direct Comet download are available at http://comet-ms.sourceforge.net.
The authors acknowledge support for this work NIH awards R01GM096306 (to W.S.N.) and P41GM103533 (to W.S.N. and M.J.M.). This work is supported in part by the University of Washington’s Proteomics Resource (UWPR95794). The authors thank Mike Riffle and Vagisha Sharma for manuscript feedback. J.K.E. thanks Nathan D. Camp for in vivo support.