Revealing the Proteome Complexity by Mass Spectrometry
The complexity of higher biological organisms is astounding, but the source of this complexity is far from obvious. With the emergence of epigenetics, the assumed main source of complexity has been shifted from the genome to pre- and post-translational modifications in proteins. There are estimated 100,000 different protein sequences in the human organism, and perhaps 10-100 times as many different protein forms. Analysis of the human proteome is a much more challenging task than that of the human genome. The challenge is to provide sufficient amount of information in experimental datasets to match the underlying complexity.
Mass spectrometry (MS) is one of the most informative techniques, widely used today for protein characterization. MS is the fastest growing spectroscopy area, which in 2005 has overtaken NMR as the prime research field. After a major revolution in the late 1980s (awarded by the Nobel prize in Chemistry in 2002), MS has continued to develop rapidly, showing amazing ability for innovation. Today, several different types of mass analyzers are competing with each other for the future. This diversity means that the field of MS, although a century old, is still in the fast evolving phase and is far from saturation.
Despite the rapid progress, today’s MS tools are still largely insufficient. Mathematical models of the MS-based proteomics analysis as well as experimental assessments showed large disproportions between the information content of the experimental MS datasets and the underlying sample complexity. One of the most desired improvements would be the higher quality of ion fragmentation in tandem mass spectrometry (MS/MS). The latter parameter boils down to the ability to specifically fragment each of the chemical bonds (C-C, C-N and N-C) linking amino acid residues in a polypeptide sequence. This formidable physico-chemical challenge is met by recently emerged techniques involving ion-electron reactions.
Characterization of primary polypeptide sequences of unmodified amino acids is a basic task in proteomics. Recent large-scale evaluation has shown that de novo sequencing by conventional MS/MS is insufficiently reliable. Fortunately, novel fragmentation techniques improved the situation and allowed the first proteomics-grade de novo sequencing routine to be developed.
Another group of challenges relates to the ability to extract maximum information from MS/MS data. The database search technologies developed in the late 1990s are still the backbone of routine proteomics analyses, but they are rapidly becoming insufficient. Typically, only 5 to 15% of all MS/MS data produce “hits” in the database, with the bulk of the data being discarded. Research in that issue has led to the emergence of a quality factor for MS/MS data (S-score). S-score analysis has shown that only half of the data are discarded for a good reason, while another half could be utilized by improved algorithms. Such algorithms specially designed to deal with any mutation or modification have recently uncovered hundreds of new types of modifications in the human proteome. High mass accuracy reveals the elemental compositions of these modifications, and MS/MS determines their positions. The potential of such algorithms for unearthing the vast and previously invisible world of modifications and thus tackling proteome’s enormous complexity will be discussed.