Ebook Advancement In Protein Inference From Shotgun Proteomics Using Peptide Detectability

Submitted by wulan on Thu, 07/30/2009 - 03:18

Shotgun proteomics refers to the use of bottom-up proteomics techniques in which the protein content in a biological sample mixture is digested prior to separation and mass spectrometry analysis. Typically, liquid chromatography (LC) is coupled with tandem mass spectrometry (MS/MS) resulting in high-throughput peptide analysis. The MS/MS spectra are searched against a protein database to identify peptides in the sample. Currently, Sequest and Mascot are the most frequently used computer programs for conducting peptide identification, both comparing experimental MS/MS spectra with in silico spectra generated from the peptide sequences in a database. Compared to top-down proteomics techniques, shotgun proteomics avoids the modest separation efficiency and poor mass spectral sensitivity associated with intact protein analysis, but it also encounters a new problem in data analysis, that of determining the set of proteins present in the sample based on the peptide identification results.

At a first glance, this problem seems trivial. It may be concluded that a protein is present in the sample, if and only if at least one of its peptides is identified. This conclusion is true, however, only when each identified peptide is unique, i.e. when it belongs to only one protein. If some peptides are degenerate, i.e. shared by two or more proteins in the database, determining which of these proteins exist in the sample has multiple possible solutions. Indeed, tryptic peptides are frequently degenerate, especially for the proteome samples of vertebrates, which, due to recent gene duplications, often have a large number of paralogs. In addition, alternative splicing in higher eukaryotes results in many identical protein subsequences. The following example illustrates the extent of peptide degeneracy in a real proteomics experiment. Of the 693 identified peptides from a real rat sample used in this study (see sections 3-4 for details), 296 were unique and 397 were degenerate, when searched against the full proteome of R. norvegicus. These peptides can be assigned to a total of 805 proteins, of which only 149 proteins could be assigned based on the 296 unique peptides.

Nesvizhskii and colleagues first formalized this challenge in shotgun proteomics data analysis. They formulated the protein inference problem and proposed a solution as the minimum number of proteins containing the set of identified peptides. Other methods assign the unique peptides first, and then use statistical methods to assign the degenerate peptides based on the likelihood of each putative protein already identified. As a result, if two proteins share some common tryptic peptides, the presence of each protein can be decided using this method only if there exists at least one identified unique peptide in one of the proteins. The degenerate peptides will be most likely assigned to the longer protein, because the shorter proteins may not contain any unique peptide (e.g. see Fig. 2 in reference 7).

In this paper, we revisit the protein inference problem based on the recently proposed concept of peptide detectability. The detectability of a peptide is defined as the probability of observing it in a standard proteomics experiment. We proposed that detectability is an intrinsic property of a peptide, completely determined by its sequence and its parent protein. We also showed that the pep-tide detectability can be estimated from its parent protein’s primary structure using a machine learning approach. The introduction of peptide detectability provides a new approach to protein inference, in which not only identified pep-tides but also those that are missed (not identified) are important for the overall outcome. Figure 1 illustrates the advantage of the new idea. Assume A and B are two proteins sharing 3 degenerate tryptic peptides (a, b, and c, shaded). Each protein in Fig. 1 also has unique tryptic peptides (d, e, and f, g, h, i respectively, white). According to the original formulation of the protein inference problem, the identities of A and B cannot be determined since the only identified peptides are degenerate.

However, if all the tryptic peptides are ranked in each protein according to their detectabilities (Fig. 1), we may infer that protein A is more likely to be present in the sample than protein B. This is because if B is present we would have probably observed peptides f-i along with peptides a-c, which all have lower detectabilities than either f, g, h, or i. On the other hand, if protein A is present, we may still miss peptides d and e, which have lower detectabilities than peptides a-c, especially if A is at relatively low abundance. In summary, peptide detectability and its correlation with protein abundance provides a means of inferring the likelihood of identifying a peptide relative to all other peptides in the same parent protein. This idea can then be used to distinguish between proteins that share tryptic peptides based on a probabilistic framework.

Download
PDF Ebook Advancement In Protein Inference From Shotgun Proteomics Using Peptide Detectability


Posted in :