Rp3: Ribosome profiling-assisted proteogenomics improves coverage and confidence during microprotein discovery

Vieira de Souza, Eduardo; L. Bookout, Angie; Barnes, Christopher A.; Miller, Brendan; Machado, Pablo; Basso, Luiz A.; Bizarro, Cristiano V.; Saghatelian, Alan

doi:10.1038/s41467-024-50301-4

Rp3: Ribosome profiling-assisted proteogenomics improves coverage and confidence during microprotein discovery

Article
Open access
Published: 09 August 2024

Volume 15, article number 6839, (2024)
Cite this article

Download PDF

You have full access to this open access article

From

View current issue

Rp3: Ribosome profiling-assisted proteogenomics improves coverage and confidence during microprotein discovery

Download PDF

1783 Accesses
19 Altmetric
1 Mention
Explore all metrics

Abstract

There has been a dramatic increase in the identification of non-canonical translation and a significant expansion of the protein-coding genome. Among the strategies used to identify unannotated small Open Reading Frames (smORFs) that encode microproteins, Ribosome profiling (Ribo-Seq) is the gold standard for the annotation of novel coding sequences by reporting on smORF translation. In Ribo-Seq, ribosome-protected footprints (RPFs) that map to multiple genomic sites are removed since they cannot be unambiguously assigned to a specific genomic location. Furthermore, RPFs necessarily result in short (25-34 nucleotides) reads, increasing the chance of multi-mapping alignments, such that smORFs residing in these regions cannot be identified by Ribo-Seq. Moreover, it has been challenging to identify protein evidence for Ribo-Seq. To solve this, we developed Rp3, a pipeline that integrates proteogenomics and Ribosome profiling to provide unambiguous evidence for a subset of microproteins missed by current Ribo-Seq pipelines. Here, we show that Rp3 maximizes proteomics detection and confidence of microprotein-encoding smORFs.

Short open reading frames (sORFs) and microproteins: an update on their identification and validation measures

Article Open access 17 March 2022

Detecting actively translated open reading frames in ribosome profiling data

Article 14 December 2015

Accurate annotation of human protein-coding small open reading frames

Article 09 December 2019

Introduction

Small open reading frames (smORFs, <300–450 nucleotides)^1,2,3,4 were excluded from typical genome annotation workflows because there were no methods to validate their translation, which led to the concern that most smORFs were false positives^5,6. The advent of Ribosome profiling, or Ribo-Seq, provided empirical data to identify actively translated smORFs and provide an accurate snapshot of translation at sub-codon resolution^7,8. The ability to accurately define translated smORFs has led to greater and greater efforts in this area to identify and functionally characterize these genes.

The assignment of smORFs occurs after Ribo-Seq data is fed into a bioinformatics workflow. A variety of different scoring algorithms are used to identify smORFs, such as RibORF⁹, Ribocode¹⁰, RiboDIPA¹¹, ORFRater¹², RiboTaper¹³, ORFscore¹⁴, RP-BP¹⁵, and PRICE¹⁶. These tools differ substantially in their approach, but most tools were designed for the de novo annotation of the translatome by assessing the 3-nucleotide periodicity, although machine learning algorithms trained on available annotation data are also common. The performance of these tools varies in many metrics, and ORFRater performance is also heavily affected by the usage of the drug harringtonine during sample preparation, which is a protein synthesis inhibitor that stalls the ribosome on the initiation codon to better identify the start of a (sm)ORF¹².

Although many pipelines are being successfully used for the identification of actively translated smORFs, there are some shortcomings that limit smORF annotation. A common problem is the existence of reads that map to multiple locations in the genome during the alignment step of these workflows¹⁷. These duplicated regions might result from a variety of events, such as whole genome duplication^18,19, recombination²⁰, retro-transposition²¹, and gene duplication²². Multi-mapping reads must be addressed with caution during genomics and transcriptomics analysis, as they can confound gene quantification and make genome annotation more difficult.

In a typical RNA-Seq analysis with short-reads, multi-mapping reads make up from 5% to 40% of total mapped reads¹⁷. This proportion is expected to be bigger in a Ribo-Seq analysis, as reads from ribosome footprints range from 25 to 34 bp^2,7, which increases the chance of them mapping to multiple sites when compared to traditional Illumina short-reads, which are 75–300 bp long²³. Most of the time, multi-mapped reads are discarded during read assignments in RNA-Seq analyses¹⁷. In our analysis, we show that the proportion of Ribo-Seq reads that multi-map is much higher than that of RNA-Seq reads, which should complicate ORF discovery when relying solely on translational evidence.

Proteogenomics is a multi-omics approach that integrates genomics, transcriptomics, and proteomics, and can be used for a multitude of tasks, including supporting genome annotation and allowing the identification of microproteins encoded by smORFs^24,25. The most robust evidence for the translation of a smORF is the direct detection of the resultant microprotein. Therefore, while the total number of smORFs revealed by proteogenomics is eclipsed by the higher number of Ribo-Seq annotated smORFs, there is added value in validating the microprotein translation with proteomics as it provides direct protein evidence.

We reasoned that the integration of proteogenomics alongside Ribo-Seq could provide a solution to overcome the challenge incurred by ambiguous and multi-mapping reads during smORF identification during Ribo-Seq-only analysis. We refer to this as the Ribosome Profiling and Proteogenomics Pipeline (Rp3). Although there are strategies to account for multi-mapping reads, such as ignoring them, splitting the reads across the mapped genes, or using statistical modeling of mapping uncertainty¹⁷, all of which are valid when working with sequencing reads alone, the integration of proteogenomics with Ribo-Seq is an advance since it provides additional layers of evidence by combining protein and translation information to achieve higher confidence in the detection of microproteins.

In this study, we employ the Rp3 workflow and reanalyze previously published datasets to provide additional evidence of smORFs that exist in regions inaccessible by Ribo-Seq alone (Fig. 1). Using the data from these results, we explore how Ribo-seq identifications differ from mass spectrometry peptide evidence of unannotated translated microproteins, including differences in genome and protein sequence composition, transcript isoforms, gene paralogy, and the presence of repeat regions, all of which can preclude the identification of translational events. The data show that Rp3 identifies smORFs and resultant microproteins that are disregarded in most pipelines that rely solely on Ribo-Seq evidence, significantly increasing the number of Ribo-Seq detected smORFs with proteomics detection—the highest confidence set of microproteins.

**Fig. 1: Workflow to reanalyze and overlay datasets.**

Results

Proteogenomics approach identifies microproteins that could not be found with Ribo-Seq

We have previously used Ribo-Seq to identify unannotated translated smORFs from thermogenic brown adipose tissue (BAT), energy storage-focused white adipose tissue (WAT), and beige adipose tissue²⁶, which yielded a total of 3877 unannotated sequences, including a bioactive secreted microprotein encoded by the mouse gene Gm8773. In the study, the 3877 unannotated smORFs were translated and appended to the mouse UniProt database to create a custom proteomics database that is able to detect stable microproteins from these sequences. While these results validated the translation of 85 smORFs into microproteins, there is a vast difference in the numbers of microproteins obtained from Ribo-Seq and proteomics. Significantly higher numbers of smORFs from Ribo-Seq are common in the field, and typically only a handful of microproteins are detected using proteomics workflows that incorporate Ribo-Seq smORFs. The lack of detectable microproteins from Ribo-Seq annotated smORFs has been attributed to several factors, including low abundance of microproteins, their length, which results in the generation of few tryptic peptides, and lack of unique tryptic peptides. Indeed, we have detected microprotein via Western blot with antibodies that we have never obtained proteomics data from to further highlight some of the challenges of microprotein detection²⁷.

Nevertheless, Ribo-Seq does not readily capture all ORFs. Alternative ORFs (AltORFs) overlap but are read in a different reading frame from the canonical ORF, and are easy to detect proteomically but difficult to detect by Ribo-Seq because of two different but overlapping reading frames that come from the same RNA. Several AltORFs have been shown to be functional⁵, highlighting the value of identifying these proteins. We also avoided the annotation of upstream overlapping smORFs (uoORFs) in the 3877 smORFs due to similar issues with overlapping transcripts, though uoORFs have been annotated in other RiboSeq studies^1,28. To determine if we were missing any translated smORFs that were excluded from the Ribo-Seq dataset, we reanalyzed the proteome and secretome data using a proteogenomics approach (Fig. 1).

Proteogenomics relies on custom proteomics databases that are derived from the 3-frame translation of RNA-Seq data, which, in theory, should capture all translated smORFs from the translatome. After excluding peptides from the proteogenomics analysis that matched annotated proteins and keeping only the ones that matched predicted unannotated proteins exclusively, we identified 130 unannotated microproteins (Fig. 2a and Supplementary Data 1). Out of these, 46 had at least one peptide matching proteins in the unreviewed Uniprot database (Supplementary Fig. 1). 33 of these peptides matched sequences with very low annotation scores (1–2), representing protein evidence for poorly characterized proteins. We also ran PRICE, Ribocode (Supplementary Data 1), and RibORF to identify smORFs based solely on RNA-Seq and Ribo-Seq evidence (RS microproteins). Although the number of Ribo-Seq identifications is much higher overall, the microproteins detected by proteomics are valuable since they are validated as stable members of the proteome. Overall, the overlap of the results coming from different pipelines is considerably small, which suggests the approach of Rp3 is orthogonal and complementary to the ones that use Ribo-Seq evidence alone to score smORFs. The proteomics coverage for Ribo-Seq smORFs identified by any of the three RS pipelines is significantly lower than those coming from Rp3 (Fig. 2b). To help understand this, we investigated the classification of smORFs most likely to be found using the different tools (Fig. 2c). The vast majority of smORFs identified with Rp3 reside in genome regions that are either non-coding or that are pseudogenes or retrotransposons (rtORFs). RibORF, Ribocode, and to a lesser extent, PRICE, were more likely to identify upstream ORFs (uORFs). The sequence length for microproteins identified with different pipelines also differ, where Rp3 tends to identify longer microproteins, and most smORFs identified by the other methods tend to be shorter (Fig. 2d).

**Fig. 2: Comparison of microproteins identified by proteogenomics and Ribo-seq.**

Differences in sequence composition may confound mass spectrometry characterization of microproteins from Ribo-Seq results

Unannotated smORFs differ in their amino acid sequence composition when compared to already annotated proteins². We wanted to see if this sequence bias extended to groups of microproteins coming from different approaches, such as proteogenomics and Ribo-Seq pipelines. Indeed, the distribution of amino acids varies among these groups (Fig. 2e), and lysine, which is of utmost importance for bottom-up proteomics²⁹, is less common in the RS microproteins than it is for the ones found with Rp3, which could contribute to the difficulty in their identification by mass spectrometry. Proline, which is substantially enriched in Ribo-seq microproteins, is known to confound the interpretation of fragmentation spectra. For instance, if the residue is present at the C-terminal, it fragments almost exclusively by cleavage of the amide bond that is adjacent to that proline, resulting mostly in y1 fragment ions³⁰.

Such amino acid biases could explain the low proteomics coverage for Ribo-seq microproteins, due to an increased difficulty in identifying microproteins rich in proline residues or those with very few tryptic sites. To confirm the latter, we inspected the tryptic peptides (TPs) obtained after an in silico digestion of annotated Rp3 and RS microproteins. The average number of TPs is larger for Rp3 proteins, which makes them more likely to be found in a mass spectrometry experiment (Fig. 2f). The number of TPs for Rp3 microproteins is also higher than those coming from Ribo-seq, which could contribute to their higher proteomics coverage. This could be explained in part by the tendency of Rp3 to pick up longer microproteins (Fig. 2d), which should produce more TPs after protein digestion. Moreover, the molecular weight for peptides from all three microproteins groups also differs significantly (Fig. 2g). The isoelectric point (pI) is also important for protein identification using mass spectrometry-based proteomics²⁹, and is significantly lower than the TPs from RS microproteins coming from the RibORF pipeline compared to ones from Rp3 (Fig. 2h).

Most Ribo-Seq reads cannot be uniquely assigned to smORF-containing transcripts that encode the Proteogenomics-derived (Rp3) microproteins

To understand the minimal overlap between the proteogenomics and Ribo-Seq results, we checked for evidence of translation for the proteogenomics hits. A priori, we expected to find evidence for the translation of Rp3 microproteins but figured they were absent from the RS-microproteins because they fell below a scoring threshold. We mapped Ribo-Seq reads for the Rp3 microproteins back to the transcripts that encode the smORFs. Surprisingly, we measured no Ribo-Seq coverage for the vast majority of proteogenomics smORFs when running FeatureCounts³¹ with the default settings (Fig. 3b), which should raise concerns about whether the peptide is a true identification or not, due to the lack of translational evidence, despite the presence of proteomics evidence. When we allowed the tool to include ambiguous and multi-mapping reads during read counting, however, we found translational evidence for many Rp3 smORFs in the Ribo-Seq data.

**Fig. 3: Ribo-Seq coverage for Rp3 smORFs.**

We grouped Rp3 smORFs together based on their detectability by Ribo-Seq. Some smORFs only have Ribo-Seq coverage when allowing for specific parameters during read counting, such as multi-mapping (Fig. 3a, middle), and ambiguous mapping (Fig. 3a, right) which means that these features preclude the identification of translational evidence for smORFs. Most smORFs do not have Ribo-Seq coverage when using default settings during read counting (Fig. 3b, c). We found a significant increase in the RPKMs for RPFs mapping to Rp3 smORFs when allowing multi-mapping (MM), ambiguous mapping (Amb), or both (MM_Amb), and divided them into their respective mapping groups. smORFs and their respective mapping groups are shown in Supplementary Data 1. In these datasets, the overall mapping landscape of Ribo-Seq reads differs drastically compared to RNA-Seq reads (Fig. 3d). As Ribo-Seq reads are much shorter, more than 80% are unassigned due to multi-mapping. This number is considerably lower for RNA-Seq reads, with a distribution ranging from 10% to 30% (t-test, P < 0.001). In fact, ~90% of Ribo-Seq reads could not be assigned when running featureCounts with default parameters. For RNA-seq, most reads could not be assigned due to overlapping regions. This could be due to the longer size of transcripts that the RNA-Seq reads are mapping to, compared to the ORFs the Ribo-Seq reads map to. As such, the transcripts should be more likely to overlap with another feature in the genome, requiring the tool to include overlapping features. Out of the multi-mappers from MM and MM_Amb that require multi-mapping specifically to be detected, and thus are absent from the Amb group, only 3 could be identified by Ribocode or RibORF (Fig. 3e). None of these were found with PRICE. The distribution of raw counts covering the Rp3 smORFs resembles a normal distribution (Fig. 3f), while most smORFs have lower RPKM coverage (Fig. 3g).

We found that smORFs in both the Amb and MM_Amb groups are present in regions with a high number of transcript isoforms and/or other overlapping features (Fig. 4a), making it harder for the counting software to distinguish which transcripts the reads might be mapping to. Additionally, MM and MM_Amb groups present a higher percentage of smORFs in repeat regions when compared to Amb and Default groups (Fig. 4b). When not accounting for these features, the counting tool tends to disregard these reads, leading to an apparent lack of translational evidence. This data shows that Rp3 smORFs, for which we have reliable protein evidence, would be disregarded by Ribo-Seq alone, due to limitations in the mapping step. By checking the coverage for Rp3 smORFs while allowing for different mapping characteristics, we are able to detect translational evidence for these sequences as well. Those are backed by the strongest evidence, as we have both proteomics and Ribo-Seq coverage.

**Fig. 4: Genomic landscape of Rp3 smORFs based on their mapping classification.**

To better understand the genomic regions where these smORFs reside based on their mapping classification (Supplementary Fig. 2), we checked the annotated regions where the smORFs overlap in their transcripts (Fig. 4c). Most smORFs in non-coding regions were found to have no Ribo-Seq coverage. The majority of smORFs with Ribo-Seq coverage were rtORFs. smORFs found in an alternative initiation site of another ORF (aiORF) were common among Amb and MM_Amb smORFs, which is expected as they are supposed to be overlapping other genes. Rp3 identified very few smORFs in transcripts annotated as long non-coding RNAs (lncRNAs) and smORFs in transcripts with retained introns (riORFs). The majority of identified uORFs required Amb or MM_Amb parameters to obtain Ribo-Seq coverage. Lastly, we checked the conservation levels for smORFs with no Ribo-Seq coverage and smORFs with coverage from the Rp3 pipeline (Supplementary Data 1). Overall, a higher proportion of Rp3 smORFs are conserved across different eukaryotes (Fig. 4d), even in distant ones like Danio rerio and Drosophilha melanogaster, while the ones with no coverage have very few homologs in distant organisms. This suggests that the coverage inferred by our Rp3 pipeline can identify microproteins that are evolutionarily conserved and, thus, more likely to be real.

Mass spectrometry peptides allow the identification of smORFs with uncertain Ribo-Seq mapping

Using Ribo-Seq evidence alone would result in microproteins being missing (Fig. 3b and c). By contrast, Rp3 combines the strengths of Ribo-Seq and proteogenomics to identify a unique subset of smORFs with a very high level of confidence. As an example of the mapping landscape of short Ribo-Seq reads for Rp3 smORFs, we plotted the position of each smORF identified with both proteogenomics and Ribosome profiling, along with the annotated genes using a circos plot (Fig. 5a). For instance, a smORF found on the reverse strand of chr8, (coordinates 107262838–107263248) would have been filtered out because of multi-mapping (Fig. 5b). The reads for this smORF were mapped to multiple different regions in the genome. None of the Ribo-Seq pipelines were able to identify this smORF due to the high degree of homology to other sequences in the genome. By aligning the microprotein peptide sequence identified from a high-quality MS/MS spectrum (Fig. 5c) across 50 genomic locations with high homology to this protein (full multiple sequence alignment available in Supplementary Fig. 3), it shows that the only nucleic acid sequence that is able to produce this particular microprotein is found at -chr8:107262838–107263248 (Fig. 5d). This example highlights the potential of Rp3 to reveal microproteins with translational and proteomics evidence—the highest confidence group of smORFs/microproteins. It exemplifies the challenges imposed by working solely with Ribo-Seq reads and how proteogenomics can circumvent these limitations.

**Fig. 5: Mapping landscape of the representative rtORF—chr8:107262838–107263248.**

Proteogenomics allows the identification of a unique subset of peptides presented in the HLA complex

Peptidomics studies from human leukocyte antigen (HLA) complex provide a rich source of potentially immunogenic microproteins and were used before to characterize smORFs identified by Ribo-Seq. Here, we reanalyzed published datasets^2,32 to determine whether using Rp3 would result in the identification of additional microproteins that were missed using the Ribo-Seq-only pipeline (Fig. 6a). After translating the previously assembled transcriptomes for HeLa-S3, K562 and HEK293T cell lines to the three reading frames, we performed a proteogenomics search with Rp3. Additionally, we used PRICE, Ribocode, and RibORF to score smORFs in these assemblies using RNA-Seq and Ribo-Seq evidence alone (Supplementary Data 2). After running Rp3, we identified 692 unannotated Rp3 microproteins with strong affinity to HLA Class I (Supplementary Data 2), 150 of which were shared with results from at least one of the Ribo-Seq pipelines (Fig. 6b). This overlap was maintained after matching fragmentation spectra from HLA peptidomics experiments with the microproteins coming from the Ribo-Seq pipelines, but the overall proportion of RS microproteins covered by mass spectrometry evidence was still low, following the same patterns of the mouse datasets (Fig. 6c). After checking for Ribo-Seq coverage using the same approach as in Fig. 3b, we found a similar pattern regarding mapping limitations (Fig. 6d), with hundreds of Rp3 smORFs present in regions affected by ambiguous and/or multi-mapping (Fig. 6e). Out of 31 multi-mapping smORFs identified with Rp3, which were found in the MM and/or MM_Amb groups but not Amb, only a few could be identified by the Ribo-Seq pipelines, with RibORF identifying the highest number (Fig. 6f). The distribution of raw counts follows a normal distribution (Fig. 6g), whereas most smORFs have a tendency for lower RPKM values (Fig. 6i). In concordance to the distribution of RPKM values for the mouse datasets (Fig. 3g), the distribution of RPKMs for Ribo-Seq reads mapping to Rp3 microproteins identified with HLA peptidomics seems shifted towards lower values, which suggests these microproteins could have faster turnover. Moreover, we have identified significant evolutionary conservation for most Rp3 microproteins (Fig. 6h), and only 122 microproteins were not found in the analyzed eukaryotic genomes. The vast majority of conserved microproteins are those that also have Ribo-Seq coverage.

**Fig. 6: Identification of microproteins in HLA peptidomics datasets.**

Discussion

The orthogonal approach of using both proteogenomics and Ribo-Seq to identify smORFs we employed in this study is particularly important to perform a truly comprehensive microprotein identification analysis. We showed how limitations in the read alignment step of Ribo-Seq reads prevent the correct annotation of unannotated smORFs that can be found with proteogenomics unambiguously. Sequence-wise, there are two main reasons for this discrepancy: the presence of repeat regions and paralogous sequences for the smORFs in the genome, and multiple different mRNA isoforms that carry the same or a similar smORF.

Gene paralogy results in reads that map to multiple regions in the genome during the alignment step, called multi-mappers, which are often discarded after a certain threshold¹⁷. Reads that map to a genomic region that encodes many different transcript isoforms are often treated as ambiguous reads, which are disregarded by default by read counting tools such as featureCounts. Simply including all ambiguous and multi-mapping reads into the analysis is not advisable, however, as this would increase the number of false positives and add uncertainty about the coordinates the reads are, in fact, mapping to. Moreover, the number of uniquely assigned reads in our datasets is greatly undermined by the number of multi-mapping reads, which reinforces the idea of not simply including non-uniquely mapped reads. Proteogenomics solves that by allowing us to unambiguously assign a Unique Tryptic Peptide (UTP) to a microprotein in the custom database. It is worth noting that just because a smORF has Ribo-Seq reads above a certain threshold, it is no guarantee that it would be called by Ribo-Seq-only pipelines, as the tools require, among other features, coverage across the whole smORF sequence, 3-nt periodicity, and a usual build-up of reads around the start codon with a slope around the stop codon. By leveraging proteomics evidence, we are able to circumvent this by identifying peptide evidence in a region that would otherwise be too similar to other parts of the genome, even if the rest of the sequence is not. If the remaining portion of the sequence has lower homology, the Ribo-Seq reads could still map to it without multi-mapping, but the highly conserved portion would confound Ribo-Seq callers. This is the case with the representative smORF in Fig. 5, where a higher degree of homology towards the end of the sequence confounds Ribo-Seq analysis, but still allows the MS peptide to map unambiguously. To add to the reliability of the identifications, most Rp3 microproteins in both the mouse (Fig. 4d) and human (Fig. 6h) datasets were highly conserved in other eukaryotes. A few, however, seem to be species-specific, even with Ribo-Seq coverage. Nonetheless, this is in agreement with findings reporting substantial evidence for the emergence of de novo genes^33,34,35, especially short ones, reiterating the importance of not disregarding evolutionarily new microproteins.

The annotated regions in the genome overlapped by novel smORFs provide some insights into the role of these novel sequences. In this study, instead of using traditional ways of annotating smORFs, we leveraged information that allowed us to better understand the reason behind the mapping limitations of very short Ribo-Seq reads. A big proportion of smORFs in MM and MM_Amb clusters were classified as rtORFs, meaning they share the loci with either pseudogenes or retrotransposons. This is very informative, as pseudogenes typically arise from the duplication of functional genes³⁶, and retrotransposons are present in multiple copies across the genome due to their highly dynamic copy-paste mechanism³⁷. Thus, these two classes of genes are expected to have high sequence similarity to other sites in the genome, resulting in regions prone to multi-mapping. The low numbers of uORFs across all groups is especially informative, as this is the most common type of smORFs to be identified in Ribo-Seq studies. This means that proteogenomics and Ribo-Seq pipelines might be more suitable to identify distinct groups of smORFs each, regarding their classification. As the number of Rp3 smORFs with no Ribo-Seq coverage is low but still exists in both proteomics datasets we analyzed, we advise to prioritize Rp3 microproteins that have at least coverage when allowing ambiguous and/or multi-mapping during read counting to avoid the inclusion of hits that have no translational evidence whatsoever. Rp3 microproteins with Ribo-Seq coverage and pure Ribo-Seq hits from established pipelines with mass spectrometry evidence are thus considered the gold standard for microprotein identification in this study.

Moreover, to reduce the number of false positives, we performed a second round of searches during the proteogenomics analysis by generating a database consisting of annotated proteins plus the microproteins identified in the initial proteogenomics searches, which used the whole 3-frame translated database from the transcriptome. This way, the FDR assessment of proteogenomics hits is not that strongly affected by a bloated database consisting of millions of sequences and we can achieve an accurate global FDR. By reassessing the FDR and checking the Ribo-Seq coverage for Rp3 microproteins, we are left with a very high-confidence set of novel sequences. To control FDR separately for unannotated microproteins identified by Rp3 and canonical proteins, we also employed a grouped-FDR approach, where we separated these two groups of microproteins and assessed the FDR separately with Percolator for each group, for both the mouse (Supplementary Fig. 4) and human (Supplementary Fig. 5) datasets. However, this greatly increased the number of identifications. To be more conservative and avoid including false positives, we kept the results using the approach where we assessed the FDR for the microproteins and canonical proteins together. Hits without Ribo-Seq coverage, although less reliable, are still robust as we included many additional steps in the pipeline to properly control the FDR. In the lack of Ribo-Seq data, Rp3 may still be used like a more traditional proteogenomics pipeline.

An important aspect to note is that the overlap between the Ribo-Seq and proteogenomics results was negligible, as very few microproteins were shared between the approaches. This is surprising, as we expected a higher overlap between microproteins with proteomics evidence and microproteins with translational evidence. This reinforces the complementary aspect of these two different approaches, as each returns a unique subset of microproteins that could not be identified with the other method due to differences in sequence composition. This corroborates findings from a recent study reporting a small intersection of results from Ribo-Seq-only pipelines identifying microproteins encoded by smORFs³⁸. Furthermore, the HLA peptidomics analysis provides a reliable source of evidence for novel microproteins without limiting the search space due to enzymatic cleavage of the proteome, potentially enabling its annotation based on antigen presentation. Also, it shows that this pattern is neither limited to the mouse genome or to the presented Ribo-Seq and proteomics datasets from a specific study²⁶. We would like to reiterate that Ribo-Seq-only pipelines are still a very reliable way to identify smORFs, especially due to their high sensitivity. However, using a Ribo-seq-assisted proteogenomics workflow allows the identification of a subset of microproteins residing in regions largely inaccessible to pipelines using translational evidence alone. We expect this workflow to increase proteomics coverage for Ribo-Seq results as well as to improve the quality of microprotein identification pipelines, as those microproteins with Ribo-Seq and proteomics coverage are the most compelling for downstream biological studies.

Methods

Reference-guided transcriptome assemblies

To maintain consistency with the previous analysis, we obtained the GTF files produced by Martinez et al. (2023)²⁶ for brown, beige, and white mouse adipose tissue. These files were generated after trimming the reads, removing the ones that mapped to rRNAs and tRNAs and aligning the remaining reads to the genome with STAR. The transcriptome was assembled using StringTie with default parameters to identify novel transcripts. For human data, we used GTF files from the cell lines HeLa-S3, and K562 from Martinez et al. (2020)² and HEK293T from Ma et al. (2018)³⁹. These GTF files were generated after aligning trimmed and ribosomal-clean reads to the human genome (assembly hg19) with STAR. The transcriptome was assembled using Cufflinks with default parameters while requiring multi-mapping and fragment bias correction. The reads for these three cell lines were obtained from total RNA-Seq ran on Illumina HiSeq 2500 and NextSeq 500.

Generation of custom databases for proteomics

To generate a database containing the whole coding potential of each cell line, we translated each transcriptome assembly generated to the three reading frames with the script GTFToFasta², giving priority to ATGs. In the case of multiple ATGs for a region between two stop codons, the most upstream ATG was selected. For the mouse datasets, we generated a non-redundant GTF file containing the merged transcriptomes for the three assemblies. This was done because the mass spectrometry files in the study were generated by combining the tryptic digests of the secretomes and proteomes of all three adipose tissue phenotypes. Afterward, we appended the Swiss-Prot Mouse proteome from Uniprot (UP000000589) to the database and tagged each annotated sequence to be removed in the post-processing steps. For each resulting database, we generated a decoy database containing the reverse sequences of each protein plus the contaminants, which were appended to the initial database. We performed the same bioinformatic steps for the HLA peptidomics proteogenomics analysis but used previously assembled human transcriptomes from HEK293T³⁹, HeLa-S3, and K562 cell lines² to generate the custom databases, to which we appended the Swiss-Prot human reference proteome from Uniprot (UP000005640). We have implemented this step in the database mode of the Rp3 pipeline.

Peptide search and post-processing of mass spectrometry data

To identify microproteins with Rp3, we used MSFragger (v3.5)⁴⁰ to map fragmentation spectra to the custom databases generated in the last step. For mouse datasets we used proteomics datasets that were previously generated for mouse brown, beige, and white adipose tissues. We used DDA files from whole-cell lysates and whole-secretome proteomes. The parameters were set as follows: precursor mass tolerance of 20 ppm, fragment mass tolerance of 50 ppm, Data-dependent acquisition as data type, fragment mass tolerance of 0.02 Da, and included carbamidomethylation (+57.021464 Da) as a fixed modification, and oxidation of methionine (+15.9949 Da) as a variable modification. For the validation step, we used Percolator (v3.06.1)⁴¹ to infer the FDR with a cutoff set at 0.01 at the protein level using the –picked-protein method. All pin files from the same datasets were processed together to better infer the FDR. After the search, we removed any peptides that matched an annotated protein but kept those that matched more than one predicted microprotein from the three-frame translated database. To make sure no annotated sequence was included in the results, we performed a Blastp (v2.12.0+) search against NCBI Refseq and excluded any microproteins that had a hit with identity and query cover 100%, i.e., a perfect match. Lastly, we performed a second round of searches to re-score the identified peptides by appending the final set of microproteins from the first search to the same reference proteome. Then, this smaller database was searched using the same mass spectrometry data and following the same bioinformatics steps. This was done to accurately assess the FDR without the effect of bloated databases coming from the three-frame translation of the transcriptome. During re-scoring, we applied MSBooster (v1.2.1)⁴² after the peptide search with MSFragger to predict retention times (RT) for each peptide using DIA-NN⁴³. The delta_loess_RT values were included in the ‘pin’ file generated during the peptide search as an extra feature for Percolator, so its model would include the calibrated RT values during prediction, which correspond to the experimental RT mapped to the predicted scale of a local regression.

To process HLA peptidomics data, since we are looking for peptides with no specific enzymatic cleavage, such as trypsin’s, we specified nonspecific cleavage with the parameters -- search_enzyme_cutafter ARNDCQEGHILKMFPSTWYV and --search_enzyme_name nonspecific. Additionally, we specified the parameters --num_enzyme_termini 0, --precursor_true_tolerance 6, --digest_mass_range 500.0_1500.0, --max_fragment_charge 3, --digest_min_length 8, --digest_max_length 25. Since the evidence provided by HLA peptidomics is the intact peptide, and a protein does not undergo experimental digestion beforehand, we applied a cutoff for an FDR of 0.01 at the peptide level. Similarly to the mouse datasets, we performed the same re-scoring steps and applied MSBooster to predict RT retention times. After removing every peptide that matched an annotated protein, we ran mhcflurry-predict-scan from the MHCFlurry package to identify which microproteins would contain peptides with strong binding affinity to the MHC Class I using the alleles HLA-A*02:01,HLA-A*03:01,HLA-B*57:01,HLA-B*45:01,HLA-C*02:02,HLA-C*07:02,HLA-A*01:01,HLA-A*02:06,HLA-B*44:02,HLA-B*07:02,HLA-C*01:02,HLA-C*03:01. We selected peptides with an affinity ≤ 100 and an affinity percentile ≤ 0.02. Only peptides 7–12 aa long were included (--peptide-lengths 7–12).

To check for proteomics coverage for the microproteins encoded by the smORFs identified with Ribo-Seq, we used the same parameters for MSFragger and Percolator and applied similar filters. The databases consisted of microproteins identified by either PRICE, Ribocode, or RibORF in each cell line. Since the database was much smaller than the proteogenomics one used for Rp3, we did not re-score the results from the first search to increase the number of identifications and keep fair comparisons. To identify which Rp3 microproteins would be annotated in Uniprot unreviewed databases, we performed a string match for each mass spectrometry peptide and protein in the Uniprot database and classified them based on their annotation level, which indicates how well-characterized the protein is. We have implemented the peptide search and post-processing of mass spectrometry data in the search mode of the Rp3 pipeline.

Processing of Ribo-Seq data

We trimmed the Ribo-Seq reads to remove the Illumina adapters (AGATCGGAAGAGCACACGTCT) using fastx_clipper with the parameters -Q 33, -l 20, -n, -v, -c, piped to fastx_trimmed with the parameters -Q 33, -f 1, both tools from the FastX Toolkit (v0.0.13) (http://hannonlab.cshl.edu/fastx_toolkit/). To remove reads mapping no tRNAs and rRNAs, we mapped the trimmed reads to a contaminant database running STAR with -outSAMstrandField intronMotif, --outReadsUnmapped Fastx. To map reads to the genomic coordinates of Rp3 smORFs, we used unmapped reads in fastq format as input for STAR with the parameters --outSAMstrandField intronMotif, -outFilterMismatchNmax 2, --outFilterMultimapNmax 9999, --chimScoreSeparation 10, --chimScoreMin 20, --chimSegmentMin 15, --outSAMattributes All. We specified a max of 9999 multi-mappings so STAR would not discard all reads mapping up to that many regions, as it does not keep reads with a number of alignments above the specified threshold. Reads for RibORF, Ribocode, and PRICE were aligned specifying a --outFilterMultimapNmax 4, as these tools use Ribo-Seq evidence alone. The Rp3 pipeline is more permissive because it processes smORFs identified with proteomics evidence, and uses the reads only to check for coverage for these microproteins. RibORF was run with default parameters, keeping only the ones with a score ≥0.7. Ribocode and PRICE both require CDS and start and stop codons in the GTF file to generate metaplots and predict 3-nt periodicity. As we are working with reference-guided assemblies from StringTie and Cufflinks, these GTF files contain only transcript and exon features. To include CDS features, we ran Transdecoder (v5.7.1) (https://github.com/TransDecoder/TransDecoder) to predict the longest ORF in each transcript and then appended the predicted CDS to the GTF files. These predicted CDS are only used as a starting point for predicting 3-nt periodicity, but ORFs are predicted from other putative regions in the transcripts later by the tools regardless. For each transcriptome assembly with CDS information, we generated a STAR (v2.7.4a) index with the custom annotations specified with the parameter—sjdbGTFfile. For PRICE, reads were aligned the same way as for RibORF. For Ribocode, since it requires reads to be aligned to the transcriptome and not the genome, we ran STAR with the parameters –quantMode TranscriptomeSAM GeneCounts –outFilterType BySJout and –alignEndsType EndToEnd, as suggested in their documentation.

We chose to infer the 3-nt periodicity and generate metaplots using each software’s own method, instead of defining the same off-sets for every analysis. This was done to keep consistency with the method of each tool, since they also calculate statistics based on their own predictions. To score ORFs with PRICE, we ran it specifying an FDR of 0.01. Since it does not generate protein sequences or GTF files, we ran $ gedi -e ViewCIT to generate results files filtered with the specified FDR and generated a custom GTF file based on the coordinates, which was used to generate a fasta file containing protein sequences with Gffread⁴⁴. To score the ORFs with Ribocode, we first ran $ prepare_transcripts specifying the adequate custom GTF file for each condition and the same reference genome used when preparing the indexes. Then, we ran $ metaplots and $ Ribocode with default parameters. Afterward, we filtered the results, applied a cutoff for the adjusted p-value (FDR) of 1%, and removed any ORF that was annotated by blasting them to the NCBI RefSeq, or that had a sequence longer than 150 amino acids or 450 codons. The same filtering was applied to results coming from PRICE.

Analysis of ambiguous and multi-mapping reads

To check for Ribo-Seq coverage for the smORFs we could find peptide evidence for with proteogenomics, we first appended the coordinates of those smORFs to the mouse mm18 reference GTF file. Then, we performed four rounds of read counting with featureCounts (v1.6.3)³¹, first with default parameters and then allowing multi-mapping reads with -M, allowing ambiguous mapping reads with -O, and allowing both ambiguous and multi-mapping reads with -M and -O. We plotted the counts for each combination of these two parameters (-M: MM, -O: Amb, -M -O: MM_Amb, and Default, when using default parameters) into a heatmap using the Python package nheatmap (https://pypi.org/project/nheatmap/). We then classified the smORFs based on their detectability using the different parameters of featureCounts. A smORF is included in a specific group if its set of parameters results in at least 10 raw counts for that smORF. The mapping classification for proteogenomics smORFs can be obtained by running the Rp3 pipeline in the ribocov mode. For each group, we performed comparisons regarding a number of isoforms and repeated regions as follows: first, for the overlapping features, we ran bedtools intersect (v2.30.0) (ref. ³¹) using the reference GTF file to identify how many features overlapped each smORF, and plotted those into box plots. To investigate repeat regions, we used repeat files for the mm10 genome provided by RepeatMasker (http://www.repeatmasker.org/species/mm.html) and checked which smORFs overlapped the repeat coordinates. For overlapping regions, we classified the overlap with the reference GTF file for mm10 from Ensembl after running bedtools intersect on the reference and custom GTF file containing the smORFs.

Circos plot generation

To generate the circos plot for read mapping visualization at the genome level, we used the Python package PyCircos (v0.3.0) (https://github.com/ponnhide/pyCircos) and added bars corresponding to the coordinates of each gene annotated in the reference mm10 GTF file to the outermost ring. Then, we did the same for the smORFs found with Ribo-Seq and for the Rp3 smORFs, which were added to the second and third ring, from edge to center, respectively. To analyze the multi-mapping landscape and generate the circos plot, we performed additional filtering in the sam files to analyze the distribution of reads with secondary alignments across the genome. For each sam file, we first kept only the reads with the flag 0x100, which indicates a secondary alignment, by running Samtools (v1.13) with the parameter -f 0x100. Afterward, we iterated the alignments and selected the reads that aligned at least once to a region in the genome that encodes an Rp3 microprotein based on the coordinates of our custom GTF file, while keeping all the other regions these selected reads mapped to. Using the filtered sam file containing only reads that mapped to at least one Rp3 smORF, we selected the reads that mapped to the representative smORF. Afterward, we added links to the center of the plot whose coordinates they point to in the genome correspond to the other regions the reads map to. To generate the multiple sequence alignment, we first identified homologs by running tblastn on the mouse genome assembly mm10. Then, we performed the multiple sequence alignment (MSA) with these sequences using MAFFT⁴⁵ and generated the MSA figure using the msa4u tool from the package uORF4u⁴⁶.

Sequence analysis comparison among annotated, Ribo-Seq and proteogenomics microproteins

To plot the amino acid composition, we extracted the microproteins sequences for each group of microproteins and plotted the distribution with the Python package matplotlib. To obtain the number of UTPs for each microprotein and the isoelectric point for each peptide, we used the tool RapidPeptidesGenerator (RPG)⁴⁷, using trypsin as the enzyme for the in silico protein digestion, and plotted the distributions with matplotlib.

Conservation analysis

To identify Rp3 smORF homologs in the genome of other eukaryotic organisms and check their overall level of conservation, we a performed a tBlastn search against the genome assemblies of Equus caballus (GCF_002863925.1), Homo sapiens (GCF_000001405.40), Canis lupus (GCF_000002285.5), Sus scrofa (GCF_000003025.6), Bos taurus (GCF_002263795.2), Pan troglodytes (GCF_028858775.1), Rattus norvegicus (GCF_015227675.2), Danio rerio (GCF_000002035.6), Balaenoptera musculus (GCF_009873245.2), Loxodonta africana (GCF_000001905.1), Drosophila melanogaster (GCF_000001215.4), Macaca mulatta (GCF_003339765.1). Then, we filtered the alignments to include only those with an E-value < 0.001 and a bitScore > 50. We then annotated a common NCBI tree in phylip format for these species using EvolView⁴⁸ by adding bars to the side of the tree corresponding to the smORFs identified with Rp3 in the organism of interest and numbers of smORF homologs in each species, according to their Ribo-Seq coverage.

Statistical analysis

We used the Python package SciPy (https://scipy.org/) to perform statistical analysis when adequate. To compare metrics such as the number of peptides per protein (Fig. 2f), molecular weight (Fig. 2g), pI (Fig. 2h), and number of overlapping features in the genome (Fig. 4a), we first combined the microprotein results coming from the same tool that passed the FDR thresholds. We performed comparisons based on the combined results of each tool (n = 5 biological replicates for a tryptic digest pool for mass spectrometry experiments. n = 2 biological replicates for each phenotype for Ribo-Seq experiments). To compare distribution of read assignment (Fig. 3d), we ran featureCounts on each RNA-Seq (n = 2 biological replicates for each of the three adipose tissue phenotypes for total RNA and another 2 for mRNA, all grouped together) and Ribo-Seq (n = 2 biological replicates for each of the three adipose tissue phenotypes, all grouped together) fastq files and compared the summary output containing the number of assigned reads for each group. We first checked for a Gaussian distribution for each group using the Shapiro–Wilk test. As our data followed a non-parametric distribution (Fig. 2f–h), we ran the Kruskal–Wallis test followed by Dunn’s post hoc adjusted for multiple hypothesis testing with the Benjamini-Holchberg FDR correction using SciPy. To perform comparisons between the read assignment of RNA-Seq and Ribo-Seq (Fig. 3d), since the data followed a parametric distribution, we ran T-tests using SciPy. We added statistical significance to the plots using statsannotations (https://github.com/trevismd/statannotations).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Raw data from mouse Ribo-Seq and mass spectrometry experiments used in this study were previously generated by Martinez et al. (2022)²⁵ and downloaded from Gene Expression Omnibus (GEO) under accession code GSE198109 [https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/geo/query/acc.cgi?acc=GSE198107] and MassIVE under accession codes MSV000089022 [https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?task=220e5ba9df934a34a5eff5dc93081a7c] and MSV000089023 [https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?task=1aef9f40f32b45488349ab0bfc5dfa17], respectively. smORF sequences previously identified by Ribo-Seq and raw Ribo-Seq data were obtained from Martinez et al. (2020)² under GEO accession code GSE125218. HLA peptidomics datasets were previously generated by Bassani-Sternberg (2015)³² and are available at proteomeXchange under accession code PXD000394. Information regarding novel microproteins identified in this study is available in Supplementary Data 1 and 2. GTF files comprising Rp3 identifications are available in Supplementary Data 3. Source data are provided with this paper.

Code availability

A command-line Python implementation of the Rp3 pipeline under the GNU General Public License (GPLv3) and custom scripts written for this study are available at https://github.com/Eduardo-vsouza/rp3⁴⁹, covering the steps for proteogenomics analysis, and Ribo-Seq coverage for proteogenomics microproteins.

References

Chothani, S. P. et al. A high-resolution map of human RNA translation. Mol. Cell 82, 2885–2899 (2022).
Article CAS PubMed Google Scholar
Martinez, T. F. et al. Accurate annotation of human protein-coding small open reading frames. Nat. Chem. Biol. 16, 458–468 (2020).
Article ADS CAS PubMed Google Scholar
Mumtaz, M. A. S. & Couso, J. P. Ribosomal profiling adds new coding sequences to the proteome. Biochem. Soc. Trans. 43, 1271–1276 (2015).
Article CAS PubMed Google Scholar
Aspden, J. L. et al. Extensive translation of small open reading frames revealed by Poly-Ribo-Seq. elife 3, e03528 (2014).
Article PubMed PubMed Central Google Scholar
Orr, M. W., Mao, Y., Storz, G. & Qian, S. B. Alternative ORFs and small ORFs: shedding light on the dark proteome. Nucleic Acids Res. https://doi.org/10.1093/nar/gkz734 (2020).
Basrai, M. A., Hieter, P. & Boeke, J. D. Small open reading frames: beautiful needles in the haystack. Genome Res. 7, 768–771 (1997).
Article CAS PubMed Google Scholar
Ingolia, N. T., Brar, G. A., Rouskin, S., McGeachy, A. M. & Weissman, J. S. The ribosome profiling strategy for monitoring translation in vivo by deep sequencing of ribosome-protected mRNA fragments. Nat. Protoc. 7, 1534–1550 (2012).
Article CAS PubMed PubMed Central Google Scholar
Ingolia, N. T., Ghaemmaghami, S., Newman, J. R. & Weissman, J. S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324, 218–223 (2009).
Article ADS CAS PubMed PubMed Central Google Scholar
Ji, Z., Song, R., Regev, A. & Struhl, K. Many lncRNAs, 5’UTRs, and pseudogenes are translated and some are likely to express functional proteins. elife 4, e08890 (2015).
Article PubMed PubMed Central Google Scholar
Xiao, Z. et al. De novo annotation and characterization of the translatome with ribosome profiling data. Nucleic Acids Res. 46, e61–e61 (2018).
Article PubMed PubMed Central Google Scholar
Li, K., Hope, C. M., Wang, X. A. & Wang, J.-P. RiboDiPA: a novel tool for differential pattern analysis in Ribo-seq data. Nucleic Acids Res. 48, 12016–12029 (2020).
Article CAS PubMed PubMed Central Google Scholar
Fields, A. P. et al. A regression-based analysis of ribosome-profiling data reveals a conserved complexity to mammalian translation. Mol. Cell 60, 816–827 (2015).
Article CAS PubMed PubMed Central Google Scholar
Calviello, L. et al. Detecting actively translated open reading frames in ribosome profiling data. Nat. Methods 13, 165–170 (2016).
Article CAS PubMed Google Scholar
Bazzini, A. A. et al. Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation. EMBO J. 33, 981–993 (2014).
Article CAS PubMed PubMed Central Google Scholar
Malone, B. et al. Bayesian prediction of RNA translation from ribosome profiling. Nucleic Acids Res. 45, 2960–2972 (2017).
CAS PubMed PubMed Central Google Scholar
Erhard, F. et al. Improved Ribo-seq enables identification of cryptic translation events. Nat. Methods 15, 363–366 (2018).
Article CAS PubMed PubMed Central Google Scholar
Deschamps-Francoeur, G., Simoneau, J. & Scott, M. S. Handling multi-mapped reads in RNA-seq. Comput. Struct. Biotechnol. J. 18, 1569–1576 (2020).
Article CAS PubMed PubMed Central Google Scholar
Wolfe, K. H. & Shields, D. C. Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387, 708–713 (1997).
Article ADS CAS PubMed Google Scholar
McLysaght, A., Hokamp, K. & Wolfe, K. H. Extensive genomic duplication during early chordate evolution. Nat. Genet. 31, 200–204 (2002).
Article CAS PubMed Google Scholar
Ohta, T. Role of gene duplication in evolution. Genome 31, 304–310 (1989).
Article CAS PubMed Google Scholar
Kazazian, H. H. Jr Mobile elements: drivers of genome evolution. Science 303, 1626–1632 (2004).
Article ADS CAS PubMed Google Scholar
Magadum, S., Banerjee, U., Murugan, P., Gangapur, D. & Ravikesavan, R. Gene duplication as a major force in evolution. J. Genet. 92, 155–161 (2013).
Article PubMed Google Scholar
Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).
Article CAS PubMed PubMed Central Google Scholar
Zhu, Y. et al. Discovery of coding regions in the human genome by integrated proteogenomics analysis workflow. Nat. Commun. 9, 1–14 (2018).
ADS Google Scholar
Nesvizhskii, A. I. Proteogenomics: concepts, applications and computational strategies. Nat. Methods 11, 1114–1125 (2014).
Article CAS PubMed PubMed Central Google Scholar
Martinez, T. F. et al. Profiling mouse brown and white adipocytes to identify metabolically relevant small ORFs and functional microproteins. Cell Metab. 35, 166–183 (2023).
Article CAS PubMed PubMed Central Google Scholar
Morgado-Palacin, L. et al. The TINCR ubiquitin-like microprotein is a tumor suppressor in squamous cell carcinoma. Nat. Commun. 14, 1328 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
van Heesch, S. et al. The translational landscape of the human heart. Cell 178, 242–260 (2019).
Article PubMed Google Scholar
Zhang, Y., Fonslow, B. R., Shan, B., Baek, M.-C. & Yates, J. R. III Protein analysis by shotgun/bottom-up proteomics. Chem. Rev. 113, 2343–2394 (2013).
Article CAS PubMed PubMed Central Google Scholar
Grewal, R. N., El Aribi, H., Harrison, A. G., Siu, K. M. & Hopkinson, A. C. Fragmentation of protonated tripeptides: the proline effect revisited. J. Phys. Chem. B 108, 4899–4908 (2004).
Article CAS Google Scholar
Liao, Y., Smyth, G. K. & Shi, W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930 (2014).
Article CAS PubMed Google Scholar
Bassani-Sternberg, M., Pletscher-Frankild, S., Jensen, L. J. & Mann, M. Mass spectrometry of human leukocyte antigen class I peptidomes reveals strong effects of protein abundance and turnover on antigen presentation*[S]. Mol. Cell. Proteom. 14, 658–673 (2015).
Article CAS Google Scholar
Carvunis, A.-R. et al. Proto-genes and de novo gene birth. Nature 487, 370–374 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Suenaga, Y. et al. NCYM, a Cis-antisense gene of MYCN, encodes a de novo evolved protein that inhibits GSK3β resulting in the stabilization of MYCN in human neuroblastomas. PLoS Genet. 10, e1003996 (2014).
Article PubMed PubMed Central Google Scholar
Van Oss, S. B. & Carvunis, A.-R. De novo gene birth. PLoS Genet. 15, e1008160 (2019).
Article PubMed PubMed Central Google Scholar
D’Errico, I., Gadaleta, G. & Saccone, C. Pseudogenes in metazoa: origin and features. Brief. Funct. Genom. 3, 157–167 (2004).
Article Google Scholar
Hancks, D. C. & Kazazian, H. H. Roles for retrotransposon insertions in human disease. Mob. DNA 7, 1–28 (2016).
Article Google Scholar
Tong, G., Hah, N. & Martinez, T. F. Comparison of software packages for detecting unannotated translated small open reading frames by Ribo-seq. Brief. Bioinform. 25, bbae268 (2024).
Article PubMed PubMed Central Google Scholar
Ma, J., Saghatelian, A. & Shokhirev, M. N. The influence of transcript assembly on the proteogenomics discovery of microproteins. PLoS ONE 13, e0194518 (2018).
Article PubMed PubMed Central Google Scholar
Kong, A. T., Leprevost, F. V., Avtonomov, D. M., Mellacheruvu, D. & Nesvizhskii, A. I. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics. Nat. Methods 14, 513–520 (2017).
Article CAS PubMed PubMed Central Google Scholar
Käll, L., Canterbury, J. D., Weston, J., Noble, W. S. & MacCoss, M. J. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods 4, 923–925 (2007).
Article PubMed Google Scholar
Yang, K. L. et al. MSBooster: improving peptide identification rates using deep learning-based features. Nat. Commun. 14, 4539 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Demichev, V., Messner, C. B., Vernardis, S. I., Lilley, K. S. & Ralser, M. DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nat. Methods 17, 41–44 (2020).
Article CAS PubMed Google Scholar
Pertea, G. & Pertea, M. GFF utilities: GffRead and GffCompare. F1000Research 9, ISCB Comm J-304 (2020).
Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).
Article CAS PubMed PubMed Central Google Scholar
Egorov, A. A. & Atkinson, G. C. uORF4u: a tool for annotation of conserved upstream open reading frames. Bioinformatics 39, btad323 (2023).
Article CAS PubMed PubMed Central Google Scholar
Maillet, N. Rapid Peptides Generator: fast and efficient in silico protein digestion. NAR Genom. Bioinform. 2, lqz004 (2020).
Article PubMed Google Scholar
Subramanian, B., Gao, S., Lercher, M. J., Hu, S. & Chen, W.-H. Evolview v3: a webserver for visualization, annotation, and management of phylogenetic trees. Nucleic Acids Res. 47, W270–W275 (2019).
Article CAS PubMed PubMed Central Google Scholar
De Souza, E. V. et al. Rp3: Ribosome Profiling-assisted Proteogenomics Improves Coverage and Confidence During Microprotein Discovery https://github.com/Eduardo-vsouza/rp3 (2024) 10.5281/zenodo.12092044.
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Article CAS PubMed Google Scholar
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

A.S. acknowledges financial support from NIH grants P30CA014195, R01GM102491, and RC2DK129961; Frederick Paulsen and the Ferring Foundation; and a sponsored research agreement with Novo Nordisk Research Center Seattle, Inc. This work was supported in part by National Institute of Science and Technology on Tuberculosis (Decit/SCTIE/MS-MCT-CNPq-FNDTC-CAPES-FAPERGS) [grant number 421703/2017-2], Banco Nacional de Desenvolvimento Econômico e Social (BNDES/FUNTEC) [grant number 14.2.0914.1], FAPERGS [grant numbers 17/1265-8 INCT-TB and 19/1724-3 PQG] and CAPES (CAPES-Print program 041/2017). C.V.B. (CNPq, grant 311949/2019-3) is a Research Career Awardee of CNPq. This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brasil (CAPES), Finance Code 001. We would like to acknowledge the financial support given by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brasil (CAPES). C.V.B. would like to acknowledge the financial support given by CNPq/FAPERGS/CAPES/BNDES to the National Institute of Science and Technology on Tuberculosis (INCT-TB), Brazil. We would like to thank Manami J. Nishikawa for helping design the schematics for Figs. 3 and 6.

Author information

Authors and Affiliations

Centro de Pesquisas em Biologia Molecular e Funcional (CPBMF) and Instituto Nacional de Ciência e Tecnologia em Tuberculose (INCT-TB), Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Porto Alegre, Brazil
Eduardo Vieira de Souza, Pablo Machado, Luiz A. Basso & Cristiano V. Bizarro
Programa de Pós-Graduação em Biologia Celular e Molecular, Pontifícia Universidade Católica do Rio Grande do Sul, 90616-900, Porto Alegre, Rio Grande do Sul, Brazil
Eduardo Vieira de Souza, Pablo Machado, Luiz A. Basso & Cristiano V. Bizarro
Clayton Foundation Laboratories for Peptide Biology, Salk Institute for Biological Studies, La Jolla, CA, USA
Eduardo Vieira de Souza, Brendan Miller & Alan Saghatelian
Novo Nordisk Research Center Seattle Inc., Seattle, WA, USA
Angie L. Bookout & Christopher A. Barnes

Authors

Eduardo Vieira de Souza
View author publications
You can also search for this author in PubMed Google Scholar
Angie L. Bookout
View author publications
You can also search for this author in PubMed Google Scholar
Christopher A. Barnes
View author publications
You can also search for this author in PubMed Google Scholar
Brendan Miller
View author publications
You can also search for this author in PubMed Google Scholar
Pablo Machado
View author publications
You can also search for this author in PubMed Google Scholar
Luiz A. Basso
View author publications
You can also search for this author in PubMed Google Scholar
Cristiano V. Bizarro
View author publications
You can also search for this author in PubMed Google Scholar
Alan Saghatelian
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

E.V.S. and A.S. conceptualized the study. E.V.S., C.A.B., B.M. and A.L.B. performed the bioinformatics methodologies, visualization, and data curation. E.V.S. wrote the python code for the bioinformatics pipeline. A.S., C.V.B., P.M., and L.A.B. performed supervision of the research. E.V.S., A.S., and C.V.B. wrote the original draft of the manuscript.

Corresponding authors

Correspondence to Cristiano V. Bizarro or Alan Saghatelian.

Ethics declarations

Competing interests

The authors declare the following competing interests. Angie L. Boookout and Christopher A. Barnes are employees of Novo Nordisk. Alan Saghatelian is a consultant for and cofounder of Exo and Velia Therapeutics. All other authors have no competing interests.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Reporting Summary

Description of Additional Supplementary Files

Supplementary Dataset 1

Supplementary Dataset 2

Supplementary Dataset 3

Transparent Peer Review file

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Vieira de Souza, E., L. Bookout, A., Barnes, C.A. et al. Rp3: Ribosome profiling-assisted proteogenomics improves coverage and confidence during microprotein discovery. Nat Commun 15, 6839 (2024). https://doi.org/10.1038/s41467-024-50301-4

Download citation

Received: 10 October 2023
Accepted: 08 July 2024
Published: 09 August 2024
DOI: https://doi.org/10.1038/s41467-024-50301-4
Springer Nature Limited

Rp3: Ribosome profiling-assisted proteogenomics improves coverage and confidence during microprotein discovery

Abstract

Similar content being viewed by others

Introduction

Results

Proteogenomics approach identifies microproteins that could not be found with Ribo-Seq

Differences in sequence composition may confound mass spectrometry characterization of microproteins from Ribo-Seq results

Most Ribo-Seq reads cannot be uniquely assigned to smORF-containing transcripts that encode the Proteogenomics-derived (Rp3) microproteins

Mass spectrometry peptides allow the identification of smORFs with uncertain Ribo-Seq mapping

Proteogenomics allows the identification of a unique subset of peptides presented in the HLA complex

Discussion

Methods

Reference-guided transcriptome assemblies

Generation of custom databases for proteomics

Peptide search and post-processing of mass spectrometry data

Processing of Ribo-Seq data

Analysis of ambiguous and multi-mapping reads

Circos plot generation

Sequence analysis comparison among annotated, Ribo-Seq and proteogenomics microproteins

Conservation analysis

Statistical analysis

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation