Background

Deciphering the human proteome represents an important challenge in the post-genomic era. Improvements in the current set of reference proteins can impact several aspects of health and disease studies for example by filling gaps in known molecular pathways or by revealing new molecular markers and therapeutic targets for diseases. Currently, most of the research in genomic sciences use reference protein sequences available in few databases (Ace View [1], GENCODE [2], RefSeq [3], Ensembl [4], VEGA [5], CCDS [6]). In most of cases, each gene contains one reference Open Reading Frame (ORF), commonly defined by its length (the longest one) and other criteria such the presence of domains in the predicted amino acid sequence, evolutionary conservation and number of introns [1]. In contrast, the same databases contain hundreds of thousands of distinct human gene transcripts for approximately twenty thousand genes revealing a huge diversity in the human transcriptome (e.g., http://ensembl.org/Homo_sapiens/Info/Annotation). This difference is due to the fact that almost all human genes produce several distinct transcripts resulting from molecular mechanisms such as alternative splicing [7], alternative polyadenylation [8], and alternative transcription initiation [9]. Moreover, a single transcript can have multiple ORFs (altORFs), which may originate proteins totally distinct in their amino acid composition and cellular functions (for a review see [10]). Thus, in comparison with the transcriptome, the known proteome diversity is still scarce and the assumption of a single reference protein for each gene is currently under challenge. Interestingly, among the increasing number of experimentally detected cases of alternative proteins [11,12,13], many were discovered in cancer tissues or cell lines [14] highlighting their relevance for the cancer research.

Evidences for translation of human alternatively spliced transcript

Alternative splicing (AS) is the mechanism by which the primary transcript (primary RNA) of a gene is processed in different ways resulting into distinct mature transcripts or messenger RNAs (mRNAs) [7, 15, 16]. Currently we know that practically all multi-exon human genes produce at least two differently spliced mRNA transcripts [17]. The most abundant type of AS event in mammals is exon-skipping [7, 17], a process that results in the complete presence or absence of an exon in mature transcripts (Fig. 1). Despite the widespread occurrence of AS in different physiological or pathogenic conditions [18,19,20,21], the functional relevance of such mechanisms is still under debate [22, 23]. One of the most controversial issues concerns the fraction of the non-reference transcriptome that is actually translated into proteins. Since its discovery in the 70's, alternative splicing was interpreted as a cellular mechanism that could increase the diversity of proteins [24]. However, only recently has this prediction been tested, mainly through mass-spectrometry (MS) experiments. The task of finding AP is complicated by some limitations of MS method such as the detection of proteins expressed in low levels. Moreover, a mass spectrum can be viewed as a “signature” of mass values that allows the identification a peptide, which is just a fragment of the whole protein. Additionally, the mass spectra are usually compared against peptides from a database of reference proteins. These proteins are inferred by several criteria including, in many cases, only proteins longer than 150 amino acids [1]. Thus, the chosen protein sequence database is central to the success of MS-based protein identification. All these aspects should be accounted when interpreting the relative low number of experimental validated alternative proteins so far.

Fig. 1
figure 1

Diagram showing the computational strategy applied to search an exon-skipping alternative splicing event in mass spectrometry data

Exploration of human mass spectrometry data has detected tens to hundreds of alternative proteins in normal tissues and thousands in cancer cell lines. One of the first attempts to investigate the human proteome was described in [25]. These authors collected human ESTs data and proposed a computational method to reconstruct the mRNAs. The mRNAs were then translated and searched against mass-spectrometry data. In total, these authors detected 20 instances of alternative proteins. In another study, Ezkurdia et al. [11] used public mass-spectrometry data to search against protein reference sequences from public databases. These authors detected splicing isoforms for 150 human genes, which in most of cases, differed only slightly from the reference proteins. However, for three genes (CUX1, NEBL, and MACF1), the APs differed in a considerable part compared to the full reference sequence. Moreover, they found that these protein regions, modified by alternative splicing, contain functional domains, suggesting that important cellular functions may be affected.

In 2013, Sheynkman et al. [12] performed RNAseq and mass-spectrometry analyses on the same cell population (Jurkat cells). By using a customized protein database, these authors discovered 57 peptides that align to exon-exon junctions created by alternative splicing events (Fig. 1). Of these, 12 were exon-skipping events. In a later study, Ramalho et al. [13] also searched public mass-spectrometry data for evidence of alternative proteins. In this work, hundreds of exon-skipping events, previously found by RNA-Seq, were used in the creation of a customized mRNA sequence database. These mRNAs were then translated into amino acid sequences and searched against millions of mass-spectra from public repositories. These authors detected signatures of exon-skipping events in proteins sequences from 14 human genes. Interestingly, the majority of theses exon-skipping events was present in different vertebrates species suggesting that they are not just splicing errors but may play a role in some cellular function. Moreover, among the 14 detected cases, four resulted in protein sequences that were much shorter than the reference sequences (truncated forms) and differ in amino acid composition at a terminal end.

Several of the exon-skipping events detected at the protein level by Ramalho et al. 2015 had been previously detected at the transcript level by distinct methods and authors and eight events (in the following genes: IMMT [26], COL6A3 [20], FN1 [21], TSC2 [27], CLIP170 [28], THYN1 [29], Junctin [30], Ktn1 [31]) had been previously detected not only by the recent high-throughput RNA sequencing but also by exon-array, quantitative real-time RT-PCR, and Southern and Northern Blot assays. Some of theses exon-skipping events were described as tissue-specific and/or associated with tumor tissues.

Alternative proteins in breast cancer

Recently, Lawrence et al., 2015 [14] used mass-spectrometry to conduct a proteomic characterization of 20 breast cancer cell lines and 4 TNBC tumor samples in order to characterize the proteomes and identify molecular diagnostic markers to improve drug selection. Among these cell lines, 16 were from the triple-negative breast cancer (TNBC) subtype characterized by the absence of three cellular receptors: estrogen receptor (ESR1), progesterone receptor (PGR), and human epidermal growth factor receptor-2 (ERBB2). TNBC is an object of great interest in the breast cancer field because it tends to be a more aggressive tumor subtype, correlated with worse prognosis than hormonal receptor-positive subtypes as well as is disproportionally diagnosed in women with pathogenic mutation in BRCA1 gene [32, 33]. Although a Poly (ADP-ribose) polymerase (PARP) inhibitor, Olaparib, has been approved for the treatment of advanced ovarian cancer with BRCA1 or BRCA2 mutation, only subsets of TNBC show sensitivity to this target therapy [34, 35] and, consequently, its efficiency in TNBC treatment remains uncertain.

Lawrence et al., identified 12,775 distinct proteins encoded by 11,466 genes (protein false discovery rate [FDR] < 1%). By using hierarchical clustering of protein expression measures, the cell lines were grouped in four clusters corresponding to the major molecular subtypes previously defined by mRNA expression arrays and morphological studies.

Gene ontology analysis from each cluster revealed that luminal-like cells expressed higher levels of proteins associated with proliferation, such as cell cycle, growth factor signaling, metabolism, and DNA damage repair mechanisms. TNBC cell types, particularly the tumors samples and more invasive cell lines, showed an overexpression of proteins associated with metastasis, such as ECM-receptor interaction, cell adhesion, and angiogenesis. Besides EGFR, ERBB2, ESR1 and PGR, which are already routine clinical targets in breast cancer, they found ephrin type A receptors to be highly overexpressed in many TNBC cell lines compared to luminal-like cells.

Surprisingly, 1,860 protein variants resulting from alternative splicing were found in the proteome of these cells. Furthermore, this relative high number of alternative proteins seems to be an underestimate because only isoforms already present in the reference database UniProt were considered in the search against mass spectra. Regarding proteins involved in cancer, they found a truncated variant (with a premature stop codon) for the p65 subunit of the NF-kB transcription factor. This variant lacks regulatory regions that directly affect its transcriptional activity and was detected in two cell lines and in all tumor samples as highly expressed. Additionally, two alternative splicing variants of the CD47 protein were detected in two cell lines (DU4475 and MCF7). CD47 is a G protein-coupled receptor with five membrane-spanning domains that participates in the integrin signaling and is a tumor antigen. The two protein variants differ in the cytoplasmic tail, probably resulting in distinct intracellular signaling. Until now, little is known about functional differences between these variants. Lastly, a truncated form of the focal adhesion kinase PTK2 (as well as the reference form) was detected in most cell lines analyzed. The truncate form lacks the FERM (4.1-Ezrin-Radixin-Moesin) domain that regulates PTK2 localization and interaction with other proteins.

Ribosomal profiling brings further evidence for non-canonical proteins

There is a much larger body of genome-wide evidence for the translation of non-canonical transcripts when results from the ribosome profiling method are considered [36,37,38]. This technique is used to quantify the translation state of specific mRNAs, and the idea behind it is to capture mRNA molecules in the translation process by freezing actively translating ribosomes on different transcripts, and then separating the resulting polyribosomes by ultracentrifugation on a sucrose gradient. This process allows for the identification of highly translated (bound by several ribosomes), poorly translated (bound by one or two ribosomes) and non-translated transcripts [39].

Ribosomal profiling applied to mouse embryonic fibroblast cells and human HEK293 cells revealed that the majority of mRNAs contain more than one translation initiation site (TIS) and >50% of detected TISs map to alternative ORFs [36, 37].

Recently, a machine learning approach (RibORF) was used in the interpretation of ribosomal profiling data collected from fibroblast and breast epithelial cell lines [38]. This study showed that ~40% of so-called long non-coding RNA (lncRNAs) and pseudogene RNAs are translated in vivo, and hence are not truly non-coding RNAs. Interestingly, the translated lncRNA and pseudogene peptides have median lengths of 69 and 92 amino acids, respectively, which are shorter than most of the proteins sequences available in the main reference databases.

Differently from canonical mRNAs, where the longest candidate ORFs are virtually always translated, lncRNAs have their longest ORFs translated in only 56% of cases. Moreover, most lncRNA peptides (92%) do not contain protein domains annotated on Pfam database.

Conversely, among transcribed pseudogenes (~3% of all annotated human pseudogene), 19% are translated into peptides longer than 100 amino acids and of these, 80% contain at least one protein domain.

Alternative proteins as therapeutic targets

The relevance of alternative splicing in cancer has been extensively discussed in the literature and different aspects has been highlighted, for example, the differential expression of alternatively spliced transcripts in cancer or the impact of somatic mutations in splice sites, in splicing regulatory motifs or in the core and auxiliary factors of the spliceosome (for recent reviews see [16, 40]). Since 1996, certain bacterial compounds (FR901464, herboxidienes and pladienolides)—extracted from genus Pseudomonas and Streptomyces—are known as cytotoxins which arrest cell cycle in the G1 and G2/M phases [41,42,43,44]. Despite the promising anticancer properties of these compounds they were chemically unstable and thus unsuitable for therapy. In 2007, some analogs of these compounds, most notably E7107 [45] (an analog of pladienolide B), spliceostatin A [46] (SSA; an analog of FR901464) and the sudemycins [47] were developed with improved stability.

Further studies demonstrated that the SSA and E7107 impair pre-mRNA splicing in a dose- and time-dependent manner through binding to the Splicing factor 3b (SF3B), a protein complex that is a component of U2 snRNP. U2 is an essential splicing factor, directly involved with splice site recognition [45, 46]. Importantly, these works identified that although most unspliced pre-mRNA are retained into the cell nucleus, a minor fraction goes to cytoplasm and thus are able to be translated into new proteins. Moreover, the treatment with these compounds resulted in the production of a truncated but functional form of the cell cycle inhibitor p27 (encoded by CDKN1B) that is more stable than normal because it lacks a C-terminal domain necessary for its normal degradation.

Further evidence for the central role of the SF3B splicing factor in the cell cycle arrest came from cell lines of colorectal cancer that acquired resistance to Pladienolide B. By using RNA-seq in both resistant and parental cells it was found that resistant cells acquired a point mutation in SF3B1 (Splicing factor 3b subunit 1) (SF3B1R1074H) that reduce its binding affinity to the compounds [48].

Another example of drug resistance induced by alternative proteins came from human melanomas. Vemurafenib is a potent RAF kinase inhibitor with remarkable clinical activity in some cases of melanoma tumors that harbor a common BRAF mutation (V600E), which constitutively activates downstream MAPK signaling. However, a splice variant of BRAF that lacks the RAS-binding domain (RBD) confers resistance to this drug [49]. Additionally, it was observed in a human melanoma cell line that an intronic mutation in BRAF leads to resistance to the BRAF inhibitor Vemurafenib. This mutation is associated with an in-frame skipping of exons 3–5, which encodes the RBD. Remarkably, the use of SSA restores the inclusion of these exons and can revert the Vemurafenib resistance both in vitro and in vivo [50].

Conclusions

The assumption of a single coding sequence (CDS) per gene is under challenge by increasing evidences that, in mammals, two or more proteins can be translated from the same mRNA and that the resulting proteins can interacts affecting the known gene function. Additionally, recent findings in human proteome research, mainly through the use of mass-spectrometry and ribosomal profiling methods, have brought evidence for the translation of non-annotated CDSs, some of them produced by alternative splicing events. The growing evidence for the presence of APs in distinct human tumor tissues and cell lines points to an important issue for genomic sciences and oncology that should be considered in the near future. Research on breast tumor tissues and derived cell lines have shown, for example, that these samples exhibit an at least 10-fold higher frequency of alternative proteins than normal counterparts. The relevance of alternative proteins as therapeutic targets in cancer is exemplified by the alternative p27 protein, a cell cycle inhibitor with more stable function than its normal counterpart, as well as by an alternative B-raf protein that lacks the RAS-binding domain (RBD) and confers resistance to Vemurafenib in melanoma. These findings suggest that the alternative proteome in cancer is a promising research field with the potential to reveal new cancer biomarkers, molecular pathways and therapeutic targets. Moreover, RNA genes, the non-coding ones, may be re-annotated if there is evidence that some of them are in fact translated into new proteins.

Remarkably, the translation of alternative ORFs, alternatively spliced transcripts and long non-coding RNA per se do not indicate functional relevance, and determining their function in different physiological and pathological conditions is fundamental. Thus, a central question that should be addressed in this new field of investigation is if the alternative proteins are sufficiently stable, if they can be detected by other methods and, finally, if they play a cellular role. The investigation of alternative proteins may reveal new perspectives in cancer research, and evolutionary conserved altORFs and alternatively splicing transcripts might be great candidates to start with.