Skip to main content
Log in

sORFPred: A Method Based on Comprehensive Features and Ensemble Learning to Predict the sORFs in Plant LncRNAs

  • Original research article
  • Published:
Interdisciplinary Sciences: Computational Life Sciences Aims and scope Submit manuscript

Abstract

Long non-coding RNAs (lncRNAs) are important regulators of biological processes. It has recently been shown that some lncRNAs include small open reading frames (sORFs) that can encode small peptides of no more than 100 amino acids. However, existing methods are commonly applied to human and animal datasets and still suffer from low feature representation capability. Thus, accurate and credible prediction of sORFs with coding ability in plant lncRNAs is imperative. This paper proposes a new method termed sORFPred, in which we design a model named MCSEN by combining multi-scale convolution and Squeeze-and-Excitation Networks to fully mine distinct information embedded in sORFs, integrate and optimize multiple sequence-based and physicochemical feature descriptors, and built a two-layer prediction classifier based on Bayesian optimization algorithm and Extra Trees. sORFPred has been evaluated on sORFs datasets of three species and experimentally validated sORFs dataset. Results indicate that sORFPred outperforms existing methods and achieves 97.28% accuracy, 97.06% precision, 97.52% recall, and 97.29% F1-score on Arabidopsis thaliana, which shows a significant improvement in prediction performance compared to various conventional shallow machine learning and deep learning models.

Graphical Abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability

Datasets and associated source codes of sORFPred are freely available for download at https://github.com/orangewindczw/sORFPred.

References

  1. Canzio D, Nwakeze CL, Horta A et al (2019) Antisense lncRNA transcription mediates DNA demethylation to drive stochastic protocadherin α promoter choice. Cell 177:1–15. https://doi.org/10.1016/j.cell.2019.03.008

    Article  CAS  Google Scholar 

  2. Hon C-C, Ramilowski JA, Harshbarger J et al (2017) An atlas of human long non-coding RNAs with accurate 5′ ends. Nature 543:199–204. https://doi.org/10.1038/nature21374

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Nelson BR, Makarewich CA, Anderson DM et al (2016) A peptide encoded by a transcript annotated as long noncoding RNA enhances SERCA activity in muscle. Science 351:271–275. https://doi.org/10.1126/science.aad4076

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Cui J, Luan Y, Jiang N et al (2017) Comparative transcriptome analysis between resistant and susceptible tomato allows the identification of lncRNA16397 conferring resistance to Phytophthora infestans by co-expressing glutaredoxin. Plant J 89:577–589. https://doi.org/10.1111/tpj.13408

    Article  CAS  PubMed  Google Scholar 

  5. Cui J, Jiang N, Meng J et al (2019) LncRNA33732-respiratory burst oxidase module associated with WRKY1 in tomato-Phytophthora infestans interactions. Plant J 97:933–946. https://doi.org/10.1111/tpj.14173

    Article  CAS  PubMed  Google Scholar 

  6. Hong Y, Zhang Y, Cui J et al (2022) The lncRNA39896–miR166b–HDZs module affects tomato resistance to Phytophthora infestans. J Integr Plant Biol 64:1979–1993. https://doi.org/10.1111/jipb.13339

    Article  CAS  PubMed  Google Scholar 

  7. Storz G (2002) An expanding universe of noncoding RNAs. Science 296:1260–1263. https://doi.org/10.1126/science.1072249

    Article  CAS  PubMed  Google Scholar 

  8. Röhrig H, Schmidt J, Miklashevichs E et al (2002) Soybean ENOD40 encodes two peptides that bind to sucrose synthase. Proc Natl Acad Sci 99:1915–1920. https://doi.org/10.1073/pnas.022664799

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Narita NN, Moore S, Horiguchi G et al (2004) Overexpression of a novel small peptide ROTUNDIFOLIA4 decreases cell proliferation and alters leaf shape in Arabidopsis thaliana. Plant J 38:699–713. https://doi.org/10.1111/j.1365-313X.2004.02078.x

    Article  CAS  PubMed  Google Scholar 

  10. Campalans A, Kondorosi A, Crespi M (2004) Enod40, a short open reading frame–containing mRNA, induces cytoplasmic localization of a nuclear RNA binding protein in Medicago truncatula. Plant Cell 16:1047–1059. https://doi.org/10.1105/tpc.019406

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Frank MJ, Smith LG (2002) A small, novel protein highly conserved in plants and animals promotes the polarized growth and division of maize leaf epidermal cells. Curr Biol 12:849–853. https://doi.org/10.1016/S0960-9822(02)00819-9

    Article  CAS  PubMed  Google Scholar 

  12. Li J, Liu C (2019) Coding or noncoding, the converging concepts of RNAs. Front Genet 10:496. https://doi.org/10.3389/fgene.2019.00496

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Kondo T, Hashimoto Y, Kato K et al (2007) Small peptide regulators of actin-based cell morphogenesis encoded by a polycistronic mRNA. Nat Cell Biol 9:660–665. https://doi.org/10.1038/ncb1595

    Article  CAS  PubMed  Google Scholar 

  14. Pauli A, Norris ML, Valen E et al (2014) Toddler: an embryonic signal that promotes cell movement via Apelin receptors. Science 343:1248636. https://doi.org/10.1126/science.1248636

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Matsumoto A, Pasut A, Matsumoto M et al (2017) mTORC1 and muscle regeneration are regulated by the LINC00961-encoded SPAR polypeptide. Nature 541:228–232. https://doi.org/10.1038/nature21034

    Article  CAS  PubMed  Google Scholar 

  16. Erhard F, Halenius A, Zimmermann C et al (2018) Improved Ribo-seq enables identification of cryptic translation events. Nat Methods 15:363–366. https://doi.org/10.1038/nmeth.4631

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Ingolia NT, Brar GA, Stern-Ginossar N et al (2014) Ribosome profiling reveals pervasive translation outside of annotated protein-coding genes. Cell Rep 8:1365–1379. https://doi.org/10.1016/j.celrep.2014.07.045

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Fritsch C, Herrmann A, Nothnagel M et al (2012) Genome-wide search for novel human uORFs and N-terminal protein extensions using ribosomal footprinting. Genome Res 22:2208–2218. https://doi.org/10.1101/gr.139568.112

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Kersten RD, Yang Y-L, Xu Y et al (2011) A mass spectrometry–guided genome mining approach for natural product peptidogenomics. Nat Chem Biol 7:794–802. https://doi.org/10.1038/nchembio.684

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Oyama M, Kozuka-Hata H, Suzuki Y et al (2007) Diversity of translation start sites may define increased complexity of the human short ORFeome. Mol Cell Proteomics 6:1000–1006. https://doi.org/10.1074/mcp.M600297-MCP200

    Article  CAS  PubMed  Google Scholar 

  21. Hemm MR, Paul BJ, Schneider TD et al (2008) Small membrane proteins found by comparative genomics and ribosome binding site models. Mol Microbiol 70:1487–1501. https://doi.org/10.1111/j.1365-2958.2008.06495.x

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Yu G, Wang Y, Wang J et al (2020) Attributed heterogeneous network fusion via collaborative matrix tri-factorization. Inf Fusion 63:153–165. https://doi.org/10.1016/j.inffus.2020.06.012

    Article  Google Scholar 

  23. Wei L, Xing P, Su R et al (2017) CPPred-RF: a sequence-based predictor for identifying cell-penetrating peptides and their uptake efficiency. J Proteome Res 16:2044–2053. https://doi.org/10.1021/acs.jproteome.7b00019

    Article  CAS  PubMed  Google Scholar 

  24. Meng J, Kang Q, Chang Z, Luan Y (2021) PlncRNA-HDeep: plant long noncoding RNA prediction using hybrid deep learning based on two encoding styles. BMC Bioinformatics 22:242. https://doi.org/10.1186/s12859-020-03870-2

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Kang Q, Meng J, Cui J et al (2020) PmliPred: a method based on hybrid model and fuzzy decision for plant miRNA–lncRNA interaction prediction. Bioinformatics 36:2986–2992. https://doi.org/10.1093/bioinformatics/btaa074

    Article  CAS  PubMed  Google Scholar 

  26. Zhang Q, Yu W, Han K et al (2021) Multi-scale capsule network for predicting DNA-protein binding sites. IEEE/ACM Trans Comput Biol Bioinform 18:1793–1800. https://doi.org/10.1109/TCBB.2020.3025579

    Article  CAS  PubMed  Google Scholar 

  27. Frith MC, Forrest AR, Nourbakhsh E et al (2006) The abundance of short proteins in the mammalian proteome. PLoS Genet 2:e52. https://doi.org/10.1371/journal.pgen.0020052

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Kang Y-J, Yang D-C, Kong L et al (2017) CPC2: a fast and accurate coding potential calculator based on sequence intrinsic features. Nucleic Acids Res 45:W12–W16. https://doi.org/10.1093/nar/gkx428

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Lin MF, Jungreis I, Kellis M (2011) PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics 27:i275–i282. https://doi.org/10.1093/bioinformatics/btr209

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Zhu M, Gribskov M (2019) MiPepid: MicroPeptide identification tool using machine learning. BMC Bioinformatics 20:559. https://doi.org/10.1186/s12859-019-3033-9

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Tong X, Liu S (2019) CPPred: coding potential prediction based on the global description of RNA sequence. Nucleic Acids Res 47:e43. https://doi.org/10.1093/nar/gkz087

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Zhang Y, Jia C, Fullwood MJ, Kwoh CK (2021) DeepCPP: a deep neural network based on nucleotide bias information and minimum distribution similarity feature selection for RNA coding potential prediction. Brief Bioinform 22:2073–2084. https://doi.org/10.1093/bib/bbaa039

    Article  CAS  PubMed  Google Scholar 

  33. Zhang H, He X, Zhu JK (2013) RNA-directed DNA methylation in plants: where to start? RNA Biol 10:1593–1596. https://doi.org/10.4161/rna.26312

    Article  PubMed  PubMed Central  Google Scholar 

  34. Hu J, Shen L, Sun G (2020) Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell 42:2011–2023. https://doi.org/10.1109/TPAMI.2019.2913372

    Article  PubMed  Google Scholar 

  35. Kursa MB, Rudnicki WR (2010) Feature selection with the Boruta package. J Stat Softw 36:1–13. https://doi.org/10.18637/jss.v036.i11

    Article  Google Scholar 

  36. Snoek J, Larochelle H, Adams RP (2012) Practical Bayesian optimization of machine learning algorithms. In: Advances in Neural Information Processing Systems, pp 2951–2959

  37. Zhang P, Meng J, Luan Y, Liu C (2020) Plant miRNA–lncRNA interaction prediction with the ensemble of CNN and IndRNN. Interdiscip Sci Comput Life Sci 12:82–89. https://doi.org/10.1007/s12539-019-00351-w

    Article  CAS  Google Scholar 

  38. Gallart AP, Pulido AH, de Lagrán IAM et al (2016) GREENC: a Wiki-based database of plant lncRNAs. Nucleic Acids Res 44:D1161–D1166. https://doi.org/10.1093/nar/gkv1215

    Article  CAS  Google Scholar 

  39. Hanada K, Akiyama K, Sakurai T et al (2010) sORF finder: a program package to identify small open reading frames with high coding potential. Bioinformatics 26:399–400. https://doi.org/10.1093/bioinformatics/btp688

    Article  CAS  PubMed  Google Scholar 

  40. Sayers EW, Barrett T, Benson DA et al (2009) Database resources of the national center for biotechnology information. Nucleic Acids Res 37:D5–D15. https://doi.org/10.1093/nar/gkn741

    Article  CAS  PubMed  Google Scholar 

  41. Huang Y, Niu B, Gao Y et al (2010) CD-HIT suite: a web server for clustering and comparing biological sequences. Bioinformatics 26:680–682. https://doi.org/10.1093/bioinformatics/btq003

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Hu H, Meng J, Zhao S et al (2022) Prediction of plant lncRNA-encoded small peptides combined with multi-scale convolutional capsule network. J Zhengzhou Univ (Natl Sci Edn) 54:12–18. https://doi.org/10.13705/j.issn.1671-6841.2021214

    Article  Google Scholar 

  43. Liu H, Zhou X, Yuan M et al (2020) ncEP: a manually curated database for experimentally validated ncRNA-encoded proteins or peptides. J Mol Biol 432:3364–3368. https://doi.org/10.1016/j.jmb.2020.02.022

    Article  CAS  PubMed  Google Scholar 

  44. Clavijo BJ, Accinelli GG, Yanes L et al (2017) Skip-mers: increasing entropy and sensitivity to detect conserved genic regions with simple cyclic q-grams. bioRxiv. https://doi.org/10.1101/179960

    Article  Google Scholar 

  45. Edwards RJ, Palopoli N (2015) Computational prediction of short linear motifs from protein sequences. Comput Pept. https://doi.org/10.1007/978-1-4939-2285-7_6

    Article  Google Scholar 

  46. Yin C, Yau SS-T (2007) Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. J Theor Biol 247:687–694. https://doi.org/10.1016/j.jtbi.2007.03.038

    Article  CAS  PubMed  Google Scholar 

  47. Wang L, Park HJ, Dasari S et al (2013) CPAT: coding-potential assessment tool using an alignment-free logistic regression model. Nucleic Acids Res 41:e74. https://doi.org/10.1093/nar/gkt006

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Chen Z, Zhao P, Li F et al (2018) iFeature: a python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 34:2499–2502. https://doi.org/10.1093/bioinformatics/bty140

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Meng J, Chang Z, Zhang P, et al (2019) lncRNA-LSTM: prediction of plant long non-coding RNAs using long short-term memory based on p-nts encoding. In: International Conference on Intelligent Computing. https://doi.org/10.1007/978-3-030-26766-7_32

  50. Wan S, Duan Y, Zou Q (2017) HPSLPred: an ensemble multi-label classifier for human protein subcellular location prediction with imbalanced source. Proteomics 17:17–18. https://doi.org/10.1002/pmic.201700262

    Article  CAS  Google Scholar 

  51. Ru X, Cao P, Li L, Zou Q (2019) Selecting essential MicroRNAs using a novel voting method. Mol Ther-Nucleic Acids 18:16–23. https://doi.org/10.1016/j.omtn.2019.07.019

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Zhang G, Liu Z, Dai J et al (2020) ItLnc-BXE: a Bagging-xgboost-ensemble method with comprehensive sequence features for identification of plant lncRNAs. IEEE Access 8:68811–68819. https://doi.org/10.1109/ACCESS.2020.2985114

    Article  Google Scholar 

  53. Zhang S, Li X, Zong M et al (2017) Learning k for KNN classification. ACM Trans Intell Syst Technol TIST 8:1–19. https://doi.org/10.1145/2990508

    Article  Google Scholar 

  54. Lin W, Ji D, Lu Y (2017) Disorder recognition in clinical texts using multi-label structured SVM. BMC Bioinformatics 18:1–11. https://doi.org/10.1186/s12859-017-1476-4

    Article  Google Scholar 

  55. Yao D, Zhan X, Zhan X et al (2020) A random forest based computational model for predicting novel lncRNA-disease associations. BMC Bioinformatics 21:1–18. https://doi.org/10.1186/s12859-020-3458-1

    Article  CAS  Google Scholar 

  56. Peng L, Yuan R, Shen L et al (2021) LPI-EnEDT: an ensemble framework with extra tree and decision tree classifiers for imbalanced lncRNA-protein interaction data classification. BioData Min 14:1–22. https://doi.org/10.1186/s13040-021-00277-4

    Article  CAS  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Nos. 32072592 and 32230091).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun Meng.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflicts of interest.

Ethical Approval

Not applicable.

Consent to Participate

Not applicable.

Consent for Publication

Not applicable.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 120 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, Z., Meng, J., Zhao, S. et al. sORFPred: A Method Based on Comprehensive Features and Ensemble Learning to Predict the sORFs in Plant LncRNAs. Interdiscip Sci Comput Life Sci 15, 189–201 (2023). https://doi.org/10.1007/s12539-023-00552-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12539-023-00552-4

Keywords

Navigation