Abstract
The task of distinguishing long non-coding RNAs (lncRNAs) from protein coding transcripts (PCTs) has been previously addressed with machine learning (ML) algorithms using hundreds of features. However, the use of a large number of features can negatively affect the predictive performance of these algorithms since it can lead to problems like overfitting due to a phenomenon known as the curse of dimensionality. In order to deal with these problems, dimensionality reduction techniques have been proposed, among them, feature selection. This work proposes and experimentally evaluates a simple and fast feature selection technique, called Single Score Feature Selection - \(S^2FS\).
For such, initially, frequencies of 2-mers, 3-mers and 4-mers were extracted from public databases of PCTs and lncRNAs of Homo sapiens, resulting in a dataset composed of two groups of RNA sequences, one for PCTs and the other for lncRNAs, and a large number of features. To reduce the number of features, \(S^2FS\) was applied to the dataset. Experimental results showed that relevant features were selected, keeping the predictive accuracy, with a lower processing cost than some existing feature selection techniques.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Cai, J., Luo, J., Wang, S., Yang, S.: Feature selection in machine learning: a new perspective. Neurocomputing 300(26), 70–79 (2018)
Esteller, M.: Non-coding RNAs in human disease. Nat. Rev. Genet. 12(12), 861 (2011)
Hall, M.A.: Correlation-based feature selection for machine learning. Ph.D. thesis, University of Waikato Hamilton, April 1999
Han, S., Liang, Y., Li, Y., Du, W.: Long noncoding RNA identification: comparing machine learning based tools for long noncoding transcripts discrimination. BioMed Res. Int. 2016 (2016)
Hughes, G.: On the mean accuracy of statistical pattern recognizers. IEEE Trans. Inf. Theory 14(1), 55–63 (1968)
Jain, A., Zongker, D.: Feature selection: evaluation, application, and small sample performance. IEEE Trans. Pattern Anal. Mach. Intell. 19(2), 153–158 (1997)
Kaikkonen, M.U., Lam, M.T., Glass, C.K.: Non-coding RNAs as regulators of gene expression and epigenetics. Cardiovas. Res. 90(3), 430–440 (2011)
Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theor. 28(2), 129–137 (2006). https://doi.org/10.1109/TIT.1982.1056489
Mattick, J.S.: Non-coding RNAs: the architects of eukaryotic complexity. EMBO Rep. 2(11), 986–991 (2001)
Mattick, J.S., Rinn, J.L.: Discovery and annotation of long noncoding RNAs. Nat. Struct. Mol. Biol. 22(1), 5 (2015)
Pian, C., et al.: LncRNApred: classification of long non-coding RNAs and protein-coding transcripts by the ensemble algorithm with a new hybrid feature. PloS One 11(5), e0154567 (2016)
Ponting, C.P., Olive, P.L., Reik, W.: Evolution and functions of long noncoding RNAs. Cell Volume 136(4), 629–641 (2009)
Popescu, M.C., Balas, V.E., Perescu-Popescu, L., Mastorakis, N.: Multilayer perceptron and neural networks. WSEAS Trans. Circ. Syst. 8(7), 579–588 (2009)
Pruitt, K.D., Tatusova, T., Maglott, D.R.: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35(Suppl. 1), D61–D65 (2007)
Pudil, P., Novovičová, J., Kittler, J.: Floating search methods in feature selection. Pattern Recogn. Lett. 15(11), 1119–1125 (1994)
Rinn, J.L., Chang, H.Y.: Genome regulation by long noncoding RNAs. Ann. Rev. Biochem. 81, 145–166 (2012)
Schneider, H.W., Raiol, T., Brigido, M.M., Walter, M.E.M., Stadler, P.F.: A support vector machine based method to distinguish long non-coding RNAs from protein coding transcripts. BMC Genomics 18(1), 804 (2017)
Tripathi, R., Patel, S., Kumari, V., Chakraborty, P., Varadwaj, P.K.: DeepLNC, a long non-coding RNA prediction tool using deep neural network. Netw. Model. Anal. Health Inform. Bioinform. 5(1), 1–14 (2016)
Volders, P.J., et al.: LNCipedia: a database for annotated human lncRNA transcript sequences and structures. Nucleic Acids Res. 41(D1), D246–D251 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Kümmel, B.C., de Carvalho, A.C.P.L.F., Brigido, M.M., Ralha, C.G., Walter, M.E.M.T. (2018). \(S^2FS\): Single Score Feature Selection Applied to the Problem of Distinguishing Long Non-coding RNAs from Protein Coding Transcripts. In: Alves, R. (eds) Advances in Bioinformatics and Computational Biology. BSB 2018. Lecture Notes in Computer Science(), vol 11228. Springer, Cham. https://doi.org/10.1007/978-3-030-01722-4_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-01722-4_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01721-7
Online ISBN: 978-3-030-01722-4
eBook Packages: Computer ScienceComputer Science (R0)