Abstract
In hindsight of the previous decades, a rapid growth of data in all fields of life sciences is perceptible. Most notably is the general tendency of retaining well established techniques regarding specific biological requirements and common taxonomies for data classification. Therefore a change in perspective towards advanced technological concepts for persisting, organizing and analyzing these huge amounts of data is essential. The Intelligent Cluster Index (ICIx) is a modern technology capable of indexing multidimensional data through semantic criteria, qualified for this challenge. In this paper methodical approaches for indexing biological sequences with the ICIx are discussed and evaluated. This includes the examination of established methods concentrating on vector transformation as well as outlining the efficiency of different distance measures applied to these vectors. Based on our results, it becomes apparent that position conserving methods are superior to other approaches and that the applied distance measures heavily influence performance and quality.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Other commonly used notions for n-grams are k-, t- or n-tuples and k-, t- or n-mers.
References
Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997)
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)
Baby, J., Kannan, T., Vinod, P., Gopal, V.: Distance indices for the detection of similarity in C programs. In: International Conference on Computation of Power, Energy, Information and Communication (ICCPEIC), pp. 462–467. IEEE (2014)
Bao, J., Yuan, R., Bao, Z.: An improved alignment-free model for dna sequence similarity metric. BMC Bioinform. 15(1), 321 (2014)
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Sayers, E.W.: Genbank. Nucleic Acids Res. 39(suppl 1), D32–D37 (2011)
Bogan-Marta, A., Hategan, A., Pitas, I.: Language engineering and information theoretic methods in protein sequence similarity studies. Computational Intelligence in Medical Informatics, pp. 151–183. Springer, Heidelberg (2008)
Boratyn, G.M., Camacho, C., Cooper, P.S., Coulouris, G., Fong, A., Ma, N., Madden, T.L., Matten, W.T., McGinnis, S.D., Merezhuk, Y., Raytselis, Y., Sayers, E.W., Tao, T., Ye, J., Zaretskaya, I.: BLAST: a more efficient report with usability improvements. Nucleic Acids Res. 41(W1), W29–W33 (2013)
Cha, S.H.: Taxonomy of nominal type histogram distance measures. In: Proceedings of the American Conference on Applied Mathematics, pp. 325–330. World Scientific and Engineering Academy and Society (WSEAS) (2008)
Deza, M.M., Deza, E.: Encyclopedia of Distances. Springer, Heidelberg (2012)
Doreswamy, Manohar, M.G., Hemanth, K.S.: A study on similarity measure functions on engineering materials selection. AIAA 1, 157–168 (2011)
Ganapathiraju, M., Manoharan, V., Klein-Seetharaman, J.: BLMT - statistical sequence analysis using N-grams. Appl. Bioinform. 3(2–3), 193–200 (2004)
Gilg, S., Neubert, R.: Semantische Indexierung mittels dynamisch-hierarchischer Neuronaler Netze. Master’s thesis, Chemnitz University of Technology (1999)
Görlitz, O., Neubert, R., Benn, W.: Access to distributed environmental databases with ICIx technology. Online Inf. Rev. J. 24(5), 364–370 (2000)
Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (1982)
Hassanat, A.B.: Dimensionality invariant similarity measure. J. Am. Sci. 10(8), 221–226 (2014)
Hatzigiorgaki, M., Skodras, A.N.: Compressed domain image retrieval: a comparative study of similarity metrics. In: Visual Communications and Image Processing 2003, pp. 439–448. International Society for Optics and Photonics (2003)
Kent, W.J.: BLAT - the BLAST-like alignment tool. Genome Res. 12(4), 656–664 (2002)
Kolekar, P., Kale, M., Kulkarni-Kale, U.: Alignment-free distance measure based on return time distribution for sequence analysis: applications to clustering, molecular phylogeny and subtyping. Mol. Phylogenet. Evol. 65(2), 510–522 (2012)
Leuoth, S., Adam, A., Benn, W.: Profit of extending standard relational database with the intelligent cluster index (ICIx). In: 11th ICARCV International Conference ond Control, Automation, Robotics and Vision, vol. 1, pp. 1198–1205 (2010)
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)
Neubert, R., Görlitz, O., Benn, W.: Incorporating knowledge technology in databases. In: KnowTech 2000 Conference (2000)
Neubert, R., Görlitz, O., Benn, W., Teich, T.: Obstacles for application of neural networks in the ICIx database index. Int. Joint Conf. Neural Networks 1, 2351–2356 (2002)
Neubert, R., Görlitz, O., Benn, W.: Towards content-related indexing in databases. Datenbanksysteme in Büro, Technik und Wissenschaft. Informatik aktuell, pp. 305–321. Springer, Heidelberg (2001)
Pearson, W.R., Lipman, D.J.: Improved tools for biological sequence comparison. PNAS USA 85(8), 2444–2448 (1988)
Punta, M., Coggill, P.C., Eberhardt, R.Y., Mistry, J., Tate, J., Boursnell, C., Pang, N., Forslund, K., Ceric, G., Clements, J., Heger, A., Holm, L., Sonnhammer, E.L.L., Eddy, S.R., Bateman, A., Finn, R.D.: The pfam protein families database. Nucleic Acids Res. 40(D1), D290–D301 (2012)
Searls, D.B.: The language of genes. Nature 420(6912), 211–217 (2002)
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)
Sun, W.K.: Algorithms in Bioinformatics - A practical Introduction. CRC Press, Boca Raton (2010)
Yao, Y., Han, J., Dai, Q., He, P.: A novel descriptor of protein sequences and its application. J. Theor. Biol. 347, 109–117 (2014)
Zvelebil, M., Baum, J.O.: Understanding Bioinformatics. Garland Science (2008)
Acknowledgement
The study has been supported by the Free State of Saxony, the University of Applied Sciences Mittweida and Chemnitz University of Technology.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Schildbach, S., Heinke, F., Benn, W., Labudde, D. (2016). Evaluation of Descriptor Algorithms of Biological Sequences and Distance Measures for the Intelligent Cluster Index (ICIx). In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Advanced Technologies for Data Mining and Knowledge Discovery. BDAS BDAS 2015 2016. Communications in Computer and Information Science, vol 613. Springer, Cham. https://doi.org/10.1007/978-3-319-34099-9_33
Download citation
DOI: https://doi.org/10.1007/978-3-319-34099-9_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-34098-2
Online ISBN: 978-3-319-34099-9
eBook Packages: Computer ScienceComputer Science (R0)