Abstract
Researchers in the biotechnology field have accomplished many achievements in the past century. They can now measure expression levels for thousands of genes, testing different conditions over varying periods of time. The analysis of the measurement results is essential to understand gene patterns and extract information about their functions and their biological roles. This paper describes a novel approach for clustering large-scale next-generation sequences (NGS). It also facilitates the process of predicting patterns and the likelihood of mutations based on a semi-supervised clustering technique. The process is based on the previously developed construction of FuzzyFind Dictionary utilizing the Golay Code for error correction. The introduced method is exceptional; it has linear time complexity with one passage through the file.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
P. D'haeseleer. How does gene expression clustering work? Nat. Biotechnol. 23(12), pp. 1499-501. 2005.
Basu, Mitra, and Tin Kam Ho, Data Complexity in Pattern Recognition, London: Springer, 2006.
S. Faro and T. Lecroq. An efficient matching algorithm for encoded DNA sequences and binary strings. 20th Annual Symposium on Combinatorial Pattern Matching (CPM 2009) 2009.
C. K. Omoto and P. F. Lurquin, Genes and DNA: A Beginner's Guide to Genetics and Its Applications, New York: Columbia University Press, 2004.
F. Alsaby and S. Berkovich. Realization of clustering with Golay code transformations. Global Science and Technology Forum, J. on Computing (JoC) Vol 4 No 1, 2014
F. Alsaby, K. Alnowaiser and S. Berkovich, Golay Code Transformations for Ensemble Clustering in Application to Medical Diagnostics. Unpublished.
S. Berkovich, and E. El-Qawasmeh, Reserving the error-correction scheme for a Fault- Tolerant Indexing, Computer Journal. England, vol. 43, no. 1, pp. 54-64, 2000
E. El-Qawasmeh, and M. Safar, Investigation of Golay Code (24, 12, 8) Structure in Improving Search Techniques, Associations of Arab Universities, 2011
S. Berkovich, and D. Liao, On clusterization of big data streams, Proceedings of the 3rd International Conference on Computing for Geospatial Research and Applications, article no.26. ACM press, New York 2012.
M. Yammahi, K. Kowsari, C. Shen, and S. Berkovich, "An Efficient Technique for Searching Very Large Files with Fuzzy Criteria Using the Pigeonhole Principle," Computing for Geospatial Research and Application (COM.Geo), 2014 Fifth International Conference on , vol., no., pp.82,86, 4-6 Aug. 2014.
O. Chapelle, B. Scholkopf, and A. Zien, “Introduction to Semi- Supervised Learning”, Cambridge, Massachusetts, The MIT Press, ch. 1. pp.6
K. Yeung, D. Haynor, and W. Ruzzo, “Validating clustering for gene expression data,” Bioinformatics, vol. 17, no. 4, pp. 309–318, 2001.
N. Grira, M. Crucianu, N. Boujemaa, Unsupervised and Semisupervised Clustering: a Brief Survey, 7th ACM SIGMM international workshop on Multimedia information retrieval, pp. 9-16, 2005.
Y. Hongjun, T. Jing, D. Chen, and S. Berkovich, Golay Code Clustering for Mobility Behavior Similarity Classification in Pocket Switched Networks, J. of Communication and Computer, USA, 2012.
D. Greene, M. Parnas, and F. Yao, Multi-index hashing for information retrieval. FOCS, 1994.
M. Norouzi, A. Punjani, and D. Fleet. Fast search in hamming space with multi-index hashing. CVPR, 2012
U. Keich, M. Li, B. Ma, and J. Tromp, On spaced seeds for similarity search, Discrete Applied Mathematics, Volume 138, Issue 3, 15 April 2004, Pages 253-263.
Author information
Authors and Affiliations
Additional information
Authors’ profile
Faisal Alsaby received his BSc degree in Computer Science and Information Systems from the King Saud University, Saudi Arabia, in 2005. He received an MS degree in Computer Science from the George Washington University, USA in 2012. He is currently a Ph.D candidate at the GWU majoring in Computer Science. His research interests are big data clustering algorithms, machine learning, and pattern recognition.
Kholood Alnowaiser received her BSc degree in Computer Science from Dammam University. She received an MS degree in Computer Science from the George Washington University, USA in 2015. She is currently lecturer at Dammam University.
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Kholood, F., Alnowaiser, A. An Efficient DNA Molecule Clustering using GCC Algorithm. GSTF J Comput 4, 11 (2015). https://doi.org/10.7603/s40601-014-0011-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.7603/s40601-014-0011-y