Optimal Samples Selection from Gene Expression Microarray Data Using Relational Algebra and Clustering Technique

  • Soumen Kr. Pati
  • Asit Kr. Das
Part of the Advances in Intelligent and Soft Computing book series (AINSC, volume 132)


Real data of natural and social sciences is often very high-dimensional. Dataset handling in high-dimensional spaces presents complicated problems, such as the degradation of data accessing, data manipulating as well as query processing performance. Dimensionality reduction efficiently tackles this problem and benefited us to visualize the intrinsic properties hidden in the dataset. The proposed method first generates decision attribute by computing the class label of each gene using clustering technique and subsequently computes the score of each sample of microarray cancerous gene data based on decision attribute using the division operation of relational algebra and select the samples with score below the average score as initial reduct. The reduced dataset is grouped into k clusters by k-means algorithm where, k is the set of values of decision attribute and matching factor of reduct is computed by considering the overlapping of clusters with the original classes of genes. Other samples are added iteratively one at a time based on their increasing score provided computed matching factor improved and thus final reduct known as optimal set of samples is obtained.


Root Mean Square Error Singular Value Decomposition Relational Algebra Decision Attribute Gene Expression Dataset 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aerman, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J.: Expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. 1, 6745–6750 (1999)Google Scholar
  2. 2.
    Hand, D.J., Heard, N.A.: Finding groups in gene expression data. Journal of Biomedicine and Biotechnology 2, 215–225 (2005)CrossRefGoogle Scholar
  3. 3.
    Muralidhar, K., Sarathy, R.: Security of random data perturbation methods. ACM Trans. Database Syst. 24(4), 487–493 (1999)CrossRefGoogle Scholar
  4. 4.
    Petrov, A., Shams, S.: Microarray image processing and quality control. VLSI Signal Processing 38(3), 211–226 (2004)CrossRefGoogle Scholar
  5. 5.
    Siedlecki, W., Sklansky, J.: On automatic feature selection. International Journal of Pattern Recognition and Artificial Intelligence 2(2), 197–220 (1988)CrossRefGoogle Scholar
  6. 6.
    Ding, C., Peng, H.C.: Minimum Redundancy Feature Selection from Microarray Gene Expression Data. In: Proc. Second IEEE Computational Systems Bioinformatics Conf., pp. 523–528 (2004)Google Scholar
  7. 7.
    Pati, S.K., Das, A.K.: Cluster Analysis of Microarray Data Based on Singularity Measurement. International Journal of Bioinformatics Research 3(2), 207–213 (2011)Google Scholar
  8. 8.
    Silberschatz, A.: Introduction to Data base Management System. Tata McGraw Hill, New DelhiGoogle Scholar
  9. 9.
    Garey, M., Johnson, D.: Computers and intractability: A guide to the theory of NP-completeness. Freeman, NewYork (1979)zbMATHGoogle Scholar
  10. 10.
    Davies, L., Bouldin: Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 1(2), 95–104 (1979)CrossRefGoogle Scholar
  11. 11.
    Huffman George, J.: Estimates of Root-Mean-Square Random Error for Finite Samples of Estimated Precipitation, pp. 1191–1201. American Meteorological Society (1997)Google Scholar
  12. 12.
    Jirapech-Umpai, T., Aitken, S.: Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. BMC Bioinformatics 6(148) (2005)Google Scholar
  13. 13.
    Nguyen, D.V., Rocke, D.M.: Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18(1), 39–50 (2002)CrossRefGoogle Scholar
  14. 14.
    Huynen, M., Snel, B., Lathe III, W., Bork, P.: Genome Res.  10, 1204–1210 (2000)Google Scholar
  15. 15.
    Mollr-Levet, C., Cho, S., Wolkenhauer, O.: Microarray data clustering based on temporal variation: Fcv and tsd preclustering. Applied Bioinformatics 2(1), 35–45 (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Soumen Kr. Pati
    • 1
  • Asit Kr. Das
    • 2
  1. 1.Department of Information TechnologySt. Thomas’ College of Engineering and TechnologyKolkataIndia
  2. 2.Department of Computer Science and TechnologyBengal Engineering and Science UniversityHowrahIndia

Personalised recommendations