Skip to main content
Log in

Graph clustering-based discretization approach to microarray data

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Several techniques in data mining require discrete data. In fact, learning with discrete domains often performs better than the case of continuous data. Multivariate discretization is the algorithm that transforms continuous data to discrete one by considering correlations among attributes. Given the benefit of this idea, many multivariate discretization algorithms have been proposed. However, there are a few discretization algorithms that directly apply to microarray or gene expression data, which is high-dimensional and unbalance data. Even so interesting, no multivariate method has been put forward for microarray data analysis. According to the recent published research, graph clustering-based discretization of splitting and merging methods (GraphS and GraphM) usually achieves superior results compared to many well-known discretization algorithms. In this paper, GraphS and GraphM are extended by adding the alpha parameter that is the ratio between the similarity of gene expressions (distance) and the similarity of the class label. Moreover, the extensions consider 3 similarity measures of cosine similarity, Euclidean distance, and Pearson correlation in order to determine the proper pairwise similarity measure. The evaluation against 20 real microarray datasets and 4 classifiers suggests that the results of three classification performances (ACC, AUC, Kappa) and running time of two proposed methods based on cosine similarity, GraphM(C) and GraphS(C) are better than 9 state-of-the-art discretization algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. http://leo.ugr.es/elvira/DBCRepository.

  2. https://www.rdocumentation.org/packages/datamicroarray/versions/0.2.3.

  3. https://kittakorn.crru.ac.th/kais/2018.

  4. www.uco.es/grupos/kdis/wiki/ur-CAIM.

  5. https://kittakorn.crru.ac.th/kais/2018.

References

  1. Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66

    Google Scholar 

  2. Alcalá-Fdez J, Sánchez L, García S, del Jesus M, Ventura S, Garrell J, Otero J, Romero C, Bacardit J, Rivas V, Fernández J, Herrera F (2009) Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13(3):307–318

    Article  Google Scholar 

  3. Alcalá J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2010) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult Valued Logic Soft Comput 17(255–287):11

    Google Scholar 

  4. Baralis E, Bruno G, Fiori A (2011) Measuring gene similarity by means of the classification distance. Knowl Inf Syst 29(1):81–101

    Article  Google Scholar 

  5. Bay SD (2001) Multivariate discretization for set mining. Knowl Inf Syst 3(4):491–512

    Article  MATH  Google Scholar 

  6. Ben-David A (2008a) About the relationship between roc curves and cohen’s kappa. Eng Appl Artif Intell 21(6):874–882

    Article  Google Scholar 

  7. Ben-David A (2008b) Comparison of classification accuracy using cohens weighted kappa. Expert Syst Appl 34(2):825–832

    Article  Google Scholar 

  8. Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2010) On the effectiveness of discretization on gene selection of microarray data. In: The 2010 international joint conference on Neural networks (IJCNN). IEEE, pp 1–8

  9. Boullé M (2006) Modl: A bayes optimal discretization method for continuous attributes. Machine learning 65(1):131–165

    Article  Google Scholar 

  10. Bradley AP (1997) The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159

    Article  Google Scholar 

  11. Brandes U, Gaertler M, Wagner D (2003) Experiments on graph clustering algorithms. Springer, Berlin

    Book  MATH  Google Scholar 

  12. Cai R, Hao Z, Wen W, Wang L (2013) Regularized gaussian mixture model based discretization for gene expression data association mining. Appl Intell 39(3):607–613

    Article  Google Scholar 

  13. Cai R, Tung AK, Zhang Z, Hao Z (2011) What is unequal among the equals? ranking equivalent rules from gene expression data. IEEE Trans Knowl Data Eng 23(11):1735–1747

    Article  Google Scholar 

  14. Cano A, Nguyen DT, Ventura S, Cios KJ (2016) ur-caim: improved caim discretization for unbalanced and balanced data. Soft Comput 20(1):173–188

    Article  Google Scholar 

  15. Cano A, Nguyen D, Ventura S, Cios K (2014) ur-caim: improved caim discretization for unbalanced and balanced data. Soft Comput 20:1–16

    Article  Google Scholar 

  16. Catlett J (1991) On changing continuous attributes into ordered discrete attributes. In: Machine learningEWSL-91. Springer, pp 164–178

  17. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

    MATH  Google Scholar 

  18. de Sá CR, Soares C, Knobbe A (2015) Entropy-based discretization methods for ranking data. Inf Sci

  19. Deegalla S, Boström H (2007) Classification of microarrays with knn: comparison of dimensionality reduction methods. In: Intelligent data engineering and automated learning-IDEAL 2007. Springer, pp 800–809

  20. Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: Prieditis A, Russell S (eds) Machine learning proceedings 1995. Morgan Kaufmann, San Francisco, pp 194–202

    Chapter  Google Scholar 

  21. Durrant B, Frank E, Hunt L, Holmes G, Mayo M, Pfahringer B, Smith T, Witten I (2014) Weka 3: Data mining software in java. Machine Learning Group at the University of Waikato

  22. Fayyad U, Irani K (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: IJCAI, pp 1022–1029

  23. Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701

    Article  MATH  Google Scholar 

  24. Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat 11(1):86–92

    Article  MathSciNet  MATH  Google Scholar 

  25. García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf Sci 180(10):2044–2064

    Article  Google Scholar 

  26. Garcia S, Luengo J, Sáez JA, López V, Herrera F (2013) A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng 25(4):734–750

    Article  Google Scholar 

  27. Giancarlo R, Bosco GL, Pinello L (2010) Distance functions, clustering algorithms and microarray data analysis. In: Learning and intelligent optimization. Springer, pp 125–138

  28. Gonzalez-Abril L, Cuberos FJ, Velasco F, Ortega JA (2009) Ameva: an autonomous discretization algorithm. Expert Syst Appl 36(3):5327–5332

    Article  Google Scholar 

  29. Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco

    MATH  Google Scholar 

  30. Hayashi Y, Setiono R, Azcarraga A (2016) Neural network training and rule extraction with augmented discretized input. Neurocomputing 207:610–622

    Article  Google Scholar 

  31. Ho KM, Scott PD (1997) Zeta: a global method for discretization of continuous variables. In: Proc. Third intl conf. knowledge discovery and data mining (KDD97), pp 191–194

  32. Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 65–70

  33. Huang J, Ling CX (2005) Using auc and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299–310

    Article  Google Scholar 

  34. John GH, Langley P (1995) Estimating continuous distributions in bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence, UAI’95, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 338–345

  35. Kautz T, Eskofier BM, Pasluosta CF (2017) Generic performance measure for multiclass-classifiers. Pattern Recognit 68:111–125

    Article  Google Scholar 

  36. Kerber R (1992) Chimerge: discretization of numeric attributes. In: Proceedings of the tenth national conference on artificial intelligence, Aaai Press, pp 123–128

  37. Kurgan LA, Cios KJ (2004) Caim discretization algorithm. IEEE Trans Knowl Data Eng 16(2):145–153

    Article  Google Scholar 

  38. Li J, Fong S, Mohammed S, Fiaidhi J (2016) Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms. J Supercomput 72(10):3708–3728

    Article  Google Scholar 

  39. Lustgarten JL, Gopalakrishnan V, Grover H, Visweswaran S (2008) Improving classification performance with discretization on biomedical datasets. In: AMIA annual symposium proceedings, Vol. 2008, American Medical Informatics Association, p 445

  40. Lustgarten JL, Visweswaran S, Gopalakrishnan V, Cooper GF (2011) Application of an efficient bayesian discretization method to biomedical data. BMC Bioinform 12(1):309

    Article  Google Scholar 

  41. Lv J, Peng Q, Chen X, Sun Z (2016) A multi-objective heuristic algorithm for gene expression microarray data classification. Expert Syst Appl 59:13–19

    Article  Google Scholar 

  42. Madhu G, Rajinikanth T, Govardhan A (2014) Improve the classifier accuracy for continuous attributes in biomedical datasets using a new discretization method. Procedia Comput Sci 31:671–679

    Article  Google Scholar 

  43. Nguyen V-A, Lió P (2009) Measuring similarity between gene expression profiles: a bayesian approach. BMC Genom 10(Suppl 3):S14

    Article  Google Scholar 

  44. Ong HF, Mustapha N, Sulaiman MN (2014) An integrative gene selection with association analysis for microarray data classification. Intell. Data Anal. 18(4):739–758

    Article  Google Scholar 

  45. Piatetsky-Shapiro G, Tamayo P (2003) Microarray data mining: facing the challenges. ACM SIGKDD Explor Newsl 5(2):1–5

    Article  Google Scholar 

  46. Quinlan JR (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA

    Google Scholar 

  47. Rahman MG, Islam MZ (2016) Discretization of continuous attributes through low frequency numerical values and attribute interdependency. Expert Syst Appl 45:410–423

    Article  Google Scholar 

  48. Ramirez-Gallego S, Garcia S, Benitez J, Herrera F (2015a) Multivariate discretization based on evolutionary cut points selection for classification. IEEE Trans Cybern PP(99):1–1

    Google Scholar 

  49. Ramirez-Gallego S, Garcia S, Benitez JM, Herrera F, (2015b) Multivariate discretization based on evolutionary cut points selection for classification

  50. Ruan J, Jahid MJ, Gu F, Lei C, Huang Y-W, Hsu Y-T, Mutch DG, Chen C-L, Kirma NB, Huang TH-M (2016) A novel algorithm for network-based prediction of cancer recurrence. Genomics

  51. Sang Y, Qi H, Li K, Jin Y, Yan D, Gao S (2014) An effective discretization method for disposing high-dimensional data. Inf Sci 270:73–91

    Article  MathSciNet  MATH  Google Scholar 

  52. Shang C, Shen Q (2005) Aiding classification of gene expression data with feature selection: a comparative study. Int J Comput Intell Res 1(1):68–76

    Article  Google Scholar 

  53. Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8):888–905

    Article  Google Scholar 

  54. Sriwanna K, Boongoen T, Iam-On N (2017) Graph clustering-based discretization of splitting and merging methods (graphs and graphm). Human-Centric Comput Inf Sci 7(1):21

    Article  Google Scholar 

  55. Sriwanna K, Puntumapon K, Waiyamai K (2012) An enhanced class-attribute interdependence maximization discretization algorithm. In: Advanced data mining and applications. Springer, pp 465–476

  56. Wang H-Q, Jing G-J, Zheng C (2014) Biology-constrained gene expression discretization for cancer classification. Neurocomputing 145:30–36

    Article  Google Scholar 

  57. Wei D, Jiang Q, Wei Y, Wang S (2012) A novel hierarchical clustering algorithm for gene sequences. BMC Bioinform. 13(1):174

    Article  Google Scholar 

  58. Wu X, Kumar V (2009) The top ten algorithms in data mining, 1st edn. Chapman & Hall, Boca Raton

    Book  Google Scholar 

  59. Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37

    Article  Google Scholar 

  60. Yang P, Li J-S, Huang Y-X (2011) Hdd: a hypercube division-based algorithm for discretisation. Int J Syst Sci 42(4):557–566

    Article  MathSciNet  MATH  Google Scholar 

  61. Yang Y, Webb GI (2009) Discretization for naive-bayes learning: managing discretization bias and variance. Mach Learn 74(1):39–74

    Article  Google Scholar 

  62. Yu Z, You J, Li L, Wong H-S, Han G (2012) Representative distance: a new similarity measure for class discovery from gene expression data. IEEE Trans NanoBiosci 11(4):341–351

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank KEEL software [2, 3] for distributing the source code of discretization algorithms, and the authors of EMD [48] for EMD program, and the authors of ur-CAIM [15] for distributing the ur-CAIM program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kittakorn Sriwanna.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sriwanna, K., Boongoen, T. & Iam-On, N. Graph clustering-based discretization approach to microarray data. Knowl Inf Syst 60, 879–906 (2019). https://doi.org/10.1007/s10115-018-1249-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-018-1249-z

Keywords

Navigation