Knowledge Discovery from Complex High Dimensional Data

Chapter
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9580)

Abstract

Modern data analysis is confronted by increasing dimensionality of problems, mainly contributed by higher resolutions available for data acquisition and by our use of larger models with more degrees of freedom to investigate complex systems deeper. High dimensionality constitutes one aspect of “big data”, which brings us not only computational but also statistical and perceptional challenges. Most data analysis problems are solved using techniques of optimization, where large-scale optimization requires faster algorithms and implementations. Computed solutions must be evaluated for statistical quality, since otherwise false discoveries can be made. Recent papers suggest to control and modify algorithms themselves for better statistical properties. Finally, human perception puts an inherent limit on our understanding to three dimensional spaces, making it almost impossible to grasp complex phenomena. For aid, we use dimensionality reduction or other techniques, but these usually do not capture relations between interesting objects. Here graph-based knowledge representation has lots of potential, for instance to create perceivable and interactive representations and to perform new types of analysis based on graph theory and network topology. In this article, we show glimpses of new developments in these aspects.

References

  1. 1.
    Anderson, N.R., Lee, E.S., Brockenbrough, J.S., Minie, M.E., Fuller, S., Brinkley, J., Tarczy-Hornoch, P.: Issues in biomedical research data management and analysis: needs and barriers. J. Am. Med. Inform. Assoc. 14(4), 478–488 (2007)Google Scholar
  2. 2.
    Bach, F.R.: Bolasso: Model consistent Lasso estimation through the bootstrap. In: 25th International Conference on Machine Learning, pp. 33–40 (2008)Google Scholar
  3. 3.
    Banerjee, O., Ghaoui, L.E., d’Aspremont, A.: Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Am. Med. Inform. Assoc. 9, 485–516 (2008)MathSciNetMATHGoogle Scholar
  4. 4.
    Barabasi, A.L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999)MathSciNetMATHGoogle Scholar
  5. 5.
    Barabási, A., Gulbahce, N., Loscalzo, J.: Network medicine: a network-based approach to human disease. Science 12(1), 56–68 (2011)Google Scholar
  6. 6.
    Beck, A., Tetruashvili, L.: On the convergence of block coordinate descent type methods. Science 23(4), 2037–2060 (2013)MathSciNetMATHGoogle Scholar
  7. 7.
    Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. Science 2(1), 183–202 (2009)MathSciNetMATHGoogle Scholar
  8. 8.
    Bogdan, M., van den Berg, E., Sabatti, C., Su, W., Candes, E.J.: SLOPE - adaptive variable selection via convex optimization. (2014). arXiv:1407.3824
  9. 9.
    Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Science 3(1), 1–122 (2011)MathSciNetMATHGoogle Scholar
  10. 10.
    Bubenik, P., Kim, P.T.: A statistical approach to persistent homology. Science 9(2), 337–362 (2007)MathSciNetMATHGoogle Scholar
  11. 11.
    Castellana, B., Escuin, D., Peiró, G., Garcia-Valdecasas, B., Vázquez, T., Pons, C., Pérez-Olabarria, M., Barnadas, A., Lerma, E.: ASPN and GJB2 are implicated in the mechanisms of invasion of ductal breast carcinomas. Science 3, 175–183 (2012)Google Scholar
  12. 12.
    Cerri, A., Fabio, B.D., Ferri, M., Frosini, P., Landi, C.: Betti numbers in multidimensional persistent homology are stable functions. Science 36(12), 1543–1557 (2013)MathSciNetMATHGoogle Scholar
  13. 13.
    Chen, H., Sharp, B.M.: Content-rich biological network constructed by mining PubMed abstracts. BMC Bioinformatics 5(1), 147 (2004)Google Scholar
  14. 14.
    Cios, K.J., Moore, G.W.: Uniqueness of medical data mining. BMC Bioinformatics 26(1), 1–24 (2002)Google Scholar
  15. 15.
    Cook, D.J., Holder, L.B.: Graph-based data mining. BMC Bioinformatics 15(2), 32–41 (2000)Google Scholar
  16. 16.
    Cox, D.R., Oakes, D.: Analysis of Survival Data. Monographs on Statistics & Applied Probability. Chapman & Hall/CRC, London (1984)Google Scholar
  17. 17.
    Dehaspe, L., Toivonen, H.: Discovery of frequent DATALOG patterns. BMC Bioinformatics 3(1), 7–36 (1999)Google Scholar
  18. 18.
    Iordache, O.: Methods. In: Iordache, O. (ed.) Polystochastic Models for Complexity. UCS, vol. 4, pp. 17–61. Springer, Heidelberg (2010)Google Scholar
  19. 19.
    Dehmer, M., Basak, S.C.: Statistical and Machine Learning Approaches for Network Analysis. Wiley, Hoboken (2012)MATHGoogle Scholar
  20. 20.
    Donsa, K., Spat, S., Beck, P., Pieber, T.R., Holzinger, A.: Towards personalization of diabetes therapy using computerized decision support and machine learning: some open problems and challenges. In: Holzinger, A., Röcker, C., Ziefle, M. (eds.) Smart Health. LNCS, vol. 8700, pp. 237–260. Springer, Heidelberg (2015)Google Scholar
  21. 21.
    Dorogovtsev, S., Mendes, J.: Evolution of Networks: From Biological Nets to the Internet and WWW. Oxford University Press, Oxford (2003)MATHGoogle Scholar
  22. 22.
    Duerr-Specht, M., Goebel, R., Holzinger, A.: Medicine and health care as a data problem: will computers become better medical doctors? In: Holzinger, A., Röcker, C., Ziefle, M. (eds.) Smart Health. LNCS, vol. 8700, pp. 21–39. Springer, Heidelberg (2015)Google Scholar
  23. 23.
    Epstein, C., Carlsson, G., Edelsbrunner, H.: Topological data analysis. BMC Bioinformatics 27(12), 120201 (2011)Google Scholar
  24. 24.
    Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical Lasso. BMC Bioinformatics 9(3), 432–441 (2008)MATHGoogle Scholar
  25. 25.
    Golumbic, M.C.: Algorithmic Graph Theory and Perfect Graphs. Elsevier, Amsterdam (2004)MATHGoogle Scholar
  26. 26.
    Henderson, B.E., Feigelson, H.S.: Hormonal carcinogenesis. Carcinogenesis 21(3), 427–433 (2000)Google Scholar
  27. 27.
    Holzinger, A.: Human-Computer Interaction and Knowledge Discovery (HCI-KDD): what is the benefit of bringing those two fields to work together? In: Cuzzocrea, A., Kittl, C., Simos, D.E., Weippl, E., Xu, L. (eds.) CD-ARES 2013. LNCS, vol. 8127, pp. 319–328. Springer, Heidelberg (2013)Google Scholar
  28. 28.
    Holzinger, A., Dehmer, M., Jurisica, I.: Knowledge discovery and interactive data mining in bioinformatics - state-of-the-art, future challenges and research directions. BMC Bioinformatics 15(Suppl 6), I1 (2014)Google Scholar
  29. 29.
    Holzinger, A., Jurisica, I. (eds.): Interactive Knowledge Discovery and Data Mining in Biomedical Informatics: State-of-the-Art and Future Challenges, vol. 8401. Springer, Heidelberg (2014)Google Scholar
  30. 30.
    Holzinger, A., Jurisica, I.: Knowledge discovery and data mining in biomedical informatics: the future is in integrative, interactive machine learning solutions. In: Holzinger, A., Jurisica, I. (eds.) Interactive Knowledge Discovery and Data Mining in Biomedical Informatics. LNCS, vol. 8401, pp. 1–18. Springer, Heidelberg (2014)Google Scholar
  31. 31.
    Holzinger, A., Malle, B., Giuliani, N.: On graph extraction from image data. In: Ślȩzak, D., Tan, A.-H., Peters, J.F., Schwabe, L. (eds.) BIH 2014. LNCS, vol. 8609, pp. 552–563. Springer, Heidelberg (2014)Google Scholar
  32. 32.
    Holzinger, A., Ofner, B., Dehmer, M.: Multi-touch graph-based interaction for knowledge discovery on mobile devices: state-of-the-art and future challenges. In: Holzinger, A., Jurisica, I. (eds.) Interactive Knowledge Discovery and Data Mining in Biomedical Informatics. LNCS, vol. 8401, pp. 241–254. Springer, Heidelberg (2014)Google Scholar
  33. 33.
    Holzinger, A., Ofner, B., Stocker, C., Calero Valdez, A., Schaar, A.K., Ziefle, M., Dehmer, M.: On graph entropy measures for knowledge discovery from publication network data. In: Cuzzocrea, A., Kittl, C., Simos, D.E., Weippl, E., Xu, L. (eds.) CD-ARES 2013. LNCS, vol. 8127, pp. 354–362. Springer, Heidelberg (2013)Google Scholar
  34. 34.
    Holzinger, A., Stocker, C., Dehmer, M.: Big complex biomedical data: towards a taxonomy of data. In: Obaidat, M.S., Filipe, J. (eds.) Communications in Computer and Information Science CCIS 455, pp. 3–18. Springer, Heidelberg (2014)Google Scholar
  35. 35.
    Huppertz, B., Holzinger, A.: Biobanks – a source of large biological data sets: open problems and future challenges. In: Holzinger, A., Jurisica, I. (eds.) Interactive Knowledge Discovery and Data Mining in Biomedical Informatics. LNCS, vol. 8401, pp. 317–330. Springer, Heidelberg (2014)Google Scholar
  36. 36.
    Jacob, L., Obozinski, G., Vert, J.P.: Group Lasso with overlap and graph Lasso. In: Proceedings of the 26th International Conference on Machine Learning (ICML), pp. 433–440 (2009)Google Scholar
  37. 37.
    Javanmard, A., Montanari, A.: Model selection for high-dimensional regression under the generalized irrepresentability condition. BMC Bioinformatics 26, 3012–3020 (2013)Google Scholar
  38. 38.
    Joachims, T., Finley, T., Yu, C.N.: Cutting-plane training of structural SVMs. BMC Bioinformatics 77(1), 27–59 (2009)MATHGoogle Scholar
  39. 39.
    Kleinberg, J.: Navigation in a small world. Nature 406(6798), 845–845 (2000)Google Scholar
  40. 40.
    Klopocki, E., Kristiansen, G., Wild, P.J., Klaman, I., Castanos-Velez, E., Singer, G., Stöhr, R., Simon, R., Sauter, G., Leibiger, H., Essers, L., Weber, B., Hermann, K., Rosenthal, A., Hartmann, A., Dahl, E.: Loss of SFRP1 is associated with breast cancer progression and poor prognosis in early stage tumors. Nature 25(3), 641–649 (2004)Google Scholar
  41. 41.
    Knight, K., Fu, W.: Asymptotics for Lasso-type estimators. Ann. Stat. 28(5), 1356–1378 (2000)MathSciNetMATHGoogle Scholar
  42. 42.
    Koontz, W., Narendra, P., Fukunaga, K.: A graph-theoretic approach to nonparametric cluster analysis. Nature 100(9), 936–944 (1976)MathSciNetMATHGoogle Scholar
  43. 43.
    Kumpulainen, S., Jarvelin, K.: Barriers to task-based information access in molecular medicine. Nature 63(1), 86–97 (2012)Google Scholar
  44. 44.
    Kurgan, L.A., Musilek, P.: A survey of knowledge discovery and data mining process models. Nature 21(01), 1–24 (2006)Google Scholar
  45. 45.
    Lauritzen, S.L.: Graphical Models. Oxford University Press, Oxford (1996)MATHGoogle Scholar
  46. 46.
    Law, V., Knox, C., Djoumbou, Y., Jewison, T., Guo, A.C., Liu, Y.F., Maciejewski, A., Arndt, D., Wilson, M., Neveu, V., Tang, A., Gabriel, G., Ly, C., Adamjee, S., Dame, Z.T., Han, B.S., Zhou, Y., Wishart, D.S.: Drugbank 4.0: shedding new light on drug metabolism. Nature 42(D1), D1091–D1097 (2014)Google Scholar
  47. 47.
    Lee, S.: Sparse inverse covariance estimation for graph representation of feature structure. In: Holzinger, A., Jurisica, I. (eds.) Interactive Knowledge Discovery and Data Mining in Biomedical Informatics. LNCS, vol. 8401, pp. 227–240. Springer, Heidelberg (2014)Google Scholar
  48. 48.
    Lee, S.: Signature selection for grouped features with a case study on exon microarrays. In: Stańczyk, U., Jain, L.C. (eds.) Feature Selection for Data and Pattern Classification, pp. 329–349. Springer, Heidelberg (2015)Google Scholar
  49. 49.
    Lee, S., Wright, S.J.: Manifold identification in dual averaging methods for regularized stochastic online learning. Nature 13, 1705–1744 (2012)MathSciNetMATHGoogle Scholar
  50. 50.
    Lilla, C., Koehler, T., Kropp, S., Wang-Gohrke, S., Chang-Claude, J.: Alcohol dehydrogenase 1B (ADH1B) genotype, alcohol consumption and breast cancer risk by age 50 years in a german case-control study. Nature 92(11), 2039–2041 (2005)Google Scholar
  51. 51.
    Lodhi, H., Saunders, C., Shawe-Taylor, J., Watkins, N.C.C.: Text classification using string kernels. Nature 2, 419–444 (2002)MATHGoogle Scholar
  52. 52.
    Ma, K.L., Muelder, C.W.: Large-scale graph visualization and analytics. Nature 46(7), 39–46 (2013)Google Scholar
  53. 53.
    Mattmann, C.A.: Computing: a vision for data science. Nature 493(7433), 473–475 (2013)Google Scholar
  54. 54.
    McCall, M., Murakami, P., Lukk, M., Huber, W., Irizarry, R.: Assessing affymetrix genechip microarray quality. BMC Bioinformatics 12(1), 137 (2011)Google Scholar
  55. 55.
    McCall, M.N., Bolstad, B.M., Irizarry, R.A.: Frozen robust multiarray analysis (fRMA). BMC Bioinformatics 11(2), 242–253 (2010)Google Scholar
  56. 56.
    Meinshausen, N., Bühlmann, P.: High-dimensional graphs and variable selection with the Lasso. BMC Bioinformatics 34, 1436–1462 (2006)MathSciNetMATHGoogle Scholar
  57. 57.
    Meinshausen, N., Bühlmann, P.: Stability selection. BMC Bioinformatics 72(4), 417–473 (2010)MathSciNetGoogle Scholar
  58. 58.
    Müller, R.: Medikamente und Richtwerte in der Notfallmedizin, 11th edn. Ralf Müller Verlag, Graz (2012)Google Scholar
  59. 59.
    Nesterov, Y.E.: A method of solving a convex programming problem with convergence rate \(o(1/k^2)\). Soviet Math. Dokl. 27(2), 372–376 (1983)MATHGoogle Scholar
  60. 60.
    Niakšu, O., Kurasova, O.: Data mining applications in healthcare: research vs practice. In: Databases and Information Systems Baltic DB & IS 2012, p. 58 (2012)Google Scholar
  61. 61.
    Otasek, D., Pastrello, C., Holzinger, A., Jurisica, I.: Visual data mining: effective exploration of the biological universe. In: Holzinger, A., Jurisica, I. (eds.) Interactive Knowledge Discovery and Data Mining in Biomedical Informatics. LNCS, vol. 8401, pp. 19–33. Springer, Heidelberg (2014)Google Scholar
  62. 62.
    Preuß, M., Dehmer, M., Pickl, S., Holzinger, A.: On terrain coverage optimization by using a network approach for universal graph-based data mining and knowledge discovery. In: Ślȩzak, D., Tan, A.-H., Peters, J.F., Schwabe, L. (eds.) BIH 2014. LNCS, vol. 8609, pp. 564–573. Springer, Heidelberg (2014)Google Scholar
  63. 63.
    Schoenauer, M., Akrour, R., Sebag, M., Souplet, J.C.: Programming by feedback. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 1503–1511 (2014)Google Scholar
  64. 64.
    Spinrad, N.: Google car takes the test. Nature 514(7523), 528–528 (2014)Google Scholar
  65. 65.
    Strogatz, S.: Exploring complex networks. Nature 410(6825), 268–276 (2001)Google Scholar
  66. 66.
    Tibshirani, R.: Regression shrinkage and selection via the Lasso. Nature 58, 267–288 (1996)MathSciNetMATHGoogle Scholar
  67. 67.
    Tseng, P.: Convergence of a block coordinate descent method for nondifferentiable minimization. Nature 109(3), 475–494 (2001)MathSciNetMATHGoogle Scholar
  68. 68.
    Vandenberghe, L., Boyd, S., Wu, S.P.: Determinant maximization with linear matrix inequality constraints. Nature 19(2), 499–533 (1998)MathSciNetMATHGoogle Scholar
  69. 69.
    Wagner, H., Dłotko, P., Mrozek, M.: Computational topology in text mining. In: Ferri, M., Frosini, P., Landi, C., Cerri, A., Di Fabio, B. (eds.) CTIC 2012. LNCS, vol. 7309, pp. 68–78. Springer, Heidelberg (2012)Google Scholar
  70. 70.
    Washio, T., Motoda, H.: State of the art of graph-based data mining. Nature 5(1), 59 (2003)Google Scholar
  71. 71.
    Wishart, D.S., Knox, C., Guo, A.C., Shrivastava, S., Hassanali, M., Stothard, P., Chang, Z., Woolsey, J.: Drugbank: a comprehensive resource for in silico drug discovery and exploration. Nature 34, D668–D672 (2006)Google Scholar
  72. 72.
    Wittkop, T., Emig, D., Truss, A., Albrecht, M., Boecker, S., Baumbach, J.: Comprehensive cluster analysis with transitivity clustering. Nature 6(3), 285–295 (2011)Google Scholar
  73. 73.
    Yoshida, K., Motoda, H., Indurkhya, N.: Graph-based induction as a unified learning framework. Nature 4(3), 297–316 (1994)Google Scholar
  74. 74.
    Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. Nature 68, 49–67 (2006)MathSciNetMATHGoogle Scholar
  75. 75.
    Yuan, M., Lin, Y.: Model selection and estimation in the Gaussian graphical model. Biometrika 94(1), 19–35 (2007)MathSciNetMATHGoogle Scholar
  76. 76.
    Zhao, P., Yu, B.: On model selection consistency of Lasso. Biometrika 7, 2541–2563 (2006)MathSciNetMATHGoogle Scholar
  77. 77.
    Zhengxiang, Z., Jifa, G., Wenxin, Y., Xingsen, L.: Toward domain-driven data mining. In: International Symposium on Intelligent Information Technology Application Workshops, pp. 44–48 (2008)Google Scholar
  78. 78.
    Zhu, X.: Persistent homology: an introduction and a new text representation for natural language processing. In: IJCAI, IJCAI/AAAI (2013)Google Scholar
  79. 79.
    Zou, H.: The adaptive Lasso and its Oracle properties. Biometrika 101(476), 1418–1429 (2006)MathSciNetMATHGoogle Scholar
  80. 80.
    Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. Biometrika 67, 301–320 (2005)MathSciNetMATHGoogle Scholar
  81. 81.
    Zudilova-Seinstra, E., Adriaansen, T.: Visualisation and interaction for scientific exploration and knowledge discovery. Biometrika 13(2), 115–117 (2007)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Artificial Intelligence Unit LS8, Computer Science DepartmentTechnische Universität DortmundDortmundGermany
  2. 2.Research Unit HCI-KDD, Institute for Medical Informatics, Statistics and DocumentationMedical University GrazGrazAustria
  3. 3.Institute for Information Systems and Computer MediaGraz University of TechnologyGrazAustria

Personalised recommendations