Enhancing White-Box Machine Learning Processes by Incorporating Semantic Background Knowledge

  • Gilles VandewieleEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10250)


Currently, most of white-box machine learning techniques are purely data-driven and ignore prior background and expert knowledge. A lot of this knowledge has already been captured in domain models, i.e. ontologies, using Semantic Web technologies. The goal of this research proposal is to enhance the predictive performance and required training time of white-box models by incorporating the vast amount of available knowledge in the pre-processing, feature extraction and selection phase of a machine learning process.


White-box machine learning Knowledge incorporation Semantic knowledge bases 



I would like to thank my promoters prof. Filip De Turck & dr. Femke Ongenae from Ghent University and my mentor, prof. Agnieszka Ławrynowicz from Poznan University, for their support and valuable input in the realization of this work. This research is funded by a PhD SB fellow scholarship of FWO (1S31417N).


  1. 1.
    Jan, T., Debenham, J.: Incorporating prior domain knowledge into inductive machine learning. J. Mach. Learn., 1–42 (2007)Google Scholar
  2. 2.
    Schulz, S., et al.: Snomed reaching its adolescence: ontologists and logicians health check. Int. J. Med. Inform. 78, S86–S94 (2009)CrossRefGoogle Scholar
  3. 3.
    Compton, M., et al.: The SSN ontology of the W3C semantic sensor network incubator group. Web Seman. Sci. Serv. Agents WWW 17, 25–32 (2012)CrossRefGoogle Scholar
  4. 4.
    Bodenreider, O.: The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res. 32(Suppl 1), D267–D270 (2004)CrossRefGoogle Scholar
  5. 5.
    Kattan, M.W.: Expert systems in medicine. Elsevier Ltd. (2001)CrossRefGoogle Scholar
  6. 6.
    Tresp, V., Bundschus, M., Rettinger, A., Huang, Y.: Towards machine learning on the semantic web. In: Costa, P.C.G., d’Amato, C., Fanizzi, N., Laskey, K.B., Laskey, K.J., Lukasiewicz, T., Nickles, M., Pool, M. (eds.) URSW 2005-2007. LNCS (LNAI), vol. 5327, pp. 282–314. Springer, Heidelberg (2008). doi: 10.1007/978-3-540-89765-1_17CrossRefGoogle Scholar
  7. 7.
    Lim, T.S., et al.: Comparison of prediction accuracy, complexity, and training time of thirty-three classification algorithms. Mach. Learn. 40, 203–228 (2000)CrossRefGoogle Scholar
  8. 8.
    Wu, X., Zhu, X., Wu, G.-Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)CrossRefGoogle Scholar
  9. 9.
    Caemaert, J., Baert, E.J.A.: Neurologie. Springer (2003)Google Scholar
  10. 10.
    Stovner, L.J., Zwart, J.-A., Hagen, K., Terwindt, G.M., Pascual, J.: Epidemiology of headache in Europe. Eur. J. Neurol. 13(4), 333–345 (2006)CrossRefGoogle Scholar
  11. 11.
    Levin, M.: The international classification of headache disorders. Headache J. Head Face Pain 53(8), 1383–1395 (2013)CrossRefGoogle Scholar
  12. 12.
    Dou, D., Wang, H., Liu, H.: Semantic data mining: a survey of ontology-based approaches. In: 2015 IEEE 9th International Conference on Semantic Computing (ICSC), pp. 244–251 (2015)Google Scholar
  13. 13.
    Ristoski, P., Paulheim, H.: Semantic web in data mining and knowledge discovery: a comprehensive survey. Web Seman. Sci. Serv. Agents World Wide Web 36, 1–22 (2016)CrossRefGoogle Scholar
  14. 14.
    Nickel, M., et al.: A review of relational machine learning for knowledge graphs from multi-relational link prediction to automated knowledge graph construction. Proc. IEEE, 1–18 (2015)Google Scholar
  15. 15.
    Paulheim, H., Ristoski, P., Mitichkin, E., Bizer, C.: Data mining with background knowledge from the web. In: RapidMiner World (2014)Google Scholar
  16. 16.
    Ristoski, P.: Towards linked open data enabled data mining. In: Gandon, F., Sabou, M., Sack, H., d’Amato, C., Cudré-Mauroux, P., Zimmermann, A. (eds.) ESWC 2015. LNCS, vol. 9088, pp. 772–782. Springer, Cham (2015). doi: 10.1007/978-3-319-18818-8_50CrossRefGoogle Scholar
  17. 17.
    Longadge, R., Dongre, S.: Class imbalance problem in data mining review. arXiv preprint arXiv:1305.1707 (2013)
  18. 18.
    He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)CrossRefGoogle Scholar
  19. 19.
    Chawla, N.V.: Data mining for imbalanced datasets: an overview. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 853–867. Springer, New York (2005)CrossRefGoogle Scholar
  20. 20.
    Ganganwar, V.: An overview of classification algorithms for imbalanced datasets. IJETAE 2(4), 42–47 (2012)Google Scholar
  21. 21.
    Tang, Y., Zhang, Y.-Q., Chawla, N.V., Krasser, S.: Svms modeling for highly imbalanced classification. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 39(1), 281–288 (2009)CrossRefGoogle Scholar
  22. 22.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Philip Kegelmeyer, W.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)zbMATHGoogle Scholar
  23. 23.
    He, H., et al.: Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: IJCNN, pp. 1322–1328. IEEE (2008)Google Scholar
  24. 24.
    Niyogi, P., Girosi, F., Poggio, T.: Incorporating prior information in machine learning by creating virtual examples. Proc. IEEE 86(11), 2196–2209 (1998)CrossRefGoogle Scholar
  25. 25.
    Iqbal, R.A.: A generalized method for integrating rule-based knowledge into inductive methods through virtual sample creation. arXiv:1101.4924 (2011)
  26. 26.
    Yang, J., et al.: A novel virtual sample generation method based on Gaussian distribution. Know.-Based Syst. 24(6), 740–748 (2011)CrossRefGoogle Scholar
  27. 27.
    Lin, L.-S., et al.: Improving virtual sample generation for small sample learning with dependent attributes. In: 2016 5th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI), pp. 715–718 (2016)Google Scholar
  28. 28.
    Li, D.-C., Wen, I.-H.: A genetic algorithm-based virtual sample generation technique to improve small data set learning. Neurocomputing 143, 222–230 (2014)CrossRefGoogle Scholar
  29. 29.
    Ringsquandl, M., Lamparter, S., Brandt, S., Hubauer, T., Lepratti, R.: Semantic-guided feature selection for industrial automation systems. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9367, pp. 225–240. Springer, Cham (2015). doi: 10.1007/978-3-319-25010-6_13CrossRefGoogle Scholar
  30. 30.
    van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)zbMATHGoogle Scholar
  31. 31.
    Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemometr. Intell. Lab. Syst. 2(1–3), 37–52 (1987)CrossRefGoogle Scholar
  32. 32.
    Gülçehre, Ç., Bengio, Y.: Knowledge matters: importance of prior information for optimization. J. Mach. Learn. Res. 17(8), 1–32 (2016)MathSciNetzbMATHGoogle Scholar
  33. 33.
    Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 71–80. ACM (2000)Google Scholar
  34. 34.
    Terziev, Y.: Feature generation using ontologies during induction of decision trees on linked data. In: ISWC PhD Symposium (2016)Google Scholar
  35. 35.
    Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: IJCAI, vol. 7, pp. 1606–1611 (2007)Google Scholar
  36. 36.
    Bonte, P., Ongenae, F., De Turck, F.: Learning semantic rules for intelligent transport scheduling in hospitals. In: CEUR Workshop Proceedings, vol. 1586, pp. 1–6 (2016)Google Scholar
  37. 37.
    Hassan, S., Mihalcea, R.: Semantic relatedness using salient semantic analysis. In: AAAI (2011)Google Scholar
  38. 38.
    Gurevych, I.: Using the structure of a conceptual network in computing semantic relatedness. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS (LNAI), vol. 3651, pp. 767–778. Springer, Heidelberg (2005). doi: 10.1007/11562214_67CrossRefGoogle Scholar
  39. 39.
    Lichman, M.: UCI machine learning repository (2013)Google Scholar
  40. 40.
    Ristoski, P., de Vries, G.K.D., Paulheim, H.: A collection of benchmark datasets for systematic evaluations of machine learning on the semantic web. In: Groth, P., Simperl, E., Gray, A., Sabou, M., Krötzsch, M., Lecue, F., Flöck, F., Gil, Y. (eds.) ISWC 2016. LNCS, vol. 9982, pp. 186–194. Springer, Cham (2016). doi: 10.1007/978-3-319-46547-0_20CrossRefGoogle Scholar
  41. 41.
    Fischera, M., et al.: The incidence and prevalence of cluster headache: a meta-analysis of population-based studies. Cephalalgia 28(6), 614–618 (2008)CrossRefGoogle Scholar
  42. 42.
    Burch, R.C., Loder, S., Loder, E., Smitherman, T.A.: The prevalence and burden of migraine and severe headache in the united states: updated statistics from government health surveillance studies. Headache J. Head Face Pain 55(1), 21–34 (2015)CrossRefGoogle Scholar
  43. 43.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web. Technical report, Google (1999)Google Scholar
  44. 44.
    Thalhammer, A., Rettinger, A.: PageRank on wikipedia: towards general importance scores for entities. In: Sack, H., Rizzo, G., Steinmetz, N., Mladenić, D., Auer, S., Lange, C. (eds.) ESWC 2016. LNCS, vol. 9989, pp. 227–240. Springer, Cham (2016). doi: 10.1007/978-3-319-47602-5_41CrossRefGoogle Scholar
  45. 45.
    Wade, A.D., et al.: Wsdm cup 2016: entity ranking challenge. In: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pp. 593–594. ACM (2016)Google Scholar
  46. 46.
    Lee, S., et al.: Random walk based entity ranking on graph for multidimensional recommendation. In: Proceedings of the Fifth ACM Conference on Recommender Systems, RecSys 2011, pp. 93–100. ACM, New York (2011)Google Scholar
  47. 47.
    Ienco, D., Meo, R., Botta, M.: Using pagerank in feature selection. In: SEBD, pp. 93–100 (2008)Google Scholar
  48. 48.
    Wolpert, D.H., Macready, W.G.: No free lunch theorems for optimization. IEEE Trans. Evol. Comput. 1(1), 67–82 (1997)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Department of Information TechnologyGhent University - imec, IDLabGhentBelgium

Personalised recommendations