Partially Synthesised Dataset to Improve Prediction Accuracy

  • Ahmed J. AljaafEmail author
  • Dhiya Al-Jumeily
  • Abir J. Hussain
  • Paul Fergus
  • Mohammed Al-Jumaily
  • Hani Hamdan
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9771)


The real world data sources, such as statistical agencies, library databanks and research institutes are the major data sources for researchers. Using this type of data involves several advantages including, the improvement of credibility and validity of the experiment and more importantly, it is related to a real world problems and typically unbiased. However, this type of data is most likely unavailable or inaccessible for everyone due to the following reasons. First, privacy and confidentiality concerns, since the data must to be protected on legal and ethical basis. Second, collecting real world data is costly and time consuming. Third, the data may be unavailable, particularly in the newly arises research subjects. Therefore, many studies have attributed the use of fully and/or partially synthesised data instead of real world data due to simplicity of creation, requires a relatively small amount of time and sufficient quantity can be generated to fit the requirements. In this context, this study introduces the use of partially synthesised data to improve the prediction of heart diseases from risk factors. We are proposing the generation of partially synthetic data from agreed principles using rule-based method, in which an extra risk factor will be added to the real-world data. In the conducted experiment, more than 85 % of the data was derived from observed values (i.e., real-world data), while the remaining data has been synthetically generated using a rule-based method and in accordance with the World Health Organisation criteria. The analysis revealed an improvement of the variance in the data using the first two principal components of partially synthesised data. A further evaluation has been conducted using five popular supervised machine-learning classifiers. In which, partially synthesised data considerably improves the prediction of heart diseases. Where the majority of classifiers have approximately doubled their predictive performance using an extra risk factor.


Partially synthesised data Prediction Heart diseases Machine learning Rule-based method 


  1. 1.
    Loong, B.: Topics and applications in synthetic data. Doctoral dissertation, Harvard University. (2012)Google Scholar
  2. 2.
    Rnbin, D.B.: Discussion statistical disclosure limitation. J. Official Stat. 9(3), 461–468 (1993)Google Scholar
  3. 3.
    Jeske, D.R., Samadi, B., Lin, P.J., Ye, L., Cox, S., Xiao, R., Younglove, T., Ly, M., Holt, D., Rich, R.: Generation of synthetic data sets for evaluating the accuracy of knowledge discovery systems. In: Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 756–762 (2005)Google Scholar
  4. 4.
    Hall, N.G., Posner, M.E.: The generation of experimental data for computational testing in optimization. In: Bartz-Beielstein, T., Chiarandini, M., Paquete, L., Preuss, M. (eds.) Experimental Methods for the Analysis of Optimization Algorithms, pp. 73–101. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  5. 5.
    Sakshaug, J.W.: Synthetic data for small area estimation. Doctoral dissertation, The University of Michigan (2011)Google Scholar
  6. 6.
    Aljaaf, A.J., Al-Jumeily, D., Hussain, A.J., Dawson, T., Fergus, P., Al-Jumaily, M.: Predicting the likelihood of heart failure with a multi level risk assessment using decision tree. In: Third International Conference on Technological Advances in Electrical, Electronics and Computer Engineering (TAEECE), pp. 101–106. IEEE, Beirut (2015)Google Scholar
  7. 7.
    The European Society of Cardiology: Heart failure: preventing disease and death worldwide (2016). Accessed 2 Feb 2016
  8. 8.
    Roger, V.L.: The heart failure epidemic. Int. J. Environ. Res. Public Health 7(4), 1807–1830 (2010)CrossRefGoogle Scholar
  9. 9.
    Scottish Intercollegiate Guidelines Network (SIGN): Management of chronic heart failure: a national clinical guideline (2016). Accessed 5 Feb 2016
  10. 10.
    Macia, N., Bernado-Mansilla, E., Orriols-Puig, A.: Preliminary approach on synthetic data sets generation based on class separability measure. In: 19th International Conference on Pattern Recognition (ICPR), pp. 1–4. IEEE (2008)Google Scholar
  11. 11.
    Sojoudi, S., Doyle, J.: Study of the brain functional network using synthetic data. In: 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 350–357. IEEE (2014)Google Scholar
  12. 12.
    Whiting, M.A., Haack, J., Varley, C.: Creating realistic, scenario-based synthetic data for test and evaluation of information analytics software. In: Proceedings of the 2008 Workshop on Beyond Time and Errors: Novel Evaluation Methods for Information Visualization, Florence Italy (2008)Google Scholar
  13. 13.
    Babaee, M., Nilchi, A.R.N.: Synthetic data generation for X-ray imaging. In: 21st Iranian Conference on in Biomedical Engineering (ICBME), pp. 190–194. IEEE (2014)Google Scholar
  14. 14.
    Tang, B., He, H.: KernelADASYN: Kernel based adaptive synthetic data generation for imbalanced learning. In: IEEE Congress on Evolutionary Computation (CEC), pp. 664–671. IEEE (2015)Google Scholar
  15. 15.
    Park, Y., Ghosh, J., Shankar, M.: Perturbed Gibbs samplers for generating large-scale privacy-safe synthetic health data. In: IEEE International Conference on Healthcare Informatics (ICHI), pp. 493–498. IEEE (2013)Google Scholar
  16. 16.
    The Cleveland Clinic Foundation: Heart Disease Data Set (2016). Accessed 3 Feb 2016
  17. 17.
    Pencina, M.J., D’Agostino, R.B., Larson, M.G., Massaro, J.M., Vasan, R.S.: Predicting the 30-year risk of cardiovascular disease: the Framingham heart study. Circulation 119, 3078–3084 (2009)CrossRefGoogle Scholar
  18. 18.
    Gillum, R.F., Makuc, D.M., Feldman, J.J.: Pulse rate, coronary heart disease, and death: the NHANES I epidemiologic follow-up study. Am. Heart J. 121, 172–177 (1991)CrossRefGoogle Scholar
  19. 19.
    North, B.J., Sinclair, D.A.: The Intersection between aging and cardiovascular disease. Circ. Res. 110, 1097–1108 (2012)CrossRefGoogle Scholar
  20. 20.
    World Health Organisation: The International Classification of adult underweight, overweight and obesity according to BMI (2016). Accessed 5 Feb 2016
  21. 21.
    The World Health Organization: Global Atlas on cardiovascular disease prevention and control (2016). Accessed 3 Feb 2016
  22. 22.
    Al Shalabi, L., Shaaban, Z.: Normalization as a preprocessing engine for data mining and the approach of preference matrix. In: the International Conference on Dependability of Computer Systems (DepCos-RELCOMEX 2006), pp. 207–214. IEEE (2006)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Ahmed J. Aljaaf
    • 1
    Email author
  • Dhiya Al-Jumeily
    • 1
  • Abir J. Hussain
    • 1
  • Paul Fergus
    • 1
  • Mohammed Al-Jumaily
    • 2
  • Hani Hamdan
    • 3
  1. 1.Applied Computing Research GroupLiverpool John Moores UniversityLiverpoolUK
  2. 2.Department of NeurosurgeryDr. Sulaiman al Habib Hospital, Dubai Healthcare CityDubaiUAE
  3. 3.Département Signal & StatistiquesCentraleSupélecChâtenay-MalabryFrance

Personalised recommendations