Privacy-Preserving Genomic Data Publishing via Differentially-Private Suffix Tree

  • Tanya Khatri
  • Gaby G. DagherEmail author
  • Yantian Hou
Conference paper
Part of the Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering book series (LNICST, volume 304)


Privacy-preserving data publishing is a mechanism for sharing data while ensuring that the privacy of individuals is preserved in the published data, and utility is maintained for data mining and analysis. There is a huge need for sharing genomic data to advance medical and health researches. However, since genomic data is highly sensitive and the ultimate identifier, it is a big challenge to publish genomic data while protecting the privacy of individuals in the data. In this paper, we address the aforementioned challenge by presenting an approach for privacy-preserving genomic data publishing via differentially-private suffix tree. The proposed algorithm uses a top-down approach and utilizes the Laplace mechanism to divide the raw genomic data into disjoint partitions, and then normalize the partitioning structure to ensure consistency and maintain utility. The output of our algorithm is a differentially-private suffix tree, a data structure most suitable for efficient search on genomic data. We experiment on real-life genomic data obtained from the Human Genome Privacy Challenge project, and we show that our approach is efficient, scalable, and achieves high utility with respect to genomic sequence matching count queries.



This research was partially supported by Forsta, Inc (


  1. 1.
    Human genome privacy protection challengeGoogle Scholar
  2. 2.
    Health insurance portability and accountability act (hipaa) (1996)Google Scholar
  3. 3.
    Genetic information nondiscrimination act (gena) (2008)Google Scholar
  4. 4.
    Akgün, M., Bayrak, A.O., Ozer, B., Sağıroğlu, M.Ş.: Privacy preserving processing of genomic data: a survey. J. Biomed. Inform. 56, 103–111 (2015)CrossRefGoogle Scholar
  5. 5.
    Bhaskar, R., Laxman, S., Smith, A., Thakurta, A.: Discovering frequent patterns in sensitive data. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 503–512. ACM (2010)Google Scholar
  6. 6.
    Bonomi, L., Xiong, L.: A two-phase algorithm for mining sequential patterns with differential privacy. In: Proceedings of the 22Nd ACM CIKM, pp. 269–278 (2013)Google Scholar
  7. 7.
    Chen, R., Fung, B.C.M., Desai, B.C., Sossou, N.M.: Differentially private transit data publication: a case study on the montreal transportation system. In: Proceedings of the 18th ACM SIGKDD on KDD, pp. 213–221 (2012)Google Scholar
  8. 8.
    Dwork, C.: Differential privacy. In ICALP, pp. 1–12 (2006)Google Scholar
  9. 9.
    Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In TCC (2006)Google Scholar
  10. 10.
    Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9(3–4), 211–407 (2014)MathSciNetzbMATHGoogle Scholar
  11. 11.
    Fienberg, S.E., Slavkovic, A., Uhler, C.: Privacy preserving GWAS data sharing. In: IEEE International Conference on Data Mining Workshops, pp. 628–635 (2011)Google Scholar
  12. 12.
    Ghosh, A., Roughgarden, T., Sundararajan, M.: Universally utility-maximizing privacy mechanisms. SIAM J. Comput. 41(6), 1673–1693 (2012)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Giegerich, R., Kurtz, S.: From ukkonen to mccreight and weiner: a unifying view of linear-time suffix tree construction. Algorithmica 19(3), 331–353 (1997)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Goodrich, M.T.: The mastermind attack on genomic data (2009)Google Scholar
  15. 15.
    Gymrek, M., McGuire, A.L., Golan, D., Halperin, E., Erlich, Y.: Identifying personal genomes by surname inference. Science 339, 321–324 (2013)CrossRefGoogle Scholar
  16. 16.
    Hay, M., Rastogi, V., Miklau, G., Suciu, D.: Boosting the accuracy of differentially private histograms through consistency. Proc. VLDB Endow. 3, 1021–1032 (2010) CrossRefGoogle Scholar
  17. 17.
    Homer, N., et al.: Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays (2006)Google Scholar
  18. 18.
    Huang, Z.: Privacy preserving algorithms for genomic dataGoogle Scholar
  19. 19.
    Jiang, X., et al.: A community assessment of privacy preserving techniques for human genomes. BMC Med. Inform. Decis. Making 14(Suppl 1), S1 (2014)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Johnson, A., Shmatikov, V.: Privacy-preserving data exploration in genome-wide association studies. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1079–1087 (2013)Google Scholar
  21. 21.
    Li, Y.D., Zhang, Z., Winslett, M., Yang, Y.: Compressive mechanism: utilizing sparse representation in differential privacy. In: Proceedings of the 10th Annual ACM Workshop on Privacy in the Electronic Society, pp. 177–182 (2011)Google Scholar
  22. 22.
    Frank D. McSherry. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the SIGMOD 2009, pp. 19–30 (2009)Google Scholar
  23. 23.
    Naveed, M., et al.: Privacy in the genomic era. ACM Comput. Surv. 48(1), 6:1–6:44 (2015)CrossRefGoogle Scholar
  24. 24.
    Rodriguez, L.L., Brooks, L.D., Greenberg, J.H., Green, E.D.: The complexities of genomic identifiabilityGoogle Scholar
  25. 25.
    Roozgard, A., Barzigar, N., Verma, P.K., Cheng, S.: Genomic data privacy protection using compressed sensing. Trans. Data Privacy 9(1)–13 (2016)Google Scholar
  26. 26.
    Uhlerop, C., Slavković, A., Fienberg, S.E.: Privacy-preserving data sharing for genome-wide association studies. J. Priv. Confidentiality 5(1), 137 (2013)Google Scholar
  27. 27.
    Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)MathSciNetCrossRefGoogle Scholar
  28. 28.
    Wang, R., Li, Y.F., Wang, X.F., Tang, H., Zhou, X.: Learning your identity and disease from research papers: information leaks in genome wide association study (2009)Google Scholar
  29. 29.
    Wang, S., Mohammed, N., Chen, R.: Differentially private genome data dissemination through top-down specialization. BMC Med. Inform. Decis. Making 14(1), S2 (2014)CrossRefGoogle Scholar
  30. 30.
    Weiner, P.: Linear pattern matching algorithms. In: SWAT 1973, pp. 1–11 (1973)Google Scholar
  31. 31.
    Yu, F., Fienberg, S.E., Slavković, A.B., Uhler, C.: Scalable privacy-preserving data sharing methodology for genome-wide association studies. J. Biomed. Inform. 50, 133–141 (2014)CrossRefGoogle Scholar

Copyright information

© ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2019

Authors and Affiliations

  1. 1.Department of Computer ScienceBoise State UniversityBoiseUSA

Personalised recommendations