Cluster Computing

, Volume 22, Supplement 2, pp 2907–2919 | Cite as

A differential privacy DNA motif finding method based on closed frequent patterns

  • Xiang WuEmail author
  • Yuyang Wei
  • Yaqing Mao
  • Liang Wang


As one of the basic research methods of bioinformatics, DNA motif finding is of great significance to the study of mechanisms for regulating gene expression and the discovery of biological functional sites. However, because of the high sensitivity of DNA data, the privacy disclosure of these data during motif finding has become a bottleneck in the field of gene research. Meanwhile, traditional privacy protection data mining methods cannot deal with DNA sequences directly, and the existing private motif finding methods usually decrease the utility of the results. To solve these problems, we propose a high-utility motif finding algorithm based on \(\epsilon \)-differential privacy, which is known as a rigorous definition of privacy with meaningful privacy guarantees in the presence of arbitrary external information. Our solution makes use of the closed frequent pattern set to reduce redundant motifs of result sets and obtain accurate motifs results, satisfying \(\epsilon \)-differential privacy. Furthermore, a post-processing method based on the best linear unbiased estimate is used to optimize the utility of noisy consolidated motif support. Experiments on real-life DNA sequence datasets confirm that our algorithm is superior to the existing algorithms in terms of utility.


Motif finding Differential privacy Privacy protection n-Gram model Closed frequent patterns 


  1. 1.
    Agrawal, R., Srikant, R.: Mining sequential patterns. In: icde, p. 3 (1995)Google Scholar
  2. 2.
    Amphawan, K., Lenca, P.: Mining top-k frequent-regular closed patterns. Expert Syst. Appl. 42(21), 7882–7894 (2015)CrossRefGoogle Scholar
  3. 3.
    Bhaskar, R., Laxman, S., Smith, A., Thakurta, A.: Discovering frequent patterns in sensitive data. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 503–512. ACM (2010)Google Scholar
  4. 4.
    Blanchette, M., Schwikowski, B., Tompa, M.: Algorithms for phylogenetic footprinting. J. Comput. Biol. 9(2), 211–223 (2002)CrossRefGoogle Scholar
  5. 5.
    Chen, R., Acs, G., Castelluccia, C.: Differentially private sequential data publication via variable-length n-grams. In: Proceedings of the 2012 ACM Conference on Computer and Communications Security, pp. 638–649. ACM (2012a)Google Scholar
  6. 6.
    Chen, R., Fung, B., Desai, B.C., Sossou, N.M.: Differentially private transit data publication: a case study on the montreal transportation system. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 213–221. ACM (2012b)Google Scholar
  7. 7.
    Chen, R., Peng, Y., Choi, B., Xu, J., Hu, H.: A private dna motif finding algorithm. J. Biomed. Inf. 50, 122–132 (2014)CrossRefGoogle Scholar
  8. 8.
    Dwork, C., Roth, A., et al.: The algorithmic foundations of differential privacy. Foundations and trends\({\textregistered }\). Theor. Comput. Sci. 9(3–4), 211–407 (2014)Google Scholar
  9. 9.
    Geng, Q., Viswanath, P.: The optimal mechanism in differential privacy. In: 2014 IEEE International Symposium on Information Theory (ISIT), pp. 2371–2375. IEEE (2014)Google Scholar
  10. 10.
    Geng, Q., Viswanath, P.: Optimal noise adding mechanisms for approximate differential privacy. IEEE Trans. Inf. Theory 62(2), 952–969 (2016)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Guo-Qing, L., Xiao-Jian, Z., Li-Ping, D., Yan-Feng, L., Xin, L.: Frequent sequential pattern mining under differential privacy. J. Comput. Res. Dev. 52(12), 2789–2801 (2015)Google Scholar
  12. 12.
    Gymrek, M., McGuire, A.L., Golan, D., Halperin, E., Erlich, Y.: Identifying personal genomes by surname inference. Science 339(6117), 321–324 (2013)CrossRefGoogle Scholar
  13. 13.
    Han, J., Pei, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.: Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth. In: proceedings of the 17th International Conference on Data Engineering, pp. 215–224 (2001)Google Scholar
  14. 14.
    Hay, M., Rastogi, V., Miklau, G., Suciu, D.: Boosting the accuracy of differentially private histograms through consistency. Proc.VLDB Endow. 3(1–2), 1021–1032 (2010)CrossRefGoogle Scholar
  15. 15.
    Holohan, N., Leith, D.J., Mason, O.: Differential privacy in metric spaces: numerical, categorical and functional data under the one roof. Inf. Sci. 305, 256–268 (2015)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Homer, N., Szelinger, S., Redman, M., Duggan, D., Tembe, W., Muehling, J., Pearson, J.V., Stephan, D.A., Nelson, S.F., Craig, D.W.: Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 4(8), 167e1000 (2008)CrossRefGoogle Scholar
  17. 17.
    Johnson, A., Shmatikov, V.: Privacy-preserving data exploration in genome-wide association studies. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1079–1087. ACM (2013)Google Scholar
  18. 18.
    Kurtz, S., Choudhuri, J.V., Ohlebusch, E., Schleiermacher, C., Stoye, J., Giegerich, R.: Reputer: the manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res. 29(22), 4633–4642 (2001)CrossRefGoogle Scholar
  19. 19.
    Le, T., Vo, B.: An n-list-based algorithm for mining frequent closed patterns. Expert Syst. Appl. 42(19), 6648–6657 (2015)CrossRefGoogle Scholar
  20. 20.
    Li, N., Qardaji, W., Dong, S., Cao, J.: Privbasis: frequent itemset mining with differential privacy. Proc. Vldb Endow. 5(11), 1340–1351 (2012)CrossRefGoogle Scholar
  21. 21.
    Malin, B.A.: Protecting genomic sequence anonymity with generalization lattices. Methods Inf. Med. 44(5), 687 (2005)CrossRefGoogle Scholar
  22. 22.
    Mrzek, J.: Finding sequence motifs in prokaryotic genomes-a brief practical guide for a microbiologist. Brief. Bioinf. 10(5), 525 (2009)CrossRefGoogle Scholar
  23. 23.
    Pavesi, G., Mauri, G., Pesole, G.: An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 17(1), S207–S214 (2001)CrossRefGoogle Scholar
  24. 24.
    Qiao, M., Zhang, D.: Efficiently matching frequent patterns based on bitmap inverted files built from closed itemsets. Int. J. Artif. Intell. Tools 21(03), 1250011 (2012)CrossRefGoogle Scholar
  25. 25.
    Ren, J.D., Yang, J., Li, Y.: Mining weighted closed sequential patterns in large databases. In: Proceedings of 2008. FSKD’08. Fifth International Conference on Fuzzy Systems and Knowledge Discovery, vol. 5, pp. 640–644. IEEE (2008)Google Scholar
  26. 26.
    Simmons, S., Berger, B.: Realizing privacy preserving genome-wide association studies. Bioinformatics 32(9), 1293–1300 (2016)CrossRefGoogle Scholar
  27. 27.
    Simmons, S., Sahinalp, C., Berger, B.: Enabling privacy-preserving gwass in heterogeneous human populations. Cell Syst. 3(1), 54–61 (2016)CrossRefGoogle Scholar
  28. 28.
    Staden, R.: Methods for discovering novel motifs in nucleic acid sequences. Bioinformatics 5(4), 293–298 (1989)CrossRefGoogle Scholar
  29. 29.
    Su, S., Xu, S., Cheng, X., Li, Z., Yang, F.: Differentially private frequent itemset mining via transaction splitting. IEEE Trans. Knowl. Data Eng. 27(7), 1875–1891 (2015)CrossRefGoogle Scholar
  30. 30.
    Tompa, M., Li, N., Bailey, T.L., Church, G.M., De, M.B., Eskin, E., Favorov, A.V., Frith, M.C., Fu, Y., Kent, W.J.: Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol. 23(1), 137 (2005)CrossRefGoogle Scholar
  31. 31.
    Tramèr, F., Huang, Z., Hubaux, J.P., Ayday, E.: Differential privacy with bounded priors: reconciling utility and privacy in genome-wide association studies. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 1286–1297. ACM (2015)Google Scholar
  32. 32.
    Uhlerop, C., Slavković, A., Fienberg, S.E.: Privacy-preserving data sharing for genome-wide association studies. J. Privacy Confid. 5(1), 137 (2013)Google Scholar
  33. 33.
    Yan, X., Han, J., Afshar, R.: Clospan: Mining: Closed sequential patterns in large datasets. In: Proceedings of the 2003 SIAM International Conference on Data Mining, pp. 166–177. SIAM (2003)Google Scholar
  34. 34.
    Yu, F., Fienberg, S.E., Slavković, A.B., Uhler, C.: Scalable privacy-preserving data sharing methodology for genome-wide association studies. J. Biomed. Inf. 50, 133–141 (2014a)CrossRefGoogle Scholar
  35. 35.
    Yu, F., Rybar, M., Uhler, C., Fienberg, S.E.: Differentially-private logistic regression for detecting multiple-SNP association in GWAS databases. In: International Conference on Privacy in Statistical Databases, pp. 170–184. Springer (2014b)Google Scholar
  36. 36.
    Zeng, C., Naughton, J.F., Cai, J.Y.: On differentially private frequent itemset mining. Vldb J. 6(1), 25–36 (2012)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of Medical InformaticsXuzhou Medical UniversityXuzhouChina

Personalised recommendations