Skip to main content
Log in

A differential privacy DNA motif finding method based on closed frequent patterns

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

As one of the basic research methods of bioinformatics, DNA motif finding is of great significance to the study of mechanisms for regulating gene expression and the discovery of biological functional sites. However, because of the high sensitivity of DNA data, the privacy disclosure of these data during motif finding has become a bottleneck in the field of gene research. Meanwhile, traditional privacy protection data mining methods cannot deal with DNA sequences directly, and the existing private motif finding methods usually decrease the utility of the results. To solve these problems, we propose a high-utility motif finding algorithm based on \(\epsilon \)-differential privacy, which is known as a rigorous definition of privacy with meaningful privacy guarantees in the presence of arbitrary external information. Our solution makes use of the closed frequent pattern set to reduce redundant motifs of result sets and obtain accurate motifs results, satisfying \(\epsilon \)-differential privacy. Furthermore, a post-processing method based on the best linear unbiased estimate is used to optimize the utility of noisy consolidated motif support. Experiments on real-life DNA sequence datasets confirm that our algorithm is superior to the existing algorithms in terms of utility.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Agrawal, R., Srikant, R.: Mining sequential patterns. In: icde, p. 3 (1995)

  2. Amphawan, K., Lenca, P.: Mining top-k frequent-regular closed patterns. Expert Syst. Appl. 42(21), 7882–7894 (2015)

    Article  Google Scholar 

  3. Bhaskar, R., Laxman, S., Smith, A., Thakurta, A.: Discovering frequent patterns in sensitive data. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 503–512. ACM (2010)

  4. Blanchette, M., Schwikowski, B., Tompa, M.: Algorithms for phylogenetic footprinting. J. Comput. Biol. 9(2), 211–223 (2002)

    Article  Google Scholar 

  5. Chen, R., Acs, G., Castelluccia, C.: Differentially private sequential data publication via variable-length n-grams. In: Proceedings of the 2012 ACM Conference on Computer and Communications Security, pp. 638–649. ACM (2012a)

  6. Chen, R., Fung, B., Desai, B.C., Sossou, N.M.: Differentially private transit data publication: a case study on the montreal transportation system. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 213–221. ACM (2012b)

  7. Chen, R., Peng, Y., Choi, B., Xu, J., Hu, H.: A private dna motif finding algorithm. J. Biomed. Inf. 50, 122–132 (2014)

    Article  Google Scholar 

  8. Dwork, C., Roth, A., et al.: The algorithmic foundations of differential privacy. Foundations and trends\({\textregistered }\). Theor. Comput. Sci. 9(3–4), 211–407 (2014)

    Google Scholar 

  9. Geng, Q., Viswanath, P.: The optimal mechanism in differential privacy. In: 2014 IEEE International Symposium on Information Theory (ISIT), pp. 2371–2375. IEEE (2014)

  10. Geng, Q., Viswanath, P.: Optimal noise adding mechanisms for approximate differential privacy. IEEE Trans. Inf. Theory 62(2), 952–969 (2016)

    Article  MathSciNet  Google Scholar 

  11. Guo-Qing, L., Xiao-Jian, Z., Li-Ping, D., Yan-Feng, L., Xin, L.: Frequent sequential pattern mining under differential privacy. J. Comput. Res. Dev. 52(12), 2789–2801 (2015)

    Google Scholar 

  12. Gymrek, M., McGuire, A.L., Golan, D., Halperin, E., Erlich, Y.: Identifying personal genomes by surname inference. Science 339(6117), 321–324 (2013)

    Article  Google Scholar 

  13. Han, J., Pei, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.: Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth. In: proceedings of the 17th International Conference on Data Engineering, pp. 215–224 (2001)

  14. Hay, M., Rastogi, V., Miklau, G., Suciu, D.: Boosting the accuracy of differentially private histograms through consistency. Proc.VLDB Endow. 3(1–2), 1021–1032 (2010)

    Article  Google Scholar 

  15. Holohan, N., Leith, D.J., Mason, O.: Differential privacy in metric spaces: numerical, categorical and functional data under the one roof. Inf. Sci. 305, 256–268 (2015)

    Article  MathSciNet  Google Scholar 

  16. Homer, N., Szelinger, S., Redman, M., Duggan, D., Tembe, W., Muehling, J., Pearson, J.V., Stephan, D.A., Nelson, S.F., Craig, D.W.: Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 4(8), 167e1000 (2008)

    Article  Google Scholar 

  17. Johnson, A., Shmatikov, V.: Privacy-preserving data exploration in genome-wide association studies. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1079–1087. ACM (2013)

  18. Kurtz, S., Choudhuri, J.V., Ohlebusch, E., Schleiermacher, C., Stoye, J., Giegerich, R.: Reputer: the manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res. 29(22), 4633–4642 (2001)

    Article  Google Scholar 

  19. Le, T., Vo, B.: An n-list-based algorithm for mining frequent closed patterns. Expert Syst. Appl. 42(19), 6648–6657 (2015)

    Article  Google Scholar 

  20. Li, N., Qardaji, W., Dong, S., Cao, J.: Privbasis: frequent itemset mining with differential privacy. Proc. Vldb Endow. 5(11), 1340–1351 (2012)

    Article  Google Scholar 

  21. Malin, B.A.: Protecting genomic sequence anonymity with generalization lattices. Methods Inf. Med. 44(5), 687 (2005)

    Article  Google Scholar 

  22. Mrzek, J.: Finding sequence motifs in prokaryotic genomes-a brief practical guide for a microbiologist. Brief. Bioinf. 10(5), 525 (2009)

    Article  Google Scholar 

  23. Pavesi, G., Mauri, G., Pesole, G.: An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 17(1), S207–S214 (2001)

    Article  Google Scholar 

  24. Qiao, M., Zhang, D.: Efficiently matching frequent patterns based on bitmap inverted files built from closed itemsets. Int. J. Artif. Intell. Tools 21(03), 1250011 (2012)

    Article  Google Scholar 

  25. Ren, J.D., Yang, J., Li, Y.: Mining weighted closed sequential patterns in large databases. In: Proceedings of 2008. FSKD’08. Fifth International Conference on Fuzzy Systems and Knowledge Discovery, vol. 5, pp. 640–644. IEEE (2008)

  26. Simmons, S., Berger, B.: Realizing privacy preserving genome-wide association studies. Bioinformatics 32(9), 1293–1300 (2016)

    Article  Google Scholar 

  27. Simmons, S., Sahinalp, C., Berger, B.: Enabling privacy-preserving gwass in heterogeneous human populations. Cell Syst. 3(1), 54–61 (2016)

    Article  Google Scholar 

  28. Staden, R.: Methods for discovering novel motifs in nucleic acid sequences. Bioinformatics 5(4), 293–298 (1989)

    Article  Google Scholar 

  29. Su, S., Xu, S., Cheng, X., Li, Z., Yang, F.: Differentially private frequent itemset mining via transaction splitting. IEEE Trans. Knowl. Data Eng. 27(7), 1875–1891 (2015)

    Article  Google Scholar 

  30. Tompa, M., Li, N., Bailey, T.L., Church, G.M., De, M.B., Eskin, E., Favorov, A.V., Frith, M.C., Fu, Y., Kent, W.J.: Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol. 23(1), 137 (2005)

    Article  Google Scholar 

  31. Tramèr, F., Huang, Z., Hubaux, J.P., Ayday, E.: Differential privacy with bounded priors: reconciling utility and privacy in genome-wide association studies. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 1286–1297. ACM (2015)

  32. Uhlerop, C., Slavković, A., Fienberg, S.E.: Privacy-preserving data sharing for genome-wide association studies. J. Privacy Confid. 5(1), 137 (2013)

    Google Scholar 

  33. Yan, X., Han, J., Afshar, R.: Clospan: Mining: Closed sequential patterns in large datasets. In: Proceedings of the 2003 SIAM International Conference on Data Mining, pp. 166–177. SIAM (2003)

  34. Yu, F., Fienberg, S.E., Slavković, A.B., Uhler, C.: Scalable privacy-preserving data sharing methodology for genome-wide association studies. J. Biomed. Inf. 50, 133–141 (2014a)

    Article  Google Scholar 

  35. Yu, F., Rybar, M., Uhler, C., Fienberg, S.E.: Differentially-private logistic regression for detecting multiple-SNP association in GWAS databases. In: International Conference on Privacy in Statistical Databases, pp. 170–184. Springer (2014b)

  36. Zeng, C., Naughton, J.F., Cai, J.Y.: On differentially private frequent itemset mining. Vldb J. 6(1), 25–36 (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiang Wu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, X., Wei, Y., Mao, Y. et al. A differential privacy DNA motif finding method based on closed frequent patterns. Cluster Comput 22 (Suppl 2), 2907–2919 (2019). https://doi.org/10.1007/s10586-017-1691-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-017-1691-9

Keywords

Navigation