Abstract
As one of the basic research methods of bioinformatics, DNA motif finding is of great significance to the study of mechanisms for regulating gene expression and the discovery of biological functional sites. However, because of the high sensitivity of DNA data, the privacy disclosure of these data during motif finding has become a bottleneck in the field of gene research. Meanwhile, traditional privacy protection data mining methods cannot deal with DNA sequences directly, and the existing private motif finding methods usually decrease the utility of the results. To solve these problems, we propose a high-utility motif finding algorithm based on \(\epsilon \)-differential privacy, which is known as a rigorous definition of privacy with meaningful privacy guarantees in the presence of arbitrary external information. Our solution makes use of the closed frequent pattern set to reduce redundant motifs of result sets and obtain accurate motifs results, satisfying \(\epsilon \)-differential privacy. Furthermore, a post-processing method based on the best linear unbiased estimate is used to optimize the utility of noisy consolidated motif support. Experiments on real-life DNA sequence datasets confirm that our algorithm is superior to the existing algorithms in terms of utility.
Similar content being viewed by others
References
Agrawal, R., Srikant, R.: Mining sequential patterns. In: icde, p. 3 (1995)
Amphawan, K., Lenca, P.: Mining top-k frequent-regular closed patterns. Expert Syst. Appl. 42(21), 7882–7894 (2015)
Bhaskar, R., Laxman, S., Smith, A., Thakurta, A.: Discovering frequent patterns in sensitive data. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 503–512. ACM (2010)
Blanchette, M., Schwikowski, B., Tompa, M.: Algorithms for phylogenetic footprinting. J. Comput. Biol. 9(2), 211–223 (2002)
Chen, R., Acs, G., Castelluccia, C.: Differentially private sequential data publication via variable-length n-grams. In: Proceedings of the 2012 ACM Conference on Computer and Communications Security, pp. 638–649. ACM (2012a)
Chen, R., Fung, B., Desai, B.C., Sossou, N.M.: Differentially private transit data publication: a case study on the montreal transportation system. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 213–221. ACM (2012b)
Chen, R., Peng, Y., Choi, B., Xu, J., Hu, H.: A private dna motif finding algorithm. J. Biomed. Inf. 50, 122–132 (2014)
Dwork, C., Roth, A., et al.: The algorithmic foundations of differential privacy. Foundations and trends\({\textregistered }\). Theor. Comput. Sci. 9(3–4), 211–407 (2014)
Geng, Q., Viswanath, P.: The optimal mechanism in differential privacy. In: 2014 IEEE International Symposium on Information Theory (ISIT), pp. 2371–2375. IEEE (2014)
Geng, Q., Viswanath, P.: Optimal noise adding mechanisms for approximate differential privacy. IEEE Trans. Inf. Theory 62(2), 952–969 (2016)
Guo-Qing, L., Xiao-Jian, Z., Li-Ping, D., Yan-Feng, L., Xin, L.: Frequent sequential pattern mining under differential privacy. J. Comput. Res. Dev. 52(12), 2789–2801 (2015)
Gymrek, M., McGuire, A.L., Golan, D., Halperin, E., Erlich, Y.: Identifying personal genomes by surname inference. Science 339(6117), 321–324 (2013)
Han, J., Pei, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.: Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth. In: proceedings of the 17th International Conference on Data Engineering, pp. 215–224 (2001)
Hay, M., Rastogi, V., Miklau, G., Suciu, D.: Boosting the accuracy of differentially private histograms through consistency. Proc.VLDB Endow. 3(1–2), 1021–1032 (2010)
Holohan, N., Leith, D.J., Mason, O.: Differential privacy in metric spaces: numerical, categorical and functional data under the one roof. Inf. Sci. 305, 256–268 (2015)
Homer, N., Szelinger, S., Redman, M., Duggan, D., Tembe, W., Muehling, J., Pearson, J.V., Stephan, D.A., Nelson, S.F., Craig, D.W.: Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 4(8), 167e1000 (2008)
Johnson, A., Shmatikov, V.: Privacy-preserving data exploration in genome-wide association studies. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1079–1087. ACM (2013)
Kurtz, S., Choudhuri, J.V., Ohlebusch, E., Schleiermacher, C., Stoye, J., Giegerich, R.: Reputer: the manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res. 29(22), 4633–4642 (2001)
Le, T., Vo, B.: An n-list-based algorithm for mining frequent closed patterns. Expert Syst. Appl. 42(19), 6648–6657 (2015)
Li, N., Qardaji, W., Dong, S., Cao, J.: Privbasis: frequent itemset mining with differential privacy. Proc. Vldb Endow. 5(11), 1340–1351 (2012)
Malin, B.A.: Protecting genomic sequence anonymity with generalization lattices. Methods Inf. Med. 44(5), 687 (2005)
Mrzek, J.: Finding sequence motifs in prokaryotic genomes-a brief practical guide for a microbiologist. Brief. Bioinf. 10(5), 525 (2009)
Pavesi, G., Mauri, G., Pesole, G.: An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 17(1), S207–S214 (2001)
Qiao, M., Zhang, D.: Efficiently matching frequent patterns based on bitmap inverted files built from closed itemsets. Int. J. Artif. Intell. Tools 21(03), 1250011 (2012)
Ren, J.D., Yang, J., Li, Y.: Mining weighted closed sequential patterns in large databases. In: Proceedings of 2008. FSKD’08. Fifth International Conference on Fuzzy Systems and Knowledge Discovery, vol. 5, pp. 640–644. IEEE (2008)
Simmons, S., Berger, B.: Realizing privacy preserving genome-wide association studies. Bioinformatics 32(9), 1293–1300 (2016)
Simmons, S., Sahinalp, C., Berger, B.: Enabling privacy-preserving gwass in heterogeneous human populations. Cell Syst. 3(1), 54–61 (2016)
Staden, R.: Methods for discovering novel motifs in nucleic acid sequences. Bioinformatics 5(4), 293–298 (1989)
Su, S., Xu, S., Cheng, X., Li, Z., Yang, F.: Differentially private frequent itemset mining via transaction splitting. IEEE Trans. Knowl. Data Eng. 27(7), 1875–1891 (2015)
Tompa, M., Li, N., Bailey, T.L., Church, G.M., De, M.B., Eskin, E., Favorov, A.V., Frith, M.C., Fu, Y., Kent, W.J.: Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol. 23(1), 137 (2005)
Tramèr, F., Huang, Z., Hubaux, J.P., Ayday, E.: Differential privacy with bounded priors: reconciling utility and privacy in genome-wide association studies. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 1286–1297. ACM (2015)
Uhlerop, C., Slavković, A., Fienberg, S.E.: Privacy-preserving data sharing for genome-wide association studies. J. Privacy Confid. 5(1), 137 (2013)
Yan, X., Han, J., Afshar, R.: Clospan: Mining: Closed sequential patterns in large datasets. In: Proceedings of the 2003 SIAM International Conference on Data Mining, pp. 166–177. SIAM (2003)
Yu, F., Fienberg, S.E., Slavković, A.B., Uhler, C.: Scalable privacy-preserving data sharing methodology for genome-wide association studies. J. Biomed. Inf. 50, 133–141 (2014a)
Yu, F., Rybar, M., Uhler, C., Fienberg, S.E.: Differentially-private logistic regression for detecting multiple-SNP association in GWAS databases. In: International Conference on Privacy in Statistical Databases, pp. 170–184. Springer (2014b)
Zeng, C., Naughton, J.F., Cai, J.Y.: On differentially private frequent itemset mining. Vldb J. 6(1), 25–36 (2012)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wu, X., Wei, Y., Mao, Y. et al. A differential privacy DNA motif finding method based on closed frequent patterns. Cluster Comput 22 (Suppl 2), 2907–2919 (2019). https://doi.org/10.1007/s10586-017-1691-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-017-1691-9