The VLDB Journal

, Volume 23, Issue 6, pp 871–893

ACME: A scalable parallel system for extracting frequent patterns from a very long sequence

Special Issue Paper

Abstract

Modern applications, including bioinformatics, time series, and web log analysis, require the extraction of frequent patterns, called motifs, from one very long (i.e., several gigabytes) sequence. Existing approaches are either heuristics that are error-prone, or exact (also called combinatorial) methods that are extremely slow, therefore, applicable only to very small sequences (i.e., in the order of megabytes). This paper presents ACME, a combinatorial approach that scales to gigabyte-long sequences and is the first to support supermaximal motifs. ACME is a versatile parallel system that can be deployed on desktop multi-core systems, or on thousands of CPUs in the cloud. However, merely using more compute nodes does not guarantee efficiency, because of the related overheads. To this end, ACME introduces an automatic tuning mechanism that suggests the appropriate number of CPUs to utilize, in order to meet the user constraints in terms of run time, while minimizing the financial cost of cloud resources. Our experiments show that, compared to the state of the art, ACME supports three orders of magnitude longer sequences (e.g., DNA for the entire human genome); handles large alphabets (e.g., English alphabet for Wikipedia); scales out to 16,384 CPUs on a supercomputer; and supports elastic deployment in the cloud.

Keywords

Automatic tuning Cache efficient Cloud Elastic Motif Suffix tree 

References

  1. 1.
    Apostolico, A., Comin, M., Parida, L.: VARUN: discovering extensible motifs under saturation constraints. IEEE/ACM Trans. Comput. Biol. Bioinform. 7(4), 752–762 (2010)CrossRefGoogle Scholar
  2. 2.
    Becher, V., Deymonnaz, A., Heiber, P.: Efficient computation of all perfect repeats in genomic sequences of up to half a gigabyte, with a case study on the human genome. Bioinformatics 25(14), 1746–53 (2009)CrossRefGoogle Scholar
  3. 3.
    Carvalho, A.M., Oliveira, A.L., Freitas, A.T., Sagot, M.F.: A parallel algorithm for the extraction of structured motifs. In: Proceedings of the ACM Symposium on Applied Computing (SAC), pp. 147–153 (2004)Google Scholar
  4. 4.
    Challa, S., Thulasiraman, P.: Protein sequence motif discovery on distributed supercomputer. In: Proceedings of the International Conference on Advances in Grid and Pervasive Computing (GPC), pp. 232–243 (2008)Google Scholar
  5. 5.
    Das, M.K., Dai, H.K.: A survey of DNA motif finding algorithms. BMC Bioinform. 8(S-7), S21 (2007)Google Scholar
  6. 6.
    Dasari, N.S., Desh, R., Zubair, M.: An efficient multicore implementation of planted motif problem. In: Proceedings of the International Conference on High Performance Computing and Simulation (HPCS), pp. 9–15 (2010)Google Scholar
  7. 7.
    Dasari, N.S., Ranjan, D., Zubair, M.: High performance implementation of planted motif problem using suffix trees. In: Proceedings of the International Conference on High Performance Computing and Simulation (HPCS), pp. 200–206 (2011)Google Scholar
  8. 8.
    Federico, M., Pisanti, N.: Suffix tree characterization of maximal motifs in biological sequences. Theor. Comput. Sci. 410(43), 4391–4401 (2009)CrossRefMATHMathSciNetGoogle Scholar
  9. 9.
    Floratou, A., Tata, S., Patel, J.M.: Efficient and accurate discovery of patterns in sequence data sets. IEEE Trans. Knowl. Data Eng. 23(8), 1154–1168 (2011)CrossRefGoogle Scholar
  10. 10.
    Grossi, R., Pietracaprina, A., Pisanti, N., Pucci, G., Upfal, E., Vandin, F., Salzberg, S., Warnow, T.: MADMX: a novel strategy for maximal dense motif extraction. In: Proceedings of Workshop on Algorithms in Bioinformatics, pp. 362–374 (2009)Google Scholar
  11. 11.
    Gusfield, D.: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, Cambridge (1997)Google Scholar
  12. 12.
    Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 1–12 (2000)Google Scholar
  13. 13.
    Huang, E., Yang, L., Chowdhary, R., Kassim, A., Bajic, V.B.: An algorithm for ab initio dna motif detection. Inf. Process. Living Syst. 2, 611–614 (2005)Google Scholar
  14. 14.
    Huang, C.W., Lee, W.S., Hsieh, S.Y.: An improved heuristic algorithm for finding motif signals in DNA sequences. IEEE/ACM Trans. Comput. Biol. Bioinform. 8(4), 959–975 (2011)CrossRefGoogle Scholar
  15. 15.
    Kleinrock, L.: Queueing Systems, vol. I: Theory. Wiley, New York (1975)MATHGoogle Scholar
  16. 16.
    Liu, Y., Schmidt, B., Maskell, D.L.: An ultrafast scalable many-core motif discovery algorithm for multiple gpus. In: Proceedings of the International Symposium on Parallel and Distributed Processing, pp. 428–434 (2011)Google Scholar
  17. 17.
    Mabroukeh, N.R., Ezeife, C.I.: A taxonomy of sequential pattern mining algorithms. ACM Comput. Surv. 43(1), 1–41 (2010)CrossRefGoogle Scholar
  18. 18.
    Mansour, E., Allam, A., Skiadopoulos, S., Kalnis, P.: Era: efficient serial and parallel suffix tree construction for very long strings. Proc. VLDB Endow. 5(1), 49–60 (2011)CrossRefGoogle Scholar
  19. 19.
    Marchand, B., Bajic, V.B., Kaushik, D.K.: Highly scalable ab initio genomic motif identification. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 56:1–56:10 (2011)Google Scholar
  20. 20.
    Marsan, L., Sagot, M.F.: Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. J. Comput. Biol. 7(3–4), 345–362 (2000)CrossRefGoogle Scholar
  21. 21.
    Meisner, D., Wenisch, T.F.: Stochastic queuing simulation for data center workloads. In: Exascale Evaluation and Research Techniques Workshop (2010)Google Scholar
  22. 22.
    Mueen, A., Keogh, E.: Online discovery and maintenance of time series motifs. In: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 1089–1098 (2010)Google Scholar
  23. 23.
    Papoulis, A., Pillai, S.U.: Probability, Random Variables, and Stochastic Processes. McGraw-Hill, New York (2002)Google Scholar
  24. 24.
    Sagot, M.F.: Spelling approximate repeated or common motifs using a suffix tree. In: Proceedings of 3rd Latin American Symposium on Theoretical Informatics, pp. 374–390 (1998)Google Scholar
  25. 25.
    Sahli, M., Mansour, E., Kalnis, P.: Parallel motif extraction from very long sequences. In: Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM) (2013)Google Scholar
  26. 26.
    Saxena, K., Shukla, R.: Significant interval and frequent pattern discovery in web log data. Int. J. Comput. Sci. Issues 7(1(3)), 29–36 (2010)Google Scholar
  27. 27.
    Schad, J., Dittrich, J., Quiané-Ruiz, J.A.: Runtime measurements in the cloud: observing, analyzing, and reducing variance. Proc. VLDB Endow. 3(1–2), 460–471 (2010)CrossRefGoogle Scholar
  28. 28.
    Tsirogiannis, D., Koudas, N.: Suffix tree construction algorithms on modern hardware. In: Proceedings of the International Conference on Extending Database Technology (EDBT), pp. 263–274 (2010)Google Scholar
  29. 29.
    Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)CrossRefMATHMathSciNetGoogle Scholar
  30. 30.
    Xie, X., Mikkelsen, T.S., Gnirke, A., Lindblad-Toh, K., Kellis, M., Lander, E.S.: Systematic discovery of regulatory motifs in conserved regions of the human genome, including thousands of ctcf insulator sites. Proc. Natl. Acad. Sci. 104(17), 7145–7150 (2007)CrossRefGoogle Scholar
  31. 31.
    Yun, U., Ryu, K.H.: Approximate weighted frequent pattern mining with/without noisy environments. Knowl. Based Syst. 24(1), 73–82 (2011)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  1. 1.King Abdullah University of Science and TechnologyThuwalSaudi Arabia
  2. 2.Qatar Computing Research InstituteDohaQatar

Personalised recommendations