Advertisement

Efficient String Mining under Constraints Via the Deferred Frequency Index

  • David Weese
  • Marcel H. Schulz
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5077)

Abstract

We propose a general approach for frequency based string mining, which has many applications, e.g. in contrast data mining. Our contribution is a novel algorithm based on a deferred data structure. Despite its simplicity, our approach is up to 4 times faster and uses about half the memory compared to the best-known algorithm of Fischer et al. Applications in various string domains, e.g. natural language, DNA or protein sequences, demonstrate the improvement of our algorithm.

Keywords

Frequency Vector Frequent Pattern Mining Space Consumption Union String Frequency Predicate 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Han, J., Cheng, H., Xin, D., Yan, X.: Frequent pattern mining: current status and future directions. Data Min. Knowl. Discov. 15(1), 55–86 (2007)CrossRefMathSciNetGoogle Scholar
  2. 2.
    Berry, M.J., Linoff, G.S.: Data Mining Techniques: For Marketing, Sales, and Customer Support, 1st edn., pp. 51–62. John Wiley & Sons, Chichester (1997)Google Scholar
  3. 3.
    Muthusamy, Y.K., Barnard, E., Cole, R.A.: Reviewing automatic language identification. IEEE Sig. Proc. Mag. 11(4), 33–41 (1994)CrossRefGoogle Scholar
  4. 4.
    Zhang, M.Q.: Computational analyses of eukaryotic promoters. BMC Bioinformatics 8(Supp 6), S3 (2007)CrossRefGoogle Scholar
  5. 5.
    Birzele, F., Kramer, S.: A new representation for protein secondary structure prediction based on frequent patterns. Bioinformatics 22(21), 2628–2634 (2006)CrossRefGoogle Scholar
  6. 6.
    Redhead, E., Bailey, T.L.: Discriminative motif discovery in dna and protein sequences using the DEME algorithm. BMC Bioinformatics 8, 385 (2007)CrossRefGoogle Scholar
  7. 7.
    Fischer, J., Heun, V., Kramer, S.: Fast frequent string mining using suffix arrays. In: IEEE ICDM 2005, pp. 609–612. IEEE Computer Society Press, Los Alamitos (2005)Google Scholar
  8. 8.
    Raedt, L.D., Jaeger, M., Lee, S.D., Mannila, H.: A theory of inductive query answering. In: IEEE ICDM 2002, pp. 123–130. IEEE Computer Society, Los Alamitos (2002)CrossRefGoogle Scholar
  9. 9.
    Chan, S., Kao, B., Yip, C.L., Tang, M.: Mining emerging substrings. In: DASFAA 2003, pp. 119–126. IEEE Computer Society, Los Alamitos (2003)Google Scholar
  10. 10.
    Lee, S.D., Raedt, L.D.: An efficient algorithm for mining string databases under constraints. In: Goethals, B., Siebes, A. (eds.) KDID 2004. LNCS, vol. 3377, pp. 108–129. Springer, Heidelberg (2005)Google Scholar
  11. 11.
    Abouelhoda, M., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2, 53–86 (2004)CrossRefMathSciNetzbMATHGoogle Scholar
  12. 12.
    Fischer, J., Heun, V., Kramer, S.: Optimal string mining under frequency constraints. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 139–150. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  13. 13.
    Fischer, J., Heun, V.: Theoretical and practical improvements on the RMQ-problem, with applications to LCA and LCE. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 36–48. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  14. 14.
    Manber, U., Myers, E.: Suffix arrays: a new method for on-line string searches. In: SODA 1990, pp. 319–327. SIAM, Philadelphia (1990)Google Scholar
  15. 15.
    Giegerich, R., Kurtz, S., Stoye, J.: Efficient implementation of lazy suffix trees. Software Pract. Exper. 33(11), 1035–1049 (2003)CrossRefGoogle Scholar
  16. 16.
    Giegerich, R., Kurtz, S.: A comparison of imperative and purely functional suffix tree constructions. Sci. Comput. Program. 25, 187–218 (1995)CrossRefMathSciNetzbMATHGoogle Scholar
  17. 17.
    Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: 8.2: Counting sort. In: Introduction to Algorithms, 2nd edn., pp. 168–170. MIT Press and McGraw-Hill (2001)Google Scholar
  18. 18.
    Fitzgerald, P.C., Sturgill, D., Shyakhtenko, A., Oliver, B., Vinson, C.: Comparative genomics of drosophila and human core promoters. Genome Biol. 7, R53 (2006)CrossRefGoogle Scholar
  19. 19.
    The UniProt Consortium: The Universal Protein Resource (UniProt). Nucl. Acids Res. 36(suppl_1), D190–D195 (2008), ftp://ftp.ebi.ac.uk/pub/databases/integr8/uniprot/proteomes
  20. 20.
    Asuncion, A., Newman, D.: UCI machine learning repository (2007), http://www.ics.uci.edu/~mlearn/MLRepository.html
  21. 21.
    Dong, G., Li, J.: Efficient mining of emerging patterns: discovering trends and differences. In: KDD 1999, pp. 43–52. ACM, New York (1999)CrossRefGoogle Scholar
  22. 22.
    Kurtz, S.: Reducing the space requirement of suffix trees. Software Pract. Exper. 29(13), 1149–1171 (1999)CrossRefGoogle Scholar
  23. 23.
    Li, J., Dong, G., Ramamohanarao, K.: Making use of the most expressive jumping emerging patterns for classification. In: PADKK 2000, pp. 220–232. Springer, Heidelberg (2000)Google Scholar
  24. 24.
    Mitasiunaite, I., Boulicaut, J.F.: Looking for monotonicity properties of a similarity constraint on sequences. In: SAC 2006, pp. 546–552. ACM, New York (2006)CrossRefGoogle Scholar
  25. 25.
    Ji, X., Bailey, J., Dong, G.: Mining minimal distinguishing subsequence patterns with gap constraints. Knowl. Inf. Syst. 11(3), 259–286 (2007)CrossRefGoogle Scholar
  26. 26.
    Döring, A., Weese, D., Rausch, T., Reinert, K.: SeqAn an efficient, generic C++ library for sequence analysis. BMC Bioinformatics 9, 11 (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • David Weese
    • 1
  • Marcel H. Schulz
    • 2
  1. 1.Department of Computer ScienceFree University of BerlinBerlinGermany
  2. 2.Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestr. 73, 14195 Berlin, Germany and, International Max Planck Research School for Computational Biology and Scientific Computing 

Personalised recommendations