Data Mining and Knowledge Discovery

, Volume 10, Issue 2, pp 117–139 | Cite as

On the Use of Wavelet Decomposition for String Classification

Article

Abstract

In recent years, the technological advances in mapping genes have made it increasingly easy to store and use a wide variety of biological data. Such data are usually in the form of very long strings for which it is difficult to determine the most relevant features for a classification task. For example, a typical DNA string may be millions of characters long, and there may be thousands of such strings in a database. In many cases, the classification behavior of the data may be hidden in the compositional behavior of certain segments of the string which cannot be easily determined apriori. Another problem which complicates the classification task is that in some cases the classification behavior is reflected in global be havior of the string, whereas in others it is reflected in local patterns. Given the enormous variation in the behavior of the strings over different data sets, it is useful to develop an approach which is sensitive to both the global and local behavior of the strings for the purpose of classification. For this purpose, we will exploit the multi-resolution property of wavelet decomposition in order to create a scheme which can mine classification characteristics at different levels of granularity. The resulting scheme turns out to be very effective in practice on a wide range of problems.

Keywords

strings classification wavelets 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aggarwal, C.C. 2002. On effective classification of strings with wavelets. In KDD Conference, pp. 163–172.Google Scholar
  2. Agrawal, R., Lin, K.-I., Sawhney, H., and Shim, K. 1995. Fast similarity search in the presence of noise, scaling, and translation in time series databases. VLDB Conference, pp. 490–501.Google Scholar
  3. Agrawal, R. and Srikant, R. 1994. Fast algorithms for finding association rules. VLDB Conference, pp. 487–499.Google Scholar
  4. Agrawal, R. and Srikant, R. 1995. Mining sequential patterns. ICDE Conference, pp. 3–14.Google Scholar
  5. Boggess, A. and Narcowich, F.J. 2001. A First Course in Wavelets with Fourier Analysis. Prentice Hall.Google Scholar
  6. Chui, C.C. 1992. An Introduction to Wavelets. Academic Press.Google Scholar
  7. Deshpande, M. and Karypis, G. 2001. Evaluation of techniques for classifying biological sequences. Technical report, TR 01–33, University of Minnesota.Google Scholar
  8. Duda, R. and Hart, P. 1973. Pattern Analysis and Scene Analysis, Wiley.Google Scholar
  9. Gehrke, J., Ganti, V., Ramakrishnan, R., and Loh, W.-Y. 1999. BOAT: Optimistic decision tree construction. SIGMOD Conference, pp. 169–180.Google Scholar
  10. Gehrke, J., Ramakrishnan, R., and Ganti, V. 1998. Rainforest—A framework for fast decision tree construction of large data sets. VLDB Conference, pp. 416–427.Google Scholar
  11. Gehrke, J, Loh, W.-Y., and Ramakrishnan, R. 1999. Data mining with decision trees. ACM SIGKDD Conference Tutorial.Google Scholar
  12. Guralnik, V. and Karypis, G. 2001. A scalable algorithm for clustering sequential data. ICDM Conference, pp. 179–186.Google Scholar
  13. Guralnik, V. and Srivastava, J. 1999. Event detection from time series data. KDD Conference, pp. 33–42.Google Scholar
  14. Gusfield, D. 1997. Algorithms on strings, trees and sequences. Press Syndicate of the University of Cambridge.Google Scholar
  15. Han, J., Dong, G., and Yin, Y. 1999. Efficient mining of partial periodic patterns in time series databases. ICDE Conference, pp. 106–115.Google Scholar
  16. Jagadish, H., Koudas, N., and Muthukrishnan, S. 1999. Mining deviants in a time series database. VLDB Conference, pp. 102–113.Google Scholar
  17. Jagadish, H., Koudas, N., and Muthukrishnan, S. 2000. On effective multidimensional indexing of strings. SIGMOD Conference, pp. 403–414.Google Scholar
  18. James, M. 1985. Classification Algorithms. Wiley.Google Scholar
  19. Quinlan, J.R. 1993. C4.5: Programs for machine learning. Morgan Kaufmann.Google Scholar
  20. Keim, D.A. and Heczko, M. 2001. Wavelets and their applications in databases. ICDE conference.Google Scholar
  21. E.J. Keogh and Pazzini, M.J. 1998. An enhanced representation of time series data which allows fast and accurate classification, clustering and relevance feedback. KDD Conference, pp. 239–243.Google Scholar
  22. Keogh, E. and Smyth, P. 1997. A probabilistic approach to pattern matching in time-series databases. KDD Conference, pp. 24–30.Google Scholar
  23. Keogh, E., Chakrabarti, K., Mehrotra, S., and Pazzini, M. 2001. Locally adaptive dimensionality reduction for indexing large time series databases. SIGMOD Conference, pp. 1–12.Google Scholar
  24. Liu, B. Hsu, W., and Ma., Y. 1998. Integrating classification and association rule mining. KDD Conference, pp. 80–86.Google Scholar
  25. Manganaris, S. 1995. Learning to classify sensor data. TR-CS-95-10. Vanderbilt University.Google Scholar
  26. Oates, T. 1999. Identifying distinctive subsequences in multivariate time series by clustering. KDD Conference, pp. 322–326.Google Scholar
  27. Struzik Z. and Siebes A. 1999. The haar wavelet transformation in the time series similarity paradigm. 3rd European Conference on Principles of Knowledge Discovery and Data Mining, pp 12–22.Google Scholar
  28. Wu, Y. Agrawal D., and El Abbadi A. 2002. A comparison of DFT and DWT based search in time series databases, ACM CIKM Conference, pp. 488–195.Google Scholar

Copyright information

© Springer Science + Business Media, Inc. 2005

Authors and Affiliations

  1. 1.IBM T. J. Watson Research CenterYorktown HeightsUSA

Personalised recommendations