The VLDB Journal

, Volume 24, Issue 1, pp 1–24 | Cite as

Compressive mining: fast and optimal data mining in the compressed domain

  • Michail Vlachos
  • Nikolaos M. Freris
  • Anastasios Kyrillidis
Regular Paper

Abstract

Real-world data typically contain repeated and periodic patterns. This suggests that they can be effectively represented and compressed using only a few coefficients of an appropriate basis (e.g., Fourier and wavelets). However, distance estimation when the data are represented using different sets of coefficients is still a largely unexplored area. This work studies the optimization problems related to obtaining the tightest lower/upper bound on Euclidean distances when each data object is potentially compressed using a different set of orthonormal coefficients. Our technique leads to tighter distance estimates, which translates into more accurate search, learning and mining operations directly in the compressed domain. We formulate the problem of estimating lower/upper distance bounds as an optimization problem. We establish the properties of optimal solutions and leverage the theoretical analysis to develop a fast algorithm to obtain an exact solution to the problem. The suggested solution provides the tightest estimation of the \(L_2\)-norm or the correlation. We show that typical data analysis operations, such as \(k\)-nearest-neighbor search or k-Means clustering, can operate more accurately using the proposed compression and distance reconstruction technique. We compare it with many other prevalent compression and reconstruction techniques, including random projections and PCA-based techniques. We highlight a surprising result, namely that when the data are highly sparse in some basis, our technique may even outperform PCA-based compression. The contributions of this work are generic as our methodology is applicable to any sequential or high-dimensional data as well as to any orthogonal data transformation used for the underlying data compression scheme.

Keywords

Data compression Compressive sensing Fourier  wavelets Waterfilling algorithm Convex optimization 

References

  1. 1.
    Souza, A., Pineda, J.: Tidal mixing modulation of sea surface temperature and diatom abundance in Southern California. Cont. Shelf Res. 21(6–7), 651–666 (2001)CrossRefGoogle Scholar
  2. 2.
    Noble, P., Wheatland, M.: Modeling the sunspot number distribution with a Fokker–Planck equation. Astrophys. J. 732(1), 5 (2011)Google Scholar
  3. 3.
    Baechler, G., Freris, N., Quick, R., Crochiere, R.: Finite rate of innovation based modeling and compression of ECG signals. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1252–1256 (2013)Google Scholar
  4. 4.
    Chien, S., Immorlica, N.: Semantic similarity between search engine queries using temporal correlation, In: Proceedings of World Wide Web conference (WWW 2005) (2005)Google Scholar
  5. 5.
    Liu, B., Jones, R., Klinkner, K.L.: Measuring the meaning in time series clustering of text search queries. In: Proceedings of the ACM International Conference on Information and Knowledge Management, pp. 836–837, ACM (2006)Google Scholar
  6. 6.
    Nygren, E., Sitaraman, R.K., Wein, J.: Networked systems research at akamai. ACM SIGOPS Oper. Syst. Rev. 44(3), 1–1 (2010)CrossRefGoogle Scholar
  7. 7.
    Agrawal, R., Faloutsos, C., Swami, A.: Efficient similarity search in sequence databases. In: Proceedings of the International Conference of Foundations of Data Organization (FODO), pp. 69–84 (1993)Google Scholar
  8. 8.
    Rafiei, D., Mendelzon, A.: Efficient retrieval of similar time sequences using DFT. In: Proceedings of the International Conference of Foundations of Data Organization (FODO), pp. 1–15 (1998)Google Scholar
  9. 9.
    Chan, F.-P., Fu, A.-C., Yu, C.: Haar wavelets for efficient similarity search of time-series: with and without time warping. IEEE Trans. Knowl. Data Eng. 15(3), 686–705 (2003)CrossRefGoogle Scholar
  10. 10.
    Eruhimov, V., Martyanov, V., Raulefs, P., Tuv, E.: Combining unsupervised and supervised approaches to feature selection for multivariate signal compression. In: Intelligent Data Engineering and Automated, Learning, pp. 480–487 (2006)Google Scholar
  11. 11.
    Vlachos, M., Kozat, S., Yu, P.: Optimal distance bounds for fast search on compressed time-series query logs. ACM Trans. Web 4(2), 6:1–6:28 (2010)CrossRefGoogle Scholar
  12. 12.
    Vlachos, M., Kozat, S., Yu, P.: Optimal distance bounds on time-series data. In: Proceedings of SIAM Data Mining (SDM), pp. 109–120 (2009)Google Scholar
  13. 13.
    Cai, Y., Ng, R.: Indexing spatio-temporal trajectories with chebyshev polynomials. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 599–610, ACM (2004)Google Scholar
  14. 14.
    Wang, C., Wang, X.S.: Multilevel filtering for high dimensional nearest neighbor search. In: Proceedings of ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge, Citeseer (2000)Google Scholar
  15. 15.
    Dasgupta, S.: Experiments with random projection. In: Proceedings of Conference on Uncertainty in Artificial Intelligence, pp. 143–151, Morgan Kaufmann Publishers Inc. (2000)Google Scholar
  16. 16.
    Calderbank, R., Jafarpour, S., Schapire, R.: Compressed learning: universal sparse dimensionality reduction and learning in the measurement domain. Technical Report (Princeton University) (2009)Google Scholar
  17. 17.
    Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math. 26, 189–206 (1984)CrossRefMATHMathSciNetGoogle Scholar
  18. 18.
    Indyk, P., Naor, A.: Nearest-neighbor-preserving embeddings. ACM Trans. Algorithms (TALG) 3(3), 31 (2007)CrossRefMathSciNetGoogle Scholar
  19. 19.
    Ailon, N., Chazelle, B.: Approximate nearest neighbors and the fast Johnson–Lindenstrauss transform. In: Proceedings of ACM symposium on Theory of Computing, pp. 557–563, ACM (2006)Google Scholar
  20. 20.
    Boutsidis, C., Zouzias, A., Drineas, P.: Random projections for \(k\)-means clustering. In. Advances in Neural Information Processing Systems, pp. 298–306 (2010)Google Scholar
  21. 21.
    Cardoso, Â., Wichert, A.: Iterative random projections for high-dimensional data clustering. Pattern Recognit. Lett. 33(13), 1749–1755 (2012)CrossRefGoogle Scholar
  22. 22.
    Achlioptas, D.: Database-friendly random projections. In: Proceedings of ACM Symposium on Principles of Database Systems (PODS), pp. 274–281 (2001)Google Scholar
  23. 23.
    Bingham, E., Mannila, H.: Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 245–250, ACM (2001)Google Scholar
  24. 24.
    Freris, N.M., Vlachos, M., Kozat, S.S.: Optimal distance estimation between compressed data series. In: Proceedings of SIAM Data Mining (SDM), pp. 343–354 (2012)Google Scholar
  25. 25.
    Vlachos, M., Yu, P., Castelli, V.: On periodicity detection and structural periodic similarity. In: Proceedings of SIAM Data Mining (SDM), pp. 449–460 (2005)Google Scholar
  26. 26.
    Mueen, A., Nath, S., Liu, J.: Fast approximate correlation for massive time-series data. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 171–182, ACM (2010)Google Scholar
  27. 27.
    Keogh, E., Kasetty, S.: On the need for time series data mining benchmarks: a survey and empirical demonstration. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (2002)Google Scholar
  28. 28.
    Mueen, A., Keogh, E.J., Shamlo, N.B.: Finding time series motifs in disk-resident data. In: Proceedings of the IEEE International Conference on Data Mining (ICDM), pp. 367–376 (2009)Google Scholar
  29. 29.
    Oppenheim, A.V., Schafer, R.W., Buck, J.R., et al.: Discrete-Time Signal Processing, vol. 5. Prentice Hall, Upper Saddle River (1999)Google Scholar
  30. 30.
    Keogh, E., Chakrabarti, K., Mehrotra, S., Pazzani, M.: Locally adaptive dimensionality reduction for indexing large time series databases. In: Proceedings of ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge, pp. 151–162 (2001)Google Scholar
  31. 31.
    Boyd, S., Vandenberghe, L.: Convex Optimization, 1st edn. Cambridge University Press, Cambridge (2004)CrossRefMATHGoogle Scholar
  32. 32.
    Grant, M., Boyd, S.: Graph implementations for nonsmooth convex programs. In: Recent Advances in Learning and Control, pp. 95–110, Springer (2008)Google Scholar
  33. 33.
    Basar, T., Olsder, G.J.: Dynamic Noncooperative Game Theory, 2nd edn. Academic Press, New York (1995)MATHGoogle Scholar
  34. 34.
    Donoho, D.L.: Compressed sensing. IEEE Trans. Inf. Theor. 52(4), 1289–1306 (2006)Google Scholar
  35. 35.
    Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theor. 13(1), 21–27 (1967)CrossRefMATHGoogle Scholar
  36. 36.
    Jones, P.W., Osipov, A., Rokhlin, V.: Randomized approximate nearest neighbors algorithm. Proc. Natl. Acad. Sci. 108(38), 15679–15686 (2011)CrossRefGoogle Scholar
  37. 37.
    Candes, E.J., Romberg, J.K., Tao, T.: Stable signal recovery from incomplete and inaccurate measurements. Commun. Pure Appl. Math. 59(8), 1207–1223 (2006)CrossRefMATHMathSciNetGoogle Scholar
  38. 38.
    Kushilevitz, E., Ostrovsky, R., Rabani, Y.: Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM J. Comput. 30(2), 457–474 (2000)CrossRefMATHMathSciNetGoogle Scholar
  39. 39.
    Hegde, C., Sankaranarayanan, A., Yin, W., Baraniuk, R.: A convex approach for learning near-isometric linear embeddings, preprint, Aug (2012)Google Scholar
  40. 40.
    Dasgupta, S.: Learning mixtures of Gaussians. In: Proceedings of Symposium on Foundations of Computer Science (FOCS), pp. 634–644, IEEE (1999)Google Scholar
  41. 41.
    Arriaga, R.I., Vempala, S.: An algorithmic theory of learning: robust concepts and random projection. In: Proceedings of Symposium on Foundations of Computer Science (FOCS), pp. 616–623, IEEE (1999)Google Scholar
  42. 42.
    Freris, N.M., Vlachos, M., Turaga, D.S.: Cluster-aware compression with provable k-means preservation. In: Proceedings of SIAM Data Mining (SDM), pp. 82–93 (2012) Google Scholar
  43. 43.
    Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theor. 28(2), 129–137 (1982)Google Scholar
  44. 44.
    Tanay, A., Sharan, R., Shamir, R.: Biclustering algorithms: a survey. Handb. Comput. Mol. Biol. 9, 1–26 (2005)Google Scholar
  45. 45.
    Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 269–274 (2001)Google Scholar
  46. 46.
    Huber, P.J.: Projection pursuit. Ann. Stat. 13(2), 435–475 (1985)Google Scholar
  47. 47.
    Arthur, D., Vassilvitskii, S.: k-Means++: the advantages of careful seeding. In: Proceedings of Symposium of Discrete Analysis (2005)Google Scholar
  48. 48.
    Cullum, J.K., Willoughby, R.A.: Lanczos algorithms for large symmetric eigenvalue computations: vol. 1, Theory. No. 41, SIAM (2002)Google Scholar
  49. 49.
    Mallat, S.: A Wavelet Tour of Signal Processing, 2nd edn. Academic Press, San Diego (1999)Google Scholar
  50. 50.
  51. 51.
    Crawford, B.: Design rules checking for integrated circuits using graphical operators. In: Proceedings on Computer Graphics and Interactive Techniques, pp. 168–176, ACM, (1975)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Michail Vlachos
    • 1
  • Nikolaos M. Freris
    • 2
  • Anastasios Kyrillidis
    • 1
  1. 1.IBM-Research ZürichRüschlikon Switzerland
  2. 2.New York University Abu DhabiAbu DhabiUnited Arab Emirates

Personalised recommendations