Progressive and Approximate Join Algorithms on Data Streams

Part of the Intelligent Systems Reference Library book series (ISRL, volume 36)

Abstract

In this chapter, we discuss the design and implementation of join algorithms for data streaming systems, wherememory is often limited relative to the data that needs to be processed.We first focus on progressive join algorithms for various data models. We introduce a framework for progressive join processing, called the Result Rate based Progressive Join (RRPJ) framework which can be used for join processing for various data models, and discuss its various instantiations for processing relational, high-dimensional, spatial and XML data.

We then consider progressive and approximate join algorithms. The need for approximate join algorithms is motivated by the observation that users often do not require complete set of answers. Some answers, which we refer to as an approximate result, are often sufficient. Users expect the approximate result to be either the largest possible or the most representative (or both) given the resources available. We discuss the tradeoffs between maximizing quantity and quality of the approximate result. To address the different tradeoffs, we discuss a family of algorithms for progressive and approximate join processing.

Keywords

Data Stream Query Processing Priority Queue Kullback Leibler Inclusion Probability 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Arge, L.A., Procopiuc, O., Ramaswamy, S., Suel, T., Vitter, J.S.: Scalable Sweeping-Based Spatial Join. In: VLDB, pp. 570–581 (1998)Google Scholar
  2. 2.
    Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and Issues in Data Stream Systems. In: PODS, pp. 1–16 (2002)Google Scholar
  3. 3.
    Babu, S., Widom, J.: Continuous Queries over Data Streams. SIGMOD Record 30(3), 109–120 (2001)CrossRefGoogle Scholar
  4. 4.
    Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B.: The R*-tree: An Efficient and Robust Access Method for Points and Rectangles. In: SIGMOD, pp. 322–331 (1990)Google Scholar
  5. 5.
    Berchtold, S., Keim, D.A., Kriegel, H.P.: The X-tree: An Index Structure for High-Dimensional Data. In: VLDB, pp. 28–39 (1996)Google Scholar
  6. 6.
    Böhm, C., Braunmüller, B., Breunig, M.M., Kriegel, H.P.: High Performance Clustering Based on the Similarity Join. In: CIKM, pp. 298–305 (2000)Google Scholar
  7. 7.
    Böhm, C., Braunmüller, B., Krebs, F., Kriegel, H.P.: Epsilon Grid Order: An Algorithm for the Similarity Join on Massive High-Dimensional Data. In: SIGMOD, pp. 379–388 (2001)Google Scholar
  8. 8.
    Böhm, C., Krebs, F.: Supporting KDD Applications by the k-Nearest Neighbor Join. In: Mařík, V., Štěpánková, O., Retschitzegger, W. (eds.) DEXA 2003. LNCS, vol. 2736, pp. 504–516. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  9. 9.
    Böhm, C., Krebs, F.: The k-Nearest Neighbour Join: Turbo Charging the KDD Process. Knowl. Inf. Syst. 6(6), 728–749 (2004)CrossRefGoogle Scholar
  10. 10.
    Brinkhoff, T., Kriegel, H.P., Seeger, B.: Efficient Processing of Spatial Joins Using R-Trees. In: SIGMOD, pp. 237–246 (1993)Google Scholar
  11. 11.
    Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Seidman, G., Stonebraker, M., Tatbul, N., Zdonik, S.B.: Monitoring Streams - A New Class of Data Management Applications. In: VLDB, pp. 215–226 (2002)Google Scholar
  12. 12.
    Chaudhuri, S., Motwani, R., Narasayya, V.R.: On Random Sampling over Joins. In: SIGMOD, pp. 263–274 (1999)Google Scholar
  13. 13.
    Cochran, W.G.: Sampling Techniques, 3rd edn. John Wiley (1977)Google Scholar
  14. 14.
    Das, A., Gehrke, J., Riedewald, M.: Approximate Join Processing Over Data Streams. In: SIGMOD, pp. 40–51 (2003)Google Scholar
  15. 15.
    Das, A., Gehrke, J., Riedewald, M.: Semantic Approximation of Data Stream Joins. IEEE Trans. Knowl. Data Eng. 17(1), 44–59 (2005)CrossRefGoogle Scholar
  16. 16.
    Dittrich, J.P., Seeger, B., Taylor, D.S., Widmayer, P.: Progressive Merge Join: A Generic and Non-blocking Sort-based Join Algorithm. In: VLDB, pp. 299–310 (2002)Google Scholar
  17. 17.
    Guttman, A.: R-Trees: A Dynamic Index Structure for Spatial Searching. In: SIGMOD, pp. 47–57 (1984)Google Scholar
  18. 18.
    Hellerstein, J.M., Avnur, R., Chou, A., Hidber, C., Olston, C., Raman, V., Roth, T., Haas, P.J.: Interactive data Analysis: The Control Project. IEEE Computer 32(8), 51–59 (1999)CrossRefGoogle Scholar
  19. 19.
    Hong, M., Demers, A., Gehrke, J., Koch, C., Riedewald, M., White, W.: Massively Multi-Query Join Processing in Publish/Subscribe Systems. In: SIGMOD. ACM Press, Beijing (2007)Google Scholar
  20. 20.
    Huang, Y.W., Jing, N., Rundensteiner, E.: Spatial Joins using R-trees: Breadth-first Traversal with Global Optimizations. In: VLDB, pp. 396–405 (1997)Google Scholar
  21. 21.
    Ibrahim, I.K.: Handbook of Research on Mobile Multimedia (N/A). IGI Publishing, Hershey (2006)Google Scholar
  22. 22.
    Kalashnikov, D.V., Prabhakar, S.: Fast Similarity Join for Multi-Dimensional Data. Inf. Syst. 32(1), 160–177 (2007)CrossRefGoogle Scholar
  23. 23.
    Koudas, N., Sevcik, K.C.: Size Separation Spatial Join. In: SIGMOD, pp. 324–335 (1997)Google Scholar
  24. 24.
    Koudas, N., Sevcik, K.C.: High Dimensional Similarity Joins: Algorithms and Performance Evaluation. In: ICDE, pp. 466–475 (1998)Google Scholar
  25. 25.
    Koudas, N., Sevcik, K.C.: High Dimensional Similarity Joins: Algorithms and Performance Evaluation. IEEE Transactions on Knowledge and Data Engineering 12(1), 3–18 (2000)CrossRefGoogle Scholar
  26. 26.
    Lawrence, R.: Early Hash Join: A Configurable Algorithm for the Efficient and Early Production of Join Results. In: VLDB, pp. 841–852 (2005)Google Scholar
  27. 27.
    Li, F., Chang, C., Kollios, G., Bestavros, A.: Characterizing and Exploiting Reference Locality in Data Stream Applications. In: ICDE, p. 81 (2006)Google Scholar
  28. 28.
    Lin, J.: Divergence Measures based on the Shannon Entropy. IEEE Transactions on Information Theory 37(1), 145–151 (1991)MATHCrossRefGoogle Scholar
  29. 29.
    Lo, M.L., Ravishankar, C.V.: Spatial Joins Using Seeded Trees. In: SIGMOD, pp. 209–220 (1994)Google Scholar
  30. 30.
    Lo, M.L., Ravishankar, C.V.: Spatial Hash-Joins. In: SIGMOD, pp. 247–258 (1996)Google Scholar
  31. 31.
    Mamoulis, N., Papadias, D.: Integration of Spatial Join Algorithms for Joining Multiple Inputs. In: SIGMOD, pp. 1–12 (1999)Google Scholar
  32. 32.
    Mokbel, M.F., Lu, M., Aref, W.G.: Hash-Merge Join: A Non-blocking Join Algorithm for Producing Fast and Early Join Results. In: ICDE, pp. 251–263 (2004)Google Scholar
  33. 33.
    Nelson, R.C., Samet, H.: A Population Analysis for Hierarchical Data Structures. In: Dayal, U., Traiger, I.L. (eds.) SIGMOD, pp. 270–277. ACM Press, New York (1987)Google Scholar
  34. 34.
    Patel, J.M., DeWitt, D.J.: Partition Based Spatial-Merge Join. In: SIGMOD, pp. 259–270 (1996)Google Scholar
  35. 35.
    Sellis, T., Roussopoulos, N., Faloutsos, C.: R+-tree: A Dynamic Index for Multi-Dimensional Objects. In: VLDB (1987)Google Scholar
  36. 36.
    Sevcik, K.C., Koudas, N.: Filter Trees for Managing Spatial Data over a Range of Size Granularities. In: VLDB, pp. 16–27 (1996)Google Scholar
  37. 37.
    Shim, K., Srikant, R., Agrawal, R.: High-Dimensional Similarity Joins. In: ICDE, pp. 301–311 (1997)Google Scholar
  38. 38.
    Srivastava, U., Widom, J.: Memory-Limited Execution of Windowed Stream Joins. In: VLDB, pp. 324–335 (2004)Google Scholar
  39. 39.
    Stark, M., Fernández, M., Michiels, P., Siméon, J.: XQuery streaming á la Carte. In: ICDE (2007)Google Scholar
  40. 40.
    Tao, Y., Yiu, M.L., Papadias, D., Hadjieleftheriou, M., Mamoulis, N.: RPJ: Producing Fast Join Results on Streams through Rate-based Optimization. In: SIGMOD, pp. 371–382 (2005)Google Scholar
  41. 41.
    Tok, W.H., Bressan, S., Lee, M.L.: Progressive Spatial Joins. In: SSDBM, pp. 353–358 (2006)Google Scholar
  42. 42.
    Tok, W.H., Bressan, S., Lee, M.-L.: Danaïdes: Continuous and Progressive Complex Queries on RSS Feeds. In: Kotagiri, R., Radha Krishna, P., Mohania, M., Nantajeewarawat, E. (eds.) DASFAA 2007. LNCS, vol. 4443, pp. 1115–1118. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  43. 43.
    Tok, W.H., Bressan, S., Lee, M.-L.: Progressive High-Dimensional Similarity Join. In: Wagner, R., Revell, N., Pernul, G. (eds.) DEXA 2007. LNCS, vol. 4653, pp. 233–242. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  44. 44.
    Tok, W.H., Bressan, S., Lee, M.-L.: RRPJ: Result-Rate Based Progressive Relational Join. In: Kotagiri, R., Radha Krishna, P., Mohania, M., Nantajeewarawat, E. (eds.) DASFAA 2007. LNCS, vol. 4443, pp. 43–54. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  45. 45.
    Tok, W.H., Bressan, S., Lee, M.-L.: Twig’n Join: Progressive Query Processing of Multiple XML Streams. In: Haritsa, J.R., Kotagiri, R., Pudi, V. (eds.) DASFAA 2008. LNCS, vol. 4947, pp. 546–553. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  46. 46.
    Urhan, T., Franklin, M.J.: XJoin: Getting Fast Answers From Slow and Bursty Networks. Tech. Rep. CS-TR-3994, University of Maryland (1999), http://citeseer.nj.nec.com/urhan99xjoin.html
  47. 47.
    Urhan, T., Franklin, M.J., Amsaleg, L.: Cost Based Query Scrambling for Initial Delays. In: Haas, L.M., Tiwary, A. (eds.) SIGMOD, pp. 130–141. ACM Press (1998)Google Scholar
  48. 48.
    Vitter, J.S.: Random Sampling with a Reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985)MathSciNetMATHCrossRefGoogle Scholar
  49. 49.
    Wilschut, A.N., Apers, P.M.G.: Dataflow Query Execution in a Parallel Main-Memory Environment. In: PDIS, pp. 68–77 (1991)Google Scholar
  50. 50.
    Xia, C., Lu, H., Ooi, B.C., Hu, J.: Gorder: An Efficient Method for KNN Join Processing. In: VLDB, pp. 756–767 (2004)Google Scholar
  51. 51.
    Xie, J., Yang, J., Chen, Y.: On Joining and Caching Stochastic Streams. In: SIGMOD, pp. 359–370 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  1. 1.MicrosoftSingaporeSingapore
  2. 2.School of ComputingNational University of SingaporeSingaporeSingapore

Personalised recommendations