Skip to main content

Optimising Queue-Based Semi-stream Joins by Introducing a Queue of Frequent Pages

  • Conference paper
  • First Online:
Databases Theory and Applications (ADC 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9877))

Included in the following conference series:

  • 2042 Accesses

Abstract

Semi-stream joins perform a join between a stream and a disk-based table. These joins can easily deal with typical workloads in online real-time data warehousing in many scenarios and with relatively modest system requirements. The disk access is page-based. In the past, several proposals have been made to exploit skew in the distribution of the join attribute. Such skew is a common result of natural short- or long-tailed distributions in master data. Several semi-stream joins use caching strategies in order to improve performance. This works up to a point, but these algorithms still require relatively slow processing of stream data that matches with the infrequent tuples in the master data. In this work we explore the possibility of an additional strategy to exploit data skew: disk pages that are frequently accessed as a whole are accessed with priority. We show that considerable gain in service rate can be achieved with this strategy, while keeping memory consumption low. In essence we gain a three-stage approach to deal with skewed, unsorted data: caching plus our new strategy plus processing of the long tail of the distribution. We also present a cost model for our approach and validate our approach empirically.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This dataset is available at: http://cdiac.ornl.gov/ftp/ndp026b/.

References

  1. Anderson, C.: The Long Tail: Why the Future of Business Is Selling Less of More. Hyperion, New York (2006)

    Google Scholar 

  2. Arasu, A., Babu, S., Widom, J.: An abstract semantics and concrete language for continuous queries over streams and relations. Technical Report 2002–57, Stanford InfoLab (2002)

    Google Scholar 

  3. Bornea, M., Deligiannakis, A., Kotidis, Y., Vassalos, V.: Semi-streamed index join for near-real time execution of ETL transformations. In: IEEE 27th International Conference on Data Engineering (ICDE 2011), pp. 159–170, April 2011

    Google Scholar 

  4. Chakraborty, A., Singh, A.: A partition-based approach to support streaming updates over persistent data in an active datawarehouse. In: IPDPS 2009: Proceedings of the 2009 IEEE International Symposium on Parallel and Distributed Processing, pp. 1–11. IEEE Computer Society, Washington, DC, USA (2009)

    Google Scholar 

  5. Golab, L., Johnson, T., Seidel, J.S., Shkapenyuk, V.: Stream warehousing with datadepot. In: SIGMOD 2009: Proceedings of the 35th SIGMOD International Conference on Management of Data, pp. 847–854. ACM, New York, NY, USA (2009)

    Google Scholar 

  6. Karakasidis, A., Vassiliadis, P., Pitoura, E.: ETL queues for active data warehousing. In: IQIS 2005: Proceedings of the 2nd International Workshop on Information Quality in Information Systems, pp. 28–39. ACM (2005)

    Google Scholar 

  7. Naeem, M.A., Dobbie, G., Weber, G.: An event-based near real-time data integration architecture. In: EDOCW 2008: Proceedings of the 2008 12th Enterprise Distributed Object Computing Conference Workshops, pp. 401–404. IEEE Computer Society, Washington, DC, USA (2008)

    Google Scholar 

  8. Naeem, M.A., Dobbie, G., Weber, G.: HYBRIDJOIN for near-real-time data warehousing. Int. J. Data Warehouse Min. (IJDWM) 7(4), 24–43 (2011)

    Google Scholar 

  9. Naeem, M.A., Dobbie, G., Weber, G.: A lightweight stream-based join with limited resource consumption. In: Cuzzocrea, A., Dayal, U. (eds.) DaWaK 2012. LNCS, vol. 7448, pp. 431–442. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  10. Naeem, M.A., Dobbie, G., Weber, G., Alam, S.: R-MESHJOIN for near-real-time data warehousing. In: DOLAP 2010: Proceedings of the ACM 13th International Workshop on Data Warehousing and OLAP. ACM, Toronto, Canada (2010)

    Google Scholar 

  11. Naeem, M.A., Weber, G., Dobbie, G., Lutteroth, C.: SSCJ: a semi-stream cache join using a front-stage cache module. In: Bellatreche, L., Mohania, M.K. (eds.) DaWaK 2013. LNCS, vol. 8057, pp. 236–247. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  12. Asif Naeem, M., Weber, G., Lutteroth, C., Dobbie, G.: Optimizing queue-based semi-stream joins with indexed master data. In: Bellatreche, L., Mohania, M.K. (eds.) DaWaK 2014. LNCS, vol. 8646, pp. 171–182. Springer, Heidelberg (2014)

    Google Scholar 

  13. Polyzotis, N., Skiadopoulos, S., Vassiliadis, P., Simitsis, A., Frantzell, N.: Supporting streaming updates in an active data warehouse. In: ICDE 2007: Proceedings of the 23rd International Conference on Data Engineering, pp. 476–485. Istanbul, Turkey (2007)

    Google Scholar 

  14. Polyzotis, N., Skiadopoulos, S., Vassiliadis, P., Simitsis, A., Frantzell, N.: Meshing streaming updates with persistent data in an active data warehouse. IEEE Trans. Knowl. Data Eng. 20(7), 976–991 (2008)

    Article  Google Scholar 

  15. Wu, E., Diao, Y., Rizvi, S.: High-performance complex event processing over streams. In: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, SIGMOD 2006, pp. 407–418. ACM, New York, NY, USA (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. Asif Naeem .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Naeem, M.A., Weber, G., Lutteroth, C. (2016). Optimising Queue-Based Semi-stream Joins by Introducing a Queue of Frequent Pages. In: Cheema, M., Zhang, W., Chang, L. (eds) Databases Theory and Applications. ADC 2016. Lecture Notes in Computer Science(), vol 9877. Springer, Cham. https://doi.org/10.1007/978-3-319-46922-5_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-46922-5_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-46921-8

  • Online ISBN: 978-3-319-46922-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics