Advertisement

The VLDB Journal

, Volume 24, Issue 6, pp 849–866 | Cite as

S\(^3\)-TM: scalable streaming short text matching

  • Fuat Basık
  • Buğra Gedik
  • Hakan Ferhatosmanoğlu
  • Mert Emin Kalender
Regular Paper

Abstract

Micro-blogging services have become major venues for information creation, as well as channels of information dissemination. Accordingly, monitoring them for relevant information is a critical capability. This is typically achieved by registering content-based subscriptions with the micro-blogging service. Such subscriptions are long-running queries that are evaluated against the stream of posts. Given the popularity and scale of micro-blogging services like Twitter and Weibo, building a scalable infrastructure to evaluate these subscriptions is a challenge. To address this challenge, we present the S\(^3\)-TM system for streaming short text matching. S\(^3\)-TM is organized as a stream processing application, in the form of a data parallel flow graph designed to be run on a data center environment. It takes advantage of the structure of the publications (posts) and subscriptions to perform the matching in a scalable manner, without broadcasting publications or subscriptions to all of the matcher instances. The basic design of S\(^3\)-TM uses a scoped multicast for publications and scoped anycast for subscriptions. To further improve throughput, we introduce publication routing algorithms that aim at minimizing the scope of the multicasts. First set of algorithms we develop are based on partitioning the word co-occurrence frequency graph, with the aim of routing posts that include commonly co-occurring words to a small set of matchers. While effective, these algorithms fell short in balancing the load. To address this, we develop the SALB algorithm, which provides better load balance by modeling the load more accurately using the word-to-post bipartite graph. We also develop a subscription placement algorithm, called LASP, to group together similar subscriptions, in order to minimize the subscription matching cost. Furthermore, to achieve good scalability for increasing number of nodes, we introduce techniques to handle workload skew. Finally, we introduce load shedding techniques for handling unexpected load spikes with small impact on the accuracy. Our experimental results show that S\(^3\)-TM is scalable. Furthermore, the SALB algorithm provides more than \(2.5\times \) throughput compared to the baseline multicast and outperforms the graph partitioning-based approaches.

Keywords

Short text matching Stream processing Publish/subscribe 

Notes

Acknowledgments

This study was funded in part by The Scientific , Technological Research Council of Turkey (TÜBİTAK) under grants EEEAG #111E217 and #112E271.

References

  1. 1.
    Aguilera, M.K., Strom, R.E., Sturman, D.C., Astley, M., Chandra, T.D.: Matching events in a content-based subscription system. In: ACM Symposium on Principles of Distributed Computing (PODC) (1999)Google Scholar
  2. 2.
    Barazzutti, R., Felber, P., Fetzer, C., Onica, E., Pineau, J.F., Pasin, M., Rivière, E., Weigert, S.: Streamhub: a massively parallel architecture for high-performance content-based publish/subscribe. In: ACM International Conference on Distributed Event-based Systems (DEBS), pp. 63–74 (2013)Google Scholar
  3. 3.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)MATHGoogle Scholar
  4. 4.
    Carzaniga, A., Rosenblum, D.S., Wolf, A.L.: Design and evaluation of a wide-area event notification service. ACM Trans. Comput. Syst. 19(3), 332–383 (2001)CrossRefGoogle Scholar
  5. 5.
    Castro, M., Druschel, P., Kermarrec, A.M., Rowstron, A.I.: Scribe: A large-scale and decentralized application-level multicast infrastructure. IEEE J. Sel. Areas Commun. 20(8), 1489–1499 (2006)CrossRefGoogle Scholar
  6. 6.
    Choudhury, M.D., Lin, Y.R., Sundaram, H., Candan, K.S., Xie, L., Kelliher, A.: How does the data sampling strategy impact the discovery of information diffusion in social media? In: AAAI Conference on Weblogs and Social Media (ICWSM) (2010)Google Scholar
  7. 7.
    Eugster, P.T., Felber, P.A., Guerraoui, R., Kermarrec, A.M.: The many faces of publish/subscribe. ACM Comput. Surv. 35(2), 114–131 (2003)CrossRefGoogle Scholar
  8. 8.
    Fidler, E., Jacobsen, H.A., Li, G., Mankovski, S.: The padres distributed publish/subscribe system. In: International Conference on Feature Interactions in Telecommunications and Software Systems (FIW) (2005)Google Scholar
  9. 9.
    Gedik, B., Liu, L.: Quality-aware distributed data delivery for continuous query services. In: ACM International Conference on Management of Data (SIGMOD) (2006)Google Scholar
  10. 10.
    Kale, S., Hazan, E., Cao, F., Singh, J.P.: Analysis and algorithms for content-based event matching. In: International Workshop on Distributed Event-Based Systems (DEBS), pp. 363–369 (2005)Google Scholar
  11. 11.
    Karanasos, K., Katsifodimos, A., Manolescu, I.: Delta: Scalable data dissemination under capacity constraints. VLDB Endow. (PVLDB) 7(4), 217–228 (2013)CrossRefGoogle Scholar
  12. 12.
    Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20(1), 359–392 (1998)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Li, M., Ye, F., Kim, M., Chen, H., Lei, H.: Bluedove: A scalable and elastic publish/subscribe service. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1254–1265 (2011)Google Scholar
  14. 14.
    Liu, L., Pu, C., Tang, W.: Continual queries for internet scale event-driven information delivery. IEEE Trans. Knowl. Data Eng. 11(4), 610–628 (1999)CrossRefGoogle Scholar
  15. 15.
    McCallum, A.K.: MALLET: A machine learning for language toolkit (2002). http://mallet.cs.umass.edu
  16. 16.
    Papaemmanouil, O., Çetintemel, U.: SemCast: Semantic multicast for content-based stream dissemination. In: IEEE International Conference on Data Engineering (ICDE), pp. 37–42 (2004)Google Scholar
  17. 17.
    Pietzuch, P.R., Bacon, J.M.: Hermes: A distributed event-based middleware architecture. In: IEEE International Conference on Distributed Computing Systems (ICDCS), pp. 611–618 (2002)Google Scholar
  18. 18.
    Porter, M.F.: An algorithm for suffix stripping. Program: electronic library and information systems pp. 313–316 (1997)Google Scholar
  19. 19.
    Ramasubramanian, V., Peterson, R., Sirer, E.G.: Corona: a high performance publish-subscribe system for the world wide web. In: USENIX Conference on Networked Systems Design and Implementation (NSDI) (2006)Google Scholar
  20. 20.
    Rose, I., Murty, R., Pietzuch, P., Ledlie, J., Roussopoulos, M., Welsh, M.: Cobra: Contentbased filtering and aggregation of blogs and rss feeds. In: USENIX Conference on Networked Systems Design and Implementation (NSDI), pp. 3–3 (2007)Google Scholar
  21. 21.
    Sadoghi, M., Jacobsen, H.A.: Be-tree: an index structure to efficiently match boolean expressions over high-dimensional discrete space. In: ACM International Conference on Management of Data (SIGMOD), pp. 637–648 (2011)Google Scholar
  22. 22.
    Schaeffer, S.E.: Survey: graph clustering. Comput. Sci. Rev. 1(1), 27–64 (2007)MATHMathSciNetCrossRefGoogle Scholar
  23. 23.
    Schneider, S., Hirzel, M., Gedik, B., Wu, K.L.: Safe data parallelism for general streaming. IEEE Trans. Comput. (2013). doi: 10.1109/TC.2013.221
  24. 24.
    Shraer, A., Gurevich, M., Fontoura, M., Josifovski, V.: Top-k publish-subscribe for social annotation of news. VLDB Endow. (PVLDB) 6(6), 385–396 (2013)CrossRefGoogle Scholar
  25. 25.
    Tatbul, N., Çetintemel, U., Zdonik, S., Cherniack, M., Stonebraker, M.: Load shedding in a data stream manager. In: Very Large Databases Conference (VLDB), pp. 309–320 (2003)Google Scholar
  26. 26.
    TIBCO Inc., Tib/rendezvous. White Paper (1999)Google Scholar
  27. 27.
    Twitter Streaming API. http://dev.twitter.com/docs/streaming-apis. Retrieved Dec (2013)
  28. 28.
    Yan, T., Garcia-Molina, H.: Index structures for selective dissemination of information under the boolean model. ACM Trans. Database Syst. 19(2), 332–364 (1994)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • Fuat Basık
    • 1
  • Buğra Gedik
    • 1
  • Hakan Ferhatosmanoğlu
    • 1
  • Mert Emin Kalender
    • 1
  1. 1.Computer Engineering DepartmentBilkent UniversityAnkaraTurkey

Personalised recommendations