Abstract
Social media services have already become main sources for monitoring emerging topics and sensing real-life events. A social media platform manages social stream consisting of a huge volume of timestamped user generated data, including original data and repost data. However, previous research on keyword search over social media data mainly emphasizes on the recency of information. In this paper, we first propose a problem of top-k most significant temporal keyword query to enable more complex query analysis. It returns top-k most popular social items that contain the keywords in the given query time window. Then, we design a temporal inverted index with two-tiers posting list to index social time series and a segment store to compute the exact social significance of social items. Next, we implement a basic query algorithm based on our proposed index structure and give a detailed performance analysis on the query algorithm. From the analysis result, we further refine our query algorithm with a piecewise maximum approximation (PMA) sketch. Finally, extensive empirical studies on a real-life microblog dataset demonstrate the combination of two-tiers posting list and PMA sketch achieves remarkable performance improvement under different query settings.
Similar content being viewed by others
Notes
References
Anand, A., Bedathur, S.J., Berberich, K., Schenkel, R.: Efficient Temporal Keyword Search over Versioned Text. In: CIKM, pp. 699–708 (2010)
Arge, L., Vitter, J.S.: Optimal external memory interval management. SIAM J. Comput. 32(6), 1488–1508 (2003)
Berberich, K., Bedathur, S., Neumann, T., Weikum, G.: A time machine for text search. SIGIR, 519 (2007)
Busch, M., Gade, K., Larson, B., Lok, P., Luckenbill, S., Lin, J.: Earlybird: Real-Time Search at Twitter. In: ICDE, pp. 1360–1369 (2012)
Chakrabarti, K., Keogh, E.J., Mehrotra, S., Pazzani, M.J.: Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans. Database Syst. 27(2), 188–228 (2002)
Chen, C., Li, F., Ooi, B.C., Wu, S.: Ti: an Efficient Indexing Mechanism for Real-Time Search on Tweets. In: SIGMOD Conference, pp. 649–660 (2011)
Chen, Q., Chen, L., Lian, X., Liu, Y., Yu, J.X.: Indexable Pla for Efficient Similarity Search. In: VLDB, pp. 435–446 (2007)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Fuchs, E., Gruber, T., Nitschke, J., Sick, B.: Online segmentation of time series based on polynomial least-squares approximations. IEEE Trans. Pattern Anal. Mach. Intell. 32(12), 2232–2245 (2010)
Gao, M., Jin, C., Qian, W., Gong, X.: Real-Time Search over a Microblogging System. In: CGC, pp. 352–359 (2012)
He, J., Suel, T.: Faster temporal range queries over versioned text. SIGIR, 565 (2011)
Hitt, M.A., Anderson, C.: The long tail: Why the future of business is selling less of more (2007)
Huang, X., Cheng, H., Li, R.H., Qin, L., Yu, J.X.: Top-k structural diversity search in large networks. Proceedings of the VLDB Endowment 6(13), 1618–1629 (2013)
Huo, W., Tsotras, V.J.: A Comparison of Top-K Temporal Keyword Querying over Versioned Text Collections. In: Database and Expert Systems Applications, pp. 360–374. Springer (2012)
Jestes, J., Phillips, J.M., Li, F., Tang, M.: Ranking large temporal data. PVLDB 5(11), 1412–1423 (2012)
Keogh, E.J., Chakrabarti, K., Pazzani, M.J., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. Knowl. Inf. Syst. 3(3), 263–286 (2001)
Keogh, E.J., Chu, S., Hart, D.M., Pazzani, M.J.: An Online Algorithm for Segmenting Time Series. In: ICDM, pp. 289–296 (2001)
Lemire, D.: A Better Alternative to Piecewise Linear Time Series Segmentation. In: SDM, pp. 545–550 (2007)
Li, F., Yi, K., Le, W.: Top-k queries on temporal data. VLDB J. 19(5), 715–733 (2010)
Li, J., Liu, C., Liu, B., Mao, R., Wang, Y., Chen, S., Yang, J.J., Pan, H., Wang, Q.: Diversity-aware retrieval of medical records. Comput. Ind. 69, 81–91 (2015)
Li, Y., Bao, Z., Li, G., Tan, K.L.: Real Time Personalized Search on Social Networks. In: ICDE, pp. 639–650. IEEE (2015)
Ma, H., Qian, W., Xia, F., He, X., Xu, J., Zhou, A.: Towards modeling popularity of microblogs. Frontiers of Computer Science 7(2), 171–184 (2013)
O’Neil, P., Cheng, E., Gawlick, D., O’Neil, E.: The log-structured merge-tree (lsm-tree). Acta Informatica 33(4), 351–385 (1996)
Teevan, J., Ramage, D., Morris, M.R.: #Twittersearch: a Comparison of Microblog Search and Web Search. In: WSDM, pp. 35–44 (2011)
Tweet Usage Statistics. http://www.internetlivestats.com/twitter-statistics (2016)
Wang, J., Huang, J.Z., Guo, J., Lan, Y.: Recommending high-utility search engine queries via a queryrecommending model. Neurocomputing 167, 195–208 (2015)
Wu, L., Lin, W., Xiao, X., Xu, Y.: Lsii: an Indexing Structure for Exact Real-Time Search on Microblogs. In: ICDE, pp. 482–493 (2013)
Xia, F., Yu, C., Qian, W., Zhou, A.: Top-K Temporal Keyword Query over Social Media Data. In: Asia-Pacific Web Conference, pp. 183–195. Springer (2016)
Xu, Z., Zhang, R., Ramamohanarao, K., Parampalli, U.: An Adaptive Algorithm for Online Time Series Segmentation with Error Bound Guarantee. In: EDBT, pp. 192–203 (2012)
Zhuang, Y.: Building a complete Tweet index. Tuesday, November 18, 2014. https://blog.twitter.com/2014/building-a-complete-tweet-index (2014). [Online; accessed 21-November-2014]
Acknowledgments
This work is partially supported by National High-tech R&D Program (863 Program) under grant number 2015AA015307, and National Science Foundation of China under grant numbers 61432006 and 61672232.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Xia, F., Yu, C., Xu, L. et al. Top-k temporal keyword search over social media data. World Wide Web 20, 1049–1069 (2017). https://doi.org/10.1007/s11280-016-0430-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-016-0430-0