Abstract
A large number of texts are rapidly generated as streaming data in social media. Since it is difficult to process such text streams with limited memory in real time, researchers are resorting to text stream compression and sampling to obtain a small portion of valuable information from the streams. In this study, we investigate the crucial question of how to use less memory space to store more valuable texts to maintain the global information of the stream. First, we propose a text stream sampling framework based on compressed sensing theory, which can sample a text stream with a lightweight framework to reduce the space consumption while still retaining the most valuable texts. We then develop a query word-based retrieval task as well as a topic detection and evolution analysis task on the sample stream to evaluate the performance of the framework in retaining valuable information. The framework is evaluated from several aspects using two representative datasets of social media, including compression ratio, runtime, information reserved rate, and efficiency of the text analysis tasks. Experimental results demonstrate that the proposed framework outperforms baseline methods and is able to complete the text analysis tasks with promising results.
Similar content being viewed by others
Notes
In the following, to distinguish the concept of sample framework proposed in the paper and the samples used in CS theory, the latter is replaced by linear measurements, or measurements in short, which are also commonly used wordings in the CS theory.
The dataset is downloaded from http://snap.stanford.edu/data.
The code are available at http://code.google.com/p/word2vec/.
References
Agichtein E, Castillo C, Donato D, Gionis A, Mishne G (2008) Finding high-quality content in social media. In: Proceedings of the 1st ACM international conference on web search and data mining. ACM
Alonso O, Marshall CC, Najork M (2013) Are some tweets more interesting than others? \(\sharp \) Hard question. In: Proceedings of the symposium on human-computer interaction and information retrieval. ACM
Baraniuk R, Davenport M, DeVore R, Wakin M (2007) A simple proof of the restricted isometry property for random matrices. Constr Approx 23(3):918–925
Bian J, Yang Y, Zhang H, Chua TS (2015) Multimedia summarization for social events in microblog stream. IEEE Trans Multimed 17(2):216–228
Brisaboa NR, Faria A, Param J (2010) Dynamic lightweight text compression. ACM Trans Inf Syst 28(3):10
Brisaboa NR, Faria A, Navarro G, Parama JR (2008) New adaptive compressors for natural language text. Softw Pract Exp 38(13):1429–1450
Cataldi M, Caro L, Schifanella C (2010) Emerging topic detection on twitter based on temporal and social terms evaluation. In: Proceedings of the 10th international workshop in multimedia data mining
Chen Y, Cheng X, Yang S (2011) Finding high quality threads in web forums. J Softw 22(8):1785–1804
Chen S, Donoho DL, Saunders MA (1998) Atomic decomposition by basis pursuit. SIAM J Sci Comput 20(1):33–61
Choudhury MD, Counts S, Czerwinski M (2011) Find me the right content! Diversity-based sampling of social media spaces for topic-centric search. In: Proceedings of the 5th international AAAI conference on weblogs and social media
Silva de Moura E, Navarro G, Ziviani N, Baeza-Yates R (2000) Fast and flexible word searching on compressed text. ACM Trans Inf Syst 18(2):113–139
Donoho DL (2006) Compressed sensing. IEEE Trans Inf Theory 52(4):1289–1306
Dutta A, Levi R, Ron D, Rubinfeld R (2013) A simple online competitive adaptation of Lempel–Ziv compression with efficient random access support. In: Proceedings of the 23rd IEEE data compression conference. IEEE
Ghosh S, Zafar MB, Bhattacharya P, Sharma N, Ganguly N, Gummadi K (2013) On sampling the wisdom of crowds: random versus expert sampling of the twitter stream. In: Proceedings of the 22nd ACM international conference on information and knowledge management. ACM
Han J, Kamber M (2006) Data mining: concepts and techniques, 2nd edn. Morgan Kaufmann Inc., Burlington
Kasiviswanathan SP, Cong G, Melville P, Lawrence RD (2013) Novel document detection for massive data streams using distributed dictionary learning. IBM J Res Dev 57(3/4):9:1–9:15
Kasiviswanathan SP, Wang H, Banerjee A, Melville P (2012) Online l1-dictionary learning with application to novel document detection. In: Proceeding of the 25th advances in neural information processing systems. MIT Press
Meladianos P, Nikolentzos G, Rousseau F, Stavrakas Y, Vazirgiannis M (2015) Degeneracy-based real-time sub-event detection in Twitter stream. In: Proceedings of the 9th international AAAI conference on web and social media
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. In: Proceedings of workshop at international conference on learning representations
Mikolov T, Sutskever I, Chen K, Corrado G S, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of 27th annual conference on neural information processing systems. MIT Press
Moffat A (1989) Word-based text compression. Softw Pract Exp 19(2):185–198
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31st international conference on machine learning. ACM
Leeuwen MV, Siebes A (2008) Streamkrimp: detecting change in data streams. In: Proceedings of European conference on machine learning and principles and practice of knowledge discovery in databases. ACM
Li C, Sun A, Weng J, He Q (2015) Tweet segmentation and its application to named entity recognition. IEEE Trans Knowl Data Eng 27(2):558–570
Pati YC, Rezaiifar R, Krishnaprasad PS (1993) Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In: Asilomar conference on signals, systems and computers, pp 40–44
Peng M, Gao B, Zhu J, Huang J, Yuan M, Li F (2016) High quality information extraction and query-oriented summarization for automatic query-reply in social network. Expert Syst Appl 44:92–101
Peng M, Huang J, Fu H, Zhu J, Zhou L, He Y, Li F (2013) High quality microblog extraction based on multiple features fusion and time-frequency transformation. In: Proceedings of the 14th international conference on web information systems engineering. Springer
Sakaki T, Okazaki M, Matsuo Y (2013) Tweet analysis for real-time event detection and earthquake reporting system development. IEEE Trans Knowl Data Eng 25(4):919–931
Shrestha P, Jacquin C, Daille B (2012) Clustering short text and its evaluation. In: Proceedings of the 13th international conference on computational linguistics and intelligent text processing. Springer
Siebes A, Vreeken J, Van Leeuwen M (2006) Item sets that compress. In: Proceedings of the 6th SIAM international conference on data mining
Socher R, Perelygin A, Wu JY, Chuang J, Manning CD, Ng AY, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the conference on empirical methods in natural language processing. ACL
Unankard S, Li X, Sharaf MA (2015) Emerging event detection in social networks with location sensitivity. World Wide Web 18(5):1393–1417
Xiang Y, Jin R, Fuhry D, Dragan FF (2008) Succinct summarization of transactional databases: an overlapped hyperrectangle scheme. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM
Xin D, Han J, Yan X, Cheng H (2005) Mining compressed frequent-pattern sets. In: Proceedings of the 31st international conference on very large data bases. ACM
Yang X, Ghoting A, Ruan Y, Parthasarathy S (2012) A framework for summarizing and analyzing Twitter feeds. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining. ACM
Yang J, Leskovec J (2011) Patterns of temporal variation in online media. In: Proceedings of the 4th ACM international conference on web search and data mining. ACM
Yang M, Rim H (2014) Identifying interesting twitter content using topical analysis. Expert Syst Appl 41:4330–4336
Yang X, Ruan Y, Parthasarathy S, Ghoting A (2013) Summarization via pattern utility and ranking: a novel framework for social media data analytics. IEEE Data Eng Bull 36(3):67–76
Zihayat M, An A (2014) Mining top-k high utility patterns over data streams. Inf Sci 285:138–161
Acknowledgements
The work was partially supported by the National Science Foundation of China (NSFC, No. 61472291) and (NSFC, No. 41472288).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tian, G., Huang, J., Peng, M. et al. Dynamic sampling of text streams and its application in text analysis. Knowl Inf Syst 53, 507–531 (2017). https://doi.org/10.1007/s10115-017-1039-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-017-1039-z