Dynamic sampling of text streams and its application in text analysis

Abstract

A large number of texts are rapidly generated as streaming data in social media. Since it is difficult to process such text streams with limited memory in real time, researchers are resorting to text stream compression and sampling to obtain a small portion of valuable information from the streams. In this study, we investigate the crucial question of how to use less memory space to store more valuable texts to maintain the global information of the stream. First, we propose a text stream sampling framework based on compressed sensing theory, which can sample a text stream with a lightweight framework to reduce the space consumption while still retaining the most valuable texts. We then develop a query word-based retrieval task as well as a topic detection and evolution analysis task on the sample stream to evaluate the performance of the framework in retaining valuable information. The framework is evaluated from several aspects using two representative datasets of social media, including compression ratio, runtime, information reserved rate, and efficiency of the text analysis tasks. Experimental results demonstrate that the proposed framework outperforms baseline methods and is able to complete the text analysis tasks with promising results.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Notes

  1. 1.

    http://www.199it.com/archives/255612.html.

  2. 2.

    http://blog.sysomos.com/2011/05/02/how-fast-thenews-spreads-through-social-media/.

  3. 3.

    In the following, to distinguish the concept of sample framework proposed in the paper and the samples used in CS theory, the latter is replaced by linear measurements, or measurements in short, which are also commonly used wordings in the CS theory.

  4. 4.

    The dataset is downloaded from http://snap.stanford.edu/data.

  5. 5.

    The code are available at http://code.google.com/p/word2vec/.

References

  1. 1.

    Agichtein E, Castillo C, Donato D, Gionis A, Mishne G (2008) Finding high-quality content in social media. In: Proceedings of the 1st ACM international conference on web search and data mining. ACM

  2. 2.

    Alonso O, Marshall CC, Najork M (2013) Are some tweets more interesting than others? \(\sharp \) Hard question. In: Proceedings of the symposium on human-computer interaction and information retrieval. ACM

  3. 3.

    Baraniuk R, Davenport M, DeVore R, Wakin M (2007) A simple proof of the restricted isometry property for random matrices. Constr Approx 23(3):918–925

    MathSciNet  MATH  Google Scholar 

  4. 4.

    Bian J, Yang Y, Zhang H, Chua TS (2015) Multimedia summarization for social events in microblog stream. IEEE Trans Multimed 17(2):216–228

    Article  Google Scholar 

  5. 5.

    Brisaboa NR, Faria A, Param J (2010) Dynamic lightweight text compression. ACM Trans Inf Syst 28(3):10

    Article  Google Scholar 

  6. 6.

    Brisaboa NR, Faria A, Navarro G, Parama JR (2008) New adaptive compressors for natural language text. Softw Pract Exp 38(13):1429–1450

    Article  Google Scholar 

  7. 7.

    Cataldi M, Caro L, Schifanella C (2010) Emerging topic detection on twitter based on temporal and social terms evaluation. In: Proceedings of the 10th international workshop in multimedia data mining

  8. 8.

    Chen Y, Cheng X, Yang S (2011) Finding high quality threads in web forums. J Softw 22(8):1785–1804

    Article  Google Scholar 

  9. 9.

    Chen S, Donoho DL, Saunders MA (1998) Atomic decomposition by basis pursuit. SIAM J Sci Comput 20(1):33–61

    MathSciNet  Article  MATH  Google Scholar 

  10. 10.

    Choudhury MD, Counts S, Czerwinski M (2011) Find me the right content! Diversity-based sampling of social media spaces for topic-centric search. In: Proceedings of the 5th international AAAI conference on weblogs and social media

  11. 11.

    Silva de Moura E, Navarro G, Ziviani N, Baeza-Yates R (2000) Fast and flexible word searching on compressed text. ACM Trans Inf Syst 18(2):113–139

    Article  Google Scholar 

  12. 12.

    Donoho DL (2006) Compressed sensing. IEEE Trans Inf Theory 52(4):1289–1306

    MathSciNet  Article  MATH  Google Scholar 

  13. 13.

    Dutta A, Levi R, Ron D, Rubinfeld R (2013) A simple online competitive adaptation of Lempel–Ziv compression with efficient random access support. In: Proceedings of the 23rd IEEE data compression conference. IEEE

  14. 14.

    Ghosh S, Zafar MB, Bhattacharya P, Sharma N, Ganguly N, Gummadi K (2013) On sampling the wisdom of crowds: random versus expert sampling of the twitter stream. In: Proceedings of the 22nd ACM international conference on information and knowledge management. ACM

  15. 15.

    Han J, Kamber M (2006) Data mining: concepts and techniques, 2nd edn. Morgan Kaufmann Inc., Burlington

    Google Scholar 

  16. 16.

    Kasiviswanathan SP, Cong G, Melville P, Lawrence RD (2013) Novel document detection for massive data streams using distributed dictionary learning. IBM J Res Dev 57(3/4):9:1–9:15

    Article  Google Scholar 

  17. 17.

    Kasiviswanathan SP, Wang H, Banerjee A, Melville P (2012) Online l1-dictionary learning with application to novel document detection. In: Proceeding of the 25th advances in neural information processing systems. MIT Press

  18. 18.

    Meladianos P, Nikolentzos G, Rousseau F, Stavrakas Y, Vazirgiannis M (2015) Degeneracy-based real-time sub-event detection in Twitter stream. In: Proceedings of the 9th international AAAI conference on web and social media

  19. 19.

    Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. In: Proceedings of workshop at international conference on learning representations

  20. 20.

    Mikolov T, Sutskever I, Chen K, Corrado G S, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of 27th annual conference on neural information processing systems. MIT Press

  21. 21.

    Moffat A (1989) Word-based text compression. Softw Pract Exp 19(2):185–198

    Article  Google Scholar 

  22. 22.

    Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31st international conference on machine learning. ACM

  23. 23.

    Leeuwen MV, Siebes A (2008) Streamkrimp: detecting change in data streams. In: Proceedings of European conference on machine learning and principles and practice of knowledge discovery in databases. ACM

  24. 24.

    Li C, Sun A, Weng J, He Q (2015) Tweet segmentation and its application to named entity recognition. IEEE Trans Knowl Data Eng 27(2):558–570

    Article  Google Scholar 

  25. 25.

    Pati YC, Rezaiifar R, Krishnaprasad PS (1993) Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In: Asilomar conference on signals, systems and computers, pp 40–44

  26. 26.

    Peng M, Gao B, Zhu J, Huang J, Yuan M, Li F (2016) High quality information extraction and query-oriented summarization for automatic query-reply in social network. Expert Syst Appl 44:92–101

    Article  Google Scholar 

  27. 27.

    Peng M, Huang J, Fu H, Zhu J, Zhou L, He Y, Li F (2013) High quality microblog extraction based on multiple features fusion and time-frequency transformation. In: Proceedings of the 14th international conference on web information systems engineering. Springer

  28. 28.

    Sakaki T, Okazaki M, Matsuo Y (2013) Tweet analysis for real-time event detection and earthquake reporting system development. IEEE Trans Knowl Data Eng 25(4):919–931

    Article  Google Scholar 

  29. 29.

    Shrestha P, Jacquin C, Daille B (2012) Clustering short text and its evaluation. In: Proceedings of the 13th international conference on computational linguistics and intelligent text processing. Springer

  30. 30.

    Siebes A, Vreeken J, Van Leeuwen M (2006) Item sets that compress. In: Proceedings of the 6th SIAM international conference on data mining

  31. 31.

    Socher R, Perelygin A, Wu JY, Chuang J, Manning CD, Ng AY, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the conference on empirical methods in natural language processing. ACL

  32. 32.

    Unankard S, Li X, Sharaf MA (2015) Emerging event detection in social networks with location sensitivity. World Wide Web 18(5):1393–1417

  33. 33.

    Xiang Y, Jin R, Fuhry D, Dragan FF (2008) Succinct summarization of transactional databases: an overlapped hyperrectangle scheme. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM

  34. 34.

    Xin D, Han J, Yan X, Cheng H (2005) Mining compressed frequent-pattern sets. In: Proceedings of the 31st international conference on very large data bases. ACM

  35. 35.

    Yang X, Ghoting A, Ruan Y, Parthasarathy S (2012) A framework for summarizing and analyzing Twitter feeds. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining. ACM

  36. 36.

    Yang J, Leskovec J (2011) Patterns of temporal variation in online media. In: Proceedings of the 4th ACM international conference on web search and data mining. ACM

  37. 37.

    Yang M, Rim H (2014) Identifying interesting twitter content using topical analysis. Expert Syst Appl 41:4330–4336

    Article  Google Scholar 

  38. 38.

    Yang X, Ruan Y, Parthasarathy S, Ghoting A (2013) Summarization via pattern utility and ranking: a novel framework for social media data analytics. IEEE Data Eng Bull 36(3):67–76

    Google Scholar 

  39. 39.

    Zihayat M, An A (2014) Mining top-k high utility patterns over data streams. Inf Sci 285:138–161

    MathSciNet  Article  MATH  Google Scholar 

Download references

Acknowledgements

The work was partially supported by the National Science Foundation of China (NSFC, No. 61472291) and (NSFC, No. 41472288).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Min Peng.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Tian, G., Huang, J., Peng, M. et al. Dynamic sampling of text streams and its application in text analysis. Knowl Inf Syst 53, 507–531 (2017). https://doi.org/10.1007/s10115-017-1039-z

Download citation

Keywords

  • Text stream
  • Compressed sensing
  • Sampling
  • Text analysis