Stream-based live public opinion monitoring approach with adaptive probabilistic topic model

  • Kun Ma
  • Ziqiang Yu
  • Ke Ji
  • Bo Yang
Methodologies and Application


Public opinion monitoring, also known as first story detection, is defined within the topic detection and tracking on a particular Internet news event. Generally, it is used to find news propagation. Traditional method adopts text matching to address opinion monitoring. But it has some limitations such as hidden and latent topic discovery and incorrect relevance ranking of matching results on large-scale data. In this paper, we propose three solutions to live public opinion monitoring: simple keyword computing and matching, simple probabilistic topic computing and matching, and stream-based live probabilistic topic computing and matching. We point out the disadvantages of the first two solutions such as semantic matching and low efficiency on timely big data. Stream-based real-time topic computing and topic matching with query-time document and field boosting are proposed to make substantial improvements. Finally, our topic computing and matching experiments with crawled historical Netease news records show that our approaches are effective and efficient.


Public opinion Public sentiment Topic computing Topic matching Probabilistic topic model Stream computing Stream processing MapReduce 



This work was supported by the National Natural Science Foundation of China (61772231 & 61702217 & 61702216), the Shandong Provincial Natural Science Foundation (ZR2017MF025 & ZR2014FQ029), the Shandong Provincial Key R&D Program of China (2015GGX106007 & 2016ZDJS01A12 & 2017CXGC0701 & 2018CXGC0706), the Science and Technology Program of University of Jinan (XKY1734 & XKY1828).

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.


  1. Alomari A (2017) Distance impact on quality of video streaming services in cloud environment. Int J Space-Based Situated Comput 7(3):119–128. CrossRefGoogle Scholar
  2. Arridha R, Sukaridhoto S, Pramadihanto D, Funabiki N (2017) Classification extension based on IoT-big data analytic for smart environment monitoring and analytic in real-time system. Int J Space-Based Situated Comput 7(2):82–93. CrossRefGoogle Scholar
  3. Anstead N, O’Loughlin B (2015) Social media analysis and public opinion: the 2010 UK general election. J Comput Med Commun 20(2):204–220CrossRefGoogle Scholar
  4. Badia A, Muezzinoglu T, Nasraoui O (2006) Focused crawling: experiences in a real world project. In: Proceedings of the 15th international conference on world wide web. ACM, pp 1043–1044Google Scholar
  5. Barbosa L, Freire J (2007) An adaptive crawler for locating hidden-web entry points. In: Proceedings of the 16th international conference on world wide web. ACM, pp 441–450Google Scholar
  6. Batsakis S, Petrakis EG, Milios E (2009) Improving the performance of focused web crawlers. Data Knowl Eng 68(10):1001–1013CrossRefGoogle Scholar
  7. Benhardus J, Kalita J (2013) Streaming trend detection in Twitter. Int J Web Based Commun 9(1):122–139CrossRefGoogle Scholar
  8. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3(Jan):993–1022zbMATHGoogle Scholar
  9. Boldi P, Codenotti B, Santini M, Vigna S (2004) Ubicrawler: a scalable fully distributed web crawler. Softw Pract Exp 34(8):711–726CrossRefGoogle Scholar
  10. Bordes A, Glorot X, Weston J, Bengio Y (2014) A semantic matching energy function for learning with multi-relational data. Mach Learn 94(2):233–259MathSciNetCrossRefzbMATHGoogle Scholar
  11. Bošnjak M, Oliveira E, Martins J, Mendes Rodrigues E, Sarmento L (2012) Twitterecho: a distributed focused crawler to support open research with twitter data. In: Proceedings of the 21st international conference on world wide web. ACM, pp 1233–1240Google Scholar
  12. Brin S, Page L (2012) Reprint of: the anatomy of a large-scale hypertextual web search engine. Comput Netw 56(18):3825–3833CrossRefGoogle Scholar
  13. Chang PC, Galley M, Manning CD (2008) Optimizing Chinese word segmentation for machine translation performance. In: Proceedings of the 3rd workshop on statistical machine translation. Association for Computational Linguistics, pp 224–232Google Scholar
  14. Cho J, Garcia-Molina H, Page L (1998) Efficient crawling through URL ordering. Comput Netw ISDN Syst 30(1C7):161–172CrossRefGoogle Scholar
  15. Cui C, Shen J, Nie L, Hong R, Ma J (2017) Augmented collaborative filtering for sparseness reduction in personalized POI recommendation. ACM Trans Intell Syst Technol (TIST) 8(5):71Google Scholar
  16. De Bra P, Houben GJ, Kornatzky Y, Post R (1994) Information retrieval in distributed hypertexts. In: Intelligent multimedia information retrieval systems and management-volume 1. Le Centre de Hautes Etudes Internationales d’Informatique Documentaire, pp 481–491Google Scholar
  17. De Francisci Morales G, Gionis A, Sozio M (2011) Social content matching in mapreduce. Proc VLDB Endow 4(7):460–469CrossRefGoogle Scholar
  18. Di Pietro G, Aliprandi C, De Luca AE, Raffaelli M, Soru T (2014) Semantic crawling: an approach based on named entity recognition. In: 2014 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM). IEEE, pp 695–699Google Scholar
  19. Dong H, Hussain FK (2014) Self-adaptive semantic focused crawler for mining services information discovery. IEEE Trans Ind Inf 10(2):1616–1626CrossRefGoogle Scholar
  20. Dong H, Hussain FK, Chang E (2009) State of the art in semantic focused crawlers. In: International conference on computational science and its applications. Springer, pp 910–924Google Scholar
  21. Fang M, Lu Q (2017) Study on clustering of micro-blog business enterprise users reputation based on web crawler. Int J Comput Sci Math 8(3):279–290CrossRefGoogle Scholar
  22. Gao W, Farahani MR, Aslam A, Hosamani S (2017) Distance learning techniques for ontology similarity measuring and ontology mapping. Clust Comput 20(2):959–968CrossRefGoogle Scholar
  23. Goh HL, Tan KK, Huang S, de Silva CW (2006) Development of bluewave: a wireless protocol for industrial automation. IEEE Trans Ind Inf 2(4):221–230CrossRefGoogle Scholar
  24. Guo X (2016) Shandong public opinion monitoring system.
  25. Guo K, Shi L, Ye W, Li X (2014) A survey of internet public opinion mining. In: 2014 International conference on progress in informatics and computing (PIC). IEEE, pp 173–179Google Scholar
  26. Guo J, Fan Y, Ai Q, Croft WB (2016) A deep relevance matching model for ad-hoc retrieval. In: Proceedings of the 25th ACM international on conference on information and knowledge management. ACM, pp 55–64Google Scholar
  27. Han X, Wang L, Cui C, Ma J, Zhang S (2017) Linking multiple online identities in criminal investigations: A spectral co-clustering framework. IEEE Trans Inf Forensics Secur 12(9):2242–2255CrossRefGoogle Scholar
  28. Haveliwala TH (2002) Topic-sensitive pagerank. In: Proceedings of the 11th international conference on world wide web. ACM, pp 517–526Google Scholar
  29. Huang B, Yu G (2015) Research and application of public opinion retrieval based on user behavior modeling. Neurocomputing 167:596–603CrossRefGoogle Scholar
  30. Kononenko O, Baysal O, Holmes R, Godfrey MW (2014) Mining modern repositories with elasticsearch. In: Proceedings of the 11th working conference on mining software repositories. ACM, pp 328–331Google Scholar
  31. Krippendorff K (2012) Content analysis: an introduction to its methodology. Sage, Beverley HillsGoogle Scholar
  32. Kwak H, Lee C, Park H, Moon S (2010) What is Twitter, a social network or a news media? In: Proceedings of the 19th international conference on world wide web. ACM, pp 591–600Google Scholar
  33. Lee MJ, Chun JW (2016) Reading others comments and public opinion poll results on social media: social judgment and spiral of empowerment. Comput Hum Behav 65:479–487CrossRefGoogle Scholar
  34. Liu Z, Zhang Y, Chang EY, Sun M (2011) Plda+: parallel latent Dirichlet allocation with data placement and pipeline processing. ACM Trans Intell Syst Technol (TIST) 2(3):26Google Scholar
  35. Ma K, Tang Z (2014) An online social mutual help architecture for multi-tenant mobile clouds. Int J Intell Inf Database Syst 8(4):359–374MathSciNetGoogle Scholar
  36. Ma K, Yang B, Abraham A (2012) A template-based model transformation approach for deriving multi-tenant SaaS applications. Acta Polytech Hung 9(2):25–41Google Scholar
  37. Ma K, Dong F, Yang B (2014) Incremental object matching approach of schema-free data with mapreduce. Int J Comput Appl 36(2):72–77Google Scholar
  38. Ma K, Dong F, Yang B (2015) Large-scale schema-free data deduplication approach with adaptive sliding window using mapreduce. Comput J 58(11):3187–3201CrossRefGoogle Scholar
  39. Ma K, Tang Z, Zhong J, Yang B (2016) LPSMon: a stream-based live public sentiment monitoring system. Lect Notes Comput Sci 9659:534–536Google Scholar
  40. Ma K, Yu Z, Ji K, Yang B (2017) Stream-based live probabilistic topic computing and matching. In: International conference on algorithms and architectures for parallel processing. Springer, pp 397–406Google Scholar
  41. Manning CD, Surdeanu M, Bauer J, Finkel JR, Bethard S, McClosky D (2014) The Stanford CoreNLP natural language processing toolkit. In: ACL (System Demonstrations), pp 55–60Google Scholar
  42. Matthes J, Kohring M (2008) The content analysis of media frames: toward improving reliability and validity. J Commun 58(2):258–279CrossRefGoogle Scholar
  43. McCandless M, Hatcher E, Gospodnetic O (2010) Lucene in action: covers apache Lucene 3.0. Manning Publications Co., Shelter IslandGoogle Scholar
  44. Media co LTD SS (2018) Shandong Shunwang official website.
  45. Mihalcea R, Tarau P (2004) Textrank: bringing order into texts. Association for Computational Linguistics, BerlinGoogle Scholar
  46. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013a) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119Google Scholar
  47. Mikolov T, Wt Y, Zweig G (2013b) Linguistic regularities in continuous space word representations. HLT-NAACL 13:746–751Google Scholar
  48. Miyoshi T, Nakagami Y (2007) Sentiment classification of customer reviews on electric products. In: 2007 IEEE international conference on systems, man and cybernetics. IEEE, pp 2028–2033Google Scholar
  49. O’Connor B, Balasubramanyan R, Routledge BR, Smith NA et al (2010) From tweets to polls: linking text sentiment to public opinion time series. ICWSM 11(122–129):1–2Google Scholar
  50. Petrović S, Osborne M, Lavrenko V (2010) Streaming first story detection with application to Twitter. In: Human language technologies: the 2010 annual conference of the North American chapter of the association for computational linguistics. Association for Computational Linguistics, pp 181–189Google Scholar
  51. Phan XH, Nguyen CT (2007) Gibbslda++: Ac/c++ implementation of latent Dirichlet allocation (LDA)Google Scholar
  52. Phan XH, Nguyen LM, Horiguchi S (2008) Learning to classify short and sparse text and web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on world wide web. ACM, pp 91–100Google Scholar
  53. Qian R, Zhang K, Zhao G (2013) A topic-specific web crawler based on content and structure mining. In: 2013 3rd international conference on computer science and network technology (ICCSNT). IEEE, pp 458–461Google Scholar
  54. Qiu G, Liu B, Bu J, Chen C (2009) Expanding domain sentiment lexicon through double propagation. IJCAI 9:1199–1204Google Scholar
  55. Ramos M, Shao J, Reis SD, Anteneodo C, Andrade JS, Havlin S, Makse HA (2015) How does public opinion become extreme? Sci Rep 5(10):032Google Scholar
  56. Sakaji H, Ishibuchi J, Sakai H (2016) Extraction of polarity comments from Nico Nico Douga. Int J Space-Based Situated Comput 6(3):165–172. CrossRefGoogle Scholar
  57. Shahi D (2015) Apache Solr: an introduction. In: Apache Solr. Springer, pp 1–9Google Scholar
  58. Socher R, Perelygin A, Wu JY, Chuang J, Manning CD, Ng AY, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), vol 1631, Citeseer, p 1642Google Scholar
  59. Su C, Gao Y, Yang J, Luo B (2005) An efficient adaptive focused crawler based on ontology learning. In: 5th International conference on hybrid intelligent systems (HIS’05). IEEE, p 6Google Scholar
  60. Su LYF, Cacciatore MA, Liang X, Brossard D, Scheufele DA, Xenos MA (2016) Analyzing public sentiments online: combining human-and computer-based content analysis. Inf Commun Soc 20:1–22Google Scholar
  61. Tang Z, Ma K (2014) Rsscube: a content syndication and recommendation architecture. Int J Database Theory Appl 7(4):237–248MathSciNetCrossRefGoogle Scholar
  62. Tsirakis N, Poulopoulos V, Tsantilas P, Varlamis I (2016) Large scale opinion mining for social, news and blog data. J Syst Softw 127:1–12Google Scholar
  63. Vuurens JB, de Vries AP (2016) First story detection using multiple nearest neighbors. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 845–848Google Scholar
  64. Wang Y, Bai H, Stanton M, Chen WY, Chang EY (2009) Plda: parallel latent Dirichlet allocation for large-scale applications. In: International conference on algorithmic applications in management. Springer, pp 301–314Google Scholar
  65. Wang Y, Zhao X, Sun Z, Yan H, Wang L, Jin Z, Wang L, Gao Y, Law C, Zeng J (2015) Peacock: learning long-tail topic features for industrial applications. ACM Trans Intell Syst Technol (TIST) 6(4):47Google Scholar
  66. Wu HC, Luk RWP, Wong KF, Kwok KL (2008) Interpreting TF-IDF term weights as making relevance decisions. ACM Trans Inf Syst (TOIS) 26(3):13CrossRefGoogle Scholar
  67. Yadollahi A, Shahraki AG, Zaiane OR (2017) Current state of text sentiment analysis from opinion to emotion mining. ACM Comput Surv (CSUR) 50(2):25CrossRefGoogle Scholar
  68. Yu X, Wang H, Zheng X (2018) Mining top-k approximate closed patterns in an imprecise database. Int J Grid Utility Comput 9(2):97–107. CrossRefGoogle Scholar
  69. Yuan J, Gao F, Ho Q, Dai W, Wei J, Zheng X, Xing EP, Liu TY, Ma WY (2015) Lightlda: big topic models on modest computer clusters. In: Proceedings of the 24th international conference on world wide web. ACM, pp 1351–1361Google Scholar
  70. Zhai Z, Xu H, Kang B, Jia P (2011) Exploiting effective features for Chinese sentiment classification. Expert Syst Appl 38(8):9139–9146CrossRefGoogle Scholar
  71. Zhang M, Chakrabarti K (2013) Infogather+: semantic matching and annotation of numeric and time-varying attributes in web tables. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data. ACM, pp 145–156Google Scholar
  72. Zhang D, Xu H, Su Z, Xu Y (2015) Chinese comments sentiment classification based on word2vec and SVMperf. Expert Syst Appl 42(4):1857–1863CrossRefGoogle Scholar
  73. Zheng HT, Kang BY, Kim HG (2008) An ontology-based approach to learnable focused crawling. Inf Sci 178(23):4512–4522CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Shandong Provincial Key Laboratory of Network Based Intelligent ComputingUniversity of JinanJinanChina

Personalised recommendations