Advertisement

Knowledge and Information Systems

, Volume 48, Issue 2, pp 379–398 | Cite as

Word network topic model: a simple but general solution for short and imbalanced texts

  • Yuan Zuo
  • Jichang Zhao
  • Ke Xu
Regular Paper

Abstract

The short text has been the prevalent format for information of Internet, especially with the development of online social media. Although sophisticated signals delivered by the short text make it a promising source for topic modeling, its extreme sparsity and imbalance bring unprecedented challenges to conventional topic models like LDA and its variants. Aiming at presenting a simple but general solution for topic modeling in short texts, we present a word co-occurrence network-based model named WNTM to tackle the sparsity and imbalance simultaneously. Different from previous approaches, WNTM models the distribution over topics for each word instead of learning topics for each document, which successfully enhances the semantic density of data space without importing too much time or space complexity. Meanwhile, the rich contextual information preserved in the word–word space also guarantees its sensitivity in identifying rare topics with convincing quality. Furthermore, employing the same Gibbs sampling as LDA makes WNTM easily to be extended to various application scenarios. Extensive validations on both short and normal texts testify the outperformance of WNTM as compared to baseline methods. And we also demonstrate its potential in precisely discovering newly emerging topics or unexpected events in Weibo at pretty early stages.

Keywords

Word co-occurrence network Topic modeling Short texts Imbalanced texts 

Notes

Acknowledgments

This work was supported by NSFC (Grant Nos. 71501005 and 61421003) and the fund of the State Key Lab of Software Development Environment (Grant No. SKLSDE-2015ZX-05).

References

  1. 1.
    Andrzejewski D, Zhu X, Craven M (2009) Incorporating domain knowledge into topic modeling via dirichlet forest priors. In: ICML, pp 25–32Google Scholar
  2. 2.
    Arora S, Ge R, Halpern Y, Mimno D, Moitra A, Sontag D, Wu Y, Zhu M (2013) A practical algorithm for topic modeling with provable guarantees. ICML 28:280–288Google Scholar
  3. 3.
    Blei DM, Lafferty JD (2006) Dynamic topic models. In: ICML, pp 113–120Google Scholar
  4. 4.
    Blei DM, McAuliffe JD (2007) Supervised topic models. In: NIPS, pp 121–128Google Scholar
  5. 5.
    Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATHGoogle Scholar
  6. 6.
    Cha Y, Cho J (2012) Social-network analysis using topic models. In: SIGIR, pp 565–574Google Scholar
  7. 7.
    Chang J, Gerrish S, Wang C, Boyd-graber JL, Blei DM (2009) Reading tea leaves: how humans interpret topic models. In: NIPS, pp 288–296Google Scholar
  8. 8.
    Chen M, Jin X, Shen D (2011) Short text classification improved by learning multi-granularity topics. In: IJCAI, pp 1776–1781Google Scholar
  9. 9.
    Chen Y, Amiri H, Li Z, Chua TS (2013a) Emerging topic detection for organizations from microblogs. In: SIGIR, pp 43–52Google Scholar
  10. 10.
    Chen Z, Mukherjee A, Liu B, Hsu M, Castellanos M, Ghosh R (2013b) Discovering coherent topics using general knowledge. In: CIKM, pp 209–218Google Scholar
  11. 11.
    Chua FCT, Asur S (2013) Automatic summarization of events from social media. In: ICWSMGoogle Scholar
  12. 12.
    Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. JASIS 41(6):391–407CrossRefGoogle Scholar
  13. 13.
    Fan R, Zhao J, Feng X, Xu K (2014) Topic dynamics in weibo: happy entertainment dominates but angry finance is more periodic. In: ASONAM, pp 230–233Google Scholar
  14. 14.
    Finkelstein L, Gabrilovich E, Matias Y, Rivlin E, Solan Z, Wolfman G, Ruppin E (2002) Placing search in context: the concept revisited. ACM Trans Inf Syst 20(1):116–131CrossRefGoogle Scholar
  15. 15.
    Heinrich G (2005) Parameter estimation for text analysis. http://www.arbylon.net/publications/text-est.pdf
  16. 16.
    Henderson K, Eliassi-Rad T (2009) Applying latent dirichlet allocation to group discovery in large graphs. In: SAC, pp 1456–1461Google Scholar
  17. 17.
    Hofmann T (1999) Probabilistic latent semantic indexing. In: SIGIR, pp 50–57Google Scholar
  18. 18.
    Hong L, Davison BD (2010) Empirical study of topic modeling in twitter. In: SOMA, pp 80–88Google Scholar
  19. 19.
    Jagarlamudi J, Daumé H III, Udupa R (2012) Incorporating lexical priors into topic models. In: EACL, pp 204–213Google Scholar
  20. 20.
    Jiang D, Leung KT, Vosecky J, Ng W (2014a) Personalized query suggestion with diversity awareness. In: ICDE, pp 400–411Google Scholar
  21. 21.
    Jiang D, Leung KWT, Ng W (2014b) Fast topic discovery from web search streams. In: WWW, pp 949–960Google Scholar
  22. 22.
    Jin O, Liu NN, Zhao K, Yu Y, Yang Q (2011) Transferring topical knowledge from auxiliary long texts for short text clustering. In: CIKM, pp 775–784Google Scholar
  23. 23.
    Li C, Cheung W, Ye Y, Zhang X, Chu D, Li X (2015) The author-topic-community model for author interest profiling and community discovery. Knowl Inf Syst 44(2):359–383CrossRefGoogle Scholar
  24. 24.
    Lin T, Tian W, Mei Q, Cheng H (2014) The dual-sparse topic model: mining focused topics and focused terms in short text. In: WWW, pp 539–550Google Scholar
  25. 25.
    McCallum A, Mimno D, Wallach HM (2009) Rethinking lda: why priors matter. In: NIPS, pp 1973–1981Google Scholar
  26. 26.
    Mimno D, Wallach HM, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: EMNLP, pp 262–272Google Scholar
  27. 27.
    Nigam K, McCallum A, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39(2–3):103–134CrossRefMATHGoogle Scholar
  28. 28.
    Peirsman Y, Heylen K, Geeraerts D (2008) Size matters: tight and loose context definitions in english word space models. In: Proceedings of the ESSLLI workshop on distributional lexical semantics, pp 34–41Google Scholar
  29. 29.
    Phan XH, Nguyen LM, Horiguchi S (2008) Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: WWW, pp 91–100Google Scholar
  30. 30.
    Quan X, Liu G, Lu Z, Ni X, Liu W (2010) Short text similarity based on probabilistic topics. Knowl Inf Syst 25(3):473–491CrossRefGoogle Scholar
  31. 31.
    Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In: EMNLP, pp 248–256Google Scholar
  32. 32.
    Ramage D, Dumais S, Liebling D (2010) Characterizing microblogs with topic models. In: ICWSMGoogle Scholar
  33. 33.
    Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: UAI, pp 487–494Google Scholar
  34. 34.
    Rubenstein H, Goodenough JB (1965) Contextual correlates of synonymy. Commun ACM 8(10):627–633CrossRefGoogle Scholar
  35. 35.
    Rubin TN, Chambers A, Smyth P, Steyvers M (2012) Statistical topic models for multi-label document classification. Mach Learn 88(1–2):157–208MathSciNetCrossRefMATHGoogle Scholar
  36. 36.
    Sahami M, Heilman TD (2006) A web-based kernel function for measuring the similarity of short text snippets. In: WWW, pp 377–386Google Scholar
  37. 37.
    Stevens K, Kegelmeyer P, Andrzejewski D, Buttler D (2012) Exploring topic coherence over many models and many topics. In: EMNLP-CoNLL, pp 952–961Google Scholar
  38. 38.
    Tang J, Meng Z, Nguyen X, Mei Q, Zhang M (2014) Understanding the limiting factors of topic modeling via posterior contraction analysis. In: ICML, pp 190–198Google Scholar
  39. 39.
    Tong Y, Cao CC, Chen L (2014) Tcs: efficient topic discovery over crowd-oriented service data. In: KDD, pp 861–870Google Scholar
  40. 40.
    Wang X, McCallum A (2006) Topics over time: a non-Markov continuous-time model of topical trends. In: KDD, pp 424–433Google Scholar
  41. 41.
    Wang X, Jia Y, Zhou B, Ding Z, Zheng L (2011) Computing semantic relatedness using chinese wikipedia links and taxonomy. J Chin Comput Syst 32(11):2237–2242Google Scholar
  42. 42.
    Weng J, Lim EP, Jiang J, He Q (2010) Twitterrank: finding topic-sensitive influential twitterers. In: WSDM, pp 261–270Google Scholar
  43. 43.
    Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: WWW, pp 1445–1456Google Scholar
  44. 44.
    Yu L, Asur S, Huberman BA (2011) What trends in chinese social media. arXiv:1107.3522
  45. 45.
    Yu LL, Asur S, Huberman BA (2013) Dynamics of trends and attention in chinese social media. arXiv:1312.0649
  46. 46.
    Zhao WX, Jiang J, Weng J, He J, Lim EP, Yan H, Li X (2011) Comparing twitter and traditional media using topic models. In: ECIR, pp 338–349Google Scholar
  47. 47.
    Zhou T, Lyu MT, King I, Lou J (2015) Learning to suggest questions in social media. Knowl Inf Syst 43(2):389–416CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London 2015

Authors and Affiliations

  1. 1.State Key Lab of Software Development EnvironmentBeihang UniversityBeijingChina
  2. 2.School of Economics and ManagementBeihang UniversityBeijingChina

Personalised recommendations