Skip to main content

Word network topic model: a simple but general solution for short and imbalanced texts

Abstract

The short text has been the prevalent format for information of Internet, especially with the development of online social media. Although sophisticated signals delivered by the short text make it a promising source for topic modeling, its extreme sparsity and imbalance bring unprecedented challenges to conventional topic models like LDA and its variants. Aiming at presenting a simple but general solution for topic modeling in short texts, we present a word co-occurrence network-based model named WNTM to tackle the sparsity and imbalance simultaneously. Different from previous approaches, WNTM models the distribution over topics for each word instead of learning topics for each document, which successfully enhances the semantic density of data space without importing too much time or space complexity. Meanwhile, the rich contextual information preserved in the word–word space also guarantees its sensitivity in identifying rare topics with convincing quality. Furthermore, employing the same Gibbs sampling as LDA makes WNTM easily to be extended to various application scenarios. Extensive validations on both short and normal texts testify the outperformance of WNTM as compared to baseline methods. And we also demonstrate its potential in precisely discovering newly emerging topics or unexpected events in Weibo at pretty early stages.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Notes

  1. 1.

    http://jgibblda.sourceforge.net/.

  2. 2.

    http://code.google.com/p/plda/.

  3. 3.

    http://code.google.com/p/btm/.

  4. 4.

    Publicly available at http://ipv6.nlsde.buaa.edu.cn/zhaojichang/paper/wntm.rar.

  5. 5.

    http://ictclas.nlpir.org/downloads.

  6. 6.

    http://www.sogou.com/labs/dl/ca.html.

  7. 7.

    http://www.csie.ntu.edu.tw/~cjlin/liblinear/.

References

  1. 1.

    Andrzejewski D, Zhu X, Craven M (2009) Incorporating domain knowledge into topic modeling via dirichlet forest priors. In: ICML, pp 25–32

  2. 2.

    Arora S, Ge R, Halpern Y, Mimno D, Moitra A, Sontag D, Wu Y, Zhu M (2013) A practical algorithm for topic modeling with provable guarantees. ICML 28:280–288

    Google Scholar 

  3. 3.

    Blei DM, Lafferty JD (2006) Dynamic topic models. In: ICML, pp 113–120

  4. 4.

    Blei DM, McAuliffe JD (2007) Supervised topic models. In: NIPS, pp 121–128

  5. 5.

    Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  6. 6.

    Cha Y, Cho J (2012) Social-network analysis using topic models. In: SIGIR, pp 565–574

  7. 7.

    Chang J, Gerrish S, Wang C, Boyd-graber JL, Blei DM (2009) Reading tea leaves: how humans interpret topic models. In: NIPS, pp 288–296

  8. 8.

    Chen M, Jin X, Shen D (2011) Short text classification improved by learning multi-granularity topics. In: IJCAI, pp 1776–1781

  9. 9.

    Chen Y, Amiri H, Li Z, Chua TS (2013a) Emerging topic detection for organizations from microblogs. In: SIGIR, pp 43–52

  10. 10.

    Chen Z, Mukherjee A, Liu B, Hsu M, Castellanos M, Ghosh R (2013b) Discovering coherent topics using general knowledge. In: CIKM, pp 209–218

  11. 11.

    Chua FCT, Asur S (2013) Automatic summarization of events from social media. In: ICWSM

  12. 12.

    Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. JASIS 41(6):391–407

    Article  Google Scholar 

  13. 13.

    Fan R, Zhao J, Feng X, Xu K (2014) Topic dynamics in weibo: happy entertainment dominates but angry finance is more periodic. In: ASONAM, pp 230–233

  14. 14.

    Finkelstein L, Gabrilovich E, Matias Y, Rivlin E, Solan Z, Wolfman G, Ruppin E (2002) Placing search in context: the concept revisited. ACM Trans Inf Syst 20(1):116–131

    Article  Google Scholar 

  15. 15.

    Heinrich G (2005) Parameter estimation for text analysis. http://www.arbylon.net/publications/text-est.pdf

  16. 16.

    Henderson K, Eliassi-Rad T (2009) Applying latent dirichlet allocation to group discovery in large graphs. In: SAC, pp 1456–1461

  17. 17.

    Hofmann T (1999) Probabilistic latent semantic indexing. In: SIGIR, pp 50–57

  18. 18.

    Hong L, Davison BD (2010) Empirical study of topic modeling in twitter. In: SOMA, pp 80–88

  19. 19.

    Jagarlamudi J, Daumé H III, Udupa R (2012) Incorporating lexical priors into topic models. In: EACL, pp 204–213

  20. 20.

    Jiang D, Leung KT, Vosecky J, Ng W (2014a) Personalized query suggestion with diversity awareness. In: ICDE, pp 400–411

  21. 21.

    Jiang D, Leung KWT, Ng W (2014b) Fast topic discovery from web search streams. In: WWW, pp 949–960

  22. 22.

    Jin O, Liu NN, Zhao K, Yu Y, Yang Q (2011) Transferring topical knowledge from auxiliary long texts for short text clustering. In: CIKM, pp 775–784

  23. 23.

    Li C, Cheung W, Ye Y, Zhang X, Chu D, Li X (2015) The author-topic-community model for author interest profiling and community discovery. Knowl Inf Syst 44(2):359–383

    Article  Google Scholar 

  24. 24.

    Lin T, Tian W, Mei Q, Cheng H (2014) The dual-sparse topic model: mining focused topics and focused terms in short text. In: WWW, pp 539–550

  25. 25.

    McCallum A, Mimno D, Wallach HM (2009) Rethinking lda: why priors matter. In: NIPS, pp 1973–1981

  26. 26.

    Mimno D, Wallach HM, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: EMNLP, pp 262–272

  27. 27.

    Nigam K, McCallum A, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39(2–3):103–134

    Article  MATH  Google Scholar 

  28. 28.

    Peirsman Y, Heylen K, Geeraerts D (2008) Size matters: tight and loose context definitions in english word space models. In: Proceedings of the ESSLLI workshop on distributional lexical semantics, pp 34–41

  29. 29.

    Phan XH, Nguyen LM, Horiguchi S (2008) Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: WWW, pp 91–100

  30. 30.

    Quan X, Liu G, Lu Z, Ni X, Liu W (2010) Short text similarity based on probabilistic topics. Knowl Inf Syst 25(3):473–491

    Article  Google Scholar 

  31. 31.

    Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In: EMNLP, pp 248–256

  32. 32.

    Ramage D, Dumais S, Liebling D (2010) Characterizing microblogs with topic models. In: ICWSM

  33. 33.

    Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: UAI, pp 487–494

  34. 34.

    Rubenstein H, Goodenough JB (1965) Contextual correlates of synonymy. Commun ACM 8(10):627–633

    Article  Google Scholar 

  35. 35.

    Rubin TN, Chambers A, Smyth P, Steyvers M (2012) Statistical topic models for multi-label document classification. Mach Learn 88(1–2):157–208

    MathSciNet  Article  MATH  Google Scholar 

  36. 36.

    Sahami M, Heilman TD (2006) A web-based kernel function for measuring the similarity of short text snippets. In: WWW, pp 377–386

  37. 37.

    Stevens K, Kegelmeyer P, Andrzejewski D, Buttler D (2012) Exploring topic coherence over many models and many topics. In: EMNLP-CoNLL, pp 952–961

  38. 38.

    Tang J, Meng Z, Nguyen X, Mei Q, Zhang M (2014) Understanding the limiting factors of topic modeling via posterior contraction analysis. In: ICML, pp 190–198

  39. 39.

    Tong Y, Cao CC, Chen L (2014) Tcs: efficient topic discovery over crowd-oriented service data. In: KDD, pp 861–870

  40. 40.

    Wang X, McCallum A (2006) Topics over time: a non-Markov continuous-time model of topical trends. In: KDD, pp 424–433

  41. 41.

    Wang X, Jia Y, Zhou B, Ding Z, Zheng L (2011) Computing semantic relatedness using chinese wikipedia links and taxonomy. J Chin Comput Syst 32(11):2237–2242

    Google Scholar 

  42. 42.

    Weng J, Lim EP, Jiang J, He Q (2010) Twitterrank: finding topic-sensitive influential twitterers. In: WSDM, pp 261–270

  43. 43.

    Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: WWW, pp 1445–1456

  44. 44.

    Yu L, Asur S, Huberman BA (2011) What trends in chinese social media. arXiv:1107.3522

  45. 45.

    Yu LL, Asur S, Huberman BA (2013) Dynamics of trends and attention in chinese social media. arXiv:1312.0649

  46. 46.

    Zhao WX, Jiang J, Weng J, He J, Lim EP, Yan H, Li X (2011) Comparing twitter and traditional media using topic models. In: ECIR, pp 338–349

  47. 47.

    Zhou T, Lyu MT, King I, Lou J (2015) Learning to suggest questions in social media. Knowl Inf Syst 43(2):389–416

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported by NSFC (Grant Nos. 71501005 and 61421003) and the fund of the State Key Lab of Software Development Environment (Grant No. SKLSDE-2015ZX-05).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Ke Xu.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zuo, Y., Zhao, J. & Xu, K. Word network topic model: a simple but general solution for short and imbalanced texts. Knowl Inf Syst 48, 379–398 (2016). https://doi.org/10.1007/s10115-015-0882-z

Download citation

Keywords

  • Word co-occurrence network
  • Topic modeling
  • Short texts
  • Imbalanced texts