Topic Modeling over Short Texts by Incorporating Word Embeddings

Qiang, Jipeng; Chen, Ping; Wang, Tong; Wu, Xindong

doi:10.1007/978-3-319-57529-2_29

Jipeng Qiang^19,20,21,
Ping Chen²¹,
Tong Wang²¹ &
…
Xindong Wu^20,22

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10235))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

3876 Accesses
42 Citations

Abstract

Inferring topics from the overwhelming amount of short texts becomes a critical but challenging task for many content analysis tasks. Existing methods such as probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA) cannot solve this problem very well since only very limited word co-occurrence information is available in short texts. This paper studies how to incorporate the external word correlation knowledge into short texts to improve the coherence of topic modeling. Based on recent results in word embeddings that learn semantically representations for words from a large corpus, we introduce a novel method, Embedding-based Topic Model (ETM), to learn latent topics from short texts. ETM not only solves the problem of very limited word co-occurrence information by aggregating short texts into long pseudo-texts, but also utilizes a Markov Random Field regularized model that gives correlated words a better chance to be put into the same topic. The experiments on real-world datasets validate the effectiveness of our model comparing with the state-of-the-art models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://news.google.com.

References

Aggarwal, C.C., Zhai, C.: A survey of text clustering algorithms. In: Mining Text Data, pp. 77–128 (2012)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. JMLR 3, 993–1022 (2003)
MATH Google Scholar
Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J.L., Blei, D.M.: Reading tea leaves: how humans interpret topic models. In: NIPS, pp. 288–296 (2009)
Google Scholar
Cheng, X., Yan, X., Lan, Y., Guo, J.: BTM: topic modeling over short texts. TKDE 26(12), 2928–2941 (2014)
Google Scholar
Griffiths, T., Steyvers, M.: Finding scientific topics. Proc. Nat. Acad. Sci. 101, 5228–5235 (2004)
Article Google Scholar
Hofmann, T.: Probabilistic latent semantic indexing. In: SIGIR, pp. 50–57 (1999)
Google Scholar
Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q.: From word embeddings to document distances. In: ICML, pp. 957–966 (2015)
Google Scholar
Mehrotra, R., Sanner, S., Buntine, W., Xie, L.: Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: SIGIR, pp. 889–892 (2013)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)
Google Scholar
Newman, D., Bonilla, E.V., Buntine, W.: Improving topic coherence with regularized topic models. In: NIPS, pp. 496–504 (2011)
Google Scholar
Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: NAACL, pp. 100–108 (2010)
Google Scholar
Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2–3), 103–134 (2000)
Article MATH Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
Google Scholar
Qiang, J., Chen, P., Ding, W., Wang, T., Fei, X., Wu, X.: Topic discovery from heterogeneous texts. In: ICTAI (2016)
Google Scholar
Quan, X., Kit, C., Ge, Y., Pan, S.J.: Short and sparse text topic modeling via self-aggregation. In: ICAI, pp. 2270–2276 (2015)
Google Scholar
Wang, X., Wang, Y., Zuo, W., Cai, G.: Exploring social context for topic identification in short and noisy texts. In: AAAI (2015)
Google Scholar
Weng, J., Lim, E.-P., Jiang, J., He, Q.: Twitterrank: finding topic-sensitive influential twitterers. In: WSDM, pp. 261–270 (2010)
Google Scholar
Xie, P., Yang, D., Xing, E.P.: Incorporating word correlation knowledge into topic modeling. In: NACACL (2015)
Google Scholar
Yan, X., Guo, J., Lan, Y., Xu, J., Cheng, X.: A probabilistic model for bursty topic discovery in microblogs. In: AAAI, pp. 353–359 (2015)
Google Scholar
Yin, J., Wang, J.: A dirichlet multinomial mixture model-based approach for short text clustering. In: SIGKDD, pp. 233–242 (2014)
Google Scholar
Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., Li, X.: Comparing Twitter and traditional media using topic models. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 338–349. Springer, Heidelberg (2011). doi:10.1007/978-3-642-20161-5_34
Chapter Google Scholar

Download references

Acknowledgement

This research is partially supported by the National Key Research and Development Program of China (2016YFB1000900), and the National Natural Science Foundation of China (No. 61503116).

Author information

Authors and Affiliations

Yangzhou University, Yangzhou, 225009, China
Jipeng Qiang
Hefei University of Technology, Hefei, 230009, China
Jipeng Qiang & Xindong Wu
University of Massachusetts Boston, Boston, MA, 02155, USA
Jipeng Qiang, Ping Chen & Tong Wang
University of Louisiana at Lafayette, Lafayette, LA, 70504, USA
Xindong Wu

Authors

Jipeng Qiang
View author publications
You can also search for this author in PubMed Google Scholar
Ping Chen
View author publications
You can also search for this author in PubMed Google Scholar
Tong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xindong Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jipeng Qiang .

Editor information

Editors and Affiliations

Kangwon National University, Chuncheon, Korea (Republic of)
Jinho Kim
Seoul National University, Seoul, Korea (Republic of)
Kyuseok Shim
University of Technology Sydney, Sydney, New South Wales, Australia
Longbing Cao
KAIST, Daejeon, Korea (Republic of)
Jae-Gil Lee
University of New South Wales, Sydney, New South Wales, Australia
Xuemin Lin
Kangwon National University, Chuncheon, Korea (Republic of)
Yang-Sae Moon

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qiang, J., Chen, P., Wang, T., Wu, X. (2017). Topic Modeling over Short Texts by Incorporating Word Embeddings. In: Kim, J., Shim, K., Cao, L., Lee, JG., Lin, X., Moon, YS. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2017. Lecture Notes in Computer Science(), vol 10235. Springer, Cham. https://doi.org/10.1007/978-3-319-57529-2_29

Download citation

DOI: https://doi.org/10.1007/978-3-319-57529-2_29
Published: 23 April 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-57528-5
Online ISBN: 978-3-319-57529-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics