Release ‘Bag-of-Words’ Assumption of Latent Dirichlet Allocation

Xuan, Junyu; Lu, Jie; Zhang, Guangquan; Luo, Xiangfeng

doi:10.1007/978-3-642-54924-3_8

Junyu Xuan^4,5,
Jie Lu⁶,
Guangquan Zhang⁶ &
…
Xiangfeng Luo⁴

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 277))

1276 Accesses

Abstract

Based on vector-based representation, topic models, like latent Dirichlet allocation (LDA), are constructed for documents with ‘bag-of-words’ assumption. They can discover the distribution of underlying topics in a document and the distribution of keywords in a topic, which have been proved very successful and practical in many scenarios, recently. Comparing vector-based representation of documents, graph-based representation method can preserve more semantics of documents, because not only keywords but also the relations between them in documents are considered. In this paper, a topic model for graph-represented documents (GTM) is proposed. In this model, a Bernoulli distribution is used to model the formation of the edge between two keywords in a document. The experimental results show that GTM outperforms LDA in document classification task using the unveiled topics from these two models to represent documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Coffman T, Greenblatt S, Marcus S (2004) Graph-based technologies for intelligence analysis. Commun ACM 47(3):45–47
Article Google Scholar
Gamon M (2006) Graph-based text representation for novelty detection. Paper presented at the proceedings of the 1st workshop on graph based methods for natural language processing
Google Scholar
Hofmann T (1999) Probabilistic latent semantic indexing. Paper presented at the proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval
Google Scholar
Jin W, Srihari RK (2007) Graph-based text representation and knowledge discovery. Paper presented at the proceedings of the 2007 ACM symposium on applied computing
Google Scholar
Klyne G, Carroll J, McBride B (2004) Resource description framework (RDF): Concepts and abstract syntax. W3C Recommend 10
Google Scholar
Luo X, Fang N, Hu B, Yan K, Xiao H (2008) Semantic representation of scientific documents for the e-science Knowledge Grid. Concurrency Comput Pract Experience 20(7):839–862
Article Google Scholar
Nallapati RM, Ahmed A, Xing EP, Cohen WW (2008) Joint latent topic models for text and citations. Paper presented at the proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining
Google Scholar
Tomita J, Nakawatase H, Ishii M (2004) Calculating similarity between texts using graph-based text representation model. Paper presented at the proceedings of the thirteenth ACM international conference on information and knowledge management
Google Scholar
Wang C, Lu J, Zhang G (2006) Integration of ontology data through learning instance matching. Paper presented at the IEEE/WIC/ACM international conference on web intelligence (WI 2006)
Google Scholar
Wang C, Lu J, Zhang G (2007) Mining key information of web pages: a method and its application. Expert Syst Appl 33(2):425–433
Article MathSciNet Google Scholar
Wilson AT, Chew PA (2010) Term weighting schemes for latent dirichlet allocation. Paper presented at the human language technologies: The 2010 annual conference of the North American Chapter of the Association for Computational Linguistics
Google Scholar

Download references

Acknowledgments

The work presented in this paper was partially supported by the Australian Research Council (ARC) under discovery Grant DP110103733 and the China Scholarship Council.

Author information

Authors and Affiliations

Shanghai University, Shanghai, China
Junyu Xuan & Xiangfeng Luo
Centre for Quantum Computation and Intelligent Systems (QCIS), Faculty of Engineering and Information Technology, University of Technology, Sydney, Australia
Junyu Xuan
Centre for Quantum Computation and Intelligent Systems (QCIS), Faculty of Engineering and Information Technology, University of Technology, Broadway 2007, PO Box 123, Sydney, NSW, Australia
Jie Lu & Guangquan Zhang

Authors

Junyu Xuan
View author publications
You can also search for this author in PubMed Google Scholar
Jie Lu
View author publications
You can also search for this author in PubMed Google Scholar
Guangquan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiangfeng Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Junyu Xuan .

Editor information

Editors and Affiliations

College of Computer and Software Engineering, Shenzhen University, Shenzhen, China
Zhenkun Wen
School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China
Tianrui Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xuan, J., Lu, J., Zhang, G., Luo, X. (2014). Release ‘Bag-of-Words’ Assumption of Latent Dirichlet Allocation. In: Wen, Z., Li, T. (eds) Foundations of Intelligent Systems. Advances in Intelligent Systems and Computing, vol 277. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54924-3_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-54924-3_8
Published: 20 June 2014
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54923-6
Online ISBN: 978-3-642-54924-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics