Skip to main content

Release ‘Bag-of-Words’ Assumption of Latent Dirichlet Allocation

  • Conference paper
  • First Online:
Foundations of Intelligent Systems

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 277))

  • 1276 Accesses

Abstract

Based on vector-based representation, topic models, like latent Dirichlet allocation (LDA), are constructed for documents with ‘bag-of-words’ assumption. They can discover the distribution of underlying topics in a document and the distribution of keywords in a topic, which have been proved very successful and practical in many scenarios, recently. Comparing vector-based representation of documents, graph-based representation method can preserve more semantics of documents, because not only keywords but also the relations between them in documents are considered. In this paper, a topic model for graph-represented documents (GTM) is proposed. In this model, a Bernoulli distribution is used to model the formation of the edge between two keywords in a document. The experimental results show that GTM outperforms LDA in document classification task using the unveiled topics from these two models to represent documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://jgibblda.sourceforge.net/#Griffiths04

  2. 2.

    http://www.daviddlewis.com/resources/testcollections/reuters21578/

  3. 3.

    http://www.cs.waikato.ac.nz/ml/weka

References

  1. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  2. Coffman T, Greenblatt S, Marcus S (2004) Graph-based technologies for intelligence analysis. Commun ACM 47(3):45–47

    Article  Google Scholar 

  3. Gamon M (2006) Graph-based text representation for novelty detection. Paper presented at the proceedings of the 1st workshop on graph based methods for natural language processing

    Google Scholar 

  4. Hofmann T (1999) Probabilistic latent semantic indexing. Paper presented at the proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval

    Google Scholar 

  5. Jin W, Srihari RK (2007) Graph-based text representation and knowledge discovery. Paper presented at the proceedings of the 2007 ACM symposium on applied computing

    Google Scholar 

  6. Klyne G, Carroll J, McBride B (2004) Resource description framework (RDF): Concepts and abstract syntax. W3C Recommend 10

    Google Scholar 

  7. Luo X, Fang N, Hu B, Yan K, Xiao H (2008) Semantic representation of scientific documents for the e-science Knowledge Grid. Concurrency Comput Pract Experience 20(7):839–862

    Article  Google Scholar 

  8. Nallapati RM, Ahmed A, Xing EP, Cohen WW (2008) Joint latent topic models for text and citations. Paper presented at the proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining

    Google Scholar 

  9. Tomita J, Nakawatase H, Ishii M (2004) Calculating similarity between texts using graph-based text representation model. Paper presented at the proceedings of the thirteenth ACM international conference on information and knowledge management

    Google Scholar 

  10. Wang C, Lu J, Zhang G (2006) Integration of ontology data through learning instance matching. Paper presented at the IEEE/WIC/ACM international conference on web intelligence (WI 2006)

    Google Scholar 

  11. Wang C, Lu J, Zhang G (2007) Mining key information of web pages: a method and its application. Expert Syst Appl 33(2):425–433

    Article  MathSciNet  Google Scholar 

  12. Wilson AT, Chew PA (2010) Term weighting schemes for latent dirichlet allocation. Paper presented at the human language technologies: The 2010 annual conference of the North American Chapter of the Association for Computational Linguistics

    Google Scholar 

Download references

Acknowledgments

The work presented in this paper was partially supported by the Australian Research Council (ARC) under discovery Grant DP110103733 and the China Scholarship Council.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Junyu Xuan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Xuan, J., Lu, J., Zhang, G., Luo, X. (2014). Release ‘Bag-of-Words’ Assumption of Latent Dirichlet Allocation. In: Wen, Z., Li, T. (eds) Foundations of Intelligent Systems. Advances in Intelligent Systems and Computing, vol 277. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54924-3_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-54924-3_8

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-54923-6

  • Online ISBN: 978-3-642-54924-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics