Topic Significance Ranking of LDA Generative Models

  • Loulwah AlSumait
  • Daniel Barbará
  • James Gentle
  • Carlotta Domeniconi
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5781)

Abstract

Topic models, like Latent Dirichlet Allocation (LDA), have been recently used to automatically generate text corpora topics, and to subdivide the corpus words among those topics. However, not all the estimated topics are of equal importance or correspond to genuine themes of the domain. Some of the topics can be a collection of irrelevant words, or represent insignificant themes. Current approaches to topic modeling perform manual examination to find meaningful topics. This paper presents the first automated unsupervised analysis of LDA models to identify junk topics from legitimate ones, and to rank the topic significance. Basically, the distance between a topic distribution and three definitions of “junk distribution” is computed using a variety of measures, from which an expressive figure of the topic significance is implemented using 4-phase Weighted Combination approach. Our experiments on synthetic and benchmark datasets show the effectiveness of the proposed approach in ranking the topic significance.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    AlSumait, L., Barbará, D., Domeniconi, C.: Online LDA: Adaptive Topic Model for Mining Text Streams with Application on Topic Detection and Tracking. In: Proceedings of IEEE International Conference on Data Mining, ICDM 2008 (2008)Google Scholar
  2. 2.
    Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006)MATHGoogle Scholar
  3. 3.
    Bouyssou, D., Marchant, T., Pirlot, M., Tsoukias, A., Vincke, P.: Evaluation and Decision Models with Multiple Criteria. Springer, Heidelberg (2006)MATHGoogle Scholar
  4. 4.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. The Journal of Machine Learning Research 3, 993–1022 (2003)MATHGoogle Scholar
  5. 5.
    Chemudugunta, C., Smyth, P., Steyvers, M.: Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model. In: Proceedings of Neural Information Processing Systems (2007)Google Scholar
  6. 6.
    Griffiths, T.L., Steyvers, M.: Finding scientific topics. In: Proceeding of the National Academy of Sciences, pp. 5228–5235 (2004)Google Scholar
  7. 7.
    Joachims, T.: A Statistical Learning Model of Text Classification with Support Vector Machines. In: Proceedings of the Conference on Research and Development in Information Retrieval, SIGIR (2001)Google Scholar
  8. 8.
    Steyvers, M., Griffiths, T.L.: Probabilistic Topic Models. In: Landauer, T., McNamara, D., Dennis, S., Kintsch, W. (eds.) Latent Semantic Analysis: A Road to Meaning. Lawrence Erlbaum, Mahwah (2005)Google Scholar
  9. 9.
    Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. Pearson Addison Wesley, London (2006)Google Scholar
  10. 10.
    Wang, X., McCallum, A.: Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends. In: ACM SIGKDD international conference on Knowledge discovery in data mining (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Loulwah AlSumait
    • 1
  • Daniel Barbará
    • 1
  • James Gentle
    • 2
  • Carlotta Domeniconi
    • 1
  1. 1.Department of Computer ScienceGeorge Mason UniversityFairfaxUSA
  2. 2.Department of Computational and Data SciencesGeorge Mason UniversityFairfaxUSA

Personalised recommendations