Advertisement

Incremental Aspect Models for Mining Document Streams

  • Arun C. Surendran
  • Suvrit Sra
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4213)

Abstract

In this paper we introduce a novel approach for incrementally building aspect models, and use it to dynamically discover underlying themes from document streams. Using the new approach we present an application which we call “query-line tracking” i.e., we automatically discover and summarize different themes or stories that appear over time, and that relate to a particular query. We present evaluation on news corpora to demonstrate the strength of our method for both query-line tracking, online indexing and clustering.

Keywords

Latent Dirichlet Allocation Normalize Mutual Information Latent Variable Model Rand Index Aspect Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Ando, R.K., Lee, L.: Iterative Residual Rescaling: An Analysis and Generalization of LSI. In: SIGIR (September 2001)Google Scholar
  2. 2.
    Blei, D.M., Griffiths, T.L., Jordan, M.I., Tenenbaum, J.B.: Hierarchical Topic Models and the Nested Chinese Restaurant Process. In: NIPS (2004)Google Scholar
  3. 3.
    Böhning, D.: A review of reliable maximum likelihood algorithms for semi-parametric mixture models. J. of Stat. Planning and Inference 47, 5–28 (1995)MATHCrossRefGoogle Scholar
  4. 4.
    Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Knowledge Discovery and Data Mining, pp. 269–274 (2001)Google Scholar
  5. 5.
    Hofmann, T.: Unsupervised learning by Probabilistic Latent Semantic Analysis. Machine Learning 42 (2001)Google Scholar
  6. 6.
    Zhang, Z.G.J., Yang, Y.: A Probabilistic Model for On-line Document Clustering with Application to Novelty Detection. In: NIPS (2005)Google Scholar
  7. 7.
    Kleinberg, J.: Authoritative sources in a hyperlinked environment. ACM 46(5), 604–632 (1999)MATHCrossRefMathSciNetGoogle Scholar
  8. 8.
    Kontostathis, A., et al.: A survey of emerging trends detection in textual data mining. In: Survey of Text Mining, pp. 185–224 (2003)Google Scholar
  9. 9.
    Lempel, R., Moran, S.: The stochastic approach for link-structure analysis (salsa). ACM Tran. on Info. Sys. 19, 131–160 (2001)CrossRefGoogle Scholar
  10. 10.
    Lewis, D.D., Yang, Y., Rose, T., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. JMLR 5, 361–397 (2004)Google Scholar
  11. 11.
    Mason, L., Baxter, J., Bartlett, P., Frean, M.: Boosting Algorithms as Gradient Descent in Function Space. In: NIPS, pp. 512–518 (2000)Google Scholar
  12. 12.
    Mei, Q., Zhai, C.: Discovering evolutionary theme patters from text - an exploration of temporal mining. In: Knowledge Discovery and Data Mining, pp. 198–207 (2005)Google Scholar
  13. 13.
    Platt, J.: FastMap, MetricMap, and Landmark MDS are all Nystrom Algorithms. In: 10th International Workshop on AI and Statistics, pp. 261–268 (2005)Google Scholar
  14. 14.
    Mahadevan, U., Kumar, R., Sivakumar, D.: A graph-theoretic approach to extracting storylines from search results. In: Knowledge Discovery and Data Mining, pp. 216–225 (2004)Google Scholar
  15. 15.
    Ridgeway, G.: Looking for lumps: boosting and bagging for density estimation. Computational Statistics and Data Analysis 38(4), 379–392 (1999)CrossRefMathSciNetGoogle Scholar
  16. 16.
    Rosset, S., Segal, E.: Boosting density estimation. In: NIPS (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Arun C. Surendran
    • 1
  • Suvrit Sra
    • 2
  1. 1.Microsoft ResearchRedmondUSA
  2. 2.Dept. of Computer SciencesThe University of Texas at AustinUSA

Personalised recommendations