Sampled Weighted Min-Hashing for Large-Scale Topic Mining

  • Gibran Fuentes-PinedaEmail author
  • Ivan Vladimir Meza-Ruíz
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9116)


We present Sampled Weighted Min-Hashing (SWMH), a randomized approach to automatically mine topics from large-scale corpora. SWMH generates multiple random partitions of the corpus vocabulary based on term co-occurrence and agglomerates highly overlapping inter-partition cells to produce the mined topics. While other approaches define a topic as a probabilistic distribution over a vocabulary, SWMH topics are ordered subsets of such vocabulary. Interestingly, the topics mined by SWMH underlie themes from the corpus at different levels of granularity. We extensively evaluate the meaningfulness of the mined topics both qualitatively and quantitatively on the NIPS (1.7 K documents), 20 Newsgroups (20 K), Reuters (800 K) and Wikipedia (4 M) corpora. Additionally, we compare the quality of SWMH with Online LDA topics for document representation in classification.


Large-scale topic mining Min-Hashing Co-occurring terms 


  1. 1.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  2. 2.
    Broder, A.Z.: On the resemblance and containment of documents. Comput. 33(11), 46–53 (2000)CrossRefGoogle Scholar
  3. 3.
    Buckley, C.: The importance of proper weighting methods. In: Proceedings of the Workshop on Human Language Technology, pp. 349–352 (1993)Google Scholar
  4. 4.
    Chum, O., Matas, J.: Large-scale discovery of spatially related images. IEEE Trans. Pattern Anal. Mach. Intell. 32, 371–377 (2010)CrossRefGoogle Scholar
  5. 5.
    Chum, O., Philbin, J., Zisserman, A.: Near duplicate image detection: min-hash and tf-idf weighting. In: Proceedings of the British Machine Vision Conference (2008)Google Scholar
  6. 6.
    Fuentes Pineda, G., Koga, H., Watanabe, T.: Scalable object discovery: a hash-based approach to clustering co-occurring visual words. IEICE Trans. Inf. Syst. E94–D(10), 2024–2035 (2011)CrossRefGoogle Scholar
  7. 7.
    Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14(8), 1771–1800 (2002)zbMATHMathSciNetCrossRefGoogle Scholar
  8. 8.
    Hoffman, M.D., Blei, D.M., Bach, F.: Online learning for latent Dirichlet allocation. In: Advances in Neural Information Processing Systems 23 (2010)Google Scholar
  9. 9.
    Larochelle, H., Stanislas, L.: A neural autoregressive topic model. In: Advances in Neural Information Processing Systems 25, pp. 2717–2725 (2012)Google Scholar
  10. 10.
    Mimno, D., Hoffman, M.D., Blei, D.M.: Sparse stochastic inference for latent Dirichlet allocation. In: International Conference on Machine Learning (2012)Google Scholar
  11. 11.
    Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 262–272. ACL (2011)Google Scholar
  12. 12.
    Salakhutdinov, R., Srivastava, N., Hinton, G.: Modeling documents with a deep Boltzmann machine. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence (2013)Google Scholar
  13. 13.
    Salakhutdinov, R., Hinton, G.E.: Replicated softmax: an undirected topic model. In: Advances in Neural Information Processing Systems 22, pp. 1607–1614 (2009)Google Scholar
  14. 14.
    Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 512–523 (1988)CrossRefGoogle Scholar
  15. 15.
    Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101, 1566–1581 (2004)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Gibran Fuentes-Pineda
    • 1
    Email author
  • Ivan Vladimir Meza-Ruíz
    • 1
  1. 1.Instituto de Investigaciones en Matemáticas Aplicadas y en SistemasUniversidad Nacional Autónoma de México Mexico cityMexico

Personalised recommendations