Sampled Weighted Min-Hashing for Large-Scale Topic Mining
- 1 Mentions
- 1k Downloads
Abstract
We present Sampled Weighted Min-Hashing (SWMH), a randomized approach to automatically mine topics from large-scale corpora. SWMH generates multiple random partitions of the corpus vocabulary based on term co-occurrence and agglomerates highly overlapping inter-partition cells to produce the mined topics. While other approaches define a topic as a probabilistic distribution over a vocabulary, SWMH topics are ordered subsets of such vocabulary. Interestingly, the topics mined by SWMH underlie themes from the corpus at different levels of granularity. We extensively evaluate the meaningfulness of the mined topics both qualitatively and quantitatively on the NIPS (1.7 K documents), 20 Newsgroups (20 K), Reuters (800 K) and Wikipedia (4 M) corpora. Additionally, we compare the quality of SWMH with Online LDA topics for document representation in classification.
Keywords
Large-scale topic mining Min-Hashing Co-occurring terms1 Introduction
The automatic extraction of topics has become very important in recent years since they provide a meaningful way to organize, browse and represent large-scale collections of documents. Among the most successful approaches to topic discovery are directed topic models such as Latent Dirichlet Allocation (LDA) [1] and Hierarchical Dirichlet Processes (HDP) [15] which are Directed Graphical Models with latent topic variables. More recently, undirected graphical models have been also applied to topic modeling, (e.g., Boltzmann Machines [12, 13] and Neural Autoregressive Distribution Estimators [9]). The topics generated by both directed and undirected models have been shown to underlie the thematic structure of a text corpus. These topics are defined as distributions over terms of a vocabulary and documents in turn as distributions over topics. Traditionally, inference in topic models has not scale well to large corpora, however, more efficient strategies have been proposed to overcome this problem (e.g., Online LDA [8] and stochastic variational inference [10]). Undirected Topic Models can be also trained efficient using approximate strategies such as Contrastive Divergence [7].
In this work, we explore the mining of topics based on term co-occurrence. The underlying intuition is that terms consistently co-occurring in the same documents are likely to belong to the same topic. The resulting topics correspond to ordered subsets of the vocabulary rather than distributions over such a vocabulary. Since finding co-occurring terms is a combinatorial problem that lies in a large search space, we propose Sampled Weighted Min-Hashing (SWMH), an extended version of Sampled Min-Hashing (SMH) [6]. SMH partitions the vocabulary into sets of highly co-occurring terms by applying Min-Hashing [2] to the inverted file entries of the corpus. The basic idea of Min-Hashing is to generate random partitions of the space so that sets with high Jaccard similarity are more likely to lie in the same partition cell.
SWMH topic examples.
NIPS | introduction, references, shown, figure, abstract, shows, back, left, process, \(\ldots \) (51) |
chip, fabricated, cmos, vlsi, chips, voltage, capacitor, digital, inherent, \(\ldots \) (42) | |
spiking, spikes, spike, firing, cell, neuron, reproduces, episodes, cellular, \(\ldots \) (17) | |
20 Newsgroups | algorithm communications clipper encryption chip key |
lakers, athletics, alphabetical, pdp, rams, pct, mariners, clippers, \(\ldots \) (37) | |
embryo, embryos, infertility, ivfet, safetybelt, gonorrhea, dhhs, \(\ldots \) (37) | |
Reuters | prior, quarterly, record, pay, amount, latest, oct |
precious, platinum, ounce, silver, metals, gold | |
udinese, reggiana, piacenza, verona, cagliari, atalanta, perugia, \(\ldots \) (64) | |
Wikipedia | median, householder, capita, couples, racial, makeup, residing, \(\ldots \) (54) |
decepticons’, galvatron’s, autobots’, botcon, starscream’s, rodimus, galvatron | |
avg, strikeouts, pitchers, rbi, batters, pos, starters, pitched, hr, batting, \(\ldots \) (21) |
The remainder of the paper is organized as follows. Section 2 reviews the Min-Hashing scheme for pairwise set similarity search. The proposed approach for topic mining by SWMH is described in Sect. 3. Section 4 reports the experimental evaluation of SWMH as well as a comparison against Online LDA. Finally, Sect. 5 concludes the paper with some discussion and future work.
2 Min-Hashing for Pairwise Similarity Search
3 Sampled Min-Hashing for Topic Mining
Partitioning of the vocabulary by Min-Hashing.
The clustering stage merges chains of co-occurring term sets with high overlap coefficient into the same topic. As a result, co-occurring term sets associated with the same topic can belong to the same cluster even if they do not share terms with one another, as long as they are members of the same chain. In general, the generated clusters have the property that for any co-occurring term set, there exists at least one co-occurring term set in the same cluster with which it has an overlap coefficient greater than a given threshold \(\epsilon \).
4 Experimental Results
In this section, we evaluate different aspects of the mined topics. First, we present a comparison between the topics mined by SWMH and SMH. Second, we evaluate the scalability of the proposed approach. Third, we use the mined topics to perform document classification. Finally, we compare SWMH topics with Online LDA topics.
The corpora used in our experiments were: NIPS, 20 Newsgroups, Reuters and Wikipedia1. NIPS is a small collection of articles (3, 649 documents), 20 Newsgroups is a larger collection of mail newsgroups (34, 891 documents), Reuters is a medium size collection of news (137, 589 documents) and Wikipedia is a large-scale collection of encyclopedia articles (1, 265, 756 documents)2.
All the experiments presented in this work were performed on an Intel(R) Xeon(R) 2.66 GHz workstation with 8 GB of memory and with 8 processors. However, we would like to point out that the current version of the code is not parallelized, so we did not take advantage of the multiple processors.
4.1 Comparison Between SMH and SWMH
Amount of mined topics for SMH and SWMH in the (a) NIPS and (b) Reuters corpora.
4.2 Scalability Evaluation
To test the scalability of SWMH, we measured the time and memory required to mine topics in the Reuters corpus while increasing the number of documents to be analyzed. In particular, we perform 10 experiments with SWMH, each increasing the number of documents by 10 %3. Figure 3 illustrates the time taken to mine topics as we increase the number of documents and as we increase an index of complexity given by a combination of the size of the vocabulary and the average number of times a term appears in a document. As can be noticed, in both cases the time grows almost linearly and is in the thousand of seconds.
Time scalability for the Reuters corpus.
Document classification for 20 Newsgroups corpus.
Model | Topics | Accuracy | Avg. score |
---|---|---|---|
205 | 3394 | 59.9 | 60.6 |
319 | 4427 | 61.2 | 64.3 |
693 | 6090 | 68.9 | 70.7 |
1693 | 2868 | 53.1 | 55.8 |
2427 | 3687 | 56.2 | 60.0 |
6963 | 5510 | 64.1 | 66.4 |
Online LDA | 100 | 59.2 | 60.0 |
Online LDA | 400 | 65.4 | 65.9 |
4.3 Document Classification
In this evaluation we used the mined topics to create a document representation based on the similarity between topics and documents. This representation was used to train an SVM classifier with the class of the document. In particular, we focused on the 20 Newsgroups corpus for this experiment. We used the typical setting of this corpus for document classification (\(60\,\%\) training, \(40\,\%\) testing). Table 2 shows the performance for different variants of topics mined by SWMH and Online LDA topics. The results illustrate that the number of topics is relevant for the task: Online LDA with 400 topics is better than 100 topics. A similar behavior can be noticed for SWMH, however, the parameter r has an effect on the content of the topics and therefore on the performance.
4.4 Comparison Between Mined and Modeled Topics
Coherence of topics mined by SWMH vs Online LDA topics in the (a) 20 Newsgroups and (b) Reuters corpora.
5 Discussion and Future Work
In this work we presented a large-scale approach to automatically mine topics in a given corpus based on Sampled Weighted Min-Hashing. The mined topics consist of subsets of highly correlated terms from the vocabulary. The proposed approach is able to mine topics in corpora which go from the thousands of documents (1 min approx.) to the millions of documents (7 h approx.), including topics similar to the ones produced by Online LDA. We found that the mined topics can be used to represent a document for classification. We also showed that the complexity of the proposed approach grows linearly with the amount of documents. Interestingly, some of the topics mined by SWMH are related to the structure of the documents (e.g., in NIPS the words in the first topic correspond to parts of an article) and others to specific groups (e.g., team sports in 20 Newsgroups and Reuters, or the Transformers universe in Wikipedia). These examples suggest that SWMH is able to generate topics at different levels of granularity.
Further work has to be done to make sense of overly specific topics or to filter them out. In this direction, we found that weighting the terms has the effect of discarding several irrelevant topics and producing more compact ones. Another alternative, it is to restrict the vocabulary to the top most frequent terms as done by other approaches. Other interesting future work include exploring other weighting schemes, finding a better representation of documents from the mined topics and parallelizing SWMH.
Footnotes
- 1.
Wikipedia dump from 2013-09-04.
- 2.
All corpora were preprocessed to cut off terms that appeared less than 6 times in the whole corpus.
- 3.
The parameters were fixed to \(s*=0.1\), \(r=3\), and overlap threshold of 0.7.
- 4.
https://github.com/qpleple/online-lda-vb was adapted to use our file formats.
References
- 1.Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
- 2.Broder, A.Z.: On the resemblance and containment of documents. Comput. 33(11), 46–53 (2000)CrossRefGoogle Scholar
- 3.Buckley, C.: The importance of proper weighting methods. In: Proceedings of the Workshop on Human Language Technology, pp. 349–352 (1993)Google Scholar
- 4.Chum, O., Matas, J.: Large-scale discovery of spatially related images. IEEE Trans. Pattern Anal. Mach. Intell. 32, 371–377 (2010)CrossRefGoogle Scholar
- 5.Chum, O., Philbin, J., Zisserman, A.: Near duplicate image detection: min-hash and tf-idf weighting. In: Proceedings of the British Machine Vision Conference (2008)Google Scholar
- 6.Fuentes Pineda, G., Koga, H., Watanabe, T.: Scalable object discovery: a hash-based approach to clustering co-occurring visual words. IEICE Trans. Inf. Syst. E94–D(10), 2024–2035 (2011)CrossRefGoogle Scholar
- 7.Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14(8), 1771–1800 (2002)zbMATHMathSciNetCrossRefGoogle Scholar
- 8.Hoffman, M.D., Blei, D.M., Bach, F.: Online learning for latent Dirichlet allocation. In: Advances in Neural Information Processing Systems 23 (2010)Google Scholar
- 9.Larochelle, H., Stanislas, L.: A neural autoregressive topic model. In: Advances in Neural Information Processing Systems 25, pp. 2717–2725 (2012)Google Scholar
- 10.Mimno, D., Hoffman, M.D., Blei, D.M.: Sparse stochastic inference for latent Dirichlet allocation. In: International Conference on Machine Learning (2012)Google Scholar
- 11.Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 262–272. ACL (2011)Google Scholar
- 12.Salakhutdinov, R., Srivastava, N., Hinton, G.: Modeling documents with a deep Boltzmann machine. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence (2013)Google Scholar
- 13.Salakhutdinov, R., Hinton, G.E.: Replicated softmax: an undirected topic model. In: Advances in Neural Information Processing Systems 22, pp. 1607–1614 (2009)Google Scholar
- 14.Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 512–523 (1988)CrossRefGoogle Scholar
- 15.Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101, 1566–1581 (2004)MathSciNetCrossRefGoogle Scholar