Abstract
Nonnegative matrix factorization (NMF) has been widely used in topic modeling of large-scale document corpora, where a set of underlying topics are extracted by a low-rank factor matrix from NMF. However, the resulting topics often convey only general, thus redundant information about the documents rather than information that might be minor, but potentially meaningful to users. To address this problem, we present a novel ensemble method based on nonnegative matrix factorization that discovers meaningful local topics. Our method leverages the idea of an ensemble model, which has shown advantages in supervised learning, into an unsupervised topic modeling context. That is, our model successively performs NMF given a residual matrix obtained from previous stages and generates a sequence of topic sets. The algorithm we employ to update is novel in two aspects. The first lies in utilizing the residual matrix inspired by a state-of-the-art gradient boosting model, and the second stems from applying a sophisticated local weighting scheme on the given matrix to enhance the locality of topics, which in turn delivers high-quality, focused topics of interest to users. We subsequently extend this ensemble model by adding keyword- and document-based user interaction to introduce user-driven topic discovery.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The code is available at https://github.com/sanghosuh/lens_nmf-matlab.
References
Aletras N, Stevenson M (2013) Evaluating topic coherence using distributional semantics. In: Proceedings of the international conference on computational semantics, pp 13–22
Andrzejewski D, Zhu X, Craven M (2009) Incorporating domain knowledge into topic modeling via dirichlet forest priors. In: Proceedings of the international conference on machine learning (ICML), pp 25–32
Bakharia A, Bruza P, Watters J, Narayan B, Sitbon L (2016) Interactive topic modeling for aiding qualitative content analysis. In: Proceedings of the ACM SIGIR on conference on human information interaction and retrieval (CHIIR), pp 213–222
Bernstein MS, Suh B, Hong L, Chen J, Kairam S, Chi EH (2010) Eddi: interactive topic-based browsing of social status streams. In: Proceedings of the annual ACM symposium on user interface software and technology (UIST), pp 303–312
Biggs M, Ghodsi A, Vavasis S (2008) Nonnegative matrix factorization via rank-one downdate. In: Proceedings of the international conference on machine learning (ICML), pp 64–71
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res (JMLR) 3:993–1022
Brandes U, Corman SR (2003) Visual unrolling of network evolution and the analysis of dynamic discourse. Inf Vis 2(1):40–50
Cho Y-S, Ver Steeg G, Ferrara E, Galstyan A (2016) Latent space model for multi-modal social data. In: Proceedings of the international conference on world wide web (WWW), pp 447–458
Choo J, Lee C, Reddy CK, Park H (2013) UTOPIAN: user-driven topic modeling based on interactive nonnegative matrix factorization. IEEE Trans Vis Comput Graph (TVCG) 19(12):1992–2001
Choo J, Lee C, Reddy CK, Park H (2015) Weakly supervised nonnegative matrix factorization for user-driven clustering. Data Min Knowl Discov (DMKD) 29(6):1598–1621
Cichocki A, Zdunek R, Amari S-I (2007) Hierarchical als algorithms for nonnegative matrix and 3d tensor factorization. In: Independent component analysis and signal separation, pp 169–176
DeCoste D (2006) Collaborative prediction using ensembles of maximum margin matrix factorizations. In: Proceedings of the international conference on machine learning (ICML), pp 249–256
Ding C, Li T, Peng W, Park H (2006) Orthogonal nonnegative matrix tri-factorizations for clustering. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD)
Freund Y, Schapire R, Abe N (1999) A short introduction to boosting. J Jpn Soc Artif Intell 14(771–780):1612
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232
Gillis N, Glineur F (2010) Using underapproximations for sparse nonnegative matrix factorization. Pattern Recogn 43(4):1676–1687
Golub GH, van Loan CF (1996) Matrix computations, 3rd edn. Johns Hopkins University Press, Baltimore
Greene D, Cagney G, Krogan N, Cunningham P (2008) Ensemble non-negative matrix factorization methods for clustering protein-protein interactions. Bioinformatics 24(15):1722–1728
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, Berlin
Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the ACM SIGIR international conference on research and development in information retrieval (SIGIR), pp 50–57
Hoque E, Carenini G (2015) Convisit: interactive topic modeling for exploring asynchronous online conversations. In: Proceedings of the international conference on intelligent user interfaces (IUI), pp 169–180
Huang F, Zhang S, Zhang J, Yu G (2017) Multimodal learning for topic sentiment analysis in microblogging. Neurocomputing 253:144–153
Jo Y, Oh AH (2011) Aspect and sentiment unification model for online review analysis. In: Proceedings of the ACM international conference on web search and data mining (WSDM), pp 815–824
Kim H, Park H (2007) Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics 23(12):1495–1502
Kim H, Park H (2008) Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM J Matrix Anal Appl 30(2):713–730
Kim J, Park H (2008) Sparse nonnegative matrix factorization for clustering. Georgia Institute of Technology, Georgia
Kim J, Park H (2011) Fast nonnegative matrix factorization: an active-set-like method and comparisons. SIAM J Sci Comput 33(6):3261–3281
Kim J, He Y, Park H (2014) Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework. J Glob Optim 58(2):285–319
Kim H, Choo J, Kim J, Reddy CK, Park H (2015) Simultaneous discovery of common and discriminative topics via joint nonnegative matrix factorization. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 567–576
Kim M, Kang K, Park D, Choo J, Elmqvist N (2017) Topiclens: efficient multi-level visual topic exploration of large-scale document collections. IEEE Trans Vis Comput Graph (TVCG) 23(1):151–160
Kuang D, Park H (2013) Fast rank-2 nonnegative matrix factorization for hierarchical document clustering. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 739–747
Kuhn HW (1955) The hungarian method for the assignment problem. Naval Res Logist Q 2(1–2):83–97
Kumar S, Mohri M, Talwalkar A (2009) Ensemble nystrom method. In: Advances in neural information processing systems (NIPS), pp 1060–1068
Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788–791
Lee H, Kihm J, Choo J, Stasko J, Park H (2012) iVisClustering: an interactive visual document clustering via topic modeling. Comput Graph Forum 31(3 pt 3):1155–1164
Lee J, Sun M, Kim S, Lebanon G (2012) Automatic feature induction for stagewise collaborative filtering. In: Advances in neural information processing systems (NIPS)
Lee J, Kim S, Lebanon G, Singer Y, Bengio S (2016) Llorma: local low-rank matrix approximation. J Mach Learn Res (JMLR) 17(15):1–24
Li T, Zhang Y, Sindhwani V (2009) A non-negative matrix tri-factorization approach to sentiment classification with lexical prior knowledge. In: Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP, pp 244–252
Lin C-J (2007) Projected gradient methods for nonnegative matrix factorization. Neural Comput 19(10):2756–2779
Mackey LW, Talwalkar AS, Jordan MI (2011) Divide-and-conquer matrix factorization. In: Advances in neural information processing systems (NIPS), pp 1134–1142
Meyer M, Munzner T, DePace A, Pfister H (2010) Multeesum: a tool for comparative spatial and temporal gene expression data. IEEE Trans Vis Comput Graph (TVCG) 16(6):908–917
Mukherjea S, Hirata K, Hara Y (1996) Visualizing the results of multimedia web search engines. In: Proceedings of the IEEE symposium on information visualization (InfoVis), pp 64–65, 122
Newman D, Lau JH, Grieser K, Baldwin T (2010) Automatic evaluation of topic coherence. In: Proceedings of the annual conference of the North American chapter of the association for computational linguistics (NAACL-HLT), pp 100–108
Paatero P, Tapper U (1994) Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5:111–126
Qian S, Zhang T, Xu C, Shao J (2016) Multi-modal event topic model for social event analysis. IEEE Trans Multimed 18:233–246
Sill J, Takacs G, Mackey L, Lin D (2009) Feature-weighted linear stacking. Arxiv preprint arXiv:0911.0460
Su X, Khoshgoftaar TM (2009) A survey of collaborative filtering techniques. Adv Artif Intell 2009:4:2
Suh S, Choo J, Lee J, Reddy CK (2016) L-ensnmf: boosted local topic discovery via ensemble of nonnegative matrix factorization. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 479–488
Titov I, McDonald R (2008) Modeling online reviews with multi-grain topic models. In Proceedings of the international conference on world wide web (WWW), pp 111–120
Wang S, Chen Z, Liu B (2016) Mining aspect-specific opinion using a holistic lifelong topic model. In: Proceedings of the international conference on world wide web (WWW), pp 167–176
Wei F, Liu S, Song Y, Pan S, Zhou MX, Qian W, Shi L, Tan L, Zhang Q (2010) Tiara: a visual exploratory text analytic system. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 153–162
Wilkinson JH, Wilkinson JH, Wilkinson JH (1965) The algebraic eigenvalue problem, vol 87. Clarendon Press, Oxford
Wu Q, Tan M, Li X, Min H, Sun N (2015) Nmfe-sscc: non-negative matrix factorization ensemble for semi-supervised collective classification. Knowl Based Syst 89:160–172
Yang P, Su X, Ou-Yang L, Chua H-N, Li X-L, Ning K (2014) Microbial community pattern detection in human body habitats via ensemble clustering framework. BMC Syst Biol 8(Suppl 4):S7
Zheng Y, Zhang YJ, Larochelle H (2016) A deep and autoregressive approach for topic modeling of multimodal data. IEEE Trans Pattern Anal Mach Intell (TPAMI) 38:1056–1069
Acknowledgements
This work was supported in part by the National Science Foundation Grants IIS-1707498, IIS-1619028, and IIS-1646881 and by Basic Science Research Program through the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. NRF-2016R1C1B2015924). Any opinions, findings, and conclusions or recommendations expressed here are those of the authors and do not necessarily reflect the views of funding agencies.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work is an extended version of [48].
Rights and permissions
About this article
Cite this article
Suh, S., Shin, S., Lee, J. et al. Localized user-driven topic discovery via boosted ensemble of nonnegative matrix factorization. Knowl Inf Syst 56, 503–531 (2018). https://doi.org/10.1007/s10115-017-1147-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-017-1147-9