Abstract
With fast development of Internet technologies and sensor techniques, it is much easier to acquire data from different sources in different dates and times. However, how to compute the correlation of those heterogeneous data is a big challenge for data mining and information retrieval. Here, data feature from one source is called as a view, and the multiview features denote the same data point. In the paper, hidden correlation of two-view features is proposed to construct a Heterogeneous (multiview) Topic Model (HTM). In particular, probabilistic topic model is utilized for different views as usually, generative models provide much richer features when handling high-dimensional data such as texts. Nevertheless, it is necessary to know the form of probability distribution for most existent probabilistic topic models, such as latent Dirichlet allocation. By avoiding the limitation of probabilistic topic model, the HTM is reduced to solving a non-negative matrix tri-factorization problem with certain constraints such that the proposed approach can be used in terms of an arbitrary model.
Similar content being viewed by others
Notes
Available at http://www.cs.nyu.edu/~roweis/data.html.
References
Blei DM, Jordan MI (2003) Modeling annotated data. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, pp 127–134
Buntine WL (2002) Variational extensions to EM and multinomial PCA. In: Proceedings of the 13th European conference on machine learning, ECML ’02, pp 23–34
Chang J, Blei D (2010) Hierarchical relational models for document networks. Ann Appl Stat 4(1):124–150
Chen X, Zhou M, Carin L (2012) The contextual focused topic model. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, pp 96–104
Ding C, Li T, Peng W, Park H (2006) Orthogonal nonnegative matrix t-factorizations for clustering. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 126–135
Fama EF (1970) Efficient capital markets: a review of theory and empirical work. J Finance 25(2):383C417
Furnas GW, Deerwester S, Dumais ST, Landauer TK, Harshman RA, Streeter LA, Lochbaum KE (1988) Information retrieval using a singular value decomposition model of latent semantic structure. In: Proceedings of the 11th annual international ACM SIGIR conference on research and development in information retrieval, pp 465–480
Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence, UAI’99, pp 289–296
Lee D, Seung H et al (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788–791
Li T, Zhang Y, Sindhwani V (2009) A non-negative matrix tri-factorization approach to sentiment classification with lexical prior knowledge. In: Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP: Volume 1–Volume 1, ACL ’09, pp 244–252
Nallapati R, Cohen W (2008) Link-PLSA-LDA: a new unsupervised model for topics and influence of blogs. In: Proceedings of the international conference on weblogs and social media (ICWSM). Association for the Advancement of Artificial Intelligence, pp 84–92
Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: Proceedings of the 20th conference on uncertainty in artificial intelligence, pp 487–494
Stigler SM (1989) Francis galton’s account of the invention of correlation. Stat Sci 4(2):73C79
Wang H, Huang H, Ding C (2011) Simultaneous clustering of multi-type relational data via symmetric nonnegative matrix tri-factorization. In: Proceedings of the 20th ACM international conference on information and knowledge management, pp 279–28
Wang H, Nie F, Huang H, Makedon F (2011) Fast nonnegative matrix tri-factorization for large-scale data co-clustering. In: Proceedings of the twenty-second international joint conference on artificial intelligence–vol 2, pp 1553–1558
Zhang Y, Yeung D (2012) Overlapping community detection via bounded nonnegative matrix tri-factorization. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, pp 606–614
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shen, J., Chi, M. A Novel Multiview Topic Model to Compute Correlation of Heterogeneous Data. Ann. Data. Sci. 5, 9–19 (2018). https://doi.org/10.1007/s40745-017-0135-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40745-017-0135-y