Supervised topic models with word order structure for document classification and retrieval learning
- 563 Downloads
- 2 Citations
Abstract
One limitation of most existing probabilistic latent topic models for document classification is that the topic model itself does not consider useful side-information, namely, class labels of documents. Topic models, which in turn consider the side-information, popularly known as supervised topic models, do not consider the word order structure in documents. One of the motivations behind considering the word order structure is to capture the semantic fabric of the document. We investigate a low-dimensional latent topic model for document classification. Class label information and word order structure are integrated into a supervised topic model enabling a more effective interaction among such information for solving document classification. We derive a collapsed Gibbs sampler for our model. Likewise, supervised topic models with word order structure have not been explored in document retrieval learning. We propose a novel supervised topic model for document retrieval learning which can be regarded as a pointwise model for tackling the learning-to-rank task. Available relevance assessments and word order structure are integrated into the topic model itself. We conduct extensive experiments on several publicly available benchmark datasets, and show that our model improves upon the state-of-the-art models.
Keywords
Topic modeling Maximum-margin Document classification Learning-to-rank Structured topic modelNotes
Acknowledgments
The work described in this paper is substantially supported by Grants from the Research Grant Council of the Hong Kong Special Administrative Region, China (Project Codes: 413510 and 14203414) and the Direct Grant of the Faculty of Engineering, CUHK (Project Code: 4055034). This work is also affiliated with the CUHK MoE-Microsoft Key Laboratory of Human-centric Computing and Interface Technologies. The authors would like to thank anonymous reviewers for their comments and suggestions.
References
- Acharya, A., Rawal, A., Mooney, R. J., & Hruschka, E. R. (2013). Using both latent and supervised shared topics for multitask learning. In Machine Learning and Knowledge Discovery in Databases, pp. 369–384.Google Scholar
- Aldous, D. (1985). Exchangeability and related topics. École d’Été de Probabilités de Saint-Flour XIII-1983, 1117, 1–198.MathSciNetCrossRefGoogle Scholar
- Allan, J. (2005). HARD track overview in TREC 2003 high accuracy retrieval from Documents. Technical report, DTIC Document.Google Scholar
- Andrzejewski, D., & Buttler, D. (2011). Latent topic feedback for Information Retrieval. In Knowledge Discovery and Data Mining, pp. 600–608.Google Scholar
- Asadi, N., & Lin, J. (2013). Effectiveness/efficiency tradeoffs for candidate generation in multi-stage retrieval architectures. In Special Interest Group on Information Retrieval, pp. 997–1000.Google Scholar
- Bai, B., Weston, J., Grangier, D., Collobert, R., Sadamasa, K., Qi, Y., et al. (2010). Learning to rank with (a lot of) word features. Information Retrieval, 13(3), 291–314.CrossRefGoogle Scholar
- Bartlett, N., Pfau, D., & Wood, F. (2010). Forgetting counts: Constant memory inference for a dependent Hierarchical Pitman–Yor process. In International Conference on Machine Learning, pp. 63–70.Google Scholar
- Bicego, M., Lovato, P., Oliboni, B., & Perina, A. (2010). Expression microarray classification using topic models. In ACM symposium on applied computing, pp. 1516–1520.Google Scholar
- Blei, D., & McAuliffe, J. (2008). Supervised topic models. In Neural Information Processing Systems, pp. 121–128.Google Scholar
- Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84.MathSciNetCrossRefGoogle Scholar
- Blei, D. M., & Lafferty, J. D. (2009). Topic models. Text mining: Classification, clustering, and applications, 10, 71.Google Scholar
- Blei, D. M., Ng, A. Y., & Jordan, M. I. (2001). Latent Dirichlet allocation. In Neural Information Processing Systems, pp. 601–608.Google Scholar
- Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research (JMLR), 3, 993–1022.MATHGoogle Scholar
- Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.MATHCrossRefGoogle Scholar
- Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1), 107–117.CrossRefGoogle Scholar
- Broder, A. (2002). A taxonomy of Web search. In ACM special interest group on Information Retrieval Forum, Vol. 36, pp. 3–10.Google Scholar
- Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., & Hullender, G. (2005). Learning to rank using gradient descent. In International Conference on Machine Learning, pp. 89–96.Google Scholar
- Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121–167.CrossRefGoogle Scholar
- Busa-Fekete, R., Kégl, B., Éltető, T., & Szarvas, G. (2013). Tune and mix: Learning to rank using ensembles of calibrated multi-class classifiers. Machine Learning, 93(2–3), 261–292.MATHMathSciNetCrossRefGoogle Scholar
- Cai, P., Gao, W., Zhou, A., & Wong, K.-F. (2011). Relevant knowledge helps in choosing right teacher: Active query selection for ranking adaptation. In Special interest group on Information Retrieval, pp. 115–124.Google Scholar
- Cao, J., Li, J., Zhang, Y., & Tang, S. (2007a). LDA-based retrieval framework for semantic news video retrieval. In International conference on semantic computing, pp. 155–160.Google Scholar
- Cao, Z., Qin, T., Liu, T.-Y., Tsai, M.-F., & Li, H. (2007b). Learning to rank: from pairwise approach to listwise approach. In International conference on machine learning, pp. 129–136.Google Scholar
- Chang, J., & Blei, D.M. (2009). Relational topic models for document networks. In International conference on artificial intelligence and statistics, pp. 81–88.Google Scholar
- Chen, B. (2009). Word topic models for spoken document retrieval and transcription. ACM Transactions on Asian Language Information Processing, 8(1), 2.Google Scholar
- Cortes, C., & Vapnik, V. (1995). Support vector machine. Machine Learning, 20(3), 273–297.MATHGoogle Scholar
- Dang, V., Bendersky, M., & Croft, W. B. (2013). Two-stage learning to rank for information retrieval. In European Conference on Information Retrieval, pp. 423–434.Google Scholar
- Duan, D., Li, Y., Li, R., Zhang, R., & Wen, A. (2012). RankTopic: Ranking based topic modeling. In International Conference on Data Mining, pp. 211–220.Google Scholar
- Egozi, O., Markovitch, S., & Gabrilovich, E. (2011). Concept-based information retrieval using explicit semantic analysis. Transactions on Information Systems, 29(2), 8:1–8:34.CrossRefGoogle Scholar
- Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, pp. 1189–1232.Google Scholar
- Ganchev, K., Graça, J., Gillenwater, J., & Taskar, B. (2010). Posterior regularization for structured latent variable models. Journal of Machine Learning Research (JMLR), 11, 2001–2049.MATHMathSciNetGoogle Scholar
- Gao, J., Toutanova, K., & Yih, W.-T. (2011). Clickthrough-based latent semantic models for Web search. In Special interest group on Information Retrieval, pp. 675–684.Google Scholar
- Gao, W., & Yang, P. (2014). Democracy is good for ranking: Towards multi-view rank learning and adaptation in web search. In Web Search and Data Mining, pp. 63–72.Google Scholar
- Griffiths, T., Steyvers, M., & Tenenbaum, J. (2007). Topics in semantic representation. Psychological Review, 114(2), 211.CrossRefGoogle Scholar
- Hang, L. (2011). A short introduction to learning to rank. IEICE Transactions on Information and Systems, 94(10), 1854–1862.Google Scholar
- Hasler, E., Blunsom, P., Koehn, P., & Haddow, B. (2014). Dynamic topic adaptation for phrase-based MT. In European chapter of the association for computational linguistics, pp. 328–337.Google Scholar
- Hazen, T. J. (2010). Direct and latent modeling techniques for computing spoken document similarity. In Spoken language technology workshop, pp. 366–371.Google Scholar
- Heath, D., & Sudderth, W. (1976). De Finetti’s theorem on exchangeable variables. The American Statistician, 30(4), 188–189.MATHMathSciNetGoogle Scholar
- Jagarlamudi, J., & Gao, J. (2013). Modeling click-through based word-pairs for Web search. In Special interest group on information retrieval, pp. 483–492.Google Scholar
- Jameel, S., & Lam, W. (2013a). A nonparametric n-gram topic model with interpretable latent topics. In Asian information retrieval societies conference, pp. 74–85.Google Scholar
- Jameel, S., & Lam, W. (2013b). An unsupervised topic segmentation model incorporating word order. In Special interest group on information retrieval, pp. 203–212.Google Scholar
- Jameel, S., & Lam, W. (2013c). An N-gram topic model for time-stamped documents. In European Conference on Information Retrieval, pp. 292–304.Google Scholar
- Jameel, S., Lam, W., & Bing, L. (2015). Nonparametric topic modeling using chinese restaurant franchise with buddy customers. In European Conference on Information Retrieval, Vol. 9022, pp. 648–659.Google Scholar
- Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. Transactions on Information Systems, 20(4), 422–446.CrossRefGoogle Scholar
- Jiang, Q., Zhu, J., Sun, M., & Xing, E. P. (2012). Monte Carlo methods for maximum margin supervised topic models. In Neural Information Processing Systems, pp. 1601–1609.Google Scholar
- Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In European Conference on Machine Learning, Vol. 1398, pp. 137–142.Google Scholar
- Joachims, T. (2002). Optimizing search engines using clickthrough data. In Knowledge Discovery and Data Mining, pp. 133–142.Google Scholar
- Kawamae, N. (2014). Supervised N-gram topic model. In Web Search and Data Mining, pp. 473–482.Google Scholar
- Lacoste-Julien, S., Sha, F., & Jordan, M. I. (2008). DiscLDA: Discriminative learning for dimensionality reduction and classification. In Neural Information Processing Systems, pp. 897–904.Google Scholar
- Lai, H., Pan, Y., Liu, C., Lin, L., & Wu, J. (2013). Sparse learning-to-rank via an efficient primal-dual algorithm. IEEE Transactions on Computers, 62(6), 1221–1233.MathSciNetCrossRefGoogle Scholar
- Lakshminarayanan, B., & Raich, R. (2011). Inference in supervised Latent Dirichlet Allocation. In Machine Learning for Signal Processing, pp. 1–6.Google Scholar
- Li, H., & Xu, J. (2014). Semantic matching in search. Foundations and Trends in Information Retrieval, 7(5), 343–469.CrossRefGoogle Scholar
- Li, P., Burges, C. J., Wu, Q., Platt, J., Koller, D., Singer, Y., et al. (2007). Mcrank: Learning to rank using multiple classification and gradient boosting. In Neural Information Processing Systems, Vol. 7, pp. 845–852.Google Scholar
- Li, W., & McCallum, A. (2006). Pachinko allocation: DAG-structured mixture models of topic correlations. In International Conference on Machine Learning, pp. 577–584.Google Scholar
- Li, X., Ouyang, J., & Zhou, X. (2015). Supervised topic models for multi-label classification. Neurocomputing, 149, 811–819.CrossRefGoogle Scholar
- Liao, R., Zhu, J., & Qin, Z. (2014). Nonparametric Bayesian upstream supervised multi-modal topic models. In Web Search and Data Mining, pp. 493–502.Google Scholar
- Lindsey, R. V., Headden, W. P., & Stipicevic, M. J. (2012). A phrase-discovering topic model using hierarchical Pitman–Yor processes. In Empirical Methods on Natural Language Processing, pp. 214–222.Google Scholar
- Liu, T.-Y. (2009). Learning to rank for Information Retrieval. Foundations and Trends in Information Retrieval, 3(3), 225–331.CrossRefGoogle Scholar
- Liu, Y., Niculescu-Mizil, A., & Gryc, W. (2009). Topic-link LDA: Joint models of topic and author community. In International Conference on Machine Learning, pp. 665–672.Google Scholar
- Lu, Y., Mei, Q., & Zhai, C. (2011). Investigating task performance of probabilistic topic models: An empirical study of PLSA and LDA. Information Retrieval, 14(2), 178–203.CrossRefGoogle Scholar
- Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning word vectors for sentiment analysis. In Association for Computational Linguistics, pp. 142–150.Google Scholar
- MacDonald, C., Santos, R. L., & Ounis, I. (2013). The whens and hows of learning to rank for web search. Information Retrieval, 16(5), 584–628.CrossRefGoogle Scholar
- Metzler, D., & Croft, W. B. (2007). Linear feature-based models for information retrieval. Information Retrieval, 10(3), 257–274.CrossRefGoogle Scholar
- Minka, T., & Robertson, S. (2008). Selection bias in the LETOR datasets. In Special interest group on information retrieval workshop on learning to rank for Information Retrieval, pp. 48–51.Google Scholar
- Nallapati, R. (2004). Discriminative models for information retrieval. In Special interest group on Information Retrieval, pp. 64–71.Google Scholar
- Niu, S., Lan, Y., Guo, J., Cheng, X., & Geng, X. (2014). What makes data robust: A data analysis in learning to rank. In Special interest group on Information Retrieval, pp. 1191–1194.Google Scholar
- Noji, H., Mochihashi, D., & Miyao, Y. (2013). Improvements to the Bayesian topic n-gram models. In Empirical Methods on Natural Language Processing, pp. 1180–1190.Google Scholar
- Park, L.A., & Ramamohanarao, K. (2009). The sensitivity of Latent Dirichlet Allocation for Information Retrieval. In Machine Learning and Knowledge Discovery in Databases, pp. 176–188.Google Scholar
- Perotte, A.J., Wood, F., Elhadad, N., & Bartlett, N. (2011). Hierarchically supervised Latent Dirichlet Allocation. In Neural Information Processing Systems, pp. 2609–2617.Google Scholar
- Pinoli, P., Chicco, D., & Masseroli, M. (2014). Latent Dirichlet allocation based on Gibbs sampling for gene function prediction. In Computational intelligence in bioinformatics and computational biology, pp. 1–8.Google Scholar
- Pitman, J., & Yor, M. (1997). The two-parameter Poisson–Dirichlet distribution derived from a stable subordinator. The Annals of Probability, 25(2), 855–900.MATHMathSciNetCrossRefGoogle Scholar
- Porteous, I., Newman, D., Ihler, A., Asuncion, A., Smyth, P., & Welling, M. (2008). Fast collapsed Gibbs sampling for latent Dirichlet allocation. In Knowledge Discovery and Data Mining, pp. 569–577.Google Scholar
- Qin, T., Liu, T.-Y., Xu, J., & Li, H. (2010). LETOR: A benchmark collection for research on learning to rank for Information Retrieval. Information Retrieval, 13(4), 346–374.CrossRefGoogle Scholar
- Quoc, C., & Le, V. (2007). Learning to rank with nonsmooth cost functions. Neural Information Processing Systems, 19, 193–200.Google Scholar
- Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009). Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Empirical Methods on Natural Language Processing, pp. 248–256.Google Scholar
- Rubin, T. N., Chambers, A., Smyth, P., & Steyvers, M. (2012). Statistical topic models for multi-label document classification. Machine Learning, 88(1–2), 157–208.MATHMathSciNetCrossRefGoogle Scholar
- Salton, G., Wong, A., & Yang, C.-S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.MATHCrossRefGoogle Scholar
- Shafiei, M. M., & Milios, E. E. (2006). Latent Dirichlet co-clustering. In International Conference on Data Mining, pp. 542–551.Google Scholar
- Shao, Q.-M., & Ibrahim, J. G. (2000). Monte Carlo methods in Bayesian computation. New York: Springer Series in Statistics.MATHGoogle Scholar
- Sordoni, A., He, J., & Nie, J.-Y. (2013). Modeling latent topic interactions using quantum interference for information retrieval. In Conference on Information and Knowledge Management, pp. 1197–1200.Google Scholar
- Steyvers, M., & Griffiths, T. (2007). Probabilistic topic models. Handbook of Latent Semantic Analysis, 427(7), 424–440.Google Scholar
- Storkey, A. J., & Dai, A. (2014). The supervised Hierarchical Dirichlet Process. Transactions on Pattern Analysis and Machine Intelligence, 37(2), 243–255.Google Scholar
- Sun, Y., Deng, H., & Han, J. (2012). Probabilistic models for text mining. In Mining Text Data, pp. 259–295.Google Scholar
- Tan, M., Xia, T., Guo, L., & Wang, S. (2013). Direct optimization of ranking measures for learning to rank models. In Knowledge Discovery and Data Mining, pp. 856–864. ACM.Google Scholar
- Tang, J., Liu, N., Yan, J., Shen, Y., Guo, S., Gao, B., et al. (2011). Learning to rank audience for behavioral targeting in display ads. In Conference on Information and Knowledge Management, pp. 605–610.Google Scholar
- Vapnik, V. (2000). The nature of statistical learning theory. Berlin: Springer.MATHCrossRefGoogle Scholar
- Wallach, H. M. (2006). Topic modeling: Beyond bag-of-words. In International Conference on Machine Learning, pp. 977–984.Google Scholar
- Wallach, H.M. (2008). Structured topic models for language. Ph.D. thesis.Google Scholar
- Wallach, H. M., Mimno, D. M., & McCallum, A. (2009). Rethinking LDA: Why priors matter. In Neural Information Processing Systems, Vol. 22, pp. 1973–1981.Google Scholar
- Wang, C., Blei, D., & Li, F.-F. (2009). Simultaneous image classification and annotation. In Conference on Computer Vision and Pattern Recognition, pp. 1903–1910.Google Scholar
- Wang, L., Lin, J., Metzler, D., & Han, J. (2014). Learning to efficiently rank on big data. In World Wide Web Conference, pp. 209–210.Google Scholar
- Wang, Q., Xu, J., Li, H., & Craswell, N. (2011). Regularized latent semantic indexing. In Special interest group on Information Retrieval, pp. 685–694.Google Scholar
- Wang, Q., Xu, J., Li, H., & Craswell, N. (2013a). Regularized latent semantic indexing: A new approach to large-scale topic modeling. Transactions on Information Systems, 31(1), 5.CrossRefGoogle Scholar
- Wang, S., Li, F., & Zhang, M. (2013b). Supervised topic model with consideration of user and item. In Association for the Advancement of Artificial Intelligence.Google Scholar
- Wang, X., & McCallum, A. (2005). A note on topical n-grams. Technical report, DTIC Document.Google Scholar
- Wang, X., & McCallum, A. (2006). Topics over time: A non-Markov continuous-time model of topical trends. In Knowledge Discovery and Data Mining, pp. 424–433.Google Scholar
- Wang, X., McCallum, A., & Wei, X. (2007). Topical N-grams: Phrase and topic discovery, with an application to Information Retrieval. In International Conference on Data Mining, pp. 697–702.Google Scholar
- Wei, X., & Croft, W. B. (2006). LDA-based document models for ad-hoc retrieval. In Special interest group on Information Retrieval, pp. 178–185.Google Scholar
- Wu, Q., Burges, C. J., Svore, K. M., & Gao, J. (2010). Adapting boosting for Information Retrieval measures. Information Retrieval, 13(3), 254–270.CrossRefGoogle Scholar
- Wu, W., & Zhong, T. (2013). Searching the deep web using proactive phrase queries. In World Wide Web Conference Companion, pp. 137–138.Google Scholar
- Xie, B., & Passonneau, R. J. (2012). Supervised HDP using prior knowledge. In Natural Language Processing and Information Systems, pp. 197–202. Berlin: Springer.Google Scholar
- Xu, J., & Li, H. (2007). AdaRank: A boosting algorithm for information retrieval. In Special interest group on Information Retrieval, pp. 391–398.Google Scholar
- Yan, X., Guo, J., Lan, Y., & Cheng, X. (2013). A biterm topic model for short texts. In World Wide Web Conference, pp. 1445–1456.Google Scholar
- Yao, L., Mimno, D., & McCallum, A. (2009). Efficient methods for topic model inference on streaming document collections. In Knowledge Discovery and Data Mining, pp. 937–946.Google Scholar
- Yi, X., & Allan, J. (2008). Evaluating topic models for Information Retrieval. In Conference on Information and Knowledge Management, pp. 1431–1432.Google Scholar
- Yi, X., & Allan, J. (2009). A comparative study of utilizing topic models for information retrieval. In European Conference on Information Retrieval, pp. 29–41.Google Scholar
- Yu, H., & Kim, S. (2012). SVM tutorial-classification, regression and ranking. In Handbook of Natural Computing, (pp. 479–506). Berlin: Springer.Google Scholar
- Yu, Z., Wu, F., Zhang, Y., Tang, S., Shao, J., & Zhuang, Y. (2014). Hashing with list-wise learning to rank. In Special interest group on Information Retrieval, pp. 999–1002.Google Scholar
- Yuan, N.J., Zhang, F., Lian, D., Zheng, K., Yu, S., & Xie, X. (2013). We know how you live: Exploring the spectrum of urban lifestyles. In Online Social Network, pp. 3–14.Google Scholar
- Yue, Y., Finley, T., Radlinski, F., & Joachims, T. (2007). A support vector method for optimizing average precision. In Special interest group on Information Retrieval, pp. 271–278.Google Scholar
- Zellner, A. (1988). Optimal information processing and Bayes’s theorem. The American Statistician, 42(4), 278–280.MathSciNetGoogle Scholar
- Zhai, C., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. Transactions on Information Systems, 22(2), 179–214.CrossRefGoogle Scholar
- Zhang, C., Ek, C. H., Gratal, X., Pokorny, F. T., & Kjellström, H. (2013). Supervised Hierarchical Dirichlet Processes with variational inference. In ICCV Workshop: Inference for Probabilistic Graphical Models, pp. 254–261.Google Scholar
- Zhang, J., & Mani, I. (2003). kNN approach to unbalanced data distributions: A case study involving information extraction. In Proceedings of workshop on learning from imbalanced datasets.Google Scholar
- Zhu, J., Ahmed, A., & Xing, E. P. (2009). MedLDA: Maximum margin supervised topic models for regression and classification. In International Conference on Machine Learning, pp. 1257–1264.Google Scholar
- Zhu, J., Ahmed, A., & Xing, E. P. (2012a). MedLDA: Maximum margin supervised topic models. Journal of Machine Learning Research (JMLR), 13, 2237–2278.MATHMathSciNetGoogle Scholar
- Zhu, J., Chen, N., Perkins, H., & Zhang, B. (2013a). Gibbs max-margin topic models with fast sampling algorithms. In International Conference on Machine Learning, pp. 124–132.Google Scholar
- Zhu, J., Chen, N., & Xing, E. P. (2011). Infinite latent SVM for classification and multi-task learning. In Neural Information Processing Systems, pp. 1620–1628.Google Scholar
- Zhu, J., Chen, N., & Xing, E. P. (2012b). Bayesian inference with posterior regularization and infinite latent support vector machines. CoRR, abs/1210.1766.Google Scholar
- Zhu, J., Chen, N., & Xing, E. P. (2014). Bayesian inference with posterior regularization and applications to infinite latent SVMs. Journal of Machine Learning Research (JMLR), 15, 1799–1847.MATHMathSciNetGoogle Scholar
- Zhu, J., Zheng, X., & Zhang, B. (2013b). Improved Bayesian logistic supervised topic models with data augmentation. In Association for Computational Linguistics, pp. 187–195.Google Scholar
- Zhu, J., Zheng, X., Zhou, L., & Zhang, B. (2013c). Scalable inference in max-margin topic models. In Knowledge Discovery and Data Mining, pp. 964–972.Google Scholar
- Zong, W., & Huang, G.-B. (2014). Learning to rank with extreme learning machine. Neural Processing Letters, 39(2), 155–166.CrossRefGoogle Scholar