Machine Learning

, Volume 95, Issue 3, pp 423–469 | Cite as

Interactive topic modeling

  • Yuening Hu
  • Jordan Boyd-Graber
  • Brianna Satinoff
  • Alison Smith
Article

Abstract

Topic models are a useful and ubiquitous tool for understanding large corpora. However, topic models are not perfect, and for many users in computational social science, digital humanities, and information studies—who are not machine learning experts—existing models and frameworks are often a “take it or leave it” proposition. This paper presents a mechanism for giving users a voice by encoding users’ feedback to topic models as correlations between words into a topic model. This framework, interactive topic modeling (itm), allows untrained users to encode their feedback easily and iteratively into the topic models. Because latency in interactive systems is crucial, we develop more efficient inference algorithms for tree-based topic models. We validate the framework both with simulated and real users.

Keywords

Topic models Latent Dirichlet Allocation Feedback Interactive topic modeling Online learning Gibbs sampling 

References

  1. Abney, S., & Light, M. (1999). Hiding a semantic hierarchy in a Markov model. In Proceedings of the Workshop on Unsupervised Learning in Natural Language Processing (pp. 1–8). Google Scholar
  2. Ahmed, A., Xing, E. P., Cohen, W. W., & Murphy, R. F. (2009). Structured correspondence topic models for mining captioned figures in biological literature. In International conference on knowledge discovery and data mining (pp. 39–48). Google Scholar
  3. Andrzejewski, D., Zhu, X., & Craven, M. (2009). Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In Proceedings of the International Conference of Machine Learning. Google Scholar
  4. Artstein, R., & Poesio, M. (2005). Kappa3 = alpha (or beta) (Technical Report). University of Essex Department of Computer Science. Google Scholar
  5. Astrachan, O. (2003). Bubble sort: an archaeological algorithmic analysis. In Proceedings of the 34th SIGCSE technical symposium on computer science education. Google Scholar
  6. Bendapudi, N., & Leone, R. P. (2003). Psychological implications of customer participation in co-production. Journal of Marketing, 67(1), 14–28. CrossRefGoogle Scholar
  7. Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python. Sebastopol: O’Reilly Media. MATHGoogle Scholar
  8. Blei, D. M., & Lafferty, J. D. (2005). Correlated topic models. In Proceedings of Advances in Neural Information Processing Systems. Google Scholar
  9. Blei, D. M., & Lafferty, J. D. (2006). Dynamic topic models. In Proceedings of the International Conference of Machine Learning. Google Scholar
  10. Blei, D. M., Ng, A., & Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. MATHGoogle Scholar
  11. Blei, D. M., Griffiths, T. L., & Jordan, M. I. (2010). The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. Google Scholar
  12. Boyd-Graber, J., & Blei, D. M. (2008). Syntactic topic models. In Proceedings of Advances in Neural Information Processing Systems. Google Scholar
  13. Boyd-Graber, J., & Resnik, P. (2010). Holistic sentiment analysis across languages: Multilingual supervised latent Dirichlet allocation. In Proceedings of Empirical Methods in Natural Language Processing. Google Scholar
  14. Boyd-Graber, J., Blei, D. M., & Zhu, X. (2007). A topic model for word sense disambiguation. In Proceedings of Empirical Methods in Natural Language Processing. Google Scholar
  15. Boydstun, A. E., Glazier, R. A., & Phillips, C. (2013). Agenda control in the 2008 presidential debates. American Politics Research. Google Scholar
  16. Bron, C., & Kerbosch, J. (1973). Algorithm 457: finding all cliques of an undirected graph. Communications of the ACM, 16(9), 575–577. CrossRefMATHGoogle Scholar
  17. Carbone, K. (2012). Topic modeling: Confusion and excitement. http://dh201.humanities.ucla.edu/?p=502.
  18. Ceaparu, I., Lazar, J., Bessiere, K., Robinson, J., & Shneiderman, B. (2004). Determining causes and severity of end-user frustration. International journal of human-computer interaction, 17(3), 333–356. CrossRefGoogle Scholar
  19. Chang, J. (2010). Not-so-latent Dirichlet allocation: Collapsed Gibbs sampling using human judgments. In NAACL Workshop: Creating Speech and Language Data With Amazon’ss Mechanical Turk. Google Scholar
  20. Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Proceedings of Advances in Neural Information Processing Systems. Google Scholar
  21. Daumé, H. III. (2009). Markov random topic fields. In Proceedings of Artificial Intelligence and Statistics. Google Scholar
  22. Dietz, L., Bickel, S., & Scheffer, T. (2007). Unsupervised prediction of citation influences. In Proceedings of the International Conference of Machine Learning. Google Scholar
  23. Drouin, J. (2011). Foray into topic modeling. Ecclesiastical Proust Archive. Google Scholar
  24. Eisenstein, J., O’Connor, B., Smith, N. A., & Xing, E. P. (2010). A latent variable model for geographic lexical variation. In Proceedings of Empirical Methods in Natural Language Processing (pp. 1277–1287). Google Scholar
  25. Evans, P. (2013). More fun with topic modeling. http://mith.umd.edu/engl668k/?p=1595.
  26. Feldman, N. H., Griffiths, T. L., & Morgan, J. L. (2009). Learning phonetic categories by learning a lexicon. In Proceedings of the 31st Annual Conference of the Cognitive Science Society. Google Scholar
  27. Görür, D., & Teh, Y. W. (2009). An efficient sequential Monte Carlo algorithm for coalescent clustering. In Proceedings of Advances in Neural Information Processing Systems. Google Scholar
  28. Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101(1), 5228–5235. CrossRefGoogle Scholar
  29. Griffiths, T. L., Canini, K. R., Sanborn, A. N., & Navarro, D. J. (2007). Unifying rational models of categorization via the hierarchical Dirichlet process. In Proceedings of the 29th Annual Conference of the Cognitive Science Society. Google Scholar
  30. Grimmer, J. (2010). A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in senate press. Political Analysis. Google Scholar
  31. Gruber, A., Rosen-Zvi, M., & Weiss, Y. (2007). Hidden topic Markov models. In Artificial Intelligence and Statistics. Google Scholar
  32. Hall, D., Jurafsky, D., & Manning, C. D. (2008). Studying the history of ideas using topic models. In Proceedings of Empirical Methods in Natural Language Processing. Google Scholar
  33. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. SIGKDD Explorations 11. Google Scholar
  34. Harary, F. (1969). Graph Theory. Reading: Addison-Wesley. Google Scholar
  35. Heinrich, G. (2004). Parameter estimation for text analysis (Technical Report). http://www.arbylon.net/publications/text-est.pdf.
  36. Hoffman, M., Blei, D. M., & Bach, F. (2010). Online learning for latent Dirichlet allocation. In NIPS. Google Scholar
  37. Hopcroft, H., & Tarjan, R. (1973). Algorithm 447: efficient algorithms for graph manipulation. Communications of the ACM, 16(6), 372–378. CrossRefGoogle Scholar
  38. Hopkins, D. J. (2012). The exaggerated life of death panels: The limits of framing effects in the 2009–2012 health care debate. Google Scholar
  39. Hu, D., & Saul, L. K. (2009). A probabilistic model of unsupervised learning for musical-key profiles. In International Society for Music Information Retrieval Conference. Google Scholar
  40. Hu, Y., & Boyd-Graber, J. (2012a). Efficient tree-based topic modeling. In Association for Computational Linguistics. Google Scholar
  41. Hu, Y., & Boyd-Graber, J. (2012b). Suggesting constraints for interactive topic modeling. In ICML Workshop on Machine Learning in Human Computation and Crowdsourcing. Google Scholar
  42. Hu, Y., Boyd-Graber, J., & Satinoff, B. (2011). Interactive topic modeling. In Proceedings of the Association for Computational Linguistics. Google Scholar
  43. Jagarlamudi, J., & Daumé, H. III. (2010). Extracting multilingual topics from unaligned corpora. In Ecir, Milton Keynes, United Kingdom. Google Scholar
  44. Johnson, M. (2010). PCFGs, topic models, adaptor grammars and learning topical collocations and the structure of proper names. In Proceedings of the Association for Computational Linguistics. Google Scholar
  45. Johnson, M., Griffiths, T. L., & Goldwater, S. (2007). Bayesian inference for PCFGs via Markov chain Monte Carlo. In Conference of the North American Chapter of the Association for Computational Linguistics. Google Scholar
  46. Kogan, S., Levin, D., Routledge, B. R., Sagi, J. S., & Smith, N. A. (2009). Predicting risk from financial reports with regression. In Conference of the North American Chapter of the Association for Computational Linguistics. Google Scholar
  47. Landauer, T. K., McNamara, D. S., Marynick, D. S., & Kintsch, W. (Eds.) (2006). Probabilistic Topic Models. Hillsdale: Erlbaum. Google Scholar
  48. Lau, J. H., Grieser, K., Newman, D., & Baldwin, T. (2011). Automatic labelling of topic models. In Proceedings of the Association for Computational Linguistics (pp. 1536–1545). Google Scholar
  49. Lavine, M. (1992). Some aspects of Pólya tree distributions for statistical modeling. The Annals of Statistics, 20(3), 1222–1235. MathSciNetCrossRefMATHGoogle Scholar
  50. Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabasi, A. L., Brewer, D., Christakis, N., Contractor, N., Fowler, J., Gutmann, M., Jebara, T., King, G., Macy, M., Roy, D., & Alstyne, M. V. (2009). Social science: Computational social science. Science, 323(5915), 721–723. CrossRefGoogle Scholar
  51. Li, W., & McCallum, A. (2006). Pachinko allocation: Dag-structured mixture models of topic correlations. In International Conference on Machine Learning (pp. 577–584). Google Scholar
  52. Li Fei-Fei Perona, P. (2005). A Bayesian hierarchical model for learning natural scene categories. In Computer Vision and Pattern Recognition (pp. 524–531). Google Scholar
  53. Lin, W. H., Wilson, T., Wiebe, J., & Hauptmann, A. (2006). Which side are you on? identifying perspectives at the document and sentence levels. In Proceedings of the Conference on Natural Language Learning (CoNLL). Google Scholar
  54. Meeks, E. (2011). Comprehending the digital humanities. Digital Humanities Specialist. Google Scholar
  55. Meilă, M. (2007). Comparing clusterings—an information based distance. Journal of Multivariate Analysis, 98(5), 873–895. MathSciNetCrossRefMATHGoogle Scholar
  56. Miller, G. A. (1990). Nouns in WordNet: A lexical inheritance system. International Journal of Lexicography, 3(4), 245–264. CrossRefGoogle Scholar
  57. Mimno, D., Wallach, H., & McCallum, A. (2008). Gibbs sampling for logistic normal topic models with graph-based priors. In NIPS 2008 Workshop on Analyzing Graphs: Theory and Applications. Google Scholar
  58. Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. In Proceedings of Empirical Methods in Natural Language Processing. Google Scholar
  59. Mimno, D., Hoffman, M., & Blei, D. (2012). Sparse stochastic inference for latent Dirichlet allocation. In Proceedings of the International Conference of Machine Learning. Google Scholar
  60. Monroe, B. L., Colaresi, M. P., & Quinn, K. M. (2008). Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict. Political Analysis, 16(4), 372–403. 2008. CrossRefGoogle Scholar
  61. Nah, F. F. H. (2004). A study on tolerable waiting time: how long are web users willing to wait? Behaviour & Information Technology, 23(3), 153–163. CrossRefGoogle Scholar
  62. Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods (Technical Report CRG-TR-93-1). University of Toronto. Google Scholar
  63. Nelson, R. K. (2010). Mining the dispatch. http://dsl.richmond.edu/dispatch/.
  64. Newman, D., Karimi, S., & Cavedon, L. (2009). External evaluation of topic models. In Proceedings of the Aurstralasian Document Computing Symposium. Google Scholar
  65. Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010). Automatic evaluation of topic coherence. In Conference of the North American Chapter of the Association for Computational Linguistics. Google Scholar
  66. Norman, D. A. (Ed.) (1993). Things That Make Us Smart: Defending Human Attributes In The Age Of The Machine, Reading: Addison-Wesley. Google Scholar
  67. Norman, D. A. (2002). The Design of Everyday Things. Reprint paperback edn. Basic Books. Google Scholar
  68. Séaghdha, D.Ó., & Korhonen, A. (2012). Modelling selectional preferences in a lexical hierarchy. In Proceedings of the 1st Joint Conference on Lexical and Computational Semantics. Google Scholar
  69. Pang, B., & Lee, L. (2008). Opinion Mining and Sentiment Analysis. Hanover: Now Publishers. Google Scholar
  70. Paul, M., & Girju, R. (2010). A two-dimensional topic-aspect model for discovering multi-faceted topics. In Association for the Advancement of Artificial Intelligence. Google Scholar
  71. Petterson, J., Alex, S., Caetano, T., Buntine, W., & Shravan, N. (2010). Word features for latent Dirichlet allocation. In Neural Information Processing Systems. Google Scholar
  72. Ramage, D., Hall, D., Nallapati, R., & Manning, C. (2009). Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of Empirical Methods in Natural Language Processing. Google Scholar
  73. Resnik, P., & Hardisty, E. (2010). Gibbs sampling for the uninitiated (Technical Report UMIACS-TR-2010-04). University of Maryland. Google Scholar
  74. Rosen-Zvi, M., Griffiths, T. L., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. In Proceedings of Uncertainty in Artificial Intelligence. Google Scholar
  75. Salton, G. (1968). Automatic Information Organization and Retrieval. New York: McGraw-Hill. Google Scholar
  76. Sayeed, A. B., Boyd-Graber, J., Rusk, B., & Weinberg, A. (2012). Grammatical structures for word-level sentiment detection. In North American Association of Computational Linguistics. Google Scholar
  77. Shneiderman, B., Byrd, D., & Croft, W. B. (1997). Clarifying search: A user-interface framework for text searches. D-Lib Magazine, 3(1). Google Scholar
  78. Shoemaker, O. J. (2011). Variance estimates for price changes in the consumer price index. Bureau of Labor Statistics Report. Google Scholar
  79. Shringarpure, S., & Xing, E. P. (2008). mStruct: a new admixture model for inference of population structure in light of both genetic admixing and allele mutations. In Proceedings of the International Conference of Machine Learning. Google Scholar
  80. Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. (2008). Cheap and fast—but is it good? Evaluating non-expert annotations for natural language tasks. In Proceedings of Empirical Methods in Natural Language Processing. Google Scholar
  81. Stevens, K., Kegelmeyer, P., Andrzejewski, D., & Buttler, D. (2012). Exploring topic coherence over many models and many topics. In Empirical Methods in Natural Language Processing (Vol. 20). Google Scholar
  82. Talley, E. M., Newman, D., Mimno, D., Herr, B. W., Wallach, H. M., Burns, G. A. P. C., Leenders, A. G. M., & McCallum, A. (2011). Database of NIH grants using machine-learned categories and graphical clustering. Nature Methods, 8(6), 443–444. CrossRefGoogle Scholar
  83. Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476), 1566–1581. MathSciNetCrossRefMATHGoogle Scholar
  84. Teh, Y. W., Daumé, H. III., & Roy, D. M. (2008). Bayesian agglomerative clustering with coalescents. In Proceedings of Advances in Neural Information Processing Systems. Google Scholar
  85. Templeton, C. (2011). Topic modeling in the humanities: An overview. Maryland Institute for Technology in the Humanities Blog. Google Scholar
  86. Thomas, J. J., & Cook, K. A. (2005). Illuminating the path: The research and development agenda for visual analytics. Los Alamitos: IEEE Comput. Soc. Google Scholar
  87. Tversky, A., & Kahneman, D. (1992). Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and Uncertainty, 5(4), 297–323. CrossRefMATHGoogle Scholar
  88. Wacholder, N., & Liu, L. (2008). Assessing term effectiveness in the interactive information access process. Information Processing and Management, 44(3), 1022–1031. CrossRefGoogle Scholar
  89. Wallach, H. M. (2006). Topic modeling: Beyond bag-of-words. In Proceedings of the International Conference of Machine Learning. Google Scholar
  90. Wang, C., Blei, D. M., & Heckerman, D. (2008). Continuous time dynamic topic models. In Proceedings of Uncertainty in Artificial Intelligence. Google Scholar
  91. Wei, X., & Croft, B. (2006). LDA-based document models for ad-hoc retrieval. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. Google Scholar
  92. Wu, X., Yu, K., Wang, H., & Ding, W. (2010). Online streaming feature selection. In International Conference on Machine Learning (pp. 1159–1166). Google Scholar
  93. Yang, T. I., Torget, A., & Mihalcea, R. (2011). Topic modeling on historical newspapers. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities. Google Scholar
  94. Yao, L., Mimno, D., & McCallum, A. (2009). Efficient methods for topic model inference on streaming document collections. In Knowledge Discovery and Data Mining. Google Scholar
  95. Ye, X., Yu, Y. K., & Altschul, S. F. (2011). On the inference of Dirichlet mixture priors for protein sequence comparison. Journal of Computational Biology, 18, 941–954. MathSciNetCrossRefGoogle Scholar

Copyright information

© The Author(s) 2013

Authors and Affiliations

  • Yuening Hu
    • 1
  • Jordan Boyd-Graber
    • 2
  • Brianna Satinoff
    • 1
  • Alison Smith
    • 1
  1. 1.Computer ScienceUniversity of MarylandCollege ParkUSA
  2. 2.iSchool and UMIACSUniversity of MarylandCollege ParkUSA

Personalised recommendations