Skip to main content

Topic Cropping: Leveraging Latent Topics for the Analysis of Small Corpora

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8092))

Abstract

Topic modeling has gained a lot of popularity as a means for identifying and describing the topical structure of textual documents and whole corpora. There are, however, many document collections such as qualitative studies in the digital humanities that cannot easily benefit from this technology. The limited size of those corpora leads to poor quality topic models. Higher quality topic models can be learned by incorporating additional domain-specific documents with similar topical content. This, however, requires finding or even manually composing such corpora, requiring considerable effort. For solving this problem, we propose a fully automated adaptable process of topic cropping. For learning topics, this process automatically tailors a domain-specific Cropping corpus from a general corpus such as Wikipedia. The learned topic model is then mapped to the working corpus via topic inference. Evaluation with a real world data set shows that the learned topics are of higher quality than those learned from the working corpus alone. In detail, we analyzed the learned topics with respect to coherence, diversity, and relevance.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Newman, D., Bonilla, E.V., Buntine, W.: Improving topic coherence with regularized topic models. In: Proceedings NIPS, pp. 496–504 (2011)

    Google Scholar 

  2. Leetaru, K.H.: Data Mining Methods for the Content Analyst: An Introdution to the Computational Analysis of Content. Routledge, New York (2012)

    Google Scholar 

  3. Janasik, N., Honkela, T., Bruun, H.: Text mining in qualitative research: Application of an unsupervised learning method. Organizational Research Methods 12(3), 436–460 (2009)

    Article  Google Scholar 

  4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)

    MATH  Google Scholar 

  5. Hofmann, T.: Probabilistic latent semantic analysis. In: Proceedings UAI, pp. 289–296 (1999)

    Google Scholar 

  6. Hong, L., Davison, B.D.: Empirical study of topic modeling in twitter. In: Proceedings 1st Workshop on Social Media Analytics, SOMA, pp. 80–88 (2010)

    Google Scholar 

  7. Krestel, R., Fankhauser, P., Nejdl, W.: Latent Dirichlet Allocation for Tag Recommendation. In: Proceedings RecSys, pp. 61–68 (2009)

    Google Scholar 

  8. Purver, M., Körding, K.P., Griffiths, T.L., Tenenbaum, J.B.: Unsupervised topic modelling for multi-party spoken discourse. In: Proceedings ACL, pp. 17–24 (2006)

    Google Scholar 

  9. Howes, C., Purver, M., McCabe, R.: Investigating topic modelling for therapy dialogue analysis. In: Proceedings IWCS Workshop on Computational Semantics in Clinical Text (CSCT), pp. 7–16 (2013)

    Google Scholar 

  10. Phan, X.-H., Nguyen, L.-M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings WWW, pp. 91–100 (2008)

    Google Scholar 

  11. Xue, G.R., Dai, W., Yang, Q., Yu, Y.: Topic-bridged plsa for cross-domain text classification. In: Proceedings SIGIR, pp. 627–634 (2008)

    Google Scholar 

  12. Jin, O., Liu, N.N., Zhao, K., Yu, Y., Yang, Q.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings CIKM, pp. 775–784 (2011)

    Google Scholar 

  13. Zhu, X., He, X., Munteanu, C., Penn, G.: Using latent dirichlet allocation to incorporate domain knowledge for topic transition detection. In: Proceedings INTERSPEECH, pp. 2443–2445 (2008)

    Google Scholar 

  14. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press (2008)

    Google Scholar 

  15. McCallum, A.K.: Mallet: A machine learning for language toolkit (2002), http://mallet.cs.umass.edu

  16. Newman, D., Asuncion, A.U., Smyth, P., Welling, M.: Distributed algorithms for topic models. Journal of Machine Learning Research 10, 1801–1828 (2009)

    MathSciNet  MATH  Google Scholar 

  17. Yao, L., Mimno, D., McCallum, A.: Efficient methods for topic model inference on streaming document collections. In: Proceedings KDD, pp. 937–946 (2009)

    Google Scholar 

  18. Marchington, M., Rubery, J., Willmott, H.: Changing organizational forms and the re-shaping of work: Case study interviews, 1999-2002 (computer file) (2004)

    Google Scholar 

  19. Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Proceedings Human Language Technologies, HLT, pp. 100–108 (2010)

    Google Scholar 

  20. Deng, F., Siersdorfer, S., Zerr, S.: Efficient jaccard-based diversity analysis of large document collections. In: Proceedings CIKM, pp. 1402–1411 (2012)

    Google Scholar 

  21. Cilibrasi, R.L., Vitanyi, P.M.B.: The google similarity distance. IEEE Trans. on Knowl. and Data Eng. 19(3), 370–383 (2007)

    Article  Google Scholar 

  22. Miller, G.A.: Wordnet: a lexical database for english. Communications of the ACM 38(11), 39–41 (1995)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Tran, N.K., Zerr, S., Bischoff, K., Niederée, C., Krestel, R. (2013). Topic Cropping: Leveraging Latent Topics for the Analysis of Small Corpora. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C.J. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2013. Lecture Notes in Computer Science, vol 8092. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40501-3_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40501-3_30

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40500-6

  • Online ISBN: 978-3-642-40501-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics