Skip to main content

Finding Scientific Topics Revisited

  • Chapter
  • First Online:
Advances in Latent Variables

Abstract

The publication of statistical results based on the use of computational tools requires that the data as well as the code are provided in order to allow to reproduce and verify the results with reasonable effort. However, this only allows to rerun the exact same analysis. While this is helpful to understand and retrace the steps of the analysis which led to the published results, it constitutes only a limited proof of reproducibility. In fact for “true” reproducibility one might require that the essentially same results are obtained in an independent analysis. To check for this “true” reproducibility of results of a text mining application we replicate a study where a latent Dirichlet allocation model was fitted to the document-term matrix derived for the abstracts of the papers published in the Proceedings of the National Academy of Sciences from 1991 to 2001. Comparing the results we assess (1) how well the corpus and the document-term matrix can be reconstructed, (2) if the same model would be selected and (3) if the analysis of the fitted model leads to the same main conclusions and insights. Our study indicates that the results from this study are robust with respect to slightly different preprocessing steps and the use of a different software to fit the model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  2. Feinerer, I.: tm: Text Mining Package (2013). URL http://CRAN.R-project.org/package=tm. R package version 0.5-8.3

  3. Feinerer, I., Hornik, K., Meyer, D.: Text mining infrastructure in R. J. Stat. Softw. 25(5), 1–54 (2008). URL http://www.jstatsoft.org/v25/i05/

  4. Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. U S A 101, 5228–5235 (2004)

    Article  Google Scholar 

  5. Grün, B., Hornik, K.: topicmodels: An R package for fitting topic models. J. Stat. Softw. 40(13), 1–30 (2011). URL http://www.jstatsoft.org/v40/i13/

  6. Hothorn, T., Leisch, F.: Case studies in reproducibility. Brief. Bioinform. 12(3), 288–300 (2011)

    Article  Google Scholar 

  7. Keiding, N.: Reproducible research and the substantive context. Biostatistics 11(3), 376–378 (2010)

    Article  Google Scholar 

  8. Koenker, R., Zeileis, A.: On reproducible econometric research. J. Appl. Econ. 24, 833–847 (2009)

    Article  MathSciNet  Google Scholar 

  9. de Leeuw, J.: Reproducible research: the bottom line. Technical Report 2001031101, Department of Statistics Papers, University of California, Los Angeles (2001). URL http://repositories.cdlib.org/uclastat/papers/2001031101/

  10. Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International World Wide Web Conference (WWW 2008), pp. 91–100. Beijing, China (2008)

    Google Scholar 

  11. Ponweiser, M.: Latent Dirichlet allocation in R. Diploma thesis, Institute for Statistics and Mathematics, WU (Wirtschaftsuniversität Wien), Austria (2012). URL http://epub.wu.ac.at/id/eprint/3558

  12. R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2012). URL http://www.R-project.org/. ISBN:3-900051-07-0

  13. Steyvers, M., Griffiths, T.: MATLAB Topic Modeling Toolbox 1.4 (2011). URL http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm

Download references

Acknowledgements

This research was supported by the Austrian Science Fund (FWF) under Elise-Richter grant V170-N18.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bettina Grün .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Ponweiser, M., Grün, B., Hornik, K. (2014). Finding Scientific Topics Revisited. In: Carpita, M., Brentari, E., Qannari, E. (eds) Advances in Latent Variables. Studies in Theoretical and Applied Statistics(). Springer, Cham. https://doi.org/10.1007/10104_2014_11

Download citation

Publish with us

Policies and ethics