Abstract
The publication of statistical results based on the use of computational tools requires that the data as well as the code are provided in order to allow to reproduce and verify the results with reasonable effort. However, this only allows to rerun the exact same analysis. While this is helpful to understand and retrace the steps of the analysis which led to the published results, it constitutes only a limited proof of reproducibility. In fact for “true” reproducibility one might require that the essentially same results are obtained in an independent analysis. To check for this “true” reproducibility of results of a text mining application we replicate a study where a latent Dirichlet allocation model was fitted to the document-term matrix derived for the abstracts of the papers published in the Proceedings of the National Academy of Sciences from 1991 to 2001. Comparing the results we assess (1) how well the corpus and the document-term matrix can be reconstructed, (2) if the same model would be selected and (3) if the analysis of the fitted model leads to the same main conclusions and insights. Our study indicates that the results from this study are robust with respect to slightly different preprocessing steps and the use of a different software to fit the model.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Feinerer, I.: tm: Text Mining Package (2013). URL http://CRAN.R-project.org/package=tm. R package version 0.5-8.3
Feinerer, I., Hornik, K., Meyer, D.: Text mining infrastructure in R. J. Stat. Softw. 25(5), 1–54 (2008). URL http://www.jstatsoft.org/v25/i05/
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. U S A 101, 5228–5235 (2004)
Grün, B., Hornik, K.: topicmodels: An R package for fitting topic models. J. Stat. Softw. 40(13), 1–30 (2011). URL http://www.jstatsoft.org/v40/i13/
Hothorn, T., Leisch, F.: Case studies in reproducibility. Brief. Bioinform. 12(3), 288–300 (2011)
Keiding, N.: Reproducible research and the substantive context. Biostatistics 11(3), 376–378 (2010)
Koenker, R., Zeileis, A.: On reproducible econometric research. J. Appl. Econ. 24, 833–847 (2009)
de Leeuw, J.: Reproducible research: the bottom line. Technical Report 2001031101, Department of Statistics Papers, University of California, Los Angeles (2001). URL http://repositories.cdlib.org/uclastat/papers/2001031101/
Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International World Wide Web Conference (WWW 2008), pp. 91–100. Beijing, China (2008)
Ponweiser, M.: Latent Dirichlet allocation in R. Diploma thesis, Institute for Statistics and Mathematics, WU (Wirtschaftsuniversität Wien), Austria (2012). URL http://epub.wu.ac.at/id/eprint/3558
R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2012). URL http://www.R-project.org/. ISBN:3-900051-07-0
Steyvers, M., Griffiths, T.: MATLAB Topic Modeling Toolbox 1.4 (2011). URL http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm
Acknowledgements
This research was supported by the Austrian Science Fund (FWF) under Elise-Richter grant V170-N18.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Ponweiser, M., Grün, B., Hornik, K. (2014). Finding Scientific Topics Revisited. In: Carpita, M., Brentari, E., Qannari, E. (eds) Advances in Latent Variables. Studies in Theoretical and Applied Statistics(). Springer, Cham. https://doi.org/10.1007/10104_2014_11
Download citation
DOI: https://doi.org/10.1007/10104_2014_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-02966-5
Online ISBN: 978-3-319-02967-2
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)