A Review of Best Practice Recommendations for Text Analysis in R (and a User-Friendly App)

Abstract

In recent decades, the amount of text available for organizational science research has grown tremendously. Despite the availability of text and advances in text analysis methods, many of these techniques remain largely segmented by discipline. Moreover, there is an increasing number of open-source tools (R, Python) for text analysis, yet these tools are not easily taken advantage of by social science researchers who likely have limited programming knowledge and exposure to computational methods. In this article, we compare quantitative and qualitative text analysis methods used across social sciences. We describe basic terminology and the overlooked, but critically important, steps in pre-processing raw text (e.g., selection of stop words; stemming). Next, we provide an exploratory analysis of open-ended responses from a prototypical survey dataset using topic modeling with R. We provide a list of best practice recommendations for text analysis focused on (1) hypothesis and question formation, (2) design and data collection, (3) data pre-processing, and (4) topic modeling. We also discuss the creation of scale scores for more traditional correlation and regression analyses. All the data are available in an online repository for the interested reader to practice with, along with a reference list for additional reading, an R markdown file, and an open source interactive topic model tool (topicApp; see https://github.com/wesslen/topicApp, https://github.com/wesslen/text-analysis-org-science, https://dataverse.unc.edu/dataset.xhtml?persistentId=doi:10.15139/S3/R4W7ZS).

This is a preview of subscription content, log in to check access.

Fig. 1

Notes

  1. 1.

    Changes from pre-registered protocol: The final sample size (n = 585) was lower than expected (n = 1000), but was dictated by our prespecified budgetary limit. Also, we originally planned to ask participants about their time working with the leader, but dropped the question due to space concerns. We had planned to examine how occupation related to LMX. However, there were not enough respondents for the majority of the occupations (n < 20); given the small n there is not adequate power to detect even a small magnitude effect (e.g., d = .30). When we aggregated the occupations, the information became redundant with our industry question. Hence, our question about how LMX varied by occupation was dropped.

  2. 2.

    Start words also exist where a researcher specifies that only certain words be included in an analysis.

References

  1. Antonakis, J. (2017). On doing better science: From thrill of discovery to policy implications. The Leadership Quarterly, 28, 5–21.

    Article  Google Scholar 

  2. Banks, G. C., Gooty, J., Ross, R., Williams, C., & Harrison, N. (2017). Construct redundancy in leader behaviors: A review and agenda for the future. The Leadership Quarterly. https://doi.org/10.1016/j.leaqua.2017.12.005.

  3. Banks, G. C., McCauley, K. D., Gardner, W. L., & Guler, C. E. (2016). A meta-analytic review of authentic and transformational leadership: A test for redundancy. The Leadership Quarterly, 27, 634–652.

    Article  Google Scholar 

  4. Baumer, E. P., Mimno, D., Guha, S., Quan, E., & Gay, G. K. (2017). Comparing grounded theory and topic modeling: Extreme divergence or unlikely convergence? Journal of the Association for Information Science and Technology, 68, 1397–1410.

    Article  Google Scholar 

  5. Bernerth, J. B., Armenakis, A. A., Feild, H. S., Giles, W. F., & Walker, H. J. (2007). Leader–member social exchange (LMSX): Development and validation of a scale. Journal of Organizational Behavior, 28, 979–1003.

    Article  Google Scholar 

  6. Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55, 77–84.

    Article  Google Scholar 

  7. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

    Google Scholar 

  8. Bliese, P. D., Maltarich, M. A., & Hendricks, J. L. (2017). Back to basics with mixed-effects models: Nine take-away points. Journal of Business and Psychology, 1–23.

  9. Buntine, W., & Jakulin, A. (2004). Applying discrete PCA in data analysis. Paper presented at the Proceedings of the 20th conference on Uncertainty in artificial intelligence.

  10. Cammann, C., Fichman, M., Jenkins, G. D., & Klesh, J. R. (1983). Assessing the attitudes and perceptions of organizational members. In S. E. Seashore, E. E. Lawler, P. H. Mirvis, & C. Cammann (Eds.), Assessing organizational change: A guide to methods, measures, and practices (pp. 71–138). New York: Wiley.

    Google Scholar 

  11. Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. Paper presented at the Advances in neural information processing systems.

  12. Chowdhury, G. G. (2003). Natural language processing. Annual Review of Information Science and Technology, 37, 51–89.

    Article  Google Scholar 

  13. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, 2493–2537.

    Google Scholar 

  14. Connelly, B. L., Certo, S. T., Ireland, R. D., & Reutzel, C. R. (2011). Signaling theory: A review and assessment. Journal of Management, 37, 39–67.

    Article  Google Scholar 

  15. Cowan, R. L., & Fox, S. (2015). Being pushed and pulled: A model of US HR professionals’ roles in bullying situations. Personnel Review, 44, 119–139.

    Article  Google Scholar 

  16. Crain, S. P., Zhou, K., Yang, S.-H., & Zha, H. (2012). Dimensionality reduction and topic modeling: From latent semantic indexing to latent Dirichlet allocation and beyond Mining text data (pp. 129-161): Springer.

  17. Denny, M. J., & Spirling, A. (2017). Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it. Available at SSRN: https://ssrn.com/abstract=2849145.

  18. Dou, W., & Liu, S. (2016). Topic-and time-oriented visual text analysis. IEEE Computer Graphics and Applications, 36, 8–13.

    Article  PubMed  Google Scholar 

  19. Dulebohn, J. H., Bommer, W. H., Liden, R. C., Brouer, R. L., Gerald, R., & Ferris, G. R. (2012). A meta-analysis of antecedents and consequences of leader-member exchange: Integrating the past with an eye toward the future. Journal of Management, 38(6), 1715–1759.

    Article  Google Scholar 

  20. Eisenberger, R., Hungtinton, R., Hutchsion, S., & Sowa, D. (1986). Perceived organizational support. Journal of Applied Psychology, 71, 500–507.

    Article  Google Scholar 

  21. Fong, C., & Grimmer, J. (2016). Discovery of treatments from text corpora. In In Proceedings of the Annual Meeting of the Association for Computational Linguistics.

  22. Gioia, D. A., Corley, K. G., & Hamilton, A. L. (2013). Seeking qualitative rigor in inductive research: Notes on the Gioia methodology. Organizational Research Methods, 16, 15–31.

    Article  Google Scholar 

  23. Glaser, B. G., & Strauss, A. L. (1967). The discovery of grounded theory: Strategies for qualitative research. New York: Aldine.

    Google Scholar 

  24. Grimmer, J. (2015). We are all social scientists now: How big data, machine learning, and causal inference work together. PS: Political Science & Politics, 48, 80–83.

    Google Scholar 

  25. Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis: mps028.

  26. Janasik, N., Honkela, T., & Bruun, H. (2009). Text mining in qualitative research application of an unsupervised learning method. Organizational Research Methods, 12, 436–460.

    Article  Google Scholar 

  27. Joshi, A. K. (1991). Natural language processing. Science, 253, 1242.

    Article  PubMed  Google Scholar 

  28. Kobayashi, V. B., Mol, S. T., Berkers, H. A., Kismihok, G., & Den Hartog, D. N. (2017). Text classification for organizational researchers: A tutorial. Organizational Research Methods. https://doi.org/10.1177/1094428117719322.

  29. Kouloumpis, E., Wilson, T., & Moore, J. D. (2011). Twitter sentiment analysis: The good the bad and the omg! Icwsm, 11, 164.

    Google Scholar 

  30. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436–444.

    Article  PubMed  Google Scholar 

  31. Lee, M., & Mimno, D. (2014). Low-dimensional embeddings for interpretable anchor-based topic inference. Paper presented at the Proceedings of Empirical Methods in Natural Language Processing.

  32. Lehmann-Willenbrock, N., & Allen, J. A. (2017). Modeling temporal interaction dynamics in organizational settings. Journal of Business and Psychology, 1–20.

  33. Manning, C. D., Prabhakar, R., & Hinrich, S. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.

    Google Scholar 

  34. McKenny, A. F., Aguinis, H., Short, J. C., & Anglin, A. H. (2016). What doesn’t get measured does exist improving the accuracy of computer-aided text analysis. Journal of Management: 0149206316657594.

  35. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

  36. Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. Paper presented at the Proceedings of the conference on empirical methods in natural language processing.

  37. Mitchel, J. O. (1981). The effect of intentions, tenure, personal, and organizational variables on managerial turnover. Academy of Management Journal, 24, 742–751.

    Google Scholar 

  38. Mosteller, F., & Wallace, D. L. (1963). Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed Federalist Papers. Journal of the American Statistical Association, 58(302), 275–309.

  39. Newman, M. E. (2005). Power laws, Pareto distributions and Zipf's law. Contemporary Physics, 46, 323–351.

    Article  Google Scholar 

  40. Pearce, C. L., & Sims, H. P. (2002). Vertical versus shared leadership as predictors of the effectiveness of change management teams: An examination of aversive, directive, transactional, transformational, and empowering leader behaviors. Group Dynamics: Theory, Research, and Practice, 6, 172–197.

    Article  Google Scholar 

  41. Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In EMNLP, 14, 1532–1543.

    Google Scholar 

  42. Quinn, K. M., Monroe, B. L., Colaresi, M., Crespin, M. H., & Radev, D. R. (2010). How to analyze political attention with minimal assumptions and costs. American Journal of Political Science, 54, 209–228.

    Article  Google Scholar 

  43. Reinard, J. C. (2008). Introduction to communication research (4th ed.). Boston: McGraw-Hill.

    Google Scholar 

  44. Roberts, M. E., Stewart, B. M., & Airoldi, E. M. (2016). A model of text for experimentation in the social sciences. Journal of the American Statistical Association, 111, 988–1003.

    Article  Google Scholar 

  45. Roberts, M. E., Stewart, B. M., & Tingley, D. (2014a). Navigating the local modes of big data: The case of topic models. New York: Cambridge University Press.

    Google Scholar 

  46. Roberts, M. E., Stewart, B. M., & Tingley, D. (2014b). stm: R package for structural topic models. R package version 0.6, 1.

  47. Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S. K., et al. (2014). Structural topic models for open-ended survey responses. American Journal of Political Science, 58, 1064–1082.

    Article  Google Scholar 

  48. Schmidt, F. (2010). Detecting and correcting the lies that data tell. Perspectives on Psychological Science, 5, 233–242.

    Article  PubMed  Google Scholar 

  49. Schofield, A., Magnusson, M. and Mimno, D. (2017). Pulling Out the stops: Rethinking stopword removal for topic models. EACL, 432.

  50. Schofield, A., & Mimno, D. (2016). Comparing apples to apple: The effects of stemmers on topic models. Transactions of the Association for Computational Linguistics, 4, 287–300.

    Google Scholar 

  51. Schriesheim, C. A., Castro, S. L., & Cogliser, C. C. (1999). Leader-member exchange (LMX) research: A comprehensive review of theory, measurement, and data-analytic practices. The Leadership Quarterly, 10, 63–113.

    Article  Google Scholar 

  52. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34, 1–47.

    Article  Google Scholar 

  53. Shaffer, J. A., DeGeest, D., & Li, A. (2016). Tackling the problem of construct proliferation: A guide to assessing the discriminant validity of conceptually related constructs. Organizational Research Methods, 19, 80–110.

    Article  Google Scholar 

  54. Shanock, L. R., Baran, B. E., Gentry, W. A., Pattison, S. C., & Heggestad, E. D. (2010). Polynomial regression with response surface analysis: A powerful approach for examining moderation and overcoming limitations of difference scores. Journal of Business and Psychology, 25, 543–554.

    Article  Google Scholar 

  55. Short, J. C., Broberg, J. C., Cogliser, C. C., & Brigham, K. H. (2010). Construct validation using computer-aided text analysis (CATA) an illustration using entrepreneurial orientation. Organizational Research Methods, 13, 320–347.

    Article  Google Scholar 

  56. Spreitzer, G. M. (1995). Psychological empowerment in the workplace: Dimensions, measurement, and validation. Academy of Management Journal, 38, 1442–1465.

    Google Scholar 

  57. Strauss, A., & Corbin, J. (1990). Basics of qualitative research. Newbury Park, CA: Sage.

  58. Strauss, A., & Corbin, J. (1998). Basics of qualitative research: Techniques and procedures for developing grounded theory (2nd ed.). Thousand Oaks: Sage.

    Google Scholar 

  59. Suddaby, R. (2006). From the editors: What grounded theory is not. Academy of Management Journal, 49, 633–642.

    Article  Google Scholar 

  60. Taddy, M. (2012). On estimation and selection for topic models. Paper presented at the International Conference on Artificial Intelligence and Statistics.

  61. Tang, J., Meng, Z., Nguyen, X., Mei, Q., & Zhang, M. (2014). Understanding the limiting factors of topic modeling via posterior contraction analysis. Paper presented at the ICML.

  62. Tonidandel, S., & LeBreton, J. M. (2015). RWA web: A free, comprehensive, web-based, and user-friendly tool for relative weight analyses. Journal of Business and Psychology, 30, 207–216.

    Article  Google Scholar 

  63. Waddell, K. (2016). The algorithms that tell bosses how employees are feeling. The Atlantic.

  64. Wallach, H. M., Murray, I., Salakhutdinov, R., & Mimno, D. (2009). Evaluation methods for topic models. Paper presented at the Proceedings of the 26th annual international conference on machine learning.

  65. Williams, L. J., & McGonagle, A. K. (2016). Four research designs and a comprehensive analysis strategy for investigating common method variance with self-report measures using latent variables. Journal of Business and Psychology, 31, 339–359.

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to George C. Banks.

Additional information

We dedicate this article to Jared Borns for his insight, patience, and guidance in the data collection process. We thank the three reviewers at Journal of Business and Psychology as well as John Batchelor, Wenwen Dou, Katherine Frear, Tiffany Gallicano, Andy Loignon, Aaron McKenny, Bob Muenchen, Ernest O’Boyle, Jeremy Short, Anne Smith, Allison Toth, and Christopher Whelpley for their feedback on previous versions of the manuscript and our analysis. The article was pre-registered via the Open Science Framework (https://osf.io/g9wjy/?view_only=045606c4e42843f7b3d131de6d0908d0).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Banks, G.C., Woznyj, H.M., Wesslen, R.S. et al. A Review of Best Practice Recommendations for Text Analysis in R (and a User-Friendly App). J Bus Psychol 33, 445–459 (2018). https://doi.org/10.1007/s10869-017-9528-3

Download citation

Keywords

  • Text analysis
  • Topic modeling
  • Structural topic modeling
  • Thematic analysis
  • Content-analysis
  • Dictionary analysis
  • Natural language processing