A Review of Best Practice Recommendations for Text Analysis in R (and a User-Friendly App)

  • George C. Banks
  • Haley M. Woznyj
  • Ryan S. Wesslen
  • Roxanne L. Ross
Original Paper

Abstract

In recent decades, the amount of text available for organizational science research has grown tremendously. Despite the availability of text and advances in text analysis methods, many of these techniques remain largely segmented by discipline. Moreover, there is an increasing number of open-source tools (R, Python) for text analysis, yet these tools are not easily taken advantage of by social science researchers who likely have limited programming knowledge and exposure to computational methods. In this article, we compare quantitative and qualitative text analysis methods used across social sciences. We describe basic terminology and the overlooked, but critically important, steps in pre-processing raw text (e.g., selection of stop words; stemming). Next, we provide an exploratory analysis of open-ended responses from a prototypical survey dataset using topic modeling with R. We provide a list of best practice recommendations for text analysis focused on (1) hypothesis and question formation, (2) design and data collection, (3) data pre-processing, and (4) topic modeling. We also discuss the creation of scale scores for more traditional correlation and regression analyses. All the data are available in an online repository for the interested reader to practice with, along with a reference list for additional reading, an R markdown file, and an open source interactive topic model tool (topicApp; see https://github.com/wesslen/topicApp, https://github.com/wesslen/text-analysis-org-science, https://dataverse.unc.edu/dataset.xhtml?persistentId=doi:10.15139/S3/R4W7ZS).

Keywords

Text analysis Topic modeling Structural topic modeling Thematic analysis Content-analysis Dictionary analysis Natural language processing 

References

  1. Antonakis, J. (2017). On doing better science: From thrill of discovery to policy implications. The Leadership Quarterly, 28, 5–21.CrossRefGoogle Scholar
  2. Banks, G. C., Gooty, J., Ross, R., Williams, C., & Harrison, N. (2017). Construct redundancy in leader behaviors: A review and agenda for the future. The Leadership Quarterly.  https://doi.org/10.1016/j.leaqua.2017.12.005.
  3. Banks, G. C., McCauley, K. D., Gardner, W. L., & Guler, C. E. (2016). A meta-analytic review of authentic and transformational leadership: A test for redundancy. The Leadership Quarterly, 27, 634–652.CrossRefGoogle Scholar
  4. Baumer, E. P., Mimno, D., Guha, S., Quan, E., & Gay, G. K. (2017). Comparing grounded theory and topic modeling: Extreme divergence or unlikely convergence? Journal of the Association for Information Science and Technology, 68, 1397–1410.CrossRefGoogle Scholar
  5. Bernerth, J. B., Armenakis, A. A., Feild, H. S., Giles, W. F., & Walker, H. J. (2007). Leader–member social exchange (LMSX): Development and validation of a scale. Journal of Organizational Behavior, 28, 979–1003.CrossRefGoogle Scholar
  6. Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55, 77–84.CrossRefGoogle Scholar
  7. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.Google Scholar
  8. Bliese, P. D., Maltarich, M. A., & Hendricks, J. L. (2017). Back to basics with mixed-effects models: Nine take-away points. Journal of Business and Psychology, 1–23.Google Scholar
  9. Buntine, W., & Jakulin, A. (2004). Applying discrete PCA in data analysis. Paper presented at the Proceedings of the 20th conference on Uncertainty in artificial intelligence.Google Scholar
  10. Cammann, C., Fichman, M., Jenkins, G. D., & Klesh, J. R. (1983). Assessing the attitudes and perceptions of organizational members. In S. E. Seashore, E. E. Lawler, P. H. Mirvis, & C. Cammann (Eds.), Assessing organizational change: A guide to methods, measures, and practices (pp. 71–138). New York: Wiley.Google Scholar
  11. Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. Paper presented at the Advances in neural information processing systems.Google Scholar
  12. Chowdhury, G. G. (2003). Natural language processing. Annual Review of Information Science and Technology, 37, 51–89.CrossRefGoogle Scholar
  13. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, 2493–2537.Google Scholar
  14. Connelly, B. L., Certo, S. T., Ireland, R. D., & Reutzel, C. R. (2011). Signaling theory: A review and assessment. Journal of Management, 37, 39–67.CrossRefGoogle Scholar
  15. Cowan, R. L., & Fox, S. (2015). Being pushed and pulled: A model of US HR professionals’ roles in bullying situations. Personnel Review, 44, 119–139.CrossRefGoogle Scholar
  16. Crain, S. P., Zhou, K., Yang, S.-H., & Zha, H. (2012). Dimensionality reduction and topic modeling: From latent semantic indexing to latent Dirichlet allocation and beyond Mining text data (pp. 129-161): Springer.Google Scholar
  17. Denny, M. J., & Spirling, A. (2017). Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it. Available at SSRN: https://ssrn.com/abstract=2849145.
  18. Dou, W., & Liu, S. (2016). Topic-and time-oriented visual text analysis. IEEE Computer Graphics and Applications, 36, 8–13.CrossRefPubMedGoogle Scholar
  19. Dulebohn, J. H., Bommer, W. H., Liden, R. C., Brouer, R. L., Gerald, R., & Ferris, G. R. (2012). A meta-analysis of antecedents and consequences of leader-member exchange: Integrating the past with an eye toward the future. Journal of Management, 38(6), 1715–1759.CrossRefGoogle Scholar
  20. Eisenberger, R., Hungtinton, R., Hutchsion, S., & Sowa, D. (1986). Perceived organizational support. Journal of Applied Psychology, 71, 500–507.CrossRefGoogle Scholar
  21. Fong, C., & Grimmer, J. (2016). Discovery of treatments from text corpora. In In Proceedings of the Annual Meeting of the Association for Computational Linguistics.Google Scholar
  22. Gioia, D. A., Corley, K. G., & Hamilton, A. L. (2013). Seeking qualitative rigor in inductive research: Notes on the Gioia methodology. Organizational Research Methods, 16, 15–31.CrossRefGoogle Scholar
  23. Glaser, B. G., & Strauss, A. L. (1967). The discovery of grounded theory: Strategies for qualitative research. New York: Aldine.Google Scholar
  24. Grimmer, J. (2015). We are all social scientists now: How big data, machine learning, and causal inference work together. PS: Political Science & Politics, 48, 80–83.Google Scholar
  25. Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis: mps028.Google Scholar
  26. Janasik, N., Honkela, T., & Bruun, H. (2009). Text mining in qualitative research application of an unsupervised learning method. Organizational Research Methods, 12, 436–460.CrossRefGoogle Scholar
  27. Joshi, A. K. (1991). Natural language processing. Science, 253, 1242.CrossRefPubMedGoogle Scholar
  28. Kobayashi, V. B., Mol, S. T., Berkers, H. A., Kismihok, G., & Den Hartog, D. N. (2017). Text classification for organizational researchers: A tutorial. Organizational Research Methods.  https://doi.org/10.1177/1094428117719322.
  29. Kouloumpis, E., Wilson, T., & Moore, J. D. (2011). Twitter sentiment analysis: The good the bad and the omg! Icwsm, 11, 164.Google Scholar
  30. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436–444.CrossRefPubMedGoogle Scholar
  31. Lee, M., & Mimno, D. (2014). Low-dimensional embeddings for interpretable anchor-based topic inference. Paper presented at the Proceedings of Empirical Methods in Natural Language Processing.Google Scholar
  32. Lehmann-Willenbrock, N., & Allen, J. A. (2017). Modeling temporal interaction dynamics in organizational settings. Journal of Business and Psychology, 1–20.Google Scholar
  33. Manning, C. D., Prabhakar, R., & Hinrich, S. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
  34. McKenny, A. F., Aguinis, H., Short, J. C., & Anglin, A. H. (2016). What doesn’t get measured does exist improving the accuracy of computer-aided text analysis. Journal of Management: 0149206316657594.Google Scholar
  35. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.Google Scholar
  36. Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. Paper presented at the Proceedings of the conference on empirical methods in natural language processing.Google Scholar
  37. Mitchel, J. O. (1981). The effect of intentions, tenure, personal, and organizational variables on managerial turnover. Academy of Management Journal, 24, 742–751.CrossRefGoogle Scholar
  38. Mosteller, F., & Wallace, D. L. (1963). Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed Federalist Papers. Journal of the American Statistical Association, 58(302), 275–309.Google Scholar
  39. Newman, M. E. (2005). Power laws, Pareto distributions and Zipf's law. Contemporary Physics, 46, 323–351.CrossRefGoogle Scholar
  40. Pearce, C. L., & Sims, H. P. (2002). Vertical versus shared leadership as predictors of the effectiveness of change management teams: An examination of aversive, directive, transactional, transformational, and empowering leader behaviors. Group Dynamics: Theory, Research, and Practice, 6, 172–197.CrossRefGoogle Scholar
  41. Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In EMNLP, 14, 1532–1543.Google Scholar
  42. Quinn, K. M., Monroe, B. L., Colaresi, M., Crespin, M. H., & Radev, D. R. (2010). How to analyze political attention with minimal assumptions and costs. American Journal of Political Science, 54, 209–228.CrossRefGoogle Scholar
  43. Reinard, J. C. (2008). Introduction to communication research (4th ed.). Boston: McGraw-Hill.Google Scholar
  44. Roberts, M. E., Stewart, B. M., & Airoldi, E. M. (2016). A model of text for experimentation in the social sciences. Journal of the American Statistical Association, 111, 988–1003.CrossRefGoogle Scholar
  45. Roberts, M. E., Stewart, B. M., & Tingley, D. (2014a). Navigating the local modes of big data: The case of topic models. New York: Cambridge University Press.Google Scholar
  46. Roberts, M. E., Stewart, B. M., & Tingley, D. (2014b). stm: R package for structural topic models. R package version 0.6, 1.Google Scholar
  47. Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S. K., et al. (2014). Structural topic models for open-ended survey responses. American Journal of Political Science, 58, 1064–1082.CrossRefGoogle Scholar
  48. Schmidt, F. (2010). Detecting and correcting the lies that data tell. Perspectives on Psychological Science, 5, 233–242.CrossRefPubMedGoogle Scholar
  49. Schofield, A., Magnusson, M. and Mimno, D. (2017). Pulling Out the stops: Rethinking stopword removal for topic models. EACL, 432.Google Scholar
  50. Schofield, A., & Mimno, D. (2016). Comparing apples to apple: The effects of stemmers on topic models. Transactions of the Association for Computational Linguistics, 4, 287–300.Google Scholar
  51. Schriesheim, C. A., Castro, S. L., & Cogliser, C. C. (1999). Leader-member exchange (LMX) research: A comprehensive review of theory, measurement, and data-analytic practices. The Leadership Quarterly, 10, 63–113.CrossRefGoogle Scholar
  52. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34, 1–47.CrossRefGoogle Scholar
  53. Shaffer, J. A., DeGeest, D., & Li, A. (2016). Tackling the problem of construct proliferation: A guide to assessing the discriminant validity of conceptually related constructs. Organizational Research Methods, 19, 80–110.CrossRefGoogle Scholar
  54. Shanock, L. R., Baran, B. E., Gentry, W. A., Pattison, S. C., & Heggestad, E. D. (2010). Polynomial regression with response surface analysis: A powerful approach for examining moderation and overcoming limitations of difference scores. Journal of Business and Psychology, 25, 543–554.CrossRefGoogle Scholar
  55. Short, J. C., Broberg, J. C., Cogliser, C. C., & Brigham, K. H. (2010). Construct validation using computer-aided text analysis (CATA) an illustration using entrepreneurial orientation. Organizational Research Methods, 13, 320–347.CrossRefGoogle Scholar
  56. Spreitzer, G. M. (1995). Psychological empowerment in the workplace: Dimensions, measurement, and validation. Academy of Management Journal, 38, 1442–1465.CrossRefGoogle Scholar
  57. Strauss, A., & Corbin, J. (1990). Basics of qualitative research. Newbury Park, CA: Sage.Google Scholar
  58. Strauss, A., & Corbin, J. (1998). Basics of qualitative research: Techniques and procedures for developing grounded theory (2nd ed.). Thousand Oaks: Sage.Google Scholar
  59. Suddaby, R. (2006). From the editors: What grounded theory is not. Academy of Management Journal, 49, 633–642.CrossRefGoogle Scholar
  60. Taddy, M. (2012). On estimation and selection for topic models. Paper presented at the International Conference on Artificial Intelligence and Statistics.Google Scholar
  61. Tang, J., Meng, Z., Nguyen, X., Mei, Q., & Zhang, M. (2014). Understanding the limiting factors of topic modeling via posterior contraction analysis. Paper presented at the ICML.Google Scholar
  62. Tonidandel, S., & LeBreton, J. M. (2015). RWA web: A free, comprehensive, web-based, and user-friendly tool for relative weight analyses. Journal of Business and Psychology, 30, 207–216.CrossRefGoogle Scholar
  63. Waddell, K. (2016). The algorithms that tell bosses how employees are feeling. The Atlantic.Google Scholar
  64. Wallach, H. M., Murray, I., Salakhutdinov, R., & Mimno, D. (2009). Evaluation methods for topic models. Paper presented at the Proceedings of the 26th annual international conference on machine learning.Google Scholar
  65. Williams, L. J., & McGonagle, A. K. (2016). Four research designs and a comprehensive analysis strategy for investigating common method variance with self-report measures using latent variables. Journal of Business and Psychology, 31, 339–359.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • George C. Banks
    • 1
  • Haley M. Woznyj
    • 2
  • Ryan S. Wesslen
    • 3
  • Roxanne L. Ross
    • 4
  1. 1.Department of Management, Belk College of BusinessUniversity of North Carolina at CharlotteCharlotteUSA
  2. 2.Department of ManagementLongwood UniversityFarmvilleUSA
  3. 3.Department of Computer ScienceUniversity of North Carolina at CharlotteCharlotteUSA
  4. 4.Department of Organizational ScienceUniversity of North Carolina at CharlotteCharlotteUSA

Personalised recommendations