Skip to main content
Log in

The College Completion Puzzle: A Hidden Markov Model Approach

  • Published:
Research in Higher Education Aims and scope Submit manuscript

Abstract

Higher education in America is characterized by widespread access to college but low rates of completion, especially among undergraduates at less selective institutions. We analyze longitudinal transcript data to examine processes leading to graduation, using Hidden Markov modeling. We identify several latent states that are associated with patterns of course taking, and show that a trained Hidden Markov model can predict graduation or nongraduation based on only a few semesters of transcript data. We compare this approach to more conventional methods and conclude that certain college-specific processes, associated with graduation, should be analyzed in addition to socio-economic factors. The results from the Hidden Markov trajectories indicate that both graduating and nongraduating students take the more difficult mathematical and technical courses at an equal rate. However, undergraduates who complete their bachelor’s degree within 6 years are more likely to alternate between these semesters with a heavy course load and the less course-intense semesters. The course-taking patterns found among college students also indicate that nongraduates withdraw more often from coursework than average, yet when graduates withdraw, they tend do so in exactly those semesters of the college career in which more difficult courses are taken. These findings, as well as the sequence methodology itself, emphasize the importance of careful course selection and counseling early on in student’s college career.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. Panel weights are not conventionally used in the construction of the HMM itself (the hidden states). An HMM looks at variation over time within individuals’ sequences rather than representativeness of samples to a larger population.

  2. Around 8190 of these sampled students were 18 or 19 years old when they entered a four-year college for the first time (790 students were 20 years or older). At the urging of one reviewer, we reran the Hidden Markov model omitting students who were 20 years or older. These reworked analyses yielded similar results in terms of state description, transitions probabilities, and prediction accuracy, and are available upon request.

  3. Murphy’s (2002, 2005) Matlab toolbox is used for all HMM calculations. See Appendix B (for HMM training) and Appendix C (for classification).

References

  • Achieve Inc. (2004). Ready or not: Creating a high school diploma that counts. An American diploma project. http://www.achieve.org/files/ADPreport.pdf. Accessed 24 November 2015.

  • Adelman, C. (1999). Answers in the toolbox: Academic intensity, attendance patterns, and bachelor’s degree attainment. Washington, DC: U.S. Department of Education.

    Google Scholar 

  • Adelman, C. (2004). Undergraduate grades: A complex story (Chapter 6). In C. Adelman (Ed.), Principal indicators of student academic histories in postsecondary education (pp. 1972–2000). Washington, DC: US Department of Education.

    Google Scholar 

  • Adelman, C. (2006). The toolbox revisited: paths to degree completion from high school through college. Washington, DC: US Department of Education.

    Google Scholar 

  • Adelman, C. (2009). The spaces between numbers: getting international data on higher education straight. Washington, DC: Institute for Higher Education Policy.

    Google Scholar 

  • Aud, S., Wikinson-Flicker, S., Kristapovich P., Rathbun A., Wang X., & Zhang, J. (2013). The condition of education 2013. National Center for Education Statistics (NCES) 2013-037. Washington, DC: US Department of Education.

  • Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. The Annals of Mathematical Statistics, 41(1), 164–171.

    Article  Google Scholar 

  • Bean, J. P., & Metzner, B. S. (1985). A conceptual model of nontraditional undergraduate student attrition. Review of Educational Research, 55(4), 485–540.

    Article  Google Scholar 

  • Bowen, W. G., Chingos, M. M., & McPherson, M. S. (2009). Crossing the finish line: Completing college at America’s Public Universities. Princeton, NJ: Princeton University Press.

    Google Scholar 

  • Bozick, R. (2007). The role of students’ economic resources, employment, and living arrangements. Sociology of Education, 80(3), 261–285.

    Article  Google Scholar 

  • Chen, X. (2005). First generation students in postsecondary education: a look at their college transcripts. National Center for Education Statistics (NCES) 2005-171. Washington, DC: US Department of Education.

  • Complete College America. (2011). Time is the enemy. Washington, DC: Complete College America. http://www.completecollege.org/docs/Time_Is_the_Enemy.pdf. Accessed 24 Nov 2015.

  • Duda, R. O., Hart, P. E., & Stork, D. G. (1973). Pattern classification and scene analysis (1st ed.). New York: John Wiley.

    Google Scholar 

  • Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification (2nd ed.). New York: John Wiley.

    Google Scholar 

  • Elzinga, C. H., Hoogendoorn, A. W., & Dijkstra, W. (2007). Linked Markov Sources modeling outcome-dependent social processes. Sociological Methods and Research, 36(1), 26–47.

    Article  Google Scholar 

  • Heck, R. H., Price, C. L., & Thomas, S. L. (2004). Tracks as emergent structures: A network analysis of student differentiation in a high school. American Journal of Education, 110(4), 321–353.

    Article  Google Scholar 

  • Hess, F., Schneider, M., Carey, K., & Kelly, A. P. (2009). Diplomas and dropouts: Which colleges actually graduate their students (and which don’t). Washington, DC: American Enterprise Institute.

    Google Scholar 

  • Horn, L. & Kojaku, L.K. (2001). High school curriculum and the persistence path through college. National Center for Education Statistics (NCES) 2001-163. Washington, DC: US Department of Education.

  • Ip, E. H., Snow Jones, A., Heckert, A., Zhang, Q., & Gondolf, E. D. (2010). Latent Markov model for analyzing temporal configuration for violence profiles and trajectories in a sample of batterers. Sociological Methods AND Research, 39(2), 222–255.

    Article  Google Scholar 

  • Langeheine, R., & Van de Pol, F. (2002). Latent Markov chains. In J. Hagenaars & A. McCutcheon (Eds.), Applied latent class analysis (pp. 304–334). Cambridge, UK: Cambridge University Press.

    Chapter  Google Scholar 

  • McLachlan, G., & Peel, D. (2000). Finite mixture models. New York: John Wiley.

    Book  Google Scholar 

  • Murphy, K. P. (2002). Dynamic Bayesian networks: Representation, inference and learning. (PhD dissertation, Department of Computer Science). Berkeley, CA: University of California.

  • Murphy, K. P. (2005). Hidden Markov model (HMM) Toolbox for Matlab (Original Toolbox of 1998). http://www.cs.ubc.ca/~murphyk/Software/HMM/hmm.html. Accessed 1 Aug 2016.

  • National Center for Education Statistics. (2011). 2004/2009 Beginning postsecondary students longitudinal study restricted use data file [in Stata]. Washington, DC: US Department of Education, NCES 2011-244 [distributor].

  • Perna, L. W. (2010). Toward a more complete understanding of financial aid in promoting college enrollment. In J. Smart, Higher education: handbook of theory and research (Vol. 25) (pp. 129–180). New York, NY: Springer.

  • Perna, L. W., & Li, C. (2006). College affordability: Implications for college opportunity. Journal of Student Financial Aid, 36(1), 7–24.

    Google Scholar 

  • Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.

  • Radford, A. W., Berkner L., Wheeless, S.C., & Shepard, B. (2011). Persistence and attainment of 20032004 beginning postsecondary students: After six years. National Center for Education Statistics (NCES) 2011-151. Washington, DC: US Department of Education.

  • Schneider, M., & Yin, M. L. (2012). Completion matters: The high cost of low community college graduation rates. Washington, DC: American Enterprise Institute for Public Policy Research.

    Google Scholar 

  • Schuh, J. (2005). Finances and retention: Trends and potential implications. In A. Seidman (Ed.), College student retention: Formula for student success (pp. 277–294). Westport, CT: American Council on Education and Praeger.

    Google Scholar 

  • Scott, S. L. (2002). Bayesian methods for hidden Markov models. Journal of the American Statistical Association, 97(457), 337–351.

  • St. John, E. P., Cabrera A. F., Nora, A., & Asker, E.H. (2000). Economic influences on persistence reconsidered. In J.M. Braxton (Ed.), Reworking the student departure puzzle (pp. 29–47). Nashville, TN: Vanderbilt University Press.

  • Stamp, M. (2015). A revealing introduction to hidden Markov models (Course). http://www.cs.sjsu.edu/~stamp/RUA/HMM.pdf. Accessed 24 Nov 2015.

  • Tinto, V. (1993). Leaving college: Rethinking the causes of student attrition (2nd ed.). Chicago, IL: University of Chicago Press.

    Google Scholar 

  • Vermunt, J. K., Tran, B., & Magidson, J. (2008). Latent class models in longitudinal research. In S. Menard (Ed.), Handbook of longitudinal research: design, measurement, and analysis (pp. 373–385). Burlington, MA: Elsevier.

  • Wine, J., Janson, N., & Wheeless, S. (2011). 2004/09 Beginning postsecondary students longitudinal study (BPS:04/09) full-scale methodology report. National Center for Education Statistics (NCES) 2012-246. Washington, DC: US Department of Education.

  • Wyner, J. S., Bridgeland, J.M., Diiulio, J. (2007). Achievement trap: How America is failing millions of high-achieving students from lower-income families. In V.A. Lansdowne (Ed.), Jack Kent Cooke Foundation. http://www.jkcf.org/news-knowledge/research-reports/. Accessed 24 Nov 2015.

Download references

Acknowledgements

We thank the National Science Foundation (Grant DRL 1243785) and the Bill & Melinda Gates Foundation (Grant OPP 1012951) for their support for this study. We also thank Andrew Rosenberg (Queens College, CUNY) for his extensive technical support and his feedback on programming Hidden Markov models.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dirk Witteveen.

Appendix

Appendix

Appendix 1: Trellis diagram of a Hidden Markov Chain

Appendix 2: Building an HMM in Matlab

Notes O = number of categories of all variables together, Q = number of expected states, s_prior0 = random initial distributions, s_transmat0 = random transition probabilities, s_obsmat0 = random observation probabilities, s_prior1 = expected initial distributions, s_transmat1 = expected transition probabilities, s_obsmat1 = expected observation probabilities, LL_S = log-likelihood (of the Graduation-HMM).

Appendix 3: Assessing the HMM

Notes The ‘Start’ and ‘End’ indicate test of nongraduating students (N = 1045). The dmm_logprob function in the HMM toolbox was used to produce two log-likelihoods. One that matches each individual test case with the prior, transition, and observation matrices of the Graduation-HMM ( LL_S ) and one that matches each individual test case with the prior, transition, and observation matrices of the Non-Completion-HMM ( LL_F ). The :,i can be replaced with any selection of length of specific semester transcripts (e.g., semester 1 through 4). Finally, the algorithm classifies by comparing the log-likelihoods LL_F > LL_S

Appendix 4: Encoding and Decoding Transcripts

The input for each student-semester observation (\(m_{1 \ldots n}\)) is based on a vector (\(v_{1 \ldots n}\)) that has all feature values encoded using the following algorithm:

The student-semester observations (m…) include categorical and continuous values. Some categorical features include the binary features “did the student take a STEM class?” and “did the student drop any courses?” Continuous features include “number of attempted credits” and “cumulative GPA.” To simplify the modeling process, we represent all features as independent categorical features. The first step in this process is the discretization of continuous features. Each continuous value is represented as a categorical feature as described in Table 1. At this point, each vector (vi) is vector of k categorical features each of which can take one of m(k) values. The second step in the simplification process converts the vectors vi to v’i where v’i is a single categorical variable which can take one of m= \prod_{i=1}^k m^(k) values. This transformation is accomplished by a bijection, f(vi) = v’i. Since the HMM assumes that all elements in the original v vector are independent, no information is lost in via this transformation.

Subsequently, in the analysis phase, the encoded student-semester observation can be decoded through the inverse function f^-1(v’i) = v_i. Since f(v) is bijective, no information is lost in this inversion.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Witteveen, D., Attewell, P. The College Completion Puzzle: A Hidden Markov Model Approach. Res High Educ 58, 449–467 (2017). https://doi.org/10.1007/s11162-016-9430-2

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11162-016-9430-2

Keywords

Navigation