Using topic modeling to understand comments in student evaluations of teaching

Written comments in student evaluations of teaching offer a rich source of data for understanding instructors’ teaching and students’ learning experiences. However, most previous studies on student evaluations of teaching have focused on numeric ratings of close-ended questions, while few studies have tried to analyze the content of students’ written comments on open-ended questions, which normally involves a labor-intensive manual process of coding and categorizing. Such qualitative efforts prevent solutions on a large scale since it is almost impossible to go through all the textual data manually. Therefore, an innovative quantitative method that can analyze a large corpus of data holds great promise. This paper proposes the latent Dirichlet allocation (LDA) method of topic modeling to discover important themes that emerge in students’ written comments. We compare our results with findings in previous qualitative studies. We also investigate how these themes vary by course grade level and course subject. Our results provide evidence that topic modeling can be an effective and efficient alternative for understanding teaching and learning experiences through students’ written comments on a large scale.


Introduction
Providing good-quality education is the essential goal of higher education institutions.When research performance alone can no longer meet the expectations of a wide range of stakeholders, a deep commitment to the core mission of teaching and learning becomes a demand that higher education institutions must meet.This commitment is fulfilled by faculty members who are actively involved with students both inside and outside the classroom.It is also fulfilled by formal evaluation processes, such as program reviews and teaching evaluations.While different strategies have emerged to assess teaching and learning, student evaluations of teaching (SET) are the most widely adopted method of conducting teaching evaluations in U.S. [3,23].
Although there are different kinds of course evaluation questionnaires, the most common consists of two parts [6]: (1) a number of close-ended questions in which the students rate the instructor or the class and ( 2) a few open-ended questions in which students can write freely.Most prior studies on SET have focused on analyzing the quantitative part of survey data, including the validity and reliability of the instrument and the external factors that influence numerical ratings [10,27].Only a few studies have examined the qualitative part of SET, the majority of which focus on the quantitative aspects of students' written comments, such as the frequency and length of positive and negative comments.Very few studies have analyzed the content of the comments.The lack of examining students' written comments in the literature could be attributed to three factors: (1) students' written comments may appear idiosyncratic and anecdotal, which could be perceived as less valuable,(2) students' written comments lack intrinsic structure and thus may be difficult to interpret,and (3) students' written comments require a labor-intensive and time-consuming manual coding process for in-depth analysis, which makes the process difficult to implement on a large scale [6,10,14].
Despite these challenges, examining students' written comments has enormous benefits.First, students' written comments are more specific in identifying explicit weaknesses and strengths of a course and subsequently providing concrete suggestions to teaching improvement [1,8,15,21].Second, students' written comments can reflect the affective aspect of students' learning, such as their feeling of their instructor's attitude toward them [10].Third, students' written comments can provide insights into different intellectual challenges encountered by different levels of learners [10].Fourth, students' written comments can verify whether the themes measured in the numerical part of the SET continue to stay relevant to students' learning experience [10].Fifth, students' written comments give instructors more impetus to take effective actions for teaching improvement [23].
To fill the gap in the SET literature and benefit from students' written comments, our study adopts an innovative approach-topic modeling-to investigate the content of students' written comments in depth on a large scale.By extracting the latent topics from students' written comments, we are able to achieve the following threefold goals: (1) discover meaningful themes that emerge from students' written comments; (2) investigate the popularity of these themes in a large corpus of data and how they vary by course grade level and course subject; and (3) provide a low-cost procedure for analyzing students' written comments that can yield real-time support for instructors, researchers and policy-makers in higher education institutions.

Students' written comments in SET
Only a small body of studies discuss students' written SET comments, most of which examine statistical attributes about the comments, such as the frequency, length, polarity of the comments, the characteristics of the students who wrote them, the factors that influence the rate of commenting, and the consistency with the dimensions identified from closedended numerical questions.
Prior studies show large variations in the frequency of students who provide written comments.Theall and Franklin [26] reported the percentage of written comments in students' responses to be as low as 10-12%, while other studies have reported that the percentage can reach 40-50% [1,9], still others have reported that the percentage can be as high as 60-70% [17].
Regarding the polarity of the comments, some studies have found that students tend to provide more positive comments than negative comments, and the ratio can be as high as 2:1 [1,17].While the positive comments tend to be more general, the negative comments are more specific and offer more details about the course and instructor [28].
Regarding the rate of commenting, other previous studies have examined multiple organization and individual factors.Sorenson & Reiner [22] examined whether different ways of administering the survey can lead to different rates of commenting.They found that when course evaluations are administered online, students tend to write more comments and longer comments than when evaluations are administered on paper.When the survey is short in form, students are also more likely to write comments [12].Furthermore, higher achievers, females, older students, domestic (U.S. versus international) students, and full-time (versus part-time) students are more likely to provide comments in SET [17].
The literature also examines the consistency between the themes reflected in students' written comments and those derived from factor analyses of the closed-ended numeric questions, providing evidence for the validity of the instrument [17,28].The common domains that appear in both written comments and numerical questions include the teaching competences and interpersonal skills of the instructor, course content, organization, and general quality [1].However, these studies do not elicit the extra insights from the written comments other than those that were already asked in the close-ended items.
However, few studies have explored the content of students' SET written comments in depth.The two most recent studies [1,6] that have examined the content of students' written comments both adopted a qualitative approach involving a similar procedure: in step 1, multiple independent researchers manually coded the comments into various categories and subcategories,in step 2, interrater reliability was examined and confirmed using a small portion of the sample; in step 3, a codebook was created based on the whole sample, and a categorization scheme was restructured in a hierarchical fashion; and finally, in step 4, the codes and categories were revisited for the finalization of the categorization process.
Specifically, the study by Alhija and Fresko [1] showed that students' SET written comments can be classified into 45 subcategories, which can be conceptually grouped into eight categories: three related to course (content, assignments, and general evaluation), three related to the instructor (personal traits, teaching style and general evaluation), and two related to the context (scheduling issues and student composition).Among these 45 subcategories, six were found to appear in more than 10% of the comments: general evaluation of the instructor, general evaluation of the course, interest generated by the course, interest generated by the instructor, contribution to learning and clarity of the instruction.
Similarly, Brockx et al. [6] classified students' SET written comments into 25 subcategories that were subsequently grouped into seven categories: one related to overall evaluation, three related to course (content, evaluation, and course materials), two related to instructor (personal traits and general teaching), and one related to context (course organization).Among these 25 subcategories, three appeared most frequently: combining theory and practice, the built-up of the course, and relevance and interest.Furthermore, Brockx et al. [6] revealed different patterns in positive and negative comments.Most of the positive comments were related to the subcategory of combining theory and practice, whereas the negative comments were mostly likely to address the relevance and interest of the course.Details of the similarities and differences between these two studies are displayed in Table 1, in which the column titled 'Overlap' depicts the common subcategories identified by both studies.

Topic modeling
Manually reading and coding thousands or even tens of thousands of students' written comments is labor-intensive and time-consuming.As such, finding an alternative method that is algorithm driven can save both time and cost.Topic modeling fits this need.Topic modeling is a family of statistical models that utilize computer algorithms for the exploratory analysis of large text collections, extracting the latent topics discussed in the corpus [4,5,7].Among the various topic modeling methods, a probability-based technique named latent Dirichlet1 allocation (LDA) [4] has gained increasing popularity due to its simplicity.LDA is a clustering algorithm that detects latent topics by finding patterns in the co-occurrence of words.The basic building block for LDA is words.That is, LDA treats each topic as a probability distribution over a collection of words and each document as a probability distribution over these topics [4].One important assumption underlying LDA is that the order of words and the grammatical structure in the corpus are insignificant.LDA treats each document as an unordered collection of words.
LDA has been increasingly used in a variety of areas, including voting themes in politics, social media texts, and education.In the context of education, previous researchers have used LDA to examine forum discussions, students' essays, and course documents.For example, Ramesh et al. [18] built a topic model using LDA to detect latent topics in students' writings in forum discussion to predict student retention in a massive open online course (MOOC),Kakkonen et al. [13] used LDA to automate essay grading,and Sekiya et al. [20] applied LDA in curriculum analysis by analyzing syllabus documents.Two more recent studies have pioneered the application of topic modeling in students' surveys.Taylor et al. [25] showcased the implementation of structural topic modeling (STM) to identify gender bias in SET comments at a previously unattainable scale, i.e., over 172,000 open-ended comments from students between 2013 and 2015 at a private research university.Alkhnbashi and Nassr [2] combined content analysis and LDA to investigate students' perspectives on learning both during and after the COVID-19 pandemic from 626 open-ended questionnaires.
In this study, we employ the LDA topic modeling technique to examine the written comments of students with the expectation that this approach will offer valuable insights into instructors' teaching and students' learning experiences.Specifically, we seek to address the following inquiries: 1. What are the latent topics detected by topic modeling in students' written comments?2. What are the similarities and differences between the latent topics discovered in students' written comments and the explicit topics elaborated in close-ended items?  3 Methods

Sample
The sample in our study included 110,420 undergraduate SET surveys collected at a research university in Texas, United States, during the fall and spring semesters of the 2013-2019 academic years (12 semesters in total).The courses involved both introductory and advanced courses from different departments across three divisions, including humanities and liberal arts, social sciences, and natural sciences.The class sizes of the courses being evaluated ranged from 3 to 180 (mean = 23, median = 17, SD = 20).In our sample, students were evenly distributed by gender.The proportions of ethnicity groups ranked from highest to lowest were White (60%), Hispanic (13%), International (9%), Asian (8%), Black (6%), and Other (4%).

Instrument
The SET questionnaire was administered online outside of class time by sending electronic invitations to all students near the end of the semester.Students were given two weeks to submit their responses before the final week began.Students' anonymity was maintained by the system automatically removing personal information upon retrieval of the survey data.
The SET questionnaire included two parts.In the first part, students were presented with 10 Likert-scale items ranging from 1 (strongly disagree) to 5 (strongly agree): six items were related to the course itself, two items were related to the instructor(s), and two items were related to learning outcomes.In the second part, students were given two open-ended questions: (1) Did any particular aspects of this course enhance your learning?and (2) Did any particular aspects of this course detract from your learning?The first open-ended question was intended to elicit positive responses, while the second was intended to elicit negative responses.Our sample consisted of 6,784 class sections, which contributed a total of 110,420 responses, of which 48,932 (44% of the entire sample) provided comments on the first open-ended question and 30,940 (28% of the entire sample) provided comments on the second open-ended question.

Analytical procedure
The initial screening of students' written comments showed that some students provided positive comments to the second open-ended question, for which the purpose was to elicit negative comments, thus making no differentiation between these two open-ended questions.As such, the written comments to these two questions were combined into one dataset.With each written comment considered as a single document, we applied the topic modeling technique LDA to discover the latent themes in students' written comments.
Before applying the topic model, some prerequisite Python packages were installed, such as RE (Regular Expression Module), NLTK (Natural Language Tookit) and the spaCY en model for text preprocessing.Data preprocessing is one key component in text mining algorithms; it plays a crucial role in reducing the noise caused by irrelevant words and thus helps improve model performance [19].This preprocess is also called data normalization, which includes the following steps: (1) expanding contractions, such as expanding "I'll" to "I will" and "shouldn't" to "should not",(2) removing punctuation, accented letters and special characters; (3) removing stop words, which are commonly used low-information words such as 'a' , 'the' , 'that' , 'or' , 'and' , 'be' , 'it' , etc.; (4) tokenizing each comment into a list of words; and (5) lemmatizing the words through the spaCY model, which aims to change all verb tenses to the present tense, reduce word forms to their root from, and change words in other tenses to the first-person tense.The result of the data preprocessing stage was a cleaned, tokenized and lemmatized list of words for each comment.
Next, we applied Python Gensim's Dicitonary() function on the cleaned texts to construct a document-term matrix, based on which Gensim's doc2bow() function converted each document into a "bag-of-words".A bag-of words is a representation of text that describes the occurrence of words within a document.Then, we used Gensim's TfidfModel() function to compute the "term frequency-inverse document frequency" (TF-IDF), which is a technique used to quantify words' importance in each document.The TF-IDF was applied directly in Genism's LdaMulticore() function for LDA topic modeling.
In the final stage of tuning the topic model, determining the number of topics (K) is key.If K is too large, the topics will be redundant; however, if K is too small, the topics will overlap with each other, become too broad and lose their nuance meanings.Therefore, choosing the appropriate topic number is the major challenge for discovering meaning units in the corpus.The common practice of choosing K is to first run a couple of topic models for a set of potential K values and then select the best K value based on criteria such as prior knowledge in the task domain and the highest model coherence score [24].Previous qualitative studies [1,6] have identified 7-8 major categories and 25-45 subcategories in students' written comments.Therefore, we decided to explore a wide range of possible sets that included all these numbers, ranging from 5 to 65 topics.Gensim's coherence model was used to calculate topic coherence for each model.Normally, the K value, which marks the highest coherence score or the end of a rapid growth of topic coherence scores, offers the most meaningful and informative topics [11].We examined each topic model by manually scrutinizing the list of significant keywords in each topic for interpretability and comparing coherence scores for maximum model fit.When the tuning results coincided, we determined the best fitting model and then produced the topic distribution for each comment.

Latent topics in student written comments
The coherence scores for K ranging from 5 to 65 are displayed in Fig. 1.We found the highest coherence score = 0.593 when the number of topics (K) was 8. Furthermore, we observed a rapid growth of topic coherence from K = 7 to K = 8 and a rapid decline from K = 8 to K = 9.Therefore, the results consistently indicated that K = 8 was the appropriate topic number for our data.
Next, Gensim's pyLDAVis was used to draw the distance map of the estimated topics.In Fig. 2, each bubble represents a topic.The size of the bubble refers to the prevalence of the topic, and the distance between the bubbles refers to the extent to which topics differ from one another.A good topic model is one that has large, nonoverlapping bubbles that scatter through the plot.If bubbles are clustered in one place and thus have too many overlaps, the topic model needs to be further improved.The plot in Fig. 2 shows that our eight topics were fairly large and had few overlaps with one another, indicating that the topic model fit well.Hence, we selected this model as our final model.In Table 2, we listed these 8 latent topics with the top 10 keywords associated with each topic.A manual label was assigned to each topic.As shown in Table 2, some latent topics can be linked to the explicit dimensions measured by the close-ended numeric questions in the SET questionnaire, such as class time, course content, and developing knowledge or skills.However, there were other topics that cannot be linked to dimensions of the SET numeric section, such as assignments, teaching method, course materials, instructor traits, and course evaluation.This indicates that students' written comments were broader while at the same time more specific than the close-ended numeric questions.
Furthermore, we also read the written comments that had the highest probability within each topic.This provided a fuller understanding of the content.Table 3 shows two comment examples for each topic.After examining the distribution of topic words and the corresponding comments, we were able to interpret the meaning of each topic as follows: • Topic 1 addressed students' perspectives on assignments, including their difficulty level, relevance, timing, contribution to understanding, etc.; • Topic 2 addressed students' perspectives on course organization, including course expectations, coursework, course schedule, class time, study guides, etc.;  I felt that the homework problems did not line up with the information I learned in class.The homework was a lot more complicated then what was taught in class and sometimes was not even in the notes.I also felt that my success would have been better if we were offered practice questions in class before a test 2. The many example problems we were able to do in class was very helpful.I appreciated the fact that we were able to watch the lectures outside of the class so that class time could be used to demonstrations and practice problems.The office hours were also very useful for the homework assignments Course organization 1.The instructor was very clear about classroom expectations and assigned a manageable amount of coursework.
Flexible and extensive office hours were also much appreciated.Class time was always taken very seriously, with not a single minute wasted 2. Liked the study guides.There's just so much at times so it was great the professor organized the study guides.I felt that the readings made sense to what we were learning although may have not been timed exactly right.I like the way the class is broken up and sectioned Student learning 1.The main thing that detracted me from my learning was that, sometimes in the lecture notes, there was some jargon that I did not understand the terminology for, so I either had to look up the term or use context clues to find out what it means 2. On some class days, especially as the semester went on, the class would get easily distracted by asking questions about the research paper or talking about the new psychology BA/BS programs.Some days it went on for half of the class, which would then be frustrating when it felt like we were trying to hurry through material Teaching method 1.I think that the group discussions that were enforced, were a great way of allowing us to stimulate our minds in a sense.In addition, acting out certain scenarios allowed to really take in what we were learning 2. I think the fast pace made it difficult for me to stay on track at times.I think a small amount of time to go over questions or material learned in the previous lecture (or on practice problems) would have helped a lot Course materials 1.His way of lecturing with the aid of power point slides is not very effective at all.Part of that is because the slides themselves are not very informative 2.Not having the powerpoint slides for lecture available online made taking notes and keeping up with the pace that the material was covered in class very challenging Instructor traits 1.She is very organized and prepared for each class.I love her use of Powerpoints because she makes it easy to take notes.She engages the class and you can tell that she is passionate about psychology.She makes learning fun and I actually enjoy studying for her class.She has furthered my interest in psychology and is a great role model.I wish she taught more courses because I will miss her class 2.He is an outstanding professor who knows his material very well.He explains all course content clearly and offers examples that help enhance understanding.He keeps the class engaged with funny economics jokes and truly seems to care about each student's learning of economics.He is by far the best professor I've ever had Course content 1.The readings and videos watched were particularly interesting.I appreciated that the class was discussion based as this allowed me to come to class with my own thoughts prepared as well as talk a lot about my own experiences 2. Many of the articles that we read and movies that we watched were really interesting and helped me become more aware of current events and important conflicts that have happened throughout the world Course evaluation 1.I have felt very lost in this class grade wise.There is no Canvas for the class so I haven't known where I stand grade wise all semester which is a little concerning considering I would like to pass 2. The tests weren't handed back and our grades were not communicated to us at all.I never knew the class average or which problems I got wrong on the test.40 min tests are stressful and almost impossible to do well on, let alone finish • Topic 3 addressed students' perspectives on the direct factors that detract from or enhance students' learning; • Topic 4 addressed students' perspectives on the instructor's teaching method, including the use of group discussions, combining theory and practice, the use of lab components, the pace of the course, etc.; • Topic 5 addressed students' perspectives on the effective use of course materials, including the availability, clarity and informativeness of the materials; • Topic 6 addresses students' perspectives on the instructor's traits, including professionalism, sense of humor, passion about the subject, attitude and conduct toward the students, etc.; • Topic 7 addressed students' perspectives on students' experience with course content, including readings, videos, discussions, essays, etc.; and • Topic 8 addressed students' perspectives on students' experience with course evaluation, including grading policy, the tools and timing for posting grades, the feedback and the communication of grades, etc.

Topic distribution over course grade levels and course subjects
The topic model not only calculates the word distribution for each topic but also the topic distribution for each comment.Each comment is composed of multiple topics, among which normally only one is dominant.In this study, we extracted the dominant topic for each comment and calculated the frequency of each topic in various groups.Table 4 shows the dominant topics' frequency on comments, from which we were able to observe the following: • Topic 2, i.e., addressing course organization, was least discussed; • Topic 1, i.e., addressing assignments, and Topic 7, i.e., addressing course content, were most discussed; • Compared to other grade levels, students in the course grade level of 1000 (i.e., freshmen) discussed more about Topic 8, i.e., addressing course evaluation, and less about Topic 7, i.e., addressing course content; • Students in humanities and liberal arts courses discussed more about topic 7, i.e., addressing course content, and topic 3, i.e., addressing student learning, while those in social sciences and natural sciences courses were more concerned about topic 1, i.e., addressing assignments, and topic 8, i.e., addressing course evaluation.

Discussion and conclusion
Most prior studies in SET focus on the numeric questions or the numeric attributes of students' written comments.A scarcity of studies adopt the qualitative approach to study the content of students' comments.However, concerns arise regarding the limitation of scope for such an approach due to issues related to the cost and time needed for the manual coding process.In this study, therefore, we aim to investigate the possibility of using an innovative quantitative approach, namely, topic modeling, to capture latent themes in a large scale of textual data.Our findings marked a unique contribution in the following four aspects.First, the results of our study provide evidence that topic modeling can be a legitimate alternative to better understanding teaching and learning through students' written comments in SET surveys.For example, eight latent topics-assignments, course organization, student learning, teaching method, course materials, instructor traits, course content, and course evaluation-were found from students' written comments in our study.These results were well aligned with those from previous studies [1,6] that used a qualitative approach to examine students' written comments.However, unlike using the approach consisting of three-layered categorization (dimensions, categories and subcategories) through manual coding, as was used in the two abovementioned studies, topic modeling proposes a flat-format categorization that facilitates faster and easier analysis in understanding and interpreting large scales of textual data.In the past, because analyzing responses to open-ended questions on a large scale posed a major challenge, researchers and educators tended to rely on close-ended numeric questions to understand teaching and learning in SET surveys.
Given that responses to open-ended questions can provide nuanced understanding of students' learning that would not be captured by close-ended questions, finding a way to reduce the cost of time and labor may increase the use of rich textual data in SET surveys.Our study provides evidence that topic modeling may be useful for this purpose.
Second, the results of our study show that some latent topics in students' written comments corresponded to the dimensions measured in the closed-ended questions of SET surveys.However, other latent topics were neglected by the close-ended questions, including assignments, student learning, course materials, instructor traits and course evaluation.The difference between the latent topics underlying students' written comments and the dimensions measured by the numeric part of the SET survey showed that students' comments indeed provided us feedback that was broader in scope and more specific in content.Thus, we can conclude that while analyzing close-ended numeric questions (e.g., Likert-scale questions) may serve their own specific purpose, the introduction of in-depth analysis of close-ended questions can enhance the scope and capabilities of SET surveys, thus better understanding teaching and learning and implementing better interventions accordingly.
Third, our results show that the latent topics were addressed differently between freshmen (i.e., course grade level 1000) and students from other grade levels.It was much more likely that students would provide comments on the aspect of course evaluation (e.g., grades) in freshmen courses than in courses in other grade levels.This suggests that freshmen tend to care more about aspects such as course grading policies, feedback, and communication of grades.This may be partly because freshmen are in a transitional phase (i.e., from high school to colleges/universities) and are thus more concerned with their performance in a new academic environment.
Fourth, our study also shows that different frequency patterns exist among different study fields.Students in the humanities and liberal arts were more likely to address student learning and course content in their comments.Meanwhile, students in the natural sciences tended to be more concerned about assignments and course evaluation.This is probably because teaching and learning in the natural sciences are more likely to be influenced by the course content itself.Furthermore, the characteristics of assignments and grading used in natural science courses may have a clearer impact on students' learning outcomes than those used in other disciplines.Future research may examine the specific differences in mechanisms that exist behind subjects.
While this study's findings provide aggregate themes, which are beneficial to school administrators who oversee the educational activities of the entire school or university, classroom instructors can also benefit by using the results to improve their teaching.For instance, for entry-level courses, instructors should pay extra attention to the aspect of course evaluation, such as grading policy and grading communication, partly because freshmen students who are in transition from high school to college are still in the process of adapting to the new social and academic environment.College curriculum rigidity poses a significant challenge to these students, and they may need additional support and guidance regarding how to study effectively and obtain good grades.Furthermore, the aspect of assignments appears to be a critical part of students' learning experience, especially for subjects enrolled in science courses.Therefore, instructors need to devote sufficient effort to designing high-quality assignments that align with the materials being taught in class and to providing adequate support and guidance for students to solve problems in assignments.Last, for courses in humanities and liberal arts, since students' comments mostly focus on the course content, it would benefit instructors' teaching if they focused on enriching the course content and leading students to a better understanding of the theoretical or empirical concepts in this increasingly complex world.
There are some limitations to using topic modeling to analyze SETs in our study.One limitation concerns the representativeness of our sample.The data collected in this study represent the demographics of students enrolled in a private university in the southern United States.Future research may use topic modeling to analyze data from public universities in other regions.Another limitation concerns the inherently subjective and interpretive nature of the topic modeling method.Determining the suitable number of latent topics for analysis, such as any dimension reduction technique, mostly depends on the solution that appears to be the most visually appealing.Furthermore, LDA's accuracy is reduced in this study, as the students opted for short phrases and few words in their input, which are characteristics that constrain LDA's capability to provide an exact count of themes.Therefore, the results disclosed in this study are obtained from a single potential thematic representation of the comments.

Fig. 1
Fig. 1 Coherence scores for different topic numbers

Fig. 2
Fig. 2 Distance map of topics

Table 1
An overview of the categories and subcategories identified in previous literature Dimension 3. How do these latent topics differ by course grade level and course subject?

Table 2
Topic summarization and the link to the SET-survey

Table 4
Frequency of dominant topics