A macro perspective of the perceptions of the education system via topic modelling analysis

Cifuentes, Jenny; Olarte, Fredy

doi:10.1007/s11042-022-13202-6

A macro perspective of the perceptions of the education system via topic modelling analysis

Published: 10 June 2022

Volume 82, pages 1783–1820, (2023)
Cite this article

Download PDF

Multimedia Tools and Applications Aims and scope Submit manuscript

A macro perspective of the perceptions of the education system via topic modelling analysis

Download PDF

2655 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Education quality has become an important issue and has received considerable attention around the world, especially due to its relevant repercussions on the socio-economical development of society. In recent years, many nations have realized the need for a highly skilled workforce to thrive in the emerging knowledge-based economy. They have consequently adopted strategies to identify the lines of action to improve the education quality. In response to the government’s efforts to improve the education quality in Colombia, this study examines the current perceptions of the education system from the perspective of key local stakeholders. Therefore, we used a survey that contained open-ended questions to collect information about the limitations and difficulties of the education process for several groups of participants. The collected answers were categorized into a variety of topics using a Latent Dirichlet Allocation based model. Consequently, the students’, teachers’ and parents’ answers were analyzed separately to obtain a general landscape of the perceptions of the education system. Evaluation metrics, such as topic coherence, were quantitatively analyzed to assess the modelling performance. In addition, a methodology for the hyper-parameters setting and the final topic labelling was presented. The results suggest that topic modelling strategies are a viable alternative to identify strategic lines of action and to obtain a macro-perspective of the perceptions of the education system.

Comparative analysis of education policies: A study on analyzing the evolutionary changes and technical advancement in the education system

Article 05 December 2022

Two-Step Approach to Topic Modeling to Incorporate Covariate and Outcome

Investigating teachers’ understanding through topic modeling: a promising approach to studying teachers’ knowledge

Article 15 January 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Surveys are a significant research tool that can help to gain insight into a study subject. Specifically, open-ended questions have been considered to be a critical element of surveys because they provide information to clarify ambiguities, to examine attitudes, and to detect spontaneous perceptions, which had not been considered during the survey planning [18]. Consequently, these questions allow the researcher to elicit a topic, even if there is a lack of knowledge in the survey that prevents the adequate formulation of closed questions. Common use cases of open-ended questions to study and analyze citizens’ perceptions about social indicators include surveys on education [13, 19, 24], health care [1, 17, 20], and social service systems [7]. The results of these studies allow the identification of relevant topics that matter to stakeholders, the detection of obstacles to change performance, and they can help us to explain and understand the impact of social reforms and their possible lack of improvement.

Despite the great benefits of using open-ended questions to acquire and analyze information about stakeholders’ perceptions and expectations, their processing is generally associated with a high work-load. The main reason for this is that the traditional approach associated with this task involves the work of analysts who read and manually categorize the whole dataset [18]. This process tends to be tedious and time-consuming. In addition, it can be susceptible to errors when different analysts individually process the data [22].

Several researchers have proposed strategies to explore and analyze text collections. At present, these techniques range from simple methodologies such as frequency counts [21] to more complex Machine Learning (ML) based algorithms [16, 25, 26]. In particular, Topic Modelling (TM) based strategies have emerged as an impressive paradigm to automatically process the semantic characteristics of large textual databases.

TM is oriented to group text instances, considering that each sample can be modeled as a function of latent variables called topics. In this context, a topic is defined by a set of words, which are selected by statistical methods [2]. This approach is generally considered to be an unsupervised algorithm because of the inference processed involved to represent the content of each modeled topic. Applications of this methodology include software engineering, linguistic sciences, social networking, and so on [8, 23].

Latent Dirichlet Allocation (LDA) is a text analysis method that is used to represent the topic structure present in a collection of text documents [2]. Using this approach, recent interesting results have included the identification of relevant topics for each coronavirus disease and the exploration of their corresponding research trends from academic papers and news [4], the modelling of key research topics in big data literature [14], or the identification of evolving trends and underlying topics in humanoid robots research by analyzing scientific articles and patents [9]. In education, the use of this approach has not been fully exploited. One of the studies developed in the field is focused on the analysis and visualization of cognitive information that can improve the collaborative learning in classrooms. To this end, the work in [6] implements a Vector Space Model to develop the methodology, which was consequently validated in an experimental case study. The results of this study provide significant elements in the discussion about the student learning process. Recently, the LDA-based approach has also been used to analyze the responses of a teacher self-assessment survey in an Ecuadorian university. As a result of this case study, a set of main strategies that teachers can carry out in their classes with the aim of improving student retention were identified and discussed [3]. An alternative analysis was developed in [15], where the massive open online courses (MOOC) reviews were analyzed using LDA. In the results, the most important characteristics of courses for learners were identified and exposed as a way to improve the overall MOOC learning experience.

This paper presents a complete methodology of collection, pre-processing, topic modelling, and results analysis, based on LDA to represent the categories from several groups of stakeholders of a set of answers to open-ended questions about educational system limitations. As relevant keypoints in the analysis, this approach describes the data collection, the initial exploratory analysis based on a relevant word frequency metric, the topic modelling method, the hyper-parameters setting and the final labelling stages. The survey evaluated in this study is oriented to acquire information about the main expectations and difficulties of the current educational system in Bogota, Colombia. Considering the possible diversity in the ideas from different stakeholders (students, parents and teachers), each group is analyzed separately. Based on this process and the analysis provided by a team of experts in qualitative analysis, the results show the main similarities and differences between the considered groups.

This paper is organized as follows. Section 2 describes the methodology of pre-processing and analysis that is used to process the textual data from the case study under consideration. In addition, the algorithm used for the topic modelling and analysis, and its corresponding approach to set and interpret the hyper-parameters is presented in this section. In Section 3, the results of this case study are detailed. Finally, Section 4 outlines the conclusions of the developed work.

2 Methodology

In this section, the methodology to model the topics in a set of unstructured textual data is presented. The case study analyzes the answers to open-ended questions that are designed to identify the current expectations and limitations in the educational system of Bogota from different points of view. Figure 1 details every stage of the proposed analysis methodology. During the first stage, the textual data is collected and a pre-processing process is carried out. Once the data is processed, the topic modelling analysis is performed through the implementation and tuning of an LDA-based model. In the final stage, a group of experts in this study area carry out the label identification of each topic. This task is developed using the information of keywords and bigrams from each of the stakeholder groups. It is important to highlight the relevance analysis of the topics identified, with reference to the problem under consideration. The case study studies the topics that are automatically generated to identify the main limitations that stakeholders find in the current educational system.

2.1 Constructing the dataset

The open-ended questions for each stakeholder are designed and the data collection is carried out in the first stage. In addition, the generated dataset is pre-processed to extract the main information to be used in the following stage.

2.1.1 Question design

To identify the most significant pedagogical and technical aspects to be improved in the educational process from different point of views, each of the stakeholder groups has been asked a slightly different question formulation, as follows:

(i)
Students: According to your experience as a school/higher education student, describe what characteristics you expect will be changed in your education environment to face the challenges that arise in your life after finishing school/university?
(ii)
Teacher: What characteristics of the pedagogical processes of the classroom, the institution and the educational system would you change to promote integral development during secondary/high education?
(iii)
Parents: What elements in the educational process would you change to impact the student’s lives in a significant way and allow them to face challenges on a personal, family and social level?

These questions have been designed in conjunction with a group of experts in education to focus the formulation on the points of interest for each stakeholder. In addition, a minimum number of words (250) was set to ensure that a collection of topics were addressed in each answer. See the complete set of open-ended questions as well as a sample of the multiple-choice questions included in this study, for each stakeholder, in the Appendix ??.

2.1.2 Text collection and pre-processing

The data that is utilized in this research were obtained from the mission of educators and citizen wisdom, which is a Bogota Secretary of Education initiative that is intended to define educational public policies for the city upto 2038. The mission’s main purpose is to listen to diverse citizen opinions about education. Therefore, several virtual and face-to-face spaces were carried out to collect perceptions and expectations of around one million people. Students, teachers, and parents contributed to create an educational landscape of the entire city.

A set of open-ended questions were designed for each role, and were validated by subject matter experts and psychometrics. Responses were acquired using several mechanisms: a web platform was widely announced, paper-based forms were applied in streets and bus stations, and during new students inscriptions. In addition, a complete educational event, which was promoted by the Secretary of Education, allowed us to collect answers from more than 500000 parents.

To summarize, the data collection stage has allowed the analysis of 669456 answers from parents, 41390 answers from students and 7814 answers from teachers, the information was obtained from different sources. Then, the data has been digitized (if necessary) and subsequently a pre-processing stage was carried out to guarantee the data quality. This phase is particularly important for the analysis of unstructured textual information [10]. Figure 2 summarizes the sequential steps conducted in this process.

First, this stage involves a lowercase normalization, followed by the removal of special characters, punctuation and extra white spaces. The next step involved in the pre-processing was tokenization. The main objective of this task is to break down the text into smaller units, called tokens. The text can be divided by either words, characters or subwords (n-gram characters). As such, the data is tokenized by words, splitting the string elements into sub-strings. Based on this result, those common words in the language that might not add much value to the meaning of the document (stop words) are also removed. Subsequently, a lemmatization process was developed to group the different flexed forms of a word into a basic root word called lemma. In addition, the singular form of the words is obtained.

The final stage is to discard sparse terms that appear less than two times in the whole corpus, as well as those which appear in more than 95% of the documents, without losing relevant relationships inherent in the text instances. This task allows us to reduce the computation time involved in the next phases of the analysis. Likewise, duplicated answers are also removed. Consequently, the final dataset, which will be the input for the topic modelling analysis, is structured with the results of the previous described pre-processing. It is important to note that the textual information that is analyzed in this survey was acquired in Spanish. The pre-processing steps were adequately adapted to the particularities of this language, considering that the implementation of natural language processing strategies (stopwords removal/lemmatization) in Spanish are still under development and some exceptions have not yet been included. Finally, the results presented in this work were translated to English by native speakers.

2.2 Topic modelling

After the data is processed, the topic modelling analysis is carried out, based on the following three main steps: the term-document matrix generation, where an initial exploratory analysis is performed; the implementation of the unsupervised algorithm (LDA); and the final setting of the related hyper-parameters.

2.2.1 Term-document matrix generation and exploratory analysis

During the processing and analysis of natural language, the textual instances are characterized by a bag of words, which is computationally represented by a term-document matrix. In this context, the word-document matrix can be considered as a simplified version of the textual corpus, and it is the input of the algorithms that are used to model the corpus topics [11]. It is important to note that the order of the textual instances does not suggest any implicit relation. In fact, during the computation of the word-document matrix, all of the textual elements are randomly mixed to carry out the required statistical processing and analysis. As such, strategies such as the Latent Semantic Analysis (PLSA) and LDA are based on the assumption of the exchangeability of words and textual instances [12].

Once the word-document matrix is generated, the words and words sequences that are the most frequently used, known as n-grams, are analyzed. Specifically, a uni-gram will be defined with one word and its frequency, while a bi-gram will be a set of two consecutive words and its frequency, and so on. The frequency of these sets of word or words helps to explore the most common concepts in the corpus. This analysis is carried out as a preliminary step to understand the recurrent ideas in the dataset, which will later will support the identification of the topics in the dataset. To consider the importance of each word in relation to other instances from the same corpus, the Term Frequency - Inverse Document Frequency (TF-IDF) is computed. To calculate the TF-IDF, it is required to compute the word frequency in a document (in this case, in an answer), and the word frequency in the other documents in the corpus. In other words, the following elements are calculated:

Term Frequency (TF): Frequency of each token or word t, which appears in the document d, tf(t,d) = f(t,d).
Inverse Document Frequency (IDF): the log of number of documents N divided by the number of documents that contain the token df_t (See (1)).
$$ \text{idf}(t, N)= \log \frac{N}{df_{t}} $$
(1)

Lastly, the TF-IDF is calculated by multiplying the TF by the IDF:

$$ \text{tfidf}= \text{tf}(t,d) \cdot \text{idf}(t,N) $$

(2)

This metric allows us to provide more relevance to those words that are repeated in more answers instead of words that are repeated a lot in just one answer.

2.2.2 LDA model

To obtain the topics of the set of answers analyzed, a topic modelling strategy using LDA is implemented. LDA is an unsupervised machine learning technique to assess data for patterns or latent topics. It is commonly used in studies that have small observations or unstructured text data, such as the answers of open-ended questions. LDA assigns every word a probabilistic score of the most probable topic it could belong to, where each topic is a mixture of words and each document is a mixture of topic probabilities.

In this context, the model considers the corpus (D = {w₁,w₂,⋯ ,w_M}) as a collection of M documents with N_m words (w = (w₁,w₂,⋯ ,w_N)), with a set of W unique words. Then, each document is represented as a combination of k bag-of-words TOPICS, and each topic is modeled by means of a discrete probability distribution that establishes the probability that each word is present in a specific topic. Figure 3 shows the generation process of the LDA. In this model, α and η are the hyper-parameters for Dirichlet distributions, 𝜃 is the distribution of topics for each instance i, and β is the distribution of words for each topic k. In addition, z describes that a word is sampled in a particular topic, and w represents a simple word.

In this context, the probability distribution over words within a given answer is:

$$ P(w_{i}) = \sum\limits^{T}_{j=1}P\left( w_{i} \mid z_{i} = j\right) P \left( z_{i}= j\right) $$

(3)

where P(z_i = j) is the probability that the j-th topic was sampled for the i-th word, and P(w_i∣z_i = j) is the probability of word w_i of topic j.

2.2.3 Hyperparameter tuning

LDA considers α,η, and k as parameters and randomizes all other values (excluding w). Based on this consideration, the goal is to determine which α and η maximizes the probability of generating the actual corpus by determining the best instance/topic (𝜃) and topic/word (β).

For the LDA implementation, a hyperparameter tuning is applied to set the the number of topics (k), the parameter of document-topic density (α), the parameter for word-topic density (η), and the number of iterations. To measure the model performance and compare, the coherence score c_v will be calculated. This probabilistic measure estimates if the words in the same topic go well together. This means that when the coherence score is high, the words are more closely related, while if it is very low, it contains words that do not occur in the same documents together or are not closely related.

Taking into account the corpus (bag of words associated to the complete answers) of each stakeholder’s group, a series of sensitivity tests is carried out to determine the best hyperparameters for the model. As previously stated, four parameters for the LDA modelling are considered: k, α, η, and the number of iterations. Consequently, the hyperparameter tuning consists of three tests:

(i)
Finding the number of k topics.
(ii)
Finding the best Dirichlet hyperparameter α and η. To calculate α, the following approaches are considered:
- Fixed normalized asymmetric prior of
  $$ \begin{array}{@{}rcl@{}} \alpha = \left (\frac{1}{\left (1+\sqrt{k} \right )},\frac{1}{\left (2+\sqrt{k} \right )}...\right )\rightarrow \\ \alpha_{i}= \frac{1}{\left (i+\sqrt{k} \right )}, i = 1,2,\cdots,k \end{array} $$
  (4)
  where i is the topic index and k is the number of topics.
- Fixed normalized asymmetric prior of 1/number of topics.
- It learns an asymmetric prior from the corpus.
- An array of uniformly distributed symmetric values for all k topics, where values from 0.01 to 1, with a step of 0.3, are considered [5].
For η calculation, three different approaches are involved:
- Scalar for a symmetric prior over topic/word probability,
- It learns an asymmetric prior from the corpus.
- An array of symmetric values for all w words, where values from 0.01 to 1, with a step of 0.3, are included.
By exploring these different alternatives, α and η values with a higher coherence score are selected. In short, from the previous considerations, α defines a Dirichlet distribution hyperparameter that creates the k-dimensional document-topic (𝜃) vectors, while η produces the W-dimensional topic-word (β) vector. In turn, 𝜃 and β act as parameters for categorical distributions, where topics and words are sampled, respectively.
(iii)
Obtaining the optimal number of iterations of the model: Now that the k value is set, and the best value for α and η is calculated, the best amount of iterations is finally selected. The number of iterations controls the repetitions of a particular loop over each document. It is important to set this value high, so we select a range from 50 to 150 iterations. The chosen value provides the best coherence score.

With these steps, the best parameters (k, α, η, and number of iterations) are selected for the modelling to obtain the highest c_v. This in turn generates more meaningful and interpretable topics. Hence, the final step for the topic modelling is to analyze the topics that the model generated, draw conclusions about the theme of each topic and analyze them in terms of its distribution in the dataset.

In addition to this analysis, the intertopic distance is computed to analyze the closeness among the modeled topics. To visualize, first the Jensen-Shannon divergence (JDS) between topics is calculated. Specifically, this metric is a symmetrized and smoothed version of the Kullback-Leibler divergence, which is used to calculate similarities between two distributions. Therefore, the Jensen-Shannon divergence of P and Q is defined as:

$$ \text{JSD} (P \parallel Q)= \frac{1}{2} D (P \parallel M) + \frac{1}{2} D (Q \parallel M) $$

(5)

Let $M= \frac {1}{2}(P +Q)$. The Jensen-Shannon distance is obtained by taking the square root of this divergence. Consequently, taking into account this definition, the probability distributions for each of the topics (β) extracted by the LDA algorithm are analyzed and the distance between each topic is computed. Then, considering these results, a multidimensional scaling is used to project the intertopic distances onto a 2D plane. In this representation, the area of the circle or blob represents the importance of each topic over the entire corpus and the distances between these blobs indicate the closeness or similarity between each topic. The respective centers are defined by the calculated distance between topics, while the circle’s area defines the prevalence of each topic. Hence, during the analysis, the preferred model will be the one that has the least or preferably no overlapping circles, and is spread throughout the graph.

2.3 Expert analysis

At this stage, based on the results of the topic modelling algorithm, the analysis of the labels that identify each of the obtained categories was carried out for each stakeholder group. Therefore, an expert team in qualitative analysis has evaluated the results of the keywords and the bigrams per each topic returned in the proposed methodology.

Before analyzing the information of each model, a manual corpus-labeling process was performed. In this task, 5% of answers from parents, and 10% of answers from students and teachers, were randomly analyzed. This approach is focused on a general reading of the chosen answers and the identification of macro-descriptors to which each stakeholder refers in the corresponding answers. This manual labeling provided relevant information to establish criteria for the final process of categories tagging.

Based on these results and the keywords/bigrams information of each model, we have titled each category in case of finding a pattern, which would allow the satisfactorily labeling. As a result of this stage, a logical association of the descriptive keywords to the related category is obtained. It is important to note that no descriptor was assigned to the words groups in case the topics seemed to be incomprehensible.

3 Experimental results and discussion

3.1 Preliminary analysis

After the dataset construction stage was finished and the term-document matrix was generated, a TF-IDF analysis was performed and the relevant terms in each corpus were identified. In the unigrams case (words), the terms that were present in more than 95% of the answers were skipped. The most important words, bigrams and trigrams for each stakeholder’s group are listed in Tables 1, 2, and 3.

Table 1 Words, bigrams and trigrams with higher TF-IDF in the student group

A macro perspective of the perceptions of the education system via topic modelling analysis

Abstract

Similar content being viewed by others

Comparative analysis of education policies: A study on analyzing the evolutionary changes and technical advancement in the education system

Two-Step Approach to Topic Modeling to Incorporate Covariate and Outcome

Investigating teachers’ understanding through topic modeling: a promising approach to studying teachers’ knowledge

1 Introduction

2 Methodology

2.1 Constructing the dataset

2.1.1 Question design

2.1.2 Text collection and pre-processing

2.2 Topic modelling

2.2.1 Term-document matrix generation and exploratory analysis

2.2.2 LDA model

2.2.3 Hyperparameter tuning

2.3 Expert analysis

3 Experimental results and discussion

3.1 Preliminary analysis

3.2 Topic modelling results

3.3 Expert analysis results

4 Conclusions

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Appendices

Appendix A: Full questionnaire

1.1 A.1 Spanish version

1.1.1 A.1.1 Students

Multiple-choice questions

Open-ended questions

1.1.2 A.1.2.Teachers

Multiple-choice questions

Open-ended questions

1.1.3 A.1.3 Parents

Multiple-choice questions

Open-ended questions

1.2 A.2 English Version

1.2.1 A.2.1 Students

Multiple-choice questions

Open-ended questions

1.2.2 A.2.2 Teachers

Multiple-choice questions

Open-ended questions

1.2.3 A.2.3 Parents

Multiple-choice questions

Open-ended questions

Appendix B: Expert Analysis: Categories and Descriptors from the Manual Categorization

1.1 B.1 Spanish Version

1.1.1 B.1.1 Students

1.1.2 B.1.2 Teachers

1.1.3 B.1.3 Parents

1.2 B.2 English Version

1.2.1 B.2.1 Students

1.2.2 B.2.2 Teachers

1.2.3 B.2.3 Parents

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation