1 Introduction

Surveys are a significant research tool that can help to gain insight into a study subject. Specifically, open-ended questions have been considered to be a critical element of surveys because they provide information to clarify ambiguities, to examine attitudes, and to detect spontaneous perceptions, which had not been considered during the survey planning [18]. Consequently, these questions allow the researcher to elicit a topic, even if there is a lack of knowledge in the survey that prevents the adequate formulation of closed questions. Common use cases of open-ended questions to study and analyze citizens’ perceptions about social indicators include surveys on education [13, 19, 24], health care [1, 17, 20], and social service systems [7]. The results of these studies allow the identification of relevant topics that matter to stakeholders, the detection of obstacles to change performance, and they can help us to explain and understand the impact of social reforms and their possible lack of improvement.

Despite the great benefits of using open-ended questions to acquire and analyze information about stakeholders’ perceptions and expectations, their processing is generally associated with a high work-load. The main reason for this is that the traditional approach associated with this task involves the work of analysts who read and manually categorize the whole dataset [18]. This process tends to be tedious and time-consuming. In addition, it can be susceptible to errors when different analysts individually process the data [22].

Several researchers have proposed strategies to explore and analyze text collections. At present, these techniques range from simple methodologies such as frequency counts [21] to more complex Machine Learning (ML) based algorithms [16, 25, 26]. In particular, Topic Modelling (TM) based strategies have emerged as an impressive paradigm to automatically process the semantic characteristics of large textual databases.

TM is oriented to group text instances, considering that each sample can be modeled as a function of latent variables called topics. In this context, a topic is defined by a set of words, which are selected by statistical methods [2]. This approach is generally considered to be an unsupervised algorithm because of the inference processed involved to represent the content of each modeled topic. Applications of this methodology include software engineering, linguistic sciences, social networking, and so on [8, 23].

Latent Dirichlet Allocation (LDA) is a text analysis method that is used to represent the topic structure present in a collection of text documents [2]. Using this approach, recent interesting results have included the identification of relevant topics for each coronavirus disease and the exploration of their corresponding research trends from academic papers and news [4], the modelling of key research topics in big data literature [14], or the identification of evolving trends and underlying topics in humanoid robots research by analyzing scientific articles and patents [9]. In education, the use of this approach has not been fully exploited. One of the studies developed in the field is focused on the analysis and visualization of cognitive information that can improve the collaborative learning in classrooms. To this end, the work in [6] implements a Vector Space Model to develop the methodology, which was consequently validated in an experimental case study. The results of this study provide significant elements in the discussion about the student learning process. Recently, the LDA-based approach has also been used to analyze the responses of a teacher self-assessment survey in an Ecuadorian university. As a result of this case study, a set of main strategies that teachers can carry out in their classes with the aim of improving student retention were identified and discussed [3]. An alternative analysis was developed in [15], where the massive open online courses (MOOC) reviews were analyzed using LDA. In the results, the most important characteristics of courses for learners were identified and exposed as a way to improve the overall MOOC learning experience.

This paper presents a complete methodology of collection, pre-processing, topic modelling, and results analysis, based on LDA to represent the categories from several groups of stakeholders of a set of answers to open-ended questions about educational system limitations. As relevant keypoints in the analysis, this approach describes the data collection, the initial exploratory analysis based on a relevant word frequency metric, the topic modelling method, the hyper-parameters setting and the final labelling stages. The survey evaluated in this study is oriented to acquire information about the main expectations and difficulties of the current educational system in Bogota, Colombia. Considering the possible diversity in the ideas from different stakeholders (students, parents and teachers), each group is analyzed separately. Based on this process and the analysis provided by a team of experts in qualitative analysis, the results show the main similarities and differences between the considered groups.

This paper is organized as follows. Section 2 describes the methodology of pre-processing and analysis that is used to process the textual data from the case study under consideration. In addition, the algorithm used for the topic modelling and analysis, and its corresponding approach to set and interpret the hyper-parameters is presented in this section. In Section 3, the results of this case study are detailed. Finally, Section 4 outlines the conclusions of the developed work.

2 Methodology

In this section, the methodology to model the topics in a set of unstructured textual data is presented. The case study analyzes the answers to open-ended questions that are designed to identify the current expectations and limitations in the educational system of Bogota from different points of view. Figure 1 details every stage of the proposed analysis methodology. During the first stage, the textual data is collected and a pre-processing process is carried out. Once the data is processed, the topic modelling analysis is performed through the implementation and tuning of an LDA-based model. In the final stage, a group of experts in this study area carry out the label identification of each topic. This task is developed using the information of keywords and bigrams from each of the stakeholder groups. It is important to highlight the relevance analysis of the topics identified, with reference to the problem under consideration. The case study studies the topics that are automatically generated to identify the main limitations that stakeholders find in the current educational system.

Fig. 1
figure 1

Topic modelling methodology

2.1 Constructing the dataset

The open-ended questions for each stakeholder are designed and the data collection is carried out in the first stage. In addition, the generated dataset is pre-processed to extract the main information to be used in the following stage.

2.1.1 Question design

To identify the most significant pedagogical and technical aspects to be improved in the educational process from different point of views, each of the stakeholder groups has been asked a slightly different question formulation, as follows:

  1. (i)

    Students: According to your experience as a school/higher education student, describe what characteristics you expect will be changed in your education environment to face the challenges that arise in your life after finishing school/university?

  2. (ii)

    Teacher: What characteristics of the pedagogical processes of the classroom, the institution and the educational system would you change to promote integral development during secondary/high education?

  3. (iii)

    Parents: What elements in the educational process would you change to impact the student’s lives in a significant way and allow them to face challenges on a personal, family and social level?

These questions have been designed in conjunction with a group of experts in education to focus the formulation on the points of interest for each stakeholder. In addition, a minimum number of words (250) was set to ensure that a collection of topics were addressed in each answer. See the complete set of open-ended questions as well as a sample of the multiple-choice questions included in this study, for each stakeholder, in the Appendix ??.

2.1.2 Text collection and pre-processing

The data that is utilized in this research were obtained from the mission of educators and citizen wisdom, which is a Bogota Secretary of Education initiative that is intended to define educational public policies for the city upto 2038. The mission’s main purpose is to listen to diverse citizen opinions about education. Therefore, several virtual and face-to-face spaces were carried out to collect perceptions and expectations of around one million people. Students, teachers, and parents contributed to create an educational landscape of the entire city.

A set of open-ended questions were designed for each role, and were validated by subject matter experts and psychometrics. Responses were acquired using several mechanisms: a web platform was widely announced, paper-based forms were applied in streets and bus stations, and during new students inscriptions. In addition, a complete educational event, which was promoted by the Secretary of Education, allowed us to collect answers from more than 500000 parents.

To summarize, the data collection stage has allowed the analysis of 669456 answers from parents, 41390 answers from students and 7814 answers from teachers, the information was obtained from different sources. Then, the data has been digitized (if necessary) and subsequently a pre-processing stage was carried out to guarantee the data quality. This phase is particularly important for the analysis of unstructured textual information [10]. Figure 2 summarizes the sequential steps conducted in this process.

Fig. 2
figure 2

Topic modelling methodology

First, this stage involves a lowercase normalization, followed by the removal of special characters, punctuation and extra white spaces. The next step involved in the pre-processing was tokenization. The main objective of this task is to break down the text into smaller units, called tokens. The text can be divided by either words, characters or subwords (n-gram characters). As such, the data is tokenized by words, splitting the string elements into sub-strings. Based on this result, those common words in the language that might not add much value to the meaning of the document (stop words) are also removed. Subsequently, a lemmatization process was developed to group the different flexed forms of a word into a basic root word called lemma. In addition, the singular form of the words is obtained.

The final stage is to discard sparse terms that appear less than two times in the whole corpus, as well as those which appear in more than 95% of the documents, without losing relevant relationships inherent in the text instances. This task allows us to reduce the computation time involved in the next phases of the analysis. Likewise, duplicated answers are also removed. Consequently, the final dataset, which will be the input for the topic modelling analysis, is structured with the results of the previous described pre-processing. It is important to note that the textual information that is analyzed in this survey was acquired in Spanish. The pre-processing steps were adequately adapted to the particularities of this language, considering that the implementation of natural language processing strategies (stopwords removal/lemmatization) in Spanish are still under development and some exceptions have not yet been included. Finally, the results presented in this work were translated to English by native speakers.

2.2 Topic modelling

After the data is processed, the topic modelling analysis is carried out, based on the following three main steps: the term-document matrix generation, where an initial exploratory analysis is performed; the implementation of the unsupervised algorithm (LDA); and the final setting of the related hyper-parameters.

2.2.1 Term-document matrix generation and exploratory analysis

During the processing and analysis of natural language, the textual instances are characterized by a bag of words, which is computationally represented by a term-document matrix. In this context, the word-document matrix can be considered as a simplified version of the textual corpus, and it is the input of the algorithms that are used to model the corpus topics [11]. It is important to note that the order of the textual instances does not suggest any implicit relation. In fact, during the computation of the word-document matrix, all of the textual elements are randomly mixed to carry out the required statistical processing and analysis. As such, strategies such as the Latent Semantic Analysis (PLSA) and LDA are based on the assumption of the exchangeability of words and textual instances [12].

Once the word-document matrix is generated, the words and words sequences that are the most frequently used, known as n-grams, are analyzed. Specifically, a uni-gram will be defined with one word and its frequency, while a bi-gram will be a set of two consecutive words and its frequency, and so on. The frequency of these sets of word or words helps to explore the most common concepts in the corpus. This analysis is carried out as a preliminary step to understand the recurrent ideas in the dataset, which will later will support the identification of the topics in the dataset. To consider the importance of each word in relation to other instances from the same corpus, the Term Frequency - Inverse Document Frequency (TF-IDF) is computed. To calculate the TF-IDF, it is required to compute the word frequency in a document (in this case, in an answer), and the word frequency in the other documents in the corpus. In other words, the following elements are calculated:

  • Term Frequency (TF): Frequency of each token or word t, which appears in the document d, tf(t,d) = f(t,d).

  • Inverse Document Frequency (IDF): the log of number of documents N divided by the number of documents that contain the token dft (See (1)).

    $$ \text{idf}(t, N)= \log \frac{N}{df_{t}} $$
    (1)

Lastly, the TF-IDF is calculated by multiplying the TF by the IDF:

$$ \text{tfidf}= \text{tf}(t,d) \cdot \text{idf}(t,N) $$
(2)

This metric allows us to provide more relevance to those words that are repeated in more answers instead of words that are repeated a lot in just one answer.

2.2.2 LDA model

To obtain the topics of the set of answers analyzed, a topic modelling strategy using LDA is implemented. LDA is an unsupervised machine learning technique to assess data for patterns or latent topics. It is commonly used in studies that have small observations or unstructured text data, such as the answers of open-ended questions. LDA assigns every word a probabilistic score of the most probable topic it could belong to, where each topic is a mixture of words and each document is a mixture of topic probabilities.

In this context, the model considers the corpus (D = {w1,w2,⋯ ,wM}) as a collection of M documents with Nm words (w = (w1,w2,⋯ ,wN)), with a set of W unique words. Then, each document is represented as a combination of k bag-of-words TOPICS, and each topic is modeled by means of a discrete probability distribution that establishes the probability that each word is present in a specific topic. Figure 3 shows the generation process of the LDA. In this model, α and η are the hyper-parameters for Dirichlet distributions, 𝜃 is the distribution of topics for each instance i, and β is the distribution of words for each topic k. In addition, z describes that a word is sampled in a particular topic, and w represents a simple word.

Fig. 3
figure 3

LDA model representation

In this context, the probability distribution over words within a given answer is:

$$ P(w_{i}) = \sum\limits^{T}_{j=1}P\left( w_{i} \mid z_{i} = j\right) P \left( z_{i}= j\right) $$
(3)

where P(zi = j) is the probability that the j-th topic was sampled for the i-th word, and P(wizi = j) is the probability of word wi of topic j.

2.2.3 Hyperparameter tuning

LDA considers α,η, and k as parameters and randomizes all other values (excluding w). Based on this consideration, the goal is to determine which α and η maximizes the probability of generating the actual corpus by determining the best instance/topic (𝜃) and topic/word (β).

For the LDA implementation, a hyperparameter tuning is applied to set the the number of topics (k), the parameter of document-topic density (α), the parameter for word-topic density (η), and the number of iterations. To measure the model performance and compare, the coherence score c_v will be calculated. This probabilistic measure estimates if the words in the same topic go well together. This means that when the coherence score is high, the words are more closely related, while if it is very low, it contains words that do not occur in the same documents together or are not closely related.

Taking into account the corpus (bag of words associated to the complete answers) of each stakeholder’s group, a series of sensitivity tests is carried out to determine the best hyperparameters for the model. As previously stated, four parameters for the LDA modelling are considered: k, α, η, and the number of iterations. Consequently, the hyperparameter tuning consists of three tests:

  1. (i)

    Finding the number of k topics.

  2. (ii)

    Finding the best Dirichlet hyperparameter α and η. To calculate α, the following approaches are considered:

    • Fixed normalized asymmetric prior of

      $$ \begin{array}{@{}rcl@{}} \alpha = \left (\frac{1}{\left (1+\sqrt{k} \right )},\frac{1}{\left (2+\sqrt{k} \right )}...\right )\rightarrow \\ \alpha_{i}= \frac{1}{\left (i+\sqrt{k} \right )}, i = 1,2,\cdots,k \end{array} $$
      (4)

      where i is the topic index and k is the number of topics.

    • Fixed normalized asymmetric prior of 1/number of topics.

    • It learns an asymmetric prior from the corpus.

    • An array of uniformly distributed symmetric values for all k topics, where values from 0.01 to 1, with a step of 0.3, are considered [5].

    For η calculation, three different approaches are involved:

    • Scalar for a symmetric prior over topic/word probability,

    • It learns an asymmetric prior from the corpus.

    • An array of symmetric values for all w words, where values from 0.01 to 1, with a step of 0.3, are included.

    By exploring these different alternatives, α and η values with a higher coherence score are selected. In short, from the previous considerations, α defines a Dirichlet distribution hyperparameter that creates the k-dimensional document-topic (𝜃) vectors, while η produces the W-dimensional topic-word (β) vector. In turn, 𝜃 and β act as parameters for categorical distributions, where topics and words are sampled, respectively.

  3. (iii)

    Obtaining the optimal number of iterations of the model: Now that the k value is set, and the best value for α and η is calculated, the best amount of iterations is finally selected. The number of iterations controls the repetitions of a particular loop over each document. It is important to set this value high, so we select a range from 50 to 150 iterations. The chosen value provides the best coherence score.

With these steps, the best parameters (k, α, η, and number of iterations) are selected for the modelling to obtain the highest cv. This in turn generates more meaningful and interpretable topics. Hence, the final step for the topic modelling is to analyze the topics that the model generated, draw conclusions about the theme of each topic and analyze them in terms of its distribution in the dataset.

In addition to this analysis, the intertopic distance is computed to analyze the closeness among the modeled topics. To visualize, first the Jensen-Shannon divergence (JDS) between topics is calculated. Specifically, this metric is a symmetrized and smoothed version of the Kullback-Leibler divergence, which is used to calculate similarities between two distributions. Therefore, the Jensen-Shannon divergence of P and Q is defined as:

$$ \text{JSD} (P \parallel Q)= \frac{1}{2} D (P \parallel M) + \frac{1}{2} D (Q \parallel M) $$
(5)

Let \(M= \frac {1}{2}(P +Q)\). The Jensen-Shannon distance is obtained by taking the square root of this divergence. Consequently, taking into account this definition, the probability distributions for each of the topics (β) extracted by the LDA algorithm are analyzed and the distance between each topic is computed. Then, considering these results, a multidimensional scaling is used to project the intertopic distances onto a 2D plane. In this representation, the area of the circle or blob represents the importance of each topic over the entire corpus and the distances between these blobs indicate the closeness or similarity between each topic. The respective centers are defined by the calculated distance between topics, while the circle’s area defines the prevalence of each topic. Hence, during the analysis, the preferred model will be the one that has the least or preferably no overlapping circles, and is spread throughout the graph.

2.3 Expert analysis

At this stage, based on the results of the topic modelling algorithm, the analysis of the labels that identify each of the obtained categories was carried out for each stakeholder group. Therefore, an expert team in qualitative analysis has evaluated the results of the keywords and the bigrams per each topic returned in the proposed methodology.

Before analyzing the information of each model, a manual corpus-labeling process was performed. In this task, 5% of answers from parents, and 10% of answers from students and teachers, were randomly analyzed. This approach is focused on a general reading of the chosen answers and the identification of macro-descriptors to which each stakeholder refers in the corresponding answers. This manual labeling provided relevant information to establish criteria for the final process of categories tagging.

Based on these results and the keywords/bigrams information of each model, we have titled each category in case of finding a pattern, which would allow the satisfactorily labeling. As a result of this stage, a logical association of the descriptive keywords to the related category is obtained. It is important to note that no descriptor was assigned to the words groups in case the topics seemed to be incomprehensible.

3 Experimental results and discussion

3.1 Preliminary analysis

After the dataset construction stage was finished and the term-document matrix was generated, a TF-IDF analysis was performed and the relevant terms in each corpus were identified. In the unigrams case (words), the terms that were present in more than 95% of the answers were skipped. The most important words, bigrams and trigrams for each stakeholder’s group are listed in Tables 12, and 3.

Table 1 Words, bigrams and trigrams with higher TF-IDF in the student group
Table 2 Words, bigrams and trigrams with higher TF-IDF in the teacher group
Table 3 Words, bigrams and trigrams with higher TF-IDF in the parent group

Considering these results, the answers associated to each bigram and trigram were extracted and an initial qualitative analysis were performed. This stage allows us to identify the main/recurrent idea behind each bigram/trigram obtained from the preliminary analysis, for each stakeholder’s group. Consequently, from the results, it is possible to see that the students answers involve ideas about having their school as a space with large green areas, with special care for the environment, where they could again have face-to-face classes and outdoor classes. Likewise, with respect to the classes, the need for dynamic and didactic classes, and ludic activities, where the teacher understands the students needs and focuses on the development of their skills, is identified. An additional relevant concept in their answers is focused on the skills to complete a resume. Finally, these participants see their educational process as an opportunity to enter a university, in which the knowledge addressed in the school could contribute for a better future and improve, in some way, their quality of life.

When analyzing the complete answers for the teacher stakeholder group, it is possible to observe a great concern about having an adequate number of students in the classroom, as well as the promotion of comprehensive development and meaningful learning during the educational process. Specifically, this actor gives particular importance to the pedagogical and learning processes in the classroom, including reading-writing processes and ludic activities, and the incorporation of technological tools and didactic material for the development of competencies. Finally, the importance of social-emotional skills and the participation of parents in the teaching process of their children is also mentioned.

The parents answers reveal that they are focused on awakening the students interest, highlighting the importance of learning in a didactic and amusing way, allowing them to develop skills that prepare them for the future and impact on their daily life. The relevance of the development of a life project is also pointed out, caring for the environment and values such as respect.

3.2 Topic modelling results

After carrying out the exploratory topic modelling analysis, an exhaustive search for the hyperparameters: k (number of topics), α (Dirichlet distribution document-topic), η (Dirichlet distribution topic-word) and the number of number of iterations was performed to optimize the obtained results. As such, considering the methodology described in the previous section, the first step was the evaluation of models from 1 to 14 topics, and later compare and select the one that had a higher coherence value. The other parameters were set to their default in the LDA model, where parameters α and η were both equal to a symmetric one over the number of topics (1/k). The coherence values based on the number of topics are shown in Fig. 4.

Fig. 4
figure 4

Coherence score for each stakeholders group, varying the number of topics k

Based on these results, the selected value will be the number of topics that marks the end of a rapid growth of the coherence values, where a suitable amount of topics is obtained and the topics can be interpreted without having many keywords being repeated in each category. It is possible to identify, from Fig. 4, that the k values with a higher Cv are obtained with 8 and 10 topics in the student group; 5, 7 and 11 in teacher group; and 9, 10 and 13 in parent group. Accordingly, the models with these number of topics were evaluated, calculating the intertopic distance and evaluating the value k that gives the most meaningful and interpretable topics.

As discussed in the previous section, the best model has the least overlapping circles, and the topics are spread all over the graph representing the intertopic distance (Fig. 5). During the evaluation process, it was observed that when the amount of topics increased, there were more smaller circles (which may possibly be subtopics). In addition, more blobs that are overlapping were present in the analysis. It is important to consider that when a greater number of topics are involved, they are less comprehensible. Therefore, largest circles and the least overlapping where obtained with k values equal to 8, 7 and 10, for the student, teacher and parent group, respectively (see Fig. 5).

Fig. 5
figure 5

Intertopic distance for the selected model of students, teachers and parents groups

With the optimal amount of topics, now α and η parameters are tuned to obtain the highest coherence score. For the parameter α, either symmetric or asymmetric values were considered; while for the η parameter, symmetric values were considered. For the symmetric values, different uniformly distributed values (e.g., 0.01, 0.31, 0.61, 0.91 and 1) were evaluated [5]. It is important to highlight that a low α in a symmetric distribution means that it is more likely that each document may contain mixture of just a few topics. In contrast, a high α means that the document is likely to contain a mixture of most of the topics and not a single topic. Likewise, for η, a high value represents that each topic is likely to contain a mixture of words and has a smoother distribution weight across all words. In addition to α and β, the number of iterations was also tuned. In this case, the range of evaluation was from 50 to 150, by 10 steps of difference. The results with the highest coherence values, for each stakeholders group, can be seen in Table 4. As it can be seen in the results, while α is small for the parents group—meaning that in proportion to the number of answers, the set of textual instances are modeled with a few amount of topics—, it is bigger for students and teachers cases. Likewise, students and parents have a higher η, which means that the model of each topic is representative of a mixture of a considerable amount of words.

Table 4 Best parameters found for the model of the student, teacher and parent groups

The number of iterations is similar in every group, with a smaller value for the parent group. This is an expected result, given the amount of answers analyzed for this stakeholder. With the generated model for each group, the top 10 more likely keywords, and the bigrams with larger frequency are found for each topic (See Tables 56 and 7). One sign of a good topic model can be seen in the possibility of labeling the topic considering the top words/bigrams of each group. As such, an initial category has been assigned to each cluster, based on a qualitative initial assessment, for each stakeholders group. This initial category seeks only to provide an initial label to the different groups, and they are not provided to the expert group nor informed to the LDA-model. Specifically, the labels were chosen by analyzing at the words/bigrams per topic with their probabilities and frequencies, respectively, and evaluating the answers that were most likely per topic. Although the models, presented for each stakeholder, seem to have consistent and interpretable topics, it is important to highlight that no one topic of each model was able to describe the analyzed dataset. LDA parameters are the important elements to characterize the models. This is due to the fact that LDA begins with a degree of randomness and, based on this particularity, it generates a slightly different topic model every time. However, in this case, the topics produced for each stakeholder, in each iteration, were similar.

Table 5 Keywords, and bigrams for each topic modeled in the student group
Table 6 Keywords, and bigrams for each topic modeled in the teacher group
Table 7 Keywords, and bigrams for each topic modeled in the parent group

The relevant topics that could be labeled for the students group include the need of language skills, the need of a preparation for a real world, the use of didactic strategies, the need to access to a higher education, the importance of the social relation at school, the improvement of the facilities of the educational institutions, the limitations during the virtual classes, and the importance of the use of technology. Meanwhile, teachers highlight the reduction of the number of students in the classroom, the importance of the family involvement, the integral development and the emotional intelligence in the students, and, similar to the students, an important number of answers are focused on the pedagogical strategies, the learning process and the use of the technology. Finally, the parents attach particular importance to the need of the theory and the practice in the learning process, the use of strategies to awaken the students interest, the development of talent and skills in their children, and the importance of the social interaction and the instruction in values. Similar to the students, they highlight the preparation for a real world, the access to a higher education, the use of the technology, and the limitations during the virtual classes. Finally, in accordance with the teachers, they give prominence to the family involvement during the learning process.

Based on these results, the final classification of answers in the different topics for each stakeholder group can be seen in Fig. 6. These distributions show that the preparation for a real world and the social relation at school are the most recurrent topics addressed by students. Meanwhile, teachers are more focused on the pedagogical strategies in the classroom and parents are more interested on awakening the students interest in the classes and the development of talent and skills.

Fig. 6
figure 6

Answer distribution for the selected model of students, teachers and parents groups

3.3 Expert analysis results

To complete the analysis, an expert team in qualitative analysis has assessed the keywords and bigrams obtained from each topic. Based on a preliminary manual categorization, they have defined a more descriptive title for each topic. During the process, they have followed these steps:

  1. 1.

    The kinds of answers for each of the questions were determined and ordered according to each of the stakeholders.

  2. 2.

    A total sample of 5/10% of the answers was selected for the manual categorization.

  3. 3.

    A list of answers was drawn up for the questions of the different stakeholders and the first categories were drawn.

  4. 4.

    A logical grouping of descriptive categories was made and descriptors were established (Appendix ??).

  5. 5.

    Based on the previous results, a categorization and coding manual was built for the responses of different stakeholders.

  6. 6.

    The answers were assigned to each category and the frequency of the categories was calculated.

  7. 7.

    Triangulation-analysis of qualitative results was performed with the results of the automatic classification (LDA).

  8. 8.

    The categories were adjusted for each analyzed group

Specifically, during the triangulation-analysis step, the categories established by the experts are matched with the groups obtained with the LDA model. This matching process has been developed by following the next steps:

  1. 1.

    Reading of all topics, bigrams and trigrams by stakeholder.

  2. 2.

    Perform the qualitative analysis between categories and descriptors, and bigrams and trigrams. The methodology of this process has the following characteristics:

    1. (a)

      Each category, obtained from the manual analysis, was scored according to each group found by the LDA model, assigning a qualitative coherence value.

    2. (b)

      The score defined from 0 to 1 took into account bigrams and trigrams, for each group. The score was made by dividing the number of bigrams or trigrams, that are consistent with the suggested category, over the total of bigrams and trigrams for each topic. The bigrams and trigrams selected by category were those present in more than 10 % of the observations.

    3. (c)

      The category with the highest score and higher than 0.7 was the final category assigned to each LDA group.

  3. 3.

    In each case, the consistency in the proportions of the manual categorization of the proposed category and the LDA group is finally validated. In all cases, results were congruous.”

Based on this analysis, the final descriptive labels selected for each topic can be seen in Tables 89 and 10. From the results, it can be seen that new labels involve more details about the focus of the answers, which was the main objective of the preliminary manual labeling of the selected sample of data.

Table 8 Final labels for each topic in the student group
Table 9 Final labels for each topic in the teacher group
Table 10 Final labels for each topic in the parent group

4 Conclusions

To assess the degree of public satisfaction in public politics (addressed in such important sectors of mutual interest as education), surveys are commonly used to understand the point of view of stakeholders (e.g., students, teachers, parents, etc). These surveys allow us to collect valuable information about possible lines of improvement during the education process. Usually, these tools include open-ended questions, focused on identifying spontaneous thoughts and discovering new lines of action. Although open-ended questions allow the acquisition of new information, they also require a large workload and manual processing time. This has been considered to be a significant disadvantage, discouraging the use of this kind of questions and avoiding the possibility of collecting information of great importance.

This study presents a complete methodology for the collection, pre-processing and automatic analysis of open-ended questions, using an unsupervised approach based on the identification of latent topics. Additional insights are provided to the topic labels obtained from the automatic results by an initial exploratory analysis using the tfidf metric and a fine labeling provided by an expert team in qualitative analysis. This approach allows us to model the topics discussed in the collected answers and obtain a macro-perspective of the education system perception from different points of view. This study will help to reduce the workload and the processing time that are required to complete the analysis of unstructured textual data from different sources, such as the answers acquired through open-ended questions.

During the analysis, three groups of stakeholders were interviewed: students, teachers and parents. Consequently, the questions were structured for each stakeholder to obtain information about the limitations identified, and the aspects to be changed in the educational system, to achieve goals consistent with the respective participant role. This application provides important information about the potential lines of action to improve the perception and satisfaction of the population in the education sector. As a result of this application, the categories generated by the models and expert feedback allowed us to clearly identify the relevant topics for each stakeholder. These results suggest that this methodology can be used to extract different kind of information in this field.

The results obtained from the methodology presented in this work show that some topics are addressed by only one group of participants. Only the students highlight the importance of a foreign language proficiency, the investment in the infrastructure and the strategies to improve school coexistence. In turn, teachers emphasize the pedagogical methodologies and curricular change, the reduction of the number of students per class and development of skills and competences focused on an integral development that integrates multiple intelligences. Finally, parents were interested in the instruction in values, the importance of teaching interpersonal skills and the changes of the traditional education to awaken the interest of the students. As a complement, both the students and parents underline the relevance of a wider coverage and access to higher education, the development of life skills and competences, and the online classes and access limitations. Teachers and parents highlight the importance of a greater interaction of the family in the educational process. As a common topic for all the groups, the access and the use of new technologies in education was reported to be an important element to consider in the change of the education system.

It is important to highlight that the proposed methodology has a practical applicability to identify prominent underlying, in a large collection of responses to open-ended questions, oriented to multiple stakeholders. The questionnaire design, acquisition, pre-procesing, automatic categorization and expert feedback stages could be applied (without loss of generality) to study and analyze a macro perspective of multiple stakeholders’ perceptions in any application. However, some considerations must be particularly analyzed such as the number of responses, which is required to be large to obtain a model with an acceptable performance and to take advantage of the time reduction during the categorization analysis, and the changes in the questionnaire between the different stakeholders, which will allow to extract different information for the same topic, based on multiple points of views. Remaining stages can be replicable for similar tasks such as analyzing open-ended feedback or discussion forums.

In our further work, the analysis will be focused on deepening the stakeholders’ perception on the educational system, but obtaining a subdivision based on the grades and level of education. In this way, students, teachers and parents will be divided in sub-stakeholders and the questionnaire will be focused on delving deeper into the topics of interest, reported as a result of the present study, of each stakeholder. Considering that LDA-based models do not properly estimate correlations between topics, because of the nature of the Dirichlet distribution, an additional line of action is oriented to the automatic analysis of the relationships between topics through the modelling of spatial distributions. This approach will aim at avoiding the overlapping of concepts among different categories. Complementary studies could involve the acquisition of new variables such as the age, gender or residence location as well as the information from other areas (e.g., the corporate sector, administrative employees of educational institutions, etc.). This new data could help to expand the scope of our results.