1 Introduction

In the present-day social environment, the discussion forum has become an increasingly popular tools to communicate and to share information among members. Through discussion forum, members update their knowledge about new things. At present, the discussion forum has rapidly become a part of Learning Management System (LMS) and part of Massive Open Online Course (MOOC) (Piña 2018; Kuran et al. 2017; Ruano et al. 2016). It means lots of knowledge and information arise in discussion forum. However, the posts are only known among the members. Recently, to improve content relevancies, some discussion forums have also included other external parties (e.g. from industries) to become part of the discussion forum members like lecturers. They can also raise questions to the students. The problem is how to retrieve the information or knowledge from discussion forums. Hence, this will in turn enrich the teaching learning process.

In the last five years, several researches about discussion forum have emerged, focusing on the association between discourse behavior and students’ learning (Wang et al. 2015), effect of confusion in discussion forum (Yang et al. 2015), sentiment analysis of MOOC discussion forum (Wen et al. 2014) and unsupervised classification method to understand student posts (Ezen-can et al. 2015). On the other hand, a few works focus on the extraction information from the discussion. To enrich research in discussion forum field as research object, this study focuses on topic extraction of discussion forum posts using latent semantic. The topic become a label of post to retrieve the information and knowledge from discussion forum. This study will complement other people’s works.

The study proposes a model for clustering posts based on the topic of discussion through latent semantic approach. The model is named Topics Finding Model (TFM), which is a new approach for the clustering posts. However, the characteristics of discussion forum have become challenging in a research environment. One of the challenges is by posting a discussion forum by a member without editing process, revealing that the post may consist of unstructured statements with some grammatical errors. Another characteristic of a discussion forum is about the topic of discussion. When a discussion is opened by a thread from a member; ideally, the thread ought to focus on one topic, however, the discussion may be opened out to other topics, which may diverge the members. The language used in a discussion forum is also another characteristic. Although there is a specific language to be used, several slangs may be used on several occasions. For example, if a discussion is in Indonesian language, members might use several English slangs during discussions. Thus, the latent semantic approach is used to handle the characteristics of these discussion forums.

Using LMS for the experiments, the study collects data from 1050 posts and divided them into three different course subjects: information systems, management and character building. The reason for using the course subjects in three different areas, computing, social and behavioral area, is to observe the consistency of the model. Actually, the language used in the discussion forum is the Indonesian language, and the effectiveness of the model is measured by an F-measure parameter. The result shows that the TFM is consistent and effective in revealing the topic of discussion.

The rest of the paper is organized as follows: Section 2 discusses the previous research related to this paper. Section 3 explains the proposed model and method. Section 4 discusses the evaluation and the result. Section 5 provides the conclusion.

2 Related works

2.1 Information retrieval (IR)

There are two scopes of research in IR. The first research is about how to index document. The second research is about document retrieval (Baeza-Yates and Ribeiro-Neto 2011; Sanderson and Croft 2012). This study focused on how to index in IR. The index is based on topic of discussion. In this research, to find out the topic, language modeling approach is used. In recent, many researches about modeling for information retrieval has arisen. There is a smoothing method for language modeling. This model used word probability estimation. Equation (1) is general form of smoothed model (Zhai and Lafferty 2017).

$$ p\left(w|d\right)=\left\{\begin{array}{c}{p}_s\left(w|d\right),\kern0.9em if\ word\ w\ is\ seen\\ {}{\propto}_{\mathrm{d}}p\left(w|\complement \right),\kern0.5em otherwise\end{array}\right. $$
(1)

where the smoothed probability of a word seen in document is denoted by ps(w| d). The smoothed probability is a probability to adjust the maximum likelihood estimator of a language model. The collection language model is denoted by p(w| ∁), meanwhile the coefficient controlling of probability mass assigned to unseen words is denoted by ∝d.

Another approach used statistical to find a posteriori most likely documents given the query based on Bayes’ law as Eq. (2). The d for which p(d| q, U) is highest posteriori, q is the query, and U is the user’s distill (Berger and Lafferty 1999).

$$ p\left(d|q,U\right)=\frac{p\left(q|d,U\right)\ p\left(d|U\right)}{p\left(q|U\right)} $$
(2)

In this study the language modeling using latent semantic approach based on the probability of latent variable to find out the topic of discussion that can be used as label to index and to retrieve the document. The latent semantic approach is an approach to find out information from a text document, based on certain entities through latent variable. Meanwhile, the latent variable is an association between unobserved class variable with each observation based on co-occurrence data. The latent variable is adopted from a generative model from Probabilistic Latent Semantic Analysis (PLSA) (Hofmann 1999).

2.2 Corpus classification

Corpus classification is a process of classifying documents in a specific corpus based on certain approach. Several previous researches focus on clustering or classifying a corpus, using scatter/gather to cluster large corpus (Cutting et al. 2017), context semantic analysis (Benedetti et al. 2018), cluster word importance-based similarity (Botev et al. 2017), cluster machine learning for text categorization (Sailaja et al. 2018) and cluster corpus classifier algorithm (Setiawan et al. 2019).

Ideally, a thread in discussion forum may discuss a topic, however the discussion can grow to other topics. The corpus classifier algorithm aims to identify the shortcoming that causes the variety topics discussion in a thread. In this approach, the similarity of documents is classified based on the similarity of words with highest term-frequency. A document in a corpus is illustrated as a set of words, and it contains m words, e.g., the first document, and the second document are denoted as d1 = {word1, word2, ⋯, wordm} and d2 = {word1, word2, ⋯, wordm}, respectively. Therefore, the ith document containing m words in the corpus is expressed by Eq. (3):

$$ {d}_i=\left\{{word}_1,{word}_2,\cdots, {word}_m\right\} $$
(3)

The similarity of documents is expressed in Eq. (4), and Fig. 1 shows the model of corpus classification approach (Setiawan et al. 2019). The algorithm needs two inputs, number of word with highest term-frequency denoted by m and number of similarity word denoted by n. The similarity between document A and document B can be expressed by Eq. (4) as follows:

$$ sim\left({d}_A,{d}_B\right)=\left\{\begin{array}{c}1, if\ \left(\left({d}_A\cap {d}_B\right)\ and\ \left(\left|{d}_A\cap {d}_B\right|\ge n\right)\right)\\ {}0, otherwise\kern11em \end{array}\right. $$
(4)

where:

sim(dA, dB):

denotes the similarity between two documents

n:

denotes the number of similar words within m words with highest term-frequency

Fig. 1
figure 1

The model of corpus classification approach (Setiawan et al. 2019)

The value of similarity is one, if the two conditions are fulfilled. The first, there is intersection between the two documents and the second is number of intersection element must be greater or equal than n. Otherwise, the value of similarity is zero.

The similar documents of classification result are defined as a corpus. The corpus is more specific and focused rather than a corpus based on a thread discussion. There is another model about corpus development in previous research. This model is based on Naïve Bayes, SVM and J48 with term weighting scheme ranking (Utomo and Bijaksana 2016). The model of corpus classification is based on similarity of words with highest term-frequency among documents. These models have different approach.

2.3 Probabilistic latent semantic analysis (PLSA)

Probabilistic Latent Semantic Analysis (PLSA) is a statistical approach to find out a categorized topic based on latent semantic through co-occurrence data analysis (Hong et al. 2008). PLSA is an aspect model introduced by Thomas Hofmann (Hofmann 1999), and it can be used for information retrieval. The aspect model associates co-occurrence with data in an unobserved class variable (topic), which an observation is occurrent of a word in a particular document (Hofmann 2001). Figure 2 shows the general structure of PLSA and describes its association among documents, topics and words. The probability P(z| d) and P(w| z) links topic layer to documents and words, respectively (Dan Oneata 1999).

Fig. 2
figure 2

The general structure of PLSA model (Dan Oneata 1999)

The aspect model has two types: asymmetric parameterization and symmetric parametrization. The asymmetric parameterization is used when a number of topics is smaller than several documents and number of words (K < < N, D).

A generative model for document and word co-occurrences with a joint probability is expressed by:

$$ P\left({d}_i,{w}_j\right)=P\left({d}_i\right)P\left({w}_j|{d}_i\right) $$
(5)

which

$$ P\left({w}_j|{d}_i\right)={\sum}_{k=1}^KP\left({w}_j|{z}_k\right)P\left({z}_k|{d}_i\right) $$
(6)

The explanation of the symbols is:

P(di):

probability of a word occurrence in a particular document di

P(wj| zk):

probability of class-conditional of a specific word conditioned on unobserved class variable zk

P(zk| di):

a document specific probability distribution over the latent variable space

A generative model for document and word co-occurrences with a joint probability is expressed by:

$$ P\left({d}_i,{w}_j\right)={\sum}_{k=1}^KP\left({z}_k\right)P\left({d}_i|{z}_k\right)P\left({w}_j|{z}_k\right) $$
(7)

The explanation of the symbols is:

P(zk):

probability of class-conditional in particular class variable zk

P(di| zk):

probability of class-conditional of a particular document conditioned on unobserved class variable zk

P(wj| zk):

probability of class-conditional of a specific word conditioned on unobserved class variable zk

The latent semantic approach based on PLSA approach, since one of discussion forum characteristics is ignored in the editing process. Thus, a statistical approach is relevant for these characteristics.

3 Proposed model

This paper proposes a latent semantic approach to find out the topic of discussion from a discussion forum. This approach is packaged in a model, named Topics Finding Model (TFM) as shown in Fig. 3. The TFM aims to find out topics of discussion in a corpus through three steps. A corpus is a set of posts of discussion, whereas a post in a discussion forum is a text document. The model consists of three steps: Pre-processing document, Corpus classification, and Finding topic. In pre-processing document, there are three activities: tokenization, stop-word removal, and stemming. The stemming process uses flexible affix classification approach, the stemming algorithm for Indonesian language (Setiawan et al. 2016). The corpus is obtained from discussion forum of Bina Nusantara University’s learning management system and not publicly accessible. Since teaching and learning process in Bina Nusantara University uses Indonesian language, thus the discussion forum uses Indonesian language as well. The empty posts of discussion forum have been removed from the corpus. It is to ensure that the corpus meets the research object needs. The corpus is not validated by the authors, however, the corpus and stemming result are validated and verified by Language Center of Bina Nusantara University as an independent party.

Fig. 3
figure 3

The topics finding model (TFM)

In corpus classification step, the corpus classifier algorithm is used. Afterwards, in finding topic, Probability Latent Semantic Analysis (PLSA) approach is used.

The Topics Finding Model depicted in Fig. 3 above can be elaborated in the following equations:

  • Pre-processing Document step:

In this step, the text document undergoes tokenization process, stop-word removal based on stop-word list and stemming. Equations (8) and (9) is mathematical model to show text document in tokenization process and stemming process, respectively. Meanwhile, Corpus contains of various stemmed text documents as illustrated in Eq. (10).

$$ {D}_i=\left\{{T}_1,{T}_2,\cdots, {T}_p\right\} $$
(8)

where:

D:

denotes an original text document

i:

denotes number of original text documents

T:

denotes a token in original text document

p:

denotes number of tokens

$$ {d}_j=\left\{{w}_1,{w}_2,\cdots, {w}_l\right\} $$
(9)

where:

d:

denotes a stemmed text document

j:

denotes number of stemmed text documents

w:

denotes a stemmed distinct word in text document

l:

denotes number of stemmed distinct words

$$ C=\left\{{d}_1,{d}_2,\cdots, {d}_i\right\} $$
(10)

where C denotes a corpus that contains certain d

  • Corpus Classification step:

In this step, the corpus is classified based on similar distinct words with highest term-frequency in several documents. There are two parameters in this step, i.e. m and n is number of words with highest term-frequency and number of similarity word, respectively. For the sake of convenience, Eq. (4) in Section 2, is rewritten as Eq. (11).

$$ sim\left({d}_A,{d}_B\right)=\left\{\begin{array}{c}1, if\ \left(\left({d}_A\cap {d}_B\right)\ and\ \left(\left|{d}_A\cap {d}_B\right|\ge n\right)\right)\\ {}0, otherwise\kern11em \end{array}\right. $$
(11)

where:

sim(dA, dB):

denotes the similarity between two documents

n:

denotes the number of similar words within m words with highest term-frequency

The value of similarity is one, if the two conditions are fulfilled. The first condition, there is intersection between the two documents and the second is number of intersection element must be greater or equal than n. Otherwise, the value of similarity is zero.

  • Finding Topic step:

There are eight steps to find the topic in corpus as follow explained:

  1. 1.

    Prepare a matrix to save term-frequency of the distinct word for each document. The term-frequency is a number of distinct words that occur in a document and denoted by tf. The matrix is created with size J × I, where J is the number of distinct words in a corpus and I is the number of documents. Thus, tf11 means the number of first distinct word, which the first document occurs.

  2. 2.

    Prepare a matrix of the probability of a word of topic P(word| topic). The size of the matrix is J × K; J is the number of distinct words in a corpus and K is the number of topics. The values of the matrix are initialized with a random number and normalized using Eq. (12) as the probability. The normalization process aims to attain weight of word based on topic. The probability and the random number are symbolized by P(wj| zk) and wjzk, respectively. The number of topics should be defined previously. Thus, P(w1| z1) means probability a word w1 become a topic z1.

    $$ P\left({w}_j|{z}_k\right)=\frac{w_j{z}_k}{\sum_{j=1}^J{w}_j{z}_k} $$
    (12)
  3. 3.

    Prepare a matrix of the probability of topic of document P(topic| doc). The size of the matrix is K × I; K is the number of topics and I is the number of documents in the corpus. Similar with P(word| topic), the values of the matrix are initialized with a random number and normalized using Eq. (13) as the probability. The normalization process aims to attain weight of topic based on document. The probability and the random number are symbolized by P(zk| di) and zkdi, respectively. Thus, P(z1| d1) is the probability of a topic z1, which is part of the document d1.

    $$ P\left({z}_k|{d}_i\right)=\frac{z_k{d}_i}{\sum_{k=1}^K{z}_k{d}_i} $$
    (13)
  4. 4.

    Prepare a matrix of the probability of a word of document P(word| doc). The size of the matrix is J × I; J is the number of distinct words in corpus and I is the number of documents. The values of the matrix are initialized with zeroes and the probability is defined with the Eq. (14). The probability is symbolized by P(wj| di). In equation, n denotes current iteration, therefore n + 1 means the next iteration. The number of topics determine the number of iterations.

    $$ P{\left({w}_j|{d}_i\right)}_{n+1}=P{\left({w}_j|{d}_i\right)}_n+P\left({w}_j|{z}_k\right)\times P\left({z}_k|{d}_i\right) $$
    (14)
  5. 5.

    Prepare a matrix of the probability of the topic, given word and document P(topic| doc, word). The size of the matrix is K × J × I; K is the number of topics, J is the number of distinct words in the corpus, and I is the numbers of documents in the corpus. The probability is symbolized by P(zk| di, wj) and obtained in Eq. (15). This step is an estimation step which compute posterior probabilities for the latent variables.

    $$ P\left({z}_k|{d}_i,{w}_j\right)=P\left({w}_j|{z}_k\right)\times P\left({z}_k|{d}_i\right)/P\left({w}_j|{d}_i\right) $$
    (15)
  6. 6.

    Update the probability of the topic of document P(topic| doc) in Eq. (16) and followed by Eq. (13). This step is a maximization step to update P(zk| di).

    $$ P{\left({z}_k|{d}_i\right)}_{n+1}=P{\left({z}_k|{d}_i\right)}_n+{\sum}_{j=1}^J{tf}_{ji}\times P\left({z}_k|{w}_j,{d}_i\right) $$
    (16)
  7. 7.

    Update the probability of the word of topic P(word| topic) in Eq. (17) and followed by Eq. (12). This step is a maximization step to update P(wj| zk).

    $$ P{\left({w}_j|{z}_k\right)}_{n+1}=P{\left({w}_j|{z}_k\right)}_n+{\sum}_{i=1}^I{tf}_{ji}\times P\left({z}_k|{w}_j,{d}_i\right) $$
    (17)
  8. 8.

    The last step is a maximization step to update the probability of a word of document P(word| doc) in Eq. (14). In the maximization step, the matrix of term-frequency impacts the update calculation of P(topic| doc) and P(word| topic), thus term-frequency of the distinct word influences the result of the topic.

The illustration of fourth step to eighth step are shown in Figs. 4, 5, 6 and 7. These illustrations explain step by step to find out topics and give a clear understanding. Figure 4 represents the calculation of probability of a word of document and part of maximization step as well. The P(word| doc) is obtained from accumulation process of multiplication between P(word| topic) and P(topic| doc) as declared in Eq. (14). Figure 5 visualizes the estimation step. Eventually, Figs. 6 and 7 depict the maximization steps. In Figs. 6 and 7, the P(topic| doc) and P(word| topic) is attained from accumulation process of multiplication between sum of term-frequency and P(topic| doc, word) as stated in Eqs. (16) and (17), respectively.

Fig. 4
figure 4

The calculation of P(word| doc) for word w1 to wj in document d1 of topic z1

Fig. 5
figure 5

The calculation of P(topic| doc, word) for word w1 to wj in document d1 of topic z1

Fig. 6
figure 6

The calculation of updated P(topic| doc) for topic z1 of document d1 and whole words

Fig. 7
figure 7

The calculation of updated P(word| topic) for word w1 of topic z1 in whole documents

Figure 8 depicts the TFM in a flow chart form to explain the model for a better understanding. This flow chart represents step by step process in finding out the topic in corpus of discussion forum.

Fig. 8
figure 8

The flow chart of TFM

Based on the TFM process shown in Fig. 3, the flow chart in Fig. 8 describes flow of process from Pre-Processing document to Corpus Classification and Finding Topic. Steps of TFM is started by store discussion forum posts as text document as shown in first parallelogram in Fig. 8. Every post is allocated as a text document. The text documents are processed tokenization, stop-word removal and stemming. These processes are part of pre-processing document and impact to term-frequency of distinct word in every text document. The stop-word list and the stemming process are adjusted based on the language used in the discussion forum. The stemming process is required to produce a root word and it impacts to term-frequency of distinct word in a document. The term-frequency of distinct word is needed in the second step, namely corpus classification. Corpus classification input consists of number of highest term-frequency of distinct word and number of similarity words as shown in first column of Fig. 8. The corpus classification groups documents based on similarity words. This output is then used in third process, i.e. finding topic. In second column of Fig. 8, beside the corpus that is created by corpus classification, there are two inputs of finding topic process; number of topics and number of iterations. The result of finding topic process is topic in corpus and it is used as a label of the post that is shown in the last parallelogram of Fig. 8. Generally, a discussion is opened through a thread and followed by replies or responses from among members. Ideally, a thread discusses a specific topic, however there is no a guarantee that it will be followed by a response. The discussion might be opened out to another topic in a thread. Therefore, a thread as a corpus is classified into the corpus classification step. This classification intent to group thread posts as a corpus into a specific corpus. The grouping is based on the number of similarities of distinct word with highest term-frequency. This similarity is determined among the number of highest term-frequency of distinct word.

Furthermore, the topic is found through a latent semantic approach. The topic is defined by the highest probability value of each document. The number of topics and the number of iterations in estimation and maximization steps are two parameters needed in this process. The number of distinct words and the number of documents in a corpus is determined to compute the posterior probability of latent variable. The approach consists of eight steps (Setiawan et al. 2019). The process in detail explained in algorithm 1, namely Latent_Semantic Algorithm.

figure a

4 Evaluation and result

This study used 1050 text documents from a Learning Management System (LMS) of Bina Nusantara University to evaluate the model. The data are gathered from three different course subjects: information system, management and character building as the first, second and third-course subject, respectively. Online discussion characteristics can be grouped into: (1) highly confined discussion because the course is governed by math formula and physical law; (2) less confined discussion because math formula dan physical law are less exposed; (3) unrestricted discussion because of expressing personal experience and character. To accommodate all concerns cited above, those 3 courses were selected, i.e. Information Systems, Management, and Character Building represents group 1, group 2, and group 3, respectively. In period of gathering data, the number of taught courses were 62 courses which were grouped according to those characteristics. Another reason for choosing several subjects in three different areas: computing, social and behavioral area, is to observe consistency of the model. The number of documents of per course subject is 330 text documents, 370 text documents, and 350 text documents. A text document represents one post in a thread discussion. Table 1 shows the profile of the data.

Table 1 A profile data

First, the data were processed with the pre-processing process: tokenization, stop-word removal and stemming. An example of tokenization statement in English is ‘Knowledge can be obtained from learning and experience’. The tokenization consists of 8 tokens: ‘Knowledge’, ‘can’, ‘be’, ‘obtained’, ‘from’, ‘learning’, ‘and’ and ‘experience’. Another example in the Indonesian language is ‘Pengetahuan bisa didapat dari pembelajaran dan pengalaman’. The result consists of 7 tokens: ‘Pengetahuan’ (‘Knowledge’), ‘bisa’ (‘can be’), ‘didapat’ (‘obtained’), ‘dari’ (‘from’), ‘pembelajaran’ (‘learning’), ‘dan’ (‘and’) and ‘pengalaman’ (‘experience’).

Moreover, the process is a stopped-word removal. Since mostly discussion is in the Indonesian language, the stop-word removal list used in the list is from Tala and completed with some common English words (Tala 2003). Using the previous example of an Indonesian statement, ‘Pengetahuan bisa didapat dari pembelajaran dan pengalaman’, the removed tokens are ‘bisa’, ‘didapat’, ‘dari’ and ‘dan’, therefore the remain tokens are ‘Pengetahuan’, ‘pembelajaran’ and ‘pengalaman’.

In this study, the stemming process is used for the flexible affix classification approach (Setiawan et al. 2016). This algorithm is used, since most of the discussion is in the Indonesian language and the algorithm is good to obtain high accuracy.

Second, the documents per thread were classified by a corpus classification approach (Setiawan et al. 2019). This process was needed to classify documents to be more specific corpus rather than a corpus based on a thread. Thus, the parameter of number of the word with highest term-frequency and number of similar words are 5 and 2, respectively. Tables 2, 3 and 4 show several the corpus from the result of the 1st, 2nd and 3rd-course subject classification. The examples in Table 2, the 1st thread consist of 82 posts classified into 25 corpora based on 2 similar words or more within 5 highest term-frequency words. This means that a thread of discussion ideally assumed as one corpus can be divided to into several corpora based on the certain similar words. It also happened in others thread of discussions.

Table 2 The corpus profiles of 1st course subject
Table 3 The corpus profiles of 2nd course subject
Table 4 The corpus profiles of 3rd course subject

Third, this is the last step in the model to find out a topic of discussion in a corpus using latent semantic approach. The documents per corpus were processed by PLSA eight steps as mentioned in Section 3. Figures 9, 10 and 11 show the performance of the model to find out the topic of discussions. The measurement used F-measure. Every post was read and defined the topic manually as a label in the text document. The precision and recall were measured based on the results of model’s topic compared to a label per document. The F-measure was measured based on the precision average and recall average per thread.

Fig. 9
figure 9

The performance model of 1st course subject based on F-measure

Fig. 10
figure 10

The performance model of 2nd course subject based on F-measure

Fig. 11
figure 11

The performance model of 3rd course subject based on F-measure

Figure 9 shows that the precision value is good. This reveals that the result of the topic in the model is correct; however, the trend of recall value is lower than the precision. This condition informs that there is a topic not found in the model. The chosen topic from the model is gathered only from the highest probability value of the latent variable. On that basis, the recall value is not good enough. An example is the following post below:

figure b

The post is from Information System Concept course subject. Since mostly topics are in Indonesian language, for the sake of reader’s convenience who are not Indonesian, the words in the brackets in the Topic column are written in English. The topics finding from the model consists of ‘strategi’ (‘strategy’), ‘organisasi’ (‘organization’) and ‘usaha’ (‘business’). The topics from manually label contain of ‘strategi’ (‘strategy’), ‘usaha’ (‘business’) and ‘porter’ (‘porter’). The topic ‘organization’ arise from the model, however not as topic from manually. It impacts to precision value. On the other hand, the topic ‘porter’ does not arise as the highest probability value from the model, though it is the topic for the post. It effects to recall value. Table 5 represent an example of topics in a corpus from the model based on highest probability value in descending mode. The 1st level is the highest probability value.

Table 5 The topics from the TFM order by probability value (English translated terminologies are added for the sake of reader’s convenience)

In consonance with Table 5, to improve the recall value, then it is necessary to explore topics that found out from the model in several next levels of the highest probability value.

Figures 10 and 11 provides similar results trend with Fig. 9, which include the precision, the recall and the F-measure. The overall results explain that TFM is consistent and effective to find out the topic of discussion, find out the result of the topic and find out whether the precision is good, however the recall can still be increased and observed in the future study.

Despite of not using the same dataset, an attempt was made to compare the research result of TFM approach to LDA and k-means. The result of k-means and LDA are obtained from the previous research (Rajasundari et al. 2017). The comparison among k-means, LDA, and TFM using precision, recall, and F-measure as parameters is shown in Fig. 12.

Fig. 12
figure 12

Performance of topic modelling comparison

The TFM confirms that result of topic modelling algorithm gives a better performance than machine learning approach does. In this case, the topic modelling algorithm are LDA and TFM, while the machine learning approach is k-means. The Precision, Recall, and F-measure of TFM is 65%, 50%, and 57% which is greater than Precision, Recall, and F-measure of k-means, 55%, 47%, and 51%, respectively.

5 Conclusion

Through discussion forum, members enhance their knowledge about new things. Unfortunately, the knowledge or information only known among the members. This paper has presented a model to extract the topic of discussion forum posts, namely Topics Finding Model (TFM). The topic from the model was used to label the post, then it is possible to retrieve knowledge from the discussion forum. The TFM consists of three steps: pre-processing text document, corpus classification and finding topic through latent semantic. To measure the effectiveness of the model, F-measure was used in this study. The result shows that TFM is consistent and effective in revealing the topic of discussion. The limitation of this study is the determined topic based on the highest probability value of the latent variable. Despite the precision value is good, the recall value can still be increased. This is an opportunity to explore another level of probability value of a topic to raise the recall value in the further study.

Since this approach has not been covered the slang and typographical error of posts, the normalization process may be added as part of pre-processing step. Thus, impact of the normalization process to the result can be observed. This process can be explored for further work.