The Algorithm of Modelling and Analysis of Latent Semantic Relations: Linear Algebra vs. Probabilistic Topic Models
Abstract
This paper presents the algorithm of modelling and analysis of Latent Semantic Relations inside the argumentative type of documents collection. The novelty of the algorithm consists in using a systematic approach: in the combination of the probabilistic Latent Dirichlet Allocation (LDA) and Linear Algebra based Latent Semantic Analysis (LSA) methods; in considering each document as a complex of topics, defined on the basis of separate analysis of the particular paragraphs. The algorithm contains the following stages: modelling and analysis of Latent Semantic Relations consistently on LDA and LSAbased levels; rulesbased adjustment of the results of the two levels of analysis. The verification of the proposed algorithm for subjectively positive and negative Polishlanguage film reviews corpuses was conducted. The level of the recall rate and precision indicator, as a result of case study, allowed to draw the conclusions about the effectiveness of the proposed algorithm.
Keywords
Latent semantic analysis Latent dirichlet allocation Rules of adjustment Corpus Linear algebra Probability1 Introduction
Modelling and Analysis of Latent Semantic Relations (LSR) – the approach of constructing a model of the corpus, reflecting the transition from a set of documents and set of words in the documents to a set of topics, describing the contents of documents. We can say that in the mathematical model of text collection, describing the words or documents is associated with a family of probability distributions on a variety of topics [4, 6, 13].
Construction of the mathematical model can be considered as a problem of simultaneous clustering of documents and words for the same set of clusters, known as topics. In terms of the cluster analysis the topic is the result of biclustering, i.e. the simultaneous clustering of words and documents in accordance with their semantic closeness. Thus, compressed semantic description of words or of a document is a probability distribution on a variety of hidden variables (topics). The process of finding these distributions is called the topic model [18, 19, 20].
Those hidden variables (topics) allow presenting the document as a vector in the space of latent topics instead of submitting in the space of words. As a result, the document has a lower number of components, allowing faster and more efficient handling. Thus, the topic model is closely related to another class of problems known as a reduction of data dimension [14, 17, 18, 19, 20].
The basic algorithms for modelling topics, on which we concentrate in this paper, are: determinant Latent Semantic Analysis (LSA), and probabilistic Latent Dirichlet Allocation (LDA). And although all of them share the fundamental assumption about latent semantic (topical) structure of the documents, they use different mathematical frameworks – Linear algebra (LSA) vs. Probabilistic Topic Modelling (LDA) [3, 4, 15].

analysing the advantages and disadvantages of Latent Semantic Relations, revealing algorithms inside the textual collection, using two different mathematical frameworks;

developing the complex Algorithm of Modelling and Analysis of Latent Semantic Relations, based on advantages of two different mathematical frameworks;

demonstration of the effectiveness of proposed Algorithm implementation for specific, Argumentative, type of documents, via conducting a case study for the Polishlanguage Film Reviews Corpora.
The research results, presented in the paper, are supported by the Polish National Centre for Research and Development (NCBiR) under Grant No. PBS3/B3/35/2015, the project “Structuring and classification of Internet contents with the prediction of its dynamics”.
2 Theoretical Background of the Research
2.1 Vector Space Models of the Semantic Relations Analysis
The aim of the LSR analysis is to extract “semantic structure” of the collection of information flow and automatically expands them into the underlying topic. Significant progress on the problem of presenting and analysing the data has been made by researchers in the field of information retrieval (IR) [1, 10, 11]. The basic methodology proposed by IR researchers for text collection reduces each document in the corpus to a vector of real numbers, each of which represents ratios of counts.
\( k(w,L_{t} ) \) – the number of w ^{ th } word occurrences in the text t; \( df \) – the total number of words in the text of t; D – total number of documents in the collection.
Then, for solving the problem of finding the similarity of documents (terms) from the point of view of the relation to the same topic, the different metric can be applied. The most appropriate metric is cosine measure of the edge between the vectors [14, 20, 21, 22].
A further part of the algorithm is to divide the source data into groups corresponding to the events, as well as in determining whether a text document describes a set of any topic. The main idea of the solution is the use of clustering algorithms [12, 14, 17, 18, 19, 20, 21].
The limitations of this method are: the calculations measure the “surface” usage of words as patterns of letters; they can’t distinguish such phenomena as polysemy and synonymy [10, 13, 16].
2.2 Latent Semantic Indexing
In 1988, Dumais et al. [7] proposed a method of Latent Semantic Indexing (LSI), most frequently referred to as LSA. Deerwester et al., 1990 [8], designed to improve the efficiency of IR algorithms and search engines by the projection of documents and terms in the space of lower dimension, which includes semantic concepts of the original set of documents.
\( \varSigma_{{K_{t \times d} }} \left( {V_{{K_{t \times d} }} } \right)^{T} \) – represents terms in kd latent space; \( U_{{K_{t \times d} }} \varSigma_{{K_{t \times d} }} \) – represents documents in kd latent space; \( U_{{K_{t \times d} }} \), \( V_{{K_{t \times d} }} \) – retain term–topic, document–topic relations for top k topics.
But, as [18, 19] proved, there are three limitations to apply LSA: documents having the same writing style (Lim#1); each document being centered on a single topic (Lim#2); a word having a high probability of belonging to one topic but low probability of belonging to other topics (Lim#3). The limitations of LSA are based on orthogonal characteristics of dimension factors as well as on the fact, that the probabilities for each topic and the document are distributed uniformly, which does not correspond to the actual characteristics of the collections of documents [7, 8, 23]. That is why, LSA tends to prevents multiple occurrences of a word in different topics and thus LSA cannot be used effectively to resolve polysemy issues (Lim#4).
2.3 Probabilistic Topic Models
In contrast to the socalled discriminative approaches (LSI, LSA), in a probabilistic approach the topics are given by the model, and then termdocument matrix is used to estimate its hidden parameters, which can then be used to generate the simulated distributions [4, 6, 17, 25].
Latent Dirichlet Allocation
LDA – generative probabilistic graphical model proposed by David Blei [3, 4, 15]. LDA is a threelevel hierarchical Bayesian model. The algorithm of the method is as follows: Each document is generated independently: randomly select its distribution for document on topics \( \theta_{d} \) for each document’s word; randomly select a topic from the distribution \( \theta_{d} \), obtained in the first step; randomly select a word from the distribution of words in the chosen topic \( \varphi_{k} \)(distribution of words in the topic k). In the classical model of LDA, the number of topics is initially fixed and specifies the explicit parameter k.
Methods of Evaluating the Quality of Results
The limitation of LDA method is: it is possible to choose the optimum value of the k, but, even under condition of finding the optimal value of the k, the level of probability of a document belonging to a particular topic could be insignificant (Lim#5) [3, 4, 15].
3 Methodology
 1.
Term is a basic unit of discrete data.
 2.
Latent Semantic/Probabilistic topic (topics) is a basic unit of Latent Semantic Relations, received by LSA/LDA approach.
 3.
Context Fragment (CF) is indivisible, topically completed, sequence of terms, located within a document’s paragraph.
 4.
Document is a set of CF.
 5.
Corpus (films reviews corpus, FRC) is a collection of the Documents.
 6.
Semantic Cluster (SC) is the set of CF that have hidden semantic closeness (HSC).
 7.
Contextual Dictionary (CD) is a set of terms that have HSC.
 8.
Subjective Sentiment Corpus (SSC) is a collection of Documents that have common sentiment closeness.
3.1 Novelty and Motivation
 1.
Whether the taking into account of specific features of Argumentative type of document allows to affect Quality of the Topic Modelling Process Results.
 2.
Is it possible to increase the Level of Quality of the Topic Modelling Process Results via using the combination of the Discriminant and Probabilistic Methods?
For finding the answers to these questions the following main heuristics and hypotheses were formulated:
Heuristic H1.1. Taking into account the specificity of chosen for this study Type of Documents and presence the nonofficial requirements of Film’s Review structure and writing rules [22], assume that the writing style of each review is approximately the same (eliminating the Lim#1).
Hypothesis H1.2. Taking into account the chosen Document Type Specificity, assume, that each paragraph (CF) is centered on a single Topic and should be analyzed separately (eliminating the Lim#2).

the quality of LDAmethod of topics recognizing via increasing the level of probability of assigning the topic to particular CF by taking into account the hidden LSR phenomena (eliminating the Lim#5);

the quality of LSAmethod of LSR recognition via adjusting the consequences of influence the uniform distribution of the topics within the document by taking into account the probabilistic approaches (eliminating the Lim#3 and #4).
As a sample for case study experiments the Polishlanguage film reviews from the filmweb.pl are used. For demonstration of the basic workability of the author’s Algorithm, as a preliminary case study was used (the data set of only one, randomly chosen, Polishlanguage film review, which contains 7 CF). All words/terms of film reviews in this paper will be presented in Polish and English languages (separated by symbol “/”). The experimental part of all steps of author’s Algorithm has been implemented in Python 3.4.1.
3.2 The Level of LDABased Modelling and Analysis of Latent Semantic Relations
LDAbased Modelling of LSR
LDAbased Modelling of LSR is the stage, which aims to ensure the implementation of the level of LDAbased Topic Analysis, presupposed the Forming the “Bag of Words” (preprocessing) step.
 text adaptation procedure, based on the specificity of the structure of reviews document layout (Fig. 2). This procedure is to implement the replacement of the Film’s Titles, the Names/Surnames of Director/Actors/Characters into the corresponding position of the descriptive part of review (for example, the Title of the film is replaced by “Film”, Name and Surname of the actor – by “Actor” etc.);

expanding by authors the list of stop words (near 400 Polish words) for improving the process of lemmatization (based on the dictionary pyMorfologik [13, 16, 22]);

partofspeech (adjective, nouns, verbs) morphological tagging and filtering procedures performing, allowed to increase the resolution of the of LSR analysis.
LDAbased Analysing of Latent Semantic Relations
Step I. Identifying the Topics
LDAbased Analysis is the stage, which aims: (1) to reveal the optimal number of latent probabilistic topics that describe the main content of the analyzed document; (2) to assign them to the CFs based on the probalistic LSR within the paragraphs. As a technical support, for the implementation this phase the LDA Gensim Python package (https://radimrehurek.com/gensim/models/ldamodel.html) was used.
PCS results of the studying of the of LDA model parameters
Perplexity  Number of topics  Number of terms  Number of passes  Alpha parameter  Eta parameter  Max probability topic  Max probability of terms in the topics 

3336  10  10  100  1.70  1.00  0.1025  0.057 
633  7  7  100  1.50  1.00  0.6050  0.177 
202  5  5  100  1.50  1.00  0.7134  0.167 
64  3  5  100  1.50  1.00  0.8417  0.132 
63  3  7  100  1.50  1.00  0.8411  0.166 
PCS results of the list of latent probabilistic topics with distribution of terms
Terms (Polish/English)  Probability  Terms (Polish/English)  Probability  Terms (Polish/English)  Probability 

Topic #0  Topic #1  Topic #2  
fabuła/story  0.080  kino/cinema  0.109  bohater/character  0.166 
akcja/action  0.062  twórca/creator  0.066  gra/playing  0.140 
efekt/effect  0.050  kobieta/woman  0.062  dobry/good  0.130 
bohater/character  0.047  obsada/cast  0.052  postać/character  0.090 
ksiazka/book  0.046  scena/stage  0.051  rola/role  0.040 
obraz/image  0.044  glowny/main  0.050  typowy/tipical  0.030 
historia/history  0.042  reżyser/director  0.049  intryga/intrigue  0.029 
Step II. LDAclustering of CF in Semantic Dimensions of Corpus
PCS results of the semantic clustering of CF
CF  CF_5  CF_0  CF_1  CF_4  CF_6  CF_2  CF_3 

# topic (cluster)  0  1  1  1  1  2  2 
Probability  0.8411  0.6228  0.8022  0.7039  0.4800  0.7957  0.6603 
The values of the Perplexity in the Table 1 proves the validity of the assumptions about providing the analysis the Corpora by paragraphs (Hypothesis H1.2). But, on the other hand, we can note, that the level of probability of a CF belonging to a particular topic/cluster is not significant for all CF (for example, for CF_6 it is lower than 0.5).
3.3 The Level of LSABased Modelling and Analysis of Latent Semantic Relations
LSAbased Modelling of Latent Semantic Relations
The fragment of PCS results of the absolute frequency termscf matrix
Terms (Polish/English)  CF_0  CF_1  CF_2  CF_3  CF_4  CF_5  CF_6  Sum 

bohater/character  1  1  4  5  2  2  1  16 
akcja/action  0  1  0  2  1  3  2  9 
kino/cinema  1  3  0  2  1  0  2  9 
film/movie  0  2  1  0  0  1  1  5 
główny/main  1  2  1  0  0  0  0  4 
kobieta/woman  0  3  0  0  1  0  0  4 
As for results of TFIDF transformation of this matrix, we can state the following facts: differences in absolute term frequencies were reduced; frequently appearing terms are less relevant compared to infrequent terms; termsCF matrix contains weighted term frequencies.
However, according to [26], and during a number of author’s experiments, the solutions were found: TFIDF approach does not work well because when a CF contains only a 100150 words, there are seldom terms that occur more than once within a document; but, the most common words occurred within one CF are the socalled key terms, which determine the topic’s label of analysed CF in a large scale; it is more important to focus on the allocation of stop words and most significant partofspeech, to maximise the weight of keywords of the CF by excluding consideration of the terms that have no semantic weight.
LSAbased Analyzing of Latent Semantic Relations
LSAbased Analysis of LSR is the stage, which aims to identify the patterns in the relationships between the terms and latent semantic topics. As we already stated, LSA method is based on the principle that terms that are used in the same contexts tend to have similar meanings. For revealing this information about LSR between topics and CF/terms, we need: to assess the degree of semantic correlation relationship between CF/terms via building the reduced model of LSR; to form the semantic clusters of CF via determining the cosine distance between the CF in order to identify the LSR between topics and CF; to form the contextual dictionary of semantic clusters of CF via determining the cosine distances between the terms in order to identify the LSR between k terms and topics.
Step III. Identifying the Hidden Semantic Connection Within the Documents
The fragment of PCS results of the reduced model for identifying the LSR
Via comparison of the red numbers in Table 5 with zero’s values in the same places of Table 4 could be, as an example, identified the existence of the following phenomena of LSR:
– the term “Film/Movie” seems to have the presence in all CF where the word “Bohater/Character” appears;
– the term “Kobieta/Woman” seems to have the presence in the CF where the word “Kino/Cinema” appears.
Example of PCS results of the comparison of the CC between terms
Source terms  Absolute frequency termsCF matrix  Reduced model for identifying the hidden connection 

Bohater. Film  –0.33391154  0.984754769 
Kino. Kobieta  0.64162365  0.984405802 
Steps IVV. Identifying the Degree of Closeness Between the CF/Terms in the Semantic Dimensions of Topics
PCS results of the matrix of cosine distance between the vectors of CF
CF_0  CF_1  CF_2  CF_3  CF_4  CF_5  CF_6  

CF_0  1  0.9998  0.8052  0.8403  0.9537  –0.3376  0.8764 
CF_1  0.9998  1  0.8164  0.8505  0.9592  –0.3196  0.8855 
CF_2  0.8052  0.8164  1  0.9981  0.9463  0.2863  0.9912 
CF_3  0.8403  0.8505  0.9981  1  0.9645  0.2266  0.9975 
CF_4  0.9537  0.9592  0.9463  0.9645  1  –0.0387  0.9807 
CF_5  –0.3376  –0.3196  0.2863  0.2266  –0.0387  1  0.1573 
CF_6  0.8764  0.8855  0.9912  0.9975  0.9807  0.1573  1 
The fragment of PCS results of the matrix of cosine distance between the vectors of terms
akcent/accent  akcja/action  bohater/character  …  łatwo/easily  osiągać/reach  

akcent  1  0.9938  0.6136  …  0.873  0.1269 
akcja  0.9938  1  0.6978  …  0.8132  0.2367 
bohater  0.6136  0.6978  1  …  0.1506  0.8611 
…  
łatwo  0.873  0.8132  0.1506  …  1  0.373 
osiągać  0.1269  0.2367  0.8611  …  0.373  1 
Step VI. LSA Clustering of CF/Terms in the Semantic Dimensions of Topics
PCS results of the labels of contextual fragments’ clustering
CF  CF_0  CF_1  CF_5  CF_2  CF_3  CF_4  CF_6 

Cluster  0  0  1  2  2  2  2 
PCS results of the contextual dictionary of semantic clusters
Terms (Polish/English)  Cluster  Terms (Polish/English)  Cluster  Terms (Polish/English)  Cluster 

fabuła/story  0  reżyser/director  1  bohater/caracter  2 
akcent/accent  0  kino/cinema  1  dobry/good  2 
scenariusz/script  0  kobieta/woman  1  film/movie  2 
akcja/action  0  główny/main  1  intryga/intrigue  2 
ksiazka/book  0  obsada/cast  1  sposób/method  2 
scena/scene  0  efekt/effect  1  typowy/typical  2 
obraz/image  0  schemat/scheme  1  gra/playing  2 
historia/history  0  stworzyć/create  1  rola/role  2 
3.4 Adjustments of the Results of the Two Levels of Analysis
 1.Forming the table of the Comparison of the numerical labels of Latent Semantic Clusters of a set of CF, obtained on two levels of research (Table 11). As we can see, the results of clustering for CF_4 and CF_6, obtained in LSA and LDAanalysis levels, do not match.Table 11.
PCS results of the comparison of the semantic clusters as a set of CF labels
LDAlevel
LSAlevel
CF
# Topic (Cluster)
Probability
CF
Cluster
CF_0
1
0.6228
CF_0
0
CF_1
1
0.8022
CF_1
0
CF_2
2
0.7957
CF_2
2
CF_3
2
0.6603
CF_3
2
CF_4
1
0.7039
CF_4
2
CF_5
0
0.8411
CF_5
1
CF_6
1
0.4800
CF_6
2
 2.
Formulation and implementation the Rules of Adjustments of the results obtained in the LSA and LDAanalysis levels.
Rules of adjustments of CF clustering results
# of rule  LSAanalysis result  Result of comparison  LDAanalysis result  LDA Probability (P)  Assignable cluster 

1  LSA Cluster  =  LDA Cluster  P > 0.3  LSA Cluster = LDA Cluster 
2  LSA Cluster  =  LDA Cluster  P ≤ 0.3  Cluster is Not recognized 
3  LSA Cluster  ≠  LDA Cluster  P ≤ 0.3  LSA Cluster 
4  LSA Cluster  ≠  LDA Cluster  0.3 < P ≤ 0.7  LSA Cluster/Reclustering 
5  LSA Cluster  ≠  LDA Cluster  P > 0.7  LDA Cluster 
These rules allow:
– to improve the quality of LDAmethod recognizing the CF’s topics (rules 3, 4) due to the possibility of correcting the results of clustering, which are characterized by the low level of probability of a CF belonging to a particular topic. Suggested instrument – latent semantic specificity of the LSA method;
– to improve the quality of LSAmethod recognition of hidden relations between the CF (rules 2, 5) due to the possibility of correcting the results of clustering, which characterize by situations, when CF coordinates located on the cluster’s boundary. Suggested instrument – the probabilistic characteristics of the LDA method.
PCS results of the of final version of the labels of the CF’s semantic clusters
CF  CF_5  CF_0  CF_1  CF_4  CF_2  CF_3  CF_6 

# topic  0  1  1  1  2  2  2 
4 Case Study Results and Discussion
For the process of verification of the author’s Algorithm was formed the sentimental structure of FRC via classification of the reviews collection on the Subjectively Positive (SPSC) and Subjectively Negative Sentiment Corpuses (SNSC). This procedure is realized on the basis of information on the subjective assessment (SA) of films by the reviewers (measured by 10point scale).
As a condition of sentimental structure of FRC building, the following Heuristic 1.3 was adopted: to consider the SPSC, if the SA is more than 5 points, and SNSC – if it is equal or less than 5 points.
The structure of the semantic clusters
SPSC  SNSC  

CL of the Topics  LSA, %  LDA, %  LSA & LDA, %  CL of the Topics  LSA, %  LDA, %  LSA&LDA, % 
Bohater/Character  19.71  18.75  19.23  Bohater/Character  11.54  13.46  12.31 
Reżyser/Director  32.21  36.06  33.65  Aktor/ Actor  30.00  30.38  29.23 
Scenariusz/Scenario  17.79  12.50  16.83  Widz/Spectator  28.85  26.92  28.08 
Fabuła/Story  30.29  32.69  30.29  Fabuła/Story  29.62  29.23  30.38 
The quality of the of LSR analysis results
SPSC  SNSC  

Labels of the topics  Indicator 1  Indicator 2  Labels of the topics  Indicator 1  Indicator 2 
Bohater/Character  7.50  5.56  Bohater/Character  9.23  6.25 
Reżyser/Director  2.82  5.48  Aktor/Actor  1.27  5.13 
Scenariusz/Scenario  3.17  12.00  Widz/Spectator  5.52  9.09 
Fabuła/Story  6.11  7.81  Fabuła/Story  2.61  2.70 
Recall rate  95.19  Recall rate  96.15 
5 Conclusions
In this paper authors presented the complex twolevel Algorithm of Modelling and Analysis of TMP, aimed at elimination the Limitations characterizing the of two mathematical frameworks and taking into account the Document Type Specificity. The answers for the main scientific research question were found: the combination of the Discriminant and Probabilistic Methods (Hypothesis H2.1) as well as Specificity of the Argumentative Type Document oriented approach (Hypothesis H1.2), gave the opportunity to improve the following qualitative characteristics of LSR Analysis:
– recall rate (the ratio of the number of semantically clustered/recognized paragraphs to the total number of paragraphs in the corpora) to 90–95%;
– precision indicator (the average probability of significantly clustered/recognized paragraphs) from 62 to 70–75%.
In the future research, these results are planned to be used: to evaluate the Algorithm effectiveness for processing the English language Documents; to develop the algorithm of forming the hierarchical structure of the Latent topics of Corpora with taking into account the Sentiment specificity.
References
 1.BaezaYates, R., RibeiroNeto, B.: Modern Information Retrieval. AddisonWesley, Wokingham (2011). Second edition (1999)Google Scholar
 2.Bahl, L., Baker, J., Jelinek, E., Mercer, R.: Perplexity – a measure of the difficulty of speech recognition tasks. In: Program, 94th Meeting of the Acoustical Society of America, vol. 62, p. S63 (1977)Google Scholar
 3.Blei, D., Ng, A., Jordan, M.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
 4.Blei, D.: Introduction to probabilistic topic models. Comm. ACM 55(4), 77–84 (2012)CrossRefGoogle Scholar
 5.Ali, D., Juanzi, L., Lizhu, Z., Faqir, M.: Knowledge discovery through directed probabilistic topic models: a survey. In: Proceedings of Frontiers of Computer Science in China, pp. 280–301 (2010)Google Scholar
 6.Blei, D.: Topic modeling, http://www.cs.princeton.edu/~blei/topicmodeling.html
 7.Dumais, S.T., Furnas, G.W., Landauer, T.K., Deerwester, S.: Using latent semantic analysis to improve information retrieval. In: Proceedings of CHI 1988: Conference on Human Factors in Computing, pp. 281–285. ACM, New York (1988)Google Scholar
 8.Deerwester, S., Dumais, S.T., Harshman, R.: Indexing by Latent Semantic Analysis (1990), http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf
 9.Eden, L.: Matrix Methods in Data Mining and Pattern Recognition. SIAM (2007)Google Scholar
 10.Furnas, G.W., Deerwester, S., Dumais, S.T., Landauer, T.K., Harshman, R.A., Streeter, L.A., Lochbaum, K.E.: Information retrieval using a singular value decomposition model of latent semantic structure. In: Proceedings of ACM SIGIR Conference, pp. s.465–s.480. ACM, New York (1998)Google Scholar
 11.Salton, G., Michael, J.: McGill Introduction to Modern Information Retrieval. McGrawHill Computer Science Series, vol. XV, 448 p. McGrawHill, New York (1983)Google Scholar
 12.Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3) (1999)Google Scholar
 13.Gramacki, J., Gramacki, A.: Metody algebraiczne w zadaniach eksploracji danych na przykładzie automatycznego analizowania treści dokumentów. In: XVI Konferencja PLOUG, pp. 227–249 (2010)Google Scholar
 14.Kapłanski, P., Rizun, N., Taranenko, Y., Seganti, A.: Textmining similarity approximation operators for opinion mining in bi tools. In: Proceeding of the 11th Scientific Conference “Internet in the Information Society2016”, pp. 121–141. University of Dąbrowa Górnicza (2016)Google Scholar
 15.Canini, K.R., Shi, L., Griffiths, T.: Online inference of topics with latent dirichlet allocation. J. Mach. Learn. Res. Proc. Track 5, 65–72 (2009)Google Scholar
 16.Tomanek, K.: Analiza sentymentu – metoda analizy danych jakościowych. Przykład zastosowania oraz ewaluacja słownika RID i metody klasyfikacji Bayesa w analizie danych jakościowych, Przegląd Socjologii Jakościowej, pp. 118–136 (2014), www.przegladsocjologiijakosciowej.org
 17.Aggarwal, C., Zhai, X.: Mining Text Data. Springer, New York (2012)Google Scholar
 18.Leticia, H.A.: Comparing Latent Dirichlet Allocation and Latent Semantic Analysis as Classifiers, Doctor of Philosophy (Management Science), 226 p. (2011)Google Scholar
 19.Papadimitrious, C.H., Raghavan, P., Tamaki, H., Vempala, S.: Latent semantic indexing: a probabilistic analysis. J. Comput. Syst. Sci. 61, 217–235 (2000)MathSciNetCrossRefzbMATHGoogle Scholar
 20.Rizun, N., Kapłanski, P., Taranenko, Y.: Development and research of the text messages semantic clustering methodology. In: 2016, Third European Network Intelligence Conference, vol. # 33, pp. 180–187. ENIC (2016)Google Scholar
 21.Rizun, N., Kapłanski, P., Taranenko, Y.: Method of a TwoLevel TextMeaning Similarity Approximation of the Customers’ Opinions. Economic Studies – Scientific Papers. University of Economics in Katowice, vol. 296, pp. 64–85 (2016)Google Scholar
 22.Rizun, N., Taranenko, Y.: Development of the algorithm of polish language film reviews preprocessing. In: Proceeding of the 2nd International Conference on Information Technologies in Management, Rocznik Naukowy Wydziału Zarządzania WSM (in print) (2017)Google Scholar
 23.Rui, X., Donald, C., Wunsch, I.I.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16(3), 645–678 (2005)CrossRefGoogle Scholar
 24.Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), s.613–s.620 (1975)Google Scholar
 25.Hofman, T.: Probabilistic latent semantic analysis. In: UAI 1999, pp. 289–296 (1999); Hofmann, T.: Probabilistic Latent Semantic Indexing. In: SIGIR, pp. 50–57 (1999)Google Scholar
 26.Mika, T.: Term Weighting in Short Documents for Document Categorization, Keyword Extraction and Query Expansion. PhD Thesis, Series of Publications A, Report A20131 (2013)Google Scholar