GDTM: Graph-based Dynamic Topic Models

Dynamic Topic Modeling (DTM) is the ultimate solution for extracting topics from short texts generated in Online Social Networks (OSNs) like Twitter. It requires to be scalable and to be able to account for sparsity and dynamicity of short texts. Current solutions combine probabilistic mixture models like Dirichlet Multinomial or Pitman-Yor Process with approximate inference approaches like Gibbs Sampling and Stochastic Variational Inference to, respectively, account for dynamicity and scalability of DTM. However, these methods basically rely on weak probabilistic language models, which do not account for sparsity in short texts. In addition, their inference is based on iterative optimizations, which have scalability issues when it comes to DTM. We present GDTM, a single-pass graph-based DTM algorithm, to solve the problem. GDTM combines a context-rich and incremental feature representation method with graph partitioning to address scalability and dynamicity and uses a rich language model to account for sparsity. We run multiple experiments over a large-scale Twitter dataset to analyze the accuracy and scalability of GDTM and compare the results with four state-of-the-art models. In result, GDTM outperforms the best model by 11%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$11\%$$\end{document} on accuracy and performs by an order of magnitude faster while creating four times better topic quality over standard evaluation metrics.

The protocol of GDTM, a graph-based algorithm for dynamic topic modeling. A stream of documents pass through a set of four components to extract and dynamically maintain their topics limits their power to account for dynamicity. GSDMM [8], FGSDMM [9] and PYPM [10] propose to solve this problem using stochastic optimization approaches like Dirichlet Multinomial Mixture (DMM) [11] models or Pitman-Yor [10] processes to consider infinite number of topics and allow the algorithm to dynamically adapt the number of partitions. In addition, all these approaches strive to achieve scalability by reducing the sample size using approximate optimization algorithms such as Gibbs Sampling [12] and Stochastic variational inference [13]. However, they are still relying on the same iterative optimization mechanisms and therefore sensitive to scalability issues when it comes to DTM.
Our approach This paper presents GDTM, a Graph-based Dynamic Topic Modeling algorithm designed to overcome those limitations by taking all the above-mentioned aspects into consideration. The solution combines a dimensionality reduction technique, called Random Indexing (RI) [14], to overcome the scalability, an advanced language modeling approach based on the Skip-Gram [15] technique, used in natural language modeling and speech recognition, to address the sparsity and an innovative graph modeling together with a single-pass graph partitioning algorithm to account for dynamicity. Figure 1 shows the overall protocol of the algorithm that is constructed in a pipeline approach.
A stream of documents pass through the pipeline of four components where each document gets processed by each component until the topic is assigned to it. First, the Feature Vector Extraction component reads and tokenizes the document and extracts a vector representation for each word in the document using RI. Then, the Feature Vector Composition combines the corresponding feature vectors to construct the document representation vector using the skip-gram model. After that, the Graph Representation component converts each document vector into a graph representation called document graph. Finally, the Graph Partitioning component extracts the topics by aggregating the document graphs into a single, weighted and highly dense graph representation called Knowledge Graph (KG). The algorithm uses the KG for two reasons: first, to assign topics to new documents based on the overlap between their corresponding document graph and the KG and second, to maintain the dynamics of the topics following a deterministic optimization function.

Main contribution
The key element of success in our algorithm is to distinguish between the two main components, namely (i) feature representation and (ii) topic extraction. This allows us to develop a single-pass algorithm where each document only passes once through the entire process. Moreover, the two main characteristics that play a significant role in this scenario are (i) the incremental nature of the RI technique that allows us to extract semantically rich and highly low-dimensional feature representation vectors without the need to access the entire dataset and (ii) the single-pass streaming graph partitioning that enables the extraction of high-quality topics encoded in the graph representation using the rich language representation model.

Summary of experiments and results
We run two sets of experiments to analyze the (i) accuracy and (ii) scalability of GDTM. To show the accuracy, we define a topic modeling task on a tagged Twitter dataset and compare GDTM with four state-of-the-art approaches on performing the task using a standard evaluation metric, called B-Cubed [16]. For scalability, we run a set of experiments on a largescale Twitter dataset and compare the execution time and the quality of the extracted topics, using the evaluation method called Coherency [17]. The results show that GDTM outperforms all the state-of-the-art approaches in both accuracy and scalability. In particular, GDTM provides more than 11% improvement on accuracy compared to the best results over the state-of-the-art approaches. In addition, we show that GDTM is by an order of magnitude faster than the best approach over scalability, while the extracted partitions exhibit significantly higher quality in terms of coherency.

Related work
Classical solutions for topic modeling on text, such as PLSI [18] and LDA [3], proposed to model the co-occurrence patterns as a probability distribution over a batch of long documents and infer the topics using statistical techniques such as variational inference and Gibbs Sampling [12]. However, with the emergence of the online social networks and the appearance of short texts, like tweets, these solutions faced various challenges related to the size, the number and the dynamics of the documents in such new environments.
Yan et al. [4] and ghoorchian et al. [26] presented a method to solve the sparsity in short texts by applying a more complex language model, known as bigram. The authors used the bigram model to overcome the sparsity by constructing richer context representations from short texts. However, they did not consider the dynamicity as their model still requires to know the number of topics in advanced and therefore lacks flexibility when it comes to fast dynamic changes in the documents.
Blei et al. [5] proposed the first solution specifically designed to alleviate the dynamicity in DTM. They developed a family of probabilistic time-series models to analyze and extract the evolution of topics in a stream of documents. The authors tried to solve the dynamicity by discretizing the stream of documents and developing a stream of batches by interrelating the consequent models through variational approximation methods based on Kalman Filters [19]. Their model was limited in scalability when the discretization of the topics went to infinity. Wang et al. [6] proposed another solution, called Continuous-time Dynamic Topic Model (CDTM), to overcome the discretization problem in DTM using a continuous generalization approach. DTM and CDTM are basically designed for topic modeling on large documents and do not account for sparsity and scalability in short texts.
Liang et al. [7] proposed another solution based on shortterm and long-term inter-dependency between the mean of the distributions across multiple time stamps to solve the discretization problem and also account for the sparsity in short texts. However, their model, similar to DTM and CDTM, requires to know the number of topics, which limits its power to account for the dynamicity. In addition, their inference approach is based on the same iterative Gibbs Sampling optimization mechanism that limits the scalability.
To solve the problem of the fixed number of topics, Yin et al. proposed solutions, GSDMM [8] and FGSDMM [9], based on Dirichlet Multinomial Mixture (DMM) Processes. Qiang et al. [10] improved Yin's solution using a new clustering with probabilities derived from a Pitman-Yor Mixture Process [20]. These approaches have significantly improved the accuracy of the extracted topics. However, they are basically designed for application on batch processing problems and therefore, face scalability issues when it comes to DTM.
Multiple solutions are developed to overcome different challenges in DTM but to our knowledge, a single approach that can tackle all the challenges at once is missing. Thus, we present GDTM as a universal model that is designed to meet all the challenges in DTM.

Solution
In this section, we will explain the details of our single-pass graphs-based dynamic topic modeling algorithm. The algorithm is designed using a pipeline approach that receives a stream of documents. The documents pass through four components: Feature Vector Extraction, Feature Vector Composition, Graph Representation and Graph Partitioning. In the following sections, we will explain each of these compo-nents and the way they interact with each other to extract the topics.

Feature vector extraction
We consider words as the atomic features and use a vector representation model to construct the feature vectors as the building blocks of the document representation model. GDTM requires a representation model that (i) is lowdimensional to account for scalability, (ii) is incremental to be useful in streaming setting, and (iii) creates relatively rich representations that contributes to efficiency in a single-pass optimization approach. RI is a reliable [21] dimensionality reduction technique that perfectly satisfies all the above requirements.
RI follows the famous statement "you shall know a word by the company it keeps" [22] based on distributional semantics [23]. The algorithm iterates through the document and constructs a low-dimensional vector representation for each word as follows. First, for each new word, RI creates a new vector W V of a fixed dimension d and randomly initializes an arbitrary number of its elements ζ to 1 and the rest d − ζ to 0. Then, the algorithm updates the W V of each word by looking into a window of an arbitrary size ω around the corresponding word and aggregating their corresponding W V s. The dimension of the vectors d is fixed and is significantly lower than the original feature space n (e.g., the total number of words) d << n. To avoid redundancy, we maintain a list of previously seen words together with their corresponding feature vectors and update the feature vectors only upon the observation of new context structures. This mechanism allows each feature vector to contain a rich representation of the context structure around the corresponding word without any clue on the significance of those structures. This is the requirement that the algorithm will address in graph partitioning component.
Neural Language Models [15], are another group of vector representation models that create low-dimensional and rich feature vector representations. However, they use classification based on iterative back-propagation algorithm, which does not suit the dynamic ecosystem of the GDTM.

Feature vector composition
The next step is to compose the extracted feature vectors to construct a document representation vector. A valid composition method should satisfy two properties: (i) preserving the complexity of the original feature vectors without losing any information (ii) accounting for sparsity in the documents.
Mitchell et al. [24] proposed a variety of vector composition methods such as pairwise multiplication, pairwise addition, weighted pairwise multiplication, etc., that satisfy the lossless property. However, these simple composition methods do not account for sparsity. For example, pairwise addition is similar to the BOW [3] approach used in LDA method that does not address the sparsity. Therefore, a more complex composition method is required. The choice of the composition depends on the language model used in the analysis. We use a well-known technique called Skip-gram [25] for this purpose. Skip-gram drives the probability of a feature given the history of its surrounding context. (This provides a more complex model compared to its descendant model called N-gram, which only considers the history of the previous context). More specifically, we use a m −ski p −bigram model where m is a parameter to be specified by the user. This model drives the context structure of a word w in a given context by looking at the bigrams with M = [0, m] step(s) before and after the w. Let us explain the composition model and the weighting mechanism with an example. Assume we are given a document D containing four consequent words W D = {w 1 , w 2 , w 3 , w 4 } and we set m = 1. To construct the document vector, we iterate through the document and for each word first, we extract the set of 1skip-bigrams that contains all the bigrams with skip value between 0 and 1. For example, w 2 has three bigrams including two 0-skip-bigrams w 2 w 1 , w 2 w 3 and one 1-skip-bigram w 2 w 4 . Table 1 shows the list of all 1-skip-bigrams extracted for all the words in D. Afterward, for each bigram w i w j , we create a bigram vector by weighted pairwise multiplication of its corresponding feature vectors v i = {e i1 , . . . , e id } and v j = {e j1 , . . . , e jd }, constructed in the previous step: α i and α j are the weights, respectively, related to the words w i and w j , which are calculated using a Sigmoid function as follows: The weight α l is inversely proportional to the frequency of the corresponding word w l and is used to reduce the negative effect of highly frequent words in the dataset. We use an adjustment parameter δ to indicate the significance of the ratio and a threshold parameter γ that indicates the words to remove from the document representation. In particular, if α l < γ , then we set α l = 0 that eliminate the bigram vector from the construction of the document vector.
The final step is to combine all valid bigram vectors to construct the corresponding document vector. We use a normalized pairwise addition as the composition method in this step. In the above example, given that none of the weights are zero and all bigrams are valid, we will have ten bigram vectors corresponding to the skip-grams presented in Table 1. Now assuming that each bigram vector is a vector containing d elements bv i = {l i1 , . . . , l id }, then the document vector is created as follows:

Graph construction
In previous steps, the algorithm encoded topics as unique structures in the form of document vector representations. The goal, in this step, is to project those structures into a graph representation model to be used for extracting topics by graph partitioning component. Each vertex v i corresponds to one element l i in the DV , and each edge e i j represents the relation between its incident vertices v i and v j in the graph representation. The edges are weighted, and the weight w i j of a given edge e i j is calculated as the multiplication of the values of the corresponding elements in the DV , w i j = l i × l j . The construction methods suggests that the created graph is a mesh. However, this is not the case since the document vectors are often highly sparse with most of their elements being zero. Therefore, the created document graph will also be sparse.
After converting each DV into a DG, which is representative of the topical structure of the corresponding documents in the stream, the next step is to combine the DGs and extract the topics using graph partitioning.

Graph partitioning
Let us first present a set of definitions required for understanding the mechanism of the graph partitioning algorithm, before explaining the details: Knowledge graph (KG): is a graph with the same number of vertices d and the same number of edges d×(d−1) 2 as DGs. GDTM uses KG as the universal model in the algorithm to aggregate the DGs, keep track of the topics and assign topics to the documents.
Density 4: is a metric to measure the degree of connectedness of the nodes in a graph. We define the density d of a given graph G < V , E > as the average weight over the total possible edges in the graph: Therefore, the higher the weights of the edges, the higher the density will be. Consequently, a nonexisting edge makes zero contribution to the density.
Average density 5: is a measure to show the average total density over a graph and is calculated as the average of the densities of all partitions in that graph. Given a graph G(V , E) and a set of n partitions P = {p 1 , . . . , p n }, the average density is calculated as follows: Now, let us move forward to explain the graph partitioning algorithm. The main assumption is that each DG is representative of the unique topical structure of its corresponding document. Thus, the goal is to aggregate the DGs into a single graph representation, called KG and extract the topics by partitioning the KG following an optimization mechanism. GDTM is an online approach that requires to apply partitioning upon receiving every single document. Thus, the algorithm is designed in two steps (i) topic assignment and (ii) optimization. The first step assigns a topic to each document by comparing its DG with the KG. The second step aggregates the corresponding DG into the KG and applies an optimization mechanism such that the partitioning of the KG gets continuously updated over aggregating every single edge from the DG. The next sections will explain the details of these two steps and how they interact with each other to extract the topics.

Topic assignment
Before aggregating each document into the KG, we need to know the topic of the document in order to apply the correct optimization. The basic idea is to extract the distribution of the topics over that document and choose the The only challenge in this step happens when one or more edges in the DG have no overlapping edges in the KG and therefore cannot be assigned a topic. This condition occurs when the document under operation belongs to a new topic other than those currently presented in the KG (e.g., note the Orange topic on DG 3 in Fig. 2 that does not exist in the KG before aggregating DG 3 ). In this situation, GDTM creates a new topic and assigns it to the corresponding edge(s) in the DG. The new topic will then be added to the KG upon aggregating the corresponding DG. This is one of the key advantages of the GDTM that enables the model to account for an infinite number of partitions, in contrast to the approaches with fixed partition count. Following the same argument, it is important to note that the first document in the stream will always be assigned a new topic as the KG is initially null and there are no topics to be assigned. After assigning a topic to the DG, the next step is to aggregate the DG with the KG and update the KG following an optimization mechanism.

Optimization
Optimization is an online process to extract high-quality topics encoded as dense weighted partitions in the KG. We consider the quality of partitioning in terms of average den-sity. More specifically, the higher the average density, the better the partitioning. Thus, the goal, in this step, is to define an optimization problem to maximize the average density of the partitioning over the KG and develop an accurate algorithm to solve it. Next comes a formal definition of the problem followed by the detailed explanation of the algorithm.
Problem definition Given a partitioned KG and a DG with a dominant partition assigned to it, how can we aggregate the DG to the KG and update the partitioning of the KG such that the average density of the partitioning is maximized.
Solution GDTM develops a local deterministic optimization algorithm to solve this problem. The algorithm establishes and applies a set of policies upon aggregating each DG to the KG. The policies ensure maximization of the local density of the partitions, which in turn guarantee the monotonic optimization of the global average density. Let us present and prove the basic proposition that ensures the monotonically increasing behavior of the algorithm before explaining the conditions and their corresponding policies. Proposition 1 Assume a set of real numbers R = {r 1 , . . . , r n } with mean μ. Consider the set R\{r j } = {r 1 , . . . , r j−1 , r j+1 , . . . , r n } with mean denoted by μ n ( j). We have that μ n ( j) ≥ μ, for all j such that r j ≤ μ.
We start by simple rearrangements: r i = 1 n r j + 1 n (r 1 + · · · + r j−1 + r j+1 + · · · + r n ) Deducting 1 n r j from both sides and using (7) yields: Adding 1 n μ n ( j) to both sides results in: The two main intuitions behind the above preposition, relative to our definitions of partition and density, are as follows. Given a partition p with the density μ, (i) removing an edge e with the weight w e ≤ μ will not decrease the density, (ii) adding an edge e with the weight w e ≥ μ always increases the density with a positive value. Now let us present the details of the algorithm and the way it applies the above intuitions in the aggregation process to appeal optimization.
Given a partitioned KG and a DG with a dominant partition assigned to it, the algorithm iterates through all the edges in the DG and for each edge e with the corresponding weight w and dominant partition p , it applies an optimization upon aggregating e with the matching edge e having the weight w and the partition p in the KG. Different conditions can happen depending on the type of p and the weights of the edges w and w . GDTM requires to apply an appropriate policy upon aggregation in each condition in order to ensure the optimization requirements. Two types of partitions can be assigned to a DG, as explained in Sect. 3.4.1. First, a New Partition (NP) when majority of edges in DG do not match any edge in the KG, and Second, an Old Partition (OP) that currently exists in the KG and the edges in the DG has the highest overlap with edges of this partition in the KG. Also, there are three different conditions depending on the current status of the edges and partitions in the KG and the DG (i) e = φ meaning that the edge does not exist in the KG (ii) e = φ and p = p , meaning that the edge e exists in the KG but it has a different partition than e and (iii) e = φ and p = p indicating that e exists and belongs to the same partition as e . Table 2 shows a summary of all conditions labeled as {c 1 , . . . , c 5 }. Note that the condition e = φ and p = p is not a valid condition when p is new (NP), which is clear by definition.
Next, we will present different conditions and explain the appropriate policy applied in each condition. Algorithm 1 shows the overall process of the optimization mechanism and the corresponding policy applied depending on the condition. Each condition is numbered according to the numbers in Table 2. We use e, w and p to refer to the elements in the KG and the e , w and p for elements in the DG. In addition, we use a function called density(p) that is used to retrieve the density of a given partition p. For performance reasons, GDTM creates and maintains a key-value storage to retrieve the partition densities in the KG.

C1:
In this condition, we can simply add the new edge e to the KG and assign p as its corresponding partition. This will result in the increasing of the average density for two reasons. First, it will not affect the average density of any other partitions in the KG. Second, it will always increase the average density of the new partitions p i as it did not previously exist in the KG.

C2:
In this condition, we can only aggregate if the weight of the current edge w is less than the density of its corresponding partition, w ≤ densit y( p). The reason is that according to Proposition 1 removing e will not reduce the density of p. We call it an expansion condition, where a partition tries to expand its territory around the borders and take over the other partition. Proposition 1 ensures that no partition P can completely take over another partition P unless the weight of the largest edge in P is smaller than the density of P.
C3: This is when a nonexisting edge is going to be added to an existing partition p. There are two possible scenarios depending on either the newly created edge e is going to be an internal edge related to the partition p or not. An edge E is called internal with respect to a specific partition P if both vertices incident to E are connected to other edges with the same partition P. Based on this, if e will become an internal edge, then the algorithm aggregates e without further consideration because the aggregation always increases the density of p and does not affect the density of any other partitions in the KG. On the other hand, if e is not an internal edge, then it can only be aggregated if w ≥ densit y( p) according to Proposition 1.

C4:
In this condition, the aggregation will change the partition p of an existing edge e to another existing partition p and moving its weight w to the p . Since we are dealing with two existing partitions p and p , we need to check the optimization conditions on both partitions. In particular, we have to make sure that removing an edge with weight w from p and adding an edge with weight w + w to p do not reduce their corresponding densities. Following Proposition 1, removing is allowed if w ≤ densit y( p). However, aggregation to the p depends on whether the new edge is internal or not. If it is an internal edge, then we can apply the aggregation following the same reasoning in C3. However, for noninternal edge, the aggregation is only allowed if the weight of the new edge is larger than the density w + w ≥ densit y( p ). This is another example of the expansion condition similar to C3.

C5:
The last condition is aggregating an edge with partition p from DG to an existing edge e with the same partition p in the KG. We call this a reinforcement condition, where only the weight of an edge in a specific partition will increase. It is explicitly clear that this operation always results in the increase in the density of the corresponding partition p and does not affect any other partitions in the KG. Thus, the algorithm aggregates w with w on e in the KG.

Experiments and result
In this section, we demonstrate the accuracy and scalability of GDTM by running the algorithm over two sets of experiments. To measure the accuracy, we run a set of supervised experiments on a tagged Twitter dataset and report the B-Cubed [16] score, and for scalability, we use a largescale Twitter dataset and report the execution time and the coherence score [17] of the extracted topics. B-Cubed is an standard evaluation metric that measures the accuracy of a classification task. Each experiment is repeated 100 times, and the average is reported. We compare the results with four state-of-the-art approaches and show that GDTM significantly outperforms the others on both accuracy and scalability. All experiments are executed on a machine with 48 cores of 2G H z CPUs and 20G Bs of RAM.

Datasets
In our experiments, we use a Twitter dataset collected during 2014 over the geographic area of London. The dataset contains 9.8 million tweets. We extracted the data related to 3 months of March, April and May from the original dataset to use in our scalability experiments. The dataset contained 1.8M tweets. We cleaned the dataset by removing URLs, Punctuation Marks, Hashtags, and Mentions and keep the tweets containing more than three words. The resulting dataset was reduced to 1.2M tweets. Next, we created a tagged dataset from the cleaned dataset for the experiments on accuracy. To create the tagged dataset first, we extracted a list of trending topics, during the corresponding timespan (Mar-May 2014), from Twitter's Official Blog 1 and the English Wikipedia page for Reporting the Events from 2014 in the United Kingdom. 2 Then, we hand-tagged the tweets in the clean dataset using the extracted topics and removed the topics with less than 100 occurrences. The remaining contained 26K tweets from 22 different topics. Figure 3 shows the titles and the overall distribution of the topics. As we can see, the topics cover a wide range of events from domestic (e.g., London Marathon) to international (e.g., EuroVision, WorldCup and Oscar's Award) and contain subjects from overlapping categories (e.g., WorldCup, FACup and London-Marathon from the sports category) ( Table 3). [16] is a statistical metric to measure the accuracy of a classification compared to the ground truth. It is calculated as the average F-score over all documents. Given a dataset D with n documents, tagged with k hand labels, L = {l 1 , . . . , l k } and a classification of the documents into k class labels, C = {c 1 , . . . , c k }, the B-Cubed of a document d with hand label l d and class label c d is calculated as:  where P and R stand for precision and recall, respectively, and are calculated as follows:

B-Cubed
Precision shows the likelihood of documents correctly classified in a specific class c, with respect to the total number of documents in that class, whereas the recall represents the likelihood with respect to the total number of documents in a specific label l. The total B-Cubed score is calculated as the average over all documents in the dataset: Note that precision and recall measure the quality of the classifications with respect to the tagged labels for individual categories of the problem (e.g., individual topics), and therefore, they provide more accurate evaluation compared to more general methods like Coherency that provide an average over all instances. That is the main reason that we decided to use precision and recall in our supervised classification task.
Coherency [17] is an evaluation metric for measuring the quality of extracted topics in a topic classification problem. It assumes that the most frequent words in each class tend to have higher co-occurrence among the documents in that class rather than the documents across multiple classes. .Thus, given a set of documents classified into k topics, T = {t 1 , . . . , t k }, first, the coherency of each topic, z, with top m probable words, W z = {w 1 , . . . , w m }, is calculated as, where D(w i z , w j z ) is the co-occurrence frequency of the words v i and v j among documents in z and D(w j z ) is the total frequency of w j in z. Then, the total coherency of partitioning is calculated as:

Baseline and experimental settings
We compare GDTM with four state-of-the-art approaches, namely LDA [3], BTM [4], CDTM [6] and PYPM [10]. The source codes are available and downloaded from their corresponding URLs (LDA 3 , BTM 4 , CDTM 5 and PYPM 6 ). Table 2 shows a summary of all approaches with respect to their support for sparsity, scalability and dynamicity, in comparison with GDTM. As we can see, GDTM is the only approach that satisfies all three properties, which we will analyze and show in the coming sections.
Accuracy. In this experiment, we use the tagged Twitter dataset to compare the accuracy of different algorithms on dealing with sparsity in short texts. We run each approach 100 times and report Scalability. This experiment is designed and run in an online structure to show the power of the GDTM to account for dynamicity and scalability in comparison with the stateof-the-art approaches. Since none of the four baseline approaches are real streaming solutions, to comply with our streaming model we developed a mini-batch streaming mechanism as follows. First, we sorted the documents by date and considered the cardinal number of the documents as the lowest possible discretization value for the streaming. Then, we used a snapshot period to extract the results for each algorithm. The snapshot was set to 10K , and we run each algorithm over the entire dataset. More specifically, for every 10k documents, we calculate and report the coherence score for the extracted partitioning for each algorithm. Our initial experiments show that CDTM and PYPM are not tractable using the available resources. In particular, they required more than 20G Bs of memory after processing around 350K and 80K number of documents, respectively. Therefore, we had to exclude those two approaches from the scalability

Discussion
We now turn into the details of the results over the two sets of experiments.
Accuracy. Figure 4 shows the results of the accuracy of the topic assignment task performed by different algorithms over the tagged Twitter dataset. The results show that GDTM significantly outperforms the other approaches with a P value less than 0.01 over 95% confidence interval, in all cases. GDTM shows the largest value on precision, which is an illustration of its strong language modeling approach. In particular, the application of the skip-gram method enables GDTM to cope with the sparsity by extracting the modest amount of information from the sparse contexts of the tweets and enriching the feature vector representations.
A more interesting outcome to be considered is the remarkable improvement over the value of precision on GDTM compared to all other approaches. This is due to the strong partitioning mechanism in GDTM that allows the algorithm to automatically choose the best number of topics and prevents the incorrect mixing of the documents. Note that As we can see, GDTM shows a linear time complexity as opposed to the baseline models. In particular, GDTM performs at least three times faster than both models the higher recall value on PYPM compared to the other three approaches, namely LDA, BTM and CDTM, confirms this statement, as PYPM supports an infinite number of topics, similar to GDTM. In summary, GDTM is the only approach with high values on both precision and recall, which ensue the largest overall F-score of 77.5%. This result is around 11% larger than the second best approach, PYPM, with B-Cubed score of 65.8%. Figure 5 shows the comparison between the execution time of running different algorithms over the largescale Twitter dataset. GDTM has a constant execution time compared to the other approaches. In particular, it performs by around 3 times faster than the LDA and an order of magnitude faster than the BTM over 1.2M documents. At the same time, GDTM does not sacrifice the quality to gain such significant performance gain. In fact, the quality of extracted partitions is significantly higher than both LDA and BTM as shown in Fig. 6. The coherence scores of the partitions created by GDTM are between four and five times larger than those extracted by BTM and LDA, respectively. Also, the analogous results over multiple experiments with 5, 10 and 20 number of top words per partition confirm the significance of the outcomes. The main justification Average coherence score over 100 runs on different snapshots. The documents are sorted by date, and for every 10k documents, a snapshot is created and the coherence score is calculated. (The figures are summarized to 100k for the sake of presentation). The score is mea-sured over three different numbers of top sample words from different partitions including 5, 10 and 20. GDTM shows a higher coherence score in all three cases behind this remarkable result is the rich feature representation model in GDTM that enables the algorithm to create and extract high-quality partitions without requiring the usual iterative optimization algorithms used in other approaches. The other justification lies in the automatic feature representation model in GDTM that enables the emergence and disappearance of the partitions following the natural dynamics of their representative topics in the stream, which enables the algorithm to adapt to the changes of the topics in the stream.

Conclusion
We developed GDTM, a solution for dynamic topic modeling on short texts in online social networks. Natural language is the best model for its own representation; however, the sparsity, velocity and dynamicity of short texts make it a difficult task to develop appropriate models for extracting topics from these texts. GDTM overcomes this problem with an online topic modeling approach. It first combines an incremental dimensionality reduction method called Random Indexing with a language representation technique called Skip-gram to construct a strong feature representation model. Then, it uses a novel graph representation technique and a graph partitioning algorithm to extract the topics in an online approach. We examine the accuracy and scalability of GDTM and compare the results with four state-of-theart approaches. The results show that GDTM significantly outperforms all other solutions on both accuracy and scalability.
Even though we only applied GDTM on short texts in this paper, we strongly claim that application is not limited to linguistic data. In fact, GDTM provides a generic algorithm for automatic feature extraction over any stream of data that can be presented in some form of discrete representation level. This opens a new track of research to be considered in our future plans.