1 Introduction

The advent of Web 2.0 has led to the popularity of systems that are based on user-generated content (Srba and Bielikova 2016). One such web-based service that relies on users generating content themselves is known as Community Question Answering (CQA). CQA websites such as Yahoo! Answers, Quora, Stack Overflow and Wikianswers are becoming increasingly important for sharing and spreading knowledge (Srba and Bielikova 2016; Yuan et al. 2020). These platforms leverage “the wisdom of crowds” (Surowiecki 2005) and provide a venue where multiple users can exchange information in the form of questions and answers (Yang et al. 2015). While CQA services offer significant help to knowledge seekers, their rapid growth poses unique challenges. First, these sites witness thousands of questions posted every day in addition to millions of questions that already exist. This makes it difficult for an answerer to find the appropriate question that matches his or her expertise (Chang and Pal 2013). Second, as the expertise and education levels vary a lot among answerers, the quality of answers received is difficult to control. Third, the increasing time to receive a high-quality answer and a growing number of low-quality answers cause a high churn rate (the rate at which users leave the community or become inactive), hampering the sustainability of these CQA systems (Srba and Bielikova 2016).

Different studies have been proposed over the past years to overcome these issues which broadly fall into question routing (Li et al. 2011), expert recommendation, and question retrieval fields (Chen et al. 2018; Dai and Callan 2019). While question routing or expert recommendation aims at recommending a new question to users who would be able (possibly willing) to provide answers, question retrieval tries to identify the required information to answer a new question from the already existing question-answer pairs. In this paper, we focus on the domain of question routing.

Past studies have shown that the CQA communities have only a handful of domain experts that provide most of the high-quality answers (Sung et al. 2013). Acknowledging this long-tail distribution of users’ answering quality, one could resolve the above challenges by automatically identifying the “experts” that tend to contribute high-quality answers and direct unanswered questions to them. An algorithm that successfully links users to questions they are likely to provide good answers to is known as question routing (QR) (Li et al. 2011). The success of QR can potentially increase the participation rates of users and foster stronger communities in CQA. Approaches to question routing make use of various data science tools such as information retrieval (IR), machine learning, natural language processing (NLP), and social computing perspective (Wang et al. 2018; Al-Taie et al. 2018).

However, despite the active research in CQA, QR remains a challenging task which can be attributed to three key limitations. First, the sparsity of users’ historical question and answer records makes it difficult to infer their domain expertise (Le and Shah 2016). This results in a lack of personalized recommendations. Second, most of the CQA systems are dynamic: new users join constantly, and some accounts become abandoned. As a result, evaluating available experts for a newly posted question would require periodic updates. Thus, running complex models that use all possible data is becoming increasingly costly. Third, most of the past studies have assumed CQA sites as static environments, while overlooking the temporal aspects such as users’ willingness to contribute, changing interests, and knowledge evolution (Pal et al. 2012). These characteristics indicate the need for self-evolving approaches for expert recommendation that can be updated efficiently when new information is available (Fig. 1).

Fig. 1
figure 1

An example of question routing based on different types of metadata. a considering previous answer count, answerers U2 and U3 have equal expertise for a new question Q4 (both having shortest path), b considering context of the previous questions based on tags, Q4 is closest to Q2 (tag ‘git-pull’ in this case). Thus, answerers U2, U3 and U4 be considered having equal expertise (shortest path for all three users). c Considering topic community - tags in all the questions Q1, Q2, Q3 and Q4 are correlated to topic ‘git’. Thus, answerers U2 and U3 be ranked higher. However, considering the temporal aspect, U2 is more likely to answer question Q4 (only shortest path)

To overcome these limitations, we propose a novel topic community based temporal expertise for question routing (TCTE-QR). Our contribution in this work is threefold and can be summarized as follows:

1) We introduced a novel personalized recommendation method that uses a key feature of CQA platforms - the semantic similarity networks of questions are modular. The proposed recommender framework is designed to consider the similarity between the content of the recent activity of a user (the “domain expertise”) and the content of a given new question. To find the domain expertise of a user we use the topics of the archived questions answered by the user.

2) Compared to past studies that relied on all available textual data to infer topic experts, in our approach we use only tags assigned to questions. We define a “tag graph” using association rule mining (Agrawal et al. 1993) and subsequently used community detection to infer topics from the tag graph and topic experts from users’ past activities on these topics. In this way, the proposed method can learn domain experts efficiently with reduced computational cost, facilitating routine implementation.

3) Another key feature that TCTE-QR leverages is that users’ interests and expertise change over time. Thus, the knowledge of a user is determined using the questions recently answered by the user. To incorporate the evolving interest of the user, we use a decay function to the archived answers, giving more weight to recent answering activities.

The rest of this paper is organized as follows: In Sect. 2, we briefly review the standard approaches of question routing. Next, in Sect. 3, we look at relevant properties of CQA platforms and discuss the task of question routing. In Sect. 4, we present our proposed approach—“Topic Community-based Temporal Expertise Question Routing (TCTE-QR)”. In Sect. 5, we describe the models and the data used to test our algorithm. In Sect. 6, we evaluate the proposed model to several baseline methods, summarizing the article in Sect. 7.

2 Related work

In the past few years, question routing has attracted a lot of attention in the IR community (Neshati et al. 2017). We briefly review prior research on question routing and classify the approaches into the following four broad categories: classification-based, network-based, text-based, and collaborative filtering methods. Further, we highlight the major challenges in each of these approaches.

2.1 Classification-based approaches

Question routing can be translated to a problem of identifying experts as a class of users among all users and recommending questions to them. Thus this can be solved as a classification problem that aims to distinguish the expert class of users from others. The advantage of classification methods is that these methods can easily apply multiple features of users, questions, answers, or interaction networks. For example, Pal and Konstan (2010) used question features (e.g., question length), user features (e.g., the previous number of best answers), and user feedback on answers, training a binary classifier to distinguish experts from others. Similarly, Zhou et al. (2012) built a classifier using local and global features of questions, user history, and question-user relationships. Commonly used classifiers in the above approaches are Support Vector Machine (Ji and Wang 2013), Random Forest (Choetkiertikul et al. 2015), and Naive Bayes (van Dijk et al. 2015). However, these approaches are limited, as hand-crafted feature extraction is required. It is not only time-consuming but also reliant on selected features that can cause selection bias.

2.2 Network-based approaches

Network-based approaches analyze a user-user network formed by their asking-answering relationships. Next, a link analysis technique (Borodin et al. 2005) is used on the network to evaluate the authority of each user. The simplest way to measure the authority of a user in the CQA community is by using degree centrality measure — InDegree (Zhang et al. 2007), which considers users who have answered more questions in the user-user network as better answerers (Jeon et al. 2006). Another notable approach is the community expertise network (CEN) that uses “z-score” to measure the authority of an answerer based on in-degrees and out-degrees (Zhang et al. 2007). Other centrality measures are also used, e.g., ExpertiseRank, a slight variant of PageRank (Page et al. 1999), or HITS Kleinberg (1999). A major challenge of methods based on centrality measures is that they cannot leverage textual information, topics, categories of questions. Thus, most of these approaches recommend questions to users based on their general expertise (rather than their expertise on particular topics).

2.3 Text-based approaches

Another set of approaches builds recommendations around topics inferred using language and topic models from the text of questions and answers (Li et al. 2011). The language models use a generative approach to compute the word-based relevance of a user’s past activities to a new question and compute the probability of the user answering the question (Zheng et al. 2012). Finally, a ranked list of users based on their likelihood of answering the given question is generated. However, language models are based on exact word matching; therefore, they are not able to capture advanced semantics (“lexical gap”) (Zhou et al. 2012). In addition, data sparseness and co-occurrence of irrelevant words in user profiles or questions can lead to word mismatch between the routed question and user profiles.

To bypass the lexical gap, topic models were introduced that measure relationships in the topic space rather than the word space and thus do not require the exact word to appear in the user profile (Riahi et al. 2012). One of the most widely used topic models is Latent Dirichlet allocation (LDA) in which topic mixture is drawn from a conjugate Dirichlet prior that remains the same for all users (Blei et al. 2003). In such QR approaches, LDA is first used to extract topics based on users’ past activity that shows the connection between the expert users and new questions. In the second step, these topics are used to compute the probability of each user to provide an answer, ordering users based on this probability (Momtazi and Naumann 2013; Zhou et al. 2012). However, a limitation of LDA based method is that standard LDA groups all users’ questions under one topic. Riahi et al. (2012) proposed segmented topic model to overcome this limitation, which allows questions to have different topical distributions.

In addition, there are other studies based on a hybrid approach that leverage both topic relevance as well user authority for expert finding in CQA. For example, Kao et al. (2010) used user reputation and category into link analysis for expert finding, while in related studies topic-sensitive probabilistic model was combined with PageRank to improve the recommendation methods based on the latter (see Li et al. 2015; Zhao et al. 2014). Yang et al. (2013) used topic modeling in Topic Expertise Model (TEM) and combined it with link analysis among users to recommend experts.

A major challenge of these approaches is that they require all of the questions and answers data to extract topics. This makes them computationally expensive in dynamic environments like CQA services where periodic updates are useful. Further, since probabilistic models such as LDA distribute the total probability of 1 among all the topics for each user, having a higher probability on one topic for a user will discount his probability on other topics (Wang et al. 2018). However, a user can have great expertise in multiple topics simultaneously.

2.4 Collaborative filtering approaches

Another stream of research applied collaborative filtering (CF) methods such as matrix factorization (MF) techniques which are known to be advantageous in terms of flexibility and scalability in the recommendation domain (Koren et al. 2009). For example, Zhao et al. (2014) used MF by representing the questions with their content words, defined a user-word matrix to discover the user’s expertise on particular words. However, this results in a high-dimensional, sparse matrix which affects the performance of the MF approach (Wang et al. 2016; Idrissi and Zellou 2020). Further, as items in the matrix factorization approach are treated as independent, the semantic similarity between words is ignored.

Advancing in this direction, Yang and Manandhar (2014) proposed to use tags (keywords), that summarize the question’s focus, instead of the textual content of questions and answers to perform the MF for QR. The study highlighted that tags are more informative and well-summarizing the topic of focus of the question. The approach also uses the number of votes for an answer to evaluate the answerer’s expertise in a given question and related tags. This method outperformed other, text-based topic modeling approaches like TEM (Yang et al. 2013), and was shown to be several orders of magnitude faster. In particular, MF learns the latent feature space of both users and tags (words) to build a user-tag (user-word) matrix which is used to recommend experts given a new question.

Fig. 2
figure 2

Question post from StackExchange CQA network

However, while using tags instead of words (i.e., textual content of questions and answers) overcomes the dimensionality issue to some extent, it does not completely resolve the problem of data sparsity. In addition to that, the approach still suffers from having a high number of correlated items. Both of these issues affect MF performance (Najafabadi and Mahrin 2016; Huang et al. 2004; Idrissi and Zellou 2020). For example, while tags such as “java”, “java-9”, “jdk” are arguably similar, they are treated as independent when applying matrix factorization based on user-tag matrix. Figure 2 shows an example of two topically similar questions and their seemingly similar tags. A recent study by Fukui et al. (2019) tried to improve the approach of Yang and Manandhar (2014) by expanding tagged keywords based on word embeddings to mitigate one of the issues related to the spelling variants of tagged keywords such as “java” and “java-9”.

In addition to the issues highlighted above, most of the literature considers the expertise in a static environment (i.e., at the new question time), ignoring the evolution of personal expertise and interest over time (Neshati et al. 2017).

In summary, although many innovative approaches to question routing have been proposed over the years, there are several outstanding issues and possible areas of improvement which we aim to address in this work.

3 Problem formalization

3.1 Basic properties of community question-answer platforms

In this section, we define the CQA platform as a network. Topologically, the platform consists of a global sample of questions, users, and tags. First, we have a set of questions \({\mathcal {Q}} = \{q_i\}\), \(\vert {\mathcal {Q}} \vert =Q\), indexed over integers \(i,j\in [1,Q]\) and a set of M answerers (users that may respond to questions) \({\mathcal {U}}= \{u_{\alpha },u_{\beta },...\}\), indexed over \(\alpha ,\beta\), and the size of this set \(\vert {\mathcal {U}} \vert =M\). Lastly, we have tags \(\tau\) that form a set \({\mathcal {V}}\); tags are indexed with integers \(\kappa\) that range from 1 to K. Each question is associated with a tag if the tag features in the question. We then write \(\tau \sim q\). Each user is also associated with questions and their related tags: we will write that \(u \sim q\) if a question q is addressed by u. By transitivity, if \(q \sim \tau\), then \(u \sim \tau\).

Each question \(q_i\in {\mathcal {Q}}\) is mapped to a tuple \(q_i=({\mathcal {C}}^{(i)}, t_i)\) where \({\mathcal {C}}^{(i)}\) is some set describing the question’s contents. For example, \({\mathcal {C}}^{(i)}\) can be obtained using natural language processing, topic modeling, tags, etc. In tag-based QR approach, tags of each question are used to describe \({\mathcal {C}}^{(i)}\). \(t_i\) is the timestamp when the question was posed.

Each question-answer session comprises of a question \(q_i\) and a subset of answerers \({\mathcal {U}}_i\) that responded to the question \(q_i\):

$$\begin{aligned} {\mathcal {U}}_i = \{ u \in {\mathcal {U}} \quad \forall \quad u\sim q_i\}, \end{aligned}$$
(1)

We further denote the user who gave the accepted answer by \(u^{*},u^{*}\in {\mathcal {U}}_i\), i.e., the user whose answer is considered as fully addressing the question \(q_i\).

3.2 Question routing

Question routing is the task of predicting a user in CQA who is most likely to share knowledge and answer a newly posted question (Zhou et al. 2012). In general, the Question Routing task ranks a users’ set \({\mathcal {U}}\) to obtain an ordered set \({\mathcal {U}}^{\text {rank}}\) such that the order reflects on how likely a user will provide an answer to a new question \(\hat{q}=(\hat{{\mathcal {C}}},\hat{t})\notin {\mathcal {Q}}\): the higher in the ranking u is, the more likely the user is to answer \(\hat{q}\). We essentially endow the set \({\mathcal {U}}\) with order relations “\(\prec\)”, e.g., \(u_{\beta }\prec u_{\alpha } \in {\mathcal {U}}^{\text {rank}}\) if we find that user \(u_{\alpha }\) is more likely to respond to \(\hat{q}\) than a user \(u_{\beta }\). In reality, the set \({\mathcal {U}}^{\text {rank}}\) need not be totally ordered (i.e., we may find more than one user that is equally likely to respond to \(\hat{q}\)), yielding a partially ordered set \({\mathcal {U}}^{\text {rank}}\), but in practice, this is very unlikely, as we will essentially be mapping the users to real-valued numbers to produce this order. Finally, an order relation \(\prec\) reflects on the deduced rank relation (2):

$$\begin{aligned} {\mathcal {U}}^{\text {rank}}= {\left\{ \begin{array}{ll} u_{\beta } \succ u_{\alpha } \in {\mathcal {U}} \quad \mid \quad \text{rank}\,\,(u_{\beta },\hat{q}) >\text {rank}\,\,(u_{\alpha },\hat{q})\\ u_{\beta } \prec u_{\alpha } \in {\mathcal {U}} \quad \mid \quad \text {rank}\,\,(u_{\beta },\hat{q}) <\text {rank}\,\,(u_{\alpha },\hat{q}) \end{array}\right. }. \end{aligned}$$
(2)

We also define \(u^*\) as the supremum of \({\mathcal {U}}^{\text {rank}}\):

$$\begin{aligned} u^* = \sup ({\mathcal {U}}^{\text {rank}}). \end{aligned}$$
(3)

4 Proposed model—TCTE-QR

Our approach consists of four steps:

  1. 1.

    Building topic communities given a projection of a bipartite network of questions and tags;

  2. 2.

    Creating user-topic activity matrix \(\mathbf {S}(\hat{t})\) that accounts for user’s activities within topics, accounting for temporal changes in users’ response patterns;

  3. 3.

    Factorizing the user-topic activity matrix to obtain the user-topic expertise matrix \({\textbf {U}}{} {\textbf {T}}^\top\);

  4. 4.

    Obtaining a ranked user set \({\mathcal {U}}^{\text {rank}}\) using the user-topic expertise matrix for a given newly added question \(\hat{q}\).

Figure 3 illustrates our proposed approach.

Fig. 3
figure 3

Framework of the proposed approach

4.1 Step 1: building topic communities given a projection of a bipartite network of questions and tags

As shown in Yang and Manandhar (2014), the expertise of the user answering a question can be viewed as his/her expertise on the tags of this question. In our approach, we propose using a collection of similar tags \({\mathcal {C}}_{\omega } =\{\tau \}\), as opposed to each individual tag for quantifying the user’s activity/expertise within the topic \({\mathcal {C}}_{\omega }\) the tag represents.

To systematically obtain collections of similar tags \(\{\tau _{\kappa }\}\), we consider a weighted undirected graph \({\mathcal {G}}=({\mathcal {V}},{\mathcal {E}},{\mathcal {W}})\), where the node-set \({\mathcal {V}}\) contains the tags \(\tau _{\kappa }\), the edge set \({\mathcal {E}}\) contains undirected edges \((\tau _{\kappa }, \tau _{\lambda })\), and the weights set \({\mathcal {W}}\) contains information about the edge weights. We define the weight of an edge \((\tau _{\kappa }, \tau _{\lambda })\), \(w_{\lambda \kappa }\) as the number of questions associated with a given pair of tags. An edge set is defined as:

$$\begin{aligned} {\mathcal {E}} = \{ (\tau _{\kappa }, \tau _{\lambda }) ,\tau _{\kappa }, \tau _{\lambda }\in {\mathcal {V}} \text { if } \exists q_i \in {\mathcal {Q}} \quad \mid \quad q_i \sim \tau _{\kappa }, q_i\sim \tau _{\lambda }\}, \end{aligned}$$
(4)

and for each edge its weight is defined as:

$$\begin{aligned} w_{\lambda \kappa } = \mid \{q_i \sim \tau _{\kappa } \text { and } q_i\sim \tau _{\lambda } \text { for } q_i \in {\mathcal {Q}}\} \mid \end{aligned}$$
(5)

Further, we consider the non-unitary minimum value of questions that each pair of tags needs to be collectively associated with, \(N_q\), in order to confirm the edge in \({\mathcal {G}}\). Thus, we have a projection of a bipartite network between questions and tags, projected onto a tag layer.

We expect \({\mathcal {G}}\) to be modular: some relevant tags would be strongly connected to each other, but loosely connected to irrelevant tags. In light of this, we apply community detection that maximizes partition quality given by (6) to separate tags into communities \({\mathcal {C}}\). We use these communities as a proxy for topics in the consequent steps.

$$\begin{aligned} Q({\mathcal {G}},{\mathcal {C}}) = \frac{1}{2m} \sum _{\kappa ,\lambda } \left( A_{\kappa \lambda } - \frac{d_\kappa d_\lambda }{2m}\right) \phi (c_{\kappa }, c_{\lambda }) \end{aligned}$$
(6)

where for the tag graph \({\mathcal {G}}\) and given partition \({\mathcal {C}}\), A is the adjacency matrix, m is the total number of edges in \({\mathcal {G}}\), \(d_\kappa\) is the degree of the node \(\tau _\kappa\) and function \(\phi (c_{\kappa }, c_{\lambda })\) is 1 if tags \((\tau _{\kappa }, \tau _{\lambda })\) are in the same community and 0 otherwise.

4.2 Step 2: creating temporal user-topic activity matrix \(\mathbf {S}(\hat{t})\) that accounts for user’s activities within topics, and for temporal changes in users’ response patterns

Once the tags are clustered into topic communities as described above, a user-topic activity matrix is created. For each user, all positively scored answers and the respective question along with associated tags are collected. Next, the tags are mapped to the topic communities discovered in the previous step. Accordingly, the user is assigned an activity score on a topic the question is related to. This score is calculated as the fraction of tags associated with the question that comes from the topic out of all the tags, related to the question:

$$\begin{aligned} f_{q_i}^{\alpha \omega } =\frac{\mid \{\tau \in {\mathcal {C}}_{\omega } \mid u_\alpha \sim \tau \sim q_i\} \mid }{\mid \{\tau \in {\mathcal {V}} \mid u_\alpha \sim \tau \sim q_i\} \mid }. \end{aligned}$$
(7)

The net activity score of the user for a given topic is the sum of all the activity score on all the questions related to that topic:

$$\begin{aligned} s_{\alpha \omega } = \sum _{q_i \in {\mathcal {Q}}}f_{q_i}^{\alpha \omega } \end{aligned}$$
(8)

4.2.1 Temporal discounting

Temporal discounting (Green et al. 1994) is a term that refers to the tendency of people to give more value to near-future rewards while discounting delayed rewards. Similarly, in a dynamic environment where the users’ activities keep changing, the system should give more value to users with recent activities and discount earlier activities.

Consider the CQA site where the first post occurred at \(t_1=1\) and a new question \(\hat{q}\) is posted at time \(t_q=\hat{t}\). To identify the expert candidates who can reply to question \(\hat{q}\) all the positively scored answers by the users posted within the period \([t_1,t_q]\) are considered. Next, the time period \([t_1,t_q]\) is divided into time windows \(\delta\) such as day, week, or month. The answers are grouped according to their corresponding time window determined by the date of the answers. Next, a user-topic activity matrix \(\mathbf {S}_j\) is defined, where an entry \(s_{\alpha \omega }^{j}\) corresponds to the user \(u_\alpha\) activity on topic \(\omega\) for the time-window \([t_{j-1},t_{j}]\):

$$\begin{aligned} s_{\alpha \omega }^{j} = \sum _{q_i \in {\mathcal {Q}}_{[t_{j-1},t_{j}]}}f_{q_i}^{\alpha \omega }. \end{aligned}$$
(9)

To account for the temporal changes in users’ activities, we consider a temporal kernel (discounting function) of a form

$$\begin{aligned} g(j)=\frac{1}{1+j}, \end{aligned}$$
(10)

where j is the number of time windows \(\delta\) that passed from the time of interest until present.

Fig. 4
figure 4

Illustration of workflow to obtain the temporal user-topic activity matrix \({\textbf {S}}(\hat{t})\) for a given time window \([t_{j-1},t_j]\) and the current time \(\hat{t}\). The first step is to find communities of tags, which are then assumed to be topics \({\mathcal {C}}_{\omega }\). We combine the information about all the positively scored answers on questions answered by users within the time window into \({\textbf {R}}^{q}_j\). Further, to avoid double counting, the entries for user-tag matrix is estimated based on the proportion of tags. Thus, for a question \(q_2\) with three tags, each of the tags gets an entry = 0.33 in \({\textbf {R}}^{\tau }_{j}\). We then combine this and information about which topic communities the questions belong to obtain each entry \(s_{\alpha \omega }\). In particular, for a user \(u_\alpha\) a contribution towards the black topic \({\mathcal {C}}_1\) is not accounted for, because the user did not receive a positive score for the answer on question \(q_1\), the only associated with this topic community. Therefore, \(u_\alpha\) has \(s_{\alpha \omega }=0.67\) for a red topic community and \(s_{\alpha \omega }=0.33\) for a green topic community. This contribution comes from the positively scored answer to the question \(q_2\). The obtained matrix \({\textbf {S}}_j\) is weighted by a function g(j) and the result is added to the temporally discounted user activity matrix \({\textbf {S}}(\hat{t})\)

Finally, given this temporal division of users answering activity, we define a temporal user-topic activity matrix \(\mathbf {S}(\hat{t})\), where an entry \(s_{\alpha \omega }^{\hat{t}}\) corresponds to the user \(u_\alpha\) temporal-discounted activity on topic \(\omega\).

$$\begin{aligned} \mathbf {S}({\hat{t}}) = \sum _{j=1}^{J}g(j) \mathbf {S}_{j}, \end{aligned}$$
(11)

where J is the total number of time windows.

We illustrate the process of clustering topics into communities and building \(\mathbf {S}(\hat{t})\) in Fig. 4. This matrix is of size \(M \times N\) where N is the number topics and M is the number of users.

4.3 Step 3: factorizing the user-topic activity matrix to obtain the user-topic expertise matrix \({\textbf {U}}{} {\textbf {T}}^\top\)

We perform matrix factorization (Koren et al. 2009) on the user-topic activity matrix, to learn the latent features of users and topics. Since not all users answered questions from all topics, \(\mathbf {S}(\hat{t})\) is incomplete. Therefore, using matrix factorization, the goal is to find an approximation of \(\mathbf {S}(\hat{t})\). In practice, this is done by mapping users and items (topics) to a joint latent-factor space of dimensionality \(\ell\) lower than M for users and N for topics. The mapping is done such that the inner product of the users feature matrix and topics feature matrix approximates users’ expertise on topics. This matrix factorization is illustrated in Fig. 5.

Fig. 5
figure 5

Illustration of a workflow of matrix factorization

The algorithm begins by creating a users’ feature matrix \(\mathbf {U}\) that is of size \(M\times \ell\), and a topic feature matrix \(\mathbf {T}\) of size \(N \times \ell\). Here, \(\ell\) is the number of latent features. Initially, the entries of each matrix are assigned at random, sampling from a Gaussian distribution. Thus, each user \(u_{\alpha }\), and topic \({\mathcal {C}}_{\omega }\) are associated with vectors \(\mathbf {u}_\alpha \in \mathbb {R}^{\ell }\) and \(\mathbf {t}_{\omega } \in \mathbb {R}^{\ell }\), respectively. A dot product \(\mathbf {u}_\alpha \mathbf {t}_{\omega }^{\top }\) captures the overall interest of the user \(u_{\alpha }\) in the topic \({\mathcal {C}}_{\omega }\) and approximates user’s expertise-score in the topic \(s_{\alpha \omega }^{\hat{t}}\).

In order to learn all entries of the vectors \(\mathbf {u}_\alpha\) and \(\mathbf {t}_{\omega }\), namely the “latent features”, the model minimizes the regularized squared error on the set of known activity score:

$$\begin{aligned} \varepsilon= & {} \sum _{(\alpha ,\omega ) \mid {\mathcal {C}}_{\omega } \sim u_{\alpha }}\varepsilon _{\alpha \omega } = \min _{u_* {\mathcal {C}}_*} \sum _{(\alpha ,\omega ) \mid {\mathcal {C}}_{\omega } \sim u_{\alpha }} \left( s_{\alpha \omega }^{\hat{t}} - \mathbf {u}_\alpha \mathbf {t}_{\omega }^{\top } \right) ^{2}\nonumber \\&+ \lambda ( \Vert \mathbf {u}_\alpha \Vert _{{\mathcal {F}}}^2 + \Vert \mathbf {t}_{\omega } \Vert _{{\mathcal {F}}}^2), \end{aligned}$$
(12)

where \((\alpha ,\omega ) \mid {\mathcal {C}}_{\omega } \sim u_{\alpha }\) is the set of user-topic index pairs for which \(s_{\alpha \omega }^{\hat{t}}\) is known, and \(\Vert \cdot \Vert _{{\mathcal {F}}}^2\) is a Frobenius norm (also known as L2-norm).

The constant \(\lambda\) controls the extent of regularization and avoids overfitting the observed data. The terms in the right-most bracket are known as Tikhonov’s terms. Commonly used approaches to minimize (12) are stochastic gradient descent and alternating least squares (Koren et al. 2009). The optimal values for the hyperparameters such as the number of latent factors \(\ell\), the regularization parameter \(\lambda\), and the learning rate are determined by k-fold cross-validation on the training dataset (Bishop 2006).

4.4 Step 4: expert recommendation for a new question

For a newly posted question \(\hat{q}\), first the tags of the question are mapped to topic communities and the score is calculated for each user \(u_\alpha\):

$$\begin{aligned} \text {rank}(u_\alpha ,\hat{q}) = \sum _{{\mathcal {C}}_{\omega } \mid \exists \tau \in {\mathcal {C}}_{\omega },\ \hat{q}\sim \tau } w_\omega \left[ \mathbf {U} \mathbf {T}^{\top }\right] _{\alpha \omega }, \end{aligned}$$
(13)

where \(w_\omega\) is a weight—importance of the topic community—in the context of \(\hat{q}\), calculated as the fraction of tags associated with the question that come from the community \({\mathcal {C}}_{\omega }\) out of all tags, related to the question:

$$\begin{aligned} w_\omega =\frac{\mid \{\tau \in {\mathcal {C}}_{\omega } \mid \tau \sim \hat{q}\} \mid }{\mid \{\tau \in {\mathcal {V}} \mid \tau \sim \hat{q}\} \mid }=\frac{\mid {\mathcal {C}}_{\omega }\cap \hat{{\mathcal {C}}}\mid }{\mid \hat{{\mathcal {C}}}\mid }. \end{aligned}$$
(14)
Fig. 6
figure 6

Step 4 of the approach, following the example of user activity matrix deduced via matrix factorization as exemplified in Fig. 5

The ordered set \({\mathcal {U}}^{\text {rank}}\) is obtained using the rank of (13), as per order relation (2). The last step of the process is illustrated in Fig. 6.

5 Data and models

5.1 Experimental setup

In this section, the performance of the proposed model is evaluated using three datasets. Here, we introduce the experiment settings: datasets, baseline models used to evaluate the performance of our method, and different metrics for comparing the baselines with the obtained results.

5.1.1 Datasets

We considered three CQA datasets from StackExchange (SE) network to evaluate the performance of our model. All data are available online on the archive which contains a full history of every SE community, including all the digital footprint of users activity: time-stamped posts (questions, answers), votes on questions and answers, and tags used.

We used data from the three CQAs, namely: Super User, Server Fault, and Ask Ubuntu. The data for each of these three CQA platforms were retrieved from the archive platform that contains all the data from inception until \(31^{\text {st}}\) December 2020. The statistics summary of the datasets are presented in Table 1, including the link to download each dataset.

Table 1 Summary statistics for three datasets considered in the main text

From each site’s archive, we downloaded the XML files, namely Posts.xml and Tags.xml. The Tags.xml file includes the tag information and the date on which each tag was created. Posts.xml file includes all information on timestamped questions and answers, corresponding votes, and the users who made that post. The questions also include the annotated tags and the corresponding accepted answer. For implementing our proposed approach we used all the data from \(1^{\text {st}}\) January 2015 until \(31^{\text {st}}\) December 2018 as training data. Further, data from \(1^{\text {st}}\) January 2019 to \(31^{\text {st}}\) March 2019 is used to evaluate the performance of our model. The details of the training datasets are presented in Table 2.

Table 2 Details of the training datasets (average and standard deviation)

5.2 Topic communities

The first part of building the temporal expertise matrix is obtaining topics. For this, we considered tag network \({\mathcal {G}}\), obtained from the training dataset. Note that in StackExchange, a user can tag a question with a maximum of five tags from a predefined list. We used Tags.xml file that contains tags and their creation date to analyze the temporal nature of the tag list. We found that 90% of the tags were created on and before \(1^{\text {st}}\) January 2015 for all datasets. Considering that a list of tags has remained relatively constant for most of the datasets, we analyzed the tag-tag network and topic communities therein without temporal changes.

We started by creating the tag graph using tag pairs that co-occurred in the questions, as discussed in Sec 4 (Step-1). Next, for each tag pair, we calculated the number of questions associated with the pair in the dataset and defined a threshold value \(N_q\), setting it to be greater than or equal to 5. Next, we used the Louvain method (Blondel et al. 2008) to find topic communities.

Fig. 7
figure 7

Topic communities on the three CQAs. The nodes represents the tags and the color of the nodes represents the tags that are clustered in the same community. The figures are made using the Gephi tool (Bastian et al. 2009) for visualizing

Figure 7 illustrates the topic communities on the tag network from the three CQAs. The nodes (tags) that are clustered in the same community are shown in the same color. The nodes’ label size represents the degree centrality measure. That is, the nodes that are connected to more number of other nodes have a larger size label.

5.3 Testing and training data

To perform matrix factorization that is at the heart of the proposed approach, we used training data to form \({\textbf {S}}(\hat{t})\). Following the settings in the previous studies (Li et al. 2019), we filtered all users who provided less than five answers in the training data set to avoid the cold start problem (Najafabadi and Mahrin 2016). The elements in the user-topic activity matrix were calculated using hyperbolic temporal discounting function g(j), as per (11). The time period was divided into time windows of size \(\delta\) equal to 1 month. Further, given the ground truth for a question is the user who has provided the accepted answer, the test set was filtered to have only the questions with an accepted answer.

5.4 Quality metrics

We considered three popular rank evaluation metrics.

  1. 1.

    Mean Reciprocal Rank (MRR): for each of the Q questions in the test dataset:

    $$\begin{aligned} \text {MRR}=\frac{1}{Q}\sum _{q\in {\mathcal {Q}}}\frac{1}{ \mid \left\{ v\in {\mathcal {U}}^{\text {rank}} \mid v\succ u^*\right\} \mid }. \end{aligned}$$
    (15)
  2. 2.

    “Precision@r” (P@r): the fraction of questions for which the recommendation provided \(u^*\) in one of the r first items of \({\mathcal {U}}^{\text {rank}}\):

    $$\begin{aligned} \text {P}@r=\frac{1}{Q}\sum _{q\in {\mathcal {Q}}}{\left\{ \begin{array}{ll}1 \text { if } \mid \left\{ v\in {\mathcal {U}}^{\text {rank}} \mid v\succ u^*\right\} \mid \le r\\ 0 \text { otherwise,} \end{array}\right. } \end{aligned}$$
    (16)

    We considered \(r=5,10\) and reported P@5, P@10.

5.5 Comparison to other approaches

We compared our method to five alternatives:

  • Random, where \({\mathcal {U}}^{\text {rank}}\) is obtained by randomly ordering users.

  • FeatureEngg (Ji and Wang 2013), where \({\mathcal {U}}^{\text {rank}}\) is obtained by training a classifier using question features and user features that distinguish the expert class of users from others.

  • InDegree (answer count) (Zhang et al. 2007), where \({\mathcal {U}}^{\text {rank}}\) is obtained considering that a user who has given the number of positively scored answers has higher authority.

  • Z-score (Zhang et al. 2007), where \({\mathcal {U}}^{\text {rank}}\) is deduced from considering that a user who answers many questions is an authority, while those who ask more question have less authority.

  • Tag-based matrix factorization (T-MF) (Yang and Manandhar 2014). In this approach, user-tag activity matrix is decomposed to learn user and tag latent features and subsequently user expertise on a given tag. In other words, the communities that we consider here are unitary. The number of latent features is set to 10, as was done in previous studies (Yang and Manandhar 2014).

  • Topic Community-based Question Routing (TComm-QR), i.e., our approach without the temporal discounting. In this approach user-topic activity matrix is decomposed. Similar to our proposed approach, the number of latent features is set to 10.

In our approach, the number of latent features is set as 10, and the regularization parameter \(\lambda\) is set as 0.01. The experiments are run on Macbook Pro (CPU: 3.5 GHz Intel Core i7, Memory: 16GB). Matrix factorization is implemented using Surprise library (Hug 2020), and community detection is implemented using CD-Lib library (Rossetti et al. 2019).

6 Results

6.1 Performance

Table 3 summarizes the performance of our proposed model in the datasets given the three performance metrics.

Table 3 Performance of proposed approach and baseline models

The results show that the proposed model significantly outperforms all alternatives. First, the proposed approach without considering temporal dynamics (TComm-QR) shows a \(\approx 40\%\) performance improvement on average across datasets when compared with the T-MF method. Further, with the addition of temporal discounting (TCTE-QR) the performance further improves to \(\approx 95\%\) over T-MF method. Our model also outperforms all the baseline models when compared on P@r, obtaining an improvement of \(\approx 80\%\) and \(\approx 60\%\) for P@5 and P@10 over T-MF method, respectively.

6.2 Topic-communities underlying in the TAG-network

6.2.1 Density of user activity matrix

For the data we used, we found that the sparsity of the user-tag matrix can be high (\(\approx 99.7\%\)). However, since a lot of tags are correlated, clustering tags into topic communities improve the information on users’ expertise as well the density of the user activity matrix. Table 4 shows the result of the user-topic activity matrix density for our approach (TCTE-QR) compared with the user-tag activity matrix in the tag-based approach (T-MF). The density of the matrix increased \(\approx 45\) times.

Table 4 Density of the user activity matrix, defined as the fraction of non-empty entries of \(\mathbf {S}\) when the activity is based on tags (first column), and on topic communities

6.2.2 Significance of the modular structure in the tag network

To evaluate the effectiveness of the underlying topic communities in \({\mathcal {G}}\), we compared the performance of the proposed approach based on topics inferred using the original network, and topics inferred using a randomly generated graph.

To generated a random graph, we defined a function that randomly rewires all the edges while preserving the original graph’s degree distribution. This is done by choosing two arbitrary edges, e.g., (a,b) and (c,d), and substituting them with (a,d) and (c,b) if they do not already exist in the original graph. We repeat this until all the edges are swapped to generate a complete random graph (average of \(\approx 7500\) edge swaps in the three datasets). For the given randomized graph, we perform community detection and compare the performance of TCTE-QR using topic communities inferred from the original network and the randomized network. We measured the performance metrics on ten independent test datasets—alternative quarter datasets starting from Q3-2015 to Q1-2020. For each of the test dataset, the previous two years of data are used as a training dataset. For example, for the Q1-2018 test dataset, we used all the data from \(1^{\text {st}}\) January 2016 until \(31^{\text {st}}\) December 2017 as training data. We compared the performance metrics using the Wilcoxon signed-rank test, a widely used statistical test for comparing the performance of recommendation systems (Shani and Gunawardana 2011).

Table 5 Community structure of the tag network \({\mathcal {G}}\) and performance of TCTE-QR when communities are obtained from considering modularity maximization (“Original”) and when communities are created from random graph (“Random”)

Results, shown in Table 5, suggest that the advance of our technique is due to accounting for the modules within the data resulting in effective clustering of tags in topic communities.

Further, to analyze the effectiveness of clustering tags into topics without temporal discounting, we looked into the performance of TComm-QR when topics are obtained from the original network and those inferred from a random graph. Results, shown in Table 6, suggest that the performance of the TComm-QR on topics inferred from the random graph is comparable to T-MF where the tags were treated independently. Further, while the density of the user-topic matrix increases regardless of how nodes are clustered together, the improvement in performance is significantly higher in TComm-QR (original graph). These results suggest that reducing sparsity for MF is most effective only when community structure exists and is accounted for in the tag network.

Table 6 Average Performance on ten test datasets of proposed approach (without temporal discounting) and baseline models

Robustness of the topic community structure

Results from Table 5 suggest that the network has a modular structure, and the modularity of the networks decreases by \(\approx 50\%\) when the original graph is completely randomized. It is important to note that while networks with high modular structure have high modularity, having high modularity does not imply networks have a modular structure (Guimera et al. 2004). Thus, to evaluate the topic communities are significant, we follow the approach described in Karrer et al. (2008) to access the robustness of the community structure. In particular, we examined the stability of the partition recovered against random perturbations of the original graph. This is done in four steps:

First, we find the community assignment C that maximizes the modularity of the original network for a given community detection algorithm (Louvain in our case). Second, we perturb the network at different perturbation levels p, i.e., we rewire a proportion of edges in the original graph and connect them randomly. The value of p is varied from 0 to 1 (20 values), where values close to 0 mean only a few edges moved, while values close to 1 mean network becomes completely random. Third, we find the optimal community assignment \(C'\) for that perturbed network and finally measure the variation of information (VIFootnote 1) (Meilă 2007) between \(C'\) and C. We repeat this step ten times for each value of p to derive an average value for the VI.

We repeat these four steps starting with a null random graph with the same degree sequence as the original graph and comparing the two VI curves thus obtained. The first curve \(VI_{org}\) is obtained computing VI between the partition of the original network and the partition of perturbed versions of the original network. The second curve \(VI_{random}\) is obtained computing VI between the partition of a null random network and the partition of perturbed versions of such null network. It is expected that VI should be robust to small perturbation because if small changes in the network result in completely different partition, the found communities are not trustworthy. In other words, robustness/stability of partition against perturbation implies the presence of modular structure in the network (Karrer et al. 2008; Carissimo et al. 2018).

Fig. 8
figure 8

The variation of information as a function of the perturbation level - p for the three networks (blue), along with equivalent results for the corresponding random graphs(red)

The results in Fig. 8 show that the community structure discovered in the original network is significantly more robust against perturbations than that of the random graph. For example, while in the case of the original graph the VI value is obtained at perturbation level 0.2, i.e., by rewiring \(\approx 20\%\) of the edges, for the random graph the same value is obtained by rewiring only \(\approx 5\%\) of the edges.

6.2.3 Effectiveness of community detection algorithms

Although our results relied on Louvain (Blondel et al. 2008) community detection algorithm, we also considered alternatives. In particular, we compared the results obtained using Louvain algorithm with Leiden ( Traag et al. 2019) and Greedy modularity (Clauset et al. 2004) algorithms.

Table 7 Performance based on different community detection algorithms

Louvain is one of the most widely used community detection algorithms that optimizes a quality function such as modularity (Newman 2004) in two phases, first by locally moving nodes to increase the quality function and next by aggregating the network obtained in the previous step. The two phases are repeated until the quality function converges. Leiden algorithm is based on a smart local move algorithm. Greedy-modularity (GreedyM) algorithm works by greedily optimizing the modularity. The algorithm merges two communities at each step that contribute maximum positive value to global modularity. Table 7 shows that the results from Louvain and Leiden algorithm are comparable and were able to find communities with high-quality partitions compared to GreedyM. As a result, tags were clustered more efficiently into topics using the former resulting in the better routing of the questions to the potential expert answerers.

7 Discussion

This paper proposed a topic community-based temporal expertise algorithm for question routing (TCTE-QR) in community-based question answering websites. Our model integrates topic-community detection and accounts for temporal changes in users’ activity and interests patterns. Using community detection we cluster tags, ensuring that correlated items are treated together in matrix factorization. Overall, our method makes use of (i) the modular structure of the underlying tag network, (ii) focuses only on tags instead of all textual information to reduce the computational cost for identifying domain experts, and (iii) the evolution of users’ interest and activity patterns. The computational complexity of the approaches that are based on topic modeling approaches like LDA to determine topical expertise is polynomial and can be solved in time \({\mathcal {O}}(bk^3)\)Footnote 2, where b = \({\mathcal {O}}(N)\), k =(N + T), N the number of unique words in the documents and T is the number of topics (Sontag and Roy 2011). The computational complexity of Louvain algorithm that is used in our approach to infer topics and users’ topical expertise has linear runtime complexity of \({\mathcal {O}}(m)\), where m is the number of edges in the tag-network (Blondel et al. 2008). We performed extensive experiments on three real-world datasets which demonstrated significant improvement for question routing over the existing baselines. Further, the proposed method improves on the key challenges encountered in the question routing task, namely the lack of scalability of algorithms, as well as the missing dynamic aspect of the CQA environment (Wang et al. 2018).

In addition, our work has practical contributions not limited to CQA and can be extended to the recommendation systems domain as a whole. In particular, the data sparsity of the user-item matrix is a challenge in collaborative filtering and has been addressed by approaches such as data mining, clustering, and dimensionality reduction. Najafabadi et al. (2017). Here, we argued that this issue can also be solved by leveraging the modular structure of the graph formed by items of interest.

In the future, we intend to address additional issues such as the cold-start problem for improving personalize recommendation in CQA as well extending our approach to other CQA platforms like Quora which has questions from multiple domains. Further, we intend to extend this idea to other datasets in the recommendation domain like music recommendation in Last.fm, movie recommendation on Netflix, product recommendation in Amazon, etc.