Supervised topic models with word order structure for document classification and retrieval learning

Jameel, Shoaib; Lam, Wai; Bing, Lidong

doi:10.1007/s10791-015-9254-2

Supervised topic models with word order structure for document classification and retrieval learning

Published: 04 June 2015

Volume 18, pages 283–330, (2015)
Cite this article

Download PDF

Information Retrieval Journal Aims and scope Submit manuscript

Supervised topic models with word order structure for document classification and retrieval learning

Download PDF

Shoaib Jameel¹,
Wai Lam¹ &
Lidong Bing²

964 Accesses
14 Citations
Explore all metrics

Abstract

One limitation of most existing probabilistic latent topic models for document classification is that the topic model itself does not consider useful side-information, namely, class labels of documents. Topic models, which in turn consider the side-information, popularly known as supervised topic models, do not consider the word order structure in documents. One of the motivations behind considering the word order structure is to capture the semantic fabric of the document. We investigate a low-dimensional latent topic model for document classification. Class label information and word order structure are integrated into a supervised topic model enabling a more effective interaction among such information for solving document classification. We derive a collapsed Gibbs sampler for our model. Likewise, supervised topic models with word order structure have not been explored in document retrieval learning. We propose a novel supervised topic model for document retrieval learning which can be regarded as a pointwise model for tackling the learning-to-rank task. Available relevance assessments and word order structure are integrated into the topic model itself. We conduct extensive experiments on several publicly available benchmark datasets, and show that our model improves upon the state-of-the-art models.

Neural labeled LDA: a topic model for semi-supervised document classification

Article 15 October 2021

Wei Wang, Bing Guo, … Xinhua Suo

Rank-Integrated Topic Modeling: A General Framework

Twin labeled LDA: a supervised topic model for document classification

Article 28 July 2020

Wei Wang, Bing Guo, … Xinhua Suo

1 Introduction

Most existing probabilistic latent topic models such as Latent Dirichlet Allocation (LDA) (Blei et al. 2001, 2003) are unsupervised probabilistic topic models which analyze a high dimensional term space and discover a low-dimensional topic space (Blei et al. 2003; Steyvers and Griffiths 2007; Blei and Lafferty 2009; Blei 2012). They have been employed for tackling text mining problems (Sun et al. 2012) including document classification (Jameel and Lam 2013b; Rubin et al. 2012; Li et al. 2015) and document retrieval (Wei and Croft 2006; Wang et al. 2007; Chen 2009; Yi and Allan 2009; Egozi et al. 2011; Andrzejewski and Buttler 2011; Wang et al. 2011, 2013a; Lu et al. 2011; Yi and Allan 2008; Cao et al. 2007a; Park and Ramamohanarao 2009; Duan et al. 2012). These models can achieve better performance via detecting the latent topic structure and establishing a relationship between the latent topic and the goal of the problem. One limitation of unsupervised topic models for document classification is that the topic model itself does not consider the class labels of documents during inference. Various advantages of considering this variable in the latent topic models have been discussed in Zhu et al. (2012a), and Blei and McAuliffe (2008). Another limitation of latent topic models is that they do not exploit the word order structure of the documents. Some works attempt to integrate the class label information into a topic model for solving document classification, for example, supervised Latent Dirichlet Allocation (sLDA) (Blei and McAuliffe 2008), multi-class supervised Latent Dirichlet Allocation (mcLDA) (Wang et al. 2009), supervised Hierarchical Dirichlet Processes Zhang et al. (2013), Storkey and Dai (2014), and maximum margin supervised topic model, MedLDA (Zhu et al. 2012a). These models have shown to improve document classification performance (Zhu et al. 2013a; Jiang et al. 2012; Zhu et al. 2014). However, one common limitation of the above models is that they do not make use of the word order structure in text documents that could interact with the class label information for solving the document classification task. Obviously, technical challenges in considering the word order structure in a supervised topic model are high. First, the mathematical derivation of Gibbs sampling equations need to be revised from that of the unigram models as our classification model considers distribution over bigrams. Such requirement involves refinement based on theoretical aspect. Bag-of-words models assume exchangeability in the probability space, whereas models which maintain the order of words in the document relax such a strong assumption (Aldous 1985). The form of input data to the model changes from the traditional word document co-occurrence matrix to full documents with word order.

Likewise, unsupervised topic models such as Topical N-Gram (TNG) (Wang et al. 2007; Wang and McCallum 2005) and Latent Dirichlet Allocation (LDA) have been used in developing document retrieval model (Wang et al. 2007; Wei and Croft 2006). But they have not been explored for document retrieval learning which can be essentially cast into a learning-to-rank problem (Hang 2011). Learning-to-rank models make use of available relevance judgment information of a document for a query in the training process. The task is then to predict a desired ordering of documents. Several learning-to-rank models have been introduced, for example, Wang et al. (2014), Zong and Huang (2014), Yu et al. (2014) and Niu et al. (2014), but none of them considers the similarity between the document and the query under a low-dimensional topic space within the topic model itself.

The main idea in both of our models is to conduct posterior regularization (Ganchev et al. 2010) in a Bayesian inference parameter learning setup (Zhu et al. 2014). In posterior regularization using Bayesian inference, we intend to find a new desired posterior which is regularized using a regularization model. In our framework, our regularization is due to a maximum margin classifier which mainly helps predict the relevant class of the data. The notion is that for points which are difficult to classify by the classifier, the classifier gets an extra classifying signal from the topic model to help classify that point to its correct class. Such hard points are mainly located at the margin of the classifier or may be generally mis-classified by the classifier without any latent topic information. This posterior regularization mainly is a new posterior obtained by the topic model.

1.1 Our main contributions

We propose two topic models that build upon previous works on topic models with word order (Wallach 2006, 2008; Noji et al. 2013; Jameel and Lam 2013b, c; Kawamae 2014; Wang et al. 2007), etc which discuss in detail the challenges, motivation, and advantages of such models for solving various text mining tasks. One of the main advantages is that such models can better capture the semantic fabric of the document, which is lost when the order of words in the document is relaxed. In particular, our models incorporate the notion of side-information within the latent topic model itself. In contrast, none of the existing topic models with word order considers it. Side-information is mainly handled by the maximum margin classifier which is tightly integrated into the topic model. Topic models with word order have shown to produce more interpretable latent topics as compared to unigram models (Wang et al. 2007; Jameel and Lam 2013b, c; Lindsey et al. 2012). In addition, they have also shown to perform better on other quantitative tasks (Jameel and Lam 2013b). But such models fail to take advantage of side-information to produce more discriminative and interpretable latent topics. Our hybrid models can accomplish such goal. Our first model is a low-dimensional latent topic model for document classification. Class label information and word order structure are integrated into our supervised topic model with maximum margin learning enabling more effective interaction among such information for solving document classification. Mathematical derivation of Gibbs sampling equations are quite complex due the Markovian assumption on the order of the words for our model. Since our classification model considers the distribution over bigrams, the framework described in Jiang et al. (2012) and Zhu et al. (2012a) needs considerable changes due to the exchangeability (Heath and Sudderth 1976) assumption (Aldous 1985). We adopt collapsed Gibbs sampler (Shao and Ibrahim 2000) framework with considerable changes from Jiang et al. (2012) because it collapses out the nuisance variables and speeds up the inference (Porteous et al. 2008). The design and the study of the interplay between the side-information and word order is an interesting finding. Our model provides insights about how word order interacts with the side-information in a topic model. The implementation of the model is also challenging, where the input is not the word co-occurrence matrix, but a full document with word order.

Another contribution is that we propose a new supervised topic model for document retrieval learning which can be regarded as a pointwise model for tackling learning-to-rank task. Available relevance assessments and word order structure are integrated into the topic model itself. We jointly model the similarity between the query and the document under a low-dimensional topic space in a maximum margin framework. The main motivation for proposing this model is that in the document retrieval learning setting, our model apart from using the usual query-dependent features such as similarity metrics between the query and the document and query-independent features (Qin et al. 2010) such as PageRank (Brin and Page 1998), can also use the topic similarity feature which can help find the similarity between the query and the document in the latent topic space. Fundamentally, even if the words between the query and the documents do not overlap, but their low-dimensional representations are semantically close or the same in their latent topic assignments, then we get a signal that they are describing about the same thematic content. We conduct extensive experiments on several publicly available benchmark datasets, and show that our model improves upon the state-of-the-art models. One major difference between our model and existing learning-to-rank models is that existing learning-to-rank models do not consider latent topic information in the learning framework. Our pointwise learning-to-rank model lays a foundation upon which future research on document retrieval learning can be done, for example, allowing further development of pairwise and listwise document retrieval learning probabilistic latent topic models. Note that we develop our model based on the design paradigm from Jiang et al. (2012) and Zhu et al. (2012a) for our document retrieval learning and classification models. An important point to note is that these methods have shown superior performance than the two-stage heuristic methods which first compute the latent topic vector representation and then these vectors are fed to another prediction model. In order to adapt the classification model for solving document retrieval learning problem, new design has to be made. First, the definition of the discriminant function needs to be designed to handle document retrieval learning task along with the other formulations that follow the discriminant function. Second, the relevance judgment associated with the query-document pair is also considered in our model. Third, the prediction task on unseen query and document pairs needs to be formulated as the prediction for the classification model will not directly work for document retrieval learning task.

1.2 Our previous works

Recently, in Jameel and Lam (2013b) we presented a topic model which is inspired from the Bigram Topic Model (BTM) (Wallach 2006). This model relaxes the bag-of-words assumption, and generates collocations just like the LDA-Collocation Model (LDACOL) (Griffiths et al. 2007). It also differs from our new models proposed in this paper as we have incorporated side-information, where our previous model is unsupervised. Our temporal model proposed in Jameel and Lam (2013c), also generates more interpretable latent topics with word order. However, this model does not consider side-information and cannot solve document retrieval learning task. Our nonparametric topic model proposed in Jameel and Lam (2013a) significantly differs from the models proposed in this paper. Although our model maintains the order of words, and shows promising empirical performance, the model proposed in Jameel and Lam (2013a) does not incorporate side-information and it is a nonparametric topic model. Recently, we also proposed a nonparametric topic model where order of words is maintained (Jameel et al. 2015). This model introduced a new non-exchangeable metaphor known as the Chinese Restaurant Franchise with Buddy Customers (CRF-BC). This model is significantly different from the models proposed in this work in that the CRF-BC model does not incorporate side-information. Also, the model is well suited for generated collocations and is nonparametric.

2 Related work

Unsupervised and supervised topic models have been applied on the document classification task (Blei et al. 2003; Blei and McAuliffe 2008; Wang et al. 2013b). An advantage that supervised topic models have over unsupervised ones is that supervised topic models consider the available side-information as response variables in the topic model itself. This helps discover more predictive low dimensional representation of the data for better classification (Zhu et al. 2012a). Blei et al., proposed the Supervised Latent Dirichlet Allocation (sLDA) (Blei and McAuliffe 2008) model which captures the real-valued document rating as a regression response. The model relies upon a maximum-likelihood based mechanism for parameter estimation. Wang et al. (2009) proposed multi-class sLDA (mcLDA) which directly captures discrete labels of documents as a classification response. The Discriminative LDA (DiscLDA) (Lacoste-Julien et al. 2008) also performs classification in a different mechanism than sLDA. Different from the above models, Zhu et al. (2012a) proposed Maximum Entropy Discrimination LDA model known as MedLDA that directly minimizes a margin based loss derived from an expected prediction rule. The MedLDA model uses a variational inference method for parameter estimation. Subsequently, Markov Chain Monte Carlo techniques were proposed in Zhu et al. (2013a, b, c) and Jiang et al. (2012). Ramage et al. (2009) proposed a supervised topic model which jointly models available class labels and text content by defining a one-to-one correspondence between latent topics and class label information. This allows their model to directly learn word-tag correspondences in the topic model itself. What has not been studied in supervised topic modeling is the role that the word order structure in the text content that could play along with the side-information in the document classification task. Our proposed supervised topic model falls in the class of parametric topic models where the number of latent topics has to supplied by the user, but recently, Kawamae Kawamae (2014) presented a nonparametric supervised n-gram topic model based on a Pitman–Yor process prior (Pitman and Yor 1997) for phrase extraction which takes the advantage of labels during training process. However, it cannot perform document retrieval learning as in our model. Moreover, in Bartlett et al. (2010), it has been stated that nonparametric models with Pitman–Yor process priors cannot scale to large scale datasets. There are other proposed supervised nonparametric topic modeling approaches such as (Perotte et al. 2011; Storkey and Dai 2014; Lakshminarayanan and Raich 2011; Xie and Passonneau 2012; Liao et al. 2014; Acharya et al. 2013). These models too cannot perform document retrieval learning task. In addition, such nonparametric topic models are computationally very expensive (Wallach et al. 2009).

Unsupervised topic models have also been used to perform document classification. As mentioned above, they do not make use of the available side-information in the topic model itself. The LDA model is one example and it achieves better performance than that of Support Vector Machines (SVM) (Joachims 1998; Cortes and Vapnik 1995; Vapnik 2000). In (Rubin et al. 2012), the authors showed a model that maintains the order of words in documents which helps achieve better classification results. In (Li and McCallum 2006), the authors presented an unsupervised hierarchical topic model which generates super and sub-topics. The authors showed good classification performance than the comparative methods. The model is represented by a Directed Acyclic graph, which has a capability to capture correlations between two levels of topics. In fact, topic models have also been used on other datasets apart from text documents for classification under the unsupervised setting (Bicego et al. 2010; Pinoli et al. 2014).

It has been studied in the past that considering the order of words in documents helps improve both quantitative and qualitative performance of probabilistic topic models. For example, Wallach (2008) has studied that word order is an important component in many applications such as natural language processing, speech recognition, text compression, etc. Therefore, bag-of-words models might not be very suitable for such applications. Wallach proposed the Bigram Topic model (BTM) which is an extension to the LDA model. The BTM adopts a Markovian assumption on the order of words in documents, and has shown to perform better than the LDA model in predictive tasks. But the BTM had limitation in that it only generates bigram words, which may not be desirable for some tasks. Griffiths et al. (2007) proposed the LDA collocation model (LDACOL) which can generate either unigram or bigram words based on the context information. But in LDACOL model, only the first term has a topic assignment whereas the second term does not, which was addressed in the topical n-gram model (TNG) (Wang and McCallum 2005; Wang et al. 2007). Some improvements to the BTM have been proposed in Noji et al. (2013). In all these works it has been suggested that word order plays important role in topic models. In terms of qualitative results, words appear more interpretable (Lindsey et al. 2012), and in terms of quantitative results it has been shown to improve many applications such as document classification (Jameel and Lam 2013b), information retrieval (Wang et al. 2007), etc.

Learning-to-rank models have been extensively investigated and they can be categorized into pointwise, pairwise, and listwise approaches (Liu 2009). One early work used some bag-of-features in training a SVM model in order to conduct document retrieval learning which can be regarded as a pointwise approach for the learning-to-rank task (Nallapati 2004). This approach predicts a binary relevance prediction. Documents are then ranked based on the confidence scores given by the discriminative classifier. Subsequently other discriminative learning-to-rank models have been proposed such as those which handle multi-class relevance assessments (Busa-Fekete et al. 2013; Li et al. 2007). Many state-of-the-art learning-to-rank models have been proposed recently. For example, Gao et al. (Gao and Yang 2014) recently presented a listwise learning-to-rank model, a novel semi-supervised rank learning model which is extended to an adaptive ranker to domains where no training data is available. In (Lai et al. 2013), the authors presented a sparse learning-to-rank model for information retrieval. Dang et al. (2013) proposed a two-stage learning-to-rank framework to address the problem of sub-optimal ranking when many relevant documents are excluded from the ranking list using bag-of-words retrieval models. In (Tan et al. 2013), the authors proposed a model which directly optimizes the ranking measure without resorting to any upper bounds or approximations. However, a major difference between these learning-to-rank models and our proposed document retrieval learning model is that our model considers the latent topic information unified within a discriminative framework.

In the past, few proposals have been made to conduct document retrieval where the low-dimensional latent semantic space has been used. In (Li and Xu 2014), the authors summarize many of those works. The main motivation for incorporating the semantic information in document retrieval task is mainly to compute the similarity between the latent factors which is based on the semantic content of the document. In (Bai et al. 2010), the authors proposed a discriminative model called supervised semantic indexing which can be trained on labeled data. Their model can compute query-document and document-document similarity in the semantic space. Their focus is primarily on traditional document retrieval than learning-to-rank using an extensive set of feature values. Gao et al. (2011), and Jagarlamudi and Gao (2013) proposed topic models which jointly consider the query and the title of the document to conduct document retrieval task using a language modeling framework. Their motivation for considering title fields in the documents is mainly because queries (Broder 2002) as well as titles are mostly short in nature, thus short document titles could represent more informative power than the entire document for a query. One difference between our model and their framework is that their model is not designed to solve the learning-to-rank task considering feature instances. Our model jointly learns the query and document pair along with the associated relevance label in the latent topic space.

Our document retrieval learning framework is also closely related to some works in posterior regularization. The objective of the posterior regularization framework is to restrict the space of the model parameters on unlabeled data as a way to guide the model towards some desired behaviour. In (Ganchev et al. 2010), the authors proposed a framework which incorporates side-information into the parameter estimation in the form of linear constraints on posterior expectations. Recently, Zhu et al. (2012b, 2014) introduced Bayesian posterior regularization under an information theoretic formulation, and applied their framework on infinite latent SVM. Earlier, the same authors had extended the Zellner’s view of the optimization framework described in Zellner (1988) to propose a regularized Bayesian regularization framework for multi-task learning problem (Zhu et al. 2011). The authors mainly added a convex function to the optimization framework proposed by Zellner. Models such as MedLDA (Zhu et al. 2009, 2012a) and some of its extension are based on such frameworks (Zhu et al. 2013a; Jiang et al. 2012).

Relational topic models, such as the one described in Chang and Blei (2009), incorporate side-information in the form of connections on information networks. Such connections can be social network friends as used in Yuan et al. (2013) or scholar citation networks. In (Tang et al. 2011), the authors proposed a topic model with supervised information for advertising. These models are not designed to handle document retrieval learning which can be cast as a learning-to-rank problem. Also, in our model we incorporate the latent topic model from the BTM model to better capture latent semantic information. The supervising signal is used in the maximum margin framework.

3 Background

We first present a brief background in this section that would help understand our proposed models described later. We start with a basic topic model known as Latent Dirichlet Allocation (LDA) (Blei et al. 2003). We present the details of main part of the LDA model. Then we will present the optimization framework of the posterior distribution obtained from LDA. This optimization framework will be then extended to incorporate loss functions from maximum-margin classifier. We will present an example of a supervised topic model that makes use of the optimization framework of LDA by extending it to incorporate some posterior constraints in Bayesian inference leading to what is known as regularized Bayesian inference framework.

3.1 Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) is a generative probabilistic topic model for collections of discrete data such as text document collections. The model assumes that documents exhibit multiple latent topics. Therefore, each document is a mixture of a number of topics. In LDA, it represents a latent topic as a probability distribution of words taken from a vocabulary set. A document is denoted by $d \in \left\{ 1,\ldots ,D\right\} $ where $D$ is the total number of documents in the collection. Let $\varvec{W}=\left\{ \varvec{w}^d\right\} _{d=1}^{D}$ denote all the words in all the documents in the collection where each $\varvec{w}^d$ denotes the words in the document $d$. $N^d$ is the number of words in the document $d$. $w_n^d$ is the word at the position $n$ in the document $d$. $K$ is the total number of latent topics as specified by the user. $z_n^d$ is the topic assignment of the word $w_n^d$. $\varvec{Z}=\left\{ \varvec{z}^d\right\} _{d=1}^{D}$ are topic assignments to all the words. $\varvec{\varTheta }=\left\{ \varvec{\theta }^d\right\} _{d=1}^{D}$ are topic distributions for all documents. Let $\varvec{\varPhi }=\left\{ \varvec{\phi }_k\right\} _{k=1}^{K}$ denote the word-topic distribution. Let $V$ denote the number of words in the vocabulary. Let $\varvec{\alpha }$ be the vector denoting the hyperparameter values for the document-topic distributions. Let $\varvec{\beta }$ denote the vector of hyperparameter values for the word-topic distributions.

The LDA model describes the generative procedure of each document in the collection. Each document is generated from a mixture of topics that pervades the document. Each of those topics is in turn responsible for generating the words without giving importance to the order of the occurrence of the words in those documents.

The generative process of the LDA model is written as:

1.
Draw topic proportion for each document $d$ denoted as $\varvec{\theta }^d$ from Dirichlet($\varvec{\alpha }$), $\varvec{\theta }^d$ is the topic proportions for a document,
2.
Draw $\phi _k$ for each topic $k$ from Dirichlet($\varvec{\beta }$),
3.
For each word $w_n^d$ in the document $d$,
1. (a)
  Draw a topic assignment $z_n^d|\varvec{\theta }^d$ from Multinomial($\varvec{\theta }^d$)
2. (b)
  Draw the observed word $w_n^d|z_n^d,\varvec{\varPhi }$ from Multinomial($\phi _{z_n^d}$)

The probability of a document collection ${\mathbb {D}}$ in LDA is given as:

$$ p\left( {\mathbb {D}}|\varvec{\alpha }, \varvec{\beta }\right) = \prod _{d=1}^{D}\int P\left( \varvec{\theta }^d|\varvec{\alpha }\right) \left( \prod _{n=1}^{N^d} \sum _{z_{n}^d}P\left( z_{n}^d|\varvec{\theta }^d\right) P\left( w_{n}^d|z_{n}^d, \varvec{\beta }\right) \right) {\text {d}}\varvec{\theta }^d $$

(1)

The posterior distribution inferred by the LDA model can be written as:

$$ P\left( \varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{W},\varvec{\alpha },\varvec{\beta }\right) = \frac{P_0\left( \varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{\alpha },\varvec{\beta }\right) P\left( \varvec{W}|\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }\right) }{P\left( \varvec{W}|\varvec{\alpha },\varvec{\beta }\right) } $$

(2)

where $P(\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{W},\varvec{\alpha },\varvec{\beta })$ is the posterior distribution of the model. Let the prior distribution represented as $P_0(\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{\alpha },\varvec{\beta })$, and it is defined as:

$$ P_0\left( \varvec{\varTheta },\varvec{\varPhi },\varvec{Z}|\varvec{\alpha },\varvec{\beta }\right) =\left( \prod _{d=1}^{D}P\left( \varvec{\theta }^d|\varvec{\alpha }\right) \prod _{n=1}^{N^d}P\left( z_{n}^d|\varvec{\theta }^d\right) \right) \prod _{k=1}^{K}P\left( \varvec{\phi }_k|\varvec{\beta }\right) $$

(3)

$P(\varvec{W}|\varvec{\varTheta },\varvec{Z},\varvec{\varPhi })$ is the likelihood. $P(\varvec{W}|\varvec{\alpha },\varvec{\beta })$ is the marginal probability distribution.

3.2 Learning using Bayesian inference

Equation 2 presented in Sect. 3.1 can be further translated into an information theoretical optimization problem (Jiang et al. 2012; Zhu et al. 2012a, 2013a, 2014). An advantage of considering this paradigm is that it can be easily extended to incorporate some regularization terms on the desired posterior distribution obtained using Bayes’ theorem. It can lead to a learning model where the posterior distribution obtained using the Bayes’ theorem is directly regularized using a learning model which considers side-information. The regularizer can be obtained from the maximum-margin learning principle, and then can be integrated into the Bayesian learning paradigm leading to regularized Bayesian inference using maximum-margin learning. In principle, this hybrid model could achieve better prediction performance than using a topic model or a maximum-margin classifier alone because this hybrid model inherits the prediction power from both maximum margin prediction learning and topic models. It is well known that maximum margin classifiers have shown strong generalization performance (Burges 1998), and topic models have also shown good performance on document classification task (Rubin et al. 2012; Li and McCallum 2006). Therefore, we can expect that the hybrid model can inherit advantages of both of these models. When conducting posterior inference, we can directly regularize the posterior distribution, which leads to a new posterior regularized by a constraint. Some supervised topic models such as MedLDA (Zhu et al. 2012a), Monte Carlo MedLDA (Jiang et al. 2012), etc. are based on this paradigm.

According to the findings described in Zellner (1988), Eq. (2) can be transformed to an optimization problem which can be written as follows:

$$\begin{aligned} \begin{aligned}&\underset{P\left( \varvec{\varTheta },\varvec{Z},\varvec{\varPhi }\right) \in {\mathbb {P}}}{\text {minimize}}\quad {\text {KL}}\left[ P\left( \varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{W},\varvec{\alpha },\varvec{\beta })||P_0(\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{\alpha },\varvec{\beta }\right) \right] -{\mathbb {E}}_P \left[ \log {\text {P}}\left( \varvec{W}|\varvec{Z},\varvec{\varPhi }\right) \right] \\&{\text {subject \, to}} \quad \, P\left( \varvec{\varTheta },\varvec{Z},\varvec{\varPhi }\right) \in {\mathbb {P}}, \end{aligned} \end{aligned}$$

(4)

where ${\mathbb {P}}$ is the probability distribution space, and ${\text {KL}}(P||P_0)$ is the Kullback–Leibler divergence from $P$ to $P_0$. The above optimization interpretation will be useful in our later discussion where we will show how this technique can be used to derive a new maximum margin learning framework using a topic model. We present how the posterior distribution can be transformed into an optimization problem depicted in Eq. (4) in "Appendix”.

3.3 Maximum Margin Entropy Discrimination - LDA (MedLDA)

As mentioned above, our proposed model can be regarded as a supervised topic model where the class label information is incorporated into a topic model itself. Supervised topic models have been used for both classification and regression tasks. One example of a supervised topic model is supervised LDA (sLDA) (Blei and McAuliffe 2008) which is based on extending LDA via the likelihood principle. Another recent supervised topic model is MedLDA (Zhu et al. 2009, 2012a; Jiang et al. 2012) whose graphical model is presented in Fig. 1. Note that in this model, $\beta $ is not used explicitly, but can be used as a prior to make the model fully Bayesian (Zhu et al. 2012a). MedLDA combines a maximum margin learning algorithm based on Support Vector Machines (SVM) for label prediction, and a topic model based on LDA for the semantic content of the words.

The class label for the document $d$ is denoted by $y^d$ which takes on one of the values ${\texttt{Y}} =\left\{ 1,\ldots , M\right\} $. Let $\overline{\varvec{z}}^d$ denote a $K$ dimensional vector with each element ${\overline{z}}_k^d=\frac{1}{N^d}\sum _{n=1}^{N^d} {\mathbb {I}}(z_{n}^d=k)$. ${\mathbb {I}}(.)$ is an indicator function which equals to 1 if the predicate holds else it is 0. $\varvec{f}(y,\varvec{\overline{z}}^d)$ is a $MK$-dimensional vector whose elements from ($y-1$)K to $yK$ are $\overline{\varvec{z}}^d$ and the rest are all 0. Let $\varvec{\eta }$ denote the parameters of the maximum margin classification model. Let $C$ be a regularization constant, $\xi ^d$ be the slack variable, and $l^d(y)$ be the loss function for the label $y$; all of which are positive. $\varvec{\xi }$ are the nonnegative auxiliary parameters and are usually referred to as the slack variables. Consider the Zellner’s interpretation shown in Equation 4. In a regularized Bayesian framework setting a convex function is added to the optimization framework described above (Zhu et al. 2011). One choice of such convex function is to borrow ideas from a maximum margin classifier model, and this equation can be written as:

$$\begin{aligned} \underset{P(\varvec{\eta },\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }) \in {\mathbb {P}}, \varvec{\xi }}{\text {minimize}}\quad&{\text {KL}}[P(\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{W}, \varvec{\alpha },\varvec{\beta })||P_0(\varvec{\varTheta },\varvec{Z}, \varvec{\varPhi }|\varvec{\alpha },\varvec{\beta })]-{\mathbb {E}}_P[\log {\text {P}}(\varvec{W}|\varvec{Z},\varvec{\varPhi })] + B(\varvec{\xi }) \nonumber \\ {\text {subject \, to }}\quad \,&P(\varvec{\eta }, \varvec{\varTheta },\varvec{Z},\varvec{\varPhi }) \in {\mathbb {P}}(\varvec{\xi }), \end{aligned}$$

(5)

where $B(\varvec{\xi })$ is a convex function which usually refers to the hinge loss function of the maximum margin classifier. $\varvec{\eta }$ denotes the parameters of the maximum margin classifier. ${\mathbb {P}}(\varvec{\xi })$ is the subspace of probability distribution that satisfies a set of constraints. One can note that as stated in Sect. 3.2, we can add a loss function to the optimization view of the Bayes’ theorem obtained from LDA. Thus the interpretation given by Zellner, can be easily used to develop supervised topic models for prediction tasks.

Considering a maximum margin based topic model for label prediction, MedLDA, the soft-margin for MedLDA can be written as:

$$\begin{aligned} \underset{p \left( \varvec{\eta },\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }\right) \in {\mathbb {P}},\xi }{{\text {minimize}}}\quad&{\text {KL}}\left[ P(\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{W},\varvec{\alpha },\varvec{\beta })||P_0(\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{\alpha },\varvec{\beta })\right] -{\mathbb {E}}_P\left[ \log {\text {P}}\left( \varvec{W}|\varvec{Z},\varvec{\varPhi }\right) \right] + \frac{C}{D} \sum _{d=1}^{D} \xi ^d \nonumber \\ {\text {subject \, to }} \quad&{\mathbb {E}}_p\left[ \varvec{\eta }^{\intercal } \varvec{f} \left( y^d,\varvec{\overline{z}}^d\right) -\varvec{f}\left( y,\varvec{\overline{z}}^d\right) \right] \ge l^d(y), \xi ^d \ge 0, \forall d, \forall y, \end{aligned}$$

(6)

One can see from the above equation that MedLDA conducts regularized Bayesian inference which is of the same form as depicted in Eq. (5). Therefore, MedLDA is a hybrid topic model which takes advantages from topic model and maximum margin learning framework. Equation (6) can also be written as:

$$\begin{aligned} \begin{aligned}&\underset{P\left( \varvec{\eta },\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }\right) \in {\mathbb {P}},\xi }{{\text {minimize}}}\quad {\text {KL}}\left[ P\left( \varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{W},\varvec{\alpha },\varvec{\beta }\right) ||P_0\left( \varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{\alpha },\varvec{\beta }\right) \right] - {\mathbb {E}}_P\left[ \log {\text {P}}\left( \varvec{W}|\varvec{Z},\varvec{\varPhi }\right) \right] \\&\quad +\frac{C}{D} \sum _d {\text {argmax}}_y\left( l^d\left( y\right) \right) -{\mathbb {E}}_P\left[ \varvec{\eta }^{\intercal } \left( \varvec{f}\left( y^d,\varvec{\overline{z}}^d\right) -\varvec{f}\left( y,\varvec{\overline{z}}^d\right) \right) \right] \end{aligned} \end{aligned}$$

The component $\frac{1}{D} \sum _d {\text {argmax}}_y(l^d(y))-{\mathbb {E}}_p[\varvec{\eta }^{\intercal }( \varvec{f}(y^d,\varvec{\overline{z}}^d)-\varvec{f}(y,\varvec{\overline{z}}^d))]$ is the hinge loss which is defined as an upper bound of the prediction error on the training data.

One characteristic of MedLDA is to conduct posterior regularization where the posterior distribution obtained using a topic model is regularized with maximum margin constraints. This leads to a posterior which is mainly helpful in classifying those points which lie on the margin of the classifier or are mis-classified. The latent topic information supplied by the topic model helps classify such hard instances, for which the maximum margin classifier would find it difficult to accomplish. This mechanism makes this model different from those two stage approaches where one can compute the latent topic information using a topic model, and then use that latent topic information as an added feature in the classification task. Two stage approach for prediction might involve error propagation from one stage to another, which can be mitigated in such single stage models as MedLDA.

4 Supervised topic model with word order for document classification

4.1 Model description

We propose a document classification model based on a latent topic model that integrates the class label information and the word order structure into the topic model itself. It enables interaction among such information for more effective modeling for document classification. There are two main components. One component is a topic model with word order. The other component is the maximum margin model. One fundamental difference between MedLDA and our proposed model is that our model exploits the word order structure of a document. The design of the above two components leads to latent topic representation that is more discriminative, and also advantageous for supervised document classification learning problem.

The document content modeling component of our model is primarily a bigram topic model which captures dependencies between the words in sequence. Each topic is characterized by a distribution of bigrams. The goal of our model is to generate a latent topic representation that is suitable for classification task. We adopt the same notation from Sect. 3. In our model, word generation is defined by the conditional distribution $P(w_n^d|w_{n-1}^d, z_n^d)$. The word-topic distribution denoted by $\varvec{\varPhi }$ is different from MedLDA. $\varvec{\varPhi }=\left\{ \varvec{\phi }_{kv}\right\} _{v,k=1}^{V,K}$ are word-topic distribution. We depict the graphical model of our model in Fig. 2. Note that we show the hyperparameter $\beta $ explicitly in the graphical model. The generative process of our model is depicted below:

1.
Draw Multinomial distribution $\phi _{zw}$ from a Dirichlet prior $\beta $ for each topic $z$ and each word $w$,
2.
For each document $d$
1. (a)
  Draw a topic proportion θ ^d for the document d from Dirichlet (α ), where Dirichlet (α) is the Dirichlet distribution with the parameter α,
2. (b)
  For each word w ^d_n ,
  
  (i) Draw a topic z ^d_n from Multinomial (θ ^d)
  
  (ii) Draw a word w ^d_n from the distribution over words for the context defined by the topic z ^d_n and the previous word w ^d_n−1 from Multinomial ($\phi _{w_{n-1}^d z_n^d}$)
3.
Draw the class label parameter $\varvec{\eta }$ from Normal ($0,\varvec{\eta }_0$), where $\varvec{\eta }_0$ is the hyperparameter for $\varvec{\eta }$ and is sampled $M$ times, where $M$ is the number of classes considered in the classification problem,
4.
Draw a class label $y^d|(\varvec{z}^d,\varvec{\eta })$ according to Eqs. (8)–(10).

Let $\varvec{b}^d$ denote $\{b_{n,n+1}^d\}_{n=1}^{N^d-1}$, where $b_{n,n+1}^d$ denotes the words at the positions $n$ and $n+1$ in the document $d$ written as $b_{n,n+1}^d=(w_n^d,w_{n+1}^d)$. $\varvec{W}=\{\varvec{b}^d\}_{d=1}^{D}$ is the word order information. The prior distribution defined in the model is expressed as:

$$\begin{aligned} P_0(\varvec{\varTheta },\varvec{\varPhi },\varvec{Z})=\left( \prod _{d=1}^{D}P(\varvec{\theta }^d|\varvec{\alpha }) \prod _n^{N^d}P\left( z_{n}^d|\varvec{\theta }^d\right) \right) \prod _{k=1}^{K}\prod _{v=1}^{V} P\left( \varvec{\phi }_{kv}|\varvec{\beta }\right) \end{aligned}$$

(7)

In our model, the objective is to infer the joint distribution $P(\varvec{\eta },\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{W},\varvec{\alpha },\varvec{\beta })$, where $\varvec{\eta }$ is a random variable representing the parameter of the classification model. In addition, the discriminant function is defined as:

$$ F\left( y,\varvec{\eta },\varvec{z};\varvec{b}^d\right) = \varvec{\eta }^{\intercal } \varvec{f}\left( y;\varvec{\overline{z}}^d\right) $$

(8)

The above latent function cannot be directly used for prediction tasks for an observed input document as it involves random variables. Therefore, we take the expectation and define the effective discriminant function as follows:

$$ F\left( y;\varvec{b}^d\right) = {\mathbb {E}}_{p\left( \varvec{\eta },\varvec{z}| \varvec{b}^d\right) } \left[ F\left( y,\varvec{\eta },\varvec{z};\varvec{b}^d\right) \right] $$

(9)

The prediction rule incorporating the word order structure in the classification task is:

$$ \hat{y}=\underset{y}{{{\text {argmax}}}}\quad F\left( y;\varvec{b}^d\right) $$

(10)

Let $C$ be a regularization constant, $\xi ^d$ be the slack variable and $l^d(y)$ be the loss function for the label $y$; all of which are positive. The soft-margin framework for our model can be written as:

$$\begin{aligned} \begin{aligned}&\underset{P\left( \varvec{\eta },\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }\right) \in \mathbb {P}, \varvec{\xi }}{\text {minimize}}\quad {\text {KL}}\left[ P\left( \varvec{\eta },\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{W},\varvec{\alpha },\varvec{\beta }\right) ||P_0\left( \varvec{\eta },\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{\alpha },\varvec{\beta }\right) \right] - \mathbb {E}_q\left[ \log {\text {P}}\left( \varvec{W}|\varvec{Z},\varvec{\varPhi }\right) \right] \\&\quad + \frac{C}{D} \sum _d {\text {argmax}}_y\left( l^d\left( y\right) \right) -\mathbb {E}_P\left[ \varvec{\eta }^{\intercal } \left( \varvec{f}\left( y^d,\varvec{\overline{z}}^d\right) -\varvec{f}\left( y,\varvec{\overline{z}}^d\right) \right) \right] \\&{\text {subject \, to }}\quad \mathbb {E}_P\left[ \varvec{\eta }^{\intercal }\left( \varvec{f}\left( y^d,\varvec{\overline{z}}^d\right) -\varvec{f}\left( y,\varvec{\overline{z}}^d\right) \right) \right] \ge l^d(y) - \xi ^d, \xi ^d \ge 0, \forall d, \forall y, \end{aligned} \end{aligned}$$

(11)

4.2 Posterior inference

We use Collapsed Gibbs sampling for computing the posterior inference considering the word order structure in the document. Collapsed Gibbs sampler collapses out the nuisance parameters, and speeds up the posterior inference (Shafiei and Milios 2006). Eq. (11) can be solved in two steps in alternate manner. The first step is to estimate $P(\varvec{\eta })$ given $P(\varvec{\varTheta },\varvec{Z},\varvec{\varPhi })$. In the second step, we need to estimate $P(\varvec{\varTheta },\varvec{Z},\varvec{\varPhi })$ given $P(\varvec{\eta })$. We can estimate $P(\varvec{\eta })$ from the algorithm described in Jiang et al. (2012) where we make use of Lagrange multipliers, but our topic modeling component is different and thus the distribution $P(\varvec{\varTheta },\varvec{Z},\varvec{\varPhi })$ needs to be estimated. We define $\varvec{\kappa }$ as follows:

$$ \varvec{\kappa }=\sum _{d=1}^{D} \sum _{y^d} \lambda ^d_{y^d} \varvec{\Delta } \varvec{f}\left( y^d,\mathbb {E}\left[ \varvec{\overline{z}}^d\right] \right), $$

(12)

where $\varvec{\kappa }$ is the mean of classifier parameters $\varvec{\eta }$. When we place a $*$ with $\kappa $, it denotes the optimum solution. We describe an outline for estimation of topical bigrams below.

First, we can factorize the topic model component and the maximum margin parameter component as follows:

$$ P\left( \varvec{\eta },\varvec{\varTheta },\varvec{\varPhi },\varvec{Z}\right) = P\left( \varvec{\eta }\right) P\left( \varvec{\varTheta },\varvec{\varPhi },\varvec{Z}\right) $$

(13)

Let $\varvec{\Delta }\varvec{f}\left( y^d,\varvec{\overline{z}}^{d}\right) $ be defined as follows:

$$ \varvec{\Delta }\varvec{f}\left( y^d,\varvec{\overline{z}}^{d}\right) =\varvec{f}\left( y^d,\varvec{\overline{z}}^d\right) -\varvec{f}\left( y, \varvec{\overline{z}}^d\right) $$

(14)

Based on Eq. (13), the formulation for the optimum solution is given as follows:

$$\begin{aligned} P\left( {\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }}\right) \propto P\left( \varvec{\varTheta },\varvec{Z},\varvec{\varPhi },\varvec{W}\right) {\text {e}}^{\varvec{\kappa ^{\left( *\right) }}^{\intercal } \sum _{d=1}^{D}\sum _{y^d} \left( \lambda ^d_{y^d}\right) ^{*}\varvec{\Delta }\varvec{f}\left( y^d,\varvec{\overline{z}}^{d}\right) } \end{aligned}$$

(15)

where $\lambda _{y^d}^{d}$ is the Lagrange multiplier. The problem now is to efficiently draw samples from $P(\varvec{\varTheta },\varvec{Z},\varvec{\varPhi })$ and also compute the expectation statistics of the maximum margin classifier used in our model. In order to simplify the integrals, we can take advantage of conjugate priors. We can integrate out the intermediate variables $\varvec{\varTheta },\varvec{\varPhi }$ and build a Markov chain whose equilibrium distribution is the resulting marginal distribution $P(\varvec{Z})$.

Let $Z$ be a normalization constant. We get the following marginalized posterior distribution for our model after integrating out $\varvec{\varTheta },\varvec{\varPhi }$:

$$\begin{aligned} P\left( \varvec{Z}\right) =\frac{P\left( \varvec{W},\varvec{Z}|\varvec{\alpha },\varvec{\beta }\right) }{Z} {\text {e}}^{\varvec{\kappa ^{\left( *\right) }}^{\intercal } \sum _{d=1}^{D} \sum _{y} \left( \lambda ^d_y\right) ^{*}\varvec{\Delta }\varvec{f}\left( y,\varvec{\overline{z}}^{d}\right) } \end{aligned}$$

(16)

The original BTM model proposed in Wallach (2006) used EM algorithm for doing the approximation. But we have used collapsed Gibbs sampler. Therefore, in order to solve the first component on the right hand side of the above equation, collapsed Gibbs sampling for the model has to be implemented. The second component can be solved using any existing SVM implementation with some modifications based on the formulations used in our model.

Let $m_{zwv}$ be the number of times the word $w$ is generated by the topic $z$ when preceded by the word $v$. $q_{dz}$ is the number of times a word is assigned to the topic $z$ in the document $d$. The element $\kappa _{y^dk}$ represents the contribution of the topic $k$ in classifying a data point to the class $y^d$. The transition probability along with the maximum margin constraint can be expressed as:

$$\begin{aligned} \begin{aligned} P\left( z_n^d|\varvec{W},\varvec{Z}_{\lnot n},\alpha ,\beta \right) =&\left( \frac{\alpha _{z_n^d}+q_{dz_n^d}-1}{\sum _{z=1}^{K} \left( \alpha _z+q_{dz}\right) -1} \times {\text {e}}^{\frac{1}{N^d} \sum _{y} \left( \lambda ^d_y\right) ^{*}\left( \varvec{\kappa }^{*}_{{y_d}k}-\varvec{\kappa }^{*}_{yk}\right) } \right) \\&\times \frac{\beta _{w_n^d}+m_{z_n^d w_{n}^dw_{n-1}^d}-1}{\sum _{v=1}^{V}\left( \beta _v+m_{z_n^d w_{n}^dv}\right) -1} \end{aligned} \end{aligned}$$

(17)

Note that all the counts used above exclude the current case i.e., the word being visited during sampling. When we use a $\lnot $ sign in the subscript of a variable, it means that the variable corresponding to the subscripted index is removed from the calculation of the count. In the above equation, $-1$ mainly arises from the chain rule expansion of the Gamma function. The posterior estimates of the model can be written as:

$$\begin{aligned} \begin{aligned} P\left( z_n^d|\varvec{W},\varvec{Z}_{\lnot n},\alpha ,\beta \right) =&\left( \frac{\alpha _{z_n^d}+q_{dz_n^d}}{\sum _{z=1}^{K} \left( \alpha _z+q_{dz}\right) } \times {\text {e}}^{\frac{1}{N^d} \sum _{y} \left( \lambda ^d_y\right) ^{*} \left( \varvec{\kappa }^{*}_{{y_d}k}-\varvec{\kappa }^{*}_{yk}\right) } \right) \\&\times \frac{\beta _{w_n^d}+m_{z_n^d w_{n}^dw_{n-1}^d}}{\sum _{v=1}^{V} \left( \beta _v+m_{z_n^d w_{n}^dv}\right) } \end{aligned} \end{aligned}$$

(18)

4.3 Prediction for unseen documents

Our prediction framework also follows similar strategy for unseen documents using topic models as used in many other works (Jiang et al. 2012; Yao et al. 2009). Let the unseen document be denoted as $d^{\text {new}}$. We consider the notion of word order. The input for prediction task are unlabeled test data. The output is to predict the label for the new document $d^{\text {new}}$. We compute the point estimate of topics obtained in the matrix $\varvec{\varPhi }$ from the training data. This matrix is used in the prediction task. When the unseen document is given to the model, we need to determine the latent dimensions $\varvec{z}^{d^{\text {new}}}$ for this unseen document. This is computed using the MAP estimate of $\varvec{\varPhi }$ to obtain $\hat{\varvec{\varPhi }}$. Specifically, we compute the $z_n^{d^{\text {new}}}$ in each new document $d^{\text {new}}$ as follows:

$$ P\left( z_n^{d^{\text {new}}}|\varvec{z}_{\lnot n}^{d^{\text {new}}}\right) \propto \hat{\phi }_{\left( z_n^{d^{\text {new}}},w_n^{d^{\text {new}}},w_{n-1}^{d^{\text {new}}}\right) } \left( \alpha _{z_n^{d^{\text {new}}}}+q_{dz_n^{d^{\text {new}}}}\right) $$

(19)

Expectation statistics computation can be derived in a similar manner as the classifier described in Jiang et al. (2012).

5 Document classification experiments

5.1 Experimental setup

We conduct extensive experiments on document classification using some benchmark test collections. We also compare with many related comparative methods. In addition, we present some high quality topical words showing how our model generates interpretable topical words. In all our experiments for topic models, we run the sampler for 1000 iterations.^{Footnote 1} We have also removed stopwords^{Footnote 2} and performed stemming using Porter’s stemmer.^{Footnote 3} Text pre-processing and vector space generation was done using Gensim package.^{Footnote 4} Fivefold cross validation is used as in Zhu et al. (2012a). In each fold, the macro-average across the classes is computed. Each model is run for five times. We take the average of the results obtained for all the runs and in all the folds.

We use four datasets, namely, 20 Newsgroups dataset,^{Footnote 5} OHSUMED-23 dataset.^{Footnote 6} TechTC-300 Test Collection for Text Categorization,^{Footnote 7} and Reuters 21578 text categorization collection.^{Footnote 8} In OHSUMED-23, as adopted in Joachims (1998), we used the first 20,000 documents. We present the details about the datasets in Table 1. In the table, the first column presents the names of different datasets. The second column describes the total number of classes in the dataset. The third column presents the total number of documents in that entire dataset. The fourth column shows the average number of documents in the each class. The fifth column presents the average length of the documents in the entire dataset. One can see that we have used both small and large document collections.

Table 1 Details about different datasets used in the document classification experiments

Abstract

Similar content being viewed by others

Neural labeled LDA: a topic model for semi-supervised document classification

Rank-Integrated Topic Modeling: A General Framework

Twin labeled LDA: a supervised topic model for document classification

1 Introduction

1.1 Our main contributions

1.2 Our previous works

2 Related work

3 Background

3.1 Latent Dirichlet Allocation (LDA)

3.2 Learning using Bayesian inference

3.3 Maximum Margin Entropy Discrimination - LDA (MedLDA)

4 Supervised topic model with word order for document classification

4.1 Model description

4.2 Posterior inference

4.3 Prediction for unseen documents

5 Document classification experiments

5.1 Experimental setup

5.2 Quantitative results

5.3 Examples of topical words

6 Topic model for document retrieval learning

6.1 Model description

6.2 Posterior inference

6.3 Ranking unseen query-document pairs

7 Retrieval learning experiments

7.1 Experimental setup

7.2 Quantitative results

7.3 Investigation on topic enhancements for comparative models

7.3.1 Employing LDA

7.3.2 Employing BTM

7.4 Topical words examples

8 Conclusions

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Proof

Appendix: Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation