Poisson Dependency Networks: Gradient Boosted Models for Multivariate Count Data
 2.4k Downloads
 5 Citations
Abstract
Although count data are increasingly ubiquitous, surprisingly little work has employed probabilistic graphical models for modeling count data. Indeed the univariate case has been well studied, however, in many situations counts influence each other and should not be considered independently. Standard graphical models such as multinomial or Gaussian ones are also often illsuited, too, since they disregard either the infinite range over the natural numbers or the potentially asymmetric shape of the distribution of count variables. Existing classes of Poisson graphical models can only model negative conditional dependencies or neglect the prediction of counts or do not scale well. To ease the modeling of multivariate count data, we therefore introduce a novel family of Poisson graphical models, called Poisson Dependency Networks (PDNs). A PDN consists of a set of local conditional Poisson distributions, each representing the probability of a single count variable given the others, that naturally facilitates a simple Gibbs sampling inference. In contrast to existing Poisson graphical models, PDNs are nonparametric and trained using functional gradient ascent, i.e., boosting. The particularly simple form of the Poisson distribution allows us to develop the first multiplicative boosting approach: starting from an initial constant value, alternatively a loglinear Poisson model, or a Poisson regression tree, a PDN is represented as products of regression models grown in a stagewise optimization. We demonstrate on several real world datasets that PDNs can model positive and negative dependencies and scale well while often outperforming stateoftheart, in particular when using multiplicative updates.
Keywords
Graphical models Dependency networks Poisson distribution Learning MAP inference1 Introduction
The world contains an unimaginably vast amount of information, and much of the data consists of counts, i.e., observations that can take only nonnegative integer values. Without counting, we cannot know how many people are born and have died; how many men and women still live in poverty; how many children need education, and how many teachers to train or schools to build; the prevalence and incidence of diseases; how many customers arrive in a shop daily and whether demands for products are expanding, and how many teams did a team win in the football world championship. Our scientific and digital lives also thrive on counts. Example data are publication and citation counts, bagofX representations of, e.g., collections of images or text documents, genomic sequencing data, userratings data, spatial incidence data, climate studies, and site visits, among others. Behavioral data of users visiting web sites for example are tracked on a large scale, and visits or logins are counted and then used to enrich the user experience and to increase revenue. Or consider computational social science, a field that leverages the capacity to collect and analyze data at a scale that may reveal patterns of individual and group behaviors. Here, for instance, the number of people living in a region and the number of people migrating from one city to another one, among others, are of great interest to politicians and also influence decisions in education and research. Finally, counts also play an essential role in statistics. Consider for example an election analyst interested in the association of gender and voting intentions. Standing on a street corner for an hour recording data from anyone willing to talk to them, they build a contingency table, say, with the gender in the rows and and the voting intentions in the columns. In contrast to interviewing a fixed number of n individuals, there are no constraints on the row and column totals, and hence the cell entries follow a count distribution.
All these examples share a common attribute in that they require a distribution over counts, a potentially skewed, discrete distribution over the natural numbers. Indeed, there are several count distributions. Here we focus on one of the most widely used ones, namely the Poisson distribution. Poisson distributions are observed in sports data (Karlis and Ntzoufras 2003), and in natural sciences (Feller 1968). Clarke (1946) even observed that the points of impact of bombs in flyingbomb attacks are Poisson distributed. However, these classical studies have only considered the uni and bivariate cases even though in many situations count variables may directly influence each other and should not be considered independently. For instance, if “Neural” appears often in a document then it is also likely that “Network” appears in the same document. Indeed, one may consider to employ a probabilistic graphical model widely used for modeling distributions among several random variables. Unfortunately, research has so far mainly focused on graphical models over binary, multinomial and Gaussian random variables only. These standard distributions, however, are often illsuited for modeling count data^{1}, since they disregard either the infinite range over the natural numbers or the potentially asymmetric shape of the distribution of count variables; counts are neither binary nor continuous, they are discrete with a typically right skewed distribution over an infinite range.
Therefore, it is not surprising that extensions of them to the Poisson case have been proposed, see e.g., Besag (1974) and Yang et al. (2012, 2013). However, all existing extensions are restricted in that they model only negative conditional dependencies or neglect the prediction of counts or do not scale well. In particular negative dependencies limit the expressiveness of the models because one can only represent relationships where the mean of a variable decreases with increasing neighbors.
To ease the modeling of multivariate count data, we therefore propose a novel family of Poisson graphical models, called Poisson Dependency Networks (PDNs). A PDN consists of a set of local conditional Poisson distributions—each representing the probability of a single count variable given the others—that naturally facilities a Gibbs sampling inference procedure. Moreover, the family admits simple training procedures induced by a functional gradient view on training. Specifically, triggered by the recent successes of functional gradient ascent for representing multinomial dependency networks, our first technical contribution is to show that PDNs can be represented as sums of regression models grown in a stagewise optimization starting from an initial constant value, or alternatively a loglinear model, or a Poisson regression tree. In fact, this first functional gradient approach to modeling multivariate Poisson distributions—as we will show empirically on several real world datasets—scales well and can model positive and negative dependencies among count variables while often outperforming stateoftheart approaches. However, our main technical contribution is the development of the first multiplicative functional gradient ascent, which is demonstrated empirically to be able to boost the performance even further, since it essentially implements an automated step size selection without incurring in computational overhead.
We proceed as follows. We start off by discussing related work in more detail. We then introduce Poisson Dependency Networks (PDNs) in Sect. 3 and in Sect. 4 we show how to learn them, in particular, using the multiplicative functional gradient approach. In Sect. 5 we explain how to perform inference in PDNs and before concluding, Sect. 6 presents our exhaustive experimental evaluation on different synthetic and realworld datasets demonstrating the effectiveness of PDNs.
2 Related Work and a First Empirical Illustration
Most of the existing machine learning and data mining literature on graphical models—we refer to Koller and Friedman (2009) for a general introduction to graphical models—is dedicated to binary, multinominal, or certain classes of continuous (e.g., Gaussian) random variables. Undirected models, a.k.a. Markov Random Fields (MRFs), such as Ising (binary random variables) and Potts (multinomial random variables) models have found a lot of applications in various fields such as robotics, computer vision and statistical physics, among others. Whereas MRFs allow for cycles in the graphical structure, directed models, a.k.a Bayesian Networks (BNs), require acyclic directed relationships among the random variables. They have also been used in a number of applications such as planning, NLP and and information retrieval, among others. Dependency Networks (DNs), the focus of the present paper, combine concepts from directed and undirected worlds and are due to Heckerman et al. (2000). Specifically, like BNs, DNs have directed arcs but they allow for networks with cycles and bidirectional arcs, akin to MRFs. This makes DNs quite appealing because, if the data are fully observed, learning is done locally on the level of the conditional probability distributions for each variable mixing, directed and indirected as needed. Based on these local distributions, samples from the joint distribution are obtained via Gibbs sampling (Heckerman et al. 2000; Bengio et al. 2014). Except for few cases that we will discuss below, however, surprisingly little attention has been paid to (graphical) models of multivariate count data within the machine learning and data mining communities.
Indeed, when only one count variable is considered, the literature is vast. A lot of this can essentially be treated as a Generalized Linear Model (GLM), see McCullagh and Nelder (1989) for example. In GLMs, the response variable is related to a link function applied to a linear model. One specific instantiation of GLMs is the Poisson regression case where the link function is the logarithm, i.e., the mean of the Poisson distribution is defined by a loglinear model. Compared to ordinary least squares regression, an advantage of the GLM framework is the fact that nonGaussian error structures are possible. When considering jointly two or more count variables, however, things become more complicated and there exist much less work.
For instance, one can define a multivariate Poisson distribution by modeling node variables as sums of independent Poisson variables, see e.g., Karlis (2003) and Ghitany et al. (2012). Since this is again a Poisson, the marginals are Poisson as well. The resulting joint distribution, however, can only model positive correlations. There are also bivariate extension for specific models, e.g., Karlis and Ntzoufras (2003). In general, however, even calculating probabilities for these multivariate Poisson distributions is computationally challenging and hence their usage is often limited (Tsiamyrtzis and Karlis 2004). Hoff (2003) proposed to use Generalized Linear Mixed Model (GLMMs) using Poisson regression and modeling the dependencies between variables using random effects. Training the resulting GLMM, however, is computationally demanding because it requires the estimation of the unobserved mixed effects; nevertheless, GLMMs are often used, for example in studies in ecology and evolution (Bolker et al. 2009). Using just GLMs, Yang et al. (2012) have recently proposed an undirected Poisson model, called GLM graphical model, where—close in spirit to PDNs—each conditional node distribution is assumed to be from the exponential family with the Poisson distribution as one particular instantiation.
In contrast to the nonparametric functional gradient approach of PDNs, where interactions among variables are introduced only as needed and hence the learner does not explicitly consider the potentially immense parameter space, they employ a sparsity constrained, parametrized conditional MLE approach along the lines of Meinshausen and Bühlmann (2006). It must be mentioned that Yang et al.’s GLM graphical models—like PDNs—extend the seminal work by Besag (1974). More precisely, Besag’s AutoPoisson model can be seen as an instantiation of GLM graphical models as the latter ones allow for higherorder cliques. In general, this line of work models the dependencies between the variables directly, instead of adding mixed effects. That is, the mean of each variable follows a GLM where all neighboring variables are used as explanatory variables. However, to guarantee a consistent joint probability distribution known in closed form, the parameters are required to be negative, i.e., competitive relationships. In the case of arbitrary positive dependencies, the joint distribution is not guaranteed to be normalizable anymore because the normalization constant becomes infinite. In contrast, PDNs drop the guarantee of consistency and stay local, allowing for negative and positive parameters, i.e., competitive and attractive relationships. Alternatively, Kaiser and Cressie (1997) suggested the use of Winsorized Poisson distributions to remove the drawback of negative dependencies only. The Winsorized Poisson distribution is the univariate distribution obtained by truncating the intervalued Poisson variables at a finite constant. Doing so, however, makes estimation considerably harder than for PDNs that are naturally equipped with a simple learning approach.
A comparison of existing Poisson graphical models most similar to PDNs
AutoPoisson (Besag 1974)  TPGMs (Yang et al. 2013)  SPGMs (Yang et al. 2013)  LPGMs (Allen and Liu 2013)  PDNs  

Consistent joint distribution  \(\checkmark \)  \(\checkmark \)  \(\checkmark \)  (\(\checkmark \))  (\(\checkmark \)) 
Arbitrary parameters  –  \(\checkmark \)  \(\checkmark \)  \(\checkmark \)  \(\checkmark \) 
Unbounded range  \(\checkmark \)  –  –  \(\checkmark \)  \(\checkmark \) 
Covers learning  \(\checkmark \)  \(\checkmark \)  \(\checkmark \)  \(\checkmark \)  \(\checkmark \) 
Higherorder potentials  –  \(\checkmark \)  \(\checkmark \)  \(\checkmark \)  \(\checkmark \) 
Treestructured potentials  –  –  –  –  \(\checkmark \) 
Functional gradient  –  –  –  –  \(\checkmark \) 
Investigates inference  –  –  –  –  \(\checkmark \) 
Our, empirical results support this as illustrated in Fig. 1. It summarizes the results of using the most recent approaches from Table 1 as well as PDNs for a network discovery task. More precisely, following Yang et al. (2013), two network families were considered that are commonly used throughout genomics: the hub and scalefree graph structures. As one can see, the local approaches, i.e., LPGMs and PDNs have competitive performances compared to the guaranteed consistent SPGMs^{3}. The plots show the F1scores for structure recovery for varying problem sizes averaged over five runs; we sampled 1000 graphs per run and problem size. Moreover, the training of PDNs is considerably faster compared to the stateoftheart approaches, often an order of magnitude.
The speedup is likely due to the nonparametric nature of PDNs. In particular, we train PDNs using functional gradients. That is, we train them in a stagewise manner at low and scalable costs following Friedman’s Gradient Tree Boosting (GTB) (Friedman 2001). This boosting approach has been proven successful in a number of cases, see e.g., Ridgeway (2006), Kersting and Driessens (2008), Dietterich et al. (2008), Elith et al. (2008), Natarajan et al. (2012, 2014b, 2013) and Weiss et al. (2012), and since it estimates the parameters and the structure jointly, it is generally related to structure learning of graphical models, in particular to approaches that use the local neighborhood of each variable to construct the entire graph. For example, covariance selection is used in Gaussian models where edges are added to the graph until a stopping criterion is met. Despite being greedy, the method is not practical for multivariate distributions with a large number of variables. A popular alternative is neighborhood selection via Lasso (Meinshausen and Bühlmann 2006). This has also been used for learning the structure of binary Ising models, see e.g., Ravikumar et al. (2010). In the case of DNs, Heckerman et al. (2000) originally did neighborhood selection implicitly by learning probabilistic decision trees for each variable. Also, undirected probabilistic relational models such as Markov Logic Networks were learned with the help of (boosted) decision trees in Khot et al. (2011), Lowd and Davis (2014) and Natarajan et al. (2014a).
3 Poisson Dependency Networks (PDNs)

Unrestricted parameters Some approaches only handle negative dependencies which is a serious limitation because positive dependences are often present as well.

Joint distribution Given local Poisson probability distributions, formulate a joint probability distribution.

Learning Recovery of the dependencies in the present data, so that an easy interpretation of the data is possible. A learned model should also quantify the dependences properly to enable accurate inference on unlabeled instances.

Inference Given a partially observed case, the goal of MAP inference is to predict the most likely assignment of the unobserved variables. Other probabilistic queries, such as marginal probabilities, are of great interest as well.

Positive and negative dependencies PDNs are capable of modeling positive and negative dependencies. Furthermore, we do not have to restrict the weights to be symmetric or limited in any way.

Nonparametric PDN is the first nonparametric Poisson graphical model, both at the level of local models as well as at the global level of the overall model. The use of (Poisson) regression trees for the (initial) conditional distributions are more flexible than using parametric loglinear models for the mean in the Poisson distribution. Moreover, finding many rough rules of thumb of how count variables interact can potentially be a lot easier than finding a single, highly accurate local Poisson model.

Multiplicative GTB We present the very first multiplicative functional gradient ascent. This implements an automated step size selection and hence avoids an expensive line search while improving convergence.

Flexible structure learning The learning estimates the structure and parameters simultaneously and scales well. PDNs do not require fixing the missing symmetries in the structure to do inference. The structure learning of PDNs is competitive with consistent models, has much lower computational costs, and is readily parallelizable.

Predictions PDNs naturally facilitate a simple Gibbs sampling inference. More precisely, we use a Gibbs sampler that provides us with samples from a joint probability distribution. This joint distribution allows us to predict counts of unobserved instances easily instead of solely focusing on structure recovery as done for existing Poisson graphical models.
An example of a PDN with three variables is depicted in Fig. 2 (right). The gray shaded variable \(X_0\) can be seen as an observational vector. In the spirit of other graphical models, for example Conditional Random Fields (Lafferty et al. 2001), we can define a set of observed variables in advance. It will not be necessary to learn a local model for these variables as they are always observed. Additionally, we have fewer constraints on these variables. They do not have to be count variables but instead can be arbitrary features. This allows us to easily incorporate features that are for example normal distributed, hence, paving the way for hybrid models with local models of all kind of local distributions. We call this new formalism Poisson Dependency Networks because they generalize Dependency Networks (DNs) (Heckerman et al. 2000) for multinomial distributions to the Poisson case.
One crucial part of PDNs that we have not touched upon yet is the encoding of \(p(X_i  {\mathbf {X}}_{\setminus i})\) and of \(\lambda _i\) in particular. There are two sensible ways. First, one can follow a parametric approach as it has been done before in the literature. Second, as an alternative, we propose a nonparametric encoding.
More precisely, we show that a PDN can be represented as sums resp. products of regression models grown in a stagewise optimization starting from an initial constant value, or alternatively a loglinear Poisson model, or a Poisson regression tree. Before doing so, let us touch upon the issue of consistency again. Our nonparametric approach to model the means can essentially be seen as linear functions with no restrictions on the parameters. In turn, we will generally not be able to formulate the underlying joint probability distribution of the PDN in closedform by means of a standard (pairwise) Poisson graphical model. However, as we will show later, we can use sampling techniques to draw samples from the joint distribution as we have access to the conditional probabilities. If this sampling converges, there exists a consistent distribution, which however does not have to be known in closed form. We will discuss the question of consistency in greater detail in this work.
4 Learning Poisson Dependency Networks
Before going into details, let us summarize the resulting highlevel approach in Algorithm 1, since it covers all the approaches. The mean \(\lambda _i\) of each variable is assumed to consist of a set of local models grown in a stagewise manner. More precisely, initially the set is a singleton learned in Line 9. Here, the formula \(X_i \sim \sum { {\mathbf{X}}_{\setminus i}}\) denotes that the mean for \(X_i\) can have all other variables potentially as features. Please note that the models used may include hyperparameters such as step sizes and pruning parameters (if e.g., using regression trees)—we will touch upon them along with each approach—that might be estimated using a grid search and a validation set. After the initial learning, a predefined number of gradient steps (T) are made to further improve the model (line 11). Thus, the main computational task is the induction of regression models. More precisely, one regression model per random variable \(X_i\) and per iteration in the stagewise optimization (lines 9–11). For simplicity, let us assume that the regression models are trees. Treebased regression models are known for their simplicity and efficiency when dealing with domains with large numbers of variables and cases. With careful implementations, inducing a single regression tree can be realized with practically linear time complexity, see e.g., Dobra and Gehrke (2002). Hence, the running time is, practically speaking, linearly dependent on \(n\cdot T\), i.e., on the number of count variables times the number of stagewise optimization iterations.
4.1 Poisson Log Linear Models
Before moving to functional gradient ascent, let us make a further note on the consistency of PDNs. In general, we may not be consistent, however, using loglinear models, we can fit our PDNs with loglinear means into the setting of Besag (1974) (and its follow up work) by restricting the parameters and structure. As Besag showed for the pairwise lattice case, there exists an undirected graphical model with Poisson local conditional probability distributions if the parameters \(\varvec{\beta }\) are nonpositive. Further it is assumed that \(\beta _{ij} = \beta _{ji}\). This was further generalized for the nonpairwise case in Yang et al. (2012). If our parameters satisfy these conditions, we can specify the joint distribution for our model and hence we do have a consistent Poisson MRF as well.
However, we are interested in models with as few limitations as possible on the structure and the parameters. Hence we do not rely on the existence of a closed form equation for the joint distribution or even insist on the existence of a consistent distribution. Instead, we introduce a more general class of Poisson graphical models. This is mainly motivated by computational concerns. For domains with a large number of count variables and many dependencies among them, learning a consistent Poisson MRF might be too costly. Moreover, not being able to model both positive and negative influences may hurt the performance and limits the expressiveness. Finally, as the results of our first experimental comparison in Fig. 1 already showed, approaches that are guaranteed to provide an explicit joint distribution are not necessarily better. Specifically, we follow Heckerman et al. (2000) and Bengio et al. (2014) and employ the machinery of Markov chains to generate pseudo samples. The generated samples are used to answer complex probabilistic count queries. For more details on Gibbs sampling and a justification for its usage we refer to Sect. 5.
4.2 Nonparametric Poisson Models Via Gradient Boosting
Local loglinear models tend to estimate fully connected networks and overfit. Consequently, they typically do not provide major insights into the structure of the underlying data generation process. Hence, regularization or postprocessing such as thresholding (Allen and Liu 2013; Yang et al. 2013) have to be employed to extract the true nature of the network. Moreover, as demonstrated in our experiments, using loglinear models easily results in PDNs that overshoot at prediction time leading to overflows and, hence, prediction is not possible at all.

By allowing the treestructure to handle much of the overall model complexity, the models in each leaf can be kept at a low order and, hence, are more easily interpreted.

Interactions among features are directly conveyed by the structure of the regression tree. As a result, interactions can be understood and interpreted more easily in qualitative terms; we will touch upon this later again.
Theorem 1
Proof
These pointwise gradients are quite intuitive. We want to make the predicted means \(\lambda _i({\mathbf{x}}_{\setminus i})\) as similar to the observed count \(x_i\) as possible.
However, we are still left with the step size parameter \(\eta \) to compute the updated mean model in (4). For other probabilistic models such as CRFs, performing a line search was reported to be too expensive (Dietterich et al. 2008). Hence, it was suggested to rely on the “selfcorrecting” property of tree boosting to correct any overshoot or undershoot on the next iteration, i.e., to use a step size of \(\eta =1\;\). Unfortunately, we have observed in our experiments that using a fixed step size of \(\eta = 1\) can lead to very slow convergence or failures due to overflows for large counts. Consequently, we now develop our main technical contribution, a multiplicative gradient boosting approach that avoids this without incurring any computational overhead.
4.3 Multiplicative Gradient Boosting
For nonnegative optimization problems, multiplicative update rules for the parameters have been shown to have much better convergence rates, see e.g., Lee and Seung (2000), Saul and Lee (2001), Sha et al. (2003) and Yang and Laaksonen (2007), and we confirm this for PDNs in our experimental section. More importantly, since the treestructure of the induced regression trees handle much of the overall PDN complexity, faster convergence implies sparser PDNs.
To derive a multiplicative update, we consider a simpler functional dependency between \(\lambda _i({\mathbf{x}_{\setminus \mathbf{i}}})\) and the feature function \(\psi _i\). More precisely, triggered by Chen et al. (2009), who have proven the use of the identity link function to be beneficial when using parametrized univariate Poisson distribution for largescale behavioral targeting, we assume \(\lambda _i = \psi _i\) (again omitting the iteration index t). For this link function, the pointwise gradients are the following:
Theorem 2
Proof
The multiplicative update now follows from Theorem 2 together with a functional step size.
Corollary 1
Proof
The pointwise multiplicative updates are quite intuitive. Instead of making the differences as small as possible, we want to make the ratio of observed counts \(x_i\) and predicted means \(\lambda _i({\mathbf{x}}_{\setminus i})\) as close to 1 as possible.
4.4 Model Compression, Initial Count Models, and Dependency Recovery
One additional way to avoid overfitting next to Laplace estimates is model compression (Bucila et al. 2006). Here, we collapse the trained additive resp. multiplicative model into a single model. To do so, we evaluate it on the training set and learn a single Poisson model per count variable. Since we are dealing with count data, we should use e.g., a Poisson regression tree (Chaudhuri et al. 1995) as shown in Fig. 3; that is, the compressed PDN model consists of a set of local Poisson regression trees, one for each variable \(X_i\), where we train \(\text {tree}_i(X_i \mathbf {X}_{\setminus i})\) and evaluate \(\lambda ^0_i({\mathbf {x}_{\setminus i}}) = \text {tree}_i({\mathbf {x}_{\setminus i}})\).
More precisely, a Poisson regression tree partitions the training examples in the space of the dependent variables \(\mathbf {X}_{\setminus i}\) in order to best fit the response variable \(X_i\). It is a binary tree whose leaves represent the \(\lambda _i\) of that given partition, all other nodes represent the splitting criterion on a variable \(X_j \in \mathbf {X}_{\setminus i}\). We use the Poisson regression tree implementation of rpart (Therneau et al. 2011) where the splitting criterion is given by the likelihood ratio test for two Poisson groups \(D_{\mathrm{parent}}  (D_{\mathrm{left son}} + D_{\mathrm{right son}})\) with the deviance given by \(D = \sum \left[ x_i \log \Big ( \frac{x_i}{\hat{\lambda }} \Big )  (x_i  \hat{\lambda }) \right] ,\) where \(\hat{\lambda }\) is the sample mean^{5} of count variable \(X_i\). This splitting is recursively applied until each subgroup reaches a minimum size or there are no improvements. To avoid overfitting, the depth of a tree is typically limited a priori, alternatively postpruning can be used after the tree has been learned. In the rpartimplementation the prepruning is controlled by a complexity parameter cp. This parameter (which we consider to be part of the hyperparameter space of our algorithm) controls the size of the initial tree. A secondary step is executed where the tree is pruned back using cross validation after the initial tree has been learned. The height of the final tree will be the one that reduces the cross validation error and in our experiments it generalizes well.
To summarize, in contrast to other Poisson graphical models, compressed PDNs return local models that are likely to be sparse and therefore easier to interpret.
5 Making Predictions Using Poisson Dependency Networks
In many applications, we obtain only partially observed instances and we want to use probabilistic inference to predict the values of the missing variables. To be more precise, assume that \(\mathbf{X} = \mathbf{Y} \cup \mathbf{E}\), where \(\mathbf{Y}\) and \(\mathbf{E}\) are disjoint. \(\mathbf{Y}\) amounts to the unobserved variables and \(\mathbf{E}\) describes the evidence. Then we want to answer queries of the form \(p(\mathbf{y}  \mathbf{e}), p( y_i  \mathbf{e})\), or \({{\mathrm{arg\,max}}}_{\mathbf{y}} p({\mathbf{Y}} = \mathbf{y}  \mathbf{e})\). The latter query corresponds to MAP inference and finds the most likely assignment to the unobserved variables. In an univariate Poisson model, MAP inference consists of just reading off the mode of the Poisson distribution which is equal to \(\lfloor \lambda _i\rfloor \). Other marginal probabilities, e.g., \(p(X_i = k)\), can also be read off the distribution because the Poisson distribution is completely defined based on the mean. The same holds for a variable \(X_i\) in a PDN if all neighbors of this variable are observed. However, PDNs with unobserved variables require an inference machinery to account for the dependencies. We here resort to Gibbs sampling (Geman and Geman 1984) to do so.
Since we do not know explicitly the underlying joint distribution, the Gibbs sampler provides us only with pseudo samples, and hence it is called Pseudo Gibbs sampler (Heckerman et al. 2000) due to the potentially inconsistencies arising from conflicting local and joint distributions. Specifically, the Pseudo Gibbs sampler starts with an arbitrary initialization of the unobserved variables and then iterates over each variable for a previously defined number of sweeps. Each sweep produces a new sample by first calculating the conditional probability distribution for every variable \(X_i\). That is calculating the mean \(\lambda _i\) based on the current states of its parent variables \({{\mathrm{\mathbf {pa}}}}_i\). Based on the Poisson distribution parameterized by \(\lambda _i\), a new state for \(X_i\) is sampled. This procedure will sample from the joint distribution after an adequate burnin phase which removes early samples because they are heavily biased by the initial values. In terms of computations, this algorithm does not distinguish itself from a standard Gibbs sampler. However, in the case of an inconsistent set of local probability distributions, it is referred to as a “Pseudo” Gibbs sampler to stress the fact that it may not produce samples from a joint distribution consistent with the PDN. Its run time is identical to the one a standard Gibbs sampler. Besides the number of variables in the PDN and the number samples, the run time also depends on the number of trees used in the stagewise optimization. However, evaluating regression trees is done rapidly and hence, calculating \(\lambda _i\) can be done efficiently.
From the collected samples we can then compute an approximate marginal distribution or MAP assignment. If we are interested in the MAP assignment, we simply select the configuration occurring most often in our samples. Due to the nature of count variables, the number of different configurations can be fairly large in several cases. We can then also approximate the MAP assignment by looking at the marginal probabilities and picking the most probable state for each variable individually.
The order of the Gibbs, however, actually does matter in the case of PDNs due to conflicting \(\beta _{ij}\) and \(\beta _{ji}.\) Consequently, we follow Bengio et al. (2014) and use an unordered Pseudo Gibbs sampler in which in each step one randomly chooses a \(X_i\). Bengio et al. showed that this unordered Pseudo Gibbs sampler induces a socalled Generative Stochastic Network (GSN) Markov chain. As long as this GSN Markov chain has a stationary distribution, the DN defines a joint distribution, which, however, also does not have to be known in closed form. This is for example the case, if the chain is ergodic.
From this perspective, PDNs aim to estimate the generating distribution of multivariate count data indirectly, by boosting the transition operator of a Markov chain rather directly. Since all means are positive, and hence \(p({\mathbf {x}}) > 0\), the chain is ergodic and we will arrive at a consistent estimator of the associated joint distribution.
6 Experimental Evaluation

Q1 Can PDNs learn both positive and negative dependences?

Q2 Can Gibbs sampling predict good counts?

Q3 Can PDNs outperform existing Poisson graphical models?

Q4 Are PDNs easy to interpret?

Q5 Is the strength of the learned dependencies robust to data perturbations?

Q6 Can gradient tree boosting improve the quality of an initial model?

Q7 Can multiplicative updates speed up training, without sacrificing performance?
6.1 Network Discovery from Simulated Data (Q1, Q2, Q3)
Comparison of network discovery performance based on F1scores (the higher, the better)
Hub  Scalefree  

10  25  50  75  100  10  25  50  75  100  
PDNs  0.614  0.599  0.558  0.493  0.463  0.415  0.489  0.548  0.472  0.412 
LPGMs  (0.222)  0.400  0.530  0.545  0.554  (0.100)  0.262  0.372  0.376  0.286 
SPGMs  (0.316)  0.310  0.415  0.515  0.473  (0.000)  0.194  0.286  0.327  0.271 
The datasets were generated as follows. First, with the help of the Rpackage “huge”^{7} a graph with \(n =10, 25, 50, 75, 100\) nodes was generated. This data generation process is based on multivariate normal distributions for different graph structures; we created “hub” and “scalefree” graphs. The adjacency matrix of this true graph was used to construct the neighborhoods in the PGM model. Together with an unary (\(\beta _i=0\)) and a pairwise (\(\beta _{ij} = 0.1\)) parameter, we then obtained the PGM model. We used the Winsorized PGM Gibbs Simulator contained in the “PGM” package to generate \(m=1000\) samples. This resulted in an \(m\times n\) matrix with count values used to learn a model for the original neighborhood graph. More precisely, we used the relative influence to construct an adjacency matrix from our learned models by setting a cell to 1 whenever \(I^2(uv;\lambda _u)>0\). Similarly, an adjacency matrix can be read off from the parameter matrix of LPGMs and SPGMs. The results in terms of F1scores are shown in Table 2 and correspond to Fig. 1. As can be seen, PDNs achieve competitive results on the hub graphs and outperform LPGMs and SPGMs on the scalefree networks; the latter ones may even not converge. Moreover, as already shown in Fig. 1c, PDNs can be an order of a magnitude faster.
Because of this, we only compared PDNs to LPGMs in the remaining experiments. In fact, we did ran SPGMs, but on some of the more complex datasets they were not only an order of magnitude slower but did not terminate after days of running; while learning a PDN took less than 30 min. Moreover, we note that in the following LPGMs refer to local Poisson models with GLM mean models learned without regularization. To cover regularization, we also present results for PDNs with Poisson regression trees as mean models, which can be seen as a regularized model.
6.2 Cell Counts and Bibliography Data (Q2, Q3, Q6)
To investigate the quality of predicted counts returned by the Gibbs sampler, we used two different datasets, namely cell counts and counts of publications.
Tenfold cross validation comparison of the logscores (the lower, the better) on training and test data for the cell count images (top) and publication data (bottom)
Boost. Iters.  Train LL  LOO LL  

\(\textsc {PDNA}^{\text {const,log}}\) (\(\hbox {w}\slash \,\eta =1\))  1.91  1.026  1.541  Cells 
\(\textsc {PDNA}^{\text {const,log}}\) (\(\hbox {w}\slash \) selected \(\eta \))  5.60  1.046  \(1.428^\bullet \)  
\(\textsc {PDNM}^{\text {const,id}}\)  1.26  1.167  1.547  
\(\textsc {PDNM}^{\text {const,id}}\) (Laplace)  1.75  1.074  1.511  
\(\textsc {PDNM}^{\text {tree,id}}\)  0.85  1.134  1.655  
\(\textsc {PDNM}^{\text {tree,id}}\) (Laplace)  1.18  1.049  1.590  
\(\textsc {PDN}^{\text {tree}}\)  –  1.219  1.525  
\(\textsc {LPGM}\) (Allen and Liu 2013)  –  0.963  \(\mathbf{1.117}^\circ \)  
\(\textsc {PDNA}^{\text {const,log}}\) (\(\hbox {w}\slash \,\eta =1\))  x  x  x  Publications 
\(\textsc {PDNA}^{\text {const,log}}\) (\(\hbox {w}\slash \) ParamILS \(\eta \))  31.50  0.980  \(\mathbf{1.182}^\star \)  
\(\textsc {PDNM}^{\text {const,id}}\)  1.95  1.069  1.248  
\(\textsc {PDNM}^{\text {const,id}}\) (Laplace)  2.68  0.994  1.228  
\(\textsc {PDNM}^{\text {tree,id}}\)  0.23  0.998  1.232  
\(\textsc {PDNM}^{\text {tree,id}}\) (Laplace)  0.83  0.955  1.235  
\(\textsc {PDN}^{\text {tree}}\)  –  1.004  1.212  
\(\textsc {LPGM}\) (Allen and Liu 2013)  –  1.137  1.233 
Tenfold cross validation comparison of collective prediction performances (the lower, the better) made by the Gibbs sampler on cell count images (top) and publication data (bottom)
Pred. LL  NRMSE  

\(\textsc {PDNA}^{\text {const,log}}\) (\(\hbox {w}\slash \) selected \(\eta \))  \(\mathbf{1.429}^\bullet \)  \(\mathbf{0.396}^\bullet \)  Cells 
\(\textsc {PDNM}^{\text {const,id}}\)  1.570  0.408  
\(\textsc {PDNM}^{\text {const,id}}\) (Laplace)  1.503  0.410  
\(\textsc {PDNM}^{\text {tree,id}}\)  1.619  0.455  
\(\textsc {PDNM}^{\text {tree,id}}\) (Laplace)  1.576  0.448  
\(\textsc {PDN}^{\text {tree}}\)  1.511  0.439  
\(\textsc {LPGM}\) (Allen and Liu 2013)  –  –  
\(\textsc {PDNA}^{\text {const,log}}\) (\(\hbox {w}\slash \) ParamILS \(\eta \))  \(\mathbf{0.446}^\bullet \)  0.262  Publications 
\(\textsc {PDNM}^{\text {const,id}}\)  0.609  0.320  
\(\textsc {PDNM}^{\text {const,id}}\) (Laplace)  0.610  0.268  
\(\textsc {PDNM}^{\text {tree,id}}\)  0.525  0.256  
\(\textsc {PDNM}^{\text {tree,id}}\) (Laplace)  0.550  0.254  
\(\textsc {PDN}^{\text {tree}}\)  0.508  0.249  
\(\textsc {LPGM}\) (Allen and Liu 2013)  –  – 
Publication Data It has been shown in Hadiji et al. (2013) that migration data based on bibliographic entries exhibits interesting phenomena. Here, we used the AAN (Radev et al. 2009) corpus instead of DBLP and moved from descriptive to predictive models. The goal is to predict the number of publications of a researcher in future years. The AAN corpus at hand contains 19,410 publications written by 15,397 authors from the NLP community. We observe the first 6 years of a researchers’ publication record and predict the number of publications for the following 4 years. We take only active researchers into account, i.e., researchers who had publications in the first 3 years of their career. For this experiment, we ran the Gibbs sampler for 40,000 iterations, with an initial burnin of 4000 iterations (10 %), to obtain predictions on the number of publications of an author.
The tenfold crossvalidated results for the training and test likelihoods are summarized in Table 3. Here, boosted PDNs with additive updates achieve the best results. However, one should note that these results were obtained by optimizing the step size for each class separately. Instead of a simple grid search, we used ParamILS (Hutter et al. 2009) to find the best step sizes. In our setting, we obtained step sizes of 0.5, 0.075, 0.05, and 0.075 for the 4 years to predict. Although Allen et al.’s LPGMs—PDNs with loglinear meanmodels—do a full maximum likelihood optimization for the parameters, they are not able to outperform the boosted approaches. Most importantly, with only a few iterations of optimization and no ParamILS, multiplicative updates achieved a better train likelihood than LPGMs, without sacrificing the test likelihood much.
Even more interesting are the collective prediction performances as summarized in Table 4. As one can clearly see, \(\textsc {PDN}^{\text {tree}}\), i.e., PDNs with PRTs, do the best job. On first sight, it is striking that boosted PDNs with initial tree models and multiplicative updates perform worse than standard \(\textsc {PDN}^{\text {tree}}\). We attribute this to the rather small validation set used during boosting and in turn to overfitting.
This is also confirmed by the boosting results using Laplace smoothing with \(\alpha _i=0.1\) and \(\beta _i=0.2\). It significantly reduced the error for \(\textsc {PDNM}^{\text {const,id}}\). In any case, LPGMs, the only model not developed in the present paper, are not capable of predicting all tests instances due to overflows in the Gibbs sampling.
To summarize, both experiments answer question Q2 affirmatively. Moreover, they already indicate that Q7 may also be answered affirmatively.
6.3 BagofWord PDNs (Q4, Q5)
To investigate the robustness of the relative influence among words extracted from the PDN (Q5), we shuffled the NIPS dataset randomly ten times and for each reordering we removed 5, 10 and 15 % of documents from the end. We learned PDNs for each of the subsets of reduced size, and considered the mean and standard deviation of the normalized relative influences among the words induced by the PDNs. The results are depicted in Fig. 7. Here, the strength of the edges represent the mean, and the gray shade is the standard deviation of the relative influence with darker tones indicating small values. As one can see, strong dependencies are less affected by removing documents from the corpus but the variance indeed increases, mostly by deviation on the strong edges or new small ones.
6.4 Communities and Crime and ClickStream Data (Q6, Q7)
Finally, to investigate the overall performance of PDNs as well as to compare additive and multiplicative updates, we used two realworld datasets, namely the Communities and Crime dataset and a clickstream dataset.
For a comparison of the additive and multiplicative updates for PDNs, we used a tenfold cross validation based on the folds defined in the dataset. The bottom plots in Fig. 8 shows the learning curves for the eight different crimes. The plots are averaged over the tenfolds using a maximum of 50 iterations and measure the effectiveness of the learning rate in terms of the logscore per class. For the additive models, PDNA, we used a step size of \(\eta =0.01\) in case of the identity link function and the log link case used \(\eta =1^{5}\). Both step sizes were found based on a systematic search. We started with \(\eta =1.0\) and decreased the step size until no more oscillation was observed during the training. In particular, when using a constant meanmodel for initialization, one can see that the multiplicative update outperforms the additive ones: it is significantly faster and achieves better test performance. It must be mentioned that this was achieved without a time consuming selection of an adequate step size at all. In case of the additive updates, the log link function learns faster in cases of crimes with high counts such as “larcenies” with data mean around 1999 per community. For means with small values such as “murders” having an empirical mean of around 6.4, however, we see that additive updates using the identity link function learned faster. Moreover, the additive updates always required the maximum of 50 iterations to obtain the best value. We determined the optimal iteration based on a validation set during the learning. On the other hand, multiplicative updates required only a fraction of the iterations, while not sacrificing predictive performance. Averaged over all experiments and classes, multiplicative updates require far less than ten iterations and the log likelihood score on the test data is significantly better than for the additive case with mean initialization. This clearly shows the advantage of multiplicative updates.
Finally, we investigated the robustness of the learned PDNs w.r.t. the numbers of iterations of the stepwise optimization. To this end, we computed the gain in normalized relative influence over the iterations for each fold in the CnCdataset. The summarized results in Fig. 9 show that the gain in influence decreases with later iterations. That is, the learner focuses on important dependencies first. Influences added at later iterations are less important. The important influences are robust w.r.t. to the number of iterations set. This agrees with the robustness results obtained in Sect. 6.3.
To summarize, PDNs outperform their univariate counterpart, boosting improves the initial model, and multiplicative updates outperform additive updates. These are affirmative answers to our initial questions (Q6, Q7).
ClickStreamData. To reaffirm the results of the last experiment, we created a count dataset based on the MSNBC.com dataset from the UCI repository^{12}. The data gives sequences corresponding to a user’s page views for an entire day, which are grouped into 17 categories. Instead of the original dataset, we used the postprocessed version from the SMPF library^{13}, which removed very short click sequences. In total, this dataset contains information of about 31,790 users. We discard the sequence information and instead analyzed solely the frequencies of the visited categories. In contrast to the CnC dataset, the means of the 17 categories have all low mean values. To be more precise, the category “frontpage” has the highest mean (3.62) and the category “msnnews” has the lowest mean (0.03). We also note that the variance in the data is much lower than in the CnCdataset.
First, we considered boosted PDNs with the empirical mean as initial model again. Since the learning curves were qualitatively identical to the CnCexperiment, we omit them here. Due to the low means and variance of the categories, the simple initial models themselves achieve already a low average logscore of 1.22. Still, boosting was able to improve the model with both, additive and multiplicative updates. Additive updates achieved logscores of 1.14 (\(\textsc {PDNA}^{\text {const,id}}\)) and 1.08 (\(\textsc {PDNA}^{\text {const,log}}\)), however, requiring more than 40 iterations. With less than four iterations on average, multiplicative updates achieved an average logscore of 1.09. That is, again, multiplicative updates can be an order of magnitude faster, without sacrificing predictive performance. Initialization using Poisson regression trees gives a head start for the additive updates but not for the multiplicative ones; they are still significantly faster. These are affirmative answers to questions (Q3–7).
Taking all results together, the experimental evaluation clearly shows that all six questions (Q1–7) can be answered affirmatively and indicate that PDNs have the potential to be a fast alternative to existing Poisson graphical models.
7 Conclusion
Count data are increasingly ubiquitous in data science settings. Example data are bagofX representations of, e.g., collections of images or text documents, genomic sequencing data, userratings data, spatial incidence data, climate studies, and site visits, among others. Unfortunately, standard graphical models such as multinomial or Gaussian ones are often illsuited for modeling this data. We have therefore introduced Poisson Dependency Networks (PDNs), a new graphical model for multivariate count data. Its representation naturally facilitates a Gibbs sampler and a very simple, nonparametric training procedure: starting from a simple initial model, which in the simplest case can be a constant value, loglinear Poisson model or a Poisson regression tree, PDNs are represented as sums resp. products of regression models grown in a stagewise optimization. On several realworld datasets we have demonstrated empirically that PDNs are competitive multivariate count models, both in terms of efficiency and predictive performance, although they do not guarantee a consistent joint distribution in closed form.
The intent of our paper has been to introduce and explore the basic idea of nonparametric dependency networks for multivariate count data. Consequently, there is significant additional work to be done. More work is needed to characterize the set of consistent PDNs, i.e., PDNs with a consistent joint distribution in closed form. This would allow one to characterize the complexity of inference in terms of treewidth, adapt other inference methods, e.g., based on messagepassinglike inference, and other learning methods, e.g., based on Expectation Maximization. As another example, additional work is needed to understand when the joint distribution of an (inconsistent) PDN has low predictive accuracy. Generally, one should explore PDNs within other machine learning tasks such as characterizing neural dependencies (Berkes et al. 2008), training topic models (Gehler et al. 2006) also capturing word dependencies within each topic (Inouye et al. 2014a, b), predicting user behavior such as retention and churn (Hadiji et al. 2014), and recommendation (Gopalan et al. 2014). Another interesting avenue for future work is to exploit functional gradients for learning hybrid multivariate models. Along the way, one should investigate dependency networks for the complete family of generalized linear models; for instance, Dobra (2009) has shown that hybrid dependency networks among Gaussian and logistic variables perform well for discovering genetic networks, and Guo and Gu have shown logistic (conditional) dependency networks to perform well for multilabel classification (Guo and Gu 2011). Upgrading the resulting nonparametric, hybrid dependency networks to relational domains may provide novel structure learning approaches for BLOG (Milch et al. 2005) and probabilistic programming languages (Goodman 2013), who feature Poisson distributions, and would complement relational Gaussian models (Singla and Domingos 2007; Choi and Amir 2010; Ahmadi et al. 2011) as well as relational copulas as proposed by Xiang and Neville (2013) for relational collective classification; additionally this line of research could pave the way to novel evaluation methods for statistical relational models. In general, copula models for multivariate count data (Hee Lee 2014) are an interesting option, in particular extending them to the relational case based on Xiang and Neville (2013). All this could lead to better methods for mining webpopulated knowledge bases such as TextRunner, NELL, YAGO, and KnowledgeGraph. In such open information retrieval tasks, one cannot easily assume a fixed number of “entities” and in turn use existing probabilistic relational models such as Markov Logic Networks: we have to deal with multivariate count distributions.
Footnotes
 1.
If the mean \(\lambda \) of a Poisson distribution \(p(x) = (\lambda ^x e^{\lambda })\slash (x!)\) is small, say less than 5, then its probability histogram is markedly asymmetrical, making a Normalapproximation illsuited. Using Stirling’s formula for x!, i.e., \(x!\approx \sqrt{2\pi x} e^{x}x^x\), as \(x\rightarrow \infty \), one can see that its probability histogram is essentially symmetric and bellshaped. More precisely, we rewrite \(p(x) \approx (\sqrt{2\pi \lambda }x^{x+0.5})^{1}e^{x\lambda }\lambda ^{x+0.5}\). In the limit of large \(\lambda \), this approaches the Normal distribution \(p(x)\approx (2\pi \lambda )^{0.5}e^{(x\lambda )^2/2\lambda }\). This fact can be traced back to A. De Moivre, Approximato ad Summam Terminorum Binmoo \((a+b)^n\) in Seriem Expansi (London, 1733).
 2.
 3.
We did not compare to TPGMs since they were compared to SPGMs already in Yang et al. (2013).
 4.
For the sake of simplicity we assume the data is fully observed. Together with the Gibbs sampler discussed later, everything can in principle be extended to the partially observed case.
 5.
For stability reasons the implementation uses a revised Bayes estimate of \(\hat{\lambda }\).
 6.
 7.
 8.
 9.
 10.
 11.
As one further exception, we also removed New York City from the data as it presents an extreme outlier in terms of size.
 12.
 13.
Notes
Acknowledgments
The authors would like to thank YingWooi Wan and Zhandong Liu for making the code for Poisson graphical models and for the simulated network recovery experiments from multivariate count data publically available. This work was partly supported by the DFG Collaborative Research Center SFB 876 project B4 and by the DFG, KE 1686/21, as part of the Coordination Project SPP 1527.
References
 Ahmadi, B., Kersting, K., & Sanner, S. (2011). Multievidence lifted message passing, with application to PageRank and the Kalman filter. In Proceedings of the 22nd international joint conference on artificial intelligence (IJCAI).Google Scholar
 Allen, G., & Liu, Z. (2013). A local poisson graphical model for inferring networks from sequencing data. IEEE Transactions on Nanobioscience, 12, 189–198.CrossRefGoogle Scholar
 Bengio, Y., ThibodeauLaufer, É., Alain, G., & Yosinski, J. (2014). Deep generative stochastic networks trainable by backprop. In Proceedings of the 31th international conference on machine learning (ICML) (pp. 226–234).Google Scholar
 Berkes, P., Wood, F., & Pillow, J. (2008). Characterizing neural dependencies with copula models. In Proceedings of the twentysecond annual conference on neural information processing systems (NIPS) (pp. 129–136).Google Scholar
 Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society. Series B (Methodological), 36(2), 192–236.MathSciNetzbMATHGoogle Scholar
 Bolker, B. M., Brooks, M. E., Clark, C. J., Geange, S. W., Poulsen, J. R., Stevens, M. H. H., et al. (2009). Generalized linear mixed models: A practical guide for ecology and evolution. Trends in Ecology and Evolution, 24, 127–135.CrossRefzbMATHGoogle Scholar
 Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. Belmont: Wadsworth.zbMATHGoogle Scholar
 Bucila, C., Caruana, R., & NiculescuMizil, A. (2006). Model compression. In Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining (KDD) (pp. 535–541).Google Scholar
 Chaudhuri, P., Lo, W. D., Loh, W. Y., & Yang, C. C. (1995). Generalized regression trees. Statistica Sinica, 5, 641–666.MathSciNetzbMATHGoogle Scholar
 Chen, Y., Pavlov, D., & Canny, J. (2009). Largescale behavioral targeting. In Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (CIKM) (pp. 209–218).Google Scholar
 Choi, J., & Amir, E. (2010). Lifted inference for relational continuous models. In Proceedings of the 26th conference on uncertainty in artificial intelligence (UAI).Google Scholar
 Clarke, R. D. (1946). An application of the poisson distribution. Journal of the Institute of Actuaries, 72, 481.Google Scholar
 Dietterich, T. G., Hao, G., & Ashenfelter, A. (2008). Gradient tree boosting for training conditional random fields. Journal of Machine Learning Research, 9, 2113–2139.MathSciNetzbMATHGoogle Scholar
 Dobra, A. (2009). Variable selection and dependency networks for genomewide data. Biostatistics, 19, 621–639.CrossRefGoogle Scholar
 Dobra, A., & Gehrke, J. (2002). SECRET: A scalable linear regression tree algorithm. In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 481–487).Google Scholar
 Elith, J., Leathwick, J., & Hastie, T. (2008). A working guide to boosted regression trees. Journal of Animal Ecology, 77, 802–813.CrossRefGoogle Scholar
 Feller, W. (1968). An introduction to probability theory and its applications. London: Wiley.zbMATHGoogle Scholar
 Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5), 1189–1232.MathSciNetCrossRefGoogle Scholar
 Gehler, P., Holub, A., & Welling, M. (2006). The rate adapting poisson model for information retrieval and object recognition. In Proceedings of the twentythird international conference (ICML) (pp. 337–344).Google Scholar
 Geman, S., & Geman, D. (1984). Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741.CrossRefzbMATHGoogle Scholar
 Ghitany, M., Karlis, D., AlMutairi, D., & AlAwadhi, F. (2012). An em algorithm for multivariate poisson regression models and its application. Applied Mathematical Sciences, 6, 6843–6856.MathSciNetGoogle Scholar
 Goodman, N. (2013). The principles and practice of probabilistic programming. In Proceedings of the 40th annual ACM SIGPLANSIGACT symposium on principles of programming languages (POPL) (pp. 399–402).Google Scholar
 Gopalan, P., Charlin, L., & Blei, D. (2014). Contentbased recommendations with poisson factorization. In Proceedings of the annual conference on neural information processing systems (NIPS) (pp. 3176–3184).Google Scholar
 Guo, Y., & Gu, S. (2011). Multilabel classification using conditional dependency networks. In Proceedings of the 22nd international joint conference on artificial intelligence (IJCAI) (pp. 1300–1305).Google Scholar
 Hadiji, F., Kersting, K., Bauckhage, C., & Ahmadi, B. (2013). GeoDBLP: Geotagging DBLP for mining the sociology of computer science. arXiv preprint arXiv:1304.7984.
 Hadiji, F., Sifa, R., Drachen, A., Thurau, C., Kersting, K., & Bauckhage, C. (2014). Predicting player churn in the wild. In Proceedings of the IEEE conference on computational intelligence and games (CIG).Google Scholar
 Heckerman, D., Chickering, D., Meek, C., Rounthwaite, R., & Kadie, C. (2000). Dependency networks for density estimation, collaborative filtering, and data visualization. Journal of Machine Learning Research, 1, 49–76.Google Scholar
 Hoff, P. (2003). Random effects models for network data. In R. Breiger, K. Carley, & P. Pattison (Eds.), Dynamic social network modeling and analysis: Workshop summary and papers (pp. 303–312). Washington: The National Academies Press.Google Scholar
 Hutter, F., Hoos, H. H., LeytonBrown, K., & Stützle, T. (2009). ParamILS: An automatic algorithm configuration framework. Journal of Artificial Intelligence Research, 36, 267–306.Google Scholar
 Inouye, D., Ravikumar, P., & Dhillon, I. (2014a). Admixture of poisson mrfs: A topic model with word dependencies. In Proceedings of the 31th international conference on machine learning (ICML) (pp. 683–691).Google Scholar
 Inouye, D., Ravikumar, P., & Dhillon, I. (2014b). Capturing semantically meaningful word dependencies with an admixture of Poisson MRFs. In Proceedings of the annual conference on neural information processing systems (NIPS) (pp. 3158–3166).Google Scholar
 Kaiser, M. S., & Cressie, N. (1997). Modeling poisson variables with positive spatial dependence. Statistics and Probability Letters, 35(4), 423–432.MathSciNetCrossRefzbMATHGoogle Scholar
 Karlis, D. (2003). An EM algorithm for multivariate poisson distribution and related models. Journal of Applied Statistics, 30, 63–77.MathSciNetCrossRefzbMATHGoogle Scholar
 Karlis, D., & Ntzoufras, I. (2003). Analysis of sports data by using bivariate poisson models. Journal of the Royal Statistical Society: Series D (The Statistician), 52(3), 381–393.MathSciNetGoogle Scholar
 Kersting, K., & Driessens, K. (2008). Nonparametric policy gradients: A unified treatment of propositional and relational domains. In Proceedings of the twentyfifth international conference (ICML) (pp. 456–463).Google Scholar
 Khot, T., Natarajan, S., Kersting, K., & Shavlik, J. (2011). Learning markov logic networks via functional gradient boosting. In Proceedings of the 11th IEEE international conference on data mining (ICDM) (pp. 320–329).Google Scholar
 Koller, D., & Friedman, N. (2009). Probabilistic graphical models. Cambridge: The MIT Press.zbMATHGoogle Scholar
 Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings 18th international conference on machine learning (pp. 282–289). Morgan Kaufmann, San Francisco, CA.Google Scholar
 Lee, E. H. (2014). Copula analysis of correlated counts. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in econometrics (Chap. 16, pp. 325–348). Bradford: Emerals Group Publishing.Google Scholar
 Lee, D., & Seung, H. S. (2000). Algorithms for nonnegative matrix factorization. In Proceedings of neural information processing systems (NIPS) (pp. 556–562).Google Scholar
 Lehmussola, A., Ruusuvuori, P., Selinummi, J., Huttunen, H., & YliHarja, O. (2007). Computational framework for simulating fluorescence microscope images with cell populations. IEEE Transactions on Medical Imaging, 26(7), 1010–1016.CrossRefGoogle Scholar
 Lowd, D., & Davis, J. (2014). Improving Markov network structure learning using decision trees. Journal of Machine Learning Research, 15(1), 501–532.zbMATHGoogle Scholar
 McCullagh, P., & Nelder, J. (1989). Generalized linear models. London: Chapman and Hall.CrossRefzbMATHGoogle Scholar
 Meinshausen, N., & Bühlmann, P. (2006). High dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3), 1436–1462.MathSciNetCrossRefGoogle Scholar
 Milch, B., Marthi, B., Russell, S., Sontag, D., Ong, D., & Kolobov, A. (2005). BLOG: Probabilistic models with unknown objects. In Proceedings of the nineteenth international joint conference on artificial intelligence (IJCAI) (pp. 1352–1359).Google Scholar
 Natarajan, S., Kersting, K., Khot, T., & Shavlik, J. (2014a). Boosted statistical relational learners: From benchmarks to datadriven medicine. Berlin: Springer.Google Scholar
 Natarajan, S., Khot, T., Kersting, K., Gutmann, B., & Shavlik, J. (2012). Gradientbased boosting for statistical relational learning: The relational dependency network case. Machine Learning Journal, 86(1), 25–56.MathSciNetCrossRefGoogle Scholar
 Natarajan, S., Leiva, J. M. P., Khot, T., Kersting, K., Re, C., & Shavlik, J. (2014b). Effectively creating weakly labeled training examples via approximate domain knowledge. In ILP.Google Scholar
 Natarajan, S., Saha, B., Joshi, S., Edwards, A., Khot, T., Davenport, E. M., et al. (2014c). Relational learning helps in threeway classification of alzheimer patients from structural magnetic resonance images of the brain. International Journal of Machine Learning and Cybernetics, 5(5), 659–669.Google Scholar
 Radev, D., Muthukrishnan, P., & Qazvinian, V. (2009). The ACL anthology network corpus. In Proceedings, ACL workshop on natural language processing and information retrieval for digital libraries. Singapore.Google Scholar
 Ravikumar, P., Wainwright, M. J., & Lafferty, J. D. (2010). Highdimensional ising model selection using a l1regularized logistic regression. The Annals of Statistics, 38(3), 1287–1936.MathSciNetCrossRefzbMATHGoogle Scholar
 Ridgeway, G. (2006). Generalized boosted models: A guide to the GBM package. R vignette.Google Scholar
 Saul, L., & Lee, D. (2001). Multiplicative updates for classification by mixture models. In Proceedings of neural information processing systems (NIPS) (pp. 897–904).Google Scholar
 Sha, F., Saul, L. K., & Lee, D. D. (2003). Multiplicative updates for large margin classifiers. In Proceedings of the 16th annual conference on computational learning theory (COLT) (pp. 188–202).Google Scholar
 Singla, P., & Domingos, P. (2007). Markov logic in infinite domains. In Proceedings of the twentythird conference on uncertainty in artificial intelligence (UAI) (pp. 368–375).Google Scholar
 Therneau, T. M., Atkinson, B., & Ripley, B. (2011). rpart: Recursive Partitioning. http://CRAN.Rproject.org/package=rpart
 Tsiamyrtzis, P., & Karlis, D. (2004). Strategies for efficient computation of multivariate poisson probabilities. Communications in Statistics, Simulation and Computation, 33, 271–292.MathSciNetCrossRefGoogle Scholar
 Weiss, J., Natarajan, S., Peissig, P., McCarty, C., & Page, D. (2012). Statistical relational learning to predict primary myocardial infarction from electronic health records. In Proceedings of the twentyfourth annual conference on innovative applications of artificial intelligence (IAAI12).Google Scholar
 Xiang, R., & Neville, J. (2013). Collective inference for network data with copula latent markov networks. In Proceedings of the sixth ACM international conference on web search and data mining (WSDM) (pp. 647–656).Google Scholar
 Yang, E., Ravikumar, P., Allen, G., & Liu, Z. (2012). Graphical models via generalized linear models. In Proceedings of the annual conference on neural information processing systems (NIPS) (pp. 1367–1375).Google Scholar
 Yang, E., Ravikumar, P., Allen, G.I., & Liu, Z. (2013). On poisson graphical models. In Proceedings of the annual conference on neural information processing systems (NIPS) (pp. 1718–1726).Google Scholar
 Yang, Z., & Laaksonen, J. (2007). Multiplicative updates for nonnegative projections. Neurocomputing, 71(1–3), 363–373.CrossRefGoogle Scholar