MAP inference algorithms without approximation for collective graphical models on path graphs via discrete difference of convex algorithm

Collective graphical model (CGM) is a probabilistic model that provides a framework for analyzing aggregated count data. Maximum a posteriori (MAP) inference of unobserved variables under given observations is one of the essential operations in CGM. Because the MAP inference problem is known to be NP-hard in general, the current mainstream approach is to solve an alternative problem obtained by approximating the objective function and applying continuous relaxation. However, this approach has two significant drawbacks. First, the quality of the solution deteriorates when the values in the count data are negligible due to the inaccuracy of Stirling’s approximation. Second, the application of continuous relaxation causes the violation of integrality constraints. This paper proposes novel algorithms for MAP inference in CGMs on path graphs to overcome these problems. Our method is based on the discrete difference of convex algorithm (DCA); DCA is a general framework to minimize the sum of a convex function and a concave function by repeatedly minimizing surrogate functions. Utilizing the particular structure of path graphs, we efficiently solve the surrogate function minimization by minimum convex cost flow algorithms. Furthermore, our approach also leads to a new method of solving another important task; MAP inference of the sample size in CGM on path graphs. Our method is naturally applicable to this task, allowing us to design very efficient algorithms. Experimental results on synthetic and real-world datasets show the effectiveness of the proposed algorithms.


Introduction
In recent years, the importance of aggregated count data, which is calculated from the data of multiple individuals, has been increasing (Tanaka et al. 2019;Zhang et al. 2020). Although technologies for acquiring individual data such as sensors and GPS have greatly advanced, it is still very difficult to handle individual data due to privacy concerns and the cost of data collection. However, there are many situations where data aggregated from multiple individuals can be obtained and utilized easily. For example, mobile spatial statistics , which is the hourly population data of fixed-size square grid cells calculated from cell phone network data in Japan, is available for purchase; such data is being used for disaster prevention and urban planning (Suzuki et al. 2013). In traffic networks, traffic volume data at each point can be obtained more easily by sensors or cameras than the trajectories of individual cars, and the data is useful for managing traffic congestion (Morimura et al. 2013;Zhang et al. 2017).
Collective graphical model (CGM) (Sheldon and Dietterich 2011) is a probabilistic model to describe aggregated statistics of a sample drawn from a graphical model. CGM makes it possible to conduct various practical tasks on aggregated count data, such as interpolation of unobserved aggregated count values, denoising of observed count values, and parameter learning of the underlying graphical model. Particularly, the case where the underlying graph is a path graph is important because CGMs on path graphs can treat time series data in which the states of interest follow Markov chains. In fact, most of the realworld applications of CGMs utilize CGMs on path graphs to represent the collective movement of humans and animals (Du et al. 2014;Sun et al. 2015;Akagi et al. 2018). Detailed analyses of time series of collective people movements from limited observations would be useful for controlling people flow to avoid congestion and to maintain social distancing in urban spaces.
One of the essential operations in CGM is maximum a posteriori (MAP) inference. MAP inference is the discrete (combinatorial) optimization problem of finding an assignment of unobserved variables that maximizes the posterior probability under given observations. MAP inference makes it possible to interpolate missing values of aggregated data and estimate more detailed information behind the observations. For example, suppose the population distribution of humans in each area and at each time is given as observations. In that case, the number of people moving between each time and area with the highest posterior probability can be estimated by performing MAP inference on a CGM. This allows us to obtain more detailed information about crowd movements from a series of snapshots of population distribution. Another example is population interpolation; by conducting MAP inference on a CGM with population distribution at two-time points as input, we can obtain population distribution at the time points in between. This allows us to obtain the city's high temporal resolution population dynamics from observations of population distribution at limited time points.
Unfortunately, MAP inference for general CGMs has been shown to be NP-hard (Sheldon et al. 2013) and thus is difficult to solve exactly and efficiently. Therefore, an alternative approach that solves an approximate problem, which is derived by applying Stirling's approximation and continuous relaxation, has been proposed (Sheldon et al. 2013). Subsequent studies have focused on solving this approximate problem efficiently (Sun et al. 2015;Vilnis et al. 2015;Nguyen et al. 2016;Singh et al. 2020).
However, there are inherent problems with this approach to solving the approximate problem. First, this approach tends to output a solution with a low posterior probability when the values in the count tables are small, because Stirling's approximation, log x! ≈ x log x − x , is inaccurate when x is small. Such a situation frequently occurs when the number of values that each variable in the graphical model takes is large, or when the sample size is small. Second, since continuous relaxation is applied, the integrality constraints of count table values are violated in the output. As a result, values that should be integers (e.g., the number of people) are no longer integers, which not only reduces interpretability, but also makes the output less sparse, resulting in high memory consumption to maintain the output. It is possible to obtain integer-valued results by rounding the output, but this rounding process destroys the sum constraints among the estimated counts, e.g., the sum of the count table values at each node may not match the sample size.
To resolve these issues, in this paper, we propose a new method for MAP inference for CGMs on path graphs. We first show that the objective function of the problem can be expressed as the sum of univariate discrete convex functions and discrete concave functions. Based on this expression, we utilize the idea of the difference of convex algorithm (DCA) (An Le Thi and Pham Dinh 2018). DCA is a framework to minimize a function expressed as the sum of a convex function and a concave function. In DCA, a solution is obtained by repeatedly minimizing a surrogate function that upper-bounds the objective function, and the objective function value decreases monotonically in each iteration. In addition, the algorithm terminates in a finite number of iterations in our case since the variables are discrete, not continuous.
The key to make the DCA-based algorithm efficient is a fast minimization algorithm for the surrogate function. Because the feasible region of our problem is limited to integer lattice points, continuous optimization methods such as the gradient descent, which are usually used in DCAs, cannot be applied to minimize our surrogate function. Instead, we utilize the special structure of path graphs; it enables us to formulate the minimization problem of the surrogate function as a combinatorial optimization problem called the minimum convex cost flow problem. Fast algorithms for the minimum convex cost flow problem are known and we can minimize the surrogate function efficiently by using these algorithms.
The proposed method has several practical advantages. First, since the proposed method does not use Stirling's approximation, it offers an accurate inference even when the values in the count tables are small. This makes it possible to output solutions with much higher posterior probability than the approximation-based approach. Second, because the proposed method does not apply continuous relaxation, the obtained solution is guaranteed to be integer-valued, which results in sparse and interpretable outputs. In Sect. 5, we show experimental results gained from synthetic and real-world datasets; they indicate that the proposed method outputs higher quality solutions than the existing approach. We show that the superiority of the proposed method is much greater when the sample size is not very large or the number of states on nodes in the graphical model is large.
Furthermore, our approach also leads to a new method of solving another important task; sample size estimation in CGM on path graphs. The sample size is the number of individuals in a sample obtained from the original graphical model before aggregation; for instance, in human flow analysis, the sample size corresponds to the total number of people in the entire space. In most existing CGM studies, the MAP inference problem is solved under the assumption that the sample size is given in advance (Sheldon et al. 2013;Sun et al. 2015;Singh et al. 2020;Nguyen et al. 2016). However, it is often difficult to know the true sample size exactly a priori. This is because the true sample size cannot be obtained from the observed aggregate values due to observation noise, or worse, some aggregate values may be missing. For example, when dealing with population distribution data of a city, it is possible that the total number of people in the entire city at each time is not constant due to inaccurate observations. In this case, there are several options for the sample size, and we have to manually set the appropriate value. This may be a barrier to real-world applications of the MAP inference problem of CGMs.
A natural approach to tackle this task is to set a prior distribution for the sample size and conduct MAP inference of sample size and unobserved aggregate values simultaneously. However, it is not easy to construct an efficient algorithm for this new MAP inference problem; a naive method is to solve the MAP inference problem with a given sample size for all candidate sample sizes and output the solution with the highest posterior probability among them. But this method requires a lot of computation time because we have to solve many MAP inference problems.
Our approach based on DCA and the minimum convex cost flow algorithms can naturally handle the MAP inference problem even when the sample size is not given. The proposed method can estimate the sample size by solving only one MAP inference problem and thus can output a solution very efficiently. Experimental results confirm that the proposed method can achieve almost the same objective function values with significantly less computational time than the baseline method with a brute force search of sample size.
A preliminary version of this work appeared in the Proceedings of NeurIPS'21 (Akagi et al. 2021). The main difference from Akagi et al. (2021) is that the algorithm is extended to the problem setting when the sample size is unknown, and the effectiveness of the algorithm has been confirmed by experiments using synthetic and real-world datasets.

MAP inference for CGMs
Several methods have been proposed for the MAP inference of CGMs, but most of them take the approach of solving the approximate problem (Sheldon et al. 2013), which is derived by applying Stirling's approximation and continuous relaxation. For example, the interior point method (Sheldon et al. 2013), projected gradient descent (Vilnis et al. 2015), message passing (Sun et al. 2015) and Sinkhorn-Knopp algorithm (Singh et al. 2020) have been used to solve the approximate problem. In particular, Nguyen et al. (2016) proposes a method to use DCA to solve this approximate problem. Although this approach is similar to our proposal in that it uses DCA, the purpose of applying DCA is totally different: our focus is to solve the MAP inference problem without using any approximation or continuous relaxation.
One of the few exceptions is the method proposed in Akagi et al. (2020), which solves the original MAP inference problem directly without using approximation. Our method follows this line of research, but there are two major differences. First, their method can only be applied to CGM on a graph with two vertices, and thus applicability is very limited. Since our method is consistent with this method when applied to CGM on a graph with two vertices, our method can be regarded as a generalization of their method. Second, their work assumes accurate observations and does not handle observation noise. Sheldon et al. (2007) solves related collective MAP inference problems on path graphs. The problems addressed in this paper are different from ours; their purpose is finding the most likely assignments of the entire variables for each individual, while our purpose is finding the most likely node and edge contingency tables. In their settings, non-linear terms 1 3 in the log posterior probability vanish, and the MAP inference problem can be solved easily by linear optimization approaches.

Difference of convex algorithm (DCA)
DCA, which is sometimes called Convex Concave Procedure (Yuille and Rangarajan 2001), is a framework to minimize a function expressed as the sum of a convex function and a concave function (An Le Thi and Pham Dinh 2018). DCA was originally proposed as a method for optimization in continuous domains. DCA has been used in various machine learning fields, such as feature selection (An Le Thi et al. 2015), reinforcement learning (Piot et al. 2014), support vector machines (Xu et al. 2017) and Boltzmann machines (Nitanda and Suzuki 2017).
Several studies have applied DCA to discrete optimization problems. This line of research is sometimes called discrete DCA (Maehara and Murota 2015). (Narasimhan and Bilmes 2005;Iyer and Bilmes 2012) propose algorithms to minimize the sum of a submodular function and a supermodular function. This algorithm is generalized to yield the minimization of the sum of an M/L-convex function and an M/L-concave function (Maehara and Murota 2015), where M-convex function and L-convex function are classes of discrete convex functions (Murota 1998). Although our work is closely related to these studies, it is not part of them. This is because our problem can be regarded as the minimization of the sum of two M-convex functions and a separable concave function, and this is not included in the class of functions dealt with in Maehara and Murota (2015) 1 .

Collective graphical models in previous studies
Collective graphical model (CGM) is a probabilistic generative model that describes the distributions of aggregated statistics of a sample drawn from a certain graphical model (Sheldon and Dietterich 2011). Let G = (V, E) be an undirected tree graph (i.e., a connected graph with no cycles). We consider a pairwise graphical model over discrete random variable X ∶= (X u ) u∈V defined by is the partition function for normalization. In this paper, we assume that x u takes values on the set [R] for all u ∈ V , where [k] denotes the set {1, 2, … , k} for a positive integer k.
We draw an ordered sample (X (1) , … , X (M) ) independently from the graphical model, where M is the sample size. Let n u ∶= (n u (i)) i∈ [R] and n uv ∶= (n uv (i, j)) i,j∈ [R] , where n u (i) ∶= |{m | X (m) u = i}| and n uv (i, j) ∶= |{m | X (m) u = i, X (m) v = j}| . Each entry of n u and n uv is the number of occurrences of a particular variable setting (see Fig. 1). We

3
call (n u ) u∈V node contingency table and (n uv ) (u,v)∈E edge contingency table, and denote n ∶= ((n u ) u∈V , (n uv ) (u,v)∈E ) . We assume that observations y ∶= (y u ) u∈V are generated by adding noise to the node contingency table (n u ) u∈V , and the distribution of y is given by where p ui is the noise distribution. An additional assumption is described below.
Assumption 1 For u ∈ V and i ∈ [R] , log p ui (y | n) is a concave function in n.
Assumption 1 is a quite common assumption in CGM studies (Sheldon et al. 2013;Sun et al. 2015). Commonly used noise distributions such as Gaussian distribution and Poisson distribution p ui (y u (i) | n u (i)) = n u (i) y u (i) ∕y u (i)! ⋅ exp(−n u (i)) satisfy Assumption 1. It is also possible to consider observation models other than the one described here. For example, some observations may be missing, noiseless observations may be obtained, or some or all of the elements of the noisy edge contingency table may be observed. For the sake of simplicity, we limit ourselves to models where noisy vertex contingency tables are observed, but the following description and the proposed method can be generalized to the above cases as well.
The MAP inference problem for CGM is to find n that maximizes the posterior probability Pr(n | y) . The MAP inference problem is the operation of finding the vertex/edge contingency table with the highest posterior probability from noisy observations. It is of great importance in CGM research because it allows interpolation of missing values in aggregate data and estimation of more detailed information hidden behind observations. For more specific applications, see the example in Sect. 3.3. Since Pr(n | y) = Pr(n, y)∕ Pr(y) from Bayes' rule, it suffices to maximize the joint probability Pr(n, y) = Pr(n) ⋅ Pr(y | n) . Pr(n) is called CGM distribution and calculated as follows (Sun et al. 2015): Here, (⋅) is the indicator function, u is the degree of node u in G, and ℤ M is the set of possible contingency tables. Using the above notations, the MAP inference problem can be written as

Prior distribution for sample size
In this paper, we consider both existing problem setups where the sample size is given as input and new problem setups where the sample size is not given as input. As we will see later, the existing problem setup can be considered a particular case of the new problem setup, so we will discuss the new problem setup, i.e., when the sample size is unknown, and mention the particular case when necessary.
We set a prior probability distribution of the sample size, Pr(M) = q(M) , where M ≥ 0 is a random variable which represents the sample size. We make an assumption on the prior probability distribution q(M).
Many practical probability distributions satisfy Assumption 2; e.g. the discrete Note that by setting we obtain the existing problem setup where the sample size M given is given as input. This indicates that our new formulation is a generalization of the existing problem setup. Therefore, all subsequent discussions are also applicable to existing problem settings. The posterior distribution of (M, n) under the given observation y is written as Pr(M, n | y) = Pr(M, n, y)∕ Pr(y) ∝ q(M) Pr(n | M) Pr(y | n) . Therefore, by the same argument as the derivation of (4), the MAP inference problem can be written as where and F(M, n) is the same function as F(n) in (3) with M added to the argument.

CGMs on path graphs
Hereafter, we focus on CGMs on path graphs, which is the main topic of this paper. Path graph P T is an undirected graph whose vertex set is V = [T] and edge set is Fig. 2). A graphical model (not CGM) on path graph is the most basic graphical model that represents a time series generated by a Markov model; that is, the current state depends only on the previous state. A CGM on a path graph represents the distribution of aggregated statistics when there are many individuals whose state transition is determined by a Markov model. In the rest of this paper, we use the notation n ti ∶= n t (i) , n tij ∶= n t,t+1 (i, j) , and tij ∶= t,t+1 (i, j) for simplicity.
We give an example of a CGM on a path graph which models human mobility. Consider that a space is divided into R distinct areas and that M people are moving around in the space. The random variable X (m) t represents the area to which person m belongs at time step t, and the time series . Here, tij is the affinity between two areas i and j at time step t → t + 1 . n ti represents the number of people in area i at time step t, and n tij represents the q given (M) = 1 M = M given , 0 otherwise,

Fig. 2 A path graph P T
number of people who moved from area i to j at time step t → t + 1 . We have noisy observations y ti for t ∈ [T] and i ∈ [R] , which are generated by adding noise to n ti . The MAP inference problem we want to solve is to find the sample size M, the true number of people of each area at each time step, (n ti ) t∈ [T],i∈ [R] , and the true number of people moving between each two areas, (n tij ) t∈ [T−1],i,j∈ [R] , with the highest posterior probability given the observation (y ti ) t∈ [T],i∈ [R] . The MAP inference problem of (M, n) can be formulated as follows from (2), (3), (6), (7): Note that Z is the partition function of the original graphical model (see (1)).
We can derive the optimization problem (8) as follows; because holds on path graphs, we have from (3). This gives where C is a constant. We can verify easily that the feasible region of problem (8) is ℤ defined in (7).

Application of DCA
To solve problem (8), we propose utilizing the idea of the Difference of Convex Algorithm (DCA). Before describing our method, we review a discrete version of DCA (Maehara and Murota 2015;Maehara et al. 2018). DCA is a general framework to solve the minimization problem Here, we define the convexity for discrete functions as follows. 4 To solve problem (9), DCA generates a solution sequence x (1) , x (2) , … by the following procedure: Step 1 Choose an arbitrary feasible solution x (1) ∈ D and set the iteration counter as s ← 1.
Step 2 Find a function R (s) ∶ ℝ d → ℝ such that Note that such a function exists since R is discrete concave.
To apply the DCA framework, the objective function must be expressed as the sum of convex and concave functions. The following proposition shows that our MAP inference problem in (8) has such a structure.

Proposition 3 Let D be the feasible region of problem (8), and define discrete functions
The proof is given in the "Appendix". Hereafter, we set functions Q and R as in Proposition 3. As the objective function of problem (8) is written as P(M, n) = Q(M, n) + R(M, n) , we can apply DCA to our problem.
The following proposition explicitly provides an efficiently computable upper bound of R we can use in Step 2.

Proposition 4 Define a function
Please see the "Appendix" for the proof.

Minimum cost flow algorithm for the subroutine
The most important and difficult part to derive an efficient DCA-based algorithm is designing efficient algorithms for subproblem (11). To achieve this, we show that the subproblem can be formulated as the Minimum Convex Cost Flow Problem (C-MCFP), which is the efficiently solvable subclass of the Minimum Cost Flow Problem (MCFP). The (nonlinear) MCFP is a combinatorial optimization problem on a directed graph G = (V, E) . Each node i ∈ V has a supply value b i ∈ ℤ , and each edge (i, j) ∈ E has a cost function MCFP is the problem of finding a minimum cost flow on G that satisfies the supply constraints at all nodes. MCFP can be described as follows: Note that z takes only integer values (i.e., z ∈ ℤ |E| ). A subclass of MCFP in which all cost functions are discrete convex functions (see Definition 1) is called the C-MCFP; it is known to be efficiently solvable (Ahuja et al. 1993).
The following proposition shows that the subproblem min (M,n)∈DP (s) (M, n) can be formulated as a C-MCFP.
Proposition 5 Define the MCFP instance as follows: where (u, v, c(z)) represents a directed edge from node u to node v with cost function c(z), Let z * is an optimal solution of this MCFP instance, and define M * by M * ∶= z * d,o , n * by n * ti ∶= z *  (M, n) . Furthermore, the MCFP instance belongs to C-MCFP.
The proof is given in the "Appendix". Figure 3 illustrates an example of the MCFP instance defined in Proposition 5. The above proposition enables us to solve the subproblem cost flow problem such that b v = 0 for all v ∈ V , as with this problem, is called a minimum cost circulation problem (Tardos 1985).

Overall view of the proposed method and time complexity analysis
From the above arguments, we can construct an efficient optimization algorithm, described in Algorithm 1, for the MAP inference Problem (8). Under the assumption that the support of the prior distribution q(M) is finite, the algorithm is guaranteed to terminate after a finite number of iterations because P(M (s) , n (s) ) monotonically decreases and D is a finite set. We analyze the time complexity of one iteration of the proposed method (Lines 3-5 in Algorithm 1). The computational bottleneck is solving C-MCFP in Line 3. There are several algorithms to solve C-MCFP and time complexity varies depending on which one is adopted. In this paper, we consider two typical methods, the successive shortest path algorithm (SSP) and the capacity scaling algorithm (CS) (Ahuja et al. 1993).
SSP is an algorithm that successively augments unit flow along the shortest path from a supply node (i.e. b i > 0 ) to a demand node (i.e. b i < 0 ) in the residual graph, which is an auxiliary graph calculated from the current flow. Given a C-MCFP instance with graph G = (V, E) , the shortest path in the residual graph can be computed in O(|E| log |V|) time by Dijkstra's algorithm with a binary heap, and the augmentation of the flow can be done in O(|E|) time. The augmentation is performed B ∶= ( ∑ i∈V � � b i � � )∕2 times totally, so the total time complexity is O(B|E| log |V|) . CS resembles SSP, but it differs in that it tries to push a large amount of flow, rather than a unit amount of flow, in a single augmentation. In CS, the number of shortest path calculations and flow augmentations can be bounded Although b v = 0, ∀v ∈ V in the C-MCFP instance constructed in Proposition 5, we have to push M max flow from o to d beforehand to eliminate the negative cost edge, where M max is the maximum value M can take (i.e. the maximum value of the support of the prior distribution q(M) ). Therefore, B = Θ(M max ) and U = Θ(M max ) holds in our problem. Because |V| = O(TR) and |E| = O(TR 2 ) , the time complexity in one iteration is O(M max TR 2 log(TR)) when SSP is applied and O(T 2 R 4 log(TR) log M max ) when CS is applied. As a special case, the time complexity for the MAP inference problem with a given sample size (the problem setting in previous studies) with SSP is O(M given TR 2 log(TR)) and with CS is O(T 2 R 4 log(TR) log M given ) where M given is the given sample size, because M max = M given . These result imply that each method has its own advantages and disadvantages: SSP has small time complexity for T and R, while CS has small time complexity for M max or M given . This difference is confirmed empirically in Sect. 5.

Discussions
Here, we discuss why it is possible to construct an efficient algorithm when the graph is a path graph. The difficulty of MAP inference in CGM can be decomposed into the following two factors. The first is the non-convexity of the objective function; the objective function is the sum of convex functions and concave functions, as expressed in (8), and the objective function as a whole is not convex. The second is many constraints; as can be seen from (7), the variables M and n has a lot of complex constraints, all of which must be considered. The proposed method addresses difficulty (i) with the discrete DCA. Difficulty (ii) is addressed by restricting the graph to path graphs. An essential property of path graphs is that they only contain vertices of degree 2 or less. Because the constraints on vertices of degree 2 or less can be expressed as flow conservation laws, we can formulate the problem as a minimum convex cost flow problem by constructing an appropriate flow network.
When considering graphical models on tree graphs other than path graphs, there are always vertices of degree 3 or higher in the graph, and constraints around these vertices cannot be expressed as flow-preserving laws. Therefore, to extend our approach to graphs other than path graphs, it is necessary to develop a different method than network flow to handle constraints at vertices of degree 3 or higher. We leave this extension to future work.

Experiments
We perform experiments to evaluate the effectiveness of the proposed methods. All experiments are conducted on a 64-bit macOS machine with Intel Core i7 CPUs and 16 GB of RAM. All algorithms are implemented in C++ (gcc 9.1.0 with -O3 option). Experiments are conducted both in the existing problem setting where sample size is given as input and in the new problem setting where no sample size is given. The experiments are as follows: • Experiments using synthetic instances with a given sample size in 5.1, • Experiments using synthetic instances without a given sample size in 5.2, • Experiments using real-world instances with and without a given sample size in 5.3, • Experiments on histogram interpolation, one of the applications of MAP inference, with synthetic instances in 5.4.

Settings
We solve randomly generated synthetic instances of the MAP inference problem (8). We fix T to 5 and vary the values of R and M. We use two types of potential functions as follows.
1. uniform. tij is independently drawn from a uniform distribution on the set of integers {1, 5, 10}.
2. distance. We set tij = 1 |i−j+1| . This potential models the movement of individuals in one-dimensional space: the state indices i and j represent coordinates in the space, and the closer the two points are, the more likely are movements between them to occur.
The input observations y are generated by the following procedure; first, we generate independent M samples by Gibbs sampling in the graphical model (Bishop and Nasrabadi 2006), then we calculate the true contingency tables n true by aggregating the generated samples, and finally we get y according to the Gaussian observation distribution p ti (y ti | n ti ) ∝ exp − (y ti −n ti ) 2 10 . The sample size M is given as input of the algorithms. To construct surrogate functions in the proposed method, we can choose arbitrary (s) ti which satisfies the condition − log(n (s) ti + 1) ≤ (s) ti ≤ − log n (s) ti (see Proposition 4). To investigate the influence of the choice of (s) ti , we try three strategies to decide (s) ti : (1) . We call them Proposed (L), Proposed (M), Proposed (R), respectively. Note that when the sample size is given, the term k(M) + g(M) in the objective function of (8) can be ignored because M is a constant, and it is not necessary to determine (s) .
As the compared method, we use Non-Linear Belief Propagation (NLBP) (Sun et al. 2015), which is a message-passing style algorithm to the solve approximate MAP inference problem derived by applying Stirling's approximation and continuous relaxation. Because the output of NLBP is not integer-valued and log(z!) is defined only if z is an integer, we cannot calculate the objective function of (8) directly. To address this, we calculate it by replacing the term log(z!) by linear interpolation of log(⌊z⌋!) and log(⌈z⌉!) , which is given by (⌈z⌉ − z) ⋅ log(⌊z⌋!) + (z − ⌊z⌋) ⋅ log(⌈z⌉!) . Note that although there are various algorithms to solve the approximate MAP inference problem (see Sect. 2.1), the objective function values attained by these algorithms are the same. This is because the approximate problem is a convex optimization problem (Sheldon et al. 2013).
We also calculated and compared the error between the solution output by each algorithm and the ground truth contingency table which was generated by Gibbs sampling and aggregation. The error metric we used is normalized absolute error (NAE), which is defined as where n true tij is the ground truth value of the edge contingency table and n est tij is the estimated value of the edge contingency table. Note that NAE is a metric that has been used frequently in previous studies of CGM (Tanaka et al. 2018;Akagi et al. 2018; Iwata and Shimizu 2019).

Results
First, we compare the attained objective values and NAEs. The results are shown in Table 1. We generate 10 instances for each parameter setting and determined the average of attained objective function values. Because the objective function P(n) is equal to − log Pr(n | y) + const. , P(n) takes both positive and negative values, and the difference of the objective function values is essential; when P(n 1 ) − P(n 2 ) = , Pr(n 1 | y) = exp(− ) ⋅ Pr(n 2 | y) holds. All the proposed methods consistently have smaller objective function values than the compared method. The difference tends to be large when R is large and M is small. This would be because small values appear in the contingency table more frequently when R is large and M is small, and the effect of the inaccuracy of Stirling's approximation becomes larger. Among the proposed methods, proposed (L) achieves the smallest objective function values in most cases. This finding is considered to be an important guideline for determining hyperparameters (s) ti . Furthermore, in most cases, all the proposed methods achieve smaller NAEs than the existing method. As with the objective function values, the differences tend to be larger when R is large or M is small. This confirms the superior performance of the proposed method in terms of estimation accuracy. However, NLBP achieves a slightly smaller NAE when R = 10, M = 10 3 or R = 20, M = 10 3 with the distance potential. This may be due to the fact that the values included in the contingency table are large enough that Stirling's approximation becomes accurate and NLBP is able to output a good solution. Among the three proposed methods, there was not much difference in NAEs. This indicates that the proposed method is robust with respect to the choice of hyperparameters (s) ti in terms of estimation accuracy.
To compare the characteristics of solutions obtained by proposed (L) and NLBP, we solve an instance with R = 20 , M = 10 2 , and uniform potential by each method. Obtained edge contingency tables n 1ij are shown in Fig. 4 as heat maps. We also show the edge contingency table obtained by rounding each element of the NLBP solution to the nearest integer. We observe that the proposed method outputs sparse solutions while the solutions by NLBP are blurred and contain a lot of non-zero elements. This difference is quantified by "sparsity", which is calculated by Sparsity of the output of proposed (L) is 77%, while the sparsity of the output of NLBP is 0%. This is caused by its application of continuous relaxation and the inaccuracy of Stirling's approximation around 0. For NLBP (rounded), many near-zero values are rounded to 0 and the solution is sparser than the Proposed (L) solution. But the constraints of the   (12) and Constraints Violation is defined by (13) problem (8)  for each solution n est (we call this value "constraints violation"). Each term in this formula corresponds to a constraint of the optimization problem (8), and this value measures the violation of constraints of the optimization problem (8). For the solutions output by Proposed (L) and NLBP, this value is 0, indicating that the constraints are not violated, but for NLBP (rounded), the value is 82.0, indicating that the constraints are violated drastically. In additional experiments, we observed that the outputs of the three methods become closer as M increases. We compare the computation time of each algorithm. As explained in Sect. 4.3, we can choose an arbitrary C-MCFP algorithm as the subroutine in the proposed method and the time complexity varies depending on the choice. We compare proposed (L) with SSP, proposed (L) with CS, and NLBP. Figure 5 shows the relationship between input size and computation time. These results are consistent with the complexity analysis results in Sect. 4.3; SSP is efficient when R is large but becomes inefficient when M is large, and the converse is true for CS. The results also suggest that it is important to choose the algorithm depending on the size parameter of the input. The proposed method is not much worse than the existing method in terms of computation time by choosing an appropriate C-MCFP algorithm according to the size 1 3

Settings
We compared methods for MAP inference problem without a given sample size using random instances generated in the same way as in Sect. 5.1. Note that the sample size used to generate random instances is not given as input to the algorithms. We use Gaussian distribution p ti (y ti | n ti ) ∝ exp − (y ti − n ti ) 2 as the noise distribution and the uniform distribution q uniform,10 6 (M) defined by (5) as the prior distribution of M.
We prepare two proposed methods; one is to solve the instance of C-MCFP constructed in Proposition 5 using SSP (DCA-SSP) and the other is to solve it using CS (DCA-CS). In both methods, we set (s) ti = − log(n (s) ti ) , (s) = − log M (s) when constructing the surrogate function (see Proposition 5).
Because no methods for MAP inference of the sample size have been proposed so far, we use a naive approach to solve the optimization problem (8) as baselines; solving the problem for fixed M by the algorithm for the MAP inference problem with a given sample size with a brute force search of M ∈ {0, 1, … , M max } . We prepare two baseline methods, one using SSP (BF-SSP) and the other using CS (BF-CS), for solving the problem (8) with fixed M . We set M max = max(10, 2 ⋅ ∑ R i=1 y 1i ).

Results
First, we compare the attained objective values and NAEs by DCA-SSP, DCA-CS, BF-SSP, BF-CS. The results are shown in Table 2. As shown, the four methods achieve nearly identical objective function values and NAEs in all cases. BF-SSP and BF-CS are naive methods that conduct brute force searches of the sample size, so the objective function value achieved by them is considered to be the smallest among the currently available methods. The fact that the proposed methods, DCA-SSP and DCA-CS, achieve almost the same objective function values as BF-SSP and BF-CS indicates that the optimization performances of the proposed methods are sufficiently high. Second, we compare the computation time of the four methods. Figure 7 shows the relationship between input size and computation time. It can be seen that the two proposed methods are much faster than the two baseline methods. In particular, the difference is  significantly larger when M is large. This is because the proposed method needs to apply DCA only once, while the baseline method needs to apply it M max times. In the comparison between proposed methods, DCA-CS is faster when M is large, and DCA-SSP is faster when R is large. This result is consistent with the results of the time complexity analysis. Overall, the proposed methods achieve almost the same objective function values and NAEs as the baseline methods in much smaller computation times, demonstrating the effectiveness of the proposed methods.

Real-world Instances
We conduct experiments using real-world population datasets in both settings where a sample size is given and not given.

Settings
The datasets are generated from 8694 car trajectories collected by a car navigation application in the Greater Tokyo area, Japan 5 . We randomly sample M (M = 100, 500, 1000) trajectories from this data and create aggregated population data of each area at fixed time intervals. The areas are decided by dividing the targeted geospatial space into fixedsize grid cells. The grid size is set to 10 km × 10 km ( R = 8 × 7 = 56 ) and 5 km × 5 km ( R = 16 × 13 = 208 ), and time interval is 60 min ( T = 24).
Aggregated population data of humans often contain noise based on the following scenarios. The first scenario is that it is difficult to obtain accurate aggregate values due to the poor performance of the sensors and equipment used to observe location information. The second scenario is that noises are added artificially to aggregate values when published to prevent identifying individual information. To take such a situation into account, we assume noisy observation. As the noise distribution, we use Gaussian distribution p ti (y ti | n ti ) ∝ exp −(y ti − n ti ) 2 . We construct the potential tij = exp (− dist (i, j)) , where dist (i, j) is the Euclidean distance between the centers of cell i and cell j in the grid space. We create 10 instances by random sampling and averaged the attained objective function values for each setting. We also evaluate the estimation accuracy of the edge contingency table (n tij ) t∈ [T−1],i,j∈[R] by NAE.
We solved the MAP inference problem in both problem settings where the sample size is given and not given. In the former setting, the compared methods are proposed (L) and NLBP. In addition, we also evaluate the estimation accuracy of NLBP (rounded), which is a method that rounds the output of NLBP to integer values. Note that we do not evaluate the objective function values of NLBP (rounded). This is because the output of NLBP (rounded) completely violates the constraints on summation in the MAP inference problem (8) and the objective function value cannot evaluate whether the optimization problem is successfully solved or not. In the setting where the sample size is not given, the compared methods are DCA-SSP and DCA-CS, which is explained in Sect. 5.2. The surrogate function and the prior distribution for the sample size are the same as those depicted in Sect. 5.2.  Table 3 shows the results. The top table shows the results for a problem with a given sample size, and the bottom table shows the results for a problem without a given sample size.  The bottom table also shows the sample sizes estimated by each method. First, in the top table, we observe that Proposed (L) consistently attain smaller objective values and NAEs than the existing method. The superiority of the proposed method increase when R is large and M is small, and this is the same trend as the results with synthetic data. The NAE of NLBP is relatively large; this is due to the fact that small values are assigned to the elements of the output that should be 0, which is the same phenomenon seen in Fig. 4. The NAE values are improved to some extent by rounding, but the proposed method is still superior.

Results
In the setting where the sample size is not given (the bottom table in Table 3), the two proposed methods (DCA-SSP and DCA-CS) achieved similar objective function values and NAEs, suggesting that the effect of the method of solving the minimum convex cost flow problem on the estimation performance is small. For cases other than M = 100 , the proposed method achieves the same level of NAE as the problem setting where the sample size is given. This indicates that the proposed method can successfully estimate the sample size even when the sample size is not given. Note that it is not very meaningful to compare the objective function values between the setting where the sample size is given and not because their objective functions are different. For M = 100 , NAEs are worse than in the problem setting where the sample size is given. This may be because the estimated sample size (the value of "estimated M" in the table) is underestimated compared to the true value. When the problem is highly sparse (i.e., M is small and R is large), the sample size tends to be underestimated (this tendency is also confirmed in experiments using synthetic data). There are possible ways to get around this, such as using a probability distribution that assigns more probability to large values as a prior distribution for sample size.

Histogram interpolation
As an application of MAP inference of CGMs on path graphs, we can interpolate the time series of histograms. This application is useful, for example, when the population distribution can only be observed at rough time intervals. By estimating the population distribution at fine time intervals through interpolation, it will be possible to understand the movement of crowds in more detail.
In this section, we show experimental results on this application and discuss the differences between the output of the proposed method and that of the existing method.

Settings
First, we briefly explain how to realize interpolation between two histograms by MAP inference of CGMs on path graphs. Suppose we are given histogram 1 ∶= [ 11 , … , 1R ] at time 1 and the histogram T ∶= [ T1 , … , TR ] at time T. The interpolated histogram t at time t (= 2, … , T − 1) is calculated by the following procedure.
1. Consider a CGM on a path graph with T vertices. 2. Let y 1 = 1 and y T = T . 3. y t (t = 2, … , T − 1) is treated as a missing value. This can be achieved by setting h ti (z) = 0 (t = 2, … , T − 1, i ∈ [R]) in the objective function of the problem (8).
4. Find a solution n * to the MAP inference problem under an appropriate potential . 5. Obtain an interpolation result by ti = n * ti (t = 2, … , T − 1, i ∈ [R]).
In our experiment, we consider a grid space of size 5 × 5 = 25 (= R) and a histogram ∶= [ 1 , … , R ] with a value i for each cell i (= 1, … , R) . To get interpolation results which consider the geometric structure defined by Euclidean distance in the grid space, we set the potential tij = exp(−(r i − r j ) 2 − (c i − c j ) 2 )) , where (r i , c i ) is the two-dimensional coordinate of the center of cell i in the grid space. We set T = 6 and use Gaussian distribution p ti (y ti | n ti ) ∝ exp(−5(y ti − n ti ) 2 ) for the noise distributions at t = 1, T . The sample size M is given as input, and M is set to 20.

Results
The results are shown in Fig. 8. Note that Fig. 8 illustrates different objects from what is shown in Figs 4; 8 illustrates the interpolated node contingency table values n ti as twodimensional grid spaces, while Fig. 4 illustrate edge contingency table values n 1ij as matrices. As shown in the figure, NLBP tends to assign non-zero values to many cells, while proposed (L) assigns non-zero values to a small number of cells, resulting in sparse solutions. Moreover, the outputs of the proposed (L) are integer-valued while those of NLBP are not. This characteristic of the proposed method is beneficial for interpretability when the histogram values are the numbers of countable objects (e.g., the number of people in the area). The figure also describes the objective function values achieved. In all cases, the objective function values achieved by the proposed method are much smaller than those by NLBP, which confirms that the proposed method is able to output interpolation results with large posterior probabilities. Note that the sparse and sharp nature of the output of the proposed method may mislead the readers to believe that there is a solution with a posteriori probability significantly higher than the others; there are many solutions with almost identical posterior probabilities around the MAP solution in the MAP inference problem we are solving here. Care should be taken when interpreting the output of the proposed method.

Conclusion
In this paper, we propose a non-approximate method to solve the MAP inference problem for CGMs on path graphs. Our algorithm is based on an application of DCA. In the algorithm, surrogate functions can be constructed in closed-form and minimized efficiently by C-MCFP algorithms. Our method is naturally applicable to problem settings where sample size is not given as input. Experimental results show the effectiveness of our algorithms both in the quality of solutions and computation time.  8 Three examples of interpolation results yielded by each method. In each example, three sequences of histograms in the two-dimensional grid space are presented; the first row shows the input histograms 1 and T , the second row shows the interpolation results obtained by proposed (L), and the third row shows the interpolation results obtained by NLBP by (A1), we obtain the desired inequality as follows: For y < x , it follows from the same argument that ̄( y) ≤ (y) . ◻ From Lemma 6, it suffices to show that f tij (z), −g(z), h ti (z), k(z) are univarite discrete convex functions. From Lemma 7, the proof is completed by showing The inequation (A2) holds because The inequation (A3) holds because The inequation (A4) holds because − log p ti (y ti | z) is a continuous convex function in z from Assumption 1. The inequation (A5) holds from Assumption 2. ◻

A.3 Proof of Proposition 5
Proof There is a one-to-one correspondence between a feasible solution to the problem (8), (M, n) , and a feasible solution to the MCFP instance constructed, z , under the relationship M = z d,o , n ti = z u t,i w t,i and n tij = z w t,i u t+1,j ; the constraint ∑ i∈ [R] n ti = M corresponds to the flow conservation rule at node o and d, the constraint ∑ j∈ [R] n tij = n ti corresponds to the flow conservation rule at node w t,i and the constraint ∑ i∈ [R] n tij = n i+1,j corresponds to the flow conservation rule at node u t+1,j . Moreover, corresponding (M, n) and z have the same objective function value in problem (8) and the MCFP instance, respectively. These facts yield that (M * , n * ) is the optimum solution of the problem (8). Because all the cost functions are discrete convex (this can be easily verified by (A2, A3, A4, A5) and definition of ḡ (s) ti (z) and ḡ (s) (z) in Proposition 4), the constructed instance belongs to C-MCFP. ◻ Funding This research is funded by the Nippon Telegraph and Telephone Corporation (NTT).

Availability of data and material
The authors cannot make data public because it is the confidential information of NTT, the company for which they work.Code availability The authors cannot make codes public because it is the confidential information of NTT, the company for which they work.

Declarations
Conflicts of interest All authors are employed and paid by NTT.
Ethics approval Not applicable.

Consent for publication Not applicable.
Authors' contributions All authors contributed to the study conception. The details of the proposed methods were developed by YA and NM. Implementation and analysis of the experimental results were conducted by YA. The first draft of the manuscript was written by YA and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.