Skip to main content

Optimal representative sample weighting


We consider the problem of assigning weights to a set of samples or data records, with the goal of achieving a representative weighting, which happens when certain sample averages of the data are close to prescribed values. We frame the problem of finding representative sample weights as an optimization problem, which in many cases is convex and can be efficiently solved. Our formulation includes as a special case the selection of a fixed number of the samples, with equal weights, i.e., the problem of selecting a smaller representative subset of the samples. While this problem is combinatorial and not convex, heuristic methods based on convex optimization seem to perform very well. We describe our open-source implementation rsw and apply it to a skewed sample of the CDC BRFSS dataset.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Availability of data and material.

All data and material are freely available online at


  • Agrawal, A., Verschueren, R., Diamond, S., Boyd, S.: A rewriting system for convex optimization problems. J. Control Decis. 5(1), 42–60 (2018)

    Article  MathSciNet  Google Scholar 

  • Angeris, G., Vučković, J., Boyd, S.: Computational bounds for photonic design. ACS Photonics 6(5), 1232–1239 (2019). ISSN 2330-4022, 2330-4022

    Article  Google Scholar 

  • ApS, M.: MOSEK modeling cookbook. (2020)

  • Bethlehem, J., Keller, W.: Linear weighting of sample survey data. J. Off. Stat. 3(2), 141–153 (1987)

    Google Scholar 

  • Bishop, Y., Fienberg, S., Holland, P.: Discrete Multivariate Analysis. Springer, New York (2007). 978-0-387-72805-6

    MATH  Google Scholar 

  • Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004). 978-0-521-83378-3

    Book  Google Scholar 

  • Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach. Learn. 3(1), 1–122 (2010). ISSN 1935-8237, 1935-8245

    Article  MATH  Google Scholar 

  • Center for Disease Control and Prevention (CDC). Behavioral Risk Factor Surveillance System Survey Data (2018a)

  • Center for Disease Control and Prevention (CDC). LLCP 2018 codebook report. (2018b)

  • Chen, S., Donoho, D., Saunders, M.: Atomic decomposition by basis pursuit. SIAM Rev. 43(1), 129–159 (2001)

    Article  MathSciNet  Google Scholar 

  • Daszykowski, M., Walczak, B., Massart, D.: Representative subset selection. Anal. Chim. Acta 468(1), 91–103 (2002)

    Article  Google Scholar 

  • Deming, W., Stephan, F.: On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. Ann. Math. Stat. 11(4), 427–444 (1940)

    Article  MathSciNet  Google Scholar 

  • Deville, J.-C., Särndal, C.-E., Sautory, O.: Generalized raking procedures in survey sampling. J. Am. Stat. Assoc. 88(423), 1013–1020 (1993)

    Article  Google Scholar 

  • Diamond, S., Boyd, S.: CVXPY: A Python-embedded modeling language for convex optimization. J. Mach. Learn. Res. 17(83), 1–5 (2016)

    MathSciNet  MATH  Google Scholar 

  • Diamond, S., Takapoui, R., Boyd, S.: A general system for heuristic minimization of convex functions over non-convex sets. Optim. Methods Softw. 33(1), 165–193 (2018)

    Article  MathSciNet  Google Scholar 

  • Domahidi, A., Chu, E., Boyd, S.: ECOS: An SOCP solver for embedded systems. In 2013 European Control Conference (ECC), pp. 3071–3076, Zurich (2013). IEEE. ISBN 978-3-033-03962-9.

  • Estabrooks, A., Jo, T., Japkowicz, N.: A multiple resampling method for learning from imbalanced data sets. Comput. Intell. 20(1), 18–36 (2004)

    Article  MathSciNet  Google Scholar 

  • Fougner, C., Boyd, S.: Parameter selection and preconditioning for a graph form solver. Emerging Applications of Control and Systems Theory, pp. 41–61. Springer, Cham (2018)

    Chapter  Google Scholar 

  • Fu, A., Narasimhan, B., Boyd, S.: CVXR: an R package for disciplined convex optimization. J. Stat. Softw. 94, 1–34 (2019)

    Google Scholar 

  • Grant, M., Boyd, S.: Graph implementations for nonsmooth convex programs. Recent Advances in Learning and Control. Lecture Notes in Control and Information Sciences, pp. 95–110. Springer, London (2008)

    Chapter  Google Scholar 

  • Grant, M., Boyd, S.: CVX: Matlab software for disciplined convex programming, version 2.1 (2014)

  • Gurobi Optimization. GUROBI optimizer reference manual. (2020)

  • Heckman, J.: The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. Ann. Econ. Soc. Meas. 5, 475–492 (1976)

    Google Scholar 

  • Holt, D., Smith, F.: Post stratification. J. R. Stat. Soc. Ser. A 142(1), 33–46 (1979)

    Article  Google Scholar 

  • Horvitz, D., Thompson, D.: A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47(260), 663–685 (1952)

    Article  MathSciNet  Google Scholar 

  • Iannacchione, V., Milne, J., Folsom, R.: Response probability weight adjustments using logistic regression. Proc. Surv. Res. Methods Sect. Am. Stat. Assoc. 20, 637–642 (1991)

    Google Scholar 

  • Jones, E., Oliphant, T., Peterson, P.: SciPy: Open source scientific tools for Python. (2001)

  • Kalton, G., Flores-Cervantes, I.: Weighting methods. J. Off. Stat. 19(2), 81 (2003)

    Google Scholar 

  • Karp, R.: Reducibility among combinatorial problems. Complexity of Computer Computations, pp. 85–103. Springer, Boston (1972). ISBN 978-1-4684-2003-6 978-1-4684-2001-2

    Chapter  Google Scholar 

  • Kish, L.: Weighting for unequal pi. J. Off. Stat. 8(2), 183–200 (1992)

    MathSciNet  Google Scholar 

  • Kolmogorov, A.: Sulla determinazione empírica di uma legge di distribuzione (1933)

  • Kruithof, J.: Telefoonverkeersrekening. De Ingenieur 52, 15–25 (1937)

    Google Scholar 

  • Kullback, S., Leibler, R.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)

    Article  MathSciNet  Google Scholar 

  • Lambert, J.: Observations variae in mathesin puram. Acta Helvitica, physico-mathematico-anatomico-botanico-medica 3, 128–168 (1758)

    Google Scholar 

  • Lavallée, P., Beaumont, J.-F.: Why we should put some weight on weights. In: Survey Methods, Insights from the Field (SMIF) (2015)

  • Lepkowski, J., Kalton, G., Kasprzyk, D.: Weighting adjustments for partial nonresponse in the 1984 SIPP panel. In Proceedings of the Section on Survey Research Methods, pp. 296–301. American Statistical Association Washington, DC, (1989)

  • Lofberg, J.: YALMIP: A toolbox for modeling and optimization in MATLAB. In: IEEE International Conference on Robotics and Automation, IEEE, pp. 284–289 (2004)

  • Lumley, T.: Complex surveys: a guide to analysis using R, vol. 565. Wiley, Hoboken (2011)

    Google Scholar 

  • McKinney, W.: Data structures for statistical computing in Python. In: Proceedings of the 9th Python in Science Conference, pp. 56–61 (2010).

  • Mercer, A., Lau, A., Kennedy, C.: How different weighting methods work. (2018)

  • Neyman, J.: On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. J. R. Stat. Soc. 96(4), 558–625 (1934)

    Article  Google Scholar 

  • O’Donoghue, B., Chu, E., Parikh, N., Boyd, S.: Conic optimization via operator splitting and homogeneous self-dual embedding. J. Optim. Theory Appl. 169(3), 1042–1068 (2016)

    Article  MathSciNet  Google Scholar 

  • Parikh, N., Boyd, S.: proximal Github repository. (2013)

  • Parikh, N., Boyd, S.: Block splitting for distributed optimization. Math. Program. Comput. 6(1), 77–102 (2014a)

    Article  MathSciNet  Google Scholar 

  • Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends® Optim. 1(3), 127–239 (2014b). ISSN 2167-3888, 2167-3918

    Article  Google Scholar 

  • Peyré, G., Cuturi, M.: Computational optimal transport: with applications to data science. Found. Trends® Mach. Learn. 11(5–6), 355–607 (2019)

    Article  Google Scholar 

  • She, Y., Tang, S.: Iterative proportional scaling revisited: a modern optimization perspective. J. Comput. Graph. Stat. 28(1), 48–60 (2019)

    Article  MathSciNet  Google Scholar 

  • Stella, L., Antonello, N., Falt, M.: ProximalOperators.jl. (2020)

  • Stellato, B., Banjac, G., Goulart, P., Bemporad, A., Boyd, S.: qdldl: a free LDL factorization routine. (2020a)

  • Stellato, B., Banjac, G., Goulart, P., Bemporad, A., Boyd, S.: OSQP: An operator splitting solver for quadratic programs. Math. Program. Comput. 12, 637–672 (2020b).

    Article  MathSciNet  MATH  Google Scholar 

  • Teh, Y., Welling, M.: On improving the efficiency of the iterative proportional fitting procedure. In: AIStats (2003)

  • Tseng, P.: Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl. 109(3), 475–494 (2001). ISSN 0022-3239, 1573-2878

    Article  MathSciNet  MATH  Google Scholar 

  • Udell, M., Mohan, K., Zeng, D., Hong, J., Diamond, S., Boyd, S.: Convex optimization in Julia. Workshop on High Performance Technical Computing in Dynamic Languages (2014)

  • Valliant, R., Dever, J., Kreuter, F.: Practical Tools for Designing and Weighting Survey Samples. Springer, New York (2013)

    Book  Google Scholar 

  • Vanderbei, R.: Symmetric quasidefinite matrices. SIAM J. Optim. 5(1), 100–113 (1995)

    Article  MathSciNet  Google Scholar 

  • Walt, S., Colbert, C., Varoquaux, G.: The NumPy array: a structure for efficient numerical computation. Comput. Sci. Eng. 13(2), 22 (2011)

    Article  Google Scholar 

  • Wittenberg, M.: An introduction to maximum entropy and minimum cross-entropy estimation using stata. Stata J. Promot. Commun. Stat. Stata 10(3), 315–330 (2010). ISSN 1536-867X, 1536-8734

    Article  MathSciNet  Google Scholar 

  • Yu, C.: Resampling methods: concepts, applications, and justification. Pract. Assess. Res. Eval. 8(1), 19 (2002)

    Google Scholar 

  • Yule, U.: On the methods of measuring association between two attributes. J. R. Stat. Soc. 75(6), 579–652 (1912)

    Article  Google Scholar 

Download references


The authors would like to thank Trevor Hastie, Timothy Preston, Jeffrey Barratt, and Giana Teresi for discussions about the ideas described in this paper.


Shane Barratt is supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1656518.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Shane Barratt.

Ethics declarations

Conflicts of interest

Not applicable.

Code availability.

All computer code is freely available online at

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


A Iterative proportional fitting

The connection between iterative proportional fitting, initially proposed by Deming and Stephan (1940) and the maximum entropy weighting problem has long been known and has been explored by many authors Teh and Welling (2003), Fu et al. (2019), She and Tang (2019), Wittenberg (2010), Bishop et al. (2007). We provide a similar presentation to She and Tang (2019), Sect. 2.1, though we show that the iterative proportional fitting algorithm that is commonly implemented is actually a block coordinate descent algorithm on the dual variables, rather than a direct coordinate descent algorithm. Writing this update in terms of the primal variables gives exactly the usual iterative proportional fitting update over the marginal distribution of each property.

Maximum entropy problem In particular, we will analyze the application of block coordinate descent on the dual of the problem

$$\begin{aligned} \begin{array}{ll} \text{ minimize } &{} \quad \sum _{i=1}^n w_i \log w_i\\ \text{ subject } \text{ to } &{} \quad Fw = f^\mathrm {des}\\ &{} \quad {\mathbf {1}}^Tw = 1, ~~ w \ge 0, \end{array} \end{aligned}$$

with variable \(w \in {\mathbf{R}}^n\), where the problem data matrix is Boolean, i.e., \(F \in \{0, 1\}^{m \times n}\). This is just the maximum entropy weighting problem given in Sect. 3.1, but in the specific case where F is a matrix with Boolean entries.

Selector matrices We will assume that we have several possible categories \(k=1, \dots , N\) which the user has stratified over, and we will define selector matrices \(S_k \in \{0,1\}^{p_k \times m}\) which ‘pick out’ the rows of F containing the properties for property k. For example, if the first three rows of F specify the data entries corresponding to the first property, then \(S_1\) would be a matrix such that

$$\begin{aligned} S_1F = F_{1:3,1:n}, \end{aligned}$$

and each column of \(S_1F\) is a unit vector, i.e., a vector whose entries are all zeros except at a single entry, where it is one. This is the same as saying that, for some property k, each data point is allowed to be in exactly one of the \(p_k\) possible classes. Additionally, since this should be a proper probability distribution, we will also require that \({\mathbf {1}}^TS_k f^\mathrm {des} = 1\), i.e., the desired marginal distribution for property k should itself sum to 1.

Dual problem To show that iterative proportional fitting is equivalent to block coordinate ascent, we first formulate the dual problem (Boyd and Vandenberghe 2004, Ch. 5). The Lagrangian of (6) is

$$\begin{aligned} \mathcal {L}(w, \nu , \lambda ) = \sum _{i=1}^n w_i \log w_i + \nu ^T(Fw - f^\mathrm {des}) + \lambda ({\mathbf {1}}^Tw - 1), \end{aligned}$$

where \(\nu \in {\mathbf{R}}^n\) is the dual variable for the first constraint and \(\lambda \in {\mathbf{R}}\) is the dual variable for the normalization constraint. Note that we do not need to include the nonnegativity constraint on \(w_i\), since the domain of \(w_i \log w_i\) is \(w_i \ge 0\).

The dual function (Boyd and Vandenberghe 2004, Sect. 5.1.2) is given by

$$\begin{aligned} g(\nu , \lambda ) = \inf _{w} \mathcal {L}(w, \nu , \lambda ), \end{aligned}$$

which is easily computed using the Fenchel conjugate of the negative entropy (Boyd and Vandenberghe 2004, Sect. 3.3.1):

$$\begin{aligned} g(\nu , \lambda ) = - {\mathbf {1}}^T\exp (-(1 + \lambda ) {\mathbf {1}}- F^T\nu ) - \nu ^Tf^\mathrm {des} - \lambda , \end{aligned}$$

where \(\exp \) of a vector is interpreted componentwise. Note that the optimal weights \(w^\star \) are exactly those given by

$$\begin{aligned} w^\star = \exp ( -(1 + \lambda ) {\mathbf {1}}- F^T\nu ). \end{aligned}$$

Strong duality Because of strong duality, the maximum of the dual function (7) has the same value as the optimal value of the original problem (6) (Boyd and Vandenberghe 2004, Sect. 5.2.3). Because of this, it suffices to find an optimal pair of dual variables, \(\lambda \) and \(\nu \), which can then be used to find an optimal \(w^\star \), via (9).

To do this, first partially maximize g with respect to \(\lambda \), i.e.,

$$\begin{aligned} g^p(\nu ) = \sup _{\lambda } g(\nu , \lambda ). \end{aligned}$$

We can find the minimum by differentiating (8) with respect to \(\lambda \) and setting the result to zero. This gives

$$\begin{aligned} 1+\lambda ^\star = \log \left( {\mathbf {1}}^T\exp (-F^T\nu )\right) , \end{aligned}$$


$$\begin{aligned} g^p(\nu ) = -\log ({\mathbf {1}}^T\exp (-F^T\nu )) - \nu ^Tf^\mathrm {des}. \end{aligned}$$

This also implies that, after using the optimal \(\lambda ^\star \) in (9),

$$\begin{aligned} w^\star = \frac{\exp (- F^T\nu )}{{\mathbf {1}}^T\exp (- F^T\nu )}. \end{aligned}$$

Block coordinate ascent In order to maximize the dual function \(g^p\), we will use the simple method of block coordinate ascent with respect to the dual variables corresponding to the constraints of each of the possible k categories. Equivalently, we will consider updates of the form

$$\begin{aligned} \nu ^{t + 1} = \nu ^{t} + S_{t}^T\xi ^t,\quad t =1, \ldots , T, \end{aligned}$$

where \(\nu ^t\) is the dual variable at iteration t, while \(\xi ^t \in {\mathbf{R}}^{p_t}\) is the optimization variable we consider at iteration t. To simplify notation, we have used \(S_t\) to refer to the selector matrix at iteration t, if \(t \le N\), and otherwise set \(S_t = S_{(t-1 \mod N) + 1}\), i.e., we choose the selector matrices in a round robin fashion. The updates result in an ascent algorithm, which is guaranteed to converge to the global optimum since \(g^p\) is a smooth concave function Tseng (2001).

Block coordinate update In order to apply the update rule to \(g^p(\nu )\), we first work out the optimal steps defined as

$$\begin{aligned} \xi ^{t} = {{\,\mathrm{argmax}\,}}_{\xi } ~g^p(\nu ^t + S^T_t \xi ). \end{aligned}$$

To do this, set the gradient of \(g_p\) to zero,

$$\begin{aligned} \nabla _\xi ~g^p(\nu ^t + S^T_t \xi ) = 0, \end{aligned}$$

which implies that

$$\begin{aligned} \frac{\sum _{i=1}^n (S_t f_i) \exp (-f_i^T\nu ^t - f_i^TS_t^T\xi )}{\sum _{i=1}^n \exp (-f_i^T\nu ^t - f_i^T S^T_t \xi )} = Sf^\mathrm {des}, \end{aligned}$$

where \(f_i\) is the ith column of F and the division is understood to be elementwise.

To simplify this expression, note that, for any unit basis vector \(x \in {\mathbf{R}}^m\) (i.e., \(x_i = 1\) for some i and 0, otherwise), we have the simple equality,

$$\begin{aligned} x\exp (x^T\xi ) = x\circ \exp (\xi ), \end{aligned}$$

where \(\circ \) indicates the elementwise product of two vectors. Using this result with \(x = S_t f_i\) on each term of the numerator from the left hand side of (11) gives

$$\begin{aligned} \sum _{i=1}^n (S_t f_i) \exp (-f_i^T\nu ^t - f_i^TS_t^T\xi ) = \exp (-\xi ) \circ y, \end{aligned}$$

where \(y = \sum _{i=1}^m \exp (-f_i^T\nu ^t)Sf_i\). We can then rewrite (11) in terms of y by multiplying the denominator on both sides of the expression:

$$\begin{aligned} \exp (-\xi ) \circ y = (\exp (-\xi )^Ty)S_t f^\mathrm {des}, \end{aligned}$$

which implies that

$$\begin{aligned} \frac{y \circ \exp (-\xi )}{{\mathbf {1}}^T(y \circ \exp (-\xi ))} = S_t f^\mathrm {des}. \end{aligned}$$

Since \({\mathbf {1}}^TS_t f^\mathrm {des} = 1\),

$$\begin{aligned} y \circ \exp (-\xi ) = S_t f^\mathrm {des}, \end{aligned}$$

or, after solving for \(\xi \),

$$\begin{aligned} \xi = -\log \left( \mathbf{diag}(y)^{-1}S_t f^\mathrm {des}\right) , \end{aligned}$$

where the logarithm is taken elementwise. The resulting block coordinate ascent update can be written as

$$\begin{aligned} \nu ^{t+1} = \nu ^t - S_t^T \log \left( \frac{S_t f^\mathrm {des}}{\sum _{i=1}^n \exp (-f_i^T\nu ^t)S_t f_i}\right) , \end{aligned}$$

where the logarithm and division are performed elementwise. This update can be interpreted as changing \(\nu \) in the entries corresponding to the constraints given by property t by adding the log difference between the desired distribution and the (unnormalized) marginal distribution for this property suggested by the previous update. This follows from (10), which implies \(w_i^t \propto \exp (-f_i^T\nu ^t)\) for each \(i=1, \dots , n\), where \(w^t\) is the distribution suggested by \(\nu ^t\) at iteration t.

Resulting update over w We can rewrite the update for the dual variables \(\nu \) as a multiplicative update for the primal variable w, which is exactly the update given by the iterative proportional fitting algorithm. More specifically, from (10),

$$\begin{aligned} w^{t+1}_i = \frac{\exp (-f_i^T\nu ^{t+1})}{\sum _{i=1}^n \exp (-f_i^T\nu ^{t+1})}. \end{aligned}$$

For notational convenience, we will write \(x_{t i}= S_t f_i\), which is a unit vector denoting the category to which data point i belongs to, for property t. Plugging update (12) gives, after some algebra,

$$\begin{aligned} \exp (-f_i^T\nu ^{t+1}) = \exp (-f_i^T\nu ^t) \frac{\exp (x_{t i}^T\log (S_t f^\mathrm {des}))}{\exp \left( x_{t i}^T\log \left( \sum _j \exp (-f_j^T\nu ^t)x_{t j}\right) \right) }. \end{aligned}$$

Since \(x_{t i}\) is a unit vector, \( \exp (x_{t i}^T \log (y)) = x_{t i}^Ty\) for any vector \(y > 0\), so

$$\begin{aligned} \exp (-f_i^T\nu ^{t+1}) = \exp (-f_i^T\nu ^t) \frac{x_{t i}^TS_t f^\mathrm {des}}{\sum _{j=1}^n \exp (-f_j^T\nu ^t)x_{t i}^Tx_{t j}}. \end{aligned}$$

Finally, using (10) with \(\nu ^t\) gives

$$\begin{aligned} w_i^{t+1} = w_i^t\frac{x_{t i}^TS_t f^\mathrm {des}}{\sum _{j=1}^n w^t_j (x_{t i}^Tx_{t j})}, \end{aligned}$$
Table 1 Desired sex values in percentages
Table 2 Desired education and income values in percentages
Table 3 Desired reported health values in percentages
Table 4 Desired state and age values in percentages

which is exactly the multiplicative update of the iterative proportional fitting algorithm, performed for property t.

B Expected values of BRFSS data

See Tables 1, 2, 3, and 4.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Barratt, S., Angeris, G. & Boyd, S. Optimal representative sample weighting. Stat Comput 31, 19 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Sample weighting
  • Iterative proportional fitting
  • Convex optimization
  • Distributed optimization