Abstract
We consider the problem of assigning weights to a set of samples or data records, with the goal of achieving a representative weighting, which happens when certain sample averages of the data are close to prescribed values. We frame the problem of finding representative sample weights as an optimization problem, which in many cases is convex and can be efficiently solved. Our formulation includes as a special case the selection of a fixed number of the samples, with equal weights, i.e., the problem of selecting a smaller representative subset of the samples. While this problem is combinatorial and not convex, heuristic methods based on convex optimization seem to perform very well. We describe our open-source implementation rsw and apply it to a skewed sample of the CDC BRFSS dataset.
This is a preview of subscription content, access via your institution.






Availability of data and material.
All data and material are freely available online at www.github.com/cvxgrp/rsw.
References
Agrawal, A., Verschueren, R., Diamond, S., Boyd, S.: A rewriting system for convex optimization problems. J. Control Decis. 5(1), 42–60 (2018)
Angeris, G., Vučković, J., Boyd, S.: Computational bounds for photonic design. ACS Photonics 6(5), 1232–1239 (2019). https://doi.org/10.1021/acsphotonics.9b00154. ISSN 2330-4022, 2330-4022
ApS, M.: MOSEK modeling cookbook. https://docs.mosek.com/MOSEKModelingCookbook.pdf (2020)
Bethlehem, J., Keller, W.: Linear weighting of sample survey data. J. Off. Stat. 3(2), 141–153 (1987)
Bishop, Y., Fienberg, S., Holland, P.: Discrete Multivariate Analysis. Springer, New York (2007). 978-0-387-72805-6
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004). 978-0-521-83378-3
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach. Learn. 3(1), 1–122 (2010). https://doi.org/10.1561/2200000016. ISSN 1935-8237, 1935-8245
Center for Disease Control and Prevention (CDC). Behavioral Risk Factor Surveillance System Survey Data (2018a)
Center for Disease Control and Prevention (CDC). LLCP 2018 codebook report. https://www.cdc.gov/brfss/annual_data/2018/pdf/codebook18_llcp-v2-508.pdf (2018b)
Chen, S., Donoho, D., Saunders, M.: Atomic decomposition by basis pursuit. SIAM Rev. 43(1), 129–159 (2001)
Daszykowski, M., Walczak, B., Massart, D.: Representative subset selection. Anal. Chim. Acta 468(1), 91–103 (2002)
Deming, W., Stephan, F.: On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. Ann. Math. Stat. 11(4), 427–444 (1940)
Deville, J.-C., Särndal, C.-E., Sautory, O.: Generalized raking procedures in survey sampling. J. Am. Stat. Assoc. 88(423), 1013–1020 (1993)
Diamond, S., Boyd, S.: CVXPY: A Python-embedded modeling language for convex optimization. J. Mach. Learn. Res. 17(83), 1–5 (2016)
Diamond, S., Takapoui, R., Boyd, S.: A general system for heuristic minimization of convex functions over non-convex sets. Optim. Methods Softw. 33(1), 165–193 (2018)
Domahidi, A., Chu, E., Boyd, S.: ECOS: An SOCP solver for embedded systems. In 2013 European Control Conference (ECC), pp. 3071–3076, Zurich (2013). IEEE. ISBN 978-3-033-03962-9. https://doi.org/10.23919/ECC.2013.6669541
Estabrooks, A., Jo, T., Japkowicz, N.: A multiple resampling method for learning from imbalanced data sets. Comput. Intell. 20(1), 18–36 (2004)
Fougner, C., Boyd, S.: Parameter selection and preconditioning for a graph form solver. Emerging Applications of Control and Systems Theory, pp. 41–61. Springer, Cham (2018)
Fu, A., Narasimhan, B., Boyd, S.: CVXR: an R package for disciplined convex optimization. J. Stat. Softw. 94, 1–34 (2019)
Grant, M., Boyd, S.: Graph implementations for nonsmooth convex programs. Recent Advances in Learning and Control. Lecture Notes in Control and Information Sciences, pp. 95–110. Springer, London (2008)
Grant, M., Boyd, S.: CVX: Matlab software for disciplined convex programming, version 2.1 (2014)
Gurobi Optimization. GUROBI optimizer reference manual. https://www.gurobi.com/wp-content/plugins/hd_documentations/documentation/9.0/refman.pdf (2020)
Heckman, J.: The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. Ann. Econ. Soc. Meas. 5, 475–492 (1976)
Holt, D., Smith, F.: Post stratification. J. R. Stat. Soc. Ser. A 142(1), 33–46 (1979)
Horvitz, D., Thompson, D.: A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47(260), 663–685 (1952)
Iannacchione, V., Milne, J., Folsom, R.: Response probability weight adjustments using logistic regression. Proc. Surv. Res. Methods Sect. Am. Stat. Assoc. 20, 637–642 (1991)
Jones, E., Oliphant, T., Peterson, P.: SciPy: Open source scientific tools for Python. http://www.scipy.org/ (2001)
Kalton, G., Flores-Cervantes, I.: Weighting methods. J. Off. Stat. 19(2), 81 (2003)
Karp, R.: Reducibility among combinatorial problems. Complexity of Computer Computations, pp. 85–103. Springer, Boston (1972). https://doi.org/10.1007/978-1-4684-2001-2_9. ISBN 978-1-4684-2003-6 978-1-4684-2001-2
Kish, L.: Weighting for unequal pi. J. Off. Stat. 8(2), 183–200 (1992)
Kolmogorov, A.: Sulla determinazione empírica di uma legge di distribuzione (1933)
Kruithof, J.: Telefoonverkeersrekening. De Ingenieur 52, 15–25 (1937)
Kullback, S., Leibler, R.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Lambert, J.: Observations variae in mathesin puram. Acta Helvitica, physico-mathematico-anatomico-botanico-medica 3, 128–168 (1758)
Lavallée, P., Beaumont, J.-F.: Why we should put some weight on weights. In: Survey Methods, Insights from the Field (SMIF) (2015)
Lepkowski, J., Kalton, G., Kasprzyk, D.: Weighting adjustments for partial nonresponse in the 1984 SIPP panel. In Proceedings of the Section on Survey Research Methods, pp. 296–301. American Statistical Association Washington, DC, (1989)
Lofberg, J.: YALMIP: A toolbox for modeling and optimization in MATLAB. In: IEEE International Conference on Robotics and Automation, IEEE, pp. 284–289 (2004)
Lumley, T.: Complex surveys: a guide to analysis using R, vol. 565. Wiley, Hoboken (2011)
McKinney, W.: Data structures for statistical computing in Python. In: Proceedings of the 9th Python in Science Conference, pp. 56–61 (2010). https://doi.org/10.25080/Majora-92bf1922-00a
Mercer, A., Lau, A., Kennedy, C.: How different weighting methods work. https://www.pewresearch.org/methods/2018/01/26/how-different-weighting-methods-work/ (2018)
Neyman, J.: On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. J. R. Stat. Soc. 96(4), 558–625 (1934)
O’Donoghue, B., Chu, E., Parikh, N., Boyd, S.: Conic optimization via operator splitting and homogeneous self-dual embedding. J. Optim. Theory Appl. 169(3), 1042–1068 (2016)
Parikh, N., Boyd, S.: proximal Github repository. https://github.com/cvxgrp/proximal (2013)
Parikh, N., Boyd, S.: Block splitting for distributed optimization. Math. Program. Comput. 6(1), 77–102 (2014a)
Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends® Optim. 1(3), 127–239 (2014b). https://doi.org/10.1561/2400000003. ISSN 2167-3888, 2167-3918
Peyré, G., Cuturi, M.: Computational optimal transport: with applications to data science. Found. Trends® Mach. Learn. 11(5–6), 355–607 (2019)
She, Y., Tang, S.: Iterative proportional scaling revisited: a modern optimization perspective. J. Comput. Graph. Stat. 28(1), 48–60 (2019)
Stella, L., Antonello, N., Falt, M.: ProximalOperators.jl. https://github.com/kul-forbes/ProximalOperators.jl (2020)
Stellato, B., Banjac, G., Goulart, P., Bemporad, A., Boyd, S.: qdldl: a free LDL factorization routine. https://github.com/oxfordcontrol/qdldl (2020a)
Stellato, B., Banjac, G., Goulart, P., Bemporad, A., Boyd, S.: OSQP: An operator splitting solver for quadratic programs. Math. Program. Comput. 12, 637–672 (2020b). https://doi.org/10.1007/s12532-020-00179-2
Teh, Y., Welling, M.: On improving the efficiency of the iterative proportional fitting procedure. In: AIStats (2003)
Tseng, P.: Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl. 109(3), 475–494 (2001). https://doi.org/10.1023/A:1017501703105. ISSN 0022-3239, 1573-2878
Udell, M., Mohan, K., Zeng, D., Hong, J., Diamond, S., Boyd, S.: Convex optimization in Julia. Workshop on High Performance Technical Computing in Dynamic Languages (2014)
Valliant, R., Dever, J., Kreuter, F.: Practical Tools for Designing and Weighting Survey Samples. Springer, New York (2013)
Vanderbei, R.: Symmetric quasidefinite matrices. SIAM J. Optim. 5(1), 100–113 (1995)
Walt, S., Colbert, C., Varoquaux, G.: The NumPy array: a structure for efficient numerical computation. Comput. Sci. Eng. 13(2), 22 (2011)
Wittenberg, M.: An introduction to maximum entropy and minimum cross-entropy estimation using stata. Stata J. Promot. Commun. Stat. Stata 10(3), 315–330 (2010). https://doi.org/10.1177/1536867X1001000301. ISSN 1536-867X, 1536-8734
Yu, C.: Resampling methods: concepts, applications, and justification. Pract. Assess. Res. Eval. 8(1), 19 (2002)
Yule, U.: On the methods of measuring association between two attributes. J. R. Stat. Soc. 75(6), 579–652 (1912)
Acknowledgements
The authors would like to thank Trevor Hastie, Timothy Preston, Jeffrey Barratt, and Giana Teresi for discussions about the ideas described in this paper.
Funding
Shane Barratt is supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1656518.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
Not applicable.
Code availability.
All computer code is freely available online at www.github.com/cvxgrp/rsw.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Iterative proportional fitting
The connection between iterative proportional fitting, initially proposed by Deming and Stephan (1940) and the maximum entropy weighting problem has long been known and has been explored by many authors Teh and Welling (2003), Fu et al. (2019), She and Tang (2019), Wittenberg (2010), Bishop et al. (2007). We provide a similar presentation to She and Tang (2019), Sect. 2.1, though we show that the iterative proportional fitting algorithm that is commonly implemented is actually a block coordinate descent algorithm on the dual variables, rather than a direct coordinate descent algorithm. Writing this update in terms of the primal variables gives exactly the usual iterative proportional fitting update over the marginal distribution of each property.
Maximum entropy problem In particular, we will analyze the application of block coordinate descent on the dual of the problem
with variable \(w \in {\mathbf{R}}^n\), where the problem data matrix is Boolean, i.e., \(F \in \{0, 1\}^{m \times n}\). This is just the maximum entropy weighting problem given in Sect. 3.1, but in the specific case where F is a matrix with Boolean entries.
Selector matrices We will assume that we have several possible categories \(k=1, \dots , N\) which the user has stratified over, and we will define selector matrices \(S_k \in \{0,1\}^{p_k \times m}\) which ‘pick out’ the rows of F containing the properties for property k. For example, if the first three rows of F specify the data entries corresponding to the first property, then \(S_1\) would be a matrix such that
and each column of \(S_1F\) is a unit vector, i.e., a vector whose entries are all zeros except at a single entry, where it is one. This is the same as saying that, for some property k, each data point is allowed to be in exactly one of the \(p_k\) possible classes. Additionally, since this should be a proper probability distribution, we will also require that \({\mathbf {1}}^TS_k f^\mathrm {des} = 1\), i.e., the desired marginal distribution for property k should itself sum to 1.
Dual problem To show that iterative proportional fitting is equivalent to block coordinate ascent, we first formulate the dual problem (Boyd and Vandenberghe 2004, Ch. 5). The Lagrangian of (6) is
where \(\nu \in {\mathbf{R}}^n\) is the dual variable for the first constraint and \(\lambda \in {\mathbf{R}}\) is the dual variable for the normalization constraint. Note that we do not need to include the nonnegativity constraint on \(w_i\), since the domain of \(w_i \log w_i\) is \(w_i \ge 0\).
The dual function (Boyd and Vandenberghe 2004, Sect. 5.1.2) is given by
which is easily computed using the Fenchel conjugate of the negative entropy (Boyd and Vandenberghe 2004, Sect. 3.3.1):
where \(\exp \) of a vector is interpreted componentwise. Note that the optimal weights \(w^\star \) are exactly those given by
Strong duality Because of strong duality, the maximum of the dual function (7) has the same value as the optimal value of the original problem (6) (Boyd and Vandenberghe 2004, Sect. 5.2.3). Because of this, it suffices to find an optimal pair of dual variables, \(\lambda \) and \(\nu \), which can then be used to find an optimal \(w^\star \), via (9).
To do this, first partially maximize g with respect to \(\lambda \), i.e.,
We can find the minimum by differentiating (8) with respect to \(\lambda \) and setting the result to zero. This gives
while
This also implies that, after using the optimal \(\lambda ^\star \) in (9),
Block coordinate ascent In order to maximize the dual function \(g^p\), we will use the simple method of block coordinate ascent with respect to the dual variables corresponding to the constraints of each of the possible k categories. Equivalently, we will consider updates of the form
where \(\nu ^t\) is the dual variable at iteration t, while \(\xi ^t \in {\mathbf{R}}^{p_t}\) is the optimization variable we consider at iteration t. To simplify notation, we have used \(S_t\) to refer to the selector matrix at iteration t, if \(t \le N\), and otherwise set \(S_t = S_{(t-1 \mod N) + 1}\), i.e., we choose the selector matrices in a round robin fashion. The updates result in an ascent algorithm, which is guaranteed to converge to the global optimum since \(g^p\) is a smooth concave function Tseng (2001).
Block coordinate update In order to apply the update rule to \(g^p(\nu )\), we first work out the optimal steps defined as
To do this, set the gradient of \(g_p\) to zero,
which implies that
where \(f_i\) is the ith column of F and the division is understood to be elementwise.
To simplify this expression, note that, for any unit basis vector \(x \in {\mathbf{R}}^m\) (i.e., \(x_i = 1\) for some i and 0, otherwise), we have the simple equality,
where \(\circ \) indicates the elementwise product of two vectors. Using this result with \(x = S_t f_i\) on each term of the numerator from the left hand side of (11) gives
where \(y = \sum _{i=1}^m \exp (-f_i^T\nu ^t)Sf_i\). We can then rewrite (11) in terms of y by multiplying the denominator on both sides of the expression:
which implies that
Since \({\mathbf {1}}^TS_t f^\mathrm {des} = 1\),
or, after solving for \(\xi \),
where the logarithm is taken elementwise. The resulting block coordinate ascent update can be written as
where the logarithm and division are performed elementwise. This update can be interpreted as changing \(\nu \) in the entries corresponding to the constraints given by property t by adding the log difference between the desired distribution and the (unnormalized) marginal distribution for this property suggested by the previous update. This follows from (10), which implies \(w_i^t \propto \exp (-f_i^T\nu ^t)\) for each \(i=1, \dots , n\), where \(w^t\) is the distribution suggested by \(\nu ^t\) at iteration t.
Resulting update over w We can rewrite the update for the dual variables \(\nu \) as a multiplicative update for the primal variable w, which is exactly the update given by the iterative proportional fitting algorithm. More specifically, from (10),
For notational convenience, we will write \(x_{t i}= S_t f_i\), which is a unit vector denoting the category to which data point i belongs to, for property t. Plugging update (12) gives, after some algebra,
Since \(x_{t i}\) is a unit vector, \( \exp (x_{t i}^T \log (y)) = x_{t i}^Ty\) for any vector \(y > 0\), so
Finally, using (10) with \(\nu ^t\) gives
which is exactly the multiplicative update of the iterative proportional fitting algorithm, performed for property t.
B Expected values of BRFSS data
Rights and permissions
About this article
Cite this article
Barratt, S., Angeris, G. & Boyd, S. Optimal representative sample weighting. Stat Comput 31, 19 (2021). https://doi.org/10.1007/s11222-021-10001-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-021-10001-1
Keywords
- Sample weighting
- Iterative proportional fitting
- Convex optimization
- Distributed optimization