Constructing effective personalized policies using counterfactual inference from biased data sets with many features

  • Onur AtanEmail author
  • William R. Zame
  • Qiaojun Feng
  • Mihaela van der Schaar


This paper proposes a novel approach for constructing effective personalized policies when the observed data lacks counter-factual information, is biased and possesses many features. The approach is applicable in a wide variety of settings from healthcare to advertising to education to finance. These settings have in common that the decision maker can observe, for each previous instance, an array of features of the instance, the action taken in that instance, and the reward realized—but not the rewards of actions that were not taken: the counterfactual information. Learning in such settings is made even more difficult because the observed data is typically biased by the existing policy (that generated the data) and because the array of features that might affect the reward in a particular instance—and hence should be taken into account in deciding on an action in each particular instance—is often vast. The approach presented here estimates propensity scores for the observed data, infers counterfactuals, identifies a (relatively small) number of features that are (most) relevant for each possible action and instance, and prescribes a policy to be followed. Comparison of the proposed algorithm against state-of-art algorithms on actual datasets demonstrates that the proposed algorithm achieves a significant improvement in performance.


Inferring counterfactuals Identifying relevant features Constructing personalized policies 



This research was funded by Grants from NSF ECCS 1462245 and NSF IIP1533983.


  1. Alaa, A.M., van der Schaar, M. (2017). Bayesian inference of individualized treatment effects using multi-task gaussian processes. arXiv preprint arXiv:1704.02801
  2. Atan, O., Zame, W. R., & van der Schaar, M. (2018). Learning optimal policies from observational data. arXiv preprint arXiv:1802.08679
  3. Athey, S., & Imbens, G. W. (2015). Recursive partitioning for heterogeneous causal effects. arXiv preprint arXiv:1504.01132.
  4. Audibert, J. Y., Munos, R., & Szepesvári, C. (2009). Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19), 1876–1902.MathSciNetCrossRefGoogle Scholar
  5. Beygelzimer, A., & Langford, J. (2009). The offset tree for learning with partial labels. In Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 129–138).Google Scholar
  6. Bottou, L., Peters, J., Candela, J. Q., Charles, D. X., Chickering, M., Portugaly, E., et al. (2013). Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14(1), 3207–3260.MathSciNetzbMATHGoogle Scholar
  7. Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification. Hoboken: Wiley.zbMATHGoogle Scholar
  8. Dudík, M., Langford, J., & Li, L. (2011). Doubly robust policy evaluation and learning. In International conference on machine learning (ICML).Google Scholar
  9. Dy, J. G., & Brodley, C. E. (2004). Feature selection for unsupervised learning. Journal of Machine Learning Research, 5, 845–889.MathSciNetzbMATHGoogle Scholar
  10. Hall, M. A. (1999). Correlation-based feature selection for machine learning. PhD thesis, The University of WaikatoGoogle Scholar
  11. He, X., Cai, D., & Niyogi, P. (2005). Laplacian score for feature selection. In Advances in neural information processing systems (pp. 507–514).Google Scholar
  12. Hoiles, W., & van der Schaar, M. (2016). Bounded off-policy evaluation with missing data for course recommendation and curriculum design bounded off-policy evaluation with missing data for course recommendation and curriculum design. In International conference on machine learning (pp 1596–1604).Google Scholar
  13. Ionides, E. L. (2008). Truncated importance sampling. Journal of Computational and Graphical Statistics, 17(2), 295–311.MathSciNetCrossRefGoogle Scholar
  14. Jiang, N., & Li, L. (2016). Doubly robust off-policy evaluation for reinforcement learning. In International conference on machine learning (ICML).Google Scholar
  15. Joachims, T., Grotov, A., Swaminathan, A., & de Rijke, M. (2018). Deep learning with logged bandit feedback. In International conference on learning representations (ICLR).Google Scholar
  16. Joachims, T., & Swaminathan, A. (2016). Counterfactual evaluation and learning for search, recommendation and ad placement. In International ACM SIGIR conference on research and development in information retrieval (pp 1199–1201).Google Scholar
  17. Johansson, F., Shalit, U., & Sontag, D. (2016). Learning representations for counterfactual inference. In International conference on machine learning (ICML) Google Scholar
  18. Kira, K., & Rendell, L. A. (1992). A practical approach to feature selection. In Proceedings of the ninth international workshop on Machine learning (pp. 249–256).Google Scholar
  19. Koller, D., & Sahami, M. (1996). Toward optimal feature selection. Stanford InfoLab.Google Scholar
  20. Maurer, A., & Pontil, M. (2009). Empirical bernstein bounds and sample variance penalization. In The 22nd conference on learning theory.Google Scholar
  21. Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1226–1238.CrossRefGoogle Scholar
  22. Prentice, R. (1976). Use of the logistic model in retrospective studies. Biometrics, 32(3), 599–606.CrossRefGoogle Scholar
  23. Robnik-Šikonja, M., & Kononenko, I. (2003). Theoretical and empirical analysis of relieff and rrelieff. Machine Learning, 53(1–2), 23–69.CrossRefGoogle Scholar
  24. Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41–55.MathSciNetCrossRefGoogle Scholar
  25. Shalit, U., Johansson, F., & Sontag, D. (2016). Estimating individual treatment effect: Generalization bounds and algorithms. arXiv preprint arXiv:1606.03976
  26. Slivkins, A. (2014). Contextual bandits with similarity information. Journal of Machine Learning Research, 15(1), 2533–2568.MathSciNetzbMATHGoogle Scholar
  27. Song, L., Smola, A., Gretton, A., Bedo, J., & Borgwardt, K. (2012). Feature selection via dependence maximization. Journal of Machine Learning Research, 13(May), 1393–1434.MathSciNetzbMATHGoogle Scholar
  28. Strehl, A., Langford, J., Li, L., & Kakade S. M. (2010). Learning from logged implicit exploration data. In Advances in neural information processing systems (pp. 2217–2225).Google Scholar
  29. Swaminathan, A., & Joachims, T. (2015). Batch learning from logged bandit feedback through counterfactual risk minimization. Journal of Machine Learning Research, 16, 1731–1755.MathSciNetzbMATHGoogle Scholar
  30. Swaminathan, A., & Joachims, T. (2015b). The self-normalized estimator for counterfactual learning. In advances in neural information processing systems (pp. 3231–3239).Google Scholar
  31. Tang, J., Alelyani, S., & Liu, H. (2014). Feature selection for classification: A review. Data Classification: Algorithms and Applications, 37. Google Scholar
  32. Tekin, C., & van der Schaar, M. (2014). Discovering, learning and exploiting relevance. In Advances in neural information processing systems (pp. 1233–1241).Google Scholar
  33. Tian, L., Alizadeh, A., Gentles, A., & Tibshirani, R. (2012). A simple method for detecting interactions between a treatment and a large number of covariates. arXiv preprint arXiv:1212.2995
  34. Wager, S., & Athey, S. (2015). Estimation and inference of heterogeneous treatment effects using random forests. arXiv preprint arXiv:1510.04342
  35. Weissman, T., Ordentlich, E., Seroussi, G., Verdu, S., & Weinberger, M. J. (2003). Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech Rep.Google Scholar
  36. Weston, J., Elisseeff, A., Schölkopf, B., & Tipping, M. (2003). Use of the zero-norm with linear models and kernel methods. Journal of Machine Learning Research, 3, 1439–1461.MathSciNetzbMATHGoogle Scholar
  37. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4), 229–256.zbMATHGoogle Scholar
  38. Xu, Z., King, I., Lyu, M. R. T., & Jin, R. (2010). Discriminative semi-supervised feature selection via manifold regularization. IEEE Transactions on Neural Networks, 21(7), 1033–1047.CrossRefGoogle Scholar
  39. Yoon, J., Davtyan, C., & van der Schaar, M. (2017). Discovery and clinical decision support for personalized healthcare. IEEE Journal of Biomedical and Health Informatics, 21(4), 1133–1145.CrossRefGoogle Scholar
  40. Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution. International Conference on Machine Learning (ICML), 3, 856–863.Google Scholar

Copyright information

© The Author(s) 2018

Authors and Affiliations

  • Onur Atan
    • 1
    Email author
  • William R. Zame
    • 1
    • 2
  • Qiaojun Feng
    • 3
  • Mihaela van der Schaar
    • 1
    • 4
  1. 1.University of California, Los AngelesLos AngelesUSA
  2. 2.Nuffield CollegeOxford UniversityOxfordUK
  3. 3.Tsinghua UniversityBeijingChina
  4. 4.Oxford-Man InstituteOxford UniversityOxfordUK

Personalised recommendations