Skip to main content

Learning Action Embeddings for Off-Policy Evaluation

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2024)


Off-policy evaluation (OPE) methods allow us to compute the expected reward of a policy by using the logged data collected by a different policy. However, when the number of actions is large, or certain actions are under-explored by the logging policy, existing estimators based on inverse-propensity scoring (IPS) can have a high or even infinite variance. Saito and Joachims [13] propose marginalized IPS (MIPS) that uses action embeddings instead, which reduces the variance of IPS in large action spaces. MIPS assumes that good action embeddings can be defined by the practitioner, which is difficult to do in many real-world applications. In this work, we explore learning action embeddings from logged data. In particular, we use intermediate outputs of a trained reward model to define action embeddings for MIPS. This approach extends MIPS to more applications, and in our experiments improves upon MIPS with pre-defined embeddings, as well as standard baselines, both on synthetic and real-world data. Our method does not make assumptions about the reward model class, and supports using additional action information to further improve the estimates. The proposed approach presents an appealing alternative to DR for combining the low variance of DM with the low bias of IPS.

M. Cief—Work done during an internship at Amazon.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions


  1. 1.

    Code to reproduce this and further experiments is available at


  1. Chuklin, A.: Click Models for Web Search, vol. 7, no. 3, pp. 1–115 (2015)

    Google Scholar 

  2. Dhrymes, P.J.: Topics in Advanced Econometrics: Probability Foundations, vol. 1. Springer, Heidelberg (1989).

    Book  Google Scholar 

  3. Dudík, M., Erhan, D., Langford, J., Li, L.: Doubly robust policy evaluation and optimization. Stat. Sci. 29(4), 485–511 (2014). ISSN 0883–4237, 2168–8745.

  4. Efron, B.: The efficiency of logistic regression compared to normal discriminant analysis. J. Am. Stat. Assoc. 70(352), 892–898 (1975)

    Article  MathSciNet  Google Scholar 

  5. Farajtabar, M., Chow, Y., Ghavamzadeh, M.: More robust doubly robust off-policy evaluation. In: Proceedings of the 35th International Conference on Machine Learning, pp. 1447–1456. PMLR (2018). iSSN: 2640–3498

  6. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York (2009).

  7. Kallus, N., Zhou, A.: Policy evaluation and optimization with continuous treatments. In: Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, pp. 1243–1251. PMLR (2018). iSSN: 2640–3498

  8. Metelli, A.M., Russo, A., Restelli, M.: Subgaussian and differentiable importance sampling for off-policy evaluation and learning. In: Advances in Neural Information Processing Systems, vol. 34, pp. 8119–8132. Curran Associates, Inc. (2021).

  9. Peng, J., et al.: Offline policy evaluation in large action spaces via outcome-oriented action grouping. In: Proceedings of the ACM Web Conference 2023, WWW 2023, pp. 1220–1230. Association for Computing Machinery, New York (2023). ISBN 978-1-4503-9416-1.

  10. Robins, J.M., Rotnitzky, A., Zhao, L.P.: Estimation of regression coefficients when some regressors are not always observed. J. Am. Stat. Assoc. 89(427), 846–866 (1994). ISSN 0162–1459.

  11. Sachdeva, N., Su, Y., Joachims, T.: Off-policy bandits with deficient support. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2020, pp. 965–975. Association for Computing Machinery, New York (2020). ISBN 978-1-4503-7998-4.

  12. Saito, Y., Aihara, S., Matsutani, M., Narita, Y.: Open bandit dataset and pipeline: towards realistic and reproducible off-policy evaluation (2021). arXiv:2008.07146 [cs, stat]

  13. Saito, Y., Joachims, T.: Off-policy evaluation for large action spaces via embeddings. In: Proceedings of the 39th International Conference on Machine Learning, pp. 19089–19122. PMLR (2022). iSSN: 2640–3498

  14. Saito, Y., Ren, Q., Joachims, T.: Off-policy evaluation for large action spaces via conjunct effect modeling. In: Proceedings of the 40th International Conference on Machine Learning, pp. 29734–29759. PMLR (2023). iSSN: 2640–3498

  15. Su, Y., Dimakopoulou, M., Krishnamurthy, A., Dudik, M.: Doubly robust off-policy evaluation with shrinkage. In: Proceedings of the 37th International Conference on Machine Learning, pp. 9167–9176. PMLR (2020). iSSN: 2640–3498

  16. Su, Y., Wang, L., Santacatterina, M., Joachims, T.: CAB: continuous adaptive blending for policy evaluation and learning. In: Proceedings of the 36th International Conference on Machine Learning, pp. 6005–6014. PMLR (2019). iSSN: 2640–3498

  17. Swaminathan, A.: Counterfactual Evaluation and Learning From Logged User Feedback. Ph.D. thesis, Cornell University, Ithaca, NY, United States (2017).

  18. Swaminathan, A., Joachims, T.: The self-normalized estimator for counterfactual learning. In: Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc. (2015).

  19. Swaminathan, A., et al.: Off-policy evaluation for slate recommendation. In: Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017).

  20. Wang, Y.X., Agarwal, A., Dudik, M.: Optimal and adaptive off-policy evaluation in contextual bandits. In: Proceedings of the 34th International Conference on Machine Learning, pp. 3589–3597. PMLR (2017). iSSN: 2640–3498

  21. Zhou, L.: A Survey on Contextual Multi-armed Bandits (2016). arXiv:1508.03326 [cs]

Download references


We want to thank Mohamed Sadek for his contributions to the codebase. The research conducted by Matej Cief (also with slovak.AI) was partially supported by TAILOR, a project funded by EU Horizon 2020 under GA No. 952215,

Author information

Authors and Affiliations


Corresponding author

Correspondence to Matej Cief .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cief, M., Golebiowski, J., Schmidt, P., Abedjan, Z., Bekasov, A. (2024). Learning Action Embeddings for Off-Policy Evaluation. In: Goharian, N., et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14608. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-56026-2

  • Online ISBN: 978-3-031-56027-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics