Abstract
The problems of online pricing with offline data, among other similar online decision making with offline data problems, aim at designing and evaluating online pricing policies in presence of a certain amount of existing offline data. To evaluate pricing policies when offline data are available, the decision maker can either position herself at the time point when the offline data are already observed and viewed as deterministic, or at the time point when the offline data are not yet generated and viewed as stochastic. We write a framework to discuss how and why these two different positions are relevant to online policy evaluations, from a worst-case perspective and from a Bayesian perspective. We then use a simple online pricing setting with offline data to illustrate the constructions of optimal policies for these two approaches and discuss their differences, especially whether we can decompose the searching for the optimal policy into independent subproblems and optimize separately, and whether there exists a deterministic optimal policy.
Similar content being viewed by others
Data Availability
The datasets generated during and/or analysed during the current study are available in the GitHub repository, https:/github.coir/YueWangMathbio/OPOD.
References
Ban G-Y, Keskin N B (2021). Personalized dynamic pricing with machine learning: High-dimensional features and heterogeneous elasticity. Management Science 67(9):5549–5568.
Bastani H, Simchi-Levi D, Zhu R (2022). Meta dynamic pricing: Transfer learning across experiments. Management Science 68(3):1865–1881.
Billingsley P (2013). Convergence of Probability Measures. John Wiley & Sons, USA.
Bu J, Simchi-Levi D, Xu Y (2020). Online pricing with offline data: Phase transition and inverse square law. In International Conference on Machine Learning. PMLR.
den Boer A V (2015). Dynamic pricing and learning: Historical origins, current research, and new directions. Surveys in Operations Research and Management Science 20(1): 1–18.
den Boer A V, Zwart B (2015). Dynamic pricing and learning with finite inventories. Operations Research 63(4):965–978.
Durrett R (2019). Probability: Theory and Examples. Cambridge University Press, UK.
Eysenbach B, Salakhutdinov R R, Levine S (2019). Search on the replay buffer: Bridging planning and reinforcement learning. arXiv: 1906.05253.
Fujimoto S, Meger D, Precup D (2019). Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning. PMLR.
Gallego G, Topaloglu H (2019). Revenue Management and Pricing Analytics. Springer, USA.
Harrison J M, Keskin N B, Zeevi A (2012). Bayesian dynamic pricing policies: Learning and earning under a binary prior distribution. Management Science 58(3):570–586.
Keskin N B, Zeevi A (2014). Dynamic pricing with an unknown demand model: Asymptotically optimal semi-myopic policies. Operations Research 62(5):1142–1167.
Kirschner J, Krause A (2018). Information directed sampling and bandits with heteroscedastic noise. In Conference on Learning Theory. PMLR.
Munos R, Stepleton T, Harutyunyan A, Bellemare M (2016). Safe and efficient off-policy reinforcement learning. arXiv: 1606.02647.
Prokhorov Y V (1956). Convergence of random processes and limit theorems in probability theory. Theory of Probability and Its Applications 1(2):157–214.
Rakelly K, Zhou A, Finn C, Levine S, Quillen D (2019). Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International Conference on Machine Learning. PMLR.
Rolnick D, Ahuja A, Schwarz J, Lillicrap T, Wayne G (2019). Experience replay for continual learning. arXiv: 1811.11682.
Russo D, Van Roy B (2014). Learning to optimize via posterior sampling. Mathematics of Operations Research 39(4):1221–1243.
Russo D, Van Roy B (2018). Learning to optimize via information-directed sampling. Operations Research 66(1):230–252.
Russo D J, Van Roy B, Kazerouni A, Osband I, Wen Z (2018). A Tutorial on Thompson Sampling. Now Foundations and Trends, USA.
Srinivas N, Krause A, Kakade S M, Seeger M W (2012). Information-theoretic regret bounds for Gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory 58(5):3250–3265.
Thomas P, Brunskill E (2016). Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning. PMLR.
Wang Y, Wang L (2020). Causal inference in degenerate systems: An impossibility result. In International Conference on Artificial Intelligence and Statistics. PMLR.
Wang Y, Zheng Z, Shen Z-J M (2023). Online pricing with polluted offline data. Available at SSRN 4320324.
Zanette A, Brandfonbrener D, Brunskill E, Pirotta M, Lazaric A (2020). Frequentist regret bounds for randomized least-squares value iteration. In International Conference on Artificial Intelligence and Statistics. PMLR.
Acknowledgments
The authors would like to thank the anonymous referees for providing helpful comments that improve the quality of this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Yue Wang is a postdoctoral fellow at the Department of Computational Medicine, University of California, Los Angeles since 2021. During 2018–2021, Dr. Wang was a postdoctoral researcher at Institut des Hautes Études Scientifiques in France. Dr. Wang received Ph.D. in applied mathematics from the University of Washington in 2018, and B.Sc. in mathematics from Peking University in 2013. Dr. Wang applies different mathematical tools, such as modeling, simulation, algorithm, statistical analysis, theoretical analysis with discrete mathematics, differential equation, and stochastic process, to biology, e.g., population dynamics, gene regulation, and developmental biology. Dr. Wang also applies probability, stochastic process, and discrete mathematics to different subjects, such as reinforcement learning, causal inference, statistical physics, biochemistry, dynamical system, and law.
Zeyu Zheng is an assistant professor at the Department of Industrial Engineering and Operations Research, University of California, Berkeley since 2018. Dr. Zheng received a PhD degree in operations research from Stanford University in 2018, an MS degree in economics from Stanford University in 2016 and a Bachelor degree in mathematics from Peking University in 2012. He has done research in Monte Carlo simulation theory and simulation optimization. He is also interested in non-stationary stochastic modeling.
Rights and permissions
About this article
Cite this article
Wang, Y., Zheng, Z. Measuring Policy Performance in Online Pricing with Offline Data: Worst-case Perspective and Bayesian Perspective. J. Syst. Sci. Syst. Eng. 32, 352–371 (2023). https://doi.org/10.1007/s11518-023-5557-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11518-023-5557-9