Interpreting Decision Process in Offline Reinforcement Learning for Interactive Recommendation Systems

Volovikova, Zoya; Kuderov, Petr; Panov, Aleksandr I.

doi:10.1007/978-981-99-8138-0_22

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1963))

Included in the following conference series:

International Conference on Neural Information Processing

406 Accesses

Abstract

Recommendation systems, which predict relevant and appealing items for users on web platforms, often rely on static user interests, resulting in limited interactivity and adaptability. Reinforcement Learning (RL), while providing a dynamic and adaptive approach, brings its unique challenges in this context. Interpreting the behavior of an RL agent within recommendation systems is complex due to factors such as the vast and continuously evolving state and action spaces, non-stationary user preferences, and implicit, delayed rewards often associated with long-term user satisfaction.

Addressing the inherent complexities of applying RL in recommendation systems, we propose a framework that includes innovative metrics and a synthetic environment. The metrics aim to assess the real-time adaptability of an RL agent to dynamic user preferences. We apply this framework to LastFM datasets to interpret metric outcomes and test hypotheses regarding MDP setups and algorithm choices by adjusting dataset parameters within the synthetic environment. This approach illustrates potential applications of our framework, while highlighting the necessity for further research in this area.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bobadilla, J., Ortega, F., Hernando, A., Gutiérrez, A.: Recommender systems survey. Knowl.-Based Syst. 46, 109–132 (2013)
Article Google Scholar
Dacrema, M.F., Boglio, S., Cremonesi, P., Jannach, D.: A troubling analysis of reproducibility and progress in recommender systems research. ACM Transactions on Information Systems 39(2), 1–49 (2021). https://doi.org/10.1145/3434185, arXiv:1911.07698 [cs]
Deffayet, R., et al.: Offline evaluation for reinforcement learning-based recommendation: a critical issue and some alternatives. arXiv preprint arXiv:2301.00993 (2023)
Frolov, E., Oseledets, I.: Fifty shades of ratings: how to benefit from a negative feedback in top-n recommendations tasks. In: Proceedings of the 10th ACM Conference on Recommender Systems, pp. 91–98 (2016). https://doi.org/10.1145/2959100.2959170, http://arxiv.org/abs/1607.04228, arXiv:1607.04228 [cs, stat]
Grishanov, A., Ianina, A., Vorontsov, K.: Multiobjective evaluation of reinforcement learning based recommender systems. In: Proceedings of the 16th ACM Conference on Recommender Systems, pp. 622–627 (2022)
Google Scholar
Hou, Y., et al.: A deep reinforcement learning real-time recommendation model based on long and short-term preference. Int. J. Comput. Intell. Syst. 16(1), 4 (2023)
Article Google Scholar
Ie, E., et al.: RECSIM: a configurable simulation platform for recommender systems (arXiv:1909.04847) (2019). http://arxiv.org/abs/1909.04847, arXiv:1909.04847 [cs, stat]
Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer 42(8), 30–37 (2009)
Article Google Scholar
Levine, S., Kumar, A., Tucker, G., Fu, J.: Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643 (2020)
Liu, F., et al.: Deep reinforcement learning based recommendation with explicit user-item interactions modeling. arXiv preprint arXiv:1810.12027 (2018)
Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Marin, N., Makhneva, E., Lysyuk, M., Chernyy, V., Oseledets, I., Frolov, E.: Tensor-based collaborative filtering with smooth ratings scale. arXiv preprint arXiv:2205.05070 (2022)
Meggetto, F., et al.: Why people skip music? on predicting music skips using deep reinforcement learning. arXiv preprint arXiv:2301.03881 (2023)
Rohde, D., Bonner, S., Dunlop, T., Vasile, F., Karatzoglou, A.: RecoGym: a reinforcement learning environment for the problem of product recommendation in online advertising. arXiv preprint arXiv:1808.00720 (2018)
Schafer, J.B., Frankowski, D., Herlocker, J., Sen, S.: Collaborative filtering recommender systems. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) The Adaptive Web. LNCS, vol. 4321, pp. 291–324. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-72079-9_9
Chapter Google Scholar
Seno, T., Imai, M.: d3rlpy: an offline deep reinforcement learning library. J. Mach. Learn. Res. 23(1), 14205–14224 (2022)
MathSciNet Google Scholar
Turrin, R., Quadrana, M., Condorelli, A., Pagano, R., Cremonesi, P.: 30music listening and playlists dataset. In: Castells, P. (ed.) Poster Proceedings of the 9th ACM Conference on Recommender Systems, RecSys 2015, Vienna, Austria, September 16, 2015. CEUR Workshop Proceedings, vol. 1441. CEUR-WS.org (2015). https://ceur-ws.org/Vol-1441/recsys2015_poster13.pdf
Wang, K., et al.: Rl4rs: A real-world benchmark for reinforcement learning based recommender system. arXiv preprint arXiv:2110.11073 (2021)
Wang, S., Cao, L., Wang, Y., Sheng, Q.Z., Orgun, M.A., Lian, D.: A survey on session-based recommender systems. ACM Comput. Surv. (CSUR) 54(7), 1–38 (2021)
Article Google Scholar
Wang, S., Hu, L., Wang, Y., Cao, L., Sheng, Q.Z., Orgun, M.: Sequential recommender systems: challenges, progress and prospects. arXiv preprint arXiv:2001.04830 (2019)
Wang, X., et al.: A reinforcement learning framework for explainable recommendation. In: 2018 IEEE International Conference on Data Mining (ICDM), pp. 587–596. IEEE (2018)
Google Scholar
Zhao, X., Xia, L., Zou, L., Yin, D., Tang, J.: Toward simulating environments in reinforcement learning based recommendations. arXiv preprint arXiv:1906.11462 (2019)

Download references

Acknowledgments

The authors extend their heartfelt gratitude to Evgeny Frolov and Alexey Skrynnyk for their insightful feedback, guidance on key focus areas for this research.

Author information

Authors and Affiliations

AIRI, Moscow, Russia
Zoya Volovikova, Petr Kuderov & Aleksandr I. Panov
Moscow Institute of Physics and Technology, Dolgoprudny, Russia
Zoya Volovikova & Petr Kuderov
Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences FRC CSC RAS, Moscow, Russia
Aleksandr I. Panov

Authors

Zoya Volovikova
View author publications
You can also search for this author in PubMed Google Scholar
Petr Kuderov
View author publications
You can also search for this author in PubMed Google Scholar
Aleksandr I. Panov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zoya Volovikova .

Editor information

Editors and Affiliations

School of Automation, Central South University, Changsha, China
Biao Luo
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Long Cheng
Institute of Cyber-Systems and Control, Zhejiang University, Hangzhou, China
Zheng-Guang Wu
School of Automation, Guangdong University of Technology, Guangzhou, China
Hongyi Li
School of Electrical Engineering and Telecommunications, UNSW Sydney, Sydney, NSW, Australia
Chaojie Li

Appendices

Classic Recommendation Algorithms

Recommendation systems have been extensively studied, and several classic algorithms have emerged as popular approaches for generating recommendations. In this section, we provide a brief overview of some of these algorithms, such as matrix factorization and other notable examples.

1.1 Algorithms

Matrix factorization is a widely used technique for collaborative filtering in recommendation systems. The basic idea is to decompose the user-item interaction matrix into two lower-dimensional matrices, representing latent factors for users and items. The interaction between users and items can then be approximated by the product of these latent factors. Singular Value Decomposition (SVD) and Alternating Least Squares (ALS) are common methods for performing matrix factorization. The objective function for matrix factorization can be written as:

$$\begin{aligned} \min _{U, V} \sum _{(i, j) \in \Omega } (R_{ij} - U_i^T V_j)^2 + \lambda (||U_i||^2 + ||V_j||^2), \end{aligned}$$

(7)

where $R_{ij}$ is the observed interaction between user i and item j, $U_i$ and $V_j$ are the latent factors for user i and item j, respectively, $\Omega $ is the set of observed user-item interactions, and $\lambda $ is a regularization parameter to prevent overfitting.

Besides matrix factorization, other classic recommendation algorithms include:

User-based Collaborative Filtering: This approach finds users who are similar to the target user and recommends items that these similar users have liked or interacted with. The similarity between users can be computed using metrics such as Pearson correlation or cosine similarity.
Item-based Collaborative Filtering: Instead of focusing on user similarity, this method computes the similarity between items and recommends items that are similar to those the target user has liked or interacted with.
Content-based Filtering: This approach utilizes features of items and user profiles to generate recommendations, assuming that users are more likely to be interested in items that are similar to their previous interactions.

1.2 Evaluation Metrics for Recommendation Algorithms

Several evaluation metrics are commonly used to assess the performance of recommendation algorithms. We provide a brief description with equations for the metrics, which we use in our work.

Normalized Discounted Cumulative Gain (NDCG) used for measuring the effectiveness of ranking algorithms, takes into account the position of relevant items in the ranked list. First, DCG is calculated:

$$\begin{aligned} \text {DCG}@k = \sum _{i=1}^{k} \frac{\text {rel}_i}{\log _2(i+1)}, \end{aligned}$$

(8)

where k is the number of top recommendations considered, and $\text {rel}_i$ is the relevance score of the item at position i in the ranked list. Then DCG value is normalized with the ideal DCG (IDCG), which represents the highest possible DCG value:

$$\begin{aligned} \text {NDCG}@k = \frac{\text {DCG}@k}{\text {IDCG}@k}. \end{aligned}$$

(9)

Coverage measures the fraction of recommended items to the total number of items:

$$\begin{aligned} \text {Coverage} = \frac{|\text {Recommended Items}|}{|\text {Total Items}|}. \end{aligned}$$

(10)

High coverage indicates that the algorithm can recommend a diverse set of items, while low coverage implies that it is limited to a narrow subset. Hit Rate calculates the proportion of relevant recommendations out of the total recommendations provided:

$$\begin{aligned} \text {Hit Rate} = \frac{\text {Number of Hits}}{\text {Total Number of Recommendations}}, \end{aligned}$$

(11)

where a “hit” occurs when a recommended item is considered relevant or of interest to the user. A higher hit rate indicates better performance.

Experiments in Dynamic Environment

Experiments in a high-dynamic environment for agents trained on optimal and sub-optimal data. Preference Correlation is the best metric for evaluating lower-quality datasets with negative reviews, as it assesses the agent’s understanding of user needs. However, high values can occur if the dataset has many positive ratings and the agent’s coverage is low. For higher-quality datasets with positive responses, the I-HitRate metric correlates with online evaluations but is sensitive to the agent’s coverage.

Reinforcement Learning as an MDP Problem

Reinforcement learning (RL) addresses the problem of learning optimal behaviors by interacting with an environment. A fundamental concept in RL is the Markov Decision Process (MDP), which models the decision-making problem as a tuple $(S, A, P, R, \gamma )$. In this framework, S represents the state space, A is the action space, P is the state transition probability function, R denotes the reward function, and $\gamma $ is the discount factor. By formulating the recommendation problem as an MDP, RL algorithms can learn to make decisions that optimize long-term rewards. The MDP framework provides a solid foundation for designing and evaluating RL agents in recommendation systems, allowing for the development of more adaptive and effective algorithms.

Comparing Algorithms Behavior on High Dynamic Environment

Table 1. Comparison of CQL, SAC, BC, Dert4Rec algorithms in an environment with high dynamics of user mood changes.

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Volovikova, Z., Kuderov, P., Panov, A.I. (2024). Interpreting Decision Process in Offline Reinforcement Learning for Interactive Recommendation Systems. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1963. Springer, Singapore. https://doi.org/10.1007/978-981-99-8138-0_22

Download citation

DOI: https://doi.org/10.1007/978-981-99-8138-0_22
Published: 26 November 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8137-3
Online ISBN: 978-981-99-8138-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Interpreting Decision Process in Offline Reinforcement Learning for Interactive Recommendation Systems