Skip to main content

Interpreting Decision Process in Offline Reinforcement Learning for Interactive Recommendation Systems

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2023)

Abstract

Recommendation systems, which predict relevant and appealing items for users on web platforms, often rely on static user interests, resulting in limited interactivity and adaptability. Reinforcement Learning (RL), while providing a dynamic and adaptive approach, brings its unique challenges in this context. Interpreting the behavior of an RL agent within recommendation systems is complex due to factors such as the vast and continuously evolving state and action spaces, non-stationary user preferences, and implicit, delayed rewards often associated with long-term user satisfaction.

Addressing the inherent complexities of applying RL in recommendation systems, we propose a framework that includes innovative metrics and a synthetic environment. The metrics aim to assess the real-time adaptability of an RL agent to dynamic user preferences. We apply this framework to LastFM datasets to interpret metric outcomes and test hypotheses regarding MDP setups and algorithm choices by adjusting dataset parameters within the synthetic environment. This approach illustrates potential applications of our framework, while highlighting the necessity for further research in this area.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bobadilla, J., Ortega, F., Hernando, A., Gutiérrez, A.: Recommender systems survey. Knowl.-Based Syst. 46, 109–132 (2013)

    Article  Google Scholar 

  2. Dacrema, M.F., Boglio, S., Cremonesi, P., Jannach, D.: A troubling analysis of reproducibility and progress in recommender systems research. ACM Transactions on Information Systems 39(2), 1–49 (2021). https://doi.org/10.1145/3434185, arXiv:1911.07698 [cs]

  3. Deffayet, R., et al.: Offline evaluation for reinforcement learning-based recommendation: a critical issue and some alternatives. arXiv preprint arXiv:2301.00993 (2023)

  4. Frolov, E., Oseledets, I.: Fifty shades of ratings: how to benefit from a negative feedback in top-n recommendations tasks. In: Proceedings of the 10th ACM Conference on Recommender Systems, pp. 91–98 (2016). https://doi.org/10.1145/2959100.2959170, http://arxiv.org/abs/1607.04228, arXiv:1607.04228 [cs, stat]

  5. Grishanov, A., Ianina, A., Vorontsov, K.: Multiobjective evaluation of reinforcement learning based recommender systems. In: Proceedings of the 16th ACM Conference on Recommender Systems, pp. 622–627 (2022)

    Google Scholar 

  6. Hou, Y., et al.: A deep reinforcement learning real-time recommendation model based on long and short-term preference. Int. J. Comput. Intell. Syst. 16(1), 4 (2023)

    Article  Google Scholar 

  7. Ie, E., et al.: RECSIM: a configurable simulation platform for recommender systems (arXiv:1909.04847) (2019). http://arxiv.org/abs/1909.04847, arXiv:1909.04847 [cs, stat]

  8. Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer 42(8), 30–37 (2009)

    Article  Google Scholar 

  9. Levine, S., Kumar, A., Tucker, G., Fu, J.: Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643 (2020)

  10. Liu, F., et al.: Deep reinforcement learning based recommendation with explicit user-item interactions modeling. arXiv preprint arXiv:1810.12027 (2018)

  11. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  12. Marin, N., Makhneva, E., Lysyuk, M., Chernyy, V., Oseledets, I., Frolov, E.: Tensor-based collaborative filtering with smooth ratings scale. arXiv preprint arXiv:2205.05070 (2022)

  13. Meggetto, F., et al.: Why people skip music? on predicting music skips using deep reinforcement learning. arXiv preprint arXiv:2301.03881 (2023)

  14. Rohde, D., Bonner, S., Dunlop, T., Vasile, F., Karatzoglou, A.: RecoGym: a reinforcement learning environment for the problem of product recommendation in online advertising. arXiv preprint arXiv:1808.00720 (2018)

  15. Schafer, J.B., Frankowski, D., Herlocker, J., Sen, S.: Collaborative filtering recommender systems. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) The Adaptive Web. LNCS, vol. 4321, pp. 291–324. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-72079-9_9

    Chapter  Google Scholar 

  16. Seno, T., Imai, M.: d3rlpy: an offline deep reinforcement learning library. J. Mach. Learn. Res. 23(1), 14205–14224 (2022)

    MathSciNet  Google Scholar 

  17. Turrin, R., Quadrana, M., Condorelli, A., Pagano, R., Cremonesi, P.: 30music listening and playlists dataset. In: Castells, P. (ed.) Poster Proceedings of the 9th ACM Conference on Recommender Systems, RecSys 2015, Vienna, Austria, September 16, 2015. CEUR Workshop Proceedings, vol. 1441. CEUR-WS.org (2015). https://ceur-ws.org/Vol-1441/recsys2015_poster13.pdf

  18. Wang, K., et al.: Rl4rs: A real-world benchmark for reinforcement learning based recommender system. arXiv preprint arXiv:2110.11073 (2021)

  19. Wang, S., Cao, L., Wang, Y., Sheng, Q.Z., Orgun, M.A., Lian, D.: A survey on session-based recommender systems. ACM Comput. Surv. (CSUR) 54(7), 1–38 (2021)

    Article  Google Scholar 

  20. Wang, S., Hu, L., Wang, Y., Cao, L., Sheng, Q.Z., Orgun, M.: Sequential recommender systems: challenges, progress and prospects. arXiv preprint arXiv:2001.04830 (2019)

  21. Wang, X., et al.: A reinforcement learning framework for explainable recommendation. In: 2018 IEEE International Conference on Data Mining (ICDM), pp. 587–596. IEEE (2018)

    Google Scholar 

  22. Zhao, X., Xia, L., Zou, L., Yin, D., Tang, J.: Toward simulating environments in reinforcement learning based recommendations. arXiv preprint arXiv:1906.11462 (2019)

Download references

Acknowledgments

The authors extend their heartfelt gratitude to Evgeny Frolov and Alexey Skrynnyk for their insightful feedback, guidance on key focus areas for this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zoya Volovikova .

Editor information

Editors and Affiliations

Appendices

Classic Recommendation Algorithms

Recommendation systems have been extensively studied, and several classic algorithms have emerged as popular approaches for generating recommendations. In this section, we provide a brief overview of some of these algorithms, such as matrix factorization and other notable examples.

1.1 Algorithms

Matrix factorization is a widely used technique for collaborative filtering in recommendation systems. The basic idea is to decompose the user-item interaction matrix into two lower-dimensional matrices, representing latent factors for users and items. The interaction between users and items can then be approximated by the product of these latent factors. Singular Value Decomposition (SVD) and Alternating Least Squares (ALS) are common methods for performing matrix factorization. The objective function for matrix factorization can be written as:

$$\begin{aligned} \min _{U, V} \sum _{(i, j) \in \Omega } (R_{ij} - U_i^T V_j)^2 + \lambda (||U_i||^2 + ||V_j||^2), \end{aligned}$$
(7)

where \(R_{ij}\) is the observed interaction between user i and item j, \(U_i\) and \(V_j\) are the latent factors for user i and item j, respectively, \(\Omega \) is the set of observed user-item interactions, and \(\lambda \) is a regularization parameter to prevent overfitting.

Besides matrix factorization, other classic recommendation algorithms include:

  • User-based Collaborative Filtering: This approach finds users who are similar to the target user and recommends items that these similar users have liked or interacted with. The similarity between users can be computed using metrics such as Pearson correlation or cosine similarity.

  • Item-based Collaborative Filtering: Instead of focusing on user similarity, this method computes the similarity between items and recommends items that are similar to those the target user has liked or interacted with.

  • Content-based Filtering: This approach utilizes features of items and user profiles to generate recommendations, assuming that users are more likely to be interested in items that are similar to their previous interactions.

1.2 Evaluation Metrics for Recommendation Algorithms

Several evaluation metrics are commonly used to assess the performance of recommendation algorithms. We provide a brief description with equations for the metrics, which we use in our work.

Normalized Discounted Cumulative Gain (NDCG) used for measuring the effectiveness of ranking algorithms, takes into account the position of relevant items in the ranked list. First, DCG is calculated:

$$\begin{aligned} \text {DCG}@k = \sum _{i=1}^{k} \frac{\text {rel}_i}{\log _2(i+1)}, \end{aligned}$$
(8)

where k is the number of top recommendations considered, and \(\text {rel}_i\) is the relevance score of the item at position i in the ranked list. Then DCG value is normalized with the ideal DCG (IDCG), which represents the highest possible DCG value:

$$\begin{aligned} \text {NDCG}@k = \frac{\text {DCG}@k}{\text {IDCG}@k}. \end{aligned}$$
(9)

Coverage measures the fraction of recommended items to the total number of items:

$$\begin{aligned} \text {Coverage} = \frac{|\text {Recommended Items}|}{|\text {Total Items}|}. \end{aligned}$$
(10)

High coverage indicates that the algorithm can recommend a diverse set of items, while low coverage implies that it is limited to a narrow subset. Hit Rate calculates the proportion of relevant recommendations out of the total recommendations provided:

$$\begin{aligned} \text {Hit Rate} = \frac{\text {Number of Hits}}{\text {Total Number of Recommendations}}, \end{aligned}$$
(11)

where a “hit” occurs when a recommended item is considered relevant or of interest to the user. A higher hit rate indicates better performance.

Fig. 5.
figure 5

The results of an experiment in a highly dynamic environment when trained on suboptimal data.

Experiments in Dynamic Environment

Experiments in a high-dynamic environment for agents trained on optimal and sub-optimal data. Preference Correlation is the best metric for evaluating lower-quality datasets with negative reviews, as it assesses the agent’s understanding of user needs. However, high values can occur if the dataset has many positive ratings and the agent’s coverage is low. For higher-quality datasets with positive responses, the I-HitRate metric correlates with online evaluations but is sensitive to the agent’s coverage.

Fig. 6.
figure 6

The results of an experiment in a highly dynamic environment when trained on optimal data.

Reinforcement Learning as an MDP Problem

Reinforcement learning (RL) addresses the problem of learning optimal behaviors by interacting with an environment. A fundamental concept in RL is the Markov Decision Process (MDP), which models the decision-making problem as a tuple \((S, A, P, R, \gamma )\). In this framework, S represents the state space, A is the action space, P is the state transition probability function, R denotes the reward function, and \(\gamma \) is the discount factor. By formulating the recommendation problem as an MDP, RL algorithms can learn to make decisions that optimize long-term rewards. The MDP framework provides a solid foundation for designing and evaluating RL agents in recommendation systems, allowing for the development of more adaptive and effective algorithms.

Comparing Algorithms Behavior on High Dynamic Environment

Table 1. Comparison of CQL, SAC, BC, Dert4Rec algorithms in an environment with high dynamics of user mood changes.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Volovikova, Z., Kuderov, P., Panov, A.I. (2024). Interpreting Decision Process in Offline Reinforcement Learning for Interactive Recommendation Systems. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1963. Springer, Singapore. https://doi.org/10.1007/978-981-99-8138-0_22

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8138-0_22

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8137-3

  • Online ISBN: 978-981-99-8138-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics