Skip to main content

Knowledge Infused Policy Gradients with Upper Confidence Bound for Relational Bandits

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases. Research Track (ECML PKDD 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12975))


Contextual Bandits find important use cases in various real-life scenarios such as online advertising, recommendation systems, healthcare, etc. However, most of the algorithms use flat feature vectors to represent context whereas, in the real world, there is a varying number of objects and relations among them to model in the context. For example, in a music recommendation system, the user context contains what music they listen to, which artists create this music, the artist albums, etc. Adding richer relational context representations also introduces a much larger context space making exploration-exploitation harder. To improve the efficiency of exploration-exploitation knowledge about the context can be infused to guide the exploration-exploitation strategy. Relational context representations allow a natural way for humans to specify knowledge owing to their descriptive nature. We propose an adaptation of Knowledge Infused Policy Gradients to the Contextual Bandit setting and a novel Knowledge Infused Policy Gradients Upper Confidence Bound algorithm and perform an experimental analysis of a simulated music recommendation dataset and various real-life datasets where expert knowledge can drastically reduce the total regret and where it cannot.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others


  1. Langford, J., Zhang, T.: The epoch-greedy algorithm for multi-armed bandits with side information. In: Advances in Neural Information Processing Systems, vol. 20, pp. 817–824 (2007)

    Google Scholar 

  2. Zhou, L.: A survey on contextual multi-armed bandits. arXiv preprint arXiv:1508.03326 (2015)

  3. Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1), 4–22 (1985)

    Article  MathSciNet  Google Scholar 

  4. Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4), 285–294 (1933)

    Article  Google Scholar 

  5. Chapelle, O., Li, L.: An empirical evaluation of Thompson sampling. In: Advances in Neural Information Processing Systems, pp. 2249–2257 (2011)

    Google Scholar 

  6. Peters, J., Bagnell, J.A.: Policy gradient methods. Scholarpedia 5(11), 3698 (2010)

    Article  Google Scholar 

  7. Roy, K., Zhang, Q., Gaur, M., Sheth, A.: Knowledge infused policy gradients for adaptive pandemic control. arXiv preprint arXiv:2102.06245 (2021)

  8. Kakadiya, A., Natarajan, S., Ravindran, B.: Relational boosted bandits (2020)

    Google Scholar 

  9. Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1), 48–77 (2002)

    Article  MathSciNet  Google Scholar 

  10. Kersting, K., Driessens, K.: Non-parametric policy gradients: a unified treatment of propositional and relational domains. In: Proceedings of the 25th International Conference on Machine Learning, pp. 456–463 (2008)

    Google Scholar 

  11. Blockeel, H., De Raedt, L.: Top-down induction of first-order logical decision trees. Artif. Intell. 101(1–2), 285–297 (1998)

    Article  MathSciNet  Google Scholar 

  12. Odom, P., Khot, T., Porter, R., Natarajan, S.: Knowledge-based probabilistic logic learning. In: Twenty-Ninth AAAI Conference on Artificial Intelligence (2015)

    Google Scholar 

  13. Hazan, E., Rakhlin, A., Bartlett, P.L.: Adaptive online gradient descent. In: Advances in Neural Information Processing Systems, pp. 65–72 (2008)

    Google Scholar 

  14. Peizer, D.B., Pratt, J.W.: A normal approximation for binomial, F, beta, and other common, related tail probabilities, I. J. Am. Stat. Assoc. 63(324), 1416–1456 (1968)

    MathSciNet  MATH  Google Scholar 

  15. Buldygin, V.V., Kozachenko, Y.V.: Sub-Gaussian random variables. Ukr. Math. J. 32(6), 483–489 (1980)

    Article  MathSciNet  Google Scholar 

  16. Cohen, J.E.: Markov’s inequality and Chebyshev’s inequality for tail probabilities: a sharper image. Am. Stat. 69(1), 5–7 (2015)

    Article  MathSciNet  Google Scholar 

  17. Hayes, A.L., Das, M., Odom, P., Natarajan, S.: User friendly automatic construction of background knowledge: mode construction from ER diagrams. In: Proceedings of the Knowledge Capture Conference, pp. 1–8 (2017)

    Google Scholar 

  18. Motl, J., Schulte, O.: The CTU prague relational learning repository. arXiv preprint arXiv:1511.03086 (2015)

  19. Dhami, D.S., Kunapuli, G., Das, M., Page, D., Natarajan, S.: Drug-drug interaction discovery: kernel learning from heterogeneous similarities. Smart Health 9, 88–100 (2018)

    Article  Google Scholar 

  20. Dhami, D.S., Yan, S., Kunapuli, G., Natarajan, S.: Non-parametric learning of Gaifman models. arXiv preprint arXiv:2001.00528 (2020)

  21. Mihalkova, L., Mooney, R.J.: Bottom-up learning of Markov logic network structure. In: Proceedings of the 24th International Conference on Machine Learning, pp. 625–632 (2007)

    Google Scholar 

  22. Mitchell, T., et al.: Never-ending learning. Commun. ACM 61(5), 103–115 (2018)

    Article  Google Scholar 

  23. Chung, W., Thomas, V., Machado, M.C., Roux, N.L.: Beyond variance reduction: understanding the true impact of baselines on policy optimization. arXiv preprint arXiv:2008.13773 (2020)

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Manas Gaur .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Roy, K., Zhang, Q., Gaur, M., Sheth, A. (2021). Knowledge Infused Policy Gradients with Upper Confidence Bound for Relational Bandits. In: Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12975. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86485-9

  • Online ISBN: 978-3-030-86486-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics