Navigation-based candidate expansion and pretrained language models for citation recommendation

Nogueira, Rodrigo; Jiang, Zhiying; Cho, Kyunghyun; Lin, Jimmy

doi:10.1007/s11192-020-03718-9

Navigation-based candidate expansion and pretrained language models for citation recommendation

Published: 10 October 2020

Volume 125, pages 3001–3016, (2020)
Cite this article

Scientometrics Aims and scope Submit manuscript

Rodrigo Nogueira ORCID: orcid.org/0000-0002-2600-6035^1,2,
Zhiying Jiang²,
Kyunghyun Cho^3,4,5,6 &
…
Jimmy Lin²

712 Accesses
10 Citations
4 Altmetric
Explore all metrics

Abstract

Citation recommendation systems for the scientific literature, to help authors find papers that should be cited, have the potential to speed up discoveries and uncover new routes for scientific exploration. We treat this task as a ranking problem, which we tackle with a two-stage approach: candidate generation followed by reranking. Within this framework, we adapt to the scientific domain a proven combination based on “bag of words” retrieval followed by rescoring with a BERT model. We experimentally show the effects of domain adaptation, both in terms of pretraining on in-domain data and exploiting in-domain vocabulary. In addition, we introduce a novel navigation-based document expansion strategy to enrich the candidate documents fed into our neural models. On three benchmark datasets, our methods achieve or rival the state of the art in the citation recommendation task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Recommendation system based on deep learning methods: a systematic review and new directions

Article 03 August 2019

Recommender Systems: Techniques, Applications, and Challenges

A systematic review and research perspective on recommender systems

Article Open access 03 May 2022

Notes

https://www.nlm.nih.gov/bsd/stats/cit_added.html.
This work was initially published at the 10th International Workshop on Bibliometric-Enhanced Information Retrieval (BIR 2020) (Nogueira et al. 2020). The main additions to this publication are the introduction of the navigation-based method for expanding candidates and an analysis of the impact of query length on the models’ effectiveness.
https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/2017-02-21/papers-2017-02-21.zip.
https://github.com/allenai/citeomatic/blob/master/citeomatic/scripts/evaluate.py.
http://anserini.io/.
https://whoosh.readthedocs.io/en/latest/.
https://github.com/allenai/citeomatic#citeomatic-evaluation.

References

Ammar, W., Groeneveld, D., Bhagavatula, C., Beltagy, I., Crawford, M., Downey, D., Dunkelberger, J., Elgohary, A., Feldman, S., Ha, V., Kinney, R., Kohlmeier, S., Lo, K., Murray, T., Ooi, H.H., Peters, M., Power, J., Skjonsberg, S., Wang, L., Wilhelm, C., Yuan, Z., van Zuylen, M., & Etzioni, O. (2018). Construction of the literature graph in semantic scholar. In Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: Human language technologies, Vol. 3 (Industry Papers), pp. 84–91.
Basu, C., Hirsh, H., Cohen, W. W., & Nevill-Manning, C. (2001). Technical paper recommendation: A study in combining multiple information sources. Journal of Artificial Intelligence Research, 14, 231–252.
Article Google Scholar
Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, pp. 3615–3620.
Bhagavatula, C., Feldman, S., Power, R., & Ammar, W. (2018). Content-based citation recommendation. In Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: Human language technologies, Vol. 1 (Long Papers), pp. 238–251.
Bollacker, K. D., Lawrence, S., & Giles, C. L. (1999). A system for automatic personalized tracking of scientific literature on the web. In Proceedings of the fourth ACM conference on digital libraries, pp. 105–113.
Bordes, A., Chopra, S., & Weston, J. (2014). Question answering with subgraph embeddings. In Proceedings of the 2014 conference on empirical methods in natural language processing, pp. 615–620.
Caragea, C., Silvescu, A., Mitra, P., & Giles, C. L. (2013). Can’t see the forest for the trees? A citation recommendation system. In Proceedings of the 13th ACM/IEEE-CS joint conference on digital libraries, pp. 111–114.
Chen, T. T., & Lee, M. (2018). Research paper recommender systems on big scholarly data. In Pacific rim knowledge acquisition workshop, pp. 251–260.
Das, R., Dhuliawala, S., Zaheer, M., Vilnis, L., Durugkar, I., Krishnamurthy, A., Smola, A., & McCallum, A. (2017). Go for a walk and arrive at the answer: Reasoning over paths in knowledge bases using reinforcement learning. arXiv:1711.05851.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American Chapter of the association for computational linguistics: Human language technologies, Vol. 1 (Long and Short Papers), pp. 4171–4186.
Dinh, D., & Tamine, L. (2011). Combining global and local semantic contexts for improving biomedical information retrieval. In European conference on information retrieval, pp. 375–386.
Eto, M. (2019). Extended co-citation search: Graph-based document retrieval on a co-citation network containing citation context information. Information Processing & Management, 56(6), 102046.
Article Google Scholar
Fiorini, N., Canese, K., Starchenko, G., Kireev, E., Kim, W., Miller, V., et al. (2018a). Best match: New relevance search for PubMed. PLoS Biology, 16(8), e2005343.
Fiorini, N., Leaman, R., Lipman, D. J., & Lu, Z. (2018b). How user intelligence is improving PubMed. Nature Biotechnology, 36(10), 937.
Article Google Scholar
Gao, Y., Kinoshita, J., Wu, E., Miller, E., Lee, R., Seaborne, A., et al. (2006). Swan: A distributed knowledge infrastructure for Alzheimer disease research. Web Semantics: Science, Services and Agents on the World Wide Web, 4(3), 222–228.
Article Google Scholar
Ginsparg, P. (1994). First steps towards electronic research communication. Computers in Physics, 8(4), 390–396.
Article Google Scholar
Guu, K., Miller, J., & Liang, P. (2015). Traversing knowledge graphs in vector space. arXiv:1506.01094.
He, Q., Kifer, D., Pei, J., Mitra, P., & Giles, C.L. (2011). Citation recommendation without author supervision. In Proceedings of the fourth ACM international conference on web search and data mining, pp. 755–764.
He, Q., Pei, J., Kifer, D., Mitra, P., & Giles, C. L. (2010). Context-aware citation recommendation. In Proceedings of the 19th international conference on world wide web, pp. 421–430.
Huang, W., Kataria, S., Caragea, C., Mitra, P., Giles, C. L., & Rokach, L. (2012). Recommending citations: Translating papers into references. In Proceedings of the 21st ACM international conference on information and knowledge management, pp. 1910–1914.
Huang, W., Wu, Z., Liang, C., Mitra, P., & Giles, C. L. (2015). A neural probabilistic model for context based citation recommendation. In Proceedings of the twenty-ninth AAAI conference on artificial intelligence.
Jerome, R. N., Giuse, N. B., Gish, K. W., Sathe, N. A., & Dietrich, M. S. (2001). Information needs of clinical teams: Analysis of questions received by the clinical informatics consult service. Bulletin of the Medical Library Association, 89(2), 177.
Google Scholar
Jiang, Z., Liu, X., & Gao, L. (2015). Chronological citation recommendation with information-need shifting. In Proceedings of the 24th ACM international conference on information and knowledge management, pp. 1291–1300.
Jiang, Z., Yin, Y., Gao, L., Lu, Y., & Liu, X. (2018). Cross-language citation recommendation via hierarchical representation learning on heterogeneous graph. In Proceedings of the 41st annual international ACM SIGIR conference on research and development in information retrieval, pp. 635–644.
Johnson, R., Watkinson, A., & Mabe, M. (2018). The STM report: An overview of scientific and scholarly publishing. International Association of Scientific: Technical and Medical Publishers.
Kanakia, A., Shen, Z., Eide, D., & Wang, K. (2019). A scalable hybrid research paper recommender system for Microsoft Academic. In The world wide web conference, pp. 2893–2899.
Kinga, D., & Adam, J. (2015). Adam: A method for stochastic optimization. In Proceedings of the 3rd international conference on learning representations.
Kobayashi, Y., Shimbo, M., & Matsumoto, Y. (2018). Citation recommendation using distributed representation of discourse facets in scientific articles. In Proceedings of the 18th ACM/IEEE on joint conference on digital libraries, pp. 243–251.
Kodakateri Pudhiyaveetil, A., Gauch, S., Luong, H., & Eno, J. (2009). Conceptual recommender system for CiteSeerX. In Proceedings of the Third ACM conference on recommender systems, pp. 241–244.
Lao, N., & Cohen, W. W. (2010). Relational retrieval using a combination of path-constrained random walks. Machine Learning, 81(1), 53–67.
Article MathSciNet Google Scholar
Lawrence, S., Bollacker, K., & Giles, C. L. (1999). Indexing and retrieval of scientific literature. In Proceedings of the 8th ACM international conference on information and knowledge management, pp. 139–146.
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2019). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. arXiv:1901.08746.
Lin, J. (2019). The neural hype and comparisons against weak baselines. In ACM SIGIR Forum, Vol. 52, ACM, pp. 40–51.
Lin, X.V., Socher, R., & Xiong, C. (2018). Multi-hop knowledge graph reasoning with reward shaping. arXiv:1808.10568.
Liu, H., Kong, X., Bai, X., Wang, W., Bekele, T. M., & Xia, F. (2015). Context-based collaborative filtering for citation recommendation. IEEE Access, 3, 1695–1703.
Article Google Scholar
Liu, X., Yu, Y., Guo, C., & Sun, Y. (2014). Meta-path-based ranking with pseudo relevance feedback on heterogeneous graph for citation recommendation. In Proceedings of the 23rd ACM international conference on information and knowledge management, pp. 121–130.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692.
Livne, A., Gokuladas, V., Teevan, J., Dumais, S. T., & Adar, E. (2014). CiteSight: Supporting contextual citation recommendation using differential search. In Proceedings of the 37th annual international ACM SIGIR conference on research and development in information retrieval, pp. 807–816.
Lu, Y., He, J., Shan, D., & Yan, H. (2011). Recommending citations with translation model. In Proceedings of the 20th ACM international conference on information and knowledge management, pp. 2017–2020.
Ma, S., Zhang, C., & Liu, X. (2020). A review of citation recommendation: From textual content to enriched context. In Scientometrics, pp. 1–28.
MacAvaney, S., Yates, A., Cohan, A., & Goharian, N. (2019). CEDR: Contextualized embeddings for document ranking. arXiv:1904.07094.
McNee, S. M., Albert, I., Cosley, D., Gopalkrishnan, P., Lam, S. K., Rashid, A. M., Konstan, J. A., & Riedl, J. (2002). On the recommending of citations for research papers. In Proceedings of the 2002 ACM conference on computer supported cooperative work, pp. 116–125.
Mohan, S., Fiorini, N., Kim, S., & Lu, Z. (2017). Deep learning for biomedical information retrieval: Learning textual relevance from click logs. BioNLP, 2017, 222–231.
Google Scholar
Nabeel Asim, M., Wasim, M., Usman Ghani Khan, M., & Mahmood, W. (2018). Improved biomedical term selection in pseudo relevance feedback. Database.
Nogueira, R., & Cho, K. (2016). End-to-end goal-driven web navigation. In Advances in neural information processing systems, pp. 1903–1911.
Nogueira, R., & Cho, K. (2019). Passage re-ranking with BERT. arXiv:1901.04085.
Nogueira, R., Jiang, Z., Cho, K., & Lin, J. (2020). Evaluating pretrained transformer models for citation recommendation. In BIR@ECIR.
Peng, Y., Yan, S., & Lu, Z. (2019). Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets. arXiv:1906.05474.
Ren, X., Liu, J., Yu, X., Khandelwal, U., Gu, Q., Wang, L., & Han, J. (2014). ClusCite: Effective citation recommendation by information network-based clustering. In Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 821–830.
Spangler, S., Wilkins, A. D., Bachman, B. J., Nagarajan, M., Dayaram, T., Haas, P. J., Regenbogen, S., Pickering, C. R., Comer, A., Myers, J. N., Stanoi, I. R., Kato, L., Lelescu, A., Labrie, J. J., Parikh, N., Lisewski, A. M., Donehower, L., Chen, Y., & Lichtarge, O. (2014). Automated hypothesis generation based on mining scientific literature. In Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1877–1886.
Sugiyama, K., & Kan, M. Y. (2013). Exploiting potential citation papers in scholarly paper recommendation. In Proceedings of the 13th ACM/IEEE-CS joint conference on digital libraries, pp. 153–162.
Sugiyama, K., & Kan, M. Y. (2015). A comprehensive evaluation of scholarly paper recommendation using potential citation papers. International Journal on Digital Libraries, 16(2), 91–109.
Article Google Scholar
Sybrandt, J., Shtutman, M., & Safro, I. (2017). Moliere: Automatic biomedical hypothesis generation system. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1633–1642.
Yang, P., Fang, H., & Lin, J. (2017). Anserini: Enabling the use of Lucene for information retrieval research. In Proceedings of the 40th annual international ACM SIGIR conference on research and development in information retrieval, pp. 1253–1256.
Yang, P., Fang, H., & Lin, J. (2018). Anserini: Reproducible ranking baselines using Lucene. Journal of Data and Information Quality, 10(4), Article 16.
Yang, W., Lu, K., Yang, P., & Lin, J. (2019). Critically examining the“ neural hype” weak baselines and the additivity of effectiveness gains from neural ranking models. In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, pp. 1129–1132.

Download references

Acknowledgements

This research was supported in part by the Canada First Research Excellence Fund, the Natural Sciences and Engineering Research Council (NSERC) of Canada, NVIDIA, and eBay. Additionally, we would like to thank Google for computational resources in the form of Google Cloud credits.

Author information

Authors and Affiliations

Tandon School of Engineering, New York University, New York, USA
Rodrigo Nogueira
David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Canada
Rodrigo Nogueira, Zhiying Jiang & Jimmy Lin
Courant Institute of Mathematical Sciences, New York University, New York, USA
Kyunghyun Cho
Center for Data Science, New York University, New York, USA
Kyunghyun Cho
Facebook AI Research, New York, USA
Kyunghyun Cho
CIFAR Azrieli Global Scholar, Toronto, Canada
Kyunghyun Cho

Authors

Rodrigo Nogueira
View author publications
You can also search for this author in PubMed Google Scholar
Zhiying Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Kyunghyun Cho
View author publications
You can also search for this author in PubMed Google Scholar
Jimmy Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rodrigo Nogueira.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nogueira, R., Jiang, Z., Cho, K. et al. Navigation-based candidate expansion and pretrained language models for citation recommendation. Scientometrics 125, 3001–3016 (2020). https://doi.org/10.1007/s11192-020-03718-9

Download citation

Received: 19 May 2020
Published: 10 October 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s11192-020-03718-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Navigation-based candidate expansion and pretrained language models for citation recommendation

Abstract

Access this article

Similar content being viewed by others

Recommendation system based on deep learning methods: a systematic review and new directions

Recommender Systems: Techniques, Applications, and Challenges

A systematic review and research perspective on recommender systems

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation