Skip to main content
Log in

Navigation-based candidate expansion and pretrained language models for citation recommendation

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Citation recommendation systems for the scientific literature, to help authors find papers that should be cited, have the potential to speed up discoveries and uncover new routes for scientific exploration. We treat this task as a ranking problem, which we tackle with a two-stage approach: candidate generation followed by reranking. Within this framework, we adapt to the scientific domain a proven combination based on “bag of words” retrieval followed by rescoring with a BERT model. We experimentally show the effects of domain adaptation, both in terms of pretraining on in-domain data and exploiting in-domain vocabulary. In addition, we introduce a novel navigation-based document expansion strategy to enrich the candidate documents fed into our neural models. On three benchmark datasets, our methods achieve or rival the state of the art in the citation recommendation task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://www.nlm.nih.gov/bsd/stats/cit_added.html.

  2. This work was initially published at the 10th International Workshop on Bibliometric-Enhanced Information Retrieval (BIR 2020) (Nogueira et al. 2020). The main additions to this publication are the introduction of the navigation-based method for expanding candidates and an analysis of the impact of query length on the models’ effectiveness.

  3. https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/2017-02-21/papers-2017-02-21.zip.

  4. https://github.com/allenai/citeomatic/blob/master/citeomatic/scripts/evaluate.py.

  5. http://anserini.io/.

  6. https://whoosh.readthedocs.io/en/latest/.

  7. https://github.com/allenai/citeomatic#citeomatic-evaluation.

References

  • Ammar, W., Groeneveld, D., Bhagavatula, C., Beltagy, I., Crawford, M., Downey, D., Dunkelberger, J., Elgohary, A., Feldman, S., Ha, V., Kinney, R., Kohlmeier, S., Lo, K., Murray, T., Ooi, H.H., Peters, M., Power, J., Skjonsberg, S., Wang, L., Wilhelm, C., Yuan, Z., van Zuylen, M., & Etzioni, O. (2018). Construction of the literature graph in semantic scholar. In Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: Human language technologies, Vol. 3 (Industry Papers), pp. 84–91.

  • Basu, C., Hirsh, H., Cohen, W. W., & Nevill-Manning, C. (2001). Technical paper recommendation: A study in combining multiple information sources. Journal of Artificial Intelligence Research, 14, 231–252.

    Article  Google Scholar 

  • Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, pp. 3615–3620.

  • Bhagavatula, C., Feldman, S., Power, R., & Ammar, W. (2018). Content-based citation recommendation. In Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: Human language technologies, Vol. 1 (Long Papers), pp. 238–251.

  • Bollacker, K. D., Lawrence, S., & Giles, C. L. (1999). A system for automatic personalized tracking of scientific literature on the web. In Proceedings of the fourth ACM conference on digital libraries, pp. 105–113.

  • Bordes, A., Chopra, S., & Weston, J. (2014). Question answering with subgraph embeddings. In Proceedings of the 2014 conference on empirical methods in natural language processing, pp. 615–620.

  • Caragea, C., Silvescu, A., Mitra, P., & Giles, C. L. (2013). Can’t see the forest for the trees? A citation recommendation system. In Proceedings of the 13th ACM/IEEE-CS joint conference on digital libraries, pp. 111–114.

  • Chen, T. T., & Lee, M. (2018). Research paper recommender systems on big scholarly data. In Pacific rim knowledge acquisition workshop, pp. 251–260.

  • Das, R., Dhuliawala, S., Zaheer, M., Vilnis, L., Durugkar, I., Krishnamurthy, A., Smola, A., & McCallum, A. (2017). Go for a walk and arrive at the answer: Reasoning over paths in knowledge bases using reinforcement learning. arXiv:1711.05851.

  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American Chapter of the association for computational linguistics: Human language technologies, Vol. 1 (Long and Short Papers), pp. 4171–4186.

  • Dinh, D., & Tamine, L. (2011). Combining global and local semantic contexts for improving biomedical information retrieval. In European conference on information retrieval, pp. 375–386.

  • Eto, M. (2019). Extended co-citation search: Graph-based document retrieval on a co-citation network containing citation context information. Information Processing & Management, 56(6), 102046.

    Article  Google Scholar 

  • Fiorini, N., Canese, K., Starchenko, G., Kireev, E., Kim, W., Miller, V., et al. (2018a). Best match: New relevance search for PubMed. PLoS Biology, 16(8), e2005343.

  • Fiorini, N., Leaman, R., Lipman, D. J., & Lu, Z. (2018b). How user intelligence is improving PubMed. Nature Biotechnology, 36(10), 937.

    Article  Google Scholar 

  • Gao, Y., Kinoshita, J., Wu, E., Miller, E., Lee, R., Seaborne, A., et al. (2006). Swan: A distributed knowledge infrastructure for Alzheimer disease research. Web Semantics: Science, Services and Agents on the World Wide Web, 4(3), 222–228.

    Article  Google Scholar 

  • Ginsparg, P. (1994). First steps towards electronic research communication. Computers in Physics, 8(4), 390–396.

    Article  Google Scholar 

  • Guu, K., Miller, J., & Liang, P. (2015). Traversing knowledge graphs in vector space. arXiv:1506.01094.

  • He, Q., Kifer, D., Pei, J., Mitra, P., & Giles, C.L. (2011). Citation recommendation without author supervision. In Proceedings of the fourth ACM international conference on web search and data mining, pp. 755–764.

  • He, Q., Pei, J., Kifer, D., Mitra, P., & Giles, C. L. (2010). Context-aware citation recommendation. In Proceedings of the 19th international conference on world wide web, pp. 421–430.

  • Huang, W., Kataria, S., Caragea, C., Mitra, P., Giles, C. L., & Rokach, L. (2012). Recommending citations: Translating papers into references. In Proceedings of the 21st ACM international conference on information and knowledge management, pp. 1910–1914.

  • Huang, W., Wu, Z., Liang, C., Mitra, P., & Giles, C. L. (2015). A neural probabilistic model for context based citation recommendation. In Proceedings of the twenty-ninth AAAI conference on artificial intelligence.

  • Jerome, R. N., Giuse, N. B., Gish, K. W., Sathe, N. A., & Dietrich, M. S. (2001). Information needs of clinical teams: Analysis of questions received by the clinical informatics consult service. Bulletin of the Medical Library Association, 89(2), 177.

    Google Scholar 

  • Jiang, Z., Liu, X., & Gao, L. (2015). Chronological citation recommendation with information-need shifting. In Proceedings of the 24th ACM international conference on information and knowledge management, pp. 1291–1300.

  • Jiang, Z., Yin, Y., Gao, L., Lu, Y., & Liu, X. (2018). Cross-language citation recommendation via hierarchical representation learning on heterogeneous graph. In Proceedings of the 41st annual international ACM SIGIR conference on research and development in information retrieval, pp. 635–644.

  • Johnson, R., Watkinson, A., & Mabe, M. (2018). The STM report: An overview of scientific and scholarly publishing. International Association of Scientific: Technical and Medical Publishers.

  • Kanakia, A., Shen, Z., Eide, D., & Wang, K. (2019). A scalable hybrid research paper recommender system for Microsoft Academic. In The world wide web conference, pp. 2893–2899.

  • Kinga, D., & Adam, J. (2015). Adam: A method for stochastic optimization. In Proceedings of the 3rd international conference on learning representations.

  • Kobayashi, Y., Shimbo, M., & Matsumoto, Y. (2018). Citation recommendation using distributed representation of discourse facets in scientific articles. In Proceedings of the 18th ACM/IEEE on joint conference on digital libraries, pp. 243–251.

  • Kodakateri Pudhiyaveetil, A., Gauch, S., Luong, H., & Eno, J. (2009). Conceptual recommender system for CiteSeerX. In Proceedings of the Third ACM conference on recommender systems, pp. 241–244.

  • Lao, N., & Cohen, W. W. (2010). Relational retrieval using a combination of path-constrained random walks. Machine Learning, 81(1), 53–67.

    Article  MathSciNet  Google Scholar 

  • Lawrence, S., Bollacker, K., & Giles, C. L. (1999). Indexing and retrieval of scientific literature. In Proceedings of the 8th ACM international conference on information and knowledge management, pp. 139–146.

  • Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2019). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. arXiv:1901.08746.

  • Lin, J. (2019). The neural hype and comparisons against weak baselines. In ACM SIGIR Forum, Vol. 52, ACM, pp. 40–51.

  • Lin, X.V., Socher, R., & Xiong, C. (2018). Multi-hop knowledge graph reasoning with reward shaping. arXiv:1808.10568.

  • Liu, H., Kong, X., Bai, X., Wang, W., Bekele, T. M., & Xia, F. (2015). Context-based collaborative filtering for citation recommendation. IEEE Access, 3, 1695–1703.

    Article  Google Scholar 

  • Liu, X., Yu, Y., Guo, C., & Sun, Y. (2014). Meta-path-based ranking with pseudo relevance feedback on heterogeneous graph for citation recommendation. In Proceedings of the 23rd ACM international conference on information and knowledge management, pp. 121–130.

  • Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692.

  • Livne, A., Gokuladas, V., Teevan, J., Dumais, S. T., & Adar, E. (2014). CiteSight: Supporting contextual citation recommendation using differential search. In Proceedings of the 37th annual international ACM SIGIR conference on research and development in information retrieval, pp. 807–816.

  • Lu, Y., He, J., Shan, D., & Yan, H. (2011). Recommending citations with translation model. In Proceedings of the 20th ACM international conference on information and knowledge management, pp. 2017–2020.

  • Ma, S., Zhang, C., & Liu, X. (2020). A review of citation recommendation: From textual content to enriched context. In Scientometrics, pp. 1–28.

  • MacAvaney, S., Yates, A., Cohan, A., & Goharian, N. (2019). CEDR: Contextualized embeddings for document ranking. arXiv:1904.07094.

  • McNee, S. M., Albert, I., Cosley, D., Gopalkrishnan, P., Lam, S. K., Rashid, A. M., Konstan, J. A., & Riedl, J. (2002). On the recommending of citations for research papers. In Proceedings of the 2002 ACM conference on computer supported cooperative work, pp. 116–125.

  • Mohan, S., Fiorini, N., Kim, S., & Lu, Z. (2017). Deep learning for biomedical information retrieval: Learning textual relevance from click logs. BioNLP, 2017, 222–231.

    Google Scholar 

  • Nabeel Asim, M., Wasim, M., Usman Ghani Khan, M., & Mahmood, W. (2018). Improved biomedical term selection in pseudo relevance feedback. Database.

  • Nogueira, R., & Cho, K. (2016). End-to-end goal-driven web navigation. In Advances in neural information processing systems, pp. 1903–1911.

  • Nogueira, R., & Cho, K. (2019). Passage re-ranking with BERT. arXiv:1901.04085.

  • Nogueira, R., Jiang, Z., Cho, K., & Lin, J. (2020). Evaluating pretrained transformer models for citation recommendation. In BIR@ECIR.

  • Peng, Y., Yan, S., & Lu, Z. (2019). Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets. arXiv:1906.05474.

  • Ren, X., Liu, J., Yu, X., Khandelwal, U., Gu, Q., Wang, L., & Han, J. (2014). ClusCite: Effective citation recommendation by information network-based clustering. In Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 821–830.

  • Spangler, S., Wilkins, A. D., Bachman, B. J., Nagarajan, M., Dayaram, T., Haas, P. J., Regenbogen, S., Pickering, C. R., Comer, A., Myers, J. N., Stanoi, I. R., Kato, L., Lelescu, A., Labrie, J. J., Parikh, N., Lisewski, A. M., Donehower, L., Chen, Y., & Lichtarge, O. (2014). Automated hypothesis generation based on mining scientific literature. In Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1877–1886.

  • Sugiyama, K., & Kan, M. Y. (2013). Exploiting potential citation papers in scholarly paper recommendation. In Proceedings of the 13th ACM/IEEE-CS joint conference on digital libraries, pp. 153–162.

  • Sugiyama, K., & Kan, M. Y. (2015). A comprehensive evaluation of scholarly paper recommendation using potential citation papers. International Journal on Digital Libraries, 16(2), 91–109.

    Article  Google Scholar 

  • Sybrandt, J., Shtutman, M., & Safro, I. (2017). Moliere: Automatic biomedical hypothesis generation system. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1633–1642.

  • Yang, P., Fang, H., & Lin, J. (2017). Anserini: Enabling the use of Lucene for information retrieval research. In Proceedings of the 40th annual international ACM SIGIR conference on research and development in information retrieval, pp. 1253–1256.

  • Yang, P., Fang, H., & Lin, J. (2018). Anserini: Reproducible ranking baselines using Lucene. Journal of Data and Information Quality, 10(4), Article 16.

  • Yang, W., Lu, K., Yang, P., & Lin, J. (2019). Critically examining the“ neural hype” weak baselines and the additivity of effectiveness gains from neural ranking models. In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, pp. 1129–1132.

Download references

Acknowledgements

This research was supported in part by the Canada First Research Excellence Fund, the Natural Sciences and Engineering Research Council (NSERC) of Canada, NVIDIA, and eBay. Additionally, we would like to thank Google for computational resources in the form of Google Cloud credits.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rodrigo Nogueira.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nogueira, R., Jiang, Z., Cho, K. et al. Navigation-based candidate expansion and pretrained language models for citation recommendation. Scientometrics 125, 3001–3016 (2020). https://doi.org/10.1007/s11192-020-03718-9

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-020-03718-9

Keywords

Navigation