Abstract
Proper citation is of great importance in academic writing for it enables knowledge accumulation and maintains academic integrity. However, citing properly is not an easy task. For published scientific entities, the ever-growing academic publications and over-familiarity of terms easily lead to missing citations. To deal with this situation, we design a special method Citation Recommendation for Published Scientific Entity (CRPSE) based on the cooccurrences between published scientific entities and in-text citations in the same sentences from previous researchers. Experimental outcomes show the effectiveness of our method in recommending the source papers for published scientific entities. We further conduct a statistical analysis on missing citations among papers published in prestigious computer science conferences in 2020. In the 12,278 papers collected, 475 published scientific entities of computer science and mathematics are found to have missing citations. Many entities mentioned without citations are found to be well-accepted research results. On a median basis, the papers proposing these published scientific entities with missing citations were published 8 years ago, which can be considered the time frame for a published scientific entity to develop into a well-accepted concept. For published scientific entities, we appeal for accurate and full citation of their source papers as required by academic standards.
Similar content being viewed by others
Notes
Data was obtained in November of 2021.
The examples in the figure are created for better illustration, not real examples from S2ORC.
Papers with parsing errors are excluded.
References
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A, Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, DG., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke. M., Yu, Y., & Zheng, X. (2016). TensorFlow: A system for large-scale machine learning. In OSDI.
Amjad, T., Rehmat, Y., Daud, A., & Abbasi, R. A. (2020). Scientific impact of an author and role of self-citations. Scientometrics, 122(2), 915–932. https://doi.org/10.1007/s11192-019-03334-2
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. (2015). VQA: Visual question answering. In ICCV. https://doi.org/10.1109/ICCV.2015.279
Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. In EMNLP-IJCNLP. https://doi.org/10.18653/v1/D19-1371
Bradford, S. C. (1934). Sources of information on specific subjects. Engineering, 137, 85–86.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
Brooks, R. L. (1941). On colouring the nodes of a network. Mathematical Proceedings of the Cambridge Philosophical Society, 37(2), 194–197. https://doi.org/10.1017/S030500410002168X
Chen, L.-C., Papandreou, G., Schroff, F., & Adam, H. (2017). Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV. https://doi.org/10.1007/978-3-030-01234-2_49
Chen, X., Hj, Zhao, Zhao, S., Chen, J., & Yp, Zhang. (2019). Citation recommendation based on citation tendency. Scientometrics, 121(2), 937–956. https://doi.org/10.1007/s11192-019-03225-6
Chung, J. S., Nagrani, A., & Zisserman, A. (2018). VoxCeleb2: Deep speaker recognition. In Interspeech. https://doi.org/10.21437/Interspeech.2018-1929
Ciotti, V., Bonaventura, M., Nicosia, V., Panzarasa, P., & Latora, V. (2016). Homophily and missing links in citation networks. EPJ Data Science. https://doi.org/10.1140/EPJDS/S13688-016-0068-2
Damen, D., Doughty, H., Farinella, G. M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., & Wray, M. (2018). Scaling egocentric vision: The EPIC-KITCHENS dataset. In ECCV. https://doi.org/10.1007/978-3-030-01225-0_44
Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified data processing on large clusters. In OSDI.
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Li, F. F. (2009). ImageNet: A large-scale hierarchical image database. In CVPR. https://doi.org/10.1109/CVPR.2009.5206848.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT. https://doi.org/10.18653/v1/N19-1423
Ebesu, T., & Fang, Y. (2017). Neural citation network for context-aware citation recommendation. In SIGIR. https://doi.org/10.1145/3077136.3080730
Fowler, J. H., & Aksnes, D. W. (2007). Does self-citation pay? Scientometrics, 72(3), 427–437. https://doi.org/10.1007/S11192-007-1777-2
Gardner, M., Grus, J., Neumann, M., Tafjord, O., Dasigi, P., Liu, N. F., Peters, M., Schmitz, M., & Zettlemoyer, L. (2018). AllenNLP: A deep semantic natural language processing platform. In NLP-OSS. https://doi.org/10.18653/v1/W18-2501
Ginsparg, P. (1997). Winners and losers in the global research village. The Serials Librarian, 30(3–4), 83–95. https://doi.org/10.1300/J123v30n03_13
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., & Parikh, D. (2017). Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In CVPR. https://doi.org/10.1109/CVPR.2017.670
Gross, B. M. (1964). The managing of organizations: The administrative struggle (Vol. 2). Free Press of Glencoe.
Halpern, J. Y. (2000). CoRR: A computing research repository. ACM Journal of Computer Documentation, 24(2), 41–48. https://doi.org/10.1145/337271.337274
He, Q., Pei, J., Kifer, D., Mitra, P., & Giles, C. L. (2010). Context-aware citation recommendation. In WWW. https://doi.org/10.1145/1772690.1772734
He, Q., Kifer, D., Pei, J., Mitra, P., & Giles, C. L. (2011). Citation recommendation without author supervision. In WSDM. https://doi.org/10.1145/1935826.1935926
Hicks, R. W. (2021). How accurate are your citations? Journal of the American Association of Nurse Practitioners, 33(9), 667–669. https://doi.org/10.1097/jxx.0000000000000645
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Hoeks, F. W. J. M. M., Van Wees-Tangerman, C., Luyben, K. C. A. M., Gasser, K., Schmid, S., & Mommers, H. M. (1997). Stirring as foam disruption (SAFD) technique in fermentation processes. The Canadian Journal of Chemical Engineering, 75(6), 1018–1029. https://doi.org/10.1002/cjce.5450750604
Hu, Z., Lin, G., Sun, T., & Hou, H. (2017). Understanding multiply mentioned references. Journal of Informetrics, 11(4), 948–958. https://doi.org/10.1016/J.JOI.2017.08.004
Huang, W., Kataria, S., Caragea, C., Mitra, P., Giles, C. L., & Rokach, L. (2012). Recommending citations: Translating papers into references. In CIKM. https://doi.org/10.1145/2396761.2398542
Huang, W., Wu, Z., Liang, C., Mitra, P., & Giles, C. L. (2015). A neural probabilistic model for context based citation recommendation. In AAAI.
Jeong, C., Jang, S., Park, E., & Choi, S. (2020). A context-aware citation recommendation model with BERT and graph convolutional networks. Scientometrics, 124(3), 1907–1922. https://doi.org/10.1007/s11192-020-03561-y
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR.
Kipf, T. N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. In ICLR.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541–551. https://doi.org/10.1162/neco.1989.1.4.541
Lin, J., Yu, Y., Zhou, Y., Zhou, Z., & Shi, X. (2020). How many preprints have actually been printed and why: A case study of computer science preprints on arXiv. Scientometrics, 124(1), 555–574. https://doi.org/10.1007/s11192-020-03430-8
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In NAACL-HLT. https://doi.org/10.18653/v1/N18-1202
Lo, K., Wang, L. L., Neumann, M., Kinney, R., & Weld, D. S. (2020). S2ORC: The Semantic Scholar open research corpus. In ACL. https://doi.org/10.18653/v1/2020.acl-main.447
Lopez, P. (2009). GROBID: Combining automatic bibliographic data recognition and term extraction for scholarship publications. In ECDL. https://doi.org/10.1007/978-3-642-04346-8_62
Macqueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. In Berkeley Symposium on Mathematical Statistics and Probability.
Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In ICLR.
Neumann, M., King, D., Beltagy, I., & Ammar, W. (2019). ScispaCy: Fast and robust models for biomedical natural language processing. In BioNLP workshop. https://doi.org/10.18653/v1/W19-5034
Oh, S., Lei, Z., Lee, W. C., & Yen, J. (2014). Recommending missing citations for newly granted patents. In DSAA. https://doi.org/10.1109/DSAA.2014.7058110
Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. In WWW.
Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). LibriSpeech: An ASR corpus based on public domain audio books. In ICASSP. https://doi.org/10.1109/ICASSP.2015.7178964
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. In ACL. https://doi.org/10.3115/1073083.1073135
Park, D. S., Chan, W., Zhang, Y., Chiu, C. C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). SpecAugment: A simple data augmentation method for automatic speech recognition. In Interspeech. https://doi.org/10.21437/Interspeech.2019-2680
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. preprint.
Roetzel, P. G. (2019). Information overload in the information age: A review of the literature from business administration, business psychology, and related disciplines with a bibliometric approach and framework development. Business Research, 12(2), 479–522. https://doi.org/10.1007/s40685-018-0069-z
Saitou, N., & Nei, M. (1987). The neighbor-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4(4), 406–425. https://doi.org/10.1093/oxfordjournals.molbev.a040454
Schrödinger, E. (1926). An undulatory theory of the mechanics of atoms and molecules. Physical Review, 28(6), 1049–1070. https://doi.org/10.1103/PhysRev.28.1049
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., & Hassabis, D. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484–489. https://doi.org/10.1038/nature16961
Tan, C., Zhang, L., & Qian, T. (2019). A new supervised learning approach: Statistical adaptive Fourier decomposition (SAFD). In ICONIP. https://doi.org/10.1007/978-3-030-36802-9_42
Trevor, S., Croft, W. B., & Jensen, D. (2007). Recommending citations for academic papers. In SIGIR (pp. 705–706). https://doi.org/10.1145/1277741.1277868
Strohman, T., Croft, W. B., & Jensen, D. (2007). Recommending citations for academic papers. In SIGIR. https://doi.org/10.1145/1277741.1277868
Turing, A. M. (1950). Computing machinery and intelligence. Mind, 59(236), 433–460.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In NIPS.
Voorhees, E. M. (1999). The TREC-8 Question Answering track report. In TREC.
Voorhees, E. M., & Harman, D. (1998). Overview of the seventh Text REtrieval Conference (TREC-7). In TREC.
Vrettas, G., & Sanderson, M. (2015). Conferences versus journals in computer science. Journal of the Association for Information Science and Technology, 66(12), 2674–2684. https://doi.org/10.1002/asi.23349
Wang, C., Luo, Z., Zhong, Z., & Li, S. (2021). SAFD: Single shot anchor free face detector. Multimedia Tools and Applications, 80(9), 13761–13785. https://doi.org/10.1007/s11042-020-10401-x
Wang, J. S., & Matyjaszewski, K. (1995). Controlled/"living" radical polymerization. atom transfer radical polymerization in the presence of transition-metal complexes. Journal of the American Chemical Society, 117(20), 5614–5615. https://doi.org/10.1021/ja00125a035
Witten, I. H., & Bell, T. C. (1991). The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37(4), 1085–1094. https://doi.org/10.1109/18.87000
Yan, E., Chen, Z., & Li, K. (2020). Authors’ status and the perceived quality of their work: Measuring citation sentiment change in nobel articles. Journal of the Association for Information Science and Technology, 71(3), 314–324. https://doi.org/10.1002/asi.24237
Yang, L., Zheng, Y., Cai, X., Dai, H., Mu, D., Guo, L., & Dai, T. (2018). A LSTM based model for personalized context-aware citation recommendation. IEEE Access, 6, 59618–59627. https://doi.org/10.1109/ACCESS.2018.2872730
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., & Le, Q. V. (2019). XLNet: Generalized autoregressive pretraining for language understanding. In NeurIPS.
Yin, J., & Li, X. (2017). Personalized citation recommendation via convolutional neural networks. In APWeb-WAIM. https://doi.org/10.1007/978-3-319-63564-4_23
Zhao, M., Yan, E., & Li, K. (2018). Data set mentions and citations: A content analysis of full-text publications. Journal of the Association for Information Science and Technology, 69(1), 32–46. https://doi.org/10.1002/asi.23919
Acknowledgements
We would like to acknowledge the support of Yingmin Wang for improving the mathematical expressions. We are grateful to Li Lei, Xun Zhou, Lei Lin and Meizhen Zheng for their help in the data processing. We also appreciate two anonymous reviewers for their valuable comments. Special and heartfelt gratitude goes to the first author’s wife Fenmei Zhou, for her understanding and love. Her unwavering support and continuous encouragement enable this research to be possible.
Funding
This work is partly funded by the 13th Five-Year Plan project Artificial Intelligence and Language of State Language Commission of China (Grant No. WT135-38).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there is no conflict of interest regarding the publication of this paper.
Rights and permissions
About this article
Cite this article
Lin, J., Yu, Y., Song, J. et al. Detecting and analyzing missing citations to published scientific entities. Scientometrics 127, 2395–2412 (2022). https://doi.org/10.1007/s11192-022-04334-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-022-04334-5