Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

CAD: an algorithm for citation-anchors detection in research papers


Citations are very important parameters and are used to take many important decisions like ranking of researchers, institutions, countries, and to measure the relationship between research papers. All of these require accurate counting of citations and their occurrence (in-text citation counts) within the citing papers. Citation anchors refer to the citation made within the full text of the citing paper for example: ‘[1]’, ‘(Afzal et al, 2015)’, ‘[Afzal, 2015]’ etc. Identification of citation-anchors from the plain-text is a very challenging task due to the various styles and formats of citations. Recently, Shahid et al. highlighted some of the problems such as commonality in content, wrong allotment, mathematical ambiguities, and string variations etc in automatically identifying the in-text citation frequencies. The paper proposes an algorithm, CAD, for identification of citation-anchors and its in-text citation frequency based on different rules. For a comprehensive analysis, the dataset of research papers is prepared: on both Journal of Universal Computer Science (J.UCS) and (2) CiteSeer digital libraries. In experimental study, we conducted two experiments. In the first experiment, the proposed approach is compared with state-of-the-art technique over both datasets. The J.UCS dataset consists of 1200 research papers with 16,000 citation strings or references while the CiteSeer dataset consists of 52 research papers with 1850 references. The total dataset size becomes 1252 citing documents and 17,850 references. The experiments showed that CAD algorithm improved F-score by 44% and 37% respectively on both J.UCS and CiteSeer dataset over the contemporary technique (Shahid et al. in Int J Arab Inf Technol 12:481–488, 2014). The average score is 41% on both datasets. In the second experiment, the proposed approach is further analyzed against the existing state-of-the-art tools: CERMINE and GROBID. According to our results, the proposed approach is best performing with F1 of 0.99, followed by GROBID (F1 0.89) and CERMINE (F1 0.82).

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18


  1. 1.

  2. 2.

  3. 3.

  4. 4.

  5. 5.

  6. 6.

    American Psychological Association.

  7. 7.

    Modern Language Association.

  8. 8.

    American Medical Association.

  9. 9.

    Council of Biology Editors.


  1. Afzal, M., Maurer, H., Balke, W., & Kulathuramaiyer, N. (2010). Rule based autonomous citation mining with TIERL. Journal of Digital Information Management, 8(3), 96–204.

  2. Ahmad, R., Afzal, M. T., & Qadir, M. A. (2016). Information extraction from PDF sources based on rule-based system using integrated formats. In Semantic web evaluation challenge, pp. 293–308. Springer, Cham.

  3. Beel J., & Gipp B., (2009). Google Scholar’s ranking algorithm: The impact of citation counts (An empirical study). In Proceedings of the 3rd international conference on research challenges in information science.

  4. Boyack, K., Small, H., & Klavans, R. (2013). Improving the accuracy of cocitation clustering using full text. Journal of the American Society for Information Science and Technology, 64(9), 1759–1767.

  5. Butt, B., Rafi, M., Jamal, A., Rehman, R., Alam, S., & Alam, M., (2015). Classification of research citations (CRC). arXiv preprint arXiv:1506.08966.

  6. Ciancarini, P., Iorio, A., Nuzzolese, A., Peroni, S., & Vitali, F. (2013). Semantic annotation of scholarly documents and citations. In Advances in Artificial Intelligence (pp. 336–347). Springer, Cham

  7. Constantin, A., Pettifer, S., & Voronkov, A., (2013). PDFX: Fully-automated PDF-to-XML conversion of scientific literature. In Proceedings of the ACM symposium on document engineering (pp. 177–180).

  8. Councill, I. G., Giles, C. L., & Kan, M. Y. (2008). ParsCit: An open-source CRF reference string parsing package. In LREC (Vol. 8, pp. 661–667).

  9. Ding, Y., Liu, X., Guo, C., & Cronin, B. (2013). The distribution of references across texts: Some implications for citation analysis. Journal of Informetrics, 7(3), 583–592.

  10. Garfield, E. (1972). Citation analysis as a tool in journal evaluation. American Association for the Advancement of Science, 178, 471–479.

  11. GiIes, C., Bollacker, K., & Lawrence, S., (1998). CiteSeer: An automatic citation indexing system. In: Proceedings of third ACM conference on digital libraries (pp. 89–98).

  12. Goodall, A. (2006). Should top universities be led by top researchers and are they? A citations analysis. Journal of Documentation, 62(3), 388–411.

  13. Gipp, B., & Beel, J., (2009). Citation proximity analysis (CPA): A new approach for identifying related work based on co-citation analysis. In Proceedings of the 12th international conference on scientometrics and informetrics (Vol. 2, pp. 571–575).

  14. Gipp, B., Beel, J., & Hentschel, C., (2009). Scienstein: A research paper recommender system. In Proceedings of the international conference on emerging trends in computing (iCETiC’09).

  15. Hirsch, J. (2005). An index to quantify an individual’s scientific research output. PNAS, 102(46), 16569–16572.

  16. Hou, W., Li, M., & Niu, D. (2011). Counting citations in texts rather than reference lists to improve the accuracy of assessing scientific contribution. BioEssays, 33(10), 724–727.

  17. Hu, Z., Chen, C., & Liu, Z. (2013). Where are citations located in the body of scientific articles? A study of the distributions of citation locations. Journal of Informetrics, 7(4), 887–896.

  18. Hu, Z., Chen, C., & Liu, Z. (2015). The recurrence of citations within a scientific article. Istanbul: International Society for Scientometrics and Informetrics, Bogazii University Printhouse.

  19. Iorio, A., Nuzzolese, A., & Peroni, S. (2013). Characterising citations in scholarly documents: The CiTalO framework. In The semantic web: ESWC satellite events (pp. 66–77).

  20. Lopez, P. (2009). GROBID: Combining automatic bibliographic data recognition and term extraction for scholarship publications. In International conference on theory and practice of digital libraries (pp. 473–474). Berlin: Springer.

  21. Liu, S., & Chen C. (2011). The effects of co-citation proximity on co-citation analysis. In 13th conference of the international society for scientometrics and informetrics (pp. 474–484).

  22. Powers, D. (2011). Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. Journal of Machine Learning Technologies, 2(1), 37–63.

  23. Ritchie, A. (2009). MS Thesis: “Citation context analysis for information retrieval”. University of Cambridge.

  24. Small, H. (1973). Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24, 265–269.

  25. Shahid, A., Afzal, M., & Qadir, M. (2011). Discovering semantic relatedness between scientific articles through citation frequency. Australian Journal of Basic and Applied Sciences, 5(6), 1599–1604.

  26. Shahid, A., Afzal, M., & Qadir, M. (2014). Lessons learned: The complexity of accurate identification of in-text citations. International Journal Arab of Information Technology, 12, 481–488.

  27. Teufel, S., & Kan, M. (2011). Robust argumentative zoning for sensemaking in scholarly documents. In Advanced language technologies for digital libraries (pp. 154–170). Berlin, Heidelberg: Springer.

  28. Tkaczyk, D., & Bolikowski, L. (2015). Extracting contextual information from scientific literature using CERMINE system. In Semantic web evaluation challenge (pp. 93–104). Cham: Springer.

  29. Tkaczyk, D., Collins, A., Sheridan, P., & Beel, J. (2018). Evaluation and comparison of open source bibliographic reference parsers: A business use case. arXiv:1802.01168

  30. Zhao, D., Cappello, A., & Johnston, L. (2017). Functions of uni-and multi-citations: implications for weighted citation analysis. Journal of Data and Information Science, 2(1), 51–69.

Download references

Author information

Correspondence to Riaz Ahmad.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ahmad, R., Afzal, M.T. CAD: an algorithm for citation-anchors detection in research papers. Scientometrics 117, 1405–1423 (2018).

Download citation


  • In-text citation analysis
  • Citation string
  • Citation-anchor
  • Citation-tag
  • Citation frequency
  • Research papers