CAD: an algorithm for citation-anchors detection in research papers

Ahmad, Riaz; Afzal, Muhammad Tanvir

doi:10.1007/s11192-018-2920-6

CAD: an algorithm for citation-anchors detection in research papers

Published: 29 September 2018

Volume 117, pages 1405–1423, (2018)
Cite this article

Scientometrics Aims and scope Submit manuscript

Riaz Ahmad¹ &
Muhammad Tanvir Afzal¹

638 Accesses
7 Citations
1 Altmetric
Explore all metrics

Abstract

Citations are very important parameters and are used to take many important decisions like ranking of researchers, institutions, countries, and to measure the relationship between research papers. All of these require accurate counting of citations and their occurrence (in-text citation counts) within the citing papers. Citation anchors refer to the citation made within the full text of the citing paper for example: ‘[1]’, ‘(Afzal et al, 2015)’, ‘[Afzal, 2015]’ etc. Identification of citation-anchors from the plain-text is a very challenging task due to the various styles and formats of citations. Recently, Shahid et al. highlighted some of the problems such as commonality in content, wrong allotment, mathematical ambiguities, and string variations etc in automatically identifying the in-text citation frequencies. The paper proposes an algorithm, CAD, for identification of citation-anchors and its in-text citation frequency based on different rules. For a comprehensive analysis, the dataset of research papers is prepared: on both Journal of Universal Computer Science (J.UCS) and (2) CiteSeer digital libraries. In experimental study, we conducted two experiments. In the first experiment, the proposed approach is compared with state-of-the-art technique over both datasets. The J.UCS dataset consists of 1200 research papers with 16,000 citation strings or references while the CiteSeer dataset consists of 52 research papers with 1850 references. The total dataset size becomes 1252 citing documents and 17,850 references. The experiments showed that CAD algorithm improved F-score by 44% and 37% respectively on both J.UCS and CiteSeer dataset over the contemporary technique (Shahid et al. in Int J Arab Inf Technol 12:481–488, 2014). The average score is 41% on both datasets. In the second experiment, the proposed approach is further analyzed against the existing state-of-the-art tools: CERMINE and GROBID. According to our results, the proposed approach is best performing with F1 of 0.99, followed by GROBID (F1 0.89) and CERMINE (F1 0.82).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 6

Plagiarism in research

Article 04 July 2014

Identifying interdisciplinary topics and their evolution based on BERTopic

Article 03 July 2023

Mapping science through editorial board interlocking: connections and distance between fields of knowledge and institutional affiliations

Article Open access 12 May 2024

Notes

http://cermine.ceon.pl/index.html.
http://cloud.science-miner.com/grobid/.
https://www.editpadpro.com/.
http://pdfbox.apache.org/.
http://pdfx.cs.man.ac.uk/.
American Psychological Association.
Modern Language Association.
American Medical Association.
Council of Biology Editors.

References

Afzal, M., Maurer, H., Balke, W., & Kulathuramaiyer, N. (2010). Rule based autonomous citation mining with TIERL. Journal of Digital Information Management, 8(3), 96–204.
Google Scholar
Ahmad, R., Afzal, M. T., & Qadir, M. A. (2016). Information extraction from PDF sources based on rule-based system using integrated formats. In Semantic web evaluation challenge, pp. 293–308. Springer, Cham.
Beel J., & Gipp B., (2009). Google Scholar’s ranking algorithm: The impact of citation counts (An empirical study). In Proceedings of the 3rd international conference on research challenges in information science.
Boyack, K., Small, H., & Klavans, R. (2013). Improving the accuracy of cocitation clustering using full text. Journal of the American Society for Information Science and Technology, 64(9), 1759–1767.
Article Google Scholar
Butt, B., Rafi, M., Jamal, A., Rehman, R., Alam, S., & Alam, M., (2015). Classification of research citations (CRC). arXiv preprint arXiv:1506.08966.
Ciancarini, P., Iorio, A., Nuzzolese, A., Peroni, S., & Vitali, F. (2013). Semantic annotation of scholarly documents and citations. In Advances in Artificial Intelligence (pp. 336–347). Springer, Cham
Constantin, A., Pettifer, S., & Voronkov, A., (2013). PDFX: Fully-automated PDF-to-XML conversion of scientific literature. In Proceedings of the ACM symposium on document engineering (pp. 177–180).
Councill, I. G., Giles, C. L., & Kan, M. Y. (2008). ParsCit: An open-source CRF reference string parsing package. In LREC (Vol. 8, pp. 661–667).
Ding, Y., Liu, X., Guo, C., & Cronin, B. (2013). The distribution of references across texts: Some implications for citation analysis. Journal of Informetrics, 7(3), 583–592.
Article Google Scholar
Garfield, E. (1972). Citation analysis as a tool in journal evaluation. American Association for the Advancement of Science, 178, 471–479.
Article Google Scholar
GiIes, C., Bollacker, K., & Lawrence, S., (1998). CiteSeer: An automatic citation indexing system. In: Proceedings of third ACM conference on digital libraries (pp. 89–98).
Goodall, A. (2006). Should top universities be led by top researchers and are they? A citations analysis. Journal of Documentation, 62(3), 388–411.
Article Google Scholar
Gipp, B., & Beel, J., (2009). Citation proximity analysis (CPA): A new approach for identifying related work based on co-citation analysis. In Proceedings of the 12th international conference on scientometrics and informetrics (Vol. 2, pp. 571–575).
Gipp, B., Beel, J., & Hentschel, C., (2009). Scienstein: A research paper recommender system. In Proceedings of the international conference on emerging trends in computing (iCETiC’09).
Hirsch, J. (2005). An index to quantify an individual’s scientific research output. PNAS, 102(46), 16569–16572.
Article Google Scholar
Hou, W., Li, M., & Niu, D. (2011). Counting citations in texts rather than reference lists to improve the accuracy of assessing scientific contribution. BioEssays, 33(10), 724–727.
Article Google Scholar
Hu, Z., Chen, C., & Liu, Z. (2013). Where are citations located in the body of scientific articles? A study of the distributions of citation locations. Journal of Informetrics, 7(4), 887–896.
Article Google Scholar
Hu, Z., Chen, C., & Liu, Z. (2015). The recurrence of citations within a scientific article. Istanbul: International Society for Scientometrics and Informetrics, Bogazii University Printhouse.
Google Scholar
Iorio, A., Nuzzolese, A., & Peroni, S. (2013). Characterising citations in scholarly documents: The CiTalO framework. In The semantic web: ESWC satellite events (pp. 66–77).
Lopez, P. (2009). GROBID: Combining automatic bibliographic data recognition and term extraction for scholarship publications. In International conference on theory and practice of digital libraries (pp. 473–474). Berlin: Springer.
Liu, S., & Chen C. (2011). The effects of co-citation proximity on co-citation analysis. In 13th conference of the international society for scientometrics and informetrics (pp. 474–484).
Powers, D. (2011). Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. Journal of Machine Learning Technologies, 2(1), 37–63.
MathSciNet Google Scholar
Ritchie, A. (2009). MS Thesis: “Citation context analysis for information retrieval”. University of Cambridge.
Small, H. (1973). Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24, 265–269.
Article Google Scholar
Shahid, A., Afzal, M., & Qadir, M. (2011). Discovering semantic relatedness between scientific articles through citation frequency. Australian Journal of Basic and Applied Sciences, 5(6), 1599–1604.
Google Scholar
Shahid, A., Afzal, M., & Qadir, M. (2014). Lessons learned: The complexity of accurate identification of in-text citations. International Journal Arab of Information Technology, 12, 481–488.
Google Scholar
Teufel, S., & Kan, M. (2011). Robust argumentative zoning for sensemaking in scholarly documents. In Advanced language technologies for digital libraries (pp. 154–170). Berlin, Heidelberg: Springer.
Tkaczyk, D., & Bolikowski, L. (2015). Extracting contextual information from scientific literature using CERMINE system. In Semantic web evaluation challenge (pp. 93–104). Cham: Springer.
Tkaczyk, D., Collins, A., Sheridan, P., & Beel, J. (2018). Evaluation and comparison of open source bibliographic reference parsers: A business use case. arXiv:1802.01168
Zhao, D., Cappello, A., & Johnston, L. (2017). Functions of uni-and multi-citations: implications for weighted citation analysis. Journal of Data and Information Science, 2(1), 51–69.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Capital University of Science & Technology, Islamabad, Pakistan
Riaz Ahmad & Muhammad Tanvir Afzal

Authors

Riaz Ahmad
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Tanvir Afzal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Riaz Ahmad.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ahmad, R., Afzal, M.T. CAD: an algorithm for citation-anchors detection in research papers. Scientometrics 117, 1405–1423 (2018). https://doi.org/10.1007/s11192-018-2920-6

Download citation

Received: 13 November 2017
Published: 29 September 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s11192-018-2920-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CAD: an algorithm for citation-anchors detection in research papers

Abstract

Access this article

Similar content being viewed by others

Plagiarism in research

Identifying interdisciplinary topics and their evolution based on BERTopic

Mapping science through editorial board interlocking: connections and distance between fields of knowledge and institutional affiliations

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CAD: an algorithm for citation-anchors detection in research papers

Abstract

Access this article

Similar content being viewed by others

Plagiarism in research

Identifying interdisciplinary topics and their evolution based on BERTopic

Mapping science through editorial board interlocking: connections and distance between fields of knowledge and institutional affiliations

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation