Abstract
Objective: To automatically detect duplicate citations in a bibliographical database.
Background: Citations retrieved from multiple search databases have different forms making manual and automatic detection of duplicates difficult. Existing methods rely on fuzzy-similarity measures which are error-prone.
Methods: We analysed four pairs of original search results from MEDLINE and EMBASE that were used to create systematic reviews. An automatic tool deduplicated citations by first enriching citations with Digital Object Identifiers (DOI), and/or other unique identifiers. Duplication of records was then determined by comparing these unique identifiers. We compared our method with the duplicate detection function of a popular citation management desktop application in several configurations.
Results: Citation Enrichment identified 93 % (range 86 %–100 %) of the duplicates indexed online and erroneously marked 3 % (range 0 %–6 %) documents as duplicates. The citation management application found 68 % (range 64 %–72 %) without error using default setting. When set for highest deduplication, the citation management application found 94 % of duplicates (range 77 %–100 %) and 4 % error (range 0 %–8 %).
Conclusion: Citation enrichment using unique identifiers enhances automatic deduplication. On its own, the approach seems slightly superior to tools that compare citations without enrichment. Methods that combine citation enrichment with existing fuzzy-matching may substantially reduce resource requirements of evidence synthesis.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsAbbreviations
- DOI:
-
Digital Object Identifiers
- FN:
-
False Negatives
- FP:
-
False Positives
- HTML:
-
HyperText Markup Language
- MAS:
-
Microsoft Academic Search
- MASID:
-
MAS unique identifiers
- P:
-
Precision
- PDF:
-
Portable Document Format
- R:
-
Recall
- SR:
-
Systematic Reviews
- TP:
-
True Positives
References
Lefebvre, C., Manheimer, E., Glanville, J.: Chapter 6: searching for studies, in Cochrane handbook for systematic reviews of interventions. In: Higgins, J., Green, S. (eds.) The Cochrane Collaboration (2011). www.cochrane-handbook.org
Qi, X., et al.: Find duplicates among the PubMed, EMBASE, and Cochrane library databases in systematic review. PLoS ONE 8(8), e71838 (2013)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Tsafnat, G., et al.: The automation of systematic reviews. BMJ Br. Med. J. 346, f139 (2013)
Carvalho, M.G., et al.: Replica identification using genetic programming. In: Proceedings of the 2008 ACM symposium on Applied computing. ACM (2008)
Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: 21st International Conference on Data Engineering, 2005, ICDE 2005, Proceedings. IEEE (2005)
Borges, E.N., et al.: A classification-based approach for bibliographic metadata deduplication. In: Proceedings of the IADIS International Conference WWW/Internet 2011 (2011)
Rathbone, J., et al.: Better duplicate detection for systematic reviewers: evaluation of systematic review assistant-deduplication module. Syst. Rev. 4(1), 6 (2015)
Jiang, Y., et al.: Rule-based deduplication of article records from bibliographic databases. Database 2014, bat086 (2014)
Gillies, M., et al.: Harms from amoxicillin: a systematic review and meta-analysis of randomised controlled trials for any indication. Unpublished raw data (2014)
Maconochie, I.K., Bhaumik, S.: Fluid therapy for acute bacterial meningitis. Cochrane Database Syst. Rev. 5 (2014). doi: 10.1002/14651858.CD004786.pub4
Pugh, R., et al.: Short-course versus prolonged-course antibiotic therapy for hospital-acquired pneumonia in critically ill adults. Cochrane Database Syst Rev 10 (2011). doi: 10.1002/14651858.CD007577.pub2
Coxeter, P., Hoffmann, T., Del Mar, C.B.: Shared decision making for acute respiratory infections in primary care. Cochrane Database Syst. Rev. 1, 1–11 (2014)
Choong, M.K., et al.: Automatic evidence retrieval for systematic reviews. J. Med. Internet Res. 16(10), e223 (2014)
Choong, M.K., et al.: Automatic evidence discovery for systematic reviews. J. Med. Internet Res. 16(10) (2014)
EndNote. http://endnote.com/. Available from: http://endnote.com/
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Choong, M.K., Thorning, S., Tsafnat, G. (2015). Citation Enrichment Improves Deduplication of Primary Evidence. In: Li, XL., Cao, T., Lim, EP., Zhou, ZH., Ho, TB., Cheung, D. (eds) Trends and Applications in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science(), vol 9441. Springer, Cham. https://doi.org/10.1007/978-3-319-25660-3_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-25660-3_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25659-7
Online ISBN: 978-3-319-25660-3
eBook Packages: Computer ScienceComputer Science (R0)