, Volume 99, Issue 4, pp 313–349 | Cite as

A systematic review and comparative analysis of cross-document coreference resolution methods and tools

  • Seyed-Mehdi-Reza Beheshti
  • Boualem Benatallah
  • Srikumar Venugopal
  • Seung Hwan Ryu
  • Hamid Reza Motahari-Nezhad
  • Wei Wang


Information extraction (IE) is the task of automatically extracting structured information from unstructured/semi-structured machine-readable documents. Among various IE tasks, extracting actionable intelligence from an ever-increasing amount of data depends critically upon cross-document coreference resolution (CDCR) - the task of identifying entity mentions across information sources that refer to the same underlying entity. CDCR is the basis of knowledge acquisition and is at the heart of Web search, recommendations, and analytics. Real time processing of CDCR processes is very important and have various applications in discovering must-know information in real-time for clients in finance, public sector, news, and crisis management. Being an emerging area of research and practice, the reported literature on CDCR challenges and solutions is growing fast but is scattered due to the large space, various applications, and large datasets of the order of peta-/tera-bytes. In order to fill this gap, we provide a systematic review of the state of the art of challenges and solutions for a CDCR process. We identify a set of quality attributes, that have been frequently reported in the context of CDCR processes, to be used as a guide to identify important and outstanding issues for further investigations. Finally, we assess existing tools and techniques for CDCR subtasks and provide guidance on selection of tools and algorithms.


Information extraction Cross-document coreference Resolution  Large datasets 

Mathematics Subject Classification

68 Computer Science 68-02 Research exposition (monographs, survey articles) 68U15 Text processing; mathematical typography 



We Acknowledge the Data to Decisions CRC (D2D CRC), the Cooperative Research Centres Programme and the Defence Systems Innovation Centre (DSIC) for funding this research.


  1. 1.
    McCallum A (2005) Information extraction: distilling structured data from unstructured text. ACM Queue 3(9):48–57CrossRefGoogle Scholar
  2. 2.
    Crouch R, van den Berg MH, Salvetti F, Thione GL, Ahn D (2014) Coreference resolution in an ambiguity-sensitive natural language processing system. Google Patent 8,712,758Google Scholar
  3. 3.
    Bagga A, Baldwin B (1998) Entity-based cross-document coreferencing using the vector space model. In: COLING-ACL, pp 79-85Google Scholar
  4. 4.
    Dutta S, Weikum G (2015) Cross-document co-reference resolution using sample-based clustering with knowledge enrichment. Trans Assoc Comput Linguist 3:15–28Google Scholar
  5. 5.
    Mayfield J et al (2009) Cross-document coreference resolution: a key technology for learning by reading. In: AAAI’09, pp 65-70Google Scholar
  6. 6.
    Vincent Ng, Cardie C (2002) Improving machine learning approaches to coreference resolution. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp 104-111Google Scholar
  7. 7.
    Wellner B et al (2004) An integrated, conditional model of information extraction and coreference with application to citation matching. In: UAI’04, pp 593-601. AUAI PressGoogle Scholar
  8. 8.
    Singhal A (2012) Introducing the knowledge graph: things, not strings. Official Google BlogGoogle Scholar
  9. 9.
    Elsayed T, Lin JJ, Oard DW (2008) Pairwise document similarity in large collections with mapreduce. In: ACL (short papers), pp 265-268Google Scholar
  10. 10.
    Kolb L, Thor A, Rahm E (2012) Dedoop: efficient deduplication with hadoop. Proc VLDB Endow 5(12):1878–1881CrossRefGoogle Scholar
  11. 11.
    Pantel P, Crestan E, Borkovsky A, Popescu AM, Vyas V (2009) Web-scale distributional similarity and entity set expansion. In: EMNLP, pp 938-947Google Scholar
  12. 12.
    Sarmento L, Kehlenbeck A, Oliveira EC, Ungar LH (2009) An approach to web-scale named-entity disambiguation. In: MLDM, pp 689-703Google Scholar
  13. 13.
    Singh S, Subramanya A, Pereira FCN, McCallum A (2011) Large-scale cross-document coreference using distributed inference and hierarchical models. In: ACL, pp 793-803Google Scholar
  14. 14.
    Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1):107–113CrossRefGoogle Scholar
  15. 15.
    Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: USENIX’10, pp 10-10Google Scholar
  16. 16.
    Barnawi A, Batarfi O, Beheshti SMR, Elshawi R, Nouri R, Sakr S (2014) On characterizing the performance of distributed graph computation platforms. In: TPCTCGoogle Scholar
  17. 17.
    Keele S (2007) Guidelines for performing systematic literature reviews in software engineering. Technical report, Technical report, EBSE Technical Report EBSE-2007-01Google Scholar
  18. 18.
    Cornolti M, Ferragina P, Ciaramita M (2013) A framework for benchmarking entity-annotation systems. In: WWW’13, pp 249-260Google Scholar
  19. 19.
    Bollacker KD, Evans C, Paritosh P, Sturge T, Taylor J (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In: SIGMOD Conference, pp 1247-1250Google Scholar
  20. 20.
    Suchanek FM, Kasneci G, Weikum G (2007) Yago: a core of semantic knowledge. In: WWW, pp 697-706Google Scholar
  21. 21.
    Ah-Pine J, Jacquet G (2009) Clique-based clustering for improving named entity recognition systems. In: EACL, pp 51-59Google Scholar
  22. 22.
    Attardi G, Rossi SD, Simi M (2010) Tanl-1: coreference resolution by parse analysis and similarity clustering. In: SemEval’10, pp 108-111Google Scholar
  23. 23.
    Bengtson E, Roth D (2008) Understanding the value of features for coreference resolution. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pp 294-303Google Scholar
  24. 24.
    Bryl V, Giuliano C, Serafini L, Tymoshenko K (2010) Using background knowledge to support coreference resolution. In: ECAI, pp 759-764Google Scholar
  25. 25.
    Chen C, Ng V (2012) Combining the best of two worlds: a hybrid approach to multilingual coreference resolution. EMNLP-CoNLL, p 56Google Scholar
  26. 26.
    Chen H-H, Ding Y-W, Tsai S-C (1998) Named entity extraction for information retrieval. Comput Process Orient Lang 12(1):75–85Google Scholar
  27. 27.
    Elsner M, Charniak E, Johnson M (2009) Structured generative models for unsupervised named-entity clustering. In: HLT-NAACL, pp 164-172Google Scholar
  28. 28.
    Luo X (2005) On coreference resolution performance metrics. In: HLT’05, pp 25-32Google Scholar
  29. 29.
    Màrquez L, Recasens M, Sapena E (2013) Coreference resolution: an empirical study based on semeval-2010 shared task 1. Lang Resour Eval 47(3):661–694CrossRefGoogle Scholar
  30. 30.
    Luisa B, Christian G, Emanuele P (2008) Creating a gold standard for person crossdocument coreference resolution in italian news. In: The Workshop Programme, p 19Google Scholar
  31. 31.
    Bizer C, Heath T, Berners-Lee T (2009) Linked data—the story so far. Int J Semant Web Inf Syst 5(3):1–22CrossRefGoogle Scholar
  32. 32.
    Daumé III H, Marcu D (2005) A large-scale exploration of effective global features for a joint entity detection and tracking model. In: HLTNLP’05, pp 97-104Google Scholar
  33. 33.
    Green S, Andrews N, Gormley MR, Dredze M, Manning CD (2012) Entity clustering across languages. In: HLT-NAACL, pp 60-69Google Scholar
  34. 34.
    Köpcke H, Thor A, Rahm E (2010) Learning-based approaches for matching web data entities. IEEE Internet Comput 14(4):23–31CrossRefGoogle Scholar
  35. 35.
    Ni Y, Zhang L, Qiu Z, Wang C (2010) Enhancing the open-domain classification of named entity using linked open data. Int Semantic Web Conf 1:566–581Google Scholar
  36. 36.
    Niu C, Li W, Srihari RK (2004) Weakly supervised learning for cross-document person name disambiguation supported by information extraction. In: ACL’04, USAGoogle Scholar
  37. 37.
    Singh S, Wick ML, McCallum A (2010) Distantly labeling data for large scale cross-document coreference. CoRR. arXiv:1005.4298
  38. 38.
    Sleeman j, Finin T (2013) Entity type recognition for heterogeneous semantic graphs. In: Semantics for Big Data, AAAI Technical Report FS-13-04Google Scholar
  39. 39.
    Wang J, Li G, Feng J (2011) Fast-join: an efficient method for fuzzy token matching based string similarity join. In: ICDE, pp 458-469Google Scholar
  40. 40.
    Wick ML, Culotta A, Rohanimanesh K, McCallum A (2009) An entity based model for coreference resolution. In: SDM, pp 365-376Google Scholar
  41. 41.
    Zheng J, Vilnis L, Singh S, Choi J, McCallum A (2013) Dynamic knowledge-base alignment for coreference resolution. In: CoNLL’13, pp 153-162Google Scholar
  42. 42.
    Ando RK, Zhang T (2005) A high-performance semi-supervised learning method for text chunking. In: Proceedings of the 43rd annual meeting on association for computational linguistics, pp 1-9Google Scholar
  43. 43.
    Bagga A, Baldwin B (1998) Algorithms for scoring coreference chains. Int Conf Lang Resour Eval Workshop Linguist Coreference 1:563–566Google Scholar
  44. 44.
    Black W, Rinaldi F, Mowatt D (1998) Facile: description of the ne system used for muc-7. In: Proceedings of Message Uunderstanding Conference (MUC)-7Google Scholar
  45. 45.
    Chen Y, Martin J (2007) Towards robust unsupervised personal name disambiguation. In: EMNLP-CoNLL, pp 190-198Google Scholar
  46. 46.
    Fleischman M, Hovy E (2004) Multi-document person name resolution. In: ACL, pp 66-82Google Scholar
  47. 47.
    Giles CB, Wren JD (2008) Large-scale directional relationship extraction and resolution. BMC Bioinform 9(S-9)Google Scholar
  48. 48.
    Gooi CH, Allan J (2004) Cross-document coreference on a large scale corpus. In: HLT-NAACL, pp 9-16Google Scholar
  49. 49.
    Hall PA, Dowling GR (1980) Approximate string matching. ACM Comput Surv 12(4):381–402MathSciNetCrossRefGoogle Scholar
  50. 50.
    Holmes DO, McCabe MC (2002) Improving precision and recall for soundex retrieval. In: ITCC, pp 22-27Google Scholar
  51. 51.
    Kambhatla N (2004) Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations. In: ACL’04, ACLdemo ’04Google Scholar
  52. 52.
    Karaboga D, Ozturk C (2011) A novel clustering approach: artificial bee colony (abc) algorithm. Appl Soft Comput 11(1):652–657CrossRefGoogle Scholar
  53. 53.
    Luo X, Ittycheriah A, Jing H, Kambhatla N, Roukos S (2004) A mention-synchronous coreference resolution algorithm based on the bell tree. In: ACL, pp 135-142Google Scholar
  54. 54.
    Vincent Ng (2010) Supervised noun phrase coreference research: the first fifteen years. In: ACLGoogle Scholar
  55. 55.
    Randell L (1993) An assessment of name matching algorithms. Technical reports 550, Department of Computer Science, University of Newcastle upon TyneGoogle Scholar
  56. 56.
    Rao D, McNamee P, Dredze M (2010) Streaming cross document entity coreference resolution. In: COLING (Posters), pp 1050-1058Google Scholar
  57. 57.
    Ravichandran D, Pantel P, Hovy EH (2005) Randomized algorithms and nlp: using locality sensitive hash functions for high speed noun clustering. In: ACLGoogle Scholar
  58. 58.
    Sarawagi S, Kirpal A (2004) Efficient set joins on similarity predicates. In: SIGMOD Conference, pp 743-754Google Scholar
  59. 59.
    Tsuruoka Y et al (2005) Developing a robust part-of-speech tagger for biomedical text. In: Panhellenic Conference on Informatics, pp 382-392Google Scholar
  60. 60.
    Vilain M, Burger J, Aberdeen J, Connolly D, Hirschman L (1995) A model-theoretic coreference scoring scheme. In: MUC6’95, pp 45-52. USAGoogle Scholar
  61. 61.
    Wick M, Singh S, McCallum A (2012) A discriminative hierarchical model for fast coreference at large scale. In: ACL’12, pp 379-388Google Scholar
  62. 62.
    Anderberg MR (1973) Cluster analysis for applications. Academic Press, New YorkzbMATHGoogle Scholar
  63. 63.
    Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives ZG (2007) Dbpedia: a nucleus for a web of open data. In: ISWC/ASWC, pp 722-735Google Scholar
  64. 64.
    Benjelloun O, Garcia-Molina H, Menestrina D, Qi S, Whang SE, Widom J (2009) Swoosh: a generic approach to entity resolution. VLDB J 18(1):255-276Google Scholar
  65. 65.
    Day D, Hitzeman J, Wick ML, Crouch K, Poesio M (2008) A corpus for cross-document co-reference. In: LRECGoogle Scholar
  66. 66.
    Elfeky MG, Elmagarmid AK, Verykios VS (2002) Tailor: a record linkage toolbox. In: Data Engineering. Proceedings 18th International Conference on. IEEE, pp 17-28Google Scholar
  67. 67.
    Finkel JR, Grenager T, Manning C (2005) Incorporating non-local information into information extraction systems by gibbs sampling. In: ACL’05, pp 363-370Google Scholar
  68. 68.
    Hachey B, Grover C, Tobin R (2012) Datasets for generic relation extraction. Nat Lang Eng 18(1):21–59CrossRefGoogle Scholar
  69. 69.
    Lee H, Peirsman Y, Chang , Chambers N, Surdeanu M, Jurafsky D (2011) Stanford’s multi-pass sieve coreference resolution system at the conll-2011 shared task. In: CONLL’11Google Scholar
  70. 70.
    Miller GA (1995) Wordnet: a lexical database for english. Commun ACM 38(11):39–41CrossRefGoogle Scholar
  71. 71.
    Miller GA, Fellbaum C (2007) Wordnet then and now. Lang Resour Eval 41(2):209–214CrossRefGoogle Scholar
  72. 72.
    Nastase V, Strube M, Boerschinger B, Zirn C, Elghafari A (2010) A very large scale multi-lingual concept network. In: LREC, WikinetGoogle Scholar
  73. 73.
    Philips L (2000) The double-metaphone search algorithm. C/C++ User’s J 18(6):38-43Google Scholar
  74. 74.
    Ponzetto SP, Strube M (2007) Deriving a large-scale taxonomy from wikipedia. In: AAAI, pp 1440-1445Google Scholar
  75. 75.
    Singh S et al (2012) Wikilinks: a large-scale cross-document coreference corpus labeled via links to Wikipedia. Technical Report UM-CS-2012-015. University of Massachusetts, AmherstGoogle Scholar
  76. 76.
    Spitkovsky VI, Chang AX (2012) A cross-lingual dictionary for english wikipedia concepts. In: LREC, pp 3168-3175Google Scholar
  77. 77.
    Nadeau D, Sekine S (2007) A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1):3–26CrossRefGoogle Scholar
  78. 78.
    Sekine S, Ranchhod E (2009) Named entities: recognition, classification and use, vol 19. John Benjamins Publishing Company, The NetherlandsGoogle Scholar
  79. 79.
    Skut W, Brants T (1998) Chunk tagger–statistical recognition of noun phrases. CoRR. arXiv:9807007 [cmp-lg]
  80. 80.
    Witten IH, Frank E (1999) Data mining: practical machine learning tools and techniques with Java Implementations. Morgan Kaufmann, USAGoogle Scholar
  81. 81.
    Weikum G, Hoffart J, Nakashole N, Spaniol M, Suchanek F, Yosef M (2012) Big data methods for computational linguistics. IEEE Data Eng Bull 35(3):46–64Google Scholar
  82. 82.
    Riddle WE (1984) The magic number eighteen plus or minus three: a study of software technology maturation. ACM SIGSOFT Softw Eng Note 9(2):21–37MathSciNetCrossRefGoogle Scholar
  83. 83.
    Cruzes DS, Dyba T (2011) Recommended steps for thematic synthesis in software engineering. In: Empirical Software Engineering and Measurement (ESEM), pp 275-284. IEEEGoogle Scholar
  84. 84.
    Marrero M, Sanchez-Cuadrado S, Morato J, Andreadakis Y (2009) Evaluation of named entity extraction systems. Adv Comput Linguistics 41:47–58Google Scholar
  85. 85.
    Mousavi H, Kerr D, Iseli M, Zaniolo C (2014) Mining semantic structures from syntactic structures in free text documents. In: ICSC’14, pp 84-91. IEEEGoogle Scholar
  86. 86.
    Rahman A, Ng V (2011) Coreference resolution with world knowledge. In: ACL, pp 814-824Google Scholar
  87. 87.
    SMR Beheshti, Motahari Nezhad HR, Benatallah B (2012) Temporal provenance model (tpm): model and query language. CoRR. arXiv:1211.5009
  88. 88.
    Tasdemir K, Merényi E (2011) A validity index for prototype-based clustering of data sets with complex cluster structures. IEEE Trans 41(4):1039–1053Google Scholar
  89. 89.
    Estivill-Castro V, Houle ME (2001)Robust distance-based clustering with applications to spatial data mining. Algorithmica 30(2):216-242Google Scholar
  90. 90.
    Vincent Ng (2008) Unsupervised models for coreference resolution. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pp 640-649Google Scholar
  91. 91.
    Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig latin: a not-so-foreign language for data processing. In: SIGMOD’08. ACM, pp 1099-1110Google Scholar
  92. 92.
    Frakes WB, Baeza-Yates R (eds) (1992) Information retrieval: data structures and algorithms. Prentice-Hall Inc, Upper Saddle RiverGoogle Scholar
  93. 93.
    Nist Ac (2008) Extraction automatic content: Evaluation plan (ace08). In: Proceedings of the ACE, pp 1-3Google Scholar
  94. 94.
    McNamee P, Dang H (2009) Overview of the TAC 2009 knowledge base population track. In: Proc. Text Analysis Conference (TAC) WorkshopGoogle Scholar
  95. 95.
    Salton G, McGill M (1984) Introduction to Modern Information Retrieval. McGraw-Hill Book Company, New YorkGoogle Scholar
  96. 96.
    US NIST (2003) The ace 2003 evaluation plan. US National Institute for Standards and Technology (NIST), pp 2003-2008Google Scholar
  97. 97.
    Ciaramita M, Altun Y (2006) Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger. In: EMNLP, pp 594-602Google Scholar
  98. 98.
    Van Zaanen M, Mollá D et al (2007) A named entity recogniser for question answering. Pacific Association for Computational LinguisticsGoogle Scholar
  99. 99.
    Beheshti SMR et al (2013) Big data and cross-document coreference resolution: current state and future opportunities. CoRR. arXiv:1311.3987

Copyright information

© Springer-Verlag Wien 2016

Authors and Affiliations

  • Seyed-Mehdi-Reza Beheshti
    • 1
  • Boualem Benatallah
    • 1
  • Srikumar Venugopal
    • 1
  • Seung Hwan Ryu
    • 1
  • Hamid Reza Motahari-Nezhad
    • 1
    • 2
  • Wei Wang
    • 1
  1. 1.School of Computer Science and EngineeringUniversity of New South WalesSydneyAustralia
  2. 2.IBM Almaden Research CenterSan JoseUSA

Personalised recommendations