Computing

, Volume 99, Issue 4, pp 313–349

A systematic review and comparative analysis of cross-document coreference resolution methods and tools

  • Seyed-Mehdi-Reza Beheshti
  • Boualem Benatallah
  • Srikumar Venugopal
  • Seung Hwan Ryu
  • Hamid Reza Motahari-Nezhad
  • Wei Wang
Article

DOI: 10.1007/s00607-016-0490-0

Cite this article as:
Beheshti, SMR., Benatallah, B., Venugopal, S. et al. Computing (2017) 99: 313. doi:10.1007/s00607-016-0490-0

Abstract

Information extraction (IE) is the task of automatically extracting structured information from unstructured/semi-structured machine-readable documents. Among various IE tasks, extracting actionable intelligence from an ever-increasing amount of data depends critically upon cross-document coreference resolution (CDCR) - the task of identifying entity mentions across information sources that refer to the same underlying entity. CDCR is the basis of knowledge acquisition and is at the heart of Web search, recommendations, and analytics. Real time processing of CDCR processes is very important and have various applications in discovering must-know information in real-time for clients in finance, public sector, news, and crisis management. Being an emerging area of research and practice, the reported literature on CDCR challenges and solutions is growing fast but is scattered due to the large space, various applications, and large datasets of the order of peta-/tera-bytes. In order to fill this gap, we provide a systematic review of the state of the art of challenges and solutions for a CDCR process. We identify a set of quality attributes, that have been frequently reported in the context of CDCR processes, to be used as a guide to identify important and outstanding issues for further investigations. Finally, we assess existing tools and techniques for CDCR subtasks and provide guidance on selection of tools and algorithms.

Keywords

Information extraction Cross-document coreference Resolution  Large datasets 

Mathematics Subject Classification

68 Computer Science 68-02 Research exposition (monographs, survey articles) 68U15 Text processing; mathematical typography 

Copyright information

© Springer-Verlag Wien 2016

Authors and Affiliations

  • Seyed-Mehdi-Reza Beheshti
    • 1
  • Boualem Benatallah
    • 1
  • Srikumar Venugopal
    • 1
  • Seung Hwan Ryu
    • 1
  • Hamid Reza Motahari-Nezhad
    • 1
    • 2
  • Wei Wang
    • 1
  1. 1.School of Computer Science and EngineeringUniversity of New South WalesSydneyAustralia
  2. 2.IBM Almaden Research CenterSan JoseUSA