Skip to main content

Coreference resolution: an empirical study based on SemEval-2010 shared Task 1

Abstract

This paper presents an empirical evaluation of coreference resolution that covers several interrelated dimensions. The main goal is to complete the comparative analysis from the SemEval-2010 task on Coreference Resolution in Multiple Languages. To do so, the study restricts the number of languages and systems involved, but extends and deepens the analysis of the system outputs, including a more qualitative discussion. The paper compares three automatic coreference resolution systems for three languages (English, Catalan and Spanish) in four evaluation settings, and using four evaluation measures. Given that our main goal is not to provide a comparison between resolution algorithms, these are merely used as tools to shed light on the different conditions under which coreference resolution is evaluated. Although the dimensions are strongly interdependent, making it very difficult to extract general principles, the study reveals a series of interesting issues in relation to coreference resolution: the portability of systems across languages, the influence of the type and quality of input annotations, and the behavior of the scoring measures.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3

Notes

  1. 1.

    Available at the SemEval-2010 Task 1 website: http://stel.ub.edu/semeval2010-coref.

  2. 2.

    This material is available at http://nlp.lsi.upc.edu/coreference/LRE-2011/.

  3. 3.

    The average number of entities per document is calculated as the summation of coreference chains in every document divided by the number of documents.

  4. 4.

    Singletons are excluded.

  5. 5.

    It must be noted that, in this study, there is no need to recognize elliptical pronouns neither in the gold nor in the predicted setting, since they appear as special lexical tokens in the Catalan and Spanish corpora. They were inserted during the manual syntactic annotation of the AnCora corpora (Civit and Martí 2005).

  6. 6.

    Cistell is the Catalan word for ‘basket.’

  7. 7.

    http://opennlp.sourceforge.net.

  8. 8.

    The evaluation of SemEval-2010 Task 1 (Recasens et al. 2010) also distinguished between closed and open settings. In the former, systems had to be built strictly with the information provided in the task data sets. In the latter, systems could be developed using any external tools and resources (e.g., WordNet, Wikipedia, etc.). In this study we do not make such a distinction because the three systems rely on the same sources of information: training set, particular heuristics, and WordNet.

  9. 9.

    Although our scores by class are similar to Stoyanov et al.’s (2009) MUC-RC score, a variant of MUC, we do not start from the assumption that all the coreferent mentions that do not belong to the class under analysis are resolved correctly. The results by mention class for all the scenarios and measures as well as the detailed scoring software are available at http://nlp.lsi.upc.edu/coreference/LRE-2011.

References

  1. Abad, A., Bentivogli, L., Dagan, I., Giampiccolo, D., Mirkin, S., Pianta, E., et al. (2010). A resource for investigating the impact of anaphora and coreference on inference. In Proceedings of the 7th conference on language resources and evaluation (LREC 2010) (pp. 128–135). Valletta, Malta.

  2. Azzam, S., Humphreys, K., & Gaizauskas, R. (1999). Using coreference chains for text summarization. In Proceedings of the ACL workshop on coreference and its applications (pp. 77–84). Baltimore, Maryland,

  3. Bagga, A., & Baldwin, B. (1998). Algorithms for scoring coreference chains. In Proceedings of the linguistic coreference workshop at LREC 98 (pp. 563–566). Granada, Spain.

  4. Bengtson, E., & Roth, D. (2008). Understanding the value of features for coreference resolution. In Proceedings of the conference on empirical methods in natural language processing (EMNLP 2008) (pp. 294–303). Honolulu, USA.

  5. Cai, J., & Strube, M. (2010). Evaluation metrics for end-to-end coreference resolution systems. In Proceedings of the annual SIGdial meeting on discourse and dialogue (SIGDIAL 2010) (pp. 28–36). Tokyo, Japan.

  6. Chambers, N., & Jurafsky, D. (2008). Unsupervised learning of narrative event chains. In Proceedings of the 46th annual meeting of the association for computational linguistics (ACL-HLT 2008) (pp. 789–797). Columbus, USA.

  7. Civit, M., & Martí, M. A. (2005). Building Cast3LB: A Spanish treebank. Research on Language and Computation, 2(4), 549–574.

    Article  Google Scholar 

  8. Daelemans, W., Buchholz, S., & Veenstra, J. (1999). Memory-based shallow parsing. In Proceedings of the conference on natural language learning (CoNLL 1999) (pp. 53–60). Bergen, Norway.

  9. Daumé, H., & Marcu, D. (2005). A large-scale exploration of effective global features for a joint entity detection and tracking model. In Proceedings of human language technology conference and conference on empirical methods in natural language processing (HLT-EMNLP 2005) (pp. 97–104) Vancouver, Canada.

  10. Denis, P., & Baldridge, J. (2009). Global joint models for coreference resolution and named entity classification. Procesamiento del Lenguaje Natural, 42, 87–96.

    Google Scholar 

  11. Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassel, S., & Weischedel, R. (2004). The automatic content extraction (ACE) program—tasks, data, and evaluation. In Proceedings of the 4th conference on language resources and evaluation (LREC 2004) (pp. 837–840). Lisbon, Portugal.

  12. Finkel, J., & Manning, C. (2008). Enforcing transitivity in coreference resolution. In Proceedings of the 46th annual meeting of the association for computational linguistics (ACL-HLT 2008) (pp. 45–48). Columbus, USA.

  13. Gerber, M., & Chai, J. Y. (2010). Beyond NomBank: A study of implicit arguments for nominal predicates. In Proceedings of the 48th annual meeting of the association for computational linguistics (ACL 2010) (pp. 1583–1592). Uppsala, Sweden.

  14. Heim, I. (1983). File change semantics and the familiarity theory of definiteness. In R. BŠuerle, C. Schwarze, & A. von Stechow (Eds.), Meaning, use, and interpretation of language (pp. 164–189). Berlin, Germany: Mouton de Gruyter.

    Google Scholar 

  15. Hirschman, L., & Chinchor, N. (1997). MUC-7 coreference task definition—version 3.0. In Proceedings of the 7th message understanding conference (MUC-7), Fairfax, USA.

  16. Hummel, R. A., & Zucker, S. W. (1987). On the foundations of relaxation labeling processes. In M. A. Fischler, & O. Firschein (Eds.), Readings in computer vision: Issues, problems, principles, and paradigms (pp. 585–605). San Francisco, USA: Morgan Kaufmann Publishers Inc.

    Google Scholar 

  17. Lundquist, L. (2007). Lexical anaphors in Danish and French. In M. Schwarz-Friesel, M. Consten, & M. Knees (Eds.), Anaphors in text: Cognitive, formal and applied approaches to anaphoric reference (pp. 25–32). Amsterdam, Netherlands: John Benjamins.

    Google Scholar 

  18. Luo, X. (2005). On coreference resolution performance metrics. In Proceedings of the joint conference on human language technology and empirical methods in natural language processing (HLT-EMNLP 2005 (pp. 37–48). Vancouver, Canada.

  19. Luo, X., Ittycheriah, A., Jing, H., Kambhatla, N., & Roukos, S. (2004). A mention-synchronous coreference resolution algorithm based on the bell tree. In Proceedings of the 42th annual meeting of the association for computational linguistics (ACL 2004) (pp. 21–26). Barcelona, Spain.

  20. McCarthy, J. F., & Lehnert, W. G. (1995). Using decision trees for coreference resolution. In Proceedings of the 1995 international joint conference on AI (IJCAI 1995) (pp. 1050–1055) Montreal, Canada.

  21. Mirkin, S., Berant, J., Dagan, I., & Shnarch, E. (2010). Recognising entailment within discourse. In Proceedings of the 23rd international conference on computational linguistics (COLING 2010) (pp. 770–778). Beijing, China.

  22. Morton, T. S. (1999). Using coreference in question answering. In Proceedings of the 8th Text REtrieval Conference (TREC-8) (pp. 85–89).

  23. Ng, V. (2010). Supervised noun phrase coreference research: the first fifteen years. In Proceedings of the 48th annual meeting of the association for computational linguistics (ACL 2010) (pp. 1396–1411). Uppsala, Sweden.

  24. Ng, V., & Cardie, C. (2002). Improving machine learning approaches to coreference resolution. In Proceedings of the 40th annual meeting of the association for computational linguistics (ACL 2002) (pp. 104–111). Philadelphia, USA.

  25. Nicolov, N., Salvetti, F., & Ivanova, S. (2008). Sentiment analysis: Does coreference matter? In Proceedings of the symposium on affective language in human and machine (pp. 37–40). Aberdeen, UK.

  26. Orasan, C., Cristea, D., Mitkov, R., & Branco, A. (2008). Anaphora resolution exercise: An overview. In Proceedings of the 6th conference on language resources and evaluation (LREC 2008) (pp. 28–30). Marrakech, Morocco.

  27. Padró, L. (1998). A hybrid environment for syntax–semantic tagging. PhD thesis, Dep. Llenguatges i Sistemes Informaics. Barcelona, Spain: Universitat Politècnica de Catalunya.

  28. Poon, H., Christensen, J., Domingos, P., Etzioni, O., Hoffmann, R., Kiddon, C., et al. (2010). Machine reading at the University of Washington. In Proceedings of the NAACL-HLT first international workshop on formalisms and methodology for learning by reading (pp. 87–95). Los Angeles, USA.

  29. Popescu, A., & Etzioni, O. (2005). Extracting product features and opinions from reviews. In Proceedings of the conference on human language technology and empirical methods in natural language processing (HLT-EMNLP 2005) (pp. 339–346). Vancouver, Canada.

  30. Popescu-Belis, A., Robba, I., & Sabah, G. (1998). Reference resolution beyond coreference: a conceptual frame and its application. In: Proceedings of the 36th annual meeting of the association for computational linguistics joint with the international conference on computational linguistics (COLING-ACL 1998) (pp. 1046–1052). Montreal, Canada.

  31. Pradhan, S., Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., & Weischedel, R. (2007). OntoNotes: A unified relational semantic representation. In Proceedings of the international conference on semantic computing (ICSC 2007) (pp. 517–526). Irvine, USA.

  32. Pradhan, S., Ramshaw, L., Marcus, M., Palmer, M., Weischedel, R., & Xue, N. (2011). CoNLL-2011 shared task: Modeling unrestricted coreference in OntoNotes. In Proceedings of the conference on natural language learning (CoNLL 2011) (pp. 1–27). Shared Task, Portland, USA.

  33. Quinlan, J. (1993). C4.5: Programs for machine learning. MA, USA: Morgan Kaufmann.

    Google Scholar 

  34. Rahman, A., & Ng, V. (2009). Supervised models for coreference resolution. In Proceedings of the conference on empirical methods in natural language processing (EMNLP 2009) (pp. 968–977). Suntec, Singapore.

  35. Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336), 846–850.

    Article  Google Scholar 

  36. Recasens, M. (2010). Coreference: Theory, annotation, resolution and evaluation. PhD thesis, University of Barcelona, Barcelona, Spain.

  37. Recasens, M., & Hovy, E. (2009). A deeper look into features for coreference resolution. In S. L. Devi, A. Branco, & R. Mitkov. (Eds.), Anaphora processing and applications (DAARC 2009) (Vol. 5847, pp. 29–42). Berlin, Germany, LNAI: Springer.

    Chapter  Google Scholar 

  38. Recasens, M., & Hovy, E. (2011). BLANC: Implementing the rand index for coreference evaluation. Natural Language Engineering, 17(4), 485–510.

    Article  Google Scholar 

  39. Recasens, M., & Martí, M. A. (2010). AnCora-CO: Coreferentially annotated corpora for Spanish and Catalan. Language Resources and Evaluation, 44(4), 315–345.

    Article  Google Scholar 

  40. Recasens, M., Màrquez, L., Sapena, E., Martí, M. A., Taulé, M., Hoste, V., et al. (2010). Semeval-2010 task 1: Coreference resolution in multiple languages. In Proceedings of the 5th international workshop on semantic evaluation (SemEval 2010) (pp. 1–8). Uppsala, Sweden.

  41. Ruppenhofer, J., Sporleder, C., & Morante, R. (2010). SemEval-2010 Task 10: Linking events and their participants in discourse. In Proceedings of the 5th international workshop on semantic evaluation (SemEval 2010) (pp. 45–50). Uppsala, Sweden.

  42. Sapena, E., Padró, L., & Turmo, J. (2010a). A global relaxation labeling approach to coreference resolution. In Proceedings of 23rd international conference on computational linguistics (COLING 2010) (pp. 1086–1094). Beijing, China.

  43. Sapena, E., Padró, L., & Turmo, J. (2010b). Relaxcor: A global relaxation labeling approach to coreference resolution. In Proceedings of the ACL workshop on semantic evaluations (SemEval-2010) (pp. 88–91). Uppsala, Sweden.

  44. Soon, W. M., Ng, H. T., & Lim, D. C. Y. (2001). A machine learning approach to coreference resolution of noun phrases. Computational Linguistics, 27(4), 521–544.

    Article  Google Scholar 

  45. Steinberger, J., Poesio, M., Kabadjov, M. A., & Jeek, K. (2007). Two uses of anaphora resolution in summarization. Information Processing and Management: An International Journal, 43(6), 1663–1680.

    Article  Google Scholar 

  46. Stoyanov, V., Gilbert, N., Cardie, C., & Riloff, E. (2009). Conundrums in noun phrase coreference resolution: Making sense of the state-of-the-art. In Proceedings of the 47th annual meeting of the association for computational linguistics and the 4th international joint conference on natural language processing (ACL-IJCNLP 2009) (pp. 656–664). Suntec, Singapore.

  47. Stoyanov, V., Cardie, C., Gilbert, N., Riloff, E., Buttler, D., & Hysom, D. (2010). Coreference resolution with Reconcile. In Proceedings of the 48th annual meeting of the association for computational linguistics (ACL 2010) (pp. 156–161) Uppsala, Sweden.

  48. Versley, Y., Ponzetto, S., Poesio, M., Eidelman, V., Jern, A., Smith, J., et al. (2008). BART: A modular toolkit for coreference resolution. In: Proceedings of the 6th conference on language resources and evaluation (LREC 2008) (pp. 962–965). Marrakech, Morocco.

  49. Vicedo, J. L., & Ferrández, A. (2006). Coreference in Q&A. In T. Strzalkowski & S. Harabagiu (Eds.), Advances in open domain question answering, text, speech and language technology (Vol. 32, pp. 71–96). Berlin, Germany: Springer.

    Chapter  Google Scholar 

  50. Vilain, M., Burger, J., Aberdeen, J., Connolly, D., & Hirschman, L. (1995). A model-theoretic coreference scoring scheme. In Proceedings of the 6th message understanding conference (MUC-6) (pp. 45–52).

  51. Wick, M., Culotta, A., Rohanimanesh, K., & McCallum, A. (2009). An entity based model for coreference resolution. In Proceedings of the SIAM data mining conference (SDM 2009) (pp. 365–376). Reno, USA.

Download references

Acknowledgements

This work was partially funded by the Spanish Ministry of Science and Innovation through the projects TEXT-MESS 2.0 (TIN2009-13391-C04-04), OpenMT-2 (TIN2009-14675-C03), and KNOW2 (TIN2009-14715-C04-04). It also received financial support from the Seventh Framework Programme of the EU (FP7/2007–2013) under GAs 247762 (FAUST) and 247914 (MOLTO), and from Generalitat de Catalunya through a Batista i Roca project (2010 PBR 00039). We are grateful to the two anonymous reviewers of this paper. Their insightful and careful comments allowed us to significantly improve the quality of the final version of this manuscript.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Lluís Màrquez.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Màrquez, L., Recasens, M. & Sapena, E. Coreference resolution: an empirical study based on SemEval-2010 shared Task 1. Lang Resources & Evaluation 47, 661–694 (2013). https://doi.org/10.1007/s10579-012-9194-z

Download citation

Keywords

  • Coreference resolution and evaluation
  • NLP system analysis
  • Machine learning based NLP tools
  • SemEval-2010 (Task 1)
  • Discourse entities