A Study of the Effect of Document Representations in Clustering-Based Cross-Document Coreference Resolution

  • Horacio Saggion
Part of the Theory and Applications of Natural Language Processing book series (NLP)


Finding information about people on huge text collections or on-line repositories on the Web is a common activity. We describe experiments aiming at identifying the contribution of semantic information (e.g., named entities) and summarization (e.g., sentence extracts) in a cross-document coreference resolution system. Our system uses a clustering-based algorithm to group documents referring to the same entity. Clustering uses vector representations created by summarization and semantic tagging components. We investigate different clustering configurations and show that selection of the type of summary and the type of term to be used for vector representation is important to achieve good performance.


Name Entity Recognition Target Person Document Representation Text Summarization Input Document 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



We thank the reviewers for their comments and suggestions which helped improve the final version of this paper. Horacio Saggion is grateful to a fellowship from Programa Ramón y Cajal, Ministerio de Ciencia e Innovación, Spain. We acknowledge the support from the editors of this volume.


  1. 1.
    Abdalla R., Teufel, S.: A bootstrapping approach to unsupervised detection of cue phrase variants. In: Proceedings of COLING/ACL 2006, Sydney (2006)Google Scholar
  2. 2.
    Artiles, J., Borthwick, A., Gonzalo, J., Sekine, S., Amigó, E.: Weps-3 evaluation campaign: overview of the web people search clustering and attribute extraction tasks. In: CLEF - Notebook Papers/LABs/Workshops, Padova, Italy (2010)Google Scholar
  3. 3.
    Artiles, J., Gonzalo, J., Sekine, S.: The semEval-2007 wePS evaluation: establishing a benchmark for web people search task. In: Proceedings of Semeval 2007, Prague, Czech Republic. Association for Computational Linguistics, Stroudsburg (2007)Google Scholar
  4. 4.
    Aswani, N., Bontcheva, K., Cunningham, H.: Mining information for instance unification. In: 5th International Semantic Web Conference (ISWC2006), Athens. Springer, Berlin/Heidelberg (2006).
  5. 5.
    Bagga, A., Baldwin, B.: Entity-based cross-document coreferencing using the vector space model. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (COLING-ACL’98), Montreal, pp. 79–85. Association for Computational Linguistics, Stroudsburg (1998)Google Scholar
  6. 6.
    Bagga, A., Baldwin, B., Ramesh, G.: Methodology for cross-document coreference over degraded data sources. In: Angelova, G., Bontcheva, K., Mitkov, R., Nikolov, N., Nicolov, N. (eds.) Proceedings of Recent Advances in Natural Language Processing (RANLP’01), Tzigov Chark, Bulgaria, pp. 15–21 (2001)Google Scholar
  7. 7.
    Bekkerman, R., McCallum, A.: Disambiguating web appearances of people in a social network. In: Proceedings of WWW-05, the 14th International World Wide Web Conference, Chiba. ACM, New York (2005)Google Scholar
  8. 8.
    Chen, Y., Martin, J.: Cu-comsem: Exploring rich features for unsupervised web personal named disambiguation. In: Proceedings of SemEval 2007, Prague, pp. 125–128. Assocciation for Computational Linguistics, Stroudsburg (2007)Google Scholar
  9. 9.
    Cutting, D.R., Pedersen, J.O., Karger, D., Tukey, J.W.: Scatter/gather: A cluster-based approach to browsing large document collections. In: Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, pp. 318–329 (1992)Google Scholar
  10. 10.
    Day, D., Hitzeman, J., Wick, J., Crouch, K., Poesio, M.: A corpus for cross-document co-reference. In: Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), Marrakech, Morocco. European Language Resources Association, Paris, France (2008)Google Scholar
  11. 11.
    Grishman, R.: Information extraction: techniques and challenges. In: Pazienza, M.T. (ed.) Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology, International Summer School (SCIE-97), Lecture Notes in Computer Science, vol. 1299, pp. 10–27. Springer, Frascati, Italy (1997)Google Scholar
  12. 12.
    Hotho, A., Staab, S., Stumme, G.: WordNet improves text document clustering. In: Proceeding of the SIGIR 2003 Semantic Web Workshop, Toronto (2003)Google Scholar
  13. 13.
    Mani, I.: Automatic Summarization. John Benjamins, Amsterdam/Philadelphia (2001)Google Scholar
  14. 14.
    Mann, G., Yarowsky, D.: Unsupervised personal name disambiguation. In: Proceedings of CoNLL, Edmonton. Association for Computational Linguistics, Stroudsburg (2003)Google Scholar
  15. 15.
    Mann, G.S., Yarowsky, D.: Unsupervised personal name disambiguation. In: Daelemans, W., Osborne, M. (eds.) Proceedings of the 7th Conference on Natural Language Learning (CoNLL-2003), Edmonton, pp. 33–40. Association for Computational Linguistics, Stroudsburg (2003)Google Scholar
  16. 16.
    Maynard, D., Tablan, V., Cunningham, H., Ursu, C., Saggion, H., Bontcheva, K., Wilks, Y.: Architectural elements of language engineering robustness. J. Nat. Lang. Eng. Spec. Issue Robust Methods Anal. Nat. Lang. Data 8(2/3), 257–274 (2002).
  17. 17.
    Phan, X.H., Nguyen, L.M., Horiguchi, S.: Personal name resolution crossover documents by a semantics-based approach. IEICE Trans. Inf. Syst. 89, 825–836 (2006)Google Scholar
  18. 18.
    Radev, D.R., Teufel, S., Saggion, H., Lam, W., Blitzer, J., Qi, H., Çelebi, A., Liu, D., Drábek, E.: Evaluation challenges in large-scale document summarization. In: ACL, Sapporo, pp. 375–382 (2003)Google Scholar
  19. 19.
    Rasmussen, E., Willett, P.: Non-hierarchical document clustering using the icl distribution array processor. In: SIGIR ’87: Proceedings of the 10th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, pp. 132–139. ACM Press, New York, NY, USA (1987)Google Scholar
  20. 20.
    Saggion, H.: Shef: Semantic tagging and summarization techniques applied to cross-document coreference. In: Proceedings of SemEval 2007, Prague, Czech Republic, pp. 292–295. Assocciation for Computational Linguistics, Stroudsburg, PA, USA (2007).
  21. 21.
    Saggion, H.: Experiments on semantic-based clustering for cross-document coreference. In: Proceedings of the Third Joint International Conference on Natural Language Processing, AFNLP, Hyderabad, pp. 149–156 (2008)Google Scholar
  22. 22.
    Saggion, H.: SUMMA: a robust and adaptable summarization tool. Traitement Automatique des Langues 49(2), 103–125 (2008)Google Scholar
  23. 23.
    Saggion, H., Gaizauskas, R.: Multi-document summarization by cluster/profile relevance and redundancy removal. In: Proceedings of the Document Understanding Conference 2004, Boston, USA. NIST, Gaithersburg, MD, USA (2004)Google Scholar
  24. 24.
    Saggion, H., Lloret, E., Palomar, M.: Using text summaries for predicting rating scales. In: Proceedings of the 1st Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA), Lisbon, Portugal, pp. 44–51 (2010)Google Scholar
  25. 25.
    Saggion, H., Radev, D., Teufel, S., Wai, L., Strassel, S.: Developing infrastructure for the evaluation of single and multi-document summarization systems in a cross-lingual environment. In: 3rd International Conference on Language Resources and Evaluation (LREC 2002), Las Palmas, Gran Canaria, pp. 747–754 (2002)Google Scholar
  26. 26.
    Salton, G.: Automatic Text Processing. Addison-Wesley, Reading (1988)Google Scholar
  27. 27.
    Tombros, A., Sanderson, M., Gray, P.: Advantages of query biased summaries in information retrieval. In: Intelligent Text Summarization. Papers from the 1998 AAAI Spring Symposium. Technical Report SS-98-06, The AAAI Press, Standford, pp. 34–43 (1998)Google Scholar
  28. 28.
    van Rijsbergen, C.: Information Retrieval. Butterworths, London (1979)Google Scholar
  29. 29.
    Willett, P.: Recent trends in hierarchic document clustering: a critical review. Inf. Process. Manage. 24(5), 577–597 (1988)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  1. 1.Department of Information and Communication TechnologiesUniversitat Pompeu FabraBarcelonaSpain

Personalised recommendations