Abstract
This article introduces the MC4WEPS corpus, a new resource for evaluating Web people search disambiguation tasks, and describes its design, collection and annotation process, the agreement between the different annotators, and finally introduces a baseline evaluation. This corpus is built by compiling multilingual search engines results where the queries are person names. Proper noun disambiguation is an open problem in natural language ambiguity resolution and, specifically, resolving the ambiguity of person names in Web search results is still a challenging problem. However, state-of-the-art approaches have been evaluated only with monolingual web page collections. The MC4WEPS corpus aims to provide the research community with a reference corpus for the task of disambiguating search engine results where the query is a person name shared by homonymous individuals. The features of this new corpus stand out from existing corpora for the same task, namely multilingualism and inclusion of social networking websites. These characteristics make it more representative of a real search scenario, especially for evaluating person name disambiguation in a multilingual context. The article also includes detailed information about the format and the availability of the corpus.
Similar content being viewed by others
References
Artiles, J. (2009). Web people search. Ph.D. thesis, UNED.
Artiles, J., Borthwick, A., Gonzalo, J., Sekine, S., & Amigó, E. (2010). Weps-3 evaluation campaign: Overview of the web people search clustering and attribute extraction tasks. In Third Web people search evaluation forum (WePS-3).
Artiles, J., Gonzalo, J., & Sekine, S. (2007). The semeval- 2007 weps evaluation: Establishing a benchmark for the web people search task. In Proceedings of the fourth international workshop on semantic evaluations (SemEval-2007), pp. 64–69. ACL.
Artiles, J., Gonzalo, J., & Sekine, S. (2009).Weps 2 evaluation campaign: Overview of the web people search clustering task. In Proceedings of the 2nd Web people search evaluation workshop (WePS 2009).
Bagga, A., & Baldwin, B. (1998). Entity-based cross-document coreferencing using the vector space model. In Proceedings of the 36th anual meeting of the association of computational linguistics and 17th international conference on computational linguistics (Vol. 1, pp. 79–85).
Bekkerman, R., & McCallum, A. (2005). Disambiguating web appearances of people in a social network. In Proceedings of the 14th international World Wide Web conference (WWW 2005) (pp. 463–470).
Berendsen, R., Kovachev, B., Nastou, E. P., de Rijke, M., & Weerkamp, W. (2012). Result disambiguation in web people search. In Proceedings of the 34th European conference on advances in information retrieval (ECIR2012) (pp. 146–157).
Bhowmick, P. K., Mitra, P., & Basu, A. (2008). An agreement measure for determining inter-annotator reliability of human judgements on affective text. In Proceedings of the workshop on Human Judgements in Computational Linguistics (COLING 2008) (pp. 58–65).
Chen, Y., Lee, S. Y. M., & Huang, C. R. (2012). A robust web personal name information extraction system. Expert Systems with Applications, 39, 2690–2699.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
Delgado, A. D., Martínez, R., Fresno, V., & Montalvo, S. (2014a). An unsupervised algorithm for person name disambiguation in the web. Procesamiento del Lenguaje Natural, 53, 51–58.
Delgado, A. D., Martínez, R., Montalvo, S., & Fresno, V. (2014b). A data driven approach for person name disambiguation in web search results. In Proceedings of the 25th international conference on computational linguistics (COLING 2014) (pp. 301–310).
Di, B., & Glass, E. M. (2004). Squibs and discussions the kappa statistic: A second look. Computational Linguistics, 30(1), 95–101.
Fleiss, J. L. (1981). Statistical methods for rates and proportions (2nd ed.). New York: Wiley.
Fowlkes, E. B., & Mallows, C. L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78, 553–569.
Gruetze, T., Kasneci, G., Zuo, Z., & Naumann, F. (2014). Bootstrapped grouping of results to ambiguous person name queries. In Proceedings of the 30th international conference on data engineering workshops (ICDE) (pp. 56–61).
Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, 17(2–3), 107–145.
Jaccard, P. (1901). Étude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin del la Socit Vaudoise des Sciences Naturelles, 37, 547–579.
Kilgarriff, A., & Grefenstette, G. (2003). Web as corpus: Introduction to the special issue. Computational Linguistics, 29(3), 333–347.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174.
Liu, V., & Curran, J.R. (2006). Web text corpus for natural language processing. In Proceedings of the 11th conference of the European chapter of the association for computational linguistics (pp. 233–240).
Liu, Z., Lu, Q., & Xu, J. (2011). High performance clustering forweb person name disambiguation using topic capturing. In International workshop on entity-oriented Search (EOS).
Mann, G. S. (2006). Multi-document statistical fact extraction and fusion. Ph.D. thesis, Johns Hopkins University, Baltimore, MD, USA. AAI3213760
McEnery, A., Xiao, R., & Tono, Y. (2006). Corpus-based language studies: An advanced resource book. London: Routledge.
Nuray-Turan, R., Kalashnikov, D. V., & Mehrotra, S. (2012). Exploiting web querying for Web people search. Journal ACM Transactions on Database Systems, 37(1), 1–41.
Pedersen, T., Kulkarni, A., Angheluta, R., Kozareva, Z., & Solorio, T. (2006). An unsupervised language independent method of name discrimination using second order co-occurrence features. Computational linguistics and intelligent text processing (Vol. 3878, pp. 208–222). Lecture Notes in Computer Science. Berlin, Heidelberg: Springer.
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66, 846–850.
Rosell, M., Kann, V., & Litton, J.E. (2004). Comparing comparisons: Document clustering evaluation using two manual classifications. In Proceedings of the international conference on natural language processing (pp. 207–216).
Shen, D., Walker, T., Zheng, Z., Yang, Q., & Li, Y. (2008). Personal name classification in web queries. In Proceedings of the 2008 international conference on Web search and data mining (WSDM’08) (pp. 149–158).
Siegel, S., & Castellan, N. J. (1988). Nonparametric statistics for the behavioral sciences (2nd ed.). New York: McGraw Hill.
Vu, Q. M., Takasu, A., & Adachi, J.(2008). Name disambiguation boosted by latent topics from web directories. In Proceedings of the IEEE/WIC/ACM international conference on Web intelligence and intelligent agent technology (WI-IAT ’08) (pp. 697–703).
Wang, X., Tang, J., Cheng, H., & Yu, P. S. (2011). Adana: Active name disambiguation. In Proceedings of the 2011 IEEE 11th international conference on data mining (ICDM’11) (pp. 794–803).
Xiao, R. (2010). The handbook of natural language processing, chap. corpus creation. Boca Raton: CRC Press.
Xu, J., Lu, Q., Li, M., & Li, W. (2015). Web person disambiguation using hierarchical co-reference model. In Proceedings of the 16th international conference CICLing 2015 (pp. 279–291).
Yoshida, M., Ikeda, M., Ono, S., Sato, I., & Nakagawa, H. (2010). Person name disambiguation by bootstrapping. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval (SIGIR’10) (pp. 10–17).
Acknowledgments
We would like to thank the financial support for this research to the Spanish Ministry of Science and Innovation (MED-RECORD Project, TIN2013-46616-C2-2-R), and we would also like to thank the annotators for their work.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Montalvo, S., Martínez, R., Campillos, L. et al. MC4WEPS: a multilingual corpus for Web people search disambiguation. Lang Resources & Evaluation 51, 805–832 (2017). https://doi.org/10.1007/s10579-016-9365-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-016-9365-4