Quo Vadis: A Corpus of Entities and Relations

  • Dan Cristea
  • Daniela Gîfu
  • Mihaela Colhon
  • Paul Diac
  • Anca-Diana Bibiri
  • Cătălina Mărănduc
  • Liviu-Andrei Scutelnicu
Chapter
Part of the Text, Speech and Language Technology book series (TLTB, volume 48)

Abstract

This chapter describes a collective work aimed to build a corpus including annotations of semantic relations on a text belonging to the belletristic genre. The paper presents conventions of annotations for four categories of semantic relations and the process of building the corpus as a collaborative work. Part of the annotation is done automatically, such as the token/part of speech/lemma layer, and is performed during a preprocessing phase. Then, an entity layer (where entities of type person are marked) and a relation layer (evidencing binary relations between entities) are added manually by a team of trained annotators, the result being a heavily annotated file. A number of methods to obtain accuracy are detailed. Finally, some statistics over the corpus are drawn. The language under investigation is Romanian, but the proposed annotation conventions and methodological hints are applicable to any language and text genre.

Keywords

Semantic relations Annotated corpus Anaphora XML Annotation conventions 

Notes

Acknowledgments

We are grateful to the master students in Computational Linguistics from the “Alexandru Ioan Cuza” University of Iaşi, Faculty of Computer Science, who, along three consecutive terms, have annotated and then corrected large segments of the “Quo Vadis” corpus. Part of the work in the construction of this corpus was done in relation with COROLA—The Computational Representational Corpus of Contemporary Romanian, a joint project of the Institute for Computer Science in Iaşi and the Research Institute for Artificial Intelligence in Bucharest, under the auspices of the Romanian Academy.

References

  1. Anechitei, D., Cristea, D., Dimosthenis, I., Ignat, E., Karagiozov, D., Koeva, S., et al. (2013). Summarizing short texts through a discourse-centered approach in a multilingual context. In A. Neustein & J. A. Markowitz (Eds.), Where humans meet machines: Innovative solutions to knotty natural language problems. Heidelberg: Springer.Google Scholar
  2. Bagga, A., & Balwdin, B. (1998). Entity-based cross-document coreferencing using the vector space model. Proceedings of COLING ‘98, 1.Google Scholar
  3. Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., & Etzioni, O. (2007). Open information extraction from the web. Proceedings of IJCAI ‘07. Google Scholar
  4. Bejan, C. A., & Harabagiu, S. (2010). Unsupervised event coreference resolution with rich linguistic features. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden.Google Scholar
  5. Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN systems, 30(1), 107–117.CrossRefGoogle Scholar
  6. Boschee, E., Weischedel, R., & Zamanian, A. (2005). Automatic information extraction. Proceedings of the 2005 International Conference on Intelligence Analysis, McLean, VA, pp. 2–4.Google Scholar
  7. Bunescu, R. C., & Paşca, M. (2006). Using encyclopedic knowledge for named entity disambiguation. European Chapter of the Assocation for Computational Linguistics (EACL 2006).Google Scholar
  8. Carlson, A., Betteridge, J., Wang, R. C., Hruschka Jr., E. R., & Mitchell, T. M. (2010). Coupled semi-supervised learning for information extraction. Proceedings of the Third ACM International Conference on Web Search and Data Mining (WSDM 2010).Google Scholar
  9. Chen, B., Su, J., Pan, S. J., & Chew L. T. (2011). A unified event coreference resolution by integrating multiple resolvers. Proceedings of the 5th International Joint Conference on Natural Language Processing, pp. 102–110, Chiang Mai, Thailand.Google Scholar
  10. Cristea, D., & Dima, G. E. (2001). An integrating framework for anaphora resolution. Information Science and Technology, Romanian Academy Publishing House, Bucharest, 4(3–4), 273–291.Google Scholar
  11. Cruse, D. A. (1986). Lexical semantics. Cambridge: Cambridge University Press.Google Scholar
  12. Cucerzan, S. (2007). Large-scale named entity disambiguation based on wikipedia data. Empirical Methods in Natural Language Processing (EMNLP).Google Scholar
  13. Cybulska, A., & Vossen, P. (2012). Using semantic relations to solve event coreference in text. Proceedings of Semantic Relations-II. Enhancing Resources and Applications Workshop, Istanbul.Google Scholar
  14. Del Gaudio, R. (2014). Automatic extraction of definitions. Ph.D. thesis, University of Lisbon.Google Scholar
  15. Drabek, R., & Yarowsky, D. (2005). Induction of fine-grained part-of-speech taggers via classifier combination and crosslingual projection. Proceedings of the ACL Workshop on Building And Using Parallel Texts: Data-Driven Machine Translation And Beyond, June 29–30, 2005, Ann Arbor, Michigan, pp. 49–56.Google Scholar
  16. Gala, N., Rey, V., & Zock, M. (2010). A tool for linking stems and conceptual fragments to enhance word access. Proceedings of LREC-2010, Malta.Google Scholar
  17. Girju, R., Badulescu, A., & Moldovan, D. (2006). Automatic discovery of part-whole relations. Computational Linguistics, 32(1), 83–135.Google Scholar
  18. Hearst, M. (1992). Automatic acquisition of hyponyms from large text corpora. Proceedings of COLING ‘92.Google Scholar
  19. Iida, R., Komachi, M., Inui, K., & Matsumoto, Y. (2007). Annotating a Japanese text corpus with predicate-argument and coreference relations. Proceedings of the Linguistic Annotation Workshop, pp. 132–139.Google Scholar
  20. Kawahara, D., Kurohashi, S., & Hasida, K. (2002). Construction of a Japanese relevance-tagged corpus. Proceedings of LREC ‘02.Google Scholar
  21. Levi, J. N. (1978). The syntax and semantics of complex nominals. New York: Academic Press.Google Scholar
  22. Lyons, J. (1977). Semantics. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
  23. Mani, I., Wellner, B., Verhagen, M., Lee, C. M., & Pustejovsky, J. (2006). Machine learning of temporal relation. Proceedings of the 44th Annual meeting of the Association for Computational Linguistics, Australia.Google Scholar
  24. Masatsugu, H., Kawahara, D., & Kurohashi, S.(2012). Building a diverse document leads corpus annotated with semantic relations. Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation, pp. 535–544.Google Scholar
  25. Miller G. A., Beckwidth R., Fellbaum C., Gross D., & Miller K. J. (1990). Introduction to WordNet: An on-line lexical database. International Journal of Lexicography, 3(4)(winter 1990), 235–244.Google Scholar
  26. Mitkov, R. (2003). Anaphora resolution. In R. Mitkov (Ed.), The oxford handbook of computational linguistics (pp. 266–283). Oxford: Oxford University Press.Google Scholar
  27. Mulkar-Mehta, R., Hobbs, J. R., & Hovy, E. (2011). Granularity in natural language discourse. Proceedings of International Conference on Computational Semantics.Google Scholar
  28. Murphy, M. L. (2003). Semantic relations and the lexicon: Antonymy, synonymy, and other paradigms. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
  29. Năstase, V., Nakov, P., Séaghdha, D. Ó., & Szpakowicz, S. (2013). Semantic relations between nominals. California: Morgan & Claypool Publishers.Google Scholar
  30. Ohara, K. (2011). Full text annotation with Japanese framenet: Study to annotation semantic frame to bccwj (in japanese). Proceedings of the 17th Annual Meeting fo the Association for Natural Language Processing, pp. 703–704.Google Scholar
  31. Pantel, P., Ravichandran, D., & Hovy, E. (2004). Towards terascale knowledge acquisition. Proceedings of COLING ‘04.Google Scholar
  32. Paşca, M., Lin, D., Bigham, J., Lifchits, A., & Jain, A. (2006). Names and similarities on the Web: Fact extraction in the fast lane. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pp. 809–816, Sydney, Australia.Google Scholar
  33. Postolache, O., Cristea, D., & Orasan, C. (2006). Transferring coreference chains through word alignment. Proceedings of LREC-2006, Geneva.Google Scholar
  34. Quillian, M. R. (1962). A revised design for an understanding machine. Mechanical Translation, 7, 17–29.Google Scholar
  35. Rao, D., McNamee, P., & Dredze, M. (2012). Entity linking: Finding extracted entities in a knowledge base. In T. Poibeau, H. Saggion, J. Piskorski, & R. Yangarber (Eds.), Multisource multilingual information extraction and summarization, Springer lecture notes in computer science. Berlin: Springer.Google Scholar
  36. Rello, L., & Ilisei, I. (2009). A comparative study of Spanish zero pronoun distribution. Proceedings of the International Symposium on Data and Sense Mining, Machine Translation and Controlled Languages (ISMTCL), pp. 209–214.Google Scholar
  37. Rodríguez, K. J., Delogu, F., Versley, Y., Stemle, E. W., & Poesio, M. (2010). Anaphoric annotation of Wikipedia and blogs in the live memories corpus. Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC ‘10).Google Scholar
  38. Rosenfeld, B., & Feldman, R. (2007). Using corpus statistics on entities to improve semisupervised relation extraction from the Web. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, pp. 600–607, Prague, Czech Republic.Google Scholar
  39. Pollard, C., & Sag, I. A. (1994). Head-driven phrase structure grammar. Chicago: University of Chicago Press.Google Scholar
  40. Saggion, H. (2007). SHEF—semantic tagging and summarization techniques applied to cross-document coreference. Proceedings of SEMEVLA ‘07.Google Scholar
  41. Séaghdha, D. Ó., & Copestake, A. (2008). Semantic classification with distributional kernels. Proceedings of the 22nd International Conference on Computational Linguistics (COLING-08), Manchester, UK.Google Scholar
  42. Singh, S., Subramanya, A., Pereira, F., & McCallum, A. (2011). Large-scale cross-document coreference using distributed inference and hierarchical models. Proceedings of HLT ‘11, 1.Google Scholar
  43. Simionescu, R. (2012). Romanian deep noun phrase chunking using graphical grammar studio. In M. A. Moruz, D. Cristea, D. Tufiş, A. Iftene, H. N. Teodorescu (Eds.), Proceedings of the 8th International Conference “Linguistic Resources and Tools for Processing of the Romanian Language”, pp. 135–143.Google Scholar
  44. Snow, R., Jurafsky, D., & Ng, A. Y. (2006). Semantic taxonomy induction from heterogeneous evidence. Proceedings of COLING-ACL ‘06.Google Scholar
  45. Tanaka, I. (1999). The value of an annotated corpus in the investigation of anaphoric pronouns, with particular reference to backwards anaphora in English. Ph.d. thesis, University of Lancaster.Google Scholar
  46. Tesnière, L. (1959). Éléments de syntaxe structurale. Paris: Klincksieck.Google Scholar
  47. Zock, M. (2010). Wheels for the mind of the language producer: microscopes, macroscopes, semantic maps and a good compass. In V. Barbu Mititelu, V. Pekar, & E. Barbu (Eds.), Proceedings of the Workshop Semantic Relations. Theory and Applications.Google Scholar
  48. Zock, M., Ferret, O., & Schwab, D. (2010). Deliberate word access: An intuition, a roadmap and some preliminary empirical results. International Journal of Speech Technology, 13, 201–218.CrossRefGoogle Scholar
  49. Zock, M., & Schwab, D. (2013). L’index, une ressource vitale pour guider les auteurs a trouver le mot bloque sur le bout de la langue. In N. Gala, & M. Zock (Eds.), Ressources lexicales: construction et utilisation. Lingvisticae Investigationes. Amsterdam: John Benjamins.Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Dan Cristea
    • 1
    • 2
  • Daniela Gîfu
    • 1
  • Mihaela Colhon
    • 3
  • Paul Diac
    • 1
  • Anca-Diana Bibiri
    • 4
  • Cătălina Mărănduc
    • 5
  • Liviu-Andrei Scutelnicu
    • 1
    • 2
  1. 1.Faculty of Computer Science“Alexandru Ioan Cuza” University of IaşiIaşiRomania
  2. 2.Institute for Computer ScienceRomanian Academy - The Iaşi BranchIaşiRomania
  3. 3.Department of Computer ScienceUniversity of CraiovaCraiovaRomania
  4. 4.Department of Interdisciplinary Research in Social-Human Sciences“Alexandru Ioan Cuza” University of IaşiIaşiRomania
  5. 5.“Iorgu Iordan-Al. Rosetti” Institute of Linguistics of the Romanian AcademyBucharestRomania

Personalised recommendations