Syntax Deep Explorer

  • José Correia
  • Jorge Baptista
  • Nuno Mamede
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9727)


The analysis of the co-occurrence patterns between words allows for a better understanding of the use (and meaning) of words and its most straightforward applications are lexicography and linguist description in general. Some tools already produce co-occurrence information about words taken from Portuguese corpora, but few can use lemmata or syntactic dependency information. Syntax Deep Explorer is a new tool that uses several association measures to quantify several co-occurrence types, defined on the syntactic dependencies (e.g. subject, complement, modifier) between a target word lemma and its co-locates. The resulting co-occurrence statistics is represented in lex-grams, that is, a synopsis of the syntactically-based co-occurrence patterns of a word distribution within a given corpus. These lex-grams are obtained from a large-sized Portuguese corpus processed by STRING [19] and are presented in a user-friendly way through a graphical interface. The Syntax Deep Explorer will allow the development of finer lexical resources and the improvement of STRING processing in general, as well as providing public access to co-occurrence information derived from parsed corpora.


Natural Language Processing (NLP) Co-occurrence Collocation Association measures Graphic interface Lex-gram Portuguese 



This work was supported by national funds through FCT–Fundação para a Ciência e a Tecnologia, ref. UID/CEC/50021/2013. Thanks to Neuza Costa (UAlg) for revising the final version of this paper.


  1. 1.
    Art-Mokhtar, S., Chanod, J.P., Roux, C.: Robustness beyond shallowness: incremental deep parsing. Nat. Lang. Eng. 8, 121–144 (2002)Google Scholar
  2. 2.
    Bick, E.: The Parsing System PALAVRAS. Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Aarhus University Press, Aarhus (2000)Google Scholar
  3. 3.
    Bick, E.: DeepDict - a graphical corpus-based dictionary of word relations. In: Proceedings of NODALIDA 2009. NEALT Proceedings Series, vol. 4, pp. 268–271. Tartu University Library, Tartu (2009)Google Scholar
  4. 4.
    Biemann, C., Bordag, S., Heyer, G., Quasthoff, U., Wolff, C.: Language-independent methods for compiling monolingual lexical data. In: Gelbukh, A. (ed.) CICLing 2004. LNCS, vol. 2945, pp. 217–228. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  5. 5.
    Carapinha, F.: Extração Automática de Conteúdos Documentais. Master’s thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa, June 2013Google Scholar
  6. 6.
    Chen, P.: The entity-relationship model—toward a unified view of data. ACM Trans. Database Syst. 1(1), 9–36 (1976)CrossRefGoogle Scholar
  7. 7.
    Church, K., Hanks, P.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16(1), 22–29 (1990)Google Scholar
  8. 8.
    Codd, E.: A relational model of data for large shared data banks. Commun. ACM 26(6), 64–69 (1983)CrossRefGoogle Scholar
  9. 9.
    Dice, L.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)CrossRefGoogle Scholar
  10. 10.
    Diniz, C., Mamede, N., Pereira, J.: RuDriCo2 - a faster disambiguator and segmentation modifier. In: INFORUM II, pp. 573–584, September 2010Google Scholar
  11. 11.
    Diniz, C., Mamede, N., Pereira, J.D.: RuDriCo2 - a faster disambiguator and segmentation modifier. In: Simpósio de Informática - INForum, pp. 573–584. Universidade do Minho, Portugal (2010)Google Scholar
  12. 12.
    Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)Google Scholar
  13. 13.
    Hagège, C., Baptista, J., Mamede, N.: Identificação, Classificação e Normalização de Expressões Temporais em Português: a Experiência do Segundo HAREM e o Futuro. In: Mota, C., Santos, D. (eds.) Desafios na Avaliação Conjunta do Reconhecimento de Entidades Mencionadas: o Segundo HAREM, chap. 2, pp. 33–54. Linguateca (2008).
  14. 14.
    Hagège, C., Baptista, J., Mamede, N.: Portuguese temporal expressions recognition: from TE characterization to an effective TER module implementation. In: 7th Brazilian Symposium in Information and Human Language Technology, STIL 2009, pp. 1–5. Sociedade Brasileira de Computação, São Carlos (2009)Google Scholar
  15. 15.
    Hagège, C., Baptista, J., Mamede, N.J.: Reconhecimento de entidadesmencionadas com o xip: Uma colaboração entre o inesc-l2f e a xerox. In: Mota, C., Santos, D. (eds.) Desafios na avaliação conjunta doreconhecimento de entidades mencionadas: Actas do Encontro do Segundo HAREM (Aveiro, 11 de Setembro de 2008). Linguateca (2009)Google Scholar
  16. 16.
    Hagège, C., Baptista, J., Mamede, N.J.: Caracterização e processamento de expressões temporais em português. Linguamática 2(1), 63–76 (2010)Google Scholar
  17. 17.
    Kilgarriff, A., et al.: The sketch engine: ten years on. Lexicography 1(1), 7–36 (2014)CrossRefGoogle Scholar
  18. 18.
    Kilgarriff, A., Rychly, P., Tugwell, D., Smrz, P.: The sketch engine. In: Proceedings of Euralex. vol. Demo Session, pp. 105–116. Lorient, France, July 2004Google Scholar
  19. 19.
    Mamede, N., Baptista, J., Diniz, C., Cabarrão, V.: STRING: an hybrid statistical and rule-based natural language processing chain for Portuguese. In: PROPOR 2012, vol. Demo Session, April 2012Google Scholar
  20. 20.
    Mamede, N.J., Baptista, J.: Nomenclature of chunks and dependencies in Portuguese XIP Grammar 4.5. Technical report, L2F-Spoken Language Laboratory, INESC-ID Lisboa, Lisboa, January 2016Google Scholar
  21. 21.
    Manning, C., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)zbMATHGoogle Scholar
  22. 22.
    Marques, J.S.: Anaphora Resolution. Master’s thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa, Lisboa (2013)Google Scholar
  23. 23.
    Maurício, A.: Identificação, Classificação e Normalização de Expressões Temporais. Master’s thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa, Lisboa, November 2011Google Scholar
  24. 24.
    Nobre, N.: Resolução de Expressões Anafóricas. Master’s thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa, June 2011Google Scholar
  25. 25.
    Oliveira, D.: Extraction and Classification of Named Entities. Master’s thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa (2010)Google Scholar
  26. 26.
    Pereira, S.: Linguistics Parameters for Zero Anaphora Resolution. Master’s thesis, Universidade do Algarve and University of Wolverhampton (2010)Google Scholar
  27. 27.
    Quasthoff, U., Richter, M., Biemann, C.: Corpus portal for search in monolingual corpora. In: Proceedings of the 5th LREC, pp. 1799–1802 (2006)Google Scholar
  28. 28.
    Ribeiro, R.: Anotação Morfossintática Desambiguada do Português. Master’s thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa, March 2003Google Scholar
  29. 29.
    Rychly, P.: Manatee/Bonito - a modular corpus manager. In: Sojka, P., Horák, A. (eds.) RASLAN 2008, pp. 65–70. Masaryk University, Brno (2007)Google Scholar
  30. 30.
    Rychly, P.: A lexicographer-friendly association score. In: RASLAN 2008, pp. 6–9. Masarykova Univerzita, Brno (2008)Google Scholar
  31. 31.
    Santos, D., Rocha, P.: Evaluating CETEMPúblico, a free resource for Portuguese. In: Proceedings of the 39th Annual Meeting of ACL, ACL 2001, pp. 450–457. Association for Computational Linguistics, Stroudsburg (2001)Google Scholar
  32. 32.
    Silberschatz, A., Korth, H., Sudarshan, S.: Database System Concepts. Connect, learn, succeed. McGraw-Hill Education (2010)Google Scholar
  33. 33.
    Sinclair, J.: Corpus, Concordance, Collocation. Oxford University Press, Oxford (1991)Google Scholar
  34. 34.
    Smadja, F., McKeown, K., Hatzivassiloglou, V.: Translating collocations for bilingual lexicons: a statistical approach. Comput. Linguist. 22(1), 1–38 (1996)Google Scholar
  35. 35.
    Vicente, A.M.F.: LexMan: um Segmentador e Analisador Morfológico com Transdutores. Master’s thesis, Instituto Superior Técnico, Universidade de Lisboa, Lisboa, June 2013Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Instituto Superior TécnicoUniversidade de LisboaLisbonPortugal
  2. 2.L2F – Spoken Language Lab, INESC-ID LisboaLisbonPortugal
  3. 3.Universidade do AlgarveFaroPortugal

Personalised recommendations