Automatic Multiword Identification in a Specialist Corpus

  • Pasquale PavoneEmail author
Part of the Quantitative Methods in the Humanities and Social Sciences book series (QMHSS)


In a logic of study of specialist-technical corpora, this work proposes the definition of a lexical-textual model for the automatic identification of the nominal Multiword Expressions present in texts. In automatic text analysis, particular attention usually devoted to recognizing the nominal Multiword Expressions in a corpus, which include both nominal idiomatic expressions and linguistic collocations. This vast class of Multiword Expressions includes technical terms and compound personal nouns. They are thus often found in specialist-technical language. Though they are not nominal idioms, these complex lexemes represent technical or specialist expressions. Accurate detection of Multiword Expressions enables us to disambiguate the meaning of words and to define or enhance terminological glossaries for a specific specialist sector. Our objective is reached through the recognition of the syntactic structures which define the nominal expressions. Multiword Expressions represent the universe of disambiguous subjects and objects in a text, that is to say, the terminology of the discourse. It is shown how the use of factor analysis in a limited number of Multiword Expressions is able to rebuild the same structure of the whole vocabulary in analysis. The procedure here presented is applied to the corpus of documents made of a collection of titles of papers published in the journals Mind, The Monist, The Journal of Philosophy and The Philosophical Review, from their foundation to the last number for 2016.


Multiword expressions Technical language Part-of-speech tagging Regular expressions Taltac2 software 


  1. Agirre, E., & Edmonds, P. (2007). Word sense disambiguation. Text, speech, and language technology. Dordrecht: Springer.CrossRefGoogle Scholar
  2. Benzécri, J.-P. (1976). L’Analyse des Données. II. L’analyse des correspondances (2nd ed.). Paris: Dunod.zbMATHGoogle Scholar
  3. Benzécri, J.-P. (1992). Correspondence analysis handbook. Statistics, textbooks and monographs. New York: Marcel Dekker.CrossRefGoogle Scholar
  4. Bolasco, S. (1999). Analisi multidimensionale dei dati. Metodi, strategie e criteri d’interpretazione. Roma: Carocci.Google Scholar
  5. Bolasco, S. (2010). Taltac2.10. Sviluppi, esperienze ed elementi essenziali di analisi automatica dei testi. Milano: LED.Google Scholar
  6. Bolasco, S. (2013). L’analisi automatica dei testi. Fare ricerca con il text mining. Roma: Carocci.Google Scholar
  7. Bolasco, S., & Morrone, A. (1998). La construction d’un lexique fondamental de polyformes selon leur usage. In JADT. Nice: Universié de Nice.Google Scholar
  8. Bolasco, S., & Pavone, P. (2010). Automatic dictionary and rule-based systems for extracting information from text. In F. Palumbo & C. N. Lauro (Eds.), Data analysis and classification. Proceedings of the 6th Conference of the Classification and Data Analysis Group of the Società Italiana di Statistica (pp. 189–198). Berlin: Springer.Google Scholar
  9. Church, K., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22–29.Google Scholar
  10. Church, K., Gale, W., Hanks, P., & Kindle, D. (1991). Using statistics in lexical analysis. In U. Zernik (Ed.), Lexical acquisition: Exploiting on-line resources to build a lexicon (pp. 115–164). Hillsdale: Lawrence Erlbaum Associates.Google Scholar
  11. De Mauro, T. (1999–2007). Grande Dizionario Italiano dell’Uso (GRADIT). Torino: Utet.Google Scholar
  12. Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.Google Scholar
  13. Elia, A. (1995). Per una disambiguazione semi-automatica di sintagmi composti: i dizionari lessico-grammaticali. In S. Bolasco & R. Cipriani (Eds.), Ricerca Qualitativa e Computer (pp. 112–141). Milano: Franco Angeli.Google Scholar
  14. Elia, A. (1996). Per filo e per segno: la struttura degli avverbi composti. In E. D’Agostino (Ed.), Sintassi e Semantica (pp. 167–263). Napoli: ESI.Google Scholar
  15. Grigolli, S., Maltese, G., & Mancini, F. (1995). Un prototipo di lemmatizzatore automatico per la lingua italiana. In S. Bolasco & R. Cipriani (Eds.), Ricerca Qualitativa e Computer (pp. 142–155). Milano: Franco Angeli.Google Scholar
  16. Justeson, J. S., & Katz, S. M. (1995). Technical terminology: Some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1, 9–27.CrossRefGoogle Scholar
  17. Lenci, A., Montemagni, S., & Pirrelli, V. (2005). Testo e Computer. Elementi di linguistica computazionale. Roma: Carocci.Google Scholar
  18. Morrone, A. (1993). Alcuni criteri di valutazione della significatività dei segmenti ripetuti. In Actes des secondes Journées Internationales d’Analyse Statistique de Données Textuelles (pp. 445–453). Paris: Anastex S. J.Google Scholar
  19. Pavone, P. (2010). Sintagmazione del testo: una scelta per disambiguare la terminologia e ridurre le variabili di un’analisi del contenuto di un corpus. In S. Bolasco, I. Chiari, & L. Giuliano (Eds.), Jadt 2010—Statistical analysis of textual data (Vol. 1, pp. 131–140). Roma: LED.Google Scholar
  20. Rouget, C. (2000). Distribution et sémantique des construcuions Nom de Nom. Paris: Honoré Champion.Google Scholar
  21. Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002, February). Multiword expressions: A pain in the neck for NLP. In International Conference on Intelligent Text Processing and Computational Linguistics (pp. 1–15). Heidelberg, Berlin: Springer.Google Scholar
  22. Salem, A. (1987). Pratique des segments répétés. Essai de statistiques textuelle. Paris: Klincksieck.Google Scholar
  23. Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees in proceedings of international conference on new methods in language processing. In Proceedings of International Conference on New Methods in Language Processing. Manchester.Google Scholar
  24. Schmid, H. (1995). Improvements in part-of-speech tagging with an application to German. In Proceedings of the ACL SIGDAT-Workshop. Dublin.Google Scholar
  25. Sinclair, J. (1991). Corpus concordance collocation. Oxford: Oxford University Press.Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Università degli Studi di Modena e Reggio EmiliaModenaItaly

Personalised recommendations