Abstract
The automatic processing of medical language represents a clue for computational linguists due to intrinsic feature of these sub-codes: its lexicon comprises a vast number of terms that appear infrequently in texts. In addition, the presence of many sub-domains that can coincide in a single text complicates the collection of the training set for a supervised classification task. This paper will tackle the problem of unsupervised classification of medical scientific papers based on a hybrid Multiword Expression Discovery. We apply a morpho-semantic approach to extract medical domain terms and their semantic tags in addition to the classic MWEs discovery strategies. The collected MWEs will be used to vectorize texts and generate a network of similarities among corpus documents. With this approach, we try to solve both problems caused by the medical domain features. The presence of a vast lexicon of low-frequency terms is dealt with by extracting many semantic tags with a small dictionary; the issues of co-occurring sub-domains are solved by generating clusters of similarity values instead of a rigid classification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
This paper has been produced with the financial support of the Project financed by Campania Region of Italy ‘REMIAM - Rete Musei intelligenti ad avanzata Multimedialita’. CUP B63D18000360007.
References
Maisto, A., Guarasci, R.: Morpheme-based recognition and translation of medical terms. In: Okrut, T., Hetsevich, Y., Silberztein, M., Stanislavenka, H. (eds.) International Conference on Automatic Processing of Natural-Language Electronic Texts with NooJ, pp. 172–181. Springer, Cham (2015)
Möbius, B.: Rare events and closed domains: two delicate concepts in speech synthesis. Int. J. of Speech Technol. 6(1), 57–71 (2003)
Iacobini, C.: Composizione con elementi neoclassici. La formazione delle parole in italiano. Tübingen: Max Niemeyer Verlag, pp. 69–95 (2004)
Balzano, W., Del Sorbo, M.R.: Genomic comparison using data mining techniques based on a possibilistic fuzzy sets model. Biosystems 88(3), 343–349 (2007)
Amato, A., Balzano, W., Cozzolino, G., Moscato, F.: Analysis of consumers perceptions of food safety risk in social networks. In: Barolli, L., Takizawa, M., Xhafa, F., Enokido, T. (eds.) International Conference on Advanced Information Networking and Applications, pp. 1217–1227. Springer, Cham (2019)
Grimmer, J., Stewart, B.: Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Polit. Anal. 21(3), 267–297 (2013)
Thangaraj, M., Sivakami, M.: Text classification techniques: a literature review. Interdisc. J. Inf. Knowl. Manag. 13, 117–135 (2018)
Turney, P.D.: Thumbs up or thumbs down? semantic orientation applied to unsupervised classification of reviews. arXiv:preprint cs/0212032 (2002)
Ko, Y., Seo, J.: Automatic text categorization by unsupervised learning. In: COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics (2000)
Yang, L., Li, C., Ding, L., Li, Q.: Combining lexical and semantic features for short text classification. Procedia Comput. Sci. 22, 78–86 (2013)
Miller, T., Dligach, D., Savova, G.: Unsupervised document classification with informed topic models. In: Proceedings of the 15th Workshop on Biomedical Natural Language Processing, pp. 83–91 (2016)
Maisto, A., Pelosi, S., Stingo, M., Guarasci, R.: A hybrid method for the extraction and classification of product features from user generated contents. Lingue e Linguaggi 22, 137–168 (2017)
Catone, M.C., Falco, M., Maisto, A., Pelosi, S., Siano, A.: Automatic text classification through point of cultural interest digital identifiers. In: Barolli, L., Hellinckx, P., Natwichai, J. (eds.) International Conference on P2P, Parallel, Grid, Cloud and Internet Computing, pp. 211–220. Springer, Cham (2019)
Amato, F., Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M., Moscato, V., Persia, F., Picariello, A.: Challenge: processing web texts for classifying job offers. In: Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC 2015), pp. 460–463. IEEE (2015)
Amato, F., Casola, V., Mazzocca, N., Romano, S.: A semantic approach for fine-grain access control of e-health documents. Log. J. IGPL 21(4), 692–701 (2013)
Elia, A., Cardona, G.R.: Discorso scientifico e linguaggio settoriale. un esempio di analisi lessico-grammaticale di un testo neuro-biologico. Quaderni del Dipartimento di Scienze della Comunicazione–Università di Salerno, Cicalese A., Landi A., Simboli, linguaggi e contesti (2) (2002)
Bolasco, S., et al.: Statistica testuale e text mining: alcuni paradigmi applicativi. Quaderni di Statistica 7, 17–53 (2005)
Amato, F., Casola, V., Mazzeo, A., Romano, S.: A semantic based methodology to classify and protect sensitive data in medical records. In: 2010 Sixth International Conference on Information Assurance and Security, pp. 240–246. IEEE (2010)
Pratt, A.W., Pacak, M.: Identification and transformation of terminal morphemes in medical English. Methods Inf. Med. 8(2), 84–90 (1969)
Wolff, S.: The use of morphosemantic regularities in the medical vocabulary for automatic lexical coding. Methods Inf. Med. 23(4), 195–203 (1984)
Pacak, M.G., Norton, L.M., Dunham, G.S.: Morphosemantic analysis of-ITIS forms in medical language. Methods Inf. Med. 19(2), 99–105 (1980)
Norton, L.M., Pacak, M.G.: Morphosemantic analysis of compound word forms denoting surgical procedures. Methods Inf. Med. 22(1), 29–36 (1983)
Dujols, P., Aubas, P., Baylon, C., Grémy, F.: Morpho-semantic analysis and translation of medical compound terms. Methods Inf. Med. 30(1), 30 (1991)
Amato, F., Mazzeo, A., Elia, A., Maisto, A., Pelosi, S.: Morphosemantic strategies for the automatic enrichment of Italian lexical databases in the medical domain. Int. J. Grid Util. Comput. 8(4), 312–320 (2017)
Baldwin, T.: Deep lexical acquisition of verb-particle constructions. Comput. Speech Lang. 19(4), 398–414 (2005)
Biber, D., Conrad, S., Cortes, V.: If you look at\(\ldots \): lexical bundles in university teaching and textbooks. Appl. Linguist. 25(3), 371–405 (2004)
Brooke, J., et al.: Unsupervised multiword segmentation of large corpora using prediction-driven decomposition of n-grams. In: COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers (2014)
Manning, C., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Evert, S., et al.: E-view-affilation–a large-scale evaluation study of association measures for collocation identification. In: Proceedings of eLex 2017–Electronic Lexicography in the 21st Century: Lexicography from Scratch (2017)
Dunning, T.E.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)
Church, K., Gale, W., Hanks, P., Hindle, D.: Using statistics in lexical analysis. In: Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon, pp. 115–164 (1991)
Church, K., Hanks, P.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16(1), 22–29 (1990)
Petrović, S.: Collocation extraction measures for text mining applications (2007)
Jacquemin, C., Tzoukermann, E.: NLP for term variant extraction: synergy between morphology, lexicon, and syntax. In: Strzalkowski, T. (ed.) Natural Language Information Retrieval, pp. 25–74. Springer, Dordrecht (1999)
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Association for Computational Linguistics (ACL) System Demonstrations, pp. 55–60 (2014)
Lyse, G.I., Andersen, G.: Collocations and statistical analysis of n-grams. Exploring Newspaper Language: Using the Web to Create and Investigate a Large Corpus of Modern Norwegian, ser. Studies in Corpus Linguistics, pp. 79–109. John Benjamins Publishing, Amsterdam (2012)
Blondel, V.D., Guillaume, J.-L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech: Theory Exp. 2008(10), P10008 (2008)
Leech, G.N.: 100 million words of English: the British National Corpus (BNC) (1992)
Jacomy, M., Venturini, T., Heymann, S., Bastian, M.: ForceAtlas2, a continuous graph layout algorithm for handy network visualization designed for the Gephi software. PLoS ONE 9(6), e98679 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Maisto, A. (2021). Unsupervised Classification of Medical Documents Through Hybrid MWEs Discovery. In: Barolli, L., Woungang, I., Enokido, T. (eds) Advanced Information Networking and Applications. AINA 2021. Lecture Notes in Networks and Systems, vol 227. Springer, Cham. https://doi.org/10.1007/978-3-030-75078-7_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-75078-7_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-75077-0
Online ISBN: 978-3-030-75078-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)