Unsupervised Classification of Medical Documents Through Hybrid MWEs Discovery

Maisto, Alessandro

doi:10.1007/978-3-030-75078-7_6

Alessandro Maisto¹²

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 227))

Included in the following conference series:

International Conference on Advanced Information Networking and Applications

979 Accesses

Abstract

The automatic processing of medical language represents a clue for computational linguists due to intrinsic feature of these sub-codes: its lexicon comprises a vast number of terms that appear infrequently in texts. In addition, the presence of many sub-domains that can coincide in a single text complicates the collection of the training set for a supervised classification task. This paper will tackle the problem of unsupervised classification of medical scientific papers based on a hybrid Multiword Expression Discovery. We apply a morpho-semantic approach to extract medical domain terms and their semantic tags in addition to the classic MWEs discovery strategies. The collected MWEs will be used to vectorize texts and generate a network of similarities among corpus documents. With this approach, we try to solve both problems caused by the medical domain features. The presence of a vast lexicon of low-frequency terms is dealt with by extracting many semantic tags with a small dictionary; the issues of co-occurring sub-domains are solved by generating clusters of similarity values instead of a rigid classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 259.00; Price excludes VAT (USA)

Softcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
This paper has been produced with the financial support of the Project financed by Campania Region of Italy ‘REMIAM - Rete Musei intelligenti ad avanzata Multimedialita’. CUP B63D18000360007.

References

Maisto, A., Guarasci, R.: Morpheme-based recognition and translation of medical terms. In: Okrut, T., Hetsevich, Y., Silberztein, M., Stanislavenka, H. (eds.) International Conference on Automatic Processing of Natural-Language Electronic Texts with NooJ, pp. 172–181. Springer, Cham (2015)
Google Scholar
Möbius, B.: Rare events and closed domains: two delicate concepts in speech synthesis. Int. J. of Speech Technol. 6(1), 57–71 (2003)
Article Google Scholar
Iacobini, C.: Composizione con elementi neoclassici. La formazione delle parole in italiano. Tübingen: Max Niemeyer Verlag, pp. 69–95 (2004)
Google Scholar
Balzano, W., Del Sorbo, M.R.: Genomic comparison using data mining techniques based on a possibilistic fuzzy sets model. Biosystems 88(3), 343–349 (2007)
Article Google Scholar
Amato, A., Balzano, W., Cozzolino, G., Moscato, F.: Analysis of consumers perceptions of food safety risk in social networks. In: Barolli, L., Takizawa, M., Xhafa, F., Enokido, T. (eds.) International Conference on Advanced Information Networking and Applications, pp. 1217–1227. Springer, Cham (2019)
Google Scholar
Grimmer, J., Stewart, B.: Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Polit. Anal. 21(3), 267–297 (2013)
Article Google Scholar
Thangaraj, M., Sivakami, M.: Text classification techniques: a literature review. Interdisc. J. Inf. Knowl. Manag. 13, 117–135 (2018)
Google Scholar
Turney, P.D.: Thumbs up or thumbs down? semantic orientation applied to unsupervised classification of reviews. arXiv:preprint cs/0212032 (2002)
Google Scholar
Ko, Y., Seo, J.: Automatic text categorization by unsupervised learning. In: COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics (2000)
Google Scholar
Yang, L., Li, C., Ding, L., Li, Q.: Combining lexical and semantic features for short text classification. Procedia Comput. Sci. 22, 78–86 (2013)
Article Google Scholar
Miller, T., Dligach, D., Savova, G.: Unsupervised document classification with informed topic models. In: Proceedings of the 15th Workshop on Biomedical Natural Language Processing, pp. 83–91 (2016)
Google Scholar
Maisto, A., Pelosi, S., Stingo, M., Guarasci, R.: A hybrid method for the extraction and classification of product features from user generated contents. Lingue e Linguaggi 22, 137–168 (2017)
Google Scholar
Catone, M.C., Falco, M., Maisto, A., Pelosi, S., Siano, A.: Automatic text classification through point of cultural interest digital identifiers. In: Barolli, L., Hellinckx, P., Natwichai, J. (eds.) International Conference on P2P, Parallel, Grid, Cloud and Internet Computing, pp. 211–220. Springer, Cham (2019)
Google Scholar
Amato, F., Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M., Moscato, V., Persia, F., Picariello, A.: Challenge: processing web texts for classifying job offers. In: Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC 2015), pp. 460–463. IEEE (2015)
Google Scholar
Amato, F., Casola, V., Mazzocca, N., Romano, S.: A semantic approach for fine-grain access control of e-health documents. Log. J. IGPL 21(4), 692–701 (2013)
Article MathSciNet Google Scholar
Elia, A., Cardona, G.R.: Discorso scientifico e linguaggio settoriale. un esempio di analisi lessico-grammaticale di un testo neuro-biologico. Quaderni del Dipartimento di Scienze della Comunicazione–Università di Salerno, Cicalese A., Landi A., Simboli, linguaggi e contesti (2) (2002)
Google Scholar
Bolasco, S., et al.: Statistica testuale e text mining: alcuni paradigmi applicativi. Quaderni di Statistica 7, 17–53 (2005)
Google Scholar
Amato, F., Casola, V., Mazzeo, A., Romano, S.: A semantic based methodology to classify and protect sensitive data in medical records. In: 2010 Sixth International Conference on Information Assurance and Security, pp. 240–246. IEEE (2010)
Google Scholar
Pratt, A.W., Pacak, M.: Identification and transformation of terminal morphemes in medical English. Methods Inf. Med. 8(2), 84–90 (1969)
Google Scholar
Wolff, S.: The use of morphosemantic regularities in the medical vocabulary for automatic lexical coding. Methods Inf. Med. 23(4), 195–203 (1984)
Article Google Scholar
Pacak, M.G., Norton, L.M., Dunham, G.S.: Morphosemantic analysis of-ITIS forms in medical language. Methods Inf. Med. 19(2), 99–105 (1980)
Article Google Scholar
Norton, L.M., Pacak, M.G.: Morphosemantic analysis of compound word forms denoting surgical procedures. Methods Inf. Med. 22(1), 29–36 (1983)
Article Google Scholar
Dujols, P., Aubas, P., Baylon, C., Grémy, F.: Morpho-semantic analysis and translation of medical compound terms. Methods Inf. Med. 30(1), 30 (1991)
Article Google Scholar
Amato, F., Mazzeo, A., Elia, A., Maisto, A., Pelosi, S.: Morphosemantic strategies for the automatic enrichment of Italian lexical databases in the medical domain. Int. J. Grid Util. Comput. 8(4), 312–320 (2017)
Article Google Scholar
Baldwin, T.: Deep lexical acquisition of verb-particle constructions. Comput. Speech Lang. 19(4), 398–414 (2005)
Article Google Scholar
Biber, D., Conrad, S., Cortes, V.: If you look at\(\ldots \): lexical bundles in university teaching and textbooks. Appl. Linguist. 25(3), 371–405 (2004)
Article Google Scholar
Brooke, J., et al.: Unsupervised multiword segmentation of large corpora using prediction-driven decomposition of n-grams. In: COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers (2014)
Google Scholar
Manning, C., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
MATH Google Scholar
Evert, S., et al.: E-view-affilation–a large-scale evaluation study of association measures for collocation identification. In: Proceedings of eLex 2017–Electronic Lexicography in the 21st Century: Lexicography from Scratch (2017)
Google Scholar
Dunning, T.E.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)
Google Scholar
Church, K., Gale, W., Hanks, P., Hindle, D.: Using statistics in lexical analysis. In: Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon, pp. 115–164 (1991)
Google Scholar
Church, K., Hanks, P.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16(1), 22–29 (1990)
Google Scholar
Petrović, S.: Collocation extraction measures for text mining applications (2007)
Google Scholar
Jacquemin, C., Tzoukermann, E.: NLP for term variant extraction: synergy between morphology, lexicon, and syntax. In: Strzalkowski, T. (ed.) Natural Language Information Retrieval, pp. 25–74. Springer, Dordrecht (1999)
Google Scholar
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Association for Computational Linguistics (ACL) System Demonstrations, pp. 55–60 (2014)
Google Scholar
Lyse, G.I., Andersen, G.: Collocations and statistical analysis of n-grams. Exploring Newspaper Language: Using the Web to Create and Investigate a Large Corpus of Modern Norwegian, ser. Studies in Corpus Linguistics, pp. 79–109. John Benjamins Publishing, Amsterdam (2012)
Google Scholar
Blondel, V.D., Guillaume, J.-L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech: Theory Exp. 2008(10), P10008 (2008)
Article Google Scholar
Leech, G.N.: 100 million words of English: the British National Corpus (BNC) (1992)
Google Scholar
Jacomy, M., Venturini, T., Heymann, S., Bastian, M.: ForceAtlas2, a continuous graph layout algorithm for handy network visualization designed for the Gephi software. PLoS ONE 9(6), e98679 (2014)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Salerno, Via Giovanni Paolo II, 132, 84084, Fisciano, (SA), Italy
Alessandro Maisto

Authors

Alessandro Maisto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alessandro Maisto .

Editor information

Editors and Affiliations

Department of Information and Communication Engineering, Fukuoka Institute of Technology, Fukuoka, Japan
Leonard Barolli
Department of Computer Science, Ryerson University, Toronto, ON, Canada
Isaac Woungang
Faculty of Business Administration, Rissho University, Tokyo, Japan
Tomoya Enokido

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Maisto, A. (2021). Unsupervised Classification of Medical Documents Through Hybrid MWEs Discovery. In: Barolli, L., Woungang, I., Enokido, T. (eds) Advanced Information Networking and Applications. AINA 2021. Lecture Notes in Networks and Systems, vol 227. Springer, Cham. https://doi.org/10.1007/978-3-030-75078-7_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-75078-7_6
Published: 01 May 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-75077-0
Online ISBN: 978-3-030-75078-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics