Skip to main content

Unsupervised Classification of Medical Documents Through Hybrid MWEs Discovery

  • Conference paper
  • First Online:
Advanced Information Networking and Applications (AINA 2021)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 227))

  • 979 Accesses

Abstract

The automatic processing of medical language represents a clue for computational linguists due to intrinsic feature of these sub-codes: its lexicon comprises a vast number of terms that appear infrequently in texts. In addition, the presence of many sub-domains that can coincide in a single text complicates the collection of the training set for a supervised classification task. This paper will tackle the problem of unsupervised classification of medical scientific papers based on a hybrid Multiword Expression Discovery. We apply a morpho-semantic approach to extract medical domain terms and their semantic tags in addition to the classic MWEs discovery strategies. The collected MWEs will be used to vectorize texts and generate a network of similarities among corpus documents. With this approach, we try to solve both problems caused by the medical domain features. The presence of a vast lexicon of low-frequency terms is dealt with by extracting many semantic tags with a small dictionary; the issues of co-occurring sub-domains are solved by generating clusters of similarity values instead of a rigid classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 259.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This paper has been produced with the financial support of the Project financed by Campania Region of Italy ‘REMIAM - Rete Musei intelligenti ad avanzata Multimedialita’. CUP B63D18000360007.

References

  1. Maisto, A., Guarasci, R.: Morpheme-based recognition and translation of medical terms. In: Okrut, T., Hetsevich, Y., Silberztein, M., Stanislavenka, H. (eds.) International Conference on Automatic Processing of Natural-Language Electronic Texts with NooJ, pp. 172–181. Springer, Cham (2015)

    Google Scholar 

  2. Möbius, B.: Rare events and closed domains: two delicate concepts in speech synthesis. Int. J. of Speech Technol. 6(1), 57–71 (2003)

    Article  Google Scholar 

  3. Iacobini, C.: Composizione con elementi neoclassici. La formazione delle parole in italiano. Tübingen: Max Niemeyer Verlag, pp. 69–95 (2004)

    Google Scholar 

  4. Balzano, W., Del Sorbo, M.R.: Genomic comparison using data mining techniques based on a possibilistic fuzzy sets model. Biosystems 88(3), 343–349 (2007)

    Article  Google Scholar 

  5. Amato, A., Balzano, W., Cozzolino, G., Moscato, F.: Analysis of consumers perceptions of food safety risk in social networks. In: Barolli, L., Takizawa, M., Xhafa, F., Enokido, T. (eds.) International Conference on Advanced Information Networking and Applications, pp. 1217–1227. Springer, Cham (2019)

    Google Scholar 

  6. Grimmer, J., Stewart, B.: Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Polit. Anal. 21(3), 267–297 (2013)

    Article  Google Scholar 

  7. Thangaraj, M., Sivakami, M.: Text classification techniques: a literature review. Interdisc. J. Inf. Knowl. Manag. 13, 117–135 (2018)

    Google Scholar 

  8. Turney, P.D.: Thumbs up or thumbs down? semantic orientation applied to unsupervised classification of reviews. arXiv:preprint cs/0212032 (2002)

    Google Scholar 

  9. Ko, Y., Seo, J.: Automatic text categorization by unsupervised learning. In: COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics (2000)

    Google Scholar 

  10. Yang, L., Li, C., Ding, L., Li, Q.: Combining lexical and semantic features for short text classification. Procedia Comput. Sci. 22, 78–86 (2013)

    Article  Google Scholar 

  11. Miller, T., Dligach, D., Savova, G.: Unsupervised document classification with informed topic models. In: Proceedings of the 15th Workshop on Biomedical Natural Language Processing, pp. 83–91 (2016)

    Google Scholar 

  12. Maisto, A., Pelosi, S., Stingo, M., Guarasci, R.: A hybrid method for the extraction and classification of product features from user generated contents. Lingue e Linguaggi 22, 137–168 (2017)

    Google Scholar 

  13. Catone, M.C., Falco, M., Maisto, A., Pelosi, S., Siano, A.: Automatic text classification through point of cultural interest digital identifiers. In: Barolli, L., Hellinckx, P., Natwichai, J. (eds.) International Conference on P2P, Parallel, Grid, Cloud and Internet Computing, pp. 211–220. Springer, Cham (2019)

    Google Scholar 

  14. Amato, F., Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M., Moscato, V., Persia, F., Picariello, A.: Challenge: processing web texts for classifying job offers. In: Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC 2015), pp. 460–463. IEEE (2015)

    Google Scholar 

  15. Amato, F., Casola, V., Mazzocca, N., Romano, S.: A semantic approach for fine-grain access control of e-health documents. Log. J. IGPL 21(4), 692–701 (2013)

    Article  MathSciNet  Google Scholar 

  16. Elia, A., Cardona, G.R.: Discorso scientifico e linguaggio settoriale. un esempio di analisi lessico-grammaticale di un testo neuro-biologico. Quaderni del Dipartimento di Scienze della Comunicazione–Università di Salerno, Cicalese A., Landi A., Simboli, linguaggi e contesti (2) (2002)

    Google Scholar 

  17. Bolasco, S., et al.: Statistica testuale e text mining: alcuni paradigmi applicativi. Quaderni di Statistica 7, 17–53 (2005)

    Google Scholar 

  18. Amato, F., Casola, V., Mazzeo, A., Romano, S.: A semantic based methodology to classify and protect sensitive data in medical records. In: 2010 Sixth International Conference on Information Assurance and Security, pp. 240–246. IEEE (2010)

    Google Scholar 

  19. Pratt, A.W., Pacak, M.: Identification and transformation of terminal morphemes in medical English. Methods Inf. Med. 8(2), 84–90 (1969)

    Google Scholar 

  20. Wolff, S.: The use of morphosemantic regularities in the medical vocabulary for automatic lexical coding. Methods Inf. Med. 23(4), 195–203 (1984)

    Article  Google Scholar 

  21. Pacak, M.G., Norton, L.M., Dunham, G.S.: Morphosemantic analysis of-ITIS forms in medical language. Methods Inf. Med. 19(2), 99–105 (1980)

    Article  Google Scholar 

  22. Norton, L.M., Pacak, M.G.: Morphosemantic analysis of compound word forms denoting surgical procedures. Methods Inf. Med. 22(1), 29–36 (1983)

    Article  Google Scholar 

  23. Dujols, P., Aubas, P., Baylon, C., Grémy, F.: Morpho-semantic analysis and translation of medical compound terms. Methods Inf. Med. 30(1), 30 (1991)

    Article  Google Scholar 

  24. Amato, F., Mazzeo, A., Elia, A., Maisto, A., Pelosi, S.: Morphosemantic strategies for the automatic enrichment of Italian lexical databases in the medical domain. Int. J. Grid Util. Comput. 8(4), 312–320 (2017)

    Article  Google Scholar 

  25. Baldwin, T.: Deep lexical acquisition of verb-particle constructions. Comput. Speech Lang. 19(4), 398–414 (2005)

    Article  Google Scholar 

  26. Biber, D., Conrad, S., Cortes, V.: If you look at\(\ldots \): lexical bundles in university teaching and textbooks. Appl. Linguist. 25(3), 371–405 (2004)

    Article  Google Scholar 

  27. Brooke, J., et al.: Unsupervised multiword segmentation of large corpora using prediction-driven decomposition of n-grams. In: COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers (2014)

    Google Scholar 

  28. Manning, C., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  29. Evert, S., et al.: E-view-affilation–a large-scale evaluation study of association measures for collocation identification. In: Proceedings of eLex 2017–Electronic Lexicography in the 21st Century: Lexicography from Scratch (2017)

    Google Scholar 

  30. Dunning, T.E.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)

    Google Scholar 

  31. Church, K., Gale, W., Hanks, P., Hindle, D.: Using statistics in lexical analysis. In: Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon, pp. 115–164 (1991)

    Google Scholar 

  32. Church, K., Hanks, P.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16(1), 22–29 (1990)

    Google Scholar 

  33. Petrović, S.: Collocation extraction measures for text mining applications (2007)

    Google Scholar 

  34. Jacquemin, C., Tzoukermann, E.: NLP for term variant extraction: synergy between morphology, lexicon, and syntax. In: Strzalkowski, T. (ed.) Natural Language Information Retrieval, pp. 25–74. Springer, Dordrecht (1999)

    Google Scholar 

  35. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Association for Computational Linguistics (ACL) System Demonstrations, pp. 55–60 (2014)

    Google Scholar 

  36. Lyse, G.I., Andersen, G.: Collocations and statistical analysis of n-grams. Exploring Newspaper Language: Using the Web to Create and Investigate a Large Corpus of Modern Norwegian, ser. Studies in Corpus Linguistics, pp. 79–109. John Benjamins Publishing, Amsterdam (2012)

    Google Scholar 

  37. Blondel, V.D., Guillaume, J.-L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech: Theory Exp. 2008(10), P10008 (2008)

    Article  Google Scholar 

  38. Leech, G.N.: 100 million words of English: the British National Corpus (BNC) (1992)

    Google Scholar 

  39. Jacomy, M., Venturini, T., Heymann, S., Bastian, M.: ForceAtlas2, a continuous graph layout algorithm for handy network visualization designed for the Gephi software. PLoS ONE 9(6), e98679 (2014)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alessandro Maisto .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Maisto, A. (2021). Unsupervised Classification of Medical Documents Through Hybrid MWEs Discovery. In: Barolli, L., Woungang, I., Enokido, T. (eds) Advanced Information Networking and Applications. AINA 2021. Lecture Notes in Networks and Systems, vol 227. Springer, Cham. https://doi.org/10.1007/978-3-030-75078-7_6

Download citation

Publish with us

Policies and ethics