Topic Identification from Spanish Unstructured Health Texts

Mena, Andrea; Reátegui, Ruth

doi:10.1007/978-3-030-71503-8_27

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1388))

Included in the following conference series:

International Conference on Applied Technologies

771 Accesses
1 Citations

Abstract

Topic Models allow to extract topics from documents and classify them. In this work, Latent Dirichlet Allocation model was applied to extract topics from documents with medical information. 220 digital documents written in Spanish were used, these documents have information about different health conditions. A pre-processing was carried out, which implies tokenization, stop words elimination and lemmatization, to define the medical data or terms that will represent the documents. Subsequently, a document representation was made through a document-term matrix. An important step was to use a medical glossary based on terminology extracted from Internet to assign weights to the terms. LDA was applied and two new matrices were obtained: a document-topic matrix and a topic-term matrix. 25 topics were identified, they can be visualized by heat maps, word cloud and an interactive tool called PyLDAvis. The application was developed in Phyton using some libraries such as Spacy, Scikit-learn, Tmtoolkit, PyLDAvis among others.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Liu, L., Tang, L., Dong, W., Yao, S., Zhou, W.: An overview of topic modeling and its current applications in bioinformatics. SpringerPlus 5(1), 1–22 (2016). https://doi.org/10.1186/s40064-016-3252-8
Article Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(4–5), 993–1022 (2003)
MATH Google Scholar
Speier, W., Ong, M.K., Arnold, C.W.: Using phrases and document metadata to improve topic modeling of clinical reports. J. Biomed. Inform. 61, 260–6 (2016)
Article Google Scholar
Bhattacharya, M., Jurkovitz, C., Shatkay, H.: Co-occurrence of medical conditions: exposing patterns through probabilistic topic modeling of snomed codes. J. Biomed. Inform. 82, 31–40 (2018)
Article Google Scholar
Chen, Y., et al.: Building bridges across electronic health record systems through inferred phenotypic topics. J. Biomed. Inform. 55, 82–93 (2015)
Article Google Scholar
Ahuja, Y., et al.: sureLDA: a multidisease automated phenotyping method for the electronic health record. J. Am. Med. Inform. Assoc. 27(8), 1235–1243 (2020)
Article Google Scholar
Pérez, J., Pérez, A., Casillas, A., Gojenola, K.: Cardiology record multi-label classification using latent Dirichlet allocation. Comput. Methods Programs Biomed. 164, 111–119 (2018)
Article Google Scholar
Baechle, C., Huang, C.D., Agarwal, A., Behara, R.S., Goo, J.: Latent topic ensemble learning for hospital readmission cost optimization. Eur. J. Oper. Res. 28, 517–531 (2020)
Article Google Scholar
Hwang, Y., Kim, H.J., Choi, H.J., Lee, J.: Exploring abnormal behavior patterns of online users with emotional eating behavior: topic modeling study. J. Med. Internet Res. 22(3), e15700 (2020)
Google Scholar
Jelodar, H., Wang, Y., Rabbani, M., et al.: A collaborative framework based for semantic patients-behavior analysis and highlight topics discovery of alcoholic beverages in online healthcare forums. J. Med. Syst. 44(101), 1–8 (2020)
Google Scholar
Zhao, Y., Zhang, J., Wu, M.: Finding users’ voice on social media: an investigation of online support groups for autism-affected users on facebook. Int. J. Environ. Res. Pub. Health 16(23), 4804 (2019)
Google Scholar
Lenzi, A., Maranghi, M., Stilo, G., Velardi, P.: The social phenotype: extracting a patient-centered perspective of diabetes from health-related blogs. Artif. Intell. Med. 101, 101727 (2019)
Google Scholar
Blei, D.M.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)
Article Google Scholar
Cao, J., Xia, T., Li, J., Zhang, Y., Tang, S.: A density-based method for adaptive LDA model selection. Neurocomputing 72(7–9), 1775–1781 (2009)
Article Google Scholar
Arun, R., Suresh, V., Veni Madhavan, C.E., Narasimha Murthy, M.N.: On finding the natural number of topics with latent dirichlet allocation: some observations. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010. LNCS (LNAI), vol. 6118, pp. 391–402. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-13657-3_43
Chapter Google Scholar
Mimno, D., Wallach, H., Talley, E., Leenders, M., Mccallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, pp. 262–272 (2011)
Google Scholar
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Nat. Acad. Sci. 101(suppl 1), 5228 (2004)
Article Google Scholar
Srinivasa-Desikan, B.: Natural Language Processing and Computational Linguistics. Packt Publishing, Birmingham (2018)
Google Scholar
McKinney, W.: Python for Data Analysis (2nd Edn). O’Reilly Media, Inc., Sebastopol (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Universidad Técnica Particular de Loja, San Cayetano Alto, 1101608, Loja, Ecuador
Andrea Mena & Ruth Reátegui

Authors

Andrea Mena
View author publications
You can also search for this author in PubMed Google Scholar
Ruth Reátegui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruth Reátegui .

Editor information

Editors and Affiliations

Eindhoven University of Technology, Eindhoven, The Netherlands
Miguel Botto-Tobar
Universidad de las Fuerzas Armadas (ESPE), Quito, Ecuador
Sergio Montes León
Escuela Politécnica Nacional, Quito, Ecuador
Oscar Camacho
Escuela Politécnica Nacional, Quito, Ecuador
Danilo Chávez
Universidad Técnica Particular de Loja, Loja, Ecuador
Pablo Torres-Carrión
Universidad Técnica del Norte, Ibarra, Ecuador
Marcelo Zambrano Vizuete

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mena, A., Reátegui, R. (2021). Topic Identification from Spanish Unstructured Health Texts. In: Botto-Tobar, M., Montes León, S., Camacho, O., Chávez, D., Torres-Carrión, P., Zambrano Vizuete, M. (eds) Applied Technologies. ICAT 2020. Communications in Computer and Information Science, vol 1388. Springer, Cham. https://doi.org/10.1007/978-3-030-71503-8_27

Download citation

DOI: https://doi.org/10.1007/978-3-030-71503-8_27
Published: 01 April 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-71502-1
Online ISBN: 978-3-030-71503-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics