Abstract
Topic Models allow to extract topics from documents and classify them. In this work, Latent Dirichlet Allocation model was applied to extract topics from documents with medical information. 220 digital documents written in Spanish were used, these documents have information about different health conditions. A pre-processing was carried out, which implies tokenization, stop words elimination and lemmatization, to define the medical data or terms that will represent the documents. Subsequently, a document representation was made through a document-term matrix. An important step was to use a medical glossary based on terminology extracted from Internet to assign weights to the terms. LDA was applied and two new matrices were obtained: a document-topic matrix and a topic-term matrix. 25 topics were identified, they can be visualized by heat maps, word cloud and an interactive tool called PyLDAvis. The application was developed in Phyton using some libraries such as Spacy, Scikit-learn, Tmtoolkit, PyLDAvis among others.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Liu, L., Tang, L., Dong, W., Yao, S., Zhou, W.: An overview of topic modeling and its current applications in bioinformatics. SpringerPlus 5(1), 1–22 (2016). https://doi.org/10.1186/s40064-016-3252-8
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(4–5), 993–1022 (2003)
Speier, W., Ong, M.K., Arnold, C.W.: Using phrases and document metadata to improve topic modeling of clinical reports. J. Biomed. Inform. 61, 260–6 (2016)
Bhattacharya, M., Jurkovitz, C., Shatkay, H.: Co-occurrence of medical conditions: exposing patterns through probabilistic topic modeling of snomed codes. J. Biomed. Inform. 82, 31–40 (2018)
Chen, Y., et al.: Building bridges across electronic health record systems through inferred phenotypic topics. J. Biomed. Inform. 55, 82–93 (2015)
Ahuja, Y., et al.: sureLDA: a multidisease automated phenotyping method for the electronic health record. J. Am. Med. Inform. Assoc. 27(8), 1235–1243 (2020)
Pérez, J., Pérez, A., Casillas, A., Gojenola, K.: Cardiology record multi-label classification using latent Dirichlet allocation. Comput. Methods Programs Biomed. 164, 111–119 (2018)
Baechle, C., Huang, C.D., Agarwal, A., Behara, R.S., Goo, J.: Latent topic ensemble learning for hospital readmission cost optimization. Eur. J. Oper. Res. 28, 517–531 (2020)
Hwang, Y., Kim, H.J., Choi, H.J., Lee, J.: Exploring abnormal behavior patterns of online users with emotional eating behavior: topic modeling study. J. Med. Internet Res. 22(3), e15700 (2020)
Jelodar, H., Wang, Y., Rabbani, M., et al.: A collaborative framework based for semantic patients-behavior analysis and highlight topics discovery of alcoholic beverages in online healthcare forums. J. Med. Syst. 44(101), 1–8 (2020)
Zhao, Y., Zhang, J., Wu, M.: Finding users’ voice on social media: an investigation of online support groups for autism-affected users on facebook. Int. J. Environ. Res. Pub. Health 16(23), 4804 (2019)
Lenzi, A., Maranghi, M., Stilo, G., Velardi, P.: The social phenotype: extracting a patient-centered perspective of diabetes from health-related blogs. Artif. Intell. Med. 101, 101727 (2019)
Blei, D.M.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)
Cao, J., Xia, T., Li, J., Zhang, Y., Tang, S.: A density-based method for adaptive LDA model selection. Neurocomputing 72(7–9), 1775–1781 (2009)
Arun, R., Suresh, V., Veni Madhavan, C.E., Narasimha Murthy, M.N.: On finding the natural number of topics with latent dirichlet allocation: some observations. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010. LNCS (LNAI), vol. 6118, pp. 391–402. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-13657-3_43
Mimno, D., Wallach, H., Talley, E., Leenders, M., Mccallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, pp. 262–272 (2011)
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Nat. Acad. Sci. 101(suppl 1), 5228 (2004)
Srinivasa-Desikan, B.: Natural Language Processing and Computational Linguistics. Packt Publishing, Birmingham (2018)
McKinney, W.: Python for Data Analysis (2nd Edn). O’Reilly Media, Inc., Sebastopol (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Mena, A., Reátegui, R. (2021). Topic Identification from Spanish Unstructured Health Texts. In: Botto-Tobar, M., Montes León, S., Camacho, O., Chávez, D., Torres-Carrión, P., Zambrano Vizuete, M. (eds) Applied Technologies. ICAT 2020. Communications in Computer and Information Science, vol 1388. Springer, Cham. https://doi.org/10.1007/978-3-030-71503-8_27
Download citation
DOI: https://doi.org/10.1007/978-3-030-71503-8_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-71502-1
Online ISBN: 978-3-030-71503-8
eBook Packages: Computer ScienceComputer Science (R0)