Skip to main content

Topic Identification from Spanish Unstructured Health Texts

  • Conference paper
  • First Online:
Applied Technologies (ICAT 2020)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1388))

Included in the following conference series:

Abstract

Topic Models allow to extract topics from documents and classify them. In this work, Latent Dirichlet Allocation model was applied to extract topics from documents with medical information. 220 digital documents written in Spanish were used, these documents have information about different health conditions. A pre-processing was carried out, which implies tokenization, stop words elimination and lemmatization, to define the medical data or terms that will represent the documents. Subsequently, a document representation was made through a document-term matrix. An important step was to use a medical glossary based on terminology extracted from Internet to assign weights to the terms. LDA was applied and two new matrices were obtained: a document-topic matrix and a topic-term matrix. 25 topics were identified, they can be visualized by heat maps, word cloud and an interactive tool called PyLDAvis. The application was developed in Phyton using some libraries such as Spacy, Scikit-learn, Tmtoolkit, PyLDAvis among others.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://temu.bsc.es/BARR2/.

  2. 2.

    https://tmtoolkit.readthedocs.io/en/latest/index.html.

  3. 3.

    https://pyldavis.readthedocs.io/en/latest/.

References

  1. Liu, L., Tang, L., Dong, W., Yao, S., Zhou, W.: An overview of topic modeling and its current applications in bioinformatics. SpringerPlus 5(1), 1–22 (2016). https://doi.org/10.1186/s40064-016-3252-8

    Article  Google Scholar 

  2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(4–5), 993–1022 (2003)

    MATH  Google Scholar 

  3. Speier, W., Ong, M.K., Arnold, C.W.: Using phrases and document metadata to improve topic modeling of clinical reports. J. Biomed. Inform. 61, 260–6 (2016)

    Article  Google Scholar 

  4. Bhattacharya, M., Jurkovitz, C., Shatkay, H.: Co-occurrence of medical conditions: exposing patterns through probabilistic topic modeling of snomed codes. J. Biomed. Inform. 82, 31–40 (2018)

    Article  Google Scholar 

  5. Chen, Y., et al.: Building bridges across electronic health record systems through inferred phenotypic topics. J. Biomed. Inform. 55, 82–93 (2015)

    Article  Google Scholar 

  6. Ahuja, Y., et al.: sureLDA: a multidisease automated phenotyping method for the electronic health record. J. Am. Med. Inform. Assoc. 27(8), 1235–1243 (2020)

    Article  Google Scholar 

  7. Pérez, J., Pérez, A., Casillas, A., Gojenola, K.: Cardiology record multi-label classification using latent Dirichlet allocation. Comput. Methods Programs Biomed. 164, 111–119 (2018)

    Article  Google Scholar 

  8. Baechle, C., Huang, C.D., Agarwal, A., Behara, R.S., Goo, J.: Latent topic ensemble learning for hospital readmission cost optimization. Eur. J. Oper. Res. 28, 517–531 (2020)

    Article  Google Scholar 

  9. Hwang, Y., Kim, H.J., Choi, H.J., Lee, J.: Exploring abnormal behavior patterns of online users with emotional eating behavior: topic modeling study. J. Med. Internet Res. 22(3), e15700 (2020)

    Google Scholar 

  10. Jelodar, H., Wang, Y., Rabbani, M., et al.: A collaborative framework based for semantic patients-behavior analysis and highlight topics discovery of alcoholic beverages in online healthcare forums. J. Med. Syst. 44(101), 1–8 (2020)

    Google Scholar 

  11. Zhao, Y., Zhang, J., Wu, M.: Finding users’ voice on social media: an investigation of online support groups for autism-affected users on facebook. Int. J. Environ. Res. Pub. Health 16(23), 4804 (2019)

    Google Scholar 

  12. Lenzi, A., Maranghi, M., Stilo, G., Velardi, P.: The social phenotype: extracting a patient-centered perspective of diabetes from health-related blogs. Artif. Intell. Med. 101, 101727 (2019)

    Google Scholar 

  13. Blei, D.M.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)

    Article  Google Scholar 

  14. Cao, J., Xia, T., Li, J., Zhang, Y., Tang, S.: A density-based method for adaptive LDA model selection. Neurocomputing 72(7–9), 1775–1781 (2009)

    Article  Google Scholar 

  15. Arun, R., Suresh, V., Veni Madhavan, C.E., Narasimha Murthy, M.N.: On finding the natural number of topics with latent dirichlet allocation: some observations. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010. LNCS (LNAI), vol. 6118, pp. 391–402. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-13657-3_43

    Chapter  Google Scholar 

  16. Mimno, D., Wallach, H., Talley, E., Leenders, M., Mccallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, pp. 262–272 (2011)

    Google Scholar 

  17. Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Nat. Acad. Sci. 101(suppl 1), 5228 (2004)

    Article  Google Scholar 

  18. Srinivasa-Desikan, B.: Natural Language Processing and Computational Linguistics. Packt Publishing, Birmingham (2018)

    Google Scholar 

  19. McKinney, W.: Python for Data Analysis (2nd Edn). O’Reilly Media, Inc., Sebastopol (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruth Reátegui .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mena, A., Reátegui, R. (2021). Topic Identification from Spanish Unstructured Health Texts. In: Botto-Tobar, M., Montes León, S., Camacho, O., Chávez, D., Torres-Carrión, P., Zambrano Vizuete, M. (eds) Applied Technologies. ICAT 2020. Communications in Computer and Information Science, vol 1388. Springer, Cham. https://doi.org/10.1007/978-3-030-71503-8_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-71503-8_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-71502-1

  • Online ISBN: 978-3-030-71503-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics