Unsupervised Topic Modeling in a Large Free Text Radiology Report Repository
- 422 Downloads
Radiology report narrative contains a large amount of information about the patient’s health and the radiologist’s interpretation of medical findings. Most of this critical information is entered in free text format, even when structured radiology report templates are used. The radiology report narrative varies in use of terminology and language among different radiologists and organizations. The free text format and the subtlety and variations of natural language hinder the extraction of reusable information from radiology reports for decision support, quality improvement, and biomedical research. Therefore, as the first step to organize and extract the information content in a large multi-institutional free text radiology report repository, we have designed and developed an unsupervised machine learning approach to capture the main concepts in a radiology report repository and partition the reports based on their main foci. In this approach, radiology reports are modeled in a vector space and compared to each other through a cosine similarity measure. This similarity is used to cluster radiology reports and identify the repository’s underlying topics. We applied our approach on a repository of 1,899,482 radiology reports from three major healthcare organizations. Our method identified 19 major radiology report topics in the repository and clustered the reports accordingly to these topics. Our results are verified by a domain expert radiologist and successfully explain the repository’s primary topics and extract the corresponding reports. The results of our system provide a target-based corpus and framework for information extraction and retrieval systems for radiology reports.
KeywordsTopic modeling Radiology report narrative Clustering Text mining Natural language processing
The authors would like to thank Chuck Kahn, Kevin McEnery, and Brad Erickson for their work on compiling RadCore database and Daniel Rubin for his contribution to RadCore and providing access to this database.
- 4.Dreyer KJ: Information theory entropy reduction program. U.S. Patent 8,756,234, 2014Google Scholar
- 12.Goryachev S, Sordo M, Zeng QT: A suite of natural language processing tools developed for the I2B2 project. In: AMIA Annual Symposium Proceedings, 2006, p 931Google Scholar
- 13.Aronson AR: Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In: Proceedings of the AMIA Symposium, 2001, p 17Google Scholar
- 14.Taira RK, Soderland SG: A statistical natural language processor for medical reports. In: Proceedings of the AMIA Symposium, 1999, p 970Google Scholar
- 16.Haug P, Koehler S, Lau LM, Wang P, Rocha R, Huff S: A natural language understanding system combining syntactic and semantic techniques. In: Proceedings of the Annual Symposium on Computer Application in Medical Care, 1994, p 247Google Scholar
- 17.Haug PJ, Koehler S, Lau LM, Wang P, Rocha R, Huff SM: Experience with a mixed semantic/syntactic parser. In: Proceedings of the Annual Symposium on Computer Application in Medical Care, 1995, p 284Google Scholar
- 20.Apache Mahout. Available at http://mahout.apache.org. Accessed 24 March 2015
- 21.Apache Hadoop. Available at https://hadoop.apache.org. Accessed 24 March 2015
- 24.Kaufman L, Rousseeuw PJ: Finding groups in data: an introduction to cluster analysis. John Wiley & Sons, New York, 2009Google Scholar
- 26.Singhal A: Modern information retrieval: A brief overview. IEEE Data Eng Bull 24(4):35–43, 2001Google Scholar