Two Applications of Statistical Modelling to Natural Language Processing
Each week the Columbia-Presbyterian Medical Center collects several megabytes of English text transcribed from radiologists’ dictation and notes of their interpretations of medical diagnostic x-rays. It is desired to automate the extraction of diagnoses from these natural language reports. This paper reports on two aspects of this project requiring advanced statistical methods. First, the identification of pairs of words and phrases that tend to appear together (collocate) uses a hierarchical Bayesian model that adjusts to different word and word pair distributions in different bodies of text. Second, we present an analysis of data from experiments to compare the performance of the computer diagnostic program to that of a panel of physician and lay readers of randomly sampled texts. A measure of inter-subject distance with respect to the diagnoses is defined for which estimated variances and covariances are easily computed. This allows statistical conclusions about the similarities and dissimilarities among diagnoses by the various programs and experts.
KeywordsChronic Obstructive Pulmonary Disease Natural Language Processing Word Pair Clinical Information System Hierarchical Bayesian Model
Unable to display preview. Download preview PDF.
- [Dunning93]Dunning, Ted (1993) Accurate methods for the statistics of surprise and coincidence, Computational Linguistics, 19: 61–74.Google Scholar
- [Hripcsak95]Hripcsak G, Friedman C, Alderson P, DuMouchel W, Johnson S, Clayton P (1995) Unlocking clinical data from narrative reports: a study of natural language processing. Annals of Internal Medicine, 122: 681–688.Google Scholar