Lexical Acquisition for Clinical Text Mining Using Distributional Similarity

  • John Carroll
  • Rob Koeling
  • Shivani Puri
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7182)


We describe experiments into the use of distributional similarity for acquiring lexical information from clinical free text, in particular notes typed by primary care physicians (general practitioners). We also present a novel approach to lexical acquisition from ‘sensitive’ text, which does not require the text to be manually anonymised – a very expensive process – and therefore allows much larger datasets to be used than would normally be possible.


Free Text General Practice Research Database Read Code Parallel Corpus Natural Language Processing Technique 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bentley, T., Price, C., Brown, P.: Structural and lexical features of successive versions of the Read Codes. In: Teasdale, S. (ed.) Proceedings of the Annual Conference of The Primary Health Care Specialist Group of the British Computer Society, Worcester, UK, pp. 91–103 (1996),
  2. 2.
    Curran, J., Moens, M.: Scaling context space. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, pp. 231–238 (2002)Google Scholar
  3. 3.
    Fan, J.W., Friedman, C.: Semantic classification of biomedical concepts using distributional similarity. JAMIA 14(4), 467–477 (2007)Google Scholar
  4. 4.
    Firth, J.R.: A synopsis of linguistic theory 1930-1955. Studies in Linguistic Analysis, 1–32 (1957)Google Scholar
  5. 5.
    Freitag, D., Blume, M., Byrnes, J., Chow, E., Kapadia, S., Rohwer, R., Wang, Z.: New experiments in distributional representations of synonymy. In: Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL), Ann Arbor, MI, pp. 25–32 (2005)Google Scholar
  6. 6.
    Hamilton, W., Peters, T., Bankhead, C., Sharp, D.: Risk of ovarian cancer in women with symptoms in primary care: population based case-control study. British Medical Journal 339, b2998 (2009)CrossRefGoogle Scholar
  7. 7.
    Henriksson, A., Hassel, M., Kvist, M.: Diagnosis Code Assignment Support using Random Indexing of Patient Records a Qualitative Feasibility Study. In: Peleg, M., Lavrač, N., Combi, C. (eds.) AIME 2011. LNCS, vol. 6747, pp. 348–352. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  8. 8.
    Johansen, M., Scholl, J., Hasvold, P., Ellingsen, G., Bellika, J.: “Garbage in, garbage out” – extracting disease surveillance data from EPR systems in primary care. In: Proceedings of the ACM Conference on Computer Supported Cooperative Work, San Diego, CA, pp. 525–534 (2008)Google Scholar
  9. 9.
    Kalra, D., Ingram, D.: Electronic health records. In: Zielinski, K., Duplaga, M., Ingram, D. (eds.) Information Technology Solutions for Healthcare. Springer, Heidelberg (2006), Google Scholar
  10. 10.
    Koeling, R., Carroll, J., Tate, A.R., Nicholson, A.: Annotating a corpus of clinical text records for learning to recognize symptoms automatically. In: Proceedings of the 3rd Louhi Workshop on Text and Data Mining of Health Documents, Bled, Slovenia, pp. 43–50 (2011)Google Scholar
  11. 11.
    Koeling, R., McCarthy, D., Carroll, J.: Domain-specific sense distributions and predominant sense acquisition. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, Vancouver, Canada, pp. 419–426 (2005)Google Scholar
  12. 12.
    Koeling, R., Tate, A.R., Carroll, J.: Automatically estimating the incidence of symptoms recorded in GP free text notes. In: Proceedings of the First International Workshop on Managing Interoperability and Complexity in Health Systems, Glasgow, UK, pp. 43–50 (2011)Google Scholar
  13. 13.
    Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of the 17th International Conference on Computational Linguistics and the 36th Annual Meeting of the ACL, Montreal, Canada, pp. 768–774 (1998)Google Scholar
  14. 14.
    McCarthy, D., Koeling, R., Weeds, J., Carroll, J.: Unsupervised acquisition of predominant word senses. Computational Linguistics 33(4), 553–590 (2007)CrossRefGoogle Scholar
  15. 15.
    NIST: Proceedings of the 2011 Text REtrieval Conference (TREC 2011). National Institute for Standards in Technology, Gaithersburg, MD (2011)Google Scholar
  16. 16.
    Pestian, J., Brew, C., Matykiewicz, P., Hovermale, D., Johnson, N., Cohen, K.B., Duch, W.: A shared task involving multi-label classification of clinical free text. In: Proceedings of BioNLP 2007: Biological, Translational, and Clinical Language Processing, Prague, Czech Republic, pp. 97–104 (2007)Google Scholar
  17. 17.
    van der Plas, L., Tiedemann, J.: Finding medical term variations using parallel corpora and distributional similarity. In: Proceedings of the 6th Workshop on Ontologies and Lexical Resources, Beijing, China, pp. 28–37 (2010)Google Scholar
  18. 18.
    Resnik, P., Niv, M., Nossal, M., Kapit, A., Toren, R.: Communication of clinically relevant information in electronic health records: a comparison between structured data and unrestricted physician language. Perspectives in Health Information Management (2008)Google Scholar
  19. 19.
    Roberts, A., Gaizauskas, R., Hepple, M., Guo, Y.: Mining clinical relationships from patient narratives. BMC Bioinformatics 9(suppl. 11), S3 (2008)CrossRefGoogle Scholar
  20. 20.
    Tate, A.R., Martin, A., Ali, A., Cassell, J.: Using free text information to explore how and when GPs code a diagnosis of ovarian cancer: an observational study using primary care records of patients with ovarian cancer. BMJ. Open. (2011) doi:10.1136/bmjopen-2010-000025 Google Scholar
  21. 21.
    Uzuner, Ö., Goldstein, I., Luo, Y., Kohane, I.: Identifying patient smoking status from medical discharge records. JAMIA 15(1), 14–24 (2008)Google Scholar
  22. 22.
    Uzuner, Ö., Solti, I., Cadag, E.: Extracting medication information from clinical text. JAMIA 17(5), 514–518 (2010)Google Scholar
  23. 23.
    Weeds, J., Dowdall, J., Schneider, G., Keller, B., Weir, D.: Using distributional similarity to organise biomedical terminology. Terminology 11(1), 107–141 (2005)CrossRefGoogle Scholar
  24. 24.
    Weeds, J., Weir, D.: Co-occurrence Retrieval: a flexible framework for lexical distributional similarity. Computational Linguistics 31(4), 439–476 (2005)zbMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • John Carroll
    • 1
  • Rob Koeling
    • 1
  • Shivani Puri
    • 2
  1. 1.Department of InformaticsUniversity of SussexBrightonUK
  2. 2.GPRDLondonUK

Personalised recommendations