Linguistic and Statistically Derived Features for Cause of Death Prediction from Verbal Autopsy Text

Danso, Samuel; Atwell, Eric; Johnson, Owen

doi:10.1007/978-3-642-40722-2_5

Samuel Danso^22,23,
Eric Atwell^22,23 &
Owen Johnson²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8105))

1326 Accesses
6 Citations

Abstract

Automatic Text Classification (ATC) is an emerging technology with economic importance given the unprecedented growth of text data. This paper reports on work in progress to develop methods for predicting Cause of Death from Verbal Autopsy (VA) documents recommended for use in low-income countries by the World Health Organisation. VA documents contain both coded data and open narrative. The task is formulated as a Text Classification problem and explores various combinations of linguistic and statistical approaches to determine how these may improve on the standard bag-of-words approach using a dataset of over 6400 VA documents that were manually annotated with cause of death. We demonstrate that a significant improvement of prediction accuracy can be obtained through a novel combination of statistical and linguistic features derived from the VA text. The paper explores the methods by which ATC may leads to improved accuracy in Cause of Death prediction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 49.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

World Health Organization: WHO Handbook for Reporting Results of Cancer Treatments (WHO Offset Publication No. 48) (2004)
Google Scholar
Kahn, K., Tollman, S.M., Garenne, M., Gear, J.S.: Validation and application of verbal autopsies in a rural area of South Africa. Tropical Medicine & International Health 5(11), 824–831 (2000)
Article Google Scholar
Byass, P., Kathleen, K., Edward, F., Mark, A.C., Stephen, M.T.: Moving from Data on Deaths to Public Health Policy in Agincourt, South Africa: Approaches to Analysing and Understanding Verbal Autopsy Findings. PLoS Medicine 7(8) (2010)
Google Scholar
King, G., Lu, Y., Shibuya, K.: Designing verbal autopsy studies. Population Health Metrics 8(1), 19 (2010)
Article Google Scholar
Byass, P., Edward, F., Dao Lan, H., Yamene, B., Tumani, C., Kathleen, K., Lulu, M.: Refining a probabilistic model for interpreting verbal autopsy data. Scandinavian Journal of Public Health 34(1), 26–31 (2006)
Article Google Scholar
Murray, C.J.L., Alan, D.L., Dennis, F., Shannon, T.P., Gonghuan, Y.: Validation of the symptom pattern method for analyzing verbal autopsy data. PLOS Medicine 4, 1739–1753 (2007)
Google Scholar
Soleman, N., Chandramohan, D., Shibuya, K.: WHO Technical Consultation on Verbal Autopsy Tools, Geneva (2005)
Google Scholar
Pakhomov, S., Shah, N., Hanson, P., Balasubramaniam, S., Smith, S.: Automatic quality of life prediction using electronic medical records. American Medical Informatics Association (2008)
Google Scholar
Pakhomov, S., Weston, S.A., Jacobsen, S.J., Chute, C.G., Meverden, R., Roger, V.L.: Electronic medical records for clinical research: application to the identification of heart failure. The American Journal of Managed Care 13(6 Part 1), 281 (2007)
Google Scholar
Cohen, A.M., Hersh, W.R.: A survey of current work in biomedical text mining. Briefings in Bioinformatics 6(1), 57–71 (2005)
Article Google Scholar
Cohen, A.M.: An effective general purpose approach for automated biomedical document classification. In: AMIA Annual Symposium Proceedings. American Medical Informatics Association (2006)
Google Scholar
Nikfarjam, A., Gonzalez, G.H.: Pattern mining for extraction of mentions of adverse drug reactions from user comments. In: AMIA Annual Symposium Proceedings. American Medical Informatics Association (2011)
Google Scholar
Leaman, R., Wojtulewicz, L., Sullivan, R., Skariah, A., Yang, J., Gonzalez, G.: Towards internet-age pharmacovigilance: extracting adverse drug reactions from user posts to health-related social networks. Association for Computational Linguistics (2010)
Google Scholar
Gamon, M.: Sentiment classification on customer feedback data: noisy data, large feature vectors, and the role of linguistic analysis. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 841. Association for Computational Linguistics, Geneva (2004)
Google Scholar
Oberlander, J., Nowson, S.: Whose thumb is it anyway?: classifying author personality from weblog text. In: Proceedings of the COLING/ACL on Main Conference Poster Sessions. Association for Computational Linguistics (2006)
Google Scholar
Turney, P.D.: Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 417–424. Association for Computational Linguistics, Philadelphia (2002)
Google Scholar
Pang, B., Lee, L.: Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In: Annual Meeting-Association For Computational Linguistics (2005)
Google Scholar
Danso, S., Atwell, E.S., Johnson, O., ten Asbroek, A., Soromekun, S., Edmond, K., Hurt, C., Hurt, L., Zandoh, C., Tawiah, C., Fenty, J., Etego, S., Agyei, S., Kirkwood, B.: A semantically annotated Verbal Autopsy corpus for automatic analysis of cause of death. ICAME Journal of the International Computer Archive of Modern English 37 (in press, 2013)
Google Scholar
Francis, W.N., Kucera, H.: Brown corpus manual. Letters to the Editor 5(2), 7 (1979)
Google Scholar
Scott, S., Matwin, S.: Text classification using WordNet hypernyms. In: Use of WordNet in Natural Language Processing Systems: Proceedings of the Conference (1998)
Google Scholar
Forman, G.: A pitfall and solution in multi-class feature selection for text classification. ACM (2004)
Google Scholar
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. Lawrence Erlbaum Associates Ltd. (1995)
Google Scholar
Danso, S., Atwell, E.S., Johnson, O.: A Comparative Study of Machine Learning Methods for Verbal Autopsy Text Classification. International Journal of Computer Science Issues 10 (in press)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann (2005)
Google Scholar
Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, vol. 10, Association for Computational Linguistics (2002)
Google Scholar
Loper, E., Bird, S.: NLTK: the Natural Language Toolkit. In: Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, vol. 1, pp. 63–70. Association for Computational Linguistics, Philadelphia (2002)
Chapter Google Scholar
Wilks, Y., Stevenson, M.: Word sense disambiguation using optimised combinations of knowledge sources. In: Proceedings of the 17th International Conference on Computational Linguistics, vol. 2. Association for Computational Linguistics (1998)
Google Scholar
Moschitti, A., Basili, R.: Complex linguistic features for text classification: A comprehensive study. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 181–196. Springer, Heidelberg (2004)
Chapter Google Scholar
Matsumoto, S., Takamura, H., Okumura, M.: Sentiment classification using word sub-sequences and dependency sub-trees. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 301–311. Springer, Heidelberg (2005)
Chapter Google Scholar
Scott, S., Matwin, S.: Feature engineering for text classification. In: Machine Learning-International Workshop Conference (1999)
Google Scholar
Harris, Z.S.: Methods in structural linguistics (1951)
Google Scholar
McKeown, K.R., Radev, D.R.: Collocations. Handbook of Natural Language Processing. Marcel Dekker (2000)
Google Scholar
Pearce, D., Qh, B.: Using conceptual similarity for collocation extraction. In: Proceedings of the Fourth Annual CLUK Colloquium (2001)
Google Scholar
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational. Linguistics 19(1), 61–74 (1993)
Google Scholar
Seretan, V., Nerima, L., Wehrli, E.: Extraction of multi-word collocations using syntactic bigram composition. In: Proceedings of the Fourth International Conference on Recent Advances in NLP, RANLP-2003 (2003)
Google Scholar
Pearce, D.: A comparative evaluation of collocation extraction techniques. In: Proceedings of the 3rd International Conference on Language Resources and Evaluation, LREC 2002 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing, Language Research Group, University of Leeds, UK
Samuel Danso & Eric Atwell
Yorkshire Centre for Health Informatics, eHealth Research Group, University of Leeds, UK
Samuel Danso, Eric Atwell & Owen Johnson

Authors

Samuel Danso
View author publications
You can also search for this author in PubMed Google Scholar
Eric Atwell
View author publications
You can also search for this author in PubMed Google Scholar
Owen Johnson
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Technical University Darmstadt, 64289 Darmstadt, Germany, and German Institute for International Education Research,, 60486, Frankfurt, Germany
Iryna Gurevych
Technical University Darmstadt, 64289, Darmstadt, Germany
Chris Biemann
Technical University Darmstadt, 64289 Darmsadt, and German Institute for International Educational Research, 60486, Frankfurt, Germany
Torsten Zesch

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Danso, S., Atwell, E., Johnson, O. (2013). Linguistic and Statistically Derived Features for Cause of Death Prediction from Verbal Autopsy Text. In: Gurevych, I., Biemann, C., Zesch, T. (eds) Language Processing and Knowledge in the Web. Lecture Notes in Computer Science(), vol 8105. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40722-2_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-40722-2_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40721-5
Online ISBN: 978-3-642-40722-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics