Towards Using Scientific Publications to Automatically Extract Information on Rare Diseases

Abstract

A small percentage of the population is afflicted by what is called an orphan or a rare disease. All over the world, there are about several thousand of these diseases. When adding up together all the individuals who are affected, it amounts for up to 10% of the US population. Scientific works on these diseases are often poorly financed due to the lack of potential markets for a treatment, which means for patients and clinicians a very limited and scattered access to vital information. To contribute addressing this issue, we present in this paper a new software tool for automating the extraction of information related to rare diseases from scientific publications. More precisely, our contribution consists in a new method of extracting automatically symptoms of these diseases from research papers exploiting a Named Entity Recognition (NER) algorithm based on the numerical statistic Term Frequency - Inverse Document Frequency (TF-IDF). The proposed tool has been tested using PubMed Central (PMC) database.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2

Notes

  1. 1.

    https://rarediseases.info.nih.gov/diseases/pages/31/faqs-about-rare-diseases

  2. 2.

    https://sites.google.com/site/bionlpst/home/geniaevent-extraction-genia

References

  1. 1.

    OoM (2018) Budget. Budget of the U.S. Government (2018). https://www.whitehouse.gov/

  2. 2.

    National institutes for health (2018) Budget. https://www.nih.gov/about-nih/what-we-do/budget

  3. 3.

    Rooke T (2018) The therapeutic challenge of rare diseases. Mayo Clin Proc 93(5):560

    Article  Google Scholar 

  4. 4.

    Orphanet (2018) Orphanet: about orphanet. https://www.orpha.net/consor/cgi-bin/Education_AboutOrphanet.php

  5. 5.

    EU (2015) European platform for rare disease registries. http://www.epirare.eu

  6. 6.

    NORD (1969) Home - NORD (national organization for rare disorders). https://rarediseases.org

  7. 7.

    Levenshtein V (1966) Binary codes capable of correcting deletions, insertions and reversals. Sov Phys Dokl 10:707

    MathSciNet  Google Scholar 

  8. 8.

    Gupta V, Lehal GS (2009) Journal of Emerging Technologies in Web Intelligence 1(1):60. https://doi.org/10.4304/jetwi.1.1.60-76

    Article  Google Scholar 

  9. 9.

    Allahyari M, Pouriyeh S, Assefi M, Safaei S, Trippe ED, Gutierrez JB, Kochut K (2017) arXiv:1707.02268. https://doi.org/10.14569/IJACSA.2017.081052

  10. 10.

    Venkata N, Padmasree L, Mangathayaru N (2016) Int J Comput Appl 146 (11):30. https://doi.org/10.5120/ijca2016910908

    Article  Google Scholar 

  11. 11.

    Liu Y, Liang Y, Wishart D (2015) Nucleic Acids Res 43(W1):W535. https://doi.org/10.1093/nar/gkv383

    Article  Google Scholar 

  12. 12.

    Li A, Zang Q, Sun D, Wang M (2016) Neurocomputing 206:73. https://doi.org/10.1016/j.neucom.2015.11.110

    Article  Google Scholar 

  13. 13.

    Peng Y, Wei CH, Lu Z (2016) J Cheminf 8(1):1. https://doi.org/10.1186/s13321-016-0165-z

    Article  Google Scholar 

  14. 14.

    Mahmood AS, Wu TJ, Mazumder R, Vijay-Shanker K (2016) , . PLoS ONE 11(4):1. https://doi.org/10.1371/journal.pone.0152725

    Article  Google Scholar 

  15. 15.

    Bui QC, Sloot PMA (2012) Bioinformatics 28(20):2654. https://doi.org/10.1093/bioinformatics/bts487

    Article  Google Scholar 

  16. 16.

    Holat P, Tomeh N, Charnois T, Battistelli D, Jaulent MC, Métivier JP (2016) Weakly-supervised symptom recognition for rare diseases in biomedical text

  17. 17.

    Martin L, Battistelli D, Charnois T (2014). In: 13th workshop on biomedical natural language processing (BioNLP 2014), pp 107–111

  18. 18.

    Schmid H (1995) Treetagger| a language independent part-of-speech tagger. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart 43:28

    Google Scholar 

  19. 19.

    Manning C, Surdeanu M, Bauer J, Finkel J, Bethard S, McClosky D (2014). In: Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pp 55–60. https://doi.org/10.3115/v1/P14-5010

  20. 20.

    Orphadata (2013) Free access data from Orphanet. http://www.orphadata.org

  21. 21.

    U.S. National Institutes of Health’s National Library of Medicine (NIH/NLM) (2018) Pubmed Central. https://www.ncbi.nlm.nih.gov/pmc

  22. 22.

    Köhler S, Vasilevsky NA, et al. (2017) Nucleic Acids Res 45(D1):D865. https://doi.org/10.1093/nar/gkw1039

    Article  Google Scholar 

  23. 23.

    Freud S (1920) Entrez programming utilities help [Internet]. Bethesda: national center for biotechnology information

  24. 24.

    Umbel C, Ellis R, Mull R (2011) NaturalNode/natural. https://github.com/NaturalNode/natural

  25. 25.

    Alias-i (2008) LingPipe. http://alias-i.com/lingpipe/

  26. 26.

    Liu Y, Liao WK, Choudhary A, Li J (2007) Parallel data mining algorithms for association rules and clustering. CRC Press, Boca Raton. https://doi.org/10.1201/9781420011296.ch32

    Google Scholar 

  27. 27.

    Vukotic V, Claveau V, Raymond C (2015) IRISA at DeFT 2015: supervised and unsupervised methods in sentiment analysis. https://hal.archives-ouvertes.fr/hal-01226528

  28. 28.

    Garcia E (2008). J Doc 60(5):503. https://doi.org/10.1108/00220410410560582

    Google Scholar 

  29. 29.

    Cousyn C, Bouchard K, Bouchard B, Gaboury S. In: Proceedings of the 4th EAI international conference on smart objects and technologies for social good - Goodtechs ’18. Goodtechs ’18. ACM, New York, pp 13–18. https://doi.org/10.1145/3284869.3284892

Download references

Acknowledgements

This project success was conducted with the financial support received from UQAC and the National Sciences and Engineering Research Council of Canada (NSERC).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Charles Cousyn.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(PNG 142 KB)

(PNG 295 KB)

(PNG 749 KB)

(PNG 84.4 KB)

(PNG 67.9 KB)

(PNG 53.8 KB)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Cousyn, C., Bouchard, K., Gaboury, S. et al. Towards Using Scientific Publications to Automatically Extract Information on Rare Diseases. Mobile Netw Appl 25, 953–960 (2020). https://doi.org/10.1007/s11036-019-01237-3

Download citation

Keywords

  • Text mining
  • Rare disease
  • Named entity recognition
  • Knowledge aggregation
  • Symptoms