Towards Using Scientific Publications to Automatically Extract Information on Rare Diseases

  • Charles CousynEmail author
  • Kévin Bouchard
  • Sébastien Gaboury
  • Bruno Bouchard


A small percentage of the population is afflicted by what is called an orphan or a rare disease. All over the world, there are about several thousand of these diseases. When adding up together all the individuals who are affected, it amounts for up to 10% of the US population. Scientific works on these diseases are often poorly financed due to the lack of potential markets for a treatment, which means for patients and clinicians a very limited and scattered access to vital information. To contribute addressing this issue, we present in this paper a new software tool for automating the extraction of information related to rare diseases from scientific publications. More precisely, our contribution consists in a new method of extracting automatically symptoms of these diseases from research papers exploiting a Named Entity Recognition (NER) algorithm based on the numerical statistic Term Frequency - Inverse Document Frequency (TF-IDF). The proposed tool has been tested using PubMed Central (PMC) database.


Text mining Rare disease Named entity recognition Knowledge aggregation Symptoms 



This project success was conducted with the financial support received from UQAC and the National Sciences and Engineering Research Council of Canada (NSERC).

Supplementary material

11036_2019_1237_MOESM1_ESM.png (142 kb)
(PNG 142 KB)
11036_2019_1237_MOESM2_ESM.png (296 kb)
(PNG 295 KB)
11036_2019_1237_MOESM3_ESM.png (749 kb)
(PNG 749 KB)
11036_2019_1237_MOESM4_ESM.png (84 kb)
(PNG 84.4 KB)
11036_2019_1237_MOESM5_ESM.png (68 kb)
(PNG 67.9 KB)
11036_2019_1237_MOESM6_ESM.png (54 kb)
(PNG 53.8 KB)


  1. 1.
    OoM (2018) Budget. Budget of the U.S. Government (2018).
  2. 2.
    National institutes for health (2018) Budget.
  3. 3.
    Rooke T (2018) The therapeutic challenge of rare diseases. Mayo Clin Proc 93(5):560CrossRefGoogle Scholar
  4. 4.
    Orphanet (2018) Orphanet: about orphanet.
  5. 5.
    EU (2015) European platform for rare disease registries.
  6. 6.
    NORD (1969) Home - NORD (national organization for rare disorders).
  7. 7.
    Levenshtein V (1966) Binary codes capable of correcting deletions, insertions and reversals. Sov Phys Dokl 10:707MathSciNetGoogle Scholar
  8. 8.
    Gupta V, Lehal GS (2009) Journal of Emerging Technologies in Web Intelligence 1(1):60. CrossRefGoogle Scholar
  9. 9.
    Allahyari M, Pouriyeh S, Assefi M, Safaei S, Trippe ED, Gutierrez JB, Kochut K (2017) arXiv:1707.02268.
  10. 10.
    Venkata N, Padmasree L, Mangathayaru N (2016) Int J Comput Appl 146 (11):30. Google Scholar
  11. 11.
    Liu Y, Liang Y, Wishart D (2015) Nucleic Acids Res 43(W1):W535. CrossRefGoogle Scholar
  12. 12.
    Li A, Zang Q, Sun D, Wang M (2016) Neurocomputing 206:73. CrossRefGoogle Scholar
  13. 13.
    Peng Y, Wei CH, Lu Z (2016) J Cheminf 8(1):1. CrossRefGoogle Scholar
  14. 14.
    Mahmood AS, Wu TJ, Mazumder R, Vijay-Shanker K (2016) , . PLoS ONE 11(4):1. Google Scholar
  15. 15.
    Bui QC, Sloot PMA (2012) Bioinformatics 28(20):2654. CrossRefGoogle Scholar
  16. 16.
    Holat P, Tomeh N, Charnois T, Battistelli D, Jaulent MC, Métivier JP (2016) Weakly-supervised symptom recognition for rare diseases in biomedical textGoogle Scholar
  17. 17.
    Martin L, Battistelli D, Charnois T (2014). In: 13th workshop on biomedical natural language processing (BioNLP 2014), pp 107–111Google Scholar
  18. 18.
    Schmid H (1995) Treetagger| a language independent part-of-speech tagger. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart 43:28Google Scholar
  19. 19.
    Manning C, Surdeanu M, Bauer J, Finkel J, Bethard S, McClosky D (2014). In: Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pp 55–60.
  20. 20.
    Orphadata (2013) Free access data from Orphanet.
  21. 21.
    U.S. National Institutes of Health’s National Library of Medicine (NIH/NLM) (2018) Pubmed Central.
  22. 22.
    Köhler S, Vasilevsky NA, et al. (2017) Nucleic Acids Res 45(D1):D865. CrossRefGoogle Scholar
  23. 23.
    Freud S (1920) Entrez programming utilities help [Internet]. Bethesda: national center for biotechnology informationGoogle Scholar
  24. 24.
    Umbel C, Ellis R, Mull R (2011) NaturalNode/natural.
  25. 25.
    Alias-i (2008) LingPipe.
  26. 26.
    Liu Y, Liao WK, Choudhary A, Li J (2007) Parallel data mining algorithms for association rules and clustering. CRC Press, Boca Raton. CrossRefGoogle Scholar
  27. 27.
    Vukotic V, Claveau V, Raymond C (2015) IRISA at DeFT 2015: supervised and unsupervised methods in sentiment analysis.
  28. 28.
    Garcia E (2008). J Doc 60(5):503. Google Scholar
  29. 29.
    Cousyn C, Bouchard K, Bouchard B, Gaboury S. In: Proceedings of the 4th EAI international conference on smart objects and technologies for social good - Goodtechs ’18. Goodtechs ’18. ACM, New York, pp 13–18.

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.LIARA LaboratoryUniversité du Québec à ChicoutimiChicoutimiCanada

Personalised recommendations