Towards Using Scientific Publications to Automatically Extract Information on Rare Diseases


A small percentage of the population is afflicted by what is called an orphan or a rare disease. All over the world, there are about several thousand of these diseases. When adding up together all the individuals who are affected, it amounts for up to 10% of the US population. Scientific works on these diseases are often poorly financed due to the lack of potential markets for a treatment, which means for patients and clinicians a very limited and scattered access to vital information. To contribute addressing this issue, we present in this paper a new software tool for automating the extraction of information related to rare diseases from scientific publications. More precisely, our contribution consists in a new method of extracting automatically symptoms of these diseases from research papers exploiting a Named Entity Recognition (NER) algorithm based on the numerical statistic Term Frequency - Inverse Document Frequency (TF-IDF). The proposed tool has been tested using PubMed Central (PMC) database.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2


  1. 1.

  2. 2.


  1. 1.

    OoM (2018) Budget. Budget of the U.S. Government (2018).

  2. 2.

    National institutes for health (2018) Budget.

  3. 3.

    Rooke T (2018) The therapeutic challenge of rare diseases. Mayo Clin Proc 93(5):560

    Article  Google Scholar 

  4. 4.

    Orphanet (2018) Orphanet: about orphanet.

  5. 5.

    EU (2015) European platform for rare disease registries.

  6. 6.

    NORD (1969) Home - NORD (national organization for rare disorders).

  7. 7.

    Levenshtein V (1966) Binary codes capable of correcting deletions, insertions and reversals. Sov Phys Dokl 10:707

    MathSciNet  Google Scholar 

  8. 8.

    Gupta V, Lehal GS (2009) Journal of Emerging Technologies in Web Intelligence 1(1):60.

    Article  Google Scholar 

  9. 9.

    Allahyari M, Pouriyeh S, Assefi M, Safaei S, Trippe ED, Gutierrez JB, Kochut K (2017) arXiv:1707.02268.

  10. 10.

    Venkata N, Padmasree L, Mangathayaru N (2016) Int J Comput Appl 146 (11):30.

    Article  Google Scholar 

  11. 11.

    Liu Y, Liang Y, Wishart D (2015) Nucleic Acids Res 43(W1):W535.

    Article  Google Scholar 

  12. 12.

    Li A, Zang Q, Sun D, Wang M (2016) Neurocomputing 206:73.

    Article  Google Scholar 

  13. 13.

    Peng Y, Wei CH, Lu Z (2016) J Cheminf 8(1):1.

    Article  Google Scholar 

  14. 14.

    Mahmood AS, Wu TJ, Mazumder R, Vijay-Shanker K (2016) , . PLoS ONE 11(4):1.

    Article  Google Scholar 

  15. 15.

    Bui QC, Sloot PMA (2012) Bioinformatics 28(20):2654.

    Article  Google Scholar 

  16. 16.

    Holat P, Tomeh N, Charnois T, Battistelli D, Jaulent MC, Métivier JP (2016) Weakly-supervised symptom recognition for rare diseases in biomedical text

  17. 17.

    Martin L, Battistelli D, Charnois T (2014). In: 13th workshop on biomedical natural language processing (BioNLP 2014), pp 107–111

  18. 18.

    Schmid H (1995) Treetagger| a language independent part-of-speech tagger. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart 43:28

    Google Scholar 

  19. 19.

    Manning C, Surdeanu M, Bauer J, Finkel J, Bethard S, McClosky D (2014). In: Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pp 55–60.

  20. 20.

    Orphadata (2013) Free access data from Orphanet.

  21. 21.

    U.S. National Institutes of Health’s National Library of Medicine (NIH/NLM) (2018) Pubmed Central.

  22. 22.

    Köhler S, Vasilevsky NA, et al. (2017) Nucleic Acids Res 45(D1):D865.

    Article  Google Scholar 

  23. 23.

    Freud S (1920) Entrez programming utilities help [Internet]. Bethesda: national center for biotechnology information

  24. 24.

    Umbel C, Ellis R, Mull R (2011) NaturalNode/natural.

  25. 25.

    Alias-i (2008) LingPipe.

  26. 26.

    Liu Y, Liao WK, Choudhary A, Li J (2007) Parallel data mining algorithms for association rules and clustering. CRC Press, Boca Raton.

    Google Scholar 

  27. 27.

    Vukotic V, Claveau V, Raymond C (2015) IRISA at DeFT 2015: supervised and unsupervised methods in sentiment analysis.

  28. 28.

    Garcia E (2008). J Doc 60(5):503.

    Google Scholar 

  29. 29.

    Cousyn C, Bouchard K, Bouchard B, Gaboury S. In: Proceedings of the 4th EAI international conference on smart objects and technologies for social good - Goodtechs ’18. Goodtechs ’18. ACM, New York, pp 13–18.

Download references


This project success was conducted with the financial support received from UQAC and the National Sciences and Engineering Research Council of Canada (NSERC).

Author information



Corresponding author

Correspondence to Charles Cousyn.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(PNG 142 KB)

(PNG 295 KB)

(PNG 749 KB)

(PNG 84.4 KB)

(PNG 67.9 KB)

(PNG 53.8 KB)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Cousyn, C., Bouchard, K., Gaboury, S. et al. Towards Using Scientific Publications to Automatically Extract Information on Rare Diseases. Mobile Netw Appl 25, 953–960 (2020).

Download citation


  • Text mining
  • Rare disease
  • Named entity recognition
  • Knowledge aggregation
  • Symptoms