Weakly-Supervised Symptom Recognition for Rare Diseases in Biomedical Text

  • Pierre HolatEmail author
  • Nadi Tomeh
  • Thierry Charnois
  • Delphine Battistelli
  • Marie-Christine Jaulent
  • Jean-Philippe Métivier
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9897)


In this paper, we tackle the issue of symptom recognition for rare diseases in biomedical texts. Symptoms typically have more complex and ambiguous structure than other biomedical named entities. Furthermore, existing resources are scarce and incomplete. Therefore, we propose a weakly-supervised framework based on a combination of two approaches: sequential pattern mining under constraints and sequence labeling. We use unannotated biomedical paper abstracts with dictionaries of rare diseases and symptoms to create our training data. Our experiments show that both approaches outperform simple projection of the dictionaries on text, and their combination is beneficial. We also introduce a novel pattern mining constraint based on semantic similarity between words inside patterns.


Information extraction Pattern mining CRF Symptoms recognition Biomedical texts 



This work is supported by the French National Research Agency (ANR) as part of the project Hybride ANR-11-BS02-002 and the “Investissements d’Avenir” program (reference: ANR-10-LABX-0083).


  1. 1.
    Agrawal, R., Srikant, R.: Mining sequential patterns. In: Proceedings of the Eleventh International Conference on Data Engineering, pp. 3–14 (1995)Google Scholar
  2. 2.
    Béchet, N., Cellier, P., Charnois, T., Crémilleux, B.: Sequence mining under multiple constraints. In: Proceedings of the 30th Annual ACM Symposium on Applied Computing, pp. 908–914 (2015)Google Scholar
  3. 3.
    Cohen, K.B.: BioNLP: biomedical text mining. In: Handbook of Natural Language Processing, 2nd edn. (2010)Google Scholar
  4. 4.
    Doğan, R.I., Leaman, R., Lu, Z.: NCBI disease corpus: a resource for disease name recognition and concept normalization. J. Biomed. Inf. 47, 1–10 (2014)CrossRefGoogle Scholar
  5. 5.
    Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 363–370 (2005)Google Scholar
  6. 6.
    Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)CrossRefGoogle Scholar
  7. 7.
    Kokkinakis, D.: Developing resources for swedish bio-medical text mining. In: Proceedings of the 2nd International Symposium on Semantic Mining in Biomedicine (SMBM) (2006)Google Scholar
  8. 8.
    Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data (2001)Google Scholar
  9. 9.
    Leaman, R., Miller, C., Gonzalez, G.: Enabling recognition of diseases in biomedical text with machine learning: corpus and benchmark. In: Proceedings of the 2009 Symposium on Languages in Biology and Medicine, vol. 82(9) (2009)Google Scholar
  10. 10.
    Martin, L., Battistelli, D., Charnois, T.: Symptom extraction issue. In: Proceedings of BioNLP 2014, pp. 107–111 (2014)Google Scholar
  11. 11.
    Métivier, J.P., Serrano, L., Charnois, T., Cuissart, B., Widlöcher, A.: Automatic symptom extraction from texts to enhance knowledge discovery on rare diseases. In: Holmes, J.H., Bellazzi, R., Sacchi, L., Peek, N. (eds.) Artificial Intelligence in Medicine. LNCS, vol. 9105, pp. 249–254. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-19551-3_33 CrossRefGoogle Scholar
  12. 12.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
  13. 13.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar
  14. 14.
    Pei, J., Han, J., Wang, W.: Constraint-based sequential pattern mining: the pattern-growth methods. J. Intell. Inf. Syst. 28(2), 133–160 (2007)CrossRefGoogle Scholar
  15. 15.
    Savova, G.K., Masanz, J.J., Ogren, P.V., Zheng, J., Sohn, S., Kipper-Schuler, K.C., Chute, C.G.: Mayo clinical text analysis and knowledge extraction system (ctakes): architecture, component evaluation and applications. J. Am. Med. Inf. Assoc. 17(5), 507–513 (2010)CrossRefGoogle Scholar
  16. 16.
    South, B.R., Shen, S., Jones, M., Garvin, J., Samore, M.H., Chapman, W.W., Gundlapalli, A.V.: Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease. BMC Bioinform. 10(9), 1 (2009)Google Scholar
  17. 17.
    Srikant, R., Agrawal, R.: Mining sequential patterns: generalizations and performance improvements. In: Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology, pp. 3–17 (1996)Google Scholar
  18. 18.
    Uzuner, Ö., South, B.R., Shen, S., DuVall, S.L.: 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. J. Am. Med. Inf. Assoc. 18(5), 552–556 (2011)CrossRefGoogle Scholar
  19. 19.
    Wagholikar, K.B., Torii, M., Jonnalagadda, S.R., Liu, H.: Pooling annotated corpora for clinical concept extraction. J. Biomed. Semant. 4(1), 1–10 (2013)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Pierre Holat
    • 1
    Email author
  • Nadi Tomeh
    • 1
  • Thierry Charnois
    • 1
  • Delphine Battistelli
    • 2
  • Marie-Christine Jaulent
    • 3
  • Jean-Philippe Métivier
    • 4
  1. 1.LIPNUniversity of Paris 13, Sorbonne Paris CitéParisFrance
  2. 2.MoDyCoUniversity of Paris Ouest Nanterre La DéfenseParisFrance
  3. 3.InsermParisFrance
  4. 4.GREYCUniversity of Caen Basse-NormandieCaenFrance

Personalised recommendations