DL-VSM based document indexing approach for information retrieval

  • Kabil BoukhariEmail author
  • Mohamed Nazih Omri
Original Research


Textual information is constantly increasing. With this accumulation of documents, the satisfaction of user needs becomes more and more complex. For that, several information retrieval systems have been designed in order to respond to user requests. Document indexing is considered as a crucial phase in the information retrieval field. The main contribution of the current work resides in the suggestion of a novel hybrid approach for biomedical document indexing. We improve the estimation of the correspondence between a document and a given concept using two methods: vector space model (VSM) and description logics (DL). VSM performs partial matching between documents and external resource terms. DL allows representing knowledge in a relevant manner for better matching. The proposed contribution reduces the limitation of exact matching. It serves to index documents by exploiting medical subject headings (MeSH) thesaurus services with approximate matching. The latter partially matches document terms with biomedical vocabularies to extract other morphological variants in that resource. It also generates irrelevant concepts. The filtering step solves this problem and grants the selection of the most important concepts by exploiting the knowledge provided by MeSH. The experiments, carried out on different corpora, show encouraging results (+ 25% improvement in average accuracy compared to other approaches in the literature).


Document indexing Vector space model Description logics Partial matching Stemming Biomedical vocabulary 



  1. Ali M, Khalid S, Saleemi M (2019) Comprehensive stemmer for morphologically rich Urdu language. Int Arab J Inf Technol 16(1):138–147Google Scholar
  2. Alotaibi FS, Gupta V (2018) A cognitive inspired unsupervised language-independent text stemmer for information retrieval. Cognit Syst Res 52:291–300CrossRefGoogle Scholar
  3. Aravazhi R, Chidambaram M (2018) An efficient indexing mesh term description logic using in medical subject headings. J Comput Math Sci 9(10):1556–1567Google Scholar
  4. Aronson A, Mork J, Gay C, Humphrey S, Rogers W (2004) The nlm indexing initiative’s medical text indexer. Stud Health Technol Inf 11(1):268–272Google Scholar
  5. Arroyo-Fernández I, Méndez-Cruz C, Sierra G, Torres-Moreno J, Sidorov G (2019) Unsupervised sentence representations as word information series: revisiting TF-IDF. Comput Speech Lang 56:107–129CrossRefGoogle Scholar
  6. Baoli H, Ling C, Xiaoxue T (2018) Knowledge based collection selection for distributed information retrieval. Inf Process Manag 54(1):116–128CrossRefGoogle Scholar
  7. Boukhari K, Omri MN (2015) Said: a new stemmer algorithm to indexing unstructured document. In: 2015 15th international conference on intelligent systems design and applications (ISDA). IEEE, pp 59–63.
  8. Boukhari K, Omri MN (2016) Raid: robust algorithm for stemming text document. Int J Comput Inf Syst Ind Manag Appl 8(1):235–246Google Scholar
  9. Boukhari K, Omri MN (2017a) Information retrieval approach based on indexing text documents: application to biomedical domain. In: The 13th international conference on natural computation, fuzzy systems and knowledge discovery (ICNC-FSKD), pp 2213–2220Google Scholar
  10. Boukhari K, Omri MN (2017b) Information retrieval based on description logic: application to biomedical documents. In: International conference on high performance computing and simulation (HPCS), pp 846–853Google Scholar
  11. Bracewell D, Ren F-J, Kuriowa S (2005) Multilingual single document keyword extraction for information retrieval. In: Proceedings of natural language processing and knowledge engineering (NLP-KE), pp 517–522Google Scholar
  12. Chebil W, Soualmia LF, Darmoni SJ (2013) Biodi: a new approach to improve biomedical documents indexing. In: Decker H, Lhotská L, Link S, Basl J, Tjoa AM (eds) Database and expert systems applications. DEXA 2013. Lecture notes in computer science, vol 8055. Springer, Berlin, Heidelberg, pp 78–87CrossRefGoogle Scholar
  13. Dahak F, Boughanem M, Ballaa A (2017) A probabilistic model to exploit user expectations in xml information retrieval. Inf Process Manag 53(1):87–105CrossRefGoogle Scholar
  14. Dinh D, Tamine L (2011) Combining global and local semantic contexts for improving biomedical information retrieval. In: European conference on information retrieval research, pp 375–386Google Scholar
  15. Ferjani F, Elloumi S, Jaoua A, Sahar Ahmad Ismail SBY, Ravan S (2012) Formal context coverage based on isolated labels: an efficient solution for text feature extraction. Inf Sci Inf Comput Sci Intell Syst Appl Int J 188(1):198–214MathSciNetzbMATHGoogle Scholar
  16. Fiorini N, Ranwez S, Montmain J, Ranwez V (2015) USI: a fast and accurate approach for conceptual document annotation. BMC Bioinf 16(1):1–10CrossRefGoogle Scholar
  17. Fkih F, Omri MN (2012) Complex terminology extraction model from unstructured web text based linguistic and statistical knowledge. Int J Inf Retrieval Res 2(3):1–18Google Scholar
  18. Fkih F, Omri MN (2016a) Hybridization of an index based on concept lattice with a terminology extraction model for semantic information retrieval guided by wordnet. In: International conference on hybrid intelligent systems, pp 144–152Google Scholar
  19. Fkih F, Omri MN (2016b) IRAFCA: an o(n) information retrieval algorithm based on formal concept analysis. Knowl Inf Syst 48(2):465–491CrossRefGoogle Scholar
  20. Garcia MAM, Rodriguez RP, Rifon LA (2018) Leveraging wikipedia knowledge to classify multilingual biomedical documents. Artif Intell Med 88(1):37–57CrossRefGoogle Scholar
  21. Haarslev V, Moller R (2001) Description of the racer system and its applications. In: The international workshop on description logics, pp 132–141Google Scholar
  22. Hao S, Shi C, Niu Z, Cao L (2018) Concept coupling learning for improving concept lattice-based document retrieval. Eng Appl Artif Intell 69(1):56–75Google Scholar
  23. Happe A, Pouliquen B, Burgun A, Cuggia M, Beux PL (2003) Automatic concept extraction from spoken medical reports. Int J Med Inf 70(2–3):255–263CrossRefGoogle Scholar
  24. Jiménez S, Cucerzan S, González FA, Gelbukh AF, Dueñas G (2018) BM25-CTF: improving TF and IDF factors in BM25 by using collection term frequencies. J Intell Fuzzy Syst 34(5):2887–2899CrossRefGoogle Scholar
  25. Jonquet C, LePendu P, Falconer S, Coulet A, Noy NF, Musen MA, Shah NH (2011) NCBO resource index: ontology-based search and mining of biomedical resources. J Web Seman 9(3):316–324CrossRefGoogle Scholar
  26. Jutinico CJM, Montenegro-Marin CE, Burgos D, Crespo RG (2019) Natural language interface model for the evaluation of ergonomic routines in occupational health (ilena). J Ambient Intell Human Comput 10(4):1611–1619CrossRefGoogle Scholar
  27. Karaa WBA (2013) A new stemmer to improve information retrieval. Int J Netw Sec Appl (IJNSA) 5(4):143–154MathSciNetGoogle Scholar
  28. Liu Y-H, Wacholderc N (2017) Evaluating the impact of mesh (medical subject headings) terms on different types of searchers. Inf Process Manag 53(4):851–870CrossRefGoogle Scholar
  29. Lv X, Guan Y, Deng B (2014) Transfer learning based clinical concept extraction on data from multiple sources. J Biomed Inf 52(3):55–64CrossRefGoogle Scholar
  30. Mahedi HH, Sanyal F, Chaki D (2018) A novel approach to extract important keywords from documents applying latent semantic analysis. In: International conference on knowledge and smart technology (KST), pp 1–6Google Scholar
  31. Matsuo Y, Ishizuka M (2003) Keyword extraction from a single document using word co-occurrence statistical information. In: Proceedings of the sixteenth international Florida Artificial Intelligence Research Society conference, pp 392–396Google Scholar
  32. Mukherjea S, Gaurav Chanda LVS, Sankararaman S, Kothari R, Batra VS, Bhardwaj DN, Srivastava B (2004) Enhancing a biomedical information extraction system with dictionary mining and context disambiguation. IBM J Res Dev 48(5–6):693–702CrossRefGoogle Scholar
  33. Naouar F, Hlaoua L, Omri MN (2016) Collaborative information retrieval model based on fuzzy confidence network. J Intell Fuzzy Syst 30(4):2119–2129CrossRefGoogle Scholar
  34. Naouar F, Hlaoua L, Omri MN (2017) Information retrieval model using uncertain confidence’s network. Int J Inf Retriev Res 7(2):34–50Google Scholar
  35. Radhouani S, Falquet G (2008) Description logics-based modelling for precise information retrieval. In: International workshop on description logics, pp 1–11Google Scholar
  36. Radhouani S, Falquet G, Chevallet JP (2008) Description logic to model a domain specific information retrieval system. In: International conference on database and expert systems applications, pp 142–149Google Scholar
  37. Ru C, Tang J, Li S, Xie S, Wang T (2018) Using semantic similarity to reduce wrong labels in distant supervision for relation extraction. Inf Process Manag 54(4):593–608CrossRefGoogle Scholar
  38. Ruch P (2006) Automatic assignment of biomedical categories: toward a generic approach. Bioinf J 6(22):58–64Google Scholar
  39. Sirin E, Parsia B, Grau BC, Kalyanpur A, Katz Y (2007) Pellet: a practical owl-dl reasoner. J Web Semant 5(2):51–53CrossRefGoogle Scholar
  40. Sohn S, Kim W, Comeau DC, Wilbur WJ (2008) Optimal training sets for bayesian prediction of mesh\(\textregistered {R}\) assignment. J Am Med Inf Assoc 15(4):546–553CrossRefGoogle Scholar
  41. Soldaini L, Goharian N (2016) Quickumls: a fast, unsupervised approach for medical concept extraction. In: Medical information retrieval (MedIR) workshop, pp 1–4Google Scholar
  42. Song M (2015) Exploring concept graphs for biomedical literature mining. In: International conference on big data and smart computing, pp 103–110Google Scholar
  43. Sun P, Wang L, Xia Q (2017) The keyword extraction of Chinese medical web page based on WF-TF-IDF algorithm. In: 9th international conference on cyber-enabled distributed computing and knowledge discovery (CyberC), pp 193–198Google Scholar
  44. Tsarkov D, Horrocks I (2004) Efficient reasoning with range and domain constraints. Descript Logic Workshop DL 2004:41–50Google Scholar
  45. Warren P, Mulholland P, Collins TD, Motta E (2019) Improving comprehension of knowledge representation languages: a case study with description logics. Int J Hum Comput Stud 122:145–167CrossRefGoogle Scholar
  46. You W, Fontaine D, Barthès J-P (2013) An automatic keyphrase extraction system for scientific documents. Knowl Inf Syst 34(3):691–724CrossRefGoogle Scholar
  47. Yuan L (2018) Supporting relevance feedback with concept learning for semantic information retrieval in large OWL knowledge base. In: Yoshida K, Lee M (eds) Knowledge management and acquisition for intelligent systems. PKAW 2018. Lecture notes in computer science, vol 11016. Springer, Cham, pp 61–75CrossRefGoogle Scholar
  48. Zhang C, Wang H, Liu Y, Wu D, Liao Y, Wang B (2008) Automatic keyword extraction from documents using conditional random fields. J Comput Inf Syst 4(3):1169–1180Google Scholar
  49. Zhou X, Zhang X, Hu X (2006) Maxmatcher: Biological concept extraction using approximate dictionary lookup. In: Pacific rim international conference on artificial intelligence, pp 1145–1149Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2020

Authors and Affiliations

  1. 1.MARS Research LaboratoryUniversity of SousseSousseTunisia

Personalised recommendations