Abstract
Digital resource objects (DRO) are among the most valuable resources that store the accumulated knowledge of humankind. Nowadays, many organisations aim to make these resources available to users. Basically, Dirichlet smoothing (DS) model is widely used to retrieve DRO documents. DS model uses a smoothing parameter μ which plays a strong role in finding the value of the unseen terms to avoid zero probability value. For documents of equal length, the value of μ is set as a constant value although its value depends on the length of a document. In DROs, almost all documents are of different length, and each metadata unit in a document also has a different length. Hence, it is not appropriate to predefine the μ parameter with a constant value and uses it for different search space. This leads to difficulty in accessing and retrieving the DRO documents. To solve fixed smoothing-parameter value problem in DRO’s retrieval, and make DROs more accessible, Adaptive Dirichlet Smoothing (ADS) and Adaptive Structured Dirichlet Smoothing (ASDS) models are proposed to improve the performance of the DRO’s retrieval by estimating the smoothing parameter automatically. The proposed ASDS model comprises the ADS model together with an existing DS model. Experimental results on CHiC2013 collections show that the proposed models have the ability to retrieve the most relevant results (documents or metadata units) related to a particular query and reduce the zero-probability values compared with state-of-the-art traditional methods particularly on DROs. Moreover, t-test result is used to prove that the performance of the proposed models is statistically significant.
Similar content being viewed by others
References
Abdulmutalib N, Fuhr N (2008) Language models and smoothing methods for collections with large variation in document length. In 2008 19th International Workshop on Database and Expert Systems Applications, pp. 9-14. IEEE
Alma’aitah WZ, Talib AZ, Osman MA (2019) Document expansion method for digital resource objects. In 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT), pp. 256-260.
Alma’aitah WZ, Talib AZ, Osman MA (2020) Opportunities and challenges in enhancing access to metadata of cultural heritage collections: a survey. Artif Intell Rev 53(5):3621–3646. https://doi.org/10.1007/s10462-019-09773-w
Alma'aitah WZ, Zawawi Talib A, Osman M (2019a) Information retrieval framework for digital resource objects. International Journal of Advanced Trends in Computer Science and Engineering 8(1):6
Alma'aitah WZ, Zawawi Talib A, Osman M (2019b) Structured Dirichlet smoothing model for digital resource objects. International Journal of Engineering and Advanced Technology 9(1):4
Almasri M (2013) Semantic query structuring to enhance precision of an information retrieval system: application to the medical domain. In CORIA:293–298
Almasri, M., Tan, K., Berrut, C., Chevallet, J.-P., & Mulhem, P. (2014). Integrating semantic term relations into information retrieval systems based on language models. In Asia Information Retrieval Symposium, pp. 136-147. Springer
Alnaied, A., Elbendak, M., & Bulbul, A. (2020). An intelligent use of stemmer and morphology analysis for Arabic information retrieval. Egyptian Informatics Journal
Arslan A (2020) On the usefulness of html meta elements for web retrieval. Anadolu University of Sciences & Technology-A: Applied Sciences & Engineering 21(1)
Azzopardi L, Losada DE (2007) Fairly retrieving documents of all lengths. In: In proceedings of the first international conference in theory of information retrieval (ICTIR 2007), pp 65–76
Berger, A., & Lafferty, J. (1999). Information retrieval as statistical translation. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 222-229. ACM
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022
Boban I, Doko A, Gotovac S (2020) Improving sentence retrieval using sequence similarity. Appl Sci 10(12):4316
Brocks H, Thiel U, Stein A, Dirsch-Weigand A (2001) Customizable retrieval functions based on user tasks in the cultural heritage domain. In International Conference on Theory and Practice of Digital Libraries, pp. 37-48. Springer
Bruza P, Song D (2003). A comparison of various approaches for using probabilistic dependencies in language modeling. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 419-420. ACM
Câmara A, Hauff C (2020) Diagnosing BERT with Retrieval Heuristics. In, pp. 605-618. Springer International Publishing
Candela L, Castelli D, Ferro N, Ioannidis Y, Koutrika G, Meghini C, … Agosti M (2007) The DELOS digital library reference model. Foundations for digital libraries, ISTI-CNR
Carpineto C, Romano G (2012) A survey of automatic query expansion in information retrieval. ACM Computing Surveys (CSUR) 44(1):1–50
Cechinel, C., Sánchez-Alonso, S., & Sicilia, M. Á. (2009, 2009). Empirical analysis of errors on human-generated learning objects metadata. In Metadata and semantic research, pp. 60–70. Springer Berlin Heidelberg
Chen SF, Goodman J (1999) An empirical study of smoothing techniques for language modeling. Comput Speech Lang 13(4):359–394
Cummins R, Paik JH, Lv Y (2015) A Pólya urn document language model for improved information retrieval. ACM Transactions on Information Systems (TOIS) 33(4):21
Darwish, K., & Oard, D. W. (2007). Adapting morphology for arabic information retrieval Arabic Computational Morphology (pp. 245-262): Springer.
Duris F, Gazdarica J, Gazdaricova I, Strieskova L, Budis J, Turna J, Szemes T (2018) Mean and variance of ratios of proportions from categories of a multinomial distribution. Journal of Statistical Distributions and Applications 5(1):2
Hatano, K., Kinutani, H., Yoshikawa, M., & Uemura, S. (2002). Information retrieval system for XML documents. In International Conference on Database and Expert Systems Applications, pp. 758-767. Springer
He, B., & Ounis, I. (2005). A study of the dirichlet priors for term frequency normalisation. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 465-471. ACM
Jungmaier J, Kassner N, Roth B (2020). Dirichlet-smoothed word embeddings for low-resource settings. arXiv preprint arXiv:2006.12414.
Krasakis, A. M., Aliannejadi, M., Voskarides, N., & Kanoulas, E. (2020). Analysing the effect of clarifying questions on document ranking in conversational search. arXiv preprint arXiv:2008.03717.
Lafferty J, Zhai C (2001) Document language models, query models, and risk minimization for information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 111-119. ACM
Laitang C, Pinel-Sauvagnat K, Boughanem M (2013) Estimating structural relevance of XML elements through language model. In Proceedings of the 10th Conference on Open Research Areas in Information Retrieval, pp. 41–46.
Lavrenko V, Choquette M, Croft WB (2002) Cross-lingual relevance models. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 175-182. ACM
Little RJ, Rubin DB (2014) Statistical analysis with missing data (Vol. 333): John Wiley & Sons.
Losada DE, Azzopardi L (2008) An analysis on document length retrieval trends in language modeling smoothing. Inf Retr 11(2):109–138. https://doi.org/10.1007/s10791-007-9040-x
Lv Y, Zhai C (2009a) Positional language models for information retrieval. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp. 299-306. ACM
Lv Y, Zhai C (2009b) Positional language models for information retrieval. In: Paper presented at the proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval. MA, USA, Boston
Manning P (2013) Introduction drugs and popular culture (pp. 10-13): Willan.
Mataoui MH, Sebbak F, Benhammadi F, Bey KB (2015). Query expansion in XML information retrieval: a new approach for terms selection. In Modeling, simulation, and applied optimization (ICMSAO), 2015 6th International Conference on, pp. 1-4. IEEE
Mei Q, Ling X, Wondra M, Su H, Zhai C (2007) Topic sentiment mixture: modeling facets and opinions in weblogs. In Proceedings of the 16th international conference on World Wide Web, pp. 171-180. ACM
Nallapati R, Allan J (2002) Capturing term dependencies using a language model based on sentence trees. In Proceedings of the eleventh international conference on Information and knowledge management, pp. 383-390. ACM
Ogawa K, Murahashi T, Taguchi H, Nakajima K, Takehara M, Tamura S, Hayamizu S (2016) Spoken document retrieval using neighboring documents and extended language models for query likelihood model. In NTCIR, pp. 186-190.
Ogilvie P, Callan J (2003) Language models and structured document retrieval. In Proceeding of the INitiative for the Evaluation of XML Retrieval (INEX), pp. 12-18.
Parikh N, Sriram P, Al Hasan M (2013). On segmentation of ecommerce queries. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pp. 1137-1146. ACM
Ponte JM, Croft WB (1998) A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 275-281. ACM
Rahimi R, Montazeralghaem A, Shakery A (2020) An axiomatic approach to corpus-based cross-language information retrieval. Information Retrieval Journal, 1-25.
Si L, Jin R, Callan, J, Ogilvie P (2002). A language modeling framework for resource selection and results merging. In Proceedings of the eleventh international conference on Information and knowledge management, pp. 391-397. ACM
Singhal, A., & Pereira, F. (1999). Document expansion for speech retrieval. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 34-41. ACM
Smucker, M. D., Kulp, D., & Allan, J. (2005). Dirichlet mixtures for query estimation in information retrieval. University of Massachusetts Amherst, Department of Computer Science: Technical Report IR-445.
Strohman, T., Metzler, D., Turtle, H., & Croft, W. B. (2005). Indri: A language model-based search engine for complex queries. In Proceedings of the International Conference on Intelligent Analysis, pp. 2-6. Citeseer
Tan (2015). Extended language model in cultural heritage collection (PhD thesis), Universiti Sains Malaysia.
Wang J, Pan M, He T, Huang X, Wang X, Tu X (2020) A pseudo-relevance feedback framework combining relevance matching and semantic matching for information retrieval. Inf Process Manag 57(6):102342
Winther, O. (2020). Method of and system for information retrieval: Google patents.
Witten IH, Bainbridge D, Paynter G, Boddie S (2002, 2002//). Importing documents and metadata into digital libraries: requirements analysis and an extensible architecture. In Research and advanced Technology for Digital Libraries, pp. 390–405. Springer Berlin Heidelberg
Xu J, Croft WB (1999) Cluster-based language models for distributed retrieval. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 254-261. ACM
Xu J, Weischedel R, Nguyen C (2001) Evaluating a probabilistic model for cross-lingual information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 105-110. ACM
Xu B, Lin H, Lin Y, Guan Y (2020) Integrating social annotations into topic models for personalized document retrieval. Soft Comput 24(3):1707–1716. https://doi.org/10.1007/s00500-019-03998-1
Zhai C (2002). Risk minimization and language modeling in text retrieval. PhD thesis, Carnegie Mellon University.
Zhai (2008a) Statistical language models for information retrieval. Synthesis Lectures on Human Language Technologies 1(1):1–141
Zhai C (2008b) Statistical language models for information retrieval. Synthesis Lectures on Human Language Technologies 1(1):1–141
Zhai C, Lafferty J (2001) Model-based feedback in the language modeling approach to information retrieval. In Proceedings of the tenth international conference on Information and knowledge management, pp. 403-410. ACM
Zhai C, Lafferty J (2004) A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS) 22(2):179–214
Zhai C, Lafferty J (2017) A study of smoothing methods for language models applied to ad hoc information retrieval. In ACM SIGIR Forum, pp. 268-276. ACM
Zhao L, Callan J (2008) A generative retrieval model for structured documents. In Proceedings of the 17th ACM conference on Information and knowledge management, pp. 1163-1172. ACM
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Alma’aitah, W.Z., Talib, A.Z. & Osman, M.A. Towards adaptive structured Dirichlet smoothing model for digital resource objects. Multimed Tools Appl 80, 12175–12194 (2021). https://doi.org/10.1007/s11042-020-10305-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-10305-w