Skip to main content
Log in

Query expansion based on clustering and personalized information retrieval

  • Regular Paper
  • Published:
Progress in Artificial Intelligence Aims and scope Submit manuscript

Abstract

Information retrieval systems are used to describe a variety of processes involving the delivery of information to people who need it. Although several mathematical approaches have been studied in order to formalize the main components of an information retrieval system: queries representation, information items representations and the retrieval process, such systems still face many difficulties to extract relevant information for users especially when the processed data are texts. This is due to the complex nature of text databases. Generally, an information retrieval system reformulates queries according to associations among information items before matching them to dataset items. In this sense, semantic relationships or machine learning techniques can be applied to refine the returned results. This paper presents a formal model to organize data, and a new search algorithm to browse it. It incorporates a natural language preprocessing stage, a statistical representation of short documents and queries and a machine learning model to select relevant results. We propose later in this paper two further optimizations that proved quite interesting and returned significantly satisfying results on two datasets in a reasonable computation time. The first optimization concerns queries expansions, while the second one concerns dataset restructuration. Thus, we formally evaluate the impact of each optimization by computing the performance of the information retrieval system with and without it; the highest reached recall and precision were 96.2% and 99.2%, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. Yahoo! Webscope dataset ydata-ymusic-user-artist-ratings-v1_0 [http://research.yahoo.com/Academic_Relations].

References

  1. Barreau, D., Nardi, B.A.: Finding and reminding: file organization from the desktop. SIGCHI Bull. 27(3), 329–339 (1995)

    Article  Google Scholar 

  2. Berger, A., Lafferty, J.: Information retrieval as statistical translation. In: ACM SIGIR Forum, vol. 51, no. 2, pp. 219–226. ACM (2017)

  3. Bordogna, G., Carrara, P., Pasi, G.: Query term weights as constraints in fuzzy information retrieval. Inf. Process. Manage. 27(1), 15–26 (1991)

    Article  Google Scholar 

  4. Cai, F., De Rijke, M.: A survey of query auto completion in information retrieval. Found. Trends Inf. Retr. 10(4), 273–363 (2016)

    Article  Google Scholar 

  5. Cai, F., Liang, S., De Rijke, M.: Personalized document re-ranking based on bayesian probabilistic matrix factorization, pp. 835–838. SIGIR, ACM (2014)

    Google Scholar 

  6. Campos, R., Dias, G., Jorge, A.M., Jatowt, A.: Survey of temporal information retrieval and related applications. ACM Comput. Surv. (CSUR) 47(2), 15 (2015)

    Google Scholar 

  7. Carpineto, C., Romano, G.: A survey of automatic query expansion in information retrieval. ACM Comput. Surv. (CSUR) 44(1), 1 (2012)

    Article  MATH  Google Scholar 

  8. Chang, C., Lin, C.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)

    Google Scholar 

  9. Cherif, W., Madani, A., Kissi, M.: New rules-based algorithm to improve Arabic stemming accuracy. Int. J. Knowl. Eng. Data Mining 3(3–4), 315–336 (2015)

    Article  Google Scholar 

  10. Cherif, W., Madani, A., Kissi, M.: Towards an efficient opinion measurement in Arabic comments. Proc. Comput. Sci. 73, 122–129 (2015)

    Article  Google Scholar 

  11. Cherif, W.: Optimization of K-NN algorithm by clustering and reliability coefficients: application to breast-cancer diagnosis. Proc. Comput. Sci. 127, 293–299 (2018)

    Article  Google Scholar 

  12. Conneau, A., Schwenk, H., Barrault, L., Lecun, Y.: Very deep convolutional networks for natural language processing. arXiv preprint arXiv:1606.01781 (2016)

  13. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, Cambridge (2000)

    Book  MATH  Google Scholar 

  14. Dumais, S., Cutrell, E., Cadiz, J.J., Jancke, G., Sarin, R., Robbins, D.C.: Stuff I’ve seen: a system for personal information retrieval and re-use. In ACM SIGIR Forum, vol. 49, no. 2, pp. 28–35. ACM (2016)

  15. El Ghali, B., El Qadi, A.: Context-aware query expansion method using language models and latent semantic analyses. Knowl. Inf. Syst. 50(3), 751–762 (2017)

    Article  Google Scholar 

  16. Erickson, T.: The design and long-term use of a personal electronic notebook: a reflective analysis. In: Proceedings of CHI’96, pp. 11–18 (1996)

  17. Furnas, G.W., Landauer, T.K., Gomez, L.M., Dumais, S.T.: The vocabulary problem in human-system communication. Commun. ACM 30(11), 964–971 (1987)

    Article  Google Scholar 

  18. Ghorab, M.R., Zhou, D., O’connor, A., Wade, V.: Personalised information retrieval: survey and classification. User Model. User-Adap. Inter. 23(4), 381–443 (2013)

    Article  Google Scholar 

  19. Harper, D.J., Van Rijsbergen, C.J.: An evaluation of feedback in document retrieval using co-occurrence data. J. Doc. 34(3), 189–216 (1978)

    Article  Google Scholar 

  20. Hattie, J.: Visible Learning: A Synthesis of Over 800 Meta-Analyses Relating to Achievement. Routledge, London (2008)

    Book  Google Scholar 

  21. Hofmann, T.: Probabilistic latent semantic indexing. In: ACM SIGIR Forum, vol. 51, no. 2, pp. 211–218. ACM (2017)

  22. Jain, A., Mishne, G.: Organizing query completions for web search. In: Proceedings of the 19th ACM international conference on Information and knowledge management, pp. 1169–1178. ACM (2010)

  23. Jones, S.R., Thomas, P.J.: Empirical assessment of individuals’ ‘personal information management systems’. Behav. Inf. Technol. 16(3), 158–160 (1997)

    Article  Google Scholar 

  24. Jones. W.P., Dumais, S.T., Bruce, H.: Once found, what then? A study of “Keeping” behaviors in the personal use of web information. In: Proceedings of ASIST, pp. 391–402 (2002)

  25. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of Tricks for Efficient Text Classification. arXiv preprint arXiv:1607.01759. (2016)

  26. Khalifi, H., Elqadi, A., Ghanou, Y.: Support Vector Machines for a new Hybrid Information Retrieval System. Proc. Comput. Sci. 127(C), 139–145 (2018)

    Article  Google Scholar 

  27. Kiros, R., Zhu, Y., Salakhutdinov, R.R., Zemel, R., Urtasun, R., Torralba, A., Fidler, S.: Skip-thought vectors. In: Advances in Neural Information Processing Systems, pp. 3294–3302 (2015)

  28. Ko, Y.: How to use negative class information for Naive Bayes classification. Inf. Process. Manage. 53(6), 1255–1268 (2017)

    Article  Google Scholar 

  29. Krishnamurthy, S., Akila, V.: Information retrieval models: trends and techniques. In: Web Semantics for Textual and Visual Information Retrieval, pp. 17–42. IGI Global (2017)

  30. Labjar, H., Cherif, W., Nadir, S., Digua, K., Sallek, B., Chaair, H.: Support vector machines for modelling phosphocalcic hydroxyapatite by precipitation from a calcium carbonate solution and phosphoric acid solution. J. Taibah Univ. Sci. 10(5), 745–754 (2016)

    Article  Google Scholar 

  31. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1188–1196 (2014)

  32. Lewis, D.D.: Naive (Bayes) at forty: the independence assumption in information retrieval. In European Conference on Machine Learning, pp. 4–15. Springer, Berlin, Heidelberg (1998)

  33. Lewis, D.D.: Learning in intelligent information retrieval. In: Machine Learning: Proceedings of the Eighth International Workshop, pp. 235–239 (2014)

  34. Li, B., Han, L.: Distance weighted cosine similarity measure for text classification. In: International Conference on Intelligent Data Engineering and Automated Learning, pp. 611–618. Springer, Berlin, Heidelberg (2013)

  35. Lu, Y., Hsiao, I.H.: Personalized Information Seeking Assistant (PiSA): from programming information seeking to learning. Inf. Retr. J. 20(5), 433–455 (2017)

    Article  Google Scholar 

  36. Malone, T.: How do people organize their desks? Implications for the design of office information systems. ACM Trans. Office Inf. Syst. 1(1), 99–112 (1983)

    Article  Google Scholar 

  37. Mao, R., Chen, G., Li, R., & Lin, C.: ABDN at SemEval-2018 Task 10: recognising discriminative attributes using context embeddings and WordNet. In: Proceedings of the 12th International Workshop on Semantic Evaluation, pp. 1017–1021 (2018)

  38. Marais, H., Bharat, K.: Supporting cooperative and personal surfing with a desktop assistant. Proc. UIST 1997, 129–138 (1997)

    Article  Google Scholar 

  39. Micarelli, A., Gasparetti, F., Sciarrone, F., Gauch, S.: Personalized search on the world wide web. In: The adaptive web, pp. 195–230. Springer, Berlin, Heidelberg (2007)

  40. Moniz, N., Torgo, L.: Multi-Source Social Feedback of Online News Feeds. arXiv preprint arXiv:1801.07055 (2018)

  41. Nie, J.: An information retrieval model based on modal logic. Inf. Process. Manage. 25(5), 477–491 (1989)

    Article  Google Scholar 

  42. Palangi, H., Deng, L., Shen, Y., Gao, J., He, X., Chen, J., Song, X., Ward, R.: Deep sentence embedding using long short-term memory networks: analysis and application to information retrieval. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 24(4), 694–707 (2016)

    Article  Google Scholar 

  43. Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: ACM SIGIR Forum, vol. 51, no. 2, pp. 202–208. ACM (2017)

  44. Powers, D.M.W.: Evaluation: from precision, recall and F-measure to ROC, informedness, markedness & correlation. J. Mach. Learn. Technol. 2(1), 37–63 (2011)

    MathSciNet  Google Scholar 

  45. Qu, Z., Song, X., Zheng, S., Wang, X., Song, X., Li, Z.: Improved Bayes method based on TF-IDF feature and grade factor feature for Chinese information classification. In: 2018 IEEE International Conference on Big Data and Smart Computing (BigComp), pp. 677–680. IEEE (2018)

  46. Rajman, M., Vesely, M.: From text to knowledge: document processing and visualization: a text mining approach. In: Text mining and its applications, pp. 7–24. Springer, Berlin, Heidelberg (2004)

  47. Rhodes, B., Starner, T.: Remembrance agent: a continuously running automated information retrieval system. In: The Proceedings of the First International Conference on The Practical Application Of Intelligent Agents and Multi Agent Technology, pp. 487–495 (1996)

  48. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)

    Article  Google Scholar 

  49. Silvestri, F.: Mining query logs: turning search usage data into knowledge. Foundations and Trends® in Information Retrieval, 4(1–2), 1-174. (2009)

  50. Shen, Y., He, X., Gao, J., Deng, L., Mesnil, G.: A latent semantic model with convolutional-pooling structure for information retrieval. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 101–110. ACM (2014)

  51. Smits, G.F., Jordaan, E.M.: Improved SVM regression using mixtures of kernels. In: Proceedings of the 2002 International Joint Conference on Neural Networks, 2002. IJCNN’02, vol. 3, pp. 2785–2790. IEEE (2002)

  52. Suykens, J.A., Vandewalle, J.: Least squares support vector machine classifiers. Neural Process. Lett. 9(3), 293–300 (1999)

    Article  Google Scholar 

  53. UtreraSust, E., Simon-Cuevas, A., Olivas, J.A., Romero, F.P.: An approach of a personalized information retrieval model based on contents semantic analysis. Procesamiento del lenguaje natural 61, 31–38 (2018)

    Google Scholar 

  54. Vapnik, V., Mukherjee, S.: Support vector method for multivariate density estimation. In: Advances in Neural Information Processing Systems, pp. 659–665 (2000)

  55. Walpole, R.E., Myers, R.H., Myers, S.L., Ye, K.: Probability and Statistics for Engineers and Scientists, vol. 5. Macmillan, New York (1993)

    MATH  Google Scholar 

  56. Whittaker, S., & Sidner, C.: Email overload: exploring personal information management of email. In Proceedings of the SIGCHI conference on Human factors in computing systems (pp. 276-283). ACM. (1996)

  57. Xiao, C., Qin, J., Wang, W., Ishikawa, Y., Tsuda, K., Sadakane, K.: Efficient error-tolerant query autocompletion. Proceedings of the VLDB Endowment 6(6), 373–384 (2013)

    Article  Google Scholar 

  58. Yin, Z., Shokouhi, M., & Craswell, N.: Query Expansion Using External Evidence. In ECIR (Vol. 9, pp. 362-374). (2009)

  59. Zhai, C., & Lafferty, J.: A study of smoothing methods for language models applied to ad hoc information retrieval. In ACM SIGIR Forum (Vol. 51, No. 2, pp. 268-276). ACM. (2017)

  60. Zhang, X., Zhao, J., & LeCun, Y.: Character-level convolutional networks for text classification. In Advances in neural information processing systems (pp. 649-657) (2015)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hamid Khalifi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khalifi, H., Cherif, W., Qadi, A.E. et al. Query expansion based on clustering and personalized information retrieval. Prog Artif Intell 8, 241–251 (2019). https://doi.org/10.1007/s13748-019-00178-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13748-019-00178-y

Keywords

Navigation