Advertisement

An Improved Algorithm for Fast K-Word Proximity Search Based on Multi-component Key Indexes

  • Alexander B. VeretennikovEmail author
Conference paper
  • 66 Downloads
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 1251)

Abstract

A search query consists of several words. In a proximity full-text search, we want to find documents that contain these words near each other. This task requires much time when the query consists of high-frequently occurring words. If we cannot avoid this task by excluding high-frequently occurring words from consideration by declaring them as stop words, then we can optimize our solution by introducing additional indexes for faster execution. In a previous work, we discussed how to decrease the search time with multi-component key indexes. We had shown that additional indexes can be used to improve the average query execution time up to 130 times if queries consisted of high-frequently occurring words. In this paper, we present another search algorithm that overcomes some limitations of our previous algorithm and provides even more performance gain.

Keywords

Full-text search Search engines Inverted indexes Additional indexes Proximity search Term proximity Information retrieval Query processing Document-At-A-Time DAAT 

Notes

Acknowledgement

The work was supported by Act 211 Government of the Russian Federation, contract no. 02.A03.21.0006.

References

  1. 1.
    Anh, V.N., de Kretser, O., Moffat, A.: Vector-space ranking with effective early termination. In: SIGIR 2001 Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, Louisiana, USA, pp. 35–42 (2001).  https://doi.org/10.1145/383952.383957
  2. 2.
    Borodin, A., Mirvoda, S., Porshnev, S., Ponomareva, O.: Improving generalized inverted index lock wait times. J. Phys.: Conf. Ser. 944(1), Article no. 012022 (2018).  https://doi.org/10.1088/1742-6596/944/1/012022
  3. 3.
    Büttcher, S., Clarke, C., Lushman, B.: Term proximity scoring for ad-hoc retrieval on very large text collections. In: SIGIR 2006 Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 621–622 (2006).  https://doi.org/10.1145/1148170.1148285
  4. 4.
    Daoud, C.M., de Moura, E.S., Carvalho, A., da Silva, A.S., Fernandes, D., Rossi, C.: Fast top-k preserving query processing using two-tier indexes. Inf. Process. Manag. 52(5), 855–872 (2016).  https://doi.org/10.1016/j.ipm.2016.03.005CrossRefGoogle Scholar
  5. 5.
    Fox, C.: A stop list for general text. ACM SIGIR Forum 24, 19–35 (1989).  https://doi.org/10.1145/378881.378888CrossRefGoogle Scholar
  6. 6.
    Jansen, B.J., Spink, A., Saracevic, T.: Real life, real users, and real needs: a study and analysis of user queries on the web. Inf. Process. Manag. 36(2), 207–227 (2000).  https://doi.org/10.1016/S0306-4573(99)00056-4CrossRefGoogle Scholar
  7. 7.
    Jiang, D., Leung, K.W.-T., Yang, L. and Ng, W.: TEII: topic enhanced inverted index for top-k document retrieval. Know.-Based Syst. 89(C), 346–358 (2015).  https://doi.org/10.1016/j.knosys.2015.07.014
  8. 8.
    Gall, M., Brost, G.: K-word proximity search on encrypted data. In: 30th International Conference on Advanced Information Networking and Applications Workshops (WAINA), pp. 365-372 (2016).  https://doi.org/10.1109/WAINA.2016.104
  9. 9.
    Garcia, S., Williams, H.E., Cannane, A.: Access-ordered indexes. In: ACSC 2004 Proceedings of the 27th Australasian Conference on Computer Science, Dunedin, New Zealand, pp. 7–14 (2004)Google Scholar
  10. 10.
    Lu, X., Moffat, A., Culpepper, J.S.: Efficient and effective higher order proximity modeling. In: ICTIR 2016 Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval, pp. 21–30 (2016).  https://doi.org/10.1145/2970398.2970404
  11. 11.
    Luk, R.W.P.: Scalable, statistical storage allocation for extensible inverted file construction. J. Syst. Softw. Archive 84(7), 1082–1088 (2011).  https://doi.org/10.1016/j.jss.2011.01.049CrossRefGoogle Scholar
  12. 12.
    Sadakane, K.: Fast algorithms for k-word proximity search. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 84(9), 2311–2318 (2001)Google Scholar
  13. 13.
    Rasolofo, Y., Savoy, J.: Term proximity scoring for keyword-based retrieval systems. In: European Conference on Information Retrieval (ECIR) 2003: Advances in Information Retrieval, pp. 207–218 (2003).  https://doi.org/10.1007/3-540-36618-0_15
  14. 14.
    Veretennikov, A.B.: Proximity full-text search with a response time guarantee by means of additional indexes with multi-component keys. In: Selected Papers of the XX International Conference on Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL 2018), Moscow, Russia, 9–12 October 2018, pp. 123–130 (2018). http://ceur-ws.org/Vol-2277
  15. 15.
    Veretennikov, A.B.: Proximity full-text search by means of additional indexes with multi-component keys: in pursuit of optimal performance. In: Manolopoulos, Y., Stupnikov, S. (eds.) Data Analytics and Management in Data Intensive Domains. DAMDID/RCDL 2018. Communications in Computer and Information Science, vol. 1003, pp. 111–130 (2019). Springer, Cham.  https://doi.org/10.1007/978-3-030-23584-0_7
  16. 16.
    Veretennikov, A.B.: Proximity full-text search with a response time guarantee by means of additional indexes. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) Intelligent Systems and Applications. IntelliSys 2018. Advances in Intelligent Systems and Computing, vol. 868, pp. 936–954 (2019). Springer, Cham.  https://doi.org/10.1007/978-3-030-01054-6_66
  17. 17.
    Veretennikov, A.B.: Proximity full-text search with response time guarantee by means of three component keys. Bull. South Ural State Univ. Ser.: Comput. Math. Softw. Eng. 7(1), 60–77 (2018). (in Russian)Google Scholar
  18. 18.
    Williams, H.E., Zobel, J., Bahle, D.: Fast phrase querying with combined indexes. ACM Trans. Inf. Syst. (TOIS) 22(4), 573–594 (2004).  https://doi.org/10.1145/1028099.1028102CrossRefGoogle Scholar
  19. 19.
    Williams, J.W.J.: Algorithm 232 heapsort. Commun. ACM 7(6), 347–348 (1964).  https://doi.org/10.2307/408772CrossRefGoogle Scholar
  20. 20.
    Yan, H., Shi, S., Zhang, F., Suel, T., Wen, J.-R.: Efficient term proximity search with term-pair indexes. In: CIKM 2010 Proceedings of the 19th ACM International Conference on Information and Knowledge Management, Toronto, pp. 1229–1238 (2010).  https://doi.org/10.1145/1871437.1871593
  21. 21.
    Yang, Y. Ning, H.: Block linked list index structure for large data full text retrieval. In: 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), pp. 2123-2128 (2017)Google Scholar
  22. 22.
    Zipf, G.: Relative frequency as a determinant of phonetic change. Harv. Stud. Class. Philol. 40, 1–95 (1929).  https://doi.org/10.2307/408772CrossRefGoogle Scholar
  23. 23.
    Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38(2), Article no. 6 (2006).  https://doi.org/10.1145/1132956.1132959

Copyright information

© Springer Nature Switzerland AG 2021

Authors and Affiliations

  1. 1.Chair of Calculation Mathematics and Computer ScienceUral Federal UniversityYekaterinburgRussia

Personalised recommendations