Utilising Information Foraging Theory for User Interaction with Image Query Auto-Completion
- 3k Downloads
Query Auto-completion (QAC) is a prominently used feature in search engines, where user interaction with such explicit feature is facilitated by the possible automatic suggestion of queries based on a prefix typed by the user. Existing QAC models have pursued a little on user interaction and cannot capture a user’s information need (IN) context. In this work, we devise a new task of QAC applied on an image for estimating patch (one of the key components of Information Foraging Theory) probabilities for query suggestion. Our work supports query completion by extending a user query prefix (one or two characters) to a complete query utilising a foraging-based probabilistic patch selection model. We present iBERT, to fine-tune the BERT (Bidirectional Encoder Representations from Transformers) model, which leverages combined textual-image queries for a solution to image QAC by computing probabilities of a large set of image patches. The reflected patch probabilities are used for selection while being agnostic to changing information need or contextual mechanisms. Experimental results show that query auto-completion using both natural language queries and images is more effective than using only language-level queries. Also, our fine-tuned iBERT model allows to efficiently rank patches in the image.
KeywordsQuery auto completion Interactive information retrieval Information Foraging Theory
Query auto-completion (QAC) is an action of signalling full queries once the user starts typing a prefix of a few characters that eases user query compositions . It is also termed as (dynamic) query suggestion , query completion  and real-time query expansion . Popular features such as QAC make people more dependent on search engines to find any relevant information. However, such kind of factor lets users express their queries only ambiguously, which are then overly vague to be completely interpreted by search engines. This makes query auto-completion a bottleneck construct in the usability of search engines . Also, users often apply several rounds of search to reformulate their queries further to adhere to their information needs given they find some relevant results. Past work [6, 20] demonstrated the use of information scent to model users’ information need during web search, and it has been used to understand the factors affecting search and what takes a user to stop the search. Despite the good observation, the exploitation of information scent (from Information Foraging Theory ) is under-explored in case of ambiguous queries and have not been extended to take into account an image in query expansion (or suggestion) tasks. For the users’ convenience, current search engines generally endue query suggestions for them in order to describe their queries more explicitly. They have been explored extensively in query auto-completion tasks, especially the traditional approach known as Most Popular Completion (MPC)  which at the extreme is incapable of anticipating a query it has never seen before. Solutions further improved by recent semantically-driven models [23, 24] and neural model  approaches which are the current state-of-the art in QAC. However, most of the language embedding models  have obtained strong results on multiple benchmarks for understanding the polarity of word compositions. Unsupervised pre-trained natural language embeddings [7, 21] successfully model long term dependencies with the purpose of predicting masked terms and assessing if sentences ensue one another, which showed strong results on several natural language processing and information retrieval tasks. Empirically, recent advances in sequence models have been adapted to span a prefix to full text and index  but despite the attainment, it has not been generalised to take an image into account. Also, deep neural networks are mature enough and capable of segmenting regions within an image [9, 10].
To the best of our knowledge, we are the first to present a method for image query auto-completion where a user query prefix is adapted upon an image.
We elaborate the analogy of query auto-completion based on Information Foraging Theory and propose an explainable strategy for the observed challenges of query formulation and the varying users’ information need.
We propose iBERT inspired by  to compute probabilities of patches and rank them efficiently in the image.
2 Related Work
This section details a brief overview of query auto-completion, image search suggestion, Information Foraging Theory and BERT pre-trained language embedding model. We will investigate the latter approach experimentally in the following section.
Query Auto-Completion: Query auto-completion is an important aspect for information retrieval systems which allow it to predict what could be the next character (or query item) right after the first key was pressed by a user. The predictions in IR systems are generally driven by the query logs (or query history) which are the factual queries that users have previously entered as they were trying to satisfy their information need [14, 37].  introduced a method called NearestCompletion that addresses the situation of “context” which depicts the users’ preceding queries in suggestion-based IR systems. The authors’ proposed MPC mechanism relies on the entire popularity of the queries conforming to the provided prefix. Recent work reported in  studies user reformulation behaviour by leveraging textual features, whereas  introduced personalised query auto-completion and found that utilising a user’s long-term search logs and locations as well as both context-based textual features and demographic features is more effective. More recent advances in QAC using neural language models are proposed in  using recurrent neural networks that effectuate the performance on immediately unseen queries. A generalised and adaptable language model for personalised QAC is introduced in . We extend this adaptable language model to query completion in an image search scenario in the following section.
Query Suggestion in Image Search: Query suggestion and query completion differs in their end goal in which the former search aspect outputs a list of ranked queries against an input query, whereas the latter search aspect outputs queries with the first few characters (or text) similar to the user’s input. Recent work  introduced a learning-based personalised suggestion framework for query suggestion which uses both visual and textual queries. Their work uses users’ click-through data. A new paradigm of attention-based mechanisms for referring expressions in image segmentation  is proposed which contains a keyword-aware network and query attention model that demonstrates the relationships with various image regions for a given query. Inspired by the idea of attention models, we modify this mechanism for patch alignments within images via information scent in the following section.
Information Foraging Theory: Information Foraging Theory (IFT)  is a theoretical framework for understanding information access behaviour, derived from the ecological science concept of optimal foraging theory which applies to how humans access information. IFT stands on three different models, namely information scent model, information patch model and information diet model, which can illustrate users’ search preferences and behaviours : (1) The information within a certain environment scattered in form of patches (images, text snippets, documents) consisting of information features (colors, words) refers to the information patch model; (2) A user can go from one patch to another via a cue (e.g., typing a query by following perceptual or heuristic cues ), which meets the user’s information need. The goal of such cues is to characterise the contents that will be envisaged by trailing the links, which refers to the information scent model; (3) Different types of information sources will vary in their information access costs. Users will assess the information sources based on information gain per unit cost or varied profitability, and then the users will narrow or expand diversities of information sources based on their profitability. This user behaviour refers to the information diet model.
One of the main IFT concepts are information patches. For instance, sections and their associated features in search engine results can be considered patches. From a foraging perspective in image search, the searcher is the predator (or forager ), the information patch is any segment or a region within an image (or image itself) in a given information environment. The piece of information a user is looking for is the prey, and the consumed (or gained) information is the information diet. Something on the user interface that informs users about a specific place they should look next is referred to as a cue of the information scent.
Language Embeddings: Nowadays, many information retrieval or natural language processing tasks rely on language embeddings, such as word2vec , Glove1, and fastText2. They use vector word embeddings for word representation to transform a distinct space of human language into a continuous space, which will be further processed usually through a neural network. In query auto-completion, embeddings have been employed for distributed representation of queries based on a convolutional latent semantic model . Word embeddings have been used to compute query similarity for query auto-completion , incorporating the features with the Most Popular Completion model. Very recent work  introduced a pre-trained deep language model known as BERT which has shown promising results on several IR and natural language processing tasks. However, it is still not well-explored how to leverage such pre-trained language models for QAC, which poses certain challenges both regarding the task and training. Based on this work, we describe our proposed BERT-based model for computing patch probabilities in the following section.
3 Our Model
An overview of the proposed end-to-end architecture shown in Fig. 2. The user types his/her query prefix for the given image to autocomplete and we perform image feature extraction using a pre-trained Convolutional Neural Network (CNN). Then, we feed the image features into the extended Long Short-Term Memory (LSTM) language model together with the query prefix which has a context-dependent weight matrix with an adaptation matrix constructed from a context-driven embedding model. These two constructs from image and text as visual features and textual queries are applied to complete a query. The completed query is then passed to iBERT (fine-tuned BERT language embedding model) to compute the patch probabilities, which in are utilised for patch selection. More details are provided in the next section.
3.1 Image Query Auto Completion
3.2 iBERT - BERT for Patch Probability
We describe our approach to compute the probability of image patches which addresses an important aspect of query auto-completion systems. We assume that during the search process, users are typically interested in some part of the image as well as the image itself if it matches the mental picture of their belief . Our work focuses on a new perspective of query auto-completion on images and the proposed model finds image patches which match the user context based on the query prefix using Eq. (1). BERT (Bidirectional Encoder Representations from Transformers)  shows promising results in multiple tasks of natural language processing and information retrieval  and is presently the state-of-the-art embedding model. We propose to fine-tune the BERT model as a transfer learning task for patch selection, using images composed of several patches (regions of an image), hence the name iBERT3. To the best of our knowledge, BERT has not yet been retraced for the QAC task. We use the BERT embedding model, which has a twelve layer implementation, extending it by adding a dense layer with 10% dropout which then is mapped to the final pooled layer connected the object class, and which outputs patch probabilities as shown in Fig. 2.
3.3 Information Foraging Explanation
Our goal of using Information Foraging Theory  from a cognitive viewpoint is to find explanations for the observed behaviour in query auto-completion and to model the information need within query sessions. IFT postulates that the human information seekers follow an information scent to navigate from one information region to another in an information environment that is instinctively patchy in nature, and from one information patch to another within a region. IFT implies that foragers adapt their behaviour to the structure of the information environment in which they prevail such that the entire system (encompassing the information seeker, the information environment, and the interactions among these two) tries to maximise the ratio of the expected value of the information gained to the total cost of the interaction. Following the IFT analogy, when users start typing a prefix to auto-complete, their perceptual cues (such as mental beliefs ) either allow them to type the next character or to access the provided suggestion (under the query field) which acts as a distal cue and visually inspires the user to acquire them instantly to forage or seek. Query auto-completion, from an IFT perspective as query-level user interaction, is initiated by the user typing as little as a single-character query prefix. The user then may follow suggestions in case a completion is generated (which again follows the earlier mentioned strategy). In case the query prefix is unknown to the system (e.g. by being entered for the first time) the information scent associated with a result might be too poor  to immediately infer information needs. In this case we are applying beam search to generate the query based on image features. Suggestions are based on information scent values as described in the following subsection. These query suggestions represent the diversity of information scent patterns which elicits a varied distribution of relevant queries in the search field.
We use two well-known and diverse datasets: a visual dataset with large-scale knowledge bases that provide a rich collection of language annotations for visual concepts known as Visual Genome  with over 100k images where most image categories fall within a long tail, and the ReferIt dataset  which contains \(\sim \)42k image regions with descriptions. These two datasets fit well for our tasks. The Visual Genome dataset includes images, region descriptions, question-answers, objects, relationships, and attributes. The region descriptions confer a substitution for queries as they refer to several objects in various regions of every image. Few region descriptions are referring phrases and few of them are quite alike to descriptions. For example, referring descriptions are “guy sitting on the couch”, “white keyboard on the desk” and non-referring descriptions are “couch is brown” and “mouse is in the charger”. The huge number of instances from the Visual Genome dataset makes it quite convenient for our task. The ReferIt dataset is a collection of referring expressions engaged to images which quite intently resemble probable user queries of images. We separately train models for query auto-completion and patch selection using both datasets.
We combine query and image as pairs by utilising the region descriptions from the Visual Genome dataset and referring to expressions from the ReferIt dataset. During training, we taken 85% of the Visual Genome data as the training set consists of 16,000 images and 740,000 corresponding region descriptions in which there are approximately 40–45 text descriptions per image. The training data from the ReferIt dataset consists of 9,000 images and 54,000 referring expression with approximately 4–6 referring expression per image.
For the query auto-completion task, we train our extended LSTM language model where the dimension of image representation is 128, \(r = 64\) is the rank of the matching personalised matrix (component from Fig. 2). We use character embeddings with dimension 24, the dimension of the LSTM hidden units is 512, and a maximum length of 50 characters per query with Adam optimizer at a learning rate of 5e-4 for 50,000 iterations as well as a batch size of 32. For the patch selection task, we train our proposed iBERT model using pairs of (region description, patch set) from the Visual Genome dataset, giving rise to a training set of approximately 1.73 million samples. The extra 0.3 million samples are split into test and validation set. We conduct training for the patch selection model that fine-tunes BERT having twelve layers with batch size of 32 for 250,000 iteration using Adam as optimizer at a learning rate of 5e-5 in which the performance increases steeply for the initial 10% of iterations. We use a NVIDIA Tesla T4 GPU which takes a day and half for the complete training activity.
4.3 Performance Measure
We evaluate the quality of our predictions and estimations using the following performance metrics:
We evaluate the patch selection by F1 score.
4.4 Results and Discussion
Evaluation results of the query completion task. Our MRR score is in bold face.
Perplexity of image query auto-completion on both datasets utilising an image and indiscriminate noise. Inclusion of image results in a better (lower) perplexity
We evaluated our proposed iBERT model for finding patch probabilities which is used to select and rank patches in the image. We achieve an F1 score4 of 0.7638 over 3,000 patch classes.
5 Conclusion and Future Work
In this work, we propose an extended LSTM language model for a new task of query auto-completion adapted upon an image. The language model enriches both image features and text information in which the surplus of beam search over our model is efficiently able to predict future queries at least on a single character prefix. The significant increase in MRR is due to the inclusion of visual information within textual queries as explained by IFT model. Also, we present iBERT for patch selection to efficiently rank them in the image and eventually predicts the most suitable image for the auto-completed query, and compare against the result from probabilistic patch selection model. This work is among the first attempt to apply foraging-based strategy to QAC. The self-explanatory power of IFT to understand user interaction at query level leads the foundation of probabilistic patch selection model to devise users’ information need. Our future work is to generalise the referring expression with contextual model to distinguish referring and non-referring region descriptions. We intend to aggregate information from textual queries and visual descriptions to scale it for multimodal query auto-completion in a single model.
This work is supported by the Quantum Access and Retrieval Theory (QUARTZ) project, which has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No. 721321, and partially supported for computing resources by Google Cloud grant.
- 1.Azzopardi, L., Girolami, M., Van Rijsbergen, K.: Investigating the relationship between language model perplexity and IR precision-recall measures (2003)Google Scholar
- 2.Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
- 3.Bar-Yossef, Z., Kraus, N.: Context-sensitive query auto-completion. In: Proceedings of the 20th International Conference on World Wide Web, pp. 107–116. ACM (2011)Google Scholar
- 5.Cao, H., et al.: Context-aware query suggestion by mining click-through and session data. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 875–883. ACM (2008)Google Scholar
- 6.Chi, E.H., Pirolli, P., Chen, K., Pitkow, J.: Using information scent to model user information needs and actions and the web. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 490–497. ACM (2001)Google Scholar
- 7.Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
- 8.Hauff, C., Murdock, V., Baeza-Yates, R.: Improved query difficulty prediction for the web. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 439–448. ACM (2008)Google Scholar
- 9.He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)Google Scholar
- 10.Hu, R., Dollár, P., He, K., Darrell, T., Girshick, R.: Learning to segment every thing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4233–4241 (2018)Google Scholar
- 11.Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., Darrell, T.: Natural language object retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4555–4564 (2016)Google Scholar
- 12.Jaech, A., Ostendorf, M.: Personalized language model for query auto-completion. arXiv preprint arXiv:1804.09661 (2018)
- 13.Jaiswal, A.K., Holdack, G., Frommholz, I., Liu, H.: Quantum-like generalization of complex word embedding: a lightweight approach for textual classification. In: Proceedings of the Conference “Lernen, Wissen, Daten, Analysen”, LWDA 2018, Mannheim, Germany, 22–24 August 2018, pp. 159–168 (2018). http://ceur-ws.org/Vol-2191/paper19.pdf
- 14.Ji, S., Li, G., Li, C., Feng, J.: Efficient interactive fuzzy keyword search. In: Proceedings of the 18th International Conference on World Wide Web, pp. 371–380. ACM (2009)Google Scholar
- 15.Jiang, J.Y., Ke, Y.Y., Chien, P.Y., Cheng, P.J.: Learning user reformulation behavior for query auto-completion. In: Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 445–454. ACM (2014)Google Scholar
- 16.Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: referring to objects in photographs of natural scenes. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 787–798 (2014)Google Scholar
- 17.Kharitonov, E., Macdonald, C., Serdyukov, P., Ounis, I.: User model-based metrics for offline query suggestion evaluation. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 633–642. ACM (2013)Google Scholar
- 19.Liu, H., Mulholland, P., Song, D., Uren, V., Rüger, S.: Applying information foraging theory to understand user interaction with content-based image retrieval. In: Proceedings of the Third Symposium on Information Interaction in Context, pp. 135–144. ACM (2010)Google Scholar
- 21.McCann, B., Bradbury, J., Xiong, C., Socher, R.: Learned in translation: contextualized word vectors. In: Advances in Neural Information Processing Systems, pp. 6294–6305 (2017)Google Scholar
- 22.Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar
- 23.Mitra, B.: Exploring session context using distributed representations of queries and reformulations. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 3–12. ACM (2015)Google Scholar
- 24.Mitra, B., Craswell, N.: Query auto-completion for rare prefixes. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1755–1758. ACM (2015)Google Scholar
- 25.Mitra, B., Rosset, C., Hawking, D., Craswell, N., Diaz, F., Yilmaz, E.: Incorporating query term independence assumption for efficient retrieval and ranking using deep neural networks. arXiv preprint arXiv:1907.03693 (2019)
- 26.Park, D.H., Chiba, R.: A neural language model for query auto-completion. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1189–1192. ACM (2017)Google Scholar
- 28.Pirolli, P., Card, S.K., Van Der Wege, M.M.: Visual information foraging in a focus+ context visualization. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 506–513. ACM (2001)Google Scholar
- 29.Shao, T., Chen, H., Chen, W.: Query auto-completion based on word2vec semantic similarity. In: Journal of Physics: Conference Series, vol. 1004, p. 012018. IOP Publishing (2018)Google Scholar
- 31.Shokouhi, M.: Learning to personalize query auto-completion. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 103–112. ACM (2013)Google Scholar
- 33.Sutskever, I., Martens, J., Hinton, G.E.: Generating text with recurrent neural networks. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 1017–1024 (2011)Google Scholar
- 34.Vijayakumar, A.K., et al.: Diverse beam search: decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424 (2016)
- 35.Weber, I., Castillo, C.: The demographics of web search. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 523–530. ACM (2010)Google Scholar
- 36.White, R.: Beliefs and biases in web search. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 3–12. ACM (2013)Google Scholar
- 39.Wu, C.C., Mei, T., Hsu, W.H., Rui, Y.: Learning to personalize trending image search suggestion. In: Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 727–736. ACM (2014)Google Scholar