QISS: An Open Source Image Similarity Search Engine
- 3.7k Downloads
Qwant Image Similarity Search (QISS) is a multi-lingual image similarity search engine based on a dual path neural networks that embed texts and images into a common feature space where they are easily comparable. Our demonstrator, available at http://research.qwant.com/images, allows real-time searches in a database of approximately 100 million images.
KeywordsNeural networks Image retrieval
Qwant Image Similarity Search (QISS) is a multi-lingual image search engine. It allows users to make queries both textually or by using images. QISS relies on similarity search. This means that it will compare the content of a query with the data in its index and returns the elements it considers most similar visually or semantically. In our case, we consider the index elements whose Euclidean distance from the query is closest to zero to be the most similar.
If an image and its describing text are close to one another in the representing space, it is possible to query either one with the other. QISS aims to allow the user to use text or image to query a set of images. While other search engines are based on text surrounding images or tags, QISS evaluates the semantic similarity between the query and each element of the database.
2 System Description
QISS can be used to query the images index by either using texts or images as queries. Also, the representation of the texts is multi-lingual. This means that words from different languages but with similar meanings will have close representations in the semantic space.
2.1 Multi-lingual Text Representation
One of QISS’s constraints is to be available in several languages. Instead of using a translation of textual image descriptions, we propose to use multi-lingual word embeddings to cope with multiple languages. Word embeddings are used to project words into a semantic space, where distance and semantic similarity are related. Multi-lingual embeddings, such as Multilingual Unsupervised or Supervised word Embeddings (MUSE) , allow for the representation of different languages into one common space. Thanks to this alignment, a neural network can extract information from the embedded words in all learned languages. This allows QISS to have only one index that contains every image, and that can answer to queries expressed into several languages. This is a strong difference with classic search engines that have one index for each language.
2.2 Model for Image and Text Representation
To project both images and texts into the same space, we use two networks trained simultaneously. The image branch of the network uses a Convolutional Neural Network (CNN) followed by a fully connected layer that embed images. The second branch is a multi-layer Recurrent Neural Network (RNN) that compose a multi-lingual word embeddings list into the same space. This list corresponds to a given sentence.
We use two datasets to train the models used by QISS. Each dataset is composed of images and their corresponding captions. The first dataset is Common Objects in COntext (COCO) . It contains 123 287 images with 5 English captions per image. The second dataset is called Multi30K . It contains 31 014 images with captions in French, German, and Czech. We use 29 000 images for training and 1014 for validation and 1000 for testing.
MUSE allows for a common representation for 110 languages. Once we trained our model in English using COCO, we used MUSE to transfer the computed embeddings to any language supported by MUSE, at no cost.
For the online demonstration, we indexed images from the Yahoo Flickr Creative Commons (YFCC)  image dataset. This dataset contains roughly 100 million images under Creative Commons license.
As said above, QISS is a full image search engine based on similarity search. Figure 1 shows the interface, where it is possible to search using a text query or by uploading an image. The results are shown in Fig. 2. The images that our method evaluates as the most similar to the query (either text or image) are returned.
Upload an Image. It is sent to the Image Handler and the inference is realized with NVidia TensorRT.
Search a text. The text goes to the language detector and the Text Features Extractor.
- 2.Elliott, D., Frank, S., Sima’an, K., Specia, L.: Multi30K: multilingual English-German image descriptions. In: Proceedings of the 5th Workshop on Vision and Language. Association for Computational Linguistics, Stroudsburg (2016). https://doi.org/10.18653/v1/W16-3210. http://arxiv.org/abs/1605.00459. http://aclweb.org/anthology/W16-3210
- 3.Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017)
- 5.Portaz, M., Randrianarivo, H., Nivaggioli, A., Maudet, E., Servan, C., Peyronnet, S.: Image search using multilingual texts: a cross-modal learning approach between image and text. Ph.D. thesis, Qwant Research (2019)Google Scholar
- 6.Shamma, D.A.: One hundred million creative commons Flickr images for research, 24 June (2014)Google Scholar