1 Introduction

The retail industry is currently experiencing a digital transformation. To remain competitive, retailers use new machine learning based technologies that allow them to exploit real-time purchase behavior, achieve more efficient store management, and offer new shopping experiences to their customers. [1] identifies six value drivers for digital transformation that generate new revenue or improve efficiency: innovation, employee productivity, asset utilization, customer experience, supply chain and logistics, and sustainability. Included in these value drivers are use cases that improve the efficiency of the checkout process and also reduce the number of thefts. In grocery stores, this can be achieved by recognizing products that enable automatic checkout, provide on-shelf availability analysis, and detect fraud at self-checkouts (SCO).

Fig. 1
figure 1

Example of challenging cases where different grocery products have a similar appearance and are only differentiable by subtle details (ingredients side (a), meat packages (b)), lactose and non-lactose product variant (c), and the same type of product with different weight (d)

Automatic recognition of products in grocery stores poses many challenges to machine learning. One such challenge relates to the large number of products to recognize. A large supermarket typically has tens of thousands of unique products; for hypermarkets, this number could increase to hundreds of thousands. There is also a large imbalance in sales of products; some products are sold in large quantities each day, while others are only sold a few times a week. Moreover, new products are added and removed from the retailer each week. Another challenge relates to finding robust recognition solutions that can distinguish between products with very subtle visual differences.

In computer vision, indeed, fine-grained image recognition focuses on differentiating between hard to distinguish or similar types of objects. [2] categorizes fine-grained image recognition into three main paradigms: finding key parts in an image and merging the local feature vector with a global vector representation, learning better feature representation by high-order feature interaction or novel loss functions, or use of auxiliary data sources. Multimodal classification is part of the last paradigm, using a combination of data from different modalities to improve recognition performance. In retail, this is common in e-commerce applications [3,4,5], where product images combined with textual metadata are used to create more accurate models. For automatic recognition of grocery products, image-based techniques are the dominant category. However, recent progress in optical character reading (OCR) has made extracting textual elements from product packages possible. Thus, the combination of the text elements from the product packages with the respective image allows for more accurate and reliable recognition of products using multimodal classifiers. This approach has been proposed in two recent surveys (see [6, 7]) and listed as an important research direction to improve fine-grained recognition of grocery products [6, 7]. In fact, recent work has shown that using extracted textual data from packages as input to natural language processing (NLP) models results in a robust text classifier with high accuracy, see [8, 9]. Furthermore, [7] state that no specific datasets exist for fine-grained product recognition. Therefore, a dataset is needed to identify the performance and challenges prevalent for fine-grained recognition of grocery products. In addition, more research is needed on how to construct such multimodal classifiers/models and how image and text information can be fused.

Thus, in this work, we suggest a multimodal product recognition approach that uses image and textual information as input on a fine-grained product recognition dataset. Our approach uses separate image and text models and combines the features with multimodal fusion methods. We test several deep image models, from small to larger state-of-the-art models. We also use several state-of-the-art text models based on the Transformer architecture [10] to evaluate text recognition performance. Furthermore, experiments are performed with multiple multimodal fusion models to explore the performance of the different fusion methods with respect to model size, image resolution, and text length.

A challenge with existing datasets for automated product recognition is the lack of high-resolution images that include a wide range of classes with subtle distinctions captured in all product orientations. In this regard, to the best of our knowledge, there exists no public product dataset that provides both image and package text information with products that are only distinguishable by subtle details, image samples from all product orientations, detailed text information from packages, and a significant number of classes and samples. Therefore, we collected a dataset with those characteristics from a grocery store used for all our experiments. The created dataset contains a great variety of different products, where most of the products have one or more related product(s), with product package differences consisting of small textual or color details for the related product(s). Figure 1 shows different types of challenging cases for grocery product recognition.

In summary, the main contributions of this paper are as follows:

  • A novel fine-grained dataset (FineGrainOCR) for grocery product recognition consisting of image and textual data. The distinguishing marks for this dataset are:

    • Subtle differences between products

    • A wide range of product orientations

    • High-resolution image data

    • Detailed text information from packages

    • A significant number of classes and samples

  • A multimodal product recognition approach that combines image and textual data from grocery products using multimodal fusion methods. This approach performs significantly better compared to unimodal models on fine-grained image recognition on a dataset collected from a grocery store. Furthermore, the approach shows state-of-the-art results in a related retail domain.

  • Recommendations and trade-off aspects based on our extensive experimental evaluations related to how to implement and deploy multimodal product recognition methods.

The remainder of this paper is structured as follows. Section 2 reviews related work for grocery product recognition and multimodal classification. We then present our dataset in Sect. 3 and details of our proposed approach in Sect. 4. The experimental results are reported in Sect. 5 and a discussion of our results in Sect. 6 in the context of the wider literature. Finally, we conclude our work in Sect. 7.

2 Background

In this section, we present a summary of relevant grocery product recognition research that includes both general and fine-grained methods. To contextualize our own dataset, we provide a detailed description of the available datasets for recognition of grocery products. Furthermore, we give an introduction to multimodal classification techniques and work focusing on multimodal techniques for image and text data. Finally, examples of multimodal image and text models are provided from the retail domain.

2.1 Grocery product recognition

Recognition Techniques Recognition of products is required for many types of automation solutions at grocery stores. This includes automatic checkout systems for the registration of products [11,12,13], the monitoring of availability and misplacement on store shelves [14,15,16], frictionless checkout where a camera system with the inclusion of other sensors in a store registers the pick of products by customers [17, 18] and detection of barcode switches and other fraudulent actions at SCOs [8, 19]. A figure showing examples of applications can be seen in Fig. 2.

Fig. 2
figure 2

Example of applications areas for product recognition in grocery stores; a autonomous checkout systems, b shelf monitoring and c detection of barcode switch at SCOs

Early work used local feature-based methods to recognize products, see examples in [20, 21]. In particular, [20] used scale-invariant feature transform (SIFT) [22] keypoints, color histograms, and boosted Haar-like features to recognize products.

In recent years, many methods have used convolutional neural networks (CNN) to improve the recognition of grocery products. These include methods to improve the recognition performance of new models [23,24,25,26], improved learning strategies [27,28,29,30], use synthetic data for larger training sets [31, 32], and adapt grocery models to new environments [33, 34]. To exemplify the breath and variety of these solutions, we present details of some of these relevant works below. In [27], the problem of new grocery products introduced in a store is addressed. An object classification system is proposed that incorporates the rejection of unknown products in a checkout counter scenario. A metric learning approach [35] enables more reliable confidence using cosine similarity as a distance measure. In addition, a novel margined unknown loss function further improves the rejection of unknown products. The handling of new grocery products for a product recognition system is proposed by [33]. The method learns the embedding of products using a generative adversarial network (GAN). Furthermore, leveraging the learned embedding, the method can add new classes without any retraining. This is achieved by formulating the task of product recognition as an instance-level recognition problem and applying the k-nearest neighbors algorithm (KNN) to recognize products.

Because some grocery products have only small fine-grained details that differentiate them, several fine-grained methods have been proposed to tackle this challenge in the literature. Such solutions consist of several steps and use complex models; researchers use hierarchical information [36], techniques to find discriminative features [37,38,39,40], and others use contextual information [41]. However, the methods for finding discriminating features vary a lot. For example, [37] presents a method to improve the fine-grained recognition of products using destruction and construction learning proposed in [42] with a self-attention mechanism. The model learns to correlate semantic regions in product images and is also able to find specific discriminatory areas. [39] propose a fine-grained classification network where an input image is passed through two classification paths, using object-level and part-level information. The part-level information contains discriminatory regions that are passed through stacked convolutional LSTM before being combined with features from object-level information before a classification layer. [40] propose a framework for product label recognition using textual information. However, the results show challenges associated with reading OCR and correctly classifying product labels. For a more general overview of fine-grained recognition techniques, see [2, 43].

Multimodal techniques, which combine image and text for fine-grained grocery product recognition, have also been proposed in recent years; see, for example, [44, 45]. In [44], the recognition of soda bottles is carried out by combining textual and visual features with a simple attention mechanism. The product text is extracted from each input image by OCR reading. [45] also use OCR to combine textual and visual features. Their approach adopts supervised contrastive learning and achieves a significant improvement in the accuracy of several benchmarks for the recognition of grocery products. However, these techniques are applied to front-facing grocery products with a limited number of classes. Furthermore, the datasets also contain a limited number of text elements, and thereby missing classes with challenging fine-grained textual details. Thus, based on the limited research using image and OCR for grocery product recognition and the research gap identified by [6, 7], we try to bridge such gap by proposing a multimodal product recognition approach. Moreover, we create a fine-grained dataset with a large number of classes, and perform an extensive evaluation that identifies the trade-offs using our multimodal product recognition approach.

Datasets Early work by [20] introduced the first open dataset for the recognition of grocery products. It consists of 120 different types of products, where training data contains product images from a studio environment with only a handful of images for each class. A large test set is also included by extracting images captured by a camera from a real store environment. [21] use the same setup but increase the number of product classes to 8,350. In [46], this is further extended by adding text descriptions of each product in the dataset, enabling the inclusion of semantic information in model development. [47] use both training and test data from challenging in-store conditions such as different perspectives and lighting. The dataset is aimed at autonomous robots. It consists of 25 classes with 5,000 image samples in total, where each class describes a coarse class, such as, for example, flour, milk, and pasta.

The RPC dataset is aimed at automatic checkout systems [48]. It includes 200 different types of products, with 53,739 single-product images from a studio setup and 30,000 multi-product images from a checkout counter setup. The authors also propose the single-product images to be used for constructing synthetic images by cutting the products from the single images and pasting them into the checkout counter environment. In a similar setting, [49] provide the D2S dataset consisting of semantic segmentation masks for products.

Object detection of products in shelf images poses a significant challenge since there are usually a large number of products per image positioned close to each other or overlapping one another. In the SKU-110K dataset [50], a dense object detection dataset is proposed for grocery products, which includes 11,762 shelf images, where each image contains around 147 products. The products in the dataset are unlabeled products, hence recognition of specific classes is not possible. In [51], the SKU-110K is expanded by adding multiple rotated instances for each sample. In contrast to the unlabeled products in SKU-110K, the RP2K data set [52] consists of 2,000 products with 500k images of the shelves collected from physical retail stores. The dataset is organized hierarchically with meta-categories for shape and product type. Additionally, details such as brand, flavor, and type are also included, allowing the ability to evaluate on a customized categorization level. In the Locount dataset provided by [53], an instance count is added to each product at each shelf position. This enables evaluations of quantity estimations at each shelf position.

In recent years, datasets have been presented for grocery products that use text as an additional input. [54] present a dataset designed to help people with visual impairment, including textual descriptions acquired from OCR reading of images as metadata. The dataset contains a candidate set that contains 6348 products, with a total of 13,290 images. To evaluate product recognition performance, a test set with 373 samples from 104 randomly selected classes is extracted from a real-world environment using a smartphone camera. United Retail Datasets (Unitail) [55] consist of a product detection and a product recognition dataset. The product recognition dataset uses 1,454 products for the training set, where each product has a front-facing product sample. Manually annotated text regions for each sample allow the dataset to support the tasks of text detection, text recognition, and product matching that utilize text information.

However, despite the large number of aforementioned datasets for grocery product recognition, there is no dataset that addresses the challenges of fine-grained grocery product recognition with the following characteristics:

  • Containing challenging products with similar appearance

  • Sufficient image resolution to capture fine-grained details

  • Product facing in all directions

  • Large number of classes

  • A substantial amount of training data

2.2 Multimodal classification

By combining several data sources, such as images, video, text, and audio, a better representation of a subject is possible. Multimodal classification is used to fuse the representations from each unimodal data source and to learn a classifier with improved performance. Multimodal data are present in many different types of domains, such as retail [3, 56], agriculture [57, 58], and self-driving cars [59, 60].

[61] describe five challenges for a multimodal setting: learning to construct a representation that uses information from all domains, how to translate from one modality to another, how to align the information, how to fuse the information and finally, how to transfer knowledge between modalities. The description and naming of multimodal fusion methods are inconsistent in the current literature. [62] propose a holistic taxonomy of multimodal classification that we are applying in this work. The taxonomy categorizes the fusion methods into three different types: early fusion, late fusion, and cross-modality fusion. In early fusion, data are aggregated before the learning model is applied. Late fusion combines the extracted features of each modality, which includes the probabilities from each class or the final feature representation from each modality. In cross-modality fusion, data and intermediate features are shared during learning, which can model complex relationships between modalities.

Several methods have been proposed to fuse image and textual features. Encoded textual features to the image domain and passing the resulting image through a CNN have been shown to improve classification performance [63,64,65]. Furthermore, transformer-based models have been successfully combined for multimodal classification [66]. [67] evaluate different fusion methods using textual features of FastText [68] and image features of a ResNet model [69]. [70] use image and text information for scene understanding. Text from images is extracted via OCR and a pyramidal histogram of characters (PHOC) is used to calculate a Fisher vector that encodes the textual information. The textual features are then fused with the image features of a CNN using an attention mechanism. Examples of multimodal fusion work for the recognition of grocery products are described in Sect. 2.1.

Within the retail industry, e-commerce is another application area where image and text information is often found. Zahavy et al. [71] recognize products using the image of the product and its corresponding title. Two CNNs are used, one for each modality, to extract feature representations. Fusion is performed by training a policy network that learns the preferred modality for classification. [72] learn a simple fusion model composed of a fully connected layer that assigns a weight to each modality. To reduce the risk of neglecting one modality during training, a regularization scheme is proposed using a Kullback-Leibler divergence loss in combination with a cross-entropy loss. Other work uses early fusion methods by concatenating the final feature layers of the unimodal decision results [73, 74]. In [75] and [76], the authors improve the classification results using an attention mechanism that combines the image and textual features. Combining image and text from OCR is restricted not only to grocery product recognition. It has been applied to other domains such as document classification [77], package identification [78] in logistics, and product leaflet classification [79].

3 FineGrainOCR—fine grained product dataset with OCR

There are several datasets for the recognition of grocery products, as described in Sect. 2. However, they contain either a limited number of classes with similar appearance, have too low image resolution to capture small distinguishable details or use only front-facing products. Therefore, we created a fine-grained multimodal product recognition dataset. We established that such a dataset would have to fulfill the following requirements:

  • Contain high-resolution images

  • Present different sides and orientations on packages

  • Include a large number of classes where most of them have one or several classes to which they have a strong resemblance.

Fig. 3
figure 3

Autonomous checkout system used for data collection. The scanning tunnel where image data is collected is marked with a red rectangle

An autonomous checkout system deployed in a large grocery store has collected images of grocery products for this new dataset. In the system, the products are placed on a moving conveyor belt and then pass through a scanning tunnel. Figure 3 shows the system with the scanning tunnel marked with a red rectangle. Product registration is carried out using barcode scanners placed within the scanning tunnel. Customers in the store have performed the scanning process, and no instructions have been provided on how to place the products, resulting in a wide variety of product orientations. An RGB camera is placed in the scanning tunnel with a resolution of 2592 \(\times \) 1944. The high resolution and close placement of the RGB camera to the products ensure that fine-grained details have been captured.

Fig. 4
figure 4

Gallery of image samples in the new dataset created for this work. Rows contain example categories of products while columns depict examples for each category

Our FineGrainOCR dataset has been constructed by extracting a total of 256 classes from the collected data. Classes have been selected from six different categories, namely, chocolate, dairy, meat, milk/cream, mushroom, and toppings. In each category, the major part of all classes have one or several classes that they have a similar appearance to. Figure 4 illustrates some of the categories (rows) and a few examples of different classes in each (columns).

To assess the overall performance of our proposed approach, we also want to investigate how sample size affects product recognition performance. Therefore, a substantial number of samples is extracted for each class, with a target of 500 samples for each. However, the sales of products in grocery stores are highly imbalanced, making it difficult to extract a balanced dataset. Figure 5 illustrates the number of training samples for each class. Although some classes have a limited number of samples, most classes have a number of samples similar to or close to the target value of 500 samples. FineGrainOCR is divided into a training and validation set, where 80% samples are in the training set and 20% of the samples in the validation set for each class.

The product texts of each image sample are extracted using the OCR engine in Google Vision API,Footnote 1 which captures many fine-grained words from product images. Figure 6 illustrates parts of the words extracted from a sample in the dataset.

All extracted words from an image sample are concatenated into a single text string, where each word is separated by a newline character. Words are ordered using the default output from the Google Vision API. The text provided to the text models needs to be transformed into a feature vector. This is done by first tokenizing the text and then transforming it into a text embedding.

The transformer-based text models used in our approach expect a fixed number of tokens to be input into the model, with a maximum number of 512 tokens. However, the number of tokens quadratically affects the runtime performance of the text model for the BERT architecture [80], which requires us to use as few tokens as possible without significantly affecting the performance of text recognition. Figure 7 shows a histogram of the length of the tokenized text for the samples in the dataset. Based on this, the maximum number of tokens is selected as 128, 256, and 384 in the experiments. Text samples containing longer sentences are cropped.

4 Proposed approach

As explained earlier, there are many products in a grocery store that have minimal differences and are therefore difficult to distinguish visually. In many of these cases, it can consist of small textual differences (products with different fat content) or small patches with additional information (products without lactose). Figure 8 illustrates two different types of products where small text details "medium ground beef" and "extra lean ground beef" make up the entire difference. In addition, different sides of products can be more or less difficult to classify. For example, the side of the package with ingredients is difficult to differentiate, whereas the front side of a package is typically easier. To handle these challenges, we suggest using a simple auxiliary input, such as text extracted from OCR on products, to improve the recognition accuracy, thus creating a recognition pipeline that combines both images and text.

Fig. 5
figure 5

Histogram of the number of samples for the classes in the FineGrainOCR training set

Fig. 6
figure 6

Words are recognized by the OCR engine and merged into a single text string. The image exemplifies how a subpart of the recognized words from the OCR from the image are concatenated into a single string

Fig. 7
figure 7

Histogram of the number of tokens of the text samples in the training dataset

Fig. 8
figure 8

Example of two different types of packaged meat. While not containing many visual differences, the extracted text patches (red box) in (a) and (b) show that the texts (medium ground beef and extra lean ground beef) can easily be used for discriminating between the two products

The details of our approach are explained in the following sections. Section 4.1 presents an overview of the image and text classification pipeline. In Sect. 4.2, we motivate and describe the selected image and text models. In addition, we describe how image and text representations are fused into different multimodal models.

4.1 Overview

The architecture of our approach is shown in Fig. 9. The input image is passed to two separate models, one image-based model and one text-based model. Textual input is extracted using OCR on the input image.Footnote 2 The output of each model is a feature vector, labeled \(x_{txt}\) for the text model and \(x_{img}\) for the image model. These feature vectors are passed to a multimodal feature fusion module that combines the feature vectors of each modality and generates a fused representation called \(x_{fused}\). This representation is passed through a classification layer that outputs the class scores \(y_{score}\).

Fig. 9
figure 9

The proposed multimodal architecture of our approach for grocery product recognition. A two-stream model uses an image and text model to create feature representations \(x_{img}\) and \(x_{txt}\) for a multimodal fusion module, which produces a new representation \(x_{fused}\) and passes it to a classifier

Fig. 10
figure 10

Multimodal fusion methods used in the proposed approach. Each technique use the features \(x_{img}\) and \(x_{txt}\) from both modalities, and process it to a final feature representation \(x_{fused}\)

Fig. 11
figure 11

Histogram of how the per class accuracy is affected by the multimodal model DistilBERT compared to the a image model ResNet50 and b text model DistilBERT for different number of maximum training samples. A reduction means that the multimodal accuracy for a class is significantly lower compared to the unimodal model, whereas equal means that they are the same, and improvement means that the multimodal accuracy is higher

4.2 Model selection

We consider three different types of CNN architectures for the image models: ResNet [69], MobileNetV3 [81], and ConvNext [82]. Each of these has different characteristics, such as being a baseline approach, designed for mobile devices, and a state-of-the-art model. We want to see how these different characteristics affect performance in a multimodal setting. ResNet [69] is the de facto standard as the CNN backbone for many computer vision tasks. Published in 2015, it still offers good performance with recent training procedures [83], while also having a low computational cost. MobileNetV3 [81] is a lightweight CNN designed for mobile devices with a low memory footprint and a low latency requirement. The CNN architecture ConvNext [82] has shown state-of-the-art results in several image recognition tasks. The architecture is based on the ResNet50 design with several techniques and design elements that resemble the recent Vision Transformer architecture [84].

For the recognition of the product package text in grocery products, BERT [85] with a small classification head [9] has shown notably better classification results compared to other methods using GloVe embeddings [86]. Therefore, we use three types of text models based on the Transformer architecture used in BERT; the baseline model BERT [85], the optimized DistilBERT [87] model, and the more accurate DeBERTa [88] model.

Four types of late fusion methods are used to combine the image and text modality. Feature Concatenation and Score Fusion are two simple and classic methods to combine multiple modalities. Two recent multimodal methods are also used; the Gated Multimodal Unit (GMU) [89] learns the influence of each modality and gates the most informative features. EmbraceNet [90] is designed for robustness and learns cross-modal correlations by randomly selecting a subset of the features of each modality during model training. Each multimodal fusion method is shown in Fig. 10.

As shown in Fig. 9, the features extracted from the image and text modality are merged into a multimodal feature fusion module; \(x_{img}\) denotes the features of the image modality, while \(x_{txt}\) is used for the text modality. The resulting representation of the features of the multimodal feature fusion module is denoted as \(x_{fused}\).

Feature Concatenation is a simple feature fusion methods based on concatenating the features of each modality into a single feature vector. It can be expressed as follows:

$$\begin{aligned} x_{fused}&= [x_{txt}, x_{img}] \end{aligned}$$

where \([\cdot ,\cdot ]\) is the concatenation operation.

In Score Fusion, the classification results of each modality are used to train a new classifier by concatenating the probabilities of each class. This fusion method has the following form:

$$\begin{aligned} s_{txt}&= Softmax(W_{txt} \cdot x_{txt} + b_{txt}) \\ s_{img}&= Softmax(W_{img} \cdot x_{img} + b_{img}) \\ x_{fused}&= [s_{txt}, s_{img}] \end{aligned}$$

where \(s_{txt}\) and \(s_{img}\) are the probabilities of each classifier, \(W_{txt}\) and \(W_{img}\) are trainable weight matrices, while \(b_{txt}\) and \(b_{img}\) are trainable bias parameters. Compared to other multimodal fusion methods, a drawback of this method is that it is not end-to-end trainable and requires each modality to be trained separately.

Introduced by Arevalo et al. [89], Gated Multimodal Unit (GMU) learns how each modality influences activations. Formally, it can be expressed as:

$$\begin{aligned} h_{txt}&= tanh(W_{txt} \cdot x_{txt}) \\ h_{img}&= tanh(W_{img} \cdot x_{img}) \\ h_{gate}&= Sigmoid(W_{fused} \cdot [x_{txt}, x_{img}]) \\ x_{fused}&= h_{gate} * h_{txt} + (1 - h_{gate}) * h_{img} \end{aligned}$$

where \(h_{txt}\) and \(h_{img}\) are intermediate feature representations and \(h_{gate}\) is a scalar and weighs the contribution from each modality. \(W_{txt}\), \(W_{img}\) and \(W_{fused}\) are trainable weight matrices.

EmbraceNet [90] is a feature fusion method designed to ensure robustness against loss of data or modalities. It consists of a docking layer that transforms each input modality into the same feature dimension. Then an embracement layer combines the features in a probabilistic fashion. The operations can be expressed as:

$$\begin{aligned} d_{txt}&= W_{txt} \cdot x_{txt} + b_{txt} \\ d_{img}&= W_{txt} \cdot x_{txt} + b_{img} \\ e_{txt}&= r_{txt} \odot d_{txt} \\ e_{img}&= r_{img} \odot d_{img} \\ x_{fused}&= e_{txt} + e_{img} \end{aligned}$$

where \(d_{txt}\), \(d_{img}\), \(e_{txt}\), and \(e_{img}\) are the feature representations after the docking and embracement layer. \(W_{txt}\) and \(W_{img}\) are trainable weight matrices of the docking layer, and \(b_{txt}\) and \(b_{img}\) the trainable bias parameters. \(r_{img}\) and \(r_{txt}\) are influence vectors that match the size of \(d_{txt}\), \(d_{img}\). \(r_{txt}\) and \(r_{img}\) are vectors of the same size as the docking layers. They are generated jointly by a multinominal distribution function, where each feature value for \(r_{txt}\) and \(r_{img}\) is set to 1 or 0 for each modality. These vectors are then multiplied by the output from each docking layer, which gives the embracement vectors \(e_{txt}\) and \(e_{img}\). This mechanism enhances the contribution from both modalities while also providing additional robustness to a modality, for example, if it is not possible to read OCR from a product image.

The resulting features \(x_{fused}\) of each feature fusion method are then combined and passed through a classifier that outputs the resulting class scores.

We also considered using the CLIP model [91] as a standalone multimodal model; however, we discarded it due to its reported poor performance in fine-grained recognition tasks compared to task-specific models.

5 Experiments

5.1 Experimental settings

We perform our experiments with a varying number of training samples of sizes \(\{50, 100, 200, 300, 400\}\). Training samples are extracted by timestamp (time of registration), ensuring that the same number of samples is always selected for each size. Classes that do not have the same number of training samples as the target number use the maximum number of training samples available. The same validation set is used throughout all experiments, using the validation split consisting of 20% of all samples for each class. The purpose of evaluating the performance of different models is two-fold. First, we want to see how the performance of each modality is affected by the training sample size. Second, we want to analyze the effect of training samples on unimodal models compared to multimodal models.

In grocery stores, the computational requirements depend on the type of application. From in-store shelf monitoring systems that analyze images in minute intervals, while automatic checkout typically needs a response in just a few hundred milliseconds. Therefore, we experiment with different model sizes, image resolutions, and text lengths to find the trade-offs of different model selections and hyperparameters. Our selected image models ResNet, MobileNetV3 and ConvNext evaluated in this work use the PyTorch implementationFootnote 3 with its pre-trained weights. For text models, we use the implementations provided by HuggingFace.Footnote 4 For the BERT, DistilBERT and DeBERTa models, we use the pre-trained weights bert-uncased, distilbert-uncased and deberta-base respectively. The default model configurations are used for all text models. Image augmentations are performed when training image and multimodal models. We use a data augmentation pipeline consisting of random vertical/horizontal flipping and random rotation, each with a probability of \(50\%\). When training the text models, no pre-processing of the textual data is performed.

We train our CNN-based image classifiers using stochastic gradient descent (SGD). The learning rate is set at \(1e^{-3}\) and the weight decay is \(1e^{-3}\) with a step size of 10. Text and multimodal models use the ADAM optimizer with a learning rate \(2e^{-5}\) as suggested in [92] and a weight decay of \(1e^{-4}\). The standard batch size is set to 32 for the models. For some larger models, a batch size of 16 has been used due to computational limitations. Cross-entropy is used to calculate the loss for all experiments.

5.2 Image classification results

We evaluated the image classification performance using our FineGrainOCR dataset. Three different model architectures are considered: ResNet, MobileNet-V3, and ConvNext. For each model architecture, two versions of the architectures are evaluated to validate whether a larger model size improves the results. The images provided to each image model are downsampled before being passed to the image model. The text extracted from the OCR reading uses the original resolution of the images in the comparison. To see whether this affects the results, the following image sizes are considered 256, 384, and 512; images are quadratic, i.e., \(256\times 256\) and so on. The results of product recognition using FineGrainOCR with only the image modality are summarized in Table 1.

Table 1 Accuracy (in percentage) from image classification on our FineGrainOCR using the image sizes \(\{256, 384, 512\}\) and max training samples for each class \(\{50, 100, 200, 300, 400\}\).

As we can see in Table 1, the accuracy of the ResNet50 model is the highest for all image and sample sizes. Despite a significantly larger model size, the ConvNext architecture results are consistently 0.5–1.0 percentage points worse than ResNet50, with only a few combinations where performance is similar. Overall, we see that increasing the size of the model for each of the model architectures yields improved accuracy. The size of the image also affects the accuracy of the model, improving it by one percentage point in most cases. Furthermore, the accuracy improves steadily when more training samples are used and starts to saturate after 200 training samples. It is noteworthy that the ResNet models outperform the ConvNext models in terms of accuracy. A potential reason for this is that the more complex ConvNext needs more samples and classes to effectively learn and generalize. Exploratory experiments with different data augmentation techniques and hyperparameters selections have been tested with these architectures. However, no deviations from the results presented in Table 1 were observed. We also performed experiments with two vision transformer networks, namely, ViT [84] and Swin [93]. However, under identical experimental conditions, these models achieved an accuracy that was \(1.0-2.0\) percentage points lower than the results observed in the ResNet and ConvNext models.

5.3 Text classification results

We evaluate the accuracy of the text classification module using the BERT, DistilBERT, and DeBERTa models. For each model, we test the following maximum sizes of the sequence length: 128, 256, and 384, as described in Sect. 3. The results of the text classification can be seen in Table 2.

Table 2 Accuracy (in percentage) from text classification on our FineGrainOCR with sequence lengths \(\{128, 256, 384\}\) and max training samples for each class \(\{50, 100, 200, 300, 400\}\)

From the results, it can be observed that a maximum sequence length of 128 gives an accuracy that is typically \(2-3\) percentage points lower than with a maximum sequence length of 256 or 384 for most models. Furthermore, an increasing number of training samples improve the accuracy of text classification. However, the gain is drastically reduced after 200 training samples. BERT shows the best overall performance for sequence length 128, and DeBERTa for sequences 256 and 384. However, the difference compared to DistilBERT is just a few sub-percentage points.

5.4 Multimodal fusion results

To evaluate multimodal recognition performance, we use insights from the image and text classification results presented in Sects. 5.2 and 5.3. We observe that ResNet50 is the model with the highest classification performance, while the lowest performing image classifier is MobileNetV3-Small. We select those for further evaluation to investigate the effect of the image classifier selection.

Image size is a factor that affects image recognition performance. Therefore, we select two image size values, 256 and 512, for further evaluation. All text models have similar performance when compared against each other on the same sequence length and sample size. Therefore, we select all of them to see how each complements the image model. We choose a sequence length of 256, which performs significantly better than a sequence length of 128. It also performs similarly to that of a sequence length of 384, while requiring less computational power. The sampling size is also evaluated for the multimodal models to see how each modality complements the other and how it evolves when the sampling size increases.

Analysis of potential multimodal performance To see how this potential multimodal classifier is affected by all these parameters, an oracle classifier is applied. The oracle classifier always selects the best prediction result from either the image or text recognition models, and the result can be seen as the potential performance of an optimal classifier. The result in parentheses is the gain compared to the best unimodal classifier with the same sample size and image size. We evaluate different combinations of the selected image and text models, the image size, and the sampling size. The results of the oracle classifier can be seen in Table 3.

Table 3 Ideal oracle classification accuracy (in percentage) for different combinations of the image models MobileNetV3-Small and ResNet50, and the text models BERT, DistilBERT and DeBERTa

The results of the oracle evaluation show the potential of a multimodal classifier to improve the classification performance. In addition, the number of training samples does not have a large effect on the accuracy of the oracle classification. The performance difference between 50 and 400 training samples is between 1.3 and 2.6 percentage points depending on the multimodal configurations. Furthermore, the increased image size affects only MobileNetV3-Small and ResNet50 with a few subpercentage points, and not always in a positive direction. There is no deviation from using different text models, indicating that they are learning to distinguish similar classes.

Multimodal classification Based on the results of the oracle classification, we continue to use the MobileNetV3-Small and ResNet50 models for multimodal evaluation. The results suggest that a multimodal model with MobileNetV3-Small can achieve competitive performance with a multimodal model with ResNet50, although MobileNetV3-Small has significantly lower accuracy individually. A larger image size requires an increase in the computational power of the image models, and the oracle classifier showed minor sub-percentage point improvements using a larger image size. Therefore, the image size for the image model is set to 256. DistilBERT is selected as the text model as it is the smallest text model and because of similar results from the text models. All fusion models except score fusion are trained end-to-end, initialized with pre-trained features from the provided frameworks. For the score fusion model, the image and text modality is first trained separately. Resulting in separate image and text classifiers. In the second step, the score fusion model is trained on the probabilities of each modality as input. The results of the multimodal classification can be seen in Table 4.

The multimodal results clearly show the improvements in fusing image and textual features. For both image models with the DistilBERT model, all fusion models increase the classification results compared to DistilBERT. The number of training samples affects the classification accuracy of the multimodal model, increasing the accuracy by several percentage points on 50 training samples, while the increment is reduced when more training samples are used. The feature concatenation fusion method is overall better compared to the other fusion methods. Unlike the oracle classifier, there is also a large gap in classification accuracy when using image models, showing that ResNet50 provides superior performance when used for multimodal classification.

Table 4 Multimodal classification accuracy (in percentage) of the image model MobileNetV3-Small and ResNet50 model with the text model DistilBERT, using the fusion models Feature Concatenation, Score Fusion, GMU and EmbraceNet
Table 5 The bottom 5 classes with lowest accuracy for the image model ResNet50, text model DistilBERT and the multimodal classifier combining these models with the feature concatenation as the fusion method

Experiments with more complex feature fusion models have also been carried out. The co-attention and cross-attention techniques proposed in Zhang et al. [94] did not improve the results compared to the selected fusion method. Furthermore, we investigate whether using a long short-term memory (LSTM) network in the output of the text model, as suggested by Gallo et al. [95], yields improved classification results; however, this did not happen. A more thorough discussion on the results of the fusion models is presented in Sect. 6.

Analysis of multimodal per-class improvements We evaluate how the multimodal model affects the classification accuracy of each class compared to the unimodal models. The image and text combination ResNet50+DistilBERT with feature concatenation as the fusion method had the highest multimodal accuracy. Hence, we select this one for evaluation. Sampling sizes of 50, 200, and 400 are used, where the purpose is to see how many classes are affected by the multimodal model, positively and negatively. We perform a paired t-test to see which classes have reduced, equal, or improved performance. We train 10 different image, text, and multimodal models for this. For each class, we extract the total number of correct predictions for the class on both the unimodal and multimodal models. Then, a null hypothesis test is performed, checking if the mean difference between the unimodal and multimodal models is different. A p-value of \(1\%\) is used to test the null hypothesis. As we are interested in a qualitative measure of the multimodal effect on the different classes, this p-value ensures that the results of the 256 classes are reliable. The results of the test can be seen in Fig. 11.

The results show that the class accuracy for the multimodal model is equal to or improved compared to that of the image and text models. The reason for less improved classes for the multimodal model compared to the image model is that the performance of the image model is closer to that of the multimodal model when more training samples are available.

Analysis of low accuracy classification cases To see how the most challenging products are affected by a multimodal model, we show the bottom 5 products with the lowest classification accuracy for image, text, and multimodal models as shown in Table 5. Similarly to previous experiments, we use the ResNet50+DistilBERT combination with feature concatenation as the multimodal model. A training sample size of 200 is selected and the maximum sequence length and image size are set to 256 (similar results are achieved with other settings). The results show that the multimodal model is much more accurate for the classes with the lowest classification performance, improving the accuracy per product by a minimum of 12 percentage points and up to more than 50 percentage points. Inspection of the image classes with the poorest performance showed that each had classes that looked similar and only subtle text details distinguished the products. For the text model, poor performance had multiple reasons, such as lack of discriminative text information, poor/partial OCR reading, and changes in product appearance in the test set. For most of the worst classes in the multimodal model, poor performance is due to the changing appearance of the product. Although the text information is similar, the packages have a different visual appearance. The other classes with poor performance are almost identical, and the OCR reading is unable to extract discriminatory parts in many cases.

Evaluation on Product Leaflet dataset As described in Sect. 3, no previous grocery product recognition datasets contain challenging products for fine-grained recognition. To compare our approach to other methods, we therefore evaluate our approach on a dataset in a similar field. The Product Leaflet Dataset [79] dataset is a fine-grained classification dataset with product leaflet images. The challenge with the dataset is that there is a great variation in the appearance of the images in a class. However, the textual information is rich and informative. The dataset has 832 classes with 33280 training images and 8320 test images. For each class, there are 40 training samples and 10 test samples.

We evaluate our approach on this dataset with the data extraction method described in Sect. 3. ResNet50 and DistilBERT are selected for unimodal comparisons. For multimodal comparison, we select Score Fusion and Feature Concatenation. We use the same experimental setup as described in the article, with an image size of 256 for the image classifier and an image size of 512 for OCR extraction and text classification. The results can be seen in Fig. 6.

We see that our approach substantially improves the accuracy on both unimodal and multimodal models, setting a new state-of-the-art baseline. In particular, the performance of the DistilBERT model which surpasses the results of the weight proba fusion method of the previous approach. In addition to the above mentioned results, our approach also won the Kaggle competition associated with the article.Footnote 5 We achieved a mean F1-score of \(94.67\%\) on the private leaderboard, while the second runner-up had an F1-score of \(93.00\%\).

Table 6 Recognition results on Product Leaflet Dataset

6 Discussion and recommendations

In this section, we discuss the significance of our results and provide recommendations for recognising grocery products using image and textual data.

In our work, we can see that fusing image and text information from product packages improves the recognition performance of grocery products. This has been done using a dataset from a grocery store that captures products with fine-grained details. To show that textual information provides additional information, we have shown that an increase in image resolution of image models does not improve performance notably. Instead, by using textual information, the recognition performance is significantly increased; hence, the image models in our experiments are not able to capture fine-grained details. An alternative approach is to use only fine-grained image-based recognition methods. However, these methods often require additional annotated data or complex training methods [2, 43]. Another aspect is that two products might be easy to distinguish on all sides except one, making it only necessary to use the fine-grained recognition techniques on that side. Our approach handles this by learning which modality to focus on, depending on the input data.

It is also shown that the performance of the different text models is negligible. However, the size of each model differs significantly. For example, DistilBERT has 40% less parameters than BERT. This indicates that the complexity of the text models could be further reduced. The default settings are used in the experiments, but several parameters are possible to configure to further reduce the complexity of the models, for example, the number of attention heads, the size of the encoder layers, and the number of hidden layers. Another approach is to use even simpler text models, such as the MobileBERT model [96]. Reducing the maximum sequence lengths from 256 to 128 showed a minor change in accuracy of just a few percentage points. Even if this work focuses on products with detailed text elements, this also suggests that our approach can be used in setups where only limited text details are distinguishable. This makes this solution viable in product recognition solutions in SCOs where the products are moved closer or farther to the camera by the customer when scanning products. We also showed that our approach can be utilized in other domains with image and OCR data by achieving state-of-the-art results on the Product Leaflet dataset. In this work, no filtering or data augmentation techniques have been applied to the text models; this is an area to explore in the future.

We also see that the less complex fusion method performs better in our multimodal models. This is in line with previous results reported by [72]; the authors argue that patterns learned in pre-training are forgotten due to high error in the backpropagation in the initial training iterations. To alleviate this, they suggested training multimodal models with small learning rates using ADAM as the optimizer, which is also applied in this work. Another observation is that both domains complement each other well when a small number of training samples are available. When the number of training samples increases, the difference between the best unimodal model and the multimodal model decreases. Even if it decreases, there is still a performance gap that shows that the image and text domains complement each other.

Although not better than the simple feature concatenation fusion model, EmbraceNet has several properties to work in a practical environment. For multiple types of products, there is no textual information, for example, for fruits and vegetables. This could be handled seamlessly by the EmbraceNet model by only activating the visual features for recognition. In the other fusion models, an image-based model and a multimodal model have to be trained separately to achieve the same purpose. In [9], it is shown that text-based grocery product recognition models are robust to domain adaptation, while the performance of image-based models degrades significantly. This domain adaptation aspect is a practical problem when a system is installed in a new grocery store or if the camera type or camera position changes in an existing one. Using the EmbraceNet model, only the text model could be used at the beginning and the image model could be activated when sufficient training data have been acquired in the new environment.

From the analysis of challenging classes, some products had low recognition performance because the OCR text information could not be extracted sufficiently. To handle these cases, the recommendation is to look at improving the camera and lighting to be able to capture fine-grained images, which can be used to separate classes.

The additional computational cost of incorporating OCR for product recognition is significant. This has been identified as a concern for the applicability of these types of models in [97]. But in many cases, it is not relevant due to non-real-time requirements, for example, in shelf monitoring analysis where the interval of capturing images is several minutes. When barcode switches are identified at SCOs, only one input image is needed to identify when the product is scanned at the barcode scanner. In streaming applications like frictionless checkout, our approach can be adapted to specialize in products that are hard-to-classify. Specifically, the EmbraceNet architecture can be utilized to classify the results on the image-modality first. If it indicates a pre-known hard-to-classify product, the textual information from OCR could be applied to further improve the confidence in the classification. While the above applies to systems with either in-store machine learning servers and cloud computing, edge devices might pose additional challenges. It is not within the scope of this paper to perform a complete analysis of edge device performance. However, using suggestions on how to adapt the text model architecture or using even more lightweight versions of BERT in combination with previous work that uses similar text models on edge devices [98], we claim that our approach is also feasible to adapt and use on edge devices.

To summarize our results, we find that:

  • Combining image and text from OCR significantly increases classification performance. Especially with a low number of training samples, and the difference is still considerable with a large number of training samples.

  • The size of the input image and the length of the text sequence had a small impact on the multimodal classification results. This indicates that our approach generalizes well, and thus, it can be used in systems with less image resolution.

  • Model selection matters for the image domain, while simpler text models saw only a small drop in performance. This suggests that simpler text models can be used, further reducing computational complexity.

  • Feature fusion gave the best multimodal classification results, outscoring more complex fusion models. However, we also recommend EmbraceNet, due to its practical design for easy deployment.

Our results show that this technology can be applied to many product recognition applications and further improve recognition performance.

7 Conclusions and future work

In this work, we present the dataset FineGrainOCR for fine-grained recognition of grocery products. For this dataset, we propose an approach that uses image and OCR-extracted product text to improve recognition performance. Several experiments are performed that combine image and text models with multimodal fusion methods. Several trade-offs are considered, such as the number of training samples, image resolution, and length of text. The results show that the combination of image and text information using multimodal techniques is superior compared to unimodal models and significantly improves the results on our dataset with a limited number of training samples. In addition, we see that our approach generalizes well and shows state-of-the-results in another retail domain. In future work, we intend to evaluate our approach in a more real-world open retail environment, such as an SCO. In this environment, multiple products are normally present and overlap each other, requiring a more complex method to handle such overlaps. Additionally, we will tackle the challenge of performing OCR reading and multimodal classification in real-time with limited computational power, applying the recommendations proposed in this paper. Finally, we will employ explainable AI techniques to interpret what types of information image and text models are learning compared to the multimodal model.