1 Introduction

In retail environments, understanding consumer behaviour is of great importance and one of the keys to success for retailers [1]. Many efforts have been devoted in particular towards monitoring how shoppers move about in the retail space and interact with products. This challenge is still open due to several serious problems, which include occlusions, appearance changes and dynamic and complex backgrounds [2]. Popular sensors that are used for this task are RGB-D cameras because of their affordability, reliability and availability. The great value (both in accuracy and efficiency) of using depth cameras in coping with severe occlusions among humans and complex backgrounds has been demonstrated in several studies. Additionally, while the retail environment has several favourable characteristics for computer vision (such as reasonable lighting), the large number and diversity of products sold and the potential ambiguity of shopper movements mean that accurately measuring shopper behaviours is still challenging.

The advent of low-cost RGB-D devices, such as Microsoft’s Kinect and Asus’s Xtion Pro Live sensors, has led to a revolution in computer vision and vision-related research. The combination of high-resolution depth and visual information has led to new challenges and opportunities for activity recognition and people tracking for many retail applications based on human–environment interactions.

In several research manuscripts, the top-view configuration was adopted to counter these challenges because it facilitates tasks and makes it more simple to extract different trajectory features. This setup choice also increases the robustness, because it reduces occlusions among individuals, it has the advantage of preserving privacy, because faces are not recorded by the camera, and it is more easy to set up in a retail environment. Furthermore, reliable depth maps can provide valuable additional information that can significantly improve detection and tracking results [3]. Top-view RGB-D applications are the most accurate (with up to 99% accuracy) application type, especially in very crowded scenarios (more than three people per square metre).

Over the past years, machine-learning and feature-based tools were developed with the aim of learning shopper skills in intelligent retail environments. Each application uses RGB-D cameras in a top-view configuration that are installed in different locations of a given store, providing large volumes of multidimensional data that can be used to determine statistics and deduce insights [4,5,6]. These data are analysed with the aim of examining the attraction (the level of attraction that the shopper is showing for a store category based on the rate between the total amount of shoppers that entered the store and those who passed by the category), the attention (the amount of time that shoppers spend in front of a brand display) and the action (the consumers who go into the store and interact with the products, those who buy a product and those who interact with a product without buying it). People’s re-identification (re-id) in different categories is also crucial to understanding the shopping trip of every customer. Based on these insights, new store layouts could be designed to improve product exposure and placement and to promote products by actively acquiring and maintaining users’ attention [7].

However, after moving into the era of multimedia big data, machine-learning approaches have evolved into deep learning approaches, which are a more powerful and efficient way of dealing with the massive amounts of data generated from modern approaches and coping with the complexities of understanding human behaviour. Deep learning has taken key features of the machine-learning model and has even taken it one step further by constantly teaching itself new abilities and adjusting existing ones [8].

In this work, a novel VRAIFootnote 1 deep learning framework is introduced with the goal of improving our existing applications [4,5,6], and it is suggested that this evolution of machine intelligence provides a solid guide for discovering powerful insights in the current big data era.

According to the Pareto principle, stores are mapped with a focus on targeted Stock Keeping Units (SKUs) that offer greater profit margins. Figure 1 shows a camera installation in one of the stores where our experiments were performed.

This store has an area of about 1500 \(m^2\), and 24 RGB-D (Asus Xtion Pro Live) cameras were installed in it in a top-view configuration without overlapping one another.

For maximum coverage using this relatively small number of cameras, the store was covered with two RGB-D cameras placed at the store’s entrances to identify and count the shoppers, and, to measure the shoppers’ attractions, attentions and interactions, the other 22 cameras were placed in front of shelves, counting the number of people and re-identifying them in every top-seller category.

This test installation, together with that of the other four stores in Italy, China, Indonesia and the USA, became the basis for the dataset and results presented in this paper based on a two-year experience that measured 10.4 million shoppers and about 3 million interactions.

In order to conduct a comprehensive performance evaluation, it is critical to collect representative datasets. While much progress has been made in recent years regarding efforts in sharing codes and datasets, it is of great importance to develop libraries and benchmarks to gauge state-of-the-art datasets.

To this end, newly challenging datasets were specifically designed for the tasks described in this study. In fact, each described application involved the construction of one dataset, which was used as the input. Thus, the learning methods described were evaluated according to the following proposed datasets: the Top-View Heads (TVHeads) dataset, the Hands dataset (HaDa) and the Top-View Person Re-Identification 2 (TVPR2) dataset.

Fig. 1
figure 1

Camera installations in the target store where our experiments are performed. This store has an area of about 1500 \(m^2\) and was covered with a total of 24 RGB-D cameras that were installed in a top-view configuration. In particular, two RGB-D cameras were used for counting and identifying shoppers at the store’s entrances (marked in yellow), and the other 22 cameras, in order to measure shoppers’ attractions, attentions and interactions, were installed in front of shelves, counting the people and re-identifying them in every top-seller category (marked in red)

Based on these evaluation configurations and datasets, a novel VRAI deep learning framework that uses three CNNs to count people passing by the camera area, perform top-view re-id and measure shopper–shelf interactions in a single RGB-D frame simultaneously is introduced. The system is able to process data at 10 frames per second, ensuring high performances even in cases of very brief shopper–shelf interactions.

The proposed methods were evaluated using three new publicly available datasets: TVHeads, HaDa and TVPR2, which are described more thoroughly in the next sections. Experimental results showed that the proposed VRAI networks significantly outperformed all competitive state-of-the-art methods with an accuracy of 99.5% on people counting, 92.6% on interaction classification and 74.5% on re-id.

The paper is organised as follows. Section 2 provides a description of the approaches that were adopted using RGB-D sensors installed in a top-view configuration. Section 3 describes our approach to evolving our systems towards VRAI deep learning and offers details on “VRAI datasets”, three new, challenging datasets that are publicly available. In Sect. 4, an extensive comparative evaluation of our approach with respect to the state-of-the-art “VRAI datasets” is offered, as well as a detailed analysis of each component of our approach. Finally, in Sect. 5, conclusions and discussion about future directions for this field of research are drawn.

2 Related work

In this section, the relevant literature concerning the human behaviour analysis in crowded environment is reviewed. Then, it focuses on the applications, discussing different existing approaches.

2.1 People detection and tracking in crowded environment

Detecting and tracking people in video sequences has recently attracted a great deal of attention in research, with wide application particularly in security surveillance and human–computer interaction fields [2]. Considerable research has been conducted on robust methods for tracking isolated and small numbers of humans, for which only transient occlusion exists. Still, however, the challenging task remains studying and tracking people in crowded situations that exhibit persistent occlusions or changes of appearance, complex and dynamic backgrounds, since conventional surveillance technologies often have difficulties understanding images [9].

Several early works have studied the deployment of conventional video cameras that lack depth information [10]. Among these methods, the most popular are those that extracted spatially global features [11] and those that used statistical learning with local features and boosting, such as edgelet [12] and histograms of oriented gradient [13]. Even though some works reported that these methods can lead to satisfying detection and tracking results, their performance does deteriorate in more challenging applications, which is due to the limitations in the use of conventional cameras in complex background situations.

To cope with crowded environments, several approaches have proposed the use of depth data produced by stereo rigs [14]. In a stationary camera setting, moving pixels can usually be extracted by changing the detection techniques, as outlined in [15]. Blobs are defined as groups of moving pixels according to their connectivity. On the other hand, although these methods have advantages over those that use conventional cameras, in realistic scenarios, a blob-based analysis may face several difficulties. For instance, a single blob may contain multiple humans, a single object may be fragmented into several blobs when a low colour contrast occurs, and the last blobs may contain pixels corresponding to shadows or reflections caused by the moving objects. Several approaches have been proposed to help solve these problems. For example, [16] studied vertical blob projection in order to segment a large blob among multiple humans, and [17] analysed foreground boundaries in order to detect head candidates. However, in highly crowded environments, these methods may not be successful.

More recently, the quality of depth maps has greatly improved through the use of depth cameras such as Kinect [18], Xtion and TOF (time-of-flight) cameras, which are available at affordable prices. These cameras have demonstrated great value (efficiency-wise and accuracy-wise) in coping with severe occlusions among humans and complex backgrounds [19]. Usually, depth cameras are placed either vertically overhead [3, 20] or horizontally at the same level as humans [21]. Recent works have shown that the use of RGB-D cameras and depth maps can provide valuable additional information that significantly improves tracking and detection results [2, 3].

Understanding a scene, which is one of the main problems in computer vision and semantic segmentation, is considered a high-level task that should lean into its understanding. In the past, these problems were addressed by various traditional computer vision and machine-learning techniques. More recently, deep learning architectures and CNNs have been proposed and are becoming more popular due to the improved accuracy and efficiency compared to former ones [5, 22,23,24,25]. Semantic segmentation refers to the understanding of an image at the pixel level. In order to do this, each pixel of the image has to be assigned to an object class. In the literature, the widely known standards that have made significant contributions in the deep networks field are AlexNet [26], VGG-16 [27], GoogleNet [28] and ResNet [29]. Today, these are exploited as building blocks for many segmentation architectures.

The current milestone in deep learning techniques for semantic segmentation appears to be the fully convolutional network (FCN), as described by Long et al. [30]. Contemporary classification networks are adapted into FCNs, and they transfer their learned representations by fine-tuning them according to the segmentation task. A significant improvement has been achieved in segmentation accuracy through this method compared to traditional methods on standard datasets, such as PASCAL VOC, while also preserving inference efficiencies [31]. Although the FCN model is known for its power and flexibility, it still lacks various features. For instance, its inherent spatial invariance does not take into account useful global context information, it exhibits no instance-awareness by default, its efficiency has not yet reached real-time execution at high resolutions, and it is not completely suitable for unstructured data. Novel, state-of-the-art solutions such as SegNet, DeepLab, ENet and U-Net have been proposed in the literature in an attempt to overcome these challenges [31].

2.2 People counting

People counting can be classified into two categories: counting the number of people in a certain area and counting the number of people passing through a passage [32]. This work deals with the second category, i.e. counting people passing through a door or passage. In this regard, there are mainly vision-based studies, using radio frequency (RF) signals, such as Wi-Fi, ZigBee and UWB [33]. The main issues related to people counting in a given area have focused on the characteristics and the detection of the person, and they have tried to define indexes, which tend to change with the number of people and measuring the number of people based on these indexes. Instead, when the goal is counting people passing under a door or a gate, the binary direction along with the detection should also taken into exam [34].

Recent works have demonstrate the validity of CNN for density map estimation in single image for crowd counting. The first researchers that adopted CNN-based methods to crowd density estimation are Wang et al. [35] and Fu et al. [36]. At the same time, in [37], the authors propose a cross-scene crowd counting. The key idea is to map images into crowd counts and adapt this mapping to new target scenes for cross-scene counting. However, the drawback of this method is the need for perspective maps both on training scenes and test scenes.

Del Pizzo et al. [38] have described a vision-based approach for counting the number of persons which cross a virtual line. In their work, they have analysed the video stream acquired by a camera mounted in a zenithal position with respect to the counting line, allowing to determine the number of persons that cross the virtual line and providing the crossing direction for each person.

In [39], an image representation method is proposed. It combines semantic attributes and spatial cues to increase the discriminative power of feature representation. Shang et al. have presented an end-to-end network composed of CNN model and LSTM decoder to predict the number of people [40]. A deep spatial regression model for counting the number of people present in a still image with arbitrary perspective and arbitrary resolution based on CNN and LSTM is described in [41].

Recently, different approaches focus on combining additional for people counting such as detection, attention, localisation and synthetic data [42]. In [43], the authors presented a large synthetic crowd counting dataset, as well as a spatial fully convolutional network to improve real-world performance with synthetic data. All these approaches have reached huge success in crowd counting. However, these single image crowd counting methods may lead to inconsistent head counts for neighbouring frames in video crowd counting. Liciotti et al. have introduced a novel modified U-Net architecture for head segmentation, first modifying the U-Net to U-Net 3 and then making it even more robust and efficient [5].

2.3 Human–object interactions

This topic is essentially concerned with engineering feature extraction methods that can promptly detect and represent motion in the input sequence of frames. There have been many approaches introduced from early stages of space-time pyramids [44] till recent years with the ones as attempts to substitute deep neural network feature extractors. In particular, human–object interaction is a considerable problem in computer vision. However, the types of studied interactions are mostly around sports activities [45], cooking [46] or everyday activities [47].

In retail environment, the shoppers interaction is completely different. Firstly, all the objects (e.g. products) could be considered a single category of the object without hurting the recognition. Secondly, only the movement of them in the scene helps distinguish some activities (picking from shelf vs putting back). In [48], the authors have proposed a method to recognise actions involving objects. The authors use graph neural networks to fuse human pose and object pose data for action recognition from surveillance cameras. However, their approach is heavily reliant on the quality of input pose information. In [49], the authors have introduced a framework for integrating human pose and object motion to both temporally detect and classify the activities in a fine-grained manner. They have combined partial human pose and interaction with the objects in a multi-stream neural network architecture to guide the spatiotemporal attention mechanism for more efficient activity recognition. They have also proposed to use the generative adversarial network (GAN) to generate exact joint locations from noisy probability heat maps. Furthermore, they have integrated the second stream of object motion to our network as a prior knowledge that we quantitatively show improves the recognition results. The approach has been applied on MERL dataset, a dataset composed by six activities, namely: “reach the shelf”, “retract from the shelf”, “hand in the shelf”, “inspect the product”, “inspect the shelf” and the background (or no action) class. It is recorded using a roof-mounted camera to simulate the real shopping store environment, albeit not specific. On the contrary, our proposed approach is evaluated on dataset collected in a real retail environment.

2.4 Person re-identification

Common re-id approaches are generally based on frontal image datasets, but sensors installed in top-view configuration have been revealed as especially effective in crowded environments [5].

Regarding person re-id problems, on the other hand, many recent works have addressed this issue. These works have focused mainly on either developing new descriptors for a person’s appearance or on the learning techniques for a person re-id [50]. The problem of appearance models for person recognition, re-acquisition and tracking together was first introduced in [51]. The authors propose the cumulative match curve (CMC) as the performance evaluation metric and introduce the viewpoint invariant pedestrian recognition (VIPeR) dataset for re-id. Regarding discriminating features, hue–saturation–value (HSV) and red–green–blue (RGB) histogram colours were used initially due to their robustness to variations in resolution and perspective [52]. Later on, anthropometric measures were proposed based on calculating the physical conformation values of people, such as their heights, which were estimated using RGB cameras [53]. Another example is the use of anthropometric measurements in combination with clothing descriptors with both extracted through RGB-D cameras [54]. An estimation of the body’s pose to guide feature extraction was proposed in [55]. With a similar performance as the former example, another state-of-the-art approach proposed an appearance model that does not rely on body parts but is based on a descriptor called the Mean Riemannian Covariance Grid [56].

Liciotti et al. [6, 57] proposed a method to extract anthropometric features through image processing techniques, then training machine-learning algorithms for re-id tasks. Their tests were carried out on a dataset of 100 people acquired using a top-view RGB-D camera.

Haque et al. [58] developed an attention-based model that deduces human body shape and motion dynamics by using depth information. Their approach was a combination of convolutional and recurrent neural networks leveraging unique 4-D spatiotemporal signatures to identify small discriminative regions indicative of human beings. Their tests were assessed on a DPI-T dataset, which consisted of 12 persons appearing in 25 videos while wearing different sets of clothing and holding different objects.

In [59], the authors started with a two-flow convolutional neural network (CNN) (one for RGB and one for depth) and a final fusion layer. They improved on this approach with a multi-modal attention network [60], adding an attention module to extract local and discriminative features that were fused with globally extracted features. In another work, Lejbolle et al. [61] presented a SLATT network with two types of attention modules (one spatial and one layer-wise). The authors collected also the OPR dataset from a university canteen, which was composed of 64 persons captured twice (entering and leaving a room). However, these datasets are not publicly available. The approach used in the current work, top-view re-id, matches the one proposed by the authors in [6].

2.5 Comparisons and contributions

In this work, a novel VRAI deep learning approach into existing applications [4,5,6] is introduced to evolve the machine-learning and features-based methods into deep learning methods, which are a more powerful and efficient way of dealing with the massive amounts of data generated from modern approaches.

The main contributions of this paper with respect to the state-of-the-art approach are: i) solutions, for real retail environments with a great variability in data acquired, derived from a large experience over 10.4 million shoppers observed in two years in different types of stores and in different countries; ii) an initial integrated framework for the deep understanding of shopper behaviours in crowded environments with high accuracy, actually limited to count and re-identify people passing by and to analyse their interactions with shelves. Three datasets are also provided that are publicly available to the scientific community for testing and comparing different approaches.

Table 1 is a comparative overview with the key differences between the different previous approaches and the VRAI proposed method. The triple deep network guaranties a single identification for every user entering the area (even in case of multi-cameras), and an association between the user identification and its interactions directly at a frame level, ensuring also multi-user identification and a correct association of men–shelf interactions to every user in the scene. Finally, the common frame pre-processing flow and the triple deep network inferences on the same input frame allow parallelisation on multi-core or multi-CPU architectures, ensuring low computational time for the whole framework processing; the system is able to process data at 10 frames per second, ensuring high performances even in cases of very fast shopper–shelf interactions. The networks use the same data stream and a common pre-processing phase that produce a proper input for each network.

Table 1 Comparative overview table with the key differences between the different previous approaches and the VRAI proposed method (PC stands for people counting and HOI stands for human–object interactions)
Fig. 2
figure 2

VRAI framework for the deep understanding of shoppers’ behaviour

3 From a geometric and features-based approach to VRAI deep learning

In this section, this evolution process as well as the datasets used for the evaluation is described. The framework is depicted in Fig. 2 and comprises three main systems: people counting [5], interaction classification [4] and re-id [6]. Specially designed new VRAI-Nets are used: VRAI-Net 1, VRAI-Net 2 and VRAI-Net 3, which are applied to every frame coming from every RGB-D camera in the store in order to move these systems towards deep learning. In fact, to address increasingly complex problems and when dealing with big data, deep learning approaches can provide a powerful framework for supervised learning. For example, when measuring everything happens in front of the shelf, many fake interactions could occur because of unintentional interactions or something new in a scene; thus, an object that appears present may not actually be present.

Further details pertaining to this are described in the following subsections.

The VRAI framework is comprehensively evaluated on the new “VRAI datasets”, collected for this work. The details of the data collection and ground truth labelling are discussed in Sect. 3.4.

3.1 VRAI-Net 1 for people counting

Semantic segmentation has proven to be a high-level task when dealing with 2D images, 2D videos and even 3D data [24]. It paves the way towards the complete understanding of scenes and is being tackled using deep learning architectures, primarily deep convolutional neural networks (DCNNs), because they perform more accurately and sometimes even more efficiently compared to machine-learning and features-based approaches [23]. An efficient segmentation leads to the complete understanding of a scene; moreover, since the segmentation of an image takes place at the pixel level, each object will have to be assigned to a class. Thus, its boundaries will be uniquely defined. To obtain high quality output, it has been designed a novel VRAI-Net 1.

VRAI-NET 1 presents a batch normalisation layer at the end of each layer after the first rectified linear unit (ReLU) activation function and after each max pooling and upsampling function. In this way, it obtains a better training performance and yields more precise segmentations. Furthermore, the classification layer is modified. In fact, it is composed of two convolutional layers with hard sigmoid functions. This block is faster to compute than simple sigmoid activation, and it maps the features of the previous layer according to the desired number of classes. Compared to the U-Net [63], the number of filters of the first convolution block was halved in the current study. A simpler network is obtained, going from 7.8 million parameters to 2 million.

The VRAI-Net 1 architecture is shown in Fig. 3.

VRAI-NET 1 is even more robust and efficient than the work proposed in [5]. In the current work, the expansive path of the network was modified after being replaced by a refinement procedure. This procedure is composed of four layers, and each layer combines two types of the features map. It is basically formed by two branches: the first uses an up-convolution layer to up-sample the activations of the previous layer and a ReLU function to avoid the vanishing gradient problem, and the second branch joins the corresponding layer of the contracting path with a dropout layer to prevent the overfitting problem. These two branches are merged through an equivalent layer thickness (ELT) layer, which determines the element-wise sum of the outputs. The output of each refinement layer is the input of the first branch of the next refinement layer. Also added to this network was the use of a particular dropout technique instead of the standard technique, based on the random zeroing of certain activations. The spatial dropout method of [64] has been implemented, performing standard Bernoulli trials on the training phase and then propagating the dropout value on the entire feature map. In this way, a dropout with a 50% ratio zeroes half the channels of the previous layer. The dropout of spatial correlations was aimed towards increasing the robustness of our network in a shorter amount of time than the standard method. Finally, the channels of the layers of the contraction part were increased by a factor of four compared to the expansive part. Then, a \( 1 \times 1 \) convolutional layer with a single channel was added both between the two sides and at the end of the expansive part. The number of every feature maps of the contraction part’s layers was increased four times compared to the feature maps of the relative expansive part’s layers. This was done because a good trade-off can be achieved between computational efficiency and better segmentation predictions since the first part of the network processes large enough feature maps compared to the second part, but the latter still maintains a suitable number of parameters to perform a good up-sampling.

Fig. 3
figure 3

VRAI-Net 1 architecture. It is composed of two main parts: a contracting path (left side) and an expansive path (right side). Its main contribution is its use of a refinement process in the expansive path; each step is a combination of two branches: one from the upsampling and the other from the corresponding layer of the contracting path. The combination is performed using an element-wise sum. Another improvement is the use of spatial dropout layers instead of standard ones, which are aimed towards improving the robustness of the network in a shorter amount of time

3.2 VRAI-Net 2 for interactions classification

In [4], a “virtual wall” (threshold) is considered to be in front of the shelf with the help of the depth sensor 3D coordinates system. When the shopper crosses this wall, ideally with his hand, forward and backward to interact with a product, a region of interest (ROI) is cropped from the colour frame and analysed. People detection and tracking is performed in the depth stream, while the final step of the interaction analysis uses the colour frame. A depth-based hand detection method is used for this analysis: the system searches for every object that has an intersection with the “virtual wall” in the depth channel, and a crop of the RGB channels is performed around the intersection point with a resolution of 80x80 px. The final step of the interaction analysis, based on the proposed deep network, uses only the colour frame. The analysis thus involves 2 images per interaction: the first entering the “virtual wall” and the second exiting. Interactions can be occasionally performed also by shoulder or other parts of the body; these kind of interactions will be classified as a neutral interaction. The depth frame is also used for the interaction classification. After a background subtraction, the classification was made using geometric features, calculating the difference in area between the ROIs (ideally the hand with or without a product). In order to clarify this concept, if the ROI of the hand exiting the wall has a bigger area than the entering one, the shopper has taken something from the shelf, and thus, the interaction is positive. The aforementioned methods cannot, for example, distinguish between real and unintentional interactions (i.e. shopper unintentionally cross the “virtual wall” with his body or even a shopping cart). Another issue with these methods is that they cannot distinguish between real customers and store staff. In order to understand the shopper behavior, the interactions performed by the store employees, for example, while refilling a shelf, must be filtered out. In this paper, new interaction class is introduced: the “Refill”, which indicates an interaction performed by a staff employee instead of a customer. The depth information is also crucial in this approach, because it allows to easily match a shopper interaction with a product, looking at the real-world coordinates of the action. A step into deep learning approach was necessary, and the key idea is to rely on the aforementioned methods and classify the interaction colour images ROIs independently with a deep learning approach and combine the predictions into the interaction classification. Since there were no public datasets in the literature available with this scope, the new HaDa dataset has been collected in real stores with the aim to train a new neural network.

Regarding this context, VRAI-Net 2 was designed to have efficient architecture for classifying shopper interactions. This architecture can be adapted to classify either three (negative, neutral and positive) or four (negative, neutral, positive and refill) classes by simply modifying the last layer. The design of the network is based on the key idea of the inception module defined in [28] and also uses the improvements described in [65] and [66]. It has been attempted to scale up the network, but at the same time, the aim was to reduce the number of parameters and the computational power required. In a typical CNN layer, it has been chosen either to have a stack of \( 3 \times 3 \) filters, a stack of \( 5 \times 5 \) filters or a max pooling layer. Generally, all of these are beneficial for the modelling power of the network. The inception module suggested the use of all of them. Then, the outputs of all these filters were concatenated and passed on as an input to the next layer.

For this task, the main idea of the typical architecture of a convolutional network has been followed in which going deep into the net also means making a subsample. A max pooling layer of \( 2 \times 2 \) (stride 2) has been used after every two inception modules. At the same time, the map of features learned in each module has to increase. In this way, the output has been halved but doubled the number of feature channels.

The main architecture of the VRAI-Net 2 is composed of two inception modules followed by a max pooling layer of \( 2 \times 2 \) (stride 2) and 1 inception module.

Choosing the number of modules was designed to optimally process the images of our dataset; starting from \( 80 \times 80 \) pixels images, it has been extracted feature maps with growing volumes step by step, up to dimensions (width and height) that were neither too small nor too large, until the classification layer was reached. In this way, a number of learned parameters that were not too high are maintained.

The last block of the network was used to map the features learned in the desired number of classes. Usually, only fully connected layers are used; however, these are very expensive in terms of learned parameters. Thus, it has been decided to use a global average pooling (GAP) layer after the last module, as in [67]. The GAP layer calculates the average of each feature map, and these values are fed directly into a softmax layer. This can remove the need for fully connected layers in a CNN-based classifier. It is considered to be a structural regulariser of CNNs, transforming feature activations into confidence maps by creating correspondences between features and classes. It also allows for a significant reduction of the parameters learned, compared to the parameters of the fully connected layers.

To speed up the learning and increase the stability of the neural network, batch normalisation after each layer has been added [68]. The advantages are manifold. First, higher learning rates can be used because batch normalisation ensures that no activation can go extremely high or extremely low. Second, batch normalisation reduces overfitting because it has a slight regularising effect.

However, it is important not to depend solely on batch normalisation for regularisation; it should be used together with a dropout. Thus, a dropout layer (rate 50 %) before the classification layer has been added. The VRAI-Net 2 architecture is shown and described in Fig. 4.

Fig. 4
figure 4

VRAI-Net 2 architecture. It is composed of two inception modules followed by a max pooling layer of \( 2 \times 2 \) (stride 2) and one inception module. The last block of the network should be used to map the features learned in the desired number of classes. A GAP layer is used after the last module. These values are fed directly into a softmax layer. This can remove the need for fully connected layers in a CNN-based classifier

VRAI-Net 1 and 2 share only the pre-processing layers, and to limit computational overload, VRAI-Net 2 is only activated when a person is detected and its hand interacts with the shelf.

3.3 VRAI-Net 3 for Top-View Person Re-Identification

As stated in the previous sections, the RGB-D cameras installed in the stores were devoted not only to counting and classifying the interactions but also to re-identifying the customers.

In a previous work [69], a Top-View Person Re-id (TVPR) dataset was built that contained videos of 100 persons recorded from an RGB-D camera in a top-view configuration. An Asus Xtion Pro Live RGB-D camera was chosen because this camera allows for the acquirement of colour and depth information in an affordable and fast way. The camera was installed on the ceiling above the area to be analysed. The current work followed the same procedure adopted in [6] for re-identifying customers in the store. In particular, with this methodology, important statistics of the shoppers have been extracted, which included the time spent in the store, the products chosen by the same customer and the shelf attraction times. The approach recognises people from RGB-D images and consists of two steps: person detection and person identification. The detection of person is carried out by the depth channel, and it uses an algorithm to locate people within frames, making a crop of the person through a \( 150 \times 150 \) pixel bounding box, with a threshold on the minimum height of people. In this way, it is possible to remove the noise produced by the frame background and focus only on the interested details for every single image, i.e. the person. The \( 150 \times 150 \) size is chosen experimentally, since it has been found that the people in our dataset had average dimensions between \( 80 \times 80 \) and \( 125 \times 125 \) pixels (Fig. 5).

In the second step, a novel architecture called VRAI-Net 3 was designed to carry out the identification of the people. This network is based on a type of classic DCNN architecture used for classification tasks, which in turn is based on the same concepts as those in VRAI-Net 2. The network has been adapted to process \( 150 \times 150 \) pixel images by adding several inception layers followed by a max pooling layer. The network became deeper, increasing the number of features learned and thus improving the accuracy of the classification. In addition, the classification layer was adapted to classify 1000 classes. Figure 6 depicts the VRAI Network chosen for the re-id process.

The re-id phase allows to create an intermediate dataset that can be used to feed our DCNN, to better perform the training. To increase the accuracy, the dataset is balanced, maintaining a constant number of frames for each person, both for training and validation dataset. In particular, the balanced dataset for 1000 people has the training set with 22 frame/person * 1000 people, i.e. 22.000 frames. The testing set has 22 frame/person * 1000 people, i.e. 22000 frames. The data augmentation (Fig. 5) ensures 1.320.000 frames, and it is done by using:

  • image flipping, left to right and top to down;

  • image rotation to \( 90^o \), \( 180^o \), \( 270^o \);

  • crops \( 3 \times 3 \) (crop \( 130 \times 130 \), stride 10 pixel, 3 steps horizontal x 3 steps vertical and resizing at \( 150 \times 150 \) of the cropped).

The proposed number is intended as a maximum number of contemporary people in the re-id gallery (closed world setting), and it is a credible number for a week-based re-id process in a store. Moreover, the system is able to be retrained at every new customer and, in a real installation done in Germany and reported in the last part of the paper, a specialised processing unit is able to train the network in a proper time frame, given also the very slow dynamic of a store visit. (Avg dwell time is 18 minutes in the test store.) Even if the deep network is trained again at every customer, the total amount of different people in the gallery is constant (1000) to ensure a high and stable re-id accuracy over time.

Fig. 5
figure 5

Data augmentation

Fig. 6
figure 6

Person Re-Identification Workflow consists of two steps: person detection and person identification. The detection of person is carried out by the depth channel. For the identification is designed the VRAI-Net 3 architecture. The network was adapted to process \( 150 \times 150 \) pixel images by adding several inception layers followed by a max pooling layer. In addition, the classification layer was adapted to classify 1000 classes based on our TVPR2 dataset

3.4 VRAI datasets

In this work, the first study on understanding shoppers’ behaviours using an RGB-D camera installed in a top-view configuration is provided. As discussed in Sect. 1, the choice to use a top-view configuration was because of its greater suitability than a front-view configuration, as the former reduces the problem of occlusions and has the advantage of preserving privacy, since faces are not recorded by the camera.

Three new datasets are constructed from the images and videos acquired from the RGB-D cameras that were installed in a top-view configuration in different areas of the target store. The “VRAI datasets” were composed of the following three datasets:

The TVHeads dataset contained 1815 depth images (16 bit) with dimensions of \(320\times 240\) pixels captured from an RGB-D camera in a top-view configuration. The images collected in this dataset represented a crowded retail environment with at least three people per square metre and physical contact between them.

Following the pre-processing phase, a suitable scaling method was applied to the images, which allowed us to switch to 8 bits per pixel instead of the original 16. In this way, it is possible to obtain a more highlighted profile of the heads, improving the contrast and brightness of the image. The ground truth, for the head detection, was manually labelled by human operators.

Figure 7 shows an example of a dataset instance that includes the two images described above (8-bit depth image and the corresponding ground truth).

Fig. 7
figure 7

TVHeads dataset. It consists of an 8-bit scaled depth image (a) and the corresponding ground truth (b)

The HaDa dataset was composed of 13856 manually labelled frames. These frames were the same type as those used in the aforementioned features-based approach, thus, for each interaction, resulting in a total of four images (first RGB, first DEPTH, last RGB and last DEPTH). This dataset was acquired in a real retail environment over a period of three months using seven different cameras located in four different shelf categories (Chips, Women’s Care, Baby Care and Spirits) above a total of ten shelves.

Frames were labelled in the following four frame interaction-classifying classes:

  • Positive: images that show a hand holding a product;

  • Negative: images that show only a hand;

  • Neutral: images in which the customer is not interacting with the shelf; and

  • Refill: images that indicate a refill action, which happens every time a box filled with products is visible. This class has a “priority” over the others. (For instance, if there is a hand holding a product and a box containing the same products in the same image, the class is deemed “refill” and not “positive”.)

Figure 8 depicts four samples of HaDa dataset classes.

Fig. 8
figure 8

HaDa dataset

The third dataset is TVPR2. This data was collected following the procedure outlined in [6], which described settings that are close to being realistic. This new dataset enabled possibilities in multiple directions, including deep learning, large-scale metric learning, multiple query techniques and search re-ranking directions. The dataset was composed of 235 videos, containing RGB channels and depth channels. Each video recorded the people on the forward path (left to right) for half the time and recorded the same people on the return path (right to left) for the other half of the time, though not necessarily in that order. The number of people present in the videos varied from one to eleven. The total number of people in this dataset was 1027.

Table 2 briefly summarises the characteristics of the data collected for the VRAI datasets.

Table 2 VRAI datasets

4 Results and discussion

Systems based on the functionality described here have been deployed at a number of stores around the world, and many have been in operation for over two years. This paper focused specifically on an analysis of a supermarket. Several days of video data were recorded from 24 cameras (2 used solely as counters and 22 used for counting, interaction classification and re-id) and processed by the system. Computation resources are a key factor to keep the elaboration on the edge with low-cost embedded hardware which is desirable in real applications in the retail scenario (i.e. high number of stores, high number of categories, etc.). In the following subsections, the results of our VRAI deep learning framework are evaluated and compared with the state-of-the-art framework.

4.1 People counting

In this subsection, the results of the experiments conducted using the TVHeads dataset will be reported. In addition to the performance of VRAI-Net 1 also presented here is the performance of the different approaches taken from the literature based on CNNs such as SegNet [70], ResNet [29], FractalNet [71], U-Net, U-Net 2 [63, 72] and U-Net 3 [5] to attempt to solve the problem of head image segmentations.

Each DCNN is trained using two types of depth images to highlight head silhouettes: 16-bit (original depth images) and 8-bit (scaled images). In this way, it is possible to improve each image’s contrast and brightness. In the training phase, the dataset is split into training and validation sets with a ratio of 10%. The learning process is conducted for 200 epochs using a learning rate equal to 0.001 and an Adam optimisation algorithm. Semantic segmentation performances are shown in Table 3, which also reports the Jaccard [73] and Dice [74] indices for training and validation, respectively, as well the results in terms of accuracy, precision, recall and F1-score. Those metrics mainly concern the quality of the segmentation and not the counting of people. However, there are the Dice and Jaccard metrics, which are based on the area of predicted segmentation compared to the ground truth. Since their results are very good, it means that the segmentation is very accurate. We can count people from the mask of the predicted segmentation by simply using an image processing algorithm, which detects and counts the contours of the segmented areas. As it is possible to infer, the current study’s VRAI-Net 1 network outperformed the state-of-the-art network in terms of the Jaccard and Dice indices and in terms of accuracy. The VRAI-Net 1 reached 0.9381 for the Jaccard index and 0.9691 for the Dice index. The accuracy of our network instead reached a value of 0.9951, thus demonstrating the effectiveness and suitability of the proposed approach. The comparison shows that VRAI-Net 1 performed better than the previous U-Nets design. Among the various tests performed, the best performance used mainly images scaled to 8 bits.

Table 3 Jaccard and Dice indices comparison and segmentation results obtained for different DCNN architectures

The obtained deep learning results are compared with image processing algorithm: multi-level segmentation [75] and water filling [76]. Table 4 shows the results of algorithms in terms of precision, recall and F1-score. The algorithms reach lower values of performances. In particular, when the heads are along the edge of image, their accuracies are decreased. Instead, the multi-level segmentation algorithm looks more accurate than water filling algorithm.

Table 4 Image processing algorithms performances

Main applications of accurate people counting in crowded scenario related to the shopper marketing area are: i) the accurate funnel evaluation at store and category level, starting from people entering the space; ii) the store and category A/B testing for performance comparison; and iii) the store flow modelling also for high traffic areas (e.g. promo areas). To better understand the applications of the proposed method and the high impact on the shopper marketing area, other aggregated results of the proposed framework are reported in Sect. 4.4.

4.2 Interaction classification

To evaluate the HaDa dataset, the interaction frames are independently classified (the first and last ones) of each interaction and combined these predictions to obtain the aforementioned interaction type. Four different networks were tested to determine the best results. These networks included a classic CNN in which the core structure was essentially the same as that of the LeNet architectures introduced in the late 1980s by LeCun et al. [77], AlexNet [26] and CaffeNet [78]. Then, the CNN structure has been modified, deepening it by duplicating the main block, which was composed of two convolution layers and a max pooling layer, \( CNN^2\).

Table 5 outlines the classification results for the interactions frame according to the classes defined in Sect. 3.2. From the same test set, each type of interaction has been extracted and compared it with the features-based approach, using the test set labels as the ground truth. The test set was composed of 784 images; 624 of them were paired leading to an ultimate total of 312 interactions. By combining the labels given to the frames involved in each pair, the type of each interaction has been determined. If at least one frame in a pair was labeled as “neutral”, the interaction can be excluded, as it was not a real interaction. (These fake interactions are discussed in Sect. 3.2.) This led to the first important result: of a total of 312 interactions, 69 (22%) were fake and could at that point be excluded by the deep learning approach, while they had earlier been misinterpreted by the features-based approaches. After excluding the fake interactions and the interactions labelled “refill” (in the features-based approach, the refill operation was misclassified in one of the other categories), an accuracy for the features-based approach of 70% has been obtained. This value represented the accuracy on the real interactions performed by the customers; however, it represented only 16% of the total interactions. In terms of accuracy, VRAI-Net 2, which was the best DCNN in our case, achieved 92% for the entire test set, and thus the same accuracy for the interaction-type classification, outperforming the previous features-based results. To prove the interaction classification accuracy from a different point of view, positive interaction classifications were compared with sell-out data over a period of 4 weeks on a real store in Italy. The assumptions behind are that a positive interaction (taking out a product from the shelf) is a final buy action for the shopper and that there are no other secondary placements for the analysed category (i.e. diapers). A total of 1353 positive interactions in 4 week with opening time from 9 a.m. to 10 p.m. on the diapers category were measured and compared with the sell-out provided by the store cashier system with a final accuracy of 96,72%. This final real test confirms again the high quality of the proposed approach on a real scenario. To better understand the applications of the proposed method and the high impact on the shopper marketing area, other aggregated results of the proposed framework are reported in Sect. 4.4.

Table 5 Shopper interaction results

4.3 Re-identification

In this subsection, the re-id results of VRAI-Net 3 on the TVPR2 dataset are presented and compared with those obtained from other state-of-the-art approaches. The results that were obtained are shown in Table 6. The results reported in this table are on a re-id over 1000 contemporary shoppers.

In the classification stage, different classifiers are compared according to the nature of the feature descriptors TVD (depth descriptor) and TVH (colour descriptor) . The overall prediction is performed by averaging the computed posterior probability of each classifier in order to provide the optimal decision rule. Based on TVD and TVH features, five state-of-the-art classifiers, namely k-nearest neighbours (kNN) [81], support vector machine (SVM) [82], decision tree (DT) [83], random forest (RF) [84] and Naïve Bayes (NB) [85] classifiers, are compared to recognise customers.

Our network has been compared to another state-of-the-art classification network, the VGG-16 network [27]. To obtain shorter training times, a pre-trained VGG-16 [27] on the ImageNet dataset [26] is used. Then, a network fine-tuning has been performed; the final classification layer has been replaced with our own custom layer and then re-trained the network by using a lower learning rate in the first convolutional layers of the network and a more aggressive learning rate in the last layers.

For the training phase, it has been decided to also use data augmentation techniques. In particular, it has been used:

  • image flipping, left to right and top to down;

  • image rotation to \( 90^o \), \( 180^o \), \( 270^o \);

  • crops \( 3 \times 3 \) (crop \( 130 \times 130 \), stride 10 pixel, 3 steps horizontal x 3 steps vertical and resizing at \( 150 \times 150 \) of the cropped).

These techniques were used both on the original images and on some of their clippings. The clippings were generated by moving a box of \( 130 \times 130 \) pixels inside the image with steps of 10 pixels in both directions, making a \( 3 \times 3 \) grid.

To improve accuracy during the testing phase, it was decided to use a technique called 10-crop validation. For each image of the validation dataset, the network was tested on the original images of four of its crops (top-left, top-right, bottom-left and bottom-right), on the original image flipped left to right, and finally, on four more of its crops (top-left, top-right, bottom-left and bottom-right). As a result of this classification, the most commonly predicted class for these 10 types of tests was used.

Moreover, the results obtained by training a DCNN with a triplet loss, pretrained on ImageNet dataset, are reported. Given an input image (called anchor), this function tends to bring it closer to images of its same class (hard positive) while simultaneously moving it away from images of the other classes (hard negative) [86]. GoogleNet [87] has been chosen as backbone, pretrained on ImageNet dataset [88]; the triplet loss function was instead based on the work of Hermans et al. [89].

From the results reported in Table 6, it can be observed how VRAI-Net 3 exceeded, in all metrics, the performance of the other features-based methods. In particular, an increase of about 0.2 was obtained for all the classification metrics compared to the SVM, which is the most common features-based approach. It is also interesting to note that the CMC of rank one of our VRAI-Net 3 was lower than its own accuracy, which is very unique.

There are also several real checks in-store through human observations. Data have been collected in two target stores in Germany, and it has been observed for 2 hours people entering and leaving the stores. The accuracy of real in-store tests gave an accuracy of 73% comparable with the VRAI-Net 3 results on the laboratory dataset.

In addition to accuracy, precision, recall and F1-score, our approaches have been evaluated using the CMC. The CMC represents the expectations of finding the correct identity in the first n-predicted identities. This metric is suitable for evaluating performances in recognition problems. Figure 9 shows the CMCs of the compared approaches. In particular, the horizontal axis indicates the rank, while the vertical axis indicates the probability of correctly identifying the corresponding rank. From the CMC, it is possible to infer that the curve of our proposed network VRAI-Net 3 was always higher than the CMC curves of other state-of-the-art methods.

An additional comparison between the approaches was carried out to evaluate recognition performance according to the number of people identified, as depicted in Fig. 10.

Table 6 Re-Id Results on TVPR2, i.e. 1027 contemporary people in the retail space
Fig. 9
figure 9

CMC on TVPR2 dataset

Fig. 10
figure 10

Scores of the people in the TVPR2 dataset

Table 7 shows the measured simulation runtime at different network sizes for the compared DCNNs. The experiments are conducted on GPU NVIDIA Tesla K80. The results reveal that VRAI-Net 2 does not scale well. In contrast to this, VRAI-Net 1 finishes the same task faster. The VRAI-Nets performances are aligned with the general purposes of the framework also in terms of a correct mix of accuracy and time performances. These systems are not only reliable but also cost-effective in order to maintain scalability, but efficient enough to run on the edge. The designed network is fast and light in terms of parameters and learns with good performance.

Table 7 Parameters and computational time comparison

Main applications of the re-id are: i) the evaluation of the dwell time inside the store, by the identification of the same person entering and exiting the store; ii) the identification of returning customers both at a store level and category level; and iii) the store flow of a single person passing by different categories. Next section will better clarify the impact of the proposed methods on the shopper marketing.

4.4 Marketing applications of shopper behaviour understanding

As previously mentioned, the technologies discussed in this work have great relevance in the marketing field. In particular, they offer relevant contributions in the field of behavioural science and, more precisely, to consumer behaviour studies by using innovative methodologies and tools. Over the years, various attempts have been made to study in-store consumer behaviours using mainly manual recording techniques [94]. However, the extremely laborious nature of these techniques means that they take a long time to complete, and it is often difficult to obtain a large sample. Moreover, it is almost impossible to obtain a complete picture of a consumer’s behaviour during his or her entire shopping journey at any given moment or from a series of moments over time. Another possible technique for measuring in-store behaviour is to interview consumers when they leave the store. However, a study on pedestrian flow in the city centre of Lincoln, Nebraska, indicated that such investigative techniques lead to unacceptably high levels of inaccuracy [95]. Because of the obvious limitations of analyses made through manual surveys, over the years, researchers have begun to experiment with passive methods of data collection, which are considered most appropriate for in-store consumer behaviour studies. Implicit behaviour detections using technology have been carried out by many researchers, including Sorensen et al. [94], who used a shopping cart-tracking system, and Dieter Oosterlinck et al. [96], who used Bluetooth technology to detect in-store shopping journeys. The use of the technologies described in this paper allowed us, therefore, to analyse shoppers implicitly and continuously in all the stores in which they were observed, to obtain multiple independent, comparable studies. The main goals of this approach, which is defined in the literature as the “meta-analysis” [97, 98], consist of:

  • assessing Shopping Experience Fundamentals by comparing insights across different categories and in different store formats and

  • confirming (or refuting) behavioural science theories using data obtained from actual shopper observations.

For example, by comparing data from multiple categories, it is possible to verify that the most frequently used category management evidence, stating that, for instance, “the middle third of the shelf performs better than the top and bottom thirds”. By using interaction recognition technology between people and products on a shelf, it is possible to promptly verify which parts of the shelf are touched more, confirming the theory mentioned in Fig. 11.

Fig. 11
figure 11

Distribution of positive interactions by top, middle and bottom shelf

Through the same technology, it is also possible to measure interaction dwell time to analyse the relationship between the time spent at the shelf and sales, which is positive up to a ceiling and then becomes inverse (Fig. 12). This mean that the longer the amount of time a consumer spends at a shelf, the more the purchases that will be made, but only up to a threshold of three products being touched.

Fig. 12
figure 12

Time spent at the shelf by purchasers having had one to six interactions

The use of these technologies, therefore, has a wide range of applications in the marketing field. Further studies should be conducted to deepen these technologies’ potential and make useful findings in order to confirm or refute, through implicit observations, consumer behaviour theories.

5 Conclusion

In this paper, a novel and powerful methodology and application for shopper behaviour analysis is presented. The system is based on RGB-D video in an intelligent retail environment and is evaluated on real environments, collecting 3 public datasets. Results prove that the proposed methodology is suitable for implicit shopper behaviour analysis with relevant applications in marketing and consumer research field with a particular focus on implicit consumer understanding.

The proposed research starts from the idea of collecting relevant datasets from real scenarios to change the overall methodology from a feature-based and geometric approach to a fully deep learning method with three concurrent CNNs processing the same frame to: i) segment and count people count with high accuracy (more than 99%) even in crowded environments); ii) measure and classify interactions between shoppers and shelves classifying positive, negative, neutral and refill actions with a good accuracy also compared with cashier sell-out; and iii) perform a re-identification over contemporary shoppers (up to 1000 people in the same area at the same time) with a good accuracy to detect massive behavioural data on the best performing categories (more than 80% with 100 or 250 contemporary shoppers in the area).

For every purpose, a public dataset is collected and shared together with the framework source codes to ensure comparisons with the proposed method and future improvements and collaborations over this challenging problems. The paper describes one of the more extensive tests based on real data from real retail scenarios in the literature. The system is also designed to deal with the modern regulations about privacy, avoiding to record and transmit any personal image or video, computing all frames on edge, and transmitting to the cloud only synthetic and anonymised data.

Future works should improve and better integrate the three CNNs with more complex architectures able to improve performances. Incremental learning methods will be investigated to improve the online performances of the re-identification algorithm. Further investigation on CNNs generalisations is needed to prove the effectiveness of the approach in very different retail categories (from grocery to fashion) and in cross-country human behaviours and attitudes.