Introduction

Nowadays, there are many applications in mobile robotics that are becoming a part of everyday life. Among them, shopping centers is one of the sectors where automated robots can be utilized to facilitate shopping activities. Shopping carts are broadly used in modern shopping centers, supermarkets and hypermarkets. However, pushing a shopping cart and moving it from shelf to shelf can be tiring and laborious job, especially for customers with certain disabilities or elderly. Sometimes, if a customer has one or more child it is difficult to push the cart as he or she has to hold his or her child’s hand at the same time. In this situation, sometimes they need caregivers for support. To overcome this, an intelligent shopping support robot is a good replacement. In [1], Kobayashi et al. show the benefits of robotic shopping trolleys for supporting the elderly. In general, some of the core functionality of a shopping support robot are: following its user (customer), navigating through the paths that a customer takes during his or her shopping time and avoiding collisions with obstacles or other objects. Shopping malls or supermarkets typically have many crowded regions. For this reason, in our previous research [2], we developed an autonomous person following robot that can follow a given target person in crowded areas.

In addition to the robust person-following, the robot can more support the user if it can act in advance to meet the user’s next move. For example, when the user picks up a product from a shelf, it is convenient if the robot automatically comes to the user’s right hand side ( if the user is right-handed) so that he or she can put it easily in the basket. To realize such functions, the robot needs to recognize the user’s behavior.

To recognize the user’s behavior, we have used GRU (Gated Recurrent Unit) network [3] instead of LSTM network because the GRU network performance is better than LSTM. GRU has a simpler structure and can be computed faster. The three gates from LSTM are combined into two gates, respectively updating gate and resetting gate in GRU.

Before presenting the details of our methods, we would like to summarize our contributions of our paper. Firstly, we integrate head orientation, body orientation, GRU network for customer shopping behavior recognition and then, provide the shopping support to the customer. Here, we propose a GRU network to classify five types of shopping behavior: reach to shelf, retract from shelf, hand in shelf, inspect product and inspect shelf. Head and body orientations are used to classify customer gaze and interest in any given shelf.

Related work

During the last decades, several teams of roboticists have presented the idea of new shopping support robot prototypes, representing worldwide cutting edge advancements in the field. An autonomous robotic shopping cart was developed by Nishimura et al. [4]. This shopping cart can follow customers autonomously and transports the goods. Kohtsuka et al. [5] followed a similar approach: they provide a conventional shopping cart with a laser range sensor to measure distance from and the position of its user and develop a system to prevent collisions. Their robotic shopping cart also follows users to transport goods.

The study carried out in [6] concludes that elderly people interact in a better way with robots carrying the shopping basket and providing conversational facilities. In [7, 8] a shopping help system was developed and able to obtain the shopping list from a mobile device through a QR code, carry the shopping basket and show at each moment, which articles are on it, and communicate with the supermarket computer system to inform about the location of articles. It uses a laser range finder, sonar, and contact sensors (bumpers) to navigate. An indoor environment shopping cart tracking system was developed in [9]. This system needs the installation of a computer and a video camera on the shopping carts, so that they can perform self-localization and send their positions to a centralized system.

In addition to customer shopping support, the analysis of customer shopping behavior is commercially important for marketing. Usually, the records of cash registers or credit cards are used to analyze the buying behaviors of customers. But this information is insufficient for understanding the behaviors of customers for situations such as when he or she shows interest while in the front of a given merchandise shelf but does not make any purchases. The main task of customer shopping behavior recognition is to count the customers and analyze the trajectory of customers so that merchants can easily understand the interests of customers. Haritaoglu et al. [10] described a system for counting shopping groups waiting in checkout lanes. Leykin et al. [11] used a swarming algorithm to group customers throughout a store into shopping groups. For marketing and staff planning decisions, person counting is a useful tool. For understanding the hot zones and dwell time trajectories of individual customers from surveillance cameras in retail store were analyzed by Senior et al. [12]. However, customer shopping behavior includes more diverse actions, such as: stopping before products, browsing, picking up a product, reading the label of the product, returning it to the shelf or putting it into the shopping cart. Those different behaviors or combinations of them show much richer marketing information. Hu et al. [13] proposed an action recognition system to detect the interaction between customer and the merchandise on the shelf. The recognition of the shopping action of retail customer was also developed by using stereo cameras from a top view [14]. Lao et al. [15] recognize customer’s actions, such as pointing, squatting and raising hand using one surveillance camera. Haritaoglu et al. [16] extracted customer behavior information whenever they watched advertisements on a billboard or a new product promotion.

In this paper, we propose to combine the research on autonomous shopping cart robots and that on shopping behavior recognition to realize a shopping support robot. The robot can act in advance to meet the user’s next move based on the user’s behavior recognition results.

Customer behavior model for the front of the shelf

In our previous work [2], we developed a person following shopping support robot. In this paper, we focus on more intelligent shopping support robot that can recognize the customer’s shopping behavior and act accordingly.

Definition of customer behavior model

Our customer behavior model captures indications of increasing interest that the customer has towards the store’s products. If a customer has no interest in a given product, he or she will neither look at the shelf nor product and will likely not turn towards the shelf. We classify this behavior by our head and body orientation methods.

Other shopping behaviors such as reach to shelf, retract from shelf, hand in shelf, inspect product, inspect shelf are classified by our proposed GRU network. These behaviors indicate increasing interest levels to the product. These behaviors are defined in Table 1. Figure 1 shows some examples of these behaviors.

Fig. 1
figure 1

Examples of customer behaviors

Table 1 Customer behavior model

Framework of customer behavior classification

Fig. 2
figure 2

Framework of customer shopping behavior classification system

Head orientation and body orientation are relevant to our shopping behavior recognition model. According to our previous work, a robot with shopping cart can be made to effectively follow a person. If the person’s body orientation is \(0^{\circ }\) or \(180^{\circ }\) it just follows that person and the behavior is recognized as “no interest in the products”. If the person’s body and head orientation is neither \(0^{\circ }\) nor \(180^{\circ }\), then our proposed GRU neural network is used for classification of shopping behavior. The system’s framework is shown in Fig. 2.

Customer behavior classification

Head orientation detection

Fig. 3
figure 3

Examples of different head orientation detection

In this paper the head orientation of the customer is estimated in eight directions as shown in Fig. 3. We propose a simple method of detecting head orientation from OpenPose [17] results. In OpenPose, the whole body pose is represented by [0, 1, 2, ... , 17] joints as shown in the right hand side of Fig. 3. Depending on the detected skeleton joint numbers, we can easily classify our head orientation. For example, for detecting the \(0^{\circ }\) head orientation, all head skeleton joint points [0, 14, 15, 16, 17] are detected. For other head orientations, detection of the corresponding head skeleton joint numbers are shown in Table 2.

Table 2 Detected head joint points for classification of head orientation

Body orientation detection

Similar to the head orientation, the body orientation is also calculated in eight directions as shown in Fig. 4.

Fig. 4
figure 4

Examples of body orientation detection

We take four angle values \(\angle EAC\), \(\angle ACE\),\(\angle DBE\) and \(\angle AEC\) of the target person to predict body orientations. More details for body orientation detection is shown in our previous work [1].

Fig. 5
figure 5

Detection results of head and body orientation

For example Fig. 5 shows our head and body orientation detection results. It can be seen that our method clearly identifies different head and body orientations.

Gated Recurrent Neural Network (GRU)

The GRU is a similar network to the well-known LSTM. A GRU network has two gates, a reset gate and an update gate. The reset gate determines how to combine new inputs with the previous memory, and the update gate determines how much of the previous memory remains.

Fig. 6
figure 6

Proposed GRU network for shopping behavior classification

As shown in Fig. 6, our designed GRU network consists of 32 GRU cells. In our GRU model, the number of GRUs reflects the length of the activity video frames of skeleton data.

The activation  \(h_t^j\) of the GRU at time t is a linear interpolation between the previous activation  \(h_{t-1}^j\) and the candidate activation  \({\widetilde{h}}_t^j\):

$$\begin{aligned} h_t^j=(1-z_t^j)h_{t-1}^j+z_t^j{\widetilde{h}}_t^j \end{aligned}$$
(1)

where an update gate  \(z_t^j\) decides how much the unit updates its activation, or content. The update gate is computed by:

$$\begin{aligned} z_t^j={\sigma (W{x_t}+{U_z}{h_{t-1}})}^j \end{aligned}$$
(2)

where  \(x_t\) is the input sequence,  W denotes the weight matrices and  \(\sigma \) is the logistic sigmoid function. The candidate activation  \({\widetilde{h}}_t^j\) is computed similarly to that of the traditional recurrent unit.

$$\begin{aligned} {\widetilde{h}}_t^j={\tanh ({W_z}{x_t}+U(r_t{\odot }{h_{t-1}}))}^j \end{aligned}$$
(3)

where  \(r_t^j\) is a set of reset gates and  \(\odot \) is an element-wise multiplication. When off (\(r_t^j\) close to 0), the reset gate effectively makes the unit act as if it is reading the first symbol of an input sequence, allowing it to forget the previously computed state.

The reset gate  \(r_t^j\) is computed similarly to the update gate:

$$\begin{aligned} r_t^j={\sigma ({W_r}{x_t}+U_r{h_{t-1}}))}^j \end{aligned}$$
(4)

And the output is given by

$$\begin{aligned} y={sigmoid}({W_y}{h_{32}}+b_y) \end{aligned}$$
(5)

The output vector of y comes from the hidden state vector of  \(h_{32}\) at the last time step of 32 which is multiplied by the weight matrix and added a bias as expressed in Eq. (5). We use the sigmoid function as the network output activation function.

Dataset construction

We built one dataset, that recorded five different kinds of shopping behavior for a certain period equally distributed among them. The actions consisted of: reach to shelf, retract from shelf, hand in shelf, inspect product and inspect shelf.

For creating the dataset, we constructed shopping shelves in our lab environment and put different items or products on the shelves. Then we setup four cameras for four angle views in recording videos. A total of 20 people took part in the video recording sessions. Each participant performed our desired shopping actions for 10 minutes. So, the total length of our video sequences is (20 × 10 × 4) 400 min as four cameras were used for each person. Then we ran the OpenPose model to extract skeleton data for each action. We obtained 211,872 skeleton data of different actions. A single frame’s input (where j refers to a joint) is stored in our dataset as:  \( [j0_x,j0_y,j1_x,j1_y,j2_x,j2_y,j3_x,j3_y,j4_x,j4_y,j5_x,j5_y,j6_x,j6_y,j7_x,j7_y,j8_x,j8_y, j9_x,j9_y,j10_x,j10_y, j11_x,j11_y,j12_x,j12_y,j13_x,j13_y,j14_x,j14_y, j15_x,j15_y, j16_x,j16_y,j17_x,j17_y ]\)

Experiments description

All the experiments were performed using a GPU NVIDIA GTX TITAN X, with 12 GB of global memory and with Nvidia Digits. We divided our dataset into two parts: 80% of the total data as training data and 20% of the total data as a testing data. Using this data, we trained our GRU network to classify our shopping behaviors. A fixed learning rate of 0.000220 was used. Our model was trained using 50,000 epochs. The training took around 5 h to finish. Other training specifications are given in Table 3.

Table 3 Training specification for our proposed GRU network

Figure 7 shows the plot of the model’s loss and accuracy over 50,000 iterations.

Fig. 7
figure 7

The model accuracy and loss over 50,000 iterations

Table 4 shows the detailed layer information for our proposed GRU network structure. It has three layers. The first layer is GRU layer and it is the main layer containing two gates, a reset gate and an update gate. The second layer is a Dropout layer and it reduces overfitting. The last layer is a Dense layer and it is a fully connected layer.

Table 4 Detailed layer information for the proposed GRU structures

Architecture of the shopping support robot based on the user’s behavior recognition

Fig. 8
figure 8

Our proposed shopping support robot

Figure 8 shows our proposed shopping support robot. First, it detects the nearest person as the user and starts following him/her. It can robustly follow the target person in crowded places. The details of our person tracking and following system are discussed in our previous paper [2]. Our shopping support robot uses a LiDAR sensor about 20 cm high from the floor. So, the sensor can cover customers of any height. Then our subsequent task is to develop a shopping support robot that can recognize the customer’s behavior and intensity of interest in the products.

Fig. 9
figure 9

Flowchart of our proposed shopping support robot

Figure 9 shows the flowchart of our behavior based shopping support robot. The total working procedure is given below:

  1. Step 1:

    Track the target customer.

  2. Step 2:

    Recognize the customer’s body orientation. If the customer’s body orientation is \(0^{\circ }\), go to step 3. If body orientation is \(180^{\circ }\), go to step 4. Otherwise, go to step 5.

  3. Step 3:

    Recognize the customer’s head orientation. If the customer’s head orientation is \(0^{\circ }\), take a suitable position in front of the customer. Otherwise, go to step 5.

  4. Step 4:

    If the customer’s head orientation is \(180^{\circ }\), follow the target customer at a certain distance. Otherwise, go to step 5.

  5. Step 5:

    Recognize the customer’s shopping behavior actions using GRU network.

Results

Evaluation of behavior recognition

Performance metrics

To verify the performance of behavior recognition, we employed four widely used evaluation metrics for multi-class classification.

Precision The precision or positive predictive value (PPV) is defined as the proportion of instances that belongs to a class (TP: True Positive) by the total instances, including TP and FP (False Positive) classified by the classifier as belong to this particular class.

$$\begin{aligned} Precision=TP/(TP+FP) \end{aligned}$$
(6)

Recall The recall or sensitivity is defined as the proportion of instances classified in one class by the total instances belonging to that class. The total number of instances of a class includes TP and FN (False Negative).

$$\begin{aligned} Recall=TP/(TP+FN) \end{aligned}$$
(7)

Accuracy Measures the proportion of correctly predicted labels over all predictions:

$$\begin{aligned} Over\ \ all\ \ accuracy =(TP+TN)/(TP+TN+FP+FN) \end{aligned}$$
(8)

F1 measure A weighted harmonic means of precision and recall. The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score ranges from 0 to 1. The relative contribution of precision and recall to the F1 score equal. The formula for the F1 measure is:

$$\begin{aligned} F1\ \ measure=(2*Precision*Recall)/(Precision+Recall) \end{aligned}$$
(9)

In Table 5 we review and compare the performance of different shopping behavior action classification using our proposed GRU network.

Table 5 Performance evaluation of shopping behavior classification
Fig. 10
figure 10

Confusion matrix of different shopping behavior

Figure 10 shows the confusion matrix for shopping behavior action classification of our proposed network. It can be found that only 243 samples are misclassified out of 1361 samples, which means our accuracy is 82.1%. Hand in shelf and inspect product are less discernible from retract from shelf in this case.

Evaluation of shopping support robot

The prediction time of recognizing head orientation, body orientation and shopping behavior recognition is five frames per second, which means 200 ms per frame. Using this processing speed, we evaluate our shopping support robot in two ways. In the first case, our shopping support robot is on the right side of the customer. In the second case, our shopping support robot is on the left side of the customer.

Fig. 11
figure 11

Evaluation of shopping support robot

For the first case, in the first frame of Fig. 11 we see that our shopping support robot observes the customer inspecting the products with \(45^{\circ }\) head and body orientation from a distance. In the second frame, we see that the customer’s head and body orientation is \(0^{\circ }\) with respect to the shopping support robot. In this situation, our shopping support robot decides to move closer to the customer and change its orientation to a suitable position so that the customer can easily put his product in the shopping basket. The last frame shows that the customer is putting his product in the basket.

Fig. 12
figure 12

Evaluation of shopping support robot

For the second case, the procedure is similar to the first case except the head and body orientations are different while inspecting the product. In the first frame of Fig. 12 we see the customer inspecting the product with \(270^{\circ }\) head and body orientation. After inspecting the product, we see the customer looking towards the robot and his head and body orientation are both \(0^{\circ }\) with respect to the shopping support robot. Then the robot decides to move close to the customer and assumes the proper orientation so that the customer can put his product into the shopping basket.

In this way, our shopping support robot provides proper support to the customer by carrying his shopping product and following the customer until he or she is finished shopping.

Conclusion and future work

In this paper, we address the design considerations for an intelligent shopping support robot. One of the objectives of the work was to develop a simple, reliable and easy to use system that could provide freedom of movement for elderly and handicapped people. To do so, a person following robot was develop in our previous work [2]. In this work, we provide shopping support facilities for the elderly.

We have confirmed that our vision system can understand the shopping behaviors necessary for supporting the user and developed a robot system. However, the current visual processing speed is not fast enough for the robot to move smoothly and to be used in practice. We will improve the processing speed and perform experiments in actual shopping situations to evaluate the total system. Then, based on the experimental results, we will further modify a robot system that can provide practical support for the elderly in shopping.