1 Introduction

The computer-based technological advances achieved over the last decade have become essential for measuring consumer behavior in terms of unconscious processes, which has undoubtedly improved marketing research practices (Alcañiz et al. 2019; Cherubino et al. 2019). This accelerated growth has generated particular interest among marketing scholars, who have found a new digital sales channel in these technologies, potentially offering more excellent knowledge of consumer perceptions (Bonetti et al. 2018; Dad et al. 2016; Pantano and Servidio 2012; Pizzi et al. 2019). In this respect, a significant number of scientific studies have examined the impact that extended reality technologies, such as augmented reality and virtual reality (VR), have on consumers’ purchasing experiences (Bonetti et al. 2018; Desmet et al. 2013; Marín-Morales et al. 2019; Martínez-Navarro et al. 2019; Dad et al. 2016; Peukert et al. 2019). Specifically, virtual reality (VR) has become a significant trend in consumer neuroscience and market research. Along with the development of portable, stand-alone, and behavior-tracking devices (e.g., VR head-mounted displays [HMDs]), the landscape of consumer behavior research has allowed new types of interactive experiences to evolve.

VR technology can include an experiential layer that was not accessible until recently, allowing companies and researchers to predict real consumer behaviors in a controlled setting (Alcañiz et al. 2019; Guixeres et al. 2017; Harris et al. 2020; Martínez-Navarro et al. 2019; Meißner et al. 2019; Wedel et al. 2020). This purchasing “virtualization” phenomenon has brought about a robust transformation in retail research while facilitating logistics, customer experience, and management (Grewal et al. 2017).

Shopping experiences are modulated not only by the external characteristics provided by the selling environment but also are influenced by intrinsic attributes such as personality traits, age and gender (Bogomolova et al. 2016; Chang and Yeh 2016; Chebat et al. 2008; Hwang and Lee 2018; Khatri et al. 2022; Moghaddasi et al. 2021; Spiers et al. 2008; Zaharia et al. 2017). Identifying such psychological and demographic traits in consumers is a key aspect in the personalization of products and the shopping experience. However, the extent to which consumer’s experiences can be improved or even predicted in highly immersive VR environments considering traits of age and gender is still unexplored. This study gathers user’s behavioral data from a VR shopping interaction across three different tasks. The goal is to infer the demographic attributes of consumers from their implicit responses after completing the v-commerce purchase task. Anticipating buyer’s behavior in virtual contexts according to demographic variables would provide researchers and sellers with more objective estimates to predict types of buyers, allowing them to apply personalized improvements in virtual commerce sales channels.

In the following sections, we briefly review the literature on consumer behavior, emphasizing VR studies. Based on the available evidence, we highlight the interplay between the implicit aspects of consumer behavior and the role of age- and gender-based differences in virtual shopping.

2 Related framework

2.1 Methodological considerations in consumer behavior research

Regarding brands, products, or purchasing environments, consumer preferences are mediated by cognitive and emotional processes that are difficult to approach with traditional consumer research tools (Chartrand 2005; Hsu and Yoon 2015; Woodside and Brasel 2011). In terms of unconscious drivers, scientific interest is mainly focused on predicting consumers' product/brand choices and preferences in different purchasing contexts (Pound et al. 2000; Meissner et al. 2019; Ravaja et al. 2013; Yanan and Yang 2019). Although these studies have successfully provided valuable insights into the unconscious drivers of consumer behaviors, they are limited by significant methodological constraints. In Meissener et al. 2019, some of the methodological pitfalls of using implicit association tasks (IAT) to predict consumer behavior were reviewed. The most remarkable endpoint is the low predictive validity of reviewed studies using IAT, which is partly due to the limitation to predicting behavior in real-world settings. Ravaja et al 2013 research used an implicit approach such as an EEG to predict consumer purchase decisions. The methodological approach is a classical desktop computer with a common monitor screen that presents the product in a non-immersive way. In Yanan and Yang 2019 study, the prediction of consumers' purchase behavior is analyzed in the social network environment. The study concludes that the predictive effect of implicit measures in determining consumer’s purchase intention is better than that based only on explicit feature preference. Although interesting, the study lacks ecological validity since it is only based on consumer’s online preferences. The study of Pound, Duizer and McDowell is also revealing since it evaluates consumer behavior in four types of testing situations (central location, in-home, teaching laboratory, and formal sensory laboratory). However, a real-life environment or setting fails to test consumers’ responses more ecologically.

Moreover, the use of a real-life purchasing context activates consumers’ motivational orientation of the purchase. As shown in previous research, there is a functional relationship between consumer’s shopping decisions and their shopping motivational orientation (Brown et al. 2003), mainly reflecting the experiential (i.e., hedonic, unplanned) and the goal-oriented (e.g., utilitarian, planned) sides of shopping (Pizzi et al. 2019; Wolfinbarger and Gilly 2001; Khatri et al. 2022). Such different shopping orientations entail different consumer perceptions and shopping strategies (Pizzi et al. 2019; Siegrist et al. 2019), resulting in different product interaction patterns and navigation strategies.

Incorporating virtual interfaces into online commerce brought some temporary solutions to the field. However, these interfaces still do not provide the necessary level of realism to elicit the same behaviors that the customers would display in physical stores (Wagner et al. 2020). This is a significant limitation, particularly in retail research, because the interactive aspect naturally present within in-store experiences, among others (e.g., navigation, and trying and touching products), appears neglected outside the context of the physical store, which ultimately affects the validity of the conclusions. Additionally, the lack of ecologically valid scenarios capable of generating more natural behaviors restricts the extrapolation and generalization of results (Plassmann et al. 2015; Köster 2003). Evidence shows that moving the current consumer research methodology from the standard experimental lab settings closer to where consumers purchase or consume the tested products may increase ecological validity. For example, in Van Herpen et al. 2016 research, authors examine whether a virtual store’s greater realism than pictorial (2D) stimuli elicits consumer behavior more in line with behavior in a physical store. The reported results indicated that virtual reality can improve realism in responses to shelf assignment compared to the 2D pictorial store.

2.1.1 Virtual reality: from the sense of presence to the sense of real experience

Today, immersive VR approaches are proposed as a solution to the outlined methodological limitations, with benefits for research that have been widely recognized in the literature (seeAlcañiz et al. 2019; Bonetti et al. 2018; Desmet et al. 2013; Peukert et al. 2019). In contrast with traditional research methods, VR environments present consumers with different challenges and purchasing experiences, enhancing the online experience in many ways (Flavián et al. 2019a). One of the main advantages of VR in experimental and consumer research is the inclusion of high levels of standardization, almost comparable with classical laboratory experiments. The underlying assumption is that VR offers unbiased estimates of consumer behaviors similar to those in natural, physical environments (Marín-Morales et al. 2019; Needel 1998). This has been empirically demonstrated in several studies comparing VR shopping experiences with experiences in physical stores (Bonetti et al. 2018; Pizzi et al. 2019).

Technically speaking, VR immerses users in a seemingly natural 3D environment, allowing them to interact with objects/avatars using special wearable devices, such as helmets allowing vision or gloves equipped with sensors. Their perception of the outside world is blocked to induce a more engaging experience (Bonetti et al. 2018; Brookes et al. 2020; Johnson-Glenberg 2018; Dad et al. 2016). Because VR allows the measurement of behavior in real time, it has become an essential tool for investigating the neurocognitive processes elicited naturally by virtual navigation experiences (Alcañiz et al. 2019). In consumer neuroscience, these realistic approaches are used by scholars and market researchers to get more reliable insights into customer’s purchase preferences while overcoming the typical limitations of physical shops, such as limited stock, and product presentation options (Burke 2017). From the consumer’s point of view, VR provides a new form of purchasing products that is more playful and orientated toward influencing their hedonic experience by increasing the sense of presence (Farah et al. 2019; Flavián et al. 2019b; Xue et al. 2020; Peukert et al. 2019;). In such an environment, customers perceive, feel, and interact with products just as they would naturally do in a physical store. This assumption relies on previous studies showing relatively stable neurophysiological patterns exhibited in virtual and real-world contexts during the performance of physical activities (e.g., Marin-Morales et al. 2019; Petukhov et al. 2020). However, in the context of retail, researchers still need to clarify whether the physiological and neurocognitive reactions in virtual contexts are the same as those elicited in real, physical stores (Alcañiz et al. 2009; Baumgartner 2008; Peukert et al. 2019; Sanchez-Vives and Slater 2005). As shown in the prior literature, VR shopping—compared to physical commerce—contributes to enriching the buying–selling process, allowing consumers to make better-informed decisions about which products or services to consume (Bressoud 2013; Burke 2017; Lau & Lee 2019; Martínez-Navarro et al. 2019; Oh et al. 2008).

Identifying consumer’s implicit behavior in VR has created many methodological opportunities. Recent advancements in integrating human behavior-tracking technologies into HMDs and external wearables (Marin-Morales et al. 2020) have further expanded the research scope in retail, mainly helping to improve consumer shopping experiences on-site. Virtual store (VS) layouts can now be optimized to anticipate shopper needs, ultimately helping save time and money. Retail is therefore transitioning to what has been termed “virtual commerce” (Alcañiz et al. 2019; Martínez-Navarro et al. 2019; Velev and Zlateva 2019), conceptualized as a new digital sales channel centered on the use of VR technology. The advantage of this new concept of VSs is twofold. On the one hand, they are expected to enhance the overall consumer purchasing experience through personalized shopping environments and products while reducing the perceived risk of buying online. On the other hand, VR methods allow buyers to experience the product before purchasing it, thereby solving this intrinsic limitation of traditional physical shops and e-commerce selling channels. For example, we could drive a car that we are going to buy utilizing a realistic simulation or teleport to a hotel room that we are going to book. Ultimately, it is possible to customize the experience while maintaining a high level of immersion, using prior information about the subject simply like an e-commerce platform (payment preferences, categories of products that most interest him/her, maximum cost…).

2.1.2 The role of demographic factors in shopping behavior

Attempts to further understand purchasing behavior have shown that demographic factors play a fundamental role (Chandrasekar and George 2013; Cleveland et al. 2011; Dotson and Hyatt 2004; Vipul 2010; Yildirim et al. 2015). In this regard, the influences of age and gender as predictors of shopping behavior have been analyzed in both real-world (Laroche et al. 2000; Sommer et al. 1992) and simulated spaces (Hasan 2010; Kizony et al. 2017; Spiers et al. 2008; Tlauka et al. 2005; Waller 2000). When focusing on the marketing literature, gender differences have been investigated in several shopping behavior streams like product perception (Borges et al. 2013; Sebastianelli et al. 2008), sale promotion (Harmon and Hill 2003; Hill and Harmon 2009), shopping attitudes (Garbarino and Strahilevitz 2004), shopping styles (Dittmar et al. 2004), and advertising (Martin 2003). Overall, the reported results support essential differences between men and women regarding attitudes, product searching efficiency, and in-store shopping time (Bogomolova et al. 2016; Hasan 2010; Sommer et al. 1992). Some research exploring gender-based differences in virtual spatial Navigation has confirmed that, on average, males and females perform differently in their employment of specific spatial Navigation and goal-orientated strategies (Mueller et al. 2008; Spiers et al. 2008; Tlauka et al. 2005). Using motion-tracking immersive VR systems, one recent study demonstrated that women spend significantly more time in-store (mostly interacting with products) than men (Schnack et al. 2020). When accomplishing a specific shopping task (e.g., a goal-oriented purchase), males spend, on average, less time completing it than their female counterparts. This aligns with prior research on in-store contexts showing relatively longer shopping times among females (Bogomolova et al. 2016; Chang and Yeh 2016; Chebat et al. 2008; Davies and Bell 1991; Schnack et al. 2020).

The influence of age on shopping behavior has also been reported as a relevant demographic influence in real (Gazova et al. 2013) and virtual (Driscoll et al. 2005; Jansen et al. 2009; Moffat and Resnick 2002; Rodgers et al. 2012) spatial Navigation. It is well-known from neurodegeneration studies that cognitive abilities—such as processing speed, working memory, and spatial orientation—decline linearly with age (Deary et al. 2009; Ghisletta et al. 2012; Salthouse 2009). These changes may impact several aspects of shopping behavior in older customers, especially those related to the speed and quality of decision‐making processes and their motivation (Drolet et al. 2018; Rodgers et al. 2012). For instance, measured metrics such as walk-around time and speed of travel have been shown to differ between young adults and old adults, especially in allocentric navigation (Rodgers et al. 2012). In online shopping, the likelihood of purchasing more expensive products, taking risks, and preferring traditional shopping methods has been found to differ according to age (Lian and Yen 2014). However, the impact of age-related factors in highly immersive 3D experiences needs to be investigated further (e.g., Plechatá et al. 2019).

Demographic differences have also been explored in conjunction with eye movement patterns. Previous eye-tracking studies in the context of online shopping have suggested that gender influences deployed visual processing strategies (Bergstrom et al. 2013; Hwang & Lee 2018; Tupikovskaja-Omovie & Tyler 2020; Zaharia et al. 2017). Metrics of interest often include fixation times and gaze mapping of areas of interest (AOIs) mostly providing information about key components of the shopping journey (Hwang and Lee 2018; Menon et al. 2016; Zaharia et al. 2017). Findings have consistently suggested that women’s visual attention to products tends to be greater than men’s, likely indicating their different buying styles and strategies (Hwang and Lee 2018). Also, studies based on spatial cognition have reported evidence of different cognitive strategies adopted by men and women when navigating VR environments (Andersen et al. 2012; Mueller et al. 2008). For instance, in a virtual maze setting, relatively longer fixation durations and greater increases in pupil diameter have been observed in women compared to men (e.g., Mueller et al. 2008).

The current scientific literature confirms that age and gender play a key role in influencing purchase decisions and motivations. Thanks to a new generation of integrated neuroscientific tools with more versatile VR designs, studying the intrinsic layers of consumer behavior is feasible from an experimental approach (Bigné et al. 2016; Meißner et al. 2019). However, current research does not clearly describe the role of implicit measures in predicting consumer typologies. In the specific retail context, it is unknown, for example, whether the age and gender of consumers can be inferred from implicit measures such as body gestures or spatial navigation patterns. In our understanding, the combination of immersive VR with wearable technologies may help gain much more generalizable and actionable insights into the demographic influence on consumer behavior.

2.1.3 The present study

The focus of this study was to examine whether implicit behavioral responses of consumers can predict their demographic attributes (i.e., age and gender) in a virtual store. Following similar experimental VR approaches to those validated in previous research (Harris et al. 2020; Marín-Morales et al. 2019; Martínez-Navarro et al. 2019; Pfeiffer et al. 2020; Khatri et al. 2022), we conducted an experiment in which participants had to navigate a virtual supermarket while performing goal-oriented and free search tasks. A set of implicit behavioral features was processed and demographic characteristics (age and gender) recognized using a statistical supervised machine learning (ML) classifier algorithm via a support vector machine. In the context of this study, implicit features or measures refer to nonverbal reactions gathered from eye-tracking, body-tracking, and navigation register systems (e.g., Peukert et al. 2019). These biometric signals help track consumers' shopping routes, visual attention, purchase behaviors, and time spent on each task. The relevance of implicit features in recognizing age and gender was statistically analyzed based on the different tasks and regions of interest (ROIs).

A main contribution of this research includes characterizing consumers in v-commerce spaces according to the shopper’s demographic profile. Specifically, the results of this study are insightful for retailers and marketers considering the use of immersive virtual reality approaches in V-commerce practices. First, by providing a better understanding of the role of consumer age and gender in immersive virtual reality retail environments based on human behavior-tracking (HBT) and eye-tracking techniques. The definition of this standard behavioral measure will allow researchers and marketers to adapt/customize the different areas of a hypothetical virtual shop according to the demographic attributes of consumers without altering the shopping experience. Additionally, we add other implicit behavioral metrics such as virtual space navigation and product interaction to infer intrinsic attributes of age and gender following the same procedures validated in our previous research (Khatri et al. 2022). Second, our study informs about the type of tasks and spatial features that can accurately optimize the classification of consumers based on their demographic characteristics. Third, the present study assesses implicit consumer behavior in a VR store based on an integrated approach that will extend existing findings in the v-commerce literature (Meißner et al. 2020; Elboudali et al. 2020; Peukert et al. 2019; Velev and Zlateva 2019; Khatri et al. 2022). Ultimately, the findings of this study will be relevant for the future application of hands-free VR technologies in different retail ecosystems. The fact of inferring multiple intrinsic shoppers’ attributes in a transparent and noninvasive way is of great interest for current online sales platforms based on virtual reality. We expect that this new capability will allow them to develop a new concept of adaptive store capable of redesigning the experience offered based on the consumer's profile (Alcañiz et al. 2019; Khatri et al. 2022).

3 Material and methods

3.1 Participants

A sample of 57 participants balanced by gender (27 females and 30 males) and age (from 18 to 36 years) was recruited for the experiment. Initially, 60 individuals were considered but three participants were removed due to corrupted data in the eye-tracking signal. All the participants were healthy and reported no motor diseases, no evident mental pathologies, and normal or corrected-to-normal vision and hearing. Participants' prior experience with VR was self-reported, with a total of 46% of the sample reporting having no experience, 53% having experienced it at least once, and 1% having experienced it multiple times. Informed written consent was obtained from all the participants, and the study was approved by the ethical committee of the Polytechnic University of Valencia in accordance with the Declaration of Helsinki. All participants were incentivized with money, which they received after the experiment was concluded.

3.2 Technological setup

A brick-and-mortar VS of 6 m × 6 m in dimension was developed using the Unity 3D game engine (Unity 2020). The participants wore HMD HTC VIVE Pro glasses (HTC 2020) running on SteamVR (SteamVR 2020). The participants could interact with the virtual objects using HTC VIVE Pro controllers. The glasses send the data wirelessly to a central computer in the same room. Participants could move inside the virtual environment by walking in a natural way because of the special tracking made by 4 base stations in a physical zone of 6 m × 6 m, the same dimensions of the generated VS (Fig. 1).

Fig. 1
figure 1

Above: technology setup employed. Below: VS of 6 m × 6 m with seven shelving units and three shelf levels

We recorded 3D movement data at an average of 76 Hz (SD = 8.41) using the HMD, considered head tracking. Similarly, 3D movement data for the hands were also recorded at the same frequency using the controller, considered hand tracking. HTC VIVE Pro includes a built-in eye-tracking (ET) technology by TOBII (TOBII 2021), which uses infrared sensors and emitters to detect pupil movement. The HMD glasses have a lens resolution of 1440 × 1600 pixels per lens (2880 × 1600 pixels combined), with a field of view of 110 degrees. The raw gaze data were collected at a variable sampling rate of 60–70 Hz using HTC SRanipal SDK.

3.3 Virtual store

The VS consisted of seven shelving units with three levels each (top, middle, and bottom), as shown in Fig. 1. Each shelving unit contained realistic product models of fast-moving consumer goods (e.g., milk, juice, coffee, and noodles) and durable goods (e.g., shoes). A close-up look of the products is shown in Fig. 2. The products were highly interactable: they could be picked up, rotated, dropped, and purchased (if applicable). A blue circle on the ground in one of the corners of the VS was used as a trigger point to start and end tasks.

Fig. 2
figure 2

Training room showing purchasable and non-purchasable products

The shelving units were 210 cm tall in total. The bottom level had a height of 60 cm and was situated 30 cm above the ground. The middle and top levels were 55 cm high. Above the top level, there was an upper rim measuring 10 cm.

3.4 Training room

Because the objective of the study is to analyze natural user behavior inside a store, a training room was developed to let participants familiarize themselves with the technology by learning to move in the environment and use the controllers before entering the VS (Slater et al. 1996; Steinicke et al. 2009). The training room contained two white tables at its center. Each table held four objects, green or red, of different shapes (Fig. 3). In this task, participants could walk around and interact with the objects. They could pick up green objects and hold the purchase key to buy the item; after a successful purchase, the object vanished, accompanied by a soft sound. When they tried to buy a red object, because the red items were not purchasable, a buzzer sound informed them as such.

Fig. 3
figure 3

Training room showing purchasable and non-purchasable products

3.5 Experimental design

A within-subjects design was adopted to test the research hypotheses of this study. The experiment was conducted at the LENI laboratory of the Polytechnic University of Valencia (LENI 2021). The study took 45 min and was structured as an initial training task followed by three experimental tasks.

Upon arrival at the laboratory, participants were welcomed, seated, and then had the procedure explained to them. After reading and signing the informed consent form, participants were taken to the starting point to put on the HMD with the experimenter’s help. The familiarization task described above was conducted in the training room, where participants were informed about the mechanics of the VR gear and the controllers. When the 4-min time limit was over, or when participants became familiar and comfortable with the VR gear’s mechanics, they were instructed to go to the blue circle in the corner to finish the task. Next, participants underwent calibration for the eye tracker following HTC guidelines. After equipment calibration, participants received instructions to perform the three purchasing tasks, detailed below.

Task 1 (Free exploration task) In this task, participants were instructed to roam freely and explore the VS for up to 4 min. This task represents unplanned browsing behavior, meaning that the customer does not have any specific goal when visiting the shop. They could interact with the products present in the store and end the task early by standing on the blue circle.

Task 2 (Search and buy snacks task) In this task, participants were instructed to search for the shelving unit containing snacks (potato chips) and purchase the ones of their choice. The shelving unit held nine types of different potato snacks with different prices distributed on the three shelf levels. Each participant had a budget of 5 Euros and was allowed to buy up to three snacks. The snacks were priced between 1.00 and 2.50 Euros. After buying the snacks, they were instructed to return to the blue circle to finish the task.

Task 3 (Search and buy shoes task) Similar to Task 2, participants were instructed to search for the shelving unit containing shoes. There were nine pairs of shoes of different colors and prices distributed on the three shelf levels. The budget for this task was 180 Euros, with the shoes ranging between 115 and 180 Euros. Participants could only choose one pair of shoes, unlike in Task 2. After buying the shoes, they were instructed to return to the blue circle to finish the task. Following Task 3, participants removed the HMD and completed some questionnaires.

3.6 Data analysis

3.6.1 Data recording and preprocessing

Raw data that measure the behavior of the participants in the virtual environment were collected using the Unity 3D game engine. The data were preprocessed to create four data source groups which, combined, compose the human behavior-tracking (HBT) dataset that was analyzed in the experiment. For a full review of HBT please see Khatri et al. (2022), a general idea of the groups is given below:

  • Eye-tracking (ET): data gathered from the eye gaze of participants. Using these data, fixations and saccades classification in a 3D environment (Khatri et al. 2020) was conducted using the dispersion threshold identification algorithm (I-DT; Salvucci and Goldberg 2000). The parameters were set based on previous studies as follows: the mean time fixation was at 0.25 s and the dispersion threshold at less than 1º (Llanes-Jurado et al. 2020). Every duration and centroid were computed for each fixation. Some example features include- Number of Fixations, Fixation Saccade Ratio, etc.

  • Navigation (NAV): two-dimensional movement data on the subject inside the virtual space collected from spatial movement of head tracking. It considered the movement of participants along the two axes of the ground and rejected the height axis. Some example features include- Total Distance, Mean Velocity, etc.

  • Posture (POS): three-dimensional movement data extracted from head and hand tracking. It was similar to NAV with the addition of the height axis as well as the position of hands in three dimensions. Some example features include- Number of visits in AOI, Mean Velocity in AOI, etc.

  • Interaction (INT): data extracted from events and interaction of the participant with the VS in a session. Some example features include- start and end times, times at which products were picked up, and the number of products picked up, etc.

Some features such as stops, saccades, and fixations are divided into two categories, i.e., short and long. For these divisions, the following thresholds are used: (1) for saccades 1 m on the 3D projection, (2) fixation 0.45 s, and (3) stops 2.5 s. Besides, the saccades above 45 degrees are considered vertical and below this amount horizontal. Due to the smaller number of elements in the interaction category, we combined posture and interaction as POS + INT. A combination of these preprocessed data sources (ET + NAV + POS + INT) made up the HBT dataset.

First, subjects with corrupted data were removed (i.e., incomplete recordings due to failure in transmission or storage). In this phase, 3 participants were excluded due to corrupted eye-tracking data. Data preprocessing and analysis were conducted using Python version 3.7.3.

3.6.2 Data segmentation

In this second data analysis phase, features were grouped into three sets (1:ET, 2: NAV, 3: POS + INT) plus a fourth set that combines all previous sets (4: HBT).

Related to feature extraction, two strategies were adopted:

  1. (1)

    Zonal domain features: features that are related to the intrinsic zonal parameters of the VS. We differentiated AOIs, associated with vertical shelf levels (top, middle, and bottom), and zones of interest (ZOIs), segmenting the floor plan into four different ZOIs (shelf, adjacent, near, and far) based on the zone proximity to the shelves (Fig. 4). In the case of Tasks 2 and 3, the ZOIs were divided to reflect the target shelving unit pertaining to the task, as shown in Fig. 4. The widths of the adjacent and near zones were optimized based on a previous work (Moghaddasi et al. 2020), which set the width of adjacent and near zones to 18 cm and 13 cm, respectively. Hence, the far zone covers the rest of the area. In Tasks 2 and 3, the near zone covered all space in front of the shelf after the adjacent zone. Only the width of the adjacent had to be determined. This width was also set to 18 cm, as in Task 1.

  2. (2)

    General features: general features that are not related to zonal parameters.

Fig. 4
figure 4

Segmentation of VS into several horizontal levels (ZOIs) and vertical levels (AOIs) for a Task 1, b Task 2, and c Task 3. Vertical levels d refer to zonal divisions of shelves. The far zone covers the entire store plan except the areas that are selected for the shelf, adjacent, and near zones. In the far zone, the vertical dimension AOI is considered to be one level

Further, these zonal features and general features were subdivided into spatial, kinematic and temporal features based on the type of data they contained, for example, spatial–total distance traveled in x min, temporal–total time and kinematic-number of stops per min. The full list of selected features is given in Tables 1, 2, 3 and 4.

Table 1 Gender recognition based on general features extracted from implicit sources across tasks. The different fonts (bold and italics) account for the higher/lower statistical significances
Table 2 Significant zone-related found statistically significant in gender recognition across tasks. The structure of the AOIs includes four spatial zones, all segmented into three levels, excluding the “FAR” region. Effect sizes and p-values are reported
Table 3 Age recognition based on general significant features found significant for each metric across different tasks
Table 4 Analyzed zone−related features found significant in age recognition across tasks. Effect sizes and p-values are reported

3.6.3 Statistical analysis of gender and age differences

As a first step to explore the importance of each feature, unpaired two-sample Mann–Whitney–Wilcoxon tests were performed to test the significance of the differences between gender and age for each feature after checking that most of the variables were not normally distributed (Shapiro–Wilk p value < 0.05). The effect sizes were computed using rank–biserial correlation.

For age, participants were divided into two age-groups based on the median of the participants’ ages for statistical comparison. The young adult group comprised participants aged between 18 and 24, and participants above that range (age ≥ 25) were assigned to the old adult group (Arnett 2006).

3.6.4 Machine learning to predict demographics

3.6.4.1 Preprocessing

Normalization of data was carried out using the rescaling (min–max normalization) method to map the features in the zero–one interval. Features with a standard deviation under 1e-5 were excluded to avoid the inclusion of variables that do not contain information. The features that were linearly dependent on others were removed using Pearson correlation (rho > 0.95).

3.6.4.2 Feature selection and machine learning

Based on the different types of signals used in the classification (ET, NAV, POS + INT, and combined HBT), there were 33, 68, 157, and 258 features extracted, respectively. A stepwise feature selection algorithm was implemented to reduce the number of features that had to be fed to the models. In the first step, by applying stepwise forward selection, the number of features was reduced to 25. After this step, a backward elimination algorithm was applied to select a maximum of 15 features for the final modeling. This step was performed to explore the importance of the features and to avoid possible overfitting of the classifier derived from a high number of features.

The evaluation of the machine learning model was done using a k-fold (k = 5) cross-validation strategy. In this method, the samples were shuffled and split into k groups. The data was split into folds using stratified sampling, meaning that each fold had the same proportions of different classes. Then, one group was taken as a test dataset, and the remaining groups were used as a training dataset. This process was repeated for each fold to be used as a test, so that each subject could be used to test the accuracy of the model created with the rest of the observations. The folds helped to reduce the impact of diversity in the distributions of the testing and training data and tune the hyper-parameters. The evaluation values obtained through all the test groups were then averaged and the method was repeated 5 times, with the data reshuffled prior to each repetition, resulting in a different split of the sample (Schneider 1997). This was to ensure that different combinations of the observations were used for training the models and evaluating the test set.

Regarding the algorithm, we tested some classifiers, including support vector machines (SVMs; Chang and Lin 2011) with linear kernel, and optimized cost by searching among 20 logarithmically spaced quantities between 0.1 and 104 according to cross-validation score, and k-nearest neighbor search (KNN; Weber et al. 1998) with k optimized by grid search among 1 to 20 according to cross-validation score. We used accuracy and Cohen’s Kappa to measure the performance of the models. Cohen’s Kappa is interpreted as follows according to the guideline of Landis and Koch (1977): 0.00–0.20 indicates slight agreement, 0.21–0.40 fair agreement, 0.41–0.60 moderate agreement, 0.61–0.80 substantial agreement, and 0.81–1.00 indicates almost perfect agreement.

4 Results

4.1 Gender recognition based on a statistical analysis of general and zonal features

To assess the validity of both general and zone-related features in recognizing participants' gender, we focused on significant p values and effect sizes linked to each source of information across tasks. Results are displayed in Table 1 and Table 2 according to metrics of interest and tasks. Color intensity indicates the strength of significance (lightest red = p < 0.05 light red p < 0.001, dark red = p < 0.0001) based on the Mann–Whitney–Wilcoxon test. Significant gender differences displayed for each feature provide information about the size of averaged values in terms of whether they are relatively greater for males (> M) or for females (> F). Lack of gender differences in features of interest is labeled ND.

4.1.1 Free exploration context (Task 1)

The examination of general features shown in Table 1 revealed a total of 8 significant metrics related to the Navigation and Posture (Head and Hand) sources. The “mean duration of stops” feature exhibited the highest level of significance linked to Navigation (p = 0.008**) and Posture–Head (p = 0.0004***). None of the measured features from ET were found to be significant.

Analysis of zone-related features (ZOIs/AOIs) pertaining to eye movement measures revealed a total of 5 significant features across all AOIs. The strongest significance was shown for “number of visits to Adjacent AOI” (p = 0.0006***). Analyzed metrics related to Body (Head and Hand) signal entailed a total of 14 significant features, with “mean velocity” in “Far” and “Shelf–Mid” being highly significant (p = 0.0002***, p = 0.005**). Navigation patterns yielded a sum of 6 significant features largely centered on the “Adjacent” ZOI. The analysis identified “mean velocity” in the “Far” ZOI as the most relevant feature (p = 0.0001***).

4.1.2 Goal-oriented context (Task 2 and Task 3)

Considering the results shown in Table 1, a total of 5 general features were shown to significantly identify gender. In both tasks, only “number of stops × min” linked to Navigation and Posture (Head) were shown to be significant. In the case of ET, the results of the analysis indicate one informative feature (i.e., “mean saccade time,” p = 0.041*) for gender recognition.

When focusing on zone-related features, results, as shown in Table 2, revealed many significant metrics. A total of 10 features from ET resulted significant, with “mean velocity” and “STD acceleration” exhibiting the strongest significance in the “Adjacent” AOI (i.e., p < 0.0005***). The analysis of the Navigation data yielded 8 significant features, mostly centered on the “Adjacent” and “Near” and “Far” ZOIs. Results in the case of Body (Head) data reveal a total of 12 significant features highly focused on “Adjacent–Mid,” “Near–Top and “Far.” In the case of Body (Hand), the statistical analysis showed 3 significant features linked to “Shelf,” “Adjacent,” and “Near,” with low to moderate levels of significance.

The measured zonal features from the navigation and body posture sources were consistently insightful for determining the gender of buyers regardless of the task context. Considering general features, gender recognition was affected by the task context at the Posture–Hand source level. Finally, ET appears suitable for recognizing gender in free exploration and goal-oriented shopping contexts but only based on zone-related features.

4.2 Age results based on general and zonal features

The results of the performed analysis are presented in Table 3 and Table 4. The greater absolute mean difference found for the young age-group and the old age-group is termed > Y or > O, respectively.

4.2.1 Free exploration context (Task 1)

Considering the results of general features only (Table 4), the analysis revealed 7 significant features linked to the different levels of implicit metrics. The results from the ET dataset confirm 4 important features, with the most substantial level of significance found for “number of fixations × min” (p = 0.013*). The analysis of the Navigation and Posture (Head) signals indicated a significant contribution of a single feature: “total distance traveled per min.” In the case of Posture (Hand), only the “mean velocity” feature was found to be significant (p = 0.002**).

The results from the zone-related dataset (Table 2) indicate a total of 8 significant features. The analysis of the Navigation data identified one feature (i.e., “time of 1st visit to Near ZOI”) as relevant in discriminating the user’s age. In the case of Posture (Head and Hand), 7 features linked to different AOIs were found to be statistically significant, with 3 specific head-movement-related features focused on “Near–Top” AOI. In this task context, ET analysis did not identify any significant features.

4.2.2 Goal-oriented context (Task 2 and Task 3)

Focusing only on general features, the results revealed 12 significant features for characterizing age recognition in goal-oriented shopping contexts. The ET analysis confirmed at least 9 powerful features across all AOIs, with a relatively more robust significance for “short fixations” (p = 0.0008***) and “small saccades” (p = 0.007**) linked to the context of Task 2. In the case of Navigation, the analysis revealed one significant feature related to the context of Task 3 (“total time,” p = 0.029*). The examination of Posture (Head and Hand) sources showed a total of 6 features sharing similar levels of significance for both goal-oriented tasks.

Results related to the zone-related metrics (Table 4) yielded many significant features contributing to recognizing the user’s age. A total of 9 features from the ET measures were shown to be significant in subdivisions belonging to all AOIs, with some of them predominantly focused on “Near” areas, and with “mean velocity” showing the highest level of significance (p = 0.0001***). The analysis of the navigation data indicated that the “Adjacent” and "Nearby" zones were significantly related with 6 features elicited in the context of Task 3. The results in the case of Posture (Head) indicate that “Near–Top” is the most predominant AOI, with a total of 7 significant features. This is also limited to the context of Task 3. In the case of Posture (Hand), the results confirm that there are at least 10 significant features, though more evenly distributed across AOIs.

In sum, implicit behavior features were insightful enough to potentially identify age in both exploration and goal-oriented contexts. First, metrics captured using ET are robust enough to discriminate shopper’s ages, but only in goal-oriented contexts. Second, Posture (Head and Hand) measures also resulted informative in terms of free exploration, with an important number of features pertaining to both general and zonal levels. Finally, measured Navigation features may not be as consistent in goal-oriented settings as they appear to be in free exploration for the purpose of age recognition.

4.3 Demographic prediction based on ML model accuracy

Among ML classifiers described in the previous section, the SVM model performed best for classifying the gender and age datasets in all measured features.

4.3.1 Gender accuracy results

Table 5 shows the performance of the computed ML models, considering the combination of optimal features derived from the automatic feature selection procedure. Classification accuracies, kappa indicators, and confusion matrices (i.e., TP rates) are also tabulated in Table 5.

Table 5 ML classification accuracies found relevant for each task and data source to recognize gender

ET outcomes show the best classification with the SVM. The highest accuracy was reached in the context of Task 3 (79%, kappa = 0.58) compared to those reached in Task 1 (74%, kappa = 0.47) and Task 2 (72%, kappa = 0.44). When combining all tasks, the tested model improved (77%, kappa = 0.54) concerning Task 1 and Task 2 accuracies but did not outperform Task 3 accuracy.

The Navigation signal also showed the highest accuracy with the SVM in all tasks except Task 2. The results show the highest accuracy (73%, kappa = 0.47) in Task 1 compared to Task 2 and Task 3 conditions (56% and 62%, respectively). The accuracy level improved by 8% when combining all the tasks (81%, kappa = 0.63). According to the TP rate, the classifier performed better in classifying females (84%) than males (64%).

In the case of Posture+Interaction, model accuracy increased when including “all tasks” (89%, kappa = 0.78) compared to the accuracy reached in each task separately (see Table 5). The TP ratio shown for Task 3 resulted in higher accuracy (> 25%) in the female group (92%) compared to the male group (66%).

Results concerning the HBT signal indicate that the highest accuracy was achieved by combining all tasks (94%, kappa = 0.88). When tested individually, Task 1 had the highest accuracy level (87%, kappa = 0.74) compared to Task 2 (82%) and Task 3 (81%). HBT metrics had a strongly balanced TP rate across task conditions in all performed models.

Together, all signals achieved acceptable accuracy across tasks. However, HBT and Posture + Interaction metrics performed relatively better than ET and Navigation, especially when the model included all tasks. In the context of a goal-oriented task, the Navigation signal was less accurate in predicting gender than in the context of free exploration. In the case of ET, gender prediction accuracies were reasonably balanced across all tasks.

4.3.2 Age accuracy results

The classification accuracies for each task and metric are tabulated in Table 6. As with gender, SVM performed better on almost all metric characteristics, followed by the KNN model (Fukunaga and Hostetler 1975).

Table 6 ML classification accuracies found relevant for each task and data set to recognize age

Regarding ET, the results show that the highest accuracy was reached when the model included all tasks (74%, kappa = 0.46), outperforming the accuracies achieved in each task individually.

Navigation metrics performed better with SVM (Task 1 and Task 2) and KNN (T3) algorithms. The highest precision was achieved by combining all the tasks (84%, kappa = 0.68), with a 14% increase in performance in the young age-group (TP = 0.91, 0.77). When taking each task individually, the highest accuracy was obtained under Task 2 conditions (67%, kappa = 0.34) compared to Task 1 (61%) and Task 3 (63%).

Regarding the Posture + Interaction signal, SVM models worked better in all-task scenarios. Specifically, the model including “all tasks” presented 83% accuracy (kappa = 0.65) and a balanced TP rate. The accuracy of the model that included only the free exploration task (Task 1) was 7% higher (79%, kappa = 0.58) than that achieved in the goal-oriented tasks separately (72%, kappa = 0.43).

Finally, the HBT signal worked better with SVM in Task 1 and Task 2. The highest accuracy and a balanced confusion matrix (TP) were achieved with SVM in all task-combined categories (81%, kappa = 0.62). When tested individually, Task 2 had the highest accuracy level (79%, kappa = 0.58) compared to Task 1 (77%) and Task 3 (74%). Based on the TP, the classifier was significantly better at classifying old age-group cases than young age-group cases in free exploration and goal-oriented contexts (> 20%).

The highest age prediction accuracies were shown in models that included “all tasks,” with Navigation, Posture+Interaction, and HBT being more accurate than ET. HBT was more accurate in predicting age than Posture + Interaction when considering each task separately, but only in goal-oriented shopping contexts. Conversely, in free exploration settings, Posture+Interaction metrics results were slightly more accurate in predicting age than the HBT signal.

5 Discussion

The present study highlights the use of VR combined with implicit measures captured with multiple signals as an effective tool for recognizing two tested demographic variables of users: gender and age. One goal was to identify which implicit behavior metrics yield reliable information for recognizing consumers' age and gender when they perform three different tasks in a VS. Another goal was to investigate whether a VR environment favors the measurement of consumer behavior in a more ecologically viable way, thus facilitating a method for age and gender detection. In this section, we discuss our results based on (1) the contributions of significant biometric features in recognition of customers' demographic profiles in retail environments, (2) the accuracy of ML methods in predicting and classifying shoppers based on their gender and age, (3) the influence of purchase context on implicit behavior metrics, and (4) the limitations and further directions.

5.1 Contributions of implicit behavior to identify demographic factors

5.1.1 Gender

Our results confirm an overall substantial contribution of implicit features in recognizing consumer’s gender, particularly in the male group. Based on gaze behavior insights linked to zonal features, we have shown that AOIs located in the "Shelf," “Adjacent,” and “Near” domains were relevant to detecting gender in two different task contexts. In terms of the dimensions of gaze features, spatial and kinematic categories tended to be more appropriate in recognizing men. In contrast, temporal features (e.g., “time of 1st visit to AOI”) tended to be more informative in identifying women, but only in free exploration settings. This outcome is consistent with some studies that have reported gender-based differences based on temporal aspects of oculomotor behavior, with specifically larger gaze fixation durations reported for women than for men (Andersen et al. 2012; Campagne et al. 2005; Mathew et al. 2020; Mueller et al. 2008). In virtual Navigation, findings have consistently suggested that women spend relatively more time than men looking at spatial location landmarks, ultimately affecting navigation performance (Andersen et al. 2012).

Concerning the performance of zonal features measured from Navigation and Posture sources, our results suggest that crucial visitation areas in the “Adjacent,” “Near,” and “Far” domains may be more informative than spatial domains located on shelves. Some research has suggested gender-based differences in VR navigation (Halpern 2000), with traveled distance and task-goal as influential moderating factors (for a review, see Nazareth et al. 2019).

Moreover, our results confirm the slightly higher relevance of Head Posture measures in both shopping contexts than patterns found in hand gestures, which appear to be valid only in free exploration and to a greater extent in recognizing the gender of men. These results might reflect a gender difference in the level of presence experienced by men and women when navigating a virtual environment. According to some reports, men tend to engage more in virtual environments and thus develop stronger feelings of presence than women (Lachlan and Krcmar 2011; Felnhofer et al. 2012). Therefore, we recommend an additional evaluation using more spatial and interactive features and tasks for accounting for the observed gender difference.

5.1.2 Age

Based on eye movement patterns, the outcomes suggest that the context of goal-oriented searching plays a significant role in the success of age recognition, with spatiotemporal and kinematic dynamics of gaze fixation being insightful. When participants performed goal-oriented purchases, as in Task 2 and Task 3, gaze behavior was particularly relevant in several areas of interest. Conversely, eye movements did not contribute to recognizing the age of shoppers in the context of free exploration to the same extent. Previous studies show that visual attention in virtual environments is task-dependent (Hadnett-Hunter et al. 2019). Therefore, a possible explanation for such a discrepancy between task contexts is that specific drivers of attention allocation (e.g., feature conspicuity in particular areas) might differ with age, but only when the participants’ attention is focused on a specific zone. Unlike gender, age recognition seems to be more affected by the task’s context, with the visual strategies of the youngest participants being strongly detectable.

Similarly, zonal features computed from Navigation and Posture (Head and Hand) sources were also found to be notably relevant. In general, spatial navigation and head posture signals were more pronounced in the younger age-group, which aligns with previous research showing age-based differences in virtual spatial navigation strategies (Driscoll et al. 2005; Rodgers et al. 2012; Kizony et al. 2017). However, this age-related influence could also be due to a generational component that should be investigated further using more polarized age-groups.

When focusing on the contributions of features unrelated to space, all signals yield important, valuable information for recognizing participants’ age. The most significant contribution came from gaze dynamics and Posture (Head and Hand) patterns, showing higher recognition for the older age-group. Also, we found that both sources were, to some extent, influenced by the context of the task. For instance, when participants freely explored the VS without any goal in mind, the gaze pattern concerning the number of fixations and saccades (spatial span) was particularly relevant and more effective in discriminating against members of the older age-group. Again, the influence of the task context on visual attention may explain the greater relevance of gaze duration as an indicator of engagement in the context of searching for products, but not in the context of free exploration. We believe this temporal aspect of gaze behavior might help discriminate the age of older buyers across different shopping contexts.

The fact that mean values linked to HBT technology were generally higher for the young age-group may be related to greater levels of distraction among older adults. Indeed, some studies have suggested that high VR immersion experiences contain a distracting element affecting task performance (Moreno and Mayer 2004; Richards and Taylor 2015; Parong and Mayer 2018). Therefore, adding an extra measure assessing the ability to inhibit distracting information (e.g., pop-up objects and background music) in different groups of age could be advantageous, especially in goal-oriented shopping contexts.

5.2 Contributions of the ML model

A critical contribution of this work is its detailed analysis of implicit features derived from different biometric sources to identify user gender and age from an ML approach accurately. The achieved levels of accuracy (> 70%) in both gender and age datasets validate the relative discriminating value of ET, NAV, and HBT sources. Altogether, presented insights provide a comprehensive understanding of how multiple implicit behavioral features (i.e., body movement, spatial navigation, and gaze behavior) combine to increase classification accuracy effectively. Such a procedure overcomes the current limitations of classical statistical methods for measuring implicit responses, as recently reported (e.g., Pfeiffer et al. 2019; Shu et al. 2018). We suggest researchers assess and compare these features, selecting them according to their relevance to their work.

In the following subsections, we highlight the specific insights of this work according to the implications it may have for future VR research.

5.2.1 Gender recognition

Consistent with previous studies (Bailey et al. 2014; Pfeiffer et al. 2019), the combination of all biometric features (HBT) enhanced gender classification accuracy in all task contexts, thereby confirming our predictions. Accuracies of metrics computed from the position of the participant’s heads and hands (POS + INT) produced a more accurate classification (> 82%) than ET (74%) but only in a shopping context of free exploration. Otherwise, ET metrics gain relevance (79% accuracy) in the particular context of searching for a relatively high-valued product, as in the case of Task 3. A possible explanation is that different visual attention processes, namely bottom-up and top-down (e.g., Corbetta and Shulman 2002; van der Laan et al. 2015), are contingent upon the purchasing context. Unlike goal-oriented tasks (i.e., top-down processing), in the case of free browsing (i.e., bottom-up processing), the buyers’ gaze is not focused on any particular product, which holds the same for men and women. It could also indicate that, regardless of gender, the presented product features did not capture the participants’ attention, which is consistent with studies showing the influence of visual stimuli on gazing patterns (Wenzlaff et al. 2016). Conversely, in the forced-search purchase task, participants were driven by an objective in mind, and men and women differed in how they visually latched onto specific product features. This suggestion is consistent with reported gender-based differences in visual attention based on gaze patterns (Abdi Sargezeh et al. 2019; Hwang and Lee 2018). In addition, the perceived tangible attributes of products, such as purchase price, can differ between genders, with being women more sensitive to price than men (Kraljević and Filipović, 2017). In our study, the economic value of to-be-searched products was substantially lower in Task 2 (i.e., 1–5€) than in Task 3 (i.e., 105–180€). Such a price difference may explain why gender recognition accuracy was slightly higher in Task 3 (i.e., 79%) compared to Task 2 (i.e., 72%). One possible suggestion could be a purchase price influence leading to different visual attention mechanisms in men and women (see Guo et al. 2016; Hwang and Lee 2018). Thus, the ET signal highly contributes to recognizing gender in VR retail contexts when a specific goal drives the purchase. At the same time, in a shopping experience of free exploration, eye movement metrics may be helpful only if products have remarkable visually salient features (e.g., eye-catching colors and prominent contrast).

Regarding the ability of navigation to predict gender, achieved accuracies when shoppers freely explored the store (73%) were, on average, 14% higher than when the shoppers were searching for a particular product. This imbalance could be because wandering from aisle to aisle in free exploration is more likely to occur than in goal-oriented purchases, which also entails more considerable gender-based differences (Mueller et al. 2008; Spiers et al. 2008; Tlauka et al. 2005). Finally, the accuracy of the prediction of our model increased for all signals when the three tasks were combined into one category, thus validating, for the first time, the proposed computational methodology for identifying gender in a VR shopping context.

5.2.2 Age recognition

The adopted ML model approach worked effectively for all signals, according to the results. In terms of classification accuracy rates, HBT features achieved the highest precision (> 74%) in all shopping tasks, performing slightly better than ET features (< 69%). The main explanation for this is that body posture, gait, and movement variability may be more discriminating of age-groups than eye movement patterns. Such variability may be improved when measuring body posture in older age-groups (e.g., Del Din et al. 2016; Kang and Dingwell 2009), affecting the algorithm’s prediction performance. Finally, our results prove that a task integration approach is more advantageous than considering each task separately. This is particularly relevant in the case of navigation measurements, where we found accuracy to increase by 20% when all the tasks were combined in the classification model.

In sum, compared to other signals, HBT is the best indicator for gender and age recognition in all task contexts In future approaches, the task context (free browsing versus goal-oriented) may be critical in classifying gender rather than age, mostly when using navigational measures.

5.3 Conclusions and implications

Effectively detecting consumer’s gender and age in VR settings based on implicit measurements is a promising, innovative approach with practical implications for marketers and consumer experience research (see Grewal et al. 2017; Martinez-Navarro et al. 2019). A notable contribution of this work is the possibility of inferring the gender and age of an anonymous shopper visiting an online virtual store using an ML model with high classification accuracies (Tables 5, 6) for multiple behavioral signals (ET, navigational patterns, posture and interactions with products) and tasks. This is not possible using traditional methods that rely either on data collection via surveys or on adopting less generalizable implicit behavior assessments. Moreover, our results suggest that zonal features related to body posture and navigation sources should be used conjointly with eye movement patterns for better detection of demographic attributes. For prediction purposes, our ML model highlights the superiority in the accuracy of combining all sources of information over taking an individual approach. Prior to this study, research tracking consumer behavior by using ET, navigation, posture and interaction inside a VR retail store, to date, can only be found in our previous research conducted in our lab (Moghaddasi et al. 2021; Khatri et al. 2022). The results of this work complement those reported in Khatri et al. (2022) on the usefulness of implicit behavioral signals in classifying consumers based on their personality traits. Hence, our study goes one step further by validating the provided ML model accuracy and methodology to recognize also the age and gender of shoppers in a virtual store.

We also believe that the results of this work could be of potential interest for retail designers and market researchers. In line with the marketing research literature (e.g., Croson and Gneezy 2009; Fang et al. 2016; Yoon and Occeña 2015), classifying consumers based on demographic variables has been shown to help marketers in personalizing products while enhancing shopping experiences. Particularly in the context of v-commerce, our tested ML model provides information on consumers’ gender and age features at the offset of the purchase process. Having this knowledge raises the possibility for retailers to analyze which areas of a hypothetical v-commerce store could be more customizable according to buyers’ demographic profile. Overall, the proposed approach involves a new methodology in the study of consumer behavior that is more integrated and aligned with generalizable VR research.

5.4 Limitations and further research directions

Despite the stated advantages of the proposed study, there are some methodological limitations to be addressed in future research. The first limitation is related to the specific task contexts and prediction model time interval used in this study. Validation of the procedure requires performing direct comparisons using different task versions and stimuli to extrapolate the obtained results to other shopping environments. For instance, seasonal purchases, such as those during Christmas, hold a potentially rich source of information for studying gender recognition because sex-role orientation has been proven to influence shopping activities (Laroche et al. 2000).

As a second limitation refers to the scope of the study. The proposed method allows assessing the age and gender of the consumer only after the purchase is over. The fact that our assessed model was limited to the end of the consumer buying process, prevented us from making predictions during previous exploration and product search phases. A precise definition of the minimum analysis time required within the v-commerce experience could clarify whether our prediction model approach is generalizable also during the time interval of the purchase.

A third limitation of our study is that we did not use a separate test set to validate our results due to the small sample size. This is a common problem in experimental laboratory studies that aim to classify subjects, since each participant contributes one sample to the dataset and only one recording can be done at a time (Pfeiffer et al. 2020). Future studies with sufficient subjects are needed to evaluate the results in real situations. Furthermore, the sample was limited in terms of the age range of participants. Young volunteers are likely to handle VR environments and HMD technology more efficiently than their older counterparts (Lian and Yen 2014), which restricts the generalizability of our results. For this reason, replications with larger age range samples are necessary to support the versatility of this methodology in recognizing age in older participants.

A fourth limitation found in our study is that our work is based on interaction and navigation metaphors (i.e., natural walking), which may limit some technical aspects of the 3D experience, such as preference or usability. To overcome such limitations, future research could also examine the scope of these results using other locomotion and interaction techniques such as redirected teleportation (Liu et al. 2018) or haptic gloves instead of controllers (Parastoo et al. 2019). For instance, in teleportation, users can navigate much faster, increasing the available walking space that may lead to a different experience affecting their implicit reactions. In the case of the virtual hand metaphor, wearing gloves offers more natural interactions and a stronger feeling of presence (e.g., through haptic response) which may change how gesture-based features contribute to demographic recognition.

Future research must contemplate V-commerce applications based on the development of adaptive models (Pfeiffer et al. 2020) capable of identifying the most effective behavioral features in real time. Thus, a further direction is to build an effective methodology that allows customizing the shopping experience and product placements by adapting to the unique characteristics of the consumer in real time. Furthermore, the inclusion of other implicit measurements not tested in our study such as cardiac variability, skin conductance, facial gestures, and voice would optimize the prediction of consumers’ attributes in a more versatile and generalizable way.

Moreover, future research should address different group comparatives besides the age and gender of the participants, to assess the estimation of our model considering more granular groups or even a regression problem. Finally, the increasing interest in the study of social cues in retail environments has created a new research direction trend (e.g., Barnes 2016; Jang Ho Moon et al. 2013; Silva and Bonetti 2021). We believe that our results might be influenced by an interactive, avatar-based virtual shopping paradigm (e.g., a salesperson or peer consumer), likely mediating kinematic metrics in hedonic and utilitarian shopping. In this regard, our work is the starting point to infer not only gender and age but also other more complex attributes (culture level, race, cognitive style), in order to offer a personalized virtual marketing experience.