In Study 1, we designed an experiment with eight different treatments and invited 375 (43 for a pretest and 332 for the main experiment) subjects to participate. We first used analysis of variance (ANOVA) and analysis of covariance (ANCOVA) to examine the influence of colorful packages, extra gifts, and preprinted return labels on consumers’ return intentions. We then estimated a PLS model to better understand how these effects operate through consumers’ cognitive–affective reaction processes.
We based our choice of package stimuli on related research and the pretest results. For the pretest, we invited 43 German participants [50% female; average age 36.1 years (SD = 10.20)] to provide feedback on different manipulations. Figure 3 shows the final treatments for both the product and package stimuli. The results of the pretest also show that our scales achieved good reliability.
Four criteria guided our selection of a product: (1) a high proportion of our target population should be interested in buying this product, (2) the product should belong to an industry whose return rate is relatively high, (3) the product should have both utilitarian and hedonic value, and (4) a product defect can be easily manipulated to enhance consumer return intention after the package opening. Keeping all these criteria in mind, we selected a jersey of the German national football team and added a 5 mm × 5 mm black stain on the back. We decided to use only one black stain because in our pretest, as more than one mark led to extremely high return rates (> 90%), strongly reducing the variance in our dependent variable. To ensure that people noticed the stain and assessed the problem similarly, we stated that “You have no idea what caused the stain, but you notice that you might not be allowed to return the jersey after washing it.”
According to a report in 2001, 40% of Germans’ favorite color is blue, followed by red (19%) and then green (18%) (Institut für Demoskopie Allensbach 2001). In our pretest, both men and women indicated that the color of an ideal delivery package, other than standard brown, was blue. Thus, we chose blue-colored delivery packages for our experiment. Crowley (1993) documents that blue has a strong impact on shopping in terms of both evaluation and activation, which meets the requirements of our research goal. The control group received a delivery package in standard brown.
In the pretest, we also tested the estimated price of various extra gifts. In line with the results, we selected Nivea Creme Care as the extra gift for our main test. The price (approximately €2.5) is 3% of the price of a soccer jersey, and both men and women can use it.
Preprinted return label
We placed a preprinted and prepaid DHL label with a return shipping address into the package. To return the package, participants needed only to glue this return label to the original delivery package and bring it to a post station. In the control group, participants needed to log into their accounts, complete several forms, and then print the return document themselves. We reasoned that a preprinted, prepaid DHL label could significantly reduce return costs and thus, according to utility theory, increase consumers’ return intentions.
Design and procedure
We employed a 2 (colorful vs. not colorful) × 2 (gift vs. no gift) × 2 (preprinted return label vs. no preprinted return label) between-subjects design on the online survey platform Dynamic Intelligent Survey Engine.Footnote 2 In step 1, we randomly assigned participants to one of the eight experimental conditions and asked for demographic information (i.e., age, gender, and career). In step 2, we simulated an online purchase process. Participants were asked to imagine that they had decided to buy a jersey of the German national football team for the upcoming World Cup and then to specify their size and gender in order to obtain the appropriate jersey.
In step 3, we clarified that they were to imagine that they paid for their selected jersey, and then we asked for their emotions (pleasure and arousal) toward and perceived utility (utilitarian and hedonic) of the jersey. For step 4, we needed to create an artificial time delay between the payment and the virtual receipt. Thus, we employed a filler task in which participants answered questions about their online shopping experience and personality by identifying the extent of their extroversion, agreeableness, conscientiousness, neuroticism, and openness (a 10-item short version of the Big Five Inventory in German) (Rammstedt and John 2007). Afterward, participants learned that “after 3 days, you receive your order.” Subsequently, in step 5, we told participants, “Please assume that you were the person who opened the package in the video” and then used a 30-s stop-motion animation to show the entire opening process. In stop-motion (also known as stop-frame) animation, an object (in this case, the package) is moved in small increments between individually photographed frames, creating the illusion of movement when the series of frames is played as a continuous sequence. This technique allowed us to control the timing and method of package opening. The eight videos in the eight experimental groups were exactly the same except for our manipulations. The gift and/or the preprinted return label appeared for approximately 5 s (six photos for the process taking the items from the package, two photos for a full-screen display of the details, and another two photos for putting the items down; for details, see Table 1). We used an amplification process for the gift and preprinted return label to ensure that every participant could recognize each stimulus clearly. Participants could not move to the next step until they finished watching the whole video.
In step 6, we surveyed participants’ current emotions and the perceived utility of the whole package, along with their satisfaction and return intentions. To keep the package in participants’ minds, we placed a picture of the package, showing all the items, at the top of the questionnaire (see Fig. 3). In step 7, in order to match their return intention to real return behavior, we communicated that every participant had a chance to win the real package shown in the video (with extra gift/colorful package/preprinted return label and a jersey with a stain) and that they could send the jersey back for a new, flawless one. We then asked whether they would really return their jersey in that case. As an additional motivation and to increase realism, we asked participants to voluntarily give their contact information and jersey size.
In the final step 8, we randomly chose five participants and sent them the package exactly as shown in the video of their treatment group and asked them whether they would like to return the flawed jersey. If they wanted to return, they had to bring the parcel to the post office and had to wait until they received their flawless jersey. This additional step allowed us to observe their real return decision and examine whether their answers (return intentions) in the experiment matched their real behavior. Figure 4 summarizes the entire experimental procedure.
We adapted our items for measuring the constructs from prior marketing research (see Table 2) using multi-item Likert-type scales for each. We assessed perceived utility using the hedonic/utilitarian scale proposed by Voss et al. (2003). This scale includes eight-point semantic differential items, but we decided to use only seven points according to the Cronbach’s α results (> .7). Moreover, we measured emotions using the PA model (including three items for pleasure and three items for arousal) from Mehrabian and Russell (1974). We measured perceived utility and emotion twice—once after participants’ purchase decisions and again after the package-opening process. Note that the perceived utility tested following the package opening pertains to the whole package. For consumer satisfaction, we adopted Finn (2005) three-item scale, which is widely used in marketing research.
To assess consumer return intention, we used the Net Promoter Score (NPS), which is based on an 11-point Likert scale (0 = “not at all likely” and 10 = “very likely”) introduced by Reichheld (2003) and widely used to measure attitudes or behavioral intentions (Samson 2006). The NPS is calculated with a single question, in our case, “How likely is it that you would return the package?” We identified participants who responded with a score of 9 or 10 on the NPS as package returners and those who responded with a score of 0–6 as package keepers.
In the real return behavior check (Step 7 and 8), we coded participants’ answers with a dummy variable equal to 0 if they claimed to keep the whole package shown in the video and 1 if they opted to send it back to get a new one. Although receiving a gift is different from a real purchase, the return decision is similar in our simulated case. Thus, we believe participants’ choice of gift return can proxy for their actual behavior after receiving a product with a small flaw. We then compared participants’ return intention (0–6 for non-return, 9–10 for return) and their real return choice (0 for non-return, 1 for return); these two answers were highly correlated (p < .01).
All survey items were presented in the respondents’ native language (German). We pretested the final questionnaire with doctoral students and university employees to identify unclear and ambiguous questions. The convergent and discriminant validity for the constructs exceeded all critical values (see Table 2).
After conducting a pretest with 43 participants who came from our target population of native Germans with Internet access, we employed a professional market research company to collect a representative sample for our main study in March 2015. Our initial sample for our main study included 332 participants, all of whom had recent online shopping experience. To keep our sample representative within each experimental group, we set quotas for age and gender according to Europe’s 2014 online shopping consumer report (Eurostat 2014). To verify the validity of the responses, we checked each participant’s response patterns and completion time. We excluded five questionnaires that were completed in less than five minutes, six questionnaires that exhibited a visible pattern of the same response on all the Likert scales, and one questionnaire from a participant who reported that his computer was unable to play the video. The final sample thus consisted of 320 completed surveys (see Table 3). An ANOVA revealed no significant differences in participants’ age, gender, occupation, and soccer preference among the eight experimental groups, which indicates that our randomization worked as intended.
Common method bias analysis
We strived to design the questionnaire carefully, which entailed ensuring participants’ anonymity, using a random order for survey items, providing concrete survey instructions, and asking participants to answer the questions as honestly as possible (Podsakoff et al. 2003). Nevertheless, self-reported data can suffer from common method biases, such as consistency motifs or social desirability concerns (Podsakoff et al. 2003). Thus, we adopted the marker variable approach (Rönkkö and Ylitalo 2011) to test whether a common method bias confounded our results.
We performed the marker variable method (Rönkkö and Ylitalo 2011) with two marker items (two items for Openness, which the ANCOVA in Table 4 shows to be unrelated to the dependent variables) taken from our empirical data set; these items were not included in our research model and lack an explicit theoretical influence on the constructs in our research model. Following Rönkkö and Ylitalo’s (2011) method, we found relatively low correlations between the marker items and study items (the mean values of the correlation coefficients were .046 and .061) and determined that these low correlations must have been caused by the method. Next, we included the marker items as additional latent variables in our PLS analysis model and compared the results between the original research model (without the marker variables) and the common method bias test model (with marker variables). The results indicate that the marker variables had no significant effects on the dependent variables (satisfaction and return intention) or on other effective endogenous variables (utilitarian utility, hedonic utility, and pleasure) (see Online Resource 1.1). In any case, only one relationship between the marker variable and arousal was significant; however, because arousal was non-significant (see Sect. 3.6), this finding does not influence our main conclusions. In addition, the path coefficients between all main contrasts and consumer behavior did not significantly differ between these two models. Therefore, we can conclude that a common method bias did not likely distort the main results of our study.
Measurement model validation
Our research model contains seven reflective multi-item constructs and six one-item constructs. The quality of the reflective measurement models depends on convergent validity and discriminant validity (Bagozzi and Yi 1988).
To analyze convergent validity, we determined indicator reliability and internal consistency. All the indicator loadings of the reflective multi-item constructs were, at a minimum, significant at the .01 level. For the internal consistency assessment, we examined the composite reliability (CR), Cronbach’s alpha, and average variance extracted (AVE) (seeTable 2) (Teo et al. 2003). All the CR indices, as well as the Cronbach’s alpha values, met the threshold of .7 (Nunnally et al. 1967). Furthermore, for AVE, all reflective multi-item constructs met Fornell and Larcker’s (1981) suggested critical level of .5. In summary, the constructs satisfied all criteria for indicator reliability and internal consistency, in support of convergent validity.
We also analyzed the constructs’ discriminant validity by examining whether the square root of the indicators’ AVE within any construct was higher than the correlations between it and any other construct (Son and Benbasat 2007). All included constructs met this criterion, thus evidencing discriminant validity (see Online Resource 1.2). Moreover, none of the correlations between any pair of constructs were higher than the threshold value of .9 (Son and Benbasat 2007), and there was no evidence of critically high cross-loadings between the main constructs (see Online Resource 1.3). Therefore, we can conclude that the reflective constructs possessed discriminant validity.
Results from ANOVA and ANCOVA
We first used ANOVA to test the significant differences in satisfaction and return intention among different package design groups (color, gift, and return label). We then added participants’ demographics and personality to the analysis model as covariates (ANCOVA) to test the stability of the results (see Table 4 and Fig. 5).
The results of both analyses showed that an extra gift can significantly influence consumer satisfaction and return intention, while a colorful package only has a significant impact on consumers’ return intentions. More specifically, a colorful package significantly reduced consumers’ return intentions R(cor = 9.076 vs. R
color = 9.662, see Fig. 5b; F = 3.66, p < .1, see Table 4) compared with a standard brown package, but had no significant impact on consumer satisfaction. Meanwhile, an extra gift in the package increased consumer satisfaction S(gt = 2.576 vs. S
gift = 2.072, see Fig. 5a; F = 10.685, p < .001, see Table 4) and reduced return intentions R(gt = 9.050 vs. R
gift = 9.648, see Fig. 5c; F = 4.417, p < .05, see Table 4). These results offer initial evidence for the impact of package design on consumer return behavior.
Our results further showed that a preprinted return label had no significant effect on consumer satisfaction or return intentions (p > .1, see Table 4). However, this result might have occurred because European consumers know that their return rights are highly protected by the Consumer Protection Law, and thus the 14-day return policy is already deeply rooted in their decision processes. The other possible reason is that the preprinted return label does not significantly reduce return costs. We also tested the interactions among color, gift, and return label, but none of them were significant.
Results from PLS analysis
To analyze the package-opening process more thoroughly, we operationalized our model as a structural equation model and estimated it using Smart PLS (v.3.2.1) (Ringle et al. 2015). This method is well suited for exploratory research and shares the modest distributional and sample size requirements of ordinary least squares linear regression. We also used two models to individually test the cognitive process (without affective reactions) or affective process (without cognitive reactions); the results can be found in the section Online Resource 1.1. To reduce common method bias, we included common control variables for our main dependent variables: age, gender, soccer preference, and personality. The main results appear in Fig. 6.
The squared multiple correlations (R
2) of .39 for satisfaction and .28 for consumers’ return intention are high, which means 39% of the variance in satisfaction and 28% of the variance in return intention can be explained by the chosen constructs (Glantz and Slinker 1990). To assess the significance of the path coefficients, we used the bootstrapping procedure implemented in Smart PLS with 1000 resamples. Figure 6 displays the results, with continuous lines representing significant path coefficients and dashed lines indicating non-significant paths.
Package color can positively influence consumer return decisions, as we expected, but surprisingly, our data indicate it only works through the cognitive process via perceived utilities. These results confirm Chebat and Morrin (2007) major finding that, in the realm of consumer behavior, the influence of colors is largely facilitated by cognitive rather than affective mechanisms. Specifically, we found that the perceived utilitarian utility of the blue delivery package is relatively higher than the standard brown package (.166, p < .01). In other words, the blue hues associated with a high-value brand can enhance consumers’ evaluation of packaged products.
The extra gift significantly increased both the utilitarian utility (.107, p < .05) and the hedonic utility (.130, p < .05) of the whole package, but showed no significant direct impact on arousal and pleasure. The reason might be that because e-retailers commonly offer extra gifts, consumers may not feel special when receiving one. At the same time, consumers can easily recognize the utility benefits of extra gifts. When comparing the relative impact of gifts and color, the former works more effectively, but the costs of the latter are significantly lower.
Our results also show that utilitarian and hedonic utility impact the consumers’ post-purchase decisions in various ways. Higher utilitarian utility increases consumer satisfaction (.183, p < .01), which is consistent with previous empirical findings (e.g., Anderson et al. 2009). In contrast, hedonic utility is positively and strongly related to pleasure (.551, p < .01).
In line with our expectations, satisfaction is negatively related to consumer return intention (− .379, p < .01). In short, the more satisfied consumers are after opening the package, the less return intention they exhibit. The results also indicate that pleasure plays the most crucial role in consumers’ post-purchase decisions. Pleasure is the only factor in our research model that can directly increase satisfaction (.449, p < .01) and simultaneously decrease return intention (− .212, p < .01). However, arousal did neither influence satisfaction nor return intention. Indeed, the PLS results revealed that arousal had no significant relationship to any other constructs in our research model.
Furthermore, by using the bootstrapping procedure as a mediation test (Suwelack et al. 2011), we found significant indirect effects of the package design (i.e., extra gifts and colorful packages) on emotions and return intentions (see Table 5), emphasizing the cognitive-affective reactions process. Specifically, we found that extra gifts invoke more pleasure by increasing hedonic utility (.072, p < .05, see Table 5). In turn, pleasure can directly and indirectly (via satisfaction, − 0.171, p < .01, see Table 5) reduce return intentions. In addition, only a colorful package (.030, p < .1, see Table 7) can indirectly lead to higher consumer satisfaction, namely by increasing the utilitarian utility. Satisfaction is thus an important mediator, through which utilitarian utility (− .073, p < .05, see Table 5) and pleasure (− .171, p < .01, see Table 5) can significantly reduce consumer return intention.
Moreover, we also tested the models that solely included cognitive or affection reactions. The results (see Online Resource 1.1) show that extra gifts and colorful packages can have direct effects on perceived utilities, but not on emotions. The only significant direct effect on emotions is the one from extra gifts on pleasure (.097, p < .1), but that might be a result of cognitive reactions like hedonic utility (.072, p < .01, see Table 5). Furthermore, we tested the model in reverse order (i.e., affective-cognitive reaction process) and found that our package manipulations did not directly influence affective user reactions (pleasure and arousal). Thus, a cognitive-affective reaction process seems more plausible based on our data.
Among the control variables, only agreeableness had a significantly negative effect on return intentions. In other words, consumers who are kind, sympathetic, cooperative, warm, and considerate are more tolerant of product defects, as might be expected.
Real return behavior check
Following the experiment’s completion, we randomly drew five winners [2 men and 3 women, average age 25.9 years (SD = 10.78)] from the final sample of 320 participants. They received the package as shown in the video of their experimental group. The participants who did not receive a pre-paid DHL label were allowed to email us for a free DHL label (a PDF file). Four of the winners returned the slightly flawed jersey to get a new one and one kept it, which was exactly in line with their stated survey response. This small-number sample may serve as initial evidence that the measured return intention is a reasonable and valid proxy for actual return behavior. This point will be further corroborated in our third study.