Effects of front-of-pack labels on the nutritional quality of supermarket food purchases: evidence from a large-scale randomized controlled trial

To examine whether four pre-selected front-of-pack nutrition labels improve food purchases in real-life grocery shopping settings, we put 1.9 million labels on 1266 food products in four categories in 60 supermarkets and analyzed the nutritional quality of 1,668,301 purchases using the FSA nutrient profiling score. Effect sizes were 17 times smaller on average than those found in comparable laboratory studies. The most effective nutrition label, Nutri-Score, increased the purchases of foods in the top third of their category nutrition-wise by 14%, but had no impact on the purchases of foods with medium, low, or unlabeled nutrition quality. Therefore, Nutri-Score only improved the nutritional quality of the basket of labeled foods purchased by 2.5% (−0.142 FSA points). Nutri-Score’s performance improved with the variance (but not the mean) of the nutritional quality of the category. In-store surveys suggest that Nutri-Score’s ability to attract attention and help shoppers rank products by nutritional quality may explain its performance.


Introduction
To promote healthier eating, regulatory authorities worldwide are encouraging the use of labels that provide simplified nutrition information on the front of the pack (FOP) in addition to the mandatory calorie and nutrition information already provided on the back. The European Union, for example, recently introduced a voluntary scheme for manufacturers to put graphical information about nutritional product quality on the front of the pack (Regulation 2011). However, there is disagreement about whether FOP nutrition labels truly improve food purchases and, if they do, about which specific label regulators should endorse, and companies adopt (Askew 2019).
InMay2018Mondelēz, Mars, Nestlé, PepsiCo, Coca-Cola, and Unilever backed the Evolved Nutrition Label, a nutrientspecific system inspired by the British "Multiple Traffic Light" label but, a few months later, Mars and Nestlé withdrew their support and, in June 2019, Nestlé announced its support for Nutri-Score. In France, four FOP nutrition labels competed for governmental endorsement in 2016: two were analytic systems which provide information analysis for each nutrient: Nutri-Couleurs, an adaptation of the British traffic-light system, and the mono-colored Nutri-Repère, which was backed by industry groups. The other two provide a single summary indicator: Nutri-Score, which has started to be used by some manufacturers and retailers in France, Spain, Belgium, and Germany, and SENS, which was developped for French retailers and promoted by French, German, and Belgian food industry groups (Michail 2015).
Despite the importance of this issue, evidence of the effects of FOP nutrition labeling on food purchases in natural settings is sparse. A recent review and meta-analysis (Ikonen et al. 2020) concluded that, "Although most FOP nutrition labels help consumers identify healthier options within product sets, this does not directly translate into other measures of effectiveness, which show smaller effects and greater variability between different label types and product categories".Itisstill unclear whether FOP nutrition labels in general-and the four labels seeking government approval in particular-improve the nutritional quality of purchases made in real-life grocery shopping settings. The literature is also silent about whether nutritional improvements come from an increase in purchases of foods with higher nutritional quality in their category, a fall in purchases of options with lower nutritional quality, or both. Moreover, the impact of FOP labels on purchases of unlabeled foods remains unknown. This is a critical issue as current regulations prevent retailers or governments from forcing manufacturers to adopt a FOP nutrition label. Studying the purchases of unlabeled products also allows testing whether the mere presence of FOP nutrition labels in one category changes the consumer's decision process or preferences (e.g., for health versus taste) rather than simply providing information about the nutritional quality of labeled foods. Finally, disagreement exists about the role of the design, notably whether FOP nutrition labels should provide a summary vs. nutrient-specific scores, show the range of possible grades on each label, and should be color-coded.
To answer these questions, the French health authorities asked the study authors to help determine which of the four FOP labeling system mentioned earlier, if any, should receive official approval. Meanwhile, industry associations and individual retailers and manufacturers are waiting for evidence from real-life grocery shopping studies to decide which FOP system to adopt, if any. We therefore conducted a randomized controlled trial (RCT) to test the effects of these four labeling systems in sixty supermarkets over a ten-week period. To shed light on some of the mechanisms underlying the effectiveness of the labels, we also conducted a survey of shoppers in the test and control stores, both before and during the experiment. The large scale of the study, which involved adding millions of labels and tracking millions of purchases, provided the large sample size required to obtain precise estimates for effect sizes that are known to be small (Ikonen et al. 2020).
In addition to examining whether FOP labels improve the nutritional quality of supermarket food purchases, the randomized-controlled-trial allowed us to test whether they do so by influencing the purchase of foods that were labeled high or low nutritional quality, as well as to assess their effects on other products in the category without an FOP nutrition label. The field experiment also allowed us to assess whether the effects of nutrition labels vary across four product categories that differ in terms of the mean and variance of their nutrition quality. Finally, by comparing four different FOP nutrition labels, this study examines the effects of design choices, such as providing nutrient-specific information or a summary score.
Front-of-package nutrition labeling FOP nutrition labels provide summary, simplified information about the calorie and/or nutrient content of foods on the front of the pack, sometimes augmented with evaluative symbols or color coding (McGuire 2012;Newman et al. 2018). They should not be confused with various other FOP information such as warning labels (e.g., "contains sulfites"), health or structure/function claims (e.g., "calcium helps create strong bones"), unregulated food claims (e.g., "natural"), or the regulated nutrient claims such as the "low-fat", "no trans-fat" or "extra antioxidants" studied by Kiesel and Villas-Boas (2013) or by Belei et al. (2012). They also differ from "reductive" nutrient-specific labels such as "Facts Up Front" labels which simply reproduce a subset of the nutrition information available on the back of the package without further interpretation.
FOP nutrition labels are related to, but qualitatively different from the various initiatives developed to provide calorie or nutrition information on the menus of restaurants. Although the goal is similar, the context is not. Unlike in restaurants, foods sold in supermarkets are pre-packaged (with a few exceptions that fall outside the scope of labeling laws, such as fresh produce in bulk) and hence already provide ingredients, calorie, and nutrition information on the back of the pack. People make purchase decisions in supermarkets, whereas they make consumption decisions in restaurants. For these reasons, studies conducted in restaurants are suggestive and can be used for hypothesis generation. However, evidence that adding calorie labeling to menus can work in restaurant settings (Bollinger et al. 2011;Bleich et al. 2017) cannot be directly extrapolated to the effects of FOP nutrition labels on supermarket purchases.

Effects of nutrition labels in laboratory settings
Consumers have favorable attitudes towards FOP nutrition labels and believe that they help them make healthier food choices (Cadario and Chandon 2019;F e u n e k e se ta l .2008; Grunert and Wills 2007;H a w l e ye ta l .2013). Reviews (Cecchini and Warin 2016;Kiszko et al. 2014) and metaanalyses (Ikonen et al. 2020) show that there is also evidence that FOP nutrition labels help consumers identify healthier products and increase intentions to choose healthier foods in online surveys (e.g., Andrews et al. 2011;Roberto et al. 2012), although there are many counter-examples of studies finding no effects on purchase intentions (e.g., Gorski Findling et al. 2018).
Analyses of the effects of FOP labels on actual food choices or consumption found more limited effects. For example, the recent meta-analysis by Ikonen et al. (2020) found small effect sizes for interpretive (nutrient-specific) and summary FOP labels on the choice of healthier options (respectively, Fisher'sz back-transformed estimated correlation = 0.079 [0.064, 0.094] and 0.023 [0.008, 0.039]) and smaller and statistically insignificant effects on actual consumption (respectively, 0.036 [−0.022, 0.093] and 0.006 [−0.030, 0.041]). However, their meta-analysis relied mainly on laboratory studies.
A recent laboratory study by Crosetto et al. (2020), which was not included in Ikonen's meta-analysis, provides the strongest evidence that FOP nutrition labels can significantly influence consequential food choices in laboratory settings. Grounded in the paradigm of experimental economics (e.g., Muller et al. 2017), respondents were asked to shop for their households using a paper catalog of 290 products, including the full denomination of the product, color pictures, prices, bar-code, but without front-of-pack nutrition label. With a bar-code reader, the shoppers could access the information typically available when shopping online (e.g., the list of ingredients as well as nutrition information). To make the experiment incentive compatible, the participants were informed that they would have to buy a randomly determined subset of the products that they chose. After purchasing food for a day once, participants were unexpectedly asked to make a second "shopping trip" from the same catalog, but now with a frontof-pack nutrition label implemented on all products (or the same catalog without any nutrition label for the control treatment). They tested five labels, including the four that we tested and NutriMark (a French version of Australia and New Zealand's Health Star Ratings). They found that all labels significantly improved the nutritional quality of the shopping baskets purchased, in the following order (from most to least effective): Nutri-Score, NutriMark, Nutri-Couleurs, Nutri-Repère, and SENS.

Effects of nutrition labels in field settings
Evidence that FOP nutrition labels work in a laboratory setting does not mean that they will necessarily work in the field. For example, a meta-analysis of the effects of menu calorie labeling in restaurants found that it led toan18-kcalreductionpermeal in laboratory and online studies but to an insignificant 8-kcal reduction in studies conducted in restaurants (Long et al. 2015).
As reviewed by Hawley et al. (2013), few studies have examined the effects of FOP labeling on food purchases made by people in supermarkets and those which have, focused on a few products and short periods of time. Gaigi et al. (2015) placed an ad-hoc summary FOP label (Vita+) in two French supermarkets and found no effects on food purchases. In a study encompassing 100 stores, and eight categories over six months, Nikolova and Inman (2015) found that the introduction of the summary Nuval system, which grades each food on a 100-point scale, led people to switch to higher-scoring (healthier) foods. However, that label, which was briefly used by one North American retailer before being abandoned, is unrepresentative of most FOP labels. First, it was added by the retailer to the shelf tag, and not on the packages themselves. Second and more importantly, it was available for all the products in the category, whereas EU laws, for example, require the agreement of the manufacturer. Mhurchu et al. (2017) used an RCT to deliver "traffic light" labels, Health Star Rating labels, or nutritional information via a smartphone app to 1357 shoppers. They found no significant effect on the nutritional quality of grocery purchases over a four-week period. It remains unclear whether these results would hold in a larger-scale trial over a longer period, and when nutrition information is displayed directly on packages. A recent meta-analysis of field studies (Cadario and Chandon 2020) found similarly low effect sizes for both descriptive and evaluative (interpretive) labeling (respectively, d = 0.10, SE = 0.07 and d = 0.17, SE = 0.06). Unfortunately, this study did not differentiate between food selection and consumption, included warning labels and, like the meta-analysis performed by Ikonen et al. (2020), relied primarily on studies conducted in restaurant settings.
In short, we still do not know whether established FOP nutrition labels, like the British Traffic-Light system, the French Nutri-Score, or alternatives backed by research and industry groups significantly improve the nutritional quality of foods purchased over multiple shopping trips in real-life grocery shopping conditions and across a wide variety of brands and product categories. However, absence of evidence is not evidence of absence and, based on the positive of recent laboratory settings, all four labels may have a small but positive effect on the nutritional quality of supermarket food purchases in real-life grocery shopping conditions.

Effects of nutrition labels across brands
Another important question is whether the effects of FOP nutrition labels come from boosting the purchases of brands or categories with high nutrition value or from hurting the purchases of brands or categories with low nutrition value. Ikonen et al. (2020) found a positive and statistically significant effect of FOP nutrition labels on intentions to purchase virtue food products such as almonds but a null effect for vice foods like cakes. Similarly, Nikolova and Inman (2015) found that the roll-out of Nuval had stronger effects in healthier product categories. They also hypothesized that Nuval would have the largest effect in categories with the widest within-category variance in nutrition quality, which provide greater opportunity to switch to healthier alternatives, but this hypothesis was not supported by their data.
To date, no field study has examined whether FOP labels has a different impact on the purchases of foods with a high, low, or unknown (unlabeled) nutritional quality within their product category. Crosetto et al. (2020) found that summary labels like Nutri-Score and SENS had larger effects on the purchases of brands carrying the most extreme labels (e.g., green or red compared to yellow). However, they did not examine whether the effects of analytic labels varied according to the nutritional quality of the food. Also, because all the brands in their study were labeled, they did not examine the effects on unlabeled foods.
Based on these findings, FOP nutrition labels may have stronger effects in product categories with a high mean and a high variance in nutrition quality. Given the small effects of nutrition labels overall, this would imply very small to null effect for categories with low mean and variance in nutritional quality, at least if there are few substitutions from one category to another. By extending this reasoning to within-category nutrition differences, nutrition labels could increase the purchases of foods with a higher nutritional quality than the average of their category but have no effects for foods with a lower than average nutritional quality. Finally, given the limited effects of FOP labels on the purchases of labeled products, they are unlikely to influence the purchases of those foods in the category without a nutrition label, whose nutritional quality is not immediately knowable.

Stimuli
Four labeling systems were pre-selected following a comprehensive consultation process involving national research institutes, food manufacturers and retailers, the French national health insurance administration, consumer and patient advocacy groups, and the Ministries of Health, Agriculture, and Consumer Affairs. These labeling systems were chosen based on scientific evidence and were backed by industry and consumer associations. Each system was developed independently by competing research teams in accordance with a research and innovation contest. These contests are an effective and increasingly popular tool for encouraging innovation because they incentivize each competitor to design the best possible solution and increase the diversity of solutions (Boudreau et al. 2011). On the other hand, the lack of central coordination is a disadvantage in terms of the ability to pinpoint why one solution outperformed another. In any case, the costs involved in conducting a large-scale randomized controlled trial in natural grocery shopping conditions allow for a limited number of experimental conditions, which prevents the identification of the effects of each design characteristic.
We display a graphical representation of each labeling system in competition in Fig. 1. SENS, in panel (a), provides a summary evaluation of the nutritional quality of the food (Maillot et al. 2016). It is based on an algorithm adapted from the nutrient profiling system SAIN/LIM (Darmon et al. 2009), which scores nutritional quality based on the relative quantity of favorable ("SAIN") and unfavorable ("LIM") nutrients. This system classifies foods into four categories represented by green, orange, blue, or purple inverted pyramids. Each level has a label specifying the consumption frequency or portion size (e.g., the purple label says "occasionally or in small quantity"). Only the proper pyramid is visible on a given product'slabel.
The Nutri-Score labeling system in panel (b) below provides a summary evaluation based on the amount of positive and negative nutrients (Julia and Hercberg 2017). It is adapted from the British Food Standards Agency's nutrient profiling system. It grades products on a five-point scale, from A to E, and displays the assigned grade with a larger font on a sliding scale showing the five grades, colored from green to yellow to dark orange, identifying the relative nutritional quality on this scale.
Nutri-Repère, panel (c), is an analytic label that displays the amount of energy, fat, sugars, and salt per suggested portion, as well as their contribution-in percentage and as a blue bar graph-to the Guideline Daily Amounts, which represents the daily nutritional requirements of the average adult consumer. It is backed by industry bodies (Nutri-Repère 2015). We refer to it as mono-color rather than monochrome because the level and percentage information are in black, not in the same color as the bar chart, the key visual element.
Finally, Nutri-Couleurs, in panel (d), is an analytic label that provides the same information as Nutri Repère but, like the British Multiple Traffic Light label on which it is based, it color-codes each nutrient as red, amber, and green based on thresholds set by the British Food Standards Agency (2007).
According to the classification of FOP nutrition labels (Newman et al. 2018;Ikonen et al. 2020), all of these are "interpretive" because they repeat some of the descriptive nutrition information present on the back of the package but enhance it with graphical symbols (e.g., colors, bar charts, more or less filled triangles) that help to convey the nutritional quality of each nutrient or of the food product as a whole. This is true even for Nutri-Repère, the least "enhanced" label, where a bar chart allows consumers to see at a glance the contribution of the product to daily nutritional requirements.

Procedure
From September 26 to December 4 of 2016, we placed FOP labels on food products in 40 randomly selected supermarkets in France, 10 supermarkets per labeling system. In addition, 20 control supermarkets were randomly chosen in which products had no additional FOP labeling. These 60 supermarkets belonged to three of the largest retail chains in France. The protocol was made publicly available on the website of the French Ministry of Social Affairs and Health in April of 2016 (Renaudin et al. 2016) and the intervention was authorized by ministerial decree. The field experiment was registered at the International Standard Randomized Controlled Trials Number Registry (ISRCTN 58212763). The operational settings are summarized in Fig. 4,intheAppendix.
Before the randomization procedure started, the number of stores in the treatment and control groups was chosen based on implementation costs (hence the higher number of stores in the control group) and market conditions (e.g., the market share of each retailer chain). Each treatment condition included four supermarkets from Carrefour (for a total of 16 stores), three from Simply Market (total: 12 stores), and three from Casino (total: 12 stores), while the control condition included eight supermarkets from Carrefour, six from Simply Market, and six from Casino (total: 20 stores).
The random sample of 60 stores included in the study, out of the universe of the 174 French supermarkets of these three chains, was obtained through several steps. First, we categorized the universe of stores into groups characterized by two attributes: retailer chain (Casino, Simply Market, and Carrefour) and whether the store was in a privileged or underprivileged geographical area. The latter was operationalized as the store's catchment area being in the bottom two quintiles in terms of the proportion of unskilled laborers, the only measure that was available to us for all stores. This was done to ensure that we had enough shoppers from lower socio-economic status, because prior research has shown that nutrition labeling tends to have lower effects for this population (e.g., Elbel et al. 2009). This resulted in six groups, three levels by two levels. Second, for each of the six groups, the stores were a-priori assigned a number from 1 to N,whereN was the total number of stores in that group. For example, for Carrefour stores in underprivileged areas, it was decided to have 12 stores (with 2 for in each condition and 4 in the control group) out of 22 possible stores, and so N was 22 for that group and each store was assigned a number from 1 to 22. For each treatment or control conditions, we then obtained random numbers from 1 to N using the www.randomizer.org site. In the Carrefour example, we obtained 12 numbers, 2 random numbers for each treatment condition and 4 for the control group, from the set of 1 to 22 (e.g., we obtained numbers 4, 12, 15, and 21 for the control condition). We then looked up each of these randomly drawn numbers to the a-priori assigned numbers given to each store, hence obtaining the set of stores assigned to the study for each condition.
Consumers were informed of the local intervention in each treatment supermarket through leaflets and displays that were rigorously identical, except for the explanation of each labeling system. Shoppers were informed about the labels present in their store (but not about the other types of labeling systems) in three ways. First, leaflets were made available in each store, describing the labeling system, the four product categories where it could be found, and two hypothetical examples (one of a high and one of a low nutritional quality food). The leaflets also described the goal of the intervention, the sponsors (French Government, National Health Care, French Fund for Food and Health) and the regions in which the study was conducted. The four leaflets were identical in structure and design and the text describing each labeling system was validated by a scientific committee to ascertain its validity and fairness (see exhibit Fig. 5 for an example of leaflet). Second, information about the label was made available in the aisles themselves thanks to self-standing signs ("totems"). A sc a nb es e e ni nF i g .6 in the Appendix, some of these displayed included additional leaflets. Third, stickers were manually placed on the front of each of the products themselves (Fig. 7). After we had checked that all the communication was put in place properly at the start of the intervention, responsibility for in-store signage was transferred to the stores and no additional measures were taken to ensure that the signs remained visible or would be replaced if they went missing. This was done on purpose, to replicate what would happen when a national labeling system is put in place by the retailers.
The operational management of the study was done by the French Fund for Food and Health (FFAS), an organization jointly created by the French Nutrition Institute (Institut Français pour la Nutrition) and the National Association of Food Industries (Association Nationale des Industries Alimentaires), with the aim of developing a partnership between the academic community and economic actors, for a better service of public health. The FFAS raised both private and public funding for the study, selected the company in charge of its implementation, and oversaw the coordination between all the parties involved.
Stickers were affixed to food products in four categories: fresh prepared foods (e.g., pizzas, quiches), pastries (e.g., croissants, brioches), breads (e.g., sliced breads, baguettes), and canned prepared meals (e.g., cooked beans, ravioli). These categories were selected because they are consumed regularly by a large percentage of shoppers and because it is relatively easy to place stickers on their packages. These are all processed or ultra-processed foods, classified in the third or fourth NOVA category (Monteiro et al. 2018), and therefore representative of most foods targeted by FOP nutrition labels (nutrition labelling is not mandatory for minimally processed foods, like fruits and vegetables). As we report in the data section, these four categories provide a good cross section of products in terms of nutritional quality compared to the typical French diet.
Participation by manufacturers was voluntary, as per E.U. regulations. Allowing firms to decline to participate matches the regulations of most markets but means that our results are not illustrative of a case of mandatory enforcement. A large majority of manufacturers (29 firms) and all three retailers agreed to participate, leading to a total of 1266 tested products.
More than 1.9 million stickers were affixed. Sixty research assistants printed the labels on site by directly scanning the product's barcode and using an open-source web-based application designed for this study (https://getiq.inra.fr). Daily quality checks were carried out by supermarket personnel, with additional checks performed bi-weekly by 24 trained dieticians. Seven independent professional auditors oversaw the quality control of the intervention.

Data
Retailers provided purchase data from their loyalty cardholders for two time periods: the ten weeks during which the study was implemented in 2016, and the corresponding ten weeks in the previous year, 2015. Because labels were only affixed on some products from the fifth week onwards, we restricted the analysis to weeks 5 through 10 (for both years). This allowed us to examine the effects of labeling after the initial curiosity and trial phase was over. After removing transactions where product information was missing, our data set included 1,668,301 purchases of 3586 products, of which 1266 were labeled products, made by 171,827 consumers.
We assessed the nutritional quality of purchased food using the Ofcom nutrient profiling score developed by the British Food Standards Agency (FSA). The FSA score allocates positive points between 0 and 10 according to the amount of four 'A' components: energy, sugars, saturated fat, and sodium per 100 g or 100 mL (UK Department of Health 2011); negative points between −5and0accordingtotheamountofthree'C' components: percentage of fruits, vegetables and nuts; fibers, and protein. The FSA score can range from −15 (best) to +40 (worst nutritional quality). We chose the FSA score as a measure of nutritional quality because it is one of the most used in the scientific literature. It is also the only system that has been validated in a French context by, among others, prospective associations with the onset of metabolic syndrome, cancer, and cardiovascular risks (Labonté et al. 2018).
The FSA score can be computed using nutrient and energy information that is mandatory on food packages, but also requires additional information on the amount of fruits, vegetables, and fiber, which we obtained from the manufacturers or, when missing, using standard recipes for the category. We computed the FSA score for the 1266 products participating in the RCT, but it was not possible to compute the FSA score for most of the unlabeled products.
As shown in Table 1, canned prepared foods and breads have the best nutritional quality (lowest FSA score), pastries have the worst, and fresh prepared foods are close to the average of the four categories. Overall, the mean FSA score of the chosen products (5.62, when computed across both years) was slightly better than the average FSA score of the French diet, estimated to be 7.67 for men and 7.47 for women ). However, the four categories encompassed a large range of FSA scores, from low (high nutrition quality) for canned prepared meals and breads (respectively, −0.10 and 0.27), to medium for fresh prepared foods (6.1), to high for pastries (14.4). The four categories also differed in terms of the variance of FSA scores across the products of the category, which was low for breads (1.8), moderate for pastries and canned prepared meals (respectively, 3.1 and 3.4), and high for fresh prepared foods (7.23). The average calories per product is about 1000 and the average weight of each product about 400 g.
The overall FSA score ranges between 5.41 in 2015 in the SENS stores and 5.75 in the control stores. Table 1 shows that the largest declines (which means the largest improvements) from 2015 to 2016 happened in the fresh prepared foods category, with the largest decline emerging in the Nutri-Score stores, from 6.36 to 5.79.

Outcome measures
We assessed changes in the nutritional quality of food purchased at the purchase incidence level (whether or not to buy foods with high, medium, low, or unlabeled nutritional quality) and at the purchase quantity level (the nutritional quality of the basket of foods purchased, weighted by calories).
To compute the dependent variable of the purchase incidence models, we categorized products as unlabeled or labeled, and further divided labeled products into nutrition terciles based on the FSA score in each category. We then computed the number of purchases over the weeks in our data In equation (1), X ij c bt;2016 and X ij c bt;2015 take the value of 1 if shopper i purchased product j of category c in bracket b (one of the three terciles or unlabeled), on week t, and 0 otherwise. Hence, N icb,2 0 1 6 and N icb 2015 measure the number of purchases for each individual i over all products in each bracket b, and so the difference N icb,2 0 1 6 − N icb 2015 evaluates if that particular shopper increased or decreased purchases of products in the various brackets in product category c.Bycomputing the difference between years, we are effectively controlling for individual fixed effects (each shopper acts as her own control). For robustness, we also performed our analysis with number of units purchased (e.g., taking into account if consumers bought more than one unit of each product) instead of number of purchases. The results are substantively similar to the ones presented here.
By categorizing labeled products into three nutrition tiers, as opposed to just two, we test whether nutrition labels have the intended effect of increasing the purchases of products with a higher nutritional quality than average in the category and of decreasing the purchases of products with a lower nutritional quality than average. This allows us to test for unintended purchase consequences of labeling, such as promoting foods with low nutritional quality. This is important because even if an unintended positive effect on the purchases of low nutritional foods is more than compensated by a positive effect on high nutrition foods, one of the first principles of labeling is that it should do no harm. By incorporating unlabeled products as a fourth tier, the analysis provides information for manufacturers considering adopting each specific FOP labeling system. Studying unlabeled products also allows us to test whether the mere presence of FOP labels in the category changes the consumer's decision processes in the category rather than just providing information about the nutritional quality of labeled foods.
As a second outcome measure, like Crosetto et al. (2020), we computed the nutritional quality of the average basket of foods purchased in each product category. Given the FSA score for each labeled product j purchased in category c, FSA j c , we compute the difference between the two years in the quantity-weighted average FSA score of labeled purchased goods for each shopper i. In other words, our unit of analysis is the individual-specific basket of labeled goods purchased in each category. Formally, for shopper i who purchased quantity q ij c of product j of product category c, the difference in the weighted average (across products) FSA score of the basket of goods in that category is: Following earlier studies, the weighing quantity q ijt is measured in calories (as a robustness check, the analysis was also done using the quantity in kilograms as weights, with substantively similar results). In terms of interpretation, the dependent variable FSA i, 2016 − FSA i, 2015 measures the change in the average FSA score across the shopping basket of products purchased. Because a low FSA score indicates a high nutrition quality, a reduction in the FSA score is an improvement in the nutrition quality of the basket of foods purchased.
The main advantage of the basket measure is that it summarizes the effects of the label on shoppers' buying of high-, medium-, and low-nutritional-quality products into one number, while considering the quantity bought. For this reason, it is the measure of choice in epidemiological studies of the effects of the nutritional quality of food purchased on health (e.g., Donnenfeld et al. 2015). The drawback is that it can only be computed for consumers who bought labeled products in the category both before and during the intervention. It cannot tell whether consumers entered or exited the category. In addition, it cannot tell whether an improvement in average nutritional quality is driven by increasing the purchase of high nutritional quality products, reducing the purchase of products of low nutritional quality, or both. Therefore, we used both measures.

Econometric specifications
We estimate the average treatment effect (ATE) of nutrition labels using a difference-in-differences (DiD) approach, which is commonly used for policy evaluation (Bertrand et al. 2004). In the purchase incidence regressions, we include as covariates the difference in the number of store visits and prices between 2016 and 2015, and fixed effects for each tercile and unlabeled products, and food category. We include, for both regressions, consumers with at least one store visit in any week during the 2015 and 2016 periods where a visit is inferred from having at least one purchase in any of the four categories. The coefficients of interest are those of the interactions between the variables coding the intervention and the nutritional quality tier or unlabeled product. For the purchase incidence model, the difference-in-differences equation is: where, as described earlier, the dependent variables N icb,2 0 1 6 -N icb,2 0 1 5 measure if the specific consumer i increased or decreased their purchases of products in the various brackets in category c. On the right-hand side, α c includes an intercept and fixed effects for each product tier and category. The term X icb,2 0 1 6 − X icb,2 0 1 5 is the difference between the two years in observed characteristics, such as the difference in the average price paid by the individual and in the number of visits to the store. The variable Z iLb is the interaction term between label system L faced by individual i and bracket b. We used a dummy coding to examine the effect of each label compared to the control condition. Each observation is assigned to only one condition, as each consumer shops at one unique store from the set of stores in our data. All standard errors are store clustered.
For the nutritional basket analysis, the equation is the following: As described earlier, the dependent variable FSA i,2 0 1 6 − FSA i, 2015 measures the change in the average FSA score across the shopping basket of products purchased. The term α c includes an intercept and category-specific intercepts while (X ic,2 0 1 6 − X ic,2 0 1 5 ) includes the difference between years of the number of purchases in the category c for individual i and the difference in the average prices of individual baskets. The term Z iL includes the variables coded for the case faced by individual i,wh er eL is the FOP label system. We computed the basket's FSA for all consumers with at least one purchase of a labeled product in the category, in both years. As for the purchase incidence analyses, all standard errors are store clustered.

Purchase incidence analyses
The estimates and standard errors are reported in Fig. 2 and Table 2. Figure 2 shows the estimated effect of labeling on the number of purchases of products in the first (in green), second (in amber), and third (in red) nutrition quality tier as well as for unlabeled products (in white).
Looking at each label separately, but for all product categories together, the second column of Table 2 shows that Nutri-Score, and to a lesser degree Nutri-Couleurs significantly increased the purchases of products with high nutritional quality, with coefficients of 0.021 and 0.012 respectively. Given that consumers bought on average 0.142 high-nutrition products in 2015, these estimates correspond to an increase of 14.4% and 8.0% in purchases of healthier tiered products for Nutri-Score and Nutri-Couleurs, respectively. Second, the effects of Nutri-Score were appropriately ordered, decreasing from high-to medium-to low-nutritional-quality products. In comparison, Nutri-Couleurs had the undesirable effect of slightly increasing the purchase of low-quality products (directionally, not statistically significant) and SENS had the undesirable effect of reducing the purchases of products of medium nutritional quality more than those of products of low nutritional quality. In terms of performance, Nutri-Score is best, followed by Nutri-Couleurs, with SENS and Nutri-Repère having essentially no beneficial effect.
The last four columns of Table 2 show the results of the purchase incidence analyses for each of the four categories separately. It shows that the overall results are mostly driven by the prepared foods category, which has the widest variance of FSA scores across products in that category. Looking at each category separately, the results are consistent with those reported for all products together (second column in Table 2): Nutri-Score is the only label that always has a positive impact on the purchase of products with the highest nutrition quality. Unsurprisingly given the lower number of observations, the reliability of the estimates is lower when each category is estimated separately. We return to category differences when discussing the results of the shopping basket analyses.
In a final analysis, we examined whether the nutrition labels led shoppers to change the total number of purchases by estimating the same model as in Equation 3 but without the four brackets b. Nutri-Score led to a small but statistically insignificant increase in the total number of purchases (B = .006, t =1.52,p = .13), whereas the other three labels all led to small and statistically insignificant decreases in the total number of purchases (for Nutri-Couleurs: B = −.004, t = −.85, p = .40, for Nutri-Repère: B = −.003, t = −.63, p = .53, for SENS: B = −.008, t = −1.82, p =.07).

Nutritional basket analyses
When considering all products together, the coefficients of the four labels are all in the expected (negative) direction (second column of Table 3) but are all statistically insignificant at the 5% level. The most reliable decrease is for Nutri-Score, with a reduction in FSA score by 0.142 (t = −1.66, p = 0.097), confirming the results from the purchase incidence models that favored this system over others. The reduction of statistical significance of the effects, when compared to the purchase incidence analyses, was expected given the reduction in the number of observations from four per shopper (the changes in the number of purchases of products of high, medium, low, and unlabeled nutritional quality) to just one (the change in the average nutritional quality of the shopping basket score). Another contributing factor is that the FSA score is averaged over all tiers and that purchase incidence analyses found a positive effect on the highest tier of products but no effect on the lowest tier. The robustness of the ordering of the four systems in both analyses is worth noticing: Nutri-Score is best, followed by Nutri-Couleurs, while SENS and Nutri-Repère have the weakest effects. This is the same ordering as for the increase in the purchases of high nutrition quality products shown in Table 2: Nutri-Score is first, followed by Nutri-Couleurs, with SENS and Nutri-Repère having essentially no effect.
The diagonal of Table 4 shows the effect sizes associated with each label in the shopping basket analysis. We report Cohen's f 2 , calculated based on the semi-partial correlation coefficients (Cohen 2013). All of these effect sizes can be labeled as "very small" according to Sawilowsky's( 2009) classification. The lower triangle of Table 4 shows the tvalues of pairwise comparison tests for the four labels. These t-values are obtained by estimating the same model as in equation 4 but with a different coding of the treatment conditions in order to measure the difference between two Fig. 2 Effects of labels on purchases of labeled products in the high (green), medium (amber), and low (red) nutrition tier and on purchases of unlabeled products (white) Note: *** p < .01, ** p < .05, * p < .10. An intercept, fixed effects for nutrition tercile 2, nutrition tercile 3 and unlabeled products, and changes in the number of purchases in the category and in average basket prices were also included in the regressions but are not shown here treatment groups, as opposed to the difference between the control group and one treatment group. The upper triangle of Table 4 provides similar t values but comparing the effects of labels on the purchases of high nutrition products in the purchase incidence analysis. These results show that the only statistically significant differences between the labels are those for Nutri-Score.
The last four columns of Table 3 show the shopping basket results separately for each product category. As for the purchase incidence analysis, the largest effects on the nutritional quality of baskets are observed for fresh prepared foods and are the weakest for breads, the categories with, respectively, the largest and lowest variance in nutritional quality across foods. To examine whether nutrition labels are more effective in categories with a higher variance in nutrition quality or in those with a higher mean nutrition quality, we computed the squared correlation between the four regression coefficients of a particular label (one per category) and the mean and standard deviation of the FSA score of that category before the intervention (in 2015, shown in Table 1).
The results of this analysis are shown for Nutri-Score in Fig. 3, which plots the regression coefficient (the estimated decrease in FSA score) as a function of the within-category standard deviation in FSA score (top panel) and the mean FSA score of the category (bottom panel). As can be seen from the linear trendline, the effectiveness of Nutri-Score increased linearly with the standard deviation in the nutritional quality of the product category (R 2 = 0.90). On the other hand, the effectiveness of Nutri-Score is unrelated to the average nutritional quality of the category: the trend is flat and mean FSA scores have low explanatory power (R 2 =0.05).
The pattern observed for Nutri-Score was also observed for SENS (R 2 SD = 0.68 vs. R 2 M = 0.16) and for Nutri-Couleurs (R 2 SD =0.47 vs. R 2 M = 0.28). This means, that the variance, rather than the mean nutritional quality is linked to the performance of three most effective labels. Interestingly, it was the opposite pattern for Nutri-Repère, whose limited effects were less strongly associated with variance than with the mean category nutrition quality (R 2 SD =0.07vs.R 2 M = 0.76). One explanation may be that the 5 nutrient-specific and monocolored bar charts in Nutri-Repère make it hard to distinguish between foods with high and low nutritional quality within a category, but that taller bars overall signal to shoppers that the category, as a whole, has a low nutritional quality.

Shopper survey results
The goal of the survey was to shed some light on the differences in the effectiveness of the nutrition labels, particularly between the most effective (Nutri-Score) and its immediate follower (Nutri-Couleurs) by assessing their effects on shopper behaviors that are not captured by transaction data, such as their ability to attract attention and to identify the healthiest options. To achieve this goal, we conducted face-to-face shopper surveys in two waves, one in early September before the start of the experiment, the second in late November, while the labels were in place. We also sent an online questionnaire in late December using the email addresses collected during the two waves but chose not to use it because of concerns about the low response rate (13.6%), potential contamination from earlier responses, and as it coincided with the holiday season.
We randomly selected 20 stores among the 60 that participated in the study to obtain 4 stores per experimental group Note: *** p <.01, ** p <.05, * p < .10. An intercept and changes in the number of purchases in the category and in average basket prices were also included in the regressions. Negative coefficients represent an improvement in nutritional quality (i.e., lower FSA scores) Note: Numbers in the diagonal are the effect sizes (Cohen's f 2 )fromthe basket analyses. Numbers in the off diagonals are the t-values of pairwise comparisons of the regression coefficients of each label for the basket analyses (lower triangle) or for the purchase of high nutrition products (upper triangle). ** p <.05 and a balance across the three retail chains, geographical location, and income level of the store's catchment area. As with the transaction data, we oversampled shoppers in the control group, aiming for 95 respondents in each of the 4 control stores and 70 in each of the test stores, for a total of 1500 different respondents per wave. The survey was conducted on our behalf by CREDOC, a market-research company. Trained interviewers stationed themselves in one aisle of the tested product categories and moved from aisle to aisle during the day. They approached shoppers who appeared to be considering purchasing products in those categories. The response rate was 54.4% in the first wave, yielding 1844 observations, and 52.2% in the second wave, yielding 1737 observations. The interviewers asked about respondents' age, which they estimated if they refused to answer. The data were weighted by gender and age group in order to match the respondent samples with the distribution of the entire sample of shoppers approached in the stores (for more details about the procedure, see CREDOC 2017).
To understand the differences between the FOP labels, two of the dimensions examined by the survey are particularly illuminating: 1) their ability to attract attention, 2) their ability to help shoppers rank foods by their nutritional quality (other results are available online, CREDOC 2017). To measure the capacity of each FOP nutrition to attract attention, participants in the second wave were asked: "Did you see that some Category nutritional quality (FSA score mean) Fig. 3 Effects of Nutri-Score by category nutritional quality: Standard deviation (top) vs. mean (bottom) products on the shelf had a nutrition label on the front of their packages?", which is a standard measure when eye-tracking information is not available (van Herpen and van Trijp 2011). The two summary systems were noticed by more people (48% and 46% of the respondents for Nutri-Score and SENS) than the two analytic systems (respectively 31% and 37% for Nutri-Couleur and Nutri-Repère). A logistic regression of attention on label system controlling for retail chain and sociodemographic variables showed that the difference between Nutri-Score and SENS was not statistically significant, but that Nutri-Score was more visible than both Nutri-Couleurs and Nutri-Repère at the 1% level.
The key objective of FOP labels is to help people identify the nutritional quality of food products. To measure the performance of the four systems on this key criterion, we first selected three food products in each of the four categories tested, one per nutrition tercile, showed them side by side to the respondents, and asked them to rank them by nutritional quality. This approach is an extension to three products of the classic nutrition comprehension test which asks people to say which of two products presented side by side is nutritionally better (Gorski Findling et al. 2018;Roberto et al. 2012).
To control for the effects of trends in nutrition comprehension or seasonality, we used a difference-in-differences approach, just like for the main study. As mentioned before, we interviewed people who were shopping in control stores, where no FOP label was ever available, as well as in test stores, when FOP labels were absent in the first wave of the survey but were present in the second. In the first wave of the study, as well as in both survey waves for shoppers in control stores, interviewers invited respondents to look at the nutrition fact panel on the back of the packages to help them rank the products. In the second wave, people shopping in test stores were invited to look at the FOP labels when ranking the products. By comparing the changes in the percentage of test stores shoppers who correctly ranked the three products in the first and second waves, we measure whether the FOP systems improved people's ability to identify healthier products. By comparing these changes to those of the respondents who shopped in the control stores, we control for the effects of trends.
The first two rows of Table 5 show the percentage of respondents who correctly ranked the three products by nutritional quality in each wave and experimental condition. The change in that percentage between waves is provided in the third row (first difference). Accuracy remained stable in control stores where respondents could only rely on back-of-pack nutrition information in both waves. In the Nutri-Score and SENS stores, accuracy improved between the first and the second wave, suggesting that these labels helped shoppers identify the relative nutritional quality of products. In the Nutri-Couleur and Nutri-Repère stores, however, accuracy decreased between the first and second waves, suggesting that the two analytic labels confused shoppers. All the results were robust and statistically significant at the 5% level when looking at the double differences shown in the bottom row. The improvement driven by Nutri-Score was the largest and was twice as large as the improvement driven by the secondbest system, SENS.
Overall, the in-store surveys suggest that the superior performance of Nutri-Score stems from its ability to draw as much attention as SENS, and significantly more than the two analytic systems, and from a superior ability (compared with the analytic system and with SENS) to help shoppers rank products by nutritional quality. The survey results also suggest that the null effect of Nutri-Score on the purchase of products of low nutritional quality does not arise from shoppers' confusion about their relative nutritional quality.

Conclusion
Although FOP labels have been investigated at length in laboratory experiments and in hypothetical settings, there was still a lot of uncertainty about their ability to improve the nutritional quality of supermarket foods purchases over a significant time period and across a wide variety of brands and product categories. This study also allowed measuring whether FOP labels primarily impact the purchase of brands with a label signaling high or low nutrition quality, their effect on brands without any FOP label, and whether the effects differ across product categories. Comparison of the four labels was used to determine which nutrition label should receive government approval.
The key overall conclusion is that, compared to the encouraging findings of laboratory-based studies with exhaustive labeling implementation in all products, FOP nutrition labels had disappointingly modest effects on the nutritional quality of the foods purchased in four categories in real-life grocery shopping conditions. Despite the controls and large number of observations, their impact on the nutritional quality of the shopping basket of labeled products was not statistically significant at the customary 5% level. Although they slightly increased purchases of the best tercile of products in terms of nutritional quality, they slightly decreased purchases of products in the second tercile and had no effect on products in the lowest tercile. They also had insignificant effects on the purchases of unlabeled products.
A second conclusion is that Nutri-Score is the best nutrition label, closely followed by Nutri-Couleurs, with SENS and Nutri-Repère significantly behind. Nutri-Score had the largest and the only statistically significant (at 10%) improvement in the nutritional quality of the basket of labeled products purchased, thanks to its positive impact on the purchase of highnutritional-quality products, followed by a monotonically decreasing effect on the purchases of products of medium and low nutritional quality. This last result is important given that other labels had the undesirable property of having a larger influence on products of medium quality than on products of low-nutritional-quality products. The shopper survey suggests that this happened because Nutri-Score is among the top two most visible labels but provides the easiest way to gauge the relative nutritional quality of the food. The effectiveness of Nutri-Score was unrelated to the mean nutritional quality of the four product category studied, and increased with the variance in the nutritional quality of the category.

Effect sizes of FOP nutrition labels in the field and in the lab
Compared to what could have been expected based on the results of recent laboratory studies, the effects of even the best nutrition label, Nutri-Score, were disappointingly small. Nutri-Score led to a 14.4% increase in purchases of high-nutritional-quality products (an increase by 0.021 over an average 0.148 purchases of these products). In terms of the nutrition quality of the overall basket of labeled products purchased, the improvement was of 0.142 FSA points, a 2.5% improvement of the average FSA score of 5.61. Given that the FSA score has a standard deviation of 7.31, this corresponds to a standardized mean deviation (Cohen's d) of only 0.02. In addition, the effects of Nutri-Score, like those of the other three labels, were mostly driven by the fresh prepared food category, a category with the widest variance in nutrition quality. Nutri-Score did not reliably improve the nutrition of food basket in the other three categories, in which foods do not differ as much in terms of nutrition quality.
The Crosetto et al. (2020) study described earlier provides the best lab-based comparison to our results because it collected incentive-compatible purchase decisions, used the same FSA metric to measure nutritional quality, tested the same four labels (implemented on all products) with a similar population, and used the same within-subjects difference-indifferences architecture. The good news is that the correlation between the effectiveness estimates (changes in FSA scores of shopping baskets compared to the control condition) between both studies is 0.82. Nutri-Score was the winner in both studies, followed by Nutri-Couleur (but the order of SENS and Nutri-Repère was different). In terms of effect sizes however, there is a 17 to 1 difference between their estimates and ours. For Nutri-Score, their estimate of the change in basket FSA score is −2.65, which is 18.6 times larger than our estimate of −0.142. The gap was nearly as wide for Nutri-Couleur (−1.40 vs. -0.12, 12 times larger) and SENS (−0.81 vs. -0.06, 13 times larger) and even bigger for Nutri-Repère (−1.02 vs. -0.24, 43 times larger).
The much lower effects that we observed in the field could be driven by any of the differences between the two studies. First, Crosetto et al. (2020) studied two consecutive purchase decisions whereas this study encompasses multiple purchase decisions over several weeks. It is possible that initial interest for healthier products may have decayed over time as people revert to their habitual behaviors. Another important factor may be that their participants paid stronger attention to the nutrition labels because they had just seen the same catalogue without the labels minutes earlier and because each label was present on all the products, unlike in our study in which the participation of manufacturers was voluntary. Another important difference is that their study was a "framed natural experiment" as opposed to a "natural experiment" like ours, in which people are unaware that their choices are being studied (Levitt and List 2007). In the food domain, and especially for nutrition-related decisions, there can be a large difference between what people choose when they are being watched and when they are not (Herman et al. 2003;Holden et al. 2016; Va r t an ia n 2015).
That nutrition labels have much smaller effects in the field than in the lab has important implications for public health. A recent paper (Egnell et al. 2019) used the estimates of the Crosetto et al. (2020) study to simulate the effects of nutrition labels on dietary intakes (using data from an observational study) and then, through a macro-simulation using the PRIME model (Scarborough et al. 2014), estimated the effects on mortality from diet-related non-communicable diseases. They estimated that approximately 7680 deaths (3.4% of all deaths from diet-related non-communicable diseases in France) would be avoidable if the Nutri-Score was implemented. Clearly, it would be useful to run these simulations with our much smaller estimates. On the other hand, even small changes in FSA scores can have significant health outcomes. Prospective studies have found that a one-point increase in the FSA score computed over a total diet is associated with a 16% higher risk of obesity among men  and with a higher risk (hazard ratio = 1.08) of cardiovascular disease (Adriouch et al. 2016).
More generally, it would be important to compare the costs and benefits of FOP nutrition labeling with those of other cognitive-focused nudges, such as shelf display changes, but also of affect-focused nudges that motivate people to eat better by using pleasure or social pressure, and of behavior-focused nudges that directly change behaviors by changing portion sizes, for example. A recent meta-analysis has shown that the effectiveness of nudges increases significantly as their focus shifts from cognition to affect to behavior (Cadario and Chandon 2020).

Implications for future research
Because participation in the study by manufacturers was on a voluntary basis, our results must be interpreted in this context and cannot be easily extrapolated to contexts that would force all firms to adopt a label. We found no systematic effect of labeling on the purchases of unlabeled products, ruling out the possibility that consumers drew any statistically significant inferences, negative or positive, from the fact that a product did not carry a nutrition label. Under the plausible assumption that the nutritional quality of unlabeled products is below the category average (otherwise, its manufacturers would have chosen to participate in the study), it indicates that the mere presence of labels did not increase healthy eating motivation or attention to back-of-pack nutrition information. Relatedly, our study did not look at the impact of labeling systems on the decision of manufacturers to reformulate their products, which has been identified as one of the most relevant and costeffective public health nutrition strategies (Dobbs et al. 2014), and has occurred in response to other labeling initiatives (Vyth et al. 2010).
The strengths of the trial include its large-scale real setting condition and rigorous RCT design, which are increasingly called for given that most true effect sizes are likely to be small and hard to estimate precisely in under-powered studies (Funder and Ozer 2019). However, it is not without limitations. The study was restricted to four food categories. Further research using multiple socioeconomic indicators is necessary to examine the effects of graphical labeling systems on specific populations and for a larger variety of foods, including less processed ones. Future research should also examine the underlying purchasing process in order to study whether nutrition labels operate by enhancing attention to nutrition information, reducing information-processing costs (Russo et al. 1986), or by increasing the importance attached to health over other benefits of food, like taste. To help manufacturers and retailers determine whether they should adopt these labels and reformulate their foods, it would also be useful to estimate the short and long-term effects of these labels at the brand level on consumer price sensitivity and on product-level substitution patterns. Finally, given the distinctiveness of French attitudes to food, it remains to be seen whether these results would hold in different countries.  Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.