Introduction

Pictures are often used as stimuli in studies on visual perception, memory, and language. In the field of dialogue psychology, which focuses on the mental representations and the psychological processes which enable (at least) two people to reach mutual comprehension as they interact, various methodologies have been developed. They range from the experimental study of genuine dialogues (e.g., Clark & Krych, 2004; Clark & Wilkes-Gibbs, 1986; Isaacs & Clark, 1987; Knutsen & Le Bigot, 2012; Kraut et al., 2003; Roxβnagel, 2000) to the use of the visual world paradigm in interactive settings (e.g., Barr & Keysar, 2002; Brown-Schmidt, 2009a, 2009b; Metzing & Brennan, 2003). Interestingly, a number of these studies use similar pictures, that is, tangram pictures (e.g., Bangerter et al., 2020; Bard et al., 2014; Branigan et al., 2011; Brennan & Clark, 1996; Clark & Wilkes-Gibbs, 1986; Fox Tree & Clark, 2013; Horton & Gerrig, 2002; Hupet et al., 1991; Hupet & Chantraine, 1992; Knutsen, Col, et al., 2018a; Knutsen et al., 2019; Knutsen, Ros, et al., 2018b; Knutsen & Le Bigot, 2018; Lysander & Horton, 2012; Murfitt & McAllister, 2001; Ntsame-Mba & Caron, 1999; Rogers et al., 2013; Rogers & Fay, 2016; Russell & Schober, 1999; Schober & Clark, 1989; Swets et al., 2013; Wilkes-Gibbs & Clark, 1992; Yoon & Brown-Schmidt, 2014, 2019) (see Fig. 1). Tangram pictures are abstract pictures which are usually made of seven smaller geometric black-and-white figures (one square, two big triangles, two small triangles, one medium triangle, and one parallelogram), although some studies have used more figures to create bigger tangram pictures (e.g., Bard et al., 2014).

Fig. 1
figure 1

Example of a tangram picture made of seven figures

These pictures are used as stimuli in dialogue studies, as they may be perceived in different ways, leading dialogue partners to engage in discussion and negotiation to reach an agreement as to how each picture should be referred to. The content of the conversations is then analyzed to infer the nature of the processes involved in dialogue. For instance, a decrease in the number of words and speech turns when a tangram picture is referred to repeatedly may reflect the emergence of “common ground” (i.e., knowledge that two people share and are aware of sharing) between the two dialogue partners (Clark, 1996; Clark & Wilkes-Gibbs, 1986).

A potential issue with the use of tangram pictures as stimuli is that their characteristics are likely to affect the way in which people talk about them, and therefore the results of the studies they are used in. For instance, Hupet et al. (1991) found that picture codability (i.e., the ease with which each picture can be interpreted and verbally expressed) and discriminability (i.e., the ease with which a picture can be discriminated from others) made it more or less difficult for participants to reach an agreement as to how to name the pictures. Other characteristics may also play a role. For instance, some tangram pictures may be visually simpler than others. Likewise, some tangram pictures are perceived as more “consensual” than others; that is, many people may come up with the same label to refer to them, whereas other pictures may be associated with several different labels. However, the characteristics of tangram pictures are seldom controlled for in dialogue studies.

Murfitt and McAllister (2001) controlled their pictures for codability and discriminability. Swets et al. (2013) selected their tangram pictures based on a norming pre-study in which they assessed the mean naming time for each picture. Knutsen, Ros, et al. (2018) attempted to control for label consensus by considering the frequency with which each label was used in their study in a post hoc control. Nonetheless, it seems that in most studies, tangram pictures are chosen randomly, based only on the category they belong to (e.g., human-like pictures, animals or objects). This is problematic because, as mentioned above, most of the dialogue studies which involve tangram pictures base their conclusions on the analysis of the characteristics of the participants’ speech, but these characteristics may also depend on the features of the tangram pictures under discussion. For instance, the decrease over time in the number of words needed to repeatedly describe a referent may reflect not only common ground construction, but also the ease with which the picture may be described, due for instance to its familiarity or its concreteness. In other words, without strict control of the stimuli used, it is difficult to determine which conclusions can be drawn from the study of dialogues about tangram pictures.

In sum, although tangram pictures are often used in dialogue research, little is known regarding the characteristics of these pictures or the way in which these characteristics may affect the interaction between participants. The current study aimed to examine several variables associated with tangram pictures in order to enable researchers to control for them in their experiments. We specifically focused on variables related to the visual properties of the pictures, the way they are represented conceptually, and the ease with which the label used to refer to each picture may be retrieved. Indeed, picture-naming is known to involve various cognitive processes, such as converting the visual stimulus into a conceptual representation, label/lexical retrieval, and lexical production (Dell et al., 1997). Variables that may affect each of these processing steps must thus be identified and controlled for.

Snodgrass and Vanderwart (1980) were the first to collect data on four distinct characteristics of a set of pictures of common objects (black-and-white line drawings): name agreement, image agreement, familiarity, and visual complexity. Name agreement reflects the extent to which participants use the same label to refer to a given picture. Image agreement is defined as the extent to which the picture’s appearance is representative of the mental representation that participants associate with the picture’s label. Familiarity refers to the extent to which participants are in contact with the picture in everyday life. Visual complexity is defined as the amount of visual detail in the picture (e.g., the number of lines in the drawing). Following Snodgrass and Vanderwart’s (1980) study, many studies involving picture-naming tasks were conducted in a wide variety of languages (e.g., in French, Alario, & Ferrand, 1999, Bonin et al., 2003; in Spanish, Duñabeitia et al., 2018, Manoiloff et al., 2010; in Greek, Dimitropoulou et al., 2009; in Russian, Tsaparina et al., 2011; in Persian, Ghasisin et al., 2014) and in various populations including children and younger and older adults (e.g., Cannard et al., 2005, Yoon et al., 2004). Normative data on picture characteristics usually include additional variables such as image variability and age of acquisition (AoA). Image variability reflects whether the label used to refer to a picture is related to a small or a large number of different images. AoA is the age at which participants believe that they learned the label and is one of the main determinants of the speed of lexical retrieval and production (e.g., Cortese & Khanna, 2007).

To constitute a normative database for tangram pictures, we focused on the same variables as in other studies involving picture-naming tasks: naming agreement, image agreement, familiarity, visual complexity, image variability, and AoA (see Table 1). We adapted the instructions used in name agreement tasks to obtain labels which were not necessarily isolated words, as participants in dialogue studies may use simple (e.g., “the guy”) or more complex referential expressions to refer to tangram pictures (e.g., “the guy walking and wearing a hat”). Concreteness, which is known to influence the processes involved in lexical production (e.g., Hanley et al., 2013), was also measured in this study. Most studies on concreteness focus on word concreteness (e.g., a participant is shown a word and is asked to say how concrete the word is). In the current study, it reflected the extent to which a picture was judged as representing a concrete concept as opposed to a more abstract entity. The remainder of the collected data sought to examine the use of the labels associated with the pictures in interactive dialogue settings, the use of alternative labels to name the pictures, and the prior knowledge of pictures outside the study.

Table 1 Summary statistics for all variables

Among the variables usually explored in normative studies, some are significantly correlated with each other. For instance, name agreement and image agreement are correlated negatively, suggesting that a stronger agreement between the picture’s label and its appearance is related to a smaller number of labels provided, owing to the selection of similar labels across participants (Alario, & Ferrand, 1999; Bonin et al., 2003; Dimitropoulou et al., 2009; Ghasisin et al., 2014; Manoiloff et al., 2010; Snodgrass & Vanderwart,1980; Tsaparina et al., 2011). Moreover, AoA is correlated positively with name agreement (Alario, & Ferrand, 1999; Bonin et al., 2003; Dimitropoulou et al., 2009; Duñabeitia et al., 2018; Ghasisin et al., 2014; Manoiloff et al., 2010; Tsaparina et al., 2011) and negatively with familiarity and image variability (Alario, & Ferrand, 1999; Bonin et al., 2003; Ghasisin et al., 2014; Manoiloff et al., 2010; Tsaparina et al., 2011). In other words, when labels are acquired early on, fewer labels are provided and the level of agreement across participants is high. In addition, when labels are acquired early on, more pictures are judged as familiar, and the labels provided are associated with more different images.

Importantly, due to the nature of the stimuli used, we also adapted the instructions usually used in studies focusing on picture-naming tasks (Alario, & Ferrand, 1999; Bonin et al., 2003; Manoiloff et al., 2010; Snodgrass & Vanderwart,1980; Tsaparina et al., 2011) by asking questions on the visual and conceptual properties of the pictures immediately after the participant provided a label. Each participant thus focused on their choice of a label before answering the questions on the visual and conceptual properties of the picture. In picture-naming studies involving line drawings (Alario, & Ferrand, 1999; Bonin et al., 2003; Manoiloff et al., 2010; Snodgrass & Vanderwart, 1980; Tsaparina et al., 2011), participants are usually asked to make judgments about the visual and conceptual properties of the labels provided most frequently by another set of participants. Our approach made it possible to collect data from the same participants including the labels they had chosen as well as their judgments on the visual and conceptual properties of each picture.

Despite these methodological differences, similar cognitive processes are likely to be involved in picture-naming tasks using both line drawings and tangram pictures (e.g., converting the visual stimulus into a conceptual representation, label/lexical retrieval, and lexical production).

When using tangram pictures, we thus expect to find the same correlations between the variables usually taken into account in picture-naming studies involving line drawings. However, it is noteworthy that tangram pictures may be labeled in several different ways; thus, we expect a stronger diversity in the labels provided by participants to refer to a given tangram picture than those provided to refer to line drawings. Another difference with line drawings could be related to the amount of visual detail in tangram pictures being smaller than in line drawings.

Methods

Participants

One hundred and ninety-three native French speakers took part in the study. They received course credit or monetary compensation (€20) for their participation. All participants had normal or corrected-to-normal vision and no history of language disorders. After their participation, 20 participants were removed from the data, owing to either an overly long or an overly short time of participation (longer than three hours or less than one hour). This was done to ensure that all participants performed the task correctly and dutifully without taking too much time or going too fast on the questionnaires. Following the same principle, two participants were removed from the analysis owing to a rate of similar responses greater than 10%. Two other participants were removed from the database to balance the number of participants across groups (see Materials for details). Therefore, data from 169 participants were examined (133 female and 36 male, 18–29 years old, M = 20.17; SD = 1.47). Before the beginning of the experiment, they were informed about the goal and duration of the study. They also validated an online written consent form which followed the Declaration of Helsinki.

Materials

Tangram pictures

We started by collecting a total of 375 monochrome (black) tangram pictures from booklets found in various tangram games. As specified below, not all 375 pictures were necessarily included in the final database. As in most studies involving tangram pictures (e.g., Knutsen, Ros, et al., 2018), all the pictures were made of one square, two big triangles, two small triangles, one medium triangle, and one parallelogram. The pictures were then scanned and randomly divided into two sets, hereafter referred to as set A, which contained 187 pictures, and set B, which contained 188 pictures. All pictures were then uploaded to the online survey platform LimeSurvey (version 2.6). The largest side of the picture (length or width) was always 300 pixels long, and the size of the picture was automatically adjusted to maintain the original proportions of each picture.

Questionnaires

The variables examined in this study were split into three sets of questionnaires (hereafter questionnaires 1, 2, and 3). Two different versions of each questionnaire were then created, each corresponding to a different set of pictures (A or B). Six questionnaires (1A, 1B, 2A, 2B, 3A, and 3B) were thus created in total. We divided the pictures into two sets and the questions into three questionnaires to reduce the length of the experiment and to make sure that the collected data were reliable. In each questionnaire, the first question (which was always the same in all questionnaires) asked the participants to state the first word or expression which came to their mind when they saw the picture. That question was thus related to the name agreement variableFootnote 1. Due to the expected diversity in the labels provided by the participants to refer to a given tangram picture, the first question was always related to name agreement and the following questions were divided into different categories (questions related to how the picture may be referred to in dialogue in questionnaires 1A and 1B, the interface between the labels chosen and their visual representation in questionnaires 2A and 2B with the image agreement and image variability questions, and visual and conceptual properties of the pictures in questionnaires 3A and 3B).

Following the name agreement question, in questionnaires 1A and 1B, participants were asked to say whether they would use the label they had provided to describe the picture during a dialogue with another person, and if not, which label they would prefer to use. The purpose of this question was to determine how likely the labels provided by the participants were to be used in a dialogue setting. Participants were then asked to state whether any other label (i.e., word or expression) came to their mind when they looked at the picture. These two questions were respectively referred to as “use in dialogue” and “other label.” Participants had to answer “yes” or “no” to each of these two questions. They were required to provide an additional label if they had answered “no” to the “use in dialogue” question or if they had answered “yes” to the “other label” question.

In questionnaires 2A and 2B, the name agreement question was followed by questions on the conceptual characteristics of the pictures that assessed image agreement, image variability, and AoA. The image agreement question asked participants to judge to what extent the picture’s appearance was representative of the mental representation associated with the label they had provided, using a 5-point Likert scale from “very weakly representative” to “very highly representative.” In the image variability question, the participants were instructed to rate whether the label they had provided was related to few or many different visual representations, using a 5-point Likert scale from “there are very few ways to visually represent this word or expression” to “there are many ways to visually represent this word or expression.” Finally, in the AoA question, the participants were asked to estimate the age at which they thought they had learned the labels they had provided by selecting one age class among five: 0–3, 4–6, 7–9, 10–12, after 12.

In questionnaires 3A and 3B, the name agreement question was followed by questions about the visual properties of the pictures. The second question of these questionnaires sought to determine whether the picture had already been seen (referred to as the “already seen” question; e.g., the participants might have already seen the picture before taking part in the study). This question involved a “yes/no” answer. It was then followed by questions on familiarity, visual complexity, and concreteness, all rated on a 5-point Likert scale. The familiarity question consisted in indicating how familiar the participant was with the picture on a 5-point Likert scale from “unfamiliar” to “very familiar.” Regarding the visual complexity question, the participants had to rate the picture on a 5-point scale from “very simple” to “very complex.” Finally, the concreteness question required participants to rate the concept associated with the picture on a 5-point scale from “abstract” to “concrete.” We asked participants to name the picture before rating the related concept on the concreteness scale. All questions used in each questionnaire are listed in Appendix A (we provide the initial French wording as well as a translation in English; the questions are listed in the same order as in the initial questionnaires).

Procedure

To complete the online questionnaire, participants were asked to sit in a quiet room to avoid distractions such as music or noise and to answer the questions at their own pace. The first page of the questionnaire was the description of the experiment (goal and duration) and was followed by the consent form. Once participants had given their consent by answering “yes” to the question “Do you consent to take part in this study?” they were shown the instructions of the task. Pictures were then displayed one by one on the participants’ screen. The pictures were alternately shown on a green or blue background, the alternation making it easier for the participants to understand that they had switched to a new picture. Each page of the questionnaire included one picture as well as all the questions the participant was required to answer. All questions were presented on the same page, below the picture. Each participant was shown only one of the six questionnaires (1A, 1B, 2A, 2B, 3A, or 3B). Therefore, they saw only the questions corresponding to the questionnaire they had been allocated. Each question included one sentence (the question itself) followed by a space to answer, or a Likert scale, depending on the type of question. When all the questions corresponding to a given picture had been answered, participants clicked on the “next” button to move on to the following picture. The order in which the pictures were presented was randomized across participants. At the end of the questionnaire, participants provided demographic information regarding their first language and other spoken languages, gender, age, and history of language disorders. The entire questionnaire took approximately 90 minutes to complete.

Data preprocessing on name agreement question

The data from the name agreement question were first examined by two native French speakers in order to correct spelling mistakes. Determiners of isolated words were removed, except for words with different meanings depending on their grammatical gender (e.g., “le vase”, which means “vase”, had to be distinguished from “la vase”, which means “mud”). In cases where the participant’s response consisted of a letter (“M”) or expressions such as “the letter M” (“la lettre M”), only the letter (capitalized) was kept (“M”). Regarding numbers, when they were used to count things, as in the expression “two mountains” (“deux montagnes” in French), they were written in words. In all other cases (e.g., when the picture was believed to represent a number), the number form was kept. Plural words were replaced by singular forms. Finally, when a participant used a quotation mark to express their answer (e.g., “?”, implying that the picture looked like a question mark), the quotation mark was written out in full (e.g., “question mark”). All other punctuation marks were removed. Words voluntarily written in English were not translated.

Secondly, labels indicating that the participants did not know the name of the label or did not want to respond were coded as “non-responses” (“abs” in the data files; e.g., “nothing”, “no idea”, “no opinion”). We also coded the following responses as NA: (a) when we did not understand the labels or when they presented lexical ambiguities, (b) when participants provided more than one label, (c) when labels referred to another picture which had been shown previously (e.g. “the same corridor as before but the door is closer”)Footnote 2.

Sixteen pictures were removed from the analysis because they generated either more than 10% of NA responses, or more than 10% of non-responses. This represented 0.05% of the dataset. In the final dataset, 0.70% of responses were NA responses and 1.20% of responses were non-responses. In addition, 27 pictures were removed from the final dataset owing to an experimenter error. Therefore, the answers for the name agreement and all other variables were analyzed on the 332 remaining pictures (166 in set A and 166 in set B). Summary descriptive statistics for these 332 pictures are presented in Table 1. Appendix B provides the associated mean and standard deviation values for each given picture on the following variables: percentage of name agreement, H index, image agreement, familiarity, visual complexity, image variability, AoA, and concreteness.

Data analyses

Analyses on name agreement data: modal label, percentage of name agreement, and H index

In line with previous literature on name agreement data (e.g., Alario, & Ferrand, 1999; Bonin et al., 2003; Dimitropoulou et al., 2009; Duñabeitia et al., 2018; Ghasisin et al., 2014; Manoiloff et al., 2010; Tsaparina et al., 2011), three measures were calculated for each picture: the modal label, which was the label that most participants gave to refer to a given picture; the percentage of name agreement, which corresponded to the percentage of participants who gave the modal label as their answer; and the H index. The H index (Shannon & Weaver, 1949) reflects the diversity in the labels provided by participants to refer to a given picture. The H index was calculated for each picture using the following formula:

$$\mathrm{H}=\sum_{i=1}^k{p}_i{\log}_2\left(1/{p}_i\right)$$

where k refers to the number of different labels given to each picture and p represents the proportion of subjects who gave each label (Snodgrass & Vanderwart, 1980). More precisely, if all participants use the same label to refer to a given picture, the picture has an H index of 0 and its percentage of name agreement is 100. In contrast, when the variability in labels provided across participants increases, the value of the H index also increases, and the percentage of name agreement usually decreases as well. As already defined in the literature (Alario & Ferrand, 1999; Snodgrass & Vanderwart, 1980), the H index is calculated based on name agreement question because the H index captures the distribution of labels for each picture across participants better than the percentage of name agreement. The modal label, the percentage of name agreement, and the H index were computed on the two sets of pictures separately and on both sets combined (set A-B in Table 2). The results for the H index and the percentage of name agreement are presented in Table 2, while modal labels can be found online (see “ModalResponses-NamingAgreement” file).

Table 2 Summary descriptive statistics for name agreement (percentage of name agreement and H index) in sets A and B taken separately and for both sets combined.

Moreover, we explored whether the diversity in labels provided across participants for each given picture was concordant across the three sets of questionnaires. As pointed out by Snodgrass and Vanderwart (1980), the diversity in the participants’ labels to refer to a given picture is best reflected by the H index. We therefore calculated Kendall's coefficient of concordance on the H indexes obtained in each questionnaire by comparing the different questionnaires of a same set of pictures (i.e., 1A, 2A, and 3A or 1B, 2B, and 3B). In this analysis, the pictures rather than the participants were the basic analysis unit. Given the high number of pictures, the tables of critical values for the Kendall’s W statistic (Siegel & Castellan, 1988) were not appropriate to test W for statistical significance, and a chi-square test of significance was used instead.

We then examined the homogeneity of modal labels for each given picture across the three sets of questionnaires. In other words, for each given picture, we checked that the mode was the same regardless of the questionnaire in which this picture was presented. We calculated Krippendorff’s alpha rather than Fleiss’ kappa, as the comparison included three questionnaires for each set of pictures, A and B, and the dataset contained missing data (Zapf, Castell, Morawietz & Karch, 2016). Using the R-function kripp.alpha of the irr package (Gamer, Lemon, Fellows, & Singh, 2019), we compared the modal label obtained for each picture separately for both sets of pictures. This was performed in questionnaires 1A, 2A, and 3A on the one hand, and in questionnaires 1B, 2B, and 3B on the other, as sets A and B included different pictures.

Importantly, 67 pictures from set A and 63 pictures from set B had more than one modal label (multiple-mode pictures, i.e., pictures for which two or more labels had been given the same number of times and were the most frequent labels). For example, picture A23 (set B) was named “bouteille” (i.e., bottle) by 12 participants and “maison” (i.e., house) by 12 other participants across all three questionnaires. In addition, 12 pictures from set A and 3 pictures from set B had no modal label. These were cases where each participant provided a different label to describe a picture. Each label was thus provided only once. Two different approaches were considered to solve these particularities in modal labels. The first approach consisted in including multiple-mode pictures in the analysis, but only taking one modal label per picture into account. We selected the mode included in the analysis using the following procedure: for each picture, if one of the multiple modes in one of the questionnaires was the same as the (unique) mode in another questionnaire, it was this mode which was included in the analysis, as it could be considered the most representative of the picture. All other possible modes were removed from the analysis. If none of the multiple modes matched the modes of the other questionnaires, then all multiple modes were replaced by an NA response and were not included in the analysis. The second approach consisted in removing, for each picture, the data from all three questionnaires if the picture was associated with multiple modes in at least one of the questionnaires. The modal responses for each approach are available online (see “ModalResponses-NamingAgreement” file).

Reliability for image agreement, AoA, image variability, familiarity, visual complexity, and concreteness

The reliability of the ratings was assessed by calculating intra-class correlation coefficients (ICCs) on image agreement, AoA, image variability, familiarity, visual complexity, and concreteness. For each variable, the ICC was obtained by using two-way random effects as the model, consistency as the definition, and multiple raters/measurements as the type (see McGraw & Wong, 1996). The analysis sought to assess inter-participant reliability for each variable within each set of pictures.

Correlational analyses between H index, image agreement, AoA, image variability, familiarity, visual complexity, and concreteness

Correlations between the H index, image agreement, AoA, image variability, familiarity, visual complexity, and concreteness were performed for the two sets of pictures separately and for both sets combined. The purpose of these correlations was to determine whether the correlational results reported in the literature on black-and-white drawings (e.g., Alario & Ferrand, 1999; Bonin et al., 2003) were also found with the tangram pictures used in the current study. The percentage of name agreement was not included in these analyses, as the diversity in the participants’ labels to refer to a given picture is best reflected by the H index. Image agreement, AoA, image variability, familiarity, visual complexity, and concreteness scores were obtained by averaging the numerical responses to each picture across participants. The average scores for each picture are included in Appendix B, and the corresponding raw data are available online (see “OtherVariables” file).

Analyses on three additional exploratory variables: “use in dialogue”, “other label”, and “already seen” variables

Descriptive statistics were calculated for the “use in dialogue” and the “other label” variables in order to provide information on how the participants would have named the pictures in dialogue settings and on whether the participants would have provided any other label (i.e., word or expression) as an alternative to their first label. Analyses on these variables are presented as exploratory since, to our knowledge, this is the first norming paper to examine such variables. Indeed, the pictures examined in other norming studies are not systematically used in dialogue research, thus making questions about how they would be referred to in dialogue less relevant. We therefore computed the percentage of “no” answers, that is, when an additional label was provided as being more likely to be used in a dialogue setting than the first label (hereafter called the percentage of “use in dialogue”) across all pictures. Regarding the “other label” question, we calculated the percentage of “yes” answers for which an alternative label was provided (hereafter called the percentage of “other label”) across all pictures. This enabled us to determine how often the first label provided in response to the naming question (a) was not the label participants would have used in a dialogue setting or (b) was not the only label to come to the participants’ mind to describe the picture.

Finally, the proportion of “yes” responses to the “already seen” question was also calculated in order to perform correlational analyses between these data and the other numerical variables measured in the questionnaires (i.e., questionnaires 3A and 3B, in which H index, familiarity, visual complexity, and concreteness were examined). This analysis was performed to quantify the effect of prior knowledge on the other variables explored in this study.

Results

Analyses on name agreement data: modal label, percentage of name agreement, and H index

Table 2 presents a summary of descriptive statistics related to the percentage of name agreement and the H index, including the 25th (Q1) percentile and the 75th (Q3) percentile. As shown in Table 2, the mean percentage of name agreement across participants from sets A and B combined was 24.01% (SD = 17.89), reflecting the fact that situations in which most participants all produced the same label to refer to a given picture were quite rare. In line with the percentage of name agreement, the average H index for both sets A and B combined was 4.59 (SD = 1.14), which implies that the variability in naming was large across participants. For almost half of the pictures (151 tangram pictures), the modal response to each picture was given by 10% to 30% of the participants, and only 11% of the pictures (37 pictures) had a modal response given by participants more than 50% of the time. One possible explanation for this lack of agreement lies in the opportunity for participants to use full referential expressions (instead of isolated words only to refer to the pictures). To address this possibility, we used an additional level of coding on the modal labels associated with each picture. If a modal label was a word, we counted the total number of labels in which this word occurred alone or as part of an expression. However, if the modal label was an expression such as “a person who dances”, we counted the total number of labels in which the corresponding content words (“person” and “to dance”) appeared separately or together. If a picture was associated with more than one modal label, we used the most frequent label after having counted the number of times each modal label occurred as an isolated word or as a content word within a full referential expression. The goal was to determine whether a new coding grouping labels as isolated entities led to percentages of name agreement that were closer to previous studies on name agreement (e.g., Alario, & Ferrand, 1999; Bonin et al., 2003; Dimitropoulou et al., 2009; Duñabeitia et al., 2018; Ghasisin et al., 2014; Manoiloff et al., 2010; Tsaparina et al., 2011). Using this procedure, the maximal mean percentage of name agreement found in set A was 38.09% (SD = 24.28) and 31.95% (SD = 20.59) in set B. This was once again lower than the values reported in previous studies.

Kendall's coefficients of concordance were calculated between the H indexes of each set of questionnaires within each set of pictures. The analysis revealed a coefficient of concordance of 0.91 for the three questionnaires of set A and the same coefficient of concordance (0.91) for the three questionnaires of set B. The chi-square test revealed a significant concordance for both sets A and B (respectively χ2 = 451, p < 0.001 and χ2 = 448, p < 0.001). This result implies that for both sets A and B, the H index of each of the 166 pictures could be ranked in approximately the same order for each of the three questionnaires (Siegel & Castellan, 1988). In other words, the agreement across participants regarding picture naming was concordant across the three questionnaires. In line with this result, the distribution of the H index for each questionnaire appeared as quite similar across questionnaires (see Fig. 2).

Fig. 2
figure 2

Density plots of the H index for the three questionnaires (1, 2, and 3)

The analysis conducted using Krippendorff’s alpha revealed a reliability rate of 0.69 for set A, 0.66 for set B, and 0.68 for both sets combined, when multiple-mode pictures were included in the analysis of homogeneity of modal labels across the three sets of questionnaires. When multiple-mode pictures were removed from the analysis, we observed a higher reliability rate of 0.78 for set A, 0.71 for set B, and 0.74 for both sets combined. It is acceptable to draw tentative conclusions given that these reliability coefficients are between 0.67 and 0.80 (Krippendorff, 2004). Modal responses for a given picture were thus closely related across all three sets of questionnaires.

Reliability for image agreement, AoA, image variability, familiarity, visual complexity, and concreteness

Intra-class correlational analyses (ICC) were performed on image agreement, AoA, image variability, familiarity, visual complexity, and concreteness, in order to assess the inter-participant reliability for each of these variables. Table 3 reports the results for each set of pictures (A and B).

Table 3 ICC index for image agreement, AoA, image variability, familiarity, visual complexity and concreteness

All ICC values reflect moderate to excellent reliability.

Correlational analyses between H index, image agreement, AoA, image variability, familiarity, visual complexity, and concreteness

Correlations between the variables examined in questionnaires 2 (2A and 2B) and 3 (3A and 3B) were calculated. These correlations were calculated using Kendall’s tau, as all variables were  not distributed normally, except for the AoA. Table 4 shows the correlation matrix for all variables. Since the data associated with each variable were not collected using the same sample of participants, the correlation matrix was divided into three parts. Matrix A shows correlations between the variables and the H index obtained using questionnaire 2. Matrix B shows correlations between the variables and the H index obtained using questionnaire 3. Finally, Matrix C shows correlations between the variables obtained using both questionnaires.

Table 4 Summary of all correlations performed within and across questionnaires 2 and 3

As shown in Table 4, almost all variables were significantly correlated with each other. High correlations were found between concreteness and familiarity (.73) and between concreteness and image agreement (.67). This suggests that pictures rated as representing a concrete concept had an appearance judged as highly representative of the participants’ associated mental representations of the labels. Pictures rated as representing a concrete concept were also judged as more familiar.

As expected, name agreement, as measured by the H index, and image agreement were correlated negatively (–.43). Name agreement was also correlated negatively with familiarity (–.41) and concreteness (–.42), suggesting that pictures with a high level of agreement were judged as familiar and representing a concrete concept. AoA was also correlated positively with name agreement (.25) and negatively with familiarity (–.14), image variability (–.30), image agreement (–.18), and concreteness (–.15). This suggests that when labels are acquired early on, fewer labels are provided, and the level of agreement is high across participants. In addition, when labels are acquired early on, more pictures are judged as familiar and representing a concrete concept, and the labels evoked a larger number of different images. Labels acquired early on were also associated with pictures whose appearance was judged as highly representative of the participants’ associated mental representations of the labels. Image agreement was correlated positively with familiarity (.62), suggesting that pictures whose appearance was judged as highly representative of the participants’ associated mental representations of the labels were also rated as familiar.

There were also significant (but small or moderate) correlations between visual complexity and several other variables. Visual complexity was correlated positively with image variability (.21) and name agreement (.30), but negatively with image agreement (–.30), familiarity (–.48), and concreteness (–.37). This means that complex pictures were associated with labels which evoked a larger number of different images and led participants to provide more labels. However, complex pictures were also judged as less representative of the participants’ associated mental representations of the labels and were rated as unfamiliar and more abstract.

Analyses on three additional exploratory variables: “use in dialogue”, “other label”, and “already seen”

Finally, we explored the data from the “use in dialogue”, “other label”, and “already seen” variables. The “use in dialogue” variable reflects whether a participant would use a word other than the label provided to describe the picture to another person in an interactive dialogue setting. For example, for picture A23, a participant used the word “tour” (i.e., tower) to name the picture but indicated that they would use the word "robe" (i.e., dress) to describe this picture in a dialogue setting. As for the “other label” variable, participants had the opportunity to suggest another label from the one they had initially provided (i.e., their answer to the name agreement question). For example, for the A23 picture, one participant proposed the label “maison avec une cheminée” (i.e., house with a chimney) for the picture, but added another label “bouteille” (i.e., bottle) as their response to the “other label” question.

The percentage of “no” answers for the “use in dialogue” variable (i.e., when an additional label was provided as being more likely to be used in a dialogue setting than the first label) and “yes” answers for the “other label” variable (i.e., “yes” answers when an alternative label was provided) was relatively low. The mean percentage of “no” answers for the “use in dialogue” variable was 17.58% (SD = 7.53) and 27.59 % for the “yes” answers for the “other label” variable (SD = 10.48). This means that most participants considered that their response to the naming question could be used in a dialogue setting. In addition, it was rare that participants gave another label after providing the first label that came to mind. Regarding the “already seen” variable, the mean percentage of “yes” responses was 12.04% (SD  = 12.36). This could be because some of the participants had already played tangram games before. To develop the analysis of the data from questionnaire 3, we also computed correlations between the percentage of “yes” answers to the “already seen” variable and the other variables from this set. All these correlations were significant (Table 5). There was a negative correlation between the “already seen” variable and the H index (–.30), suggesting that pictures which had already been seen led to less diversity in the labels provided. As revealed by a negative correlation between the “already seen” variable and visual complexity (–.46) and a positive correlation between the “already seen” variable and familiarity (.42) and concreteness variables (.28), pictures which had already been seen were also judged as less complex, more familiar, and as representing a more concrete concept.

Table 5 Correlations between the percentage of “yes“ answers to the “already seen” question and the other variables measured in questionnaire 3.

Discussion

We present the first French normative database for 332 tangram pictures that can be used in dialogue research and other research fields. This database includes norming data on several characteristics, from visual properties to conceptual representations (name agreement, image agreement, familiarity, visual complexity, image variability, AoA, and concreteness).

The first important result of this study concerns the high variability in the labels given to each picture in all three sets, as shown by the high H index and the low percentage of name agreement. The reliability analysis revealed a sufficient level of agreement between the three sets of questions regarding how the pictures were named in all groups of participants. This variability cannot be attributed to the format of the participants’ responses. Indeed, one may argue that while most norming studies (e.g., Alario, & Ferrand, 1999, Bonin et al., 2003; Dimitropoulou et al., 2009 ; Duñabeitia et al., 2018; Ghasisin et al., 2014; Manoiloff et al., 2010; Tsaparina et al., 2011) offer only isolated words as labels to refer to pictures, allowing participants to use either isolated words or expressions to respond increases the number of ways in which the same label may be expressed, thus explaining the variability in data. However, the additional analysis performed by grouping participants’ responses by word content yielded results which do not support this hypothesis. Although the percentage of name agreement increased in this analysis, it remained low in comparison with the results on name agreement reported in the literature (e.g., Alario & Ferrand, 1999; Bonin et al., 2003; Duñabeitia et al., 2018). This reinforces and helps quantify the idea that tangram pictures are suitable for dialogue studies. Indeed, as mentioned in the introduction, dialogue research needs stimuli which can be perceived in different ways to encourage participants to negotiate how to refer to them. This enables researchers to study collaboration and the emergence of mutual knowledge and common ground (Clark, 1996; Clark & Wilkes-Gibbs, 1986). In this context, our database may be used to select stimuli based on name agreement—e.g., a researcher may be interested in selecting tangram pictures with a high level of name agreement (which could lead participants to reach an agreement promptly) versus tangram pictures with a low level of name agreement (which could lead to more negotiation among participants, thus potentially causing the dialogue to last longer). The high level of variability in the labels provided by the participants also highlights that dividing tangram pictures into categories (e.g., people, cats, boats, etc., as is often done in the dialogue literature) is not as straightforward as it may seem.

The second important result concerns the correlations obtained between the name agreement, calculated using the H index, and the other variables examined in the study. We found that name agreement was correlated with all other variables except for image variability, as expected from prior literature (Alario, & Ferrand, 1999; Bonin et al., 2003; Dimitropoulou et al., 2009; Ghasisin et al., 2014; Manoiloff et al., 2010; Snodgrass & Vanderwart,1980; Tsaparina et al., 2011). Name agreement and image agreement were negatively correlated, suggesting that a stronger agreement between a picture’s label and its appearance is related to a lower number of labels provided. As in previous studies (Alario, & Ferrand, 1999; Bonin et al., 2003; Dimitropoulou et al., 2009; Ghasisin et al., 2014; Manoiloff et al., 2010; Tsaparina et al., 2011), we found that pictures associated with a smaller agreement rate were also those for which the label was acquired later.

These pictures were judged as less familiar and more complex as well. These results also match the findings of Bonin et al. (2003). Therefore, while the level of name agreement may be lower for tangram pictures than for line drawing pictures, correlations between name agreement and other variables pertaining to the characteristics of the pictures seem to remain relatively stable. Correlations between variables other than name agreement are discussed below.

Negative correlations were also found between AoA and familiarity on the one hand, and between AoA and image variability on the other, as in previous studies (Alario, & Ferrand, 1999; Bonin et al., 2003; Ghasisin et al., 2014; Manoiloff et al., 2010; Tsaparina et al., 2011). Regarding the correlation between AoA and familiarity, it suggests that the later in life a word is learned, the less familiar people are with the shape of its visual representation, probably because they have been less exposed to the word. AoA was also negatively correlated with image agreement, as in the work by Alario and Ferrand (1999), Ghasisin et al. (2014), and Tsaparina et al. (2011). Words acquired early thus tend to evoke more visual representations and are perceived as more familiar in our study, and they are also associated with more representative pictures than words acquired late. The difference from previous results (Alario & Ferrand, 1999; Ghasisin et al., 2014; Manoiloff et al., 2010; Tsaparina et al., 2011) is the absence of correlation between AoA and visual complexity in our data, although it was also not found in the study by Bonin et al. (2003).

Image variability was positively correlated with visual complexity in our study, which means that labels evoking a large number of different images were associated with more complex pictures. This correlation was not significant in the work by Bonin et al. (2003), and it was negative in that of Alario and Ferrand (1999) and Manoiloff et al. (2010), since more complex pictures were associated with labels evoking a small number of images. This difference may lie in the nature of the pictures used in both studies, and in the subsequent interpretation of visual complexity. In line with drawing pictures, visual complexity may arise from a greater number of lines and thus be associated with a more detailed picture, leading to fewer possible representations. However, in tangram figures, increased visual complexity can be associated with a loss of clarity of the visual forms and a greater number of possible representations for the same picture. In accordance with this interpretation, it is interesting to note that the mean value of visual complexity in our study is similar to that found in previous studies (Alario & Ferrand, 1999; Bonin et al., 2003; Manoiloff et al., 2010), although line drawings usually include more visual details than tangram pictures.

Image agreement was positively correlated with familiarity as in research by Ghasisin et al. (2014), unlike in other studies (Alario & Ferrand, 1999; Bonin et al.,2003; Manoiloff et al. 2010; Tsaparina et al., 2011). This correlation means that the more a picture is considered familiar, the more it is perceived as representative of its label. Image agreement was also negatively correlated in our study with visual complexity, and visual complexity was negatively correlated with familiarity. In other words, the more a tangram picture is perceived as complex, the less the picture is considered as representative of its label (Ghasisin et al., 2014; Tsaparina et al., 2011). Similar to previous studies using line drawing pictures (Alario, & Ferrand, 1999; Bonin et al., 2003; Ghasisin et al., 2014; Manoiloff et al., 2010; Tsaparina et al., 2011), pictures rated as more complex were judged as less familiar. As for name agreement, correlations between visual complexity and other variables pertaining to the characteristics of the pictures seem to remain relatively stable with respect to line drawing pictures, except for the correlation between visual complexity and image variability.

This is also the first study to explore concreteness for picture norms. This variable, which reflects the extent to which a picture is judged as representing a concrete concept, as opposed to a more abstract entity, is known to influence the processes involved in lexical production (e.g., Hanley et al., 2013). In this study on tangram pictures, concreteness was negatively correlated with name agreement and visual complexity, and was positively correlated with familiarity. These correlations mean that the more a concept was perceived as concrete, the more familiar and less complex the picture was perceived to be. Pictures which were judged as representing a more concrete concept were also associated with a lower name agreement rate. Our findings about concreteness are not surprising considering that (a) concreteness and imageability (i.e., the ease with which a mental image can be generated in response to the presentation of a written word) are known to be highly correlated (Paivio et al., 1968) and (b) the same correlations between the variables previously cited and imageability have already been found in the literature (Tsaparina et al., 2011). Therefore, if the perception of the concreteness of a concept is partly linked with the ease of generating a mental image of a given word, it seems reasonable to assume that these two variables will correlate in the same way with name agreement, visual complexity, and familiarity. Moreover, it is interesting to note that tangram pictures were rated as representing abstract concepts (with a median value of 2.83). Hence, this study shows that tangram pictures are an interesting tool for presenting abstract concepts, even though presenting abstract concepts is not usually considered feasible (Hanley et al., 2013).

Regarding the “already seen” variable, there were two negative correlations with the name agreement and the visual complexity variables and two positive correlations with familiarity and concreteness variables. This means that the way participants perceived pictures was influenced by their previous experiences with these pictures. Consequently, the more a picture had already been seen by participants, the more it was perceived as familiar and representing a concrete concept, and the less it was perceived as complex. The already seen pictures also led to more consensus in the participants’ answers, as shown by the correlation with the name agreement.

In conclusion, the present database provides the first French norms for a new set of 332 tangram pictures. The analyses of this database show that the characteristics of tangram pictures are very similar to those of line drawings regarding many variables. Nevertheless, tangram pictures also have specific characteristics. Overall, the results confirm that tangram pictures are particularly well suited for dialogue studies. As they can be perceived in several different ways, they lead participants to use a wide range of labels to refer to them, implying that they must discuss and negotiate in order to reach an agreement regarding how to refer to them. In addition, the characteristics assessed in this study may be particularly useful for dialogue researchers. Indeed, the data collected through the naming question may be used to anticipate how a sample of participants may refer to a given picture, and/or how likely it is for someone to use a given label to refer to a specific tangram picture. This could be particularly useful in studies in which the consensualness of tangram pictures must be taken into account (e.g., Knutsen, Ros, et al., 2018), or when attempting to write plausible scripts for studies involving confederates (see Kuhlen & Brennan, 2013). Moreover, the other variables examined in this study may help to better control the materials used in dialogue studies, as the visual and conceptual characteristics of tangram pictures are known to affect the way in which participants talk about them (e.g., Hupet et al., 1991). Specifically, the variables taken into account in the current study were not the same as those examined by Hupet et al. (who focused on codability or discriminability). Discriminability could not be assessed in the current study, as this variable involves comparing a tangram picture with other tangram pictures. However, codability, which is defined as the ease with which a tangram picture can be verbally expressed, may be related to a number of our variables (e.g., name agreement, as a picture whose name agreement is high may be perceived as easy to express verbally; pictures with a high level of image agreement may also be perceived as easy to express verbally). Thus, it would be interesting to examine whether these two variables affect dialogue in the same way as codability. More generally speaking, we hope that the variables measured in this study will be used by dialogue researchers to anticipate how picture characteristics might affect dialogue characteristics. This should enable researchers to control for a number of item-related phenomena (e.g., the fact that labels associated with more complex or less familiar pictures may take longer to negotiate) in order to obtain a better understanding of the processes which underlie dialogue.

Unlike the work by Duñabeitia et al. (2018) providing norms for 750 drawings in six European languages, our database concerns French labels, thus providing information on a less studied language than English, but the part about the visual properties of pictures can still be used in any other languages, and the concreteness ratings can also be generalized to other languages which have the same classification of concepts. Above all, the methodological approach adopted in this study offers interesting ways to develop new databases of abstract pictures in other languages. For example, the procedure we used to assess image agreement, image variability, and AoA (these were assessed following the choice of a label to refer to the picture) strongly differs from the procedure usually used with line drawing pictures (in which modal labels of pictures are judged in terms of image agreement, image variability, and AoA; Alario & Ferrand, 1999; Bonin et al., 2003; Manoiloff et al., 2010; Tsaparina et al., 2011). We chose to adapt the procedure due to the way in which tangram pictures are used in the dialogue literature, in which participants are usually required to come up with a word or expression to refer to the tangram pictures they are shown. In this kind of context, both the features of the tangram picture per se and the features of the label/concept chosen are likely to affect the way in which the tangram picture is referred to in dialogue. This is why we decided to ask the participant to focus on the concept they had chosen (even if this did not finally correspond to the modal label for the picture) rather than on modal labels. A limitation of this study is the fact that data were collected in a situation other than a dialogue setting. This led us to create the “use in dialogue” and “other label” variables. In future studies, it would thus be interesting to verify whether the modal responses found in our name agreement question are used in genuine dialogue situations, or whether participants favor the use of the “other labels” listed in this study (in the latter case, it would also be interesting to determine which factors lead participants to favor the use of “other labels” over modal labels).

Outside the field of dialogue research, researchers focusing on the cognitive processes involved in the processing of pictures and concepts, on the interface of picture recognition and language processing, or on language production in general may want to use our database to control in their studies the variables we have measured in our study. Moreover, the fact that tangram pictures may be associated with several different labels may enable researchers to answer questions such as how lexical production is affected by the competition between multiple concepts and labels evoked by pictures, as well as the nature of the factors that influence the selection of concepts and labels, or even the cognitive mechanisms associated with the activation of abstract concepts after the presentation of pictures.

Regarding the materials used in this study, they are open-access and free from copyright restrictions for noncommercial purposes, to facilitate their use for further research and the exploration of other aspects of the pictures. Researchers can find the materials online at the following URL https://osf.io/bxkpa/view_only=2fe0acd124e64df19a0195354ebe45b4, with all the featured pictures and corresponding norms. The modal responses using the various methods presented in this article are also available.