Cards are printed out and laminated to make up a deck of high-quality cards consisting of 5 feminized prototypes (f), 5 masculinized prototypes (m), one Jesus prototype (jp), one male prototype (mp), and one female prototype (fp). The images can be marked for their category on the backside for easy reading of the results. The reason for using personalized prototypes is to make the task less obvious. The participants are asked to sort the images according to beauty, and they are not instructed that some images are more feminine than the other images.

Goals

Statistics: to familiarize students with statistical tests on ordinal data.

Experiment: to demonstrate a sorting task that can be performed without the use of computers. The task involves physical manipulation of objects and can be performed in a short time with very little training.

Variations: The instructions for how the cards should be sorted can be varied. We suggest attractiveness and trustworthiness, but other tasks might be to sort them according to a feminine to masculine scale.

Variant: If you perform more than one variant on each subject you should consider evaluating the tasks together for repeated measures on the same subject. The presentation order of the tasks should also be varied. For three tasks there are 6 possible orders to consider: A B C, A C B, B A C, B C A, C A B, C B A.

Publication: For the statistics, we are relying on ordinal data. As ordinal data are not on an interval scale, it is difficult to use a model that assumes this, and similarly measures of variance are not readily available. More data is the main key to get convincing statistics. This can be accomplished by increasing the number of participants and increasing the number of cards. We will show one way of accounting for individual differences, by comparing the Wilcoxon scores between individuals, assuming only that the W scores can be sorted.

Preparatory Activities

You will need about 10 subjects to perform the task, and more is better.

The task is either to sort on attractiveness or trustworthiness, from left to right.

Assume that we ask to sort on attractiveness, most attractive first.

Each card is marked on the backside for male or female, or which type of prototype. Some examples of cards.

The alternating rows show first male, then female individuals. The morphed pictures consist of 80% Christ prototypes and 20% individual pictures.

The sorted sequences can be read and marked as in the example below.

m

jp

m

fp

f

f

mp

f

m

f

m

f

m

0

1

2

3

4

5

6

7

8

9

10

11

12

For each card and category, we count the number of cards to the left in another category. The examples do not assume the same images as in Figs. 1 and 2.

Fig. 1
Three digitally developed prototype photographs of a woman, a man, and an estimated visualization of Jesus.

Prototypes: female, male, Jesus

Fig. 2
16 digitally developed prototype photographs of an estimated visualization of Jesus.

Personalized “Jesuses”. Each Portrait has been morphed with the Jesus prototype

We test for a difference in the sorting orders between male and female Wilcoxon sum rank test for each person. Let us count for each category how many of the other category have a higher ranking number and sum those numbers. For example, the first and second male (0 and 2) outranks all five female images, and the last male (12) is outranked by all five female images (Table 1).

$$ \begin{aligned} & {\text{Male }}\left( {\text{m}} \right): {5 } + { 5 } + { 2 } + { 1 } + \, 0 \, = { 13} \left( {{\text{max score is 25}}:{ 5} + {5} + {5} + {5} + {5}} \right) \\ & {\text{Female }}\left( {\text{f}} \right): {3 } + { 3 } + { 3 } + { 2 } + { 1 } = {\bf{12}} \\ \end{aligned} $$
$$ {\text{In R}}:{\text{ wilcox}}.{\text{test}}\left( {{\text{c}}\left( {0,{2},{8},{1}0,{12}} \right),{\text{ c}}\left( {{4},{5},{7},{9},{11}} \right)} \right) $$
Table 1 Rankings for one person (13 cards: 10 person, 3 prototypes)

Here, W = 12 and p = 1, and there is no significant difference in the sorting order. The sum of rank differences of female images was lower, thus they were slightly less attractive for this individual as higher numbers indicate that more in the other category was sorted below. For this participant, there was no significant difference, which indicates that the sorting order can be explained by random chance.

W is the rank sum associated with the second group, the smallest number, in the ordered test. Say that we have females as our second group. For 10 people we got the following W numbers for M and F (Table 2):

Table 2 Calculating rank sums

Table 2 can be evaluated by a paired Wilcoxon sum rank test. We have sum ranks obtained from the same ten people in the two categories. Female images were judged more attractive 8 times out of 10. Is this significant?

The sum ranks, positive and negative classes:

$$ \begin{aligned} & {\text{M:}}\,{2 } + { 2 } = {\bf{4}} \\ & {\text{F:}}\,{2 } + { 4 } + { 5 } + { 6 } + { 7}.{5 } + { 7}.{5 } + { 9}.{5 } + { 9}.{5 } = {51} \\ \end{aligned} $$

(Note: max score is 55 (10*5.5) when all are in one class and the mid-point is between 5 and 6 thus 5.5 is assigned to all).

$$ \begin{aligned} & {\text{In R}}: \, > {\text{ wilcox}}.{\text{test}}\left( \begin{aligned} & {\text{c}}\left( {{13},{12},{11},{1}0,{8},{7},{7},{5},{5},{13}} \right), \, \\ & {\text{c}}\left( {{12},{13},{14},{15},{17},{18},{18},{2}0,{2}0,{12}} \right),{\text{ paired}} = {\text{T}} \\ \end{aligned} \right) \\ & {\text{V}} = {\bf{4}};{\text{ p}} = 0.0{186 }\left( {{\text{p}} < 0.0{5}} \right) \\ \end{aligned} $$

In this hypothetical example, we can conclude that the difference is significant. We see that F is judged more attractive. The reason the significance is not very strong is that we only have 10 participants, judging only five male and five female images. The obvious next question is if male and female participants were different in their judgments.

In a second test we test if the gender of the subjects influences the results.

We have two groups, male and female.

A very simple test is to test preference (m or f) related to the gender of the subject (here coded as M or F in the first row in Table 3). We can use an unpaired Wilcoxon sum rank test to accomplish a test for preferences (compare Table 2). We simply look at how many in the other category we have to the left for each person (here we observed either 5 or 0, showing a complete separation between male and female participants). Looking at Table 3, we see that male participants did not have strong preferences, whereas female participants ranked male and female images with a stronger and consistent difference.

Table 3 The sign indicates the winner. Reordered from highest to smallest difference

Since the difference in ranks cannot be assumed to be normally distributed and we do not have an interval scale we cannot use a test that assumes a normal distribution, but we do have ordinal data and larger absolute numbers indicate a stronger preference. Males seem to have weak preferences, and females have stronger preferences toward female, in this example. Note that the two males that had the m category as a winner, had very little difference between the sorting order of male and female images. The reason we should use an unpaired test is because we do not have repeated the sorting task for any of the individuals. We use the signed difference in the W-statistic for each of the individuals.

Sum ranks:

$$ \begin{aligned} & {\text{Males}}:{ 5 } + { 5 } + { 5 } + { 5 } + { 5 } = {\bf{25}} \\ & {\text{Females}}: \, 0 \, + \, 0 \, + \, 0 \, + \, 0 \, + \, 0 \, = \, 0 \\ & {\text{In R}}: \, > {\text{ wilcox}}.{\text{test}}\left( {{\text{c}}\left( {{1},{1}, - {1}, - {3}, - {5}} \right),{\text{ c}}\left( { - {9}, - {11}, - {11}, - {15}, - {15}} \right)} \right) \\ & \quad \quad \quad \quad {\text{Wilcoxon rank sum test with continuity correction}} \\ & {\text{data}}: {\text{c}}\left( {{1},{ 1}, \, - {1}, \, - {3}, \, - {5}} \right){\text{ and c}}\left( { - {9}, \, - {11}, \, - {11}, \, - {15}, \, - {15}} \right) \\ & {\text{W }} = {\bf{25}},{\text{ p - value }} = \, 0.0{1141} \\ \end{aligned} $$

Analyzing Prototypes

For prototypes, we investigate if there are different preferences for the three prototypes. We do this using a Friedman test on the raw rank numbers given by each subject. This will also introduce new tests, for when we have more than two groups (Table 4).

Table 4 Rankings for the three prototypes, per subject

When the data is entered into R, we typically read the columns from top to bottom, column after column, and instruct R about the number of rows and columns. It is often necessary to check that the format is correct by simply printing the data to the screen.

$$ \begin{aligned} & {\text{In R:}}\, > \,{\text{data}}\, < \, - {\text{matrix}}({\text{c}}\left( \begin{aligned} & {1},{2},{5},{6},{3},{2},{5},{2},{1},{4},{3},{5},{2}, \\ & {1},{2},{5},{6},{3},{4},{2},{6},{8},{6},{3},{6},{8},{8},{5},{7},{5} \\ \end{aligned} \right),{\text{ nrow}}\, = \,{1}0,{\text{ ncol}}\, = \,{3}) \\ & > {\text{ friedman}}.{\text{test}}\left( {{\text{data}},{\text{paired}} = {\text{T}}} \right) \\ & \quad \quad \quad {\text{Friedman rank sum test}} \\ & {\text{data:}}\, {\text{data}} \\ & {\text{Friedman chi-squared }} = { 12}.{2},{\text{ df }} = { 2},{\text{ p-value }} = \, 0.00{2243} \\ \end{aligned} $$

This means that at least one of JP, MP, and FP is different from the rest.

We need to apply a pairwise.wilcox.test to find out.

We need to create different, grouped, data formats. The easiest is to use Excel to create a data file with a column called class and a column called rank and insert all the values. This can be saved as a tabulator-separated text file to make it easy to import into R. The file will look like:

Class

Rank

JP

1

JP

2

JP

5

FP

3

FP

5

FP

2

MP

6

MP

8

MP

6

The data can now be imported into a data frame:

$$ {\text{In}}\,{\text{R:}}\, > \,{\text{dataW}}\, < \, - {\text{ read}}.{\text{delim}}({\text{file}}.{\text{choose}}()) $$

where we pick the textfile where we stored the data.

Check that it looks fine.

$$ \begin{aligned} & {\text{In}}\,{\text{R:}}\, > \,{\text{summary}}({\text{dataW}}) \\ & > \,{\text{summary}}\left( {{\text{dataW}}} \right) \\ \end{aligned} $$

Class

Rank

Length: 30

Min.: 1.0

Class: character

1st Qu.: 2.0

Mode: character

Median: 4.5

 

Mean: 4.2

 

3rd Qu.: 6.0

 

Max.: 8.0

The pairwise test can be applied. We have three data points from each subject.

$$ \begin{aligned} & {\text{In R:}}\, > \,{\text{pairwise}}.{\text{wilcox}}.{\text{test}}({\text{rank}},{\text{class}},{\text{paired}}\, = \,{\text{T}}) \\ & \quad \quad \quad {\text{Pairwise comparisons using Wilcoxon signed rank test with continuity correction}} \\ & {\text{data: rank and class}} \\ \end{aligned} $$

The result is a table with p-values.

From the table below, we find that MP is different from FP, p = 0.015 and MP is different from JP, p = 0.041. So, there is no significant difference in the sorting order of JP and FP, which are both judged significantly more attractive than MP.

 

FP

JP

JP

0.757

MP

0.015

0.041

P value adjustment method: holm.

Summary

There are two main goals of this lab exercise: To make students comfortable with an experimental design and making students comfortable with the statistical evaluation of results. We have shown how a simple sorting task can be evaluated, and how we may assess differences between the gender of subjects, but also between the different classes of stimuli.

The sorting task allows the subjects to manipulate and compare images until they are satisfied or until a specified time limit, for example 5 minutes.

For statistical evaluation, the use of unpaired and paired Wilcoxon tests has been introduced, and the calculations behind the tests have been exemplified. We have also introduced the Friedman test, and how to perform post hoc tests. These are the main tests that can be performed on smaller experiments using ordinal data, stemming from sorting tasks. There are many variants of the study, and it is possible to perform the tests without any advanced equipment. This may be important for use in the field. The number of images may easily be extended. The time it takes to physically sort images is often shorter than a comparable experiment using a computer.

Further Research Questions:

  • Are there cultural differences in sorting preferences?

  • Will subjects of different ethnicities sort the images differently?

  • Do people react differently to pictures of themselves? In order to investigate this, it might be necessary to create more images that contain “self.”

  • Do people react differently to in-group and out-group?

  • Can the judgment of beauty be explained by anatomical measures, for example face symmetry?

  • Can the underlying gender of the image be detected from anatomical measures of for example face symmetry?

It could be demonstrated that factors in-group/out-group and male/female are possible to detect in the images, for example by face proportions.

It is also possible to use various other controlled images. For example, images with rounded forms versus images with jagged forms, as in the classic bouba—kiki images.

It is also possible to easily allow people to sort the images according to some other dimension than beauty, for example trustworthiness, masculinity, femininity, intelligence, mental states, emotions, etc. Many of these dimensions cannot be judged from images, but people may still have consistent tendencies for how to sort the images.