Abstract
Consider how we evaluate how normal an object is. On the dual-nature hypothesis, a normality evaluation depends on the object’s goodness (how good do you think it is?) and frequency (how frequent do you think it is?). On the single-nature hypothesis, the evaluation depends solely on either frequency or goodness. To assess these hypotheses, I ran four experiments. Study 1 shows that normality evaluations vary with both the goodness and the frequency assessment of the object. Study 2 shows that manipulating the goodness and the frequency dimension changes the normality evaluation. Yet, neither experiment rules out that some people evaluate normality solely based on frequency, and the rest evaluate normality solely based on goodness. Whence two more experiments. Study 3 reveals that when scenarios are contrasted—presented one after another—only frequency matters. But, as study 4 shows, when scenarios are evaluated alone, both frequency and goodness influence normality evaluations in a single person, although the more a person is sensitive to one dimension, the less she’s sensitive to the other. The dual-nature hypothesis seems thus true of uncontrasted applications of the concept of normality, whereas the single-nature hypothesis seems true of contrasted applications.
Similar content being viewed by others
Notes
I use the term object to refer to the object of a normality judgment. Thus, the term can refer to material objects (“this seems like a perfectly normal orange to me”), behaviors (“what are you doing? This is not normal at all!”), events (“earthquakes are pretty normal in this part of the country”), and so on.
Later, Bear and Knobe indeed studied the applications of the concept of normality (Bear and Knobe 2017). However, the first two experiments in this paper (§2, §3) were run before theirs, and the subsequent two experiments (§4, §5) take a different route than Bear and Knobe’s study. I discuss their results briefly in §2.
Given the recommendations by Green (1991) and to account for some participants missing a question, I intended to have 200 participants. However, only 189 mTurk workers filled out the questionnaire before the time allotted for data collection expired. Of those, 10 participants missed one question and thus were excluded from the analysis.
The experiment was run before same-sex marriage was legalized in the U.S.
This interpretation isn’t strictly correct because the interaction is significant; still, the interpretation provides a good grasp on what bs mean. The interaction term will be discussed shortly.
In linear regression, a β coefficient denotes how many standard deviations a dependent variable will change if the predictor changes by one standard deviation. βs thus behave like bs but are interpreted in terms of standard deviations rather than the units of the IV and DV.
Let me thank one of the reviewers for drawing my attention to this fact.
For the ordinal regression results, see the appendix, §1. Since in all studies I use a Likert scale for the dependent variable, ordinal regression might seem like a better choice than linear regression. However, the interpretation of ordinal models isn’t as straightforward as the linear models, only loose analogs of R2 and η2 coefficients are available, and in practice psychologists use linear regression for modelling similar problems. Therefore, in this investigation, I limit myself to linear models (standard and mixed-effects). The appendix, as well as the data, protocols, and scripts in R, can be downloaded from the project’s repository on the Open Science Framework: https://osf.io/6kr2m/.
That’s consistent with the results of the fourth and fifth experiment (§4, §5), which indicate that people’s average normality evaluations differ between scenarios.
Presumably, typicality—again, in this research denoting being a good example of a member of a target category—is different from normality. Declaring something a bad example doesn’t carry the normative implications of declaring it abnormal (“What you have done isn’t normal” sounds like an accusation; “what you have done here isn’t what people typically do in similar circumstances” may even sound like a praise of your creativity). It’s also felicitous to utter “this one isn’t a good example of skin moles, yet it’s still normal,” and it sounds fine to call Kaczyński a good example of a modern Central European authoritarian figure, but less so to call him a normal authoritarian. For similar reasons, goodness is presumably very different from closeness to the ideal: traits close to the ideal of the category ‘annoying personality features’ won’t score high on goodness.
I would like to thank one of the reviewers for this observation.
Again, let me explain the choice of the sample size. There were 18 conditions; for each condition, I planned to have at least 14 observations, so the comparisons between the conditions would be meaningful. Moreover, running a mediation analysis (below) required a big sample size (Fritz and MacKinnon 2007). I thus intended there to be 300 participants; 284 answered both questions about the scenario.
In all following studies, the interaction term was not significant, and so I report models evaluated for equations without interaction terms. I include the models with interaction terms in the appendix, §2.
That is, the green (red) line shows how normal depends on frequency for trait = 1 (trait = 0).
In Preacher and Hayes’s method the correlations are established using linear regression with bootstrapping. On fig. 2, bt → n is the regression coefficient in the model predicting normal from trait alone, bt → g is the coefficient in the model predicting good from trait alone, and bg → n is the coefficient in the model predicting normal from good alone. All of them are significant (p < 0.001). b’t → n is the coefficient next to trait in the model predicting normal from good and trait, and this coefficient isn’t significant (p = 0.515): once good is included, the relationship between trait and normal disappears.
These correspond to Hitchcock and Knobe’s notions of statistical and prescriptive norms, respectively.
Given recommendations in Westfall and Kenny (2014) and expecting a medium-sized effect at least, I chose the sample size as 100 participants; 97 mTurk workers filled out the questionnaire before the task expired.
Therefore, goodness refers to the participants’ own evaluation, on a seven-point scale, how good the described behavior or situation is.
The model with random slopes for participants and random intercepts for scenarios was the simplest model whose fit wasn’t worse than any other model’s—i.e., no other model was deemed better with the χ2 test. Additionally, the chosen model has the smallest BIC of all models considered. For the details of model selection, see the appendix, §3.1.
There’s no straightforward way to obtain η2s for a mixed-effects model, but there are two indirect ways. First, you can use the corresponding ANOVA model with random-effects variables as blocks. And indeed, the values in the table come from such a model with participant and scenarios as blocking variables. (Moreover, the ANOVA model corroborates the mixed-level results: goodness and frequency, but not their interaction, are significant at p < 0.001 level; I don’t report these results in detail, as doing so wouldn’t add to the analysis.) Another way is to compare the R2s of mixed-effects models with either predictor removed. So, marginal R2 = 34.3% for the full model with scenario and participant as random effects; for goodness as the sole predictor, marginal R2 = 0.7%, and for frequency as the sole predictor, marginal R2 = 33.7%. These values match almost perfectly the η2s obtained with ANOVA.
For one participant, the first response would be one to the good-rare version of the second scenario; for another, it would be one to the good-frequent version of the fifth scenario, and so on.
Models with scenario-specific random slopes for either goodness or frequency had no better fit than the simpler model from table 5, and that’s why I chose the latter. See the appendix, §3, for the details of model selection.
Again, a corresponding ANOVA model with scenario as a blocking variable delivered η2s. (It also agreed with the mixed-effects model on the predictors’ significance: p < 0.001 for both goodness and frequency, with a not significant interaction.) Similar (although not identical) results follow from analyzing the proportions of the total variance explained. For the full mixed-effect model, marginal R2 = 26%; for a model with goodness as the sole predictor, marginal R2 = 10%, and for a model with frequency as the sole predictor, marginal R2 = 21%. Notice that the additional amount of variance explained by a variable added to the model depends on the order of adding variables, which is the reason why I prefer relying on less ambiguous η2s from ANOVA.
For instance, point (0.1, 1) means that, according to the equation for that individual, a one-point increase in goodness evaluation increases the expected normality evaluation by 0.1, and switching from a rare to a frequent behavior increases the evaluation by 1 point.
The normative dimension is was additionally highlighted, as immediately after the first normality evaluation, the participants answered how good or bad they find the behavior/situation depicted.
The mixed-effects model can’t be used to evaluate that interaction, but an ANOVA can. A two-way ANOVA modelling the influence of goodness, frequency, and the scenario (a categorical variable with five levels) on the normality evaluation yields results consistent with the mixed-effects models. All three main effects, but no interactions, were significant. For details, see the appendix, §4.
The median times between filling out the vignettes were: 6 days between the first and second one, 4 days between the second and third, and 3 days between the third and fourth.
Justifying the sample size is tricky in this case because the statistics used below is descriptive, not inferential. I therefore simply collected responses from as many students as possible.
Again, η2s come from ANOVA with participants as a blocking variable, and these values are closely matched by analyzing marginal R2: 12% for a mixed-effects model with goodness as the sole predictor and 26% for a model with frequency as the sole predictor.
That is, I estimated the model in table 6 (right) using only the answers to the four versions of the squirrel scenario, even though the participants evaluated four other scenarios too.
63% of uncontrasted goodness slopes are larger than the largest contrasted goodness slope.
The versions appeared in the same order for all participants. Since the first two scenarios differ with respect to goodness but not frequency, that might have drawn the participants’ attention to that dimension. That is, the distribution in these answers (fig. 5, red) might be due to the salience of the normative dimension rather than due to considering the cases in isolation. Although only a follow up study could settle this possibility, I don’t find it worryingly likely, as the median time between filling out the first two vignettes was 6 days, more than enough for the participants to forget the details of the scenario.
Allow me to mention my reasons—I don’t expect you to share them. First, the data collected doesn’t yet allow for inferring underlying mechanisms (I find the story presented plausible but still merely hypothetical). Second, if the correct meaning of a term has normative implications about how one ought to use the term, establishing the correct meaning of normality would require analyzing the ramifications of using the concept (Burgess and Plunkett 2013). Given my casual observation of how this concept functions in moral reasoning (see §6.3), my tentative verdict is that the term should be banished altogether, except for its technical uses (as in, e.g., statistics).
But notice that “Hibbles eat one kind of berries” is a generic statement, whose meaning is much more complex than the one of statistical statements (Leslie 2012). If generics themselves encode normative information, children in the experiment aren’t making is-to-ought inferences.
And indeed, the authors devised the fake-galaxy experiment to test for the existence bias, although in the experiment they manipulated the frequency of the galaxy.
If this speculation is indeed correct, the two hypotheses should be put in terms of being rather than frequency.
However, Tworek and Cimpian’s model (estimated on adults) explains R2 = 11% of the variance in their participants’ is-ought inferences, which suggests there’s more to these inferences than peoples’ focus on objects’ intrinsic features.
References
Bailenson, J., M. Shum, S. Atran, D. Medin, and J. Coley. 2002. A bird’s eye view: Biological categorization and reasoning within and across cultures. Cognition 84 (1): 1–53.
Barsalou, L. 1985. Ideals, central tendency, and frequency of instantiation as determinants of graded structure in categories. Journal of Experimental Psychology. Learning, Memory, and Cognition 11: 629–654.
Bear, A., and J. Knobe. 2017. Normality: Part descriptive, part prescriptive. Cognition 167: 25–37.
Burgess, A., and D. Plunkett. 2013. Conceptual ethics I. Philosophy Compass 8 (12): 1091–1101.
Burnett, R., D. Medin, N. Ross, and S. Blok. 2005. Ideal is typical. Canadian Journal of Experimental Psychology/Revue canadienne de psychologie expérimentale 59 (1): 3–10.
Egré, P., and F. Cova. 2015. Moral asymmetries and the semantics of many. Semantics and Pragmatics 8 (13): 1–45.
Eidelman, C., and S. Crandall. 2014. The intuitive traditionalist: How biases for existence and longevity promote the status quo. Advances in Experimental Social Psychology 50: 53–104.
Eidelman, S., C. Crandall, and J. Pattershall. 2009. The existence bias. Journal of Personality and Social Psychology 97 (5): 765–775.
Eidelman, S., J. Pattershall, and C. Crandall. 2010. Longer is better. Journal of Experimental Social Psychology 46 (6): 993–998.
Fritz, M., and D. MacKinnon. 2007. Required sample size to detect the mediated effect. Psychological Science 18 (3): 233–239.
Goldstein, N., R. Cialdini, and V. Griskevicius. 2008. A room with a viewpoint: Using social norms to motivate environmental conservation in hotels. Journal of Consumer Research 35: 472–482.
Green, S. 1991. How many subjects does it take to do a regression analysis? Multivariate Behavioral Research 26 (3): 499–510.
Grice, P. 1961. The causal theory of perception. Proceedings of the Aristotelian Society, Supplementary 35: 121–168.
Hitchcock, C., and J. Knobe. 2009. Cause and norm. Journal of Philosophy 106: 587–612.
Hlavka, H. (2014). Normalizing sexual violence: Young women account for harassment and abuse. Gender and Society.
Kalish, C. 2015. Normative concepts. In The conceptual mind: New directions in the study of concepts, ed. E. Margolis and S. Laurence, 519–539. Cambridge, MA: MIT Press.
Kushner, H. 2017. On the other hand: Left hand, right brain, mental disorder, and history. Baltimore: Johns Hopkins University Press.
Leslie, S.-J. 2012. Generics. In The Routledge companion to philosophy of language, ed. G. Russell and D. Fara, 355–367. New York: Routledge.
Madon, S. 1997. What do people believe about gay males? A study of stereotype content and strength. Sex Roles 37 (9): 663–685.
Men’s health. (2014a). Am I normal? p. 21.
Men’s health. (2014b). Am I normal? p. 30.
Men’s health. (2014c). Am I normal? p. 26.
Nolan, J., P. Schultz, R. Cialdini, N. Goldstein, and V. Griskevicius. 2008. Normative social influence is underdetected. Personality and Social Psychology Bulletin 34: 913–923.
Preacher, K., and A. Hayes. 2004. SPSS and SAS procedures for estimating indirect effects in simple mediation models. Behavior Research Methods, Instruments, & Computers 36: 717–731.
Roberts, S., S. Gelman, and A. Ho. 2016. So it is, so it shall be: Group regularities license children’s prescriptive judgments. Cognitive Science 41: 576–600.
Roberts, S., A. Ho, and S. Gelman. 2017. Group presence, category labels, and generic statements influence children to treat descriptive group regularities as prescriptive. Journal of Experimental Child Psychology 158: 19–31.
Tworek, C., and A. Cimpian. 2016. Why do people tend to infer “ought” from “is”? The role of biases in explanation. Psychological Science 27 (8): 1109–1122.
Voorspoels, W., W. Vanpaemel, and G. Storms. 2011. A formal ideal-based account of typicality. Psychonomic Bulletin & Review 18 (5): 1006–1014.
Westfall, J., and D. Kenny. 2014. Statistical power and optimal Design in Experiments in which samples of participants respond to samples of stimuli. Journal of Experimental Psychology 143 (5): 2020–2045.
Acknowledgements
I would especially like to thank Joshua Knobe and the two reviewers—Steven Verheyen and the other, anonymous reviewer—for their invaluable advice and rich feedback. (Only a person who has worked with Josh knows how encouraging and helpful he is; I wouldn’t have completed this study if it wasn’t for him. And Steven Verheyen’s suggestions, especially the ones about the methodology and related psychological literature, were amazing; I wish everyone a reviewer like him.) For their help, I also want to thank John Doris, Dominik Dziedzic, Phoebe Friesen, Thomas Icard, Katarzyna Szubert, Adam Shmidt, Pascale Willemsen, and Jennifer Whyte. This research was supported by the National Science Centre (Poland), grant no. DEC-2013/09/N/HS4/03693.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wysocki, T. Normality: a Two-Faced Concept. Rev.Phil.Psych. 11, 689–716 (2020). https://doi.org/10.1007/s13164-020-00463-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13164-020-00463-z