Further model applications, analytical extensions, and clarifications are detailed in 12 sections after the first section which summarizes results and implications of the Chap. 5 analyses.

6.1 Empirical and Conceptual Main Points

It seems reasonable to suspect that anyone who has read this far is likely to be much more interested in sex differences in math than in sex differences in reading. One might therefore start on a lighter note by observing that perhaps the above theory might provide the justification for a popular book’s title: The Math Gene [123]. Curiously, the title is misleading. Devlin rejects any notion of a math gene. He writes “Roughly speaking, by ‘the math gene’ I mean ‘an innate facility for mathematical thought…’ [123, p. xvi].”

“There cannot be any single or simple answer to the many complex questions about sex differences in math and science. Readers expecting a single conclusion…are surely disappointed [12, p. 41, italics added].” Ignoring the arrogance of the claim, certainly there are settings in which sex differences concerning math and science are boldly evident and which may be the result of bias or discrimination [124]. Such matters are certainly of concern. However, one of the most important, most widely noted, and most puzzling sex differences has concerned differences in math test score distributions obtained in observational settings. Such distributional differences have been evident for more than a century in the U.S.A. and are evident globally today. These differences have now been coherently explained by a model which, at root, is little more than the model for two independent coin flips. All the analytical model inequalities can be traced to sex differences in probabilistic outcomes of binary events. The same model addresses the widely neglected but far larger sex differences in reading test score distributions as well.

Thus, contrary to what has been claimed (with some exceptions to be addressed momentarily), a simple and coherent answer can be given to empirical test score sex differences puzzles. It is simply a matter of first, recognizing the patterns in data, as Kagan [34] wisely stressed, and second, modeling the processes appropriately. The answer provided by \(\mathcal {Y}\) may not be a popular answer, nor perhaps the answer hoped for or desired. However, \(\mathcal {Y}\) provides a plausible answer. There exists no alternative, formal, or otherwise, which provides the conceptual basis for addressing a wide range of empirical facts and in particular provides the theoretical basis for the inequalities of \(\mathcal {S}.\) The level of model to empirical correspondence, that is, the correspondence between V  and \(\widehat V\) especially within a simple framework, seems rare. Consequently, it seems reasonable to suggest that \(\mathcal {Y}\) resembles the process that generated the V  sample data. \(\mathcal {Y}\) has just three critical parameters to estimate: \(q,\mu _2-\mu _1\), and \(\sigma \), the same number as the effect size \(\delta .\) If the law of parsimony (Occam’s razor) applies, then \(\mathcal {Y}\) would appear to set a high bar for alternative explanations to achieve. Any competing framework must coherently account for the inequalities of \(\mathcal {S}\) and hope to do so with three parameters.

There are settings where \(\mathcal {Y}\) does not hold, or at least the estimation algorithm fails, even if the elements of \(\mathcal {S}\) are satisfied. In the case of PISA math testing, in five countries mo fails. These countries are noted in Table 2.5. In some cases, these failures may be because of sampling variation when q for math for a particular country may be small. It may be because of girls’ dominant advantage in reading, thus facilitating girls’ performance on some PISA math test items. The language of the country in which the test is administered may impact boys and girls differently, thus influencing math test performance in ways not fully understood. And it must be admitted that the model may be flat-out wrong, at least for some countries.

For reading scores, rdo seems to nearly invariably hold, and PISA test failures of four countries in Table 2.6, and noted in Sect. 5.2.3 were of a different kind, namely estimation failures when \(\hat \sigma ^2<0.\) It seems likely the assumption that \(N_b\) and \(N_g\) share the same variance is wrong. There are several cases as well where mdo holds, but estimation fails for example [47], and likely for the same reason. Relaxing the equal variance assumption could be addressed within a likelihood model framework, but estimation in this case would require a data stream, much more conceptual machinery, and iterative methods.

While there are no competing models to \(\mathcal {Y},\) no claim is made that the model is the “correct” model as all models are wrong models [125]. However, even wrong models can be useful, and the breadth of explanatory power seems remarkable for simple model. The model appears to provide a coherent explanation for the widely noted differences among countries in their PISA test scores, except for those cases just mentioned. Being able to estimate the parameters of \(\mathcal {Y}\) based on elements of V  without requiring a raw data stream and then subsequently being able to illustrate the model graphically can certainly be viewed as a model strength. There are numerous other V  which could be analyzed. What has been presented is only a sampling.

It has been noted repeatedly that the estimates of q for math are generally small and correspondingly so are the X-linked heritabilities for math for both boys and girls. This result reveals that most of the test score variance in data is unexplained, an outcome that was expected. X-linked heritabilities are far larger for reading, but substantial variance remains unaccounted in reading test scores as well. Thus, there is much research required to identify other plausible and likely larger sources of variance.

To untangle other sources of influence is likely to require new frameworks for thinking about sex differences which puts empirical regularities at the center of attention as, to note again, Kagan [34] has eloquently argued. This simply has not occurred. Witness the general disregard for differences in test score variances as having any substantive relevance for understanding either math or reading test score sex differences. The viewpoint here is that effect size approaches have hijacked efforts at understanding and impeded if not halted conceptual progress. Meta-analysis approaches have not contributed to conceptual understanding, a fact that has long been recognized [126]. Yet the belief in the value of effect size analysis has grown such that it is has been claimed it is a viable framework for any sex difference variables of interest.

In Figure 1 [15] are displayed four panels. In each panel are two equal variance but shifted normal distributions; each panel implies a different effect size. The authors write “Figure 1 shows four possible alternatives for the distribution of males’ and females’ scores on a trait, which could be anything from hippocampus size to mathematics performance [15, p. 177, italics added].” Such a statement implies a remarkably naive belief that all between-group mean differences are understandable through a location-shift, effect size framework, as if this perspective is the only way by which Nature could produce sex differences in task performance.

Not only is an effect size approach inappropriate for math (and reading) test scores, but it is inappropriate for the measures of the hippocampus as well. That is because, while it might be surprising to learn, mo holds for hippocampus volumes as well. Ritchie and others in their Table 1 [127] report the means and standard deviations of the left and right hippocampus volumes of 2466 men and 2750 women. Data for both left and right volumes satisfy mo as do all 15 additional brain variables in their table.

Two cognitive test variables reporting sex differences also appear in their Table 1. One test satisfies mo, while the other test reports reaction times. Perhaps the questions should be, first, for what sex difference variables is an effect size model \(\delta \) appropriate and thus the calculation of d sensible? Second, could X-linkage play a role in brain sex differences?

Nature can produce sex differences in test score distributions, or sex differences in other settings, by many different vehicles. The researcher’s task is to attempt to learn which vehicle Nature uses and to try to model that process suitably. This fundamental step in the research process has gone missing.

Under \(\mathcal {Y}\), it is assumed Nature produces distributional differences by a frequency mechanism which is different from a location-shift mechanism. This is illustrated by all the Chap. 5 examples. Recall that under \(\mathcal {Y}\) both boys and girls share identical latent component distributions, \(f_k(y), k=1,2\), so boys and girls within the same component share identical math test score distributions. What produces the mean differences (and variance differences) are the frequencies with which these scores appear. These frequency differences are determined by the component coefficients, functions of q and \(q^2.\) Another way of thinking about differences, and as noted earlier, is that \(\mathcal {Y}\) models the latent within-sex differences, and these within-sex differences produce the between-sex differences observed in data. By understanding within-sex differences, the between-sex observed differences may well take care of themselves [128].

“This book is about the reasons males are overrepresented in mathematics and mathematically intensive professions…[5, p. ix].” This was the motivation of Ceci and Williams, 2010. Their motivation was clearly a frequency matter, not a location-shift matter: why are there more men than women in some professions? Yet in “Gauging the size of sex differences,” they use d [5, pp. 20–21] which does not index frequency differences. This does not imply of course that a location-shift perspective cannot often lead, plausibly, to observed frequency differences. But given that interest in sex differences often starts with everyday observations of frequency differences of boys and girls or men and women in various settings [23], it does suggest that one needs a broader perspective in how sex differences are conceptualized. Although it may seem heretical to suggest it, abandoning the devotion to effect sizes may be a good place to start. There is much to admire about the openness and scholarly approach of the now rarely referenced nearly 50 year old Maccoby and Jacklin book [53].

Recently, Casey and Ganley [20] have expressed interest in considering within sex differences. Latent processes often viewed as latent distributions appear to have attracted little interest among sex differences researchers. The main reason is likely because of the dominance of effect size procedures which leave no space for latent variables thinking achievable through mixture models [99, 129]. If between-group differences are the result of within-group component distributions with different component weights, as in \(\mathcal {Y}\), then no between-group location-shift model is appropriate. That is because each group’s probability distribution is of a different shape. Under a location-shift perspective, the distributions of different groups remain identical in shape. In real-world realizations, however, there may appear to be no sex differences in graphical portrayals as Fig. 5.2 of Example 4 reveals.

One conceptual fact that needs to be emphasized and was described in Example 3 and illustrated in Fig. 5.1. Namely, that for both sexes, the majority of high math performers come from the lower, not higher scoring latent component. This is true for example, for the U.S. PISA math data, portrayed in Fig. 5.9. Only those boys with scores above 667 more probably come from the higher component. Virtually all higher scoring girls come from the lower scoring component. The reason is because both for boys and girls the proportions of individuals in the higher scoring component are small, so most of the probability mass is in the lower scoring PISA math component. For boys the higher scoring component has about 2.5% of the total probability mass, while for girls it is about 0.1%.

To maintain the view that, under \(\mathcal {Y},\) only those boys and girls in the higher scoring components are high math achievers—which a casual reader might believe—is incorrect. An interesting question to ponder, however, is whether two individuals with the same high math score, but with scores that arise from different latent component distributions, are equally talented.

\(\mathcal {Y}\) has nearly 40-year old conceptual roots [104]. There is a historical commentary about this early work and its corresponding reception [130]. The following 12 sections address additional model applications, add some further analytical results, and consider issues which have been alluded to earlier.

6.2 Math Meta-analyses and Variance Ratios

The puzzle of why small \(d>0\) and small average d or \(\bar d>0\) are commonly observed in math meta-analyses was noted at the outset in Chap. 1. No conceptual interpretation of these findings has appeared. However, from the perspective of \(\mathcal {Y},\) a conceptual explanation emerges.

To see how d is viewed under \(\mathcal {Y},\) replace the sample values in the expression for \(d,\) given at the outset in Chap. 1 with their \(\mathcal {Y}\) parameter values, given in Chap. 4. For example, replace \(\bar x_b\) with \(\mu _b=q(1-q)\mu _1+q\mu _2\) and \(s_g^2\) with \(\sigma _g^2=q^2(1-q)^2(\mu _2-\mu _1)^2+\sigma ^2\). After some algebra, obtain

$$\displaystyle \begin{aligned} \eta={{q(1-q)}\over{\sqrt{ 1/\xi^2+q(1-q^3)/2}}}\end{aligned}$$

with \(\xi =(\mu _{2}-\mu _{1})/\sigma .\) Thus, d under \(\mathcal {Y}\) estimates \(\eta \), not \(\delta \), and \(\eta \) provides the conceptual model for d under \(\mathcal {Y}\).

Although \(\eta \) has no useful math-substantive interpretation, inspection of \(\eta \) reveals why small \(d>0\) appear. First, \(\eta >0\), so \(d>0\) are expected. Second, \(\eta \) will be small when q is small, and estimates \(\hat q\) suggest q for math is small—at least for U.S. math test results. Hence, \(d>0\) should be small. As \(q\rightarrow 0,\eta \rightarrow 0\) and similarly for d. \(\eta \) increases with increasing component mean difference \(\mu _2-\mu _1.\) \(\eta \) increases with decreasing \(\sigma \) and correspondingly increasing \(\xi .\) \(\eta \in (0,0.556)\) and is maximum when \(\xi =\infty \) and \(q\approx 0.37\).

For illustration, consider Example 4. \(d=0.067;\) replacing the parameters in \(\eta \) with estimates gives \(\hat \eta =0.067\) with \(\hat q=0.015.\) Of course, d will not always be positive, but in expectation under \(\mathcal {Y}\) it should be. Thus, \(\mathcal {Y}\) provides an explanation revealing why small \(d>0\) appear and in particular small \(\bar d>0\) in math meta-analyses.

Variance ratios of \(s^2_b/s^2_g\) greater than one have also been a puzzle. As with the above, replace the sample quantities with the quantities which they estimate under \(\mathcal {Y}\), and then form the corresponding ratio. \(s^2_b/s^2_g\) estimates the ratio \(\sigma _b^2/\sigma ^2_g\) which equals

$$\displaystyle \begin{aligned} {{q(1-q)(\mu_2-\mu_1)^2+\sigma^2}\over{q^2(1-q^2)(\mu_2-\mu_1)^2+\sigma^2}}.\end{aligned}$$

Inspection reveals why the corresponding sample ratios are larger than one. That is because \(q(1-q)>q^2(1-q^2)\) when \(0<q<0.618.\) Estimates of \(\hat q\) have been well less than \(0.618\) for both reading and math. For Example 4, \(s^2_b/s^2_g= 1.306,\) while the model expression, just above, with estimates replacing parameters gives \(1.341.\)

6.3 Arguments Against Genetic Influences

It is difficult to find explicit statements as to why sex differences, especially in math, are not to be found in matters genetic. But consider these quotes:

For some countries, girls do as well or better than boys at the left tail, but worse at the right tail (United States, Hungary). In other countries, sex differences are most pronounced in the middle of the distribution (Russia, Austria). It is hard to come up with a compelling genetic explanation for such diversity! [5, p. 164].

If the genetic contribution were strong, however, then males should predominate at the upper tail of performance in all countries and at all times, and the male-female ratio should be of comparable size across different samples [131, p. 956].

…there is no reason to believe that the genetic factors involved in determining gender will vary across countries, implying that to the degree that gender differences in mathematics result from genetic factors, there should be no international variation in these differences [132, p. S140].

…the gender gap in math differs substantially across countries. Hence, ‘nature’ cannot be the only account for the females’ disadvantage in math; there must be alternative explanations…[61, p. 2].

A puzzle is that these quotes all seem to imply the belief, which seems widely held, is that genetic influences if such occurred, they should resemble a constant effect, invariant over countries, as if genes, their relative frequencies, and their estimates did not vary. One need only consider human height, a phenotypically genetically influenced sex difference variable, which varies in countries around the world both in the height of men and women and in the sizes of their mean differences [133].

Gene frequencies are known to often vary widely over different populations and countries due to migration factors, geography, climate, altitude, and mutations among other factors [134137]. As the previous chapter’s examples, analyses, and figures reveal, the relative size of \(\hat q\) for the PISA data varies widely over different countries, both for math with \(0.001\leq \hat q_m\leq 0.476\) and for reading with \(0.197\leq \hat q_r\leq 0.614.\)

As observed earlier, part of this variation seems plausibly attributed to geography and country history. And recall that The Czech Republic and Slovakia have among the largest PISA math \(q_m\) estimates, respectively, \(\hat q_m=0.302, 0.351\), from Table 5.3. These countries share a border and were earlier a single country. Norway and Sweden are another pair with a common border; their \(\hat q\) for both reading and math will be considered below within another context. However, note here, \(\hat q_m=0.026, 0.052\) for Norway and Sweden, respectively. Within each pair of these countries, the \(\hat q_m\) seem similar, but between the pairs of countries the \(\hat q_m\) are remarkable different.

It had been thought Europe was largely homogeneous in genetic variation. Recent palaeogenetical evidence clearly falsifies this belief and makes the variation in estimates of PISA q, over different countries, even more conceptually plausible. There were two genetically distinct groups in Europe around 30,000 years ago: one group living in France and Spain and the other group living in what is now The Czech Republic and in Italy [118, 119]. \(\hat q_m= 0.050,0.056\) PISA estimates for France and Spain, respectively. The Italian PISA estimate \(\hat q_m=0.144,\) plus two additional Italian estimates \(\hat q_m=0.178(0.023), 0.237(0.028)\) with standard errors in parentheses, estimates from [60] appear to diverge from The Czech Republic PISA estimate \(\hat q_m=0.302.\) But collectively the Czech Republic PISA and Italian \(q_m\) estimates are multiples larger of the France and Spain PISA \(\hat q_m.\) The \(\hat q_r\) for these four countries seem more similar than do their \(\hat q_m\) as Table 5.3 reveals. For France and Spain, respectively, \(\hat q_r =0.470,0.464.\) For The Czech Republic and Italy, \(\hat q_r =0.568,0.417\), respectively.

An additional part of this variation, as already noted, in these parameter estimates is likely because they are different test translations, which may advantage boys and girls differently, depending on the language and culture. There is of course sampling variation in estimates, as noted earlier. What is nearly constant over countries are the inequalities in \(\mathcal {S}.\)

6.4 The Search for Biological Genetical Evidence

While \(\mathcal {Y}\) models explicitly X-linked influences, other influences if they are plausibly viewed as being additive effects, both genetical and otherwise, can be represented through the location parameters \(\mu _1\) and \(\mu _2.\) N can absorb variance changes. Consider the 1990 and 2017 eighth grade math means in Table 2.1. Both sexes increased 20 points over a 27 year interval. Component means \(\hat \mu _1=262\) and 282 and \(\hat \mu _2=407\) and 439, 1990, and 2017, respectively, reflecting these mean changes. The other parameter estimates remained similar. For example, \(\hat q=0.007\), and \(0.006\) for 1990 and 2017, respectively.

Genome wide association studies (GWAS) have become the most widely used approach for mapping genotype to phenotype associations [138]. These approaches, which may be viewed as a kind of data mining, can have serious difficulties because of linkage disequilibrium [139].

The core notion is that phenotypes are the results of thousands of minuscule genetical elements, SNPs (single-nucleotide polymorphisms), the outcomes of which additively combine, in a Mendelian matter, and are modelled as sums of random variables and are called polygenic scores. These scores are obtainable for individuals. Using polygenic scores for variables which may be considered proxies for an intelligence score, such as educational attainment, it has been claimed “…will bring the omnipotent variable of intelligence to all areas of the life sciences without the need to assess intelligence [140, p. 157].” For a very different perspective on this possibility, see Charney [141].

If the goal of understanding math and reading test score sex differences in task performance is seen as equivalent to the goal of coherently accounting of the inequalities of \(\mathcal {S}\), which is the view here, then GWAS approaches, as currently constituted, appear unable to address the matter. There are a number of reasons for this conclusion. One is that GWAS requires huge sample sizes in the thousands or hundreds of thousands. Such a database for math and reading testing simply does not exist. More fundamental, however, is that the target traits for GWAS are known observable phenotypes. In \(\mathcal {Y}\) the two phenotypes are unobserved latent distributions. GWAS approaches cannot address latent processes. To add a latent processes layer to the already complex GWAS framework would invariably increase outcome uncertainty which would seem to suggest the need for sample sizes perhaps unattainable for any variable of interest.

As stated before, at least for math, X-linked heritability is generally very small. So there is substantial variance unmodelled and unexplained. As just noted, the variable N or \(N_b\) for boys and \(N_g\) for girls reflects these contributions, some of which doubtlessly are polygene effects. There is no inconsistency here. A trait can have a major Mendelian influence as well as polygene influences. An unresolved issue in biology is where, along the continuous spectrum ranging from Mendelian genetics to complex polygene traits, particular phenotypes reside [142]. Whether current technology allows the biological identification of genes with small relative frequencies implied by \(\mathcal {Y}\) seems unclear.

Genetical theory and its wide acceptance has been achieved historically through conceptual arguments and how these arguments “fit” with phenotypical data. It is difficult to find fault with \(\mathcal {Y}\) on these grounds.

6.5 q as the Realization of a Random Variable Q

Because each individual V  has been viewed throughout, as the unit of analysis, q may be viewed as random over different \(V.\) As Table 5.3 makes clear, \(\hat q\) certainly varies widely, at the country level, for both reading and math. Within countries, there are doubtlessly subpopulations reflecting population flows over the ages, as well as the known stochastic behavior of genes [143]. Assuming q is a fixed unknown constant for any V  is, as features of all models are, an approximation to reality. More realistically, \(\hat q\) is a mean value of different values of \(q.\) This within V  randomness, while not explicitly modeled, does not appear to jeopardize the core features of the model. The following argument hopefully makes this clear.

Let the random variable Q have realizations q and with continuous or discrete distribution function \(G(q)\). Consider math for boys. Clearly, \(P(B_b=\mu _2|q)=q\). Then,

$$\displaystyle \begin{aligned} P(B_b=\mu_2)=\int P(B_b=\mu_2|q)dG(q)= \int qdG(q)=\mathrm{E}(Q).\end{aligned}$$

So, estimates of q may be viewed as estimates of the mean of Q and similarly for girls.

6.6 Sex Differences in Distributional Tails

As has been noted above, a widely recognized empirical marker for sex differences particularly in math is observed differences in the test score distributional tails, with boys having larger math right tails than girls [11, 12, 144]. Often proxies for these differences, such as effect size or variance ratios, are of focus. The comparison of reading and math tail areas, both upper tail and lower tail, has also been observed: “The sex difference in mathematics was non-existent in the lower end of the performance distribution, but the sex difference in reading at the lower end was at its peak [122, p. 4].” The matter of right tail inequality is particularly concerning to Ceci and Williams [5]. They make reference to the “right tail” more than fifty times in their book.

That \(\mathcal {Y}\) produces graphs which portray these tail area differences both for reading and for math is clear from the many figures displayed above. What is noted here is that with an additional distributional assumption, these differences follow from \(\mathcal {Y}\).

Define

$$\displaystyle \begin{aligned} r(s)= \int_s^\infty f_b(y)dy\bigg / \int_s^\infty f_g(y)dy,-\infty<s<\infty.\end{aligned}$$

If the component distributions of \(f_b(y)\) and \(f_g(y)\) are assumed to be normal, which is the assumption in the graphic displays when components appear, then \(r(s)\) is strictly increasing as s increases (for the argument, see [104]). That is, the ratio of the upper tail areas not only favors boys over girls but also the ratio \(r(s)\) will increase as, \(s,\) the smallest test score increases. American Mathematics Competition data [145] are consistent with the theory. A similar result holds for reading.

6.7 Two Additional Alternatives for d

In addition to the Hellinger distance, two other possible replacements for effect size d are suggested here. Neither require raw data. Their downsides are that they require specification of boys and girls test score probability distributions and computation with software. In the examples below, the component distributions are normal.

6.7.1 Girls Beat Boys

The spirit of this approach is by what proportion does one sex beat the other in test scores? Consider the probability \(P(Y_b<Y_g)\), the proportion of girls’ test scores which are greater than the boys’ test scores. It is perhaps intuitively clear that if the random variables \(Y_b\) and \(Y_g\) shared the same continuous test score distribution, \(P(Y_b<Y_g)=1/2\). Then the departure of an observed proportion, \(\hat P(Y_b<Y_g)\), from one-half expresses the separation of boys from girls. One might think pairs of boys and girls test scores would be required to assess the matter. However, this is not necessary as there is a general expression which is

$$\displaystyle \begin{aligned} P(Y_b<Y_g)=\int F_b(s)f_g(s)ds, -\infty<s<\infty,\end{aligned}$$

where s is a test score. \(F_b(s)\) is the lower tail cumulative probability distribution for boys and \(f_g(s)\) the probability distribution (density) for girls. Both appear in Chap. 4. While the focus is on the distributions under \(\mathcal {Y}\), the integral equation holds for all continuous probability distributions. Estimates \(\widehat F_b(s)\) and \(\hat f_g(s)\) are easily obtained by replacing their parameters with their estimates. Then a line or two of R code, using numerical integration, return \(\hat P(Y_b<Y_g).\)

Using estimates obtained from the V  associated with the U.S. 2003 PISA reading scores in Table 2.6 results in \(\hat P(Y_b<Y_g)=0.590.\) Girls beat boys here.

6.7.2 The Overlap Coefficient OVL

OVL computes the amount of overlap between two distributions [146]. There are two versions: one for discrete distributions and one for continuous distributions.

$$\displaystyle \begin{aligned} OVL=\sum_y \min[f_b(y),f_g(y)],\end{aligned}$$

which just sums the minimum height at each value of \(y.\) In a similar way for continuous data,

$$\displaystyle \begin{aligned} {OVL}=\int \min[f_b(y),f_g(y)]dy.\end{aligned}$$

It is clear from these two definitions that if the distributions of boys and girls coincide, then \(OVL=1\), and if they share no support, that is, their distributions are disjoint, then \(OVL=0.\)

OVL provides a useful way to portray how boys and girls are similar or different from each other with regard to their relative reading and math distributional overlap. Figures 6.1 and 6.2 display NAEP twelfth grade math and reading solutions. Focus on \(\hat f_g(y)\) the bold solid line and \(\hat f_b(y)\) the bold dashed line. The white areas represent the overlap, while the shaded areas represent their corresponding distributional departures. The figures make clear there is much greater similarity in U.S. NAEP math scores at Grade 12 than in the NAEP reading scores at Grade 12. For math, \(\widehat {OVL}=0.971\), while for reading \(\widehat {OVL}=0.894.\)

Fig. 6.1
A line graph of the test scores of boys and girls plots a bell curve for both, with a peak at 150. Up to this point, the curve representing girls' scores is notably above that of boys. Beyond 180, the curve for boys becomes slightly higher than the curve for girls. The line on the x-axis slightly rises after 150.

NAEP 2019 Grade 12 math solution with V  from Table 2.3. The unshaded area is the area of overlap

Fig. 6.2
A line graph displays reading scores through 6 bell curves. Two curves peak and dip near 300, with the curve for girls above the curve for boys, marked by mu hat 2. Two other curves follow a pattern around a score of 260, with the girls' scores again higher than the boys', marked by mu hat 1.

NAEP 2015 Grade 12 reading solution with V  from Table 2.4. The unshaded area is the area of overlap

6.8 A PISA Reading and Math “Paradox”

This section addresses what is claimed to be a paradox. Addressing it is unrelated to matters pertaining to \(\mathcal {Y}\), and thus it may be skipped if desired. But because the matter can be addressed with the data provided here and a resolution provided, the paradox is considered.

Figure 6.3 plots the difference scores girls’ mean minus the boys’ mean in reading, against the boys’ mean minus the girls’ mean in math, for all 41 countries with PISA data from Tables 2.5 and 2.6. \(r = -0.589\). Marks [147] first reported such a relationship using PISA data from 25 countries. Stoet and Geary [122] provide four plots, their Figure 2, representing PISA data spanning a decade, with axes defined as in Fig. 6.3, which show \(-0.78\leq r\leq -0.60.\) They regarded these negative correlations as very troubling: “…a hitherto unexplained paradoxical finding: The smaller the sex differences in mathematics, the larger the sex differences in reading (i.e., countries with a smaller sex difference in mathematics have a larger sex differences in reading, and countries with larger sex differences in mathematics have a smaller sex difference in reading). This inverse relation between the sex differences in mathematics and reading achievement poses a critical challenge for educators and policy makers who might wish to eliminate such differences …there are currently no countries that have successfully eliminated both the sex difference in mathematics…and the sex difference in reading…[122, p. 2, italics in original].”

Fig. 6.3
A scatterplot of reading mean differences, girls minus boys, versus math mean difference, boys minus girls. The plotted dataset is scattered between 0 and 20 and 20 and 40 on the x and y axes, respectively. A linearly descending line fits the dataset. R equals negative 0.589.

Plot of 41 countries PISA reading mean differences, girls minus boys, against PISA math mean differences, boys minus girls. Data from Tables 2.5 and 2.6

In fact, the negative relationship can be the consequence of a simple obvious observation: Math skills are not required for reading test performance, but certainly some reading skills are necessary for some math test items. Girls hugely dominate boys’ performance in reading. It has been noted [148] that girls’ superior reading skill can lead them to opportunities where this skill is particularly advantageous. But it may not have been recognized that girls’ reading skills would likely be advantageous as well for girls’ understanding of at least some math test items, and thus their math and reading scores would be expected to be correlated. This correlation explains the paradox.

Assume that girls’ reading and math scores are positively correlated. Then, negative correlations, as observed in, e.g., Fig. 6.3, fall out immediately as an expected empirical consequence. This result may seem counterintuitive.

To show this, define random variables \(B_{bm}, B_{gm}, B_{br}\), and \(B_{gr}\), where subscripts denote b for boys, g for girls, m for math, and r for reading. Thus, \(B_{gr}\) is a girl’s reading score random variable. Let \(\rho (\cdot )\) denote the correlation between pairs of random variables. Assume \(\rho ( B_{gm},B_{gr})>0\), which is equivalent to the covariance \(\mathrm {cov}( B_{gm},B_{gr})>0.\) Other pairs of random variables are assumed independent of each other and thus are zero correlated.

Next, define

$$\displaystyle \begin{aligned} M=B_{bm}-B_{gm}\end{aligned}$$

and

$$\displaystyle \begin{aligned} R=B_{gr}-B_{br}.\end{aligned}$$

This pair of expressions is the probability model to evaluate. It is the sign of correlation between M and \(R,\) that is, \(\rho (M,R),\) which is of concern. Because the sign of the covariance dictates the sign of the correlation, it is sufficient to consider the covariances, and without loss of generality, it may be assumed all random variables have zero mean.

$$\displaystyle \begin{aligned} \mathrm{cov}(M,R)=\mathrm{cov}(-B_{gm},B_{gr})=-\mathrm{cov}(B_{gm},B_{gr})\Rightarrow\rho(M,R)<0.\end{aligned}$$

If for boys \(\rho (B_{bm},B_{br})\) were also assumed to be positive as well, as seems plausible, the same result would hold but drive \(\rho (M,R)\) more negative. Thus, assuming that girls’ math and reading scores are positively correlated is a sufficient condition to produce the negative M and R difference score correlation, and consequently the resulting empirical correlations, the negative rs, are simply realizations of what is expected.

Marks interpreted the correlation as having a causal thrust: “Policies that promote girls’ educational performance decrease the gender gap in mathematics but also increase the gap in reading [147, p. 105].” If true, this conclusion is disturbing. Suppose a policy existed that could reduce the math gap by one-half. Then \({1\over 2}M={1\over 2}(B_{bm}-B_{gm})\), which would shrink the expected mean difference in math by one-half. Under the model above, such a change has no influence on R, the reading gap, and in particular \(\rho (\frac {1}{2}M,R)=\rho (M,R)\) leaving the correlation unchanged. Stoet and Geary [122] acknowledge they have no explanation for the paradox and claim further study is required. Given the above analysis, further study may not be needed.

6.9 Can mdo and rdo Be “Chance” Occurrences?

Could satisfying rdo or mdo simply be random chance events? The following focusses on math and mdo; the changes necessary for reading rdo should be apparent. Two very different answers must be given. As a first answer, assume boys’ and girls’ test scores are all independent and identically distributed, and based on random samples, and that their shared distribution is continuous. Then, \(P(\bar X_b>\bar X_g)=P(S^2_b>S^2_g)=1/2.\) Assuming in addition boys’ and girls’ test distributions are normal, then \(P(\bar X_b>\bar X_g\cap S^2_b>S^2_g)=1/4.\) The probability to consider if mdo is of focus is

$$\displaystyle \begin{aligned} p_m=P(\bar X_b>\bar X_g\cap S^2_b>S^2_g\cap S^2_b-S^2_g>\bar X_b-\bar X_g).\end{aligned}$$

Under normality, \(p_m<1/4\) and is perhaps a crude upper bound. Otherwise, evaluating \(p_m\) would appear to require approximation by simulation. As an example, if test scores for both boys and girls followed a t distribution on \(15=df\), \(p_m\approx 0.17,\) with a sample size of 100 of each sex, a probability that appears roughly independent of sample size. Should \(p_m\approx 1/4\), then a given V  satisfying mdo could easily occur with sampling variation. However, given k independent \(V,\) the probability to consider is \(1/4^k\), which becomes vanishing small as k grows. Consequently, a sampling variation argument is implausible given the large body of data reviewed above in Chap. 2.

This first answer just given applies to settings where it can be reasonably assumed that the elements of V  were obtained from a random sample. For none of the large-scale national or international surveys, is this assumption appropriate, for reasons noted earlier, because the sampling and estimation procedures are far removed from a random sampling model. Thus, a second answer is required, but a definitive answer cannot be given. In the case of NAEP data, the consistency with which rdo and mdo hold over decades and with the knowledge that very large sample sizes are reflected in each sample estimated quantity will have to suffice. There are 66 NAEP V  in Tables 2.1, 2.2, 2.3, and 2.4for both reading and math. Among them, 58 satisfy either mdo or \(rdo.\) Assuming the V  are independent of each other and assuming \(1/4\) would satisfy mdo or rdo by chance, the expected number is about \(17.\) In the end, it is the pervasiveness of the empirical inequalities holding for children of various ages, over intervals spanning decades, in studies of varying size, globally and historically, for both reading and math that appears compelling and important. These empirical facts provide the ultimate justification for implementing \(\mathcal {Y}\).

6.10 Model \({\mathcal {Y}}\) Dimensions and Fitting

The data of focus are two independent pairs of V\((\bar x_b,s_b)\) and \((\bar x_g, s_g)\) or four data points, while \(\mathcal {Y}\) has four parameters, \(q, \mu _1, \mu _2\), and \(\sigma .\) At first glance, one might think that the estimation and fitting procedures involving fitting four parameters to four data points results in a saturated model and consequently of little use. However, an examination of the model and estimation procedures shows that \(\mathcal {Y}\) is a much more tightly constrained model than might first be thought, as has been stated earlier.

True, four parameters have been estimated in all the examples above. But as observed earlier, only three parameters are required to estimate the critical quantities and uniquely fix the graphical display properties: \(q, \mu _2-\mu _1,\) and \(\sigma .\) Estimating \(\mu _2\) and \(\mu _1\) fixes the abscissa location of the display. Estimating only \(\mu _2-\mu _1\) changes nothing substantively in any of the above discussion. The graphs would then be unique within an abscissa translation. Furthermore, \(\mathcal {Y}\) is a much more constrained model than counting the number of parameters reveals. Considering math, \(\mathcal {Y}\) has three inequality constraints expressed in inequalities (4.2), (4.3), and (4.4) of Chap. 4, which correspond to the empirical inequalities of mo and \(mdo.\) In addition, an examination of these inequalities and their analytical arguments shows they are independent of \(\sigma ,\) and thus the inequalities are forced by functions of just two parameters, \(\mu _2-\mu _1\) and \(q.\) Said in another way, the inequalities are forced by properties of \(B_b\) and \(B_g\), while \(N_g\) and \(N_b\) play no role in generating \(\mathcal {Y}\)’s inequalities.

From a geometrical perspective, this means only a two-dimensional space is required to capture the parameter constraints, with two parameters q and \(\mu _2-\mu _1\) viewed as variables in this space. The space required to capture the model equivalent of mo is the rectangular graph shown in Fig. 6.4, open on the right side, because of the arbitrary upper bound \(\mu _2-\mu _1\leq 50\) in the graph, but to capture mo only requires that \(0<q<0.618\) and \(\mu _1<\mu _2\) or \(0<\mu _2-\mu _1.\)

Fig. 6.4
A line graph of q versus mu 2 minus mu 1. It plots a trend that ascends in a concave downward manner from 0 to 0.6 and saturates. A rectangular space with a length between 0 and 0.61 and a breadth between 0 and 50 is also plotted. Values are approximate.

Parameter space associated with \(\mathcal {Y}\) inequalities. The rectangular space \(0<q<0.618\) and \(\mu _2-\mu _1>0\) but is otherwise unbounded defines the space where inequalities (4.2) and (4.3) of Chap. 4 hold. The curved line denotes the space for inequality (4.4) to hold. These two parameters \(\mu _2-\mu _1>0\) and q generate the model inequalities

To satisfy the model constraints corresponding to \(mdo,\) further constrains must be imposed: \(\mu _2-\mu _1>1\) and \(q\leq [{\sqrt {5-4/(\mu _2-\mu _1)}-1}]/2<0.618.\) The area below the curved line in Fig. 6.4 shows the corresponding model parameter space. Thus, Fig. 6.4 makes clear the analytical inequality properties of \(\mathcal {Y}\) are determined by just two parameters. From a very different perspective, these results show just how readily falsifiable \(\mathcal {Y}\) is. While the focus here has been on math, a similar development follows easily for reading.

6.11 \({\mathcal {Y}}\) and the Global Gender Gap Index

Yearly, the World Economic Forum releases an index of gender equality for each of about 150 different countries. Called the Global Gender Gap Index, or here GGGI, is a zero to one bounded scalar index, with one presumably denoting equality. It is a composite of four assessed domains for equality: economic participation and opportunities, educational attainment, health and survival, and political empowerment. GGGI is billed as a measure of “gender equality,” which “assesses countries on how well they are dividing their resources and opportunities among their male and female populations, regardless of the overall levels of these resources and opportunities [149].” Any departure of the GGGI from one by design indexes only women’s disadvantage. The measure cannot reflect any disadvantage for boys or men. Intuitively, this fact seems to violate the very premise of a symmetrical equality relation which the idea of “gender equality” seems to imply. The GGGI index for 2022 for the OECD PISA countries is given in Table 5.4, column four [150]. These values generally change modestly from year to year.

It is interesting to consider the range of GGGI values from the perspective of \(\mathcal {Y}.\) The assumption in doing so is that country-wide indices of sex differences in testing for both math and reading, as indexed by PISA scores, have wider implications. Although assessing children with tests is usually motivated by efforts to assess educational progress or use for selection purposes, it is arguably the case tests might be taken as a proxy index portending, in perhaps difficult ways to model or quantify, sex differences in behaviors reflected in various domains of activity long after the tests are taken, perhaps indicating individual life trajectories in a country. Furthermore, because girls show some marginal disadvantage in math scores in almost all OECD countries, such influences should modestly suppress GGGI scores. That is because they would likely increase disparities in domains that would contribute to that country’s index, thus lowering the GGGI value for the country. However, girls show a huge advantage in all OECD countries on PISA reading tests, and thus, reading score indices should be positively associated with GGGI scores. It is difficult to specify any waking activity in modern culture which is independent, or at least uncorrelated, with the ability to read. It is possible to be more quantitatively precise. In doing so, all data employed in this section are in Tables 5.3 and 5.4.

Consider first math. To remind the reader, letting \(q_m\) denote q for math, the upper component latent math distribution was weighted \(q_m\) for boys and \(q_m^2\) for girls, and the component weights were the only features that marked their distributions as different. Recall \(q_m-q_m^2>0\) for \(0<q_m<1.\) The difference \(q_m-q_m^2\) is smallest for \(q_m\) small, and this difference strictly increases in \(q_m\) for \(0<q_m\leq 1/2.\) This leads to the prediction that as the difference \(q_m-q_m^2\) increases, signaling wider sex disparities in various domains for which math performance has relevance, the GGGI should modestly decrease. There are 30 countries in Tables 5.3 and 5.4 where pairs of these variables are available. Using notation that should be intuitively clear, the corresponding correlation is \(r(\hat q_m-\hat q_m^2, GGGI)= -0.313(0.165)\) with standard error in parenthesis. The corresponding scatter plot appears in Fig. 6.5.

Fig. 6.5
A scatter plot of G G G I versus q hat subscript m minus q hat square subscript m. It plots a dataset that is fit by a line that descends. R = negative 0.313, and n = 30. Values are approximate.

Scatter plot of GGGI against \(\hat q_m-\hat q_m^2\)

Consider reading: the reasoning is identical to that for math, with \(q_r\) denoting the reading q. The girls’ latent higher scoring reading component weight is \(1-q_r^2\), which is larger than the boys’ component weight of \(1-q_r.\) Thus, \((1-q_r^2)-(1-q_r)=q_r-q_r^2\), so precisely the same correlational structure of math and GGGI is of focus for reading except that a positive correlation is expected. Retaining the same 30 countries as before, \(r(\hat q_r-\hat q_r^2,GGGI)=0.441(0.147)\) with standard error in parenthesis. The corresponding scatter plot appears in Fig. 6.6, while \(r(\hat q_m-\hat q_m^2, \hat q_r-\hat q_r^2)=0.112.\)

Fig. 6.6
A scatterplot of G G G I versus q hat subscript minus q hat square subscript. It plots a dataset that is fit by a linearly ascending line. Values are approximate.

Scatter plot of GGGI against \(\hat q_r-\hat q_r^2\) for 29 countries. Japan with coordinates (0.158, 0.650) is not shown. If Japan were excluded, \(r = 0.456\) for 29 countries

More broadly, these findings seem noteworthy. For one, they increase the plausibility of the wide variability of \(\hat q\) for reading and math among countries, and for another, would seem to allow for \(\mathcal {Y}\) to be viewed as having saliency within a wider context. The predictions are model-based, and the estimated proportion of the variance in the GGGI accounted for by the two PISA tests is \(0.327\) (please see Appendix A.5 for clarification).

The above exposition views GGGI indices, in part at least, as a consequence of the same factors, under \(\mathcal {Y}\), that produce sex differences in reading and math test scores. Other perspectives view the causal suggestion as going in the opposite direction: countries with high GGGI, or other similar indices of country-wide social equality, are viewed as incubators for, if not causal agents for reducing sex differences in math [19]. Furthermore, at the same time, as the math gap for girls is presumably under reduction because of social forces, the same social forces are increasing the reading gap advantage for girls [151]. There is, apparently, no safe harbor for boys with their substantial mean reading gap.

An article’s headline summary is “Analysis of PISA results suggests that the gender gap in math scores disappears in countries with a more gender-equal culture [151, p. 1164, italics added].” Now replace the first and second italicized words with, respectively, reading and increases. Doing so one has an alternative description of their findings.

The article’s goal is to convince the reader that “…in countries with a higher GGI index (here GGGI), girls close the gender gap by becoming both better in math and reading, not by closing the math gap alone [151, p. 1165].” And they contend that “In more gender-equal countries, such as Norway and Sweden, the math gender gap disappears [151, p. 1164].” The authors repeatedly state that in more gender-equal countries the math gender-gap “disappears.”

While the implied causal linkage between countries with high GGGI and girls’ math achievement already appears suspect [152], there is another difficulty: the claim the gap disappears in high GGGI countries is misleading hyperbole. The gender gap does not disappear, nor are the Norway and Sweden math differences especially small. For both countries, \(6<\bar x_b-\bar x_g<7.\) For 12 of the 37 countries with both GGGI and PISA scores, the math differences are less than 7, and for two of these countries the differences are negative, as the data in Table 2.5 reveal. It is notable in this context that the U.S. math gap is 6.25, less than Sweden’s gap of 6.53. This U.S. math gap has not been claimed, in the U.S.A. at least, to have “disappeared.”

Norway and Sweden were featured in the authors’ chart [151, p. 1164] because in 2006, their GGGI values were, along with Finland, the highest. Then, Norway’s GGGI was 0.799 and for Sweden 0.813 [149]. In Table 5.4, the 2022 values show Norway has the third largest among 37 countries with GGGI of 0.845, and Swedens’ 0.822 is the fifth largest.

The authors, as well as here, apparently employ the 2003 PISA cycle data. Whether the source files are identical, however, cannot be determined. For 37 countries with GGGI values and PISA scores, for math, \(r(\bar x_b-\bar x_g,GGGI)=-0.426\), so smaller math disparities are associated with higher GGGI.

The problem with the authors’ interpretation arises when reading is considered. If countries with equitable resource distribution are credited with reducing math disparities, would not it be plausible to expect reading disparities would be reduced as well? This is most certainly not the case. For PISA reading differences, \(r(\bar x_g-\bar x_b, GGGI)= 0.494\) signaling that as GGGI increases, so does the reading gap. The authors claim, as they must, and as implausible as it would seem, that a country’s gender equality increases reading disparities. This does not imply of course that all of the \(\bar x_g-\bar x_b>0\) of the PISA reading differences are attributable to the presumed influence of a country’s gender equality. For their two featured countries, Norway and Sweden, the reading scores hugely favor girls.

For Norway, the difference in means is 49.2, the second largest among 41 countries in Table 2.6; Sweden’s difference is 36.75 well above the median of 33.34. These differences are many times the size of their math differences. These mean differences favoring girls in reading would seem to well exceed the sizes of socially or environmentally based influences for nearly any variable of interest. So, what is there about a country’s gender equality that works dramatically differently on math to decrease disparities than it does on reading to increase disparities? Is it to be claimed that social equality forces only lift girls’ scores? The issue is not addressed.

The expected math mean test score difference between boys and girls under \(\mathcal {Y}\) is

$$\displaystyle \begin{aligned} \mathrm{E}(Y_b-Y_g)=q(1-q)(\mu_2-\mu_1)>0,0<q<1,\mu_1<\mu_2,\end{aligned}$$

showing the math gap increases as \(q\rightarrow 1/2\) and it approaches zero as \(q\rightarrow 0.\) It is also largest for a fixed \(\mu _2-\mu _1>0\) when \(q=1/2.\) (Replace the left side with \(\mathrm {E}(Y_g-Y_b)\) for reading.)

The \(\mathcal {Y}\) interpretation for reading and math for Sweden and Norway is quite different of course. First note that for both countries \(\widehat V\) for both reading and math are nearly identical with their V . \(\hat q_m\) is small for both countries, and that is why the math mean sex differences are small. For Norway, \(\hat q_m=0.026\) and for Sweden \(\hat q_m=0.052\), the seventh and thirteen smallest values among those in Table 5.3. For reading, for Sweden \(\hat q_r=0.541,\) and for Norway \(\hat q_r=0.506\), the fifth and eighth largest values in Table 5.3. Because the \(\hat q_r\) are large, the reading mean difference is large. The rough similarity in these two countries’ q estimates for both reading and math would seem to be best understood, as suggested before, by their geography, a fact not mentioned in the article of focus. The two countries share a 1630 kilometer border and thus are thought to share similar gene pools [120].

Another application of GGGI has led to a spectacular failure. It was expected there would be a positive correlation between women’s participation in college level math focused curricula among different countries and their GGGI index. Instead, a strikingly negative correlation, \(r = -0.47,\) appeared, a finding called a paradox by Stoet and Geary [153]. This is the second such correlational finding so labelled by them as a paradox. The first “paradox” was discussed above in Sect. 6.8.

The most common explanation for such findings continues to be gender stereotyping. Subsequently, a gender stereotype variable, GMS, yielded \(r(GMS, GGGI)=0.291\) for OECD countries [154, p. 30, Table S4A], the most suitable comparison for the data available here. For a larger sample of countries, \(r(GMS, GGGI)=0.434\) [155, Table 1]. Thus, GMS accounts for about 9% or at most about 19% of the variance of GGGI. Using both math and reading PISA tests, \(\mathcal {Y}\) accounts for nearly one-third of the GGGI variance, 32.7% and given above, well more than does GMS.

While such correlations may be of interest, the spirit of a core global issue seems best captured by observations of Ceci and Williams [5, p. 168] And discussed earlier: why is the male to female ratio of computer scientists in The Czech Republic more than six, while in the U.S.A. that ratio is about two? To the extent to which PISA math scores address the matter and to briefly return to that discussion here, \(\mathcal {Y}\) provides at least part of the answer. Figure 6.7 shows the estimated upper tail distribution functions under \(\mathcal {Y}\) for both the U.S.A. and The Czech Republic assuming component normality. The upper tail sex differences are far more pronounced in The Czech Republic than in the U.S.A., largely because \(\hat q_m=0.306\) and \(0.025\) for The Czech Republic and U.S.A., respectively, and as noted earlier. For both countries, the \(\widehat V\) match their V  reported in Table 2.5.

Fig. 6.7
A line graph of estimated 1 minus F g or F b score. The curves for girls and boys of U S A and Czech Republic descend from 1 to 0 in an S-shaped manner. The lines attenuate after 700. Among the four, the Czech Republic boys' curve has higher values, followed by Czech Republic girls, U S A boys, and U S A girls.

Estimated distribution function upper tails for PISA math for boys and girls, The Czech Republic and the United States

The GGGI indices employed in the above analyses are from 2022, while the PISA data are from the 2003 assessment cycle. Thus, nearly two decades separates the assessment times of the two variables of focus. This fact would seem to suggest that the durability of the findings is to be expected.

Finally, please keep in mind that none of the references in this section which attempt to address the mean gender gap in math consider variance differences, the largest of the sex differences in both math and reading testing. To repeat yet again, any coherent explanation of sex differences must jointly address the variance differences as well as the mean differences. Only \(\mathcal {Y}\) has, so far, achieved that status.

6.12 The Misleading Language and Images of Sex Differences

Both the language psychologists have used to refer to sex differences, and the graphical images drawn to portray such differences have been misleading.

Consider language first. Words can shape beliefs and perceptions. Perhaps no two words have been used more often to characterize sex differences, certainly with respect to math and secondarily with respect to reading (as well as many other domains of focus) than “gender gap.” Search with the words “gender gap” in any browser and millions of hits are revealed. Gender gap appears in the titles of several books; it appears in three of fifteen chapter titles in a single book addressing math sex differences [79] and 128 times in single book [5]. Gender gap appears rarely explicitly defined, but it seems clear that it is taken to mean sample mean test score differences at least where math and reading are of focus. The popularity of the two words being so, gender gap would seem to fail to satisfy a “real” definition, meaning to convey the “essential nature” or “essential attributes” of some entity [75, p. 93]. Certainly, in math testing, \(\bar x_b-\bar x_g>0\) typically. So considering gender gap as equivalent to boys’ and girls’ mean math test score difference is not wrong.

However, once it is recognized how widely the inequality mdo holds and when it is realized how much larger \(s_b^2-s_g^2\) is than \(\bar x_b-\bar x_g\) for math, any “essential nature” scalar characterizing sex differences in math and reading should be the variance difference. The median of the 41 ratios \((s_b^2-s_g^2)/|\bar x_b-\bar x_g|\) for PISA math scores of Table 2.5 is 111. For PISA reading scores, Table 2.6, the median of the ratio is 46. Is it unreasonable to suggest that for many years the focus has been on the mean gender gap when it should be on the variance gender gap?

The figures or graphs of distributions intended to portray sex differences in math are misleading. Invariably, portrayed are equal variance but shifted normal distributions. This is certainly an empirically wrong visual image, and it is conceptually misleading as well.

6.13 Coda

In the executive summary of Why So Few? a book that concerns why there are few women in math and related fields, the authors write “While biological gender differences, yet to be well understood, may play a role, they clearly are not the whole story [156, p. xiv].” Nothing written in these chapters contradicts the spirit of this statement. What has been shown, however, is that virtually all of the reading and math test-based sex differences displayed by children in observational settings, especially those of \(\mathcal {S}\), can be explained by a simple model.

View the foregoing effort as an attempt to advance understanding of the biological basis of the differences expressed in the above quote. In the process, the effort has hopefully illuminated the far larger mean difference favoring girls in reading that has been mostly ignored. It is literacy, not math, that is the far larger and more important skill for children to acquire.

It would appear time to focus attention on the multitude of those sex differences which are most likely under some form of environmental or societal control and not mentioned above.