1 Introduction

Statistical inference is of great importance in scientific research and in other areas of human activity such as public policy, clinical medicine, and technology (Peck et al., 2006). Due to its universal utility, statistical inference is the main content of university statistics courses. However, researchers in statistics education and many teachers agree that students learn stereotyped calculations and procedures to make inferences but are often unable to understand the underlying processes and interpret the results correctly (Chance et al., 2004). One of these important processes is repeated sampling and the resulting abstract mathematical object: the sampling distributions (Cobb, 2007; Lee, 2018). In traditional courses, sampling distributions are presented as pre-prepared tables with their instructions for use, but without revealing their origin or construction. And this is justifiable since the theory of sampling distributions is based on sophisticated mathematical tools, inaccessible to many university students. Noll and Shaughnessy (2012) highlight that the literature suggests mastering sampling distributions requires coordinating various distribution attributes for statistical reasoning, a concept complex for students to grasp. Cobb (2007) also underscores the complexity students encounter transitioning from action tasks like calculating a single sample’s mean to the abstract concept of sampling distributions. Lipson (2003) suggests that the development of the concept of sampling distribution moves through several levels of abstraction and proposes three levels: physical, empirical, and traditional or theoretical. However, she focuses only on the sampling distribution without paying attention to the concepts of population, sample and repeated sampling that accompany it; these also have their own development (see Watson & Moritz, 2000, for the development of the sample concept).

On the other hand, research in statistics education also foresees a remarkable shift in the teaching of the subject of sampling distributions with the advent of digital technologies (Biehler et al., 2013; Andre & Lavicza, 2019). Technological possibilities have enabled the development of alternative informal approaches for university and pre-university students to understand sampling distributions (Chance et al., 2004; van Dijke-Droogers et al., 2019; Silvestre et al., 2022). Regarding formal educational level students, research has been carried out on the development of their reasoning about sampling concepts (Ainley et al., 2015; Saldanha & Thompson, 2002, 2007), but it is still necessary to understand at a finer grain level the conceptions that students are forming in their effort to assimilate the concept of sampling distribution in a process of teaching activities.

The topic of statistical inference chosen for this study is significance testing, as it demonstrates how decisions are made based on a sample with the help of the sampling distribution and provides a context that elucidates the function and importance of sampling distributions. With this in mind, we formulate the following research question:

What emerging theory explains the development of high school students’ conceptions of sampling distribution observed during problem-solving activities of significance tests with technology support?

The goal is to offer a humble theory based on our observations on how students transition during activities from empirical conceptions to ones closer to a mathematical conception of the sampling distribution.

2 Background

The research on sampling distributions boasts a rich and extensive literature, including seminal works dating back over 30 years (e.g., Rubin et al., 1990). A recent systematic review spanning the past decade across two leading journals in statistics education unearthed 19 pertinent papers: 11 from the Journal of Statistics and Data Science Education (JSE) and 8 from the Statistics Education Research Journal (SERJ). Of these articles, 12 explored technology’s role in teaching sampling distributions, whereas 7 delved into hands-on or mathematical methodologies. These 19 studies varied in terms of the educational level they focused on: five catered to Primary (Manor & Ben-Zvi, 2017; Lehrer, 2017; Watson & English, 2016) or Middle School students (Balkaya & Kurt, 2023; Nilsson, 2020), four targeted undergraduate students (Aquilonious & Brenner, 2015; Findley & Lyford, 2019; Hancock & Rummerfield, 2020; Taylor & Doehler, 2015), three were directed at teachers (Dolor & Noll, 2015; Fergusson & Pfannkuch, 2020; Lee et al., 2016), and only two focused on high school students (Batanero et al., 2020; Case & Jacobbe, 2018).

2.1 Key findings in recent research on sampling distributions

From the previous review of the literature, we draw the following five key findings. Firstly, there is a growing discourse around sampling distribution and its simulation in the teaching of statistics. Secondly, research indicates that even 9-year-old students can successfully extrapolate population predictions from collections of simulated samples (Watson & English, 2016). Thirdly, engaging students in hands-on activities prior to computer simulations enhances their understanding of sampling distributions, a finding applicable even to college-level students (Hancock & Rummerfield, 2020). Fourthly, for graduate students and statistics teachers, a digital technology-based teaching approach to sampling distributions should emphasize the modelling process, the role of probability in inference, and the use of probability language (Lee et al., 2016). At the high school level, it is crucial to explicitly model the phases to distinguish between hands-on and computer-simulated distributions (Case & Jacobbe, 2018; Manor & Ben-Zvi, 2017). Additionally, caution is advised regarding the validity of conclusions drawn from this approach (Hayden, 2019; Watkins et al., 2014). Lastly, the fifth conclusion notes the scarcity of research involving high school students using simulation-based inferences to explore their understanding of sampling distributions. Our study is largely aligned with these findings.

2.2 High school-level studies on sampling distributions and technology

Within the limited research on sampling distributions and technology at the high school level, four studies share commonalities with our investigation. Saldanha and Thompson (2002) examined how high school students in grades 11 and 12 developed their understanding of sampling concepts during a teaching experiment. They identified two distinct sample conceptions: additive, viewing a sample simply as a subset of the population, and multiplicative, perceiving the sample as a proportional small-scale version of the population. Saldanha and Thompson (2007) sought insights into students’ thought processes and conceptual operations when making inferences on a population from a collection of values of a statistic. They found that students could learn to infer about a population by viewing a collection of statistical values as a distribution, a fundamental concept in statistical inference. Van Dijke–Droogers et al. (2019) created, implemented, and evaluated a hypothetical learning trajectory for 9th-grade students. Their study concentrated on informal inferential reasoning about three key statistical concepts: sample, frequency distribution, and simulated sampling distribution. They conclude that their findings indicated that students began using simulated sampling distributions as a model to explore and interpret variation and uncertainty. Lastly, Case and Jacobbe (2018) identified two primary challenges in teaching high school students about inference with simulated sampling distributions: difficulty in coordinating different distributions (such as those of the population, a single sample, and statistics from multiple samples) and in reconciling real-world and hypothetical scenarios. This latter challenge echoes our study’s findings.

3 Conceptual framework

Statistical inference refers to the method of drawing probabilistic conclusions about a population based on a sample from it. Harradine et al. (2011) define it as the process of assessing the extent to which a set of observations is consistent with a hypothesized mechanism. This characterization underscores the importance of assessing a sample to substantiate a hypothesis and aligns with the notion of significance test. The role of a sampling distribution includes gauging the robustness of a sample in supporting a hypothesis.

3.1 Sampling distribution

Sampling distribution in statistics education can be understood from three perspectives: empirical, virtual, and formal (or mathematical). We will limit ourselves to sampling and inference situations in which the characteristic under study is the proportion of an attribute in a population. Under the empirical perspective, from a population in which a proportion of its individuals has an attribute of interest, many samples of a fixed size are randomly obtained. Recording the attribute’s proportion in each sample and representing it through histograms or dot plots forms an empirical sampling distribution (ESD); consequently, this is only descriptive. We will use the expression “empirical sample” to refer to a sample obtained from a real population; this avoids confusion with a simulated sample. It should be considered that in field research, it is not affordable to obtain many empirical samples since each one usually involves high expenses and a lot of time; in these conditions, students face challenges in creating an empirical sampling distribution (ESD) with sufficient samples to unveil its key features. Consequently, this deficiency frequently results in misunderstandings about the sampling distribution, such as the misconception that it should resemble the population, as highlighted by Chance et al. (2004). For students, it is only viable to build an ESD with populations made up of manipulatives like balls in an urn. Several authors referenced in the section 2 have championed the use of manipulative activities before proceeding to computer simulations. Activities like these facilitate students in either building or reinforcing their understanding of repeated random sampling. Despite these efforts, the challenge of enabling students to discern the properties of a sampling distribution and to recognize the analogy between real-life situations and a balls-in-an-urn model persists, underscoring a persistent educational hurdle.

The virtual approach employs a computer simulation program to create a simulated sampling distribution (SSD) through a two-tiered process. Initially, it leverages a computer’s RANDOM function to select elements randomly from a given set, either numerical or nominal, forming a sample of size “n” through “n” independent repetitions (with replacement). Then the statistic (the number of success or proportion of an attribute) is calculated. Subsequently, this first-tier process is independently replicated multiple times, say 500 times, to accumulate an equal count of statistics; these are then structured into a distribution and depicted on a frequency graph, culminating in an SSD. In the program-based activity, students can efficiently engage in multiple processes. They can repeatedly execute the program with identical proportions and sample sizes, observing the variations and constants in each resulting SSD. By altering parameters like proportion or sample size, they can examine the SSDs for each change. Additionally, upon generating an SSD, they can tally the frequency of the specific number of success or proportions or a range of them. All these tasks can be accomplished within a timeframe that is reasonable for class time. Several authors (e.g., Lee et al., 2016; Lipson, 2003) refer to SSDs as ESDs, overlooking the crucial difference that SSDs are constructed mathematically, while ESDs are derived from real-world data; consequently, an SSD is a variation model while an ESD is a description of results. Acknowledging this difference has been crucial for us to grasp the complexities and importance of high school students’ progression from one conception to another, as well as the methods through which this transition can be achieved.

The formal or mathematical approach is based on the theory of random variables and probability distributions. A sample of size \(n\) are values of \(n\) independent and identically distributed random variables \({X}_{1}, {X}_{2}\dots {X}_{n}\) associated with the same random experiment. A statistic is a real function \(Y\) of the random variables: \(Y=u({X}_{1}, {X}_{2},\dots , {X}_{n})\). The sampling distribution of the statistic \(Y\) is the probability distribution of \(Y\). A particular case of the Central Limit Theorem asserts, among other results, that when the sample is large and the statistic is the proportion of an attribute, the sampling distribution approximates a normal distribution with a mean close to said proportion. This is a theoretical result that becomes visible to students through simulation activities.

3.2 Statistical significance test

A significance test is a method in statistical reasoning used to evaluate a null hypothesis with an empirical sample. The null hypothesis, typically stating no change from the status quo, requires substantial evidence for its rejection. The reasoning scheme is if an empirical sample is unusual under the null hypothesis, the hypothesis should be dismissed. To apply it, a procedure is required to judge when an empirical sample is unusual given a hypothesis. In tests concerning proportions, a null hypothesis assumes a specific population attribute proportion. As mentioned earlier, an SSD can be created with a hypothesized value and the sample size. Thus, one can use the SSD to assess whether the empirical sample is unusual under the hypothesis. An SSD can be used to estimate the conditional probability of obtaining samples with a proportion within a specified region, assuming the null hypothesis is true. This probability is estimated by dividing the frequency of SSD sample points in the target region by the total number of SSD sample points. Assessing an empirical sample hinges on two concepts: p-value and significance level. The p-value represents the probability of obtaining a sample proportion as extreme or more than the proportion of the empirical sample under the hypothesis given. Essentially, it is the conditional probability of a region in the distribution’s tail. An empirical sample is deemed unusual if the p-value is notably small. The cutoff between usual and unusual is set conventionally by the researcher, based on context. Often, a threshold of 5% is used, meaning if the p-value is under 5%, the empirical sample is classified as unusual.

Significance tests have been a topic of ongoing debate since their inception (Nickerson, 2000; Morrison & Henkel, 1970), largely due to misunderstandings and misapplications. Misinterpretations of the p-value, for instance, are prevalent not just among students but also among professionals using significance tests (Goodman, 2008; Wasserstein & Lazar, 2016). Therefore, it could be beneficial to introduce high school students to informal approaches to p-value and significance tests. This could help identify the root causes of these widespread difficulties and potentially lay the groundwork for improved comprehension in the future. However, in this report, we concentrate specifically on issues related to sampling distributions rather than delving into the complexities associated with p-value and significance tests.

3.3 Informal statistical inference

Research trends indicate a shift towards exploring informal statistical inference in teaching before studying classical formal approaches (Pratt & Ainley, 2008; Zieffler et al., 2008). This includes the introduction of statistical inference topics at pre-university levels. Makar and Rubin (2009, 2018, p. 273) identify key elements of informal statistical inference, including “claims beyond data, expression of uncertainty, data use as evidence, aggregate consideration, and context knowledge integration.” However, this framework may overlook a critical aspect identified by Ainley et al. (2015, p. 406): in situations of inferential reasoning, populations can be tangible, finite, and countable, or abstract mathematical formulations. This distinction can extend to other sampling concepts like the sample, repeated sampling, and sampling distribution; each one can be conceived, at one extreme, as an empirical entity and at another, as a mathematical formulation. In our study, we propose that between a concept’s empirical interpretation and its mathematical definition, there exist intermediate, informal versions. For instance, a sample is a subset of individuals from the population, while mathematically, it is a point in n-dimensional space as per Neyman and Pearson (Lehmann, 2011). Reasoning with a simulated sample by computer entails a sample’s informal conception between these two ends. This suggests that technology can serve as a tool to create scenarios that encourage the development of informal reasoning.

3.4 Technology

The Fathom software enables students to create simulated sampling distributions and examine their implications and applications (Biehler et al., 2013). Students, aided by their teacher, can engage in constructing a model for random sampling and developing a program to generate an SSD. When simulating sample proportions, the (virtual) population is reduced to a parameter value, aligning more with a mathematical formulation than an empirical population. This can be exemplified as a list with two symbol types, like S, S, S, F, F indicating a success rate of 3/5, or by employing a Bernoulli distribution with p = 3/5. The software’s features, like adjusting input parameters, visualizing the sampling distribution graph, analyzing sample points in specific regions, and computing summaries, transform it into a microworld. Sarama and Clements (2002, p. 2) define a microworld as a compact, self-contained computer environment that includes tools, structures, and activities mirroring mathematics or science domains, facilitating learning through exploration, problem formulation, and resolution. Within the microworld of sampling created in Fathom, students interact with virtual manifestations (representations or processes) of abstract concepts like randomness, variability, probability (frequency), and distribution. They engage with these concepts during activities before fully understanding their definitions or meanings.

4 Method

This study is part of the second author’s doctoral research (García, 2017), wherein the data from his thesis were reanalyzed to conceptualize the development of student reasoning in sampling distribution, thus extending and deepening previous findings. The method for data collection is described next. García aimed to develop and validate a hypothetical learning trajectory (HLT) to enable students to (1) understand the logic behind statistical significance tests of proportions (SSTP) through an informal approach; (2) identify the representativeness and variability of samples; (3) create sampling distributions (simulated with Fathom) to tackle SSTP problems; (4) informally estimate the p-value; and (5) use the p-value to solve SSTP problems with a 5% significance level. The HLT lessons were structured around three core elements: (1) a problem-solving approach, complemented with teacher interventions; (2) the use of educational software Fathom; (3) pair collaboration. Four problems were designed, two set in a cola consumption scenario (Levine & Rowling, 1993), and two based on an academic text (Wackerly et al., 2010) where students had to decide whether to reject or maintain a null hypothesis given an empirical sample from the population.

The problem-solving activities, spanning four sessions, were preceded by instructional episodes in the first three. Initially, the instructor demonstrated the Fathom software, teaching the creation of sampling distributions via simulated repeated sampling in diverse contexts. Concepts like population, sample, repeated sampling, and distribution were explained with their software representations. Students learned to identify hypotheses and select simulation inputs based on problem data. They worked in pairs on the first problem, receiving guidance but not direct solutions from the instructor.

In the next two sessions, the instructor reviewed students’ previous solutions and introduced new concepts (see the Appendix): “usualness and unusualness” in the second session, and “p-value and 5% significance level” in the third. Each session featured a new problem for students to tackle. The final session focused on administering and solving the fourth problem, as detailed in Table 1.

Table 1 The classroom activity structure

The overarching hypothesis is that the structure of these activities facilitates learning about the topic. Additionally, three learning hypotheses were formulated: (1) the logic of significance tests has an intuitive basis (Rossman, 2008), providing a strong foundation for student understanding; (2) the significance testing problems of proportions set in social or technical contexts can be comprehended without needing theoretical definitions of parameters, statistics, and population distribution; (3) the use of Fathom software creates a microworld (Biehler et al., 2013) where students engage with virtual representations of key inferential concepts like randomness, variation, distribution, and probability (frequency) without having previously learned them formally. These hypotheses, grounded in literature, were also explored in previous studies conducted by two authors (García, 2012; García & Sánchez, 2014, 2015a, b), contributing to the HLT design.

4.1 Participants

Thirty-six students, aged 16–17, from a Mexican public high school, participated in this study organized into 18 pairs with one computer per pair. The sample comprised students from a course taught by one of the researchers. This course was selected because the students were in the targeted age range, had a competent background in mathematics, and had not enrolled in or completed a formal course in probability and statistics. The requirement that they have not taken the statistics and probability course was to avoid the misunderstood knowledge that was often left behind in that course. ​ The students’ prior statistical knowledge was not diagnosed.

4.2 Data collection tool

For this study, four problems (outlined in Table 2) and a program in the Fathom software to generate SSD were developed. The problems were designed in contexts relatable to students, yet not experimentally testable, differing from traditional ball-in-urn scenarios. The problem contexts suggest that the populations at play are real individuals or objects, leaving the underlying mathematical populations hidden; this feature is significant in the development of the activity. In problems 1 and 2, the population under consideration is people who like to drink Coke or Pepsi. The population in problem 3 is manufactured products with or without defects. In problem 4, the population consists of patients who take a medicine. The sample size and proportion within the empirical sample were intentionally varied across problems to avoid them becoming routine for the students.

Table 2 Problems used in the workshop

The dataset analyzed consists of 72 written reports in Word files, featuring tables and graphs clipped from computer screens and created using Fathom software. Each student pair drafted a report per problem once they had agreed on a satisfactory answer. The report was to follow the guidelines presented in the following paragraph:

Is the hypothesis correct? Write a report where you: (a) Explain your conclusion. (b) Detail step-by-step how you reached your conclusion. (c) Answer: How confident are you in your conclusion? You may insert screenshots or clippings from the program to adequately complete your report. 2. Is the sample percentage presented in the problem atypical or not? 3. Under what survey results do you conclude that you cannot reject the hypothesis? 4. Under what survey results do you conclude that you can reject the hypothesis? 5. Is it possible that the decision to reject (or not) the hypothesis could be incorrect even if the procedure was correctly followed? For questions 2 to 5, you are asked to explain your answer.

The length of student reports varied, with the shortest being one page and the longest five pages, due to the figures included. The reports were compiled into four Word files, one per problem, with each file spanning 26, 43, 32, and 44 pages, respectively. Condensing this information to make it conveyable within the limited space of this article required the application of grounded theory techniques.

4.3 Analysis procedure

The study employs grounded theory as its analytical framework, a method distinct from traditional approaches in that it does not commence with pre-existing theories. Instead, it focuses on identifying and conceptualizing categories directly emerging from the data. This approach aims to understand students’ informal reasoning processes about sampling distribution and significance tests in the context of the activity through categorizing and conceptualizing the data that we have recorded. Grounded theory, as defined by Glaser (1992), is a systematic methodology intertwining data collection and analysis to generate inductive (humble) theories about specific areas of study.

Grounded theory suits when little is known about the study area and there is a desire to generate explanatory theories or hypotheses (Birks & Mills, 2015), especially for rich and complex underlying processes. This applies here, as the 2 section revealed scant knowledge on students’ understanding of sampling distributions in computational environments and their use for statistical inferences. Showing the difficulties and misconceptions or the effectiveness of an educational proposal is not enough if theories or hypotheses are not generated to explain them.

In this study, initial coding identified key categories relating to sampling distributions and significance tests. Six categories were established, each linked to multiple codes (Table 3).

Table 3 Coding and categories

Most of the codes from the first two columns (categories) were considered from the design phase, but during the analysis, they were selected among others. Conversely, the next four categories and their codes emerged from the data analysis without having been previously anticipated. The analysis involved comparing and coding the responses and justifications of all student pairs to identify similarities and differences. This led to the selection of the most frequent students’ procedures, which are organized in the third column. The analysis also revealed specific traits in the reasoning of some pairs, conflicts, overcoming challenges, and changes in perspective that led to the definition of abstraction levels organized in the fourth column. The codes in the fifth column emerged by considering the centrality of variability and the search for its considerations in the responses. Finally, the theoretical necessity to better characterize the abstraction process envisioned in the fourth column was considered, and from this effort, the theoretical codes of the last column emerged. The characterization and organization of these categories and codes constitute what we refer to as our humble theory. This methodology focused on deriving through a comprehensive and systematic analysis of empirical data of an understanding of students’ informal reasoning processes on sampling distributions in solving significance tests and in a computational setting.

5 Results

This section details the key findings from the data analysis, which identified three phases in the students’ microevolution in understanding the simulated sampling. Initially, students perceive an SSD as a portrayal of empirical samples from a real population. Next, they focus on the variability indicated in the SSD. Finally, they employ the SSD to evaluate the atypicality or unusualness of the empirical sample. Significance influences in this microevolution were students’ intuitions, the problem-solving approach, the computational microworld, the instructor’s timely interventions, and the strength of normative concepts such as unusualness, p-value, and significance level.

5.1 Misconceptions and intuitions

The primary role of a sampling distribution is integral to statistical inference, particularly concerning the link between a sample and its originating population. Misunderstandings and intuitive beliefs about this relationship can significantly influence the formation and evolution of concepts related to sampling distribution. To explore this, we analyzed students’ responses for evidence of their inherent notions about the sample-population relationship. This analysis yielded three key codes: representativeness, approximation, and sample size.

The code “representativeness” pertains to the erroneous belief that every sample precisely reflects the population, assuming the sample’s proportion is identical to that of the population. For instance, student pair S3 observed in their answer to Problem 1:

(S3, P1): …given that 35 out of 60 people preferred Coca Cola in our experiment, equating to 58% (0.58 × 60 = 34.8), we infer that Coca Cola’s advertising claims are accurate.

This pair of students means that the proportion of the attribute in the sample is an accurate indicator of the proportion in the population.

The “approximation” code embodies the idea that a sample’s proportion is close to the population’s proportion. However, this notion often lacks a clear criterion for assessing closeness, leading to subjective interpretations of approximation. Student pair S9 expressed:

(S9, P2): While exactly half of 180 is 90, we consider a range around 90 to represent the 50% preference.

This comment suggests their view that a range near 90 could be seen as indicative of a population with a roughly 50% preference for both beverages.

Lastly, the “sample size” code reflects the belief that a larger sample size yields a more precise estimation of the population’s proportion. This is exemplified by students’ S15, who commented:

(S15, P1): The larger our population [SIC] for the experiment, the more accurate the result will be.

Here, the students mistakenly use “population” in place of “sample,” indicating a reliance on the idea that a bigger sample leads to more accurate approximations.

5.2 Main response patterns

To identify commonalities and variations, we compared and coded the students’ responses against each other. A “response pattern” is defined as a solution procedure that is fundamentally consistent across the responses of two or more student pairs, notwithstanding minor variances. Among all identified patterns, we selected the most prevalent one for each problem, which are double majority, a scale for variability, pseudo p-value, and p-value. These labels specifically describe the approach students employed to reach their conclusions for each problem. This analysis of procedures offers a basis for hypothesizing about the potential abstraction process that a group of students might undergo in understanding sampling and sampling distributions.

5.2.1 Double majority criterion

In this pattern, students’ argument is that if most samples show a predominant preference for Coca-Cola, it validates the hypothesis favoring Coca-Cola. To decode this line of thinking, it appears students assume each point in the sampling distribution represents a real and mutually exclusive sample. Team S7’s response typifies the logic underpinning this pattern. They created an SSD with p = 0.5 and n = 60, leading to their conclusion:

(S7, P1): The hypothesis favoring Coca-Cola preference is supported because the simulation indicates that in 285 of 500 surveys, over 50% of the 60 respondents preferred Coca-Cola. Conversely, in 215 surveys, fewer than 50% showed a preference for Coca-Cola. (Fig. 1) (Note: They erroneously counted samples with exactly 50% as supportive of the hypothesis).

Fig. 1
figure 1

Simulated sampling distribution of the S7 team

While the students’ interpretation in the “Double Majority” process diverges significantly from the standard interpretation, this activity nonetheless encourages them to engage in understanding features of the SSD, for example, they learn to read information in the distribution.

5.2.2 A scale for variability

During the second session, the instructor addressed the entire class to discuss the students’ conclusions from problem 1 and introduced new concepts. He pointed out that students were mistaking the SSD for the population. He clarified that SSD is a hypothetical construct, not a distribution of actual samples. He explained that SSD’s purpose is to evaluate whether an empirical sample falls within the usual or unusual range, assuming the hypothesis is correct. He emphasized the uncertainty of the hypothesis’s truth but noted the feasibility of assessing the rarity of the empirical sample. If the empirical sample is deemed unusual for a given hypothesis (null), this hypothesis should be rejected. Following these explanations, the students were tasked with solving the second problem.

The emerging pattern involved determining an interval around the expected value (center) and considering that projections within this interval represent proportions of usual samples, supporting the null hypothesis (50% preference for Coca-Cola). Conversely, projections in the tails indicate proportions of unusual samples. This led to two decision-making rules. Six student pairs assessed the number of sample points in each tail of the distribution, concluding that a greater number in the right tail supported the hypothesis of a stronger preference for Coca-Cola. However, three student pairs based their decision on whether the empirical sample fell into the usual or unusual zone, deducing that falling into the unusual zone supported this hypothesis; this type of response was coded as “Intuitive testing” (see Table 4).

The student pair S2 defined a Scale of variability in the following way: (S2, P2): Three reference ranges were established to validate the hypothesis. The 500 simulated surveys plotted the results, leading to these ranges for Coca-Cola preference: 1) Less than 50%: 0–79 people, 2) Equal to 50%: 80–100 people, 3) More than 50%: 101–180 people.

Table 4 Frequency of response codes by sessions

While most students continue to rely on empirical conceptions, they notice key characteristics of the SSD, particularly its central point and bell-shaped distribution. Their suggestion to divide the distribution into three distinct zones provides a coarse scale for assessing variability (Fig. 2).

Fig. 2
figure 2

Simulated sampling distribution created by team S2

5.2.3 Calculation of the p-value

In the third session, the instructor reviewed the students’ solutions to problem 2 and addressed the entire class for noting that the methods of students for defining the threshold between usual and unusual sample points was largely arbitrary. He observed that students who based their hypothesis on the comparison of sample points across the distribution tails failed to grasp that the distribution was hypothetical, not representative of actual population samples. Furthermore, their procedure ignored the empirical sample. However, he acknowledged that two pairs effectively used the distribution zones to assess the empirical sample’s unusualness, which could be valid with a more precise division in the distribution between usual and unusual samples.

The instructor then informally introduced the concept of p-value. He explained that having a sample of the population and a sampling distribution based on the hypothesis, one could estimate the frequency of sample points with the same or more extreme proportions than that of the empirical sample. In order not to complicate the concept further, he restricted himself to the single-tailed case. Through examples with varied parameters and samples illustrated in distribution graphs, he elucidated the concept. The p-value, he clarified, depends on the hypothesis and the specific sample. A p-value near 1/2 indicates that the empirical sample is a common sample under the hypothesis, while a value close to zero signifies that it is an unusual sample. He explained that scientists recommend a 0.05 (5%) threshold to deem a sample unusual, although this can vary with experimental conditions. Thus—the instructor concluded—if the p-value is 5% or lower, it implies either the occurrence of a rare event or the invalidity of the null hypothesis. Given the impossibility of conclusively identifying the correct scenario, the recommended course of action is to reject the null hypothesis, rather than attributing the results to the occurrence of an unusual event. Following these explanations, problem 3 was presented in the third session, and problem 4 was given in the fourth session without further instructor input.

5.2.4 Pseudo p-value

Two pairs of students in problem 3 and five in problem 4 (Table 4) proposed a procedure that was nearly correct except that they did not properly calculate the p-value frequency. Instead, they only considered the frequency of the statistic equal to that of the empirical sample. We have termed this conception “pseudo p-value” because it is a misconception of the p-value. For example, the response of team S5 is:

(S5, P3): Because in the different simulations that we did, we programmed in 10% and in the value of 16 pieces we observed that it was within the percentage. And the result was produced 30 times. This is 6%, which means it is a typical value (Fig. 3).

Fig. 3
figure 3

Simulated sampling distribution of the S5 team

5.2.5 Frequency p-value

In problems 3 and 4, there were 8 and 7 pairs of students, respectively (Table 4), who followed the procedure using the notion of p-value. For example:

(S9, P3): The machine should not be repaired since the result is not atypical [unusual], that is, it is not more than 10% as presented in the problem. Taking the 5% rule, we should get a result of less than or equal to 5% of the pieces, but this time we got 15.4% (77 experiments out of 500), and this reflects that it is a typical result of the samples we took. Therefore, if we had gotten a percentage of less than or equal to 5%, the machine should have been repaired.

(S2, P4): A range of 74 and below was taken to see if the result is atypical or not; this was done by running a simulation of 500 tests of which 42 of these were of 74 people or fewer. After calculating the percentage, we got 8.4%, so using the theory: ‘less than 5% is atypical and more than 5% is normal’, we can conclude that 8.4% is greater than 5% and thus falls within the 80% efficacy of the drug towards the population, since the result is normal.

When the students state that the result is usual (common or typical), they recognize that the proportion of a statistic can deviate from the value of the parameter due to pure chance, and only in extreme cases should it be considered significant. The use of SSD to evaluate whether the empirical sample is usual or not constitutes capturing a fundamental property of sampling distributions.

5.3 Other codes

Table 4 includes additional codes: “sample,” where responses rely solely on the empirical sample, not the SSD; “mode,” focusing on SSD mode for deciding on the hypothesis; “atypicality,” using atypical data in SSD for decision-making; and “alternative hypothesis,” proposing SSD input proportions differing from the null hypothesis. It also covers “p-value and hypothesis,” involving p-value estimation and comparison with the hypothesis, and “rejection region,” identifying a 5% frequency area as the rejection zone. The latter is correct, but it does not include the p-value estimation. However, this study does not explore nor analyze these codes in detail.

Table 4 shows that each response code we have highlighted above, except for the pseudo p-value, was the most frequent in some session. Although there is considerable variability in student responses, a progression is configured in which the frequencies of the procedure codes reflect that there were effects of the instruction in each session. However, Table 5 reveals that the way students apply the new notions and approach proposed by the instructor is somewhat unstable and irregular, in the sense that some pairs apply an appropriate procedure in one session and in the next revert to a less appropriate one. In the first row, for example, of the 12 students who apply the double majority procedure, five apply the variability scale procedure in the second, and two of these in the third apply the p-value procedure; however, these two for the fourth session revert to the pseudo p-value and variability scale procedures which are less appropriate than the p-value procedure they had applied in the previous session. We see some irregularity and instability by pairs, but a general trend in learning; we interpret this as the result of the students’ effort to resolve and reconcile conflicts between their intuitions and previous conceptions with the new notions and approach proposed by the instructor.

Table 5 Response codes and frequency of appearance per session

5.4 Uncertainty

While the instructor’s presentations touched upon aspects of uncertainty, they did not systematically lead to probabilistic statements; rather, an emphasis was placed on statements of frequency. However, through their involvement with the problems and activities in the microworld, students encountered uncertainty firsthand. The students were prompted to reflect on their confidence in their conclusions through a question in their report: “How sure are you of your conclusion?” An analysis of the student pairs’ responses revealed three distinct categories: out of 72 expected responses, uncertainty in the obtained conclusion was recognized in 36, certainty was expressed in 12, and there were no responses in 24. For those acknowledging uncertainty, we identified three primary sources: sample size, simulated samples, and variability, as detailed in Table 6.

Table 6 Frequency of responses classified in the codes for sources of uncertainty

Responses citing sample size as a source of uncertainty typically reflected the belief that only a complete census can provide certainty. Those attributing uncertainty to simulated samples viewed simulation as merely illustrative, a precursor to actual fieldwork. Finally, responses identifying variability as the source often stemmed from sampling distribution data, indicating the possibility of obtaining samples not supporting the hypothesis even if it is true. Interestingly, the recognition of variability as a source of uncertainty predominantly occurs in responses to the final problem. This may be influenced by the problem’s context, yet the impact of the activity itself as a contributing factor cannot be discounted.

6 Discussion

The reflection on the different ways in which students understand the sampling distribution has led us to propose a framework for the development of the conceptions of basic constructs of sampling whose origin and organization we describe below. Ainley et al. (2015) distinguished among those inference situations in which the population is “material, finite, and countable” and those in which it is a “mathematical formulation.” This distinction suggests two important aspects about sampling. The first is that they give importance to the concept of population, which had been neglected in statistical education studies on this topic. Second, they point out two conceptions of population that we could imagine as phases of a hypothetical scale of conceptions of the construct “population.” Based on these suggestions and the observations of our research, we consider it convenient to organize the students’ conceptions on two axes, one axis with the basic constructs of Population, Sample, Repeated Sampling, and Sampling Distribution. The other is on a scale of three levels in which the conceptions of these constructs develop, namely, Empirical, Informal, and Mathematical (see Table 7). It is worth clarifying that the levels on which we focus are only the empirical and informal ones, while the mathematical level is only on the horizon of high school students for further development in their university or postgraduate studies. It must be understood that a broad conception of the third column includes the integration of the mathematical formulation with the relevant situations.

Table 7 Students’ conceptions about some sampling concepts

As a result of the analysis of the activity we carried out, we will argue that students moved from an empirical to an informal conception of sampling distribution and that this transition constitutes a process of abstraction composed of four specific interrelated properties that we coded as: mathematization, processing, uncertainty/randomness, and conditionality; these properties emerge and develop thanks to computational mediation.

Mathematization is the transformation of real-world contexts and situations into mathematical problems and models. During the activity, students use the program and, to transfer the situation to the computational environment, they must reduce the problem’s population to the proportion of the attribute in question; moreover, to obtain a simulated sample, they only require the sample size and the attribute’s proportion. The random selection of samples, the calculation of the statistics, and the recording and representation of the results are left to the machine. This whole process takes place in the computational environment, which is independent of the problem context. The double majority procedure shows that students have difficulties moving from the empirical conception to the mathematical conception of population. However, in the computational environment, it is evident that the program operates with a mathematical version of population. Thus, in this environment, students develop a conception of population that juxtaposes characteristics of both empirical and mathematical population.

Processing is the result of performing operations and transformations. Although the software takes over the combinatorial operations and the random choice to generate the samples, it leaves a margin for the user to perform explorations in which they can change the program’s input and see the resulting distributions. By generating various distributions, students noticed two fundamental properties: the bell shape of the distributions and the relationship of the distribution’s mean to the attribute’s proportion (input). Moreover, once a distribution is generated, students operate on it, carrying out counts of frequencies of specific values or ranges of values. This operability allows students to propose variation scales to solve the problems and give operational meaning to the notions of p-value and level of significance.

Uncertainty/randomness is the consideration that it is not possible to predict the attribute’s proportion in a sample because it is chosen randomly and, therefore, is a cause of variability. Although the random generation of the samples is taken over by the program, students have a sampling scheme that allows them to interpret what the software does and observe its consequences; for example, they realize that in successive runs of the program with the same input and sample size, the resulting distributions are not identical, since they show slight or notable variations from one another, but also perceive invariants referring to the shape and mean of the distribution. Students weigh the randomness/regularity relationship differently; for instance, some are unsure of a conclusion, believing a repeated procedure might yield a different distribution and possibly an opposite conclusion, while others view the differences as minor, seldom altering a conclusion.

Conditional reasoning derives consequences from a situation under the assumption that a condition is present. A distribution generated with the software depends on the input proportion and the sample size and models the sampling distribution of a real population for a given sample size only if the attribute’s proportion matches the program’s input proportion. As in an inference problem, the true proportion of the attribute is not known, so this cannot be verified. Thus, using the software to carry out a significance test must resort to hypothetical reasoning: “What would happen if the input proportion of the computer program was equal to the attribute proportion of the population?” Students who solved the task by evaluating whether the empirical sample is usual or unusual with the pseudo p-value or p–value procedure applied conditional reasoning.

As mentioned in the background, the studies by Saldanha and Thompson (2002), van Dijke–Droogers et al. (2019), and Case and Jacobbe (2018) share our focus on sampling and sample distributions for pre-university students, the implementation of instructional episodes, and the use of technology to enhance learning. It is worth highlighting some other similarities and contrasts between these studies and ours.

Saldanha and Thompson (2002) were pioneers in addressing the problem of analyzing high school students’ conceptions of sample and sampling in relation to the learning of sampling distribution and statistical inference. The identification of additive and multiplicative sample conceptions that they proposed suggested considering a mathematical perspective in their development. In our framework, we consider that a sample and repeated sampling at the empirical level have the characteristics of the additive sample conception, while its placement at the informal level has the characteristics they attributed to a multiplicative sample conception.

Regarding the study by van Dijke–Droogers et al. (2019), we note that it contrasts with ours in that they used manipulatives as the central experimental device for students, while we let them rely entirely on computer software. For them, the inference problem was estimating the number or proportion of the attribute, while for us, it was about significance tests of proportions. A point of agreement is that they highlight the importance of the conceptual transition from conceiving the sampling distribution only as a representation of sampling results to conceiving it as a model of variability and uncertainty.

Finally, Case and Jacobbe (2018) suggest a two-axis framework to characterize students’ difficulties. One axis refers to the difficulty of distinguishing and understanding the distribution of the population and a sample and the sampling distribution; the other refers to students’ difficulty in distinguishing a real perspective from a hypothetical perspective. A contrast with our framework is that we characterize conceptions and not difficulties and consider different concepts on the vertical axis; however, their horizontal axis, which distinguishes between real and hypothetical perspectives, highlights a phenomenon that we conceptualize on the axis of levels of abstraction: empirical–informal–mathematical. We consider what distinguishes an ESD from an SSD is precisely that the former reflects a real perspective, while the latter reflects a hypothetical perspective.

7 Conclusions

We conclude that this study highlights the importance of understanding the simulated sampling distribution (SSD) as a model of sample variation, emphasizing its independence from external factors and its dependence solely on the attribute proportion and sample size. The role of computer simulation in fostering this understanding is underscored, linking it to a process of mathematization. The need to develop notions of population, sample, and repeated sampling from empirical to informal conceptions with a perspective towards their mathematical conception is highlighted. This development involves key steps: reducing a population in context to the attribute proportion, distinguishing between empirical and simulated samples and their functions, and differentiating between replication and simulation. It is argued that the understanding and application of significance tests and the concepts of p-value and significance level are crucial to facilitating the transition from conceiving an ESD to an SSD to assess whether an empirical sample is usual or unusual, that is, to conceive the SSD as a model. Through the analysis of the conducted activity, it was observed that students transitioned from an empirical to an informal conception of sampling distribution, a process characterized by abstraction through four interrelated properties: mathematization, processing, uncertainty/randomness, and conditional reasoning, facilitated by computational mediation.

As a teaching implication, we find that for high school statistics classes, an informal approach to sampling distributions is recommendable and viable, provided there is strong support from dynamic software for simulation and graph representation. Moreover, incorporating problems of significance tests is key for contextualizing and imbuing sampling distributions with meaning. Looking towards future research, we believe the findings of this study can aid in designing, implementing, and evaluating learning trajectories that involve students in an abstraction process to develop an informal understanding of sampling distribution through designing activities and hypothesizing learning while considering the attributes of mathematization, processing, randomness/uncertainty, and conditionality. Despite its contributions, this research had many limitations, notably the absence of a strategy to promote statistical and, specifically, probabilistic language during learning activities, as we relied on assessing samples by percentages and proportions.