1 Introduction

Assessments are crucial in education as they provide valuable insights into a learner's knowledge and skills. In Nigeria, examination bodies like West African Examinations Council (WAEC) and the National Business and Technical Examinations Board (NABTEB) conduct high-stakes school leaving certificate examinations, and candidates use these certificates for higher education enrollment. In contrast to some testing programs that avoid repeating the same test to ensure security and cater to diverse candidates [1, 2], Nigerian examination bodies like WAEC and NABTEB administer tests multiple times throughout the year. However, a notable difference lies in Nigeria's context, where test questions can be leaked or taken out of test centers, potentially compromising the integrity of the original skill assessment. To address this issue, it is essential for the development of test items to prioritize the use of content and specifications/test blueprints whenever feasible. However, concerns have been raised about the perceived differences in the quality and difficulty levels of their test items. Some argue that test items from one body may be better or more favorable than the other, leading to worries about discrimination in certificate awards. The need for equating WAEC and NABTEB stems from the potential inequity and inconsistency in the assessment process between the two bodies, which can impact students' opportunities and academic outcomes. Conducting an equating study to compare the mathematics test items and scores between WAEC and NABTEB is essential to determine the true comparability of their assessments and address concerns about potential bias and unfairness in the education system.

While test scores are often used for critical decisions by various entities, the nature of the exams taken by the examinees is rarely considered [3, 4]. The expectation is that parallel tests, such as NABTEB test A and WAEC test B, should generate equal scores across examinees. Equating test scores is vital for ensuring the validity, reliability, objectivity, and fairness of the results, allowing for interoperability between different types of tests and across different testing years. Previous research has established the comparability of scores from exams conducted by WAEC, NECO, or regional states in Nigeria [5,6,7,8,9] using traditional linear equating. However, this study seeks to fill both evidence and methodological gaps by standardizing WAEC and NABTEB mathematics items on a broad scale and analyzing them in-depth using IRT-based methods of Stocking-Lord and Haebara. By doing so, the ongoing debates and perceptions about the credibility of test items and certificates issued by these examination bodies can be resolved. Ultimately, this study holds great importance for an international audience as it addresses a crucial issue in educational assessment that extends beyond the borders of Nigeria. The study's focus on comparing mathematics test items and scores between two prominent examination bodies, WAEC and NABTEB, carries implications that resonate with educational systems worldwide. One of the key aspects that make this study relevant globally is its investigation of examination integrity. The study delves into the potential differences in test items and the comparability of scores between different examination bodies. Ensuring the integrity and fairness of examinations is a universal concern, as it directly impacts the credibility and reliability of educational qualifications in any country.

Furthermore, the findings of this study can significantly influence educational policies and standardization efforts at an international level. Policymakers and organizations involved in educational assessment can draw valuable insights from this study to enhance assessment practices and ensure equitable opportunities for students in diverse educational contexts. The study's implications for university admissions are also of broad interest. The findings can inform admissions processes in various countries, particularly those with multiple examination bodies. If test scores from different examination bodies are not comparable, it raises questions about the validity of using these scores as a basis for admission decisions in higher education institutions. The study's focus on fairness and equity in certificate awards based on the examination body is another aspect that appeals to an international audience. This issue is not unique to Nigeria, as other countries may also have concerns about potential discrimination in awarding certificates based on the examination body a student has taken. Additionally, the study's application of Item Response Theory (IRT)-based equating methods has broader implications beyond Nigeria's context. These methods have relevance for international assessments and standard-setting practices, making the study valuable for educational researchers and professionals around the world. Understanding the comparability of test scores is critical for ensuring the validity and accuracy of educational assessments, and this study contributes valuable insights that can be relevant to educational systems globally. Researchers worldwide can build upon this work to advance assessment methodologies and practices in their respective contexts.

In this paper, the following sections are presented. First, relevant and related literature including the underpinning theory, test score equating and its assumptions, score equating techniques equating designs, and IRT-equating methods are discussed in Sect. 2, The methodology adopted in the study is presented in Sect. 3, and Sect. 4 focuses on the analysis of data collected, and presentation of the results. In the final section, the author discusses the findings and highlights the implications for educational policy, and concludes by outlining limitations and future research prospects.

2 Literature review

2.1 The study underpinning theory

In test equating, Classical Test Theory (CTT), and Item Response Theory (IRT) are commonly used. CTT equating involves comparing observed or true scores at the examination level where the characteristics of score distributions are assumed to be equal for a specified group of examinees. CTT does not however provide assumptions of equality and invariance [10,11,12,13]. Various methods within CTT are employed for equating, such as linear equating, equipercentile equating, and several others. Contrarily, IRT methods, which form the basis of this study [14] model examinees' ability and item-level characteristics such as difficulty, discrimination, and guessing. IRT equating places estimates of item parameters on two tests using the same scale, instead of relying on the distributions of total test scores. IRT likewise uses mathematical functions to describe the probability of correctly answering an item depends on the latent ability of the respondent [15,16,17,18]. The equating process in IRT involves estimating examinee and item parameters separately, and an individual's ability is not affected by item difficulty or ease [19]. Data collected by IRT can be correlated with any design, ranging from the statistically strongest single-group design to the weakest design that administers only a handful of items to the same group of participants [20, 21].

Additionally, different IRT models have distinct numbers of item parameters that define an item's statistical characteristics. The one-parameter logistic model (Rasch model or 1PL) only uses item difficulty to describe an item's characteristics and assumes all items are equally discriminating. The two-parameter logistic model (2PL) describes items by their difficulty and discrimination, thus reflecting their relationship with ability. The three-parameter logistic model (3PL) further includes a guessing parameter to account for examinees with low proficiency [22]. Although a four-parameter logistic model (4PL) has been developed, this study used the 3PL model for dichotomous responses in the data collection process (see Eq. 1). The 3PL model was chosen as a result of model-data fit assessment. When compared to 1PL and 2PL models, the 3PL model showed a better fit based on various information indices such as AIC (Akaike Information Criterion), SABIC (Sample-Size Adjusted BIC), HQ (Hannan-Quinn), BIC (Bayesian Information Criterion), logLik (log-likelihood), as well as likelihood ratio tests (logLik), chi-square statistic (X2), degrees of freedom (df), and p-value. Thus, the 3PL model's superior fit and greater parsimony led to its selection as the preferred model for fitting the datasets of WAEC and NABTEB.

$$P \left({Y}_{j}=1/\theta \right)= {P}_{j}= {C}_{j}+1 \left(1- {C}_{j}\right)\frac{exp \left\{{a}_{j}\left(\theta -{b}_{j}\right)\right\} }{1+exp \left\{{a}_{j}\left(\theta -{b}_{j}\right)\right\}}$$
(1)

where Yj is the response to item j; \(\theta\) is the ability; aj is the discrimination parameter; bj is the difficulty parameter, and cj is the guessing parameter. If cj = 0, the model reduces to the 2PL model. If cj = 0 and aj = 1, the model reduces to the Rasch model.

2.2 Score equating and its assumptions

Test equating involves creating comparable scores for multiple versions of a test, thus enabling those versions to be interchanged. Researchers have devoted decades to the examination of test equating and found that scores are comparable when more than one form of the test is used [23,24,25]. Maintaining the technical quality of testing programes is crucial for ensuring the validity of assessment results. The interpretation of scores is supported by evidence and theory, and equating is a statistical process that helps establish the same substantive meaning of scores on different test forms [26, 27]. Test equating as described by various researchers including [26, 28,29,30,31,32,32] involves adjusting test scores to make them comparable across different tests by accounting for differences in test difficulty and other statistical properties. Another perspective is that test equating establishes relationships between raw scores of different tests as explained by [33]. Consequently, for examining bodies, when administering test items in multiple instances, and to multiple examinee groups; equating method is encouraged as a statistical procedure to overcome the overexposure of items that can threaten test security [34]. Using equating methods with less error would ensure fairness among the examinees. Also, [35] explain equating test scores as an attempt to translate scores from one test to another defensibly. The question is: how can you determine if a score of 75 is the same if half of the students see one set of items and the other half see another set? Could one be a little easier? When one conducts assessments in linear forms, or pilots a bank of computerized adaptive tests/Linear-on-the-fly tests; one will likely use more than one test form, which requires test equating. As the result of equating test scores, the primary focus is to make them comparable, exchangeable, and interchangeable.

To achieve interchangeability and comparability between test scores, researchers suggest a five-step equating process [33, 36]. This involves ensuring that the tests measure the same construct, have equal reliability, use a symmetrical equating transformation, exhibit group invariance, and employ a common equating function [37, 38]. Equating instruments with significantly different reliabilities is discouraged [39]. Although achieving complete group invariance and population invariance is challenging [40], they can be evaluated through empirical assessments [41,42,44]. The equating requirements have been subject to various critiques, and a consensus on which ones are crucial remains elusive [45, 46]. Nonetheless, the primary objective of equating is to make scores on different test forms comparable, with an emphasis on aligning the tests to the same construct. In this study, the focus was on equating the tests using the same construct, as both tests were designed to measure the same content in the senior secondary school mathematics curriculum.

2.3 Score equating techniques

Equating scores from different test forms/testing programs can be accomplished using several techniques, processes, or methodologies. As noted by [47, 48], equating can be classified into three types: vertical, score linking, and horizontal. As a method of comparing student scores across tests of different or multiple levels, vertical equating is also known as across-grade equating or scaling. It compares the content and difficulty of the tests across grade levels in the same construct, thereby summarizing student progress over time ([49], p. 50. Often, equating is used as a developmental scale. For instance, mathematics tests for grade 11 and grade 9 are compared. In these tests, mathematical skill is the focus, even though their content differs. The ability to do mathematics should steadily increase every year as students improve. In vertical equating, two different groups of examinees are compared, so it is complex conceptually. Moreover, two tests given to the same group of students such as those in grade 8 and grade 9 are likely to be easier in grade 9, thus resulting in a ceiling effect that will not provide accurate performance information at the second administration. The horizontal equating technique, also known as within-grade equating, compares the scores of similar tasks among students at the same level, topic, and population [49,50,51,52]. It is used when students re-take exams on different forms, all of which are equated to achieve comparable scores. This method is straightforward, comparing groups of examinees with the same ability level using different tests based on the same content and difficulty range, such as those used in WAEC and NABTEB Mathematics test items. There are two types of equating: pre-equating and post-equating. Pre-equating converts raw scores into scaled scores before the operational test is administered, using data from field tests analyzed with statistical procedures [53]. Post-equating, on the other hand, adjusts operational test data for difficulty after administration and is considered more accurate. Pre-equating helps with score reporting, quality control, and assessment flexibility, but it may lead to motivation issues if field test items are presented separately. Post-equating is more accurate but requires sufficient time. When time is limited, the equating process may be compromised, affecting the reporting process and quality control.

2.4 Score equating designs

Various testing equating designs include single-group designs, counterbalance single-group designs, equivalent-group designs, and non-equivalent anchor test designs (NEAT). In a single-group design, two testing instruments (X and Y) are administered to one group of participants to estimate item parameters. The drawback of this single-group design is possible fatigue or familiarity effects. Counterbalanced designs involve two sample groups taking X and Y in different orders to control for order effects. The advantage is accurate results with a small sample size [43]. Equivalent group designs administer separate test forms to the same group, but equivalence assumptions can be problematic [54]. In non-equivalent group designs, different populations take the test forms [1]. Assert that NEAT is widely used for administrative flexibility but can be challenging due to non-equivalence of test forms [55], equating based on different sets of common items can yield different results when equating two test forms. Upon discussing the results later [56], reported that when the groups were similar in ability, the anchor tests yielded similar equating results. The author further reported that when the groups differed in their ability level, the anchor tests yielded very disparate equating results Consequently, the anchor test must be carefully chosen [56]. Short tests are less reliable than long tests. To reflect group differences accurately and reliably, anchor tests need to be long enough. There is a less random equating error when there are more common items [57,58,59]. More importantly, common person and item equating are also distinctions in the methods of equating. Common person equating involves administering two tests to a group of people. By using a linear transformation, we calculate the mean and standard deviation of the scale locations for the two groups. Due to the same person taking both tests, the study proposes a common person equating. Notably, an anchor test embeds a set of common items into two different tests, and thereby resulting in a common item equating. Common items are equated by their mean location. Four steps are involved in equating, and they are: data collection, defining an operational equating transformation, selecting an estimation method, and evaluating the results.

2.5 IRT equating methods

There are four steps involved in the equating process through IRT. First, the ability and item parameters are derived from the equating design or data collection design. Second, ability and item parameters are estimated or calibrated either simultaneously or individually for each form of the test. Ability parameters can therefore be calibrated simultaneously or separately [60]. During concurrent calibration, item parameters of two forms of a test are calibrated simultaneously, whereas separately calibrated item parameters are calibrated separately through the calibration software for each form. Lastly, item parameters across different test administrations must be equated on a common scale since equating cannot take place without common item parameters. In IRT, scaling coefficients can be obtained from different calibrations of ability and item parameters on a common metric using various methods. Among them are Haebara and Stocking-Lord (characteristics curves method), mean/mean, and mean/sigma (moment methods). Haebara and Stocking-Lord take into account both difficulty and discrimination parameters of the test [61, 62]. A transformation constant is identified first in this model. A comparison of different tests or item characteristic curves is thereafter performed, and then investigate why the differences decrease. Consequently, a comparison of Haebara and Stocking-Lord (characteristics curves) was conducted in this study to determine the comparability of mathematics items administered by WAEC and NABTEB in Nigeria. By summing the squares of the differences in each item’s characteristic curve, the Haebara method estimates the differences between item characteristic curves [62, 63]. Here is the mathematical expression for this method:

A: Curve of the equating equation

B: Constant of the equating equation

\({P}_{ij }\left(\theta ji , aji, bji, cji\right)\): Item characteristic function.

\({P}_{ij }\left(\theta ji , \frac{aji}{A}, Abji+B, cij\right)\): Equated item characteristic function

$$Haediff\left({\theta }_{i}\right) = \sum_{j :V}{\left[{P}_{ij }\left(\theta ji , aji, bji, cji\right)- {P}_{ij }\left(\theta ji , \frac{aji}{A}, Abji+B, cij\right)\right]}^{2}$$
(2)
$${Hae}_{crit \,=\, \sum_{i}Haediff\left({\theta }_{i}\right)}$$
(3)

Moreover, by taking the square of the sums of differences between each item’s characteristic curve, Stocking Lord (1983) calculates the difference between them. Below is a mathematical expression of this method:

A: Curve of the equating equation.

B: Constant of the equating equation.

\({P}_{ij }\left(\theta ji , aji, bji, cji\right)\): Item characteristic function.

Pij (θji, Abji + B, cji): Equated item characteristic function

$$Stldiff= {\left[\sum_{j:V}{P}_{ij }\left(\theta ji , aji, bji, cji\right)- {P}_{ij }\left(\theta ji , \frac{aji}{A}, Abji+B, cij\right)\right]}^{2}$$
(4)
$${Stl}_{crit}\,= \sum_{i}Stldiff\left({\theta }_{i}\right)$$
(5)

2.6 Previous studies

Numerous studies (e.g., [2, 46, 53, 57, 62, 64,64,65,66,67,68,69,70]) have investigated comparability of test-takers scores from different examination bodies using Item Response Theory (IRT) equating methods, including Haebara, Stocking-Lord, mean-mean, mean-sigma, and concurrent calibration. These studies have shown that certain equating methods perform better than others in various contexts of examination test items. For example, [46] compared the Haebara and Stocking-Lord methods using data from the 2018 National Examination Administration, and the Haebara method was found to have higher mean-sigma values, suggesting potential improvements in discrimination and difficulty level. Another study [65] employed the Rasch Model to equate mathematics test scores, demonstrating the effectiveness of IRT-based methods in this context. Similarly [66], compared various equating methods and found that both Haebara and Stocking-Lord methods yielded comparable results in terms of test score comparability for mathematics assessments. In the context of WAEC mathematics test items, [70] compared CTT and IRT equating methods and found that IRT's mean-sigma method outperformed with smaller errors. Additionally [2], explored linking and concurrent calibration methods using mixed-format tests under IRT for non-equivalent groups with common-item designs, and the concurrent calibration method generally performed better, recovering item parameters accurately and generating more precise estimated scores. Furthermore [57], identified that IRT methods produced smaller equating errors compared to CTT methods in the context of SAT-Verbal tests. Studies like [71] and [72] investigated the comparability of mathematics examination items between different examining bodies and found varying degrees of equivalence. Despite these valuable insights, there has been no specific study in Nigeria focusing on the comparison of mathematics test items between WAEC and NABTEB using the Stocking-Lord and Haebara methods of IRT. This present study aims to fill this research gap and contribute to the existing body of knowledge.

3 Materials and method

3.1 Philosophical lens

The paradigm tells the reader how to interpret the research results based on the collected data [73]. It is useful for the researcher to select a paradigm, because without it, he/she cannot focus on a particular philosophical knowledge and evaluate other possibilities. This study is based on a post-positivist research philosophy. In postpositivism, factual knowledge, which includes measurement is considered trustworthy. Ultimately, this study is shaped and driven by the postpositivist paradigm's beliefs, convictions, expectations, and values [73]. Additionally, the study employed a cross-sectional quantitative method of a single group with a counterbalance design that requires each examinee to take both tests in a counterbalanced fashion. Since the same examinees write both tests and have the capacity to boost the validity of the dataset used in the study, the equating design is regarded as the strongest statistical design since it uses examinees with presumed equal ability, which also meets Lord's equity requirement.

3.2 Participants

Detailed demographic information about Grade 12 students in government-owned schools who voluntarily participated in this study is shown in Table 1. In addition, consent was sought from students and administrators before participating in the study in order to maintain ethical standards. The two test forms attracted 1,300 responses in total, and 1,210 responses were useful in the analysis. A greater percentage of male students responded to the test forms, thus representing 51.7% of the population. When data was being collected, the age range of the students was within 17–19 years, suggesting the expected age group for the student to sit for terminal examinations conducted by public examining bodies. Even though students across different locations in educational district II of Lagos State, Nigeria were sampled, more of the students in this study (i.e., 45.1%) were from schools within Ikorodu area. Although 44.5% of the students were of science extraction, the sample included other students with, commercial, and humanities specializations. As part of the study design, samples were also classified into group A (with 647 students), and group B ( with 563 students) respectively.

Table 1 Demographic profile of the participants

3.3 Measures

WAEC is considered a large-scale exam as it involves a substantial number of candidates taking the test. Furthermore, it is administered to a wide-ranging population across various West African countries, with Nigeria being one of them. All Grade 12 students are required to sit for the mathematics test, one of the cross-cutting subjects. Students use the results of this test to apply to higher education institutions (HEIs). It is therefore considered high-stakes, and the items on the test must sufficiently measure traits in a valid and reliable manner. The test comprises 50 items from a wide range of content domains, each with one correct answer, and three dummy options. Several stages of development and validation have been undertaken to standardize the test. By shading an optical mark reader (OMR), the participants scored 1 for each correct answer and 0 for each incorrect answer. Also, there is often a correction for guessing, whereby test-takers are typically awarded points for correct answers but receive a penalty or no points for incorrect answers. The aim is to discourage random guessing and encourage test-takers to answer only when they are reasonably confident in their responses. The specific scoring scheme may vary, but the idea is to prevent several wrong answers from outweighing correct ones and affecting the overall score negatively.

Furthermore, NABTEB is another Nigerian public examination body entrusted with the responsibility of conducting technical and business innovation certificate examinations, which WAEC had previously conducted [74]. Similarly, NABTEB examinations are held twice a year in May/June and November/December. In this study, their May/June mathematics test items, which comprised 50 items with four options, letters A through D were utilized. All test items were dichotomously scored (0 for each incorrect answer, and 1 for each correct answer). A raw score of 50 was the maximum possible, and a score of 0 was the minimum. The items were taken from their high school mathematics curriculum, which had almost the same content as WAEC's. Using the two test forms, this study examined whether the scores are comparable when put on the same continuum or scale. Subject experts have validated the instruments used regarding their content. Based on the loading factor and construct validity analysis, the loading factor and CVI were greater than 0.50. At the same time, the IRT empirical reliability coefficients returned 0.89 and 0.86 respectively. The instruments can therefore be used for equating.

3.4 Data analysis

Through different R packages, Haebara and stocking-lord equating methods were implemented to analyze the obtained data (see Appendix-session 1 for codes on the importation of data into the Rstudio environment for data analysis). First, the assumption of similar construct for the two forms was tested using sirt package in R software version 4.0.1 [75, 76]. In this package, the confirmatory detect (conf detect) function is used to establish Stout's test of essential unidimensionality assessed by the DETECT index [77,78,79]. Under a confirmatory specification of item clusters, this function computes the dimensionality evaluation to enumerate contributing traits (DETECT) statistic for dichotomous item responses, and the polyDETECT statistic for polytomous item responses [77, 80,81,82,83]. Noticeably, DETECT produces indices, including dimensionality evaluation to enumerate contributing traits, approximate simple structure index (ASSI), and approximate simple structure index ratio index (RATIO) [82]. The option unweighted means that all conditional covariances of item pairs are equally weighted, and weighted means that the sample size of item pairs weights these covariances. The following classification scheme is used to determine the dimensionality of the test [83, 84].

$${\text{Strong multidimensionality}}\quad {\text{DETECT}}\, > \,{1}.00$$
$${\text{Moderate multidimensionality}}\quad 0.{4}0\, < \,{\text{DETECT}}\, < \,{1}.00$$
$${\text{Weak multidimensionality}}\quad 0.{2}0\, < \,{\text{DETECT}}\, < \,0.{4}0$$
$${\text{Essential unidimensionality}}\quad {\text{DETECT}}\, < \,0.{2}0$$
$${\text{Maximum value under simple structure}}\quad {\text{ASSI}}\, = \,{1}\;{\text{RATIO}}\, = \,{1}$$
$${\text{Essential deviation from unidimensionality}}\quad {\text{ASSI}}\, > \,0.{25}\,{\text{RATIO}}\, > \,0.{36}$$
$${\text{Essential unidimensionality}}\quad {\text{ASSI}}\, < \,0.{25}\,{\text{RATIO}}\, < \,0.{36}$$

Also, scores from the two test forms were transformed through R packages such as multidimensional item response theory (mirt) [14], which gave the user the option of modeling response data using several IRT models. Based on the mirt package, IRT objects could be read into the equateIRT package [84]. The use of IRT models [85,86,87,88] is widespread in the present day for the analysis and scoring of tests. It is natural to use IRT equating because many testing programmes assemble their tests using IRT [19, 85, 89]. When fitting the dichotomous data matrix \({k}_{x} \times {J}_{x}\) with an IRT model, the parameter estimates, \(\widehat{\theta } \, and \, \widehat{\omega }\) for both persons and items will be obtained. A test score based on an IRT model is typically a prediction of the ability parameter (\(\widehat{\theta )}\) known as an IRT score, instead of the sum score \({X}_{i} \in x\). A transformation function is needed for a test form X and a test form Y to have equivalence IRT scores under the IRT setting. Depending on the equating design used, IRT scales will be transformed differently. When estimating single-group counterbalance designs, the abilities are not a concern; no additional transformations of scales are needed if the mean and variance of the ability distribution are assumed [32]. Kolen et al., [26] Show that IRT parameter estimates from two calibrations of different test forms should be on the same scale, which is done by using IRT parameter linking [90].

The functions modIRT(), and direct() are therefore also included in equateIRT to implement IRT parameter linking/equating. When modIRT() is invoked, the coef argument accepts a matrix of item parameter estimates. Aside from estimating the standard errors for estimated equating coefficients, equateIRT can also estimate the covariance matrix of item parameter estimates. In summary, equating scores on a new test form 1 to the scale of test form 2 can be achieved in two steps. A linear transformation is used to rescale the item parameters of form A onto the scale of form B using IRT methods such as the item characteristic curve [61, 91, 92]. Because the IRT item parameters of the two test forms are the same, the scores on these two tests are also the same since they are direct functions of the IRT item parameters. When multiple new test forms (e.g. X1, X2,..., Xn) need to be equated to a base form, score equating is an efficient approach. Succinctly, test scores are equated by using the 3-PL IRT model for multiple-choice mathematics questions. To ensure that test scores are comparable between different forms or administrations of the test, item difficulty, and person ability parameters are estimated using item response data. In addition, the maximum likelihood estimation procedure was used for estimating the coefficients and calibrating the items.

4 Results

To check the major requirement of the similar construct as recommended by [38, 72, 93], the responses of the examinees to the two mathematics tests (prepared, administered, and assesssed by WAEC and NABTEB) were subjected to Stout’s test of essential unidimensionality test implemented in the sirt package of R Language and environment for statistical computing (see Appendix-session 2 which describes R codes for establishing a Stout test of essential unidimensionality of the two test forms prior to the equating process). Table 2 present the result.

Table 2 Dimensionality assessment of the two tests

Table 2 shows that the two test forms of mathematics multiple-choice items were essentially unidimensional; the form1 had maximum DETECT value = −0.304 (< 0.20), ASSI = −0.476 (< 0.25), and RATIO = −0.678 (< 0.36), and form2 had maximum DETECT value = -0.303 (< 0.20), ASSI = −0.469 (< 0.25) and RATIO = −0.671 (< 0.36), respectively. The assumption of unidimensionality was therefore not rejected for the two tests. This result shows that one dominant dimension accounted for the variation observed in students’ responses to the two tests of mathematics multiple-choice items tests. The mathematics multiple-choice items tests thereby fulfilled the similar construct requirement of conducting tests equating and unidimensionality assumption of item response theory. Table 3 presents 3PLM item parameters for the two test forms.

Table 3 Item parameters for the two test forms

Table 3 displays parameters generated using 3PL implemented in the mirt package of R language (see detailed description of the R codes for establishing item calibrations (3PL) of the two test forms before equating process in Appendix-session 3). In the following estimates, a1 represents discrimination/slope, b represents difficulty/threshold, and g represents the guessing parameter. Form 2 discriminated between examinees who knew the subject material, and those who did not (M = 1.11, SD = 0.16) more strongly than Form 1 (M = 1.09, SD = 0.15). As a result, test items from form 2 distinguish better between examinees with low and high abilities. The difficulty indices indicate how easy or difficult the two test forms were for the examinees. In [92, 94], the authors recommend that easier items have lower difficulty indices (negative values) and that very easy items have values less than -2, while harder items have higher indices (positive values) and very hard items have values greater than + 2. Accordingly, form 2 test items were moderately difficult (M = 0.15, SD = 1.08) as compared to form 1 test items (M = 0.11, SD = 1.07) in terms of difficulty. The guessing parameter is also considered unacceptable if the test item falls outside this range of c > 0.35 [95]. On average, both test forms (M = 0.05, SD = 0.09, M = 0.06, SD = 0.08) were not vulnerable to guessing. Using the equateIRT package implemented in R language, the equating coefficients were obtained for both test forms using stocking-lord and Haebara equating methods as presented in Tables 4 and 5. Meanwhile, the R codes for establishing the equivalence and comparability of the two test forms using the Stocking-lord and Haebara IRT methods can be found in Appendix-session 4.

Table 4 Characteristics curve equating coefficient of the two test forms
Table 5 Equating methods on the two test forms

The Stocking-Lord procedure and Haebara method were compared for computing equating coefficients, and it was discovered that they required different levels of arithmetic complexity. The Stocking-Lord procedure utilized a quadratic loss function to minimize the equating coefficients based on the test characteristic curve, while Haebara's method involved calculating the probability of correct response for each item in the test using item parameter estimates from test calibration, and then computing a similar vector for the same theta value. The slope and intercept coefficients were calculated using a weighted least squares procedure to minimize the average squared difference between the ability distribution and the test item parameter estimates. Table 4 displays how the equating coefficients produced by the two methods yield similar estimates of item and person parameters when used for common equating operations. The focus of the comparison is on the agreement between the two methods, rather than the goodness of fit of the transformed values to the underlying parameters. The StdErr column in the output however shows NA, as the researcher did not provide the covariance matrix for the item parameter estimates.

Table 5 in the provided data displays equated scores for two forms of a test using two different methods, namely Stocking-Lord and Haebara to allow for comparison. The table has four columns with the second and fourth columns representing the possible observed scores ranging from 0 to 50 on the two test forms. The column labeled 'form1.as.form2' displays the equated scores for form 1 using both methods.For instance, when using Stocking-Lord, a score of 0 on form 2 is equivalent to −0.11 on form 1, a score of 1 is equivalent to 0.87, and so on. Similarly, using Haebara, a score of 0 on form 2 is equivalent to −0.49 on form 1, a score of 1 is equivalent to 0.38, and so on. The equated scores for both methods were relatively similar, with an average (mean) of 24.98 and a standard deviation (SD) of 14.91 for Stocking-Lord; and an average of 25.07 and an SD of 17.87 for Haebara. Relatedly, if a student scored 50 on form 2, they would be expected to earn a score of 50.06 on form 1 using Stocking-Lord and 50.49 using Haebara.To visually compare the two equating methods, a plot (Fig. 1) displays the equated scores from the test forms against the scale scores.

Fig. 1
figure 1

Equated scores of the test forms

Figure 1 consistently shows that equating form 1 and form 2 using the IRT methods results in a much stronger claim that the scores can be correctly interpreted. Due to these equated scores, the two public examining bodies are measuring very similar knowledge and skills on the test forms.

5 Discussion and conclusion

As a result of continuous diverse public perceptions over the superiority of two public examining bodies in Nigeria pertaining to placement, making decisions, awarding scholarships, and accepting admissions to universities and colleges; the present study attempted to compare mathematics test items of both examining bodies to determine their score comparability. By using the DETECT confirmatory of the Stout test of essential unidimensionality implemented in R, it was possible to establish the most important test equating assumption of the same construct. This study has demonstrated that both test forms comply with the single construct that explains the performance of the two tests examined. The study finding is in agreement with [5, 70] that WAEC diets of June and November had the same construct specification. As the two examination bodies are using the same mathematics curriculum/syllabus to develop their test items, the two tests measure the same content and cognitive processes. The two examination bodies support the same inferences about what students know and can do, thus, a strong claim has been established in this study to correct varied public perceptions/opinions on the disparity and quality of certificates awarded by these two bodies. Also, based on the calibration of the two forms using 3PL, form 2 was found to be moderately difficult compared to form 1, and therefore useful to differentiate between examinees with low abilities and those with high abilities more effectively. No strong evidence was established that low-ability students can guess the correct answer to the items in the two forms randomly, as indicated by the guessing parameters of the two forms.

Further, the study utilized two different equating procedures, namely Stocking-Lord method and Haebara method to ensure comparability and equivalence of the two test forms despite having different underlying arithmetic characteristics. The results showed a strong correlation between the equating scores obtained from both methods, and thereby indicating a reasonable level of agreement between them. This finding is consistent with previous studies [26, 46, 65, 66, 71] that have demonstrated the effectiveness of using IRT-based equating methods to compare test scores across different forms or administrations of a test. In line with [46], this study confirms that the Haebara method exhibits higher mean-sigma values, suggesting potential improvements in discrimination and difficulty level. Similarly [65], demonstrated the effectiveness of IRT-based methods, which aligns with the use of the 3PL model in this research to equate mathematics test scores. The study concurs with the findings of [66], where both Haebara and Stocking-Lord methods yielded comparable results in terms of test score comparability. Studies like [26] and [71] that have investigated the comparability of examination items between different examining bodies resonate with the findings of this study. Furthermore, the study findings indicated that the mathematical test form from WAEC exhibited slightly higher difficulty and better discriminatory power among examinees compared to the mathematical test form from NABTEB, as evident from the estimated item parameters. These differences in item parameters could be attributed to specific characteristics of the test items or variations in the populations of examinees from the two examining bodies. For instance, the WAEC test may have been designed to assess a higher level of proficiency or administered to a more advanced group of examinees compared to the NABTEB test. This discovery highlights the significance of equating test scores to account for such differences in item parameters and ensure fair score interpretation across different test forms or examinee populations. Equating test scores becomes crucial to achieving score comparability and making valid comparisons across different test forms or examination bodies. Interestingly, it was somewhat surprising to note that despite the significant differences in the mathematical stringent nature of the two procedures, they produced very similar estimates of item and ability based on a given set of items. This observation underscores the robustness of equated metrics in producing comparable estimates, even when test forms differ significantly. In conclusion, the study demonstrates that the two test forms possess similar characteristics, if not the same. Therefore, there should be no doubt or discussion about the comparability of their results or the usability of their certificates without any reservations. Equating the scores between WAEC and NABTEB is crucial for ensuring fairness and equity in score interpretation, making the certificates from both examining bodies equally valid and reliable for various purposes.

6 Implications for educational assessment and policy

There are profound implications for educational policy and practice in Nigeria based on the study findings. Based on the EquateIRT package's equating results, test scores are comparable across different examination bodies, thus demonstrating that test scores can be used fairly and equally when making important decisions, including curriculum redesign, student placement, admissions, and promotions. By combining test scores, we can interpret and compare scores meaningfully across different test forms and examination bodies, and making it easier to make accurate and reliable decisions based on test scores. In the education system, this can have a significant impact on ensuring fairness, transparency, and accountability. Furthermore, this study contributes to the growing body of research on equating methods and their applicability in diverse educational contexts. Based on this study, the EquateIRT package has been found to be useful and effective in equating test scores among Nigerian educational assessment organizations. Research and practitioners can therefore use EquateIRT to equalize test scores in a variety of settings with its user-friendly and flexible design. More so, this study's implications extend beyond the Nigerian context, offering valuable lessons and insights to an international audience. It explores methods of equating, like Haebara and Stocking-Lord, which can guide countries facing challenges in comparing test scores. Ensuring fairness and comparability in test scores is a concern shared worldwide and educational systems can learn from this study's approach to accurately interpret scores. The study’s use of IRT to assess test score comparability emphasizes the importance of employing methodologies to validate assessments. Policymakers and educational institutions around the world should consider the significance of equating methods in maintaining assessment practices and providing opportunities for students. Moreover, the findings from this study contribute to discussions on fairness and equity in education on a level. It is crucial to have fair test scores when making decisions about student performance, admissions and educational opportunities. By showcasing practices in equating this study provides guidance for countries to enhance their own assessment procedures and ensure precise interpretations of test scores. International collaboration in research is vital, as sharing knowledge and experiences regarding equating assessments can foster a comprehensive understanding of global test score comparability. In conclusion, while the study's primary focus is on Nigeria, its implications and messages are relevant to educational systems worldwide. By sharing these insights, this research has the potential to enhance assessment methods and foster equal educational opportunities beyond Nigeria's boundaries.

7 Limitations and future research

The sample size used in this study provides valuable insights into the research question, even though the scores obtained from the sampled Nigerian students are not representative of all Nigerian students. To further validate the findings, future research on this topic should consider a larger sample size that includes students from other geopolitical zones in Nigeria. Furthermore, future studies should examine other methods of data collection design and equating methods using Item Response Theory (IRT) to compare scores from other subjects administered by public testing agencies. In future studies, it will be necessary to establish the standard error of the equating coefficient, and take into account the differences between observed and true scores. Future findings would be more robust if these considerations are taken into account.