1 Introduction

Beliefs, attitudes, and expectations are psychological constructs that are studied by the discipline-based education research (DBER), including physics education research (PER) community [1,2,3]. Affective assessment is frequently observed by a rating scale questionnaire consisting of several behavioral statements. Contrary with physical measurement, attitudinal measurement incorporates latent structure of person traits which is indirectly measured by PER researchers. Several research-based assessments (RBAs), therefore, have been established by our scholars to measure attitudinal variables of physics teaching and learning. Those tools have been archived in PhysPort [4], a web-based repository from our PER community.

Latent nature that underlies attitudinal variables recommend scholars to adapt the categorical scale that is strongly established by the field of psychological assessment. Among the established scales, broadly speaking, Likert [5] scale has been widely implemented by several studies particularly in the social and humanities fields, including PER community [4, 6]. Most of the attitudinal RBAs disseminated by PER community employed five categories ranging from strongly disagree (1) to strongly agree (5) in measuring students’ attitudes, beliefs, and expectations. Meanwhile, Likert [5] scale is widely adapted as attitudinal scale because it was admittedly easier to follow, implement, disseminate, understand, and interpret the results.

In practice, PER scholars sometimes convert the 5-point Likert scale into different number of categories [7,8,9,10]. For instance, they collapse the former scale as 2-point or 4-point. Earlier work has recently reported the validation study of the German version of the Colorado Learning Attitudes about Science Survey for Experimental Physics (E-CLASS) by Teichmann et al. [11]. E-CLASS is a multiple-choice questionnaire with 5-point Likert scale to measure students’ attitudes towards experimental based physics learning. Teichmann et al. [11] collapsed the original format when reporting E-CLASS results to the faculty members and students. They argued that this scale collapsing could facilitate easier interpretation of the E-CLASS categorical data for their practical purposes.

Differentiating category numbers on attitudinal physics assessment was further studied within the Colorado Learning Attitudes about Science Survey (CLASS) assessment [12]. CLASS measures the expression of students’ attitudes toward physics learning. It encompasses 41 attitudinal statements situated within five categorical conditions. CLASS developers maintain this number of categories even though prior cognitive interviews demonstrated inconsistencies of the CLASS interpretation [7]. In practice, Adams and colleagues therefore recommended that the neutral response of CLASS scale should be neglected, and they collapsed CLASS as a dichotomous scale format (agree-disagree). Van Dusen and Nissen [13] have different position to address the importance of neutral category on CLASS. They suggest that the decision to collapse into 3-point or 5-point CLASS were more recommended than the dichotomous format as proposed by Adams et al. [7]. Van Dusen and Nissen discovered comparison results of CLASS data in five and two scale lengths. Their findings informed an increase of the pretest and posttest CLASS data, a decrease of their standard deviation, and an increase of their Cohen’s d effect sizes as indicated by 5-point CLASS results by including the neutral category.

Studying the psychometric behavior of different category numbers is a novel attempt offered by this paper. To the best of our knowledge, the curiosity of the optimal number of categories in attitudinal scale remains unanswered until this research is attempted. Admittedly, the inquiries to answer this concern have attracted several psychometricians even dating back 70 years ago [14, 15]. Moreover, to date, there are ongoing investigations approached by the modern measurement theory (e.g., item response theory/IRT). For instance, as revealed by Lee and Paek [16] and then Kutscher and Eid [17]. Unfortunately, the research community still has no consensus about the optimal category numbers that should be adapted while we adapt the Likert rating scale.

Methodologically, the performance behavior of different category numbers can be studied using two approaches either through clinical experimentation [17,18,19] or using computer simulation [20, 21]. Clinical experimentation produces several advantages since it considers the psychological aspects of participants that are unable to be explained by the simulation study. On the other hand, simulation study offers easier implementation as they overcome a barrier of participants needed by the experimental settings of different category numbers. Moreover, manipulated intervention naturally lead the experimental effects that should be avoided. Particularly, the memory effect from the participants which is likely to produce unreliable data since they still remember their previous responses during the repeated measurement of the manipulated category numbers. Using simulation study, therefore, participants are intentionally assumed to similarly perform to answer different category numbers. This assumption should be proposed to assume that the studied effect is merely affected by the experimental setting. Recent studies have reported that different category numbers have an impact toward rater ability to discriminate their responses [22]. Even though the conclusion of simulation study is mostly drawn from artificial data, the underlying distribution of the generated data must be controlled by certain simulation procedures. On the basis of the underlying distributions that have been carefully designed based on the research question, we have explored the extent to which the psychometric behavior over five different category numbers (3, 5, 7, 9, and 11-point) as a case of attitudinal measurement in PER.

Previous works reported that there are several statistical measures that have been calculated to examine the potential effect of different category numbers. They were reliability [18], discrimination index [20, 23,24,25,26,27], criterion validity [22], factor analysis [19], inferential statistics [21], polytomous item response theory [17], Rasch model [24, 28], structural equation modeling [29], and information theory [15, 30, 31]. Based on this existing literature, most of the results reported reliability to evaluate the psychometric behavior of the different category numbers. Since the development of item response theory (IRT) around the 1960s, psychometric studies have shifted the field by approaching the IRT framework to evaluate how the properties of the employed category number. The nature of categorical data such as Likert rating scale brings the user to employ a polytomous modeling framework. Theoretically, IRT offers several advantages at item level analysis over reliability measures derived from the classical test theory (CTT) [16, 32, 33].

Nevertheless, a myriad of literature still suggests inconclusive findings in answering our curiosity about the impact of different category numbers to their psychometric properties. The simulation study conducted by Green and Rao [30] recommended that the minimum response category should be made at 6 and 7-point because they found response degradation when the lengths going smaller. In contrast, the classical work of Miller [15] was against the findings of Green and Rao [30] by comparing several experimental results of absolute judgment. Miller found that humans tend to have maximum channel capacity at the seven categories. The addition of the assessment dimension was able to improve the person’s ability to distinguish information in terms of channel capacity. In the context of reliability study, Bendig [14] and Komorita [6] suggested that reliability was independently affected by the number of response categories they studied. Komorita [6] conceded that the change of the dichotomous format did not significantly reduce the reliability and information of polytomous measurement. In line with the Jacoby and Matell’s [34] study, they also suggested that reliability should not be a psychometric measure to evaluate different category numbers since it does not depend on the number of categories being investigated.

From the aforesaid explanation, we still have no conclusion about the psychometric behavior of different Likert scale lengths based on the existing literature. In this paper, we will contribute to extend the earlier findings and to update the established information using one of the polytomous IRT frameworks, Graded Response Model (GRM) [35]. Using a simulation study, this paper was purposed to investigate the effect of different category numbers on psychometric properties calculated through the lens of GRM framework. Three psychometric measures investigated through the GRM framework explain the behavior of different category numbers. Findings revealed by this paper add the novel information that has been inconclusive formerly. Particularly within the PER community, our results are likely to contribute practical recommendation in the measurement of attitudes, beliefs, and expectations towards physics teaching and learning. Item level information calculated by the IRT framework is approached to provide a more detailed parameters at the item level than the statistical measures conducted by the former researchers.

2 Method

2.1 Simulation design

This paper conducted an explorative study with simulation approach. Simulation study was designed to evaluate or assess the experiment of computational procedures. Initially, artificial data was generated within k items (observed variables) for n records controlled by a specified distribution which corresponded to the investigated statistical procedures [36]. Subsequently, the statistical method that would be investigated was then performed on the generated data, and we then evaluated the effect of experimental conditions toward observable statistical measures. Iteratively, these investigation were repeated for M replications. The observed measures of each experiment were then interpreted and aggregated to answer the research questions that have been addressed in the study.

Simulation reported in this paper was performed in R language version 4.1.3 [37]. Two primary packages employed in this simulation study were “GenOrd” [38] and “mirt” [39] packages. The former package was utilized to generate Gaussian categorical data that follow the underlying inter item correlation matrix and marginal distribution of categorical data (mimicking an attitudinal measurement using Likert-type scale as frequently employed in PER). The latter package was used to perform the Graded Response Model (GRM) [35]. The “GenOrd” package is able to handle any discrete random variable with ordinal variables as characterized by attitudinal measurement [8]. As such, GRM model corresponded to the”graded” nature of Likert rating scale [16]. It could report two item parameters, namely discrimination index (a) and item location (b). Other models such as the Rasch model have no ability that takes into account the discrimination index. Therefore, GRM was more comprehensive to capture the psychometric behavior performed by the investigated different category numbers in this study.

This study aimed to investigate the extent to which the five different category numbers (3, 5, 7, 9, 11 points) could affect their psychometric properties through the lens of the GRM framework. From the aforementioned simulation design, the aspect assigned as the independent condition in this study was the different number of category numbers. While the psychometric properties calculated from the GRM framework were specified as the dependent variables in this simulation-based experiment. There were two statistical properties that were studied through the GRM framework. First, the item parameters were discrimination index (a) and item location (b). Second, the test information function (TIF) curve was also frequently used to assess the extent to which a measurement model has sufficient information to measure the spectrum of respondents’ latent traits (attitudes) regarding the measured construct. The TIF curve could be used to understand the information provided from any category number observed in this study. These psychometric parameters described the studied behavior from different category numbers in this study.

2.2 Simulation procedure

Figure 1 summarized our simulation procedure in studying the IRT properties of different category numbers through the GRM framework. First, we commenced the experiment for 3-point Likert scale. Categorical data was generated for this category number within 10 items for 1000 respondents (rows) (Fig. 1a) and then repeated as many 1000 replications. Second, we performed GRM modeling on this artificial data (Fig. 1b). Third, we extracted psychometric properties from the GRM framework in terms of two statistical measures. They were item parameters (a, b) and test information function (TIF) (Fig. 1c). Simultaneously, we iterate those simulation stages for the remaining category numbers (5, 7, 9, and 11-point) consecutively. Overall, there were 5000 generated data and GRM models throughout our simulation study.

Fig. 1
figure 1

Simulation procedure of the study

Data distribution and internal structure of the studied category numbers were constrained by the specified inter-item correlation matrix and marginal distribution as required when running the “GenOrd” package [38]. Therefore, we should determine the specified correlation matrix and marginal distribution relied upon the studied category numbers similarly. As a refresher, interested readers could consult Barbiero and Ferrari [38] for the mechanics of the GenOrd package while mimicking the random Likert-type data using a Gaussian copula approach. This constrained condition ensured the desired internal structure of the generated data. Thus, the observed effect investigated merely stemmed from the impact of different category numbers of an attitudinal measurement. The “GenOrd” package in R language implemented the marginal distribution to generate Gaussian categorical data within our generated data of five different category numbers. Our categorical data started at one and ended at the peak point of each category numbers. The generated data and modeling results were maintained by the R serialization method which was provided within the R dataset (.rds) file format of the R base environment by default.

2.3 Data analysis

Subsequently, GRM was performed for each category number (3, 5, 7, 9, 11-point) which was then iterated for 1000 replications (M = 1000). In this study, we built one thousand GRM models on each number of category (Fig. 1). Statistical measures calculated from the GRM models were extracted and aggregated for each category number using mean and standard deviation to capture the psychometric behavior over 1000 replications [36]. Our results was described in Table 1 and depicted in Fig. 2 beneath. From these results, we interpreted the investigated psychometric measures, namely item parameters and test information functions (TIFs) curve to answer the extent to which the effect of different category numbers on scale behaviors (Fig. 1c). In terms of item location (b), GRM mathematically calculated ordered item threshold on each item [35]. In Table 1, we employed single location index for ordinal polytomous IRT modeling proposed by Educational Testing Service (ETS) researchers [40]. We determined to employ the method of calculating the average item category locations which takes all item category response functions (ICRFs) into account. This method was identical to item location of Andrich’s rating scale model, and it was also the location parameter b generated by the PARSCALE software. For the replicability of this simulation study, R codes of this project have been uploaded in Git repository [41].

Table 1 Psychometric properties of five different Likert scale lengths through GRM
Fig. 2
figure 2

Test information function (TIF) of five number Likert’s scale categories

3 Results

This study has been aimed to compare IRT properties of five different category numbers through the lens of GRM framework. This section will explain our obtained findings that have been aggregated which are then discussed to answer the research questions in this simulation study. Through the GRM framework, psychometric properties at the item level and test level of each category number are explored and compared. Table 1 below presents the aggregated IRT properties calculated from the generated 10 items in each category number. Each item parameter reported in Table 1 is the mean and standard deviation of 1000 replication in each category number. Values in the parentheses are their corresponded standard deviation.

The first growth along with the studied category numbers could be accounted for the amplifying item fit measure from chi square (\({\chi }^{2}\)) statistics. Our generated items of Likert rating scale performed a good fit with the GRM model based on the \({\chi }^{2}\) statistic that was more than the significance level (\(\alpha\) = 0.05). Root mean square error of approximation (RMSEA) was also gradually reduced (more significant) along with the greater category numbers. There was a significant difference of RMSEA among the observed category numbers (F (4, 49) = 25.877, p < 0.05). Both chi square statistics (\({\chi }^{2}\)) and RMSEA measures showed an evidence of good fit between our generated data and the GRM framework even though they were manipulated within the different category numbers. One could interpret that that the selected GRM framework in this study was appropriate to investigate the properties of the polytomous items simulated in this study.

Second, Fig. 2 presented the test information function (TIF) of five different category numbers. The TIF curve depicted in Fig. 2 was the averaged information value from the repeated replications for 1000 times in each category number. Solid line described the aggregated value of information function. The dashed line visualized the averaged standard error of measurement (SEM) which was square root of the reciprocal value of the information function [16]. Intersection due to the TIF and SEM curves located in the same scale (Fig. 2) could be interpreted as the extent to which measurement models provided sufficient information to gauge the studied latent trait (attitude). Vertical axis represents the amount of information to estimate the latent trait of students. The greater the information function, there would provide more information measured through the categorical observed items. In other words, the reliability of latent trait measurement in each Likert scale length would be higher when the peak of TIF was achieved. This could be said as a more reliable assessment tool which was described by the classical test theory (CTT).

Test level properties for each category number corresponded three interpretations that could be derived from the reported TIFs in Fig. 2. There included tail range, peak, and intersection between TIF and SEM curves. First, the left tail range of 3, 7, 9 and 11-point curves coincided with each other nevertheless the right tail range demonstrated different pattern. In the right tail range, 3-point coincided with 5-point while 7, 9, 11-point coincided within one group. From these results, it explained that there might be different properties at the left and right boundaries of our TIFs. To test the differences of the obtained pattern of TIF tails among different category numbers, one way ANOVA was employed to three TIF pairs. They were TIF3–7–9–11 (left), TIF3–5 (right), and TIF7–9–11 (right). The results indicated that these pairs had no significant difference of the aggregated information value (TIF3–7–9–11: F(3, 796) = 2.46, p > 0.05, TIF3–5: F (1, 398) = 1.46, p > 0.05, dan TIF7–9–11: F(3, 796) = 0.024, p > 0.05). We could visually evaluate from Fig. 2 that 3-point TIF has a closer look to 5-point TIF. It might be illuminated that the similar informatio accompanied by those scales can be obtained. Meanwhile, 7-point TIF has a closer look to 9 and 11-point relatively. Although the visual differences of TIF curve between 3 and 5-point was not as close as that presented by 7, 9, and 11-point.

Furthermore, the peak of the TIF could also be described as the psychometric properties of five different category numbers at the test level. There were distinct shapes and peak visualized from the five different category number. There was a trough in the middle shape on 3 and 7-point TIF. Trough was a lower information value than the highest peak. This trough of these two scales articulated that there could be lack of information to measure latent traits’ spectrum around zero (the neutral position). In fact, a plausible scale should have more information in the middle range of latent traits. Rationally, the assumption of normal distribution from the attitudinal measurement explained that raters (students) should have the largest proportion to perform attitude in this range. Therefore, measurement tools should have relatively more information in this middle range to distinguish among latent traits of participants. In addition, the 3-point Likert scale achieved a higher peak of information than 9 and 11-point. Peak of the TIF curve represents maximum information that our measurement tool might provide. Nevertheless, the peak shape of 3-point TIF could be unstable enough because there was trough of information around the range of average respondent’s ability.

As such, Fig. 2 summarized that the intersection of TIF and SEM of 3 and 5-point lies in a narrower area than 7, 9, 11-point. The measurement spectrum of 3 and 5-point was around − 3 and 2 while 7, 9, 11 point was able to wider measure participants with attitudes between − 3.5 and 3.5. Intersection point between TIF and SEM represented the spectrum of participants’ attitude that could be accurately measured by each category number without systematic error. Thus, three and five points have discovered more insufficient information to measure the latent traits rather than measured by the 7, 9, and 11-point.

4 Discussion

Results reported in this paper have explored the psychometric properties of five category numbers through the GRM framework. Two statistical measures have been revealed in this study. They were item parameters (discrimination index and item location) and TIF curves at the test level. First, we have discovered that there was no difference in terms of discrimination index (a) either 3, 5, 7, 9, or 11-point (Table 1). This pattern is in line with the previous published works reported by Weng [27] and Lee and Paek [16]. Weng [27] argued that finer scale points that made discrimination indices greater than respondents’ ability was useless. This would contribute to the measurement error through the scale differences. In other words, increasing the number of category numbers has no determination to improve the discriminating ability of participants to distinguish the differences between scale points. Through the IRT framework, Lee and Paek [16] strengthen the Weng’s and our result. They argued that one could measure through a smaller number of categories with the same effectiveness performed by a greater number of categories as long as their items have good discriminatory power. According to the Ebel’s criteria of good discriminating items (a > 0.3) [42], Table 1 has shown that our items could be used to well discriminate the attitudes performed by students.

The pattern came across the parameter of the discrimination index is related to the test reliability which is relevant with earlier studies [14, 43]. Bendig [14] argued that discrimination indices represent how the test has to discriminate within the spectrum of participant abilities. This concept is relevant to the reliability measure which describes the ability of the test to differentiate our students. In addition, Dawis [43] corroborated Bendig’s opinion if reliability is actually related to the item-total score correlation. If we return to the simulation design in the methodology section above, artificial categorical data mimicking the attitudinal measurement was generated within the underlying correlation matrix for all category numbers. The arguments of Bendig [14] and Dawis [43] are supporting to our simulation design. Our pattern of discrimination index (a) relates to reliability measures which directly corresponded to the underlying correlation matrix specified by this simulation study. Because the correlation matrix was determined prior to the data generation, we could guarantee that the reliability measure of our data would also be the same which will then associate to produce the same discrimination index in Table 1. Based on our study, it could be underlined that different category numbers have an independent effect on item characteristics in terms of discrimination index (a).

However, not all researchers propose the same positions with our study. Some previous works have also reported the competing results such as Preston and Colman [44], Ilhan and Guler [28], and Alan and Kabasakal [23]. These authors found a opposing findings with the present study. They argue that the smaller number of categories, the less able items to distinguish the students. Through intertile discrimination coefficient, Preston and Colman [44] informed the pattern of discrimination ability among 11 sets of questionnaires. They found that the greater number of response categories (up to 101 response categories) in the questionnaire, the more capable the test was to distinguish rater abilities to respond to the questionnaire. Moreover, Ilhan and Guler [28] studied a similar pattern through the Rasch model. They calculated point biserial correlation to judge discriminatory power of the items. They found that the 5-point scale obtained the highest biserial points compared to greater and smaller number of response categories. Within the context of childhood education, Ilhan and Guler [28] investigated three different category numbers (2-point, 3-point, and 4-point) and evaluated their behavior in assessing school attachment for children and adolescents through the Rasch model. Similar to Preston and Colman [44], Ilhan and Guler [28] reported that the greater number of response categories used, it led to the increase of discriminatory power up to 4-point Likert scale. However, we should carefully take into account that the context of study since they investigate children that have different abilities with Alan and Kabasakal [23] and Preston and Colman [44]. Although the previous report reported better results on 5-point scale on the discrimination indices, Weng [27] argued that there was no substantial difference in test–retest reliability between the two studied scales.

Moreover, our simulation study found that different category numbers affect item location (b) within the context of polytomous items (see Table 1). Item location is one measure that can evaluate the quality of the attitudinal measurement. Item location is the degree of participants attitude who agree with the items. Our results informed that the growing number of response categories corresponds to the more positive item location. This pattern can be represented that our respondents more positively express their agreement to the items. Previous study by Daher et al. [24] is in line with our finding. One could highlight the fact that the range of the item location can decrease as the number of scale points reduces. This pattern can be understood as diminishing variance of the data, according to Chang [45] and Nunnally [46], is associated with the separation index that must be expected to be as high as possible. Separation index articulates the classification of participant attitudes and item location into particular clusters.

Finally, the TIF curve visualized in Fig. 2 could be interpreted within the framework of information theory that was widely used by earlier scholars. Our results recommended that maximum information for the largest measurable ability range was started from 7-point. Greater value up to the 11-point scale offers more stable information in measuring the spectrum of the latent construct of attitude scale. We judged the stability of the information in these results based on the behavior of the information peaks of the TIF curve. Our results discovered that three points (3, 7, 9 number of categories) have relatively unstable information peaks. It is obviously clear that 5-point performs a similar stable peak as that of the 11-point (see Fig. 2). However, the 11-point has a wider spectrum of latent abilities than the 5-point scale. These results are relevant to previous studies by Kutscher and Eid [17] and Weng [27]. Kutscher and Eid [17] were in the same position as this paper in articulating the meaning of the peak of the TIF curve. In this information peak, there is the highest probability of a latent continuum that potentially be agreed by the rater/participants. For the number of categories on the scale such as in 3, 7, 9 number categories cause lost information in the middle range of ability. Thus, it could potentially contribute to the measurement error. Garner [26] has even reported that maximum information could be achieved when the scale has more than 20 categorical points and Weng [27] agreed that decreasing scale points associated with decreased information about individual differences then contribute to the reduced reliability.

As a final remark, findings through a simulation study with the underlying marginal distribution and inter item correlation matrix for each generated data have been conducted in studying the psychometric properties of five different category number. Information revealed by this study may have implications for considering the scale collapsing practice in measuring attitudes, beliefs, and expectations within the PER community or psychoeducational measurement. Our results suggested that collapsing the scale points into a smaller number is able to affect unstable measurement properties of the latent construct. Information that could be measured through this reduced scale should be carefully maintained since there is a decreasing pattern of available information as the scale points are made smaller. The IRT-GRM framework implemented in this study has explored the changing effect of the different category numbers on the psychometric properties at item and test level.

Admittedly, this simulation study reports analytical results within the constraint of the selected underlying distribution performed by the used software package. This selection could be the potential pitfalls of our study. Many facets of uncertainty can be derived from the dynamics of behavioral traits. Accordingly, there can be multiple factors to justify the psychometric behavior of different category numbers within an attitudinal measurement. As such, generalizing the results demonstrated by this paper should be with caution. There is still possible attempt for interested scholars to address this concern for future investigation.