Higher education stakeholders are actively seeking effective teaching practices to improve student learning and overall success. Among these efforts, High-Impact Practices (HIPs) have emerged as powerful educational practices, supported by studies consistently showing a positive correlation between students’ participation in HIPs and their learning outcomes (e.g., Finley & McNair, 2013; Kuh, 2008). Researchers at educational organizations have offered varied perspectives on specific categories of HIPs. For example, the American Association of Colleges and Universities (2024) defines HIPs as educational practices that show evidence of academic advantages for students who engage in them, identifying 11 practices: capstone projects, collaborative assignments, common intellectual experiences, diversity/global learning, e-Portfolios, first-year seminars, learning communities, service learning, internships, undergraduate research, and writing-intensive courses. The Center for Community College Student Engagement (CCCSE, 2014) identified 13 HIPs, including learning communities, supplemental instruction, tutoring, and others. Despite variations in HIP categorizations based on institutional contexts, these practices consistently yield positive student outcomes, including personal and practical gains (Finley & McNair, 2013; Kuh, 2008), as well as improved course completion and persistence (CCCSE, 2014).

While HIPs generally correlate positively with student learning outcomes, a longitudinal study by Kilgo and colleagues (2015) has singled out undergraduate research as particularly impactful within the category of HIPs, consistently demonstrating a range of positive outcomes compared to other HIPs. Moreover, studies have consistently shown a positive relationship between undergraduate research participation and improved learning outcomes (e.g., Hunter et al., 2007; Kilgo et al., 2015; Kuh, 2008; Lopatto, 2007). Recognizing these benefits has spurred researchers and educators to delve deeper into the factors that make undergraduate research high impact, particularly focusing on its quality aspects (Kuh & O’Donnell, 2013; Zilvinskis, 2019). Specifically, researchers have focused on evaluating the quality of students’ undergraduate research experiences using a range of survey methodologies. Surveys offer advantages such as broad coverage of learning domains and longitudinal or interdisciplinary data collection compared to methods like interviews or focus groups (Laursen, 2015).

Several survey instruments have been utilized for assessing the quality of the undergraduate research experiences, some of which are grounded in the eight quality components delineated by Kuh and O’Donnell (2013), including time and effort investment and interactions with faculty and peers. For example, using the data from the National Survey of Student Engagement (NSSE), studies (e.g., Kinzie et al., 2020; Zilvinskis, 2019) have investigated students’ experiences with undergraduate research and its association with students’ learning outcomes. In addition to surveys centering on these eight quality elements, researchers have developed survey items to assess and evaluate students’ undergraduate research experiences, such as the Higher Education Research Institute (HERI) College Senior Survey (2024) and the Project Ownership Survey (POS; Hanauer & Dolan, 2014). Other researchers have developed surveys that specifically target certain disciplines, such as Lopatto’s (2004) Survey of Undergraduate Research Experience (SURE) and the Undergraduate Research Student Self-Assessment (URSSA) developed by Hunter and colleagues (2009). These surveys provide multifaceted insights into students’ undergraduate research experiences, offering educators and faculty valuable information to enhance both course-based and extracurricular research opportunities.

While recognizing the value of these surveys, it is crucial to evaluate their psychometric properties to ensure that survey items function as intended. Studies have highlighted the significance of ensuring the validity of survey items to accurately reflect students’ undergraduate research experiences. For example, Weston and Laursen (2015) examined the validity of URSSA and suggested revisions to its items. Similarly, Hanauer and Dolan (2014) investigated the psychometric properties of POS and identified areas for survey item improvement in newer versions. In reviewing existing literature on students’ undergraduate research experiences and learning outcomes, studies have relied heavily on Kuh and O’Donnell’s (2013) eight quality elements when exploring the relationships between undergraduate research quality elements and student learning outcomes (e.g., Loker & Wolf, 2023; Zilvinskis, 2019). However, limited attention has been given to the construct validity of these quality elements. This gap leaves uncertainty regarding the precise and effective measurements of students’ undergraduate research experiences. Given the pivotal role of evaluating student learning experiences in promoting student success and the widespread use of the NSSE, the present study aims to bridge the gap by employing psychometric analyses—including Exploratory Factor Analysis (EFA), Parallel Analysis (PA), Item Response Theory (IRT), and Differential Item Functioning (DIF)—to assess the construct validity of the quality of students’ undergraduate research experiences.

Literature Review

Quality of Undergraduate Research Experience

Extant literature on the quality of undergraduate research experience has underscored its positive correlation with students’ learning outcomes, demonstrating numerous academic benefits. For example, studies have consistently found a positive relationship between undergraduate research participation and higher GPAs (Brown et al., 2020), the development of problem-solving skills (Seymour et al., 2004) as well as a heightened tendency among students to express interest in pursuing graduate programs (Eagan et al., 2013; Lopatto, 2007; Russell et al., 2007; Seymour et al., 2004). Additionally, undergraduate research involvement is associated with enhanced critical thinking abilities (Ishiyama, 2002; Kilgo et al., 2015; Thiry et al., 2011), technical competencies (Lopatto, 2004), and confidence in research skills (Russell et al., 2007; Seymour et al., 2004). Furthermore, Kilgo and colleagues (2015) found numerous benefits of undergraduate research, including favorable literacy attitudes, cognitive engagement, and intercultural effectiveness. Given the significance of these benefits, understanding the quality of undergraduate research experiences becomes crucial as students embark on their undergraduate research journey.

Weston and Laursen (2015) identified various factors shaping the quality of the undergraduate research experience, such as those delineated in the URSSA, which assesses students’ abilities and gains in scientific practices, confidence in research, academic skills, and attitudes toward scientific work. They identified four core constructs related to cognitive, affective, and skill-based outcomes: thinking and working like a scientist, personal gains related to research work, skills, and attitudes and behaviors. Using these constructs, Thiry et al. (2012) discovered that experienced students reported advanced personal, professional, and cognitive outcomes compared to novice students, underscoring the importance of developing intellectual skills and behaviors. Besides the URSSA survey, the HERI College Senior Survey identified two constructs used to examine students’ undergraduate research experiences: science self-efficacy and science identity. The science self-efficacy construct assesses students’ confidence in their ability to conduct scientific research, while the science identity construct examines the extent to which students see themselves as scientists (HERI, 2017).

Course-based Undergraduate Research Experiences (CUREs) provide students with opportunities to gain research experience. Auchincloss and colleagues (2014) described CUREs as involving scientific practices, inquiry and discovery, broader relevance, collaboration, and iteration. To evaluate students’ experiences in CUREs, various assessment tools have been developed. One such tool is the Classroom Undergraduate Research Experience Survey, which builds on Lopatto’s (2004) SURE survey and examines three aspects of students’ course experiences: (a) course activities information provided by the instructor; (b) a pre-course student survey covering reasons for taking the course, expectations, experiences with course elements, science attitudes, and learning styles; and (c) a post-course student survey assessing learning gains, attitudes toward science, and educational and career interests (Lopatto, 2010).

In addition to the undergraduate research experience assessed in the Classroom Undergraduate Research Experience Survey, researchers have developed a model to assess effective CUREs, drawing upon Lave and Wenger’s (1991) Situated Learning Theory (e.g., Auchincloss et al., 2014; Corwin Auchincloss et al., 2015). Auchincloss and colleagues (2014) outlined various aspects of CUREs, such as interacting with faculty and working in groups to collect data, analyze results, and evaluate work. They identified benefits such as the development of collaboration skills, science identity, and research skills, increased interaction with others, and retention in the major. These benefits manifest over the short, medium, and long terms, indicating the enduring impact of engaging in CUREs on students’ academic and professional trajectories. Moreover, Shaffer and colleagues (2014) identified substantial time commitment as another important element of CUREs and emphasized the necessity of this aspect for fully assessing students’ experiences in CUREs.

Beyond the previously mentioned components of undergraduate research experiences, researchers have identified several crucial components of undergraduate research experiences. For example, Hunter and colleagues (2007) conducted a qualitative study in which they interviewed students involved in undergraduate research to explore the benefits of their experiences. They found that students gained a variety of advantages, including opportunities to present at conferences, acquire career-related knowledge and skills, gain practical work experience, collaborate with faculty and peers, improve their writing, and apply their knowledge in real-world settings. Additionally, the study revealed enhancements in students’ conceptual understanding, a deepening of their disciplinary knowledge, and a greater connection within and between scientific fields. Similarly, Kuh and O’Donnell (2013) outlined eight critical elements of HIPs as the indicators of implementation quality, including High-Performance Expectations, Time and Effort Investment, Interactions with Faculty and Peers, Experiences with Diversity, Constructive Feedback, Reflective and Integrative Learning, Real-World Applications, and Public Demonstration of Competence.

Building on Kuh and O’Donnell’s (2013) eight quality elements, researchers have conducted in-depth investigations into students’ undergraduate research experiences and their relationships with quality elements and student outcomes. For instance, through a systematic literature review, Kinzie and colleagues (2020) highlighted that only seven of the eight quality elements were correlated with undergraduate research experiences, with no evidence concerning the Experience with Diversity. Additionally, investigating the undergraduate research experiences of first-generation, low-income, and racially minoritized students, Loker and Wolf (2023) revealed positive gains across various dimensions of undergraduate research experiences, including Time and Effort Investment, Interactions, Exposure to Diversity, Constructive Feedback, Application in Real-World contexts, and Public Demonstration of Competence. Furthermore, using data from the 2015 administration of NSSE’s HIP experimental item set, Zilvinskis (2019) examined the quality of undergraduate research experiences and found its correlations with positive outcomes, in which High-Performance Expectations, Faculty Feedback, and Real-World Applications were positively associated with student satisfaction and fostering an effective teaching and supportive environment. However, it is important to note that not all of the aforementioned quality elements positively correlate with students’ undergraduate research experiences. For example, Russell and colleagues (2007) found no evidence supporting the relationship between writing final reports, investing time, and students’ positive outcomes. These studies collectively have suggested the importance of delving into the examination of various aspects of students’ undergraduate research experiences.

Researchers have further scrutinized undergraduate research experiences and their relationship with outcomes across diverse student populations with varying social identities and racial or ethnic backgrounds (e.g., Kinzie et al., 2020; Zilvinskis, 2019). For example, Kinzie and colleagues (2020) found differences in the experiences of White and racially minoritized students. They found that 30% of White students reported having structured opportunities to integrate, whereas this was reported by 40% of racially minoritized students. Additionally, 66% of White students reported gaining applied real-world experience, compared to 71% of racially minoritized students involved in undergraduate research. In another study, Zilvinskis (2019) discovered that Black or African American students who spent more time on undergraduate research perceived less effective teaching and a less supportive environment. Hispanic or Latinx students experienced a negative correlation between effective teaching and real-world applications in their research endeavors. Additionally, Zilvinskis found that first-generation students were less likely to perceive effective teaching positively and tended to have lower GPAs when frequently receiving faculty feedback. In essence, these studies have indicated inconsistencies in the benefits of undergraduate research across student groups and highlight the necessity of accounting for diverse student experiences.

While studies (e.g., Loker & Wolf, 2023; Zilvinskis, 2019) have emphasized the crucial role of quality undergraduate research experiences in enhancing student success and have advocated for integrating these factors into undergraduate research programs to improve student outcomes, two key points need consideration. First, despite the widely used conceptual foundation provided by Kuh and O’Donnell’s (2013) framework, there remains a notable gap in the development and validation of measurement tools to accurately assess these quality dimensions. Second, the influence of the quality elements may vary depending on students’ social identities and racial or ethnic backgrounds. Given the diverse experiences students may encounter during their undergraduate research endeavors, it is critical to evaluate whether these measures may be biased or lack generalizability. Therefore, there is a pressing need to validate these measures of quality experiences through psychometric testing to ensure their effectiveness in assessing students’ undergraduate research experiences and to provide more convincing claims using the survey items.

Psychometric Properties of Undergraduate Research Surveys

Various surveys have been developed to collect data on students’ undergraduate research experiences. For instance, the Classroom Undergraduate Research Experience survey examines three facets of students’ experiences (Lopatto, 2010). The Student Experience in the Research University undergraduate survey (2024) provides detailed insights into students’ experiences at research universities, aiding administrators, policymakers, and scholars in enhancing undergraduate education. Additionally, the POS was designed to measure variations in scientific inquiry experiences (Hanauer & Dolan, 2014). National surveys such as NSSE and the HERI College Senior Survey aim to understand students’ quality of undergraduate research experiences. NSSE’s HIP Quality Topical Modules examine quality concerns related to HIP experiences such as undergraduate research, study abroad, and internships. The HERI College Senior Survey contributes to understanding students’ cognitive and affective growth during college (HERI, 2024). The URSSA survey was aimed at evaluating student outcomes in science-related undergraduate research experiences (Hunter et al., 2009). While these surveys have been developed to collect information on student experiences, studies (e.g., Bowman, 2011; Porter et al., 2011) have highlighted the importance of validity and reliability in self-report survey items for accurately measuring college student experiences.

Auchincloss and colleagues (2014) recognized the usefulness of the Classroom Undergraduate Research Experience survey; however, they raised concerns regarding its uncertain validity due to insufficient evidence of accurately measuring diverse aspects of CUREs. This uncertainty extends to its dimensionality, particularly in discerning correlations between items related to career interests and attitudes toward science, as well as its comprehensiveness across various settings and levels. In contrast, surveys like the URSSA, developed by Hunter and colleagues (2009), underwent rigorous psychometric testing. Their methodology included a comprehensive literature review, evaluation of different undergraduate research programs, and interviews with focus groups. Pilot testing of the URSSA survey involved Confirmatory Factor Analysis (CFA) to assess model fit and reliability. As Weston and Laursen (2015) identified, the URSSA survey includes four core constructs, subjected to validity studies utilizing CFA to align hypothesized factors with empirical student responses. Their examination revealed areas for survey design enhancement, particularly regarding correlations among constructs and students’ aspirations for graduate school attendance. Consequently, they advocated for clearer item wording and reorganization to enhance differentiation between various aspects of students’ undergraduate research experiences.

Other surveys that focus on specific facets of students’ undergraduate research experiences have undergone rigorous psychometric testing during their development process. For example, the HERI College Senior Survey underwent psychometric testing to ensure scoring validity (Sharkness et al., 2010). Their analysis began with EFA, where they explored various factor solutions to determine if the survey items could be explained by underlying factors. Subsequently, they generated a scree plot of the eigenvalues to validate the factor solution. Once the factor solution was determined, they proceeded with IRT, employing Samejima’s (1969) Graded Response Model (GRM). They highlighted the advantages of IRT over Classical Test Theory, including the independence of the “true score” from specific items, the flexible interpretation of scores, and the variation of standard errors of measurement based on different values of theta. Moreover, IRT offers context-independent item parameters and theta estimates, simplifying the estimation of a person’s theta from any relevant set of items.

In addition to the HERI College Senior Survey and the URSSA survey, researchers have focused on Kuh and O’Donnell’s (2013) eight HIPs quality framework to assess students’ undergraduate research experiences. This framework has been utilized in surveys (e.g., Kinzie et al., 2020) and in qualitative assessments of course-based undergraduate research (e.g., Loker & Wolf, 2023). NSSE, a nationwide survey, focuses on the college experience of both first-year and senior students at four-year colleges and universities. NSSE staff and researchers have created a psychometric portfolio page on their website that details the survey’s accuracy, validity, and reliability. This page addresses ten engagement indicator scales, as well as established scales like perceived gains and sense of belonging (NSSE, n.d.a.). These measures are categorized into five aspects: survey content, response process, internal structure, relations to other variables, and consequential validity evidence, all assessed using various methods (Kuh, 2009). In 2019, NSSE staff initiated a two-phase project titled “Assessing Quality and Equity in High-Impact Practices” aimed at evaluating students’ experiences with HIPs. The first phase involved examining and formulating HIP quality measures using existing NSSE data. The second phase included testing and data collection through cognitive interviews, focus groups, and surveys (Kinzie et al., 2020; NSSE, n.d.b.). The HIP quality item sets were administered as experimental item sets appended to the Spring 2019 survey involving 40 institutions. Building on this project, a new topical module in 2022 was introduced to assess quality concerns related to students’ experiences in HIPs. However, without further psychometric testing to validate the survey items developed solely based on the framework, the extent to which the survey provides a valid measure remains uncertain. Therefore, this study aims to address the lack of psychometric properties of these survey items.

Purpose of the Study

Establishing sound psychometric properties for items measuring students’ educational experiences is crucial, particularly when educators at colleges and universities rely on these measures to effectively promote student success. Without these psychometric properties, achieving positive outcomes for both students and institutions becomes challenging. More empirical evidence is required to establish the appropriateness of use and the interpretation of the scores. This study examines the construct validity of quality elements of students’ undergraduate research experiences based on Kuh and O’Donnell’s (2013) framework, using NSSE’s HIP Quality module. The study seeks to determine if these eight quality elements effectively evaluate students’ undergraduate research experiences. To guide the exploration, we formulated three research questions as follows:

  1. 1.

    How closely do the NSSE’s HIP Quality module items measuring undergraduate research quality experiences align with the hypothesized high-impact practice characteristics identified by Kuh and O’Donnell (2013)?

  2. 2.

    To what degree do the items and scales serve as reliable and valid indicators of undergraduate research quality?

  3. 3.

    What is the extent of measurement generalizability across various student demographic groups when utilizing DIF analysis to evaluate these HIP quality items?

Methods

Data

Our study utilized data from the 2022 administration of the NSSE’s HIP Quality Topical Module. NSSE is a large-scale, multi-institutional survey that examines first-year and senior student engagement at four-year colleges and universities (NSSE, n.d.c.). The core survey covers a wide range of topics, including student engagement in academic and co-curricular activities, the quality of interactions with faculty and peers, and the perceived gains from their college experience. Additionally, several topical modules are appended to the core survey to explore various pressing topics in higher education (NSSE, n.d.d.). In the following sections, we describe the sample of participants, detail NSSE’s HIP Quality module items used in the analysis, and outline our analytical approach to addressing the research questions.

Participants

The NSSE participating institutions provided a list of eligible students, consisting of first-year and senior students. These participating institutions then extended invitations, through both emails and regular mails, to encourage students’ voluntary participation in the survey. For the HIP quality module, NSSE staff received responses from 19,489 student participants across 26 colleges and universities. The student participants reported on their HIP experiences, such as study abroad, undergraduate research, and internships. Given that our study focused on undergraduate research experience, we specifically applied the following criteria to refine our final sample selection. Namely, we included senior students who indicated participation in the “done or in progress” category within the HIP module regarding their undergraduate research experience, resulting in a final sample size of 1,564 participants. For details on the characteristics of the student respondents, refer to Appendix A.

Among the 26 institutions participated in NSSE, 11 are doctoral universities, nine are master’s institutions, five are baccalaureate colleges, and one falls under other institution category based on the 2018 Carnegie Classification. Of the 25 Carnegie-classified, 12 are urban, four are suburban, seven are in towns, and two are rural. These institutions vary in size based on the undergraduate enrollment, with five small (under 2,500), two medium (2,500-4,999), six large (5,000–9,999), and 12 very large (over 10,000). Additionally, 18 are public, seven are private-not-for-profit, and five are Minority-Serving Institutions (three Hispanic-Serving Institutions and two Historically Black Colleges and Universities). Geographically, institutions are evenly spread across the US, with five in the Northeast, three in the Midwest, seven in the Southeast, two in the Southwest, seven in the West, and one in another US territory.

Instrument and Item

Our investigation into the quality of the undergraduate research experience drew upon relevant literature and the eight quality elements identified by Kuh and O’Donnell (2013). Considering the novelty of NSSE’s HIP Quality module items and the limited theoretical guidance available, we employed a multi-pronged approach in our analysis. We focused on the first section of the HIP Quality module items, which were administered to students who had done or were currently involved in undergraduate research. This set of questions delves into students’ perceptions of their experience in connection with their participation in undergraduate research (NSSE, n.d.d). The survey items cover a range of topics, including diversity, a sense of belonging, reflective and integrative learning, and interactions with faculty, peers, or staff members. The response options for these questions differ, ranging from two to seven response options. Specifically, response options across items include: a binary (yes or no), a four-option (never, sometimes, often, very often), a five-option (not at all, very little, some, quite a bit, very much), and a seven-option (extending from not at all to very much).

To assess the quality of these experiences, we purposefully excluded seven generic items from our assessment, as they were not relevant to evaluating the quality of experience. For example, we did not include the question about how long students have been participating in the experience. Similarly, we chose not to include the follow-up item directed at students who responded with “sometimes,” “often,” or “very often” to the item concerning their frequency of their meetings with faculty or staff members from their institutions. This follow-up item aims to gauge the extent to which faculty/staff meetings focused on students’ learning within the experience. We excluded it because the subset of student respondents selected for this follow-up question was not randomly chosen; rather, it comprised those who indicated somewhat frequent meetings with faculty or staff members. This targeted selection has the potential to introduce biases and restricts the generalizability of findings beyond this subgroup (e.g., students who selected “never” for the previous question). Additionally, excluding this follow-up item may have rendered it ineffective in capturing meaningful data or insights due to its narrow focus. Consequently, our subsequent analysis included 31 items. A more detailed explanation of the included items, along with their scoring methodology, is available on the NSSE website (NSSE, n.d.d.).

Analysis

We employed a comprehensive approach to gathering psychometric evidence for structural and internal validity, utilizing both EFA and PA and confirmatory methods through IRT. Additionally, we conducted a DIF analysis to gather evidence for generalizability. Reliability assessment was based on item and scale information derived from the IRT analyses. We considered guidelines for sample size requirements, particularly in relation to PA, which served as an additional validation step for a single-factor model or set of items prior to proceeding to IRT analyses. While consensus in the literature has been lacking, Gorsuch (1983) and Kline (2014) suggested a minimum of 100 participants for factor analysis. Moreover, when focusing only on a single-factor model (since we fit PA for each subscale individually), Mundfrom and colleagues (2005) recommended sample size ranged from 18 to 60 on average across studied conditions that involved a single-factor model. Considering various factors, such as low and high communalities and variable-to-factors ratios, Mundfrom and colleagues found that a sample size of 13–45 was needed for a good fit criterion and 32–150 for an excellent model fit criterion. In our study, we divided the sample of 1,564 participants into three random subsets: approximately 50% for EFA, 20% for PA, and the remaining 30% for IRT and DIF analyses, demonstrating adequate sample sizes across the approaches. In the following sections, we provide a detailed account of our statistical methodologies and modeling choices.

Exploratory Factor Analysis (EFA)

Prior to conducting EFA, the suitability of the dataset for factor analysis was assessed using Kaiser–Meyer–Olkin (KMO) and Bartlett tests. The KMO measure yielded a value of .87, indicating a high degree of sampling adequacy (Kaiser, 1970; Kaiser & Rice, 1974). Additionally, the Bartlett test of sphericity was statistically significant (\(\:{X}^{2}\) (465) = 18036.96, p < .01), further supporting the factorability of the data. We then proceeded with fitting a series of eight EFA models using a polychoric correlation matrix based on the 31 HIP items. We performed this analysis using the EFA function within the EFAtools (Steiner & Grieder, 2020) package in R with the Unweighted Least Squares estimation method and Promax rotation. For each model, we extracted fit indices commonly used in EFA, including Chi-square, Comparative Fit Index (CFI), and Root Mean Square Error of Approximation (RMSEA). Our determination of the optimal number of factors or the best fitting model was guided by the goodness of fit criteria proposed by Hu and Bentler (1999). Specifically, we aimed to select a model with a non-significant Chi-square, CFI greater than .95, and RMSEA less than .05. In addition to these fit indices, we compared models based on Akaike Information Criterion (AIC; Akaike, 1987) and Bayesian Information Criterion (BIC; Schwarz, 1978), both of which assess the balance between model fit. A lower AIC and BIC value indicates a better trade-off between goodness of fit and model complexity. Notably, AIC tends to perform well with smaller sample sizes, while BIC is more effective with larger sample sizes (Lopes & West, 2004; Song & Belin, 2008). Given the larger sample size in our study, we considered fit indices based on the lowest BIC value alongside substantive interpretation of factor loadings greater than .30 and item content in selecting the final model. This comprehensive approach ensured a robust and theoretically meaningful selection of the EFA model best suited to our data.

Parallel Analysis (PA)

Once the final EFA model was selected, sets of items that loaded on individual factors were treated as subscales in PA analysis. Specifically, we utilized PA to seek further evidence of unidimensionality (Timmerman & Lorenzo-Seva, 2011), as it is a main assumption of IRT modeling. PA has shown to be a more accurate method to determine the dimensionality of item responses than solely relying on the eigenvalue > 1 rule (e.g., Crawford et al., 2010). To conduct PA, we used function fa. parallel in the psych package (Revelle, 2023) in R using default options, with an exception that we increased the number of random datasets to 100 (n.iter = 100) and utilized polychoric correlations to obtain eigenvalues (cor = “poly”), given the ordinal nature of our data.

Item Response Theory (IRT) Modeling

After establishing that unidimensionality was attainable (i.e., meaning that a single latent trait underlined the item responses for each subscale individually), we proceeded with fitting an IRT model for each subscale that retained more than three items, separately. Subscales with three items would be considered just identified from a measurement perspective and model fit would not be informative, making them inadequate for offering a comprehensive representation of the underlying relationships within the dataset (Kline, 2015). Thus, we focused only on subsets that contained a minimum of four items, which yielded four subscales to be examined further.

We conducted IRT analysis by fitting Samejima’s (1969) GRM to item responses. We chose this model due to its flexibility/ability to handle a wide range of ordered response categories (Embretson & Reise, 2013). In assessing item fit, we employed Orlando and Thissen’s (2000, 2003) and Kang and Chen’s (2007) signed chi-squared test, adjusting the p-value using Bonferroni correction at a significance level of .05, as recommended by Armstrong (2014) for multiple tests without preplanned hypotheses, mitigating the heightened risk of a Type I error. There exists a lack of consensus on evaluating the IRT model and item fit beyond the significant chi-squared test. Due to the comparability of IRT and CFA models, many scholars have utilized CFA-based guidelines to evaluate IRT model fit. While a cutoff criterion of .05 for RMSEA is commonly used in CFA contexts, determining an appropriate cutoff point for RMSEA has been subject to debate. For example, Steiger (1989) proposed that RMSEA values less than .05 indicate close fit, while Browne and Cudeck (1992) suggested a value of .05 for good fit, .05 to .08 for decent fit, and values above .10 for poor fit. Additionally, MacCallum and colleagues (1996) recommended thresholds of .01, .05, and .08 for excellent, good, and mediocre fit, respectively, while emphasizing that these guidelines serve as aids for interpretation rather than strict cutoff points. With this in mind, we evaluated IRT item fit based on the p-values from the formal (chi-squared) test as well as RMSEA values. We used the mirt function in mirt package (Chalmers, 2012) to fit IRT models and to obtain relevant information. In addition to examining and reporting on the IRT model parameters, we reported on item fit statistics and reliability at item and subscale levels.

Generalizability via Differential Item Functioning (DIF)

To examine evidence of generalizability, we investigated DIF across different subgroups, including gender identity, first-generation status, STEM majors, and race/ethnicity. Due to constraints posed by sample size limitations for certain subgroups at the individual item level, we were unable to thoroughly explore all categories within each student demographic variable. Consequently, we decided to dichotomize the groups into categories such as man or woman, first-generation or continuing generation, STEM or non-STEM, and White or racially minoritized groups. Using the function DIF.Logistic in the MeasInv package (Wells, 2021), we conducted separate analyses for each grouping variable and each subscale. For the analyses, we used significance levels of .05 and a purification process to re-estimate person parameters after removing any detected DIF items. Furthermore, we examined both uniform and non-uniform DIF and we applied Bonferroni adjustment to maintain the overall alpha of .05 level using the p.adjust function in R.

Results

Exploratory Factor Analysis and Parallel Analysis

Our EFA results were based on investigating model fit indices, the eigenvalues, and factor loadings. The analysis of both eigenvalues and model fit indices indicated that a seven-factor model was the most appropriate solution. This determination was reinforced by the observation that we obtained seven eigenvalues exceeding one. Additionally, the BIC demonstrated the lowest values for the seven-factor model in comparison to all alternative models (see Table 1).

Table 1 EFA model fit indices

Regarding the factor loadings, our findings indicated that 31 items exhibited loadings greater than .3 across any one of the seven factors, except for one item (HIPfacmeet) which did not demonstrate loading on any factor. Furthermore, two items (HIPstudentfb and HIPdiverse) were omitted from subsequent analyses as they showed reasonably large cross-loadings on more than one factor. Additionally, we intentionally excluded an item (HIPquality) that did not assess a specific experience, resulting in a final set of 27 items distributed across the seven factors (see Table 2). The seven-factor EFA model thus resulted in interpretive factors: Reflective and Integrative Learning, Real-world Applications, Interactions with Others, High-Performance Expectations, Constructive Feedback, Sense of Belonging, and Public Demonstration of Competence. We treated these factors as subscales in the subsequent analyses. Additionally, separate PA analyses across the subscales supported the tenability of a single factor. Thus, for the IRT analyses, subscales with four or more items were retained, leaving four subscales for subsequent IRT analysis, as described in the next section.

Table 2 Unstandardized factor loading for all items across seven factors

IRT Modeling

In evaluating the IRT models, we focused on examining item fit, item parameters, and reliability indicators such as Item Information Function (IIF) and Test Information Function (TIF). Table 3 shows our findings, indicating that most items across all subscales showed acceptable fit. Notably, only three items—HIPissues, HIPsocietal, and HIPjournal—yielded significant p-values, suggesting potential item misfit. We decided to keep them for further analysis because of their crucial roles as indicators for measuring the respective underlying subscales.

Table 3 Item fit statistics for individual IRT models

Table 4 presents item parameter estimates for the retained four subscales. The subscale associated with Reflective and Integrative Learning displayed discrimination parameter estimates ranging from 2.11 to 3.26, signifying the items’ capacity to discern between different levels of the subscale. Correspondingly, the location parameters varied from -2.52 to -.05, suggesting where on the subscale continuum the items are most endorsed, highlighting a nuanced comprehension of how students engage in reflective and integrative learning across various experience levels. For the subscale assessing Real-World Applications, discrimination parameter estimates fell between 1.08 and 5.34, accompanied by location parameters ranging from -3.07 to 1.03. In terms of Interactions with Others, discrimination parameter estimates covered a range from .77 to 4.28, while location parameters extended from -2.22 to 1.64. Lastly, concerning High-Performance Expectations, the discrimination parameter estimates varied from .65 to 1.99, with location parameters spanning from -5.04 to 3.83.

Table 4 IRT parameter estimates

We examined reliability at both test and item levels. The TIFs for the four subscales can be found in Fig. 1, while individual-level reliability (i.e., IIF) can be found in Appendix B. As noted in Fig. 1, the four subscales varied in the amount and location of information they provided. For example, Reflective and Integrative Learning subscale was largely reliable for those participants at the lower end of the continuum (in the range of -4 to 0), suggesting that those with higher levels of estimated construct would not be reliably measured. Real-World Applications and Interaction with Others produced similar TIFs. Namely, in both cases, for the majority of participants centrally located (in the range of -2 to +2 standard deviations around the mean), levels of precision were high. Lastly, we noted that High-Performance Expectations yielded reliability across a large range of the continuum, suggesting that items on this subset would be able to measure one’s construct across the whole continuum quite precisely.

Fig. 1
figure 1

Test information functions (TIFs) for individual subscales

DIF Analysis

We conducted DIF analyses across various sub-grouping variables, including gender identity, first-generation status, STEM majors, and race/ethnicity, to explore potential differences in the four subscales. Overall, the results suggested that the four subscales were generally applicable across these sub-groups, with only two exceptions. Specifically, the HIPprobsolve item exhibited uniform DIF (DIF = .027, p < .05), indicating a discrepancy in item functioning between different gender identities. Similarly, we found that the HIPequity item showed non-uniform DIF (DIF = .014, p < .01), suggesting variations in item functioning based on first-generation status. Despite these flagged instances of DIF, we observed that the magnitude of the differences (i.e., effect size) was small.

Discussion and Conclusion

Undergraduate research has played a crucial role in fostering college student success by contributing to various desired outcomes. These positive outcomes have been identified not only through students’ participation but also through the diverse range of experiences acquired during their undergraduate research journey (Loker & Wolf, 2023; Zilvinskis, 2019). Surveys, such as URSSA survey, SURE survey, and College Senior Survey, were developed to assess the quality of students’ undergraduate research experiences. Ensuring validity and minimizing bias or inadequacy in capturing the entirety of students’ undergraduate research experiences is crucial for these surveys. Researchers have delved into the psychometric properties of items on these measures (e.g., Hanauer & Dolan, 2014; Weston & Laursen, 2015). However, there is a notable gap in the psychometric evaluation of the NSSE’s HIP Quality module items, which are based on Kuh and O’Donnell’s (2013) eight HIPs characteristics.

Given the widespread use of NSSE across states, more empirical evidence is needed to validate the appropriateness and interpretation of the NSSE’s HIP Quality module items and scores. Our study addresses this gap by assessing the alignment of the NSSE’s HIP Quality module items with the hypothesized HIP characteristics using EFA and PA. We evaluated their reliability and validity as indicators of undergraduate research quality through IRT and examined measurement generalizability across various student demographic groups using DIF analysis. The goal of our study is to enhance the design of these survey items to better reflect students’ actual experiences. By refining these measures, we aim to help institutions improve their assessment strategies and the quality of undergraduate research experiences through regular survey administration. Additionally, our study seeks to validate NSSE’s HIP Quality module items, identify effective components for evaluating undergraduate research, and offer guidance for researchers adapting these measures to assess specific aspects of students’ undergraduate research experiences.

The Alignment of NSSE’s HIP Quality Measures and HIP Characteristics

In this section, we delve into the alignment and misalignment of quality elements with Kuh and O’Donnell’s (2013) framework as revealed by our findings. We aim to provide a comprehensive analysis of how these elements correspond to the framework and discuss the implications for future research on undergraduate research experiences. Our study not only identifies key areas of alignment but also highlights gaps that need further exploration to enhance the validity and reliability of assessments in this domain. Specifically, the findings showed that four quality elements aligned with Kuh and O’Donnell’s (2013) framework, which included Reflective and Integrative Learning, Interactions with Others, Real-World Applications, and High-Performance Expectations subscales. Researchers (e.g., Loker & Wolf, 2023; Zilvinskis, 2019) have investigated the relationship between these quality elements and student outcomes in the context of undergraduate research. Given this alignment, we suggest continuing the investigation of the four quality aspects when assessing students’ undergraduate research experiences, as these aspects support the relevance and applicability of Kuh and O’Donnell’s framework in evaluating the effectiveness of the quality of undergraduate research.

On the contrary, we identified a misalignment between the EFA results and some of Kuh and O’Donnell’s (2013) quality elements, resulting in seven subscales instead of the anticipated eight. While two subscales—Public Demonstration of Competence and Constructive Feedback—aligned with their framework, there were insufficient indicators to fully assess the psychometric properties of each item within the subscales. However, although we were unable to test the psychometric properties of these two subscales, they have been used in previous studies. For example, Kinzie and colleagues (2020) identified both Public Demonstration of Competence and Constructive Feedback as important quality elements in undergraduate research. Similarly, Zilvinskis (2019) found that Constructive Feedback was positively related to various student outcomes, such as a supportive environment and overall satisfaction. Given that these two studies used earlier NSSE data and did not report any psychometric properties of the items used, our study addresses this gap by highlighting potential improvements for these quality elements. Furthermore, we found that elements such as Experience with Diversity were incorporated into the Real-World Applications subscale, while Investment of Time and Effort became part of the High-Performance Expectations subscale. We also discovered a new subscale, Sense of Belonging, which had not been previously articulated in the existing framework. Our findings have raised questions about constructing a subscale to measure the quality of students’ undergraduate research experiences, highlighting the necessity of isolating certain quality elements as independent subscales while underscoring the interrelated nature of these elements.

Psychometric Quality of NSSE Measures

Given the diverse range of assessments used to capture students’ undergraduate research experiences, ensuring these assessments yield valid and reliable scores is crucial. Our findings indicated that the four subscales demonstrated sufficient internal and construct validity in measuring various aspects of undergraduate research experiences. In particular, the Reflective and Integrative Learning subscale effectively assessed students with limited development in reflective and integrative outcomes. Notably, at the item level, the item related to students acquiring job-related skills (HIPjobskills) emerged as a strong discriminator (α = 3.26) among those with limited experience in this outcome development. This finding partially aligns with Hunter and colleagues’ (2007) study, which revealed that students reported gaining knowledge and skills beneficial for their careers or graduate school post-graduation. However, the item related to preparing for plans after graduation (HIPpostgrad) did not provide substantial information for assessing students’ undergraduate research experience, contrary to that found in existing studies (e.g., Hunter et al., 2007; Weston & Laursen, 2015).

The remaining items in the Reflective and Integrative Learning subscale showed lower discrimination parameters and were less reliable. For example, our results revealed that items focused on connecting students’ learning to their majors (HIPconnect) and understanding concepts in the courses or majors (HIPconcept) had low discrimination parameters, indicating they did not provide enough useful information. However, studies (e.g., Hunter et al., 2007; Seymour et al., 2004) have emphasized that undergraduate research participation is positively associated with students’ conceptual understanding, deepens their disciplinary knowledge, and strengthens their connections within and across scientific fields. Similarly, although research (e.g., Hunter et al., 2007; Seymour et al., 2004) has demonstrated that developing problem-solving skills and applying knowledge are key components of undergraduate research, our findings indicated that the items related to solving complex, real-world problems (HIPprobsolve) and applying theory to practice (HIPtheory) also had low discrimination parameters. The inconsistency between our study and existing research suggests a need to revisit these items to better capture the key aspects of students’ undergraduate research experiences.

In our evaluation of Real-World Applications subscale, our results indicated the subscale’s effectiveness in capturing students with average levels of experience in this aspect. The item focusing on understanding societal problems or issues (HIPissues) exhibited the highest discrimination parameter (α = 5.34), offering detailed insights into this aspect of the undergraduate research experience. This finding is somewhat inconsistent with Hunter and colleagues’ (2007) study, which indicated that students perceived their undergraduate research experiences as valuable for envisioning real-world full-time work, acquiring professional transferable skills, and applying knowledge and skills. Our findings suggest that examining the real-world application aspect of students’ undergraduate research experiences could focus on their understanding of societal problems and issues. It is worth mentioning that the wording of HIPsocietal is remarkably similar to that of HIPissues, indicating a potential issue or redundancy with this item. The remaining three items in the Real-World Applications subscale—connecting their learning to societal problems or issues (HIPsocietal), respecting the expression of diverse ideas (HIPideas), and examining issues of equity or privilege (HIPequity)—provided limited information, indicating potential areas for item revision.

The Interactions with Others subscale effectively measured students with average levels of experience in interacting with others, notably highlighting the value of engaging in discussions with other students in an organized setting (HIPdiscuss), which emerged as a rich source of information about interaction experiences (α = 4.28). However, this finding slightly deviates from studies emphasizing the importance of collaborating with faculty and peers (e.g., Auchincloss et al., 2014; Hunter et al., 2007) as part of students’ undergraduate research experiences. While these studies focus on collaboration, they did not specifically address students’ discussion experiences. In fact, the wordings of the other two items in this subscale, related to working with other students (HIPcollab) and interacting with people from different backgrounds or identities (HIPinteract), are more aligned with studies highlighting the experience of collaboration and interaction with faculty and peers but did not provide sufficient information. Regarding the item related to tracking the experience with informal writing (HIPjournal), our findings showed a low discrimination parameter, suggesting it provided limited information. These findings indicate the need to update these items to better capture students’ experiences.

Concerning the High-Performance Expectations subscale, our findings showed that this subscale provided precise measurements across a wide spectrum of student experiences. The item regarding student effort (HIPeffort) offered particularly insightful perspectives on this experience, indicating the importance of considering the effort students put into undergraduate research when examining performance expectations. However, the item related to time investment (HIPhours) did not stand out within this subscale, which aligns with Russell and colleagues’ (2007) study suggesting its irrelevance to positive outcomes in undergraduate research participation. It is noteworthy that two items related to challenge (HIPchallenge) and time investment (HIPhours) did not yield substantial information due to a surplus of response options, with nearly half of the options lacking meaningful discrimination. Additionally, the item related to experiences with new settings or circumstances (HIPnew) had a small discrimination parameter and did not provide sufficient information either. These findings collectively provide insights into the functionality of survey items when examining students’ undergraduate research experiences.

Generalizability Across Student Demographic Groups

Our findings revealed DIF in the items HIPprobsolve and HIPequity, albeit with small effect sizes, indicating subtle differences in how students from diverse social identity groups perceived their undergraduate research experiences regarding these specific items. Specifically, our results indicated potential biases in the assessment of job- or work-related skill development through undergraduate research across various gender identity groups. Similarly, a small effect size DIF was observed in the item HIPequity regarding first-generation status, suggesting disparities in how students perceive equity-related aspects of their undergraduate research experiences. Our findings suggest the significance of considering demographic variables in assessment practices to ensure fairness and accuracy, as prioritizing equity in survey items can enhance the validity and relevance of evaluations across diverse student populations.

Limitations

While this study provides valuable insights into the construct validity of the four examined quality elements, it is important to acknowledge some constraints. First, the reliance on extensive multi-institutional NSSE data is noteworthy; however, it is essential to recognize that the participants exclusively represent four-year institutions. Therefore, the findings may not be readily applicable to students engaged in undergraduate research at different types of institutions, such as two-year institutions or community colleges. Additionally, institutions that chose to participate in this HIP Quality module may be more motivated to support their students’ participation in HIPs, potentially skewing the results regarding students’ perceptions of their quality experiences.

Second, three items—HIPdiverse, HIPstudent, and HIPfacmeet—were excluded from subsequent PA and IRT analyses due to either cross-loading or failure to load onto any factor, thus precluding further exploration of these aspects of students’ undergraduate research experience. Specifically, the item concerning receiving helpful feedback from other students (HIPstudent) demonstrated loading onto both Interactions with Others and Constructive Feedback subscales, blurring the distinction between these quality elements outlined by Kuh and O’Donnell (2013). Similarly, the item related to developing the necessary skills to work effectively with people from various backgrounds (HIPdiverse) was cross loaded on both Real-World Applications and Reflective and Integrative Learning subscales, indicating difficulty in differentiating between reflective learning and practical application. Additionally, the item related to meeting with a faculty or staff member (HIPfacmeet) did not load onto any factor, further limiting the comprehensiveness of the analysis.

Third, an insufficient number of indicators hindered the analysis of three quality element subscales (Public Demonstration of Competence, Constructive Feedback, Sense of Belonging), which could have provided additional insights into students’ experiences. Additionally, due to limited sample sizes for certain subgroups at the individual item level, we were unable to thoroughly explore all categories within each student demographic variable. Consequently, we opted to dichotomize the groups, potentially resulting in an incomplete capture of all student groups’ responses. These constraints underscore the importance of ongoing refinement and adaptation of assessment tools to better capture the nuances of students’ experiences. Continuous improvement is crucial for developing more comprehensive and inclusive assessment methods, ensuring they accurately reflect the diverse and dynamic nature of undergraduate research experiences.

Implications and Future Directions

This study holds substantial implications for shaping assessment practices and potential institutional policies and is geared toward evaluating and enhancing the quality of students’ undergraduate research experiences. We offer several key insights at both the scale and item levels for educators and researchers interested in assessing students’ undergraduate research experiences effectively. Specifically, our findings could greatly inform the future revisions of the HIP Quality module item set developed by the NSSE staff and researchers. The study underscores the significance of the four quality aspects aligned with Kuh and O’Donnell’s (2013) framework, emphasizing the ongoing necessity to investigate these aspects within students’ undergraduate research experiences. Despite this alignment, we have identified a discrepancy between their framework and several subscales examined in this study, prompting a reassessment of the weighting assigned to the eight quality elements when educators, researchers, and assessors gauge students’ undergraduate research experiences on their respective campuses. Particularly crucial is these educational stakeholders to either expand existing survey items to include the underlying constructs for both Experience with Diversity and Investment of Time and Effort quality elements or reconsider the necessity of segregating these two quality elements from the Real-world Applications and High-Performance Expectations subscales, respectively. Furthermore, three subscales—Sense of Belonging, Public Demonstration of Competence, and Constructive Feedback—merit deeper exploration due to their limited number of item indicators. Thus, we suggest that researchers and assessors explore alternative methodologies for assessing these three aspects. These alternatives could involve developing additional survey items that more comprehensively capture the underlying constructs associated with these quality aspects, enabling a more nuanced analysis at the item level and providing a clearer understanding of which items or indicators yield the most informative insights.

The study demonstrates that the psychometric analysis of NSSE’s HIP Quality module items provides valuable insights at the item level, highlighting both promising items for future assessment of undergraduate research experiences and areas in need of refinement. Specifically, our study identifies items that effectively capture certain aspects of students’ undergraduate research experiences, such as job-related skills acquisition (HIPjobskills), while indicating the need for revision in other areas. For instance, the item assessing students’ preparation for post-graduation plans (HIPpostgrad) requires refinement to better align with the interconnectedness of career and graduate school aspirations and undergraduate research experiences. Similarly, the wordings of items related to working with other students and interacting with people from different backgrounds or identities (HIPcollab and HIPinteract) need updating to accurately assess students’ experiences in these areas. Clarifying and strengthening the connections between discussion, collaboration, and interaction could enhance these indicators’ effectiveness in assessing students’ interaction experiences. Additionally, some items (e.g., HIPpostgrad, HIPprobsolve, and HIPtheory), although identified by studies as important undergraduate research experiences, showed low discrimination parameters and did not provide sufficient information. Therefore, it is essential for researchers and educators to revise the wordings of these items to ensure they effectively capture various aspects of students’ undergraduate research experiences.

Furthermore, researchers and educators should integrate items related to understanding societal issues (HIPissues) and the level of effort (HIPeffort) into their assessment tools to gain a comprehensive understanding of students’ undergraduate research engagement. Items with limited informational value, such as those assessing the challenge level of the experience (HIPchallenge) or students’ weekly time commitment (HIPhours), also need further refinement. Streamlining response options for these items could improve the quality of insights gained. Additionally, three items excluded from both PA and IRT analyses require revision to ensure alignment with quality elements and proper categorization within subscales. This refinement process is crucial for maintaining the validity and reliability of assessment tools in evaluating undergraduate research experiences and informing enhancements to undergraduate research programs.

Concerning the identification of DIF in two items, namely the development of job- or work-related skills (HIPjobskills) across various gender identities and equity-related aspects (HIPequity) among first-generation and continuing-generation students in undergraduate research experiences, highlights important implications for researchers and educators. This identification suggests the importance of reviewing these items for potential subtle biases. However, given the small effect size, major revisions may not be necessary. Understanding the underlying reasons for the observed DIF is crucial to ensure that students from diverse social identities perceive related items fairly. This revision not only fosters inclusivity but also ensures that insights provided by students participating in undergraduate research accurately reflect their real experiences.

Expanding assessment methods is essential for comprehensively understanding students’ undergraduate research experiences. Utilizing alternative survey instruments and interviews allows for a deeper insight into individual experiences within different programs or campuses. Given the diversity of research offerings across programs, solely relying on existing quality elements may not capture the full scope of students’ various undergraduate research experiences, as their experiences can exceed established quality frameworks (HERI, 2017; Kuh & O’Donnell, 2013; Weston & Laursen, 2015). Additionally, educators and researchers should conduct longitudinal assessments to track students’ experiences over time and identify emerging quality elements. Continuous monitoring and promoting equity enhance the validity and relevance of assessments across diverse student populations. This approach enables educators and faculty to better support students’ varied undergraduate research journeys through integrated assessment methods and ongoing evaluation. Moreover, researchers and educators using various survey instruments must also navigate proprietary concerns and licensing processes. Customizing or modifying survey items to meet specific needs requires obtaining proper authorization, which involves securing use permissions or licensing agreements. Such alterations should be carefully assessed to ensure they do not compromise the instrument’s validity.

Appendix A

Student Participation in Undergraduate Research by Demographic Characteristics.

Variable

Category

Count

N (%)

Gender identity

Man

631

40.5

Woman

868

55.7

Another gender identity

32

2.1

Prefer not to respond

27

1.7

Major field categories

Arts & Humanities

85

5.5

Biological Sciences, Agriculture, & Natural Resources

488

31.4

Physical Sciences, Mathematics, & Computer Science

237

15.3

Social Sciences

205

13.2

Business

51

3.3

Communications, Media, & Public Relations

14

.9

Education

19

1.2

Engineering

209

13.4

Health Professions

176

11.3

Social Service Professions

27

1.7

All Other

42

2.7

Undecided, undeclared

1

.1

STEM field

No

606

39.0

Yes

948

61.0

First-generation

Not first-generation

1168

75.1

First-generation

388

24.9

Race/ethnicity

American Indian or Alaska Native

4

.3

Asian

100

6.4

Black or African American

43

2.8

Hispanic or Latina/o

79

5.1

Middle Eastern or North African

13

.8

Native Hawaiian or Other Pacific Islander

5

.3

White

1100

70.7

Another race or ethnicity

7

.5

Multiracial

158

10.2

I prefer not to respond

46

3.0

  1. Note The classification of STEM field was determined by major field categories. Biological Sciences, Agriculture, & Natural Resources; Physical Sciences, Mathematics, & Computer Science; and Engineering were included in the STEM category

Appendix B

Item Information Functions for Studied Subscales.