Background

Substantial research has articulated how undergraduate students learn and the instructional practices that best support student learning, including empirically validated instructional strategies (e.g., Chickering & Gamson 1987; Pascarella & Terenzini 1991; 2005). Efforts to transform postsecondary STEM courses to include more of these strategies have had only modest success. One reason for this is that researchers lack shared language and methods for describing teaching practices (Henderson et al. 2011; Beach et al. 2012). As a result, there is a need for documenting tools that describe what teaching practices actually occur in college classrooms (American Association for the Advancement of Science [AAAS] 2013).

Surveys are one method to measure the instructional practices of college and university instructors. Self-report surveys can be used alone or in combination with observation to provide a portrait of postsecondary teaching (American Association for the Advancement of Science [AAAS] 2013); these portraits can serve as baseline data for individual instructors, institutions, and faculty developers to plan and enact more effective change initiatives (Turpen & Finkelstein 2009). While self-report surveys are acknowledged as being useful tools for measuring teaching practices, there has been little systematic work characterizing the available instruments.

Ten surveys of postsecondary instructional practices were summarized in a recent report of the American Association for the Advancement of Science (American Association for the Advancement of Science [AAAS] 2013). This report was the result of a 3-day workshop to develop shared language and tools by examining current systematic efforts to improve undergraduate STEM education. Although the report provides an overview of available instruments, it does not examine the design and development of the surveys nor analyze the content and structure of survey items. As a result, it is difficult for researchers to know whether currently available instruments are sufficient or new instruments are needed.

The purpose of this paper is to provide a comparison and content analysis of available postsecondary STEM teaching practice surveys. Our goal is to provide a single resource for researchers to get a sense of the available instruments. We bound our analysis to 10 instruments included in the American Association for the Advancement of Science (AAAS) (2013) report and two instruments that have been released since the report. The AAAS report was developed by a diverse panel of experts in the area of describing college-level STEM. Although we are not aware of any relevant surveys that the AAAS report missed, we are aware of two relevant surveys that have been disseminated since the AAAS report: the Teaching Practices Inventory (TPI; Wieman & Gilbert 2014) and the Postsecondary Instructional Practices Survey (PIPS; Walter et al. 2014). These instruments were included in our analysis because, had they been available at the time, they likely would have been included in the AAAS report (Smith et al. 2014; Walter et al. 2014).

Through our analysis, we seek to characterize the development and administration of the self-report instruments and provide detailed descriptions of their item content (e.g., specific teaching practices) and structure (e.g., clarity, specificity). We also highlight questions that users should consider before adopting or designing an instrument and make suggestions for future work.

Research questions

Our analysis was guided by two research questions:

  • RQ1. What is the nature of the sample of available surveys that elicit self-report of postsecondary teaching practices?

    1. a.

      What are the intended populations of the surveys?

    2. b.

      What measures of reliability and validity were used in the development of the surveys?

    3. c.

      What is the respondent and administrative burden of the surveys?

  • RQ2. What teaching practices do the surveys elicit?

Methods

Proper instrument development is essential for a survey to measure correctly its intended subject for its intended demographic (DeLamater et al. 2014). As we considered a comparison of the instruments, we sought to understand the elements essential to their development and administration (RQ1). These elements include the background of the instrument, intended population, respondent and administrative burden, reliability and validity, scoring convention, and reported analyses. These attributes were selected based on commonalities in reported instrument features as well as recommendations in the instrument development literature.

We carefully reviewed the original and related follow-up manuscripts for descriptions of how each instrument employed these features. This section is intended to provide operational definitions for the key features of the instruments; we later describe how these elements were embodied in the instruments we reviewed (see “Results and discussion” section).

Background

Background for an instrument includes details on its original authors, broad development procedure, and a brief description of its content. Where applicable, we include relevant manuscripts associated with the original publication.

Intended population

The intended population of an instrument refers to the group of participants that the instrument was designed to survey (DeLamater et al. 2014).

Respondent and administrative burden

Respondent burden is the amount of time and effort required by participants to complete an instrument. We report estimated time to completion for instruments in their entirety (this may include items other than those related to teaching practice). Administrative burden refers to the demand placed on individuals implementing the instrument. As with respondent burden, the consistency and number of response scales may potentially add to administrative burden.

Reliability and validity

In survey research, it is common to report methods by which reliability and validity were achieved. Reliability is the consistency with which an instrument provides similar results across items, testing occasions, and raters (Cronbach 1947; Nunnally 1967). There are several commonly reported forms of evidence for instrument reliability, including internal consistency, test-retest, and inter-rater reliability.

Internal consistency addresses whether an instrument is consistent across items and is often reported with Cronbach’s alpha (for non-binary surveys). Alpha is a general measurement of the interrelatedness of items, provided there are no covariances, and is dependent on the number of items in the test (Nunnally, 1978).

It is hypothetically possible that the instructional practices on a given survey are not correlated to one another. However, a subset of items on a given survey is typically interrelated in some way. For example, they may have multiple items to get at a particular practice or multiple items designed around a particular construct about teaching (as evidenced by the use of exploratory and confirmatory factor analyses in many studies).

Test-retest reliability refers to the ability of an instrument to produce consistent measurements across testing occasions. Although instructional practices can change over time, some elements could remain consistent.

Inter-rater reliability is the extent to which two or more raters measuring the same phenomenon agree in their ratings. This form of reliability is more common in qualitative work than in survey administration.

Validity is the extent that an instrument measures what it was intended to measure (Haynes et al. 1995). Three commonly reported types of validity are content, construct, and face validity.

Content validity documents how well an instrument represents aspects of the subject of interest (e.g., teaching practices). A panel of subject matter experts is often used to improve content validity through refinement or elimination of items (Anastasi & Urbina 1997). We would expect content validity approaches in all of the surveys we examined.

Construct validity refers to the degree an instrument is consistent with theory (Coons et al. 2000); this is often achieved through confirmatory and/or exploratory factor analyses (Thompson & Daniel 1996). It is not appropriate for every survey to report construct validity since not every survey was developed from a theory base. For example, the TPI (Wieman & Gilbert 2014) was designed as a checklist or rubric of possible teaching practices in a given course. Therefore, as the TPI authors argue, there should be no expectation of underlying constructs.

An instrument has face validity if, from the perspective of participants, it appears to have relevance and measures its intended subject. This requires developers to use clear and concise language, avoid jargon, and write items to the education and reading level of the participants (DeLamater et al. 2014). Pilot testing items with a representative sample (e.g., postsecondary instructors) and refining items based on feedback is a common method to improve face validity. We would expect actions to ensure face validity in all of the surveys we examined.

Scoring convention

Scoring convention refers to any procedures used by the instrument authors to score items for the purposes of analyzing participant responses.

Reported analyses

The reported analyses are any statistical procedures used or recommended by the instrument authors to analyze data collected using the instruments. Additionally, the format in which the authors report their data is included here.

Item-level analysis

We undertook a content analysis to understand the aspects of teaching practices measured by each instrument included in the sample (RQ2). Content analysis is a systematic, replicable technique for compressing text (in our case, survey items) into fewer content categories based on explicit coding rules (Berelson 1952; Krippendorff 1980; U.S. General Accounting Office [GAO] 1996; Weber 1990). Content analysis enables researchers to sift through data with ease in a systematic fashion (U.S. General Accounting Office [GAO] 1996) and is a useful technique for describing the focus of individuals or groups (Weber 1990); in our case, we can examine in detail the goals of those surveying the instructional practices of postsecondary instructors. Although content analysis generates quantitative patterns (counts), the technique is methodologically rich and meaningful due to its reliance on explicit coding and categorizing of the data (Stemler 2001).

The analysis began with examining all of the items from the 12 instruments and identifying those related to teaching practices. We ended up with a pool of 320 instructional practice items. Items were excluded from the pool of 320 if they did not capture an instructional practice. We were only interested in analyzing the items that were directly related to instructional practices. The most common type of excluded items were those that elicited only a belief about teaching without the direct implication that the belief informed practice, e.g., “how much do you agree that students learn more effectively from a good lecture than from a good activity?” We did include rationale statements in the analysis as these beliefs directly informed instructional practice, e.g., “I feel it is important to present a lot of facts to students so that they know what they have to learn in this subject.”

The first phase of our item-level analysis began with two members of the research team (authors 1 and 2) independently categorizing the 320 items into emergent coarse- and fine-grained codes. The codes were created based on the content of the items themselves. We designed the codes to be autonomous, that is, one code could not overlap with another. This means that items within coding categories must not only have similar meaning (Weber 1990, p. 37), but codes should be mutually exclusive and exhaustive (U.S. General Accounting Office [GAO] 1996, p. 20). Mutually exclusive categories exist when no item falls between two categories, and each item is represented by only one data point. Generating exhaustive categories is met when the codebook represents all applicable items without exception. For this convention to function, we needed (a) to write code names and code definitions carefully and (b) to sort items into codes based on the single instructional practice best represented by its text.

The second phase of the analysis brought in two additional researchers (authors 3 and 4) to categorize the items using the codebook created by authors 1 and 2. As a four-member research team, we engaged in subsequent rounds of group coding, codebook refinement, and repeated independent coding until an acceptable overall agreement was achieved (82.1 % agreement). The result was 34 autonomous codes in three primary categories: (a) instructional format (20 codes, 138 items), (b) assessment (10 codes, 74 items), and (c) reflective practice (4 codes, 24 items). We define each code and provide a sample item for each in Table 1.

Table 1 Codebook used for content analysis

Codebook categories

Instructional format

The instructional format codes refer to items that describe the method by which a course is taught. The codes within the category differ primarily by the primary actors of the instruction, i.e., students versus the instructor. We created three main categories of instructional format codes, including transmission-based instruction, student active, and general practice codes. The transmission-based instruction codes are traditional practices where the instructor is the primary actor. Teaching practices included in this category are lecture, demonstration, and instructor-led question-and-answer. The “student active” codes include a diverse set of practices where students are the primary actors. Example practices in this category are students explaining course concepts, analyzing or manipulating data, completing lab or experimental activities, and having input into course content. The student active codes also included group work practices where two or more students collaborate. The general practice codes consist of practices where there is no designated primary actor, such as connecting course content to scientific research, drawing attention to connections among course concepts, and real-time polling.

Assessment

The assessment codes relate to teaching practices used to determine how well students are learning course content. We created three categories of assessment codes: assignment types, nature of feedback to students, and the nature of assessments. The assignment type codes are various activities assigned to students, i.e., student presentations, writing, and group projects. The “nature of feedback to students” codes refer to how much feedback is given by the instructor to students and the policies enacted by the instructor for how student work is graded. Finally, the “nature of assessment” codes include the types of questions used on summative assessment and the types of outcomes assessed.

Reflective practice

The reflective practice codes are associated with items that ask instructors to think about the big picture of what and how they teach. Additionally, the items ask about how instructors improve their teaching. Example practices include gathering information on student learning to inform future teaching and communicating with students about instructional goals and strategies for success in the course. Also included under the reflective practice codes are items that ask instructors about their rationale behind a particular teaching practice.

Results and discussion

In this section, we review the key features of each instrument. The instruments are described in alphabetical order. Table 2 includes intended population, the number of items and estimated time to completion, and information about reliability and validity for each instrument. Table 3 summarizes the scoring conventions and reported analyses for each instrument. For consistency and ease of explanation, we chose to create a name and acronym for instruments that were not given to them by their original authors. Our titles and acronyms were determined by the STEM discipline of the instrument and original authors’ surnames. An asterisk indicates self-generated acronyms.

Table 2 Instrument key features (part 1)
Table 3 Instrument key features (part 2)

The “Broad patterns and comparisons” section below includes an overview of the background, intended population, reliability and validity, respondent and administrative burden, scoring convention, and reported analyses across the instruments (RQ1). It then discusses strengths and weakness of the development process used in our sample of instruments. We also consider patterns in the content and structure for the items of each instrument based on our codebook analysis (RQ2). For more in-depth descriptions of each instrument, please see Additional file 1.

Broad patterns and comparisons

What is the nature of the instruments that elicit self-report of postsecondary teaching practices? (RQ1)

Background

Almost all of the instruments were developed out of a growing interest to improve undergraduate instruction at a local and/or national scale. Furthermore, eight of the 12 surveys we reviewed have been published or revised since 2012, heralding a movement among the research community to measure the state of undergraduate education.

Intended population

Four of the instruments we reviewed span all postsecondary disciplines (Faculty Survey of Student Engagement (FSSE), Higher Education Research Institute (HERI), National Study of Postsecondary Faculty (NSOPF), PIPS). The remaining instruments are designed for STEM faculty, including physics (Henderson & Dancy Physics Faculty Survey (HDPFS)) and engineering faculty (Borrego Engineering Faculty Survey (BEFS), BREFS), chemistry and biology (Survey of Teaching Beliefs and Practices (STEP)), geosciences (On the Cutting Edge Survey (OCES)), statistics (Statistics Teaching Inventory (STI)), and science and mathematics (TPI). There are no instruments designed specifically for technology postsecondary instructors, with the exception of an instrument to measure integration of technology into postsecondary math classrooms (Lavicza, 2010). However, this instrument focuses on use of particular technologies and not particular teaching practices.

Administrative and respondent burden

There is great variability in the number of items on the surveys we reviewed (84.4 ± 72.7). Lengthy surveys, such as the FSSE (130 items), HERI (284 items), NSOPF (83 items), TPI (72 items), and STEP (67 items), may cause participants to develop test fatigue, i.e., become bored or not pay attention to how they respond (Royce 2007).

The number of teaching practice items (26.7 ± 14.2) and proportion of teaching practices in the overall instrument (43.4 ± 26.1 %) also vary widely. This may be problematic for administrators seeking only to elicit teaching practices of respondents. Furthermore, although teaching practice items could be pulled out from a larger survey, this can impact the construct validity of the instrument.

The instruments with the lowest proportion of teaching practice items are national interdisciplinary surveys designed to assess multiple elements of the faculty work experience: FSSE (17.7 % instructional practice items), HERI (12.3 %), and NSOPF (12.0 %). In contrast, the remaining (mostly discipline-specific with the exception of PIPS) instruments focus more items on instructional practices: TPI (83.3 %), PIPS (72.7 %), HDPFS (65.6 %), OCES (63.0 %), Approaches to Teaching Inventory (ATI) (56.3 %), STI (42.0 %), and STEP (34.9 %). The exception to this pattern is Southeastern University and College Coalition for Engineering Education (SUCCEED), with only 17.9 % of its items devoted to instructional practices.

There are also a variety of scales employed by the instruments we analyzed (Table 4). Many used a 5-point response scale (e.g., BEFS, PIPS, STI, SUCCEED, TPI), but others use 3-point (STEP, NSOPF, SUCCEED, TPI), 8-point (FSSE), and binary scales (OCES, STI, SUCCEED, TPI). Response scales are an important consideration in instrument development, as is an explicit rationale for given scales in development documents. Five-point scales are generally recommended to maximize variance in responses, unless there is a compelling reason not to use such a scale (Bass et al. 1974; Clark & Watson 1995). Despite recommendations in the literature, authors rarely voiced their rationale for scale choice. Notable exceptions to this are the STI (Zieffler et al. 2012) and PIPS (Walter et al. 2014), which document rationale behind selecting a scale.

Table 4 Nature of the scales used by the instruments

Scoring convention

Seven of the instruments reported some form of scoring system. In general, scoring is done on a positive scale with higher scores given to responses indicating greater importance or use of reformed teaching practices. Providing scoring systems for an instrument can help users make sense of large data sets and produce more consistent data sets across implementations.

Reported analyses

The majority of the instruments reported descriptive statistics such as frequency distributions, means, and standard deviations. A few instruments (BEFS, PIPS, STEP, SUCCEED) reported mean comparisons using common statistical test such as independent t tests, ANOVA, and chi-square. Some instruments (ATI, PIPS) also reported correlational analysis between instrument scores and various aspects related to teaching and learning.

Areas for improvement and strengths related to the development of existing instrumentation

Face validity

It is key that an instrument makes sense and appears to measure its intended concept from the perspective of the participant (DeLamater et al. 2014). This requires avoiding jargon-based (e.g., inquiry, problem solving), overly complex, and vague statements. Although 8 of the 12 instruments were pilot tested and revised before wide implementation, we coded vague teaching practice items in all instruments except the ATI, regardless of whether they were pilot tested (see Additional file 2). “Vague” items by our definition could not be described by another instructional format or assessment code, because they were too broadly described. For example, “How often did you use multimedia (e.g., video clips, animations, sound clips)?” (Marbach-Ad et al. 2012). Similarly, many instruments included double-barreled (or multi-barreled) items, which described two or more concepts in a single question. For example, “In your selected course section, how much does the coursework emphasize applying facts, theories, or methods to practical problems or new situations?” (Center for Post-secondary Research at Indiana University [CPRIU]). These items can be problematic for participants to answer and can provide data that is difficult to interpret for researchers (Clark & Watson 1995). We encourage users to look for and identify vague items in any instrument, as these items may reduce face validity and fail to produce meaningful data.

Content validity

Seven of the instruments we reviewed have documented use of an outside panel of experts to improve content validity (BEFS, HDPFS, OCEA, PIPS, STEP, STI, and TPI). In particular, we highlight the efforts of the authors of the STI (Zieffler et al. 2012), for their iterative review process utilizing statistics education community members and NSF project advisors.

Construct validity

Construct validity is the least addressed component of validity in the instruments we reviewed. Only the ATI (2 constructs), FSSE (9 constructs), HERI (11 constructs), and PIPS (2 or 5 constructs) have documented analyses of how items grouped together in factor or principal components analyses. Furthermore, only the ATI, FSSE, and PIPS use confirmatory factor analyses to sort items into a priori categorizations. To this end, we add that none of the instruments build upon a specific educational theory nor generate a theoretical framework for the nature of postsecondary instructional practice.

Reliability

Only two of the available instruments (BEFS and FSSE) cite reliability values by construct. All other instruments fail to provide reliability statistics, bringing into question the precision of their results. Furthermore, none of the instruments we reviewed provided test-retest reliability statistics. We encourage future users of the instruments to consider longitudinal studies that would allow for the publication of these values.

Development process

We were surprised by the lack of documentation available for the development process of the instruments we reviewed. How items were generated, revised, and ultimately finalized was often not apparent. Survey development should be a transparent process, available online if not in manuscript. The ATI and STI are good examples of detailed methodological processes, providing extensive detail from development of the initial item pool, item refinement, and pilot testing to data analyses and ongoing revisions. Rationale should also be provided for item scales, with the goal of avoiding unjustified changes in scale among item blocks. We recommend referencing the psychometric literature (e.g., Bass et al. 1974; Clark & Watson 1995) to provide support for the use of particular scales.

What teaching practices do the instruments elicit? (RQ2)

As we examined all of the instruments in our sample, the majority had the largest number of their items focused on instructional format (BEFS, HDPFS, OCES) or a combination of instructional format and assessment (FSSE, PIPS, STI, STEP, SUCCEED, TPI). Other instruments had a variety of different foci. The ATI has a nearly equal number of reflective practice items (n = 4) to instructional format items (n = 5), and the NSOPF devotes almost all of its 10 teaching practice items to assessment practice (n = 9). Only the HERI has equal proportions of instructional format, assessment, and reflective practice items, although these items are a subset of 284 total questions on the instrument. Figure 1 provides a breakdown of item types by instrument.

Fig. 1
figure 1

Instrument items per coding category. Number of items per code category for postsecondary instructional practice surveys

Across the full 320-item pool, most items were coded into the instructional format category (see Additional file 2 for a full tabulation of codes). These 174 items most often referred to discussions (n = 17), group work (n = 16), students doing problem solving activities (n = 16), instructor demonstration/example (n = 11), real-world contexts (n = 12), real-time polling (n = 9), and using quantitative approaches to manipulate or analyze data (n = 9). Rarely did items describe instruction in a lab or field setting (n = 6). In addition, the lab-specific items did not reflect current reforms in laboratory instruction (e.g., avoiding verification-based activities or allowing flexibility in methods; Lunetta et al. 2007).

Assessment practice items (n = 111) focused primarily on the nature of summative assessments. Items usually referred to instructor grading policy (n = 20), the format of questions on summative assessments (e.g., multiple-choice, open-ended questions) (n = 19), formative assessment (n = 12), or the general format of summative assessments (e.g., midterms, quizzes) (n = 11). The remaining assessment items primarily referred to student term papers (n = 10), group assessments (n = 7), student presentations (n = 7), content assessed on summative assessments (n = 6), the nature of feedback given to students (n = 6), and peer evaluation of assessments (n = 4). There is a lack of instruments that explicitly refer to formative assessment practices, those that elicit, build upon, or evaluate students’ prior knowledge and ideas (Angelo & Cross 1993). While there were 12 total items referring to formative assessment, over half of the formative assessment items came from one instrument (TPI). Although the nine items sorted into the “real-time polling” code could refer to formative assessment, the use of clickers and whole class voting does not imply formative use.

We also looked specifically at the discipline-based instruments in our sample including the BEFS (engineering), HDPFS (physics), OCES (geosciences), STEP (chemistry and biology), STI (statistics), and SUCCEED (engineering). Most of the discipline-based instruments focused the majority of their items on instructional format. The SUCCEED and the STI are exceptions in that they are evenly split between instructional format and assessment. The instructional format items across the discipline-based instruments most commonly focused on group work (n = 14), students analyzing data (n = 9), discussion (n = 6), and lecture (n = 5). Some of the instruments dedicated a substantial amount of their instructional format items to particular practices. For example, the HDPFS (n = 7) and OCES (n = 5) both have several items related to problem solving. The OCES (n = 5) and STI (n = 3) have items focused on having students quantitatively analyze datasets. In addition, BEFS has a particular focus on providing a real-world context for students (n = 4) and group work (n = 4). The HDPFS is also noteworthy for being the only discipline-based instrument with multiple items (n = 3) related to laboratory teaching practices. Only one other discipline-based instrument, the OCES, has a single item related to the laboratory.

The discipline-based instruments also had a secondary focus on assessment practices. The most common assessment items across the instruments were those related to the nature of the questions included on course assessments. In particular, the HDPFS authors dedicated the majority of their assessment items (n = 6) to the nature of assessment questions. This being said, there were two instruments that had a unique focus for their assessment practice items. The STI has six items (out of nine) related to including specific content on assessments, while the SUCCEED has three items (of five) focused on group assessments.

None of the disciplinary instruments had many reflective practice items. Three instruments had no reflective practice items. Two minor exceptions are the STEP and the SUCCEED, which both had two items aimed at whether learning goals are provided to students.

Conclusions

Although many of the instruments have development and/or psychometric issues, no instrument is wholly problematic. To conclude the paper, we return to our research questions and provide recommendations for users and developers of postsecondary teaching practice surveys.

What is the nature of the instruments that elicit self-report of postsecondary teaching practices? (RQ1)

The majority of instruments we reviewed were designed for particular STEM disciplines. Outside of large national instruments, there are few instruments designed for measuring teaching practices across disciplines. In addition, there is considerable variability in overall instrument length, the proportion of teaching practice items, and response scales. All of these aspects should be taken into account to maximize participants’ ease of completing the instrument and researchers’ interpretations of the data produced.

Considerations for users and developers

The purpose of this paper has been to analyze and compare available instruments, in part so that readers have a sense of direction when determining how to measure instructional practices in their given context. Based on this experience, we are able to identify questions for potential users and developers of postsecondary instructional practice instruments. This is not a set of research questions but rather questions to consider prior to implementation. For more specific recommendations for quality test administration, consider the guidelines published by the International Test Commission (International Test Commission [ITC] (2001)).

Consideration 1: is there an established instrument?

We consider the first step to finding or developing a postsecondary teaching practice instrument to be an examination of what is currently available. We have created a flowchart (Fig. 2) to help users distinguish among the basic features of available instruments. Please note that this chart is a first step to navigating the sea of available instruments. It should not be interpreted as a recommendation for any of the instruments without deeper examination of the validity, reliability, content, and clarity of an instrument.

Fig. 2
figure 2

Faculty self-report teaching practice instrument flowchart. A flowchart of guiding questions for use in selecting an instrument based on intended population and general nature of its items. This chart should be used in tandem with the analysis in this paper and not as the sole source of information on available instrumentation

Consideration 2: is the instrument valid and reliable?

Upon confirmation that an instrument is appropriate for a particular audience, context, and research questions, the instrument should be assessed to determine if it measures what it was intended to target (validity; Haynes et al. 1995) and produces repeatable and precise results (reliability; Cronbach 1947; Nunnally 1967). We report common methods to achieve validity and reliability earlier in the manuscript (see Key Features of the Instruments), and we summarize the methods used for each instrument in Table 2. If validity and reliability have been accounted for, a user can have some confidence in the results produced by an instrument. Keep in mind that not all measures of validity and reliability are appropriate, depending on the goals of the instrument and how it was developed.

Consideration 3: what response scale(s) does the instrument use?

Inconsistent and unjustified item scales may add to administrative burden of a test and may contribute to test fatigue (Royce 2007). We recommend careful examination of item scales including number of response options (see Bass et al. 1974) and use of a neutral point on the scale. Forcing agreement or disagreement through eliminating a neutral option may avoid an increase in participants claiming “no opinion” when they actually have one (Bishop 1987; Johns 2005).

Consideration 4: will you modify or adapt the instrument?

Should a user decide that an instrument is valid, reliable, and acceptable for their intended audience, we recommend that the survey be administered in its entirety and without modifying the items. Gathering data in this controlled way enables the comparison of data with others that have used the instrument and preserves construct validity (van de Vijver 2001). Deviations from these conditions should be reported as constraints on the interpretation of results. We note that using a complete instrument may be more challenging for users interested in the FSSE, HERI, NSOPF, and/or SUCCEED, as these surveys have a large number of non-teaching practice items.

Consideration 5: do you plan to develop a new instrument?

Should the current instrumentation be insufficient for your needs, we recommend that instruments are created in the most methodological and transferable way possible (e.g., Rea & Parker 2014). Keep and disseminate detailed records of your development process, testing, and analyses. Communicate with other research groups for compatibility, comparability, and further reliability and validity testing. Since there has been little work to compare data gathered from the same population using different teaching practice instruments, we suggest gathering data using both the new instrument and a reliable and valid existing instrument to see how the instruments elicit teaching practices in similar or unique fashions.

What teaching practices do the instruments elicit? (RQ2)

The bulk of the teaching practice items across the instruments reviewed were focused on instructional format and or assessment practice. Two important areas that seem to be missing from many of the instruments are lab instructional practices and formative assessment. These are both areas that should be addressed in future instrument development.

Recommendations for future research

As discussed in this paper, many instruments currently exist for describing postsecondary teaching practices. More work is certainly needed to further refine these instruments and other similar instruments. More importantly, though, the field currently lacks instrumentation for measuring teaching practices in laboratory and online settings.

Measuring instructional practices in online courses

Despite widespread and increasing adoption of online learning approaches (Johnson et al. 2013), there are no comprehensive surveys of online teaching practices nor an objective set of descriptors to classify online teaching practices. This is not to say we do not know what makes effective online instruction. Significant effort by instructional designers, faculty developers, and online platform providers has generated checklists and rubrics of best practices (e.g., Quality Matters, BlackBoard Exemplary Course Program Rubric, MERLOT Evaluation Standards for Learning Materials).

However, best practice rubrics are designed for self-reflection or peer evaluation. They are not designed to consistently and precisely measure the same instructional practices over separate administrations, nor are they confirmed to measure what they intend. For proper comparisons among data sets and accurate results, valid and reliable instruments should be designed to measure instructional practices in online settings.

Laboratory instructional practices

Like online course settings, we find the surveys available for face-to-face classrooms to be missing an element that describes components of effective laboratory teaching. This includes avoiding verification-based activities and allowing flexibility in methods (e.g., Lunetta et al. 2007).

Inclusivity

Lastly, we see little discussion of teaching strategies specific to improving outcomes for many groups of students that are typically underrepresented in STEM disciplines, such as students with disabilities or underprepared students. Such students make up an increasing proportion of the college student population. We consider many reform-based instructional strategies to include components of universal design (Scott et al. 2003); universal design requires an intentional approach to a variety of human needs and diversity. Some universal design elements may be elicited through items on existing instruments, including items that highlight a community of learners, flexibility in teaching methods, and tolerance for student error on assessments. Other elements, including the intentionality to use methods that address the needs of diverse learners, are not as apparent in the current instrumentation. We encourage developers to consider elements of universal design when generating survey items.