Evaluating the effectiveness of rating instruments for a communication skills assessment of medical residents
- First Online:
- Cite this article as:
- Iramaneerat, C., Myford, C.M., Yudkowsky, R. et al. Adv in Health Sci Educ (2009) 14: 575. doi:10.1007/s10459-008-9142-2
The investigators used evidence based on response processes to evaluate and improve the validity of scores on the Patient-Centered Communication and Interpersonal Skills (CIS) Scale for the assessment of residents’ communication competence. The investigators retrospectively analyzed the communication skills ratings of 68 residents at the University of Illinois at Chicago (UIC). Each resident encountered six standardized patients (SPs) portraying six cases. SPs rated the performance of each resident using the CIS Scale—an 18-item rating instrument asking for level of agreement on a 5-category scale. A many-faceted Rasch measurement model was used to determine how effectively each item and scale on the rating instrument performed. The analyses revealed that items were too easy for the residents. The SPs underutilized the lowest rating category, making the scale function as a 4-category rating scale. Some SPs were inconsistent when assigning ratings in the middle categories. The investigators modified the rating instrument based on the findings, creating the Revised UIC Communication and Interpersonal Skills (RUCIS) Scale—a 13-item rating instrument that employs a 4-category behaviorally anchored rating scale for each item. The investigators implemented the RUCIS Scale in a subsequent communication skills OSCE for 85 residents. The analyses revealed that the RUCIS Scale functioned more effectively than the CIS Scale in several respects (e.g., a more uniform distribution of ratings across categories, and better fit of the items to the measurement model). However, SPs still rarely assigned ratings in the lowest rating category of each scale.
KeywordsValidityRating scaleCommunication skillsMany-faceted Rasch measurementOSCE
Communication and interpersonal skills are one of the six core competencies for which residency programs have to demonstrate training outcomes (Accreditation Council for Graduate Medical Education 1999). An assessment of residents’ communication skills that can provide valid inferences about their ability to exchange information and ally with patients requires an observed interaction with patients. The Accreditation Council for Graduate Medical Education (ACGME) and the American Board of Medical Specialties (ABMS) recommend using an assessment format that asks residents to interact with standardized patients (SPs) in an Objective Structured Clinical Examination (OSCE) as the most desirable approach for communication skills assessment (Bashook and Swing 2000).
The rating instrument that a standardized patient uses to record his/her observations of a resident’s performance during a communication skills OSCE plays a critical role in providing valid inferences from an assessment. A rating instrument not only guides the observation but also dictates the scoring of the performance of individual residents. Several rating instruments for the assessment of medical communication skills by SPs in OSCE settings have been developed and validated, including the Interpersonal and Communication Skills Checklist (Cohen et al. 1996), the Interpersonal Skills Rating Form (Schnabl et al. 1991), the Arizona Clinical Interview Rating Scale (Stillman et al. 1976, 1986), the Brown University Interpersonal Skill Evaluation (Burchard and Rowland-Morin 1990), the SEGUE Framework (Makoul 2001), the Liverpool Communication Skills Assessment Scale (LCSAS) (Humphris 2002; Humphris and Kaney 2001), and the Patient-Centered Communication and Interpersonal Skills (CIS) Scale (Yudkowsky et al. 2004, 2006).
Despite the many available rating instruments for communication skills assessment in OSCE settings, choosing an appropriate instrument to score residents’ performance in a communication skills OSCE is not an easy task. Validity evidence that supports the use of scores obtained from these rating instruments is quite limited. Researchers conducting validity studies of these instruments have focused mainly on reporting internal consistency reliability, inter-rater reliability, and correlations of scores with measures of other variables. According to the Standards for Educational and Psychological Testing (American Educational Research Association et al. 1999), such validity research only provides evidence based on internal structure and relations to other variables, leaving out evidence based on test content, response processes, and consequences.
In this study, we evaluated validity evidence related to the use of one of the existing communication skills OSCE rating instruments—the Patient-Centered Communication and Interpersonal Skills (CIS) Scale. We focused on evidence based on response processes, a source of validity evidence that test score users often overlook. In the context of a communication skills OSCE, the validity evidence based on response processes refers to the evaluation of the extent to which the SPs apply rating criteria to rate the residents’ performance in a manner that is consistent with the intended interpretation and uses of scores (American Educational Research Association et al. 1999).
There are several approaches that researchers can use to gather validity evidence based on response processes. Researchers can collect some pieces of evidence before the OSCE administration (e.g., documenting the rating criteria and the processes for selecting, training, and qualifying SPs). Researchers can collect other pieces of evidence at the time a SP rates the performance (e.g., engaging SPs in verbal think-aloud during the rating process, thus allowing researchers to know what SPs are thinking while deciding what rating they will assign (Heller et al. 1998)). The focus of this study was on gathering validity evidence related to response processes after an OSCE administration (i.e., when all the ratings were available to us). That is, we carried out a psychometric analysis of ratings to investigate to what extent the OSCE ratings were consistent with the intended uses of the scores. OSCE ratings are the result of the interaction between residents, cases, items (and their rating scales), and SPs. A comprehensive validity study of response processes for an OSCE would require close examination of responses related to all these components of an OSCE. In this study, we limited the scope of our analyses to response processes related to items and scales on the rating instrument. That is, we investigated the extent to which SPs used the rating instrument to rate the residents’ performance in a way that was consistent with the intended uses of the scores.
This study looked at the use of the CIS Scale in the scoring of internal medicine residents’ performance in communication skills OSCEs carried out at the University of Illinois at Chicago (UIC). The purposes of our study were (1) to evaluate the effectiveness of the CIS Scale in scoring the residents’ performance in the communication skills OSCE, (2) to use the findings obtained from the analysis to determine whether the rating instrument needed to be revised to improve its effectiveness, (3) to use the results from the analysis to guide the instrument revision process, and (4) to compare the original CIS Scale to the modified rating instrument to determine whether the modifications helped improve the scale’s functioning, thus in effect enhancing the validity of the inferences made from scores on the communication skills OSCE. In the course of evaluating the effectiveness of these two rating instruments, we demonstrate how researchers can analyze OSCE rating data to provide validity evidence related to response processes.
We carried out the study in two phases. The first phase was a retrospective analysis of the communication skills OSCE ratings for internal medicine residents obtained in 2003, in which SPs used the CIS scale to rate the performance of residents. We identified certain items and scales on that rating instrument that did not function effectively and revised the rating instrument to address those weaknesses. We piloted the revised instrument with a small group of SP trainers and medical faculty members and then further revised the instrument based on the comments obtained from the pilot study. This led to a development of a revised rating instrument for communication skills assessment called the Revised UIC Communication and Interpersonal Skills (RUCIS) scale.
In the second phase of the study, we implemented the RUCIS scale in the 2007 communication skills OSCE for internal medicine residents. We carried out an analysis to evaluate the effectiveness of the revised rating instrument in order to determine whether the instrument modifications helped improve the effectiveness of the instrument. Both the 2003 and 2007 communication skills OSCEs were mandatory formative assessments conducted as part of the standard curriculum of the residency program.
Participants in the 2003 communication skills OSCE included 68 internal medicine residents (51% PGY-2 and 49% PGY-3; 66% male and 34% female) and 8 SPs (38% male and 62% female). Participants in the 2007 communication skills OSCE included 85 internal medicine residents (54% PGY-1 and 46% PGY-2; 47% male and 53% female) and 17 SPs (29% male and 71% female).
The CIS Scale, which SPs used to rate the performance of residents in the 2003 communication skills OSCE, is an 18-item rating instrument. Each item asks SPs to provide an agreement rating using a 5-category rating scale, in which 1 corresponds to “strongly disagree” and 5 corresponds to “strongly agree.” Since all items are statements of desirable communication behaviors, higher ratings indicate higher level of communication competence (See Appendix A).
The RUCIS Scale, which SPs used to rate the performance of residents in the 2007 communication skills OSCE, is a 13-item rating instrument. Each item contains a short description of the particular aspect of communication under consideration and four behaviorally anchored rating categories unique to each item. For each item, the lowest rating category always describes the least appropriate behavior for that aspect of communication, while the highest rating category always describes the most appropriate behavior for that aspect. In addition to the four rating categories for each item, six items also have a “not applicable” option that SPs could use when they did not observe the behavior related to that aspect of communication (See Appendix B).
In the 2003 communication skills OSCE, all the SPs took part in an intensive training program to learn how to portray the cases and how to rate resident performance before participating in the OSCE. The training program included a review and discussion of the case script and repeatedly practicing the appropriate portrayal of the cases under the supervision of a trainer. Training on the CIS scale included a review and discussion of the scale and practice using it to rate a videotaped or observed performance. There was no attempt to reach agreement between the SP and trainer in the ratings they assigned, but divergent ratings were noted and discussed. The trainer ensured that each SP could portray the case consistently and rate the performance of residents to the trainer’s satisfaction before the SP was allowed to participate in the communication skills OSCE.
In the 2007 communication skills OSCE, all the SPs also took part in an intensive SP training program similar to the training for the 2003 communication skills OSCE to ensure an accurate portrayal of the cases before participating in the OSCE. However, this time we employed a frame-of-reference (FOR) approach in training the SPs to provide ratings (Bernadin and Buckley 1981). Prior to training, a group of SP trainers reviewed selected videotaped OSCE sessions and provided a consensus “gold standard” rating for each item in each encounter. During the training sessions SPs rated the selected videotaped OSCE sessions using the RUCIS scale, compared their ratings to the trainers’ “gold standard” ratings, and discussed the rationale for the gold standard. By practicing and receiving feedback from several videotaped OSCE sessions, the SPs developed a common rating standard (i.e., frame) by which to evaluate residents’ performances.
Both OSCEs employed the same cases and the same administration format. Six residents were assessed in each half-day session. In each session, each resident encountered six different SPs in six different clinical scenarios (cases). In each case, residents spent 10 min in the encounter with the SP, 5 min reviewing task-related educational materials while the SP rated the performance, and another 5 min receiving verbal feedback from the SP. The task-related educational materials consisted of printed documents describing effective ways to interact with a patient in the situation they just encountered. The verbal feedback session provided SPs and residents with the opportunity to discuss effective and ineffective behaviors observed during the encounter, and to practice techniques that the SP suggested. The SP did not inform the resident of his/her specific ratings. The six communication tasks that residents encountered were: (1) providing patient education, (2) obtaining informed consent, (3) dealing with a patient who refuses treatment, (4) counseling an elderly patient who has been abused, (5) giving bad news to a patient, and (6) conducting a physical examination. We repeated the OSCE sessions once or twice a week until all residents had the opportunity to participate in the OSCE, which took 2 and 4 months, for the 2003 and 2007 communication skills OSCE, respectively.
Because the OSCE is a multi-faceted assessment method where the rating of a resident’s performance depends upon many factors, including the communication competence of the resident, the difficulty of the item on the rating instrument, the severity of the SP, and the difficulty of the case, we used a many-faceted Rasch measurement (i.e., Facets) model (Linacre 1989) to analyze the data. The Facets model uses a logarithmic function of the odds of receiving a rating in a given category as compared to the odds of receiving a rating in the next lower category to define the communication competence of residents, the difficulty of items, the severity of SPs, and the difficulty of cases. All measures of these four facets are reported on the logit scale, which is a linear, equal interval scale. Higher logit measures indicate more competent residents, more difficult items, more severe SPs, and more difficult cases. Because there were multiple rating categories for each item, the Facets model also calculated a set of step thresholds for each item. (A step threshold is the transition point between two adjacent categories, where the probabilities of receiving a rating in the two categories are equal.) We used the Facets computer program (Linacre 2005) to conduct the analyses.
To ensure that the analyses to obtain validity evidence based on response processes would be based on reliable data, we first examined the degree of reproducibility of residents’ communication competence measures—validity evidence related to the internal structure of test scores. We calculated a measure of internal consistency reliability, the resident separation reliability, which is an index analogous to KR-20 or Cronbach’s Alpha. Because ratings of multiple items on the same case by the same SP can be dependent on one another, which could lead to overestimation of reliability (Sireci et al. 1991; Thissen et al. 1989), we used cases (rather than items) as scoring units. That is, we averaged the ratings a SP gave to all items in a given case to produce a case score, which we considered as one rating in the Facets analysis.
An effective rating instrument for an OSCE should produce ratings that satisfy two tests related to response processes. The first one involves determining whether each rating scale functioned appropriately (i.e., were the categories on the scales that the SPs used well-defined, mutually exclusive, and exhaustive). The second one involves determining whether each item on the rating instrument functioned properly (i.e., when evaluating each resident’s performance, did SPs assign ratings for each item in a consistent fashion).
There should be at least 10 ratings in each rating category to allow accurate calibration of step thresholds.
The frequency distribution of ratings across categories should have a uniform or unimodal pattern. If SPs use only a few of the rating categories and rarely use other rating categories, the resulting irregular distribution of ratings indicates a poorly functioning scale that cannot effectively differentiate residents according to their levels of communication competence.
The average measures of residents’ communication competence should increase as the rating categories increase. In other words, residents who receive ratings in higher categories should have higher overall communication competence measures than those who receive ratings in lower categories.
The step thresholds should increase as the rating categories increase. This criterion mirrors the third criterion. Failure of step thresholds to increase as the rating categories increase is called step disordering, which suggests that SPs may have difficulty differentiating the performance of residents in those categories. One or more of the rating categories for a particular item may not be clearly defined.
The step thresholds should advance at least 1 logit, but not more than 5 logits. The finding that two step thresholds advance by less than 1 logit would suggest that those two rating categories are practically inseparable. That is, SPs may not be able to reliably differentiate between them. On the other hand, step thresholds that are too far apart are an indication of a possible dead zone on the scale where measurement loses its precision.
The outfit mean-square value for each rating category should be less than 2.0. An outfit mean-square value is a statistical index that indicates how well the ratings in each category fit the measurement model. Its value can range from 0 to infinity, with an expected value of 1. A high outfit mean-square value for a rating category is an indicator that some SPs used that rating category in an unexpected or surprising manner that was inconsistent with the way that other SPs used that category.
In addition to evaluating the functioning of the scale categories, we evaluated fit statistics for each item on the instrument to determine whether SPs provided aberrant ratings on any items, which might indicate problematic response processes. These fit statistics are indices that indicate how well the rating data for each item fit the measurement model. In this study, we examined both outfit and infit mean-square statistics for each item. We calculated an outfit mean-square value for each item by dividing the sum of the squared standardized residuals for the item by its degree of freedom. (A residual is the difference between the rating a SP assigned a resident on an item and the rating the measurement model predicted the SP would assign.) This calculation produces a value that can range from 0 to infinity, with an expectation of 1.0. Values larger than 1.0 indicate the presence of unmodeled noise in the ratings for that item (i.e., unexpected ratings that SPs assigned when evaluating residents, given how SPs assigned ratings for other items). By contrast, values less than 1.0 indicate that there was too little variation in the ratings SPs assigned for that item (Linacre and Wright 1994; Wright and Masters 1982). However, outfit mean-square values are very sensitive to outlier ratings. To reduce the influence of outlier ratings, we weighted each squared standardized residual by its information function before we summed them. (This involved applying differential weights to standardized residuals. That is, residuals that resulted from SP ratings of items and cases that were far too easy or too difficult for residents received less weight than those that resulted from SP ratings of items and cases that were at the appropriate difficulty level for residents.) This calculation produced an infit mean-square statistic that has the same distribution and interpretation as an outfit mean-square statistic, but is more immune to the influence of the ratings for residents on items or cases that are far too easy or difficult for them. Wright and Linacre (1994) recommended that an appropriate mean-square fit statistic for judge-mediated ratings should be in the range of 0.4–1.2.
From the analysis of the 2003 communication skills OSCE ratings, we identified the items and rating categories on the CIS Scale that did not function effectively according to one or more of these criteria. We then used these findings to guide the development of a modified rating instrument—the RUCIS Scale. We implemented the RUCIS Scale in the 2007 communication skills OSCE and evaluated the effectiveness of the revised instrument using the same criteria to determine whether the modifications helped improve the effectiveness of the instrument, thus in effect enhancing the validity of the score interpretation.
Evaluating the effectiveness of the CIS scale
Summary of measures obtained from the analysis of the communication skills OSCEs
A. 2003 Communication skills OSCE
B. 2007 Communication skills OSCE
Comparing the functioning of the CIS scale (2003) and RUCIS scale (2007) using Linacre’s (2004) guidelines
CIS scale 5-category scales 18 items
RUCIS scale 4-category scales 13 items
Resident separation reliability
Criteria for evaluating the functioning of the categories on each rating scale
At least 10 ratings in each category
5 items (28%)
6 items (46%)
Uniform/unimodal distribution of ratings across categories
1 item (6%)
12 items (92%)
Residents with higher category ratings have higher overall communication competence measures
7 items (39%)
12 items (92%)
No step disordering
9 items (50%)
11 items (85%)
Step thresholds advance by at least 1 logit, but not more than 5 logits
1 item (6%)
10 items (77%)
An outfit mean-square value <2.0 for each rating category
11 items (61%)
13 items (100%)
Criteria for evaluating the functioning of the items on the instrument
Outfit mean-square values <1.2
14 items (78%)
12 items (92%)
Infit mean-square values <1.2
16 items (89%)
13 items (100%)
The analysis also revealed that some SPs experienced difficulty in differentiating between the middle categories of the 5-category agreement scale, as demonstrated by the failure of the average measures and step thresholds to increase properly along with the rating categories. Only seven items (items 4, 5, 6, 9, 11, 16, and 18) exhibited proper advancement of average resident communication competence measures as the rating categories increased. Nine items (items 2, 4, 6, 10, 11, 13, 14, 15, and 17) showed disordered step thresholds. Seven items (items 7, 8, 9, 10, 13, 14, and 15) had one or more rating categories with outfit mean-square values equal to or greater than 2, reflecting inconsistent use of the categories. Only one item (item 12) had all adjacent step thresholds separated by at least one logit. The other 17 items had one or more step thresholds that were too close to adjacent thresholds, especially for step thresholds in the middle of the scale. However, none of the 18 items had step thresholds that advanced by more than five logits, suggesting that there were no significant gaps between the categories.
Summary of item fit statistics
Item fit statistics
A. 2003 Communication skills OSCE
Outfit mean-square values
Infit mean-square values
B. 2007 Communication skills OSCE
Outfit mean-square values
Infit mean-square values
Modifying the rating instrument
The findings from our validity study revealed that there were several aspects of the CIS Scale that did not function well. Using these findings as our guide, we worked with medical faculty and SPs to revise the CIS Scale in several ways. Instead of using a single Likert-style agreement rating scale that was applicable to all items on the instrument, we devised a behaviorally anchored rating scale (BARS) (Bernardin and Smith 1981; Smith and Kendall 1963) that provided a detailed description of the specific communication behavior characteristic of each rating category for each item. Our expectation was that the change in the scale format would make each rating scale more specific to the context of a particular item and less open to SPs’ idiosyncratic interpretations.
Because our analysis revealed that the lowest rating category on the CIS Scale was a non-functioning category, we decided to change the scale format from 5-category scales to 4-category scales. To address the problem of an unbalanced rating distribution in which 70–80% of ratings were positive ratings, while only 20–30% of ratings were neutral or negative ratings, we developed 4-category scales that were saturated on the positive side. In other words, we created a separate rating scale for each item with only one category describing inadequate performance and three categories describing satisfactory communication behaviors that exemplified progressively higher levels of performance.
We also provided a “not applicable” option for six items. Our goal was to eliminate some unexpected ratings that SPs assigned in the neutral category of the agreement scale when they found themselves unable to rate a certain aspect of communication during the encounter because they did not observe any evidence that the resident engaged in that aspect.
Although we did not change the content coverage of the rating instrument, we revised the items to eliminate redundancy and improve their clarity. We combined into one item the redundant items that addressed the same aspect of communication. Specifically, we combined items 1 and 2 into an item on friendly communication; combined items 7, 8, and 9 into an item on discussion of options; combined items 10, 11, and 12 into an item on encouraging questions; and combined items 13 and 14 into an item on providing a clear explanation. We created a new item on physical examination to allow SPs to separate the act of providing an explanation of a physical examination from the act of providing an explanation about medical conditions.
Finally, we attempted to make several items more difficult by requiring that residents demonstrate communication behaviors that are more sophisticated and/or difficult to perform to qualify for a rating in the highest category.
These modifications led to the development of a revised rating instrument, called the RUCIS Scale (Appendix B), which we later used in the scoring of residents’ performance in the 2007 communication skills OSCE.
Evaluating the effectiveness of the RUCIS scale
Table 2 provides a point-by-point comparison of the findings from our analyses of the functioning of the CIS Scale and the RUCIS Scale. We found that seven items on the revised instrument still had fewer than 10 ratings assigned in the lowest category. Beyond this, nearly all the items and rating scales appearing on the RUCIS Scale satisfied Linacre’s criteria. All items but one had a uniform distribution of ratings that peaked in the middle or at the high end. Item 5 (interest in me as a person) was the only item that had a rating distribution that peaked in rating category 1. Item 2 (respectful treatment) was the only item that did not show increasing average measures as rating categories increased. The rating categories for all items fit the measurement model (i.e., all outfit mean-square values for the rating categories were less than 2). Items 7 and 12 were the only two items with disordered step thresholds. Some of the distances between step thresholds for Items 5, 6, and 10 were too narrow (i.e., less than one logit apart). However, all the step thresholds for the other items were appropriately ordered and advanced by more than one logit but less than five logits.
We summarized item fit statistics obtained from the analysis of the 2007 communication skills OSCE in Table 3. All items showed good fit to the measurement model according to their infit mean-square values. Item 5 (interest in me as a person) was the only item with an outfit mean-square value higher than 1.2, indicating too much unexplained variance in the ratings that SPs assigned for this item. Thus, it was the only item that needed close examination to try to determine what made it difficult for SPs to use the item’s behaviorally anchored rating scale to assign consistent ratings.
This study demonstrated the process of using validity evidence obtained from a Facets analysis to help revise an assessment instrument such as an OSCE rating scale. Validation is a continuing process of gathering and evaluating various sources of evidence to determine whether that evidence supports (or refutes) the proposed score interpretation. The two phases of this study correspond to the two stages of validation that Kane (2006) described. In the first phase of the study, we focused on finding ways to build a measurement instrument that possessed appropriate psychometric properties that would support the intended uses of OSCE scores. This phase corresponds to the development stage of validation. In the second phase, we critically evaluated whether the newly developed rating instrument actually functioned as predicted. This phase corresponds to the appraisal stage of validation.
In the first phase of our study, validity evidence based on response processes helped us identify several aspects of the CIS Scale that did not function as intended. The validity evidence suggested that the 5-category Likert-style agreement scale functioned as an unbalanced 4-category rating scale (i.e., most of the ratings were positive ratings, while only a few ratings were neutral or negative). This finding indicated that the items on the CIS Scale were too easy for this sample of residents. Results from our analyses also suggested that some SPs were unable to differentiate performance in the middle categories of the scale. Additionally, we found that some SPs assigned a number of surprising or unexpected ratings for item 10 (I felt you encouraged me to ask questions) and for item 15 (I felt you were careful to use plain language), suggesting that these SPs were not able to consistently apply the rating criteria for these two items to rate some residents’ performances. All these pieces of validity evidence provided useful information to guide the development of a revised rating instrument in our attempt to address these weaknesses of the CIS Scale.
In the second phase of our validity study, we implemented the revised instrument in a later administration of the communication skills OSCE and carried out the same types of analyses that had revealed the inadequacies of the CIS Scale. We considered this as a test of whether the revised instrument could withstand the same validity challenges as its predecessor. We found that in many aspects the RUCIS Scale helped improve score interpretability. The SPs more consistently applied the rating criteria to rate residents’ performances. The items on the RUCIS Scale fit the measurement model quite well. Providing a clear description of communication behavior that was appropriate for each rating category for the two misfitting items on the CIS Scale (items 10 and 15) helped eliminate confusion among SPs in rating these two aspects of communication (as demonstrated by good item fit statistics for items 7 and 10 on the RUCIS Scale).
However, the modifications we made to the rating instrument did not address all the validity issues we identified in the CIS Scale. There was one area in which the revised instrument did not show significant improvement over its predecessor. When using the behaviorally anchored rating scales, SPs still assigned only a few ratings in the lowest rating category of many items. This could be due to a restricted range of communication competence among the particular sample of residents assessed. We developed the RUCIS Scale with a broad range of communication competence in mind—from very incompetent physicians to very competent physicians. The subjects included in the 2007 communication skills OSCE were a single group of residents in one residency program. This limited the range of observable communication skills that SPs were likely to see. If we were to assess a broader range of subjects, ranging from medical students in their early years of training to experienced physicians practicing in various specialties from geographically diverse medical settings, the SPs would be more likely to observe a broader range of communication behaviors and would be more likely to employ the full range of rating categories appearing on each behaviorally anchored rating scale. Testing this hypothesis would require that researchers conduct additional studies to evaluate validity generalization (American Educational Research Association et al. 1999). That is, we are suggesting that researchers carry out studies to determine the extent to which variations in situational facets (e.g., residents from different residency programs, different SPs, etc.) may affect the assignment of ratings. The studies would help us determine how generalizable the results we obtained are across subjects that differ in education and experience, and across SPs.
Another possible explanation for non-uniform distributions of ratings is that SPs may have been uncomfortable assigning very low ratings to residents. If this were the case, then SP trainers could address this issue during the training, helping SPs understand that it is appropriate (and expected) that they will assign low ratings if they see evidence of physician behaviors that warrant those ratings. However, we would be a bit cautious in following this criterion too strictly. For a formative assessment or in a summative assessment where residents had not been properly trained, a uniform distribution of ratings is to be expected. However, in a summative assessment where the majority of residents have practiced the skills so that they are well prepared for the communication tasks, a skew distribution of ratings where only few residents would have ratings in lower categories can be obtained, which might not suggest a problem with the rating instrument.
The evaluation of item fit statistics for the RUCIS scale revealed that item 5 (interest in me as a person) was the only item with too much unexplained variance in its ratings. Interestingly, two of the SPs were responsible for 65% of the statistically significantly unexpected ratings (i.e., ratings with an absolute value of their standardized residuals larger than 2.0) for this item. This finding suggests that the source of error in the ratings of item 5 might be due to the inconsistency of only two SPs, and that the fit of the item might be improved through additional training of these two SPs to clear up any confusion they might have experienced when rating this item.
Although we carried out the study in two phases that addressed both the development and appraisal stages of validation (Kane 2006), this study by no means presents a complete validation effort. Validation is an ongoing process of gathering various sources of evidence to support proposed score interpretations. One could consider the findings from the second phase of this study as input to further modify the rating instrument to craft an even more psychometrically sound assessment, thus cycling back to the development stage of validation once again. For example, our results suggest that item 5 on the RUCIS Scale is still problematic, since it continues to show inadequate fit to the measurement model. Additional modification on this item is a potential area for further instrument improvement.
There are some limitations related to the interpretation and application of the findings from this study. The first limitation is the instrument’s limited focus on patient-centered medical communication skills. The ACGME’s (1999) definition of communication skills emphasizes the importance of the ability to communicate not only with patients but also with other members of a healthcare team. The RUCIS Scale does not address the skills needed to communicate effectively with other members of a healthcare team. The psychometric properties of the RUCIS Scale demonstrated in this study might only apply to its use in an OSCE setting where SPs are trained properly on how to use the rating instrument. Another limitation of this study is the homogeneity of the resident samples we examined. Since our participants were internal medicine residents from a single training program, they were relatively homogeneous in their medical communication experience. Communication behaviors that were not observed in these residents might be evident when other groups of physicians are assessed. A multi-center trial of the rating instrument that involves medical schools from various geographical regions could study how the RUCIS Scale functions with a more heterogeneous group of physicians.
We hope that the findings from our study will benefit the medical education community in several ways. First, the product of this validation effort—the RUCIS Scale, along with validity evidence that supports its uses in the communication skills OSCE, should serve the needs of many residency programs, especially given the increasing interest in communication skills assessment that the ACGME Outcome Project has generated. Second, our study provides a concrete example of how to use a many-faceted Rasch measurement approach to improve the quality of SP rating instruments and to provide validity evidence based on response processes as outlined in the 1999 Standards for Educational and Psychological Testing (American Educational Research Association et al. 1999). Finally, this study generated many interesting ideas for future research.