Introduction

Fracture-dislocations of the proximal interphalangeal joint can be problematic as they can result in stiffness, pain, and secondary osteoarthritis [7, 18, 21]. Characteristics of the articular fracture of the middle phalanx that are thought to indicate benefit of operative treatment are subluxation of the joint, greater than 2-mm articular stepoff, and more than 40% of articular surface involvement [13, 18, 21]. These fracture characteristics also are used in selecting a specific operative treatment with options extending from extension block pinning to hemihamate autograft arthroplasty [13, 17, 18, 21]. However, it is unclear if middle phalanx base fracture characteristics can be assessed with sufficient reliability to be useful for surgical decision making.

Validation of a specific criterion for recommending operative treatment includes evidence that: (1) patients do better with surgery when that criterion is met; (2) the criterion accurately measures pathophysiologic features; and (3) measures of the criterion are reliable. Putting aside whether the criteria for recommending operative treatment of a proximal interphalangeal joint fracture-dislocation are useful and accurate thresholds, if observers cannot agree on the criteria (unreliable) they may not be useful. Studies of fractures at other anatomic sites (such as the humerus, clavicle, and olecranon) showed that some fracture characteristics are assessed more reliably than others [1, 2, 5, 8, 10]. In addition to measuring the reliability of radiographic criteria for recommending operative treatment, better understanding of the criteria that have the greatest influence on recommendation for treatment of middle phalanx base fractures would help direct the development of guidelines and measurement techniques for those criteria.

We assessed interobserver agreement of assessment of middle phalanx base fracture characteristics and resulting treatment recommendations by hand surgeons. Specifically, we evaluated: (1) the degree of interobserver agreement as a function of fracture characteristics, (2) differences in interobserver agreement between experienced and less-experienced hand surgeons, and (3) what fracture characteristics and surgeon characteristics were associated with the decision for operative treatment.

Materials and Methods

This study was approved by our institutional review board and a waiver of informed consent was obtained. With the use of International Classification of Diseases, 9th Revision codes (code 816.01: closed middle or proximal phalanx fracture), patients who received a diagnosis of a phalangeal fracture at a tertiary hand clinic between July 2011 and August 2013 were identified (n = 315). There were 21 (6.7%) acute intraarticular middle phalanx base fractures assessed with a true lateral radiograph (assessed by looking at projection of the proximal phalanx condyles) (Appendix 1. Supplemental material is available with the online version of CORR®).

We used SurveyMonkey (Palo Alto, CA, USA), an online tool for development of a survey including the 21 cases. For all cases, the following seven questions were asked in the following order: (1) What percentage of the articular surface is fractured? (2) Is there an articular step or gap greater than 2 mm? (3) Is there comminution or fragmentation of the articular surface? (4) How many fracture fragments? (5) Is the proximal interphalangeal joint subluxated or dislocated? (6) How would you classify this injury based on imaging: stable, tenuous, or unstable? (7) What is your proposed treatment: nonoperative, extension block pinning, external fixation, open reduction and internal fixation, volar plate arthroplasty, or hemihamate autograft arthroplasty? The answer to Question 7 then was dichotomized into nonoperative and operative treatment and was analyzed separately.

An invitation to participate was sent to a list of surgeons built as part of the Science Of Variation Group (SOVG) [11, 12]. The SOVG consists of 691 orthopaedic, trauma, and plastic surgeons, all with an interest in treating upper extremity conditions. We approached only hand surgeons (n = 296); 108 (36%) responded and 99 completed (33%) all seven questions for all 21 cases. The SOVG aims to study variation in the definition and treatment of human illness without financial incentive.

Surgeon Characteristics

Of these 99 hand surgeons, there were 90 men (91%). The majority of hand surgeons were from the United States (80%). There was substantial variability in the number of years the surgeons were in practice; 57 (58%) had less than 10 years in practice. Eighty-one (82%) hand surgeons were involved in supervising trainees (Table 1).

Table 1 Surgeon characteristics (n = 99)

Statistical Analysis

We used Fleiss’ kappa analysis to assess the interobserver agreement for the questions with categorical answers (Questions 2, 3, 5, 6, and 7). Bootstrapping (number of resamples = 1000) was used to calculate a standard error, z statistic, p value, and 95% CI for the kappa values [16]. Kappa is a quantitative measure of agreement among observers and takes into account that observers sometimes will choose the same answer to a question by chance [23]. A perfect agreement among observers would be reflected as a kappa of 1, whereas agreement totally based on chance would equate to a kappa of 0. Interpretation of kappa often is done by a classification proposed by Landis and Koch [15] in which a kappa between 0.01 and 0.20 is considered to reflect slight agreement, a value between 0.21 and 0.40 as fair agreement, between 0.41 to 0.60 as moderate agreement, 0.61 and 0.80 as substantial agreement, and greater than 0.81 as almost perfect agreement.

Intraclass correlation coefficient (ICC) with 95% CI was calculated through a two-way mixed-effects model with absolute agreement and shows interobserver agreement for questions with a continuous answer (Questions 1 and 4). To calculate an ICC, the variability of ratings per subject is compared with the total possible variability in ratings. Absolute agreement in an ICC assesses how much each measurement performed per observer differs from the other observers. As with kappa, a score of 1 reflects perfect agreement in ICC, whereas 0 reflects no agreement. Fisher’s z transformation was used to calculate p values comparing ICCs in the experienced and less-experienced groups [3, 4].

We compared the kappa values and ICCs of experienced (≥ 10 years of practice) and less-experienced (< 10 years of experience) hand surgeons for every question.

We used multivariable linear regression analysis to assess if surgeon characteristics (sex, location of practice, years in practice, or supervising trainees) were independently associated with the likelihood of choosing surgery per surgeon (99 surgeons). We calculated an overall surgery score per surgeon by dividing the amount of cases they would operate on by 21 (the total number of cases). The score ranges from 0 to 1 with a higher score indicating a higher likelihood of choosing surgery. Linear regression analysis provides a β regression coefficient that indicates the difference in surgery score (likelihood for surgery) in one group compared with another.

Simple linear regression was used to assess which fracture characteristics influenced the recommendation for operative treatment per case (21 cases). We calculated an overall surgery score per case by dividing the number of surgeons who would operate by 99 (the total number of surgeons) and assessed the association of fracture characteristics with the surgery score. We used simple linear regression because fracture characteristics were highly collinear resulting in an unstable multivariable model. The R-squared indicates how much of the variation in surgery score per case is explained by the specific fracture characteristic.

A two-tailed p value less than 0.05 was considered significant; all statistical analyses were performed using Stata® 14 (StataCorp LP, College Station, TX, USA).

Results

Interobserver Reliability for Fracture Characteristics

When all surgeons are pooled together, the interobserver agreement for fracture characteristics was greatest for assessment of articular stepoff or gap of 2 mm or greater (kappa, 0.73; 95% CI, 0.60–0.86, p < 0.001) and lowest for the number of fracture fragments (ICC, 0.39; 95% CI, 0.27–0.57, p < 0.001) (Fig. 1). There was substantial agreement on the percentage of articular surface area that was involved in the fracture (ICC, 0.67; 95% CI, 0.54–0.81, p < 0.001) (Table 2).

Fig. 1A–H
figure 1

The bar graphs show the percentage of surgeons indicating (A) 2-mm step or gap (black bar); (B) comminution (black bar); (C) subluxation or dislocation (black bar); (D) Unstable (black bar), tenuous (grey bar) or stable (white bar) joint per case (case number on y-axis). The box plots show the median percentage of (E) fractured articular surface and (F) number of fracture fragments, with the interquartile range and range per case. (G) The bar graph shows the percentage of surgeons recommending operative treatment (black bar) per case. Cases (Appendix 1) are ordered from no surgeons recommending surgery (Cases 19 and 2) to all surgeons recommending surgery (Cases 9, 11, and 17). (H) The graph shows the variation in recommended treatment options: nonoperative (white bar), hemihamate autograft arthroplasty (light gray bar), volar plate arthroplasty (medium light gray bar), open reduction and internal fixation (medium dark gray bar), external fixation (dark gray bar), and extension block pinning (black bar).

Table 2 Interobserver agreement among all hand surgeons

Interobserver Reliability Based on Surgeon Experience

We found no difference, with the numbers available, in interobserver agreement for any of the fracture characteristics between less-experienced and more-experienced hand surgeons (Table 3).

Table 3 Comparison of interobserver agreement between less experienced and experienced hand surgeons

Factors Associated with Recommendation of Surgery

We found that all fracture characteristics, except for stable and uncertain joint stability, were associated with a recommendation for operative treatment. Articular stepoff or gap greater than 2 mm was most strongly associated with a recommendation for operative treatment (β, 0.90; R-squared, 0.89; 95% CI, 0.75–1.05; p < 0.001) (Table 4).

Table 4 Simple linear regression of fracture characteristics associated with decision for surgery per case

After controlling for relevant confounding variables, we found that male sex (β regression coefficient, −0.019; 95% CI, −0.082 to 0.044; p = 0.55), more experience (β, 0.0035; 95% CI, −0.033 to 0.040; p = 0.85), and supervising trainees (β, −0.028; 95% CI, −0.075 to 0.020; p = 0.26) were not associated with the recommendation for operative treatment (Table 5). There also was no difference in recommendation for surgery among practice locations: Europe (β, −0.018; 95% CI, −0.074 to 0.038; p = 0.53) and Other (β, 0.046; 95% CI, −0.021 to 0.11; p = 0.18) compared with the United States and Canada (Table 5).

Table 5 Multivariable linear regression of surgeon characteristics associated with decision for surgery

Interobserver agreement was fair (kappa, 0.34; 95% CI, 0.21–0.47, p < 0.001) for the type of treatment recommended but substantial (kappa, 0.69; 95% CI, 0.50–0.88, p < 0.001) for the recommendation to operate or not (Table 2). The fair agreement for type of treatment indicates considerable variation in decision for specific treatment techniques among included surgeons (Fig. 1). We found no difference in interobserver agreement when deciding for specific strategies (p = 0.79) or the decision to operate (p = 0.80) between the less-experienced and experienced hand surgeons with the numbers evaluated (Table 3).

Discussion

Several middle phalanx base fracture characteristics are commonly used in selecting treatment options. However, it is unclear if these characteristics can be assessed with sufficient reliability to be useful for surgical decision-making and what factors are most strongly associated with the decision for surgery. We assessed interobserver agreement among 99 hand surgeons in assessing morphologic features of fractures and decision for treatment on lateral radiographs of 21 intraarticular middle phalanx base fractures. We found that: (1) interobserver agreement was greatest for assessment of articular step or gap, likelihood of subluxation or dislocation, and percentage articular surface involvement; (2) there were no differences in agreement between experienced and less-experienced surgeons; (3) articular step or gap, likelihood of subluxation or dislocation, and unstable fractures are most strongly associated with the decision for operative treatment; and (4) there was substantial agreement when deciding for operative versus nonoperative treatment, whereas there was only fair agreement when deciding for specific strategies.

This study has several limitations. First, with our sample size we found no difference in agreement between experienced and less-experienced hand surgeons; however, a larger sample size might have resulted in a significant difference. We performed a post hoc power analysis assessing effect size, achieved power (1 − β), and required sample size-assuming a similar effect—for assessing comminution, as the difference between experienced (kappa, 0.58) and less-experienced (kappa, 0.52) hand surgeons seems to be largest for this characteristic. This is a very small effect size (0.11), which we had only 8.1% power to detect. This small difference is probably clinically unimportant, because to achieve a power of 0.80 to be more assured that this difference is not spurious, we would need 1180 experienced and 1602 less-experienced hand surgeons [22]. Second, only \({\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 3}}\right.\kern-0pt} \!\lower0.7ex\hbox{$3$}}\) of the hand surgeons in the SOVG completed the questionnaire. Because most of the people with email addresses in the SOVG database are not active participants (ie, their email addresses are not regularly updated and we do not weed out the nonresponders), the rate of participation is not a true response rate. However, there is no difference between responding and nonresponding SOVG hand surgeons regarding sex (p = 0.56, by Fisher’s exact test), years in practice (p = 0.081, by Fisher’s exact test), and location of practice (p = 0.47, by Fisher’s exact test). Participants (81 of 99; 82%) more often were involved in academic medicine (ie, supervising trainees) compared with the nonresponding hand surgeons (122 of 197; 62%; p < 0.001).

Furthermore, surgeons in the SOVG probably are more involved in academic medicine compared with the larger community of hand surgeons. However, we do not believe this difference compromises the generalizability of our results, as phalangeal fractures are common and typically assessed with radiographs. We found no trend toward a difference in agreement based on experience, nor was any of the included factors (ie, sex, practice location, years of experience, supervising trainees) associated with the decision for operative treatment. We therefore believe that our results are generalizable to the hand surgeon community as a whole. Third, we used only lateral radiographs for assessment of morphologic features of fractures. Use of other imaging modalities such as CT, and physical examination, might influence interobserver agreement. We see this as a minor limitation because lateral radiographs are used most commonly and have a strong influence on the treatment of these fractures. Fourth, experience varies among participants; however, as experience is continuous and cumulative with time we decided to dichotomize level of experience based on practice years. We chose 10 years of experience as a cutoff as this resulted in more or less equally sized groups increasing statistical power. Fifth, we did not assess classification systems as these often group fractures based on multiple characteristics (for example, the AO classification uses fragment size, impaction, instability, and dislocation). We chose to study individual characteristics instead of a classification system to allow for more-detailed exploration of which of these specific factors are associated with decision for treatment.

There was moderate to substantial agreement among hand surgeons for assessment of all fracture characteristics, except for the number of fracture fragments. Assessing 2-mm articular step or gap, likelihood of subluxation or dislocation, and percentage articular surface involvement on a lateral radiograph are most reliable. This means that these three factors are most reliable to base the decision for treatment and for communication between surgeons. Classifications and guidelines should incorporate these factors, and future studies need to assess how these fracture characteristics relate to functional outcome, stiffness, and secondary osteoarthritis. Agreement on the number of fracture fragments on lateral radiographs is poor, rendering this characteristic less useful for decision-making. Assessment of fracture stability could be more reliable in clinical practice as it might improve with examination and stress radiographs [7]. Therefore it is difficult to draw firm conclusions on the usefulness of this parameter. Overall interobserver agreement for assessment of morphologic features of a middle phalanx base fracture was much greater than other anatomic areas (humerus, elbow, clavicle, and olecranon) [1, 2, 5, 8, 10]. We found only fair agreement, indicating variation, regarding specific proposed treatments, which might be a reflection of the lack of high-level evidence and relatively small case series on which surgeons base their decisions [7]. Variation in reported indications for surgical strategies also might influence decision making among hand surgeons when deciding for specific surgical strategies [7, 13, 18, 21]. Creating guidelines by reaching consensus among hand surgeons might reduce variation and improve quality of care [9, 14].

Our study showed that, with the numbers evaluated, experienced and less-experienced hand surgeons are equally good at assessing morphologic features of fractures. Hand surgeons seem to reach proficiency assessing middle phalanx base fractures on radiographs early during their experience. A previous study assessing the learning curve of pediatric residents in diagnosing ankle fractures on radiographs showed proficiency after evaluation of 50 cases, after which learning slowed but did not stop until all 234 cases were reviewed [19]. A study of distal radius fracture classification on radiographs in children also showed that experienced surgeons were more reliable than junior registrars [20]. However, additional training, even of already fully trained surgeons, might further improve the assessment of morphologic features of fractures, as interobserver reliability improved after training of 64 fully trained orthopaedic and trauma surgeons in diagnosing scaphoid fracture displacement [6].

None of the included surgeon characteristics was associated with the decision for operative treatment, whereas all fracture characteristics were, except for uncertain fracture stability. This means that hand surgeons are consistent and likeminded regardless of sex, experience, academic involvement, and practice location when recommending operative treatment, bearing in mind the limited sample size. Our findings that articular step or gap greater than 2 mm, injuries judged unstable, and proximal interphalangeal joint subluxation or dislocation are associated most strongly with the decision for operative treatment, are in line with the indications usually cited [7, 13, 18, 21]. R-squared values (0.76 to 0.89) and β regression coefficients (0.80 to 0.90) of these three characteristics are high and vary slightly, meaning that these factors drive the decision for operative treatment to a similar degree and fashion. Future studies need to assess how these factors relate to functional outcome, stiffness, and secondary osteoarthritis.

Surgeons mostly agree on which fractures might benefit from surgery and the radiographic criteria they use most for this determination (articular step or gap and likelihood of subluxation or dislocation). The effectiveness of these thresholds for recommending surgery would be supported by additional research comparing initial operative and nonoperative approaches for these injuries, but the ethics of such research are a consideration given the current high level of agreement of hand surgeons based largely on wisdom and experience. Efforts at improving the care of proximal interphalangeal joint fracture-dislocations can focus primarily on the comparative effectiveness of the various operative treatment options.